# **Diagnosis Usage Guide**

### Welcome to the Data Diagnosis usage guide!

### **Importing Libraries**

To begin with, we will be importing the library:

- datalab as dl

In [None]:
import datalab as dl

#### **Step 2: Loading Data**

Second step involves loading our data file.

We can do this by importing ``load_tabular`` directly from datalab.

Alternatively, we can do that by using the alias like this:

``dl.load_tabular('example.csv')``

In [None]:
from datalab import load_tabular

df = dl.load_tabular('synth-data.csv') # synthetic dataset

#### **Step 3: Importing Diagnosis class from datalab**

After loading our DataFrame, we will now diagnose it by importing the ``Diagnosis`` class from datalab

In [3]:
from datalab import Diagnosis

# We will now pass the DataFrame we wish to diagnose
diagnosis = Diagnosis(df)

#### **Step 4: Preview of Data**

We will now begin our diagnosis by looking what our data looks like. 

We can do that by using ``data_preview()`` method.

In [4]:
diagnosis.data_preview(5)  #  We can pass in the number of rows we would like to see

Unnamed: 0,age,income,expenses,savings,loan_amount,credit_score,num_of_dependents,years_at_job,risk_score
0,56.0,73892.871297,55599.652884,17417.10366,18809.36736265557,727.914047,0.0,22.0,0.132589
1,69.0,62617.964801,30006.449742,28868.981712,14792.72848134152,740.039311,3.0,0.0,0.118505
2,46.0,47897.329539,25023.236338,24674.446,13678.633414495547,703.442518,2.0,13.0,0.235282
3,,49346.645711,22799.815975,,5425.383264748043,623.241966,5.0,4.0,0.145748
4,60.0,167409.013096,124105.981773,44924.782142,30292.05477411715,698.567366,0.0,0.0,0.0


#### **Step 5: Data Summary**

Okay! I have looked at the data, now what?

After we have seen what our data looks like, we may want to see how many rows do we have in total, or even how many columns there are.

Or, we may want to know what type of data are we dealing with, whether it is number, plain text, basic category or date and time.

We can do that by using ``data_summary()`` method from ``Diagnosis`` class.

``data_summary`` method returns a dictionary and allows you to see:

1. Shape of Data (number of rows and columns)
2. Column Names
3. Data Types (whether int, float, str, bool, datetime or category)
4. Index (like house no. of each row of your DataFrame)

In [5]:
summary = diagnosis.data_summary()

**Shape of Data**

Let's begin by looking at the shape of our data.

In [6]:
summary['shape']   # We have 1 M rows and 9 columns

(1000000, 9)

**Column Names** 

We can also check the names of columns we are dealing with.

In [7]:
summary['columns']   # We can just copy paste above code and replace 'shape' with 'columns'

['age',
 'income',
 'expenses',
 'savings',
 'loan_amount',
 'credit_score',
 'num_of_dependents',
 'years_at_job',
 'risk_score']

**Indices**

We can also check the index.

In [8]:
summary['index']   # Our data starts from row no. 0 and ends at 1000000

RangeIndex(start=0, stop=1000000, step=1)

**DataTypes**

Let us now check the datatypes we are working with.

In [9]:
summary['dtypes']

age                  float64
income               float64
expenses             float64
savings              float64
loan_amount           object
credit_score         float64
num_of_dependents    float64
years_at_job         float64
risk_score           float64
dtype: object

**Column Types**

Alternatively, we can detect what type of column (Numerical, Categorical or Datetime) we are dealing with, using ``detect_column_types()`` from ``Diagnosis`` class of datalab.

In [10]:
diagnosis.detect_column_types()

{'Numerical': ['age',
  'income',
  'expenses',
  'savings',
  'credit_score',
  'num_of_dependents',
  'years_at_job',
  'risk_score'],
 'Datetime': [],
 'Categorical': ['loan_amount']}

As we can see that **loan_amount** has been classified as Categorical column, as it is *object* dtype, even when it is a Numerical column.

We can change that by using ``ColumnConverter`` class from datalab, however, we are only focusing on diagnosis right now.

You will be able to see ``ColumnConverter`` class in action in ``DataLab Workflow Guides``.

#### **Step 6: Memory Usage**

We can also check how much memory is the DataFrame consuming ``show_memory_usage()`` method from the ``Diagnosis`` class

In [11]:
diagnosis.show_memory_usage()

Total Memory Usage: 121.87 MB


np.float64(121.8696060180664)

#### **Step 7: Duplicates in Data**

Okay, I have seen the number of rows I have, what datatypes I am dealing with, and whether my data is numerical, categorical or datetime.

Now What?

We will now check if we have any duplicates in our data(if two rows have same data).

It is important because if our data has many duplicates, our ability of making accurate predictions or even doing accurate analysis will be negatively affected.

Imagine analyzing a financial data set where 50 % of records are the same transaction.


**Count Duplicates**

In [12]:
diagnosis.count_duplicates()

np.int64(0)

We can see that ``count_duplicates()`` returned 0, which means across all columns, no two rows have the same data.

However, we can also pass in the parameter ``in_columns``, to check duplicates separately within a specific column or across multiple columns.

In [13]:
diagnosis.count_duplicates(in_columns=['age'])

np.int64(999936)

In [14]:
diagnosis.count_duplicates(in_columns=['age', 'expenses'])

np.int64(64407)

**Show Duplicates**

Besides counting, we can also see what values are duplicates, by using ``show_duplicates()`` method from the `Diagnosis` class.

We can also pass in the parameter ``in_columns``, to see duplicates separately within a specific column or across multiple columns.

In [15]:
diagnosis.show_duplicates(in_columns=['age', 'expenses', 'num_of_dependents'])

Unnamed: 0,age,income,expenses,savings,loan_amount,credit_score,num_of_dependents,years_at_job,risk_score
537,,80902.888122,,19885.360533,12566.928142510105,748.697744,1.0,,0.000000
601,36.0,29797.094398,,11152.628462,3688.285059972357,771.568987,,36.0,0.199874
609,43.0,55135.631251,,1816.627577,2761.846981935697,,3.0,40.0,0.000000
783,58.0,20291.124801,,11916.304510,785.1556296340912,719.454001,2.0,,0.308494
918,,46123.322775,,22112.380369,14726.982341055256,743.394466,,26.0,0.192440
...,...,...,...,...,...,...,...,...,...
999892,79.0,45062.347764,,8931.276040,4152.119464917672,698.030844,,21.0,
999922,34.0,110794.649679,,15054.116421,27338.24876021334,697.444777,0.0,5.0,0.000000
999931,24.0,74696.320523,,30812.216581,5809.787392556175,837.672651,1.0,10.0,0.000000
999947,80.0,29709.646104,,8893.800135,,748.001036,,31.0,0.303836


#### **Step 8: Cardinality**

In data, ``cardinality`` means how many different kinds of values are in a column.

You can read more about **cardinality** in **Data Diagnosis** documentation under **docs** section of DataLab.

We can check cardinality by using ``show_cardinality()``method from Diagnosis class.

In [16]:
diagnosis.show_cardinality()

{'age': 64,
 'income': 944772,
 'expenses': 935530,
 'savings': 920519,
 'loan_amount': 920609,
 'credit_score': 941267,
 'num_of_dependents': 7,
 'years_at_job': 42,
 'risk_score': 517721}