# **Diagnosis Usage Guide**

### Welcome to the Data Diagnosis usage guide!

### **Importing Libraries**

To begin with, we will be importing the library:

- datalabx as dl

In [3]:
import datalabx as dl

#### **Loading Data**

Second step involves loading our data file.

We can do this by importing ``load_tabular`` directly from datalabx.

Alternatively, we can do that by using the alias like this:

``dl.load_tabular('example.csv')``

In [4]:
from datalabx import load_tabular

df = load_tabular('messy_dataset.csv') # synthetic dataset

#### **Importing Diagnosis class from datalabx**

After loading our DataFrame, we will now diagnose it by importing the ``Diagnosis`` class from datalabx

In [5]:
from datalabx import Diagnosis

# We will now pass the DataFrame we wish to diagnose
diagnosis = Diagnosis(df)

#### **Preview of Data**

We will now begin our diagnosis by looking what our data looks like. 

We can do that by using ``data_preview()`` method.

In [6]:
diagnosis.data_preview(5)  #  We can pass in the number of rows we would like to see

Unnamed: 0,age,income,expenses,debt,score,savings_ratio,gender,region,membership_type,subscription_status,signup_date,last_active
0,45.960569836134795,tpUwb,21337.20758558362,2077881.1645985672,49.015507649186574,0.4311585563690656,M,East,vip,inactive,2019-10-23,2019-11-05
1,38.340828385945784,17182.44345210847,3621.2092817197617,3752.959576902923,60.75904929703295,0.7892494573421452,M,East,basic,active,2019-10-14,2021-04-11
2,NXlMl,23497.04853541588,16516.059770785752,,,0.2971006658180088,M,East,basic,active,2015-05-07,2017-02-09
3,58.2763582768963,10510.744673253303,8219.049415817683,9614.040727545927,70.71124084031835,0.2180335769422026,F,West,basic,active,2013-10-06,2015-03-22
4,37.19015950331997,18865.239961809577,,11116.85061890834,34.32423437519151,,M,West,basic,active,INVALID,INVALID


#### **Data Summary**

Okay! I have looked at the data, now what?

After we have seen what our data looks like, we may want to see how many rows do we have in total, or even how many columns there are.

Or, we may want to know what type of data are we dealing with, whether it is number, plain text, basic category or date and time.

We can do that by using ``data_summary()`` method from ``Diagnosis`` class.

``data_summary`` method returns a dictionary and allows you to see:

1. Shape of Data (number of rows and columns)
2. Column Names
3. Data Types (whether int, float, str, bool, datetime or category)
4. Index (like house no. of each row of your DataFrame)

In [7]:
summary = diagnosis.data_summary()

**Shape of Data**

Let's begin by looking at the shape of our data.

In [8]:
summary['shape']   # We have 2 Million rows and 9 columns

(2040000, 12)

**Column Names** 

We can also check the names of columns we are dealing with.

In [9]:
summary['columns']  

['age',
 'income',
 'expenses',
 'debt',
 'score',
 'savings_ratio',
 'gender',
 'region',
 'membership_type',
 'subscription_status',
 'signup_date',
 'last_active']

**Indices**

We can also check the index.

In [10]:
summary['index']   # Our data starts from row no. 0 and ends at 1000000

RangeIndex(start=0, stop=2040000, step=1)

**DataTypes**

Let us now check the datatypes we are working with.

In [11]:
summary['dtypes']

age                    object
income                 object
expenses               object
debt                   object
score                  object
savings_ratio          object
gender                 object
region                 object
membership_type        object
subscription_status    object
signup_date            object
last_active            object
dtype: object

**Column Type Detection**

Alternatively, we can detect what type of column (Numerical, Categorical or Datetime) we are dealing with, using ``detect_column_types()`` from ``Diagnosis`` class of datalabx.

In [12]:
diagnosis.detect_column_types()

{'Numerical': [],
 'Datetime': [],
 'Categorical': ['age',
  'income',
  'expenses',
  'debt',
  'score',
  'savings_ratio',
  'gender',
  'region',
  'membership_type',
  'subscription_status',
  'signup_date',
  'last_active']}

As we can see that all columns have been classified as Categorical column, as it is *object* dtype, even when they are Numerical or Datetime columns.

We can change that by using ``ColumnConverter`` class from datalabx, which you can read more about in **Column Type Detection & Conversion** docs.

#### **Memory Usage**

We can also check how much memory is the DataFrame consuming ``show_memory_usage()`` method from the ``Diagnosis`` class

In [13]:
diagnosis.show_memory_usage()  # 1GB file

Total Memory Usage: 1379.94 MB


np.float64(1379.940689086914)

#### **Duplicates in Data**

Okay, I have seen the number of rows I have, what datatypes I am dealing with, and whether my data is numerical, categorical or datetime.

Now What?

We will now check if we have any duplicates in our data(if two rows have same data).

It is important because if our data has many duplicates, our ability of making accurate predictions or even doing accurate analysis will be negatively affected.

Imagine analyzing a financial data set where 50 % of records are the same transaction.


**Count Duplicates**

In [14]:
diagnosis.count_duplicates()

np.int64(40000)

We can see that ``count_duplicates()`` have returned 40000, which means across all columns, 40000 rows have same data.

However, we can also pass in the parameter ``in_columns``, to check duplicates separately within a specific column or across multiple columns.

In [15]:
diagnosis.count_duplicates(in_columns=['age'])

np.int64(279142)

In [16]:
diagnosis.count_duplicates(in_columns=['age', 'expenses'])

np.int64(61350)

**Show Duplicates**

Besides counting, we can also see what values are duplicates, by using ``show_duplicates()`` method from the `Diagnosis` class.

We can also pass in the parameter ``in_columns``, to see duplicates separately within a specific column or across multiple columns.

In [17]:
diagnosis.show_duplicates(in_columns=['age', 'expenses', 'debt'])

Unnamed: 0,age,income,expenses,debt,score,savings_ratio,gender,region,membership_type,subscription_status,signup_date,last_active
1926,18.0,12178.135497480767,,,60.53615966903524,,M,West,basic,inactive,2018-03-21,2020-06-23
3458,,43681.033458597856,,,,mAnvv,M,West,premium,active,2020-05-04,2022-06-07
3992,18.0,4357.393019012679,,,33.572798887702945,0.34014310705825745,M,West,basic,inactive,2010-08-02,2011-01-03
6799,,27487.995807444542,,,63.518245845765,0.5375501652277817,M,South,premium,inactive,2020-07-10,2023-03-13
6979,18.0,19504.92078544328,,,43.524921293797966,0.27594410395045893,F,East,basic,active,2017-08-08,2019-09-09
...,...,...,...,...,...,...,...,...,...,...,...,...
2039995,27.57950658009139,24867.969251518636,17934.879104637497,19332.828168179854,60.165409812400725,0.2787959916130968,F,South,basic,inactive,2014-03-06,2014-11-19
2039996,,26883.291521692,17286.903908107197,24629.853050641286,52.44913736033527,0.3569647565604652,M,South,premium,inactive,2019-08-25,2021-10-13
2039997,59.91820212620553,28089.858787093744,22076.978333513045,13123.203972510484,54.68026819711335,0.21405876402423912,M,West,premium,active,2016-09-13,2018-05-02
2039998,58.391209727936065,26376.324092594194,18995.68056462294,,,0.279820777984888,F,North,premium,active,2019-10-27,2020-09-29


#### **Cardinality**

In data, ``cardinality`` means how many different kinds of values are in a column.

You can read more about **cardinality** in **Data Diagnosis** documentation under **docs** section of datalabx.

We can check cardinality by using ``show_cardinality()``method from Diagnosis class.

In [18]:
diagnosis.show_cardinality() # shows count of unique values

{'age': 1760858,
 'income': 1819994,
 'expenses': 1819993,
 'debt': 1819999,
 'score': 1819974,
 'savings_ratio': 1819998,
 'gender': 52,
 'region': 99989,
 'membership_type': 3,
 'subscription_status': 59999,
 'signup_date': 4001,
 'last_active': 4996}

#### **Separation of Data**

We can also use datalabx to separate our dataframe so we can work on Numerical, Categorical or Datetime data separately.

You can read more about Separation of data in **Data Diagnosis** docs 

However, since our columns are all identified as 'object' dtype, first let us use ``ColumnConverter`` to convert these columns into correct type of columns.

In [25]:
from datalabx import ColumnConverter

df = ColumnConverter(df, ['age', 'income', 'expenses', 'debt', 'score', 'savings_ratio']).to_numerical()
df = ColumnConverter(df, ['signup_date','last_active']).to_datetime()

Diagnosis(df).detect_column_types()

ColumnConverter initialized with columns: ['age', 'income', 'expenses', 'debt', 'score', 'savings_ratio']
ColumnConverter initialized with columns: ['signup_date', 'last_active']


{'Numerical': ['age', 'income', 'expenses', 'debt', 'score', 'savings_ratio'],
 'Datetime': ['signup_date', 'last_active'],
 'Categorical': ['gender', 'region', 'membership_type', 'subscription_status']}

Great! We can see that our columns are of correct datatypes. 

Now we will separate the data.


#### Numerical Data Separation:

In [29]:
# Getting Numerical Data

numerical_df = Diagnosis(df).get_numerical_columns()
numerical_df.head()

Unnamed: 0,age,income,expenses,debt,score,savings_ratio
0,45.96057,,21337.207586,2077881.0,49.015508,0.431159
1,38.340828,17182.443452,3621.209282,3752.96,60.759049,0.789249
2,,23497.048535,16516.059771,,,0.297101
3,58.276358,10510.744673,8219.049416,9614.041,70.711241,0.218034
4,37.19016,18865.239962,,11116.85,34.324234,


#### Categorical Data Separation:


In [31]:
categorical_df = Diagnosis(df).get_categorical_columns()
categorical_df.head()

Unnamed: 0,gender,region,membership_type,subscription_status
0,M,East,vip,inactive
1,M,East,basic,active
2,M,East,basic,active
3,F,West,basic,active
4,M,West,basic,active


#### Datetime Data Separation:

In [32]:
datetime_df = Diagnosis(df).get_datetime_columns()
datetime_df.head()

Unnamed: 0,signup_date,last_active
0,2019-10-23,2019-11-05
1,2019-10-14,2021-04-11
2,2015-05-07,2017-02-09
3,2013-10-06,2015-03-22
4,NaT,NaT
