# **Diagnosis Usage Guide**

### Welcome to the Data Diagnosis usage guide!

### **Importing Libraries**

To begin with, we will be importing the library:

- datalabx

In [3]:
import datalabx as dx

#### **Loading Data**

Second step involves loading our data file.

We can do this by importing ``load_tabular`` directly from datalabx.

Alternatively, we can do that by using the alias like this:

``dl.load_tabular('example.csv')``

In [9]:
from datalabx import DataLoader

df = DataLoader('ultra_messy_dataset.csv').load_tabular() # synthetic dataset

DataLoader - INFO - Data Loader initialized with csv file.


#### **Importing Diagnosis class from datalabx**

After loading our DataFrame, we will now diagnose it by importing the ``Diagnosis`` class from datalabx

In [10]:
from datalabx import Diagnosis

# We will now pass the DataFrame we wish to diagnose
diagnosis = Diagnosis(df)

Diagnosis - INFO - Data Diagnosis initialized with columns: ['Age', 'Salary', 'Expenses', 'Height_cm', 'Weight_kg', 'Temperature_C', 'Purchase_Amount', 'Score', 'Rating', 'Debt']


#### **Preview of Data**

We will now begin our diagnosis by looking what our data looks like. 

We can do that by using ``data_preview()`` method.

In [11]:
diagnosis.data_preview(5)  #  We can pass in the number of rows we would like to see

Unnamed: 0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
0,five,1.04e+05,4735.244618878169,"1,55e+02",,unknown,,30.70,4.888018131992931,64972
1,20.673408544550288,134460.66741276794,9533.158420128186,"1,53e+02",12193,2.05e+01,2400.0,one,"3,33e+00","$58,276.81"
2,33.56,missing,4104.95,204,96.50051410846052,"-1,14e+01",2877.0871437672777,four,unknown,94033.3425007563
3,missing,,894429.0,196.34 cm,?,-1.57e+01,,?,4.31,"27,400cm"
4,28$,112877.67251785153,1040.45,171.40 cm,4912,-6,4344.39,2.24,4.14,67682


#### **Data Summary**

Okay! I have looked at the data, now what?

After we have seen what our data looks like, we may want to see how many rows do we have in total, or even how many columns there are.

Or, we may want to know what type of data are we dealing with, whether it is number, plain text, basic category or date and time.

We can do that by using ``data_summary()`` method from ``Diagnosis`` class.

``data_summary`` method returns a dictionary and allows you to see:

1. Shape of Data (number of rows and columns)
2. Column Names
3. Data Types (whether int, float, str, bool, datetime or category)
4. Index (like house no. of each row of your DataFrame)

In [12]:
summary = diagnosis.data_summary()

**Shape of Data**

Let's begin by looking at the shape of our data.

In [None]:
summary['shape']   # We have 100k rows and 10 columns

(100000, 10)

**Column Names** 

We can also check the names of columns we are dealing with.

In [15]:
summary['columns']  

['Age',
 'Salary',
 'Expenses',
 'Height_cm',
 'Weight_kg',
 'Temperature_C',
 'Purchase_Amount',
 'Score',
 'Rating',
 'Debt']

**Indices**

We can also check the index.

In [None]:
summary['index']   # Our data starts from row no. 0 and ends at 100000

RangeIndex(start=0, stop=100000, step=1)

**DataTypes**

Let us now check the datatypes we are working with.

In [17]:
summary['dtypes']

Age                large_string[pyarrow]
Salary             large_string[pyarrow]
Expenses           large_string[pyarrow]
Height_cm          large_string[pyarrow]
Weight_kg          large_string[pyarrow]
Temperature_C      large_string[pyarrow]
Purchase_Amount    large_string[pyarrow]
Score              large_string[pyarrow]
Rating             large_string[pyarrow]
Debt               large_string[pyarrow]
dtype: object

**Column Type Detection**

Alternatively, we can detect what type of column (Numerical, Categorical or Datetime) we are dealing with, using ``detect_column_types()`` from ``Diagnosis`` class of datalabx.

In [18]:
diagnosis.detect_column_types()

{'Numerical': [],
 'Datetime': [],
 'Categorical': ['Age',
  'Salary',
  'Expenses',
  'Height_cm',
  'Weight_kg',
  'Temperature_C',
  'Purchase_Amount',
  'Score',
  'Rating',
  'Debt']}

As we can see that all columns have been classified as Categorical column, as it is *object* dtype, even when they are Numerical or Datetime columns.

We can change that by using ``ColumnConverter`` class from datalabx, which you can read more about in **Column Type Detection & Conversion** docs.

#### **Memory Usage**

We can also check how much memory is the DataFrame consuming ``show_memory_usage()`` method from the ``Diagnosis`` class

In [19]:
diagnosis.show_memory_usage()  # 1GB file

Diagnosis - INFO - Showing total memory usage in MB


0    14.504131
dtype: float64

#### **Duplicates in Data**

Okay, I have seen the number of rows I have, what datatypes I am dealing with, and whether my data is numerical, categorical or datetime.

Now What?

We will now check if we have any duplicates in our data(if two rows have same data).

It is important because if our data has many duplicates, our ability of making accurate predictions or even doing accurate analysis will be negatively affected.

Imagine analyzing a financial data set where 50 % of records are the same transaction.


**Count Duplicates**

In [20]:
diagnosis.count_duplicates()

0

We can see that ``count_duplicates()`` have returned 40000, which means across all columns, 40000 rows have same data.

However, we can also pass in the parameter ``in_columns``, to check duplicates separately within a specific column or across multiple columns.

In [22]:
diagnosis.count_duplicates(in_columns=['Age'])

60059

In [25]:
diagnosis.count_duplicates(in_columns=['Age', 'Expenses'])

13491

**Show Duplicates**

Besides counting, we can also see what values are duplicates, by using ``show_duplicates()`` method from the `Diagnosis` class.

We can also pass in the parameter ``in_columns``, to see duplicates separately within a specific column or across multiple columns.

In [26]:
diagnosis.show_duplicates(in_columns=['Age', 'Expenses', 'Debt'])

Unnamed: 0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
518,,,two,16g,one,18$,?,-50.96123153389587,4$,
1046,five,,,1.56e+02,55.60 kg,-14.42914420248782,-2049.7945912651485,52.015539487224714,?,
1272,,1.12e+05,one,173.99,106.66382261019672,four,five,two,,
1349,,60530,one,one,63.22351312072803,8.73e+00,3110.938952375893,81.18,missing,
1487,,one,,146.68,5005,19.43 C,"€2,860.49",unknown,4.405274741198807,five
...,...,...,...,...,...,...,...,...,...,...
99819,five,?,,174.64,?,,error,2.20e+00,-4.482239007967144,?
99920,?,38839.512869087914,,,,2.77,"£2,705.90",7405,2.79,?
99921,missing,five,missing,179.0133900816666,112.31198735047545,approx 1000,2128.47,81.50474414360973,,?
99938,three,,,181.54 cm,40,34.298229955574755,?,"9,15e+01",2.47,


#### **Cardinality**

In data, ``cardinality`` means how many different kinds of values are in a column.

You can read more about **cardinality** in **Data Diagnosis** documentation under **docs** section of datalabx.

We can check cardinality by using ``show_cardinality()``method from Diagnosis class.

In [27]:
diagnosis.show_cardinality() # shows count of unique values

{'Age': 39941,
 'Salary': 63894,
 'Expenses': 61558,
 'Height_cm': 36277,
 'Weight_kg': 40118,
 'Temperature_C': 38180,
 'Purchase_Amount': 58888,
 'Score': 40086,
 'Rating': 21232,
 'Debt': 64519}