# **Column Type Detection & Conversion Workflow Guide**
--------------------------------------------------------

### Welcome to the Workflow Guide for Column Type Detection and Conversion!

In this workflow guide, we will explore how we can do column type detection and conversion using datalab.

This is important because incorrectly identified columns will lead to data loss, or worse, conversion into NAN (Not a Number) when we do operations on them.

We can see examples of this in ``Workflow Docs`` section of DataLab docs. 


It will then lead to inaccurate computations, visualizations, analysis, model building etc.

In order to prevent all of that, let us go step by step and experience the workflow firshand!

### **Importing Libraries**

To begin with, we will be importing these libraries:

- datalab as dl
- pandas as pd 

In [1]:
import datalab as dl
import pandas as pd

### **Loading the Data**

In [2]:
from datalab import load_tabular

df=load_tabular('messy_dataset.csv')  # our synthetic dataset

### **Diagnosing the Data**


In [3]:
from datalab import Diagnosis

diagnosis = Diagnosis(df)

Let us check what our data looks like, its shape and also what column types we have.

In [4]:
diagnosis.data_preview(5)

Unnamed: 0,age,income,expenses,debt,score,savings_ratio,gender,region,membership_type,subscription_status,signup_date,last_active
0,45.960569836134795,tpUwb,21337.20758558362,2077881.1645985672,49.015507649186574,0.4311585563690656,M,East,vip,inactive,2019-10-23,2019-11-05
1,38.340828385945784,17182.44345210847,3621.2092817197617,3752.959576902923,60.75904929703295,0.7892494573421452,M,East,basic,active,2019-10-14,2021-04-11
2,NXlMl,23497.04853541588,16516.059770785752,,,0.2971006658180088,M,East,basic,active,2015-05-07,2017-02-09
3,58.2763582768963,10510.744673253303,8219.049415817683,9614.040727545927,70.71124084031835,0.2180335769422026,F,West,basic,active,2013-10-06,2015-03-22
4,37.19015950331997,18865.239961809577,,11116.85061890834,34.32423437519151,,M,West,basic,active,INVALID,INVALID


In [5]:
summary = diagnosis.data_summary()

In [6]:
summary['shape']

(2040000, 12)

In [7]:
summary['columns']

['age',
 'income',
 'expenses',
 'debt',
 'score',
 'savings_ratio',
 'gender',
 'region',
 'membership_type',
 'subscription_status',
 'signup_date',
 'last_active']

In [8]:
summary['dtypes']

age                    object
income                 object
expenses               object
debt                   object
score                  object
savings_ratio          object
gender                 object
region                 object
membership_type        object
subscription_status    object
signup_date            object
last_active            object
dtype: object

### **Column Type Detection:**

We can see that all columns have been identified as **Categorical** type columns.

However, when we compare this to the preview of our DataFrame, we can see that some of them are **Numerical**, some are **Categorical** and a few **Datetime** type columns.

Following are the correct types of columns in the DataFrame:

1. **Numerical**: ['age', 'income', 'expenses', 'debt', 'score', 'savings_ratio']

2. **Categorical**: ['gender', 'region', 'membership_type', 'subscription_status']

3. **Datetime**: ['signup_date', 'last_active']

We can convert the incorrectly identified to correct ones, by importing the ``ColumnConverter`` class from datalab.


In [9]:
from datalab import ColumnConverter


### **IMPORTANT:**

All classes in datalab, whether they are the Column Converters, Backend Converters, Data Visualizers, Cleaners, Preprocessors, or Diagnosis etc..

All of them accept specific columns you wish to work with, otherwise, they apply operations to all columns of the DataFrame.

And for this dataset, we definitely may not be doing that, otherwise everything that is not a number, will be converted into NAN (Not a Number).

That is why we will be directly mentioning the list columns we wish to convert to these specific column types - **Numerical**, **Categorical** or **Datetime**.

### **Numerical Type Conversion:**

To convert incorrectly identified columns into numerical columns, we will be using ``to_numerical()`` method, from ``ColumnConverter`` class of datalab.

This will return a ``pd.DataFrame``, A pandas DataFrame of columns converted into numeric, along with rest of the DataFrame that is not numeric.

In [10]:
# converting columns into Numerical

df = ColumnConverter(df, columns=['age', 'income', 'expenses', 'debt', 'score', 'savings_ratio']).to_numerical()

ColumnConverter initialized with columns: ['age', 'income', 'expenses', 'debt', 'score', 'savings_ratio']


Let us verify that the conversion worked.

Let us see the datatypes.

In [11]:
df.dtypes

age                    float64
income                 float64
expenses               float64
debt                   float64
score                  float64
savings_ratio          float64
gender                  object
region                  object
membership_type         object
subscription_status     object
signup_date             object
last_active             object
dtype: object

We can see that the columns passed in ``ColumnConverter`` class have successfully converted those columns into numbers. 

The datatype of those columns is reflecting as **float64**, which is a numeric datatype. 

### **Categorical Type Conversion:**

To convert incorrectly identified columns into categorical columns, we will be using ``to_categorical()`` method, from ``ColumnConverter`` class of datalab.

However, we would be skipping this method for this DataFrame since categorical columns are correctly identified.

### **Datetime Type Conversion:**

To convert incorrectly identified columns into datetime columns, we will be using ``to_datetime()`` method, from ``ColumnConverter`` class of datalab.

This will return a ``pd.DataFrame``, A pandas DataFrame of columns converted into datetime columns, along with rest of the DataFrame that is not datetime type.

In [12]:
df = ColumnConverter(df, columns=['signup_date', 'last_active']).to_datetime()

ColumnConverter initialized with columns: ['signup_date', 'last_active']


In [13]:
df.dtypes

age                           float64
income                        float64
expenses                      float64
debt                          float64
score                         float64
savings_ratio                 float64
gender                         object
region                         object
membership_type                object
subscription_status            object
signup_date            datetime64[ns]
last_active            datetime64[ns]
dtype: object

**Great!** 

We can see that we have been able to successfully convert incorrectly identified columns into correct types based on our data, using datalab.

