In [1]:
from dataset_generation import *

In [2]:
dataframe = generate_dataset()

In [3]:
dataframe.head()

Unnamed: 0,Name,Missing,Category,Height,Salary,Date,Country,Email,Useless
0,,,Special,1.95,83000,5 1976-18,China,MQbJR.8WDzY@gmail.com,9m9a334AHz
1,Sarah,,Classic,,461000,3 1956-27,China,HxJMj.glQNM@gmail.com,iWxgWkTvSa
2,James,,Special,1.91,87000,2 1954-16,USA,dqLlv.aDT6L@hotmail.com,JzVksclC8s
3,,,Special,1.88,139000,5 1994-5,USA,uLZVE.6fbPU@hotmail.com,P1Xq6UOSDm
4,,,Regular,1.86,125000,10 1956-20,Fr,Qznoc.FyGqH@gmail.com,Q9QFXsRaMz


## 1/ Useless feature deletion

Sometimes, there are features in your data that seem relatively strange, and in some cases that are really hard to understand or apprehend. They can be useless features, and to avoid wrong interpretations later in your analysis, deleting this strange feature can be an option.

In my dataset, I voluntarily created a **Useless** feature, to showcase the ability to prepare your data and continue your analysis on an appropriate perimeter of data.

In [4]:
# Look at the first few lines of the dataset to get a sense of data
dataframe.head()

Unnamed: 0,Name,Missing,Category,Height,Salary,Date,Country,Email,Useless
0,,,Special,1.95,83000,5 1976-18,China,MQbJR.8WDzY@gmail.com,9m9a334AHz
1,Sarah,,Classic,,461000,3 1956-27,China,HxJMj.glQNM@gmail.com,iWxgWkTvSa
2,James,,Special,1.91,87000,2 1954-16,USA,dqLlv.aDT6L@hotmail.com,JzVksclC8s
3,,,Special,1.88,139000,5 1994-5,USA,uLZVE.6fbPU@hotmail.com,P1Xq6UOSDm
4,,,Regular,1.86,125000,10 1956-20,Fr,Qznoc.FyGqH@gmail.com,Q9QFXsRaMz


In [5]:
# Drop the "Useless" feature of the dataset
dataframe = dataframe.drop(['Useless'], axis=1)

In [6]:
# Have another look on the data without the useless feature that has been deleted
dataframe.head()

Unnamed: 0,Name,Missing,Category,Height,Salary,Date,Country,Email
0,,,Special,1.95,83000,5 1976-18,China,MQbJR.8WDzY@gmail.com
1,Sarah,,Classic,,461000,3 1956-27,China,HxJMj.glQNM@gmail.com
2,James,,Special,1.91,87000,2 1954-16,USA,dqLlv.aDT6L@hotmail.com
3,,,Special,1.88,139000,5 1994-5,USA,uLZVE.6fbPU@hotmail.com
4,,,Regular,1.86,125000,10 1956-20,Fr,Qznoc.FyGqH@gmail.com


After deleting the `Useless` feature I was not interested in, I just have a look at my dataset to check the proper deletion of this feature, and continue my Data Cleaning process.

## 2/ Missing values as a special category

In this part, I will study the different values that take the **Category** feature of my dataset.

I quickly detect there are a few values that are taken and that represent a missing or an invalid value for the category. Therefore, I want to treat all these different values as a single and unique representation of the missing value concept : affect them all to the same category that I can myself rename `Unknown`.

In [7]:
# Check the different values taken by data for the Category feature
dataframe['Category'].value_counts()

Classic    15
Special    15
Regular     8
NaN         5
null        3
???         3
UNKWN       1
Name: Category, dtype: int64

In [8]:
# Replace data missing values for the Category feature by a single and dedicated value
dataframe['Category'].replace(['NaN', 'null', 'UNKWN', '???'], ['Unknown']*4, inplace=True)

In [9]:
# Check the different values taken by data for the Category feature, after missing category cleaning process
dataframe['Category'].value_counts()

Classic    15
Special    15
Unknown    12
Regular     8
Name: Category, dtype: int64

After having checked my transformed dataset with this single missing value for the Category feature, I will be able to process this feature later on knowing it can take values I had a glance on (`Classic`, `Regular`, `Special`), or the `Unknown` missing value.