In [1]:
from dataset_generation import *

In [2]:
dataframe = generate_dataset()

In [3]:
dataframe.head()

Unnamed: 0,Name,Missing,Category,Height,Salary,Date,Country,Email,Strange
0,Richard,52.0,Regular,1.92,124000,5 1973-16,CHinA,J1hvT.NM3Gz@gmail.com,aMj5LYJRui
1,,,???,1.7,39000,4 1998-12,USA,uTeUW.tGhw8@laposte.net,7aAS02QOAg
2,Linda,,Special,1.74,121000,6 1996-12,Brasil,8JDQS.LuwbQ@gmail.com,nM4c2sUjcx
3,John,,Special,1.67,134000,1 1970-12,South Africa,fGVZI.lMkes@gmail.com,gUuhu3L6XQ
4,,61.0,Special,2.08,56000,7 1983-21,brUsil,7OgCi.cVnKm@gmail.com,CPwAzWZsJc


## 1/ Strange feature deletion

Sometimes, there are features in your data that seem relatively strange, and in some cases that are really hard to understand or apprehend. They can be useless features, and to avoid wrong interpretations later in your analysis, deleting this strange feature can be an option.

In my dataset, I voluntarily created a **Strange** feature, to showcase the ability to prepare your data and continue your analysis on an appropriate perimeter of data.

In [4]:
# Look at the first few lines of the dataset to get a sense of data
dataframe.head()

Unnamed: 0,Name,Missing,Category,Height,Salary,Date,Country,Email,Strange
0,Richard,52.0,Regular,1.92,124000,5 1973-16,CHinA,J1hvT.NM3Gz@gmail.com,aMj5LYJRui
1,,,???,1.7,39000,4 1998-12,USA,uTeUW.tGhw8@laposte.net,7aAS02QOAg
2,Linda,,Special,1.74,121000,6 1996-12,Brasil,8JDQS.LuwbQ@gmail.com,nM4c2sUjcx
3,John,,Special,1.67,134000,1 1970-12,South Africa,fGVZI.lMkes@gmail.com,gUuhu3L6XQ
4,,61.0,Special,2.08,56000,7 1983-21,brUsil,7OgCi.cVnKm@gmail.com,CPwAzWZsJc


In [5]:
# Drop the "Useless" feature of the dataset
dataframe = dataframe.drop(['Strange'], axis=1)

In [6]:
# Have another look on the data without the useless feature that has been deleted
dataframe.head()

Unnamed: 0,Name,Missing,Category,Height,Salary,Date,Country,Email
0,Richard,52.0,Regular,1.92,124000,5 1973-16,CHinA,J1hvT.NM3Gz@gmail.com
1,,,???,1.7,39000,4 1998-12,USA,uTeUW.tGhw8@laposte.net
2,Linda,,Special,1.74,121000,6 1996-12,Brasil,8JDQS.LuwbQ@gmail.com
3,John,,Special,1.67,134000,1 1970-12,South Africa,fGVZI.lMkes@gmail.com
4,,61.0,Special,2.08,56000,7 1983-21,brUsil,7OgCi.cVnKm@gmail.com


After deleting the `Strange` feature I was not interested in, I just have a look at my dataset to check the proper deletion of this feature, and continue my Data Cleaning process.

## 2/ Missing values as a special category

In this part, I will study the different values that take the **Category** feature of my dataset.

I quickly detect there are a few values that are taken and that represent a missing or an invalid value for the category. Therefore, I want to treat all these different values as a single and unique representation of the missing value concept : affect them all to the same category that I can myself rename `Unknown`.

In [7]:
# Check the different values taken by data for the Category feature
dataframe['Category'].value_counts()

Regular    15
Special    13
Classic    10
NaN         5
???         3
UNKWN       2
null        2
Name: Category, dtype: int64

In [8]:
# Replace data missing values for the Category feature by a single and dedicated value
dataframe['Category'].replace(['NaN', 'null', 'UNKWN', '???'], ['Unknown']*4, inplace=True)

In [9]:
# Check the different values taken by data for the Category feature, after missing category cleaning process
dataframe['Category'].value_counts()

Regular    15
Special    13
Unknown    12
Classic    10
Name: Category, dtype: int64

After having checked my transformed dataset with this single missing value for the Category feature, I will be able to process this feature later on knowing it can take values I had a glance on (`Classic`, `Regular`, `Special`), or the `Unknown` missing value.

## 3/ Fixing spelling mistakes with known correct values

There is a **Country** feature in my dataset that contains a few country names, but some of them are sometimes spelled or written in a strange way. That can totally happen in real life when you collect data from multiple sources and these sources do not have the same software/language references.

Then, I will try to understand the misspelled values and replace them by a uniformed writting.

In [10]:
# Check the different values taken by data for the Country feature
dataframe['Country'].value_counts()

South Africa    10
Brasil           8
China            8
USA              6
France           5
brUsil           4
CHinA            3
uSa              3
Fr               3
Name: Country, dtype: int64

As we can see in the different values that can take the Country feature, there are few ones that misspelled or abstracted, but we can guess the name of the real country behind it. For example, *Fr* probably means *France*, *CHinA* and *uSa* is of course a misspelling, *SAF* in an abstraction of *South Africa* and *brUsil* is totally a mistake in data.

In [11]:
# Replace misspelled and erroneous values for the Country feature by a known correct values
dataframe['Country'].replace(['SAF', 'Fr', 'CHinA', 'uSa', 'brUsil'],
                             ['South Africa', 'France', 'China', 'USA', 'Brasil'], inplace=True)

In [12]:
# Check the different values taken by data for the Country feature, after fixing misspelled values
dataframe['Country'].value_counts()

Brasil          12
China           11
South Africa    10
USA              9
France           8
Name: Country, dtype: int64

With this spellchecking and fixing step of the Data Cleaning process, we now clearly see the different countries involved in the dataset, with no surprising or confusing values.