In [1]:
from dataset_generation import *
import math

In [2]:
dataframe = generate_dataset(50)

In [3]:
dataframe.head()

Unnamed: 0,Name,Missing,Category,Height,Salary,Date,Country,Email,Strange
0,John,,Special,1.78,72000,3 1951-14,Brasil,T7iP2.FaTN8@hotmail.com,nPM9ie8Dsi
1,,64.0,Classic,1.52,138000,1 2003-2,uSa,NAMqv.SNh8u@laposte.net,I1AhNlWRjp
2,Robert,,Regular,1.61,70000,3 1999-24,Fr,6s6Zs.Ay2FO@weneverknow.com,phPmw5uv8A
3,,,Special,1.52,33000,8 1994-10,Brasil,4Rw22.8Dgfv@laposte.net,EuD7n4Ym87
4,,,Regular,1.74,98000,2 2003-6,USA,ZWU0x.IFOeQ@hotmail.com,KtXAtgdsGK


## 1/ Strange feature deletion

Sometimes, there are features in your data that seem relatively strange, and in some cases that are really hard to understand or apprehend. They can be useless features, and to avoid wrong interpretations later in your analysis, deleting this strange feature can be an option.

In my dataset, I voluntarily created a **Strange** feature, to showcase the ability to prepare your data and continue your analysis on an appropriate perimeter of data.

In [4]:
# Look at the first few lines of the dataset to get a sense of data
dataframe.head()

Unnamed: 0,Name,Missing,Category,Height,Salary,Date,Country,Email,Strange
0,John,,Special,1.78,72000,3 1951-14,Brasil,T7iP2.FaTN8@hotmail.com,nPM9ie8Dsi
1,,64.0,Classic,1.52,138000,1 2003-2,uSa,NAMqv.SNh8u@laposte.net,I1AhNlWRjp
2,Robert,,Regular,1.61,70000,3 1999-24,Fr,6s6Zs.Ay2FO@weneverknow.com,phPmw5uv8A
3,,,Special,1.52,33000,8 1994-10,Brasil,4Rw22.8Dgfv@laposte.net,EuD7n4Ym87
4,,,Regular,1.74,98000,2 2003-6,USA,ZWU0x.IFOeQ@hotmail.com,KtXAtgdsGK


In [5]:
# Drop the "Useless" feature of the dataset
dataframe = dataframe.drop(['Strange'], axis=1)

In [6]:
# Have another look on the data without the useless feature that has been deleted
dataframe.head()

Unnamed: 0,Name,Missing,Category,Height,Salary,Date,Country,Email
0,John,,Special,1.78,72000,3 1951-14,Brasil,T7iP2.FaTN8@hotmail.com
1,,64.0,Classic,1.52,138000,1 2003-2,uSa,NAMqv.SNh8u@laposte.net
2,Robert,,Regular,1.61,70000,3 1999-24,Fr,6s6Zs.Ay2FO@weneverknow.com
3,,,Special,1.52,33000,8 1994-10,Brasil,4Rw22.8Dgfv@laposte.net
4,,,Regular,1.74,98000,2 2003-6,USA,ZWU0x.IFOeQ@hotmail.com


After deleting the `Strange` feature I was not interested in, I just have a look at my dataset to check the proper deletion of this feature, and continue my Data Cleaning process.

## 2/ Missing values as a special category

In this part, I will study the different values that take the **Category** feature of my dataset.

I quickly detect there are a few values that are taken and that represent a missing or an invalid value for the category. Therefore, I want to treat all these different values as a single and unique representation of the missing value concept : affect them all to the same category that I can myself rename `Unknown`.

In [7]:
# Check the different values taken by data for the Category feature
dataframe['Category'].value_counts()

Regular    22
Classic    10
Special     6
???         4
NaN         3
UNKWN       3
null        2
Name: Category, dtype: int64

In [8]:
# Replace data missing values for the Category feature by a single and dedicated value
dataframe['Category'].replace(['NaN', 'null', 'UNKWN', '???'], ['Unknown']*4, inplace=True)

In [9]:
# Check the different values taken by data for the Category feature, after missing category cleaning process
dataframe['Category'].value_counts()

Regular    22
Unknown    12
Classic    10
Special     6
Name: Category, dtype: int64

After having checked my transformed dataset with this single missing value for the Category feature, I will be able to process this feature later on knowing it can take values I had a glance on (`Classic`, `Regular`, `Special`), or the `Unknown` missing value.

## 3/ Fixing spelling mistakes with known correct values

There is a **Country** feature in my dataset that contains a few country names, but some of them are sometimes spelled or written in a strange way. That can totally happen in real life when you collect data from multiple sources and these sources do not have the same software/language references.

Then, I will try to understand the misspelled values and replace them by a uniformed writting.

In [10]:
# Check the different values taken by data for the Country feature
dataframe['Country'].value_counts()

Brasil          9
France          9
China           8
USA             6
Fr              5
South Africa    5
brUsil          4
uSa             3
SAF             1
Name: Country, dtype: int64

As we can see in the different values that can take the Country feature, there are few ones that misspelled or abstracted, but we can guess the name of the real country behind it. For example, *Fr* probably means *France*, *CHinA* and *uSa* is of course a misspelling, *SAF* in an abstraction of *South Africa* and *brUsil* is totally a mistake in data.

In [11]:
# Replace misspelled and erroneous values for the Country feature by a known correct values
dataframe['Country'].replace(['SAF', 'Fr', 'CHinA', 'uSa', 'brUsil'],
                             ['South Africa', 'France', 'China', 'USA', 'Brasil'], inplace=True)

In [12]:
# Check the different values taken by data for the Country feature, after fixing misspelled values
dataframe['Country'].value_counts()

France          14
Brasil          13
USA              9
China            8
South Africa     6
Name: Country, dtype: int64

With this spellchecking and fixing step of the Data Cleaning process, we now clearly see the different countries involved in the dataset, with no surprising or confusing values.

## 4/ Mean filling for missing values

The **Height** feature of the dataset miss a few values, but not a large amount. In order to analyse the data without deleting too much information, it can be clever in some cases to replace missing values in numerical features by the mean of this feature.

There are not too much outliers in this feature, so replacing missing values by the mean can be a good choice.

In [13]:
# Check the proportion of null values in the Height feature
height_null_values = sum(dataframe['Height'].isnull() == True)
height_notnull_values = sum(dataframe['Height'].isnull() == False)
print("Height feature has {} null values and {} not null values !".format(height_null_values, height_notnull_values))

Height feature has 7 null values and 43 not null values !


In [14]:
# Compute the mean value of the Height feature in the dataset, and replace missing values with it
height_mean_value = round(dataframe["Height"].mean(), 2)
dataframe['Height'].fillna(height_mean_value, inplace=True)

In [15]:
# Check the proportion of null values in the Height feature after mean filling
height_null_values = sum(dataframe['Height'].isnull() == True)
height_notnull_values = sum(dataframe['Height'].isnull() == False)
print("Height feature has {} null values and {} not null values !".format(height_null_values, height_notnull_values))

Height feature has 0 null values and 50 not null values !


Once the `Height` feature missing values have been replaced with the mean value, you shall heterogeneous set of values on this feature to do further analysis.

## 5/ Useless observations deletion

The **Name** feature, in my dataset example, serves as an identifier of each observation. We can face cases with real world data, where these identifiers are not filled. There are a bunch of data available, but we cannot link them to a defined indentifier, a specific object or individual.

This generally means whathever the analysis you will make on your data, you will be blocked at some point for some cases where you need to identify a data point. Thus, it is sometimes useful to delete these points for some specific analysis.

In [16]:
# Check the proportion of null values in the Name feature
name_null_values = sum(dataframe['Name'].isnull() == True)
name_notnull_values = sum(dataframe['Name'].isnull() == False)
print("Name feature has {} null values and {} not null values !".format(name_null_values, name_notnull_values))

Name feature has 20 null values and 30 not null values !


In [17]:
# Delete observations where Name feature value is missing
dataframe = dataframe.dropna(subset=['Name'])

In [18]:
# Check the proportion of null values in the Name feature after mean filling
name_null_values = sum(dataframe['Name'].isnull() == True)
name_notnull_values = sum(dataframe['Name'].isnull() == False)
print("Name feature has {} null values and {} not null values !".format(name_null_values, name_notnull_values))

Name feature has 0 null values and 30 not null values !


Deleting observations that have missing values for the `Name` feature shall reduce your dataset size, but help you to do identification and relationships analysis easier.

## 6/ Useless feature deletion

There is a **Missing** feature in my dataset that, like its name indicates, misses a large amount of data. When a feature lacks information on the majority of the data points, this means this feature does not really bring something interesting to the analysis done later on.

In this case, this feature is considered useless and better be deleted.

In [19]:
# Check the few first lines of the dataset and especially the Missing feature values
dataframe.head()

Unnamed: 0,Name,Missing,Category,Height,Salary,Date,Country,Email
0,John,,Special,1.78,72000,3 1951-14,Brasil,T7iP2.FaTN8@hotmail.com
2,Robert,,Regular,1.61,70000,3 1999-24,France,6s6Zs.Ay2FO@weneverknow.com
5,Margaret,,Regular,1.99,51000,9 1989-8,China,ameU8.prboY@gmail.com
6,Elizabeth,,Unknown,1.76,128000,3 1997-17,Brasil,7xW6p.pgOzn@gmail.com
7,John,,Special,2.0,46000,10 1975-22,France,eg6H1.YnwFH@darkmagic


In [20]:
# Drop the "Missing" feature of the dataset
dataframe = dataframe.drop(['Missing'], axis=1)

In [21]:
# Look back at the dataset with Missing feature deleted
dataframe.head()

Unnamed: 0,Name,Category,Height,Salary,Date,Country,Email
0,John,Special,1.78,72000,3 1951-14,Brasil,T7iP2.FaTN8@hotmail.com
2,Robert,Regular,1.61,70000,3 1999-24,France,6s6Zs.Ay2FO@weneverknow.com
5,Margaret,Regular,1.99,51000,9 1989-8,China,ameU8.prboY@gmail.com
6,Elizabeth,Unknown,1.76,128000,3 1997-17,Brasil,7xW6p.pgOzn@gmail.com
7,John,Special,2.0,46000,10 1975-22,France,eg6H1.YnwFH@darkmagic


When a useless feature is deleted, models and statistics used on the dataset get more precise because they are not perturbated anymore by a lot of incorrected or missing data. It is really important to focus solely on data that has interest.

## 7/ Median filling for erroneous values

The **Salary** feature of the dataset has a few values that do not seem plausible, they may be an error. When data is missing or is irrelevant in a numerical feature, and this numerical feature has potentially large outliers in it, replacing the data by the median instead of the mean can be more appropriate.

I voluntarily genertated great outliers, and I want to show how we can replace them by the median of the feature.

In [22]:
max_salary = max(dataframe['Salary'])
min_salary = min(dataframe['Salary'])
mean_salary = math.ceil(round(dataframe['Salary'].mean(), 0))
median_salary = math.ceil(dataframe['Salary'].median())
print("Salary feature : max={}, min={}, mean={}, median={}".format(max_salary, min_salary, mean_salary, median_salary))

Salary feature : max=463000, min=34000, mean=93367, median=88500


In [23]:
# Compute the mean value of the Height feature in the dataset, and replace missing values with it
max_salary_no_outlier = mean_salary = math.ceil(round(dataframe['Salary'].mean() * 2.5, 0))
dataframe = dataframe.reset_index()
high_salary_indices = dataframe[(dataframe["Salary"] > max_salary_no_outlier)].index
dataframe.iloc[high_salary_indices, [dataframe.columns.get_loc('Salary')]] = median_salary

In [24]:
max_salary = max(dataframe["Salary"])
min_salary = min(dataframe["Salary"])
mean_salary = math.ceil(round(dataframe["Salary"].mean(), 0))
median_salary = math.ceil(dataframe["Salary"].median())
print("Salary feature : max={}, min={}, mean={}, median={}".format(max_salary, min_salary, mean_salary, median_salary))

Salary feature : max=137000, min=34000, mean=80883, median=88250


Once the `Salary` feature outliers have been replaced with the median value, further analysis shall be more precise and interesting.