# Data Preprocessing in Python

## Importing Dependencies and uploading our data

In [0]:
import pandas as pd
import numpy as np
from google.colab import files
import math

In [98]:
data = files.upload()

Saving mock_data_preprocessing.csv to mock_data_preprocessing (2).csv


### Sanity check to ensure our data was successfully uploaded

In [99]:
!ls

'mock_data_preprocessing (1).csv'   mock_data_preprocessing.csv
'mock_data_preprocessing (2).csv'   sample_data


## Data Exploration

In [0]:
dataset = pd.read_csv('mock_data_preprocessing.csv')

In [293]:
dataset.head(9)

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary
0,x1,23,,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000"
1,x2,20,2.0,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000"
2,x3,40,1.0,4’10,100lbs,999.0,Bus,40km,Master’s,"$75,000"
3,x4,21,6.0,0,130lbs,23.8,Taxi,50km,High-School,"$2,000,000"
4,x5,42,0.0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000"
5,x5,42,0.0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000"
6,x1,23,,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000"
7,x7,24,1.0,5’8,155lbs,23.6,Foot,3km,Bachelor’s,"$100,000"
8,x8,35,4.0,5’3,100lbs,17.7,Bus,15km,Master’s,"$50,000"


### Note:
.describe() is only showing us the Age and BMI columns because they are the only columns with only numbers in them. This is a sign that all the other columns will need to be processed to some degree as well.

In [294]:
print(dataset.describe())
print('============================')
print(dataset.dtypes)

             Age         BMI
count   9.000000    9.000000
mean   30.000000  132.133333
std     9.539392  325.087242
min    20.000000   17.700000
25%    23.000000   23.300000
50%    24.000000   23.700000
75%    40.000000   27.400000
max    42.000000  999.000000
Id                  object
Age                  int64
Siblings            object
Height              object
Weight              object
BMI                float64
Vehicle             object
Travel Distance     object
Education           object
Salary              object
dtype: object


## Data Preprocessing

### Duplicates

To reduce the amount of computation required to clean the data we will first remove any existing duplicates.

**TODO**

1.) Check if there are any duplicate rows

2.) Remove any duplicate rows

**Note:** Make sure to keep one copy of any duplicates found in the dataset

In [295]:
df = dataset
df.drop_duplicates(subset=None, keep='first', inplace=False)
df.head()

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary
0,x1,23,,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000"
1,x2,20,2.0,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000"
2,x3,40,1.0,4’10,100lbs,999.0,Bus,40km,Master’s,"$75,000"
3,x4,21,6.0,0,130lbs,23.8,Taxi,50km,High-School,"$2,000,000"
4,x5,42,0.0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000"


### Age

In order to prepare our data for training we will normalize all our data to be between 0 and 1.

**Note:** When normalizing data only look at the values of the feature/column being normalized

**TODO**

1.) Find the minimum age

2.) Find the maximum age

3.) Normalize all the age values

**Note**

The normalize formula is as follows:

$$ X_{new} = \frac{X - X_{min}}  {X_{max} - X_{min}}$$



In [296]:
df['Age'] = (df['Age'] - df['Age'].min())/(df['Age'].max() - df['Age'].min())
df.head()

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary
0,x1,0.136364,,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000"
1,x2,0.0,2.0,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000"
2,x3,0.909091,1.0,4’10,100lbs,999.0,Bus,40km,Master’s,"$75,000"
3,x4,0.045455,6.0,0,130lbs,23.8,Taxi,50km,High-School,"$2,000,000"
4,x5,1.0,0.0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000"


### Siblings

Under Siblings we notice that we have a Value of "None". This could be due to an error or it could mean that they have no siblings. Either way we need to convert this to a number.

**Todo**

1.) Replace any None values in Siblings with 0

2.) Find the minimum value

3.) Find the maximum value

4.) Normalize the siblings values

**Note**

Only numerical data types(ints, floats) can be normalized.

In [297]:
df.replace(to_replace='None', value=0, inplace=True)
df.head()

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary
0,x1,0.136364,0,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000"
1,x2,0.0,2,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000"
2,x3,0.909091,1,4’10,100lbs,999.0,Bus,40km,Master’s,"$75,000"
3,x4,0.045455,6,0,130lbs,23.8,Taxi,50km,High-School,"$2,000,000"
4,x5,1.0,0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000"


### Height, Weight and BMI
The BMI column allows us to deal with these three columns at the same time. We can use the following formula to fix any missing values as well:

$$BMI = \frac{703 * Weight_{pounds}}  {Height_{inches}^2} $$

**TODO**

1.) Convert the Height values to inches

2.) Remove 'lbs' from the Weight values

3.) Find and fix the outlier in the BMI column

4.) Find and fix the 0 value in Height

5.) Normalize all three columns

In [305]:
def convert_height(string):
  if isinstance(string, str):
    feet = int(string.split("’")[0])
    inches = int(string.split("’")[1])
    total_inches = feet*12 + inches
    return total_inches
  else:
    return string

def remove_lbs(string):
  if isinstance(string, str):
    return int(string.rstrip('lbs'))
  else:
    return string

def find_bmi(height, weight):
  return (703*weight) / (height*height)

def find_height(weight, bmi):
  inches = round(math.sqrt((703*weight) / bmi))
  return height

find_height(180, 22.7)

58

In [299]:
height = df.loc[df['BMI']==999.0]['Height'].to_string(index=False)
height = convert_height(str(height))

weight = df.loc[df['BMI']==999.0]['Weight'].to_string(index=False)
weight = remove_lbs(weight)

df.loc[df['BMI']==999.0,'BMI'] = find_bmi(height, weight)

df.head()

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary
0,x1,0.136364,0,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000"
1,x2,0.0,2,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000"
2,x3,0.909091,1,4’10,100lbs,20.897741,Bus,40km,Master’s,"$75,000"
3,x4,0.045455,6,0,130lbs,23.8,Taxi,50km,High-School,"$2,000,000"
4,x5,1.0,0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000"


In [300]:
df.loc[df['Height']=='0']
weight = df.loc[df['Height']=='0']['Weight'].to_string(index=False)
weight = remove_lbs(weight)

bmi = float(df.loc[df['Height']=='0']['BMI'].to_string(index=False))
df.loc[df['Height']=='0','Height'] = find_height(weight, bmi)
df.head()

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary
0,x1,0.136364,0,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000"
1,x2,0.0,2,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000"
2,x3,0.909091,1,4’10,100lbs,20.897741,Bus,40km,Master’s,"$75,000"
3,x4,0.045455,6,58,130lbs,23.8,Taxi,50km,High-School,"$2,000,000"
4,x5,1.0,0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000"


In [306]:
df['Weight'] = df['Weight'].apply(remove_lbs)
df['Height'] = df['Height'].apply(convert_height)
df.head()

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary
0,x1,0.136364,0,65,140,23.3,Car,60km,Bachelor’s,"$60,000"
1,x2,0.0,2,72,175,23.7,Bicycle,20km,High-School,"$20,000"
2,x3,0.909091,1,58,100,20.897741,Bus,40km,Master’s,"$75,000"
3,x4,0.045455,6,58,130,23.8,Taxi,50km,High-School,"$2,000,000"
4,x5,1.0,0,68,180,27.4,Bus,35km,Ph.D.,"$80,000"


### Vehicle
The entries in the Vehicle column each have text representing that cell's value. Since ML models require numerical data to work, we will need to encode the categorical values into a numerical form. One-Hot encoding is a popular way of dealing with categorical data.

One-Hot encoding works by creating multiple boolean variables and assigning each variable to one category.

**ex.** If our categories were Cat, Dog, Bird. We can encode them as followed: Cat = 1 0 0,  Dog = 0 1 0, Bird = 0 0 1
<br>

**TODO**

1.) Create a copy of the Vehicle column

2.) Convert the copy into a One-Hot representation

3.) Delete the original Vehicle column

4.) Add the One-Hot encoded representation back into the dataset

In [307]:
pd.get_dummies(df['Vehicle'], prefix=['Vehicle'], drop_first=True)
copy_df = df.copy()
copy_df = pd.get_dummies(df['Vehicle'], prefix='Vehicle', drop_first=True)
new_df = df.join(copy_df)
new_df.drop(columns=['Vehicle'], inplace=True)
df = new_df.copy()
df

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Travel Distance,Education,Salary,Vehicle_Bus,Vehicle_Car,Vehicle_Foot,Vehicle_Taxi
0,x1,0.136364,0,65,140,23.3,60km,Bachelor’s,"$60,000",0,1,0,0
1,x2,0.0,2,72,175,23.7,20km,High-School,"$20,000",0,0,0,0
2,x3,0.909091,1,58,100,20.897741,40km,Master’s,"$75,000",1,0,0,0
3,x4,0.045455,6,58,130,23.8,50km,High-School,"$2,000,000",0,0,0,1
4,x5,1.0,0,68,180,27.4,35km,Ph.D.,"$80,000",1,0,0,0
5,x5,1.0,0,68,180,27.4,35km,Ph.D.,"$80,000",1,0,0,0
6,x1,0.136364,0,65,140,23.3,60km,Bachelor’s,"$60,000",0,1,0,0
7,x7,0.181818,1,68,155,23.6,3km,Bachelor’s,"$100,000",0,0,1,0
8,x8,0.681818,4,63,100,17.7,15km,Master’s,"$50,000",1,0,0,0


### Travel Distance

Just as with Weights column, we need to remove the units and normalize the values.

**TODO**

1.) Remove the km unit from all the values in the Travel Distance column

2.) Normalize all the values in the columns

In [308]:
def remove_km(dist):
  dist = int(dist.rstrip('km'))
  return dist
  

df['Travel Distance'] = df['Travel Distance'].apply(remove_km)
df['Travel Distance'] = (df['Travel Distance'] - df['Travel Distance'].min())/(df['Travel Distance'].max() - df['Travel Distance'].min())
df


Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Travel Distance,Education,Salary,Vehicle_Bus,Vehicle_Car,Vehicle_Foot,Vehicle_Taxi
0,x1,0.136364,0,65,140,23.3,1.0,Bachelor’s,"$60,000",0,1,0,0
1,x2,0.0,2,72,175,23.7,0.298246,High-School,"$20,000",0,0,0,0
2,x3,0.909091,1,58,100,20.897741,0.649123,Master’s,"$75,000",1,0,0,0
3,x4,0.045455,6,58,130,23.8,0.824561,High-School,"$2,000,000",0,0,0,1
4,x5,1.0,0,68,180,27.4,0.561404,Ph.D.,"$80,000",1,0,0,0
5,x5,1.0,0,68,180,27.4,0.561404,Ph.D.,"$80,000",1,0,0,0
6,x1,0.136364,0,65,140,23.3,1.0,Bachelor’s,"$60,000",0,1,0,0
7,x7,0.181818,1,68,155,23.6,0.0,Bachelor’s,"$100,000",0,0,1,0
8,x8,0.681818,4,63,100,17.7,0.210526,Master’s,"$50,000",1,0,0,0


### Education

Once again we are dealing with text values in the Education column. This time though since the categories can be ordered,  we are dealing with ordinal data instead. This allows us to encode the data differently in order to limit the number of columns in our dataset.

Instead of using One-Hot encoding which will have us add k-1 columns to our dataset, we will encode all the categories in one column.

**Ex.:** If our data was: Beginner, Intermediate, Advance. Instead of encoding as One-Hot, we can encode it as followed: Beginner = 0, Intermediate = 1, Advance = 2.

**Note:** There are still issues with encoding data like this as we are saying that the distance between each category is the same. Depending on the data this may not be true.

 **TODO**
 
 1.) Create a mapping from category to number. Keep in mind that since we are dealing with ordinal data we want to preserve the order of the values after the encoding
 
 2.) Use the mapping to convert the values in the Education column into numbers
 
 3.) Normalize the values

In [0]:
def return_mapping(string):
  mapping = {'High-School': 0, "Bachelor’s": 1, "Master’s": 2, "Ph.D.": 3}
  return mapping[string]

df['Education'] = df['Education'].apply(return_mapping)


In [310]:
df['Education'] = (df['Education'] - df['Education'].min())/(df['Education'].max() - df['Education'].min())
df

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Travel Distance,Education,Salary,Vehicle_Bus,Vehicle_Car,Vehicle_Foot,Vehicle_Taxi
0,x1,0.136364,0,65,140,23.3,1.0,0.333333,"$60,000",0,1,0,0
1,x2,0.0,2,72,175,23.7,0.298246,0.0,"$20,000",0,0,0,0
2,x3,0.909091,1,58,100,20.897741,0.649123,0.666667,"$75,000",1,0,0,0
3,x4,0.045455,6,58,130,23.8,0.824561,0.0,"$2,000,000",0,0,0,1
4,x5,1.0,0,68,180,27.4,0.561404,1.0,"$80,000",1,0,0,0
5,x5,1.0,0,68,180,27.4,0.561404,1.0,"$80,000",1,0,0,0
6,x1,0.136364,0,65,140,23.3,1.0,0.333333,"$60,000",0,1,0,0
7,x7,0.181818,1,68,155,23.6,0.0,0.333333,"$100,000",0,0,1,0
8,x8,0.681818,4,63,100,17.7,0.210526,0.666667,"$50,000",1,0,0,0


### Salary
The values in the Salary column contain special characters that we need to remove in order for our models to work

**TODO**

1.) Remove all the special characters

2.) Deal with any outliers

3.) Normalize the data

In [313]:
def regex(string):
  if isinstance(string, str):
    return int(''.join(ch for ch in string if ch.isdigit()))
  else:
    return string

df['Salary'] = df['Salary'].apply(regex)
df.set_value(3, 'Salary', int(round(df['Salary'].mean())))
df['Salary'] = (df['Salary'] - df['Salary'].min())/(df['Salary'].max() - df['Salary'].min())

  


280555.55555555556

## Our Clean Data

In [314]:
df.head(7)

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Travel Distance,Education,Salary,Vehicle_Bus,Vehicle_Car,Vehicle_Foot,Vehicle_Taxi
0,x1,0.136364,0,65,140,23.3,1.0,0.333333,0.5,0,1,0,0
1,x2,0.0,2,72,175,23.7,0.298246,0.0,0.0,0,0,0,0
2,x3,0.909091,1,58,100,20.897741,0.649123,0.666667,0.6875,1,0,0,0
3,x4,0.045455,6,58,130,23.8,0.824561,0.0,0.868825,0,0,0,1
4,x5,1.0,0,68,180,27.4,0.561404,1.0,0.75,1,0,0,0
5,x5,1.0,0,68,180,27.4,0.561404,1.0,0.75,1,0,0,0
6,x1,0.136364,0,65,140,23.3,1.0,0.333333,0.5,0,1,0,0


In [264]:
print(df.describe())
print('============================')
print(df.dtypes)

            Age      Weight        BMI  ...  Vehicle_Car  Vehicle_Foot  Vehicle_Taxi
count  9.000000    9.000000   9.000000  ...     9.000000      9.000000      9.000000
mean   0.454545  144.444444  23.455305  ...     0.222222      0.111111      0.111111
std    0.433609   31.169340   2.979802  ...     0.440959      0.333333      0.333333
min    0.000000  100.000000  17.700000  ...     0.000000      0.000000      0.000000
25%    0.136364  130.000000  23.300000  ...     0.000000      0.000000      0.000000
50%    0.181818  140.000000  23.600000  ...     0.000000      0.000000      0.000000
75%    0.909091  175.000000  23.800000  ...     0.000000      0.000000      0.000000
max    1.000000  180.000000  27.400000  ...     1.000000      1.000000      1.000000

[8 rows x 10 columns]
Id                  object
Age                float64
Siblings            object
Height              object
Weight               int64
BMI                float64
Travel Distance    float64
Education          floa