# Data Preprocessing in Python

## Importing Dependencies and uploading our data

In [49]:
import pandas as pd
import numpy as np
import math

In [83]:
dataset = pd.read_csv(r'/Users/mimiyufanyou/Downloads/mock_data_preprocessing.csv')

## Data Exploration

In [84]:
dataset

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary
0,x1,23,,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000"
1,x2,20,2.0,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000"
2,x3,40,1.0,4’10,100lbs,999.0,Bus,40km,Master’s,"$75,000"
3,x4,21,6.0,0,130lbs,23.8,Taxi,50km,High-School,"$2,000,000"
4,x5,42,0.0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000"
5,x5,42,0.0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000"
6,x1,23,,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000"
7,x7,24,1.0,5’8,155lbs,23.6,Foot,3km,Bachelor’s,"$100,000"
8,x8,35,4.0,5’3,100lbs,17.7,Bus,15km,Master’s,"$50,000"


### Note:
.describe() is only showing us the Age and BMI columns because they are the only columns with only numbers in them. This is a sign that all the other columns will need to be processed to some degree as well.

In [52]:
print(dataset.describe())
print('============================')
print(dataset.dtypes)

             Age         BMI
count   9.000000    9.000000
mean   30.000000  132.133333
std     9.539392  325.087242
min    20.000000   17.700000
25%    23.000000   23.300000
50%    24.000000   23.700000
75%    40.000000   27.400000
max    42.000000  999.000000
Id                  object
Age                  int64
Siblings            object
Height              object
Weight              object
BMI                float64
Vehicle             object
Travel Distance     object
Education           object
Salary              object
dtype: object


## Data Preprocessing

### Duplicates

To reduce the amount of computation required to clean the data we will first remove any existing duplicates.

**TODO**

1.) Check if there are any duplicate rows

2.) Remove any duplicate rows

**Note:** Make sure to keep one copy of any duplicates found in the dataset

In [85]:
dataset.duplicated(keep='first')

0    False
1    False
2    False
3    False
4    False
5     True
6     True
7    False
8    False
dtype: bool

In [86]:
dataset = dataset[dataset.duplicated(keep='first') == False]

dups = dataset[dataset.duplicated()]

In [87]:
dataset.head()

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary
0,x1,23,,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000"
1,x2,20,2.0,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000"
2,x3,40,1.0,4’10,100lbs,999.0,Bus,40km,Master’s,"$75,000"
3,x4,21,6.0,0,130lbs,23.8,Taxi,50km,High-School,"$2,000,000"
4,x5,42,0.0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000"


### Age

In order to prepare our data for training we will normalize all our data to be between 0 and 1.

**Note:** When normalizing data only look at the values of the feature/column being normalized

**TODO**

1.) Find the minimum age

2.) Find the maximum age

3.) Normalize all the age values

**Note**

The normalize formula is as follows:

$$ X_{new} = \frac{X - X_{min}}  {X_{max} - X_{min}}$$



In [88]:
minage = dataset['Age'].min()
maxage = dataset['Age'].max()

dataset['Norm_Age'] = (dataset['Age']-minage)/(maxage-minage)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [89]:
dataset

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary,Norm_Age
0,x1,23,,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000",0.136364
1,x2,20,2.0,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000",0.0
2,x3,40,1.0,4’10,100lbs,999.0,Bus,40km,Master’s,"$75,000",0.909091
3,x4,21,6.0,0,130lbs,23.8,Taxi,50km,High-School,"$2,000,000",0.045455
4,x5,42,0.0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000",1.0
7,x7,24,1.0,5’8,155lbs,23.6,Foot,3km,Bachelor’s,"$100,000",0.181818
8,x8,35,4.0,5’3,100lbs,17.7,Bus,15km,Master’s,"$50,000",0.681818


### Siblings

Under Siblings we notice that we have a Value of "None". This could be due to an error or it could mean that they have no siblings. Either way we need to convert this to a number.

**Todo**

1.) Replace any None values in Siblings with 0

2.) Find the minimum value

3.) Find the maximum value

4.) Normalize the siblings values

**Note**

Only numerical data types(ints, floats) can be normalized.

In [90]:
dataset['Siblings'][dataset['Siblings'] == 'None'] = 0 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


In [91]:
dataset['Siblings'] = dataset['Siblings'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [92]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 8
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               7 non-null      object 
 1   Age              7 non-null      int64  
 2   Siblings         7 non-null      int64  
 3   Height           7 non-null      object 
 4   Weight           7 non-null      object 
 5   BMI              7 non-null      float64
 6   Vehicle          7 non-null      object 
 7   Travel Distance  7 non-null      object 
 8   Education        7 non-null      object 
 9   Salary           7 non-null      object 
 10  Norm_Age         7 non-null      float64
dtypes: float64(2), int64(2), object(7)
memory usage: 672.0+ bytes


In [93]:
minsib = dataset['Siblings'].min()
maxsib = dataset['Siblings'].max()

dataset['Norm_Siblings'] = (dataset['Siblings']-minsib)/(maxsib-minsib)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [95]:
dataset.head(10)

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary,Norm_Age,Norm_Siblings
0,x1,23,0,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000",0.136364,0.0
1,x2,20,2,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000",0.0,0.333333
2,x3,40,1,4’10,100lbs,999.0,Bus,40km,Master’s,"$75,000",0.909091,0.166667
3,x4,21,6,0,130lbs,23.8,Taxi,50km,High-School,"$2,000,000",0.045455,1.0
4,x5,42,0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000",1.0,0.0
7,x7,24,1,5’8,155lbs,23.6,Foot,3km,Bachelor’s,"$100,000",0.181818,0.166667
8,x8,35,4,5’3,100lbs,17.7,Bus,15km,Master’s,"$50,000",0.681818,0.666667


### Height, Weight and BMI
The BMI column allows us to deal with these three columns at the same time. We can use the following formula to fix any missing values as well:

$$BMI = \frac{703 * Weight_{pounds}}  {Height_{inches}^2} $$

**TODO**

1.) Convert the Height values to inches

2.) Remove 'lbs' from the Weight values

3.) Find and fix the outlier in the BMI column

4.) Find and fix the 0 value in Height

5.) Normalize all three columns

In [98]:
dataset['Weight'] = dataset['Weight'].apply(lambda x: x.rstrip('lbs '))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [101]:
dataset = dataset[~dataset.Id.isin(['x3', 'x4'])]

In [102]:
dataset.head()

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary,Norm_Age,Norm_Siblings
0,x1,23,0,5’5,140,23.3,Car,60km,Bachelor’s,"$60,000",0.136364,0.0
1,x2,20,2,6’0,175,23.7,Bicycle,20km,High-School,"$20,000",0.0,0.333333
4,x5,42,0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000",1.0,0.0
7,x7,24,1,5’8,155,23.6,Foot,3km,Bachelor’s,"$100,000",0.181818,0.166667
8,x8,35,4,5’3,100,17.7,Bus,15km,Master’s,"$50,000",0.681818,0.666667


### Vehicle
The entries in the Vehicle column each have text representing that cell's value. Since ML models require numerical data to work, we will need to encode the categorical values into a numerical form. One-Hot encoding is a popular way of dealing with categorical data.

One-Hot encoding works by creating multiple boolean variables and assigning each variable to one category.

**ex.** If our categories were Cat, Dog, Bird. We can encode them as followed: Cat = 1 0 0,  Dog = 0 1 0, Bird = 0 0 1
<br>

**TODO**

1.) Create a copy of the Vehicle column

2.) Convert the copy into a One-Hot representation

3.) Delete the original Vehicle column

4.) Add the One-Hot encoded representation back into the dataset

In [None]:
dataset['Car'] = [ 1 if i == 'Car' else 0 for i in dataset['Vehicle']]
dataset['Bicycle'] = [ 1 if i == 'Bicycle' else 0 for i in dataset['Vehicle']]
dataset['Bus'] = [ 1 if i == 'Bus' else 0 for i in dataset['Vehicle']]
dataset['Foot'] = [ 1 if i == 'Foot' else 0 for i in dataset['Vehicle']]

In [109]:
dataset.head()

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary,Norm_Age,Norm_Siblings,Car,Bicycle,Bus,Foot
0,x1,23,0,5’5,140,23.3,Car,60km,Bachelor’s,"$60,000",0.136364,0.0,1,0,0,0
1,x2,20,2,6’0,175,23.7,Bicycle,20km,High-School,"$20,000",0.0,0.333333,0,1,0,0
4,x5,42,0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000",1.0,0.0,0,0,1,0
7,x7,24,1,5’8,155,23.6,Foot,3km,Bachelor’s,"$100,000",0.181818,0.166667,0,0,0,1
8,x8,35,4,5’3,100,17.7,Bus,15km,Master’s,"$50,000",0.681818,0.666667,0,0,1,0


### Travel Distance

Just as with Weights column, we need to remove the units and normalize the values.

**TODO**

1.) Remove the km unit from all the values in the Travel Distance column

2.) Normalize all the values in the columns

In [None]:
dataset['Travel Distance'] = dataset['Travel Distance'].apply(lambda x: x.rstrip('km '))

### Education

Once again we are dealing with text values in the Education column. This time though since the categories can be ordered,  we are dealing with ordinal data instead. This allows us to encode the data differently in order to limit the number of columns in our dataset.

Instead of using One-Hot encoding which will have us add k-1 columns to our dataset, we will encode all the categories in one column.

**Ex.:** If our data was: Beginner, Intermediate, Advance. Instead of encoding as One-Hot, we can encode it as followed: Beginner = 0, Intermediate = 1, Advance = 2.

**Note:** There are still issues with encoding data like this as we are saying that the distance between each category is the same. Depending on the data this may not be true.

 **TODO**
 
 1.) Create a mapping from category to number. Keep in mind that since we are dealing with ordinal data we want to preserve the order of the values after the encoding
 
 2.) Use the mapping to convert the values in the Education column into numbers
 
 3.) Normalize the values

In [133]:
education = {"Bachelor’s": 0, "High-School":1, "Master’s":2, "Ph.D.":3}
education

{'Bachelor’s': 0, 'High-School': 1, 'Master’s': 2, 'Ph.D.': 3}

In [134]:
dataset['Education'] = [ education.get(item, item) for item in dataset['Education']]

#headersList = [ headersDict.get(item,item) for item in headersList ]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [135]:
dataset.head()

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary,Norm_Age,Norm_Siblings,Car,Bicycle,Bus,Foot
0,x1,23,0,5’5,140,23.3,Car,60,0,"$60,000",0.136364,0.0,1,0,0,0
1,x2,20,2,6’0,175,23.7,Bicycle,20,1,"$20,000",0.0,0.333333,0,1,0,0
4,x5,42,0,5’8,180,27.4,Bus,35,3,"$80,000",1.0,0.0,0,0,1,0
7,x7,24,1,5’8,155,23.6,Foot,3,0,"$100,000",0.181818,0.166667,0,0,0,1
8,x8,35,4,5’3,100,17.7,Bus,15,2,"$50,000",0.681818,0.666667,0,0,1,0


### Salary
The values in the Salary column contain special characters that we need to remove in order for our models to work

**TODO**

1.) Remove all the special characters

2.) Deal with any outliers

3.) Normalize the data

In [151]:
dataset['Salary'] = dataset['Salary'].apply(lambda x: x.replace('$', ''))
dataset['Salary'] = dataset['Salary'].apply(lambda x: x.replace('\,',''))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [152]:
dataset.head()

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary,Norm_Age,Norm_Siblings,Car,Bicycle,Bus,Foot
0,x1,23,0,5’5,140,23.3,Car,60,0,60000,0.136364,0.0,1,0,0,0
1,x2,20,2,6’0,175,23.7,Bicycle,20,1,20000,0.0,0.333333,0,1,0,0
4,x5,42,0,5’8,180,27.4,Bus,35,3,80000,1.0,0.0,0,0,1,0
7,x7,24,1,5’8,155,23.6,Foot,3,0,100000,0.181818,0.166667,0,0,0,1
8,x8,35,4,5’3,100,17.7,Bus,15,2,50000,0.681818,0.666667,0,0,1,0


## Our Clean Data

In [153]:
dataset.head(7)

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary,Norm_Age,Norm_Siblings,Car,Bicycle,Bus,Foot
0,x1,23,0,5’5,140,23.3,Car,60,0,60000,0.136364,0.0,1,0,0,0
1,x2,20,2,6’0,175,23.7,Bicycle,20,1,20000,0.0,0.333333,0,1,0,0
4,x5,42,0,5’8,180,27.4,Bus,35,3,80000,1.0,0.0,0,0,1,0
7,x7,24,1,5’8,155,23.6,Foot,3,0,100000,0.181818,0.166667,0,0,0,1
8,x8,35,4,5’3,100,17.7,Bus,15,2,50000,0.681818,0.666667,0,0,1,0


In [154]:
print(dataset.describe())
print('============================')
print(dataset.dtypes)

             Age  Siblings        BMI  Education  Norm_Age  Norm_Siblings  \
count   5.000000   5.00000   5.000000    5.00000  5.000000       5.000000   
mean   28.800000   1.40000  23.140000    1.20000  0.400000       0.233333   
std     9.311283   1.67332   3.474622    1.30384  0.423240       0.278887   
min    20.000000   0.00000  17.700000    0.00000  0.000000       0.000000   
25%    23.000000   0.00000  23.300000    0.00000  0.136364       0.000000   
50%    24.000000   1.00000  23.600000    1.00000  0.181818       0.166667   
75%    35.000000   2.00000  23.700000    2.00000  0.681818       0.333333   
max    42.000000   4.00000  27.400000    3.00000  1.000000       0.666667   

            Car   Bicycle       Bus      Foot  
count  5.000000  5.000000  5.000000  5.000000  
mean   0.200000  0.200000  0.400000  0.200000  
std    0.447214  0.447214  0.547723  0.447214  
min    0.000000  0.000000  0.000000  0.000000  
25%    0.000000  0.000000  0.000000  0.000000  
50%    0.000000  0