# Data Preprocessing in Python

## Importing Dependencies and uploading our data

In [0]:
import pandas as pd
import numpy as np
from google.colab import files
import math

In [0]:
data = files.upload()

Saving mock_data_preprocessing.csv to mock_data_preprocessing (5).csv


### Sanity check to ensure our data was successfully uploaded

In [0]:
!ls

'mock_data_preprocessing (1).csv'  'mock_data_preprocessing (5).csv'
'mock_data_preprocessing (2).csv'   mock_data_preprocessing.csv
'mock_data_preprocessing (3).csv'   sample_data
'mock_data_preprocessing (4).csv'


## Data Exploration

In [0]:
dataset = pd.read_csv('mock_data_preprocessing.csv')

In [0]:
dataset.head(9)

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary
0,x1,23,,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000"
1,x2,20,2.0,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000"
2,x3,40,1.0,4’10,100lbs,999.0,Bus,40km,Master’s,"$75,000"
3,x4,21,6.0,0,130lbs,23.8,Taxi,50km,High-School,"$2,000,000"
4,x5,42,0.0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000"
5,x5,42,0.0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000"
6,x1,23,,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000"
7,x7,24,1.0,5’8,155lbs,23.6,Foot,3km,Bachelor’s,"$100,000"
8,x8,35,4.0,5’3,100lbs,17.7,Bus,15km,Master’s,"$50,000"


### Note:
.describe() is only showing us the Age and BMI columns because they are the only columns with only numbers in them. This is a sign that all the other columns will need to be processed to some degree as well.

In [0]:
print(dataset.describe())
print('============================')
print(dataset.dtypes)

             Age         BMI
count   9.000000    9.000000
mean   30.000000  132.133333
std     9.539392  325.087242
min    20.000000   17.700000
25%    23.000000   23.300000
50%    24.000000   23.700000
75%    40.000000   27.400000
max    42.000000  999.000000
Id                  object
Age                  int64
Siblings            object
Height              object
Weight              object
BMI                float64
Vehicle             object
Travel Distance     object
Education           object
Salary              object
dtype: object


## Data Preprocessing

### Duplicates

To reduce the amount of computation required to clean the data we will first remove any existing duplicates.

**TODO**

1.) Check if there are any duplicate rows

2.) Remove any duplicate rows

**Note:** Make sure to keep one copy of any duplicates found in the dataset

In [0]:
dataset.drop_duplicates(inplace=True)        
dataset.head(9)                 

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary
0,x1,23,,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000"
1,x2,20,2.0,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000"
2,x3,40,1.0,4’10,100lbs,999.0,Bus,40km,Master’s,"$75,000"
3,x4,21,6.0,0,130lbs,23.8,Taxi,50km,High-School,"$2,000,000"
4,x5,42,0.0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000"
7,x7,24,1.0,5’8,155lbs,23.6,Foot,3km,Bachelor’s,"$100,000"
8,x8,35,4.0,5’3,100lbs,17.7,Bus,15km,Master’s,"$50,000"


### Age

In order to prepare our data for training we will normalize all our data to be between 0 and 1.

**Note:** When normalizing data only look at the values of the feature/column being normalized

**TODO**

1.) Find the minimum age

2.) Find the maximum age

3.) Normalize all the age values

**Note**

The normalize formula is as follows:

$$ X_{new} = \frac{X - X_{min}}  {X_{max} - X_{min}}$$



In [0]:
age_min= dataset['Age'].min()
age_max= dataset['Age'].max()
dataset['Age_norm']= ((dataset['Age']-age_min)/(age_max-age_min))
dataset.head(9)

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary,Age_norm
0,x1,23,,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000",0.136364
1,x2,20,2.0,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000",0.0
2,x3,40,1.0,4’10,100lbs,999.0,Bus,40km,Master’s,"$75,000",0.909091
3,x4,21,6.0,0,130lbs,23.8,Taxi,50km,High-School,"$2,000,000",0.045455
4,x5,42,0.0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000",1.0
7,x7,24,1.0,5’8,155lbs,23.6,Foot,3km,Bachelor’s,"$100,000",0.181818
8,x8,35,4.0,5’3,100lbs,17.7,Bus,15km,Master’s,"$50,000",0.681818


### Siblings

Under Siblings we notice that we have a Value of "None". This could be due to an error or it could mean that they have no siblings. Either way we need to convert this to a number.

**Todo**

1.) Replace any None values in Siblings with 0

2.) Find the minimum value

3.) Find the maximum value

4.) Normalize the siblings values

**Note**

Only numerical data types(ints, floats) can be normalized.

In [0]:
dataset['Siblings'].replace('None', np.NAN, inplace=True)
dataset['Siblings']= pd.to_numeric(dataset['Siblings'])
Siblings_max=  dataset['Siblings'].max()
Siblings_min = dataset['Siblings'].min()
dataset['Sibings_norm']= ((dataset['Siblings']- Siblings_min)/ (Siblings_max - Siblings_min))
dataset.head(9)

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary,Age_norm,Sibings_norm
0,x1,23,,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000",0.136364,
1,x2,20,2.0,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000",0.0,0.333333
2,x3,40,1.0,4’10,100lbs,999.0,Bus,40km,Master’s,"$75,000",0.909091,0.166667
3,x4,21,6.0,0,130lbs,23.8,Taxi,50km,High-School,"$2,000,000",0.045455,1.0
4,x5,42,0.0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000",1.0,0.0
7,x7,24,1.0,5’8,155lbs,23.6,Foot,3km,Bachelor’s,"$100,000",0.181818,0.166667
8,x8,35,4.0,5’3,100lbs,17.7,Bus,15km,Master’s,"$50,000",0.681818,0.666667


### Height, Weight and BMI
The BMI column allows us to deal with these three columns at the same time. We can use the following formula to fix any missing values as well:

$$BMI = \frac{703 * Weight_{pounds}}  {Height_{inches}^2} $$

**TODO**

1.) Convert the Height values to inches

2.) Remove 'lbs' from the Weight values

3.) Find and fix the outlier in the BMI column

4.) Find and fix the 0 value in Height

5.) Normalize all three columns

In [0]:
#@title  Converting Height values to inches in new column Heightin

Height_ = dataset['Height'].str.split("’", n = 1, expand = True) 
Height1_ = pd.to_numeric(Height_[0])
Height2_ = pd.to_numeric(Height_[1])
dataset['Heightin'] = 12 * Height1_ + Height2_

dataset.head(9)

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary,Age_norm,Sibings_norm,Heightin
0,x1,23,,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000",0.136364,,65.0
1,x2,20,2.0,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000",0.0,0.333333,72.0
2,x3,40,1.0,4’10,100lbs,999.0,Bus,40km,Master’s,"$75,000",0.909091,0.166667,58.0
3,x4,21,6.0,0,130lbs,23.8,Taxi,50km,High-School,"$2,000,000",0.045455,1.0,
4,x5,42,0.0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000",1.0,0.0,68.0
7,x7,24,1.0,5’8,155lbs,23.6,Foot,3km,Bachelor’s,"$100,000",0.181818,0.166667,68.0
8,x8,35,4.0,5’3,100lbs,17.7,Bus,15km,Master’s,"$50,000",0.681818,0.666667,63.0


In [0]:
#@title Removing 'lbs' from the Weight values in WeightNew

dataset['WeightNew']=pd.to_numeric(dataset['Weight'].str.rstrip('lbs'))
dataset.head(9)

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary,Age_norm,Sibings_norm,Heightin,WeightNew
0,x1,23,,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000",0.136364,,65.0,140
1,x2,20,2.0,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000",0.0,0.333333,72.0,175
2,x3,40,1.0,4’10,100lbs,999.0,Bus,40km,Master’s,"$75,000",0.909091,0.166667,58.0,100
3,x4,21,6.0,0,130lbs,23.8,Taxi,50km,High-School,"$2,000,000",0.045455,1.0,,130
4,x5,42,0.0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000",1.0,0.0,68.0,180
7,x7,24,1.0,5’8,155lbs,23.6,Foot,3km,Bachelor’s,"$100,000",0.181818,0.166667,68.0,155
8,x8,35,4.0,5’3,100lbs,17.7,Bus,15km,Master’s,"$50,000",0.681818,0.666667,63.0,100


In [0]:
#@title Fixing outliers in BMI values
dataset['BMIfixed']= (dataset['WeightNew'] * 703)/ (dataset['Heightin'] ** 2 )
dataset

Unnamed: 0,Id,Age,Siblings,Height,Weight,BMI,Vehicle,Travel Distance,Education,Salary,Age_norm,Sibings_norm,Heightin,WeightNew,BMIfixed
0,x1,23,,5’5,140lbs,23.3,Car,60km,Bachelor’s,"$60,000",0.136364,,65.0,140,23.294675
1,x2,20,2.0,6’0,175lbs,23.7,Bicycle,20km,High-School,"$20,000",0.0,0.333333,72.0,175,23.731674
2,x3,40,1.0,4’10,100lbs,999.0,Bus,40km,Master’s,"$75,000",0.909091,0.166667,58.0,100,20.897741
3,x4,21,6.0,0,130lbs,23.8,Taxi,50km,High-School,"$2,000,000",0.045455,1.0,,130,
4,x5,42,0.0,5’8,180,27.4,Bus,35km,Ph.D.,"$80,000",1.0,0.0,68.0,180,27.365917
7,x7,24,1.0,5’8,155lbs,23.6,Foot,3km,Bachelor’s,"$100,000",0.181818,0.166667,68.0,155,23.565095
8,x8,35,4.0,5’3,100lbs,17.7,Bus,15km,Master’s,"$50,000",0.681818,0.666667,63.0,100,17.71227


In [0]:
Heightin_max =  dataset['Heightin'].max()
Heightin_min = dataset['Heightin'].min()
WeightNew_max = dataset['WeightNew'].max()
WeightNew_min = dataset['WeightNew'].min()
BMIfixed_max = dataset['BMIfixed'].max()
BMIfixed_min = dataset['BMIfixed'].min()
dataset['Height_norm']= (dataset['Heightin']- Heightin_min)/(Heightin_max - Heightin_min)
dataset['Weight_norm']= (dataset['WeightNew']- WeightNew_min)/(WeightNew_max - WeightNew_min)
dataset['BMI_norm']= (dataset['BMIfixed']- BMIfixed_min)/(BMIfixed_max - BMIfixed_min)
dataset.drop(['Height'],axis=1, inplace=True)
dataset.drop(['Weight'],axis=1, inplace=True)
dataset.drop(['BMI'],axis=1, inplace=True)
dataset


Unnamed: 0,Id,Age,Siblings,Vehicle,Travel Distance,Education,Salary,Age_norm,Sibings_norm,Heightin,WeightNew,BMIfixed,Height_norm,Weight_norm,BMI_norm
0,x1,23,,Car,60km,Bachelor’s,"$60,000",0.136364,,65.0,140,23.294675,0.5,0.5,0.578269
1,x2,20,2.0,Bicycle,20km,High-School,"$20,000",0.0,0.333333,72.0,175,23.731674,1.0,0.9375,0.623537
2,x3,40,1.0,Bus,40km,Master’s,"$75,000",0.909091,0.166667,58.0,100,20.897741,0.0,0.0,0.329976
3,x4,21,6.0,Taxi,50km,High-School,"$2,000,000",0.045455,1.0,,130,,,0.375,
4,x5,42,0.0,Bus,35km,Ph.D.,"$80,000",1.0,0.0,68.0,180,27.365917,0.714286,1.0,1.0
7,x7,24,1.0,Foot,3km,Bachelor’s,"$100,000",0.181818,0.166667,68.0,155,23.565095,0.714286,0.6875,0.606281
8,x8,35,4.0,Bus,15km,Master’s,"$50,000",0.681818,0.666667,63.0,100,17.71227,0.357143,0.0,0.0


### Vehicle
The entries in the Vehicle column each have text representing that cell's value. Since ML models require numerical data to work, we will need to encode the categorical values into a numerical form. One-Hot encoding is a popular way of dealing with categorical data.

One-Hot encoding works by creating multiple boolean variables and assigning each variable to one category.

**ex.** If our categories were Cat, Dog, Bird. We can encode them as followed: Cat = 1 0 0,  Dog = 0 1 0, Bird = 0 0 1
<br>

**TODO**

1.) Create a copy of the Vehicle column

2.) Convert the copy into a One-Hot representation

3.) Delete the original Vehicle column

4.) Add the One-Hot encoded representation back into the dataset

In [0]:
dataset['Vehicle_'] = dataset['Vehicle']
dummy=pd.get_dummies(dataset['Vehicle_'])
del dataset['Vehicle']
dataset=pd.concat([dataset, dummy], axis=1)
dummy.head()
dataset

Unnamed: 0,Id,Age,Siblings,Travel Distance,Education,Salary,Age_norm,Sibings_norm,Heightin,WeightNew,BMIfixed,Height_norm,Weight_norm,BMI_norm,Vehicle_,Bicycle,Bus,Car,Foot,Taxi
0,x1,23,,60km,Bachelor’s,"$60,000",0.136364,,65.0,140,23.294675,0.5,0.5,0.578269,Car,0,0,1,0,0
1,x2,20,2.0,20km,High-School,"$20,000",0.0,0.333333,72.0,175,23.731674,1.0,0.9375,0.623537,Bicycle,1,0,0,0,0
2,x3,40,1.0,40km,Master’s,"$75,000",0.909091,0.166667,58.0,100,20.897741,0.0,0.0,0.329976,Bus,0,1,0,0,0
3,x4,21,6.0,50km,High-School,"$2,000,000",0.045455,1.0,,130,,,0.375,,Taxi,0,0,0,0,1
4,x5,42,0.0,35km,Ph.D.,"$80,000",1.0,0.0,68.0,180,27.365917,0.714286,1.0,1.0,Bus,0,1,0,0,0
7,x7,24,1.0,3km,Bachelor’s,"$100,000",0.181818,0.166667,68.0,155,23.565095,0.714286,0.6875,0.606281,Foot,0,0,0,1,0
8,x8,35,4.0,15km,Master’s,"$50,000",0.681818,0.666667,63.0,100,17.71227,0.357143,0.0,0.0,Bus,0,1,0,0,0


### Travel Distance

Just as with Weights column, we need to remove the units and normalize the values.

**TODO**

1.) Remove the km unit from all the values in the Travel Distance column

2.) Normalize all the values in the columns

In [0]:
dataset['Travel Distance_']=pd.to_numeric(dataset['Travel Distance'].str.rstrip('km'))
TravelDistance_min = dataset['Travel Distance_'].min()
TravelDistance_max = dataset['Travel Distance_'].max()
dataset['Travel Distance_norm']= (dataset['Travel Distance_']- TravelDistance_min)/(TravelDistance_max - TravelDistance_min)
dataset.drop(['Travel Distance'],axis=1, inplace=True)
dataset.head(9)

Unnamed: 0,Id,Age,Siblings,Education,Salary,Age_norm,Sibings_norm,Heightin,WeightNew,BMIfixed,Height_norm,Weight_norm,BMI_norm,Vehicle_,Bicycle,Bus,Car,Foot,Taxi,Travel Distance_,Travel Distance_norm
0,x1,23,,Bachelor’s,"$60,000",0.136364,,65.0,140,23.294675,0.5,0.5,0.578269,Car,0,0,1,0,0,60,1.0
1,x2,20,2.0,High-School,"$20,000",0.0,0.333333,72.0,175,23.731674,1.0,0.9375,0.623537,Bicycle,1,0,0,0,0,20,0.298246
2,x3,40,1.0,Master’s,"$75,000",0.909091,0.166667,58.0,100,20.897741,0.0,0.0,0.329976,Bus,0,1,0,0,0,40,0.649123
3,x4,21,6.0,High-School,"$2,000,000",0.045455,1.0,,130,,,0.375,,Taxi,0,0,0,0,1,50,0.824561
4,x5,42,0.0,Ph.D.,"$80,000",1.0,0.0,68.0,180,27.365917,0.714286,1.0,1.0,Bus,0,1,0,0,0,35,0.561404
7,x7,24,1.0,Bachelor’s,"$100,000",0.181818,0.166667,68.0,155,23.565095,0.714286,0.6875,0.606281,Foot,0,0,0,1,0,3,0.0
8,x8,35,4.0,Master’s,"$50,000",0.681818,0.666667,63.0,100,17.71227,0.357143,0.0,0.0,Bus,0,1,0,0,0,15,0.210526


### Education

Once again we are dealing with text values in the Education column. This time though since the categories can be ordered,  we are dealing with ordinal data instead. This allows us to encode the data differently in order to limit the number of columns in our dataset.

Instead of using One-Hot encoding which will have us add k-1 columns to our dataset, we will encode all the categories in one column.

**Ex.:** If our data was: Beginner, Intermediate, Advance. Instead of encoding as One-Hot, we can encode it as followed: Beginner = 0, Intermediate = 1, Advance = 2.

**Note:** There are still issues with encoding data like this as we are saying that the distance between each category is the same. Depending on the data this may not be true.

 **TODO**
 
 1.) Create a mapping from category to number. Keep in mind that since we are dealing with ordinal data we want to preserve the order of the values after the encoding
 
 2.) Use the mapping to convert the values in the Education column into numbers
 
 3.) Normalize the values

In [0]:
codes = {'High-School':0, 'Bachelor’s':1, 'Master’s':2,'Ph.D.':3 }
dataset['Education']=pd.to_numeric(dataset['Education'].map(codes))
Education_max = dataset['Education'].max()
Education_min = dataset['Education'].min()
dataset['Education_norm']= (dataset['Education']- Education_min)/(Education_max - Education_min)
dataset

Unnamed: 0,Id,Age,Siblings,Education,Salary,Age_norm,Sibings_norm,Heightin,WeightNew,BMIfixed,Height_norm,Weight_norm,BMI_norm,Vehicle_,Bicycle,Bus,Car,Foot,Taxi,Travel Distance_,Travel Distance_norm,Education_norm
0,x1,23,,1,"$60,000",0.136364,,65.0,140,23.294675,0.5,0.5,0.578269,Car,0,0,1,0,0,60,1.0,0.333333
1,x2,20,2.0,0,"$20,000",0.0,0.333333,72.0,175,23.731674,1.0,0.9375,0.623537,Bicycle,1,0,0,0,0,20,0.298246,0.0
2,x3,40,1.0,2,"$75,000",0.909091,0.166667,58.0,100,20.897741,0.0,0.0,0.329976,Bus,0,1,0,0,0,40,0.649123,0.666667
3,x4,21,6.0,0,"$2,000,000",0.045455,1.0,,130,,,0.375,,Taxi,0,0,0,0,1,50,0.824561,0.0
4,x5,42,0.0,3,"$80,000",1.0,0.0,68.0,180,27.365917,0.714286,1.0,1.0,Bus,0,1,0,0,0,35,0.561404,1.0
7,x7,24,1.0,1,"$100,000",0.181818,0.166667,68.0,155,23.565095,0.714286,0.6875,0.606281,Foot,0,0,0,1,0,3,0.0,0.333333
8,x8,35,4.0,2,"$50,000",0.681818,0.666667,63.0,100,17.71227,0.357143,0.0,0.0,Bus,0,1,0,0,0,15,0.210526,0.666667


### Salary
The values in the Salary column contain special characters that we need to remove in order for our models to work

**TODO**

1.) Remove all the special characters

2.) Deal with any outliers

3.) Normalize the data

In [0]:
dataset['Salary']=dataset['Salary'].str.lstrip('$')
dataset['Salary'] = dataset['Salary'].str.replace(',', ' ')
dataset['Salary'] = dataset['Salary'].str.split().str.join("")
dataset['Salary'] =pd.to_numeric(dataset['Salary'])
Salary_min = dataset['Salary'].min()
Salary_max = dataset['Salary'].max()
dataset['Salary_norm']= (dataset['Salary']- Salary_min)/(Salary_max - Salary_min)
dataset

Unnamed: 0,Id,Age,Siblings,Education,Salary,Age_norm,Sibings_norm,Heightin,WeightNew,BMIfixed,Height_norm,Weight_norm,BMI_norm,Vehicle_,Bicycle,Bus,Car,Foot,Taxi,Travel Distance_,Travel Distance_norm,Education_norm,Salary_norm
0,x1,23,,1,60000,0.136364,,65.0,140,23.294675,0.5,0.5,0.578269,Car,0,0,1,0,0,60,1.0,0.333333,0.020202
1,x2,20,2.0,0,20000,0.0,0.333333,72.0,175,23.731674,1.0,0.9375,0.623537,Bicycle,1,0,0,0,0,20,0.298246,0.0,0.0
2,x3,40,1.0,2,75000,0.909091,0.166667,58.0,100,20.897741,0.0,0.0,0.329976,Bus,0,1,0,0,0,40,0.649123,0.666667,0.027778
3,x4,21,6.0,0,2000000,0.045455,1.0,,130,,,0.375,,Taxi,0,0,0,0,1,50,0.824561,0.0,1.0
4,x5,42,0.0,3,80000,1.0,0.0,68.0,180,27.365917,0.714286,1.0,1.0,Bus,0,1,0,0,0,35,0.561404,1.0,0.030303
7,x7,24,1.0,1,100000,0.181818,0.166667,68.0,155,23.565095,0.714286,0.6875,0.606281,Foot,0,0,0,1,0,3,0.0,0.333333,0.040404
8,x8,35,4.0,2,50000,0.681818,0.666667,63.0,100,17.71227,0.357143,0.0,0.0,Bus,0,1,0,0,0,15,0.210526,0.666667,0.015152


## Our Clean Data

In [0]:
dataset.head(7)

Unnamed: 0,Id,Age,Siblings,Education,Salary,Age_norm,Sibings_norm,Heightin,WeightNew,BMIfixed,Height_norm,Weight_norm,BMI_norm,Vehicle_,Bicycle,Bus,Car,Foot,Taxi,Travel Distance_,Travel Distance_norm,Education_norm,Salary_norm
0,x1,23,,1,60000,0.136364,,65.0,140,23.294675,0.5,0.5,0.578269,Car,0,0,1,0,0,60,1.0,0.333333,0.020202
1,x2,20,2.0,0,20000,0.0,0.333333,72.0,175,23.731674,1.0,0.9375,0.623537,Bicycle,1,0,0,0,0,20,0.298246,0.0,0.0
2,x3,40,1.0,2,75000,0.909091,0.166667,58.0,100,20.897741,0.0,0.0,0.329976,Bus,0,1,0,0,0,40,0.649123,0.666667,0.027778
3,x4,21,6.0,0,2000000,0.045455,1.0,,130,,,0.375,,Taxi,0,0,0,0,1,50,0.824561,0.0,1.0
4,x5,42,0.0,3,80000,1.0,0.0,68.0,180,27.365917,0.714286,1.0,1.0,Bus,0,1,0,0,0,35,0.561404,1.0,0.030303
7,x7,24,1.0,1,100000,0.181818,0.166667,68.0,155,23.565095,0.714286,0.6875,0.606281,Foot,0,0,0,1,0,3,0.0,0.333333,0.040404
8,x8,35,4.0,2,50000,0.681818,0.666667,63.0,100,17.71227,0.357143,0.0,0.0,Bus,0,1,0,0,0,15,0.210526,0.666667,0.015152


In [0]:
print(dataset.describe())
print('============================')
print(dataset.dtypes)

             Age  Siblings  ...  Education_norm  Salary_norm
count   7.000000  6.000000  ...        7.000000     7.000000
mean   29.285714  2.333333  ...        0.428571     0.161977
std     9.411239  2.250926  ...        0.370899     0.369753
min    20.000000  0.000000  ...        0.000000     0.000000
25%    22.000000  1.000000  ...        0.166667     0.017677
50%    24.000000  1.500000  ...        0.333333     0.027778
75%    37.500000  3.500000  ...        0.666667     0.035354
max    42.000000  6.000000  ...        1.000000     1.000000

[8 rows x 21 columns]
Id                       object
Age                       int64
Siblings                float64
Education                 int64
Salary                    int64
Age_norm                float64
Sibings_norm            float64
Heightin                float64
WeightNew                 int64
BMIfixed                float64
Height_norm             float64
Weight_norm             float64
BMI_norm                float64
Vehicle_    