<H3> Importing Modules and Data-Frame

In [14]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

df = pd.read_csv('fdata.csv')
df

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,97.6,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130.0,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,97.6,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130.0,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,97.6,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152.0,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109.0,mpfi,3.19,3.40,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136.0,mpfi,3.19,3.40,8.0,115.0,5500.0,18,22,17450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95.0,volvo,gas,std,four,sedan,rwd,front,109.1,...,141.0,mpfi,3.78,3.15,9.5,114.0,5400.0,23,28,16845.0
201,-1,95.0,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141.0,mpfi,3.78,3.15,8.7,160.0,5300.0,19,25,19045.0
202,-1,95.0,volvo,gas,std,four,sedan,rwd,front,109.1,...,173.0,mpfi,3.58,2.87,8.8,134.0,5500.0,18,23,21485.0
203,-1,95.0,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145.0,idi,3.01,3.40,23.0,106.0,4800.0,26,27,22470.0


Range of different variables are very different.For instance 'stroke' is a float around 2-4 whereas price is around 10k-40k.They cannot be efficiently used for comparision of features.

Therefore we normalize the data.


<h2> Data Normalization</h2>
Data normalization can be done in 3 ways:

1.Simple Feature Scaling: <br>
2.Min-Max Method<br>
3.Z-Score Method


In [15]:
# 1.Simple feature scaling 
df['length'] = df['length']/df['length'].max()
df['length']

0      0.811148
1      0.811148
2      0.822681
3      0.848630
4      0.848630
         ...   
200    0.907256
201    0.907256
202    0.907256
203    0.907256
204    0.907256
Name: length, Length: 205, dtype: float64

In [16]:
# 2.Max-Min Method
df['width'] = (df['width'] - df['width'].min())/(df['width'].max()-df['width'].min())
df['width']

0      0.316667
1      0.316667
2      0.433333
3      0.491667
4      0.508333
         ...   
200    0.716667
201    0.708333
202    0.716667
203    0.716667
204    0.716667
Name: width, Length: 205, dtype: float64

In [18]:
# 3.Z-Score Method
df['height'] = (df['height']-df['height'].mean())/df['height'].std()
df['height']

0     -2.015483
1     -2.015483
2     -0.542200
3      0.235366
4      0.235366
         ...   
200    0.726460
201    0.726460
202    0.726460
203    0.726460
204    0.726460
Name: height, Length: 205, dtype: float64

<h2> BINNING</h2><br>
It is groupping data into categories like we can group prices into : low,medium and high

In [23]:
bin = np.linspace(min(df['price']),max(df['price']),4)
bin_names = ['low','medium','high']
df['price-binned']= pd.cut(df['price'],bin,labels=bin_names,include_lowest=True)
df['price-binned']

0         low
1         low
2         low
3         low
4         low
        ...  
200       low
201    medium
202    medium
203    medium
204    medium
Name: price-binned, Length: 205, dtype: category
Categories (3, object): [low < medium < high]

<h2>Changing Categorical data to numerical data

<h3><u>One Hot Encoding:</u></h3> We create dummy variables for each category and assign '1' to the entries corresponding to that category.

In [29]:
dummy_variable_1 = pd.get_dummies(df["fuel-type"])
dummy_variable_1.head()

Unnamed: 0,diesel,gas
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1


In [30]:
dummy_variable_1.rename(columns={'fuel-type-diesel':'gas', 'fuel-type-diesel':'diesel'}, inplace=True)
dummy_variable_1.head()

Unnamed: 0,diesel,gas
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1


In [31]:
# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis=1)

# drop original column "fuel-type" from "df"
df.drop("fuel-type", axis = 1, inplace=True)

In [32]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,...,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,price-binned,diesel,gas
0,3,97.6,alfa-romero,std,two,convertible,rwd,front,88.6,0.811148,...,2.68,9.0,111.0,5000.0,21,27,13495.0,low,0,1
1,3,97.6,alfa-romero,std,two,convertible,rwd,front,88.6,0.811148,...,2.68,9.0,111.0,5000.0,21,27,16500.0,low,0,1
2,1,97.6,alfa-romero,std,two,hatchback,rwd,front,94.5,0.822681,...,3.47,9.0,154.0,5000.0,19,26,16500.0,low,0,1
3,2,164.0,audi,std,four,sedan,fwd,front,99.8,0.84863,...,3.4,10.0,102.0,5500.0,24,30,13950.0,low,0,1
4,2,164.0,audi,std,four,sedan,4wd,front,99.4,0.84863,...,3.4,8.0,115.0,5500.0,18,22,17450.0,low,0,1


In [33]:
dummy_variable_1 = pd.get_dummies(df["aspiration"])
dummy_variable_1.rename(columns={'aspiration-std':'std', 'aspiration':'turbo'}, inplace=True)
dummy_variable_1.head()

Unnamed: 0,std,turbo
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0


In [34]:
# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis=1)

# drop original column "fuel-type" from "df"
df.drop("aspiration", axis = 1, inplace=True)
df.head()

Unnamed: 0,symboling,normalized-losses,make,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,...,horsepower,peak-rpm,city-mpg,highway-mpg,price,price-binned,diesel,gas,std,turbo
0,3,97.6,alfa-romero,two,convertible,rwd,front,88.6,0.811148,0.316667,...,111.0,5000.0,21,27,13495.0,low,0,1,1,0
1,3,97.6,alfa-romero,two,convertible,rwd,front,88.6,0.811148,0.316667,...,111.0,5000.0,21,27,16500.0,low,0,1,1,0
2,1,97.6,alfa-romero,two,hatchback,rwd,front,94.5,0.822681,0.433333,...,154.0,5000.0,19,26,16500.0,low,0,1,1,0
3,2,164.0,audi,four,sedan,fwd,front,99.8,0.84863,0.491667,...,102.0,5500.0,24,30,13950.0,low,0,1,1,0
4,2,164.0,audi,four,sedan,4wd,front,99.4,0.84863,0.508333,...,115.0,5500.0,18,22,17450.0,low,0,1,1,0


In [35]:
df.to_csv('clean_df.csv',index=False)