<img src='data-wrangling.jpg' />   

<h1>Data Wrangling</h1>

<h2>Table of content</h2>

<ul>
    <li><a href="#identify_handle_missing_values">Identify and handle missing values</a>
        <ul>
            <li><a href="#identify_missing_values">Identify missing values</a></li>
            <li><a href="#deal_missing_values">Deal with missing values</a></li>
            <li><a href="#correct_data_format">Correct data format</a></li>
        </ul>
    </li>
    <li><a href="#data_standardization">Data standardization</a></li>
    <li><a href="#data_normalization">Data Normalization (centering/scaling)</a></li>
    <li><a href="#binning">Binning</a></li>
    <li><a href="#indicator">Indicator variable</a></li>
</ul>
    


 


<h2>What is the purpose of Data Wrangling?</h2>

Data Wrangling is the process of converting data from the initial format to a format that may be better for analysis.

The data to be used can be found in the UCI Machine Learning Repository:
following link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data. 

<h2>Loading the libraries<\h2>

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

I will create a list containing the names of the features

In [2]:
name_features = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

In [3]:
data = pd.read_csv("imports-85.data", names = name_features)
pd.set_option("display.max_column", 200)
data.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


<h2>Identify missing values<\2>

In [4]:
#data['normalized-losses'].value_counts()
for i in data.columns:
    values = data[i].value_counts()
    names = i
    print("=="*90)
    print(names)
    print("=="*90)
    print(values)

symboling
 0    67
 1    54
 2    32
 3    27
-1    22
-2     3
Name: symboling, dtype: int64
normalized-losses
?      41
161    11
91      8
150     7
128     6
134     6
104     6
94      5
85      5
95      5
102     5
74      5
103     5
168     5
65      5
122     4
93      4
148     4
106     4
118     4
125     3
115     3
137     3
154     3
101     3
83      3
81      2
113     2
164     2
192     2
110     2
153     2
89      2
108     2
194     2
87      2
158     2
119     2
129     2
145     2
197     2
188     2
186     1
256     1
77      1
142     1
107     1
90      1
121     1
231     1
78      1
98      1
Name: normalized-losses, dtype: int64
make
toyota           32
nissan           18
mazda            17
mitsubishi       13
honda            13
volkswagen       12
subaru           12
peugot           11
volvo            11
dodge             9
mercedes-benz     8
bmw               8
audi              7
plymouth          7
saab              6
porsche           5
isuzu

After running the command "value_counts" in the following features were found "?" Which is a missing value, but as in pandas the default is "NaN" pandas ends up not recognizing it as a missing value, the fetures that contain this inconsistency are:<wr>

- normalized-losses with 41 "?" values<br>
- num-of-doors with 2 "?" values<br>
- stroke with 4 "?" values<br>
- horsepower with 2 values "?"<br>
- peak-rpm with 2 values "?"<br>
- price with 4 values "?"<br>

Now I will replace all "?" Values with "NaN" so that I can know the amount of missing values that exist in this dataset.

In [5]:
data.replace("?", np.nan, inplace=True)
data.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


Checking for missing values

In [6]:
data.isnull().sum()

symboling             0
normalized-losses    41
make                  0
fuel-type             0
aspiration            0
num-of-doors          2
body-style            0
drive-wheels          0
engine-location       0
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
engine-type           0
num-of-cylinders      0
engine-size           0
fuel-system           0
bore                  4
stroke                4
compression-ratio     0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64

<h3 id="deal_missing_values">Deal with missing data</h3>

<b>Replace by mean:</b>
<ul>
    <li>"normalized-losses": 41 missing data, replace them with mean</li>
    <li>"stroke": 4 missing data, replace them with mean</li>
    <li>"bore": 4 missing data, replace them with mean</li>
    <li>"horsepower": 2 missing data, replace them with mean</li>
    <li>"peak-rpm": 2 missing data, replace them with mean</li>
</ul>

<b>Replace by frequency:</b>
<ul>
    <li>"num-of-doors": 2 missing data, replace them with "four". 
        <ul>
            <li>Reason: 84% sedans is four doors. Since four doors is most frequent, it is most likely to occur</li>
        </ul>
    </li>
</ul>

<b>Drop the whole row:</b>
<ul>
    <li>"price": 4 missing data, simply delete the whole row
        <ul>
            <li>Reason: price is what we want to predict. Any data entry without price data cannot be used for prediction; therefore any row now without price data is not useful to us</li>
        </ul>
    </li>
</ul>

In [7]:
# Calculating the averages
avg_norm = data["normalized-losses"].astype("float").mean()
avg_stroke = data["stroke"].astype("float").mean()
avg_bore = data["bore"].astype("float").mean()
avg_horsepower = data["horsepower"].astype("float").mean()
avg_peak = data["peak-rpm"].astype("float").mean()

# Replacing the missing values with the average of the feature
data["normalized-losses"].replace(np.nan, avg_norm, inplace= True)
data["stroke"].replace(np.nan, avg_stroke, inplace = True)
data["bore"].replace(np.nan, avg_bore , inplace= True)
data["horsepower"].replace(np.nan, avg_horsepower , inplace = True)
data["peak-rpm"].replace(np.nan, avg_peak, inplace= True)

# Replace "num-of-doors" value with the most frequent value.
data['num-of-doors'].replace(np.nan, "four", inplace = True)

# Dropping missing values in "price"
data.dropna(subset= ["price"], axis=0 , inplace = True)

# Checking the result
data.isnull().sum()

symboling            0
normalized-losses    0
make                 0
fuel-type            0
aspiration           0
num-of-doors         0
body-style           0
drive-wheels         0
engine-location      0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
engine-type          0
num-of-cylinders     0
engine-size          0
fuel-system          0
bore                 0
stroke               0
compression-ratio    0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64

<h4>Convert data types to proper format</h4>

In [8]:
data[["bore", "stroke"]] = data[["bore", "stroke"]].astype("float")
data[["normalized-losses"]] = data[["normalized-losses"]].astype("int")
data[["price"]] = data[["price"]].astype("float")
data[["peak-rpm"]] = data[["peak-rpm"]].astype("float")

# Checking the result
data.dtypes

symboling              int64
normalized-losses      int32
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower            object
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
dtype: object

<h2 id="data_standardization">Data Standardization</h2>

In [9]:
# Convert mpg to L/100km by mathematical operation (235 divided by mpg)
data['city-L/100km'] = 235/data["city-mpg"]

<h2 id="data_normalization">Data Normalization</h2>

In [10]:
# replace (original value) by (original value)/(maximum value)
data['length'] = data['length']/data['length'].max()
data['width'] = data['width']/data['width'].max()

# Checking the result
data.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,city-L/100km
0,3,122,alfa-romero,gas,std,two,convertible,rwd,front,88.6,0.811148,0.890278,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000.0,21,27,13495.0,11.190476
1,3,122,alfa-romero,gas,std,two,convertible,rwd,front,88.6,0.811148,0.890278,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000.0,21,27,16500.0,11.190476
2,1,122,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,0.822681,0.909722,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154,5000.0,19,26,16500.0,12.368421
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,0.84863,0.919444,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102,5500.0,24,30,13950.0,9.791667
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,0.84863,0.922222,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115,5500.0,18,22,17450.0,13.055556


In [11]:
X = data.drop("price", axis= "columns")
y = data['price']

In [12]:
X.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,city-L/100km
0,3,122,alfa-romero,gas,std,two,convertible,rwd,front,88.6,0.811148,0.890278,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000.0,21,27,11.190476
1,3,122,alfa-romero,gas,std,two,convertible,rwd,front,88.6,0.811148,0.890278,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000.0,21,27,11.190476
2,1,122,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,0.822681,0.909722,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154,5000.0,19,26,12.368421
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,0.84863,0.919444,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102,5500.0,24,30,9.791667
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,0.84863,0.922222,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115,5500.0,18,22,13.055556


In [13]:
from sklearn.compose import make_column_transformer
from category_encoders.one_hot import OneHotEncoder
from category_encoders.ordinal import OrdinalEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

In [14]:
enc = OneHotEncoder(cols=["make","fuel-type","aspiration","num-of-doors","body-style","drive-wheels", 
                          "engine-location","engine-type","num-of-cylinders","fuel-system"], use_cat_names = True)
X_enc = enc.fit_transform(X)

In [15]:
Xtrain, Xtest , ytrain, ytest = train_test_split(X_enc,y, test_size = 0.3, random_state=22)
rfr = RandomForestRegressor(n_jobs= 4,n_estimators = 100, random_state=22)
rfr.fit(Xtrain, ytrain)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=4,
                      oob_score=False, random_state=22, verbose=0,
                      warm_start=False)

In [16]:
yhat = rfr.predict(Xtest)
mean_absolute_error(ytest,yhat)

1365.6356830601096

In [17]:
enc_ord = OrdinalEncoder(cols=["make","fuel-type","aspiration","num-of-doors","body-style","drive-wheels", 
                          "engine-location","engine-type","num-of-cylinders","fuel-system"])
X_ord = enc.fit_transform(X)

In [18]:
Xtrain, Xtest , ytrain, ytest = train_test_split(X_ord,y, test_size = 0.3, random_state=22)
mdl = RandomForestRegressor(n_jobs= 4,n_estimators = 100, random_state=22)
mdl.fit(Xtrain, ytrain)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=4,
                      oob_score=False, random_state=22, verbose=0,
                      warm_start=False)

In [20]:
yhat = mdl.predict(Xtest)
mean_absolute_error(ytest,yhat)

1365.6356830601096

In [21]:
from category_encoders.target_encoder import TargetEncoder

In [23]:
enc = TargetEncoder(cols=["make","fuel-type","aspiration","num-of-doors","body-style","drive-wheels", 
                          "engine-location","engine-type","num-of-cylinders","fuel-system"])
enc.fit(X,y)
X_target = enc.transform(X)

In [25]:
Xtrain1, Xtest1 , ytrain1, ytest1 = train_test_split(X_target,y, test_size = 0.3, random_state=22)
mdt = RandomForestRegressor(n_jobs= 4,n_estimators = 100, random_state=22)
mdt.fit(Xtrain1, ytrain1)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=4,
                      oob_score=False, random_state=22, verbose=0,
                      warm_start=False)

In [27]:
yhat1 = mdt.predict(Xtest1)
mean_absolute_error(ytest1,yhat1)

1196.319262295082

In [30]:
mdt.score(Xtest1, ytest1)

0.9205827273542828