## CSE 422 Introduction to Data Preprocessing
---







### What are the advantages of preprocessing the data before applying on machine learning algorithm?

"The biggest advantage of pre-processing in ML is to improve **generalizablity** of your model. Data for any ML application is collected through some ‘sensors’. These sensors can be physical devices, instruments, software programs such as web crawlers, manual surveys, etc. Due to hardware malfunctions, software glitches, instrument failures, amd human errors, noise and erroneous information may creep in that can severely affect the performance of your model. Apart from **noise**, there are several **redundant information** that needs to be removed. For e.g. while predicting whether it rains tomorrow or not, age of the person is irrelevant. In terms of text processing, there are several stop words that may be redundant for the analysis. Lastly, there may be several **outliers** present in your data, due to the way data is collected that may need to be removed to improve the performance of the classifiers." 
                                    
                                            -Shehroz Khan, ML Researcher, Postdoc @U of Toronto


Some Data Preprocessing Techniques:

* Deleting duplicate and null values
* Imputation for missing values
* Handling Categorical Features
* Feature Normalization/Scaling
* Feature Engineering
* Feature Selection

In [850]:
#importing necessary libraries
import pandas as pd
import numpy as np
import sklearn

#Removing Null values / Handling Missing data




In [851]:
df = pd.read_csv('/kaggle/input/glass-source-classification-datasetcsv/Glass Source Classification Dataset.csv')
df.sample(25)

Unnamed: 0.1,Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
186,186,1.51838,14.32,3.26,2.22,71.25,1.46,5.79,exists,Does not exist,headlamp glass
136,136,1.51806,13.0,3.8,1.08,73.07,0.56,8.38,Does not exist,exists,building_window glass
140,140,1.5169,13.33,3.54,1.61,72.54,0.68,8.11,Does not exist,Does not exist,building_window glass
34,34,1.51783,12.69,3.54,1.34,72.95,0.57,8.75,Does not exist,Does not exist,building_window glass
169,169,1.51994,13.27,0.0,1.76,73.03,0.47,11.32,Does not exist,Does not exist,container glass
58,58,1.51754,13.48,3.74,1.17,72.99,0.59,8.03,Does not exist,Does not exist,building_window glass
200,200,1.51508,15.15,0.0,2.25,73.5,0.0,8.34,exists,Does not exist,headlamp glass
177,177,1.51937,13.79,2.41,1.19,72.76,0.0,9.77,Does not exist,Does not exist,tableware glass
49,49,1.51898,13.58,3.35,1.23,72.08,0.59,8.91,Does not exist,Does not exist,building_window glass
137,137,1.51711,12.89,3.62,1.57,72.96,0.61,8.11,Does not exist,Does not exist,building_window glass


In [852]:
df.shape

(214, 11)

In [853]:
df.columns

Index(['Unnamed: 0', 'RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe',
       'Type'],
      dtype='object')

In [854]:
df.isnull().sum()

Unnamed: 0    0
RI            0
Na            0
Mg            0
Al            0
Si            0
K             0
Ca            6
Ba            0
Fe            0
Type          0
dtype: int64

dropping columns

In [855]:
df = df.drop(['Unnamed: 0'], axis = 1)
df.shape

(214, 10)

In [856]:
df.sample(10)

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
69,1.523,13.31,3.58,0.82,71.99,0.12,10.17,Does not exist,exists,building_window glass
141,1.51851,13.2,3.63,1.07,72.83,0.57,8.41,exists,exists,building_window glass
67,1.52152,13.05,3.65,0.87,72.32,0.19,9.85,Does not exist,exists,building_window glass
136,1.51806,13.0,3.8,1.08,73.07,0.56,8.38,Does not exist,exists,building_window glass
211,1.52065,14.36,0.0,2.02,73.42,0.0,8.44,exists,Does not exist,headlamp glass
71,1.51848,13.64,3.87,1.27,71.96,0.54,8.32,Does not exist,exists,building_window glass
122,1.51687,13.23,3.54,1.48,72.84,0.56,8.1,Does not exist,Does not exist,building_window glass
199,1.51609,15.01,0.0,2.51,73.05,0.05,8.83,exists,Does not exist,headlamp glass
48,1.52223,13.21,3.77,0.79,71.99,0.13,10.02,Does not exist,Does not exist,building_window glass
29,1.51784,13.08,3.49,1.28,72.86,0.6,8.49,Does not exist,Does not exist,building_window glass


### Imputing missing Values

In [857]:
from sklearn.impute import SimpleImputer

impute = SimpleImputer(missing_values=np.nan, strategy='median')

impute.fit(df[['Ca']])

df['Ca'] = impute.transform(df[['Ca']])

In [858]:
df.sample(10)

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
64,1.52172,13.48,3.74,0.9,72.01,0.18,9.61,Does not exist,exists,building_window glass
16,1.51784,12.68,3.67,1.16,73.11,0.61,8.7,Does not exist,Does not exist,building_window glass
86,1.51569,13.24,3.49,1.47,73.25,0.38,8.03,Does not exist,Does not exist,building_window glass
60,1.51905,13.6,3.62,1.11,72.64,0.14,8.76,Does not exist,Does not exist,building_window glass
83,1.51594,13.09,3.52,1.55,72.87,0.68,8.05,Does not exist,exists,building_window glass
28,1.51768,12.56,3.52,1.43,73.15,0.57,8.54,Does not exist,Does not exist,building_window glass
14,1.51763,12.61,3.59,1.31,73.29,0.58,8.5,Does not exist,Does not exist,building_window glass
112,1.52777,12.64,0.0,0.67,72.02,0.06,14.4,Does not exist,Does not exist,building_window glass
143,1.51709,13.0,3.47,1.79,72.72,0.66,8.18,Does not exist,Does not exist,building_window glass
211,1.52065,14.36,0.0,2.02,73.42,0.0,8.44,exists,Does not exist,headlamp glass


## Feature Engineering

### Encoding categorical variables - binary


In [859]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   RI      214 non-null    float64
 1   Na      214 non-null    float64
 2   Mg      214 non-null    float64
 3   Al      214 non-null    float64
 4   Si      214 non-null    float64
 5   K       214 non-null    float64
 6   Ca      214 non-null    float64
 7   Ba      214 non-null    object 
 8   Fe      214 non-null    object 
 9   Type    214 non-null    object 
dtypes: float64(7), object(3)
memory usage: 16.8+ KB


In [860]:
df['Ba'].unique()

array(['Does not exist', 'exists'], dtype=object)

In [861]:
df['Fe'].unique()

array(['Does not exist', 'exists'], dtype=object)

In [862]:
df['Type'].unique()

array(['building_window glass', 'vehicle_window glass', 'container glass',
       'tableware glass', 'headlamp glass'], dtype=object)

### Encoding categorical variables - one-hot encoding

In [863]:
df['Ba'] = df['Ba'].map({'exists':1,'Does not exist':0}) 
df.sample(30)

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
117,1.51708,13.72,3.68,1.81,72.06,0.64,7.88,0,Does not exist,building_window glass
57,1.51824,12.87,3.48,1.29,72.95,0.6,8.6,0,Does not exist,building_window glass
137,1.51711,12.89,3.62,1.57,72.96,0.61,8.11,0,Does not exist,building_window glass
108,1.52222,14.43,0.0,1.0,72.67,0.1,11.52,0,exists,building_window glass
63,1.52227,14.17,3.81,0.78,71.35,0.0,9.69,0,Does not exist,building_window glass
81,1.51593,13.25,3.45,1.43,73.17,0.61,7.86,0,Does not exist,building_window glass
142,1.51662,12.85,3.51,1.44,73.01,0.68,8.23,1,exists,building_window glass
190,1.51613,13.88,1.78,1.79,73.1,0.0,8.67,1,Does not exist,headlamp glass
120,1.51844,13.25,3.76,1.32,72.4,0.58,8.42,0,Does not exist,building_window glass
156,1.51655,13.41,3.39,1.28,72.64,0.52,8.65,0,Does not exist,vehicle_window glass


In [864]:
df['Fe'] = df['Fe'].map({'exists':1,'Does not exist':0}) 
df.sample(30)

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
37,1.51797,12.74,3.48,1.35,72.96,0.64,8.68,0,0,building_window glass
16,1.51784,12.68,3.67,1.16,73.11,0.61,8.7,0,0,building_window glass
6,1.51743,13.3,3.6,1.14,73.09,0.58,8.17,0,0,building_window glass
143,1.51709,13.0,3.47,1.79,72.72,0.66,8.18,0,0,building_window glass
11,1.51763,12.8,3.66,1.27,73.01,0.6,8.56,0,0,building_window glass
50,1.5232,13.72,3.72,0.51,71.75,0.09,10.06,0,1,building_window glass
182,1.51916,14.15,0.0,2.09,72.74,0.0,10.88,0,0,tableware glass
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.6,0,0,building_window glass
85,1.51625,13.36,3.58,1.49,72.72,0.45,8.21,0,0,building_window glass
29,1.51784,13.08,3.49,1.28,72.86,0.6,8.49,0,0,building_window glass


We may also encode/map a certain class to a specific code (e.g 0/1/2) by using the `map()` function. 

In [865]:
df['Type'] = df['Type'].map({'building_window glass':0,'vehicle_window glass':1,'container glass':2, 'tableware glass': 3, 'headlamp glass': 4}) 
df.sample(30)

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
56,1.51215,12.99,3.47,1.12,72.98,0.62,8.35,0,1,0
149,1.51643,12.16,3.52,1.35,72.89,0.57,8.53,0,0,1
63,1.52227,14.17,3.81,0.78,71.35,0.0,9.69,0,0,0
145,1.51839,12.85,3.67,1.24,72.57,0.62,8.68,0,1,0
158,1.51776,13.53,3.41,1.52,72.04,0.58,8.79,0,0,1
25,1.51764,12.98,3.54,1.21,73.0,0.65,8.53,0,0,0
29,1.51784,13.08,3.49,1.28,72.86,0.6,8.49,0,0,0
137,1.51711,12.89,3.62,1.57,72.96,0.61,8.11,0,0,0
116,1.51829,13.24,3.9,1.41,72.33,0.55,8.31,0,1,0
204,1.51617,14.95,0.0,2.27,73.3,0.0,8.71,1,0,4


##Feature Engineering

In [866]:
df.columns

Index(['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe', 'Type'], dtype='object')

In [867]:
data = df[['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe']]

In [884]:
data.columns

Index(['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe'], dtype='object')

In [869]:
target = df[['Type']]

In [885]:
target.columns

Index(['Type'], dtype='object')

In [871]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size = 0.25, random_state=0, stratify = Type)

In [872]:
X_train.shape

(160, 9)

In [873]:
y_train = y_train.values.ravel()
y_train.shape

(160,)

In [874]:
X_test.shape

(54, 9)

In [875]:
y_test = y_test.values.ravel()
y_test.shape

(54,)

## Standardizing Data

## Feature Scaling

## Why do we need to scale our data?
* If a feature’s variance is orders of magnitude more than the variance of other features, that particular feature might dominate other features in the dataset and make the estimator unable to learn from other features correctly, i.e. our learner might give more importance to features with high variance, which is not something we want happening in our model.

The following are a few different types of Scalers:

**MinMax Scaler:** 

Scales values to a range between 0 and 1 if no negative values, and -1 to 1 if there are negative values present.

$$\frac{X - X_{min}}{X_{max} - X_{min}}$$

where, 

 $$X\space is\space a\space feature\space value.$$ 
 $$X_{min} \space and \space X_{max} \space are \space corresponding \space feature's \space min \space and \space max \space values. $$


**Standard Scaler:**

$$\frac{X - mean}{\sigma}$$
where,
$$\sigma = standard \space deviation $$ 

**Robust Scaler:**

Uses statistics that are robust to outliers

$$\frac{X - median}{IQR}$$

where, 

$$ IQR = Inter\space Quartile\space Range = Q_3 - Q_1 $$


Sklearn library provides functions for different scalers by which we can easily scale our data.

In [876]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(X_train)

MinMaxScaler()

In [877]:
X_train_scaled = scaler.transform (X_train)

In [878]:
X_test_scaled = scaler.transform (X_test)

We can see that after Min-Max Scaling all the values are in the range [0,1]

In [879]:
print("per-feature minimum before scaling:\n {}".format(X_train.min(axis=0)))
print("per-feature maximum before scaling:\n {}".format(X_train.max(axis=0)))

per-feature minimum before scaling:
 RI     1.51115
Na    10.73000
Mg     0.00000
Al     0.29000
Si    69.81000
K      0.00000
Ca     5.43000
Ba     0.00000
Fe     0.00000
dtype: float64
per-feature maximum before scaling:
 RI     1.53393
Na    17.38000
Mg     4.49000
Al     3.04000
Si    75.41000
K      6.21000
Ca    16.19000
Ba     1.00000
Fe     1.00000
dtype: float64


In [880]:
print("per-feature minimum after scaling:\n {}".format(X_train_scaled.min(axis=0)))
print("per-feature maximum after scaling:\n {}".format(X_train_scaled.max(axis=0)))

per-feature minimum after scaling:
 [0. 0. 0. 0. 0. 0. 0. 0. 0.]
per-feature maximum after scaling:
 [1. 1. 1. 1. 1. 1. 1. 1. 1.]


## Effect of using MinMax Scaler:

### Accuracy without scaling

In [881]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

knn.fit(X_train, y_train)

print("Test set accuracy: {:.2f}".format(knn.score(X_test, y_test)))

Test set accuracy: 0.83


### We can see that accuracy improves if we train on scaled data.

In [882]:
knn.fit(X_train_scaled, y_train)

# scoring on the scaled test set
print("Scaled test set accuracy: {:.2f}".format(knn.score(X_test_scaled, y_test)))

Scaled test set accuracy: 0.78


### Effect using Standard Scaler: 
We can see that accuracy has improved compared to the non-scaled version, but we can infer that for this problem, Standard Scaler performs worse than MinMaxScaler.

In [883]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# learning an SVM on the scaled training data
knn.fit(X_train_scaled, y_train)

# scoring on the scaled test set
print("KNN test accuracy: {:.2f}".format(knn.score(X_test_scaled, y_test)))

KNN test accuracy: 0.81


.