##CECS 456-Dr. Wenlu Zhang

# Data Preprocessing Tools


1.   In this lecture, it contains all the different tools of data preprocessing that you might have to use on your data sets in order to pre-process them the right way for your machine learning model.
2.   data preprocessing: the first very important step of ML. Any time you build a ML model, you always have a data preprocessing phase to work on.



## Importing the libraries

In [25]:
import numpy as np # number and matrix processing
from matplotlib import pyplot as plt # plotting library
import pandas as pd # data processing

## Importing the dataset

In [26]:
dataset = pd.read_csv("Data.csv") # import CSV
print(dataset)
X = dataset.iloc[:, :-1].values # Country, Age, and Salary (Matrix)
y = dataset.iloc[:,  -1].values # Purchased (Column vector)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


## Taking care of missing data

In [27]:
print(X[:, 1:3]) # exclude Country and Purchased
# note that X[:, 1:3] == X[:, 1:]

[[44.0 72000.0]
 [27.0 48000.0]
 [30.0 54000.0]
 [38.0 61000.0]
 [40.0 nan]
 [35.0 58000.0]
 [nan 52000.0]
 [48.0 79000.0]
 [50.0 83000.0]
 [37.0 67000.0]]


In [28]:
# Option 1: delete invalid entries from data.
# Option 2: "classic method": replace missing data with average of all the values in a column
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy="mean") # fix data using mean
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

In [29]:
# Use one-hot encoding for categorical data
# Let France = [1; 0; 0], Germany = [0; 1; 0], and Spain = [0; 1; 1]
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[("encoder", OneHotEncoder(), [0])], remainder="passthrough")
X = np.array(ct.fit_transform(X))
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [30]:
# No = 0, Yes = 1
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

Note: I will talk about applying k-Fold Cross Validation in the later class.


***most frequently asked quesions:***

do we have to apply feature scaling before splitting the data into training set and teting set or after it?

We apply feature scaling after the split. This is because our testing dataset would be biased if we applied feature scaling to it. Applying feature scaling before the split is called information leakage.

In [31]:
# Split dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# 80% in training dataset, 20% in testing (test_size)
# random_state is the seed
print("X_train:\n", X_train)
print("X_test:\n", X_test)
print("y_train:\n", y_train)
print("y_test:\n", y_test)

X_train:
 [[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]
X_test:
 [[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]
y_train:
 [0 1 0 0 1 1 0 1]
y_test:
 [0 1]


## Feature Scaling

Normalization (not preferred)
`from sklearn.preprocessing import normalize`
$$x' = \frac{x - \operatorname{min}(x)}{\operatorname{max}(x) - \operatorname{min}(x)}$$

Standardization (preferred)
`from sklearn.preprocessing import StandardScaler`
$$x' = \frac{x - \bar{x}}{\sigma} \text{ where \(\bar{x}\) is the mean of \(x\) and \(\sigma\) is the standard deviation.} $$

Normalization only works if data is already a normal distribution; standardization works all the time.

you won't have to use feature scaling all the time, we will see, in each of the machine learning model implementation whether we have apply feature scaling or not.

In [32]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
# We start from 3 so we can skip the one-hot encoding.
# We skip it because it's categorical data: feature scaling would destroy the data.
print("X_train before:\n", X_train[:, 3:])
print("X_test before:\n", X_test[:, 3:])
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:]) # we only use transform to avoid training on the testing dataset
print("X_train after:\n", X_train[:, 3:])
print("X_test after:\n", X_test[:, 3:])

# sc.fit calculates \bar{x} and \sigma; sc.transform applies the formula

X_train before:
 [[38.77777777777778 52000.0]
 [40.0 63777.77777777778]
 [44.0 72000.0]
 [38.0 61000.0]
 [27.0 48000.0]
 [48.0 79000.0]
 [50.0 83000.0]
 [35.0 58000.0]]
X_test before:
 [[30.0 54000.0]
 [37.0 67000.0]]
X_train after:
 [[-0.19159184384578545 -1.0781259408412425]
 [-0.014117293757057777 -0.07013167641635372]
 [0.566708506533324 0.633562432710455]
 [-0.30453019390224867 -0.30786617274297867]
 [-1.9018011447007988 -1.420463615551582]
 [1.1475343068237058 1.232653363453549]
 [1.4379472069688968 1.5749910381638885]
 [-0.7401495441200351 -0.5646194287757332]]
X_test after:
 [[-1.4661817944830124 -0.9069571034860727]
 [-0.44973664397484414 0.2056403393225306]]


**One of the most frequently asked questions in the data science community**


> Do we have to apply feature scaling (standardization) to the dummy variables in the matrix of features?


No.


