<a href="https://colab.research.google.com/github/19080007dwangzilai/MachineLearning_self_study/blob/master/data_preprocessing_tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# D*ata Preprocessing Tools*





# Importing the libraries

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [0]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

Notes about iloc:  
*   2 parameters, first is the range of row, second is the range of 
column;
*    use the character in python that 'front is closed while behind is open' to create X and y seperately



In [3]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [4]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [0]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

Notes about the function of fit and transform in taking care of missing data


*   fit: spot the missing data and impute the values
*   transform: replace the missing data by mean values



In [6]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

In [0]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

***Notes about encoding:***
*   encoder shows what to do about the categorical data
*   OneHotEncoder tells how(who) to make it
*   reminder = 'passthrough' remains the rest of X while encoding the categorical data
*   fit_transform function does not return np.array values so it had better to change it to np.array manually for future machine learning





In [8]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [0]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [10]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2)

In [12]:
print(X_train)

[[1.0 0.0 0.0 35.0 58000.0]
 [1.0 0.0 0.0 44.0 72000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 37.0 67000.0]
 [0.0 1.0 0.0 50.0 83000.0]]


In [13]:
print(X_test)

[[0.0 1.0 0.0 40.0 63777.77777777778]
 [0.0 0.0 1.0 27.0 48000.0]]


In [14]:
print(y_train)

[1 0 1 0 0 0 1 0]


In [15]:
print(y_test)

[1 1]


## Feature Scaling

***Order reminder:***

feature scaling should be after the splitting the dataset because it will avoid information leakage. To be more specific, the test dataset should not be involved in the caculation of mean value, which is imputered by train dataset

In [0]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

Notes about the Feature Scaling:


*   no need to feature scaling the dummpy values
*   the mean value and standard deviation calculated from X_train

*   in the code, the order of X_train and X_test demermines where the mean values and standard deviation come from
*   only one fit function will be used in X_train or X_test, the second fit function will not make difference





In [17]:
print(X_train)

[[1.0 0.0 0.0 -0.8066752374797969 -0.7213204519277457]
 [1.0 0.0 0.0 0.6176450728387547 0.5817100418772143]
 [1.0 0.0 0.0 1.250676321869222 1.2332252887796944]
 [0.0 1.0 0.0 -1.597964298767881 -1.093614878729163]
 [0.0 0.0 1.0 -0.3319018007069463 -0.4420996318266829]
 [0.0 0.0 1.0 -0.20881239117324418 -1.2797620921298716]
 [1.0 0.0 0.0 -0.4901596129645631 0.11634200837544287]
 [0.0 1.0 0.0 1.5671919463844557 1.6055197155811116]]


In [18]:
print(X_test)

[[0.0 1.0 0.0 1.0 1.0]
 [0.0 0.0 1.0 -1.0 -1.0]]
