# Machine Learning A-Z Python

## Part 1. Data Preprocessing

In [26]:
import warnings
warnings.filterwarnings('ignore')

In [36]:
## Part 1_Data Preprocessing

# Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 

In [37]:
# Importing the dataset

dataset = pd.read_csv('Data.csv')
print(dataset)
print('------------------------------------')

# Create matrix of independent and dependent variable 
X = dataset.iloc[:,:-1].values # independent variables
y = dataset.iloc[:,-1].values # dependent variables
X.ndim, y.ndim

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes
------------------------------------


(2, 1)

Dependent variable is 'Purchased' here. Rest are independent variables. Machine Learning (ML) will use the independent variables to predict the dependent variable.

* Dependent variable: Purchased
* Independent variable: Age, Salary, Purchased

Data preprocessing is common initial step before starting ML

In [38]:
# Missing Data

# Strategy 1: Remove observation that contains NaN 
# Strategy 2: Replace Nan by mean of relevant column/row

from sklearn.preprocessing import Imputer # Imputer class

# as we imported the class, we need to create an object of the class
imputer = Imputer(missing_values = 'NaN', 
                  strategy= 'mean', axis=0) # Imputer object

# Fit imputer to matrix X (matrix X has missing data)
imputer = imputer.fit(X[:,1:]) 
X[:,1:] = imputer.transform(X[:,1:]) #conversion of matrix X based on imputer
X # NaN has been replaced with mean values

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

2 categorical variables in the dataset Country (Independent) and
Purchased (dependent). Country variable contains 3 categories and 
Purchase variable contains 2 categories. We want ONLY numbers for ML and
not text (like in these categorical features). Thus we need to convert the 
categories into numbers

In [39]:
# Categorical variables

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X=LabelEncoder() # creating labelencoder object
# X[:,0] is categorical
X[:,0]=labelencoder_X.fit_transform(X[:,0])
X

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

We see the country (categorical) column is converted into numerical. 

**But**, 3 categories of country column are converted into 0, 1, 2. Since 0 <1 <2, equations in the model will think that 1 country is bigger than other based on the number representing them. ML will think Spain is bigger than Germany as 2>0. BUT there is **no relational order between the countries**. 

Label encoder encode values without thinking whether there is order or not!

But if we have 3 categories in the same column such as small, medium and large. Then 0,1,2 makes sense as **there is an order between small, medium and large**. But in country there is no order so we have to use something other than `LabelEncoder`.

### Thus, we have to use  Dummy encoding in the case of country column.
Dummy encoding will convert 1 country column into 3 columns because 
there are 3 categories. Each dummy variable column 
will represent 1 country. We will need OneHotEncoder for that.

In [40]:
onehotencoder= OneHotEncoder(categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
X[0]

array([1.0e+00, 0.0e+00, 0.0e+00, 4.4e+01, 7.2e+04])

Previously X had 3 columns. Now country column is replaced by 3 dummy variables or columns.
So 3-1+3=5 columns. So OneHotEncoder worked.

* For y or target/dependent variable LabelEncoder will work fine as y has binary category 

In [41]:
labelencoder_y=LabelEncoder()
y=labelencoder_y.fit_transform(y)
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

In [42]:
# Splitting the dataset into Training and Test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X, y, 
                                                  test_size=0.2, 
                                                  random_state=0)

We have to be careful about ML algorithem understanding the correlation between X_train and
y_train. If they don't understand the logic between X_train and y_train and just learn it by heart, 
it will have trouble predicting X_test. This is the case where overfitting of data happens. But there are 
regularization techniques to prevent this overfitting

### Feature scaling
#### Required because scale of features are uneven and widely varied

Generally feature scaling is extremely necessary: Age range and salary range are widely different.


ML model accuracy also depends on euclidean distance between data points. If there are 2 points, then they calculate
euclidean distance by ((y2-y1)^2+(x2-x1)^2)^1/2. Now one variable (suppose y) here salary
has a way higher scale and it is squared (which amplifies it further) and added with the square of variable x (Age in this case. Age 
value is negligible even after squaring). Then ML will only consider y value (Salary here) as it has a higher magnitude in comparison to age (thus effect of age won't be considered). We need to normalize that. Thats where feature scaling is really helpful.

Feature scaling can be done by standardization

1. `STANDARDIZATION: x_standardization= x - mean(x) / standard deviation(x)`

2. `NORMALIZATION: x_normalization= x - min(x) / max(x) - min(x)`

So we need to standardize them such that each variable ranges from -1 to +1. 
And thus their high range wont affect ML model to be biased


**CODE:**

`from sklearn.preprocessing import StandardScaler`

`sc_X = StandardScaler()`

`X_train = sc_X.fit_transform(X_train)`

`X_test = sc_X.transform(X_test)`

If we have to standardize y, add the following code to the above code--

`sc_y = StandardScaler()`

`y_train = sc.y.fit_transform(y_train)`

For test set, we don't need to fit the sc_X instance because it's already fitted to training set
That means X_test is transformed by StandardScaler based on its fitting to the X_train. So 
both X_train and X_test are fitted based on the same scale.

Here dependent variable or y is categorical as it has 1 or 0 values, we do NOT need to use feature scaling for clasification.

But for regression where dependent variable or y can take a huge range
(unlike classification), then we need to use feature scaling for y as well.

In [43]:
from sklearn.preprocessing import StandardScaler # StandardScaler class
sc_X = StandardScaler() # creating sc_x, a StandardScaler object

X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test) 

In [44]:
print(X_train) # numbers range between -1 to +1 because of feature scaling
print('--------------------------------------------------------------')
print(X_test) # numbers range between -1 to +1 because of feature scaling
print('--------------------------------------------------------------')
print(y_train) # No feature scaling for classification 
print('--------------------------------------------------------------')
print(y_test) # No feature scaling for classification 

[[-1.          2.64575131 -0.77459667  0.26306757  0.12381479]
 [ 1.         -0.37796447 -0.77459667 -0.25350148  0.46175632]
 [-1.         -0.37796447  1.29099445 -1.97539832 -1.53093341]
 [-1.         -0.37796447  1.29099445  0.05261351 -1.11141978]
 [ 1.         -0.37796447 -0.77459667  1.64058505  1.7202972 ]
 [-1.         -0.37796447  1.29099445 -0.0813118  -0.16751412]
 [ 1.         -0.37796447 -0.77459667  0.95182631  0.98614835]
 [ 1.         -0.37796447 -0.77459667 -0.59788085 -0.48214934]]
--------------------------------------------------------------
[[-1.          2.64575131 -0.77459667 -1.45882927 -0.90166297]
 [-1.          2.64575131 -0.77459667  1.98496442  2.13981082]]
--------------------------------------------------------------
[1 1 1 0 1 0 0 1]
--------------------------------------------------------------
[0 0]
