# Data Preprocessing

## Importing Libraries

In [1]:
import numpy as np  # To Work with arrays
import matplotlib.pyplot as plt  # To plot graphs
import pandas as pd  # To make matrices

## Importing Dataset

In [2]:
dataset = pd.read_csv('Data.csv')  # Creates a Data Frame containing the contents of our dataset
X = dataset.iloc[:, :-1].values  # iloc => Locate indexes [rows, columns]. We need all features in this, i.e. all rows of first 3 columns
y = dataset.iloc[:, -1].values  # values meaning we are taking all the variables. Dependent Variable here

In [3]:
print(dataset)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


In [4]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [5]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of Missing value

Two ways:-
1. Ignoring Missing Values
2. Replacing with average

In [6]:
from sklearn.impute import SimpleImputer  # Sklearn has many scientific tools. In this case, we are using preprocessing tool to take care of our misisng values
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')  # Creating an instance/object of this class
# First Argument - what is the missing values datatype 
# Second Argumnet - tells that we have to replace the missing values by mean values/avg values
imputer.fit(X[:, 1:3])  # Fit method will calculate the average. It also only takes the numeric value
X[:, 1:3] = imputer.transform(X[:, 1:3])  # Transform will apply the model to the dataset

In [7]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding Categorial Data

This Dataset contains one column with categories => Countries.
It would be difficult for the model to find correlations between this column and the dependent variable. So, we encode this category to numbers.

One way is to encode France => 0, Spain => 1, Germany => 2.
But since we have given numerical order between these 3 countries, our model will think that this order matters. Which is not the case, there is no relationship order between these three. 

Better Method is One Hot Encoder => In this, we create 3 columns (since there are 3 countries). Then we will assign vectors to each country. France will become 1 0 0, Spain => 0 1 0 and finally Germany => 0 0 1. So now, there is no numerical order between these countires (Since we now only have 0s and 1s).

We will also have to replace the Purchased column by 0s and 1s. It's fine to do this for this column as it is a binary outcome (Outcomes that can take one of two values).

### 1. Encoding the Independent Variable

In [8]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoding', OneHotEncoder(), [0])], remainder='passthrough')
# First Argumnet => Transformers - to specify the kind of transformation and which indexes of the columns we want to transform
# transformers = [(kind of transformation => encoding, what kind of encoding=> onehotencoder, indexes of the columns to encode)]
# Second Argument => Remainder - the columns not to apply the transformation i.e. age and salary
# passthrough => do nothing to them and keep them in the table
# without it => it will only keep the 3 new created column and delete the others

X = np.array(ct.fit_transform(X))
# But the able does not return a numpy array which is neccassory for the learning model. 
# Fit and Train functions of the model requires a numpy array

In [9]:
print(X)
print(type(X))

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]
<class 'numpy.ndarray'>


### 2. Encoding the Dependent Variable

In [10]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [11]:
print(y)
print(type(y))

[0 1 0 0 1 1 0 1 0 1]
<class 'numpy.ndarray'>


#### Question
Should Feature scaling be done before or after spliting the dataset?
Answer => Feature Scaling must be applies AFTER splitting the dataset

What:-
1. Splitting => Split in 2 sets, one to train model on existing observations and one test set to evaluate performance based on new observations. These data is exactly like the data that we are gonna give in the future when we deploy the model

2. Feature scaling => scale all variables so that they take values in the same scale. So as to prevent one feature being dominated by the other and resulting in it being neglected by the model.

Why:-
- Test set is supposed to be a brand new set just like actual futute observation that you will give to the model upon deployment.
- This means, that the model is not supposed to have the test set.
- Feature scaling requires one to have the mean/average of the features.
- If we apply feature scaling before, we will get the mean/average of the test set.
- This will result in the information leakage on the test data set, which you are not supposed to have.

### 3. Splitting the Dataset

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
# Third arguemnt - Split Size - recommended 80%, 20%
# Random state = 1 - for education purposes, so that we get the same split. It basically fixing the seed to get the same result

In [13]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [14]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [15]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [16]:
print(y_test)

[0 1]


### 4. Feature Scaling

- We apply Feature Scaling to ensure that some features are not dominated by the model, which might result in only these features being considered for the prediction and the model ignoring the rest of the features.
- Feature Scaling may not be needed to be applied in every model.
- Like for example, in regression model, y = b0 + b1.x1 + b2.x2 + ...., so if any feature (x) has a great value, its respective coefficient might become very low so as to componsate.

Two feature scaling methods:-

1. Standardisation - Works all the time

    - x_stand =  ( x - mean(x) ) / standard deviation x

    - Value -3 to +3
    

2. Normalization - Recommeneded when we have a normal distribution in most of the features

    - x_norm = ( x - min(x) ) / ( max(x) - min(x) )

    - Value 0 to 1

Now next question: Do we need to apply feature scaling to dummy variables (variables obtained after applying one hot endcoder)
Answer : No

1. Feature Scaling makes the values in the same range. Standardisation will transform values to be between -3 and +3. Since dummpy values are already between this range (they take values 0 and 1), we don't need to apply feature scaling to those columns.


2. Not only that, we will also lose the interpretation. Since, right now, by looking we can tell which country it is. After feature scaling, we wont be able to easily tell.

In [17]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

Fit => calculates mean and standard deviation

Transform => Actually transforms the features

- For training data, we need to calculate the mean and SD and apply it.
- For testing data, we need to transform the values to be in the same range (i.e. -3 to +3), but we cannot take calculate their mean and SD, since they are some future data we dont know about. Morever, we need to apply the same range, so that the model does not gets confused. The model is trained on the some scaler. So for accurate result, we need to give it the same scaler, to predict accurate output

In [18]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [19]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
