# Data Preprocessing
---


## Import Libraries

In [53]:
import numpy as np # for converting or minipulating the data
import matplotlib.pyplot as plt # for plotting the data on graph to see
import pandas as pd # for importing the data

There are more tools that we will use as we advance.\
These are the basic tools for now so we can focus more understanding what we are doing and why.\
In fact there are libraries that we will use to help in the preprocessing stage.\
More on that later.

## Import Dataset
Here we imported that Data.csv and stored it in Variable called dataset.\
What csv_read returns is called a dataframe.\
A Dataframe is a 2 dimensional data structure, like a 2 dimensional array.

In [54]:
dataset = pd.read_csv('Data.csv')

In any dataset, that you will train a model with, you will have two disticnt entities.\
The set a features, which are the known characteristics of the observation.\
Or the independent variable(s).\
Then the set of dependent variable.\
\
Lets take a look at our data set.

In [55]:
print(dataset)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


From here 4 colums, location, age, income, and if the bought a product.\
Real quick we can make out that if a person purchased a product is less likey to affect the other three columns.\
While any of the other may affect if someone does buy that product.\
We can safely say that (Country, Age, Salary) are the features and (Purchased) is the independent variable.\
We also have two cells with missing data.  More on that later.

In [56]:
# NOTE .values turns the dataframe to a numpy array
x = dataset.iloc[:,:-1].values # All Rows, All Columns except the last / feature set / independent variables
y = dataset.iloc[:,-1].values # All Rows, Only last column / output vecter / dependent variable

As already mentioned, dataset is a pandas dataframe that represents a tabel of data.\
Its not an array, but for now lets think of it as an array of arrays.\
We know in Python we index in a range like so: arr[ start : end : step ]\
We also know [ -1 ] is the same as indexing the last element.\
Well, what we might not know is that we can't index a dataframe like an array: dataset[ : ][ : -1 ]\
The Pandas dataframe frame has a built in method for indexing data called iloc.\
Pandas.iloc[  ] requires 1 argument but can take 2. indexing of the first dimension and then the second.\
Indexing values can bet both a range or a single index.

## Handle Missing Data
So as we seen above there was some data missing in our dataset.\
There a couple of ways to handle missing data:
- First if the dataset is very large and we are only missing a small %, we can just delete rows with missing data.
- A second way, and the way we'll do it is, replace the data with the avg of all the rows in that column of the dataset.

First we are add to out Data Preprocessing Tools.\
Normally we would import this with the rest of the imports at the top of the file.

In [57]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

Above we imported SimpleImputer is a class of sklearn library.\
Next we create a variable to store our new class obj and pass in a couple of arguments.\
First is missing_values, by pass np.nan we are basically saying and cell that doesnt have a value, or "not a number"
\
SimpleImputer also can do more than replace empty vaules with an average.\
You can also do things like the median, or most common value if its categorical like Country.\
\
NOTE: this set up is for numerical values, we will only want to apply it to the age and salary columns.\
For good practice, when doing this include all numerical columns, as we wont really know where there is missing data

In [58]:
imputer.fit(x[:,1:3])
x[:, 1:3] = imputer.transform(x[:,1:3])
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


We can see all the cells are full and the two oddballs are very obvious.

## Encoding Categorical Data
### Encoding The Independent Variable
Looking at our dataset, most of the data is a numerical value which is good.\
Our learning models will tend to have difficulty finding correlation with strings and the output vector.\
Which is why we are going to encode categorical data such as Country.\
You may think that we would just asign countries numerical values,\
but we dont want to give the impression that there is a ordering relation between countries.\
In other words Franch isn't first, Spain isn't second and so on.\
\
The Solution we will use is "one hot encoding".\
This is were instead of giving a numerical value to cells with "Germany" we give each unique entry its own column.\
Then in this case, a 1 will be placed in it respective country, while the others get a 0.\
So instead of it being:
- France = 0
- Spain = 1
- Germany = 2

it would be:
- France = [1,0,0,...]
- Spain = [0,1,0,...]
- Germany = [0,0,1,...]

In [59]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x)) 

Lets talk about the above code quick.\
The first two line could be added to our imports at the top of the file.\
Sklearn is a very popular machine learning library that has many powerful data preproccesing tools.\
\
First we import ColumnTransformer, A class from sklearn that can be used to update the vecter to inclued the new cells.\
Next we grabbed OneHotEncoder, this is the tool that we can use to convert "Country" into a vecter's for more efficient learning.\
We create a variable and store an instance of the ColumnTransformer class.\
ct will need a few arguments passed to it:
- First - Type of transformer we need, the string 'encoder' is an accepted parameter that tells ct we want to encode the data.
- Second - The method to be used for the transformation - we want OneHotEncoder() so we pass just that.
- Third - What to do with data not changed - remainder='passthrough' - this say to leave the data there, the defalut is to drop it.

We then pass x into ct.fit_transform() to covnert the feature set into the new set with the new cols.\
The ct.fit_transform does not return a numpy array, so the last line we just make sure that x is converted back to a numpy array.\
\
Real quick lets look at the new dataset, we can the first col has been replaced with three new ones containing either 1 or 0.

In [60]:
print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]



### Encoding The Dependent Variable
In our case the dependent variable is categorical, and we are not worried about the ordering concept as before.\
Meaning like the "Country" col, it contains strings, so we will want to convert them to numerical values.\
Unlike "Country" on the other hand, we can asign each value its own numerical value.\
\
So the concept is the same, but instead of giving each option its own column like before,\
we will a "yes" the value of 1, and "no" the value 0.\
\
Since we aren't adding new cols we wont need ColumnTransformer like before.\
We will also not use OneHotEncoder, but instead LableEncoder beacause we are only working one col.\
We create a variable to store the LabelEncoder instance, and this time no arguments are required.\
We did not convert y into a numpy array, because it wont actually be passed through the machine learning model.\

In [61]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
print(y)

[0 1 0 0 1 1 0 1 0 1]


As we can see we successfully converted the dependent variable vector into numerical data the compter can now understand.

## Feature Scaling And Splitting The Data
There has been some discussion on if we should split the data before or after the feature scaling.\
I would like to take a minute and explain why we do it before we scale the features.\
\
When we get our raw dataset it can have all kinds of values stored in it.\
Up to now we converted categorized data into values of 1's and 0's,\
and filled in the empty cells, but we could have some colums with a large difference in values.\
For example, think of age and salary.  Age is <100 while salary is in the 1000's.\
\
The data we pass to the training model will need each column changed so all the values are in a comparable range.\
Lets say values between -3/+3, or 0/1.\
\
We are going to talk about 2 of the main types of feature scaling.\
First I want to say I'm not a mathematician, but I will explain the best way I know how.\
We are about to see an x a single quote above it, that is signifying "The new cell value of this iteration of x in an array of x's".\
Also its very import to remeber, scaling is column based.
### Normalization
> #### $X^{'}=\frac{x-x_{min}}{x_{max}-x_{min}}$
This is the Normalization formular used for scaling our data.\
Whats being said here is, the new value of x is the current value of x minus the min of whole list divided by the list max minus the list min.\
Lets apply this to a few exaples in the age column of our dataset.\
- MIN: 27
- MAX: 50
>#### $X^{'}=\frac{27-27}{50-27}=\frac{0}{23}=0.0$
In this example we entered the min age into the formula, and 0 was the answer
>#### $X^{'}=\frac{50-27}{50-27}=\frac{23}{23}=1.0$
In this example we entered the max age into the formula, and we got 1 for the answer
>#### $X^{'}=\frac{38-27}{50-27}=\frac{11}{23}=0.478$
Here I grabbed a random age of 38. We got a value that was between 1 and 0.\
This will work with any age in the data set. It also works the same on the salary column.
### Standardization
> #### $X^{'}=\frac{x-\mu}{\sigma}$
Here is our standardization formula.\
This one will be a little harder to explain.\
Here, new value of X is the current value of X minus the mean(average) of the list divided by the standard deviation.\
So what is the "Standard Deviation"?\
Its a measure of how dispersed the data is in relation to its mean.\
Mathematically, it's the square root of the ((sum of ((each value minus the mean) squared)) divided by the number of values in the dataset).
> #### $\sigma=\sqrt{\frac{\Sigma(x^{i}-\mu)^{2}}{N}}$
So now, our formula looks more like this.\
so first we solve for standard deviation, then solve for $X^{'}$
> #### $X^{'}=\frac{x-\mu}{\sqrt{\frac{\Sigma(x^{i}-\mu)^{2}}{N}}}$
Needless to say I wont be running examples of this one in this section.\
\
By now, I'm sure you're wondering why go through all this?\
Well, as mentioned before in our dataset the age column and salary columns have very different size values.\
We don't want the much larger integer values of the salary over power the age column.\
Lets look at a visual example.\
![DataPrep](./img/data_prep(1).png)\
If you were asked to group row 2 with, or who 2 was closer to on a general scale who would you pick?\
\
Well, lets first look at the differences and the compare.\
![DataPrep](./img/data_prep(2).png)\
We can see the salary difference between row 1 and 2 is $10,000 while row 2 and 3 is $8,000.\
If we compare the differences theres a gap of 2000.\
For age the diffence from 1 and 2 is 1, while the diffence from 2 and three is 4.\
This gives a gap of 3.\
So, we want to say person 2 is closer to person 3 because 2000 os so much larger than 3.\
\
Now lets see what happens after we apply normalization.\
![DataPrep](./img/data_prep(3).png)\
That looks different!\
Now we can easliy see that person sits about in the middle as far as salary is concerned.\
Age on the otherhand, is much closer to person 1.\
So now we can clearly and confidently group person 2 with person 1.\
The same applies to our learning model.
### Ok so which do we preform first, split the data or feature scaling?
Now that we understand feature scaling, we can tackle this question.\
The idea of a test set is to have data that have both the independent and dependent variables.\
This way we can enter single observations, get the predicted outcome, and compare it to what actually happend.\
\
As we now know when we apply feature scaling, we are working with min's, max's or mean's.\
This means if our test data can contaminate the feature scaling results.\
This is actually called "data leakage", remember we don't want our model to have any contact with the test data.\
Another thing to consider is once the model is deployed other than handeling categorical data, we won't scale new data.\
\
I think now the answer is obvious which has to be done first now.  Which is, splitting the data.
### Split Data Into Train/Test Sets
The next step is to split the data into 2 sets.\
A set to train our learning model with and another much smaller set to test with.\
\
Well, actually 4 different sets od data.\
We still have the train\test split concept, but in reality each set has to split into 2 seperate matrixs.\
A set for independent variables/featurs/"y", and a set for the dependent variables or "x".\
\
Good news, sklearn has a tool for that, it will split our data into the 4 matrixs that we need.\
This means we will have x_train, x_test, y_train, y_test sets all made for us.\
\
**Why do we to break our data into four sets again?**\
Our learning model will be expecting both the y_train and x_train to train with, and both test sets for testing

In [62]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=1)

That was painless! Lets cover what we just did.\
We import the train_test_split method from sklearn.\
train_test_split will return for elements. In Python, when assigning variables seperated by commas, as above, is called unpacking.\
So, line 2 is unpacking the 4 elements returned from the method, and assigning to variables with realitive names.\
train_test_split, exects a couple arguments, plus we set a couple extra:
- First - we need to pass in the list of the features
- Second - we pass in all the dependent variables 
- Third - is how much of the dataset we want to reserve for testing.  Most people view an 80/20 split to be best.
- Fourth - When the data is being split, there are random factors that pick what data goes into to the test set. This just ensures the results are the same everytime. 

Lets see what we got.

In [63]:
print(f'=======\nX Train\n=======\n{x_train}')
print(f'=======\nX Test\n=======\n{x_test}')
print(f'=======\nY Train\n=======\n{y_train}')
print(f'=======\nY Test\n=======\n{y_test}')

X Train
[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]
X Test
[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]
Y Train
[0 1 0 0 1 1 0 1]
Y Test
[0 1]


### Feature Scaling
When scalling our dataset, should we scale the the data we already minipulated? (Categorical Data)
The answer is no, for two reason:
- That data we already set to be represented a value(s) of of 0's and 1's, so it's arealdy in the deisired range.
- Changing that data may alter its representation.  It does not help improve our results, it may even hurt them.

In [64]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# sc = StandardScaler()
# x_train[:,-2:] = sc.fit_transform(x_train[:,-2:]) 
# x_test[:,-2:] = sc.transform(x_test[:,-2:])

sc = MinMaxScaler(feature_range=(0,1), copy=True) # these are default and dont need to add
x_train = sc.fit_transform(x_train) # Dont need to parse data before scaling
x_test = sc.fit_transform(x_test)

print(x_train)
print(x_test)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]
[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]


Lets recap whats going on in the above cell.\
The first line we import both StandardScaler(standardization), and the MinMaxScaler(normalization) classes.\
The next 3 lines we standardize the data, then the 3 following lines we normalize the data.\
\
To standardize the data we first create an instance of the StandardScaler class and store it a variable. No arguments are needed.\
Next we apply the the StandardScaler to both our x_traind and x_test sets.\
Remember we are going to apply it to the "dummy data", or categorical data we already altered, which includes both y sets.\
\
Now, lets break these lines down.\
The first 3 columns of our feature set is the "Country" representaion, so we do want to risk altering it.\
In Python if we have nested array we can index both axis's in single sqaure brackets by seperating the index calls by a comma.\
x_train[:,-2:] - from x_train list [from every row, select all from the second to last colum to the last column].\
We set this equal to StandardScaler.fit_transform, and pass x_train as an argument.\
This say's, take the last 2 cols, standardize and update the dataset while keeping the rest of the dataset.\
\
Cool, lets standardize the test set. NOTE: that we use StandardScaler.fit.\
This is because when we call fit_transform we build the scaler to be applied in the test application.\
When we just call fit, we apply the scaler that was built during training.\
Thats its the data is standaradized.\
\
The final 3 lines are the same concept, but normaliztion.\
Both have their advantages in different situations.\