# Data Preprocessing
---


## Import Libraries

In [20]:
import numpy as np # for converting or minipulating the data
import matplotlib.pyplot as plt # for plotting the data on graph to see
import pandas as pd # for importing the data

There are more tools that we will use as we advance.\
These are the basic tools for now so we can focus more understanding what we are doing and why.\
In fact there are libraries that we will use to help in the preprocessing stage.\
More on that later.

## Import Dataset
Here we imported that Data.csv and stored it in Variable called dataset.\
What csv_read returns is called a dataframe.\
A Dataframe is a 2 dimensional data structure, like a 2 dimensional array.

In [21]:
dataset = pd.read_csv('Data.csv')

In any dataset, that you will train a model with, you will have two disticnt entities.\
The set a features, which are the known characteristics of the observation.\
Or the independent variable(s).\
Then the set of dependent variable.\
\
Lets take a look at our data set.

In [22]:
print(dataset)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


From here 4 colums, location, age, income, and if the bought a product.\
Real quick we can make out that if a person purchased a product is less likey to affect the other three columns.\
While any of the other may affect if someone does buy that product.\
We can safely say that (Country, Age, Salary) are the features and (Purchased) is the independent variable.\
We also have two cells with missing data.  More on that later.

In [23]:
# NOTE .values turns the dataframe to a numpy array
x = dataset.iloc[:,:-1].values # All Rows, All Columns except the last / feature set / independent variables
y = dataset.iloc[:,-1].values # All Rows, Only last column / output vecter / dependent variable

As already mentioned, dataset is a pandas dataframe that represents a tabel of data.\
Its not an array, but for now lets think of it as an array of arrays.\
We know in Python we index in a range like so: arr[ start : end : step ]\
We also know [ -1 ] is the same as indexing the last element.\
Well, what we might not know is that we can't index a dataframe like an array: dataset[ : ][ : -1 ]\
The Pandas dataframe frame has a built in method for indexing data called iloc.\
Pandas.iloc[  ] requires 1 argument but can take 2. indexing of the first dimension and then the second.\
Indexing values can bet both a range or a single index.

## Handle Missing Data
So as we seen above there was some data missing in our dataset.\
There a couple of ways to handle missing data:
- First if the dataset is very large and we are only missing a small %, we can just delete rows with missing data.
- A second way, and the way we'll do it is, replace the data with the avg of all the rows in that column of the dataset.

First we are add to out Data Preprocessing Tools.\
Normally we would import this with the rest of the imports at the top of the file.

In [24]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

Above we imported SimpleImputer is a class of sklearn library.\
Next we create a variable to store our new class obj and pass in a couple of arguments.\
First is missing_values, by pass np.nan we are basically saying and cell that doesnt have a value, or "not a number"
\
SimpleImputer also can do more than replace empty vaules with an average.\
You can also do things like the median, or most common value if its categorical like Country.\
\
NOTE: this set up is for numerical values, we will only want to apply it to the age and salary columns.\
For good practice, when doing this include all numerical columns, as we wont really know where there is missing data

In [25]:
imputer.fit(x[:,1:3])
x[:, 1:3] = imputer.transform(x[:,1:3])
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


We can see all the cells are full and the two oddballs are very obvious.

## Encoding Categorical Data
### Encoding The Independent Variable
Looking at our dataset, most of the data is a numerical value which is good.\
Our learning models will tend to have difficulty finding correlation with strings and the output vector.\
Which is why we are going to encode categorical data such as Country.\
You may think that we would just asign countries numerical values,\
but we dont want to give the impression that there is a ordering relation between countries.\
In other words Franch isn't first, Spain isn't second and so on.\
\
The Solution we will use is "one hot encoding".\
This is were instead of giving a numerical value to cells with "Germany" we give each unique entry its own column.\
Then in this case, a 1 will be placed in it respective country, while the others get a 0.\
So instead of it being:
- France = 0
- Spain = 1
- Germany = 2

it would be:
- France = [1,0,0,...]
- Spain = [0,1,0,...]
- Germany = [0,0,1,...]

In [27]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x)) 

Lets talk about the above code quick.\
The first two line could be added to our imports at the top of the file.\
Sklearn is a very popular machine learning library that has many powerful data preproccesing tools.\
\
First we import ColumnTransformer, A class from sklearn that can be used to update the vecter to inclued the new cells.\
Next we grabbed OneHotEncoder, this is the tool that we can use to convert "Country" into a vecter's for more efficient learning.\
We create a variable and store an instance of the ColumnTransformer class.\
ct will need a few arguments passed to it:
- First - Type of transformer we need, the string 'encoder' is an accepted parameter that tells ct we want to encode the data.
- Second - The method to be used for the transformation - we want OneHotEncoder() so we pass just that.
- Third - What to do with data not changed - remainder='passthrough' - this say to leave the data there, the defalut is to drop it.

We then pass x into ct.fit_transform() to covnert the feature set into the new set with the new cols.\
The ct.fit_transform does not return a numpy array, so the last line we just make sure that x is converted back to a numpy array.\
\
Real quick lets look at the new dataset, we can the first col has been replaced with three new ones containing either 1 or 0.

In [28]:
print(x)

[[0.0 1.0 0.0 0.0 44.0 72000.0]
 [1.0 0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 1.0 38.0 61000.0]
 [1.0 0.0 1.0 0.0 40.0 63777.77777777778]
 [0.0 1.0 0.0 0.0 35.0 58000.0]
 [1.0 0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 0.0 48.0 79000.0]
 [1.0 0.0 1.0 0.0 50.0 83000.0]
 [0.0 1.0 0.0 0.0 37.0 67000.0]]



### Encoding The Dependent Variable
In our case the dependent variable is categorical, and we are not worried about the ordering concept as before.\
Meaning like the "Country" col, it contains strings, so we will want to convert them to numerical values.\
Unlike "Country" on the other hand, we can asign each value its own numerical value.\
\
So the concept is the same, but instead of giving each option its own column like before,\
we will a "yes" the value of 1, and "no" the value 0.\
\
Since we aren't adding new cols we wont need ColumnTransformer like before.\
We will also not use OneHotEncoder, but instead LableEncoder beacause we are only working one col.\
We create a variable to store the LabelEncoder instance, and this time no arguments are required.\
We did not convert y into a numpy array, because it wont actually be passed through the machine learning model.\

In [30]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
print(y)

[0 1 0 0 1 1 0 1 0 1]


As we can see we successfully converted the dependent variable vector into numerical data the compter can now understand.

## Split Data Into Train/Test Sets

---

## Feature Scaling