# Data Preprocessing

### Importing Essential Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pprint import pprint

### Importing csv data
- pandas.read_csv was used to open Data.csv(in same directory) as a DataFrame
- then we divided the dataset into independent variables(X) and dependent variables(y) using iloc that separetes the dataset using indexes (before the comma are line indexes and after column ones)
- and finally transforming them into numpy arrays with ".values"

In [2]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [3]:
print('--------------dataset--------------')
print()
print(dataset)
print()
print('-----------------------------------')
print()
print(type(dataset))

--------------dataset--------------

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes

-----------------------------------

<class 'pandas.core.frame.DataFrame'>


In [4]:
print('-------independent variables-------')
print()
pprint(X)
print()
print('-----------------------------------')
print()
print(type(X))

-------independent variables-------

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

-----------------------------------

<class 'numpy.ndarray'>


In [5]:
print('--------dependent variables--------')
print()
pprint(y)
print()
print('-----------------------------------')
print()
print(type(y))

--------dependent variables--------

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

-----------------------------------

<class 'numpy.ndarray'>


### Taking Care of Missing Data

In [6]:
from sklearn.impute import SimpleImputer

- first we have created an object from the SimpleImputer class, this objects receives which are the missing values that we want to replace (in this case NaN values) and the strategy we will use to replace the missing values (the mean between all other column values was chosen)
- then we use the fit function with the lines and columns we want to transform as parameters (in this case all lines of all numerical columns), this function will find the missing data and calculate the mean number
- last but not least we execute the imputer.transform function that will use the information acquired through the fit function to replace all missing data and we save the result in the right place of the X array

In [7]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:])
X[:, 1:] = imputer.transform(X[:, 1:])

In [8]:
pprint(X)

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)


### Taking Care of Categorical Data

In [9]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

#### Nominal Categorical Data
- first we have created an object from the ColumnTransformer class
    - this object will receive 2 parameters, transformers an remainder, remainder will say what to do with the columns that we have not specified (in this case passthrough, that will keep the columns and do nothing with them), transformers is a little more complex, it will receive a list of tuples, each tuple will specify an action we want to take in a certain group of columns, the first element will be the transformer name('encoder'), the second one what we will use to transform(OneHotEncoder()) and the third element will be the columns in which we want to execute this action (we could use a list or a slice)
    - now talking more about the OneHotEncoder: It will create a column for each categorical value, and treat it somewhat like a boolean value. For exemple, we will have a column for Germany and in this column every line that had the "Germany" value will be filled in with 1 and every other line will be filled in with a 0

- then we will fit and transform it at once, make it a numpy array and then save again inside X

In [10]:
column_transformer = ColumnTransformer(transformers=[('encoder', 
                                                       OneHotEncoder(),
                                                       [0])],
                                       remainder='passthrough')
X = np.array(column_transformer.fit_transform(X))

In [11]:
pprint(X)

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)


#### Boolean Categorical Data

- this time we will use LabelEncoder that is much simpler then ColumnTransformer and OneHotEncoder, and we can use it since we have only two values ('Yes' and 'No'). The LabelEncoder will transform y in a column with zeros and ones and each number will represent one categorical value (just like a boolean)

In [12]:
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

In [13]:
pprint(y)

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
