# Data Preprocessing Tools

## Importing the data

In [None]:
import numpy as np # Numpy is used to perform operations and manipulate arrays
import matplotlib.pyplot as plt # Matplotlib is used to plot graphs and various visual charts
import pandas as pd # Pandas is used as a data preprocessor, used for importing datasets such as csv files

## Importing the dataset

In [None]:
dataset = pd.read_csv('Data.csv') # Reads the data from the csv file and creates a dataframe which we can manipulate
'''
X is the independent variable, which is used to 
predict the value of the independent variable. 
Here the dependent variables are country, age and salary.
'''
X = dataset.iloc[:, :-1].values # Selecting the country, age and salary columns
# iloc is used to locate the particular row and column. 
# First parameter is the row and 2nd parameter is the column
'''
Y is the dependent variable, which is predicted
based on the value of X. Here the independent variable
is purchased.
'''
Y = dataset.iloc[:, -1].values # Selecting the purchased column

print(X)
print(Y)

## Taking care of mising data

In [None]:
'''
Importing the SimpleImputer class from the scikit learn library
Impute means to simply assign something a new value
Here, the missing values of the columns are replaced with the average values of the column
i.e the missing values are imputed with the new value which is the mean column value
'''
from sklearn.impute import SimpleImputer
# Creating a SimpleImputer object, missing values are identified as nan and mean column value is used to replace it
imputer = SimpleImputer(missing_values=np.nan, strategy="mean") 
# Fitting the data to the constraints specified and transforming it. fit_transform 
# creates a copy of the data specified and returns a new matrix
X[:, 1:] = imputer.fit_transform(X[:, 1:])
print(X)

## Encoding categorical data

Categorical data is basically data which is not numerical. This data is encoded into numbers in order to manipulate/work in a convenient manner.

One hot encoding technique is used in this example, which means that every different value is transformed into a column.

### Encoding Independent variable

In [None]:
from sklearn.compose import ColumnTransformer # ColumnTransformer is used to transform the exiting columns into the ones specified

# OneHotEncoder is used to encode the columns into numerical columns
# Here France is given the code 100, Spain is given the code 010 and Germany the code 001,
# Where each number represents a separate column
from sklearn.preprocessing import OneHotEncoder 

# Takes two arguments transformers and remainder
# Transformer is a tuple consisting of three things, the type of transformation, transformer used and the columns to transform
# Remainder is either drop or passthrough, i.e drop the other columns or retain them
ColumnTransformerObj = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

X = np.array(ColumnTransformerObj.fit_transform(X))

print(X)

### Encoding Dependent variable

In [None]:
# LabelEncoder is used to encode every value(numerical or categorical) specified into a unique numerical value.
from sklearn.preprocessing import LabelEncoder

LabelEncoderObj = LabelEncoder()
Y = LabelEncoderObj.fit_transform(Y)

print(Y)

## Splitting the dataset into training set and test set

The dataset used is splitted into two halves namely training set and test set. Training set is used to train the model and test set is used to test the accuracy of the model. Generally 80% of the data is used to train the model and 20% to test the model.

In [None]:
from sklearn.model_selection import train_test_split

X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y, test_size=0.2, random_state=1)

print(X_Train)
print(X_Test)

print(Y_Train)
print(Y_Test)

## Feature Scaling

It is used to put all the features on the same scale, such that no feature dominates the other one. Feature scaling should happen before testing as to prevent any information leakage to the test set.

The main features scaling techniques are : 
1) Standardization
2) Normalization

Standardization: 

By applying standardization technique, we get a value between +3 and -3. It can be used for any kind of data. It is not applied for dummy data as it already lies between -3 and +3.

<img src="http://www.sciweavers.org/upload/Tex2Img_1642227428/render.png">

Normalization:

By applying normalization technique, we get a value between 0 and 1. It is recommended to use when dealing with data following normal distribution.

<img src="http://www.sciweavers.org/upload/Tex2Img_1642227737/render.png">

In [None]:
from sklearn.preprocessing import StandardScaler # Applies the standardization technique
StandardScalerObj = StandardScaler()

# Calcuates the mean and standard deviation 
# and transforms the values using standardization techniques
X_Train[:, 3:] = StandardScalerObj.fit_transform(X_Train[:, 3:])

# Transforms the test set based on the mean and sd of the training set
X_Test[:, 3:] = StandardScalerObj.transform(X_Test[:, 3:])

print(X_Train)
print(X_Test)