## Importing the Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing Dataset

We import the data into a variable which we call a dataframe

Matrix of features:- The matrix of features is a term used in machine learning to describe the list of columns that contain independent variables to be processed, including all lines in the dataset. These lines in the dataset are called lines of observation.
It is denoted by varaiable X.



Target variable vector:- The target variable vector is a term used in Machine Learning to define the list of dependent variables in the existing dataset. Here we also have lines of observations which is the list of those variables by line.
It is denoted by Y.

To create our feature matrix and target vector we use 'iloc' method which is abbrevation for locate index.

Format of iloc method is:

     var = dataset.iloc[rows, columns].values

And to define the rows and columns we use index operator (:), where index operator seperates start_index : end_index

Index operator include start_index but exclude end_index

If the index used individual it means iclude everything from 0 to last.

In [2]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, -1].values

In [3]:
print(dataset)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


In [4]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [5]:
print(Y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of Missing Data

Their are many ways of handling missing data from our feature matrix. Few of them are:

1. If the dataset is too large and missing values are very less then we can  remove those values.

2. We can replace the missing values with the average or mean value of that column or feature.

3. We can replace the missing values with the median value of that column or feature.

etc.

To replace these missing values we use SimpleImputer class of impute module of sklearn library.

1. First we create an instance or object of SimpleImputer class which expects two arguments,
  1. Values to be replaced, i.e missing_values
  2. Strategy to replace the missing values.
  
2. After that we select the rows and columns in which we have to search for missing values. And for that we use fit method on our imputer object.

3. Now we choose the rows and columns in which we have to replace the missing values. And for that we use transform method in our imputer object. This returns an updated feature matrix of the selected rows and columns.

4. At last we have to update the rows and columns of our original feature matrix which has been returned by the transform method after replacing the missing values

In [6]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
updated_features_of_X = imputer.transform(X[:,1:3])
X[:, 1:3] = updated_features_of_X

In [7]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data 

Since most machine learning models only accept numerical variables, preprocessing the categorical variables becomes a necessary step. 

We need to convert these categorical variables to numbers such that the model is able to understand and extract valuable information.

Categorical variables are usually represented as ‘strings’ or ‘categories’ and are finite in number. 

Further, we can see there are two kinds of categorical data-

    1. Ordinal Data: The categories have an inherent order
    2. Nominal Data: The categories do not have an inherent order
    
In Ordinal data, while encoding, one should retain the information regarding the order in which the category is provided.
In most of the cases ordinal data are independent variables.
Example- Countries in the given data

While encoding Nominal data, we have to consider the presence or absence of a feature. In such a case, no notion of order is present. 
In most of the cased Nominal data are dependent variables
Example- Purchased in given data

To encode the Ordinal Data we use method called OneHot Encoding.

And to encode the Nominal Data we use method called Label Encoding.

### Encoding Independent Variable or Ordinal Data

To encode independent variable or ordinal data we use ColumnTransformer class of compose module of sklearn library and also OneHotEncoder class of preprocessing module of sklearn library

1. First we create an object or instance of ColumnTransformer class which expect two arguments
    A. column to be transformed and encoder to be used specified within a tuple, i.e transformers
    B. wheteher we want to retain the columns not being transformed or not, i.e remainder
    
2. After that we connect our ColumnTransformer object to the feature matrix or independent variable and encode the categorical values specified.

3. Future machine learning model requires numpy array for further processing so we forcefully convert our encoded feature matrix to numpy array and update our feature matrix with updated one.

In [9]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(),[0])], remainder='passthrough')
encoded_feature_matrix = ct.fit_transform(X)
X = np.array(encoded_feature_matrix)

In [10]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding Dependent Variable or Nominal Data

To encode dependent variable or nominal data we use LabelEncoder class of preprocessing module of sklearn library

1. First we create an object or instance of LabelEncoder class which expect no arguments
    
2. After that we connect our LabelEncoder object to the target vector or dependent variable and encode the categorical values specified.

3. It is not necessary to convert target variable to numpy array thus we only update our target vector with updated one.

In [11]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded_target_variable = le.fit_transform(Y)
Y = encoded_target_variable

In [12]:
print(Y)

[0 1 0 0 1 1 0 1 0 1]


## Spliting the dataset into Training set and Test set

The train-test split is a technique for evaluating the performance of a machine learning algorithm.

The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

    Train Dataset: Used to fit the machine learning model.
    Test Dataset: Used to evaluate the fit machine learning model.
    
Each dataset is divided into two subsets, onse is feature matrix and other is target vector.

Theirfore after spltting the dataset we have four sets which are:
    1. Training feature matrix, i.e X_train
    2. Test feature matrix, i.e X_test
    3. Training target vector, i.e Y_train
    4. Testing target vector, i.e Y_test
    
The objective is to estimate the performance of the machine learning model on new data: data not used to train the model.

This is how we expect to use the model in practice. Namely, to fit it on available data with known inputs and outputs, then make predictions on new examples in the future where we do not have the expected output or target values.

An optimal splitting size is 80% training data and 20% test data and may vary according to conditions.

To split the data we use the train_test_split class of the model_selection module of the sklearn library

The train_test_split class returns for datasets and require three arguments
    1. Feature matrix, i.e X
    2. Target vector, i.e Y
    3. Test size percentage, i.test_size
    3. Randomnes of split, i.e random_state 

In [39]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,test_size=0.2, random_state=1)

In [38]:
print(X_train)

[[-0.19159184 -1.07812594]
 [-0.01411729 -0.07013168]
 [ 0.56670851  0.63356243]
 [-0.30453019 -0.30786617]
 [-1.90180114 -1.42046362]
 [ 1.14753431  1.23265336]
 [ 1.43794721  1.57499104]
 [-0.74014954 -0.56461943]]


In [24]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [25]:
print(Y_train)

[0 1 0 0 1 1 0 1]


In [26]:
print(Y_test)

[0 1]


## Feature Scaling

Feature scaling is a method used to normalize the range of independent variables or feature matrix.
Example- If the difference between the values in feature matrix is too high like some values are between 100 to 200 and others are 1000 to 100000.

In data processing, it is also known as data normalization and is generally performed during the data preprocessing
step.

Few advantages of normalizing the data are as follows:

    1. It makes your training faster.
    2. It prevents you from getting stuck in local optima.
    3. It gives you a better error surface shape.
    4. Wweight decay and bayes optimization can be done more conveniently.
    
Feature scaling is done after spliting the dataset in training and test subsets because the methods used for scaling require the mean of the feature from the feature matrix.

Thus if we scale before the spliting mean of the features would be mean of the features from train + test data but as test data is supposed to be a new data which our model will get it in real time to predict it is done after spliting.

If it is done before spliting then the condition of overfiting will occur.

In scaling we scale our training feature matrix X_train and as well as we scale our test feature matrix X_test.
But to scale our test feature matrix we use the same scaler that we used for the training feature matrix because we have to predict based on training.
Using of the same scaler lead to use of the same calculate mean, max and min used in the formulas of different methods used for scaling the training set.

Feature scaling is not done for the dummy features or the features formed during the encoding of the categorical data as they dont affect the performance too much.


Their are two methods for scaling the feature matrix:
    
       1. Normalization:- Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.
    
    Formula is, 
    
    Xnorm = X - min(X)  / max(X) - min(X)
    
    where 
    Xnorm is normalized value range between 0 and 1
    X is feature 
    min(X) is minimum of the feature or column
    max(X) is maximum of the feature or column
    
       2. Standardization:- Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
    
    Formula is,
   
       Xstandard = X -mean(X) / standard deviation(X)
    
    where
    Xstandard is standardized value range bewtween -3 and 3
    X is feature
    standard deviadtion(X) is standard deviation of feature
    

Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbors and Neural Networks.

Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Also, unlike normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization.
        
But in most cases Standardization is prefered.

To scale the feature matrix we use the StandardScaler class of the preprocessing module of the sklearn library.

1. First we create the object or instance of the StandardScaler class.
2. Then we scale the selected features of the X_train or training feature matrix in which we exclude the dummy features using fit_transform method.
3. Now we update the X_train feature matrix with the scaled values.
4. Now we scale the X_train or test feature matrix using the same scaler object thus only using the transform method to scale and not the fit_transform which will lead to creation of new scaler.

In [40]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
scaled_feature_train = sc.fit_transform(X_train[:, 3:5])
X_train[:, 3:5] = scaled_feature_train

scaled_feature_test = sc.transform(X_test[:, 3:5])
X_test[:, 3:5] = scaled_feature_test

In [41]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [42]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
