## Machine Learning Process

1.   Data Pre-Processing
2.   Modelling
3.   Evaluation


## Feature Scaling

*Important* (applied to columns)

1.   Normalization - Take min from column subtract it with all the values from that column and divide it by the max - min, value range [0, 1]


> X' = (X - xmin)/(xmax - xmin)


2. Standardization - similar to normalization but we subtract using the average of the column and divide by standard deviation, value range is [-3, +3]


> X' = (X - avg)/n

## Data Pre-Procesing Tools

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Importing Dataset

In [2]:
dataset = pd.read_csv('../datasets/Data.csv')
# iloc locate indices (row, column)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

## Handling Missing Data

In [3]:
# replace missing value by replacing it with the average of the whole column
from sklearn.impute import SimpleImputer

#learn how to do this in pandas but remember this fancy way too
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)
print(y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## One Hot Encoding

Creating binary vectors for mainly strings so that the model doesnt create a correlation between these strings and the outcome

also convert yes/no outcome to 1/0

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

In [5]:
# transformers(type of transforamtion and on which column) and remainder(like inplace from pandas)
# 0 because we are encoding country values
# Encoding independent variable
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

# does not return a numpy array so,
X = np.array(ct.fit_transform(X))
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [7]:
# Label Encoder for dependent variable
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1], dtype=int64)

# Train test Split

### Feature Scaling before V/S after splitting the dataset

feature scaling must be done after splitting the dataset into train and test set

why ?
The test set is supposed to be a brand new set, feature scaling before the split would get us the mean of data that shouldnt be there in the training set
**This is to prevent DATA LEAKAGE**


In [11]:
from sklearn.model_selection import train_test_split
# random state returns the same seed of data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=1)