<a href="https://colab.research.google.com/github/Drumstick42/MachineLearningAZ/blob/main/DataPreprocessing/data_preprocessing_tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing Tools

## Importing the libraries

In [6]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [7]:
#Pandas creates a data frame, kinda like R. 
dataset = pd.read_csv('Data.csv')

# In any ML dataset, you have the features, and the dependent variable vector.
# Features are the data with which we are trying to do the prediction (independent variable)
# We want features in the first columns, and the dependent variable in the last column

# In Data.csv, features are in the first 3 columns
# NOTE: Ranges in python are exclusive @ upper bound
features = dataset.iloc[:, :-1].values #iloc locates the rows/columns specified by the input. : gets entire range, :-1 is the range from the first element to the last index.
dependent = dataset.iloc[:, -1].values

In [8]:
print(features)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [9]:
print(dependent)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

Options for missing data:

  1.) Ignore that observation (okay if not a lot of missing data)
  
  2.) Replace missing value by the average of the rest of the dataset.



In [10]:
# scykit learn has a simpleimputer that will reduce missing data by average
from sklearn.impute import SimpleImputer
# pandas fills in our missing features with np.nan
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean') #Creates an instace of SimpleImputer. 
imputer.fit(features[:, 1:3]) #Connects imputer to matrix
features[:, 1:3] = imputer.transform(features[:, 1:3]) #Performs the replacement using the mean values calculated from the features of the matrix input to fit. The transform *could* act on a different matrix

In [11]:
print(features)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

Could encode countries by a number. Problem with that is that there could be misinterpretation about what the number represents. A numerical encoding could imply that the order of the rows matters, when they clearly don't.
The better way is to create a feature vector for each country. The vectors contain bools as that denote whether or not a particular customer is from that country. This is calle One-Hot encoding.

### Encoding the Independent Variable

In [14]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

columnTransformer = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = 'passthrough')
features = np.array(columnTransformer.fit_transform(features)) # unlike above, performs fit and transform in same step


In [15]:
print(features)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [17]:
from sklearn.preprocessing import LabelEncoder
labelEncoder = LabelEncoder()
dependent = labelEncoder.fit_transform(dependent)


In [23]:
# Sidebar: what if we wanted yes to be 0, no to be 1?
# it looks like the labelEncoder will just count, starting from the first label it finds.
labelEncoder2 = LabelEncoder()
print(labelEncoder2.fit_transform(["Yes", "No"])) # yes is still 1. why?
# Looks like the encoding might act alphabetically.

[1 0]


In [18]:
print(dependent)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

Common question is whether feature scaling is done before or after splitting. The correct answer is after.
Training set - train model on existing observations
Test set - test model on future observations (we obviously already have them, but we treat it like future data.) 

This is why we do feature scaling after the data sets are split. We need to treat the test set as if it is future data. If we apply scaling before the split, then the test-set will change the scaling. "Information leakage" onto test-set. It will contain information that it shouldn't.

80/20 is a good ratio for the training/test split


In [34]:
from sklearn.model_selection import train_test_split
xTr, xTst, yTr, yTst = train_test_split(features, dependent, test_size = 0.2, random_state = 1)

In [25]:
print(xTr)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [26]:
print(xTst)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [27]:
print(yTr)

[0 1 0 0 1 1 0 1]


In [28]:
print(yTst)

[0 1]


## Feature Scaling

Scale features to make sure that no feature dominates another, and all features are taken into account in the alogrithm. Not always needed (for example, multiple regression has constants for each feature that implicitly scale).

Standardization: x_stand = (x - mean(x))/stddev(x), all features between ~-3 and ~3

Normalization: (x - min(x))/(max(x) - min(x)), all features between 0 and 1

Normalization reccommended when there's a normal distribution of features. Standardization works all the time, so we're just going to use that. 

NOTE: WE USE THE MEAN AND STDDEV FROM THE TRAINING SET TO SCALE THE TEST SET.

In [38]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Do not need to apply standardization to dummy variables (the columns created from the country labels). Already take values between -3 and +3.
xTr[:, 3:] = scaler.fit_transform(xTr[:, 3:])
xTst[:, 3:] = scaler.transform(xTst[:, 3:])



In [39]:
print(xTr)

[[0.0 0.0 1.0 -0.1915918438457856 -1.0781259408412427]
 [0.0 1.0 0.0 -0.014117293757057902 -0.07013167641635401]
 [1.0 0.0 0.0 0.5667085065333239 0.6335624327104546]
 [0.0 0.0 1.0 -0.3045301939022488 -0.30786617274297895]
 [0.0 0.0 1.0 -1.901801144700799 -1.4204636155515822]
 [1.0 0.0 0.0 1.1475343068237056 1.2326533634535488]
 [0.0 1.0 0.0 1.4379472069688966 1.5749910381638883]
 [1.0 0.0 0.0 -0.7401495441200352 -0.5646194287757336]]


In [40]:
print(xTst)

[[0.0 1.0 0.0 -1.4661817944830127 -0.9069571034860731]
 [1.0 0.0 0.0 -0.44973664397484425 0.20564033932253029]]
