# Machine Learning A-Z: Part 1 - Data Preprocessing

This section of the Machine Learning A-Z Course focuses on how to pre-process the data to prepare it for analysis

## Step 1. Import the required libaries

In [1]:
import numpy as np # Libraries for fast linear algebra and array manipulation
import pandas as pd # Import and manage datasets
from plotly import __version__ as py__version__
import plotly.express as px # Libraries for ploting data
from sklearn import __version__ as skl__version__
from sklearn.impute import SimpleImputer # Library to impute replacement values for missing data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder # Libraries to do encoding of categorical variables
from sklearn.compose import ColumnTransformer # Library to transform only certain columns/features at a time
from sklearn.model_selection import train_test_split # Library to split data into training and test sets.
from sklearn.preprocessing import StandardScaler # Library to do feature scaling

Library versions used in this code:

In [2]:
print('Numpy: ' + np.__version__)
print('Pandas: ' + pd.__version__)
print('Plotly: ' + py__version__)
print('Scikit-learn: ' + skl__version__)

Numpy: 1.16.4
Pandas: 0.25.1
Plotly: 4.0.0
Scikit-learn: 0.21.2


## Step 2. Load the Dataset

In [3]:
def LoadData():
    dataset = pd.read_csv('Data.csv')
    return dataset

dataset = LoadData()
print(dataset.head(3))
print()
print(dataset.info())

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
Country      10 non-null object
Age          9 non-null float64
Salary       9 non-null float64
Purchased    10 non-null object
dtypes: float64(2), object(2)
memory usage: 448.0+ bytes
None


We can see that the dataset contains 4 columns:
* Country - String
* Age - Float
* Salary - Float
* Purchased - String
This file appears to represent some data about a company's customers including their country, age, salary and whether or not they bought the product.

Additionally we can see that we have 10 records, but there seems to some data missing in the age and salary columns. We'll deal with that in a minute.

## Step 3. Split the data into features and outputs (independent variables and dependent variables)

In [4]:
X = dataset.iloc[:,:-1].values # All the columns except the last are features
y = dataset.iloc[:,-1].values # The last column is the dependent variable

## Step 4. Handle Missing Data
As we noted earlier, we are missing data in both the age and salary variables.

One common way to hanlde missing data is to replace the missing data with the mean of the data. We can easily do this with tools from scikit learn

*__Question:__ What do you do for a categorical variable?* Options on SimpleImputer include:
* Mean - Numeric Only
* Median - Numeric Only
* Most_Frequent - Sort of like avg for categorical? Might work on large data sets
* Constant - Fill will some value of your choosing

*__Question:__ Doesn't filling the empty spots with mean values before splitting test and train datasets give the training set some information about the test set?*

*__Answer:__ Yup! It's called Data Snoop Bias. In this case our dataset is so small we don't have much choice.*

In [5]:
imputer = SimpleImputer()
imputer = imputer.fit(X[:, 1:3])
X[:,1:3] = imputer.transform(X[:, 1:3])

print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Step 5. Encode Categorical Variables


In [6]:
labelencoder_X = LabelEncoder()
x = labelencoder_X.fit_transform(X[:,0])
print(x)

[0 2 1 2 1 0 2 0 1 0]


This code has converted our string labels into numbers, however these numbers imply an order or hierarchry to the categories which does not actually exist. (It does exist in the case of ordinal variables though such as size: small, medium, large). This is a problem because many machine learning algorithms will pick up on this implied order and perform poorly since it is an artifact of how we prepared the data, not a relationship that actually exists in the data. To fix this we will one-hot encode this categorical variable.

### One-hot Encoding

In one-hot encoding, the categorical variable is removed and a new variable (column) for each value of the original variable is added (in our example 3 columns will be added). Then each record simple gets a one in the column that matches the value of the original variable and a zero in the other new variables. These new columns now represent whether a particular record belongs to a certain group.

In [7]:
columntransformer = ColumnTransformer(
    [('Country_Category', OneHotEncoder(), [0])],
    remainder = 'passthrough')

X = np.array(columntransformer.fit_transform(X))

print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


We also need to one-hot encode the dependent variable, but since it already only has 2 values, we only need to use a *LabelEncoder*.

In [8]:
labelencoder_Y = LabelEncoder()
y = labelencoder_Y.fit_transform(y)

print(y)

[0 1 0 0 1 1 0 1 0 1]


## Step 6. Split the Data into Test and Training Sets

The training set is used to teach the machine learning model the relationships and patterns present in the data. The test set is used to evaluate the trained model and see how well it can perform on data it has never seen before. 

If the model performs well on both the training and test sets we can expect is has done a good job of learning the patterns present in the data and will generalize to new data well. (This assumes that the original dataset is representative of the new dataset and we have avoided all significant sources of bias.) 

If the model performs well on training data, but not on the test data, then it is likely that the model has overfit our test data, or in other words it has memorized the dataset instead of learning the patterns with-in the data.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 42)

## Step 7. Scale the features

Many Machine Learning algorithms and models deal with the __Euclidean Distance__ (Square root of the sum of the squares of the difference) between observations. This means that to make it easier for our models to learn and perform well, each feature should have a similar impact on the __Euclidean Distance__. In order for the features to have similar impacts, they need to have similar scales. However in our data, the age feature is in a scale of 10's while salaray is in a scale of 10,000's. This means that impact of even a large change in age will get lost in the impact of even a small change in income. Essentially we are implying to the model that income is significantly more important than age which can almost be ignored, however that may not be true! In order to give age and income equal importance we need to change the scale the features to be on the same scale.

In [10]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

print(X_train)
print()
print(X_test)

[[ 1.         -0.57735027 -0.57735027 -0.7529426  -0.62603778]
 [ 1.         -0.57735027 -0.57735027  1.00845381  1.01304295]
 [ 1.         -0.57735027 -0.57735027  1.79129666  1.83258331]
 [-1.          1.73205081 -0.57735027 -1.73149616 -1.09434656]
 [ 1.         -0.57735027 -0.57735027 -0.36152118  0.42765698]
 [-1.          1.73205081 -0.57735027  0.22561096  0.05040824]
 [-1.         -0.57735027  1.73205081 -0.16581046 -0.27480619]
 [-1.         -0.57735027  1.73205081 -0.01359102 -1.32850095]]

[[-1.          1.73205081 -0.57735027  2.18271808  2.30089209]
 [-1.         -0.57735027  1.73205081 -2.3186283  -1.79680973]]


Note that some Machine Learning libraries in python do not require you to scale the features as they do it automatically.