In [11]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Loading The Data Set
Before we begin pre-processing the data set, we must load it.
- The data set we'll be using is the customers.csv file

In [12]:
# read the customers csv as a data set
customers_df = pd.read_csv("datasets/customers.csv")

customers_df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [13]:
# return a 2D Array excluding the last column
x = customers_df.iloc[:, :-1].values

x

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [14]:
# return an Array of the last column
y = customers_df.iloc[:, 3].values

y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

### What is "x"?
The x (independent variable) is a 2D Array of the Country, Age, and Salary columns.
- Each column is an independent variable
- Each row is a sample

### What is "y"?
The y (dependent variable) is an Array of the Purchased column.

### So What's The Machine Learning (ML) Problem?
The Purchased column is dependent on the Country, Age, and Salary.

Therefore, can we can make predictions on whether a customer will purchase an item based on their country, age, and salary?

# Handling Missing Data
The customers data set contains some "NaN" or undefined cells, so we must handle the missing data before we begin creating machine learning models.

### There are multiple ways to fix missing data:
1. Remove the columns with missing data (not recommended, dangerous)  
2. Set the cell's data to the mean of the columns (recommended)

In [15]:
# import sklearn's SimpleImputer, a class that handles missing data
from sklearn.impute import SimpleImputer

In [16]:
# create an Imputer that replaces NaN cells with its column's mean
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")

# impute, then transform the age (col 1) and salary (col 2) columns
x[:, 1:3] = imputer.fit_transform(x[:, 1:3])

x

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

### Fit
The fit part is used to analyse the data on which we apply the object (getting the mean, the min, the max, the standard deviation, outliers, etc.) in order to understand how the data is structured.

### Transform
Then once the object understands how the data is structured thanks to the fit method, the transform part is used to apply some transformation (like feature scaling for example).

# Categorical Data
The "Country" and "Purchased" columns are categorial variables.
- The Country column contains 3 categories: France, Spain, Germany
- The Purchased column contains 2 categories: Yes, No

They are considered "categorical" because they simply contain categories.

### The Problem with Categorical Variables
In Machine Learning, having categorial variables may scew the model because the values could be non-numeric or non-processable.

A solution is to encode categorical values.

In [17]:
# import a LabelEncoder
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [18]:
# create a LabelEncoder
labelencoder_x = LabelEncoder()

# set each country to its encoded value in the Country column
x[:, 0] = labelencoder_x.fit_transform(x[:, 0])

# looking at the output, it's either 0, 1, or 2 based on country
x

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

### The Problem with The "Country" Column's Encoding
If we implemented a machine learning model using the country encodings of 0, 1, or 2, then it would think the country with an encoding of 2 has a greater value than the countries of an encoding of 1 and 0.

This is not the case since the values are categorical.

### A Solution to The Column's Encoding Problem
We can create Dummy variables/columns for the three countries.
- Instead of having 1 column of the 3 countries, create 3 columns
    - These 3 columns would indicate whether or not the country was present per cell based on the Country column

In [19]:
from sklearn.compose import make_column_transformer

In [20]:
"""
a transformer to encode the Country column as dummy columns
- remainder=passthrough guarantees that after fit_transforming the
preprocess object, x contains all other variables (Age, Salary)
and not just the transformed one (Country)
"""
dummy_transformer = make_column_transformer((OneHotEncoder(), [0]),
                                            remainder="passthrough")
"""
replace the Country column with 3 columns with values of 0 or 1
- 0 indicates that country was not present in the cell
- 1 indicates that country was present in the cell
"""
x = dummy_transformer.fit_transform(x)

x

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [21]:
"""
simply encode the Purchased column because the column's values
are only "Yes" or "No", so it doesn't need dummy columns
"""
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

# Split into Training and Testing Data Sets
Let's split the customers data set into training and testing sets.

Training data sets are used by the machine learning model to learn.  
Testing data sets are used by the machine learning model to predict.  

In [22]:
from sklearn.model_selection import train_test_split 

In [23]:
"""
create x, y training and testing sets
- x = independent variable
- y = dependent variable
- test_size = what percentage of data is for testing, 20% for our case
- random_state = which state (seed) to split the data sets
"""
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=0)

In [24]:
# print out the x training and testing data sets
print("The x-training data set:")
print(x_train)
print("\nThe x-testing data set:")
print(x_test)

# print out the y training and testing data sets
print("\nThe y-training data set:")
print(y_train)
print("\nThe y-testing data set:")
print(y_test)

The x-training data set:
[[0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 37.0 67000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [1.0 0.0 0.0 44.0 72000.0]
 [1.0 0.0 0.0 35.0 58000.0]]

The x-testing data set:
[[0.0 1.0 0.0 30.0 54000.0]
 [0.0 1.0 0.0 50.0 83000.0]]

The y-training data set:
[1 1 1 0 1 0 0 1]

The y-testing data set:
[0 0]


# Feature Scaling
The columns are not within the same scale, which causes issues when comparing values in the machine learning models.

For example, the "Age" column has values from 27 to 50. If we compare that to the "Salary" column, which has values from 52,000 to 83,000, then obviously the "Salary" column would dominate in value.

### Why would these errors occur?
Most ML models follow the Euclidian distance (distance formula).

If a column has much wider range of values than another column, then the wider range of values would dominate the smaller range of values.

### Solving The Scaling Problem
We can solve this number scaling problem through 2 feature scaling methods.

Either method you choose, the variables become in the same range and in the same scale, thus no variables dominate another variable when comparing them.

#### Standarization
```stand_value(x) = [x - mean(x)] / standard_deviation(x)```  

Standardization rescales data to have a mean (μ) of 0 and standard deviation (σ) of 1 (unit variance)

#### Normalization
```norm_value(x) = [x - min(x)] / [max(x) - min(x)]```

Normalization rescales the values into a range of [0,1]
- This might be useful in some cases where all parameters need to have the same positive scale

However, the outliers from the data set are lost.

#### When Scaling This 2D Array Using Standarized Scaling:
```python
arr = [[1, 500, 1250],
       [0, 750, 1000],
       [0.5, 1000, 750]]
```

#### It Becomes:
```python
arr = [[ 1.22474487, -1.22474487,  1.22474487],
       [-1.22474487,  0.        ,  0.        ],
       [ 0.        ,  1.22474487, -1.22474487]]
```

As you can notice, the individual columns are all now on the same scale.

In [25]:
# import a Standarization Scaler for Feature Scaling
from sklearn.preprocessing import StandardScaler

In [26]:
# create a Standarization Scaler for the x sets
sc_X = StandardScaler()

# standarize the x-training set
x_train = sc_X.fit_transform(x_train)

"""
standarize the x-testing set using the same scaler.

we use the standarized scaling from the fitted training
data set on the testing data set because it compares
the testing data set values to the same mean, standard
deviation, min, and max from the training data set.

therefore, the feature scaling on the testing data set
is the same as the feature scaling on the training data
set because they were both fitted using the same scaler.
"""
x_test = sc_X.transform(x_test)

