# Data Preparation

In [None]:
import numpy as np
import pandas as pd

## The data

In [None]:
data = {
    'Country': ['France', 'Spain', 'Germany', 'Spain', 'Germany', 'France', 'Spain', 'France', 'Germany', 'France'],
    'Age': [44, 27, 30, 38, 40, 35, None, 48, 50, 37],
    'Salary': [72000, 48000, 54000, 61000, None, 58000, 52000, 79000, 83000, 87000],
    'Purchased': ['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes']
}

df = pd.DataFrame(data)
df

In [None]:
X = df.iloc[:, :-1].values # all rows for all feature columns
y = df.iloc[:, -1].values # all rows for label column
print(X)
print('---')
print(y)

## Dealing with missing values

In [None]:
from sklearn.impute import SimpleImputer

For the second and third column we will replace the missing numbers with the mean value for the respective feature.

In this example we are doing things using sklearn, but of course we could do it using just python, numpy and pandas if we want to.

In [None]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)

## Encoding categorical data

### Encoding the Features (Independent variable)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

Let's encode the Country column to numbers to have the data prepared for the machine learning model.

The reason we create 3 new columns where the current country is receiving value of 1 and the other 2 countries are with value 0 is to have equal columns.

For example if we encode France=0, Germany=1 and Spain=2 and provide this data to the model it might expect, that Spain is more important or the data is have some kind of ranking. For that reason it is better and easier for models/algorithms to have all countries equally represented by value 1 for the country and value 0 for the other countries for the respective row.

In [None]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
print(X)

### Encoding the Label (Dependent variable)

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()
y = le.fit_transform(y)
print(y)

## Train and Test Splits

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print(X_train)

In [None]:
print(X_test)

In [None]:
print(y_train)

In [None]:
print(y_test)

## Scaling the data

We should scale the data after the train and test split, not before!

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

Skip the first 3 column for the encoded countries values. They are already either 1 or 0 and scaling them will be actually bad, because we would not which country is which if we scale our 1's and 0's

In [None]:
X_train[:, 3:] = scaler.fit_transform(X_train[:, 3:])
X_test[:, 3:] = scaler.transform(X_test[:, 3:])

In [None]:
print(X_train)

In [None]:
print(X_test)