# This workflow shows the preprocessing of data for DAY 1

It includes:
1. Loading in the data
2. Handling Missing Data
3. Encoding Categorical Data
4. Splitting the Datasets into train and test
5. Feature Scaling

## 1. Loading in the Data

In [193]:
#import relevant modules
import numpy as np # contains numerical functions
import pandas as pd # manages datasets

In [194]:
# specify path to the csv file in a different folder
path = '/Users/Claudia/Documents/Personal Projects/Machine Learning/100-Days-Of-ML-Code/datasets/Data.csv'

# load in the data using pandas to create a dataframe
dataset = pd.read_csv(path)
X = dataset.iloc[ : , :-1].values # all rows, all columns up to and inlcuding the second to last column
Y = dataset.iloc[ : , 3].values # all rows, third column

In [195]:
dataset.sample(10) # visualise 10 random rows to review the data

Unnamed: 0,Country,Age,Salary,Purchased
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes
3,Spain,38.0,61000.0,No
7,France,48.0,79000.0,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
2,Germany,30.0,54000.0,No
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
4,Germany,40.0,,Yes


From the dataset, X and Y are separated as the independant variables, and dependant variables, respectively.
The independent variables (X) are 'Country, Age and Salary', whereas the dependent variable (Y) is 'purchased'.

## 2. Handle Missing Data

In [196]:
# Check general info
dataset.info() # 
dataset.describe() # basic statistics

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    10 non-null     object 
 1   Age        9 non-null      float64
 2   Salary     9 non-null      float64
 3   Purchased  10 non-null     object 
dtypes: float64(2), object(2)
memory usage: 452.0+ bytes


Unnamed: 0,Age,Salary
count,9.0,9.0
mean,38.777778,63777.777778
std,7.693793,12265.579662
min,27.0,48000.0
25%,35.0,54000.0
50%,38.0,61000.0
75%,44.0,72000.0
max,50.0,83000.0


In [197]:
# check for missing values
dataset.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

In [198]:
# import SimpleImputer from scikitlearn designed to fill in missing numbers
from sklearn.impute import SimpleImputer


In [199]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [200]:
# impute missing values with the mean
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")  # Initialize with SimpleImputer

imputer = imputer.fit(X[ : , 1:3]) # fit the imputer to the columns that have missing values, so it can calculate the average.

X[ : , 1:3] = imputer.transform(X[ : , 1:3]) # transforming the data fills in the blanks in columns 2 and 3 with the mean.

In [201]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

## 3. Encoding Categorical Data

LabelEncoder: This tool will convert text labels into numbers. It’s useful for turning categories like "Red," "Green," and "Blue" into simple numbers like 0, 1, and 2.

OneHotEncoder: This tool takes a column of numbers (like 0, 1, 2 from LabelEncoder) and turns it into dummy variables, meaning multiple columns of 1s and 0s, one for each category.

ColumnTransformer is also used alongside onehotencoder.

In [202]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder() # initialise an instance of the class LabelEncoder

In [203]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [204]:
X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0]) # transforms the first column from labels into numberical values

In [205]:
X

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

Create a dummy variables, which is a new column created for each categorical variable, like '1', '2', '3'.

In [206]:
from sklearn.compose import ColumnTransformer

Using ColumnTransformer:

The ColumnTransformer lets you specify which columns to transform and which to leave unchanged.
In this example, we’re applying the OneHotEncoder to the first column ([0]) while leaving the other columns untouched (using remainder='passthrough').
The OneHotEncoder:

You simply create an instance of OneHotEncoder() without the categorical_features argument.
It is now applied through the ColumnTransformer, which organizes the transformations for you.

In [207]:
column_transformer = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(), [0])  # Use OneHotEncoder on the first column (index 0)
    ],
    remainder='passthrough'  # Keep the rest of the columns unchanged
)

X = column_transformer.fit_transform(X)

In [208]:
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

## 4. Split dataset into train and test sets

Why Split the Data?
Training Set (X_train, Y_train):

This part of the data is used to train the machine learning model. The model learns the relationships between the features (X_train) and the target variable (Y_train).
Testing Set (X_test, Y_test):

After training the model, we use this part to evaluate how well the model performs. This helps us check if the model can make accurate predictions on new, unseen data.
It’s crucial to test on a separate set to avoid overfitting, which occurs when a model learns too much detail from the training data and fails to generalize to new data.

The test_size = 0.2 means there is an 80% 20% split between the training and testing data.

In [209]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)

## 5. Feature Scaling

Standardization (or Z-score normalization) transforms your data so that it has a mean of 0 and a standard deviation of 1. This is particularly important for certain algorithms, such as those that rely on distance calculations (like KNN and SVM), because:

It ensures that all features contribute equally to the distance computation, preventing features with larger ranges from dominating the results.

It helps improve convergence in gradient descent-based algorithms like linear regression or logistic regression, leading to faster training.

In [210]:
X_train

array([[0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 37.0, 67000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [1.0, 0.0, 0.0, 44.0, 72000.0],
       [1.0, 0.0, 0.0, 35.0, 58000.0]], dtype=object)

In [211]:
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

In [212]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler() # instant of the class

X_train = sc_X.fit_transform(X_train) # fit calculates the mean and standard deviation of the features and transform applies the standardization to the training set
X_test = sc_X.transform(X_test)

In [213]:
X_train

array([[-1.        ,  2.64575131, -0.77459667,  0.26306757,  0.12381479],
       [ 1.        , -0.37796447, -0.77459667, -0.25350148,  0.46175632],
       [-1.        , -0.37796447,  1.29099445, -1.97539832, -1.53093341],
       [-1.        , -0.37796447,  1.29099445,  0.05261351, -1.11141978],
       [ 1.        , -0.37796447, -0.77459667,  1.64058505,  1.7202972 ],
       [-1.        , -0.37796447,  1.29099445, -0.0813118 , -0.16751412],
       [ 1.        , -0.37796447, -0.77459667,  0.95182631,  0.98614835],
       [ 1.        , -0.37796447, -0.77459667, -0.59788085, -0.48214934]])

You should only transform the testing data using the parameters (mean and standard deviation) obtained from the training data. 

Fitting the scaler on the testing data can lead to data leakage. This means that the model has had access to the test data during training, which can result in overly optimistic performance estimates. The testing data should only be used to evaluate the model after it has been trained.