# Data Preprocessing for ML - Test Python Model 1

This project demonstrates the important steps of data preprocessing, including handling missing data, encoding categorical variables, and scaling numerical features, to prepare the data for machine learning models.

## Importing the required libraries

In [2]:
# Data Preprocessing 
# Importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler



## Importing the dataset

In [3]:
# Importing the dataset
data = pd.read_csv("C:/Users/aleksandar.dimitrov/Desktop/Python Tests/Data/CategoricalData.csv")

In [4]:
data

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


## Handling Missing Data with Imputation

In [5]:
# Creating Matrix of the features (independent Variables)
X = data.iloc[:, :-1].values
print(X)


[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [6]:
# Creating The dependant Variable Vector
Y = data.iloc[:, 3].values
print(Y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


In [7]:
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)



[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 48000.0]
 ['France' 35.0 58000.0]
 ['Spain' 27.0 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding the categorical data

 We should keep in mind that some machine learning algorithms may incorrectly interpret numerical labels as ordinal values and assume a certain order or magnitude between them. For example, if the labels are encoded as 0, 1, and 2, the algorithm might incorrectly assume that 2 is greater than 1 and 1 is greater than 0. This can lead to incorrect interpretations and predictions. Therefore, it is preferable to use OneHotEncoder instead of LabelEncoder, depending on the nature of your data and the requirements of your machine learning algorithm. 
 However, for presentation purposes, we are using LabelEncoding. 

In [8]:
# Label encoding of categorical feature
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
print(X[:, 0])




[0 2 1 2 1 0 2 0 1 0]


In [9]:
# Dummy Encoding (One-Hot Encoding) using ColumnTransformer
ct = ColumnTransformer([('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = ct.fit_transform(X)
print(X)



[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 48000.0]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 27.0 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [10]:
# Encoding Categorical data
labelencoder_y = LabelEncoder()
Y = labelencoder_y.fit_transform(Y)
print(Y)

[0 1 0 0 1 1 0 1 0 1]


We create training and test set. 
Train Set:

1. The training set consists of 8 instances, representing 8 different people with their features (attributes) and labels (target values).
2. The feature matrix, denoted as X_train, has a shape of (8, 5), indicating it has 8 instances and 5 features (attributes).
3. The label vector, denoted as y_train, has a shape of (8,), indicating it has 8 labels (target values).

Test Set:

1. The test set contains 2 instances with their respective features and labels.
2. The feature matrix, denoted as X_test, has a shape of (2, 5), indicating it has 2 instances and 5 features (attributes).
3. The label vector, denoted as y_test, has a shape of (2,), indicating it has 2 labels (target values).


## Splitting the dataset into the training Set and Test set

In [11]:
# Splitting the Dataset into the training Set and Test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
print("Train set shapes:")
print("X_train:", X_train.shape)
print("y_train:", Y_train.shape)

print("Test set shapes:")
print("X_test:", X_test.shape)
print("y_test:", Y_test.shape)


Train set shapes:
X_train: (8, 5)
y_train: (8,)
Test set shapes:
X_test: (2, 5)
y_test: (2,)


## Feature Scalling

With the code below we performs feature scaling on the attributes in the training and test datasets using z-score standardization (StandardScaler) to improve the handling of the machine learning model.

In [49]:
# Feature Scaling(Standardisation and Normalisation)
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

# Print the scaled training set
print("Scaled Training Set:")
print(X_train)

# Print the scaled test set
print("Scaled Test Set:")
print(X_test)





Scaled Training Set:
[[-1.00000000e+00  2.64575131e+00 -7.74596669e-01  4.33012702e-01
  -1.18512280e+00]
 [ 1.00000000e+00 -3.77964473e-01 -7.74596669e-01 -2.08166817e-17
   5.98428342e-01]
 [-1.00000000e+00 -3.77964473e-01  1.29099445e+00 -1.44337567e+00
  -1.18512280e+00]
 [-1.00000000e+00 -3.77964473e-01  1.29099445e+00 -1.44337567e+00
  -8.09638345e-01]
 [ 1.00000000e+00 -3.77964473e-01 -7.74596669e-01  1.58771324e+00
   1.72488169e+00]
 [-1.00000000e+00 -3.77964473e-01  1.29099445e+00  1.44337567e-01
   3.52016672e-02]
 [ 1.00000000e+00 -3.77964473e-01 -7.74596669e-01  1.01036297e+00
   1.06778390e+00]
 [ 1.00000000e+00 -3.77964473e-01 -7.74596669e-01 -2.88675135e-01
  -2.46411670e-01]]
Scaled Test Set:
[[-1.          2.64575131 -0.77459667 -1.01036297 -0.62189612]
 [-1.          2.64575131 -0.77459667  1.87638837  2.10036614]]
