# Data Preprocessing
Refers to the process of transforming raw data into clean data suitable for training, testing, and analysis by an ML model.

## Installing and importing the required libraries
Before doing the data preprocessing tasks, the following libraries are installed:
- NumPy: Efficient for working with multidimensional arrays.
```
pip install numpy
```
- Pandas: For loading raw data from a file or DBMS, manipulation and analysis of it.
```
pip install pandas
```
- Scikit-learn library: For statistical modelling and encoding.
```
pip install scikit-learn
```
### Importing the required libraries:

In [1]:
# Import the required libraries
import numpy as np
import pandas as pd

## Data Loading
Load the dataset using the `read` methods provided by the `pandas` library:

In [15]:
df = pd.read_csv('data.csv')
print(df.head())

   Customer_Age Gender  Dependent_count Education_Level Marital_Status  \
0            45    NaN              3.0     High School        Married   
1            49      F              5.0        Graduate         Single   
2            51      M              3.0             NaN        Married   
3            40      F              4.0     High School        Unknown   
4            40      M              NaN      Uneducated        Married   

  Income_Category Card_Cateory  
0     $60K - $80K         Blue  
1  Less than $40K         Blue  
2    $80K - $120K         Blue  
3  Less than $40K         Blue  
4     $60K - $80K         Blue  


Summary of the dataset (high-level stats):

In [16]:
print(df.describe())

       Customer_Age  Dependent_count
count     10.000000         9.000000
mean      43.700000         2.888889
std        6.360468         1.452966
min       32.000000         0.000000
25%       40.000000         2.000000
50%       44.500000         3.000000
75%       48.750000         4.000000
max       51.000000         5.000000


## Creating Variable Vectors
This the process of separating features (independent variables) from the target (dependent variables). Assuming the target is `Card_Category`:

In [17]:
# Independent variable vector
x = df.iloc[:,:-1].values
print(x)

[[45 nan 3.0 'High School' 'Married' '$60K - $80K']
 [49 'F' 5.0 'Graduate' 'Single' 'Less than $40K']
 [51 'M' 3.0 nan 'Married' '$80K - $120K']
 [40 'F' 4.0 'High School' 'Unknown' 'Less than $40K']
 [40 'M' nan 'Uneducated' 'Married' '$60K - $80K']
 [44 nan 2.0 'Graduate' 'Married' '$40K - 60K']
 [51 'M' 4.0 'Unknown' 'Married' '$120K  +']
 [32 'M' 0.0 'High School' 'Unknown' '$60K - $80K']
 [37 'M' 3.0 'Uneducated' 'Single' '$60K - $80K']
 [48 'M' 2.0 'Graduate' 'Single' '$80K - $120K']]


In [18]:
# Dependent variable vector
y = df.iloc[:, -1].values
print(y)

['Blue' 'Blue' 'Blue' 'Blue' 'Blue' 'Blue' 'Gold' 'Silver' 'Blue' 'Blue']


## Handling missing values
Some values may be missing in the dataset. This is normally indicated by `NaN`. ML models' may perform poorly as a result of these, so they have to be handled

In [19]:
# Cont the number of missing values in each column
print(df.isnull().sum())

Customer_Age       0
Gender             2
Dependent_count    1
Education_Level    1
Marital_Status     0
Income_Category    0
Card_Cateory       0
dtype: int64


### Dropping records with missing values

In [20]:
# Drop missing value records and retain the rest (inplace)
df.dropna(inplace=True)
print(df.to_string())

   Customer_Age Gender  Dependent_count Education_Level Marital_Status Income_Category Card_Cateory
1            49      F              5.0        Graduate         Single  Less than $40K         Blue
3            40      F              4.0     High School        Unknown  Less than $40K         Blue
6            51      M              4.0         Unknown        Married        $120K  +         Gold
7            32      M              0.0     High School        Unknown     $60K - $80K       Silver
8            37      M              3.0      Uneducated         Single     $60K - $80K         Blue
9            48      M              2.0        Graduate         Single    $80K - $120K         Blue


The problem with this is that lots of data could be lost as a result of dropping records with null values.

### Replacing missing values
This technique is referred to as **imputing** in simple terms. The missing values are replaced with other values.

The `SimpleImputer` class from `sklearn.impute` can be used:

In [21]:
# Replacing missing values
from sklearn.impute import SimpleImputer

# Imputer object with 'most_frequent' strategy
imputer = SimpleImputer(
    missing_values=np.nan,
    strategy='most_frequent',
)

# Fit data with imputer
imputer.fit(x[:, 1:4])
x[:, 1:4] = imputer.transform(x[:, 1:4])
print(x)

[[45 'M' 3.0 'High School' 'Married' '$60K - $80K']
 [49 'F' 5.0 'Graduate' 'Single' 'Less than $40K']
 [51 'M' 3.0 'Graduate' 'Married' '$80K - $120K']
 [40 'F' 4.0 'High School' 'Unknown' 'Less than $40K']
 [40 'M' 3.0 'Uneducated' 'Married' '$60K - $80K']
 [44 'M' 2.0 'Graduate' 'Married' '$40K - 60K']
 [51 'M' 4.0 'Unknown' 'Married' '$120K  +']
 [32 'M' 0.0 'High School' 'Unknown' '$60K - $80K']
 [37 'M' 3.0 'Uneducated' 'Single' '$60K - $80K']
 [48 'M' 2.0 'Graduate' 'Single' '$80K - $120K']]


The imputing strategy above uses `most_frequent`. The `strategy='mean'` cannot be used for this case as non-numeric data is present.

## Data Encoding
This is the process of converting non-numerical data to numerical values for easy processing of the ML model.

### Binary Encoding (One-Hot Encoding)
This is the process of converting categorical features to numerical values. It creates a column for every category in a feature, increasing the dimensionality of the dataset.

In [22]:
# Column transformer and OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Get categorical column indexes
categorical_columns = [1, 3, 4, 5]
# ColumnTransformer to OneHotEncode the data
ct = ColumnTransformer(
    transformers=[
        ('encoder', OneHotEncoder(), categorical_columns),
    ],
    remainder="passthrough"
)

x = np.array(ct.fit_transform(x))
print(x)

[[0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 45 3.0]
 [1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 49 5.0]
 [0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 51 3.0]
 [1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 40 4.0]
 [0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 40 3.0]
 [0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 44 2.0]
 [0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 51 4.0]
 [0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 32 0.0]
 [0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 37 3.0]
 [0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 48 2.0]]


### Label Encoding
This is replacing text data with a specific numerical value. It's commonly applied to the target.

In [23]:
# Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

y = le.fit_transform(y)
print(y)

[0 0 0 0 0 0 1 2 0 0]


## Splitting data to Train and Test set
This is important so that we can assess the performance of the ML model, tune hyperparameters and prevent data leakage (test data influencing the training process).

In [24]:
# Split dataset for training and testing
from sklearn.model_selection import train_test_split

# Assign 25% of the dataset for testing
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1)

`x_train` values:

In [25]:
# x_train
print(x_train)

[[0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 40 3.0]
 [0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 45 3.0]
 [1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 40 4.0]
 [1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 49 5.0]
 [0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 32 0.0]
 [0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 37 3.0]
 [0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 44 2.0]]


`x_test` values:

In [26]:
# x_test
print(x_test)

[[0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 51 3.0]
 [0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 48 2.0]
 [0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 51 4.0]]


`y_train` values:

In [27]:
# y_train
print(y_train)

[0 0 0 0 2 0 0]


`y_test` values:

In [28]:
# y_test
print(y_test)

[0 0 1]


## Feature Scaling
A technique that transforms independent variable values to a common scale, ensuring they contribute equally to the ML model.

### Why is feature scaling important?
1. Larger scale features may have more impact, relative to smaller scale features.
2. Larger values may require more computational power, hence the need to transform them so that algorithm performance is improved.
3. Feature scaling prevents numerical instabilities, resulting in problems such as overflow and underflow.


### Feature Scaling with Standardization

In [38]:
# Standardization
from sklearn.preprocessing import StandardScaler

# StandardScaler object
sc = StandardScaler()

# Apply standard scaling
x_train[:,14:] = sc.fit_transform(x_train[:,14:])
x_test[:,14:] = sc.transform(x_test[:,14:])

#### Outputs after standard scaling:

In [40]:
# x_train
print(x_train[:,14:])

[[-0.19296124624699 0.09805806756909202]
 [0.7718449849879598 0.09805806756909202]
 [-0.19296124624699 0.7844645405527362]
 [1.5436899699759197 1.4708710135363803]
 [-1.7366512162229095 -1.9611613513818404]
 [-0.7718449849879598 0.09805806756909202]
 [0.5788837387409699 -0.5883484054145522]]


In [41]:
# x_test
print(x_test[:,14:])

[[51.0 3.0]
 [48.0 2.0]
 [51.0 4.0]]
