# MLAA Assessment 2 Experiment 4



## Context
Experiment 4 explores into the K Nearest Neighbors classification model. In this experiment, the Synthetic Minority Over Sampling Technique (SMOTE) is used to deal with the existing class imbalance. 

### 0. Import Packages

In [15]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.neighbors import KNeighborsClassifier
from imblearn.over_sampling import SMOTE

### 1.   Load Dataset

In [3]:
df = pd.read_csv('/Users/ronikj/Downloads/df.csv')
df.head()

Unnamed: 0,gender_no,car_segment,age_of_vehicle_years,sched_serv_warr,sched_serv_paid,total_paid_services,total_services,mth_since_last_serv,Target
0,1.0,0,9,2,3,5,6,9,0
1,0.0,2,6,10,10,9,10,6,0
2,1.0,1,9,10,10,10,10,7,0
3,1.0,1,5,8,8,5,6,4,0
4,0.0,2,8,9,10,9,8,5,0


### 2. Explore Dataset

In [4]:
df.shape

(131279, 9)

In [5]:
df.describe()

Unnamed: 0,gender_no,car_segment,age_of_vehicle_years,sched_serv_warr,sched_serv_paid,total_paid_services,total_services,mth_since_last_serv,Target
count,131279.0,131279.0,131279.0,131279.0,131279.0,131279.0,131279.0,131279.0,131279.0
mean,0.59452,1.228117,5.494108,5.454071,5.453843,5.48336,5.456676,5.471355,0.026813
std,0.490987,0.74226,2.842468,2.883944,2.886114,2.879763,2.875332,2.859098,0.161538
min,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
25%,0.0,1.0,3.0,3.0,3.0,3.0,3.0,3.0,0.0
50%,1.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,0.0
75%,1.0,2.0,8.0,8.0,8.0,8.0,8.0,8.0,0.0
max,1.0,2.0,10.0,10.0,10.0,10.0,10.0,10.0,1.0


### 3. Explore Target variable

In [6]:
df['Target'].value_counts()

0    127759
1      3520
Name: Target, dtype: int64

As seen, a large class imbalance understandably exists due to the overall business model depending on the repurchase of cars.

### 4. Explore Variables of Interest

The variables of interest have been kept constant from Experiment 1. This is due to the business finalising on the questions to be asked in experiment 1 and sticking to it to ensure questionaires do not get too lengthy. 

Therefore the variables of interest will be kept constant - gender_no, car_segment, age_of_vehicle_years, sched_serv_warr, sched_serv_paid, total_paid_services, total_services, mth_since_last_serv

### 5. Prepare Data

The data has been prepared in Experiment 1 and data preparation will be kept constant throughout all experiments to ensure uniformity for accurate model assessment.

### 6. Split Dataset

In [8]:
X = df.drop(['Target'], axis=1)
y = df['Target']

In [9]:
X_train_base, X_temp_base, y_train_base, y_temp_base = train_test_split(X, y, test_size=0.3, random_state=42)
X_test_base, X_val_base, y_test_base, y_val_base = train_test_split(X_temp_base, y_temp_base, test_size=0.5, random_state=42)

In [10]:
print("Baseline Training set shape:", X_train_base.shape, y_train_base.shape)
print("Baseline Validation set shape:", X_val_base.shape, y_val_base.shape)
print("Baseline Testing set shape:", X_test_base.shape, y_test_base.shape)

Baseline Training set shape: (91895, 8) (91895,)
Baseline Validation set shape: (19692, 8) (19692,)
Baseline Testing set shape: (19692, 8) (19692,)


### 7. Baseline Model
The baseline model in this experiment is the KNN model with an undefined of neighbors with the imbalanced target variable dataset. We will explore into this model to check how well it does on the original given dataset.

In [27]:
model = KNeighborsClassifier()
model.fit(X_train_base, y_train_base)

In [28]:
y_train_base_pred = model.predict(X_train_base)
y_val_base_pred = model.predict(X_val_base)
y_test_base_pred = model.predict(X_test_base)

print("Training Classification Report:")
print(classification_report(y_train_base, y_train_base_pred))

print("Validation Classification Report:")
print(classification_report(y_val_base, y_val_base_pred))

print("Test Classification Report:")
print(classification_report(y_test_base, y_test_base_pred))

Training Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     89427
           1       0.90      0.47      0.62      2468

    accuracy                           0.98     91895
   macro avg       0.94      0.73      0.80     91895
weighted avg       0.98      0.98      0.98     91895

Validation Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99     19181
           1       0.79      0.38      0.51       511

    accuracy                           0.98     19692
   macro avg       0.89      0.69      0.75     19692
weighted avg       0.98      0.98      0.98     19692

Test Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99     19151
           1       0.82      0.37      0.51       541

    accuracy                           0.98     19692
   macro avg       0.90      0.69 

The K Nearest Neighbours model has been trained as a baseline model. The trend remains similar along all the previous experiments with the model performing well on the majority class.

The scores come down considerably when it comes to the minority class due to the high class imbalance that exists.
This will be dealt with in the next steps.

### 8. Train Machine Learning Model

### 8.1 Class Imbalance Handling using SMOTE
The Synthetic Minority Oversampling Technique (SMOTE) has been used to oversample the underrepresented target variable - 1 in our case. 

In [29]:
sampling = SMOTE(random_state=42)

X_resampled, y_resampled = sampling.fit_resample(X, y)

y_resampled.value_counts()

0    127759
1    127759
Name: Target, dtype: int64

In [30]:
X_train, X_temp, y_train, y_temp = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print("Training set shape:", X_train.shape, y_train.shape)
print("Validation set shape:", X_val.shape, y_val.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

Training set shape: (178862, 8) (178862,)
Validation set shape: (38328, 8) (38328,)
Testing set shape: (38328, 8) (38328,)


### 8.2 Hyperparameter Tuning (Number of Neighbors)

In [35]:
param_grid = {
    'n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15]
}
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best Number of Neighbors:", grid_search.best_params_['n_neighbors'])

Best Number of Neighbors: 3


### 8.3 K Nearest Neighbors Model

In [37]:
knn_model = KNeighborsClassifier(n_neighbors=3)

In [38]:
knn_model.fit(X_train, y_train)

In [39]:
y_pred_train = model.predict(X_train)
y_val_pred = model.predict(X_val)

In [40]:
y_test_pred = model.predict(X_test)

### 9. Assess Model Performance

In [42]:
print("Training Classification Report:")
print(classification_report(y_train, y_train_pred))

print("Validation Classification Report:")
print(classification_report(y_val, y_val_pred))

print("Testing Classification Report:")
print(classification_report(y_test, y_test_pred))

Training Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.94      0.96     89534
           1       0.94      0.98      0.96     89328

    accuracy                           0.96    178862
   macro avg       0.96      0.96      0.96    178862
weighted avg       0.96      0.96      0.96    178862

Validation Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.94      0.96     19013
           1       0.94      0.98      0.96     19315

    accuracy                           0.96     38328
   macro avg       0.96      0.96      0.96     38328
weighted avg       0.96      0.96      0.96     38328

Testing Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.93      0.96     19212
           1       0.94      0.98      0.96     19116

    accuracy                           0.96     38328
   macro avg       0.96      0.

In the final knn model, the following details are noticed:

1. The F1 Score and accuracy stays at 0.96 throughout all classification reports.
2. We see a slight drop in recall scores for the 0 target variable in the testing classification report.
3. Overall, the KNN model performs exceptionally well with consistently high scores through the three classification reports generated.