# MLAA Assessment 2 Experiment 3



## Context
In this notebook, we will be exploring into the random forest model. Hyperparameter tuning is done along with class sampling techniques to explore into the prediction scores produced.

### 0. Import Packages

In [58]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import ADASYN

### 1.   Load Dataset

In [11]:
df = pd.read_csv('/Users/ronikj/Downloads/df.csv')
df.head()

Unnamed: 0,gender_no,car_segment,age_of_vehicle_years,sched_serv_warr,sched_serv_paid,total_paid_services,total_services,mth_since_last_serv,Target
0,1.0,0,9,2,3,5,6,9,0
1,0.0,2,6,10,10,9,10,6,0
2,1.0,1,9,10,10,10,10,7,0
3,1.0,1,5,8,8,5,6,4,0
4,0.0,2,8,9,10,9,8,5,0


### 2. Explore Dataset

In [12]:
df.dtypes

gender_no               float64
car_segment               int64
age_of_vehicle_years      int64
sched_serv_warr           int64
sched_serv_paid           int64
total_paid_services       int64
total_services            int64
mth_since_last_serv       int64
Target                    int64
dtype: object

### 3. Explore Target variable

In [13]:
target = df['Target']
target.value_counts()

0    127759
1      3520
Name: Target, dtype: int64

### 4. Explore Variables of Interest

The variables of interest have been kept constant from Experiment 1. This is due to the business finalising on the questions to be asked in experiment 1 and sticking to it to ensure questionaires do not get too lengthy.

Therefore the variables of interest will be kept constant - gender_no, car_segment, age_of_vehicle_years, sched_serv_warr, sched_serv_paid, total_paid_services, total_services, mth_since_last_serv

### 5. Prepare Data

The data has been prepared in Experiment 1 and data preparation will be kept constant throughout all experiments to ensure uniformity for accurate model assessment.

### 6. Split Dataset

In experiment 3, the cross validation function will be used to perform testing on the overall dataset. The dataset will be split into 5 folds and accuracies will be tested on each fold.

We will first go ahead with the general split for the baseline model.

In [28]:
X = df.drop(['Target'], axis=1)
y = df['Target']

In [71]:
X_train_base, X_temp, y_train_base, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test_base, X_val_base, y_test_base, y_val_Base = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [72]:
print("Baseline Training set shape:", X_train_base.shape, y_train_base.shape)
print("Baseline Validation set shape:", X_val_base.shape, y_val_base.shape)
print("Baseline Testing set shape:", X_test_base.shape, y_test_base.shape)

Baseline Training set shape: (91895, 8) (91895,)
Baseline Validation set shape: (19692, 8) (19692,)
Baseline Testing set shape: (19692, 8) (19692,)


### 7. Baseline Model

In [74]:
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_base, y_train_base)

In [39]:
y_train_base_pred = model.predict(X_train_base)
y_val_base_pred = model.predict(X_val_base)
y_test_base_pred = model.predict(X_test_base)

In [75]:
print("Training Classification Report:")
print(classification_report(y_train_base, y_train_base_pred))

print("Validation Classification Report:")
print(classification_report(y_val_base, y_val_base_pred))

print("Test Classification Report:")
print(classification_report(y_test_base, y_test_base_pred))

Training Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     89427
           1       0.98      0.83      0.90      2468

    accuracy                           0.99     91895
   macro avg       0.99      0.91      0.95     91895
weighted avg       0.99      0.99      0.99     91895

Validation Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     19181
           1       0.79      0.56      0.65       511

    accuracy                           0.98     19692
   macro avg       0.89      0.78      0.82     19692
weighted avg       0.98      0.98      0.98     19692

Test Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     19151
           1       0.83      0.56      0.67       541

    accuracy                           0.98     19692
   macro avg       0.91      0.78 

It is noticed that the random forest classifier works very well on the training data but the overall scores on the validation and testing data reduce. The main reduction in scores is due to the existing class imbalance. This will be processed going ahead with the main random forest model.

### 8. Train Machine Learning Model
For the main machine learning model, we will first deal with the existing class imbalance using Adaptive Synthetic Sampling (ADASYN). 

In [76]:
adasyn = ADASYN(random_state=42)
X_resampled, y_resampled = adasyn.fit_resample(X, y)

rf_classifier = RandomForestClassifier(random_state=42)

In [77]:
X_train, X_temp, y_train, y_temp = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print("Training set shape:", X_train.shape, y_train.shape)
print("Validation set shape:", X_val.shape, y_val.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

Training set shape: (178509, 8) (178509,)
Validation set shape: (38253, 8) (38253,)
Testing set shape: (38252, 8) (38252,)


### 8.1 Hyperparameter Tuning

Hyperparameter tuning is performed for the model using GridSearchCV to be find the best performing parameters within Random Forest for the concerned dataset.

In [67]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt'],
    'bootstrap': [True, False]
}

grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best Hyperparameters:", grid_search.best_params_)

810 fits failed out of a total of 1620.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
810 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/ronikj/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/ronikj/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/Users/ronikj/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/Users/ronikj/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 95, in v

Best Hyperparameters: {'bootstrap': False, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}


In [68]:
rf_classifier = RandomForestClassifier(bootstrap=False, max_depth=None, max_features='sqrt',
                                       min_samples_leaf=1, min_samples_split=2,
                                       n_estimators=200, random_state=42)

cv_splitter = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cv_scores = cross_val_score(rf_classifier, X, y, cv=cv_splitter, scoring='accuracy')

print("Cross-Validation Scores:", cv_scores)
print("Mean CV Score:", cv_scores.mean())

Cross-Validation Scores: [0.9847273  0.98392748 0.98328001 0.98339427 0.98400305]
Mean CV Score: 0.9838664229360683


### 8.2 Model Training

In [80]:
rf_classifier.fit(X_train, y_train)

y_train_pred = rf_classifier.predict(X_train)
y_val_pred = rf_classifier.predict(X_val)

In [81]:
rf_classifier.fit(X_train, y_train)
y_pred_test = rf_classifier.predict(X_test)

### 9. Assess Model Performance

In [82]:
print("Training Classification Report:")
print(classification_report(y_train, y_train_pred))

print("Validation Classification Report:")
print(classification_report(y_val, y_val_pred))

print("Classification Report on Test Set:")
print(classification_report(y_test, y_pred_test))

Training Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.96      0.97     89619
           1       0.96      0.99      0.97     88890

    accuracy                           0.97    178509
   macro avg       0.97      0.97      0.97    178509
weighted avg       0.97      0.97      0.97    178509

Validation Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.95      0.97     19020
           1       0.96      0.98      0.97     19233

    accuracy                           0.97     38253
   macro avg       0.97      0.97      0.97     38253
weighted avg       0.97      0.97      0.97     38253

Classification Report on Test Set:
              precision    recall  f1-score   support

           0       0.98      0.95      0.97     19120
           1       0.96      0.98      0.97     19132

    accuracy                           0.97     38252
   macro avg       0.97    

The final random forest model classification reports have been generated. The observed characteristics are as follows:
1. The training classification report showed high numbers in all the concerned parameters for both the Target variable classes.
2. The precision, recall, and F1 scores came down for the validation classification report but only slightly. The model performs extremely well for both Target variable classes.
3. The scores on the final test dataset is are high and the model performs well.

Overall, the model performs well in identifying the 0 and 1 Target classes which indicates that Random Forest along with minority class sampling and hyperparameter tuning is a good choice for this project.