<a href="https://colab.research.google.com/github/glasseyes/MM0001-Operation-Mind-Shield/blob/main/submissions/daniel-glassey/daniel_g_operation_mind_shield.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧠 Operation Mind Shield: Decoding Alzheimer's
#### **Full Name:** Daniel Glassey
#### **Link to SDS Profile:** https://community.superdatascience.com/u/140115de

@misc{rabie_el_kharoua_2024,
  title={Alzheimer's Disease Dataset},
  url={https://www.kaggle.com/dsv/8668279},
  DOI={10.34740/KAGGLE/DSV/8668279},
  publisher={Kaggle},
  author={Rabie El Kharoua},
  year={2024}
  }

## The assignment

  Welcome to Operation Mind Shield, data science challenge initiated by SuperDataScience focused on predicting Alzheimer's Disease based on patient data. This mission is designed for data scientists of all
  skill levels, with three different Assignments tailored to your expertise. Dive into the dataset, tackle the assignment that best matches your skills, and contribute to this collaborative project.







##  📜 Mission Brief

  In this mission, you will work with a dataset on Alzheimer's Disease patients. The goal is to build predictive models to determine the likelihood of a patient having Alzheimer's based on various features.
  You'll progress through three phases depending on the assignment level you choose:
  - Data Cleaning & Analysis
  - Data Preprocessing & Feature Selection
  - Model Selection & Fine-Tuning

 ## 🎯 Mission Objectives

*   Predict whether a patient has Alzheimer's based on the available features.
*   Clean and preprocess the data to improve the performance of your models.
*   Train models that best suit your experience level, from simple to highly advanced.
*   Evaluate your model's performance using appropriate validation techniques.
*   (Advanced Level Only) Deploy your model using a Streamlit App.

 ## 🏆 Assignments
  1. Assignment 1: The Initiate (Beginner Level)
    - Perform basic data cleaning.
    - Build a simple linear model (e.g., Logistic Regression).
    - Evaluate your model with a simple test set.

  2. Assignment 2: The Specialist (Intermediate Level)
    - Conduct elaborate data cleaning with 1 feature selection step.
    - Train more advanced models like Gradient Boosting.
    - Use k-fold cross-validation for evaluation.
    - Perform basic hyperparameter tuning.

  3. Assignment 3: The Operative (Advanced Level)
    - Engage in advanced data preprocessing (looping feature selection and feature extraction using PCA/LDA).
    - Build ensemble models using powerful frameworks (e.g., SageMaker, Azure Machine Learning).
    - Perform extensive hyperparameter tuning.
    - Evaluate using k-fold cross-validation.
    - Deploy your model through a Streamlit App.

 ## The Dataset

    - PatientID: A unique identifier assigned to each patient (4751 to 6900).
    - Age: The age of the patients ranges from 60 to 90 years.
    - Gender: Gender of the patients, where 0 represents Male and 1 represents Female.
    - Ethnicity: The ethnicity of the patients, coded as follows:
        - 0: Caucasian
        - 1: African American
        - 2: Asian
        - 3: Other
    - EducationLevel: The education level of the patients, coded as follows:
        - 0: None
        - 1: High School
        - 2: Bachelor's
        - 3: Higher
    - BMI: Body Mass Index of the patients, ranging from 15 to 40.
    - Smoking: Smoking status, where 0 indicates No and 1 indicates Yes.
    - AlcoholConsumption: Weekly alcohol consumption in units, ranging from 0 to 20.
    - PhysicalActivity: Weekly physical activity in hours, ranging from 0 to 10.
    - DietQuality: Diet quality score, ranging from 0 to 10.
    - SleepQuality: Sleep quality score, ranging from 4 to 10.
    - FamilyHistoryAlzheimers: Family history of Alzheimer's Disease, where 0 indicates No and 1 indicates Yes.
    - CardiovascularDisease: Presence of cardiovascular disease, where 0 indicates No and 1 indicates Yes.
    - Diabetes: Presence of diabetes, where 0 indicates No and 1 indicates Yes.
    - Depression: Presence of depression, where 0 indicates No and 1 indicates Yes.
    - HeadInjury: History of head injury, where 0 indicates No and 1 indicates Yes.
    - Hypertension: Presence of hypertension, where 0 indicates No and 1 indicates Yes.
    - SystolicBP: Systolic blood pressure, ranging from 90 to 180 mmHg.
    - DiastolicBP: Diastolic blood pressure, ranging from 60 to 120 mmHg.
    - CholesterolTotal: Total cholesterol levels, ranging from 150 to 300 mg/dL.
    - CholesterolLDL: Low-density lipoprotein cholesterol levels, ranging from 50 to 200 mg/dL.
    - CholesterolHDL: High-density lipoprotein cholesterol levels, ranging from 20 to 100 mg/dL.
    - CholesterolTriglycerides: Triglycerides levels, ranging from 50 to 400 mg/dL.
    - MMSE: Mini-Mental State Examination score, ranging from 0 to 30. Lower scores indicate cognitive impairment.
    - FunctionalAssessment: Functional assessment score, ranging from 0 to 10. Lower scores indicate greater impairment.
    - MemoryComplaints: Presence of memory complaints, where 0 indicates No and 1 indicates Yes.
    - BehavioralProblems: Presence of behavioral problems, where 0 indicates No and 1 indicates Yes.
    - ADL: Activities of Daily Living score, ranging from 0 to 10. Lower scores indicate greater impairment.
    - Confusion: Presence of confusion, where 0 indicates No and 1 indicates Yes.
    - Disorientation: Presence of disorientation, where 0 indicates No and 1 indicates Yes.
    - PersonalityChanges: Presence of personality changes, where 0 indicates No and 1 indicates Yes.
    - DifficultyCompletingTasks: Presence of difficulty completing tasks, where 0 indicates No and 1 indicates Yes.
    - Forgetfulness: Presence of forgetfulness, where 0 indicates No and 1 indicates Yes.
    - Diagnosis: Diagnosis status for Alzheimer's Disease, where 0 indicates No and 1 indicates Yes.
    - DoctorInCharge: This column contains confidential information about the doctor in charge, with "XXXConfid" as the value for all patients.

## 1. Data Preprocessing

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Global variables

In [None]:
kfold_cv = 10 # number of folds in K-Fold Cross Validation
my_random_state = 0 # set random state for models for consistent runs

### Import Data

The PatientID column and DoctorInCharge columns aren't relevant for the model

In [None]:
dataset = pd.read_csv('alzheimers_disease_data.csv')
dataset = dataset.drop(['PatientID', 'DoctorInCharge'], axis= 1)

In [None]:
dataset.head()

### Check for missing or duplicated data

In [None]:
# Check dataset info
dataset.info()

All the columns have 2149 non-null entries, so there is no missing data

In [None]:
dataset = dataset.drop_duplicates(subset=None, keep='first', inplace=False)
dataset.info()

No rows have been removed so there are no duplicated rows

### List all the columns

In [None]:
# columns = list(dataset.columns)
# ['Age',
#  'Gender',
#  'Ethnicity',
#  'EducationLevel',
#  'BMI',
#  'Smoking',
#  'AlcoholConsumption',
#  'PhysicalActivity',
#  'DietQuality',
#  'SleepQuality',
#  'FamilyHistoryAlzheimers',
#  'CardiovascularDisease',
#  'Diabetes',
#  'Depression',
#  'HeadInjury',
#  'Hypertension',
#  'SystolicBP',
#  'DiastolicBP',
#  'CholesterolTotal',
#  'CholesterolLDL',
#  'CholesterolHDL',
#  'CholesterolTriglycerides',
#  'MMSE',
#  'FunctionalAssessment',
#  'MemoryComplaints',
#  'BehavioralProblems',
#  'ADL',
#  'Confusion',
#  'Disorientation',
#  'PersonalityChanges',
#  'DifficultyCompletingTasks',
#  'Forgetfulness',
#  'Diagnosis']

### Rearrange the columns to group scalable and categorical and binary ones together


In order to encode the categorical data and do feature scaling it is necessary to move columns around so that ones of the same type are grouped together.

In [None]:
# 0:1
categorical_columns = ['Ethnicity', 'EducationLevel']

# 2:16
scalable_columns = ['Age', 'BMI', 'AlcoholConsumption', 'PhysicalActivity',\
                    'DietQuality', 'SleepQuality', 'SystolicBP', 'DiastolicBP',\
                    'CholesterolTotal', 'CholesterolLDL', 'CholesterolHDL',
                    'CholesterolTriglycerides', 'MMSE', 'FunctionalAssessment', 'ADL'\
                   ]
# 17:32
binary_columns = ['Gender', 'Smoking', 'FamilyHistoryAlzheimers', 'CardiovascularDisease',\
                  'Diabetes', 'Depression', 'HeadInjury', 'Hypertension', 'MemoryComplaints',\
                  'BehavioralProblems', 'Confusion', 'Disorientation', 'PersonalityChanges',\
                  'DifficultyCompletingTasks', 'Forgetfulness', 'Diagnosis'\
                 ]

assert len(categorical_columns) + len(scalable_columns) + len(binary_columns) == len(dataset.columns)

new_columns = categorical_columns + scalable_columns + binary_columns
dataset = dataset[new_columns]
dataset.head()

In [None]:
dataset.tail()

### Extract X and y

In [None]:
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:, -1].values

### Encoding categorical data

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0, 1] )], remainder='passthrough')
X = np.array(ct.fit_transform(X))

### Splitting the dataset into the Training set and Test set

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = my_random_state)

In [None]:
X_train[0]

### Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 8:23] = sc.fit_transform(X_train[:, 8:23])
X_test[:, 8:23] = sc.transform(X_test[:, 8:23])

In [None]:
X_train[0]

## Classification models

Run all the non-gradient based models. I am collecting all the results in the variables below and will present them at the end of this section.

### Initialise result variables

In [None]:
confusion_matrices = []
accuracies = []
one_accuracy = []
models = ["Logistic Regression",\
          "K-NN",\
          "SVM",\
          "Kernel SVM",\
          "Naive Bayes",\
          "Decision Tree",\
          "Random Forest"]

### Logistic Regression model

#### Training the Logistic Regression model on the Training set

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = my_random_state)
classifier.fit(X_train, y_train)

#### Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
confusion_matrices.append(confusion_matrix(y_test, y_pred))
one_accuracy.append(accuracy_score(y_test, y_pred))

#### Applying k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies.append(cross_val_score(estimator=classifier,X=X_train, y=y_train, cv=kfold_cv))

### K-NN model

#### Training the K-NN model on the Training set

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

#### Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
confusion_matrices.append(confusion_matrix(y_test, y_pred))
one_accuracy.append(accuracy_score(y_test, y_pred))

#### Applying k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies.append(cross_val_score(estimator=classifier,X=X_train, y=y_train, cv=kfold_cv))

###SVM model

#### Training the SVM model on the Training set

In [None]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = my_random_state)
classifier.fit(X_train, y_train)

#### Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
confusion_matrices.append(confusion_matrix(y_test, y_pred))
one_accuracy.append(accuracy_score(y_test, y_pred))

#### Applying k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies.append(cross_val_score(estimator=classifier,X=X_train, y=y_train, cv=kfold_cv))

### Kernel SVM model

#### Training the Kernel SVM model on the Training set

In [None]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = my_random_state)
classifier.fit(X_train, y_train)

#### Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
confusion_matrices.append(confusion_matrix(y_test, y_pred))
one_accuracy.append(accuracy_score(y_test, y_pred))

#### Applying k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies.append(cross_val_score(estimator=classifier,X=X_train, y=y_train, cv=kfold_cv))

### Naive Bayes model

#### Training the Naive Bayes model on the Training set

In [None]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

#### Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
confusion_matrices.append(confusion_matrix(y_test, y_pred))
one_accuracy.append(accuracy_score(y_test, y_pred))

#### Applying k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies.append(cross_val_score(estimator=classifier,X=X_train, y=y_train, cv=kfold_cv))

###Decision Tree Classification model

#### Training the Decision Tree Classification model on the Training set

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = my_random_state)
classifier.fit(X_train, y_train)

#### Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
confusion_matrices.append(confusion_matrix(y_test, y_pred))
one_accuracy.append(accuracy_score(y_test, y_pred))

#### Applying k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies.append(cross_val_score(estimator=classifier,X=X_train, y=y_train, cv=kfold_cv))
print(accuracies[-1])

### Random Forest Classification model

#### Training the Random Forest Classification model on the Training set

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = my_random_state)
classifier.fit(X_train, y_train)

#### Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
confusion_matrices.append(confusion_matrix(y_test, y_pred))
one_accuracy.append(accuracy_score(y_test, y_pred))

#### Applying k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies.append(cross_val_score(estimator=classifier,X=X_train, y=y_train, cv=kfold_cv))
print(accuracies[-1])

## Comparing the classification models

### Confusion Matrices

In [None]:
print("Confusion Matrices\n")
for i in range(len(models)):
  print(models[i])
  print(confusion_matrices[i])

### Accuracy scores

In [None]:
print("Accuracy Scores\n")
for i in range(len(models)):
  print(models[i].ljust(25), "{:.2f}%".format(one_accuracy[i]*100))

### K-Fold Validation Accuracies

In [None]:
print("K-Fold Validation\n")
for i in range(len(models)):
  print(models[i].ljust(25), "Accuracy: {:.2f}% Standard Deviation: {:.2f}%".format(accuracies[i].mean()*100, accuracies[i].std()*100))

### Summary

The best of these models are Decision Tree and Random Forest.

## Gradient Boosting models

### Initialise result variables

In [None]:
gb_confusion_matrices = [] # confusion matrices
gb_accuracies = [] # k-fold accuracies
gb_skf_accuracies = [] # k-fold accuracies using stratified folding
gb_rskf_accuracies = [] # k-fold accuracies using repeated stratified folding
gb_one_accuracy = [] # accuracy on test set
gb_models = ["CatBoost",\
             "XGBoost",\
             "LightGBM"\
            ]


### CatBoost

#### Building and training the model

##### Building the model

In [None]:
!pip install catboost

In [None]:
import catboost as cb
model = cb.CatBoostClassifier(verbose = 0, random_state = my_random_state)
#model = cb.CatBoostClassifier(verbose=0, task_type='GPU')

##### Training the model

In [None]:
model.fit(X_train, y_train)

In [None]:
# model.calc_feature_statistics(X_train,
#                                     y_train,
#                                     feature=10,
#                                     plot=False)

#### Inference

In [None]:
y_pred = model.predict(X_test)

In [None]:
y_pred

#### Evaluating the model

##### Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_true=y_test, y_pred=y_pred)
gb_confusion_matrices.append(cm)

In [None]:
cmd = ConfusionMatrixDisplay(cm)
cmd.plot()

##### Monkey accuracy

In [None]:
monkey = 100*sum(cm[0])/cm.sum()
print(100*sum(cm[0]))
print("Accuracy of 0 model: {:.1f}%".format(monkey))

##### Accuracy

In [None]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_true=y_test, y_pred=y_pred)
print("Accuracy on test set: {:.2f}%".format(acc*100))
gb_one_accuracy.append(acc)

##### k-Fold Cross Validation and Stratified

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
kf_accs = cross_val_score(estimator=model,X=X_train, y=y_train, cv=kfold_cv)
skf = StratifiedKFold(n_splits=kfold_cv, shuffle = True, random_state=my_random_state)
skf_accs = cross_val_score(estimator = model,
                             X = X_train,
                             y = y_train,
                             scoring = 'accuracy',
                             cv = skf)
gb_skf_accuracies.append(skf_accs)
gb_accuracies.append(kf_accs)

###### Output

In [None]:
print("K-Fold Accuracy: {:.2f} %".format(kf_accs.mean()*100))
print("Standard Deviation: {:.2f} %".format(kf_accs.std()*100))
print(kf_accs)
print("Stratified KF Accuracy: {:.2f} %".format(skf_accs.mean()*100))
print("Standard Deviation: {:.2f} %".format(skf_accs.std()*100))
print(skf_accs)

##### Repeated Stratified K-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
rskf = RepeatedStratifiedKFold(n_splits=kfold_cv, n_repeats=3, random_state = my_random_state)
rskf_accs = cross_val_score(estimator=model, \
                     X=X_train, y=y_train, \
                     scoring='accuracy', \
                     cv=rskf)
gb_rskf_accuracies.append(rskf_accs)

###### Output

In [None]:
print("RSKF Mean Accuracies: {:.2f}%".format(rskf_accs.mean()*100))
print("Standard Deviation: {:.2f}%".format(rskf_accs.std()*100))
print(rskf_accs)

##### Grid Search option

In [None]:
# Tune the CatBoost model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the CatBoost evaluation metrics are: learning_rate, depth, l2_leaf_reg, and random_strength. For a list of all the CatBoost hyperparameters, see CatBoost hyperparameters.

# Parameter Name	Parameter Type	Recommended Ranges
# learning_rate	ContinuousParameterRanges	MinValue: 0.01, MaxValue: 0.1
# depth	IntegerParameterRanges	MinValue: 4, MaxValue: 10
# l2_leaf_reg	IntegerParameterRanges	MinValue: 2, MaxValue: 10
# random_strength	ContinuousParameterRanges	MinValue: 0, MaxValue: 10


In [None]:
# from sklearn.model_selection import GridSearchCV
# # parameters = [{'learning_rate' : [0.01, 0.05, 0.10],\
# #                'depth' : [4, 7, 10],
# #                'l2_leaf_reg' : [2, 6, 10],
# #                'random_strength' : [0, 4, 8]}]

# parameters = [{'learning_rate' : [0.007, 0.008], # initially 0.01, 0.03
#                'depth' : [8, 9, 10], # initially 4, 10
#                'l2_leaf_reg' : [2, 3], # initially 2,6
#                'random_strength' : [0, 1]}] # initially 0,4
# grid_search = GridSearchCV(estimator = model,
#                            param_grid = parameters,
#                            scoring = 'accuracy',
#                            cv = 5,
#                            n_jobs = -1)

In [None]:
# # The long bit
# grid_search.fit(X, y)

In [None]:
# best_accuracy = grid_search.best_score_
# best_parameters = grid_search.best_params_
# print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
# print("Best Parameters:", best_parameters)

Best Accuracy: 95.80%

Best Parameters: {'depth': 10, 'l2_leaf_reg': 2, 'learning_rate': 0.01, 'random_strength': 0}

In [None]:
# model = cb.CatBoostClassifier(l2_leaf_reg = 2, \
#                              learning_rate = 0.008, \
#                              depth = 10, \
#                              random_strength = 0,
#                              verbose=0)
# model.fit(X_train, y_train)
# y_pred = model.predict(X_test)
# accuracy = accuracy_score(y_test, y_pred)
# print("Accuracy: {:.2f}%".format(accuracy*100))
# accs = cross_val_score(estimator = model, X = X, y = y, cv=10, scoring='accuracy')
# print("Mean Accuracy: {:.2f}%".format(accs.mean()*100))
# print("Standard Deviation: {:.2f}%".format(accs.std()*100))

### XGBoost

#### Building and Training the model

##### Building the model

In [None]:
from xgboost import XGBClassifier
model = XGBClassifier()
# model = XGBClassifier(n_estimators=50, max_depth=3, learning_rate=0.1, gamma=0)

##### Training the model

In [None]:
model.fit(X_train, y_train)

#### Inference

In [None]:
y_pred = model.predict(X_test)

#### Evaluating the model

##### Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_true=y_test, y_pred=y_pred)
gb_confusion_matrices.append(cm)

In [None]:
cmd = ConfusionMatrixDisplay(cm)
cmd.plot()

##### Monkey accuracy

In [None]:
monkey = 100*sum(cm[0])/cm.sum()
print(100*sum(cm[0]))
print("Accuracy of 0 model: {:.1f}%".format(monkey))

##### Accuracy

In [None]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_true=y_test, y_pred=y_pred)
print("Accuracy on test set: {:.2f}%".format(acc*100))
gb_one_accuracy.append(acc)

##### k-Fold Cross Validation and Stratified

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
kf_accs = cross_val_score(estimator=model,X=X_train, y=y_train, cv=kfold_cv)
skf = StratifiedKFold(n_splits=kfold_cv, shuffle = True, random_state=my_random_state)
skf_accs = cross_val_score(estimator = model,
                             X = X_train,
                             y = y_train,
                             scoring = 'accuracy',
                             cv = skf)
gb_skf_accuracies.append(skf_accs)
gb_accuracies.append(kf_accs)

###### Output

In [None]:
print("K-Fold Accuracy: {:.2f} %".format(kf_accs.mean()*100))
print("Standard Deviation: {:.2f} %".format(kf_accs.std()*100))
print(kf_accs)
print("Stratified KF Accuracy: {:.2f} %".format(skf_accs.mean()*100))
print("Standard Deviation: {:.2f} %".format(skf_accs.std()*100))
print(skf_accs)

##### Repeated Stratified K-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
rskf = RepeatedStratifiedKFold(n_splits=kfold_cv, n_repeats=3, random_state = my_random_state)
rskf_accs = cross_val_score(estimator=model, \
                     X=X_train, y=y_train, \
                     scoring='accuracy', \
                     cv=rskf)
gb_rskf_accuracies.append(rskf_accs)

###### Output

In [None]:
print("RSKF Mean Accuracies: {:.2f}%".format(rskf_accs.mean()*100))
print("Standard Deviation: {:.2f}%".format(rskf_accs.std()*100))
print(rskf_accs)

#### Grid Search option

In [None]:
# from sklearn.model_selection import GridSearchCV
# classifier = XGBClassifier()
# parameters = [{'learning_rate': [0.09, 0.1, 0.11], \
#                'gamma': [0, 0.01], \
#                'n_estimators': [88, 90, 92], \
#                'max_depth': [3, 4, 5]}]
# #               'booster': ['gbtree', 'gblinear', 'dart'], \
# grid_search = GridSearchCV(estimator=classifier, \
#                            param_grid=parameters, \
#                            scoring = 'accuracy',\
#                            cv = 10,\
#                            n_jobs = -1)
# grid_search.fit(X, y)
# best_accuracy = grid_search.best_score_
# best_parameters = grid_search.best_params_
# print("Best Score(Accuracy): {:.2f} %".format(best_accuracy*100))
# print("Best Parameters:", best_parameters)

Find new parameters


```
Parameters: learning_rate=, max_depth=, n_estimators=
```



###### Re-run test set accuracy

Using the grid search's best parameters

In [None]:
# model = XGBClassifier(gamma = best_parameters['gamma'], \
#                       learning_rate = best_parameters['learning_rate'], \
#                       booster = 'gbtree', \
#                       max_depth = best_parameters['max_depth'], \
#                       n_estimators= best_parameters['n_estimators'])
# model.fit(X_train, y_train)
# y_pred = model.predict(X_test)
# accuracy = accuracy_score(y_test, y_pred)
# print("Accuracy: {:.2f}%".format(accuracy*100))
# accs = cross_val_score(estimator = model, X = X, y = y, cv=10, scoring='accuracy')
# print("Mean Accuracy: {:.2f}%".format(accs.mean()*100))
# print("Standard Deviation: {:.2f}%".format(accs.std()*100))

### LightGBM

#### Building and Training the model

##### Building the model

In [None]:
import lightgbm
model = lightgbm.LGBMClassifier(verbose=0, learning_rate=0.08, n_estimators=110, num_leaves=29, random_state = my_random_state )

##### Training the model

In [None]:
# model.fit(X_train, y_train, categorical_feature=['Gender', 'HasCrCard', 'IsActiveMember'])
model.fit(X_train, y_train)

#### Inference

In [None]:
y_pred = model.predict(X_test)

#### Evaluating the model

##### Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_true=y_test, y_pred=y_pred)
gb_confusion_matrices.append(cm)
gb_one_accuracy.append(accuracy_score(y_test, y_pred))

In [None]:
cmd = ConfusionMatrixDisplay(cm)
cmd.plot()

##### Monkey accuracy

In [None]:
monkey = 100*sum(cm[0])/cm.sum()
print(100*sum(cm[0]))
print("Accuracy of 0 model: {:.1f}%".format(monkey))

##### Accuracy

In [None]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_true=y_test, y_pred=y_pred)
print("Accuracy on test set: {:.2f}%".format(acc*100))
gb_one_accuracy.append(acc)

##### k-Fold Cross Validation and Stratified

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
kf_accs = cross_val_score(estimator=model,X=X_train, y=y_train, cv=kfold_cv)
skf = StratifiedKFold(n_splits=kfold_cv, shuffle = True, random_state=my_random_state)
skf_accs = cross_val_score(estimator = model,
                             X = X_train,
                             y = y_train,
                             scoring = 'accuracy',
                             cv = skf)
gb_skf_accuracies.append(skf_accs)
gb_accuracies.append(kf_accs)

###### Output

In [None]:
print("K-Fold Accuracy: {:.2f} %".format(kf_accs.mean()*100))
print("Standard Deviation: {:.2f} %".format(kf_accs.std()*100))
print(kf_accs)
print("Stratified KF Accuracy: {:.2f} %".format(skf_accs.mean()*100))
print("Standard Deviation: {:.2f} %".format(skf_accs.std()*100))
print(skf_accs)

##### Repeated Stratified K-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
rskf = RepeatedStratifiedKFold(n_splits=kfold_cv, n_repeats=3, random_state = my_random_state)
rskf_accs = cross_val_score(estimator=model, \
                     X=X_train, y=y_train, \
                     scoring='accuracy', \
                     cv=rskf)
gb_rskf_accuracies.append(rskf_accs)

In [None]:
print("RSKF Mean Accuracies: {:.2f}%".format(rskf_accs.mean()*100))
print("Standard Deviation: {:.2f}%".format(rskf_accs.std()*100))
print(rskf_accs)

#### Grid Search option

In [None]:
# from sklearn.model_selection import GridSearchCV
# parameters = [{'num_leaves' : [29, 30, 31, 32, 33], \
#                'learning_rate' : [0.075, 0.08, 0.085, 0.09, 0.095],
#                'n_estimators' : [80, 90, 100, 110, 120, 130] \
#                }]
# grid = GridSearchCV(estimator=model,
#                     param_grid=parameters,
#                     scoring='accuracy',
#                     cv=10)
# grid.fit(X_train, y_train)
# best_parameters = grid.best_params_
# print("Best Score(Accuracy) : {:.3f}%".format(grid.best_score_*100))
# print("Best Parameters :", best_parameters)

Find new parameters


```
Parameters: learning_rate=0.08, n_estimators=110, num_leaves=29
```



##### Rerun test set accuracy

In [None]:
# model = lightgbm.LGBMClassifier(num_leaves = best_parameters['num_leaves'], \
#                       learning_rate = best_parameters['learning_rate'], \
#                       n_estimators= best_parameters['n_estimators'],
#                       verbose=0)
# model.fit(X_train, y_train)
# y_pred = model.predict(X_test)
# accuracy = accuracy_score(y_test, y_pred)
# print("Accuracy: {:.2f}%".format(accuracy*100))
# accs = cross_val_score(estimator = model, X = X, y = y, cv=10, scoring='accuracy')
# print("Mean Accuracy: {:.2f}%".format(accs.mean()*100))
# print("Standard Deviation: {:.2f}%".format(accs.std()*100))

## Comparing Gradient Boosting models

### Confusion Matrices

In [None]:
print("Confusion Matrices\n")
for i in range(len(gb_models)):
  print(gb_models[i])
  print(gb_confusion_matrices[i])

### Test Set Accuracy scores

In [None]:
print("Accuracy Scores\n")
for i in range(len(gb_models)):
  print(gb_models[i].ljust(25), "{:.2f}%".format(gb_one_accuracy[i]*100))

### K-Fold Validation Accuracies

In [None]:
print("K-Fold Validation\n")
for i in range(len(gb_models)):
  print(gb_models[i].ljust(25), "Accuracy: {:.2f}% Standard Deviation: {:.2f}%".format(gb_accuracies[i].mean()*100, gb_accuracies[i].std()*100))

### Stratified K-Fold Validation Accuracies

In [None]:
print("Stratified K-Fold Validation\n")
for i in range(len(gb_models)):
  print(gb_models[i].ljust(25), "Accuracy: {:.2f}% Standard Deviation: {:.2f}%".format(gb_skf_accuracies[i].mean()*100, gb_skf_accuracies[i].std()*100))

### Repeated Stratified K-Fold Validation Accuracies

In [None]:
print("Repeated Stratified K-Fold Validation\n")
for i in range(len(gb_models)):
  print(gb_models[i].ljust(25), "Accuracy: {:.2f}% Standard Deviation: {:.2f}%".format(gb_rskf_accuracies[i].mean()*100, gb_rskf_accuracies[i].std()*100))

**CatBoost gives the highest accuracy so do grid search using CatBoost to find best parameters**

## Grid Search - optimise chosen model hyperparameters

In [None]:
# Tune the CatBoost model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the CatBoost evaluation metrics are: learning_rate, depth, l2_leaf_reg, and random_strength. For a list of all the CatBoost hyperparameters, see CatBoost hyperparameters.

# Parameter Name	Parameter Type	Recommended Ranges
# learning_rate	ContinuousParameterRanges	MinValue: 0.01, MaxValue: 0.1
# depth	IntegerParameterRanges	MinValue: 4, MaxValue: 10
# l2_leaf_reg	IntegerParameterRanges	MinValue: 2, MaxValue: 10
# random_strength	ContinuousParameterRanges	MinValue: 0, MaxValue: 10


In [None]:
import catboost as cb
from sklearn.model_selection import GridSearchCV
model = cb.CatBoostClassifier(verbose = 0, random_state = my_random_state)
model.fit(X_train, y_train)
# parameters = [{'learning_rate' : [0.01, 0.05, 0.10],\
#                'depth' : [4, 7, 10],
#                'l2_leaf_reg' : [2, 6, 10],
#                'random_strength' : [0, 4, 8]}]

parameters = [{'learning_rate' : [0.006, 0.007, 0.008, 0.009], # initially 0.01, 0.03
               'depth' : [8, 9, 10], # initially 4, 10
               'l2_leaf_reg' : [2, 3], # initially 2,6
               'random_strength' : [0, 0.5]}] # initially 0,4
skf = StratifiedKFold(n_splits=kfold_cv, shuffle = True, random_state=my_random_state)

grid_search = GridSearchCV(estimator = model,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = skf,
                           n_jobs = -1)

In [None]:
# The long bit
grid_search.fit(X_train, y_train)

In [None]:
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
print("Best Parameters:", best_parameters)

Best Accuracy: 95.80%

Best Parameters: {'depth': 10, 'l2_leaf_reg': 2, 'learning_rate': 0.01, 'random_strength': 0}

## Rerun test set accuracy with best hyperparameters

In [None]:
model = cb.CatBoostClassifier(l2_leaf_reg = best_parameters['l2_leaf_reg'], \
                             learning_rate = best_parameters['learning_rate'], \
                             depth = best_parameters['depth'], \
                             random_strength = best_parameters['random_strength'],\
                             random_state = my_random_state,\
                             verbose=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy*100))
accs = cross_val_score(estimator = model, X = X, y = y, cv=skf, scoring='accuracy')
print("Mean Accuracy: {:.2f}%".format(accs.mean()*100))
print("Standard Deviation: {:.2f}%".format(accs.std()*100))

#### Grid Search for LightGBM

In [None]:
from sklearn.model_selection import GridSearchCV
model = lightgbm.LGBMClassifier(verbose=0, random_state = my_random_state )
model.fit(X_train, y_train)
parameters = [{'num_leaves' : [31, 32, 33, 34], \
               'learning_rate' : [0.01, 0.09, 0.095, 0.1],
               'n_estimators' : [100, 105, 110, 115] \
               }]
grid = GridSearchCV(estimator=model,
                    param_grid=parameters,
                    scoring='accuracy',
                    cv=skf)
grid.fit(X_train, y_train)
best_parameters = grid.best_params_
print("Best Score(Accuracy) : {:.2f}%".format(grid.best_score_*100))
print("Best Parameters :", best_parameters)

Find new parameters


```
Parameters: learning_rate=0.08, n_estimators=110, num_leaves=29
```



#### Rerun test set accuracy

In [None]:
model = lightgbm.LGBMClassifier(num_leaves = best_parameters['num_leaves'], \
                      learning_rate = best_parameters['learning_rate'], \
                      n_estimators= best_parameters['n_estimators'],
                      verbose=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy*100))
accs = cross_val_score(estimator = model, X = X, y = y, cv=skf, scoring='accuracy')
print("Mean Accuracy: {:.2f}%".format(accs.mean()*100))
print("Standard Deviation: {:.2f}%".format(accs.std()*100))