## **Business context**
Study Group specialises in providing educational services and resources to students and professionals across various fields. The company's primary focus is on enhancing learning experiences through a range of services, including online courses, tutoring, and educational consulting. By leveraging cutting-edge technology and a team of experienced educators, Study Group aims to bridge the gap between traditional learning methods and the evolving needs of today's learners.

Study Group serves its university partners by establishing strategic partnerships to enhance the universities’ global reach and diversity. It supports the universities in their efforts to attract international students, thereby enriching the cultural and academic landscape of their campuses. It works closely with university faculty and staff to ensure that the universities are prepared and equipped to welcome and support a growing international student body. Its partnership with universities also offers international students a seamless transition into their chosen academic environment. Study Group runs several International Study Centres across the UK and Dublin in partnership with universities with the aim of preparing a pipeline of talented international students from diverse backgrounds for degree study. These centres help international students adapt to the academic, cultural, and social aspects of studying abroad. This is achieved by improving conversational and subject-specific language skills and academic readiness before students progress to a full degree programme at university.

Through its comprehensive suite of services, it supports learners and universities at every stage of their educational journey, from high school to postgraduate studies. Its approach is tailored to meet the unique needs of each learner, offering personalised learning paths and flexible scheduling options to accommodate various learning styles and commitments.

Study Group's services are designed to be accessible and affordable, making quality education a reality for many individuals. By focusing on the integration of technology and personalised learning, the company aims to empower learners to achieve their full potential and succeed in their academic and professional pursuits. Study Group is at the forefront of transforming how people learn and grow through its dedication to innovation and excellence.
Study Group has provided you a course-level data set.


<br></br>

## **Objective**
By the end of this mini-project, you will have developed the skills and knowledge to apply advanced machine learning techniques to create a predictive model for student dropout. This project will involve comprehensive data exploration, preprocessing, and feature engineering to ensure high-quality input for the models. You will employ and compare multiple predictive algorithms - XGBoost and neural network-based model, to determine the most effective model for predicting student dropout.

## Importing Libraries and Datasets

### Importing Libraries

In [None]:
%pip install keras-tuner

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, roc_auc_score,classification_report,ConfusionMatrixDisplay,roc_curve,auc
from datetime import datetime
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.models import load_model
import keras_tuner as kt
from keras.optimizers import Adam, RMSprop, SGD

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=(FutureWarning, pd.errors.SettingWithCopyWarning))

### Importing Dataset

In [None]:
import gdown
destination = ''

# Construct the download URL
download_url = ''

# Download the file using gdown
gdown.download(download_url, destination, quiet=False)

## Exploratory Data Analysis

In [None]:
# view dataframe
df = pd.read_csv('CourseLevelDatasetVersion2.csv')
df

In [None]:
# Inspect Data Structure
df.info()

In [None]:
# Generate descriptive statistics for numerical columns.
df.describe()

In [None]:
# Summarize categorical columns with frequency counts.
df.describe(include='object')

In [None]:
df[['ContactHours', 'AttendancePercentage']]

### Data Cleaning

In [None]:
# saving a copy of dataset for future purposes
df_copy = df.copy()

In [None]:
# removing columns
columns= ['BookingId','BookingType', 'LeadSource', 'DiscountType', 'Nationality', 'HomeCountry', 'HomeState', 'HomeCity', 'PresentCount',
          'LateCount', 'AuthorisedAbsenceCount','ArrivedDate','NonCompletionReason', 'TerminationDate', 'CourseFirstIntakeDate', 'CourseStartDate',
          'CourseEndDate', 'AcademicYear', 'CourseName', 'LearnerCode', 'ProgressionDegree', 'EligibleToProgress', 'AssessedModules', 'PassedModules',
          'FailedModules', 'AttendancePercentage', 'ContactHours']
df=df.drop(columns=columns)
df

In [None]:
# checking for missing values
df.isnull().sum()

In [None]:
# removing rows with missing values
df = df.dropna()
df

In [None]:
# checking for duplicate rows.
df.duplicated().sum()


In [None]:
# removing duplicate rows
df = df.drop_duplicates()
df

### Feature Engineering

Converting date of birth column to age.

In [None]:
# Convert DOB column to datetime with the correct format
df['DateofBirth'] = pd.to_datetime(df['DateofBirth'], format='%d/%m/%Y')

# Function to calculate age
def calculate_age(dob):
    today = datetime.today()
    age = today.year - dob.year - ((today.month, today.day) < (dob.month, dob.day))
    return age

# Apply the function to the DOB column
df['Age'] = df['DateofBirth'].apply(calculate_age)
df

In [None]:
# removing DateofBirth column
df = df.drop(columns=['DateofBirth'])
df

In [None]:
# Select numerical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns

# Initialize the StandardScaler
scaler = StandardScaler()

# Scale the numerical columns
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
df

Transforming 'IsFirstIntake' and 'CompletedCourse' columns into binary values.

In [None]:
# checking unique values in 'IsFirstIntake' column
df['IsFirstIntake'].unique()

In [None]:
# transforming 'IsFirstIntake' column to binary values
df['IsFirstIntake'] = df['IsFirstIntake'].map({True: 1, False: 0})
df

In [None]:
# checking unique values in 'CompletedCourse' column
df['CompletedCourse'].unique()

In [None]:
# transforming 'CompletedCourse' column to binary values
df['CompletedCourse'] = df['CompletedCourse'].map({'Yes': 1, 'No': 0})
df

Performing one-hot encoding on the categorical columns.

In [None]:
# one-hot encoding of categorical columns
df = pd.get_dummies(df, columns=['CentreName','Gender','CourseLevel','ProgressionUniversity'],dtype=int)
df

In [None]:
# splitting data into features and target
features = df.drop('CompletedCourse', axis=1)
target = df['CompletedCourse']

In [None]:
# Checking for imbalance in the dataset
print("Original target distribution:")
print(target.value_counts(normalize=True), end="\n\n")

# Create a list of class names, assuming there are two classes
target_names = ["not completed" if i == 0 else "completed" for i in target.unique()]
print("Target class names:")
print(target_names)

In [None]:
# Checking for the target variable histogram
plt.figure(figsize=(6, 4))
plt.hist(df['CompletedCourse'], bins=[-0.5, 0.5, 1.5], edgecolor='black', alpha=0.7, color='skyblue')
plt.xlabel('Completed Course')
plt.ylabel('Frequency')
plt.title('Histogram of Completed Course variable')
plt.xticks([0, 1])
plt.show()

Plotting boxplots of the input features, grouped by target variable.

In [None]:
# Plot boxplots of the input features, grouped by target variable.
for feature in df[numerical_cols].columns:
  plt.figure(figsize=(6, 4))
  sns.boxplot(x=target, y=feature, data=df)
  plt.title(f'Boxplot of {feature} by Completed Course')
  plt.show()

## XGBoost Model

In [None]:
# Initial data has been correctly split into training and test set.
X_train, X_test, y_train, y_test= train_test_split(features, target, test_size=0.2, random_state=42,stratify=target)

In [None]:
# model instantiated and fitted
model = xgb.XGBClassifier(random_state=42)
model.fit(X_train, y_train)

In [None]:
# model prediction
y_pred = model.predict(X_test)
y_pred

### Performance Indicators

In [None]:
# Printing performance indicators
print("XG Boost Model Accuracy: ", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
cfmd = ConfusionMatrixDisplay(confusion_matrix=cm)
cfmd.plot()
plt.show()

In [None]:
# calculating AUC score
y_prob = model.predict_proba(X_test)[:, 1]
auc_score = roc_auc_score(y_test, y_prob)
print("AUC score:", auc_score)

In [None]:
# roc curve for XGBoost model
fpr, tpr, thresholds = roc_curve(y_test, y_prob)

plt.plot(fpr, tpr, label='ROC Curve')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for XGBoost Model')
plt.legend()
plt.show()

### Hyperparameter tuning of XGBoost model.

In [None]:
# hyperparameter tuning of the learning rate, max depth, and number of estimators has been performed.
model = xgb.XGBClassifier(random_state=42)
param_grid = {
  'learning_rate': [0.01, 0.1, 0.2, 0.3],
  'max_depth': [3, 5, 7, 9],
  'n_estimators': [100, 200, 300, 400]
}
grid_search = GridSearchCV(
  estimator=model,
  param_grid=param_grid,
  scoring='accuracy',
  cv=5,
  n_jobs=-1,
  verbose=2
)

grid_search.fit(X_train, y_train)
print('Best parameters found: ', grid_search.best_params_)
print('Best accuracy found: ', grid_search.best_score_)

In [None]:
# fitting model with best parameters from hyper-parameter tuning
model = xgb.XGBClassifier(
  learning_rate=0.2,
  max_depth=3,
  n_estimators=100,
  random_state=42
)
model.fit(X_train, y_train)

### Performance Indicators

In [None]:
# Printing performance indicators
y_pred = model.predict(X_test)
y_pred

print("XG Boost Model Accuracy: ", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
cfmd = ConfusionMatrixDisplay(confusion_matrix=cm)
cfmd.plot()
plt.show()

In [None]:
# calculating AUC score
y_prob = model.predict_proba(X_test)[:, 1]
auc_score = roc_auc_score(y_test, y_prob)
print("AUC score:", auc_score)

In [None]:
# roc curve for XGBoost model
fpr, tpr, thresholds = roc_curve(y_test, y_prob)

plt.plot(fpr, tpr, label='ROC Curve')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for XGBoost Model')
plt.legend()
plt.show()

### Feature Importance

In [None]:
# to identify important features
plt.figure(figsize=(12, 17))
feature_importance = pd.Series(model.feature_importances_, index=X_train.columns).sort_values()
feature_importance.plot.barh()
plt.show()

In [None]:
# shows top 10 important features
plt.figure(figsize=(12, 6))
feature_importance.iloc[-10:].plot.barh()
plt.show()

### Adding two new features to dataset ('ContactHours', 'AttendancePercentage')

The two new features ('ContactHours', 'AttendancePercentage') were added back into dataset and trained the XGBoost model with new dataset.

In [None]:
columns= ['BookingId','BookingType', 'LeadSource', 'DiscountType', 'Nationality', 'HomeCountry', 'HomeState', 'HomeCity', 'PresentCount',
          'LateCount', 'AuthorisedAbsenceCount','ArrivedDate','NonCompletionReason', 'TerminationDate', 'CourseFirstIntakeDate', 'CourseStartDate',
          'CourseEndDate', 'AcademicYear', 'CourseName', 'LearnerCode', 'ProgressionDegree', 'EligibleToProgress', 'AssessedModules', 'PassedModules',
          'FailedModules']

# performing data cleaning and feature engineering steps on new dataset.
df = df_copy.drop(columns=columns)
df = df.dropna()
df = df.drop_duplicates()
df['DateofBirth'] = pd.to_datetime(df['DateofBirth'], format='%d/%m/%Y')
df['Age'] = df['DateofBirth'].apply(calculate_age)
df = df.drop(columns=['DateofBirth'])
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
df['IsFirstIntake'] = df['IsFirstIntake'].map({True: 1, False: 0})
df['CompletedCourse'] = df['CompletedCourse'].map({'Yes': 1, 'No': 0})
df = pd.get_dummies(df, columns=['CentreName', 'Gender','CourseLevel','ProgressionUniversity'],dtype=int)
df

### XGBoost model

In [None]:
# splitting data into training and test sets.
features = df.drop('CompletedCourse', axis=1)
target = df['CompletedCourse']
X_train, X_test, y_train, y_test= train_test_split(features, target, test_size=0.2, random_state=42,stratify=target)

In [None]:
# model instantiated and fitted
model = xgb.XGBClassifier(random_state=42)
model.fit(X_train, y_train)

In [None]:
# model prediction
y_pred = model.predict(X_test)
y_pred

### Performance Indicators

In [None]:
# Printing performance indicators
print("XG Boost Model Accuracy: ", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
cfmd = ConfusionMatrixDisplay(confusion_matrix=cm)
cfmd.plot()
plt.show()

In [None]:
# calculating AUC score
y_prob = model.predict_proba(X_test)[:, 1]
auc_score = roc_auc_score(y_test, y_prob)
print("AUC score:", auc_score)

In [None]:
# roc curve for XGBoost model
fpr, tpr, thresholds = roc_curve(y_test, y_prob)

plt.plot(fpr, tpr, label='ROC Curve')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for XGBoost Model')
plt.legend()
plt.show()

### Hyperparameter tuning of XGBoost model

In [None]:
# hyperparameter tuning of the learning rate, max depth, and number of estimators has been performed.
model = xgb.XGBClassifier(random_state=42)
param_grid = {
  'learning_rate': [0.01, 0.1, 0.2, 0.3],
  'max_depth': [3, 5, 7, 9],
  'n_estimators': [100, 200, 300, 400]
}
grid_search = GridSearchCV(
  estimator=model,
  param_grid=param_grid,
  scoring='accuracy',
  cv=5,
  n_jobs=-1,
  verbose=2
)

grid_search.fit(X_train, y_train)
print('Best parameters found: ', grid_search.best_params_)
print('Best accuracy found: ', grid_search.best_score_)

In [None]:
# fitting model with best parameters
model = xgb.XGBClassifier(
  learning_rate=0.1,
  max_depth=5,
  n_estimators=400,
  random_state=42
)
model.fit(X_train, y_train)

### Performance Indicators

In [None]:
# Printing performance indicators
y_pred = model.predict(X_test)
y_pred

print("XG Boost Model Accuracy: ", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
cfmd = ConfusionMatrixDisplay(confusion_matrix=cm)
cfmd.plot()
plt.show()

In [None]:
y_prob = model.predict_proba(X_test)[:, 1]
auc_score = roc_auc_score(y_test, y_prob)
print("AUC score:", auc_score)

In [None]:
# roc curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)

plt.plot(fpr, tpr, label='ROC Curve')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for XGBoost Model')
plt.legend()
plt.show()

### Feature Importance

In [None]:
# feature importance plotted after adding two new features back to dataset.
plt.figure(figsize=(12, 20))
feature_importance = pd.Series(model.feature_importances_, index=X_train.columns).sort_values()

feature_importance.plot.barh()
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
feature_importance.iloc[-10:].plot.barh()
plt.show()

## Neural Network Model

The dataset is preprocessed for neural network model.

In [None]:
columns= ['BookingId','BookingType', 'LeadSource', 'DiscountType', 'Nationality', 'HomeCountry', 'HomeState', 'HomeCity', 'PresentCount',
          'LateCount', 'AuthorisedAbsenceCount','ArrivedDate','NonCompletionReason', 'TerminationDate', 'CourseFirstIntakeDate', 'CourseStartDate',
          'CourseEndDate', 'AcademicYear', 'CourseName', 'LearnerCode', 'ProgressionDegree', 'EligibleToProgress', 'AssessedModules', 'PassedModules',
          'FailedModules','AttendancePercentage', 'ContactHours']

# performing data cleaning and feature engineering steps on new dataset.
df=df_copy.drop(columns=columns)
df = df.dropna()
df = df.drop_duplicates()
df['DateofBirth'] = pd.to_datetime(df['DateofBirth'], format='%d/%m/%Y')
df['Age'] = df['DateofBirth'].apply(calculate_age)
df = df.drop(columns=['DateofBirth'])
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
df['IsFirstIntake'] = df['IsFirstIntake'].map({True: 1, False: 0})
df['CompletedCourse'] = df['CompletedCourse'].map({'Yes': 1, 'No': 0})
df = pd.get_dummies(df, columns=['CentreName', 'Gender','CourseLevel','ProgressionUniversity'],dtype=int)

df

In [None]:
# splitting data into training and test sets.
features = df.drop('CompletedCourse', axis=1)
target = df['CompletedCourse']
X_train, X_test, y_train, y_test= train_test_split(features, target, test_size=0.2, random_state=42,stratify=target)

In [None]:
model = Sequential([
  Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
  Dropout(0.2),
  Dense(32, activation='relu'),
  Dropout(0.2),
  Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
patience = 5

early_stopping = EarlyStopping(monitor='val_loss', patience=patience, restore_best_weights=True)
model_checkpoint = ModelCheckpoint('best_model.keras', monitor='val_loss', save_best_only=True)
history = model.fit(X_train, y_train, epochs=100, batch_size=32, validation_data=(X_test, y_test), callbacks=[early_stopping, model_checkpoint])

In [None]:
# checking test loss and test accuracy of best model.
model = load_model('best_model.keras')
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print('test_loss', test_loss)
print('test_accuracy', test_accuracy)

### Performance Indicators

In [None]:
y_pred = model.predict(X_test)

# For binary classification, assume a threshold of 0.5
y_pred = (y_pred > 0.5).astype(int).flatten()

print("Neural Network: ", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
cfmd = ConfusionMatrixDisplay(confusion_matrix=cm)
cfmd.plot()
plt.show()

In [None]:
# Get predicted probabilities for the test data
ptest = model.predict(X_test)

# Calculate the AUC score
auc = roc_auc_score(y_test, ptest)

# Print the AUC score
print(f"AUC score: {auc}")

# Plot the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, ptest)

plt.plot(fpr, tpr, color="orange", label="ROC")
plt.plot([0, 1], [0, 1], color="darkblue", linestyle="--", label="Random Classifier")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend()
plt.show()

In [None]:
# Extract loss values
train_loss = history.history['loss']
val_loss = history.history['val_loss']

# Extract accuracy values
train_accuracy = history.history['accuracy']
val_accuracy = history.history['val_accuracy']

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))

# Plot the loss curves
axs[0].plot( train_loss, label='Training Loss')
axs[0].plot( val_loss, label='Validation Loss')
axs[0].set_title('Training and Validation Loss')
axs[0].set_xlabel('Epochs')
axs[0].set_ylabel('Loss')
axs[0].legend()



# Optionally, plot the accuracy curves
axs[1].plot( train_accuracy, label='Training Accuracy')
axs[1].plot(val_accuracy, label='Validation Accuracy')
axs[1].set_title('Training and Validation Accuracy')
axs[1].set_xlabel('Epochs')
axs[1].set_ylabel('Accuracy')
axs[1].legend()

plt.show()

### Hyperparameter tuning of Neural Network using looping.

Defining the hyperparameter with different values for tuning the model.

In [None]:
# Define the number of neurons.
num_neurons = [(128,64),(64,32),(32,16),(128,32),(64,16)]

# Define the optimizers.
optimizers = ['adam', 'rmsprop','sgd']

# Define the activation function.
activation_fn = ['relu', 'tanh', 'leaky_relu']

In [None]:
# creating a function called 'create_model' to call model definition for future purposes.
def create_model(num_neurons, optimizers, activation_fn):
    model = Sequential()
    model.add(Dense(num_neurons[0], activation=activation_fn, input_shape=(X_train.shape[1],)))
    model.add(Dropout(0.2))
    model.add(Dense(num_neurons[1], activation=activation_fn))
    model.add(Dropout(0.2))
    model.add(Dense(1, activation='sigmoid'))

    if optimizers == 'adam':
        opt = Adam()
    elif optimizers == 'rmsprop':
        opt = RMSprop()
    elif optimizers == 'sgd':
        opt = SGD()

    model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

    return model

In [None]:
# Train models with different number of neurons, optimizers and activation function.
results = {}
patience = 5
best_val_accuracy = 0
best_hyperparams = None

for neurons in num_neurons:
    for opt in optimizers:
        for activation in activation_fn:
            print(f"Training with num_neurons={neurons}, optimizer={opt}, activation_fn={activation}")
            model = create_model(num_neurons=neurons, optimizers=opt, activation_fn=activation)
            early_stopping = EarlyStopping(monitor='val_loss', patience=patience, restore_best_weights=True)
            model_checkpoint = ModelCheckpoint('temp_model.keras', monitor='val_loss', save_best_only=True)
            history = model.fit(X_train, y_train, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stopping, model_checkpoint])

            # Load the best model from the current training
            temp_model = load_model('temp_model.keras')
            val_accuracy = history.history['val_accuracy'][-1]
            results[(neurons, opt, activation)] = val_accuracy
            print(f"Validation accuracy: {val_accuracy}")

            # Check if the current model is the best one
            if val_accuracy > best_val_accuracy:
                best_val_accuracy = val_accuracy
                best_hyperparams = (neurons, opt, activation)


In [None]:
print(f"Best hyperparameters: {best_hyperparams}")
print(f"Best validation accuracy: {best_val_accuracy}")

In [None]:
# fitting model with best parameters from hyper-parameter tuning
model = Sequential([
  Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
  Dropout(0.2),
  Dense(32, activation='relu'),
  Dropout(0.2),
  Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

In [None]:
patience = 5

early_stopping = EarlyStopping(monitor='val_loss', patience=patience, restore_best_weights=True)
model_checkpoint = ModelCheckpoint('best_model.keras', monitor='val_loss', save_best_only=True)
history = model.fit(X_train, y_train, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stopping, model_checkpoint])

In [None]:
# best model
model = load_model('best_model.keras')
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print('test_loss', test_loss)
print('test_accuracy', test_accuracy)

### Performance Indicators.

In [None]:
y_pred = model.predict(X_test)


# For binary classification, assume a threshold of 0.5
y_pred = (y_pred > 0.5).astype(int).flatten()

print("Neural Network: ", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
cfmd = ConfusionMatrixDisplay(confusion_matrix=cm)
cfmd.plot()
plt.show()

In [None]:
# Get predicted probabilities for the test data
ptest = model.predict(X_test)

# Calculate the AUC score
auc = roc_auc_score(y_test, ptest)

# Print the AUC score
print(f"AUC score: {auc}")

# Plot the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, ptest)

plt.plot(fpr, tpr, color="orange", label="ROC")
plt.plot([0, 1], [0, 1], color="darkblue", linestyle="--", label="Random Classifier")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend()
plt.show()

In [None]:
# Extract loss values
train_loss = history.history['loss']
val_loss = history.history['val_loss']

# Extract accuracy values (optional, for completeness)
train_accuracy = history.history['accuracy']
val_accuracy = history.history['val_accuracy']

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))

# Plot the loss curves
axs[0].plot( train_loss, label='Training Loss')
axs[0].plot( val_loss, label='Validation Loss')
axs[0].set_title('Training and Validation Loss')
axs[0].set_xlabel('Epochs')
axs[0].set_ylabel('Loss')
axs[0].legend()

#plot the accuracy curves
axs[1].plot( train_accuracy, label='Training Accuracy')
axs[1].plot(val_accuracy, label='Validation Accuracy')
axs[1].set_title('Training and Validation Accuracy')
axs[1].set_xlabel('Epochs')
axs[1].set_ylabel('Accuracy')
axs[1].legend()
plt.show()

### Adding two new features to dataset ('ContactHours', 'AttendancePercentage')

The two new features ('ContactHours', 'AttendancePercentage') were added back into dataset and to train the Neural Network model with new dataset.

In [None]:
columns= ['BookingId','BookingType', 'LeadSource', 'DiscountType', 'Nationality', 'HomeCountry', 'HomeState', 'HomeCity', 'PresentCount',
          'LateCount', 'AuthorisedAbsenceCount','ArrivedDate','NonCompletionReason', 'TerminationDate', 'CourseFirstIntakeDate', 'CourseStartDate',
          'CourseEndDate', 'AcademicYear', 'CourseName', 'LearnerCode', 'ProgressionDegree', 'EligibleToProgress', 'AssessedModules', 'PassedModules',
          'FailedModules']

# performing data cleaning and feature engineering steps on new dataset.
df = df_copy.drop(columns=columns)
df = df.dropna()
df = df.drop_duplicates()
df['DateofBirth'] = pd.to_datetime(df['DateofBirth'], format='%d/%m/%Y')
df['Age'] = df['DateofBirth'].apply(calculate_age)
df = df.drop(columns=['DateofBirth'])
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
df['IsFirstIntake'] = df['IsFirstIntake'].map({True: 1, False: 0})
df['CompletedCourse'] = df['CompletedCourse'].map({'Yes': 1, 'No': 0})
df = pd.get_dummies(df, columns=['CentreName', 'Gender','CourseLevel','ProgressionUniversity'],dtype=int)

df

In [None]:
# splitting data into training and test sets.
features = df.drop('CompletedCourse', axis=1)
target = df['CompletedCourse']
X_train, X_test, y_train, y_test= train_test_split(features, target, test_size=0.2, random_state=42,stratify=target)

In [None]:
model = Sequential([
  Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
  Dropout(0.2),
  Dense(32, activation='relu'),
  Dropout(0.2),
  Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
patience = 5

early_stopping = EarlyStopping(monitor='val_loss', patience=patience, restore_best_weights=True)
model_checkpoint = ModelCheckpoint('best_model.keras', monitor='val_loss', save_best_only=True)
history = model.fit(X_train, y_train, epochs=100, batch_size=32, validation_data=(X_test, y_test), callbacks=[early_stopping, model_checkpoint])

In [None]:
# best model
model = load_model('best_model.keras')
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print('test_loss', test_loss)
print('test_accuracy', test_accuracy)

### Performance Indicators

In [None]:
y_pred = model.predict(X_test)

# For binary classification, assume a threshold of 0.5
y_pred = (y_pred > 0.5).astype(int).flatten()

print("Neural Network: ", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
cfmd = ConfusionMatrixDisplay(confusion_matrix=cm)
cfmd.plot()
plt.show()

In [None]:
# Get predicted probabilities for the test data
ptest = model.predict(X_test)

# Calculate the AUC score
auc = roc_auc_score(y_test, ptest)

# Print the AUC score
print(f"AUC score: {auc}")

# Plot the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, ptest)

plt.plot(fpr, tpr, color="orange", label="ROC")
plt.plot([0, 1], [0, 1], color="darkblue", linestyle="--", label="Random Classifier")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend()
plt.show()

In [None]:
# Extract loss values
train_loss = history.history['loss']
val_loss = history.history['val_loss']

# Extract accuracy values (optional, for completeness)
train_accuracy = history.history['accuracy']
val_accuracy = history.history['val_accuracy']

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))

# Plot the loss curves
axs[0].plot( train_loss, label='Training Loss')
axs[0].plot( val_loss, label='Validation Loss')
axs[0].set_title('Training and Validation Loss')
axs[0].set_xlabel('Epochs')
axs[0].set_ylabel('Loss')
axs[0].legend()



# Optionally, plot the accuracy curves
axs[1].plot( train_accuracy, label='Training Accuracy')
axs[1].plot(val_accuracy, label='Validation Accuracy')
axs[1].set_title('Training and Validation Accuracy')
axs[1].set_xlabel('Epochs')
axs[1].set_ylabel('Accuracy')
axs[1].legend()

plt.show()

### Hyperparameter tuning of NN using looping.

Defining the hyperparameter with different values for tuning the model.

In [None]:
# Define the number of neurons.
num_neurons = [(128,64),(64,32),(32,16),(128,32),(64,16)]

# Define the optimizers.
optimizers = ['adam', 'rmsprop','sgd']

# Define the activation function.
activation_fn = ['relu', 'tanh', 'leaky_relu']

In [None]:
def create_model(num_neurons, optimizers, activation_fn):
    model = Sequential()
    model.add(Dense(num_neurons[0], activation='relu', input_shape=(X_train.shape[1],)))
    model.add(Dropout(0.2))
    model.add(Dense(num_neurons[1], activation=activation_fn))
    model.add(Dropout(0.2))
    model.add(Dense(1, activation='sigmoid'))

    if optimizers == 'adam':
        opt = Adam()
    elif optimizers == 'rmsprop':
        opt = RMSprop()
    elif optimizers == 'sgd':
        opt = SGD()

    model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

    return model

In [None]:
# Train models with different number of neurons, optimizers and activation function.
results = {}
patience = 5
best_val_accuracy = 0
best_hyperparams = None

for neurons in num_neurons:
    for opt in optimizers:
        for activation in activation_fn:
            print(f"Training with num_neurons={neurons}, optimizer={opt}, activation_fn={activation}")
            model = create_model(num_neurons=neurons, optimizers=opt, activation_fn=activation)
            early_stopping = EarlyStopping(monitor='val_loss', patience=patience, restore_best_weights=True)
            model_checkpoint = ModelCheckpoint('temp_model.keras', monitor='val_loss', save_best_only=True)
            history = model.fit(X_train, y_train, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stopping, model_checkpoint])

            # Load the best model from the current training
            temp_model = load_model('temp_model.keras')
            val_accuracy = history.history['val_accuracy'][-1]
            results[(neurons, opt, activation)] = val_accuracy
            print(f"Validation accuracy: {val_accuracy}")

            # Check if the current model is the best one
            if val_accuracy > best_val_accuracy:
                best_val_accuracy = val_accuracy
                best_hyperparams = (neurons, opt, activation)


In [None]:
print(f"Best hyperparameters: {best_hyperparams}")
print(f"Best validation accuracy: {best_val_accuracy}")

In [None]:
# fitting model with best parameters from hyper-parameter tuning
model = Sequential([
  Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
  Dropout(0.2),
  Dense(64, activation='tanh'),
  Dropout(0.2),
  Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
patience = 5

early_stopping = EarlyStopping(monitor='val_loss', patience=patience, restore_best_weights=True)
model_checkpoint = ModelCheckpoint('best_model.keras', monitor='val_loss', save_best_only=True)
history = model.fit(X_train, y_train, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stopping, model_checkpoint])

In [None]:
model = load_model('best_model.keras')
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print('test_loss', test_loss)
print('test_accuracy', test_accuracy)

### Performance Indicators

In [None]:
y_pred = model.predict(X_test)


# For binary classification, assume a threshold of 0.5
y_pred = (y_pred > 0.5).astype(int).flatten()

print("Neural Network: ", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
cfmd = ConfusionMatrixDisplay(confusion_matrix=cm)
cfmd.plot()
plt.show()

In [None]:
# Get predicted probabilities for the test data
ptest = model.predict(X_test)

# Calculate the AUC score
auc = roc_auc_score(y_test, ptest)

# Print the AUC score
print(f"AUC score: {auc}")

# Plot the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, ptest)

plt.plot(fpr, tpr, color="orange", label="ROC")
plt.plot([0, 1], [0, 1], color="darkblue", linestyle="--", label="Random Classifier")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend()
plt.show()

In [None]:
# Extract loss values
train_loss = history.history['loss']
val_loss = history.history['val_loss']

# Extract accuracy values (optional, for completeness)
train_accuracy = history.history['accuracy']
val_accuracy = history.history['val_accuracy']

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))

# Plot the loss curves
axs[0].plot( train_loss, label='Training Loss')
axs[0].plot( val_loss, label='Validation Loss')
axs[0].set_title('Training and Validation Loss')
axs[0].set_xlabel('Epochs')
axs[0].set_ylabel('Loss')
axs[0].legend()



#plot the accuracy curves
axs[1].plot( train_accuracy, label='Training Accuracy')
axs[1].plot(val_accuracy, label='Validation Accuracy')
axs[1].set_title('Training and Validation Accuracy')
axs[1].set_xlabel('Epochs')
axs[1].set_ylabel('Accuracy')
axs[1].legend()
plt.show()