
**First things first** - please go to 'File' and select 'Save a copy in Drive' so that you have your own version of this activity set up and ready to use.
Remember to update the portfolio index link to your own work once completed!

# Mini-project 6.3 Applying supervised learning to predict student dropout rate

**Welcome to your Mini-project: Applying supervised learning to predict student dropout rate!**

In this project, we will examine student data and use supervised learning techniques to predict whether a student will drop out. In the education sector, retaining students is vital for the institution's financial stability and for students’ academic success and personal development. A high dropout rate can lead to significant revenue loss, diminished institutional reputation, and lower overall student satisfaction.

Please set aside approximately **12 hours** to complete the mini-project.

<br></br>

## **Business context**
Study Group specialises in providing educational services and resources to students and professionals across various fields. The company's primary focus is on enhancing learning experiences through a range of services, including online courses, tutoring, and educational consulting. By leveraging cutting-edge technology and a team of experienced educators, Study Group aims to bridge the gap between traditional learning methods and the evolving needs of today's learners.

Study Group serves its university partners by establishing strategic partnerships to enhance the universities’ global reach and diversity. It supports the universities in their efforts to attract international students, thereby enriching the cultural and academic landscape of their campuses. It works closely with university faculty and staff to ensure that the universities are prepared and equipped to welcome and support a growing international student body. Its partnership with universities also offers international students a seamless transition into their chosen academic environment. Study Group runs several International Study Centres across the UK and Dublin in partnership with universities with the aim of preparing a pipeline of talented international students from diverse backgrounds for degree study. These centres help international students adapt to the academic, cultural, and social aspects of studying abroad. This is achieved by improving conversational and subject-specific language skills and academic readiness before students progress to a full degree programme at university.

Through its comprehensive suite of services, it supports learners and universities at every stage of their educational journey, from high school to postgraduate studies. Its approach is tailored to meet the unique needs of each learner, offering personalised learning paths and flexible scheduling options to accommodate various learning styles and commitments.

Study Group's services are designed to be accessible and affordable, making quality education a reality for many individuals. By focusing on the integration of technology and personalised learning, the company aims to empower learners to achieve their full potential and succeed in their academic and professional pursuits. Study Group is at the forefront of transforming how people learn and grow through its dedication to innovation and excellence.
Study Group has provided you a course-level data set.


<br></br>

## **Objective**
By the end of this mini-project, you will have developed the skills and knowledge to apply advanced machine learning techniques to create a predictive model for student dropout. This project will involve comprehensive data exploration, preprocessing, and feature engineering to ensure high-quality input for the models. You will employ and compare multiple predictive algorithms - XGBoost and neural network-based model, to determine the most effective model for predicting student dropout.

In the Notebook, you will:
- explore the data set
- preprocess the data and conduct feature engineering
- predict dropout using XGBoost, and neural network-based model
- Identify the most important predictors of dropout.


You will also write a report summarising the results of your findings and recommendations.

<br></br>

## **Assessment criteria**
By completing this project, you will be able to provide evidence that you can:
- develop accurate predictions across diverse organisational scenarios by building and testing advanced machine learning models
- inform data-driven decision-making with advanced machine learning algorithms and models
- propose and present effective solutions to organisational problems using data preprocessing, model selection, and insightful analysis techniques.

<br></br>

## **Project guidance**

Data preparation
1. Import the required libraries and data set with the provided URL.
  - Data set drive: https://drive.google.com/drive/folders/130AVMFxTOtRiC7GOl7QmSo0I7B0iChv5
2. Read the course-level csv file and make it available as a dataframe.

3. From the dataframe, remove the following columns:

columns= ['BookingId','BookingType', 'LeadSource', 'DiscountType',
                                                    'Nationality', 'HomeCountry',
                                                    'HomeState',
                                                    'HomeCity',
                                                    'PresentCount',
                                                    'LateCount', 'AuthorisedAbsenceCount','ArrivedDate','NonCompletionReason',
                                                    'TerminationDate',
                                                    'CourseFirstIntakeDate', 'CourseStartDate','CourseEndDate',
                                                    'AcademicYear', 'CourseName',
                                                    'LearnerCode', 'ProgressionDegree',
                                                    'EligibleToProgress',
                                                    'AssessedModules',
                                                    'PassedModules',
                                                    'FailedModules',
                                                    'AttendancePercentage',
                                                    'ContactHours']

From here on, you will perform the rest of the actvities mentioned in the rubric with the smaller set of features obtained after performing the above step.

General Instructions that apply throughout this project activity:
  - Use the standard scaler to scale your numeric input features.
  - Split the data into train and test sets. Apply 80-20 split.
  - Print accuracy, confusion matrix, precision, recall and AUC on the test set
    for all your models.
  - Compare the performance (on the test set) obtained from the non-optimised
    model with the best performing model. Record your observations. What differences do you see and which metrics are improved or not improved?

## Please refer to the Rubric for specific steps to be performed as part of the project activity. Every step mentioned in the rubric will be assessed separately.

Report
1. Document your approach and major inferences from the data analysis and describe which method provided the best results and why.
  - Please ensure you include a discussion around which of the features will predict student droput.
2. When you’ve completed the project:
  - Download your completed Notebook as an IPYNB (Jupyter Notebook) or PY (Python) file. Save the file as follows: **LastName_FirstName_CAM_C201_Week_6_Mini-project**.
  - Prepare a detailed report (between 800-1,000 words) that includes:
    - an overview of your approach
    - a description of your analysis
    - an explanation of the insights you identified
    - a summary of which method gave the best results
    - an explanation of visualisations you created.
  - Save the document as a PDF named according to the following convention: **LastName_FirstName_CAM_C201_Week_6_Mini-project.pdf**.
  


<br></br>
> **Declaration**
>
> By submitting your project, you indicate that the work is your own and has been created with academic integrity. Refer to the Cambridge plagiarism regulations.

In [None]:
# Standard libraries (if applicable)
import numpy as np
import pandas as pd
import io
import requests

# Machine Learning libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# TensorFlow / Keras libraries
import tensorflow as tf
from tensorflow.keras.models import load_model
from keras.models import Sequential
from keras.layers import Dense

In [None]:
from google.colab import files
upload =  files.upload()

filename = next(iter(uploaded))

df = pd.read_csv(filename)

df.head()