# Introduction
Last but not least, with all the pieces in place, we can finally implement machine learning to predict student dropouts!

## Machine Learning

![MachineLearningProcess.png](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectLearningAnalytics/MachineLearningProcess.png)

We put this section on all of the projects so bear with us if you've seen this before. 

Generally, the machine learning process has five parts:
1. <strong>Split your data into train and test set</strong>
2. <strong>Model creation</strong>
<br>
Import your models from sklearn and instantiate them (assign model object to a variable)
3. <strong>Model fitting</strong>
<br>
Fit your training data into the model and train train train
4. <strong>Model prediction</strong>
<br>
Make a set of predictions using your test data, and
5. <strong>Model assessment</strong>
<br>
Compare your predictions with ground truth in test data

Highly recommended readings:
1. [Important] https://scipy-lectures.org/packages/scikit-learn/index.html
2. https://machinelearningmastery.com/a-gentle-introduction-to-scikit-learn-a-python-machine-learning-library/
3. https://scikit-learn.org/stable/tutorial/basic/tutorial.html

### Step 1: Import your libraries
We will be using models from sklearn - a popular machine learning library. However, we won't import everything from sklearn and take just what we need. 

We'll need to import plotting libraries to plot our predictions against the ground truth (test data). 

Import the following:
1. pandas
2. matplotlib.pyplot as plt
3. seaborn
4. numpy

In [None]:
# Step 1: Import the libraries

### Step 2: Read the CSV from Part IV Step 
In this section, we will read the CSV that we prepared from Part IV. 

Sanity check:
1. 28,875 rows
2. 42 columns

In [None]:
# Step 2: Read your CSV 

### Step 3: Prepare your independent and dependent variables
Before we jump into training, we will split our DataFrame into two parts - indepedent and dependent variables. 

We'll be preparing a DataFrame containing our indepedent variables, and a separate list containing the "final_result".

1. Declare a variable, and assign your independent variables to it, i.e. drop "final_result" from the DataFrame
2. Declare a variable, and assign only values from "final_result"

In [None]:
# Step 3: Prepare your indepedent and dependent variables

### Step 4: Import machine learning libraries
Time to import other libraries. We hope you've taken a look at the two articles at the start of this notebook because it'll be useful. 

Import the following libraries and methods:
1. train_test_split - sklearn.model_selection
2. DummyClassifier - sklearn.dummy
3. LogisticRegression - sklearn.linear_model
4. DecisionTreeClassifier - sklearn.tree
5. RandomForestClassifier - sklearn.ensemble
6. GradientBoostClassifier - sklearn.ensemble
7. f1_score - sklearn.metrics
8. confusion_matrix - sklearn.metrics

In [None]:
# Step 4: Import the machine learning libraries

### Step 5: Split your indepedent and dependent variables into train and test sets
We'll be using a 80/20 split for train and test set respectively, using the train_test_split function. We will also stratify by y so that the proportions for our dependent variables are even.

In [None]:
# Step 5: Split your data into train and test

### Step 6: Train your machine learning model
Once you've split your data, machine learning begins. 

This is what you'll need to do:
1. Start with a model
2. Declare a variable, and store your model in it (don't forget to use brackets)
3. Fit your training data into the instantiated model
4. Declare a variable that contains predictions from the model you just trained, using the train dataset (X_test)
5. Compare the prediction with the actual result (y_test) with f1_score and confusion matrix

We will start with DummyClassifier to establish a baseline for your predictions. 

Also, the recommended readings will be very helpful.

In [None]:
# Step 6a: Declare a variable to store the dummy model

# Step 6b: Fit your train dataset

# Step 6c: Declare a variable and store your predictions that you make with your model using X test data

# Step 6d: Print the f1_score between y_test and your prediction

# Step 6e: Print the confusion matrix between y_test and your prediction

### Step 7: Repeat Step 6 with LogisticRegression
The performance for DummyClassifier is bad, and expectedly so because it's what you'd see if you randomly guess. 

Now, let's use other models to perform the classification.

In [None]:
# Step 7a: Declare a variable to store the model

# Step 7b: Fit your train dataset

# Step 7c: Declare a variable and store your predictions that you make with your model using X test data

# Step 7d: Print the f1_score between y_test and your prediction

# Step 7e: Print the confusion matrix between y_test and your prediction


### Step 8: Repeat Step 6 with DecisionTreeClassifier
Using LogisticRegression is not bad at all - we see a f1_score jump from 0.4+ to 0.7-0.8.

What happens when we use other models? Let's find out!

In [None]:
# Step 8a: Declare a variable to store the model

# Step 8b: Fit your train dataset

# Step 8c: Declare a variable and store your predictions that you make with your model using X test data

# Step 8d: Print the f1_score between y_test and your prediction

# Step 8e: Print the confusion matrix between y_test and your prediction


### Step 9: Repeat Step 6 with RandomForestClassifier
Not bad, not bad - we should expect either slight improvements or on par performance. 

Next up, we will use a RandomForestClassifier. 

In [None]:
# Step 9a: Declare a variable to store the model

# Step 9b: Fit your train dataset

# Step 9c: Declare a variable and store your predictions that you make with your model using X test data

# Step 9d: Print the f1_score between y_test and your prediction

# Step 9e: Print the confusion matrix between y_test and your prediction


### [Important] Step 10: Repeat Step 6 with GradientBoostClassifier
The performance will improve once again!

There are many models out there, and different models work differently depending on the dataset. 

Last one - we'll give it a GradientBoostClassifier a try.

In [None]:
# Step 10a: Declare a variable to store the model

# Step 10b: Fit your train dataset

# Step 10c: Declare a variable and store your predictions that you make with your model using X test data

# Step 10d: Print the f1_score between y_test and your prediction

# Step 10e: Print the confusion matrix between y_test and your prediction


### Step 11: Get feature_importances of your Step 10 model
Now we have a great mode, we can then take a closer look at the feature importance so that we can intuitively identify which features are important not just for the model but the business context as well.

We can use .feature_importances_ attribute of our models to get a list containing the importances of each feature. 

However, this list contains only numbers so we can create a DataFrame out of the list of features.

![FeatureImportances.png](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectLearningAnalytics/FeatureImportances.png)

Your results will differ from ours so don't be alarmed. 

In [None]:
# Step 11: Create a DataFrame containing feature importances

## Hyperparameter tuning
The GradientBoostClassifier model seems to perform the best so far. 

Is that the end? More or less.

But can we improve it even further? Quite possibly, yes - with hyperparameter tuning. 

In this section, we will perform hyperparameter tuning to get the best parameters and improve our GradientBoostClassifier. 

### Step 12: Import GridSearchCV
We're tuning our model, so let's import:
1. GridSearchCV from sklearn.model_selection

Useful reading: https://scikit-learn.org/stable/modules/grid_search.html

In [None]:
# Step 12: Import GridSearchCV

### Step 13: Define a parameter grid
What is a parameter grid? A parameter grid contains a set of parameters that you'd like to explore, along with the values. 

For our own GridSearchCV, let's explore the following:
1. n_estimators - 100, 250, 500, 750, 1000, 1250, 1500, 1750
2. max_depth - 2, 3, 4, 5, 6, 7
3. learning_rate - 0.15, 0.1, 0.05, 0.01, 0.005, 0.001

You can choose any combination of the above, though take note that if you use all three sets of parameters to explore, you will need to run this all night.

In [None]:
# Step 13: Define your parameter grid

### Step 14: Create a GridSearchCV object
Declare a variable containing your GridSearchCV with the following parameters:
1. estimator - GradientBoostingClassifier()
2. param_grid - your Step 12 variable
3. scoring - 'accuracy'
4. n_jobs - 4*
5. cv - 5

*this will speed your run up slightly

In [None]:
# Step 14: Declare a GridSearchCV with the parameters specified

### [Caution] Step 15: Fit your GridSearchCV object with training data
We put caution here because of the amount of time it will take for your grid search. 

Allocate a fair amount of time, i.e. a few hours, for the grid search to finish.

In [None]:
# Step 15: Fit your X_train and y_train into your GridSearchCV object

### Step 16: Get the best parameters from the grid search
Once you're done, get the best parameters from your GridSearchCV object using the .best_params_ attribute.

In [None]:
# Step 16: Get the best parameters from the grid search

### Step 17: Use your model to make predictions
Now that you've identified the best parameters for your model, go ahead and use the GridSearchCV object like a model.

Repeat Step 6 and see how much your tuned model has improved.

In [None]:
# Step 17a: Declare a variable and store your predictions that you make with your GridSearchCV using X test data

# Step 17b: Print the f1_score between y_test and your prediction

# Step 17c: Print the confusion matrix between y_test and your prediction


With tuning, you can see decent improvement in your model performance which is great! 

### Step 18: Get feature_importances of your best performing tuned model
Now that we're done with our hyperparameter tuning, we can take another look at the best performing tuned model and assess the feature importances.

Unlike Step 11, you'll have to get the best performing model/estimator first before you can retrieve the feature importances. 

<strong>Hint: Google "feature importance gridsearchcv"</strong>

In [None]:
# Step 18: Get the feature importance from the best estimator in your GridSearchCV

# The end
And that's the end! To recap, you've:
1. Retrieve learning analytics data from Open University
2. Combined disparate pieces of data into a more coherent and complete dataset
3. Explored the data through visualization
4. Engineered new features for machine learning modelling
5. Trained a machine learning model to predict student pass/fail
6. Performed hyperparameter tuning to improve the best performing model even more

Go on, give yourself a pat on the back. We hope this project series has give you more confidence in coding and machine learning. 

You have successfully implemented machine learning in predicting student pass/fail outcome for their courses. There's always more room for improvement, such as getting more data from Open University and updating the model.  

That is the fate of a data scientist, to pursue better models that can help model the world out there.  

Whatever you learn here is but a tip of the iceberg, and launchpad for bigger and better things to come. Come join us in our Telegram community over at https://bit.ly/UpLevelSG and our Facebook page at https://fb.com/UpLevelSG

Whatever you learn here is but a tip of the iceberg, and launchpad for bigger and better things to come.