# Predicting student performance with random forest 

In this notebook, I build a student performance predictor using a random forest classifier and data on student performance ([data source](https://www.kaggle.com/aljarah/xAPI-Edu-Data)) collected from a learning management system (LMS) called Kalboard 360.

Optimization of the random forest classifier is done using GridSearchCV.

### Key findings

1) Based on the data of 480 students and 16 features, **we are able to train a model that can predict a student's performance (i.e. Low, Medium, or High) with an average accuracy of 94%. Specifically, it's able to predict Low performers with 98% accuracy.**

2) While the data is sparse, our model is able to learn from the data and tell us that the following features play a bigger part in determining student performance in this learning management system:
- Number of times the student raised his/her hands
- Number of times the student visited a course content
- Number of times the student checks the new announcement
- Number of times the student participate on discussion groups
- Whether or not the student has been absent for more than 7 days

3) Other factors such as nationality, gender, topic, parent participation are less important in predicting a student's performance.

4) There are 2 key limitations to this dataset and model:
- The number of features are quite limited, and there are probably several features that play a big part in a students' performance which are not captured. Some examples of such factors could be: the level of a student's interest in a topic, student scores in quizzes and assignments, some measure of the teachers' teaching quality, etc.

- The fact that the data is drawn from a specific learning management system means that the findings are only generalizable to the population of all users of Kalboard 360. **Better research and a more generalizable student performance predictive model would be possible if such similar data could be collected at scale from public schools.**

### Dataset Information

This is an educational data set which is collected from learning management system (LMS) called Kalboard 360. Kalboard 360 is a multi-agent learning management system (LMS). The data is collected using a learner activity tracker tool, which is called experience API (xAPI). The dataset consists of 480 student records and 16 features. The features are classified into four major categories: 

1. Demographic features such as gender and nationality.
2. Academic background features such as educational stage, grade Level and section.
3. Behavioral features such as raised hand on class, opening resources, absences, and school satisfaction.
4. Parent-related features such as parent participation in the educational process. Participation in surveys and school satisfaction are used as proxies for parent participation.

### Load data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import imblearn

%matplotlib inline
pd.options.display.max_columns = 150

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [2]:
df = pd.read_csv('./data/xAPI-Edu-Data.csv')

FileNotFoundError: File b'./data/xAPI-Edu-Data.csv' does not exist

In [3]:
df.head()

NameError: name 'df' is not defined

In [4]:
df['Class'].value_counts()

NameError: name 'df' is not defined

In the dataset, we have 127 students that scored Low, 211 that scored Medium, and 142 that scored High.

### Preparing and cleaning data for modeling

First, let's convert the Class column from string values (Low, Medium, High) to categorical values (0,1,2)

In [5]:
df.head()

NameError: name 'df' is not defined

In [6]:
grade_map = {'L': 0, 'M': 1, 'H': 2}
df = df.replace({'Class': grade_map})
df.head()

NameError: name 'df' is not defined

Let's one-hot encode the remaining X variables

In [7]:
df.columns

NameError: name 'df' is not defined

In [8]:
# One-hot encode string columns
columns_to_one_hot_encode = ['gender', 'NationalITy', 'PlaceofBirth', 'StageID', 'GradeID', 'SectionID', 'Topic', 'Semester', 'Relation', 'ParentAnsweringSurvey', 'ParentschoolSatisfaction',
'StudentAbsenceDays']

df_one_hot_encoded = pd.get_dummies(df, columns = columns_to_one_hot_encode)
df_one_hot_encoded.head()

NameError: name 'df' is not defined

In [9]:
df_one_hot_encoded.describe()

NameError: name 'df_one_hot_encoded' is not defined

In [10]:
df_one_hot_encoded.columns

NameError: name 'df_one_hot_encoded' is not defined

In [11]:
X = df_one_hot_encoded.ix[:, df_one_hot_encoded.columns != 'Class']
y = df_one_hot_encoded.ix[:, df_one_hot_encoded.columns == 'Class'].values.ravel()

NameError: name 'df_one_hot_encoded' is not defined

Let's split our data into train and test set

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

NameError: name 'X' is not defined

## Train our model

In [13]:
from sklearn.ensemble import RandomForestClassifier

In [14]:
model_1 = RandomForestClassifier()
model_1.fit(X_train, y_train)

NameError: name 'X_train' is not defined

In [15]:
expected = y
predicted = model_1.predict(X)

print(metrics.confusion_matrix(expected, predicted))
print(metrics.classification_report(expected, predicted))

NameError: name 'y' is not defined

In [16]:
df_X_vars_only = df_one_hot_encoded
del df_X_vars_only['Class']

NameError: name 'df_one_hot_encoded' is not defined

In [17]:
plt.figure(figsize=(16,8))
plt.plot(model_1.feature_importances_, 'o')
plt.xticks(range(len(df_X_vars_only.columns)), df_X_vars_only.columns.values, rotation=90);

NotFittedError: This RandomForestClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

<matplotlib.figure.Figure at 0x7f27cbdfbe80>

#### Top 5 most important features in determining student performance

In [18]:
for index, f in enumerate(model_1.feature_importances_):
    if f > 0.05:
        print(df_one_hot_encoded.columns[index], f)

NotFittedError: This RandomForestClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

## Iteration 2: Random Forest Classifier (optimized with GridSearchCV)

In [19]:
from sklearn.model_selection import GridSearchCV

In [20]:
random_forest_classifier_model = RandomForestClassifier(random_state=0)

param_grid = {'max_features': [None, 'auto', 'sqrt', 'log2'],
              'n_estimators': [1, 2, 4, 8, 10, 20, 30, 50],
              'min_samples_leaf': [1,5,10,50]}

model_2 = GridSearchCV(estimator=random_forest_classifier_model, 
                       param_grid=param_grid, cv=5)
model_2.fit(X_train, y_train)

NameError: name 'X_train' is not defined

In [21]:
expected_2 = y
predicted_2 = model_2.predict(X)

print(metrics.confusion_matrix(expected_2, predicted_2))
print(metrics.classification_report(expected_2, predicted_2))

NameError: name 'y' is not defined

In [22]:
plt.figure(figsize=(16,8))
plt.plot(model_2.best_estimator_.feature_importances_, 'o')
plt.xticks(range(len(df_X_vars_only.columns)), df_X_vars_only.columns.values, rotation=90);

AttributeError: 'GridSearchCV' object has no attribute 'best_estimator_'

<matplotlib.figure.Figure at 0x7f27c8ac4f28>

#### Top 6 most important features in determining student performance

In [23]:
for index, f in enumerate(model_2.best_estimator_.feature_importances_):
    if f > 0.05:
        print(df_one_hot_encoded.columns[index], f)

AttributeError: 'GridSearchCV' object has no attribute 'best_estimator_'

### Key findings

1) Based on the data of 480 students and 16 features, **we are able to train a model that can predict a student's performance (i.e. Low, Medium, or High) with an average accuracy of 94%. Specifically, it's able to predict Low performers with 98% accuracy.**

2) While the data is sparse, our model is able to learn from the data and tell us that the following features play a bigger part in determining student performance in this learning management system:
- Number of times the student raised his/her hands
- Number of times the student visited a course content
- Number of times the student checks the new announcement
- Number of times the student participate on discussion groups
- Whether or not the student has been absent for more than 7 days

3) Other factors such as nationality, gender, topic, parent participation are less important in predicting a student's performance.

4) There are 2 key limitations to this dataset and model:
- The number of features are quite limited, and there are probably several features that play a big part in a students' performance which are not captured. Some examples of such factors could be: the level of a student's interest in a topic, student scores in quizzes and assignments, some measure of the teachers' teaching quality, etc.

- The fact that the data is drawn from a specific learning management system means that the findings are only generalizable to the population of all users of Kalboard 360. **Better research and a more generalizable student performance predictive model would be possible if such similar data could be collected at scale from public schools.**