# Project Overview

<b>Competition Description from Kaggle:</b>

Think back to your favorite teacher. They motivated and inspired you to learn. And they knew your strengths and weaknesses. The lessons they taught were based on your ability. For example, teachers would make sure you understood algebra before advancing to calculus. Yet, many students don’t have access to personalized learning. In a world full of information, data scientists like you can help. Machine learning can offer a path to success for young people around the world, and you are invited to be part of this mission.



In 2018, 260 million children weren't attending school. At the same time, more than half of these young students didn't meet minimum reading and math standards. Education was already in a tough place when COVID-19 forced most countries to temporarily close schools. This further delayed learning opportunities and intellectual development. The equity gaps in every country could grow wider. We need to re-think the current education system in terms of attendance, engagement, and individualized attention.

Riiid Labs, an AI solutions provider delivering creative disruption to the education market, empowers global education players to rethink traditional ways of learning leveraging AI. With a strong belief in equal opportunity in education, Riiid launched an AI tutor based on deep-learning algorithms in 2017 that attracted more than one million South Korean students. This year, the company released EdNet, the world’s largest open database for AI education containing more than 100 million student interactions.

In this competition, your challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions. You will pair your machine learning skills using Riiid’s EdNet data.

Your innovative algorithms will help tackle global challenges in education. If successful, it’s possible that any student with an Internet connection can enjoy the benefits of a personalized learning experience, regardless of where they live. With your participation, we can build a better and more equitable model for education in a post-COVID-19 world.

# Import Libraries

In [7]:
import zipfile
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, ParameterGrid
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, plot_confusion_matrix
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import Lasso, Ridge, LogisticRegression

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
import lightgbm as lgb
import xgboost as xgb

from dask.distributed import Client, LocalCluster
import joblib

In [8]:
cluster = LocalCluster()
client = Client(cluster) # start a local Dask client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 56328 instead
  http_address["port"], self.http_server.port


In [9]:
cluster

VBox(children=(HTML(value='<h2>LocalCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n    …

# Data

#### Download Data

In [3]:
# Download the competition data from kaggle
# ! kaggle competitions download -c riiid-test-answer-prediction -p ../../data

In [4]:
# Unzip the data
# with zipfile.ZipFile('../../data/riiid-test-answer-prediction.zip', 'r') as zip_ref:
#     zip_ref.extractall('../../data')

#### Import Data

In [10]:
# Import the data as Pandas DataFrames
train = pd.read_csv('../../data/train.csv')
questions = pd.read_csv('../../data/questions.csv')
lectures = pd.read_csv('../../data/lectures.csv')

#### Explore Data

In [9]:
train.shape
# 101,230,332 rows

(101230332, 10)

In [10]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101230332 entries, 0 to 101230331
Data columns (total 10 columns):
row_id                            int64
timestamp                         int64
user_id                           int64
content_id                        int64
content_type_id                   int64
task_container_id                 int64
user_answer                       int64
answered_correctly                int64
prior_question_elapsed_time       float64
prior_question_had_explanation    object
dtypes: float64(1), int64(8), object(1)
memory usage: 7.5+ GB


In [11]:
train.head()

Unnamed: 0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation
0,0,0,115,5692,0,1,3,1,,
1,1,56943,115,5716,0,2,2,1,37000.0,False
2,2,118363,115,128,0,0,0,1,55000.0,False
3,3,131167,115,7860,0,3,0,1,19000.0,False
4,4,137965,115,7922,0,4,1,1,11000.0,False


In [11]:
# Split data into features and target
X = train.drop('answered_correctly', axis=1)
y = train['answered_correctly']

In [None]:
# Class Imbalance
sns.set(context = 'notebook', style = 'whitegrid')
fig, ax = plt.subplots(figsize = (10,4)) 
ax.hist(y, color = 'red', alpha = .5, bins = 5)
ax.set_title('Class distribution')

### Prepare data for modeling

In [12]:
# Train Test Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)

In [13]:
# Pipeline for preprocessing numeric features
numeric_features = list(X.select_dtypes(exclude='object').columns)
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Pipeline for preprocessing categorical features
categorical_features = list(X.select_dtypes(include='object').columns)
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(fill_value=False)),
    ('one_hot_encoder', OneHotEncoder(sparse=False))])

# Pipeline for preprocessing combined
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

In [14]:
# Use preprocessing pipeline to transform the data
X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.transform(X_test)

# Modeling

#### First Simple Model

In [None]:
with joblib.parallel_backend('dask'):
    fsm = LogisticRegression()
    fsm.fit(X_train_transformed, y_train)

tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x000001B167FC2E18>, <Task finished coro=<f() done, defined at C:\Users\Sin City\anaconda3\envs\learn-env\lib\site-packages\joblib\_dask.py:316> exception=CommClosedError('in <closed TCP>: Stream is closed',)>)
Traceback (most recent call last):
  File "C:\Users\Sin City\anaconda3\envs\learn-env\lib\site-packages\distributed\comm\tcp.py", line 187, in read
    n_frames = await stream.read_bytes(8)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\Sin City\anaconda3\envs\learn-env\lib\site-packages\tornado\ioloop.py", line 758, in _run_callback
    ret = callback()
  File "C:\Users\Sin City\anaconda3\envs\learn-env\lib\site-packages\tornado\stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "C:\Users\Sin City\anaconda3\envs

In [None]:
# Score on our training data
with joblib.parallel_backend('dask'):
    y_pred = fsm.predict(X_train_transformed)
    print(classification_report(y_train, y_pred))

In [None]:
# Score on our testing data
with joblib.parallel_backend('dask'):
    y_pred = fsm.predict(X_test_transformed)
    print(classification_report(y_test, y_pred))

In [None]:
# Confusion matrix
with joblib.parallel_backend('dask'):
    plot_confusion_matrix(fsm, X_test_transformed, y_test)
    plt.title('FSM')
    plt.xticks(rotation='vertical');

### Further Data Preprocessing

#### Using Smote to Deal with Class Imbalance

In [None]:
print("Before OverSampling, counts of label -1: {}".format(sum(y_train == -1))) 
print("Before OverSampling, counts of label 0: {}".format(sum(y_train == 0))) 
print("Before OverSampling, counts of label 1: {} \n".format(sum(y_train == 1))) 

with joblib.parallel_backend('dask'):
    # import SMOTE module from imblearn library 
    # pip install imblearn (if you don't have imblearn in your system) 
    sm = SMOTE(random_state = 42) 
    X_train_res, y_train_res = sm.fit_sample(X_train_transformed, y_train.ravel()) 

print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape)) 
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape)) 

print("After OverSampling, counts of label -1: {}".format(sum(y_train_res == -1))) 
print("After OverSampling, counts of label 0".format(sum(y_train_res == 0))) 
print("After OverSampling, counts of label 1: {}".format(sum(y_train_res == 1)))

In [None]:
sns.set(context = 'notebook', style = 'whitegrid')
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (10,4)) 
ax1.hist(y, color = 'red', alpha = .5, bins = 5)
ax1.set_title('Class distribution before SMOTE')
ax2.hist(y_train_res, color = 'blue', alpha = .5, bins = 5)
ax2.set_title('Class distribution After SMOTE')
fig.tight_layout()
plt.savefig('../../reports/figures/Fixing_class_imbalance.jpg', bbox_inches='tight');