# Capstone 1: Recommender System In-Depth Analysis

#### Kenneth Liao

Original datasource: https://datahack.analyticsvidhya.com/contest/practice-problem-recommendation-engine/#

In [109]:
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV

# enable offline plotting in plotly
init_notebook_mode(connected=True)

In [73]:
# load our 3 datasets
users = pd.read_csv('data/user_features.csv')
problems =  pd.read_csv('data/problem_features.csv')
submissions = pd.read_csv('data/train_submissions.csv')

## Background & Problem Statement 

Ultimately, our goal is to recommend practice problems to users given some information about the problems they have already solved. There are many criteria we could choose to base how we recommend problems. For the purpose of this model, I will keep the criteria simple. The criteria are as follows:

1. The problem has not yet been attempted by the user.
2. The predicted number of attempts the user will require to solve the problem is equal to 2 or 3.

Given the criteria defined above, we must first be able to predict how many attempts a user will require to solve a problem they've never attempted before. I will perform this prediction using two very different models. 

The first model will be a Random Forest Classifier. For this model, I will use meta data available for users and problems. The goal is to find patterns in the user and problem features that predict well the number of attempts for a given user-problem combination.

The second model will be a collaborative filtering model. This model will employ stochastic gradient descent (SGD) to find an approximate solution to the single value decomposition (SVD) of our user-problem matrix. In this case, we will not use any user or problem features. Predictions will be made exclusively using the history of users and problems solved.

Let's take a quick look at the submissions dataset. This dataset has 3 columns: user_id, problem_id, and attempts_range. Attempts_range gives the range of attempts that the user_id took to solve the problem_id and is defined in the original datasource as shown below.

In [75]:
submissions.head()

Unnamed: 0,user_id,problem_id,attempts_range
0,user_232,prob_6507,1
1,user_3568,prob_2994,3
2,user_1600,prob_5071,1
3,user_2256,prob_703,1
4,user_2321,prob_356,1


>We have used following criteria to define the attempts_range :-
>
>            attempts_range            No. of attempts lies inside
>
>            1                                         1-1
>
>            2                                         2-3
>
>            3                                         4-5
>
>            4                                         6-7
>
>            5                                         8-9
>
>            6                                         >=10

## Random Forest Model

### Preparing data for random forest 

The first thing we need to do to prepare the data for the random forest model is convert categorical, string columns into dummy variables. We do this for both the user and problem features.

In [76]:
users = pd.get_dummies(users.set_index('user_id')).reset_index()
users.head()

Unnamed: 0,user_id,submission_count,problem_solved,contribution,follower_count,last_online_time_seconds,max_rating,rating,registration_time_seconds,user_attempts_median,...,country_Ukraine,country_United Kingdom,country_United States,country_Uzbekistan,country_Venezuela,country_Vietnam,rank_advanced,rank_beginner,rank_expert,rank_intermediate
0,user_1,84,73,10,120,1505162220,502.007,499.713,1469108674,1.0,...,0,0,0,0,0,0,1,0,0,0
1,user_10,246,211,0,30,1505079658,326.548,313.36,1472038187,1.0,...,0,0,0,0,0,0,0,0,0,1
2,user_100,642,574,27,106,1505073569,458.429,385.894,1323974332,1.0,...,0,0,0,0,0,0,0,0,0,1
3,user_1000,259,235,0,41,1505579889,371.273,336.583,1450375392,1.0,...,0,0,0,0,0,0,0,0,0,1
4,user_1001,554,492,-6,55,1504521879,472.19,450.975,1423399585,1.0,...,0,0,0,0,0,0,0,0,0,1


In [79]:
problems = pd.get_dummies(problems.set_index('problem_id')).reset_index()
problems.head()

Unnamed: 0,problem_id,problem_attempts_median,problem_attempts_min,problem_attempts_max,problem_attempts_count,problem_attempts_iqr,algorithms,and,binary,bitmasks,...,string,strings,structures,suffix,ternary,the,theorem,theory,trees,two
0,prob_1,1.5,1.0,2.0,2.0,0.005,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,prob_10,6.0,6.0,6.0,1.0,0.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,prob_100,1.0,1.0,1.0,1.0,0.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,prob_1000,1.0,1.0,6.0,246.0,0.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,prob_1001,1.0,1.0,2.0,10.0,0.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, we will prepare a single dataframe that joins the user and problem features with the submissions data.

In [82]:
X = submissions.merge(users, on='user_id').merge(problems, on='problem_id')

# remove rows with any null values
X = X.loc[:,X.notnull().all()]

y = X.set_index(['user_id', 'problem_id'])['attempts_range']
X = X.set_index(['user_id', 'problem_id']).loc[:,'submission_count':]

X.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,submission_count,problem_solved,contribution,follower_count,last_online_time_seconds,max_rating,rating,registration_time_seconds,user_attempts_median,user_attempts_min,...,string,strings,structures,suffix,ternary,the,theorem,theory,trees,two
user_id,problem_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
user_232,prob_6507,53,47,0,1,1503633778,307.913,206.709,1432110935,2.0,1.0,...,0,1,0,0,0,0,0,0,0,0
user_1910,prob_6507,240,218,0,50,1505252563,319.954,291.284,1385471472,1.0,1.0,...,0,1,0,0,0,0,0,0,0,0
user_1824,prob_6507,370,336,-10,30,1505395587,307.339,295.585,1471685215,1.0,1.0,...,0,1,0,0,0,0,0,0,0,0
user_895,prob_6507,318,286,0,20,1505511056,304.186,191.514,1475529522,2.0,1.0,...,0,1,0,0,0,0,0,0,0,0
user_779,prob_6507,463,410,0,39,1504799078,374.713,374.713,1437245990,1.0,1.0,...,0,1,0,0,0,0,0,0,0,0


In [83]:
y.head()

user_id    problem_id
user_232   prob_6507     1
user_1910  prob_6507     2
user_1824  prob_6507     2
user_895   prob_6507     1
user_779   prob_6507     1
Name: attempts_range, dtype: int64

Dataframe X now contains all of the user and problem feature data for each combination of user_id and problem_id. Thus, for each training sample or row, we will use the combination of user and problem features to predict the attempts_range. The attempts_range for each user-problem combination is saved in y.

Next, we'll split the data into train and test sets.

In [84]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Training

We will start by building a baseline, out-of-box model and try to improve it from there.

#### Baseline model

In [95]:
clf = RandomForestClassifier(n_jobs=-1)

clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [101]:
y_pred = clf.predict(X_test)

f1_score(y_test, y_pred, average='weighted')

0.49979545702473027

The baseline model produces an f1_score of 0.5. This is far from 1. We can use GridSearhCV to try to tune the hyperparameters of the model. Rather than passing a large dictionary object of all the hyperparameters we want to tune at once, I will explore each of the hyperparameters individually. This will make it more straightforward when interpretting the effects of each hyperparameter. At the end, I will then pass all of the hyperparameters to GridSearchCV to find the optimal combination of all hyperparameters.

#### n_estimators

n_estimators defines how many trees the model will have. Generally, the more trees the better the model will generalize. However more trees equals more computation and therefore we want to strike a balance between fit to the test data and train + test times.

With GridSearchCV, we can define the scoring function. Since we want to maximize the f1_score function with "weighted" averaging from sklearn.metrics, we pass this same scoring function to GridSearchCV.

In [132]:
%%time
param_grid = {'n_estimators': [5,10,50,100,150,200,250]}

clf = RandomForestClassifier(n_jobs=-1)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

Wall time: 3min 56s


The results of the search are shown below. 

In [153]:
results = pd.DataFrame({'n_estimators' : [5,10,50,100,150,200,250],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,n_estimators,combined_mean_fit-test_time,mean_test_score,mean_train_score
0,5,0.754767,0.490222,0.936031
1,10,0.981347,0.502005,0.974539
2,50,7.770343,0.511544,0.998722
3,100,64.798515,0.516341,0.999034
4,150,109.124288,0.51661,0.999039
5,200,72.066335,0.516317,0.999041
6,250,57.751733,0.516768,0.999041


Let's plot the train and test scores as a function of N_estimators.

In [160]:
trace1 = go.Scattergl(name='Mean Test Score',
                      x=results['n_estimators'],
                      y=results['mean_test_score'], 
                      mode='lines+markers',
                     yaxis='y2')
trace2 = go.Scattergl(name='Mean Train Score',
                      x=results['n_estimators'],
                      y=results['mean_train_score'], 
                      mode='lines+markers')

layout = go.Layout(title='Mean Train & Test Scores vs N_estimators',
               xaxis=dict(title='N_estimators'),
               yaxis=dict(title='Mean Train Score'), 
                   yaxis2=dict(title='Mean Test Score',
                              side='right'),
                  legend=dict(orientation='h', y=1.12),
                  margin=dict(t=120))

fig = go.Figure([trace1, trace2], layout=layout)

iplot(fig, filename='train-test-scores.html')

We can see both scores increase in going from 5 to 100 estimators but quickly plateau after that. The train and test scores are plotted on separate axes above so we can distinguish the knees of both curves. We can see that the training score is very close to 1, even for n_estimators=5. The more important score of course is the test score. Let's now look at the tradeoff between the test score and the time required to train and test the model.

In [163]:
trace0 = go.Scattergl(name='Combined Mean Train+Test Time',
                      x=results['n_estimators'],
                      y=results['combined_mean_fit-test_time'], 
                      mode='lines+markers',)
trace1 = go.Scattergl(name='Mean Test Score',
                      x=results['n_estimators'],
                      y=results['mean_test_score'], 
                      mode='lines+markers',
                     yaxis='y2')

layout = go.Layout(title='Model Train+Test Time & Test Score vs N_estimators',
               xaxis=dict(title='N_estimators'),
               yaxis=dict(title='Combined Train+Test Time'), 
                   yaxis2=dict(title='Mean Test Score',
                              side='right'),
                  legend=dict(orientation='h', y=1.12),
                  margin=dict(t=120))

fig = go.Figure([trace0, trace1], layout=layout)

iplot(fig, filename='n_estimators.html')

Here we see that the combined time for training and testing the model increases significantly up to 109 seconds at N_estimators=150. At N_estimators=100, the train+test time is 65 seconds but the difference in test score between the two is negligible. We can thus save a lot of computational resources and time by choosing N_estimators=100.

#### max_depth

In [164]:
%%time
param_grid = {'max_depth': [5,10,50,100,150]}

clf = RandomForestClassifier(n_jobs=-1)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

Wall time: 23.6 s


In [165]:
results = pd.DataFrame({'max_depth' : [5,10,50,100,150],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,max_depth,combined_mean_fit-test_time,mean_test_score,mean_train_score
0,5,0.47925,0.412228,0.413028
1,10,0.642642,0.482578,0.500847
2,50,1.109137,0.501242,0.974375
3,100,1.062214,0.501708,0.974543
4,150,1.133677,0.503467,0.974092


#### min_samples_split

In [187]:
%%time
param_grid = {'min_samples_split': [2,3,4,5,10,25,50,100]}

clf = RandomForestClassifier(n_jobs=-1)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

Wall time: 33.7 s


In [188]:
results = pd.DataFrame({'min_samples_split' : [2,3,4,5,10,25,50,100],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,min_samples_split,combined_mean_fit-test_time,mean_test_score,mean_train_score
0,2,1.088555,0.501475,0.974483
1,3,0.999283,0.508273,0.943047
2,4,1.140367,0.509025,0.911239
3,5,1.100779,0.510806,0.883676
4,10,1.083504,0.515146,0.782923
5,25,0.982962,0.517524,0.666035
6,50,1.668919,0.516062,0.604858
7,100,1.956197,0.512582,0.563801


#### min_samples_leaf

In [117]:
f1_score(y_test, cv.predict(X_test), average='weighted')

0.5130431607694047

In [105]:
pd.crosstab(y_pred, y_test, rownames=['Predicted attempts_range'], 
            colnames=['Actual attempts_range'])

Actual attempts_range,1,2,3,4,5,6
Predicted attempts_range,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,21520,9401,2234,751,332,332
2,4890,5376,1848,749,360,377
3,571,691,450,155,78,132
4,118,150,85,68,36,40
5,27,46,25,19,15,18
6,54,97,52,34,21,96
