# Capstone 1: Recommender System In-Depth Analysis

#### Kenneth Liao

Original datasource: https://datahack.analyticsvidhya.com/contest/practice-problem-recommendation-engine/#

In [1]:
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from scipy.optimize import minimize, fmin_cg

# enable offline plotting in plotly
init_notebook_mode(connected=True)

In [2]:
# load our 3 datasets
users = pd.read_csv('data/user_features.csv')
problems =  pd.read_csv('data/problem_features.csv')
submissions = pd.read_csv('data/train_submissions.csv')

## Background & Problem Statement 

Ultimately, our goal is to recommend practice problems to users given some information about the problems they have already solved. There are many criteria we could choose to base how we recommend problems. For the purpose of this model, I will keep the criteria simple. The criteria are as follows:

1. The problem has not yet been attempted by the user.
2. The predicted number of attempts the user will require to solve the problem is equal to 2 or 3.

Given the criteria defined above, we must first be able to predict how many attempts a user will require to solve a problem they've never attempted before. I will perform this prediction using two very different models. 

The first model will be a Random Forest Classifier. For this model, I will use meta data available for users and problems. The goal is to find patterns in the user and problem features that predict well the number of attempts for a given user-problem combination.

The second model will be a collaborative filtering model. This model will employ stochastic gradient descent (SGD) to find an approximate solution to the single value decomposition (SVD) of our user-problem matrix. In this case, we will not use any user or problem features. Predictions will be made exclusively using the history of users and problems solved.

Let's take a quick look at the submissions dataset. This dataset has 3 columns: user_id, problem_id, and attempts_range. Attempts_range gives the range of attempts that the user_id took to solve the problem_id and is defined in the original datasource as shown below.

In [3]:
submissions.head()

Unnamed: 0,user_id,problem_id,attempts_range
0,user_232,prob_6507,1
1,user_3568,prob_2994,3
2,user_1600,prob_5071,1
3,user_2256,prob_703,1
4,user_2321,prob_356,1


>We have used following criteria to define the attempts_range :-
>
>            attempts_range            No. of attempts lies inside
>
>            1                                         1-1
>
>            2                                         2-3
>
>            3                                         4-5
>
>            4                                         6-7
>
>            5                                         8-9
>
>            6                                         >=10

## Random Forest Model

### Preparing data for random forest 

The first thing we need to do to prepare the data for the random forest model is convert categorical, string columns into dummy variables. We do this for both the user and problem features.

In [4]:
users = pd.get_dummies(users.set_index('user_id')).reset_index()
users.head()

Unnamed: 0,user_id,submission_count,problem_solved,contribution,follower_count,last_online_time_seconds,max_rating,rating,registration_time_seconds,user_attempts_median,...,country_Ukraine,country_United Kingdom,country_United States,country_Uzbekistan,country_Venezuela,country_Vietnam,rank_advanced,rank_beginner,rank_expert,rank_intermediate
0,user_1,84,73,10,120,1505162220,502.007,499.713,1469108674,1.0,...,0,0,0,0,0,0,1,0,0,0
1,user_10,246,211,0,30,1505079658,326.548,313.36,1472038187,1.0,...,0,0,0,0,0,0,0,0,0,1
2,user_100,642,574,27,106,1505073569,458.429,385.894,1323974332,1.0,...,0,0,0,0,0,0,0,0,0,1
3,user_1000,259,235,0,41,1505579889,371.273,336.583,1450375392,1.0,...,0,0,0,0,0,0,0,0,0,1
4,user_1001,554,492,-6,55,1504521879,472.19,450.975,1423399585,1.0,...,0,0,0,0,0,0,0,0,0,1


In [5]:
problems = pd.get_dummies(problems.set_index('problem_id')).reset_index()
problems.head()

Unnamed: 0,problem_id,points,problem_attempts_median,problem_attempts_min,problem_attempts_max,problem_attempts_count,problem_attempts_iqr,algorithms,and,binary,...,level_type_E,level_type_F,level_type_G,level_type_H,level_type_I,level_type_J,level_type_K,level_type_L,level_type_M,level_type_N
0,prob_1,500.0,1.5,1.0,2.0,2.0,0.005,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,prob_10,4500.0,6.0,6.0,6.0,1.0,0.0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,prob_100,1000.0,1.0,1.0,1.0,1.0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,prob_1000,500.0,1.0,1.0,6.0,246.0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,prob_1001,2000.0,1.0,1.0,2.0,10.0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


I'll start by splitting the whole dataset into a train (X_train) and test (X_test) set. I'll further split the X_train data into a smaller training set (R_train) and a cross-validation set (R_cv) for hyperparameter tuning. This split must be done on the original submissions dataset before pivoting the data into a sparse matrix. Once in sparse matrix format, sampling the dataset will also sample the null values in the dataset.

In [13]:
train, R_test = train_test_split(submissions, test_size=0.25, random_state=42)

R_train, R_cv = train_test_split(train, test_size=0.33, random_state=42)

In [14]:
R_train.head()

Unnamed: 0,user_id,problem_id,attempts_range
66107,user_2579,prob_5765,2
30619,user_2646,prob_4503,1
73139,user_3160,prob_506,4
152423,user_2213,prob_3331,1
118307,user_3040,prob_617,2


Next, we will prepare a single dataframe that joins the user and problem features with the submissions data.

In [50]:
X_train = R_train.merge(users, on='user_id').merge(problems, on='problem_id')
X_cv = R_cv.merge(users, on='user_id').merge(problems, on='problem_id')

# remove rows with any null values
X_train = X_train.loc[:,X_train.notnull().all()]
X_cv = X_cv.loc[:,X_cv.notnull().all()]

y_train = X_train.set_index(['user_id', 'problem_id'])['attempts_range']
X_train = X_train.set_index(['user_id', 'problem_id']).loc[:,'submission_count':]

y_cv = X_cv.set_index(['user_id', 'problem_id'])['attempts_range']
X_cv = X_cv.set_index(['user_id', 'problem_id']).loc[:,'submission_count':]

X_train.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,submission_count,problem_solved,contribution,follower_count,last_online_time_seconds,max_rating,rating,registration_time_seconds,user_attempts_median,user_attempts_min,...,level_type_E,level_type_F,level_type_G,level_type_H,level_type_I,level_type_J,level_type_K,level_type_L,level_type_M,level_type_N
user_id,problem_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
user_2579,prob_5765,676,636,39,90,1505150936,487.959,487.959,1416244222,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0
user_3138,prob_5765,1333,1280,0,114,1505506845,499.14,489.679,1438061830,2.0,1.0,...,0,0,0,0,0,0,0,0,0,0
user_1861,prob_5765,498,436,0,26,1505583689,489.679,463.876,1380916526,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0
user_262,prob_5765,98,79,3,44,1496108464,524.656,519.209,1413513739,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0
user_374,prob_5765,150,136,0,71,1499900942,458.429,458.429,1314604890,2.0,1.0,...,0,0,0,0,0,0,0,0,0,0


In [51]:
y_train.head()

user_id    problem_id
user_2579  prob_5765     2
user_3138  prob_5765     1
user_1861  prob_5765     1
user_262   prob_5765     1
user_374   prob_5765     4
Name: attempts_range, dtype: int64

Dataframe X now contains all of the user and problem feature data for each combination of user_id and problem_id. Thus, for each training sample or row, we will use the combination of user and problem features to predict the attempts_range. The attempts_range for each user-problem combination is saved in y.

### Training

#### Simplest Baseline Model

We know from our previous exploratory analysis of this data that 1 is by far the most common attempts_range. A very simple prediction model we can make is just to predict the most common value for all missing values. Let's see how such a model would do.

To benchmark our models, we'll be using sklearn's f1_score function with the average argument set to "weighted". This function will compute the f1-score for each of the labels in the dataset and then take a weighted average of the scores depending on how many samples are in each label. Thus, we will simply get one overall f1-score.

In [52]:
def f1(Ytrue, Ypred, average='weighted'):
    """Compute the f1_score between a matrix with actual
    values (Ytrue) and a matrix with predictions (Ypred).
    Ytrue and Ypred are required to have the same 
    dimensions.
    """
    # get indices of non-NaN values in Ytrue
    mask = ~np.isnan(np.array(Ytrue).flatten(order='C'))
    
    # flatten matrices to 1D arrays
    # use the mask to get only non-NaN values
    ytrue = np.array(Ytrue).flatten(order='C')[mask]
    ypred = np.array(Ypred).flatten(order='C')[mask]
    
    return f1_score(ytrue, ypred, average=average, labels=[1.0,2.0,3.0,4.0,5.0,6.0])

In [53]:
y_pred = np.ones(len(y_train))

print('F1 score for predicting all ones on training data: %s' % round(f1(y_train, y_pred), 4))

F1 score for predicting all ones on training data: 0.3708


The F1 score we got for predicting 1 for all of the training samples is 0.371. How does this compare in the CV dataset?

In [54]:
y_cv = R_cv.set_index(['user_id', 'problem_id'])['attempts_range']

y_pred = np.ones(len(y_cv))
print('F1 score for predicting all ones on cv data: %s' % round(f1(y_cv, y_pred), 4))

F1 score for predicting all ones on cv data: 0.3711


We get a similar f1 score for predicting all ones on the CV dataset. This is a good indication that there was minimal selection bias in our splitting.

#### Out-of-box Random Forest

We will start by building an out-of-box model and try to improve it from there.

In [55]:
clf = RandomForestClassifier(n_jobs=-1)

clf.fit(X_train, y_train)


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [59]:
y_pred = clf.predict(X_train)

f1(y_train, y_pred, average='weighted')

0.9765422743464437

In [60]:
y_pred = clf.predict(X_cv)

f1(y_cv, y_pred, average='weighted')

0.4138444028234788

The out-of-box random forest models gives an f1 score of 0.977 on the training data and 0.414 on the cross-validation data. This is already much better than the baseline model! But we're still far from 1. During my exploratory analysis of the data, it was clear that many features were correlated with one another. Before diving into model optimization through hyperparameter tuning, I want to see if removing some of this colinearity between features improves the model's performance.

#### Dimensionality Reduction

Let's start by performing PCA on the full dataset to see how many features we can safely remove. Performing PCA on the full dataset has two benefits.

1. The dimensionality of the training data is reduced and therefore takes less computation to train the model on.
2. Colinear features are removed. The principal components returned by PCA are all orthogonal.

In [61]:
pca = PCA()
pca.fit(X_train)

x = list(range(1, len(pca.explained_variance_)+1))
y = pca.explained_variance_

trace0 = go.Scatter(x=x, y=y, mode='lines+markers')

layout = go.Layout(title='Explained Variance vs # of Dimensions',
                  xaxis=dict(title='# of Dimensions'),
                  yaxis=dict(title='Explained Variance', type='log'))

fig = go.Figure([trace0], layout)

iplot(fig, filename='explained-var_vs_N-dimensions.html')

In [72]:
n_components=[1,2,5,10,25,50,100]

f1_scores = []
for n in n_components:

    pca = PCA(n_components=n)
    
    X_train_r = pca.fit_transform(X_train)
    
    clf = RandomForestClassifier(n_jobs=-1)

    clf.fit(X_train_r, y_train)

    y_pred = clf.predict(X_train_r)

    f1_scores.append(f1(y_train, y_pred, average='weighted'))


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.



In [73]:
trace0 = go.Scatter(x=n_components, y=f1_scores, mode='lines+markers')

layout = go.Layout(title='F1 Score vs Principal Components',
                  xaxis=dict(title='Principal Components'),
                  yaxis=dict(title='F1 Score', type='log'))

fig = go.Figure([trace0], layout)

iplot(fig, filename='f1_score-vs-principal_components.html')

We can see that at a number of principal components less than 25, there is a significant hit in the F1 score. Above 25 principal components, there seems to be a negligible difference. In general, there is no improvement over the baseline model when using PCA to remove colinear features and reduce the dataset's dimensionality.

We can use GridSearhCV to try to tune the hyperparameters of the model. Rather than passing a large dictionary object of all the hyperparameters we want to tune at once, I will explore each of the hyperparameters individually. This will make it more straightforward when interpretting the effects of each hyperparameter. At the end, I will then pass all of the hyperparameters to GridSearchCV to find the optimal combination of all hyperparameters.

#### n_estimators

n_estimators defines how many trees the model will have. Generally, the more trees the better the model will generalize. However more trees equals more computation and therefore we want to strike a balance between fit to the test data and train + test times.

With GridSearchCV, we can define the scoring function. Since we want to maximize the f1_score function with "weighted" averaging from sklearn.metrics, we pass this same scoring function to GridSearchCV.

In [74]:
%%time
param_grid = {'n_estimators': [5,10,50,100,150,200,250]}

clf = RandomForestClassifier(n_jobs=-1, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

Wall time: 1min 58s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=-1,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid=True, n_jobs=-1,
             param_grid={'n_esti

The results of the search are shown below. 

In [75]:
results = pd.DataFrame({'n_estimators' : [5,10,50,100,150,200,250],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,n_estimators,combined_mean_fit-test_time,mean_test_score,mean_train_score
0,5,6.869155,0.445677,0.939287
1,10,15.427733,0.45469,0.977563
2,50,45.935924,0.458341,0.999641
3,100,49.879743,0.461532,0.999872
4,150,67.730156,0.461941,0.999891
5,200,71.066319,0.462222,0.999891
6,250,69.040377,0.461822,0.999891


Let's plot the train and test scores as a function of N_estimators.

In [76]:
trace1 = go.Scattergl(name='Mean Test Score',
                      x=results['n_estimators'],
                      y=results['mean_test_score'], 
                      mode='lines+markers',
                     yaxis='y2')
trace2 = go.Scattergl(name='Mean Train Score',
                      x=results['n_estimators'],
                      y=results['mean_train_score'], 
                      mode='lines+markers')

layout = go.Layout(title='Mean Train & Test Scores vs N_estimators',
               xaxis=dict(title='N_estimators'),
               yaxis=dict(title='Mean Train Score'), 
                   yaxis2=dict(title='Mean Test Score',
                              side='right'),
                  legend=dict(orientation='h', y=1.12),
                  margin=dict(t=120))

fig = go.Figure([trace1, trace2], layout=layout)

iplot(fig, filename='train-test-scores.html')

We can see both scores increase in going from 5 to 100 estimators but quickly plateau after that. The train and test scores are plotted on separate axes above so we can distinguish the knees of both curves. We can see that the training score is very close to 1, even for n_estimators=5. The more important score of course is the test score. Let's now look at the tradeoff between the test score and the time required to train and test the model.

In [77]:
def plot_cv(df, param):
    trace0 = go.Scattergl(name='Combined Mean Train+Test Time',
                      x=results[param],
                      y=results['combined_mean_fit-test_time'], 
                      mode='lines+markers',)
    trace1 = go.Scattergl(name='Mean Test Score',
                          x=results[param],
                          y=results['mean_test_score'], 
                          mode='lines+markers',
                         yaxis='y2')

    layout = go.Layout(title='Model Train+Test Time & Test Score vs %s' % param,
                   xaxis=dict(title=param),
                   yaxis=dict(title='Combined Train+Test Time'), 
                       yaxis2=dict(title='Mean Test Score',
                                  side='right'),
                      legend=dict(orientation='h', y=1.12),
                      margin=dict(t=120))

    fig = go.Figure([trace0, trace1], layout=layout)

    iplot(fig, filename='%s.html' % param)

In [78]:
plot_cv(results, param='n_estimators')

Here we see that the combined time for training and testing the model increases significantly up to 109 seconds at N_estimators=150. At N_estimators=100, the train+test time is 65 seconds but the difference in test score between the two is negligible. We can thus save a lot of computational resources and time by choosing N_estimators=100.

#### max_depth

In [79]:
%%time
param_grid = {'max_depth': [5,10,50,100,150]}

clf = RandomForestClassifier(n_jobs=-1, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.



Wall time: 11.3 s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=-1,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid=True, n_jobs=-1,
             param_grid={'max_de

In [80]:
results = pd.DataFrame({'max_depth' : [5,10,50,100,150],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,max_depth,combined_mean_fit-test_time,mean_test_score,mean_train_score
0,5,1.184245,0.386043,0.395494
1,10,2.153944,0.457967,0.5011
2,50,3.667719,0.451553,0.97757
3,100,3.872194,0.45469,0.977563
4,150,4.0821,0.45469,0.977563


In [81]:
plot_cv(results, param='max_depth')

#### min_samples_split

In [82]:
%%time
param_grid = {'min_samples_split': [2,3,4,5,10,25,50,100]}

clf = RandomForestClassifier(n_jobs=-1, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.



Wall time: 18.4 s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=-1,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid=True, n_jobs=-1,
             param_grid={'min_sa

In [83]:
results = pd.DataFrame({'min_samples_split' : [2,3,4,5,10,25,50,100],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,min_samples_split,combined_mean_fit-test_time,mean_test_score,mean_train_score
0,2,4.979714,0.45469,0.977563
1,3,4.614216,0.456383,0.953491
2,4,7.281724,0.457113,0.925452
3,5,6.423782,0.459216,0.898678
4,10,6.836548,0.461343,0.797194
5,25,7.482361,0.469938,0.673634
6,50,5.751819,0.469966,0.609777
7,100,2.55655,0.472612,0.567468


In [84]:
plot_cv(results, param='min_samples_split')

#### min_samples_leaf

In [85]:
%%time
param_grid = {'min_samples_leaf': [2,3,4,5,10,25,50,100]}

clf = RandomForestClassifier(n_jobs=-1, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.



Wall time: 9.52 s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=-1,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid=True, n_jobs=-1,
             param_grid={'min_sa

In [86]:
results = pd.DataFrame({'min_samples_leaf' : [2,3,4,5,10,25,50,100],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,min_samples_leaf,combined_mean_fit-test_time,mean_test_score,mean_train_score
0,2,3.406829,0.463485,0.800847
1,3,4.886431,0.467943,0.707209
2,4,4.875672,0.473468,0.660642
3,5,5.841019,0.472296,0.631772
4,10,6.360712,0.475734,0.570989
5,25,6.272631,0.476157,0.530147
6,50,3.475947,0.469165,0.505283
7,100,1.769653,0.458407,0.489062


#### criterion

In [87]:
%%time
param_grid = {'criterion': ["gini", "entropy"]}

clf = RandomForestClassifier(n_jobs=-1, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.



Wall time: 4.64 s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=-1,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid=True, n_jobs=-1, param_grid={'criterion': ['gini'

In [88]:
results = pd.DataFrame({'criterion' : ['gini', 'entropy'],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,criterion,combined_mean_fit-test_time,mean_test_score,mean_train_score
0,gini,2.810248,0.45469,0.977563
1,entropy,3.061918,0.457913,0.977492


#### max_features

In [89]:
%%time
param_grid = {'max_features': [2, 10, 25, 50, 100, 150]}

clf = RandomForestClassifier(n_jobs=-1)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.



Wall time: 18.9 s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=-1,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid=T

In [90]:
results = pd.DataFrame({'max_features': [2, 10, 25, 50, 100, 150],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,max_features,combined_mean_fit-test_time,mean_test_score,mean_train_score
0,2,4.848548,0.442138,0.977411
1,10,7.120961,0.451855,0.977219
2,25,11.454638,0.454837,0.977082
3,50,11.92717,0.459322,0.976144
4,100,12.584168,0.462525,0.97542
5,150,12.967173,0.459076,0.975125


In [91]:
plot_cv(results, 'max_features')

#### oob_score

In [92]:
%%time
param_grid = {'oob_score': [True, False]}

clf = RandomForestClassifier(n_jobs=-1)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.



Wall time: 4.78 s



Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable oob estimates.


invalid value encountered in true_divide



GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=-1,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid=T

In [93]:
results = pd.DataFrame({'oob_score': [True, False],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,oob_score,combined_mean_fit-test_time,mean_test_score,mean_train_score
0,True,2.519834,0.453303,0.977412
1,False,2.538115,0.452112,0.977804


#### warm_start

In [94]:
%%time
param_grid = {'warm_start': [True, False]}

clf = RandomForestClassifier(n_jobs=-1)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.



Wall time: 4.28 s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=-1,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid=T

In [95]:
results = pd.DataFrame({'warm_start': [True, False],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,warm_start,combined_mean_fit-test_time,mean_test_score,mean_train_score
0,True,2.159243,0.455958,0.976786
1,False,2.663219,0.45324,0.977428


#### class_weight

In [96]:
%%time
param_grid = {'class_weight': [None, 'balanced', 'balanced_subsample']}

clf = RandomForestClassifier(n_jobs=-1)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.



Wall time: 10.9 s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=-1,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid=T

In [97]:
results = pd.DataFrame({'class_weight': [None, 'balanced', 'balanced_subsample'],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,class_weight,combined_mean_fit-test_time,mean_test_score,mean_train_score
0,,4.673367,0.449082,0.977245
1,balanced,5.801716,0.456594,0.97785
2,balanced_subsample,2.735734,0.454496,0.978093


#### max_leaf_nodes

In [98]:
%%time
param_grid = {'max_leaf_nodes': [None, 10, 25, 50, 100, 150]}

clf = RandomForestClassifier(n_jobs=-1)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.



Wall time: 7.51 s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=-1,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid=T

In [99]:
results = pd.DataFrame({'max_leaf_nodes': [None, 10, 25, 50, 100, 150],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,max_leaf_nodes,combined_mean_fit-test_time,mean_test_score,mean_train_score
0,,2.672754,0.451088,0.977471
1,10.0,2.443652,0.375279,0.385786
2,25.0,3.029357,0.417575,0.443285
3,50.0,2.87991,0.446965,0.467245
4,100.0,4.479641,0.45701,0.487207
5,150.0,4.730571,0.45546,0.494722


#### Putting it all together

In [100]:
clf = RandomForestClassifier(n_estimators=100, 
                             max_depth=50, 
                             min_samples_split=25, 
                             min_samples_leaf=25,
                             max_features=50,
                             n_jobs=-1)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

f1_score(y_test, y_pred, average='weighted')

ValueError: could not convert string to float: 'user_2932'

## Collaborative & Content Filtering Models

In [None]:
# number of unique users
n_u = submissions['user_id'].nunique()
# number of unique items (problems)
n_i = submissions['problem_id'].nunique()

In [None]:
print('Number of unique users: %s' % n_u)
print('Number of unique problems: %s' % n_i)

In [None]:
sparsity = len(submissions)/(n_u*n_i)
print('Sparsity of attempts_range: %s%%' % round(sparsity*100, 2))

The full submissions dataset contains 3529 unique users and 5776 unique problems. We have attempts_range data for only 0.76% of all user x problem combinations!! This data is incredibly sparse. Even the Netflix prize dataset had over 1% ratings. This will likely make it much harder for collaborative filtering models to produce good results, as they depend on inferring the attempts_range from the other users and/or items.

After sampling we can pivot both R_train and R_cv into sparse matrices.

In [None]:
R_train = R_train.set_index(['user_id','problem_id']).unstack(level=-1)
R_cv = R_cv.set_index(['user_id','problem_id']).unstack(level=-1)

R_train.columns = R_train.columns.droplevel()
R_cv.columns = R_cv.columns.droplevel()

R_train.head()

Since I will be building several types of models using very different types of methods to fill missing attempt_range values, I'll start by creating an empty matrix that contains all user_ids as the index and all problem ids as columns. This matrix is constructed using the full list of users and problems from the users and problems datasets and not the submissions dataset. This is because there are many users and problems for which we have meta data but no history of submissions.

In [None]:
u_diff = len(set(users.user_id.unique()).difference(submissions.user_id.unique()))
print('Number of users from users dataset, not present in submissions: %s' % u_diff)

In [None]:
p_diff = len(set(problems.problem_id.unique()).difference(submissions.problem_id.unique()))
print('Number of problems from problems dataset, not present in submissions: %s' % p_diff)

In [None]:
empty_sub = pd.DataFrame(np.nan, index=users.user_id.unique(), 
                         columns=problems.problem_id.unique())

We'll fill in the R_train and R_cv data into the empty_sub matrix to have all data and predictions in the same format.

In [None]:
R_train = empty_sub.fillna(R_train)
R_cv = empty_sub.fillna(R_cv)
R_train.head()

Recall that we created a baseline model before building our random forest models by simply predicting 1 for all missing attempts_range. Let's start by doing the same here.

In [None]:
f1(R_train, np.ones((R_train.shape[0], R_train.shape[1])))

In [None]:
f1(R_cv, np.ones((R_cv.shape[0], R_cv.shape[1])))

We get a simimlar value as before, only slightly smaller since our sample is different than before. While we will be using the F1 score as the final metric to compare models, I will use root mean squared error (RMSE) to optimize the fit of our model to the training data. Below I define a function that calculates the RMSE between two matrices, one with the ground truth values and the second with the predicted values.

In [None]:
def rmse(R_true, R_pred):
    """Calculate the RMSE between two matrices, one
    containing the ground truth, and the other a
    model's predictions"""
    
    # number of total, non-null samples
    n = R_true.count().sum()
    
    # square of the residuals
    res_squared = (R_true - R_pred)**2
    
    RMSE = np.sqrt(np.sum(np.sum(res_squared))/(n))
    
    return RMSE

Let's calculate the RMSE for a prediction of all ones.

In [None]:
R_pred = np.ones((R_train.shape[0], R_train.shape[1]))
rmse(R_train, R_pred)

So our baseline model of predicting all ones gives a starting F1_score of 0.371 and an RMSE of 1.31. Let's see how much we can improve on this!

### User-mean Recommender

The first type of collaborative filtering model I'll build is a user-mean collaborative filtering model. This simple model fills all missing attempts_range values with the averages across all users.

We need a method for dealing with edge cases where we may not have data to make a prediction. For example, since we'll be calculating the mean of each problem and using that to make predictions for all users, we could have problems that were never solved in the training data and therefore not have any predictions made for those columns. Then, in the CV and test datasets, those columns could have data that should've been predicted on. The easiest way to deal with this is to simply predict 1 when we don't have data, since this is by far the most common value of attempts_range across all problems and users.

In [None]:
# compute the mean of each problem across all users
# round to nearest int
user_means = np.round(R_train.mean())

# fill the empty_sub for scoring
R_pred = empty_sub.fillna(user_means)

# fill all missing values with 1
R_pred = R_pred.fillna(1)

R_pred.head()

In [None]:
rmse(R_train, R_pred)

In [None]:
f1(R_train, R_pred)

So this simple model produces an F1 score that's much better than the baseline, but still worse than our best random forest model. Let's see how this compares to item-based collaborative filtering.

### Item-mean Recommender

In [None]:
problem_means = np.round(R_train.mean(axis=1))

R_pred = empty_sub.T.fillna(problem_means).T

R_pred = R_pred.fillna(1)
R_pred.head()

In [None]:
rmse(R_train, R_pred)

In [None]:
f1(R_train, R_pred)

The Item-based Collaborative filtering model does considerably worse than the user-based model. In fact, this does worse than even our baseline model where we predicted 1 for all missing attempts_ranges! Here we get an F1 score of 0.34 whereas the baseline model was 0.37.

### User-based vs Item-based Collaborative Filtering

In [None]:
def cos_sim(attempts, kind='user', epsilon=1e-9):
    # fill all NaN values with 0. This does not affect
    # the cosine similarity metric.
    attempts = np.nan_to_num(attempts, 0)
    
    # compute the dot product between each user
    # and all other users.
    if kind == 'user':
        sim = np.dot(attempts, attempts.T) + epsilon
    # compute the dot product between each item
    # and all other items
    if kind == 'item':
        sim = np.dot(attempts.T, attempts) + epsilon
    
    # compute the denominator of the cosine similarity
    # metric
    norms = np.array([np.sqrt(np.diagonal(sim))])
    
    # the dimensions of the returned matrix is 
    # userxuser.
    return sim/norms/norms.T

In [None]:
similarity_u = cos_sim(R_train, kind='user')
similarity_i = cos_sim(R_train, kind='item')

In [None]:
def predict(attempts, similarity, kind='user'):
    # fill NaN values with 0
    attempts_fill = np.nan_to_num(attempts, 0)
    
    if kind == 'user':
        return np.round(similarity.dot(attempts_fill) / np.array([np.abs(similarity).sum(axis=1)]).T)
    elif kind == 'item':
        return np.round(attempts_fill.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)]))

In [None]:
R_pred = predict(R_train, similarity_u, kind='user')
f1(R_train, R_pred)

In [None]:
R_pred = predict(R_train, similarity_i, kind='item')
f1(R_train, R_pred)

We can also look at the similarity of users and items using the features datasets rather than the attempt_ranges themselves.

In [None]:
user_features_norm = users.set_index('user_id')/users.set_index('user_id').max()
user_features_norm.head()

In [None]:
R_pred = predict(R_train, cos_sim(user_features_norm))
f1(R_train, R_pred)

In [None]:
problem_features_norm = problems.set_index('problem_id')/problems.set_index('problem_id').max()
problem_features_norm.head()

In [None]:
R_pred = predict(R_train, cos_sim(problem_features_norm), kind='item')
f1(R_train, R_pred)

### Latent Factor Collaborative Filtering Model

In [None]:
def unroll(P, Q, order='C'):
    """Flatten two matrices and stack them on
    top of each other in a single array."""
    P = np.array(P)
    Q = np.array(Q)

    x = np.concatenate((P.flatten(order=order),
                        Q.flatten(order=order)), axis=0)

    return x

In [None]:
def roll(x, n_u, n_i, f):
    """Reshape a single array into the two
    original matrices."""
    
    P = np.reshape(x[0:n_u*f], (n_u, f))
    Q = np.reshape(x[n_u*f:], (n_i, f))

    return P, Q

In [None]:
def cost_f(x, y, L, n_u, n_i, f):
    P, Q = roll(x, n_u, n_i, f)

    hyp = np.dot(P, Q.T)
    error = hyp - y
    error[np.isnan(error)] = 0 # Sets all missing values to 0s

    # Compute the COST FUNCTION with REGULARIZATION
    Q_reg = (L/2) * np.nansum(Q*Q)
    P_reg = (L/2) * np.nansum(P*P)

    J = (1/2) * np.nansum(error*error) + Q_reg + P_reg

    return J

In [None]:
def grad_f(x, y, L, n_u, n_i, f):
    P, Q = roll(x, n_u, n_i, f)

    hyp = np.dot(P, Q.T)
    error = hyp - y
    error[np.isnan(error)] = 0 # Sets all missing values to 0s

    P_grad = np.dot(error, Q) + L*P
    Q_grad = np.dot(error.T, P) + L*Q

    grad = unroll(P_grad, Q_grad)

    return grad

In [None]:
n_u = R_train.shape[0] # number of users
n_i = R_train.shape[1] # number of items
f = 10 # number of latent factors
L=1 # regularization parameter

# intial random guess at P & Q
P0 = np.random.rand(n_u, f) - 0.5
Q0 = np.random.rand(n_i, f) - 0.5

x0 = unroll(P0, Q0)

cost_f(x0, y=R_train, L=1, n_u=n_u, n_i=n_i, f=f)

In [None]:
%%time
args = (R_train, L, n_u, n_i, f)
options={'maxiter':2, 'disp':True}

result = minimize(cost_f, x0, args=args, jac=grad_f, method='CG', options=options)

In [None]:
# recover P & Q matrices
P, Q = roll(result.x, n_u, n_i, f)

# compute predictions from P & Q
R_pred = pd.DataFrame(np.dot(P, Q.T), index=R_train.index, columns=R_train.columns)

# set all negative predictions to 1 (bottom limit)
R_pred[R_pred < 0] = 1

# round all values
R_pred = round(R_pred)

R_pred.head()

In [None]:
f1(R_train, R_pred)

In [None]:
import time

args = (R_train, L, n_u, n_i, f)
options={'maxiter':5, 'disp':True}

optimizers = ['CG','Newton-CG','L-BFGS-B']

results = {}
for optimizer in optimizers:
    t0=time.time()
    result = minimize(cost_f, x0, args=args, jac=grad_f, method=optimizer, options=options)
    training_time=time.time()-t0
    results[optimizer] = {'result':result, 'training_time':training_time}
    print('Training time: %s seconds' % round(training_time,1))

In [None]:
results['L-BFGS-B']

In [None]:
%%time
options={'maxiter':10, 'disp':True}

result = minimize(cost_f, x0, args=args, jac=grad_f, method='Newton-CG', options=options)

In [None]:
# recover P & Q matrices
P, Q = roll(result.x, n_u, n_i, f)

# compute predictions from P & Q
R_pred = pd.DataFrame(np.dot(P, Q.T), index=R_train.index, columns=R_train.columns)

# set all negative predictions to 1 (bottom limit)
R_pred[R_pred < 0] = 1

# round all values
R_pred = round(R_pred)

f1(R_train, R_pred)

In [None]:
f1(R_cv, R_pred)