# Capstone 1: Recommender System In-Depth Analysis

#### Kenneth Liao

Original datasource: https://datahack.analyticsvidhya.com/contest/practice-problem-recommendation-engine/#

In [119]:
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA

# enable offline plotting in plotly
init_notebook_mode(connected=True)

In [120]:
# load our 3 datasets
users = pd.read_csv('data/user_features.csv')
problems =  pd.read_csv('data/problem_features.csv')
submissions = pd.read_csv('data/train_submissions.csv')

## Background & Problem Statement 

Ultimately, our goal is to recommend practice problems to users given some information about the problems they have already solved. There are many criteria we could choose to base how we recommend problems. For the purpose of this model, I will keep the criteria simple. The criteria are as follows:

1. The problem has not yet been attempted by the user.
2. The predicted number of attempts the user will require to solve the problem is equal to 2 or 3.

Given the criteria defined above, we must first be able to predict how many attempts a user will require to solve a problem they've never attempted before. I will perform this prediction using two very different models. 

The first model will be a Random Forest Classifier. For this model, I will use meta data available for users and problems. The goal is to find patterns in the user and problem features that predict well the number of attempts for a given user-problem combination.

The second model will be a collaborative filtering model. This model will employ stochastic gradient descent (SGD) to find an approximate solution to the single value decomposition (SVD) of our user-problem matrix. In this case, we will not use any user or problem features. Predictions will be made exclusively using the history of users and problems solved.

Let's take a quick look at the submissions dataset. This dataset has 3 columns: user_id, problem_id, and attempts_range. Attempts_range gives the range of attempts that the user_id took to solve the problem_id and is defined in the original datasource as shown below.

In [121]:
submissions.head()

Unnamed: 0,user_id,problem_id,attempts_range
0,user_232,prob_6507,1
1,user_3568,prob_2994,3
2,user_1600,prob_5071,1
3,user_2256,prob_703,1
4,user_2321,prob_356,1


>We have used following criteria to define the attempts_range :-
>
>            attempts_range            No. of attempts lies inside
>
>            1                                         1-1
>
>            2                                         2-3
>
>            3                                         4-5
>
>            4                                         6-7
>
>            5                                         8-9
>
>            6                                         >=10

## Random Forest Model

### Preparing data for random forest 

The first thing we need to do to prepare the data for the random forest model is convert categorical, string columns into dummy variables. We do this for both the user and problem features.

In [122]:
users = pd.get_dummies(users.set_index('user_id')).reset_index()
users.head()

Unnamed: 0,user_id,submission_count,problem_solved,contribution,follower_count,last_online_time_seconds,max_rating,rating,registration_time_seconds,user_attempts_median,...,country_Ukraine,country_United Kingdom,country_United States,country_Uzbekistan,country_Venezuela,country_Vietnam,rank_advanced,rank_beginner,rank_expert,rank_intermediate
0,user_1,84,73,10,120,1505162220,502.007,499.713,1469108674,1.0,...,0,0,0,0,0,0,1,0,0,0
1,user_10,246,211,0,30,1505079658,326.548,313.36,1472038187,1.0,...,0,0,0,0,0,0,0,0,0,1
2,user_100,642,574,27,106,1505073569,458.429,385.894,1323974332,1.0,...,0,0,0,0,0,0,0,0,0,1
3,user_1000,259,235,0,41,1505579889,371.273,336.583,1450375392,1.0,...,0,0,0,0,0,0,0,0,0,1
4,user_1001,554,492,-6,55,1504521879,472.19,450.975,1423399585,1.0,...,0,0,0,0,0,0,0,0,0,1


In [123]:
problems = pd.get_dummies(problems.set_index('problem_id')).reset_index()
problems.head()

Unnamed: 0,problem_id,points,problem_attempts_median,problem_attempts_min,problem_attempts_max,problem_attempts_count,problem_attempts_iqr,algorithms,and,binary,...,level_type_E,level_type_F,level_type_G,level_type_H,level_type_I,level_type_J,level_type_K,level_type_L,level_type_M,level_type_N
0,prob_1,500.0,1.5,1.0,2.0,2.0,0.005,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,prob_10,4500.0,6.0,6.0,6.0,1.0,0.0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,prob_100,1000.0,1.0,1.0,1.0,1.0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,prob_1000,500.0,1.0,1.0,6.0,246.0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,prob_1001,2000.0,1.0,1.0,2.0,10.0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, we will prepare a single dataframe that joins the user and problem features with the submissions data.

In [205]:
X = submissions.merge(users, on='user_id').merge(problems, on='problem_id')

# remove rows with any null values
X = X.loc[:,X.notnull().all()]

y = X.set_index(['user_id', 'problem_id'])['attempts_range']
X = X.set_index(['user_id', 'problem_id']).loc[:,'submission_count':]

X.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,submission_count,problem_solved,contribution,follower_count,last_online_time_seconds,max_rating,rating,registration_time_seconds,user_attempts_median,user_attempts_min,...,level_type_E,level_type_F,level_type_G,level_type_H,level_type_I,level_type_J,level_type_K,level_type_L,level_type_M,level_type_N
user_id,problem_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
user_232,prob_6507,53,47,0,1,1503633778,307.913,206.709,1432110935,2.0,1.0,...,0,0,0,0,0,0,0,0,0,0
user_1910,prob_6507,240,218,0,50,1505252563,319.954,291.284,1385471472,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0
user_1824,prob_6507,370,336,-10,30,1505395587,307.339,295.585,1471685215,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0
user_895,prob_6507,318,286,0,20,1505511056,304.186,191.514,1475529522,2.0,1.0,...,0,0,0,0,0,0,0,0,0,0
user_779,prob_6507,463,410,0,39,1504799078,374.713,374.713,1437245990,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0


In [206]:
y.head()

user_id    problem_id
user_232   prob_6507     1
user_1910  prob_6507     2
user_1824  prob_6507     2
user_895   prob_6507     1
user_779   prob_6507     1
Name: attempts_range, dtype: int64

Dataframe X now contains all of the user and problem feature data for each combination of user_id and problem_id. Thus, for each training sample or row, we will use the combination of user and problem features to predict the attempts_range. The attempts_range for each user-problem combination is saved in y.

Next, we'll split the data into train and test sets.

In [207]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Training

We will start by building a baseline, out-of-box model and try to improve it from there.

#### Baseline model

In [9]:
clf = RandomForestClassifier(n_jobs=-1)

clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [10]:
y_pred = clf.predict(X_test)

f1_score(y_test, y_pred, average='weighted')

0.5041225226063352

The baseline model produces an f1_score of 0.5. This is far from 1. During my exploratory analysis of the data, it was clear that many features were correlated with one another. Before diving into model optimization through hyperparameter tuning, I want to see if removing some of this colinearity between features improves the model's performance.

#### Dimensionality Reduction

Let's start by performing PCA on the full dataset to see how many features we can safely remove. Performing PCA on the full dataset has two benefits.

1. The dimensionality of the training data is reduced and therefore takes less computation to train the model on.
2. Colinear features are removed. The principal components returned by PCA are all orthogonal.

In [175]:
pca = PCA()
pca.fit(X)

x = list(range(1, len(pca.explained_variance_)+1))
y = pca.explained_variance_

trace0 = go.Scatter(x=x, y=y, mode='lines+markers')

layout = go.Layout(title='Explained Variance vs # of Dimensions',
                  xaxis=dict(title='# of Dimensions'),
                  yaxis=dict(title='Explained Variance', type='log'))

fig = go.Figure([trace0], layout)

iplot(fig, filename='explained-var_vs_N-dimensions.html')

In [209]:
X_ = submissions.merge(users, on='user_id').merge(problems, on='problem_id')

# remove rows with any null values
X_ = X_.loc[:,X_.notnull().all()]

y_ = X_.set_index(['user_id', 'problem_id'])['attempts_range']
X_ = X_.set_index(['user_id', 'problem_id']).loc[:,'submission_count':]

n_components=[1,2,5,10,25,50,100]

f1_scores = []
for n in n_components:

    pca = PCA(n_components=n)
    X_reduced = pca.fit_transform(X_)

    X_train_, X_test_, y_train_, y_test_ = train_test_split(X_reduced, y_, test_size=0.33, random_state=42)

    clf = RandomForestClassifier(n_jobs=-1)

    clf.fit(X_train_, y_train_)

    y_pred_ = clf.predict(X_test_)

    f1_scores.append(f1_score(y_test_, y_pred_, average='weighted'))

In [210]:
trace0 = go.Scatter(x=n_components, y=f1_scores, mode='lines+markers')

layout = go.Layout(title='F1 Score vs Principal Components',
                  xaxis=dict(title='Principal Components'),
                  yaxis=dict(title='F1 Score', type='log'))

fig = go.Figure([trace0], layout)

iplot(fig, filename='f1_score-vs-principal_components.html')

We can see that at a number of principal components less than 25, there is a significant hit in the F1 score. Above 25 principal components, there seems to be no difference. In general, there is no improvement over the baseline model when using PCA to remove colinear features and reduce the dataset's dimensionality.

We can use GridSearhCV to try to tune the hyperparameters of the model. Rather than passing a large dictionary object of all the hyperparameters we want to tune at once, I will explore each of the hyperparameters individually. This will make it more straightforward when interpretting the effects of each hyperparameter. At the end, I will then pass all of the hyperparameters to GridSearchCV to find the optimal combination of all hyperparameters.

#### n_estimators

n_estimators defines how many trees the model will have. Generally, the more trees the better the model will generalize. However more trees equals more computation and therefore we want to strike a balance between fit to the test data and train + test times.

With GridSearchCV, we can define the scoring function. Since we want to maximize the f1_score function with "weighted" averaging from sklearn.metrics, we pass this same scoring function to GridSearchCV.

In [52]:
%%time
param_grid = {'n_estimators': [5,10,50,100,150,200,250]}

clf = RandomForestClassifier(n_jobs=-1, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

Wall time: 6min 3s


The results of the search are shown below. 

In [53]:
results = pd.DataFrame({'n_estimators' : [5,10,50,100,150,200,250],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,combined_mean_fit-test_time,mean_test_score,mean_train_score,n_estimators
0,0.804431,0.492844,0.939287,5
1,4.254232,0.50299,0.977063,10
2,23.390794,0.51561,0.999707,50
3,53.823086,0.518985,0.999916,100
4,84.666681,0.5195,0.999921,150
5,119.281836,0.520998,0.999921,200
6,116.85595,0.520498,0.999921,250


Let's plot the train and test scores as a function of N_estimators.

In [54]:
trace1 = go.Scattergl(name='Mean Test Score',
                      x=results['n_estimators'],
                      y=results['mean_test_score'], 
                      mode='lines+markers',
                     yaxis='y2')
trace2 = go.Scattergl(name='Mean Train Score',
                      x=results['n_estimators'],
                      y=results['mean_train_score'], 
                      mode='lines+markers')

layout = go.Layout(title='Mean Train & Test Scores vs N_estimators',
               xaxis=dict(title='N_estimators'),
               yaxis=dict(title='Mean Train Score'), 
                   yaxis2=dict(title='Mean Test Score',
                              side='right'),
                  legend=dict(orientation='h', y=1.12),
                  margin=dict(t=120))

fig = go.Figure([trace1, trace2], layout=layout)

iplot(fig, filename='train-test-scores.html')

We can see both scores increase in going from 5 to 100 estimators but quickly plateau after that. The train and test scores are plotted on separate axes above so we can distinguish the knees of both curves. We can see that the training score is very close to 1, even for n_estimators=5. The more important score of course is the test score. Let's now look at the tradeoff between the test score and the time required to train and test the model.

In [139]:
def plot_cv(df, param):
    trace0 = go.Scattergl(name='Combined Mean Train+Test Time',
                      x=results[param],
                      y=results['combined_mean_fit-test_time'], 
                      mode='lines+markers',)
    trace1 = go.Scattergl(name='Mean Test Score',
                          x=results[param],
                          y=results['mean_test_score'], 
                          mode='lines+markers',
                         yaxis='y2')

    layout = go.Layout(title='Model Train+Test Time & Test Score vs %s' % param,
                   xaxis=dict(title=param),
                   yaxis=dict(title='Combined Train+Test Time'), 
                       yaxis2=dict(title='Mean Test Score',
                                  side='right'),
                      legend=dict(orientation='h', y=1.12),
                      margin=dict(t=120))

    fig = go.Figure([trace0, trace1], layout=layout)

    iplot(fig, filename='%s.html' % param)

In [56]:
plot_cv(results, param='n_estimators')

Here we see that the combined time for training and testing the model increases significantly up to 109 seconds at N_estimators=150. At N_estimators=100, the train+test time is 65 seconds but the difference in test score between the two is negligible. We can thus save a lot of computational resources and time by choosing N_estimators=100.

#### max_depth

In [61]:
%%time
param_grid = {'max_depth': [5,10,50,100,150]}

clf = RandomForestClassifier(n_jobs=-1, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

Wall time: 25.8 s


In [62]:
results = pd.DataFrame({'max_depth' : [5,10,50,100,150],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,combined_mean_fit-test_time,max_depth,mean_test_score,mean_train_score
0,0.538286,5,0.391628,0.392864
1,1.55723,10,0.482226,0.500286
2,4.712002,50,0.504457,0.976816
3,5.501036,100,0.50299,0.977063
4,5.08066,150,0.50299,0.977063


In [63]:
plot_cv(results, param='max_depth')

#### min_samples_split

In [64]:
%%time
param_grid = {'min_samples_split': [2,3,4,5,10,25,50,100]}

clf = RandomForestClassifier(n_jobs=-1, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

Wall time: 42.6 s


In [65]:
results = pd.DataFrame({'min_samples_split' : [2,3,4,5,10,25,50,100],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,combined_mean_fit-test_time,mean_test_score,mean_train_score,min_samples_split
0,1.473731,0.50299,0.977063,2
1,3.979908,0.508514,0.950357,3
2,5.731563,0.511425,0.922482,4
3,5.137546,0.512388,0.894911,5
4,4.976796,0.516058,0.791177,10
5,5.364788,0.516848,0.669401,25
6,4.638996,0.514864,0.606484,50
7,3.893256,0.51425,0.564602,100


In [66]:
plot_cv(results, param='min_samples_split')

#### min_samples_leaf

In [71]:
%%time
param_grid = {'min_samples_leaf': [2,3,4,5,10,25,50,100]}

clf = RandomForestClassifier(n_jobs=-1, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

Wall time: 34.4 s


In [72]:
results = pd.DataFrame({'min_samples_leaf' : [2,3,4,5,10,25,50,100],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,combined_mean_fit-test_time,mean_test_score,mean_train_score,min_samples_leaf
0,1.081507,0.514596,0.794307,2
1,2.887071,0.514521,0.700129,3
2,4.853926,0.513977,0.65319,4
3,4.244852,0.513873,0.626746,5
4,3.542787,0.509917,0.569166,10
5,3.508213,0.501484,0.527091,25
6,2.928419,0.494499,0.507405,50
7,2.645871,0.485798,0.492205,100


#### criterion

In [73]:
%%time
param_grid = {'criterion': ["gini", "entropy"]}

clf = RandomForestClassifier(n_jobs=-1, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

Wall time: 18.7 s


In [74]:
results = pd.DataFrame({'criterion' : ['gini', 'entropy'],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,combined_mean_fit-test_time,criterion,mean_test_score,mean_train_score
0,1.550646,gini,0.50299,0.977063
1,2.316627,entropy,0.50109,0.976539


#### max_features

In [75]:
%%time
param_grid = {'max_features': [2, 10, 25, 50, 100, 150]}

clf = RandomForestClassifier(n_jobs=-1)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

Wall time: 1min 24s


In [76]:
results = pd.DataFrame({'max_features': [2, 10, 25, 50, 100, 150],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,combined_mean_fit-test_time,max_features,mean_test_score,mean_train_score
0,1.131247,2,0.486668,0.976429
1,4.126612,10,0.499394,0.977091
2,8.166927,25,0.507646,0.976184
3,15.851055,50,0.511771,0.975862
4,25.009694,100,0.51296,0.975006
5,26.563426,150,0.51304,0.974943


In [77]:
plot_cv(results, 'max_features')

#### oob_score

In [78]:
%%time
param_grid = {'oob_score': [True, False]}

clf = RandomForestClassifier(n_jobs=-1)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

Wall time: 17.7 s


In [79]:
results = pd.DataFrame({'oob_score': [True, False],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,combined_mean_fit-test_time,mean_test_score,mean_train_score,oob_score
0,1.90168,0.503503,0.97688,True
1,2.120224,0.504003,0.977537,False


#### warm_start

In [80]:
%%time
param_grid = {'warm_start': [True, False]}

clf = RandomForestClassifier(n_jobs=-1)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

Wall time: 18.2 s


In [81]:
results = pd.DataFrame({'warm_start': [True, False],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,combined_mean_fit-test_time,mean_test_score,mean_train_score,warm_start
0,1.335766,0.50262,0.976584,True
1,2.241348,0.505112,0.976516,False


#### class_weight

In [82]:
%%time
param_grid = {'class_weight': [None, 'balanced', 'balanced_subsample']}

clf = RandomForestClassifier(n_jobs=-1)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

Wall time: 26.1 s


In [83]:
results = pd.DataFrame({'class_weight': [None, 'balanced', 'balanced_subsample'],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,class_weight,combined_mean_fit-test_time,mean_test_score,mean_train_score
0,,1.538506,0.503074,0.976677
1,balanced,4.328634,0.497881,0.977521
2,balanced_subsample,6.72163,0.497269,0.97708


#### max_leaf_nodes

In [84]:
%%time
param_grid = {'max_leaf_nodes': [None, 10, 25, 50, 100, 150]}

clf = RandomForestClassifier(n_jobs=-1)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

Wall time: 22.8 s


In [85]:
results = pd.DataFrame({'max_leaf_nodes': [None, 10, 25, 50, 100, 150],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Unnamed: 0,combined_mean_fit-test_time,max_leaf_nodes,mean_test_score,mean_train_score
0,1.33159,,0.502644,0.977297
1,0.853774,10.0,0.417141,0.415954
2,1.552411,25.0,0.44147,0.441634
3,2.180825,50.0,0.467904,0.4703
4,2.47259,100.0,0.477493,0.482593
5,2.354054,150.0,0.483904,0.490865


#### Putting it all together

In [143]:
clf = RandomForestClassifier(n_estimators=100, 
                             max_depth=50, 
                             min_samples_split=25, 
                             min_samples_leaf=25,
                             max_features=50,
                             n_jobs=-1)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

f1_score(y_test, y_pred, average='weighted')

0.5300938646563604

## SGD

In [260]:
# number of unique users
n_u = submissions['user_id'].nunique()
# number of unique items (problems)
n_i = submissions['problem_id'].nunique()

In [268]:
print('Number of unique users: %s' % n_u)
print('Number of unique problems: %s' % n_i)

Number of unique users: 3529
Number of unique problems: 5776


In [269]:
sparsity = len(submissions)/(n_u*n_i)
print('Sparsity of attempts_range: %s%%' % round(sparsity*100, 2))

Sparsity of attempts_range: 0.76%


The full submissions dataset contains 3529 unique users and 5776 unique problems. We have attempts_range data for only 0.76% of all user x problem combinations!! This data is incredibly sparse. Even the Netflix prize dataset had a sparsity of at least 1%. This type of collaborative filtering model may therefore be very poor at predicting.

In [271]:
R = submissions.set_index(['user_id','problem_id']).unstack(level=-1)
R.head()

Unnamed: 0_level_0,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range
problem_id,prob_1,prob_10,prob_100,prob_1000,prob_1001,prob_1002,prob_1003,prob_1004,prob_1005,prob_1006,...,prob_990,prob_991,prob_992,prob_993,prob_994,prob_995,prob_996,prob_997,prob_998,prob_999
user_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
user_1,,,,,,,,,,,...,,,,,,,,,,
user_10,,,,,,,,,,,...,,,,,,,,,,
user_100,,,,,,,,,,,...,,,,,,,,,,
user_1000,,,,1.0,,,,,,,...,,,,,,,,,,
user_1001,,,,,,,,,,,...,,,,,,,,,,


In [292]:
from scipy.sparse import csr_matrix

R_train, R_test = train_test_split(R, test_size=0.25)
R_train = csr_matrix(R_train)
R_test = csr_matrix(R_test)

In [293]:
n_features = 50
p = np.random.rand(n_u, n_features) - 0.5
q = np.random.rand(n_i, n_features) - 0.5

In [294]:
# get only nonzero entries of the sparse matrix
idx_u, idx_i = R_train.nonzero()

In [295]:
def rmse_score(R, q, p):
    I = R != 0  # Indicator function which is zero for missing data
    ME = I * (R - np.dot(p, q.T))  # Errors between real and predicted ratings
    MSE = ME**2  
    return np.sqrt(np.sum(MSE)/np.sum(I))  # sum of squared errors

In [298]:
alpha = 0.01
lmbda = 1
n_epochs=10

train_errors = []
test_errors = []

for epoch in range(n_epochs):
    for u, i in zip(idx_u, idx_i):
        e = R_train[u, i] - np.dot(p[u,:], q[i, :].T)
        p[u, :] += alpha * (e * q[i, :] - lmbda * p[u, :])
        q[i, :] += alpha * (e * p[u, :] - lmbda * q[i, :])
    train_errors.append(rmse_score(R, q, p))
    test_errors.append(rmse_score(R_test, q, p))

KeyboardInterrupt: 

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(R)  # Don't cheat - fit only on training data
X_train = scaler.transform(R)
X_test = scaler.transform(R_test)  # apply same transformation to test data

In [230]:
R = R.set_index(['user_id','problem_id']).unstack(level=-1)
R.head()

Unnamed: 0_level_0,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range
problem_id,prob_1,prob_10,prob_100,prob_1000,prob_1001,prob_1002,prob_1003,prob_1004,prob_1005,prob_1006,...,prob_990,prob_991,prob_992,prob_993,prob_994,prob_995,prob_996,prob_997,prob_998,prob_999
user_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
user_1,,,,,,,,,,,...,,,,,,,,,,
user_10,,,,,,,,,,,...,,,,,,,,,,
user_100,,,,,,,,,,,...,,,,,,,,,,
user_1000,,,,1.0,,,,,,,...,,,,,,,,,,
user_1001,,,,,,,,,,,...,,,,,,,,,,


In [None]:
def get_args(X, f):

    X = np.array(X)
    
    # get mean for each column
    mean = np.nanmean(X, axis=0)

    # subtract mean from each column (mean normalization)
    y = X - mean # (N_users x N_games)

    n_u, n_i = X.shape

    # Initialize two random matrices to make our initial predictions
    p = np.random.rand(n_u, f) - 0.5
    q = np.random.rand(n_i, f) - 0.5

    return X_init, y, theta_init, n_users, n_items, mean

In [252]:
X_init, y, theta_init, n_users, n_items, mean = get_args(R, n_features=10)


Mean of empty slice



In [253]:
def unroll_params(X, theta, order='C'):

    X = np.array(X)
    theta = np.array(theta)

    parameters = np.concatenate((X.flatten(order=order),
                                 theta.flatten(order=order)), axis=0)

    return parameters

In [254]:
params = unroll_params(X_init, theta_init)

In [255]:
def roll_params(parameters, n_users, n_items, n_features):

    dim1 = n_items*n_features

    X = np.reshape(parameters[0:dim1], (n_items, n_features))
    theta = np.reshape(parameters[dim1:], (n_users, n_features))

    return X, theta

In [256]:
def cost_f(parameters, y, Lambda):
    X, theta = roll_params(parameters, *args)

    hyp = np.dot(theta,X.T)
    error = hyp - y
    error_factor = error.copy() # dimensions (N_games x N_users)
    error_factor[np.isnan(error)] = 0 # Sets all missing values to 0s

    # Compute the COST FUNCTION with REGULARIZATION
    theta_reg = (Lambda/2) * np.nansum(theta*theta)
    X_reg = (Lambda/2) * np.nansum(X*X)

    J = (1/2) * np.nansum(error_factor*error_factor) + theta_reg + X_reg

    return J

# grad_f calculates the gradients of the cost function w.r.t. X and Theta

def grad_f(parameters, y, Lambda):
    X, theta = roll_params(parameters, *args)

    hyp = np.dot(theta,X.T)
    error = hyp - y
    error_factor = error.copy() # dimensions (N_games x N_users)
    error_factor[np.isnan(error)] = 0 # Sets all missing values to 0s

    X_grad = np.dot(error_factor.T, theta) + Lambda*X
    theta_grad = np.dot(error_factor, X) + Lambda*theta

    grad = unroll_params(X_grad, theta_grad)

    return grad

In [259]:
results = fmin_cg(cost_f, parameters, grad_f, args=(y, ), 
                 full_output=True, maxiter=100)

SyntaxError: invalid syntax (<ipython-input-259-9c015ed97b05>, line 1)

In [239]:


# unroll_params takes two matrices (X and Theta), and unrolls them into a
# single end-to-end vector (params).



# roll_params takes an unrolled vector, params, and reshapes it into the
# matrices X and Theta.



# cost_f takes in the parameters vector and computes the model's predictions.
# It then compares the model's predictions to the actual ratings and computes
# a cost associated with the model's current parameters.

def cost_f(parameters, *args):

    Y = args[1]
    Lambda = args[3]

    X, Theta = roll_params(parameters, *args)

    hyp = np.dot(Theta,X.T)
    error = hyp - Y
    error_factor = error.copy() # dimensions (N_games x N_users)
    error_factor[np.isnan(error)] = 0 # Sets all missing values to 0s

    # Compute the COST FUNCTION with REGULARIZATION
    Theta_reg = (Lambda/2) * np.nansum(Theta*Theta)
    X_reg = (Lambda/2) * np.nansum(X*X)

    J = (1/2) * np.nansum(error_factor*error_factor) + Theta_reg + X_reg

    return J

# grad_f calculates the gradients of the cost function w.r.t. X and Theta

def grad_f(parameters, *args):

    Y = args[1]
    Lambda = args[3]

    X, Theta = roll_params(parameters, *args)

    hyp = np.dot(Theta,X.T)
    error = hyp - Y
    error_factor = error.copy() # dimensions (N_games x N_users)
    error_factor[np.isnan(error)] = 0 # Sets all missing values to 0s

    X_grad = np.dot(error_factor.T, Theta) + Lambda*X
    Theta_grad = np.dot(error_factor, X) + Lambda*Theta

    grad = unroll_params(X_grad, Theta_grad)

    return grad


In [240]:
from scipy.optimize import fmin_cg

# Make an initial prediction using the model and compare to the training data.
args = get_args(R, N_features=150, Lambda=1)
parameters = unroll_params(args[0], args[2])

results = fmin_cg(cost_f, parameters, grad_f, args=args, 
                 full_output=True, maxiter=100)

X_opt, Theta_opt = roll_params(results[0], *args)

predictions = np.dot(Theta_opt, X_opt.T)
predictions = pd.DataFrame(predictions, index=R.index, columns=R.columns)
predictions = predictions.add(args[7], axis=1)
predictions.head()


Mean of empty slice



         Current function value: 7503.733864
         Iterations: 100
         Function evaluations: 153
         Gradient evaluations: 153


Unnamed: 0_level_0,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range,attempts_range
problem_id,prob_1,prob_10,prob_100,prob_1000,prob_1001,prob_1002,prob_1003,prob_1004,prob_1005,prob_1006,...,prob_990,prob_991,prob_992,prob_993,prob_994,prob_995,prob_996,prob_997,prob_998,prob_999
user_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
user_1,,,1.000888,1.391362,1.030485,1.909976,2.497523,0.999289,1.60116,1.77772,...,,1.509982,2.050654,2.000307,2.999014,1.000568,1.999924,1.008984,1.999767,4.21358
user_10,,,0.999922,0.792726,1.101327,1.820148,2.562085,1.000163,1.626114,1.724674,...,,1.502891,1.831747,2.000264,2.999466,1.000351,2.000106,1.218786,1.99993,4.406965
user_100,,,1.000285,1.226508,1.05879,1.941849,2.501446,1.000097,1.896022,1.862571,...,,1.553863,1.812087,2.000139,3.000414,1.000081,2.00073,1.18915,2.000231,4.51271
user_1000,,,1.000422,1.020091,1.183453,2.002745,2.597639,0.999034,1.589689,1.683946,...,,1.564693,1.921421,2.000174,2.999734,1.000566,1.999734,1.125296,1.999941,4.408706
user_1001,,,1.000567,0.846059,1.094228,1.93956,2.574057,1.000007,2.087727,1.690446,...,,1.498363,1.711499,1.999883,2.999942,0.999931,2.000678,0.974653,2.000149,4.547452


In [105]:
pd.crosstab(y_pred, y_test, rownames=['Predicted attempts_range'], 
            colnames=['Actual attempts_range'])

Actual attempts_range,1,2,3,4,5,6
Predicted attempts_range,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,21520,9401,2234,751,332,332
2,4890,5376,1848,749,360,377
3,571,691,450,155,78,132
4,118,150,85,68,36,40
5,27,46,25,19,15,18
6,54,97,52,34,21,96
