# Capstone 1: In-Depth Analysis

#### Kenneth Liao

Original datasource: https://datahack.analyticsvidhya.com/contest/practice-problem-recommendation-engine/#

In [1]:
import pandas as pd
import numpy as np
import time
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from scipy.optimize import minimize, fmin_cg
import pickle as pkl

# enable offline plotting in plotly
init_notebook_mode(connected=True)

The functions defined below are for saving and loading results to and from a pickle object. 

In [2]:
def save_obj(obj, name ):
    with open('results/'+ name + '.pkl', 'wb') as f:
        pkl.dump(obj, f, pkl.HIGHEST_PROTOCOL)

def load_obj(name ):
    with open('results/' + name + '.pkl', 'rb') as f:
        return pkl.load(f)

In [3]:
# load our 3 datasets
users = pd.read_csv('data/user_features.csv')
problems =  pd.read_csv('data/problem_features.csv')
submissions = pd.read_csv('data/train_submissions.csv')

## Background & Problem Statement 

Ultimately, the goal of this project is to recommend practice problems to users given some information about the problems they have already solved. There are many criteria we could choose to base how we recommend problems. For the purpose of this model, I will keep the criteria simple. The criteria are as follows:

1. The problem has not yet been attempted by the user.
2. The predicted number of attempts the user will require to solve the problem is equal to 2 or 3 (attempts_range=2).

Given the criteria defined above, we must first be able to predict how many attempts a user will require to solve a problem they've never attempted before. I will perform this prediction using two very different models. 

The first model will be a Random Forest Classifier. For this model, I will use meta data available for users and problems. The goal is to find patterns in the user and problem features that predict well the number of attempts for a given user-problem combination.

The second model will be a collaborative filtering model. This model will employ stochastic gradient descent (SGD) to find an approximate solution to the single value decomposition (SVD) of our user-problem matrix. In this case, we will not utilize the user and problem datasets. Predictions will be made exclusively using the history of problem submissions.

Let's take a quick look at the submissions dataset. This dataset has 3 columns: user_id, problem_id, and attempts_range. Attempts_range gives the range of attempts that the user_id took to solve the problem_id and is defined in the original datasource as shown below.

In [4]:
submissions.head()

Unnamed: 0,user_id,problem_id,attempts_range
0,user_232,prob_6507,1
1,user_3568,prob_2994,3
2,user_1600,prob_5071,1
3,user_2256,prob_703,1
4,user_2321,prob_356,1


>We have used following criteria to define the attempts_range :-
>
>            attempts_range            No. of attempts lies inside
>
>            1                                         1-1
>
>            2                                         2-3
>
>            3                                         4-5
>
>            4                                         6-7
>
>            5                                         8-9
>
>            6                                         >=10

## Train Test Split

I'll start by randomly sampling 25% of the full submissions dataset to create a test set, `S_test`. I'll use sklearn's train_test_split while passing in the `attempts_range` column to the stratify argument. This will ensure that the proportions of the 6 attempts_range labels are preserved through the split. I'll then split the remaining 75%, again using the stratify argument, into 75% for training (`S_train`) and 25% for cross-validation (`S_cv`). Setting the `random_state` to 42 will ensure reproducibility.

In [5]:
train_cv, S_test = train_test_split(submissions, test_size=0.25, 
                                 stratify=submissions['attempts_range'], random_state=42)

S_train, S_cv = train_test_split(train_cv, test_size=0.25, 
                                 stratify=train_cv['attempts_range'], random_state=42)

In [6]:
S_train.head()

Unnamed: 0,user_id,problem_id,attempts_range
24964,user_161,prob_4719,1
2559,user_2878,prob_53,2
68733,user_1067,prob_6434,3
87410,user_2501,prob_6293,3
33117,user_1072,prob_3705,1


These splits will be now be used to compare all models. We can plot the proportions of the 6 labels in each split to check that they are indeed equal. The plot below confirms this.

In [7]:
trace0 = go.Histogram(x=S_train.attempts_range, histnorm='probability', name='S_train')
trace1 = go.Histogram(x=S_cv.attempts_range, histnorm='probability', name='S_cv')
trace2 = go.Histogram(x=S_test.attempts_range, histnorm='probability', name='S_test')

layout = go.Layout(title='Attempts_range Distributions',
               xaxis=dict(title='Attempts_range'),
               yaxis=dict(title='% of sample population'),
                  legend=dict(orientation='h', y=1.12),
                  margin=dict(t=120))

fig = go.Figure([trace0, trace1, trace2], layout=layout)

iplot(fig, filename='split-distributions.html')

## Random Forest Model

**Problem Statement:** For a given user and a problem the user has never attempted before, can we use meta data on the user and problem to predict the `attempts_range` it will take the user to solve the problem?

### Preparing data for random forest 

The random forest models will be trained using the users and problems datasets. These datasets contain information about specific problems and users. The first thing we need to do to prepare the data for the random forest model is convert categorical, string columns into dummy variables. We do this for both the user and problem features.

In [8]:
users = pd.get_dummies(users.set_index('user_id')).reset_index()
users.head()

Unnamed: 0,user_id,submission_count,problem_solved,contribution,follower_count,last_online_time_seconds,max_rating,rating,registration_time_seconds,user_attempts_median,...,country_Ukraine,country_United Kingdom,country_United States,country_Uzbekistan,country_Venezuela,country_Vietnam,rank_advanced,rank_beginner,rank_expert,rank_intermediate
0,user_1,84,73,10,120,1505162220,502.007,499.713,1469108674,1.0,...,0,0,0,0,0,0,1,0,0,0
1,user_10,246,211,0,30,1505079658,326.548,313.36,1472038187,1.0,...,0,0,0,0,0,0,0,0,0,1
2,user_100,642,574,27,106,1505073569,458.429,385.894,1323974332,1.0,...,0,0,0,0,0,0,0,0,0,1
3,user_1000,259,235,0,41,1505579889,371.273,336.583,1450375392,1.0,...,0,0,0,0,0,0,0,0,0,1
4,user_1001,554,492,-6,55,1504521879,472.19,450.975,1423399585,1.0,...,0,0,0,0,0,0,0,0,0,1


In [9]:
problems = pd.get_dummies(problems.set_index('problem_id')).reset_index()
problems.head()

Unnamed: 0,problem_id,points,problem_attempts_median,problem_attempts_min,problem_attempts_max,problem_attempts_count,problem_attempts_iqr,algorithms,and,binary,...,level_type_E,level_type_F,level_type_G,level_type_H,level_type_I,level_type_J,level_type_K,level_type_L,level_type_M,level_type_N
0,prob_1,500.0,1.5,1.0,2.0,2.0,0.005,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,prob_10,4500.0,6.0,6.0,6.0,1.0,0.0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,prob_100,1000.0,1.0,1.0,1.0,1.0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,prob_1000,500.0,1.0,1.0,6.0,246.0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,prob_1001,2000.0,1.0,1.0,2.0,10.0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, I'll join the user and problem features into a single dataframe. This dataframe will contain our predictor variables. Each row contains the user and problem features for a given user-problem combination. This dataset contains information on all users and problems. However, the submissions dataset does not contain actual attempt values for all user-problem combinations since not all problems have been attempted by all users yet. This is what we're trying to predict. 

In [10]:
X_train = S_train.merge(users, on='user_id').merge(problems, on='problem_id')
X_cv = S_cv.merge(users, on='user_id').merge(problems, on='problem_id')

# remove rows with any null values
X_train = X_train.loc[:,X_train.notnull().all()]
X_cv = X_cv.loc[:,X_cv.notnull().all()]

y_train = X_train.set_index(['user_id', 'problem_id'])['attempts_range']
X_train = X_train.set_index(['user_id', 'problem_id']).loc[:,'submission_count':]

y_cv = X_cv.set_index(['user_id', 'problem_id'])['attempts_range']
X_cv = X_cv.set_index(['user_id', 'problem_id']).loc[:,'submission_count':]

X_train.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,submission_count,problem_solved,contribution,follower_count,last_online_time_seconds,max_rating,rating,registration_time_seconds,user_attempts_median,user_attempts_min,...,level_type_E,level_type_F,level_type_G,level_type_H,level_type_I,level_type_J,level_type_K,level_type_L,level_type_M,level_type_N
user_id,problem_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
user_161,prob_4719,1102,1003,0,683,1505515170,594.323,551.892,1351548748,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0
user_2755,prob_4719,575,567,0,15,1505586992,349.197,293.865,1422206263,2.0,1.0,...,0,0,0,0,0,0,0,0,0,0
user_265,prob_4719,131,118,0,10,1500035660,520.642,483.372,1454579098,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0
user_1209,prob_4719,840,782,157,959,1505391741,735.665,663.704,1369775813,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0
user_3430,prob_4719,402,349,21,22,1505581791,430.906,417.144,1331658494,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
y_train.head()

user_id    problem_id
user_161   prob_4719     1
user_2755  prob_4719     3
user_265   prob_4719     2
user_1209  prob_4719     2
user_3430  prob_4719     2
Name: attempts_range, dtype: int64

Dataframe X now contains all of the user and problem feature data for each combination of user_id and problem_id. Thus, for each training sample or row, we will use the combination of user and problem features to predict the attempts_range. The attempts_range for each user-problem combination is saved in y.

### Simplest Baseline Model

To benchmark our models, we'll be using sklearn's f1_score function with the average argument set to "weighted". This function will compute the f1-score for each of the labels in the dataset and then take a weighted average of the scores depending on how many samples are in each label. Thus, we will simply get one overall f1-score. the function `f1` below is a wrapper function to make it more convenient for scoring our predictions from both random forest and the matrix factorization models.

In [12]:
def f1(Y_true, Y_predicted, average='weighted', labels=[1.0,2.0,3.0,4.0,5.0,6.0]):
    
    """Compute the f1_score between a true values
    matrix and predicted values array.
    
    Parameters
    ----------
    Y_true : 1D or 2D numpy array
        Matrix of true values.
    Y_predicted : 1D or 2D numpy array
        Matrix of predictions.
    average : str
        Method for weighting f1-score which is computed
        for each label.
    labels : list
        List of labels to compute f1-scores over.
        
    Returns
    -------
    f1 : float
        f1-score computed using `average` method 
        across specified `labels`.
    """
    
    Y_true_ = np.array(Y_true)
    Y_predicted_ = np.array(Y_predicted)
    
    # get indices of non-NaN values
    mask = ~np.isnan(Y_true_)
    
    # flatten mask into 1D array
    mask = mask.flatten(order='C')
    
    # flatten matrices into 1D arrays and filter with mask
    y_true = Y_true_.flatten(order='C')[mask]
    y_predicted = Y_predicted_.flatten(order='C')[mask]
    
    # compute f1-score
    f1 = f1_score(y_true, y_predicted, average=average, labels=labels)
    
    return f1

We know from our previous exploratory analysis of this data that 1 is by far the most common attempts_range. A very simple prediction model we can make is just to predict 1 for all missing values. Let's see how such a model would do.

In [13]:
y_predicted = np.ones(len(y_train))

print('F1 score for predicting all ones on training data: %s' % round(f1(y_train, y_predicted), 4))

F1 score for predicting all ones on training data: 0.3709



F-score is ill-defined and being set to 0.0 in labels with no predicted samples.



The F1 score we got for predicting 1 for all of the training samples is 0.371. How does this compare in the CV dataset?

In [14]:
y_predicted = np.ones(len(y_cv))

print('F1 score for predicting all ones on cv data: %s' % round(f1(y_cv, y_predicted), 4))

F1 score for predicting all ones on cv data: 0.3709


We get the same f1 score for predicting all ones on the CV dataset. This is a good indication that there was minimal selection bias in our splitting. Of course this is what we would expect since we stratified the train test splits.

### Out-of-box Random Forest

Next, I'll start with an out-of-box random forest model. I'll then tune the model hyperparameters to optimize the f1-score for predictions on the CV dataset.

In [15]:
clf = RandomForestClassifier(n_estimators=100, n_jobs=12, random_state=42)

clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=12, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [16]:
y_predicted = clf.predict(X_train)

score = f1(y_train, y_predicted, average='weighted')
print('f1-score on training data: %s' % round(score, 4))

f1-score on training data: 0.9999


In [17]:
y_predicted = clf.predict(X_cv)

score = f1(y_cv, y_predicted, average='weighted')
print('f1-score on CV data: %s' % round(score, 4))

f1-score on CV data: 0.5191


We get an f1-score of 0.9999 on the training data and 0.5191 on the cross-validation data. This is quite surprising because since the distribution of labels is the same in both train and cv datasets, I would expect the model to fit both equally well. Clearly the model has learned how to fit the training data extremely well, but it may be suffering from overfitting and therefore doesn't generalize well to the cv data.

In [18]:
feature_importances = pd.DataFrame({'feature': X_train.columns, 
                                    'importance': clf.feature_importances_})

feature_importances.sort_values('importance', ascending=False).head()

Unnamed: 0,feature,importance
100,problem_attempts_count,0.09311
97,problem_attempts_median,0.057118
6,rating,0.055744
7,registration_time_seconds,0.054861
5,max_rating,0.054826


The table above shows the top 5 most important features for the model's predictions. The best predictor for this model is `problem_attempts_count`, the number of attempts a problem has received from all users. It's importance score is much higher than the rest. Let's look at the relationship between `attempts_range` and this feature, `problem_attempts_count`.

In [19]:
problem_attempts = X_train.join(y_train)[['problem_attempts_count', 'attempts_range']]
problem_attempts_grp = problem_attempts.groupby('attempts_range').mean()

trace0 = go.Scattergl(name='problem_attempts_count',
                      x=problem_attempts_grp.index,
                      y=problem_attempts_grp['problem_attempts_count'], 
                      mode='lines+markers')

layout = go.Layout(title='Mean problem_attempts_count vs attempts_range',
               xaxis=dict(title='attempts_range'),
               yaxis=dict(title='Mean problem_attempts_count'),
                  legend=dict(orientation='h', y=1.12))

fig = go.Figure([trace0], layout=layout)

iplot(fig, filename='train-test-scores.html')

The plot above shows the relationship between the mean `problem_attempts_count` and `attempts_range`. The trend shows a decrease in the mean `problem_attempts_count` as `attempts_range` increases. This is intuitively easy to understand since we can think of problems with a higher `attempts_range` as harder, and harder problems will be attempted by fewer students while easier problems will be attempted by the most students.

So the out-of-box random forest model gave an f1-score on the cross-validation data of 0.5191. This is already a great improvement over the baseline model with an f1-score of 0.371! 

During my exploratory analysis of the data, it was clear that many features were correlated with one another. Before diving into model optimization through hyperparameter tuning, I want to see if removing some of this colinearity between features improves the model's performance.

### Dimensionality Reduction

Let's start by performing PCA on the full dataset to see how many features we can safely remove. Performing PCA on the full dataset has two benefits.

1. The dimensionality of the training data is reduced and therefore takes less computation resources and time to train the model on.
2. Colinear features are removed. The principal components returned by PCA are all orthogonal.

In [20]:
pca = PCA()
pca.fit(X_train)

x = list(range(1, len(pca.explained_variance_)+1))
y = pca.explained_variance_

trace0 = go.Scattergl(x=x, y=y, mode='lines+markers')

layout = go.Layout(title='Explained Variance vs # of Dimensions',
                  xaxis=dict(title='# of Dimensions'),
                  yaxis=dict(title='Explained Variance', type='log'))

fig = go.Figure([trace0], layout)

iplot(fig, filename='explained-var_vs_N-dimensions.html')

The plot above shows the explained variance of the data as a function of the number of principle components or dimensions found by PCA. There is a steep drop between 1 and 12 principle components. Above 12 principle components, there is very little contribution to the explained variance of the data. Let's now test how the number of principle components affects the model's predictions.

In [None]:
n_components=[1,2,3,4,5,10,15,20,25,30,40,50,100]

f1_scores = []
for n in n_components:

    pca = PCA(n_components=n)
    
    X_train_r = pca.fit_transform(X_train)
    X_cv_r = pca.fit_transform(X_cv)
    
    clf = RandomForestClassifier(n_estimators=100, n_jobs=12, random_state=42)

    clf.fit(X_train_r, y_train)

    y_predicted = clf.predict(X_cv_r)

    f1_scores.append(f1(y_cv, y_predicted, average='weighted'))

In [None]:
save_obj(f1_scores, 'f1_cv-n_components')

In [21]:
f1_n_components = load_obj('f1_cv-n_components')

In [22]:
n_components=[1,2,3,4,5,10,15,20,25,30,40,50,100]
trace0 = go.Scatter(x=n_components, y=f1_n_components, mode='lines+markers')

layout = go.Layout(title='CV F1 Score vs Principal Components',
                  xaxis=dict(title='Principal Components'),
                  yaxis=dict(title='CV F1 Score', type='log'))

fig = go.Figure([trace0], layout)

iplot(fig, filename='f1_score-vs-principal_components.html')

We can see that at a number of principal components less than 25, there is a significant hit in the F1 score. Above 25 principal components, the F1 score also decreases monotonically. In general, there is no improvement over the baseline model when using PCA to remove colinear features and reduce the dataset's dimensionality. The lack of improvement in performance may be attributed to the fact that random forest uses bootstrapping to generate smaller samples for each tree. These smaller samples choose only a subset of the available features and may therefore break up any collinearity between features.

### Hyperparameter Tuning

We can now use `GridSearhCV` to try to tune the hyperparameters of the model. Rather than passing a large dictionary object of all the hyperparameters we want to tune at once, I will explore some important hyperparameters individually. This will make interpretting the effects of each hyperparameter easier. At the end, I will then pass all of the hyperparameters to `GridSearchCV` to find the optimal combination of all hyperparameters.

#### n_estimators

n_estimators defines how many trees the model will create, before averaging their results. Generally, the more trees the better the model will generalize. However more trees equals more computation and therefore we want to strike a balance between fit to the test data and train + test times.

With `GridSearchCV`, we can define the scoring function. Since we want to maximize the f1_score function with "weighted" averaging from `sklearn.metrics`, we pass this same scoring function to `GridSearchCV`.

In [None]:
%%time
param_grid = {'n_estimators': [10,50,100,150,200,300,500,1000]}

clf = RandomForestClassifier(n_jobs=12, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

The results of the search are shown below. 

In [None]:
results = pd.DataFrame({'n_estimators' : [10,50,100,150,200,300,500,1000],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

In [None]:
save_obj(results, 'gridsearch-n_estimators')

In [23]:
n_components = load_obj('gridsearch-n_estimators')

Let's plot the train and test scores as a function of `n_estimators`.

In [24]:
trace2 = go.Scattergl(name='Mean CV Score',
                      x=n_components['n_estimators'],
                      y=n_components['mean_test_score'], 
                      mode='lines+markers',
                     yaxis='y2')
trace1 = go.Scattergl(name='Mean Train Score',
                      x=n_components['n_estimators'],
                      y=n_components['mean_train_score'], 
                      mode='lines+markers')

layout = go.Layout(title='Mean Train & Test Scores vs n_estimators',
               xaxis=dict(title='n_estimators'),
               yaxis=dict(title='Mean Train Score'), 
                   yaxis2=dict(title='Mean CV Score',
                              side='right', overlaying='y'),
                  legend=dict(orientation='h', y=1.12),
                  margin=dict(t=120))

fig = go.Figure([trace1, trace2], layout=layout)

iplot(fig, filename='train-test-scores.html')

The F1 score on the training data increases as n_estimators goes from 10 to 50, but quickly plateaus after that. The training score is very close to 1, even for n_estimators=5. The train and cv scores are plotted on separate axes above so we can distinguish the knees of both curves. The more important score of course is the cv score, which continues to increase with increasing n-estimators. The CV score peaks at 500 `n_estimators`

Let's now look at the tradeoff between the cv score and the time required to train and test the model.

In [25]:
def plot_cv(df, param):
    trace0 = go.Scattergl(name='Combined Mean Train+Test Time',
                      x=df[param],
                      y=df['combined_mean_fit-test_time'], 
                      mode='lines+markers',)
    trace1 = go.Scattergl(name='Mean CV Score',
                          x=df[param],
                          y=df['mean_test_score'], 
                          mode='lines+markers',
                         yaxis='y2')

    layout = go.Layout(title='Model Train+Test Time & Test Score vs %s' % param,
                   xaxis=dict(title=param),
                   yaxis=dict(title='Combined Train+Test Time'), 
                       yaxis2=dict(title='Mean CV Score',
                                  side='right', overlaying='y'),
                      legend=dict(orientation='h', y=1.12),
                      margin=dict(t=120))

    fig = go.Figure([trace0, trace1], layout=layout)

    iplot(fig, filename='%s.html' % param)

In [26]:
plot_cv(n_components, param='n_estimators')

The combined time for training and testing the model increases significantly up to 160 seconds at n_estimators=300. Clearly, the larger `n_estimators`, the better the model generalizes to the cross-validation data.

#### max_depth

`max_depth` defines how many levels each decision tree can have. This essentially gives an upper limit to how many total decision nodes a tree can use to split and categorize the data.

In [None]:
%%time
param_grid = {'max_depth': [1,2,3,4,5,10,15,20,30]}

clf = RandomForestClassifier(n_estimators=100, n_jobs=12, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

In [None]:
results = pd.DataFrame({'max_depth' : [1,2,3,4,5,10,15,20,30],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

In [None]:
save_obj(results, 'gridsearch-max_depth')

In [27]:
max_depth = load_obj('gridsearch-max_depth')

In [28]:
plot_cv(max_depth, param='max_depth')

The CV score is maximized at a `max_depth` of 15. Beyond 15, the CV score begins to suffer, and of course the model takes longer to train and predict since it has more decision nodes.

#### min_samples_leaf

`min_samples_leaf` defines the minimum number of samples required to be at a leaf node. This hyperparameter should help with overfitting since it can't have leaf nodes with only a few samples.

In [None]:
%%time
param_grid = {'min_samples_leaf': np.arange(5,100,5)}

clf = RandomForestClassifier(n_estimators=100, n_jobs=12, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

In [None]:
results = pd.DataFrame({'min_samples_leaf' : np.arange(5,100,5),
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

In [None]:
save_obj(results, 'gridsearch-min_samples_leaf')

In [29]:
min_samples_leaf = load_obj('gridsearch-min_samples_leaf')

In [30]:
plot_cv(min_samples_leaf, param='min_samples_leaf')

`min_samples_leaf` seems to be quite a particular hyperparameter. The best CV score is achieved with a `min_samples_leaf` value of 10.

#### max_features

`max_features` sets the maximum number of feature columns used when creating bootstrap samples for the different trees in the forest. This can be specified as an integer number of features, or as a percentage of the total number of features if passing a float value.

In [None]:
%%time
param_grid = {'max_features': [0.01, 0.05, 0.1, 0.2, 0.3, 0.5]}

clf = RandomForestClassifier(n_estimators=100, n_jobs=12, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

In [None]:
results = pd.DataFrame({'max_features': [0.01, 0.05, 0.1, 0.2, 0.3, 0.5],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

In [None]:
save_obj(results, 'gridsearch-max_features')

In [31]:
max_features = load_obj('gridsearch-max_features')

In [32]:
plot_cv(max_features, 'max_features')

`max_features` maximizes the CV score at 30% of the total features. Increasing `max_features` to 50% of all features actually decreases the CV score.

#### class_weight

`class_weight` defines how we weight the importance of each class when training the model. The default is 'None' which gives an equal weight to all classes. "balanced" uses the proportion of samples in each class to weight them whereas "balanced_subsample" does the same but computes the proportions for each bootstrap sample separately. 

In [None]:
%%time
param_grid = {'class_weight': [None, 'balanced', 'balanced_subsample']}

clf = RandomForestClassifier(n_estimators=100, n_jobs=12, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

In [None]:
results = pd.DataFrame({'class_weight': [None, 'balanced', 'balanced_subsample'],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

In [None]:
save_obj(results, 'gridsearch-class_weight')

In [33]:
class_weight = load_obj('gridsearch-class_weight')
class_weight

Unnamed: 0,class_weight,combined_mean_fit-test_time,mean_test_score,mean_train_score
0,,23.086714,0.468866,0.99992
1,balanced,30.367116,0.472914,0.999923
2,balanced_subsample,37.279363,0.472727,0.99992


The best CV score was achieved for a "balanced" `class_weight`.

#### Putting it all together

Note that the random forest classifier from sklearn has around 15 tunable hyperparameters! Here I focused on the top 5 that I think will impact the model's overfitting the most. I looked at the model's performance dependence on the individual parameters but there may be some interdependence between them. Therefore, the final step to optimizing the random forest model will be to do a full parameter grid search with the 5 hyperparameters we looked at. I will of course reduce the search range for each parameter around their best individual results to reduce training time.

In [None]:
%%time
param_grid = {'n_estimators': [300, 500],
             'max_depth': [10,15,20],
             'min_samples_split': [5,10,15],
             'max_features': [0.2,0.3,0.4],
             'class_weight': ['balanced', 'balanced_subsample']}

clf = RandomForestClassifier(n_jobs=-1, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

I'll save the results of this gridsearch due to the long training time to obtain them. Loading the results, we can explore the optimal model.

In [None]:
save_obj(cv, 'full-grid-search')

In [34]:
full_results = load_obj('full-grid-search')

In [35]:
settings = []
for i in full_results.cv_results_['params']:
    params = tuple(i[k] for k in i)
    settings.append(params)

In [36]:
cv_scores = full_results.cv_results_['mean_test_score']
print('Maximum CV score obtained: %s' % round(cv_scores.max(), 4))

Maximum CV score obtained: 0.4995


Surprisingly, the best CV f1-score obtained on the full grid search is worse than the out-of-box model... go figure. So the best random forest model we created gave an f1-score on the CV data of 0.5191. Let's see how other models fare.

## Collaborative Filtering Models

In [37]:
# number of unique users
n_users = submissions['user_id'].nunique()
# number of unique items (problems)
n_items = submissions['problem_id'].nunique()

In [38]:
print('Number of unique users: %s' % n_users)
print('Number of unique problems: %s' % n_items)

Number of unique users: 3529
Number of unique problems: 5776


In [39]:
sparsity = len(submissions)/(n_users*n_items)
print('Sparsity of attempts_range: %s%%' % round(sparsity*100, 2))

Sparsity of attempts_range: 0.76%


The full submissions dataset contains 3529 unique users and 5776 unique problems. We have attempts_range data for only 0.76% of all user x problem combinations!! This data is incredibly sparse. Even the Netflix prize dataset had over 1% ratings. This will likely make it much harder for collaborative filtering models to produce good results, as they depend on inferring the attempts_range from the other users and/or items.

In [40]:
S_train = S_train.set_index(['user_id','problem_id']).unstack(level=-1)
S_cv = S_cv.set_index(['user_id','problem_id']).unstack(level=-1)

S_train.columns = S_train.columns.droplevel()
S_cv.columns = S_cv.columns.droplevel()

S_train.head()

problem_id,prob_1,prob_10,prob_100,prob_1000,prob_1001,prob_1002,prob_1003,prob_1005,prob_1006,prob_1007,...,prob_99,prob_990,prob_991,prob_992,prob_994,prob_995,prob_996,prob_997,prob_998,prob_999
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
user_1,,,,,,,,,,,...,,,,,,,,,,
user_10,,,,,,,,,,,...,,,,,,,,,,
user_100,,,,,,,,,,,...,,,,,,,,,,
user_1000,,,,,,,,,,,...,,,,,,,,,,
user_1001,,,,,,,,,,,...,,,,,,,,,,


Since I will be building several types of models using different methods to predict missing attempt_range values, I'll start by creating an empty matrix that contains all user_ids as the index and all problem ids as columns. This matrix is constructed using the full list of users and problems from the users and problems datasets and not the submissions dataset. This is because there are many users and problems for which we have meta data but no history of submissions.

In [41]:
u_diff = len(set(users.user_id.unique()).difference(submissions.user_id.unique()))
print('Number of users from users dataset, not present in submissions: %s' % u_diff)

Number of users from users dataset, not present in submissions: 42


In [42]:
p_diff = len(set(problems.problem_id.unique()).difference(submissions.problem_id.unique()))
print('Number of problems from problems dataset, not present in submissions: %s' % p_diff)

Number of problems from problems dataset, not present in submissions: 768


In [43]:
empty_sub = pd.DataFrame(np.nan, index=users.user_id.unique(), 
                         columns=problems.problem_id.unique())
empty_sub_ = np.array(empty_sub)

We'll fill in the S_train and S_cv data into the empty_sub matrix to have all data and predictions in the same format.

In [44]:
S_train = empty_sub.fillna(S_train)
S_cv = empty_sub.fillna(S_cv)

S_train.head()

Unnamed: 0,prob_1,prob_10,prob_100,prob_1000,prob_1001,prob_1002,prob_1003,prob_1004,prob_1005,prob_1006,...,prob_990,prob_991,prob_992,prob_993,prob_994,prob_995,prob_996,prob_997,prob_998,prob_999
user_1,,,,,,,,,,,...,,,,,,,,,,
user_10,,,,,,,,,,,...,,,,,,,,,,
user_100,,,,,,,,,,,...,,,,,,,,,,
user_1000,,,,,,,,,,,...,,,,,,,,,,
user_1001,,,,,,,,,,,...,,,,,,,,,,


### User-mean Recommender

The first type of collaborative filtering model I'll build is a user-mean collaborative filtering model. This simple model fills all missing attempts_range values with the averages across all users. This can be considered a collaborative filtering model since we're using information from other users to generate the predictions.

We need a method for dealing with edge cases where we may not have data to make a prediction. For example, since we'll be calculating the mean of each problem and using that to make predictions for all users, we could have problems that were never solved in the training data and therefore not have any predictions made for those columns. We wouldn't be able to calculate a mean for that problem and therefore not be able to make a prediction on that problem for any user. The easiest way to deal with this is to simply predict 1 when we don't have data, since this is by far the most common value of attempts_range across all problems and users.

In [45]:
# compute the mean of each problem across all users
# round to nearest int
user_means = np.round(S_train.mean())

# fill the empty_sub for scoring
S_predicted = empty_sub.fillna(user_means)

# fill all missing values with 1
S_predicted = S_predicted.fillna(1)

S_predicted.head()

Unnamed: 0,prob_1,prob_10,prob_100,prob_1000,prob_1001,prob_1002,prob_1003,prob_1004,prob_1005,prob_1006,...,prob_990,prob_991,prob_992,prob_993,prob_994,prob_995,prob_996,prob_997,prob_998,prob_999
user_1,2.0,6.0,1.0,1.0,1.0,3.0,2.0,1.0,2.0,2.0,...,4.0,2.0,2.0,1.0,2.0,1.0,2.0,1.0,2.0,2.0
user_10,2.0,6.0,1.0,1.0,1.0,3.0,2.0,1.0,2.0,2.0,...,4.0,2.0,2.0,1.0,2.0,1.0,2.0,1.0,2.0,2.0
user_100,2.0,6.0,1.0,1.0,1.0,3.0,2.0,1.0,2.0,2.0,...,4.0,2.0,2.0,1.0,2.0,1.0,2.0,1.0,2.0,2.0
user_1000,2.0,6.0,1.0,1.0,1.0,3.0,2.0,1.0,2.0,2.0,...,4.0,2.0,2.0,1.0,2.0,1.0,2.0,1.0,2.0,2.0
user_1001,2.0,6.0,1.0,1.0,1.0,3.0,2.0,1.0,2.0,2.0,...,4.0,2.0,2.0,1.0,2.0,1.0,2.0,1.0,2.0,2.0


In [46]:
print('User-mean f1-score on CV data: %s' % round(f1(S_cv, S_predicted), 4))

User-mean f1-score on CV data: 0.4688


This very simple user-mean model gives an f1-score on the CV data of 0.4688, much better than the baseline model of guessing all ones with an f1-score of 0.3709! Not surprisingly, this is a bit worse than our best random forest model.

### Item-mean Recommender

In [47]:
problem_means = np.round(S_train.mean(axis=1))

S_predicted = empty_sub.T.fillna(problem_means).T

S_predicted = S_predicted.fillna(1)

S_predicted.head()

Unnamed: 0,prob_1,prob_10,prob_100,prob_1000,prob_1001,prob_1002,prob_1003,prob_1004,prob_1005,prob_1006,...,prob_990,prob_991,prob_992,prob_993,prob_994,prob_995,prob_996,prob_997,prob_998,prob_999
user_1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
user_10,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
user_100,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
user_1000,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
user_1001,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0


In [48]:
print('User-mean f1-score on CV data: %s' % round(f1(S_cv, S_predicted), 4))

User-mean f1-score on CV data: 0.3264



F-score is ill-defined and being set to 0.0 in labels with no predicted samples.



The item-mean model does considerably worse than the user-based model. In fact, this does worse than even our baseline model where we predicted 1 for all missing attempts_ranges! Here we get an f1-score of 0.3264.

Considering the results of the user-mean and item-mean recommenders, we can safely say that the mean user history is a much better predictor of future user behavior than the mean problem history.

### User-based vs Item-based Collaborative Filtering

Next, I'll use a very common type of collaborative filtering model. This model uses the cosine similarity between users to calculated a weighted average of user attempts_ranges. For example, if user 44 has a similarity score of 0.5 to our user of interest, and their attempts_range for a given problem was 1, their contribution to the average attempts_range would be 0.5.

We start by defining a function to calculat the cosine similarity between all users.

In [49]:
def cos_sim(attempts, kind='user', epsilon=1e-9):
    """
    Compute the cosine similarity between all users.
    
    Parameters
    ----------
    attempts : 2D numpy array
        Matrix containing all attempts_range with
        users on the index and problems on the columns.
    kind : str
        Specifices cosine similarity calculation between
        users or items (problems).
    epsilon : float
        Small float to prevent errors with dividing by zero.
        
    Returns
    -------
    similarity : 2D numpy array
        Matrix containing computed cosine similarity 
        between all users or all problems.
    """
    # fill all NaN values with 0. This does not affect
    # the cosine similarity metric.
    attempts = np.nan_to_num(attempts, 0)
    
    # compute the dot product between each user
    # and all other users.
    if kind == 'user':
        sim = np.dot(attempts, attempts.T) + epsilon
    # compute the dot product between each item
    # and all other items
    if kind == 'item':
        sim = np.dot(attempts.T, attempts) + epsilon
    
    # compute the denominator of the cosine similarity
    # metric
    norms = np.array([np.sqrt(np.diagonal(sim))])
    
    # compute the cosine similarity
    similarity = sim/norms/norms.T
    
    return similarity

In [50]:
similarity_users = cos_sim(S_train, kind='user')
similarity_items = cos_sim(S_train, kind='item')

In [51]:
def predict(attempts, similarity, kind='user'):
    """
    Predict attempts_range using cosine similarity scores
    as weights.
    
    Parameters
    ----------
    attempts : 2D numpy array
        Matrix containing all attempts_range with
        users on the index and problems on the columns.
    kind : str
        Specifices cosine similarity calculation between
        users or items (problems).
    similarity : 2D numpy array
        Matrix containing computed cosine similarity 
        between all users or all problems.
        
    Returns
    -------
    predictions : 2D numpy array
        Predictions on attempts_range for all user-
        problem combinations.
    """
    
    # fill NaN values with 0
    attempts_fill = np.nan_to_num(attempts, 0)
    
    if kind == 'user':
        predictions = np.round(similarity.dot(attempts_fill) / np.array([np.abs(similarity).sum(axis=1)]).T)
        return predictions
    elif kind == 'item':
        predictions = np.round(attempts_fill.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)]))
        return predictions

In [52]:
S_predicted = predict(S_train, similarity_users, kind='user')
print('User-cosine similarity f1-score: %s' % round(f1(S_cv, S_predicted), 4))

User-cosine similarity f1-score: 0.0007


In [53]:
S_predicted = predict(S_train, similarity_items, kind='item')
print('Item-cosine similarity f1-score: %s' % round(f1(S_cv, S_predicted), 4))

Item-cosine similarity f1-score: 0.0001


The two types of cosine similarity models are terrible! Because we're working with such sparse data, the overlap between problems solved by users is probably poor, leading to weak measures of similarity between users and even weaker similiarty between problems.

We can also look at the similarity of users and items using the features datasets rather than the attempt_ranges themselves. Perhaps, the meta data collected for each user and problem gives a better measure of similarity between them. Because the values for different features are so different, I need to first normalize the data so that each feature is on the same scale, between 0 and 1.

In [54]:
user_features_norm = users.set_index('user_id')/users.set_index('user_id').max()
user_features_norm.head()

Unnamed: 0_level_0,submission_count,problem_solved,contribution,follower_count,last_online_time_seconds,max_rating,rating,registration_time_seconds,user_attempts_median,user_attempts_min,...,country_Ukraine,country_United Kingdom,country_United States,country_Uzbekistan,country_Venezuela,country_Vietnam,rank_advanced,rank_beginner,rank_expert,rank_intermediate
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
user_1,0.018381,0.016309,0.05848,0.011348,0.999713,0.510645,0.548458,0.989808,0.166667,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
user_10,0.053829,0.04714,0.0,0.002837,0.999658,0.332167,0.343927,0.991782,0.166667,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
user_100,0.140481,0.128239,0.157895,0.010024,0.999654,0.466317,0.423536,0.892024,0.166667,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
user_1000,0.056674,0.052502,0.0,0.003877,0.99999,0.377661,0.369415,0.977186,0.166667,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
user_1001,0.121225,0.10992,-0.035088,0.005201,0.999287,0.480315,0.494966,0.959012,0.166667,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [55]:
S_predicted = predict(S_train, cos_sim(user_features_norm, kind='user'), kind='user')
f1(S_cv, S_predicted)

0.0

In [56]:
problem_features_norm = problems.set_index('problem_id')/problems.set_index('problem_id').max()
problem_features_norm.head()

Unnamed: 0_level_0,points,problem_attempts_median,problem_attempts_min,problem_attempts_max,problem_attempts_count,problem_attempts_iqr,algorithms,and,binary,bitmasks,...,level_type_E,level_type_F,level_type_G,level_type_H,level_type_I,level_type_J,level_type_K,level_type_L,level_type_M,level_type_N
problem_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
prob_1,0.071429,0.25,0.166667,0.333333,0.001465,0.047619,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
prob_10,0.642857,1.0,1.0,1.0,0.000733,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
prob_100,0.142857,0.166667,0.166667,0.166667,0.000733,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
prob_1000,0.071429,0.166667,0.166667,1.0,0.18022,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
prob_1001,0.285714,0.166667,0.166667,0.333333,0.007326,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [57]:
S_predicted = predict(S_train, cos_sim(problem_features_norm), kind='item')
f1(S_train, S_predicted)

0.0

These models are even worse! So far, the random forest model is by far the best.

### Latent Factor Collaborative Filtering Model From Scratch

#### Building the model

Now for the really fun part! I'm going to build a collaborative filtering model from scratch. In this type of model, latent features will be learned from the user-problem submission history. No data will be used from the users and problems datasets. We will define how many latent features we want the model to train on, this is a hyperparameter that we can tune later on. We will start by initializing a random guess of these latent features and then train the model by minimizing the error the model produces when predicting the attempts_ranges for user-problem combinations, from the learned latent features. To minimize the cost or error, J, we will utilize stochastic gradient descent. I'll start by defining some useful functions to help build the pipeline for the model. Keep in mind that we're starting with a single matrix with users on the row index and items (in this case problems) on the column index.

In [59]:
S_train[S_train==0]=np.nan
S_train.head()

Unnamed: 0,prob_1,prob_10,prob_100,prob_1000,prob_1001,prob_1002,prob_1003,prob_1004,prob_1005,prob_1006,...,prob_990,prob_991,prob_992,prob_993,prob_994,prob_995,prob_996,prob_997,prob_998,prob_999
user_1,,,,,,,,,,,...,,,,,,,,,,
user_10,,,,,,,,,,,...,,,,,,,,,,
user_100,,,,,,,,,,,...,,,,,,,,,,
user_1000,,,,,,,,,,,...,,,,,,,,,,
user_1001,,,,,,,,,,,...,,,,,,,,,,


#### unroll

The first function takes two matrices and flattens them into a single 1D array. It does so by first sequentially stacking each column on top of each other for each matrix, producing two 1D arrays. It then stacks those two 1D arrays on top of each other to form a single 1D array that contains all of the latent features for users and items.

In [60]:
def unroll(M_users, M_items):
    
    """Reshape 2 matrices into a single 1D array. 
    Inverse function of `roll`.
    
    Parameters
    ----------
    M_users : 2D numpy array
        Matrix of user latent features. Has
        dimensions (n_users, n_features).
    M_items : 2D numpy array
        Matrix of item latent features. Has
        dimensions (n_items, n_features).
        
    Returns
    -------
    x_users_items : 1D numpy array
        User and item latent features.
    """
    
    # convert matrices to np arrays
    M_users = np.array(M_users)
    M_items = np.array(M_items)

    # flatten 2D arrays into 1D arrays
    x_users = M_users.flatten(order='C')
    x_items = M_items.flatten(order='C')
    
    # concatenate user and item 1D arrays
    x_users_items = np.concatenate((x_users, x_items), axis=0)

    return x_users_items

In [61]:
%%time

# define model parameters
n_users = S_train.shape[0]
n_items = S_train.shape[1]
n_features = 100

# initialize random latent user and item features
M_users = np.random.rand(n_users, n_features)
M_items = np.random.rand(n_items, n_features)

# call unroll to flatten matrices into single array
x_users_items = unroll(M_users, M_items)

print(n_users*n_features + n_items*n_features)
print(x_users_items.shape)

1011500
(1011500,)
Wall time: 18 ms


This function is quite fast, taking only 20 ms to generate two random matrices and pass them into unroll.

#### roll

The next function does the inverse of unroll. It takes a single 1D array of user and item latent features and reshapes them into their original matrix forms.

In [62]:
def roll(x_users_items, n_users, n_items, n_features):
    
    """Reshape a 1D array of user and item latent
    features into their original 2D array format.
    Inverse function of `unroll`.
    
    Parameters
    ----------
    x_users_items : 1D numpy array
        User and item latent features.
    n_users : int
        Number of users.
    n_items : int
        Number of items.
    n_features: int
        Number of latent features to learn. 
        Determines the overall size of M_users 
        and M_items.
        
    Returns
    -------
    M_users : 2D numpy array
        Matrix of user latent features. Has
        dimensions (n_users, n_features).
    M_items : 2D numpy array
        Matrix of item latent features. Has
        dimensions (n_items, n_features).
    """
    
    # retrieve user and item 1D arrays
    x_users = x_users_items[0:n_users*n_features]
    x_items = x_users_items[n_users*n_features:]
    
    # reshape 1D arrays into original matrices
    M_users = np.reshape(x_users, (n_users, n_features))
    M_items = np.reshape(x_items, (n_items, n_features))

    return M_users, M_items

In [63]:
%%time

M_users, M_items = roll(x_users_items, n_users, n_items, n_features)
print(M_users.shape, M_items.shape)

(3571, 100) (6544, 100)
Wall time: 992 µs


This function is also quite fast.

#### cost

The cost function compute the cost or error, J, for a prediction made using the values in x_users_items, against the true values in Y_true. The cost is calculated as the sum of squared errors plus a regularization penatly for both user and item 2nd order latent features.

In [64]:
def cost(x_users_items, Y_true, Lambda, n_users, n_items, n_features):
    
    """Compute cost (error) J from predictions on 
    Y_true using learned features `x_users_items`. J 
    is defined as the sum of squared errors plus
    regularization penatlies on user and item 
    latent features.
    
    Parameters
    ----------
    x_users_items : 1D numpy array
        User and item latent features.
    Y_true : 2D numpy array
        Matrix containing true ratings.
    Lambda : int
        Regularization coefficient.
    n_users : int
        Number of users.
    n_items : int
        Number of items.
    n_features: int
        Number of latent features to learn. 
        Determines the overall size of M_users 
        and M_items.
        
    Returns
    -------
    J : float
        Cost associated with prediction on `Y_true` 
        using learned latent features `x_users_items`.
    """
    
    # recover 2D user and item feature matrices
    M_users, M_items = roll(x_users_items, n_users, n_items, n_features)

    # compute the prediction
    Y_predicted = np.dot(M_users, M_items.T)
    
    # compute the error in the prediction
    error = Y_true - Y_predicted
    # replace all NaN values with 0
    error[np.isnan(error)] = 0

    # compute the regularization penalties
    User_regularization = (Lambda/2) * np.nansum(M_users * M_users)
    Item_regularization = (Lambda/2) * np.nansum(M_items * M_items)

    # compute the cost J with regularization
    J = (1/2) * np.nansum(error*error) + User_regularization + Item_regularization

    return J

In [65]:
%%time

Lambda=0.1 # regularization coefficient

J = cost(x_users_items, S_train, Lambda, n_users, n_items, n_features)
print(J)

23845840.845111348
Wall time: 966 ms


Computing the cost takes a bit longer at 1 second. We can see the very large cost J for our initial random guess.

#### gradient

As the name suggests, this function computes the gradient of the cost function, evaluated independently for the user and item latent features. The gradient is what we use to decide in which direction to step in during each update or iteration.

In [66]:
def gradient(x_users_items, Y_true, Lambda, n_users, n_items, n_features):
    
    """Compute gradient function on `x_users_items`.
    
    Parameters
    ----------
    x_users_items : 1D numpy array
        User and item latent features.
    Y_true : 2D numpy array
        Matrix containing true ratings.
    Lambda : int
        Regularization coefficient.
    n_users : int
        Number of users.
    n_items : int
        Number of items.
    n_features: int
        Number of latent features to learn. 
        Determines the overall size of M_users 
        and M_items.
        
    Returns
    -------
    gradient : 1D numpy array
        Gradient of cost J w.r.t user and item
        latent features.
    """
    
    # recover 2D user and item feature matrices
    M_users, M_items = roll(x_users_items, n_users, n_items, n_features)

    # compute the prediction
    Y_predicted = np.dot(M_users, M_items.T)
    
    # compute the error in the prediction
    error = Y_true - Y_predicted
    # replace all NaN values with 0
    error[np.isnan(error)] = 0 

    # the gradients of user & item features
    M_user_gradient = np.dot(error, M_items) + Lambda*M_users
    M_item_gradient = np.dot(error.T, M_users) + Lambda*M_items

    # reshape gradients into 1D array
    gradient = unroll(M_user_gradient, M_item_gradient)

    return gradient

In [67]:
%%time

grad = gradient(x_users_items, S_train, Lambda, n_users, n_items, n_features)
grad

Wall time: 908 ms


array([-498.031751  , -369.27539972, -428.48579304, ...,  -16.55631096,
        -12.97203733,  -11.28810696])

This also takes close to 1 second to compute.

#### predict

As the name implies, this function takes in the trained latent features in x_users_items and computes the predicted attempts_range from them.

In [68]:
def predict(x_users_items, n_users, n_items, n_features):
    
    """Compute prediction on ratings. Predictions
    are computed from learned user and item 
    latent features in `x_users_items`.
    
    Parameters
    ----------
    x_users_items : 1D numpy array
        User and item latent features.
    n_users : int
        Number of users.
    n_items : int
        Number of items.
    n_features: int
        Number of latent features to learn. 
        Determines the overall size of M_users 
        and M_items.
    
    Return
    ------
    Y_predicted : pandas DataFrame
        Predictions.
    """
    
    # recover 2D user and item feature matrices
    M_users, M_items = roll(x_users_items, n_users, n_items, n_features)

    # compute predictions from P & Q
    Y_predicted = np.dot(M_users, M_items.T) 
    
    # set all predictions less than 1, equal to 1
    Y_predicted[Y_predicted < 1] = 1
    
    # set all predictions greater than 6, equal to 6
    Y_predicted[Y_predicted > 6] = 6
    
    Y_predicted = Y_predicted.astype(int)
    
    return Y_predicted

In [69]:
%%time

predict(x_users_items, n_users, n_items, n_features=100)

Wall time: 329 ms


array([[6, 6, 6, ..., 6, 6, 6],
       [6, 6, 6, ..., 6, 6, 6],
       [6, 6, 6, ..., 6, 6, 6],
       ...,
       [6, 6, 6, ..., 6, 6, 6],
       [6, 6, 6, ..., 6, 6, 6],
       [6, 6, 6, ..., 6, 6, 6]])

Using n_features=100, the prediction step only takes 320 ms. When using the `predict` function, we get back an array of predictions that seem to all have a value of 6. The `predict` function has two ways to deal with edge cases. 

>    Y_predicted[Y_predicted < 1] = 1

>    Y_predicted[Y_predicted > 6] = 6

If the predicted value is less than 1, we replace the value with 1, and if the prediction is greater than 6, we replace the prediction with 6. This is one way to put boundary conditions on our model's predictions since we know we can't have values outside of this range. Below is what the model actually predicts, before we apply the boundary conditions.

In [70]:
# recover 2D user and item feature matrices
M_users, M_items = roll(x_users_items, n_users, n_items, n_features)

# compute predictions from P & Q
Y_predicted = np.dot(M_users, M_items.T) 
Y_predicted

array([[25.78507382, 24.65458008, 24.11325677, ..., 26.10493151,
        24.80762662, 28.04366034],
       [25.15239518, 25.98095372, 24.86192013, ..., 26.61571934,
        24.87409884, 27.37180665],
       [25.92598663, 24.20472953, 25.85463325, ..., 25.23032756,
        26.69573521, 29.72480884],
       ...,
       [24.4504039 , 23.51367764, 22.13100841, ..., 24.39737622,
        24.08475808, 24.78835535],
       [24.90495798, 23.28108105, 25.39845024, ..., 24.97260922,
        26.27618462, 26.0533431 ],
       [25.11324182, 24.16390099, 23.75161446, ..., 26.25808634,
        23.65658502, 26.56647784]])

Now we can clearly see that all of the predicted values are above 6, hence the array of 6s returned by the `predict` function. These are the predicted values we obtain from simply initializing random matrices for the latent user and problem features. The next steps are coded into the stochastic gradient descent (SGD) algorithm:

1. Compute the error between these predictions and actual values in S_train.
2. Compute the gradient of the cost function with respect to the latent features.
3. Update the latent features in the direction of decreasing cost (direction of gradient).
4. Make another prediction and repeat.

#### SGD_e

Here is the meat of the training algorithm. SGD stands for stochastic gradient descent. It's an iterative agorithm that computes the error using a cost function J which we defined above, and steps in the direction (given by the gradient) that will minimize J in the next step. The latent feature values x_users_items are updated at each step until a convergence criteria is met. In this case, the e in SGD_e stands for epoch. This implementation will execute for a fixed number of epochs. Let's go ahead and take it for a spin!

In [71]:
def SGD_e(X_train, n_users, n_items, n_features, Lambda, 
          epochs, alpha, compute_f1=False, verbose=1, seed=42, **kwargs):
    
    """Stochastic gradient descent algorithm. Searches
    for the optimal values of user and item latent
    features in x_users_items, that minimize the cost J.
    Updates are calculated for `epochs` iterations using a 
    learning rate `alpha`.
    
    Parameters
    ----------
    X_train : 2D numpy array
        Ratings matrix for training dataset.
    X_cv : 2D numpy array
        Ratings matrix for cross-validation dataset.
    n_users : int
        Number of users.
    n_items : int
        Number of items.
    n_features: int
        Number of latent features to learn. 
        Determines the overall size of M_users 
        and M_items.
    Lambda : int
        Regularization coefficient.
    epochs : int
        Number of iterations to run.
    alpha : float
        Learning rate.
    compute_f1 : bool (default False)
        If true, computes the f1-scores for predictions
        on X_train and X_cv at each epoch.
    seed : int (default 42)
        Seed for numpy's pseudo-random number
        generator.
    
    Returns
    -------
    results : dict
        Nested dictionary containing epoch as keys. Values
        associated with each `epochs` are cost J, f1-scores 
        for training and CV datasets, and optimized 
        parameters `x_users_items`.
        
    """
    
    # get cross-validation data if given
    X_cv = kwargs.get('X_cv')
    
    # set random seed
    np.random.seed(seed)
    
    # intial random guess of user and item
    # latent features
    M_users = np.random.rand(n_users, n_features) - 0.5
    M_items = np.random.rand(n_items, n_features) - 0.5
    
    # reshape matrices into 1D array of
    # user and item latent features
    x_users_items = unroll(M_users, M_items)
    
    # initialize empty dict to store training results
    results = {}
    
    # loop through `epochs` iterations
    for e in range(1,epochs+1):
        
        # compute the cost
        J = cost(x_users_items, X_train, Lambda, 
                   n_users, n_items, n_features)
        
        # compute the gradient function
        gradient_ = gradient(x_users_items, X_train, Lambda, 
                      n_users, n_items, n_features)
        
        # update `x_users_items`
        x_users_items = x_users_items + alpha * gradient_
        
        if compute_f1:
            # make prediction
            Y_predicted = predict(x_users_items, n_users, n_items, n_features)

            # store cost J, f1-scores on training and CV data,
            # and the optimized parameters `x_users_items`.
            results[e] = {'J': J, 'f1-train': f1(X_train, Y_predicted), 
                          'f1-cv': f1(X_cv, Y_predicted), 'x_users_items': x_users_items}
            
        else:
            # store cost J and optimized parameters `x_users_items`
            results[e] = {'J': J, 'x_users_items': x_users_items}
        
        # print current epoch and cost
        if verbose==1:
            print('Epoch %s' % e + ' | ' + 'J : %s' % round(J))
        
        # logic for stop condition
        if e > 1:
            # compute delta for current iteration
            delta = (results[e-1]['J'] - results[e]['J'])
            
            # if J increases from last iteration (delta < 0)
            # end updates and return results
            if delta < 0:
                print('Gradient diverging! Ending training...')
                return results
        else:
            pass
    
    # indicate completion and print final J
    print('Training complete, final J: %s' % round(results[epochs]['J']))
    
    return results

In [72]:
%%time

# data parameters
n_users, n_items = S_train.shape

# hyper parameters
n_features = 10
Lambda=0.1
epochs=10
alpha=0.01

results = SGD_e(S_train, n_users, n_items, n_features, Lambda, 
                epochs, alpha, compute_f1=True, seed=42, X_cv=S_cv)

Epoch 1 | J : 188718.0
Epoch 2 | J : 182383.0
Epoch 3 | J : 176439.0
Epoch 4 | J : 167814.0
Epoch 5 | J : 151808.0
Epoch 6 | J : 125204.0
Epoch 7 | J : 97585.0
Epoch 8 | J : 77293.0
Epoch 9 | J : 64506.0
Epoch 10 | J : 62212.0
Training complete, final J: 62212.0
Wall time: 27.9 s


In [73]:
x = np.array(list(results.keys()))
y = np.array([results[i]['J'] for i in results.keys()])

trace0=go.Scattergl(x=x, y=y, mode='lines+markers')

layout=go.Layout(title='Cost Function vs epoch',
                yaxis=dict(title='Cost Function',
                          type='log'),
                xaxis=dict(title='epoch'))

fig = go.Figure([trace0], layout)

iplot(fig, filename='training.html')

The results above show the cost J vs epoch for the first 10 epochs. The error decreases steadily and then begins to drop off more quickly at epoch 5, before it appears to start slowing down at epoch 10. Let's look at the results from the final iteration, epoch 10.

In [74]:
results[10]

{'J': 62212.313323638424,
 'f1-train': 0.4970287972758789,
 'f1-cv': 0.44619616023589637,
 'x_users_items': array([ 0.07293579, -0.26029031,  0.43814448, ...,  0.09574311,
        -0.10079742,  0.33853885])}

The model already achieves an f1-score of 0.4462 on the training data after only 10 iterations and no tuning! This is promising. Note that the f1-scores for the train and cv data sets are also similar, much loser to eachother than in the random forest models. This could imply that the model will generalize better to unseen data than the random forest model. Let's look at this model's predictions.

In [75]:
p = predict(results[10]['x_users_items'], n_users, n_items, n_features)
p

array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       ...,
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1]])

Looks familiar! Here, we only gave the model 10 latent features to work with. Let's see if we can optimize the model further. The hyperparameters we have to play with are the number of latent features `n_features`, the regularization coefficient `Lambda`, and the learning rate `alpha`. We'll look at the effect of changing each of these hyperparameters, one at a time.

#### n_features

In [None]:
%%time
# data parameters
n_users, n_items = S_train.shape

# hyperparameters
N_features = [1,2,3,4,5,10,20]
Lambda = 0.1
epochs=25
alpha=0.005

results_features = {}

for n_features in N_features:
    
    print('Training with n_features=%s' % n_features)
    
    results = SGD_e(S_train, n_users, n_items, n_features, Lambda, 
                    epochs, alpha, compute_f1=True, verbose=0, seed=42, X_cv=S_cv)
    
    results_features[n_features] = results

In [None]:
save_obj(results_features, 'n_feature-search')

In [76]:
results_features = load_obj('n_feature-search')

In [77]:
traces = []
for n_features in results_features:
    results = results_features[n_features]
    
    x = np.array(list(results.keys()))
    y = np.array([results[i]['J'] for i in results.keys()])

    traces.append(go.Scattergl(name=str(n_features),x=x, y=y, mode='lines+markers'))
    
layout=go.Layout(title='Cost Function vs epoch by n_features',
                yaxis=dict(title='Cost Function', type='log'),
                xaxis=dict(title='epoch'))

fig = go.Figure(traces, layout)

iplot(fig, filename='training-n_features.html')

The cost function starts out with very similar initial errors in all cases, with slightly higher error as `n_features` increases. What's really interesting to me is the behavior of `n_features`=2 (orange). The error decreases much slower than all other values of `n_features`. While for values of 1,3,4, and 5, the curves are all very similar. At 10 latent features the error drops significantly as a function of epoch, and even moreso with 20 latent features. This is what one would expect since it should be easier to describe the data with more features. Let's look at the f1-scores on the training dataset.

In [78]:
traces = []
for n_features in results_features:
    results = results_features[n_features]
    
    x = np.array(list(results.keys()))
    y = np.array([results[i]['f1-train'] for i in results.keys()])
    
    traces.append(go.Scattergl(name=str(n_features), x=x, y=y, mode='lines+markers'))

layout=go.Layout(title='Training F1-score by n_features',
                yaxis=dict(title='F1-score', type='log'),
                xaxis=dict(title='epoch'))

fig = go.Figure(traces, layout)

iplot(fig, filename='F1-train-n_features.html')

The f1-scores on the training data are shown above. The trend follows what we saw in the cost function plots. With high values of `n_features`, the model is able to fit the training data better. The f1-score increases sooner, faster, and to higher values with increasing `n_features`. `n_features`=2 shows the same outlier behavior as with the cost function plots.

In [79]:
traces = []
for n_features in results_features:
    results = results_features[n_features]
    
    x = np.array(list(results.keys()))
    y = np.array([results[i]['f1-cv'] for i in results.keys()])

    traces.append(go.Scattergl(name=str(n_features), x=x, y=y, mode='lines+markers'))

layout=go.Layout(title='Cross Validation F1-score by n_features',
                yaxis=dict(title='F1-score', type='log'),
                xaxis=dict(title='epoch'))

fig = go.Figure(traces, layout)

iplot(fig, filename='F1-cv-n_features.html')

The f1-scores shown above are computed on the cross-validation dataset. This is an example of overfitting. As `n_features` increases, the f1-score on the CV dataset decreases. With a higher number of latent features, the model can fit the training data more accurately, but begins to suffer from high variance and thus doesn't generalize well to the CV data. Surprisingly, the best model performance is obtained with only a single latent feature!!

#### Lambda

`Lambda` is the regularization coefficient that we use to penalize higher order model parameters. The higher the value of Lambda, the higher the penalty. This is a common method for reducing overfitting.

In [None]:
%%time
# data parameters
n_users, n_items = S_train.shape

# hyperparameters
n_features = 1
Lambdas = [10,5,2,1,0.5,0.1,0.01]
epochs=25
alpha=0.005

results_lambda = {}

for Lambda in Lambdas:
    
    print('Training with Lambda=%s' % Lambda)
    
    results = SGD_e(S_train, n_users, n_items, n_features, Lambda, 
                    epochs, alpha, compute_f1=True, verbose=0, seed=42, X_cv=S_cv)
    
    results_lambda[Lambda] = results

In [None]:
save_obj(results_lambda, 'lambda-search')

In [80]:
results_lambda = load_obj('lambda-search')

In [81]:
traces = []
for Lambda in results_lambda:
    results = results_lambda[Lambda]
    
    x = np.array(list(results.keys()))
    y = np.array([results[i]['J'] for i in results.keys()])

    traces.append(go.Scattergl(name=str(Lambda),x=x, y=y, mode='lines+markers'))
    
layout=go.Layout(title='Cost Function vs epoch by Lambda',
                yaxis=dict(title='Cost Function', type='log'),
                xaxis=dict(title='epoch'))

fig = go.Figure(traces, layout)

iplot(fig, filename='training-Lambda.html')

For a `Lambda` value of 10, the gradient immediately diverges, thus only 2 points are plotted. The lower the `Lambda`, the lower the cost function is able to go within the epoch range we tested. The difference in the cost functions increases with increasing epoch. `Lambda`=5 also diverged, but not until epoch 24.

In [82]:
traces = []
for Lambda in results_lambda:
    results = results_lambda[Lambda]
    
    x = np.array(list(results.keys()))
    y = np.array([results[i]['f1-train'] for i in results.keys()])
    
    traces.append(go.Scattergl(name=str(Lambda), x=x, y=y, mode='lines+markers'))

layout=go.Layout(title='Training F1-score by Lambda',
                yaxis=dict(title='F1-score', type='log'),
                xaxis=dict(title='epoch'))

fig = go.Figure(traces, layout)

iplot(fig, filename='F1-train-Lambda.html')

The model clearly fits the training data better with larger values of `Lambda`. For `Lambda`=5, the f1-score on the training data appears to have started saturating at the end, while the small values of `Lambda` seem to be increasing at similar rates. Let's compare this plot with the previous plot of cost functions. With increasing `Lambda`, the cost function (error) increases, yet we fit the training data better so the f1-score also increases. But how can the model both fit the data better and have a higher error? In the limit where `Lambda`=0, the cost function is simply the sum of squared errors of our predictions on the training data. With regularization, we introduce an additional source of error that goes as the square of the prediction. `Lambda` is the scalar coefficient to this additional error. Thus with increasing `Lambda`, the contribution of the regularization error increases. However, since regularization also helps reduce overfitting, the model ends up fitting the data better.

In [83]:
traces = []
for Lambda in results_lambda:
    results = results_lambda[Lambda]
    
    x = np.array(list(results.keys()))
    y = np.array([results[i]['f1-cv'] for i in results.keys()])

    traces.append(go.Scattergl(name=str(Lambda), x=x, y=y, mode='lines+markers'))

layout=go.Layout(title='Cross Validation F1-score by Lambda',
                yaxis=dict(title='F1-score', type='log'),
                xaxis=dict(title='epoch'))

fig = go.Figure(traces, layout)

iplot(fig, filename='F1-cv-Lambda.html')

Here we can see that regularization makes a big difference in the model's overfitting! With increasing `Lambda`, the model generalizes better to the cross-validation data, giving a higher f1-score.

Now we can combine our learnings of `n_features` and `Lambda` and try to create an optimal model by using `n_features`=1 and increasing `Lambda` while decreasing `alpha` to prevent the search from diverging.

In [None]:
%%time

# data parameters
n_users, n_items = S_train.shape

# hyperparameters
n_features = 1
Lambda = 5
epochs=300
alpha=0.0005

results = SGD_e(S_train, n_users, n_items, n_features, Lambda, 
                epochs, alpha, compute_f1=True, verbose=1, seed=42, X_cv=S_cv)

In [None]:
save_obj(results, 'n1_L5_e300_a0005')

In [84]:
optimal = load_obj('n1_L5_e300_a0005')

In [85]:
x = np.array(list(optimal.keys()))
y = np.array([optimal[i]['J'] for i in optimal.keys()])

trace0=go.Scattergl(x=x, y=y, mode='lines+markers')

layout=go.Layout(title='Cost Function vs epoch',
                yaxis=dict(title='Cost Function',
                          type='log'),
                xaxis=dict(title='epoch'))

fig = go.Figure([trace0], layout)

iplot(fig, filename='training.html')

In [86]:
x = np.array(list(optimal.keys()))
y1 = np.array([optimal[i]['f1-train'] for i in optimal.keys()])
y2 = np.array([optimal[i]['f1-cv'] for i in optimal.keys()])
    
trace0 = go.Scattergl(name='Training f1', x=x, y=y1, mode='lines+markers')
trace1 = go.Scattergl(name='CV f1', x=x, y=y2, mode='lines+markers')

layout=go.Layout(title='F1-score vs epoch',
                yaxis=dict(title='F1-score', type='log'),
                xaxis=dict(title='epoch'))

fig = go.Figure([trace0, trace1], layout)

iplot(fig, filename='F1-train-Lambda.html')

After trying several combinations of `n_features`, `Lambda`, and `alpha`, the best model I found was:

`n_features`=1, `Lambda`=5, `alpha`=0.0005

These hyperparameters gave an f1-score on the CV data of 0.4922. Comparing these results to the best random forest model which got a score of 0.5191, the collaborative filtering model performs slightly worse. 

## Conclusions & Next Steps

The goal of this project was to build a recommender system for an online agent to recommend problems to students. The critical task of this recommender was to predict how many attempts a user would take to solve a problem the user has never seen before. I compared several models including a random forest model and several types of collaborative filtering models. The out-of-box random forest model performed the best with an f1-score on the cross-validation data of 0.5191. The best collaborative filtering model produced an f1-score of 0.4922. These f1-scores are relatively low for a good recommender system, however this particular dataset was quite sparse with only 0.76% of the attempts data available! This made predicting the missing attempts very challenging.

There are several ways to further improve the collaborative filtering model that I created here. The first is to add bias terms to the cost function. While regularization helps reduce the model's overfitting by penalizing higher order terms, including bias penalties helps take into account the fact that users will have a different average number of attempts across all problems. For example, a beginner may require on average 8 attempts to complete most problems while someone with a strong background may require only 3 attempts on average.

The next step for this project would be to actually recommend the practice problems. Once we are able to accurately predict the number of attempts a user will require to solve a problem they've never seen before, we can create some simple rules for a recommendation. For example, we could first find all problems that the user has not solved before and filter them by the `tag` or type of problem relevant to the topic they're currently studying. From the filtered set of problems we could then recommend problems that the user is predicted to take 2-3 or 3-5 attempts to solve.