# Capstone 1: In-Depth Analysis

#### Kenneth Liao

Original datasource: https://datahack.analyticsvidhya.com/contest/practice-problem-recommendation-engine/#

In [1]:
import pandas as pd
import numpy as np
import time
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from scipy.optimize import minimize, fmin_cg

# enable offline plotting in plotly
init_notebook_mode(connected=True)

In [70]:
def save_obj(obj, name ):
    with open('results/'+ name + '.pkl', 'wb') as f:
        pkl.dump(obj, f, pkl.HIGHEST_PROTOCOL)

def load_obj(name ):
    with open('results/' + name + '.pkl', 'rb') as f:
        return pkl.load(f)

In [2]:
# load our 3 datasets
users = pd.read_csv('data/user_features.csv')
problems =  pd.read_csv('data/problem_features.csv')
submissions = pd.read_csv('data/train_submissions.csv')

## Background & Problem Statement 

Ultimately, the goal of this project is to recommend practice problems to users given some information about the problems they have already solved. There are many criteria we could choose to base how we recommend problems. For the purpose of this model, I will keep the criteria simple. The criteria are as follows:

1. The problem has not yet been attempted by the user.
2. The predicted number of attempts the user will require to solve the problem is equal to 2 or 3 (attempts_range=2).

Given the criteria defined above, we must first be able to predict how many attempts a user will require to solve a problem they've never attempted before. I will perform this prediction using two very different models. 

The first model will be a Random Forest Classifier. For this model, I will use meta data available for users and problems. The goal is to find patterns in the user and problem features that predict well the number of attempts for a given user-problem combination.

The second model will be a collaborative filtering model. This model will employ stochastic gradient descent (SGD) to find an approximate solution to the single value decomposition (SVD) of our user-problem matrix. In this case, we will not utilize the user and problem datasets. Predictions will be made exclusively using the history of problem submissions.

Let's take a quick look at the submissions dataset. This dataset has 3 columns: user_id, problem_id, and attempts_range. Attempts_range gives the range of attempts that the user_id took to solve the problem_id and is defined in the original datasource as shown below.

In [3]:
submissions.head()

Unnamed: 0,user_id,problem_id,attempts_range
0,user_232,prob_6507,1
1,user_3568,prob_2994,3
2,user_1600,prob_5071,1
3,user_2256,prob_703,1
4,user_2321,prob_356,1


>We have used following criteria to define the attempts_range :-
>
>            attempts_range            No. of attempts lies inside
>
>            1                                         1-1
>
>            2                                         2-3
>
>            3                                         4-5
>
>            4                                         6-7
>
>            5                                         8-9
>
>            6                                         >=10

## Random Forest Model

### Preparing data for random forest 

The first thing we need to do to prepare the data for the random forest model is convert categorical, string columns into dummy variables. We do this for both the user and problem features.

In [4]:
users = pd.get_dummies(users.set_index('user_id')).reset_index()
users.head()

Unnamed: 0,user_id,submission_count,problem_solved,contribution,follower_count,last_online_time_seconds,max_rating,rating,registration_time_seconds,user_attempts_median,...,country_Ukraine,country_United Kingdom,country_United States,country_Uzbekistan,country_Venezuela,country_Vietnam,rank_advanced,rank_beginner,rank_expert,rank_intermediate
0,user_1,84,73,10,120,1505162220,502.007,499.713,1469108674,1.0,...,0,0,0,0,0,0,1,0,0,0
1,user_10,246,211,0,30,1505079658,326.548,313.36,1472038187,1.0,...,0,0,0,0,0,0,0,0,0,1
2,user_100,642,574,27,106,1505073569,458.429,385.894,1323974332,1.0,...,0,0,0,0,0,0,0,0,0,1
3,user_1000,259,235,0,41,1505579889,371.273,336.583,1450375392,1.0,...,0,0,0,0,0,0,0,0,0,1
4,user_1001,554,492,-6,55,1504521879,472.19,450.975,1423399585,1.0,...,0,0,0,0,0,0,0,0,0,1


In [5]:
problems = pd.get_dummies(problems.set_index('problem_id')).reset_index()
problems.head()

Unnamed: 0,problem_id,points,problem_attempts_median,problem_attempts_min,problem_attempts_max,problem_attempts_count,problem_attempts_iqr,algorithms,and,binary,...,level_type_E,level_type_F,level_type_G,level_type_H,level_type_I,level_type_J,level_type_K,level_type_L,level_type_M,level_type_N
0,prob_1,500.0,1.5,1.0,2.0,2.0,0.005,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,prob_10,4500.0,6.0,6.0,6.0,1.0,0.0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,prob_100,1000.0,1.0,1.0,1.0,1.0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,prob_1000,500.0,1.0,1.0,6.0,246.0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,prob_1001,2000.0,1.0,1.0,2.0,10.0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


I'll start by splitting the whole dataset into a train (X_train) and test (X_test) set. I'll further split the X_train data into a smaller training set (R_train) and a cross-validation set (R_cv) for hyperparameter tuning. This split must be done on the original submissions dataset before pivoting the data into a sparse matrix. Once in sparse matrix format, sampling the dataset would also sample the null values in the dataset.

In [6]:
train, R_test = train_test_split(submissions, test_size=0.25, random_state=42)

R_train, R_cv = train_test_split(train, test_size=0.25, random_state=42)

In [7]:
R_train.head()

Unnamed: 0,user_id,problem_id,attempts_range
2573,user_3506,prob_3882,1
110245,user_1732,prob_3373,1
78649,user_2502,prob_2421,1
72195,user_2653,prob_6016,1
16731,user_1399,prob_6434,2


Next, I'll prepare a single dataframe that joins the user and problem features with the submissions data.

In [8]:
X_train = R_train.merge(users, on='user_id').merge(problems, on='problem_id')
X_cv = R_cv.merge(users, on='user_id').merge(problems, on='problem_id')

# remove rows with any null values
X_train = X_train.loc[:,X_train.notnull().all()]
X_cv = X_cv.loc[:,X_cv.notnull().all()]

y_train = X_train.set_index(['user_id', 'problem_id'])['attempts_range']
X_train = X_train.set_index(['user_id', 'problem_id']).loc[:,'submission_count':]

y_cv = X_cv.set_index(['user_id', 'problem_id'])['attempts_range']
X_cv = X_cv.set_index(['user_id', 'problem_id']).loc[:,'submission_count':]

X_train.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,submission_count,problem_solved,contribution,follower_count,last_online_time_seconds,max_rating,rating,registration_time_seconds,user_attempts_median,user_attempts_min,...,level_type_E,level_type_F,level_type_G,level_type_H,level_type_I,level_type_J,level_type_K,level_type_L,level_type_M,level_type_N
user_id,problem_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
user_3506,prob_3882,107,77,0,0,1501774775,305.333,302.466,1476642256,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0
user_1669,prob_3882,430,392,0,4,1504392528,340.31,314.22,1454321387,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0
user_466,prob_3882,163,127,-4,8,1503050499,432.626,399.943,1461554180,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0
user_2661,prob_3882,45,36,2,6,1505058916,315.367,260.894,1431933069,1.5,1.0,...,0,0,0,0,0,0,0,0,0,0
user_1416,prob_3882,267,263,-12,0,1502201891,308.2,227.924,1459617061,2.0,1.0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
y_train.head()

user_id    problem_id
user_3506  prob_3882     1
user_1669  prob_3882     1
user_466   prob_3882     1
user_2661  prob_3882     2
user_1416  prob_3882     2
Name: attempts_range, dtype: int64

Dataframe X now contains all of the user and problem feature data for each combination of user_id and problem_id. Thus, for each training sample or row, we will use the combination of user and problem features to predict the attempts_range. The attempts_range for each user-problem combination is saved in y.

### Training

#### Simplest Baseline Model

We know from our previous exploratory analysis of this data that 1 is by far the most common attempts_range. A very simple prediction model we can make is just to predict the most common value for all missing values. Let's see how such a model would do.

To benchmark our models, we'll be using sklearn's f1_score function with the average argument set to "weighted". This function will compute the f1-score for each of the labels in the dataset and then take a weighted average of the scores depending on how many samples are in each label. Thus, we will simply get one overall f1-score.

In [10]:
def f1(Y_true, Y_predicted, average='weighted'):
    """Compute the f1_score between actual values 
    and predictions.
    """
    
    # convert matrices into numpy arrays
    Y_true_ = np.array(Y_true)
    Y_predicted_ = np.array(Y_predicted)
    
    # get indices of non-NaN values in Y_true
    mask = ~np.isnan(Y_true_.flatten(order='C'))
    
    # flatten matrices to 1D arrays
    # use the mask to get only non-NaN values
    y_true = Y_true_.flatten(order='C')[mask]
    y_predicted = Y_predicted_.flatten(order='C')[mask]
    
    return f1_score(y_true, y_predicted, average=average, labels=[1.0,2.0,3.0,4.0,5.0,6.0])

In [None]:
y_predicted = np.ones(len(y_train))

print('F1 score for predicting all ones on training data: %s' % round(f1(y_train, y_predicted), 4))

The F1 score we got for predicting 1 for all of the training samples is 0.371. How does this compare in the CV dataset?

In [None]:
y_cv = R_cv.set_index(['user_id', 'problem_id'])['attempts_range']

y_predicted = np.ones(len(y_cv))
print('F1 score for predicting all ones on cv data: %s' % round(f1(y_cv, y_predicted), 4))

We get a similar f1 score for predicting all ones on the CV dataset. This is a good indication that there was minimal selection bias in our splitting.

#### Out-of-box Random Forest

We will start by building an out-of-box model and try to improve it from there.

In [None]:
clf = RandomForestClassifier(n_estimators=100, n_jobs=12)

clf.fit(X_train, y_train)

In [None]:
y_predicted = clf.predict(X_train)

f1(y_train, y_predicted, average='weighted')

In [None]:
y_predicted = clf.predict(X_cv)

f1(y_cv, y_predicted, average='weighted')

The out-of-box random forest models gives an f1 score of 0.977 on the training data and 0.414 on the cross-validation data. This is already much better than the baseline model! But we're still far from 1. During my exploratory analysis of the data, it was clear that many features were correlated with one another. Before diving into model optimization through hyperparameter tuning, I want to see if removing some of this colinearity between features improves the model's performance.

#### Dimensionality Reduction

Let's start by performing PCA on the full dataset to see how many features we can safely remove. Performing PCA on the full dataset has two benefits.

1. The dimensionality of the training data is reduced and therefore takes less computation to train the model on.
2. Colinear features are removed. The principal components returned by PCA are all orthogonal.

In [None]:
pca = PCA()
pca.fit(X_train)

x = list(range(1, len(pca.explained_variance_)+1))
y = pca.explained_variance_

trace0 = go.Scatter(x=x, y=y, mode='lines+markers')

layout = go.Layout(title='Explained Variance vs # of Dimensions',
                  xaxis=dict(title='# of Dimensions'),
                  yaxis=dict(title='Explained Variance', type='log'))

fig = go.Figure([trace0], layout)

iplot(fig, filename='explained-var_vs_N-dimensions.html')

In [None]:
n_components=[1,2,5,10,25,50,100]

f1_scores = []
for n in n_components:

    pca = PCA(n_components=n)
    
    X_train_r = pca.fit_transform(X_train)
    
    clf = RandomForestClassifier(n_estimators=100, n_jobs=12)

    clf.fit(X_train_r, y_train)

    y_predicted = clf.predict(X_train_r)

    f1_scores.append(f1(y_train, y_predicted, average='weighted'))

In [None]:
trace0 = go.Scatter(x=n_components, y=f1_scores, mode='lines+markers')

layout = go.Layout(title='F1 Score vs Principal Components',
                  xaxis=dict(title='Principal Components'),
                  yaxis=dict(title='F1 Score', type='log'))

fig = go.Figure([trace0], layout)

iplot(fig, filename='f1_score-vs-principal_components.html')

We can see that at a number of principal components less than 25, there is a significant hit in the F1 score. Above 25 principal components, there seems to be a negligible difference. In general, there is no improvement over the baseline model when using PCA to remove colinear features and reduce the dataset's dimensionality.

We can use GridSearhCV to try to tune the hyperparameters of the model. Rather than passing a large dictionary object of all the hyperparameters we want to tune at once, I will explore each of the hyperparameters individually. This will make it more straightforward when interpretting the effects of each hyperparameter. At the end, I will then pass all of the hyperparameters to GridSearchCV to find the optimal combination of all hyperparameters.

#### n_estimators

n_estimators defines how many trees the model will have. Generally, the more trees the better the model will generalize. However more trees equals more computation and therefore we want to strike a balance between fit to the test data and train + test times.

With GridSearchCV, we can define the scoring function. Since we want to maximize the f1_score function with "weighted" averaging from sklearn.metrics, we pass this same scoring function to GridSearchCV.

In [None]:
%%time
param_grid = {'n_estimators': [5,10,50,100,150,200,250]}

clf = RandomForestClassifier(n_jobs=12, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

The results of the search are shown below. 

In [None]:
results = pd.DataFrame({'n_estimators' : [5,10,50,100,150,200,250],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

Let's plot the train and test scores as a function of N_estimators.

In [None]:
trace2 = go.Scattergl(name='Mean Test Score',
                      x=results['n_estimators'],
                      y=results['mean_test_score'], 
                      mode='lines+markers',
                     yaxis='y2')
trace1 = go.Scattergl(name='Mean Train Score',
                      x=results['n_estimators'],
                      y=results['mean_train_score'], 
                      mode='lines+markers')

layout = go.Layout(title='Mean Train & Test Scores vs N_estimators',
               xaxis=dict(title='N_estimators'),
               yaxis=dict(title='Mean Train Score'), 
                   yaxis2=dict(title='Mean Test Score',
                              side='right'),
                  legend=dict(orientation='h', y=1.12),
                  margin=dict(t=120))

fig = go.Figure([trace1, trace2], layout=layout)

iplot(fig, filename='train-test-scores.html')

We can see both scores increase in going from 5 to 100 estimators but quickly plateau after that. The train and test scores are plotted on separate axes above so we can distinguish the knees of both curves. We can see that the training score is very close to 1, even for n_estimators=5. The more important score of course is the test score. Let's now look at the tradeoff between the test score and the time required to train and test the model.

In [None]:
def plot_cv(df, param):
    trace0 = go.Scattergl(name='Combined Mean Train+Test Time',
                      x=results[param],
                      y=results['combined_mean_fit-test_time'], 
                      mode='lines+markers',)
    trace1 = go.Scattergl(name='Mean Test Score',
                          x=results[param],
                          y=results['mean_test_score'], 
                          mode='lines+markers',
                         yaxis='y2')

    layout = go.Layout(title='Model Train+Test Time & Test Score vs %s' % param,
                   xaxis=dict(title=param),
                   yaxis=dict(title='Combined Train+Test Time'), 
                       yaxis2=dict(title='Mean Test Score',
                                  side='right'),
                      legend=dict(orientation='h', y=1.12),
                      margin=dict(t=120))

    fig = go.Figure([trace0, trace1], layout=layout)

    iplot(fig, filename='%s.html' % param)

In [None]:
plot_cv(results, param='n_estimators')

Here we see that the combined time for training and testing the model increases significantly up to 109 seconds at N_estimators=150. At N_estimators=100, the train+test time is 65 seconds but the difference in test score between the two is negligible. We can thus save a lot of computational resources and time by choosing N_estimators=100.

#### max_depth

In [None]:
%%time
vals = np.arange(5,100,5)
param_grid = {'max_depth': vals}

clf = RandomForestClassifier(n_jobs=12, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

In [None]:
results = pd.DataFrame({'max_depth' : vals,
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

In [None]:
plot_cv(results, param='max_depth')

#### min_samples_split

In [None]:
%%time
vals = np.arange(10,300,10)
param_grid = {'min_samples_split': vals}

clf = RandomForestClassifier(n_jobs=12, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

In [None]:
results = pd.DataFrame({'min_samples_split' : vals,
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

In [None]:
plot_cv(results, param='min_samples_split')

#### min_samples_leaf

In [None]:
%%time
vals = np.arange(5,100,5)
param_grid = {'min_samples_leaf': vals}

clf = RandomForestClassifier(n_jobs=12, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

In [None]:
results = pd.DataFrame({'min_samples_leaf' : vals,
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

In [None]:
plot_cv(results, param='min_samples_leaf')

#### criterion

In [None]:
%%time
param_grid = {'criterion': ["gini", "entropy"]}

clf = RandomForestClassifier(n_jobs=12, random_state=42)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

In [None]:
results = pd.DataFrame({'criterion' : ['gini', 'entropy'],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

#### max_features

In [None]:
%%time
vals = np.arange(5,150,10)
param_grid = {'max_features': vals}

clf = RandomForestClassifier(n_jobs=12)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

In [None]:
results = pd.DataFrame({'max_features': vals,
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

In [None]:
plot_cv(results, 'max_features')

#### oob_score

In [None]:
%%time
param_grid = {'oob_score': [True, False]}

clf = RandomForestClassifier(n_jobs=12)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

In [None]:
results = pd.DataFrame({'oob_score': [True, False],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

#### warm_start

In [None]:
%%time
param_grid = {'warm_start': [True, False]}

clf = RandomForestClassifier(n_jobs=12)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

In [None]:
results = pd.DataFrame({'warm_start': [True, False],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

#### class_weight

In [None]:
%%time
param_grid = {'class_weight': [None, 'balanced', 'balanced_subsample']}

clf = RandomForestClassifier(n_jobs=12)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

In [None]:
results = pd.DataFrame({'class_weight': [None, 'balanced', 'balanced_subsample'],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

#### max_leaf_nodes

In [None]:
%%time
param_grid = {'max_leaf_nodes': [None, 10, 25, 50, 100, 150]}

clf = RandomForestClassifier(n_jobs=12)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

In [None]:
results = pd.DataFrame({'max_leaf_nodes': [None, 10, 25, 50, 100, 150],
                        'combined_mean_fit-test_time': cv.cv_results_['mean_fit_time'] + cv.cv_results_['mean_score_time'],
                        'mean_test_score': cv.cv_results_['mean_test_score'],
                       'mean_train_score': cv.cv_results_['mean_train_score']})

results

#### Putting it all together

In [None]:
%%time
param_grid = {'n_estimators': np.arange(10,200,10),
             'max_depth': np.arange(5,40,5),
             'min_samples_split': np.arange(5,40,5),
             'max_features': np.arange(10,100,10)}

clf = RandomForestClassifier(n_jobs=-1)

cv = GridSearchCV(clf, param_grid=param_grid, 
                  scoring='f1_weighted', cv=5, 
                  iid=True, n_jobs=-1, 
                  return_train_score=True)

cv.fit(X_train,y_train)

## Collaborative & Content Filtering Models

In [11]:
def f1_matrix(Y_true, Y_predicted, average='weighted', labels=[1.0,2.0,3.0,4.0,5.0,6.0]):
    
    """Compute the f1_score between an actual values
    matrix and predicted values matrix.
    
    Parameters
    ----------
    Y_true : 2D numpy array
        Matrix of true values.
    Y_predicted : 2D numpy array
        Matrix of predictions.
    average : str
        Method for weighting f1-score which is computed
        for each label.
    labels : list
        List of labels to compute f1-scores over.
        
    Returns
    -------
    f1 : float
        f1-score computed using `average` method 
        across specified `labels`.
    """
    
    Y_true_ = np.array(Y_true)
    Y_predicted_ = np.array(Y_predicted)
    
    # get indices of non-NaN values
    mask = ~np.isnan(Y_true_)
    
    # flatten mask into 1D array
    mask = mask.flatten(order='C')
    
    # flatten matrices into 1D arrays
    y_true = Y_true_.flatten(order='C')
    y_predicted = Y_predicted_.flatten(order='C')
    
    # filter the arrays using the mask
    y_true = y_true[mask]
    y_predicted = y_predicted[mask]
    
    # compute f1-score
    f1 = f1_score(y_true, y_predicted, average=average, labels=labels)
    
    return f1

In [12]:
# number of unique users
n_users = submissions['user_id'].nunique()
# number of unique items (problems)
n_items = submissions['problem_id'].nunique()

In [13]:
print('Number of unique users: %s' % n_users)
print('Number of unique problems: %s' % n_items)

Number of unique users: 3529
Number of unique problems: 5776


In [14]:
sparsity = len(submissions)/(n_users*n_items)
print('Sparsity of attempts_range: %s%%' % round(sparsity*100, 2))

Sparsity of attempts_range: 0.76%


The full submissions dataset contains 3529 unique users and 5776 unique problems. We have attempts_range data for only 0.76% of all user x problem combinations!! This data is incredibly sparse. Even the Netflix prize dataset had over 1% ratings. This will likely make it much harder for collaborative filtering models to produce good results, as they depend on inferring the attempts_range from the other users and/or items.

After sampling we can pivot both R_train and R_cv into sparse matrices.

In [15]:
R_train = R_train.set_index(['user_id','problem_id']).unstack(level=-1)
R_cv = R_cv.set_index(['user_id','problem_id']).unstack(level=-1)

R_train.columns = R_train.columns.droplevel()
R_cv.columns = R_cv.columns.droplevel()

R_train.head()

problem_id,prob_100,prob_1000,prob_1001,prob_1002,prob_1003,prob_1004,prob_1005,prob_1006,prob_1007,prob_101,...,prob_989,prob_99,prob_991,prob_992,prob_994,prob_995,prob_996,prob_997,prob_998,prob_999
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
user_1,,,,,,,,,,,...,,,,,,,,,,
user_10,,,,,,,,,,,...,,,,,,,,,,
user_100,,,,,,,,,,,...,,,,,,,,,,
user_1000,,1.0,,,,,,,,,...,,,,,,,,,,
user_1001,,,,,,,,,,,...,,,,,,,,,,


Since I will be building several types of models using very different types of methods to fill missing attempt_range values, I'll start by creating an empty matrix that contains all user_ids as the index and all problem ids as columns. This matrix is constructed using the full list of users and problems from the users and problems datasets and not the submissions dataset. This is because there are many users and problems for which we have meta data but no history of submissions.

In [16]:
u_diff = len(set(users.user_id.unique()).difference(submissions.user_id.unique()))
print('Number of users from users dataset, not present in submissions: %s' % u_diff)

Number of users from users dataset, not present in submissions: 42


In [17]:
p_diff = len(set(problems.problem_id.unique()).difference(submissions.problem_id.unique()))
print('Number of problems from problems dataset, not present in submissions: %s' % p_diff)

Number of problems from problems dataset, not present in submissions: 768


In [18]:
empty_sub = pd.DataFrame(np.nan, index=users.user_id.unique(), 
                         columns=problems.problem_id.unique())
empty_sub_ = np.array(empty_sub)

We'll fill in the R_train and R_cv data into the empty_sub matrix to have all data and predictions in the same format.

In [21]:
R_train = empty_sub.fillna(R_train)
R_cv = empty_sub.fillna(R_cv)

R_train_ = np.array(R_train)
R_cv_ = np.array(R_cv)

R_train.head()

Unnamed: 0,prob_1,prob_10,prob_100,prob_1000,prob_1001,prob_1002,prob_1003,prob_1004,prob_1005,prob_1006,...,prob_990,prob_991,prob_992,prob_993,prob_994,prob_995,prob_996,prob_997,prob_998,prob_999
user_1,,,,,,,,,,,...,,,,,,,,,,
user_10,,,,,,,,,,,...,,,,,,,,,,
user_100,,,,,,,,,,,...,,,,,,,,,,
user_1000,,,,1.0,,,,,,,...,,,,,,,,,,
user_1001,,,,,,,,,,,...,,,,,,,,,,


Recall that we created a baseline model before building our random forest models by simply predicting 1 for all missing attempts_range. Let's start by doing the same here.

In [22]:
f1_matrix(R_train_, np.ones((R_train_.shape[0], R_train_.shape[1])))


F-score is ill-defined and being set to 0.0 in labels with no predicted samples.



0.3704453520558025

In [23]:
f1_matrix(R_cv_, np.ones((R_cv_.shape[0], R_cv_.shape[1])))

0.3721755273659568

So our baseline model of predicting all ones gives a starting F1_score of 0.370 on the training data and 0.372 on the cross-validation data.

### User-mean Recommender

The first type of collaborative filtering model I'll build is a user-mean collaborative filtering model. This simple model fills all missing attempts_range values with the averages across all users.

We need a method for dealing with edge cases where we may not have data to make a prediction. For example, since we'll be calculating the mean of each problem and using that to make predictions for all users, we could have problems that were never solved in the training data and therefore not have any predictions made for those columns. Then, in the CV and test datasets, those columns could have data that should've been predicted on. The easiest way to deal with this is to simply predict 1 when we don't have data, since this is by far the most common value of attempts_range across all problems and users.

In [None]:
# compute the mean of each problem across all users
# round to nearest int
user_means = np.round(R_train.mean())

# fill the empty_sub for scoring
R_predicted = empty_sub.fillna(user_means)

# fill all missing values with 1
R_predicted = R_predicted.fillna(1)

R_predicted.head()

In [None]:
f1_matrix(R_train, R_predicted)

So this simple model produces an F1 score that's much better than the baseline, but still worse than our best random forest model. Let's see how this compares to item-based collaborative filtering.

### Item-mean Recommender

In [None]:
problem_means = np.round(R_train.mean(axis=1))

R_predicted = empty_sub.T.fillna(problem_means).T

R_predicted = R_predicted.fillna(1)

R_predicted.head()

In [None]:
f1_matrix(R_train, R_predicted)

The Item-based Collaborative filtering model does considerably worse than the user-based model. In fact, this does worse than even our baseline model where we predicted 1 for all missing attempts_ranges! Here we get an F1 score of 0.34 whereas the baseline model was 0.37.

### User-based vs Item-based Collaborative Filtering

In [None]:
def cos_sim(attempts, kind='user', epsilon=1e-9):
    # fill all NaN values with 0. This does not affect
    # the cosine similarity metric.
    attempts = np.nan_to_num(attempts, 0)
    
    # compute the dot product between each user
    # and all other users.
    if kind == 'user':
        sim = np.dot(attempts, attempts.T) + epsilon
    # compute the dot product between each item
    # and all other items
    if kind == 'item':
        sim = np.dot(attempts.T, attempts) + epsilon
    
    # compute the denominator of the cosine similarity
    # metric
    norms = np.array([np.sqrt(np.diagonal(sim))])
    
    # the dimensions of the returned matrix is 
    # userxuser.
    return sim/norms/norms.T

In [None]:
similarity_users = cos_sim(R_train_, kind='user')
similarity_items = cos_sim(R_train_, kind='item')

In [None]:
def predict(attempts, similarity, kind='user'):
    # fill NaN values with 0
    attempts_fill = np.nan_to_num(attempts, 0)
    
    if kind == 'user':
        return np.round(similarity.dot(attempts_fill) / np.array([np.abs(similarity).sum(axis=1)]).T)
    elif kind == 'item':
        return np.round(attempts_fill.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)]))

In [None]:
R_predicted = predict(R_train, similarity_users, kind='user')
f1_matrix(R_train, R_predicted)

In [None]:
R_predicted = predict(R_train, similarity_items, kind='item')
f1_matrix(R_train, R_predicted)

We can also look at the similarity of users and items using the features datasets rather than the attempt_ranges themselves.

In [None]:
user_features_norm = users.set_index('user_id')/users.set_index('user_id').max()
user_features_norm.head()

In [None]:
R_predicted = predict(R_train, cos_sim(user_features_norm))
f1_matrix(R_train, R_predicted)

In [None]:
problem_features_norm = problems.set_index('problem_id')/problems.set_index('problem_id').max()
problem_features_norm.head()

In [None]:
R_predicted = predict(R_train, cos_sim(problem_features_norm), kind='item')
f1_matrix(R_train, R_predicted)

This type of model is by far the worst for this particular dataset...

### Latent Factor Collaborative Filtering Model From Scratch

#### Building the model

Now for the really fun part! I'm going to build a collaborative filtering model from scratch. In this type of model, latent features will be learned from the user-problem submission history. No data will be used from the users and problems datasets. We will define how many latent features we want the model to train on, this is a hyperparameter that we can tune later on. We will start by initializing a random guess of these latent features and then train the model by minimizing the error the model produces when predicting the attempts_ranges for user-problem combinations, from the learned latent features. To minimize the cost or error, J, we will utilize stochastic gradient descent. I'll start by defining some useful functions to help build the pipeline for the model. Keep in mind that we're starting with a single matrix with users on the row index and items (in this case problems) on the column index.

In [57]:
R_train.head()

Unnamed: 0,prob_1,prob_10,prob_100,prob_1000,prob_1001,prob_1002,prob_1003,prob_1004,prob_1005,prob_1006,...,prob_990,prob_991,prob_992,prob_993,prob_994,prob_995,prob_996,prob_997,prob_998,prob_999
user_1,,,,,,,,,,,...,,,,,,,,,,
user_10,,,,,,,,,,,...,,,,,,,,,,
user_100,,,,,,,,,,,...,,,,,,,,,,
user_1000,,,,1.0,,,,,,,...,,,,,,,,,,
user_1001,,,,,,,,,,,...,,,,,,,,,,


#### unroll

The first function takes two matrices and flattens them into a single 1D array. It does so by first sequentially stacking each column on top of each other for each matrix, producing two 1D arrays. It then stacks those two 1D arrays on top of each other to form a single 1D array that contains all of the latent features for users and items.

In [58]:
def unroll(M_users, M_items):
    
    """Reshape 2 matrices into a single 1D array. 
    Inverse function of `roll`.
    
    Parameters
    ----------
    M_users : 2D numpy array
        Matrix of user latent features. Has
        dimensions (n_users, n_features).
    M_items : 2D numpy array
        Matrix of item latent features. Has
        dimensions (n_items, n_features).
        
    Returns
    -------
    x_users_items : 1D numpy array
        User and item latent features.
    """
    
    # convert matrices to np arrays
    M_users = np.array(M_users)
    M_items = np.array(M_items)

    # flatten 2D arrays into 1D arrays
    x_users = M_users.flatten(order='C')
    x_items = M_items.flatten(order='C')
    
    # concatenate user and item 1D arrays
    x_users_items = np.concatenate((x_users, x_items), axis=0)

    return x_users_items

In [59]:
%%time

n_users = R_train.shape[0]
n_items = R_train.shape[1]
n_features = 100

M_users = np.random.rand(n_users, n_features)
M_items = np.random.rand(n_items, n_features)

x_users_items = unroll(M_users, M_items)

print(n_users*n_features + n_items*n_features)
print(x_users_items.shape)

1011500
(1011500,)
Wall time: 19 ms


This function is quite fast, taking only 20 ms to generate two random matrices and pass them into unroll.

#### roll

The next function does the inverse of unroll. It takes a single 1D array of user and item latent features and reshapes them into their original matrix forms.

In [60]:
def roll(x_users_items, n_users, n_items, n_features):
    
    """Reshape a 1D array of user and item latent
    features into their original 2D array format.
    Inverse function of `unroll`.
    
    Parameters
    ----------
    x_users_items : 1D numpy array
        User and item latent features.
    n_users : int
        Number of users.
    n_items : int
        Number of items.
    n_features: int
        Number of latent features to learn. 
        Determines the overall size of M_users 
        and M_items.
        
    Returns
    -------
    M_users : 2D numpy array
        Matrix of user latent features. Has
        dimensions (n_users, n_features).
    M_items : 2D numpy array
        Matrix of item latent features. Has
        dimensions (n_items, n_features).
    """
    
    # retrieve user and item 1D arrays
    x_users = x_users_items[0:n_users*n_features]
    x_items = x_users_items[n_users*n_features:]
    
    # reshape 1D arrays into original matrices
    M_users = np.reshape(x_users, (n_users, n_features))
    M_items = np.reshape(x_items, (n_items, n_features))

    return M_users, M_items

In [61]:
%%time

n_users = R_train.shape[0]
n_items = R_train.shape[1]
n_features = 100

M_users, M_items = roll(x_users_items, n_users, n_items, n_features)
print(M_users.shape, M_items.shape)

(3571, 100) (6544, 100)
Wall time: 1 ms


This function is also quite fast.

#### cost

The cost function compute the cost or error, J, for a prediction made using the values in x_users_items, against the true values in Y_true. The cost is calculated as the sum of squared errors plus a regularization penatly for both user and item 2nd order latent features.

In [62]:
def cost(x_users_items, Y_true, Lambda, n_users, n_items, n_features):
    
    """Compute cost (error) J from predictions on 
    Y_true using learned features `x_users_items`. J 
    is defined as the sum of squared errors plus
    regularization penatlies on user and item 
    latent features.
    
    Parameters
    ----------
    x_users_items : 1D numpy array
        User and item latent features.
    Y_true : 2D numpy array
        Matrix containing true ratings.
    Lambda : int
        Regularization coefficient.
    n_users : int
        Number of users.
    n_items : int
        Number of items.
    n_features: int
        Number of latent features to learn. 
        Determines the overall size of M_users 
        and M_items.
        
    Returns
    -------
    J : float
        Cost associated with prediction on `Y_true` 
        using learned latent features `x_users_items`.
    """
    
    # recover 2D user and item feature matrices
    M_users, M_items = roll(x_users_items, n_users, n_items, n_features)

    # compute the prediction
    Y_predicted = np.dot(M_users, M_items.T)
    
    # compute the error in the prediction
    error = Y_true - Y_predicted
    # replace all NaN values with 0
    error[np.isnan(error)] = 0

    # compute the regularization penalties
    User_regularization = (Lambda/2) * np.nansum(M_users * M_users)
    Item_regularization = (Lambda/2) * np.nansum(M_items * M_items)

    # compute the cost J with regularization
    J = (1/2) * np.nansum(error*error) + User_regularization + Item_regularization

    return J

In [63]:
%%time

n_users = R_train.shape[0]
n_items = R_train.shape[1]
n_features = 100
Lambda=0.1 # regularization coefficient

J = cost(x_users_items, R_train, Lambda, n_users, n_items, n_features)
print(J)

23838459.908534266
Wall time: 1.1 s


Computing the cost takes a bit longer at 1 second. We can see the very large cost J for our initial random guess.

#### gradient

As the name suggests, this function computes the gradient of the cost function, evaluated independently for the user and item latent features. The gradient is what we use to decide in which direction to step in during each update or iteration.

In [64]:
def gradient(x_users_items, Y_true, Lambda, n_users, n_items, n_features):
    
    """Compute gradient function on `x_users_items`.
    
    Parameters
    ----------
    x_users_items : 1D numpy array
        User and item latent features.
    Y_true : 2D numpy array
        Matrix containing true ratings.
    Lambda : int
        Regularization coefficient.
    n_users : int
        Number of users.
    n_items : int
        Number of items.
    n_features: int
        Number of latent features to learn. 
        Determines the overall size of M_users 
        and M_items.
        
    Returns
    -------
    gradient : 1D numpy array
        Gradient of cost J w.r.t user and item
        latent features.
    """
    
    # recover 2D user and item feature matrices
    M_users, M_items = roll(x_users_items, n_users, n_items, n_features)

    # compute the prediction
    Y_predicted = np.dot(M_users, M_items.T)
    
    # compute the error in the prediction
    error = Y_true - Y_predicted
    # replace all NaN values with 0
    error[np.isnan(error)] = 0 

    # the gradients of user & item features
    M_user_gradient = np.dot(error, M_items) + Lambda*M_users
    M_item_gradient = np.dot(error.T, M_users) + Lambda*M_items

    # reshape gradients into 1D array
    gradient = unroll(M_user_gradient, M_item_gradient)

    return gradient

In [65]:
%%time

n_users = R_train.shape[0]
n_items = R_train.shape[1]
n_features = 100
Lambda=0.1 # regularization coefficient

grad = gradient(x_users_items, R_train, Lambda, n_users, n_items, n_features)
grad

Wall time: 1 s


array([-3.64724205e+02, -4.85851599e+02, -4.66740767e+02, ...,
       -4.43565122e-01, -8.28989529e+00, -7.97167493e+00])

This also takes close to 1 second to compute.

#### predict

As the name implies, this function takes in the trained latent features in x_users_items and computes the predicted attempts_range from them.

In [66]:
def predict(x_users_items, n_users, n_items, n_features):
    
    """Compute prediction on ratings. Predictions
    are computed from learned user and item 
    latent features in `x_users_items`.
    
    Parameters
    ----------
    x_users_items : 1D numpy array
        User and item latent features.
    n_users : int
        Number of users.
    n_items : int
        Number of items.
    n_features: int
        Number of latent features to learn. 
        Determines the overall size of M_users 
        and M_items.
    
    Return
    ------
    Y_predicted : pandas DataFrame
        Predictions.
    """
    
    # recover 2D user and item feature matrices
    M_users, M_items = roll(x_users_items, n_users, n_items, n_features)

    # compute predictions from P & Q
    Y_predicted = np.dot(M_users, M_items.T) 
    
    # set all negative predictions to 1 (bottom limit)
    Y_predicted[Y_predicted < 1] = 1
    
    Y_predicted = Y_predicted.astype(int)
    
    return Y_predicted

In [67]:
%%time

n_users = R_train.shape[0]
n_items = R_train.shape[1]
n_features = 100

predict(x_users_items, n_users, n_items, n_features)

Wall time: 282 ms


array([[29, 22, 28, ..., 24, 29, 26],
       [27, 24, 27, ..., 24, 29, 26],
       [27, 22, 28, ..., 23, 29, 26],
       ...,
       [28, 25, 26, ..., 24, 29, 27],
       [26, 22, 25, ..., 22, 27, 23],
       [26, 22, 24, ..., 21, 25, 25]])

The prediction step only takes 262 ms. We can see the predicted values are no where even close to what the attempts_ranges should be. Of course, the initial prediction is expected to be way off since we just randomly initialized the latent feature values. We will see how these numbers evolve as we train the model.

#### SGD_e

Here is the meat of the training algorithm. SGD stands for stochastic gradient descent. It's an iterative agorithm that computes the error using a cost function J which we defined above, and steps in the direction (given by the gradient) that will minimize J in the next step. The latent feature values x_users_items are updated at each step until a convergence criteria is met. In this case, the e in SGD_e stands for epoch. This implementation will execute for a fixed number of epochs. Let's go ahead and take it for a spin!

In [68]:
def SGD_e(R_train, n_users, n_items, n_features, Lambda, 
          epochs, alpha, compute_f1=False, seed=42, **kwargs):
    
    """Stochastic gradient descent algorithm. Searches
    for the optimal values of user and item latent
    features in x_users_items, that minimize the cost J.
    Updates are calculated for `epochs` iterations using a 
    learning rate `alpha`.
    
    Parameters
    ----------
    R_train : 2D numpy array
        Ratings matrix for training dataset.
    R_cv : 2D numpy array
        Ratings matrix for cross-validation dataset.
    n_users : int
        Number of users.
    n_items : int
        Number of items.
    n_features: int
        Number of latent features to learn. 
        Determines the overall size of M_users 
        and M_items.
    Lambda : int
        Regularization coefficient.
    epochs : int
        Number of iterations to run.
    alpha : float
        Learning rate.
    compute_f1 : bool (default False)
        If true, computes the f1-scores for predictions
        on R_train and R_cv at each epoch.
    seed : int (default 42)
        Seed for numpy's pseudo-random number
        generator.
    
    Returns
    -------
    results : dict
        Nested dictionary containing epoch as keys. Values
        associated with each `epochs` are cost J, f1-scores 
        for training and CV datasets, and optimized 
        parameters `x_users_items`.
        
    """
    
    # get cross-validation data if given
    R_cv = kwargs.get('R_cv')
    
    # set random seed
    np.random.seed(seed)
    
    # intial random guess of user and item
    # latent features
    M_users = np.random.rand(n_users, n_features) - 0.5
    M_items = np.random.rand(n_items, n_features) - 0.5
    
    # reshape matrices into 1D array of
    # user and item latent features
    x_users_items = unroll(M_users, M_items)
    
    # initialize empty dict to store training results
    results = {}
    
    # loop through `epochs` iterations
    for e in range(1,epochs+1):
        
        # compute the cost
        J = cost(x_users_items, R_train, Lambda, 
                   n_users, n_items, n_features)
        
        # compute the gradient function
        gradient_ = gradient(x_users_items, R_train, Lambda, 
                      n_users, n_items, n_features)
        
        # update `x_users_items`
        x_users_items = x_users_items + alpha * gradient_
        
        if compute_f1:
            # make prediction
            Y_predicted = predict(x_users_items, n_users, n_items, n_features)

            # store cost J, f1-scores on training and CV data,
            # and the optimized parameters `x_users_items`.
            results[e] = {'J': J, 'f1-train': f1_matrix(R_train, Y_predicted), 
                          'f1-cv': f1_matrix(R_cv, Y_predicted), 'x_users_items': x_users_items}
            
        else:
            # store cost J and optimized parameters `x_users_items`
            results[e] = {'J': J, 'x_users_items': x_users_items}
        
        # print current epoch and cost
        print('Epoch %s' % e + ' | ' + 'J : %s' % round(J))
        
        # logic for stop condition
        if e > 1:
            # compute delta for current iteration
            delta = (results[e-1]['J'] - results[e]['J'])
            
            # if J increases from last iteration (delta < 0)
            # end updates and return results
            if delta < 0:
                print('Gradient diverging! Ending training...')
                return results
        else:
            pass
    
    # indicate completion and print final J
    print('Training complete, final J: %s' % round(results[epochs]['J']))
    
    return results

In [53]:
%%time

# data parameters
n_users = R_train_.shape[0]
n_items = R_train_.shape[1]

# hyper parameters
n_features = 10
Lambda=0.01
epochs=20
alpha=0.01

results = SGD_e(R_train_, n_users, n_items, n_features, Lambda, 
                epochs, alpha, compute_f1=True, seed=42, R_cv=R_cv_)

Epoch 1 | J : 187802.0
Epoch 2 | J : 181496.0
delta: 6305.159405808954
Epoch 3 | J : 175672.0
delta: 5824.303110547073
Epoch 4 | J : 167338.0
delta: 8334.293891633191
Epoch 5 | J : 151792.0
delta: 15545.914596866089
Epoch 6 | J : 124942.0
delta: 26850.409929636604
Epoch 7 | J : 96935.0
delta: 28006.609274381073
Epoch 8 | J : 76578.0
delta: 20356.440915378058
Epoch 9 | J : 64031.0
delta: 12547.92035841605
Epoch 10 | J : 63986.0
delta: 44.15742871346447
Epoch 11 | J : 126637.0
delta: -62650.6762465129
Gradient diverging! Ending training...
Wall time: 23.7 s


In [54]:
x = np.array(list(results.keys()))
y = np.array([results[i]['J'] for i in results.keys()])

trace0=go.Scattergl(x=x, y=y, mode='lines+markers')

layout=go.Layout(title='Cost Function vs epoch',
                yaxis=dict(title='Cost Function',
                          type='log'),
                xaxis=dict(title='epoch'))

fig = go.Figure([trace0], layout)

iplot(fig, filename='training.html')

In [69]:
results[10]

{'J': 63986.38632891589,
 'f1-train': 0.493357831544373,
 'f1-cv': 0.4379398584326611,
 'x_users_items': array([ 0.16602717, -0.32074901,  0.40938191, ...,  0.10771847,
        -0.26563553,  0.28468026])}

In [56]:
predict(results[10]['x_users_items'], n_users, n_items, n_features)

array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       ...,
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1]])

In [75]:
import pickle as pkl
results_100epochs = load_obj('initial_100epochs')

In [76]:
x = np.array(list(results_100epochs.keys()))
y = np.array([results_100epochs[i]['J'] for i in results_100epochs.keys()])

trace0=go.Scattergl(x=x, y=y, mode='lines+markers')

layout=go.Layout(title='Cost Function vs epoch',
                yaxis=dict(title='Cost Function',
                          type='log'),
                xaxis=dict(title='epoch'))

fig = go.Figure([trace0], layout)

iplot(fig, filename='training.html')

In [77]:
results_100epochs[100]

{'J': 50589.926677060714,
 'F1-train': 0.5178311985749524,
 'F1-cv': 0.4516169439596815,
 'xopt': array([ 0.1384063 , -0.11193197, -0.01139779, ..., -0.83185053,
         0.50665992, -0.281704  ])}

In [None]:
def SGD_t(R_train, n_users, n_items, n_features, Lambda, 
          epsilon, alpha, compute_f1=False, seed=42, **kwargs):
    
    """Stochastic gradient descent algorithm. Searches
    for the optimal values of user and item latent
    features in x_users_items, that minimize the cost J.
    Updates are calculated until the change (delta) in J, 
    between iterations, is less than epsilon.
    
    Parameters
    ----------
    R_train : 2D numpy array
        Ratings matrix for training dataset.
    R_cv : 2D numpy array (optional, must be given if
    `compute_f1` is True)
        Ratings matrix for cross-validation dataset.
    n_users : int
        Number of users.
    n_items : int
        Number of items.
    n_features: int
        Number of latent features to learn. 
        Determines the overall size of M_users 
        and M_items.
    Lambda : int
        Regularization coefficient.
    epsilon : float
        Training threshold. Training stops when the cost
        J is less than epsilon.
    alpha : float
        Learning rate.
    compute_f1 : bool (default False)
        If true, computes the f1-scores for predictions
        on R_train and R_cv at each epoch.
    seed : int (default 42)
        Seed for numpy's pseudo-random number
        generator.
    
    Returns
    -------
    results : dict
        Nested dictionary containing epoch as keys. Values
        associated with each `epochs` are cost J, f1-scores 
        for training and CV datasets, and optimized 
        parameters `x_users_items`.
        
    """
    
    # get cross-validation data if given
    R_cv = kwargs.get('R_cv')
    
    # set random seed
    np.random.seed(seed)
    
    # intial random guess of user and item
    # latent features
    M_users = np.random.rand(n_users, n_features) - 0.5
    M_items = np.random.rand(n_items, n_features) - 0.5
    
    # reshape matrices into 1D array of
    # user and item latent features
    x_users_items = unroll(M_users, M_items)
    
    # initialize empty dict to store training results
    results = {}
    
    e = 1 # counter for training iteration
    
    # large, arbitrary initial value for delta in J
    delta = 1000
    
    # iterate until the delta in J is less than epsilon
    while delta > epsilon:
        
        # compute the cost
        J = cost(x_users_items, R_train, Lambda, 
                   n_users, n_items, n_features)
        
        # compute the gradient function
        gradient_ = gradient(x_users_items, R_train, Lambda, 
                      n_users, n_items, n_features)
        
        # update `x_users_items`
        x_users_items = x_users_items + alpha * gradient_
        
        if compute_f1:
            # make prediction
            Y_predicted = predict(x_users_items, n_users, n_items, n_features)

            # store cost J, f1-scores on training and CV data,
            # and the optimized parameters `x_users_items`.
            results[e] = {'J': J, 'f1-train': f1_matrix(R_train, Y_predicted), 
                          'f1-cv': f1_matrix(R_cv, Y_predicted), 'x_users_items': x_users_items}
            
        else:
            # store cost J and optimized parameters `x_users_items`
            results[e] = {'J': J, 'x_users_items': x_users_items}
        
        # print current epoch and cost
        print('Epoch %s' % e + ' | ' + 'J : %s' % round(J))
        
        # logic for stop condition
        if e > 1:
            # compute delta for current iteration
            delta = (results[e-1]['J'] - results[e]['J'])
            
            # if J increases from last iteration (delta < 0)
            # end updates and return results
            if delta < 0:
                print('Gradient diverging! Ending training...')
                return results
        else:
            pass
        
        print('Cost delta: %s' % round(delta))
        print()
        
        e += 1
    
    # indicate completion and print final J
    print('Stopping criteria met: delta < epsilon.')
    print('Final J: %s' % round(results[epochs]['J']))
        
    return results