Post recommendation from implicit feedback
====

<i>Challenge Data 2016.</i>
<b>N.B. the code uses Python 2 </b>

Based on the history of user "likes" over posts, the goal of the challenge is to recommend new posts that a user might like.

The dataset is split between four matrices: `input_train`, `output_train` and `input_test`, `output_test`.

There are three phases: 

1) You design and tune your algorithm by alternating between the two:

    a. You train your algorithm on `input_train`, and predict scores `train_scores`.
    b. You evaluate `train_scores` against the ground truth `output_train`
    
2) One, you have good parameters, you export recommendations for the challenge:

    a. You train your algorithm on `input_test`, and predict scores `test_scores`.
    b. You call `binarize`, which binarizes your scores and removes recommendations for items in the training set, then you export the resulting `test_predict` in a file `test_pred.csv` and submit it to the ChallengeData platform.
    
3) We (organisers of the challenge) will evaluate `test_predict` by computing its precision@5 with respect to the (hidden) `output_test`

In this tutorial we show how to do phases 1) and 2) using a simple algorithm, the popularity baseline, which recommends the 5 most popular items not already liked to each user.


The evaluation strategy is the following. For each user:
- All likes are split between a input and a output set, called `input` and `output` in this notebook)
- The algorithm assigns scores of preference to all items. In this notebook the scores predicted by the popularity baseline are called `pop_scores`.
- For each user validation data is given.



Loading the data
-----

Before running the following, ensure that the data files are next to this notebook.

In [1]:
import helpers
reload (helpers)
from helpers import load_csr, save_csr, print_metrics, binarize

# Load scipy.sparse.csr_matrix data
input_train = load_csr('input_train.csv')
output_train = load_csr('output_train.csv')
input_test = load_csr('input_test.csv')

# Densify to np.array for convenience
input_train_dense = input_train.toarray()
output_train_dense = output_train.toarray()
input_test_dense = input_test.toarray()

Popularity baseline
----

The simplest algorithm is sort the items by popularity, and to recommend for each user the top-5 items that he has not seen already.

In [2]:
import numpy as np

def popularity_baseline(input_dense):
    # Get item popularities
    per_item = input_dense.sum(axis=0)

    # Compute the (same) scores for each user
    pop_scores = np.outer(np.ones(input_dense.shape[0]), per_item)
    
    return pop_scores

Train your algorithm on `input_train`
----

In [3]:
train_scores = popularity_baseline(input_train_dense)

Evaluate with Precision@5 on `output_train`
----

Now we compute the precision@5 score using the `print_metrics` routine, which for each user will ignore the scores of items already present in `input_train`.

In [4]:
input_scores, output_scores = print_metrics(train_scores,
                                            input_train,
                                            output_train,
                                            k=5)

test_precision = output_scores[0]
print '\nTest precision', test_precision

Predictions shape (1065L, 905L)
K=5
Metrics ['precision@k', 'recall@k', 'f1@k']
Train [ 0.02835681  0.04693821  0.03012455]
Test [ 0.02929577  0.04659194  0.03232149]
Shapes. Train (1065L, 3L) Test (1065L, 3L)

Test precision 0.0292957746479


(Here you should do some tuning)
----

In our simple example the baseline popularity method has no parameters to tune, but typically you would choose your parameters by evaluating your predictions against `output_train`.

In [5]:
# some tuning

Train your algorithm on `input_test`
----

Now we predict the results for the final evaluation, by training on `input_test`.
Then we export the scores, for final evaluation.

In [6]:
test_scores = popularity_baseline(input_test_dense)

In [7]:
test_predict = binarize(test_scores, input_test, k=5)

# Export the scores
save_csr('test_pred', test_predict)

