# Demo Recpack

This is an end to end demo of how to set up an experimental pipeline with Recpack.

It covers:
- Loading a dataset
- Preprocessing the dataset
- Transforming it to the appropriate data format
- Splitting the dataset into a training(, validation) and test set. 
- Training the algorithm(s)
- Making predictions
- Evaluating the performance of the algorithm(s) using metrics

### Dataset

Recpack provides a set of datasets you can use out of the box. These are:
- MovieLens 25M
- RecSys Challenge 2015
- CiteULike

If your dataset of choice is not included in this list, you can choose to either:
- Create your own Dataset child class for this dataset.
- Perform the transformation to an InteractionMatrix yourself.

In this example we use the MovieLens 25M dataset which contains user-item rating tuples and timestamp information.

In [None]:
import os
import pandas as pd

from recpack.data.datasets import MovieLens25M

In [None]:
if not os.path.exists('data'):
    os.mkdir('data')
# We set preprocess_default to false, so we can highlight the selection and application of filters
dataset = MovieLens25M("data/ml25.csv", preprocess_default=False)
# Download the dataset if not present at "data/ml25.csv"
dataset.fetch_dataset()

In [None]:
# Load all interactions as a pandas DataFrame
df = dataset.load_dataframe()

In [None]:
df.nunique()

### Preprocessing

Recpack makes it easy to perform some classic preprocessing steps.
When using a Dataset, the preprocessing steps are performed when `load_interaction_matrix()` is called.

Every Dataset comes with a set of default filters, i.e. typical preprocessing steps applied to this dataset in the literature.
To overwrite these add the `preprocess_default=False` keyword argument when constructing the object, then proceed to define your own filters.

In this example we won't use the default ML25M filters, but instead define our own. 

* MinRating = 3 -> Any rating of 3 and above is considered a positive interaction.
* MinUsersPerItem = 30 -> Remove all items that fewer than 30 people interacted with, otherwise computation might go out of RAM.
* MinItemsPerUser = 5 -> Remove all users who interacted with less than 5 items.

The order in which these are added is important, as they are applied in order:
first the rating filter will be applied, then the users per item and finally items per user.

We usually apply the strictest filter first. 
Otherwise we might count interactions with rating < 3 in our min user per item filter, but then throw them away again later on.

In [None]:
from recpack.preprocessing.filters import MinItemsPerUser, MinRating, MinUsersPerItem, NMostPopular

In [None]:
dataset.add_filter(MinRating(3, rating_ix="rating"))
dataset.add_filter(MinUsersPerItem(10, item_ix="movieId", user_ix="userId", count_duplicates=False))
dataset.add_filter(MinItemsPerUser(5, item_ix="movieId", user_ix="userId", count_duplicates=False))
dataset.add_filter(NMostPopular(1000, item_ix="movieId"))

In [None]:
# Applies filters, and loads filtered data into an InteractionMatrix
data = dataset.load_interaction_matrix()

In [None]:
original_users = df.userId.nunique()
original_items = df.movieId.nunique()
users, items = data.shape

print(f"We have {users} users, {items} items and {data.num_interactions} interactions left")
print(f"preprocessing removed {original_users - users} users")
print(f"preprocessing removed {original_items - items} items")

### Scenario

A scenario describes a situation in which to evaluate the performance of our algorithm.
Examples are predicting for users that were not in the training set (StrongGeneralization),
or predicting future interactions of the users in the training dataset (Timed).

We're not optimizing any hyperparameters here, so we split into only a training and test set.
If you do wish to optimize hyperparameters, add `validation=True` to the constructor of your scenario,
and use this validation data to determine the optimal hyperparameters.

Here we choose the StrongGeneralization scenario. This means we won't use the timestamp information for now.
StrongGeneralization splits users into two datasets, only the training users' interactions are used for training.
Test users' interactions are again split in two: one part is used as history to base predictions on,
the other part consists of the held-out interactions we will try to predict.

As parameters this scenario allows you to select the fraction of users to use for training, 
and the fraction of user interactions to use as history for your predictions.

We will use 70% of users as training data.
For prediction we will use 80% of the test users' interactions as history and predict the remaining 20%.
Calidation is set to False because we do not require validation data.

In [None]:
from recpack.splitters.scenarios import StrongGeneralization

In [None]:
scenario = StrongGeneralization(frac_users_train=0.7, frac_interactions_in=0.8, validation=False)

In [None]:
scenario.split(data)

## Pipeline

We'll use a pipeline to do the heavy lifting for us.

We construct the pipeline using the pipeline builder. In the next steps we will add the data, select the algorithms, set the metrics, and finally run the pipeline.

In [None]:
import recpack.pipelines

In [None]:
pipeline_builder = recpack.pipelines.PipelineBuilder('demo')
pipeline_builder.set_train_data(scenario.training_data)
pipeline_builder.set_test_data(scenario.test_data)

### Algorithms

We now choose two algorithms to evaluate:

* Item KNN
* Popularity 

You can also add EASE, but make sure to have enough RAM available, at least 32GB needed. 

Each algorithm has a set of parameters, so in practical settings, you would optimise them before comparison. 
Here we don't care as much about optimality, and so we just picked defaults that made some sense.

It is also entirely possible to create your own algorithm. To learn how to do so, check out the Getting Started guide on docs!

In [None]:
pipeline_builder.add_algorithm('ItemKNN', params={'K': 200})
# pipeline_builder.add_algorithm('ItemKNN', params={'K': 300})
pipeline_builder.add_algorithm('Popularity', params={'K': 50})
# pipeline_builder.add_algorithm('EASE', params={'l2': 100})

### Metrics
We now select a couple metrics that will be evaluated on
 
* CoverageK
* CalibratedRecallK
* NormalizedDiscountedCumulativeGainK
* HitK
* WeightedByInteractionsHitK

As K value we will use 10 (as if we recommend a box of 10 items)

In [None]:
pipeline_builder.add_metric('CoverageK', 10)
pipeline_builder.add_metric('CalibratedRecallK', 10)
pipeline_builder.add_metric('NormalizedDiscountedCumulativeGainK', 10)
pipeline_builder.add_metric('HitK', 10)


### Run


To run we build the pipeline, and call `run()`


In [None]:
pipeline = pipeline_builder.build()

In [None]:
pipeline.run()

In [None]:
pd.DataFrame.from_dict(pipeline.get_metrics())