# Demo Recpack

This is an end to end demo of Recpack functionality.

### Dataset

We use the MovieLens 25M dataset, which contains user-item rating tuples, with timestamp information as well.

In [None]:
from recpack.data.datasets import MovieLens25M

In [None]:
# preprocess default set to false, so we can highlight the use of filters with datasets
dataset = MovieLens25M("data/ml25.csv", preprocess_default=False)
# Download the dataset if not present on hard disk
dataset.fetch_dataset()

In [None]:
# Load all interactions as a pandas DataFrame
df = dataset.load_dataframe()

In [None]:
df.nunique()

### Preprocessing

When using a dataset the preprocessing happens when `load_interaction_matrix()` function is called.

Datasets have default filters, to overwrite these (if you would want to) add the `preprocess_default=True` key word argument to the initialization.

In this example we won't use the default ML25M filters, but instead use our own. 

* MinRating = 3 -> Anything 3 and above is considered a positive interaction
* MinUsersPerItem = 30 -> Focussing on the most interacted items, otherwise will need too much RAM
* MinItemsPerUser = 5 -> Users with enough interactions to allow for prediction

The order in which these are added is important, as they are applied in order to the data.

So first the rating filter will be applied, then the users per item and finally items per user.

We usually first apply the strictest filters first. 
Otherwise we might count interactions with rating < 3 in our min user per item filter, but then throw them away again later on.

In [None]:
from recpack.preprocessing.filters import MinItemsPerUser, MinRating, MinUsersPerItem

In [None]:
dataset.add_filter(MinRating(3, rating_ix="rating"))
dataset.add_filter(MinUsersPerItem(30, item_ix="movieId", user_ix="userId", count_duplicates=False))
dataset.add_filter(MinItemsPerUser(5, item_ix="movieId", user_ix="userId", count_duplicates=False))

In [None]:
# Applies filters, and loads filtered data into an InteractionMatrix
data = dataset.load_interaction_matrix()

In [None]:
original_users = df.userId.nunique()
original_items = df.movieId.nunique()
users, items = data.shape

print(f"We have {users} users and {items} items left")
print(f"preprocessing removed {original_users - users} users")
print(f"preprocessing removed {original_items - items} items")

### Scenario

Scenarios are used to choose how we want to evaluate our algorithms. In literature and practice different scenarios are used for different use cases.

We don't use validation data in this example, as we don't do any parameter optimisation.

As scenario we choose Strong Generalization, so won't use the timestamp information for now.
Strong Generalization splits users into two datasets, the training users' interactions are used for training.
Test users' interactions are split, and part is used as history and the goal of the recommender is to recommend the held out dataset.

As parameters this scenario allows selection of how much users to use for training data, and how much of a user's interactions to use as history during prediction.

We will use 70% of users as training data, and for prediction we will use 80% of the test users' interactions as history, to predict the remaining 20%.
validation is set to False, so we don't generate validation data.

In [None]:
from recpack.splitters.scenarios import StrongGeneralization

In [None]:
scenario = StrongGeneralization(frac_users_train=0.7, frac_interactions_in=0.8, validation=False)

In [None]:
scenario.split(data)

### Algorithms

We use 2 different algorithms to compute scores on.

* Item KNN
* Popularity 

You can also add EASE, but make sure to have enough RAM available, at least 32GB needed. 

Each algorithm has a set of parameters, so in practical settings, you would optimise them before comparison. 
Here we don't care as much about optimality, and so we just picked defaults that made some sense.

In [None]:
from recpack.algorithms import ItemKNN, Popularity, EASE

In [None]:
algorithms = [
    ItemKNN(K=200),
    Popularity(),
#     EASE(l2=100)
]

### Metrics
We select a couple metrics that will be evaluated on
 
* CoverageK
* CalibratedRecallK
* NormalizedDiscountedCumulativeGainK
* HitK
* WeightedByInteractionsHitK

As K value we will use 10 (as if we recommend a box of 10 items)

We will allow the pipeline to construct the metrics, so we only need their names for now.

In [None]:
metrics = [
    'CoverageK',
    'CalibratedRecallK',
    'NormalizedDiscountedCumulativeGainK',
    'HitK',
    'WeightedByInteractionsHitK'
]

K_values = [10]

### Pipeline

We'll use a pipeline to do the heavy lifting for us.

We provide it with the algorithms to compare, the metrics to compare on, and the K values to compute at.

To `run` we give the training data, and the tuple of test data. The pipeline will train all models, and evaluate them on the test data.

To get the results we use `pipeline.get()` which returns a nested dict, <metric_name, <algorithm, score>>.

for easy representation, we render it using pandas

In [None]:
from recpack.pipeline import Pipeline
import pandas as pd

In [None]:
pipeline = Pipeline(algorithms, metrics, K_values=K_values)

In [None]:
# Run the pipeline, this can take a while.
# Go grab a coffee, and enjoy :D
pipeline.run(scenario.training_data, scenario.test_data)

In [None]:
pd.DataFrame.from_dict(pipeline.get())