# Demo Recpack

This is an end to end demo of Recpack functionality.

### Dataset

We use the MovieLens 25M dataset, which contains user-item rating tuples, with timestamp information as well.

In [None]:
from recpack.data.datasets import MovieLens25M

In [None]:
# preprocess default set to false, so we can highlight the use of filters with datasets
dataset = MovieLens25M("data/ml25.csv", preprocess_default=False)
# Download the dataset if not present in the data module
dataset.fetch_dataset()

In [None]:
df = dataset.load_dataframe()

In [None]:
df.nunique()

### Preprocessing

In preprocessing we'll add filters to the dataset class

* MinRating = 3 -> Anything 3 and above is considered a positive interaction
* MinUsersPerItem = 30 -> Focussing on the most interacted items, otherwise will need too much RAM
* MinItemsPerUser = 5 -> Users with enough interactions to allow for prediction

The order in which they are added are important, the data is passed in order through them.

So first the rating filter will be applied, then the users per item and finally items per user.

In [None]:
from recpack.preprocessing.filters import MinItemsPerUser, MinRating, MinUsersPerItem

In [None]:
dataset.add_filter(MinRating(3, rating_ix="rating"))
dataset.add_filter(MinUsersPerItem(30, item_ix="movieId", user_ix="userId", count_duplicates=False))
dataset.add_filter(MinItemsPerUser(5, item_ix="movieId", user_ix="userId", count_duplicates=False))

In [None]:
# Load our data into an InteractionMatrix
# This will apply the preprocessing filters as well
data = dataset.load_interaction_matrix()

In [None]:
original_users = df.userId.nunique()
original_items = df.movieId.nunique()
users, items = data.shape

print(f"We have {users} users and {items} items left")
print(f"preprocessing removed {original_users - users} users")
print(f"preprocessing removed {original_items - items} items")

### Scenario

We'll choose a scenario to generate training data and test data.
We'll skip the validation data, because we won't do parameter optimisation in this example.

As scenario we will use Strong Generalization, so won't use the timestamp information for now.

In [None]:
from recpack.splitters.scenarios import StrongGeneralization

In [None]:
scenario = StrongGeneralization(frac_users_train=0.7, frac_interactions_in=0.8, validation=False)

In [None]:
scenario.split(data)

### Algorithms

We will use 2 different algorithms to compute scores on.

* Item KNN
* Popularity 

You can also add EASE, but make sure to have enough RAM available, at least 32GB needed. 

In [None]:
from recpack.algorithms import ItemKNN, Popularity, EASE

In [None]:
algorithms = [
    ItemKNN(K=200),
    Popularity(),
#     EASE(l2=100)
]

### Metrics
We will select a couple metrics that will be evaluated on
 
* CoverageK
* CalibratedRecallK
* NDCGK
* HitK
* WeightedHitK

As K value we will use 10 (as if we recommend a box of 10 items)

We will allow the pipeline to construct the metrics, so we only need their names for now.

In [None]:
metrics = [
    'CoverageK',
    'CalibratedRecallK',
    'NDCGK',
    'HitK',
    'WeightedHitK'
]

K_values = [10]

### Pipeline

We'll use a pipeline to do the heavy lifting for us

In [None]:
from recpack.pipeline import Pipeline
import pandas as pd

In [None]:
pipeline = Pipeline(algorithms, metrics, K_values=K_values)

In [None]:
pipeline.run(scenario.training_data, scenario.test_data)

In [None]:
pd.DataFrame.from_dict(pipeline.get())