## School project - 5MLRE
The following notebook was created for a school project to create an anime recommendation system. The subject and the questions are available in the appendix.

The group members who participated in this project are:
- AMIMI Lamine
- BEZIN Théo
- LECOMTE Alexis
- PAWLOWSKI Maxence

### Main index
1. Data analysis
2. **Collaborative filtering (you are here)**
3. Content-based filtering
4. _Appendix_

# 2 - Collaborative filtering
In the previous notebook, we loaded, cleaned and studied the [MyAnimeList](https://myanimelist.net/) datasets. Now that we know them better, we will start to create the recommendation system using collaborative filtering. Collaborative filtering is a technique that filters out items that a user might like based on feedback from similar users. There are two sub-techniques: User-based collaborative filtering and article-based collaborative filtering.

### Index
<ol type="A">
  <li>Notebook initialization</li>
  <li>Collaborative filtering: unfiltered training</li>
  <li>Collaborative filtering: filtered training</li>
  <li>Conclusion of the collaborative filtering</li>
</ol>

## A - Notebook initialization
### A.1 - Imports

In [1]:
# OS and filesystem
import os
import sys
from pathlib import Path

# Data
import pandas
from matplotlib import pyplot
import matplotx

# Model processing
import surprise

# Misc.
from ast import literal_eval

# Local files
sys.path.append(os.path.join(os.pardir, os.pardir))
import helpers

### A.2 - Package initialization

In [2]:
pyplot.rcParams.update(pyplot.rcParamsDefault)
pyplot.style.use(matplotx.styles.dracula)  # Set the matplotlib style

### A.3 - Constants

In [3]:
# Filesystem paths
PARENT_FOLDER = Path.cwd()
DATA_FOLDER = (PARENT_FOLDER / ".." / ".." / "data").resolve()
MODELS_FOLDER = (PARENT_FOLDER / ".." / ".." / "models").resolve()
TEMP_FOLDER = (PARENT_FOLDER / ".." / ".." / "temp").resolve()

# Plots
FIG_SIZE = (12, 7)

# Misc.
RANDOM_STATE = 2077

### A.4 - Datasets loading

In [4]:
# data_reader = surprise.Reader(line_format="user item rating", sep=",", rating_scale=(1, 10), skip_lines=1)
# data = surprise.Dataset.load_from_file(file_path=(DATA_FOLDER / "rating_cleaned.csv"), reader=data_reader)

# Load a smaller sample of the dataset instead of the 8M rows
data_cleaned = pandas.read_csv((DATA_FOLDER / "rating_cleaned.csv"))
data_filtered = data_cleaned[data_cleaned["rating"] >= 0.0]
data_shortened = data_filtered.sample(n=35_000)

data_reader = surprise.Reader(rating_scale=(1, 10))
data = surprise.Dataset.load_from_df(df=data_shortened[["user_id", "anime_id", "rating"]], reader=data_reader)

# B - Collaborative filtering: unfiltered training
expliquer collaborative filtering
expliquer diff. user-based/item-based

## B.2 - Slope One
expliquer slope one

In [5]:
model_slope_one, _, top_n_slope_one = helpers.ml.run_model(
    name="Slope One",
    model=surprise.SlopeOne,
    dataset=data,
    hyper_params=None,
    models_folder=MODELS_FOLDER,
    seed=RANDOM_STATE
)

[32mTesting "Slope One".[39m
[0m
[1mBest params:[22m [2m[37m{}[0m
[1mRMSE:[22m [ train = 0.0426 | test = 1.7929 ]
[1mMAE:[22m [ train = 0.0029 | test = 1.3266 ]

Built top-N for each user (n=10, min_rating=4.0)
[1mHit rate:[22m 91.963143%
[1mHit rate per rating value:[22m
Rating	Hit rate
1.0	50.000000%
2.0	76.923077%
3.0	75.806452%
4.0	84.000000%
5.0	84.866469%
6.0	87.604291%
7.0	90.574713%
8.0	94.204981%
9.0	94.515050%
10.0	94.833948%
[1mCumulative hit rate (min_rating=4.0):[22m 92.241827%
[1mAverage reciprocal hit rate:[22m 0.004894340707565335
[1mUser coverage (num_users=7814, min_rating=4.0):[22m 99.705657%

Testing of the "Slope One" model successfully completed in 0:03:04.171860.
Grid search: N/A
Training and testing: 0:02:22.854174


## B.2 - KNN Basic
expliquer KNN Basic

In [6]:
model_knn_basic, _, top_n_knn_basic = helpers.ml.run_model(
    name="KNN Basic",
    model=surprise.KNNBasic,
    dataset=data,
    hyper_params={
        "k": [20, 40, 60],
        "min_k": [1, 2, 3, 5],
        "sim_options": {
            "name": ["cosine", "msd", "pearson", "pearson_baseline"],
            "user_based": [True, False]
        }
    },
    models_folder=MODELS_FOLDER,
    seed=RANDOM_STATE
)

[32mTesting "KNN Basic".[39m
Running GridSearchCV...[37m[2m
Computing the cosine similarity matrix...
Done computing similarity matrix.
[0m
[1mBest params:[22m [2m[37m{'k': 20, 'min_k': 2, 'sim_options': {'name': 'cosine', 'user_based': False}}[0m
[1mRMSE:[22m [ train = 1.2719 | test = 1.5596 ]
[1mMAE:[22m [ train = 0.9630 | test = 1.2457 ]

Built top-N for each user (n=10, min_rating=4.0)
[1mHit rate:[22m 92.948554%
[1mHit rate per rating value:[22m
Rating	Hit rate
1.0	66.666667%
2.0	88.461538%
3.0	79.032258%
4.0	85.600000%
5.0	87.240356%
6.0	89.034565%
7.0	91.436782%
8.0	94.827586%
9.0	95.317726%
10.0	95.479705%
[1mCumulative hit rate (min_rating=4.0):[22m 93.137001%
[1mAverage reciprocal hit rate:[22m 0.00461090785682634
[1mUser coverage (num_users=7814, min_rating=4.0):[22m 100.000000%

Testing of the "KNN Basic" model successfully completed in 1:07:14.096534.
Grid search: 1:02:34.144426
Training and testing: 0:03:59.709877


## B.3 - KNN With Means
expliquer KNN With Means

In [7]:
model_knn_with_means, _, top_n_knn_with_means = helpers.ml.run_model(
    name="KNN With Means",
    model=surprise.KNNWithMeans,
    dataset=data,
    hyper_params={
        "k": [20, 40, 60],
        "min_k": [1, 2, 3, 5],
        "sim_options": {
            "name": ["cosine", "msd", "pearson", "pearson_baseline"],
            "user_based": [True, False]
        }
    },
    models_folder=MODELS_FOLDER,
    seed=RANDOM_STATE
)

[32mTesting "KNN With Means".[39m
Running GridSearchCV...[37m[2m
Computing the cosine similarity matrix...
Done computing similarity matrix.
[0m
[1mBest params:[22m [2m[37m{'k': 20, 'min_k': 2, 'sim_options': {'name': 'cosine', 'user_based': False}}[0m
[1mRMSE:[22m [ train = 1.0087 | test = 1.5933 ]
[1mMAE:[22m [ train = 0.7474 | test = 1.2301 ]

Built top-N for each user (n=10, min_rating=4.0)
[1mHit rate:[22m 92.679805%
[1mHit rate per rating value:[22m
Rating	Hit rate
1.0	61.111111%
2.0	84.615385%
3.0	79.032258%
4.0	85.600000%
5.0	85.756677%
6.0	88.557807%
7.0	91.149425%
8.0	94.683908%
9.0	95.250836%
10.0	95.387454%
[1mCumulative hit rate (min_rating=4.0):[22m 92.890503%
[1mAverage reciprocal hit rate:[22m 0.001670684996231342
[1mUser coverage (num_users=7814, min_rating=4.0):[22m 100.000000%

Testing of the "KNN With Means" model successfully completed in 1:09:05.540631.
Grid search: 1:04:04.595112
Training and testing: 0:04:05.852281


## B.4 - KNN With Z-Score
expliquer KNN With Z-Score

In [8]:
model_knn_with_z_score, _, top_n_knn_with_z_score = helpers.ml.run_model(
    name="KNN With Z-Score",
    model=surprise.KNNWithZScore,
    dataset=data,
    hyper_params={
        "k": [20, 40, 60],
        "min_k": [1, 2, 3, 5],
        "sim_options": {
            "name": ["cosine", "msd", "pearson", "pearson_baseline"],
            "user_based": [True, False]
        }
    },
    models_folder=MODELS_FOLDER,
    seed=RANDOM_STATE
)

[32mTesting "KNN With Z-Score".[39m
Running GridSearchCV...[37m[2m
Computing the cosine similarity matrix...
Done computing similarity matrix.
[0m
[1mBest params:[22m [2m[37m{'k': 20, 'min_k': 2, 'sim_options': {'name': 'cosine', 'user_based': False}}[0m
[1mRMSE:[22m [ train = 1.0080 | test = 1.5930 ]
[1mMAE:[22m [ train = 0.7437 | test = 1.2299 ]

Built top-N for each user (n=10, min_rating=4.0)
[1mHit rate:[22m 92.679805%
[1mHit rate per rating value:[22m
Rating	Hit rate
1.0	61.111111%
2.0	84.615385%
3.0	79.032258%
4.0	85.600000%
5.0	85.756677%
6.0	88.557807%
7.0	91.149425%
8.0	94.683908%
9.0	95.250836%
10.0	95.387454%
[1mCumulative hit rate (min_rating=4.0):[22m 92.890503%
[1mAverage reciprocal hit rate:[22m 0.001669408327987507
[1mUser coverage (num_users=7814, min_rating=4.0):[22m 100.000000%

Testing of the "KNN With Z-Score" model successfully completed in 1:19:26.439125.
Grid search: 1:13:43.924272
Training and testing: 0:04:44.935616


## B.5 - KNN Baseline
expliquer KNN Baseline

In [9]:
model_knn_baseline, _, top_n_knn_baseline = helpers.ml.run_model(
    name="KNN Baseline",
    model=surprise.KNNBaseline,
    dataset=data,
    hyper_params={
        "k": [20, 40, 60],
        "min_k": [1, 2, 3, 5],
        "sim_options": {
            "name": ["cosine", "msd", "pearson", "pearson_baseline"],
            "user_based": [True, False]
        },
        "bsl_options": {
            "method": ["als"],
            "n_epochs": [5, 10, 15],
        }
    },
    models_folder=MODELS_FOLDER,
    seed=RANDOM_STATE
)

[32mTesting "KNN Baseline".[39m
Running GridSearchCV...[37m[2m
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
[0m
[1mBest params:[22m [2m[37m{'k': 20, 'min_k': 2, 'sim_options': {'name': 'cosine', 'user_based': False}, 'bsl_options': {'method': 'als', 'n_epochs': 15}}[0m
[1mRMSE:[22m [ train = 1.1064 | test = 1.4577 ]
[1mMAE:[22m [ train = 0.8375 | test = 1.1384 ]

Built top-N for each user (n=10, min_rating=4.0)
[1mHit rate:[22m 92.948554%
[1mHit rate per rating value:[22m
Rating	Hit rate
1.0	66.666667%
2.0	88.461538%
3.0	79.032258%
4.0	85.600000%
5.0	87.240356%
6.0	89.034565%
7.0	91.436782%
8.0	94.827586%
9.0	95.317726%
10.0	95.479705%
[1mCumulative hit rate (min_rating=4.0):[22m 93.137001%
[1mAverage reciprocal hit rate:[22m 0.011540893716711565
[1mUser coverage (num_users=7814, min_rating=4.0):[22m 100.000000%

Testing of the "KNN Baseline" model successfully completed in 3:51:05.265563.
Grid search

## B.6 - Non-negative Matrix Factorization
expliquer Non-negative Matrix Factorization

In [10]:
model_nmf, _, top_n_nmf = helpers.ml.run_model(
    name="Non-negative Matrix Factorization",
    model=surprise.NMF,
    dataset=data,
    hyper_params={
        "n_factors": [5, 15, 25],
        "n_epochs": [25, 50, 75],
        "biased": [True, False]
    },
    models_folder=MODELS_FOLDER,
    seed=RANDOM_STATE
)

[32mTesting "Non-negative Matrix Factorization".[39m
Running GridSearchCV...[37m[2m
[0m
[1mBest params:[22m [2m[37m{'n_factors': 5, 'n_epochs': 25, 'biased': True}[0m
[1mRMSE:[22m [ train = 0.8608 | test = 1.4859 ]
[1mMAE:[22m [ train = 0.5341 | test = 1.1478 ]

Built top-N for each user (n=10, min_rating=4.0)
[1mHit rate:[22m 92.948554%
[1mHit rate per rating value:[22m
Rating	Hit rate
1.0	66.666667%
2.0	88.461538%
3.0	79.032258%
4.0	85.600000%
5.0	87.240356%
6.0	89.034565%
7.0	91.436782%
8.0	94.827586%
9.0	95.317726%
10.0	95.479705%
[1mCumulative hit rate (min_rating=4.0):[22m 93.137001%
[1mAverage reciprocal hit rate:[22m 0.0035244116463346293
[1mUser coverage (num_users=7814, min_rating=4.0):[22m 100.000000%

Testing of the "Non-negative Matrix Factorization" model successfully completed in 0:30:26.801538.
Grid search: 0:05:46.304963
Training and testing: 0:23:37.449606


## B.7 - Co-clustering
expliquer Co-clustering

In [11]:
model_co_clustering, _, top_n_co_clustering = helpers.ml.run_model(
    name="Co-clustering",
    model=surprise.CoClustering,
    dataset=data,
    hyper_params={
        "n_cltr_u": [1, 3, 5],
        "n_cltr_i": [1, 3, 5],
        "n_epochs": [10, 20, 30],
    },
    models_folder=MODELS_FOLDER,
    seed=RANDOM_STATE
)

[32mTesting "Co-clustering".[39m
Running GridSearchCV...[37m[2m
[0m
[1mBest params:[22m [2m[37m{'n_cltr_u': 5, 'n_cltr_i': 3, 'n_epochs': 30}[0m
[1mRMSE:[22m [ train = 0.7790 | test = 1.7352 ]
[1mMAE:[22m [ train = 0.5423 | test = 1.3126 ]

Built top-N for each user (n=10, min_rating=4.0)
[1mHit rate:[22m 91.118505%
[1mHit rate per rating value:[22m
Rating	Hit rate
1.0	44.444444%
2.0	65.384615%
3.0	77.419355%
4.0	80.800000%
5.0	81.602374%
6.0	85.578069%
7.0	89.540230%
8.0	93.726054%
9.0	94.448161%
10.0	94.649446%
[1mCumulative hit rate (min_rating=4.0):[22m 91.424494%
[1mAverage reciprocal hit rate:[22m 0.002823389679550613
[1mUser coverage (num_users=7814, min_rating=4.0):[22m 99.833632%

Testing of the "Co-clustering" model successfully completed in 1:00:47.418472.
Grid search: 0:14:04.900818
Training and testing: 0:45:23.661188


## B.8 - Comparing performance
**TODO**

## B.9 - Getting the Top-N
**TODO**