## School project - 5MLRE
The following notebook was created for a school project to create an anime recommendation system. The subject and the questions are available in the appendix.

The group members who participated in this project are:
- AMIMI Lamine
- BEZIN Théo
- LECOMTE Alexis
- PAWLOWSKI Maxence

### Main index
1. Data analysis
2. **Collaborative filtering (you are here)**
3. Content-based filtering
4. _Appendix_

# 2 - Collaborative filtering
In the previous notebook, we loaded, cleaned and studied the [MyAnimeList](https://myanimelist.net/) datasets. Now that we know them better, we will start to create the recommendation system using collaborative filtering. Collaborative filtering is a technique that filters out items that a user might like based on feedback from similar users. There are two sub-techniques: User-based collaborative filtering and article-based collaborative filtering.

### Index
<ol type="A">
  <li>Notebook initialization</li>
  <li>Collaborative filtering: unfiltered training</li>
  <li>Collaborative filtering: filtered training</li>
  <li>Conclusion of the collaborative filtering</li>
</ol>

## A - Notebook initialization
### A.1 - Imports

In [1]:
# OS and filesystem
import os
import sys
from pathlib import Path

# Data
import pandas
from matplotlib import pyplot
import matplotx

# Model processing
import surprise

# Misc.
from ast import literal_eval

# Local files
sys.path.append(os.path.join(os.pardir, os.pardir))
import helpers

### A.2 - Package initialization

In [2]:
pyplot.rcParams.update(pyplot.rcParamsDefault)
pyplot.style.use(matplotx.styles.dracula)  # Set the matplotlib style

### A.3 - Constants

In [3]:
# Filesystem paths
PARENT_FOLDER = Path.cwd()
DATA_FOLDER = (PARENT_FOLDER / ".." / ".." / "data").resolve()
MODELS_FOLDER = (PARENT_FOLDER / ".." / ".." / "models").resolve()
TEMP_FOLDER = (PARENT_FOLDER / ".." / ".." / "temp").resolve()

# Plots
FIG_SIZE = (12, 7)

# Misc.
RANDOM_STATE = 2077

### A.4 - Datasets loading

In [4]:
# data_reader = surprise.Reader(line_format="user item rating", sep=",", rating_scale=(1, 10), skip_lines=1)
# data = surprise.Dataset.load_from_file(file_path=(DATA_FOLDER / "rating_cleaned.csv"), reader=data_reader)

# Load a smaller sample of the dataset instead of the 8M rows
data_cleaned = pandas.read_csv((DATA_FOLDER / "rating_cleaned.csv"))
data_filtered = data_cleaned[data_cleaned["rating"] >= 0.0]
data_shortened = data_filtered.sample(n=35_000)

data_reader = surprise.Reader(rating_scale=(1, 10))
data = surprise.Dataset.load_from_df(df=data_shortened[["user_id", "anime_id", "rating"]], reader=data_reader)

# B - Collaborative filtering: unfiltered training
expliquer collaborative filtering
expliquer diff. user-based/item-based

## B.2 - Slope One
expliquer slope one

In [5]:
model_slope_one, _, top_n_slope_one = helpers.ml.run_model(
    name="Slope One",
    model=surprise.SlopeOne,
    dataset=data,
    hyper_params=None,
    models_folder=MODELS_FOLDER,
    seed=RANDOM_STATE
)

[32mTesting "Slope One".[39m
[0m
[1mBest params:[22m [2m[37m{}[0m
[1mRMSE:[22m [ train = 0.0000 | test = 1.8406 ]
[1mMAE:[22m [ train = 0.0000 | test = 1.5102 ]

Built top-N for each user (n=10, min_rating=4.0)
[1mHit rate:[22m 0.000000%
[1mHit rate per rating value:[22m
Rating	Hit rate
[1mCumulative hit rate (min_rating=4.0):[22m 0.000000%
[1mAverage reciprocal hit rate:[22m 0.0
[1mUser coverage (num_users=7, min_rating=4.0):[22m 100.000000%


ValueError: Item 8795 is not part of the trainset.

## B.2 - KNN Basic
expliquer KNN Basic

In [None]:
model_knn_basic, _, top_n_knn_basic = helpers.ml.run_model(
    name="KNN Basic",
    model=surprise.KNNBasic,
    dataset=data,
    hyper_params={
        "k": [20, 40, 60],
        "min_k": [1, 2, 3, 5],
        "sim_options": {
            "name": ["cosine", "msd", "pearson", "pearson_baseline"],
            "user_based": [True, False]
        }
    },
    models_folder=MODELS_FOLDER,
    seed=RANDOM_STATE
)

## B.3 - KNN With Means
expliquer KNN With Means

In [None]:
model_knn_with_means, _, top_n_knn_with_means = helpers.ml.run_model(
    name="KNN With Means",
    model=surprise.KNNWithMeans,
    dataset=data,
    hyper_params={
        "k": [20, 40, 60],
        "min_k": [1, 2, 3, 5],
        "sim_options": {
            "name": ["cosine", "msd", "pearson", "pearson_baseline"],
            "user_based": [True, False]
        }
    },
    models_folder=MODELS_FOLDER,
    seed=RANDOM_STATE
)

## B.4 - KNN With Z-Score
expliquer KNN With Z-Score

In [None]:
model_knn_with_z_score, _, top_n_knn_with_z_score = helpers.ml.run_model(
    name="KNN With Z-Score",
    model=surprise.KNNWithZScore,
    dataset=data,
    hyper_params={
        "k": [20, 40, 60],
        "min_k": [1, 2, 3, 5],
        "sim_options": {
            "name": ["cosine", "msd", "pearson", "pearson_baseline"],
            "user_based": [True, False]
        }
    },
    models_folder=MODELS_FOLDER,
    seed=RANDOM_STATE
)

## B.5 - KNN Baseline
expliquer KNN Baseline

In [None]:
model_knn_baseline, _, top_n_knn_baseline = helpers.ml.run_model(
    name="KNN Baseline",
    model=surprise.KNNBaseline,
    dataset=data,
    hyper_params={
        "k": [20, 40, 60],
        "min_k": [1, 2, 3, 5],
        "sim_options": {
            "name": ["cosine", "msd", "pearson", "pearson_baseline"],
            "user_based": [True, False]
        },
        "bsl_options": {
            "method": ["als"],
            "n_epochs": [5, 10, 15],
        }
    },
    models_folder=MODELS_FOLDER,
    seed=RANDOM_STATE
)

## B.6 - Non-negative Matrix Factorization
expliquer Non-negative Matrix Factorization

In [None]:
model_nmf, _, top_n_nmf = helpers.ml.run_model(
    name="Non-negative Matrix Factorization",
    model=surprise.NMF,
    dataset=data,
    hyper_params={
        "n_factors": [5, 15, 25],
        "n_epochs": [25, 50, 75],
        "biased": [True, False]
    },
    models_folder=MODELS_FOLDER,
    seed=RANDOM_STATE
)

## B.7 - Co-clustering
expliquer Co-clustering

In [None]:
model_co_clustering, _, top_n_co_clustering = helpers.ml.run_model(
    name="Co-clustering",
    model=surprise.CoClustering,
    dataset=data,
    hyper_params={
        "n_cltr_u": [1, 3, 5],
        "n_cltr_i": [1, 3, 5],
        "n_epochs": [10, 20, 30],
    },
    models_folder=MODELS_FOLDER,
    seed=RANDOM_STATE
)

## B.8 - Comparing performance
**TODO**

In [7]:
_, model_nmf = surprise.dump.load(file_name=str(MODELS_FOLDER / "Non-negative Matrix Factorization"))

## B.9 - Getting the Top-N
**TODO**

In [8]:
data_anime = pandas.read_csv(DATA_FOLDER / "anime_cleaned.csv", converters={"genre_split": literal_eval})