## School project - 5MLRE
The following notebook was created for a school project to create an anime recommendation system. The subject and the questions are available in the appendix.

The group members who participated in this project are:
- AMIMI Lamine
- BEZIN Théo
- LECOMTE Alexis
- PAWLOWSKI Maxence

### Main index
1. Data analysis
2. **Collaborative filtering (you are here)**
3. Content-based filtering
4. _Appendix_

# 2 - Collaborative filtering
In the previous notebook, we loaded, cleaned and studied the [MyAnimeList](https://myanimelist.net/) datasets. Now that we know them better, we will start to create the recommendation system using collaborative filtering. Collaborative filtering is a technique that filters out items that a user might like based on feedback from similar users. There are two sub-techniques: User-based collaborative filtering and article-based collaborative filtering.

### Index
<ol type="A">
  <li>Notebook initialization</li>
  <li>Data preparation</li>
  <li>Collaborative filtering: user-based</li>
  <li>Collaborative filtering: item-based</li>
  <li>Collaborative filtering: others</li>
  <li>Conclusion of the collaborative filtering</li>
</ol>

## A - Notebook initialization
### A.1 - Imports

In [1]:
# OS and filesystem
import os
import sys
from pathlib import Path
from timeit import default_timer as timer
from datetime import timedelta

# Data
import pandas
from matplotlib import pyplot
import matplotx

# Model processing
import surprise
from surprise import accuracy as surp_acc

# Console output
from colorama import Fore, Style

# Jupyter output
from IPython.utils import io

# Local files
sys.path.append(os.path.join(os.pardir, os.pardir))
import helpers

### A.2 - Package initialization

In [2]:
pyplot.rcParams.update(pyplot.rcParamsDefault)
pyplot.style.use(matplotx.styles.dracula)  # Set the matplotlib style

### A.3 - Constants

In [3]:
# Filesystem paths
PARENT_FOLDER = Path.cwd()
DATA_FOLDER = (PARENT_FOLDER / ".." / ".." / "data").resolve()
MODELS_FOLDER = (PARENT_FOLDER / ".." / ".." / "models").resolve()
TEMP_FOLDER = (PARENT_FOLDER / ".." / ".." / "temp").resolve()

# Plots
FIG_SIZE = (12, 7)

# Misc.
RANDOM_STATE = 2077

### A.4 - Datasets loading

In [4]:
# data_reader = surprise.Reader(line_format="user item rating", sep=",", rating_scale=(1, 10), skip_lines=1)
# data = surprise.Dataset.load_from_file(file_path=(DATA_FOLDER / "rating_cleaned.csv"), reader=data_reader)

data_cleaned = pandas.read_csv((DATA_FOLDER / "rating_cleaned.csv"))
data_filtered = data_cleaned[data_cleaned["rating"] >= 0.0]
data_shortened = data_filtered.sample(n=100_000)

data_reader = surprise.Reader(rating_scale=(1, 10))
data = surprise.Dataset.load_from_df(df=data_shortened[["user_id", "anime_id", "rating"]], reader=data_reader)

# B - Data preparation
**TODO: Add text**

### B.1 - Splitting the dataset
**TODO: Add text**

In [5]:
# data_train, data_test = surprise.model_selection.train_test_split(data, test_size=0.2, shuffle=True, random_state=RANDOM_STATE)
"""data_train_full = data.build_full_trainset()
data_test = data_train_full.build_testset()
data_anti_test = data_train_full.build_anti_testset()"""

'data_train_full = data.build_full_trainset()\ndata_test = data_train_full.build_testset()\ndata_anti_test = data_train_full.build_anti_testset()'

### B.2 - Choosing the data iterator
**TODO: Add text**

In [6]:
data_iterator = surprise.model_selection.KFold(n_splits=10, random_state=RANDOM_STATE, shuffle=True)

# C - Collaborative filtering
expliquer diff. user-based/item-based

In [7]:
model_definitions = {
    "slope_one": {
        "name": "Slope One",
        "algo_class": surprise.SlopeOne,
        "hyper_params": None
    },
    "knn_basic": {
        "name": "KNN Basic",
        "algo_class": surprise.KNNBasic,
        "hyper_params": {
            "k": [20, 40, 60],
            "min_k": [1, 2, 3, 5],
            "sim_options": {
                "name": ["cosine", "msd", "pearson", "pearson_baseline"],
                "user_based": [True, False]
            }
        }
    },
    "knn_with_means": {
        "name": "KNN With Means",
        "algo_class": surprise.KNNWithMeans,
        "hyper_params": {
            "k": [20, 40, 60],
            "min_k": [1, 2, 3, 5],
            "sim_options": {
                "name": ["cosine", "msd", "pearson", "pearson_baseline"],
                "user_based": [True, False]
            }
        }
    },
    "knn_with_z-score": {
        "name": "KNN With Z-Score",
        "algo_class": surprise.KNNWithZScore,
        "hyper_params": {
            "k": [20, 40, 60],
            "min_k": [1, 2, 3, 5],
            "sim_options": {
                "name": ["cosine", "msd", "pearson", "pearson_baseline"],
                "user_based": [True, False]
            }
        }
    },
    "knn_baseline": {
        "name": "KNN Baseline",
        "algo_class": surprise.KNNBaseline,
        "hyper_params": {
            "k": [20, 40, 60],
            "min_k": [1, 2, 3, 5],
            "sim_options": {
                "name": ["cosine", "msd", "pearson", "pearson_baseline"],
                "user_based": [True, False]
            },
            "bsl_options": {
                "method": ["als"],
                "n_epochs": [5, 10, 15],
            }
        }
    },
    "non-negative_matrix_factorization": {
        "name": "Non-negative Matrix Factorization",
        "algo_class": surprise.NMF,
        "hyper_params": {
            "n_factors": [5, 15, 25],
            "n_epochs": [25, 50, 75],
            "biased": [True, False]
        }
    },
    "co-clustering": {
        "name": "Co-clustering",
        "algo_class": surprise.CoClustering,
        "hyper_params": {
            "n_cltr_u": [1, 3, 5],
            "n_cltr_i": [1, 3, 5],
            "n_epochs": [10, 20, 30],
        }
    },
}

In [8]:
print(f"{Style.BRIGHT}Testing multiple models...{Style.RESET_ALL}")
model_sep = f"{Style.DIM}{Fore.WHITE}{'=' * 25}{Style.RESET_ALL}"
measure_key = "rmse"
models = {}

for model_key in model_definitions:
    # Initialize the model processing
    iteration_start_time = timer()
    model_settings = model_definitions[model_key]
    print(f"{model_sep}\n{Fore.GREEN}Testing \"{model_settings['name']}\".{Fore.RESET}")

    # Train the model
    if model_settings["hyper_params"] is not None:  # If available, search the best estimator with GridSearch
        print(f"Running GridSearchCV...{Fore.WHITE}{Style.DIM}")
        grid_search_start_time = timer()
        with io.capture_output():
            grid_search = surprise.model_selection.GridSearchCV(
                algo_class=model_settings["algo_class"],
                param_grid=model_settings["hyper_params"],
                measures=["rmse", "mae"],
                cv=data_iterator,
                refit=False,
                n_jobs=1,
                joblib_verbose=0
            )
            grid_search.fit(data)

        best_model = grid_search.best_estimator[measure_key]
        best_params = grid_search.best_params[measure_key]
        grid_search_end_time = timer()
    else:
        best_model = model_settings["algo_class"]()
        best_params = []
        grid_search_start_time = grid_search_end_time = None

    # Save the best model
    models[model_key] = best_model

    # Accuracy calculation
    model_start_time = model_end_time = None
    LOOCV = surprise.model_selection.LeaveOneOut(n_splits=1, min_n_ratings=1, random_state=RANDOM_STATE)

    for data_train_LOOCV, data_test_LOOCV in LOOCV.split(data):
        model_start_time = timer()
        best_model.fit(data_train_LOOCV)
        train_prediction = best_model.test(data_train_LOOCV.build_testset())
        left_out_predictions = test_prediction = best_model.test(data_test_LOOCV)
        all_predictions = best_model.test(data_train_LOOCV.build_anti_testset())
        model_end_time = timer()

        print(f"{Style.RESET_ALL}")
        print(f"{Style.BRIGHT}Best params:{Style.NORMAL} {Style.DIM}{Fore.WHITE}{best_params}{Style.RESET_ALL}")
        print((
            f"{Style.BRIGHT}RMSE:{Style.NORMAL} "
            f"[ train = {surp_acc.rmse(train_prediction, verbose=False):.4f} | "
            f"test = {surp_acc.rmse(test_prediction, verbose=False):.4f} ]"
        ))
        print((
            f"{Style.BRIGHT}MAE:{Style.NORMAL} "
            f"[train = {surp_acc.mae(train_prediction, verbose=False):.4f} | "
            f"test = {surp_acc.mae(test_prediction, verbose=False):.4f} ]"
        ))
        print("")

        top_n = helpers.metrics.get_top_n(predictions=all_predictions, n=10, min_rating=4.0, verbose=True)
        helpers.metrics.get_hit_rate(top_n=top_n, left_out_predictions=left_out_predictions, auto_print=True)
        helpers.metrics.get_rating_hit_rate(top_n=top_n, left_out_predictions=left_out_predictions, auto_print=True)
        helpers.metrics.get_cumulative_hit_rate(top_n=top_n, left_out_predictions=left_out_predictions, min_rating=4.0, auto_print=True)
        helpers.metrics.get_average_reciprocal_hit_rank(top_n=top_n, left_out_predictions=left_out_predictions, auto_print=True)
        helpers.metrics.get_user_coverage(top_n=top_n, num_users=data_train_LOOCV.n_users, min_rating=4.0, auto_print=True)

    # Save the model to the disk
    surprise.dump.dump(file_name=str(MODELS_FOLDER / model_settings["name"]), algo=best_model)

    # Final output
    iteration_end_time = timer()
    iteration_elapsed_time = timedelta(seconds=iteration_end_time - iteration_start_time)
    grid_search_elapsed_time = timedelta(seconds=grid_search_end_time - grid_search_start_time) if grid_search_start_time is not None else None
    model_elapsed_time = timedelta(seconds=model_end_time - model_start_time) if model_start_time is not None else None
    print((
        f"\nTesting of the \"{model_settings['name']}\" model successfully completed in {iteration_elapsed_time}."
        f"\nGrid search: {'N/A' if grid_search_elapsed_time is None else grid_search_elapsed_time}"
        f"\nTraining and testing: {'N/A' if model_elapsed_time is None else model_elapsed_time}"
    ))

[1mTesting multiple models...[0m
[32mTesting "Slope One".[39m
[0m
[1mBest params:[22m [2m[37m[][0m
[1mRMSE:[22m [train = 0.0000 | test = 1.6306 ]
[1mMAE:[22m [train = 0.0000 | test = 1.2816 ]

Built top-N for each user (n=10, min_rating=4.0)
[1mHit rate:[22m 25.586854%
[1mHit rate per rating value:[22m
Rating	Hit rate
2.0	50.000000%
4.0	28.571429%
5.0	36.363636%
6.0	10.909091%
7.0	17.894737%
8.0	33.644860%
9.0	29.545455%
10.0	27.659574%
[1mCumulative hit rate (min_rating=4.0):[22m 25.653207%
[1mAverage reciprocal hit rate:[22m 0.0034823443310215306
[1mUser coverage (num_users=426, min_rating=4.0):[22m 0.9812206572769953

Testing of the "Slope One" model successfully completed in 0:00:01.474861.
Grid search: N/A
Training and testing: 0:00:01.208360
[32mTesting "KNN Basic".[39m
Running GridSearchCV...[37m[2m
Computing the cosine similarity matrix...
Done computing similarity matrix.
[0m
[1mBest params:[22m [2m[37m{'k': 20, 'min_k': 1, 'sim_options': {'nam