## School project - 5MLRE
The following notebook was created for a school project to create an anime recommendation system. The subject and the questions are available in the appendix.

The group members who participated in this project are:
- AMIMI Lamine
- BEZIN Théo
- LECOMTE Alexis
- PAWLOWSKI Maxence

### Main index
1. Data analysis
2. **Collaborative filtering (you are here)**
3. Content-based filtering
4. _Appendix_

# 2 - Collaborative filtering
In the previous notebook, we loaded, cleaned and studied the [MyAnimeList](https://myanimelist.net/) datasets. Now that we know them better, we will start to create the recommendation system using collaborative filtering. Collaborative filtering is a technique that filters out items that a user might like based on feedback from similar users. There are two sub-techniques: User-based collaborative filtering and article-based collaborative filtering.

### Index
<ol type="A">
  <li>Notebook initialization</li>
  <li>Collaborative filtering: unfiltered training</li>
  <li>Collaborative filtering: filtered training</li>
  <li>Getting the Top-N</li>
  <li>Conclusion of the collaborative filtering</li>
</ol>

## A - Notebook initialization
### A.1 - Imports

In [7]:
# OS and filesystem
import os
import sys
from pathlib import Path
import random

# Data
import pandas
from matplotlib import pyplot
import matplotx

# Model processing
import surprise

# Misc.
from ast import literal_eval

# Local files
sys.path.append(os.path.join(os.pardir, os.pardir))
import helpers

### A.2 - Package initialization

In [2]:
pyplot.rcParams.update(pyplot.rcParamsDefault)
pyplot.style.use(matplotx.styles.dracula)  # Set the matplotlib style

### A.3 - Constants

In [3]:
# Filesystem paths
PARENT_FOLDER = Path.cwd()
DATA_FOLDER = (PARENT_FOLDER / ".." / ".." / "data").resolve()
MODELS_FOLDER = (PARENT_FOLDER / ".." / ".." / "models").resolve()
TEMP_FOLDER = (PARENT_FOLDER / ".." / ".." / "temp").resolve()

# Plots
FIG_SIZE = (12, 7)

# Misc.
RANDOM_STATE = 2077

### A.4 - Datasets loading

In [4]:
data_reader = surprise.Reader(line_format="user item rating", sep=",", rating_scale=(-1, 10), skip_lines=1)
data = surprise.Dataset.load_from_file(file_path=(DATA_FOLDER / "rating2.csv"), reader=data_reader)

# Load a smaller sample of the dataset instead of the 8M rows
# data2 = pandas.read_csv((DATA_FOLDER / "rating.csv"), dtype={"user_id": str, "anime_id": str})
# data = data2[data2["rating"] >= 0.0]
# data = data.sample(n=500_000)

# data_reader = surprise.Reader(rating_scale=(1, 10))
# data = surprise.Dataset.load_from_df(df=data[["user_id", "anime_id", "rating"]], reader=data_reader)

In [5]:
data_anime = pandas.read_csv(DATA_FOLDER / "anime_cleaned.csv", converters={"genre_split": literal_eval})

In [6]:
rankings = pandas.Series(data=data_anime["rank_num_ratings"].values, index=data_anime["anime_id"]).to_dict()

In [7]:
evaluator = helpers.ml.ModelEvaluator(dataset=data, rankings=rankings, models_folder=MODELS_FOLDER, seed=RANDOM_STATE)

Constructing sets. This can take a while...[37m[2m
[37m[2m   > Building train/test sets...[37m[2m
[37m[2m   > Building LeaveOneOut sets...[37m[2m
[37m[2m   > Building full sets...[37m[2m
[37m[2m   > Preparing the similarities model...[37m[2m
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
[0m


# B - Collaborative filtering: unfiltered training
As we explained earlier, Collaborative filtering is a method used to make personalized recommendations by analyzing a user's past preferences or behaviors and comparing them to those of similar users. There are two sub-techniques: User-based collaborative filtering and article-based collaborative filtering.

- User-based: focuses on finding similar users that are looking like the target user in terms of preferences, liked items and user's navigation.
- Item-based: focuses on finding similar items based on the user's previous interactions with other items.

User-based is more relevant for entertainment-related items, as this approach would recommend items that other users with similar preferences have liked. There are a lot of parameters in terms of preference nuances. Item-based recommendations are more pertinent to online shops, which recommend products based on their characteristics. We are talking about individual tastes here.

In our case, user-based filtering should give better results. But in this notebook we will test our models with both methods.

## B.1 - Slope One
Slope One is a collaborative filtering algorithm designed for recommendations. Its lightweight and simple design calculates the average difference between the user's items rating and uses this information to predict the user's potential rating on an unseen article.

In [8]:
evaluator.run_model(name="Slope One", model=surprise.SlopeOne, hyper_params=None, measure_key="rmse", override=False)

[32mTesting "Slope One".[39m
Computing metrics...[37m[2m
[37m[2mCalculating the accuracy (RMSE, MAE)...[37m[2m
[37m[2mBuilding the top-N...[37m[2m
[37m[2m   > Fitting on the LOOCV...[37m[2m
[37m[2m   > Fitting on the full set...[37m[2m
Built top-N for each user (n=10, min_rating=3.0)
Built top-N for each user (n=10, min_rating=3.0)
[0m
[1mBest params:[22m [2m[37m{}[0m
[1mRMSE:[22m 2.187853
[1mMAE:[22m 1.397903

[1mHit rate:[22m 0.402010%
[1mHit rate per rating value:[22m
Rating	Hit rate
10.0	2.139037%
[1mCumulative hit rate (min_rating=3.0):[22m 0.481928%
[1mAverage reciprocal hit rate:[22m 0.0021775544388609714
[1mUser coverage (num_users=995, min_rating=3.0):[22m 100.000000%
[1mDiversity:[22m 0.666667
[1mNovelty:[22m 3181.175146

[0mTesting of the "Slope One" model successfully completed in 0:08:08.516542.
Grid search: N/A
Training and testing: 0:00:03.649915
Top-N building: 0:08:00.837321


## B.2 - KNN Basic
KNN Basic (K-Nearest Neighbors) is another collaborative filtering algorithm used for recommandation systems. It consists of finding the most similar "K" items or users based on a similarity metric. It then calculates the weighted average of the ratings of the items found to predict the user's rating for the target item or recommend items based on the user.

In [9]:
evaluator.run_model(
    name="KNN Basic",
    model=surprise.KNNBasic,
    hyper_params={
        "k": [20, 40, 60],
        "min_k": [1, 2, 3, 5],
        "sim_options": {
            "name": ["cosine", "msd", "pearson", "pearson_baseline"],
            "user_based": [True, False]
        }
    },
    measure_key="rmse",
    override=False
)

[32mTesting "KNN Basic".[39m
Running GridSearchCV...[37m[2m
Computing metrics...[37m[2m
[37m[2mCalculating the accuracy (RMSE, MAE)...[37m[2m
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
[37m[2mBuilding the top-N...[37m[2m
[37m[2m   > Fitting on the LOOCV...[37m[2m
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
[37m[2m   > Fitting on the full set...[37m[2m
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Built top-N for each user (n=10, min_rating=3.0)
Built top-N for each user (n=10, min_rating=3.0)
[0m
[1mBest params:[22m [2m[37m{'k': 40, 'min_k': 1, 'sim_options': {'name': 'pearson_baseline', 'user_based': False}}[0m
[1mRMSE:[22m 2.220969
[1mMAE:[22m 1.376290

[1mHit rate:[22m 0.402010%
[1mHit rate per rating value:[22m
Rating	Hit rate
-1.0	

## B.3 - KNN With Means
KNN With Means is a variant of the KNN Basic algorithm. This time, the algorithm adjusts the previously calculated weighted average by adding the overall average user or article rating.

In [10]:
evaluator.run_model(
    name="KNN With Means",
    model=surprise.KNNWithMeans,
    hyper_params={
        "k": [20, 40, 60],
        "min_k": [1, 2, 3, 5],
        "sim_options": {
            "name": ["cosine", "msd", "pearson", "pearson_baseline"],
            "user_based": [True, False]
        }
    },
    measure_key="rmse",
    override=False
)

[32mTesting "KNN With Means".[39m
Running GridSearchCV...[37m[2m
Computing metrics...[37m[2m
[37m[2mCalculating the accuracy (RMSE, MAE)...[37m[2m
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
[37m[2mBuilding the top-N...[37m[2m
[37m[2m   > Fitting on the LOOCV...[37m[2m
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
[37m[2m   > Fitting on the full set...[37m[2m
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Built top-N for each user (n=10, min_rating=3.0)
Built top-N for each user (n=10, min_rating=3.0)
[0m
[1mBest params:[22m [2m[37m{'k': 40, 'min_k': 5, 'sim_options': {'name': 'pearson_baseline', 'user_based': True}}[0m
[1mRMSE:[22m 2.131681
[1mMAE:[22m 1.324676

[1mHit rate:[22m 2.613065%
[1mHit rate per rating value:[22m
Rating	Hit rate
4

## B.4 - KNN With Z-Score
KNN With Z-Score is another variant of the KNN algorithm that takes into account the average ratings and standard deviations of users or items for predictions. In addition to the previous steps, the algorithm calculates the Z-Score by subtracting the average score and dividing the result by the standard deviation on the weighted average. With this method, this algorithm normalizes the ratings by trends and variabilities, which means better accuracy for predictions and recommendations.

In [11]:
evaluator.run_model(
    name="KNN With Z-Score",
    model=surprise.KNNWithZScore,
    hyper_params={
        "k": [20, 40, 60],
        "min_k": [1, 2, 3, 5],
        "sim_options": {
            "name": ["cosine", "msd", "pearson", "pearson_baseline"],
            "user_based": [True, False]
        }
    },
    measure_key="rmse",
    override=False
)

[32mTesting "KNN With Z-Score".[39m
Running GridSearchCV...[37m[2m
Computing metrics...[37m[2m
[37m[2mCalculating the accuracy (RMSE, MAE)...[37m[2m
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
[37m[2mBuilding the top-N...[37m[2m
[37m[2m   > Fitting on the LOOCV...[37m[2m
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
[37m[2m   > Fitting on the full set...[37m[2m
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Built top-N for each user (n=10, min_rating=3.0)
Built top-N for each user (n=10, min_rating=3.0)
[0m
[1mBest params:[22m [2m[37m{'k': 40, 'min_k': 5, 'sim_options': {'name': 'pearson_baseline', 'user_based': True}}[0m
[1mRMSE:[22m 2.130252
[1mMAE:[22m 1.299102

[1mHit rate:[22m 2.311558%
[1mHit rate per rating value:[22m
Rating	Hit rate

## B.5 - KNN Baseline
KNN Baseline is simpler than the previous algorithm. It calculates the distance between the raw values of the features that we want to use for our prediction. The counterpart of this method is the loss of accuracy depending to the scales or ranges of the features.

In [12]:
evaluator.run_model(
    name="KNN Baseline",
    model=surprise.KNNBaseline,
    hyper_params={
        "k": [20, 40, 60],
        "min_k": [1, 2, 3, 5],
        "sim_options": {
            "name": ["cosine", "msd", "pearson", "pearson_baseline"],
            "user_based": [True, False]
        },
        "bsl_options": {
            "method": ["als"],
            "n_epochs": [5, 10, 15],
        }
    },
    measure_key="rmse",
    override=False
)

[32mTesting "KNN Baseline".[39m
Running GridSearchCV...[37m[2m
Computing metrics...[37m[2m
[37m[2mCalculating the accuracy (RMSE, MAE)...[37m[2m
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
[37m[2mBuilding the top-N...[37m[2m
[37m[2m   > Fitting on the LOOCV...[37m[2m
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
[37m[2m   > Fitting on the full set...[37m[2m
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Built top-N for each user (n=10, min_rating=3.0)
Built top-N for each user (n=10, min_rating=3.0)
[0m
[1mBest params:[22m [2m[37m{'k': 20, 'min_k': 3, 'sim_options': {'name': 'pearson_baseline', 'user_based': False}, 'bsl_options': {'method': 'als', 'n_epochs': 15}}[0m
[1mRMSE:[22m 2.021351
[1mMAE:[22m 1.246453

[1mHit rate:[22m 2.814070%
[1m

## B.6 - Non-negative Matrix Factorization
Non-negative Matrix Factorization (NMF) is a technique used to facilitate the interpretation of non-negative matrices*¹* of data. For this, the algorithm tries to find a way to represent a non-negative matrix in smaller non-negative matrices. It can then better interpret the data structures and its predictions are improved.

*1: A non-negative matrix is a matrix where all elements are greater than or equal to zero.*

In [13]:
evaluator.run_model(
    name="Non-negative Matrix Factorization",
    model=surprise.NMF,
    hyper_params={
        "n_factors": [5, 15, 25],
        "n_epochs": [25, 50, 75],
        "biased": [True, False]
    },
    measure_key="rmse",
    override=False
)

[32mTesting "Non-negative Matrix Factorization".[39m
Running GridSearchCV...[37m[2m
Computing metrics...[37m[2m
[37m[2mCalculating the accuracy (RMSE, MAE)...[37m[2m
[37m[2mBuilding the top-N...[37m[2m
[37m[2m   > Fitting on the LOOCV...[37m[2m
[37m[2m   > Fitting on the full set...[37m[2m
Built top-N for each user (n=10, min_rating=3.0)
Built top-N for each user (n=10, min_rating=3.0)
[0m
[1mBest params:[22m [2m[37m{'n_factors': 5, 'n_epochs': 25, 'biased': True}[0m
[1mRMSE:[22m 2.542902
[1mMAE:[22m 1.641667

[1mHit rate:[22m 0.301508%
[1mHit rate per rating value:[22m
Rating	Hit rate
10.0	1.604278%
[1mCumulative hit rate (min_rating=3.0):[22m 0.361446%
[1mAverage reciprocal hit rate:[22m 0.0018425460636515912
[1mUser coverage (num_users=995, min_rating=3.0):[22m 100.000000%
[1mDiversity:[22m 0.127392
[1mNovelty:[22m 882.962780

[0mTesting of the "Non-negative Matrix Factorization" model successfully completed in 0:06:50.143506.
Grid sear

## B.7 - Co-clustering
The goal of Co-clustering is to find a way to group similar rows and columns of a matrix, like patterns, to make them more apparent. This method is useful when working on datasets with complex row and column relationships. The algorithm will first find a correlation between the rows and columns of the dataset. It will then use the k-mean method to group the data before interpreting the results.

In [14]:
evaluator.run_model(
    name="Co-clustering",
    model=surprise.CoClustering,
    hyper_params={
        "n_cltr_u": [1, 3, 5],
        "n_cltr_i": [1, 3, 5],
        "n_epochs": [10, 20, 30],
    },
    measure_key="rmse",
    override=False
)

[32mTesting "Co-clustering".[39m
Running GridSearchCV...[37m[2m
Computing metrics...[37m[2m
[37m[2mCalculating the accuracy (RMSE, MAE)...[37m[2m
[37m[2mBuilding the top-N...[37m[2m
[37m[2m   > Fitting on the LOOCV...[37m[2m
[37m[2m   > Fitting on the full set...[37m[2m
Built top-N for each user (n=10, min_rating=3.0)
Built top-N for each user (n=10, min_rating=3.0)
[0m
[1mBest params:[22m [2m[37m{'n_cltr_u': 1, 'n_cltr_i': 1, 'n_epochs': 10}[0m
[1mRMSE:[22m 2.295664
[1mMAE:[22m 1.489803

[1mHit rate:[22m 2.010050%
[1mHit rate per rating value:[22m
Rating	Hit rate
6.0	1.851852%
7.0	0.632911%
8.0	1.415094%
9.0	2.185792%
10.0	5.882353%
[1mCumulative hit rate (min_rating=3.0):[22m 2.409639%
[1mAverage reciprocal hit rate:[22m 0.005360134003350083
[1mUser coverage (num_users=995, min_rating=3.0):[22m 93.165829%
[1mDiversity:[22m 0.977778
[1mNovelty:[22m 2220.936524

[0mTesting of the "Co-clustering" model successfully completed in 0:12:24.6888

## B.8 - Comparing performance
Before we compare all the previous results, let's define a few important terms that we need to understand to properly compare the performance of our models.

<u>Machine Learning metrics:</u>
- Root Mean Squared Error (RMSE): measure of the average deviation of the predicted values of the model. The lower, the better.
- Mean Absolute Error (MAE): refers to the magnitude of difference between the prediction of an observation and the true value of that observation.

<u>Recommendation systems metrics:</u>
- Hit rate: the proportion of recommended items that are relevant to the user, expressed in percent.
- Hit rate per rating value: is the hit rate but calculated independently for each of the possible ratings (from one to ten in our case).
- Cumulative hit rate: is also the hit rate calculated for all rating values up to a certain threshold.
- Average reciprocal hit rank (ARHR): is the average of the reciprocal ranks of the relevant items in the recommended list.
- User coverage: is the proportion of users for whom the system is able to make recommendations.
- Diversity: is the variety or dissimilarity of items recommended to users.
- Novelty: is the degree to which recommended items are new or unexpected to the user.

We can now compare our models. We are going to observe which is the best model for each metrics, and then conclude on the overall best model.

The model with the most interesting metric values is the KNN Baseline due to its good performance on RMSE, MAE and hit rate. It has a reasonable training time of seven seconds. Even though the grid search takes three hours, we already have the best parameters for this model, so we don't need to run the grid search again. In another hand, KNN With Means show us an interesting performance for ARHR and a training of three seconds. KNN Basic is automatically excluded because 35% of its predictions were impossible (one of the parameters was unknown to it).

Overall, based on the metrics and results we have at this moment of the training, KNN Baseline appears to be the most efficient model.

## D - Getting the Top-N
The final step is to display the top-N of a user. We start by loading the previously saved top of our best model.

In [11]:
top_n_knn_baseline = helpers.ml.Model.load_top_n(filepath=(MODELS_FOLDER / "KNN_Baseline__topN-full.pkl"))

We then pick a random user from our dataset.

In [8]:
random_user_id = int(random.choice(list(set([r[0] for r in data.raw_ratings]))))
random_user_id

263

We define a function that build a human-readable table from the top-N.

In [17]:
def get_top_n_of(user_id: int, top_n: dict[int, list], items_df: pandas.DataFrame, auto_print: bool = False) -> pandas.DataFrame:
    """ Returns the formatted top-N recommendation for a specific user. """
    user_top = []

    for top_item_id, estimated_rating, _ in top_n[user_id]:
        item = items_df[items_df["anime_id"] == top_item_id].iloc[0]
        user_top.append({
            "Name": item["name"],
            "Genre": item["genre"],
            "Num. ratings": item["num_ratings"],
            "Mean ratings": item["rating"],
            "User estimated rating": estimated_rating
        })

    return pandas.DataFrame(data=user_top)

In [18]:
get_top_n_of(user_id=random_user_id, top_n=top_n_knn_baseline, items_df=data_anime, auto_print=False)

Unnamed: 0,Name,Genre,Num. ratings,Mean ratings,User estimated rating
0,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",21494,9.26,9.781889
1,Hajime no Ippo,"Comedy, Drama, Shounen, Sports",4273,8.83,9.670955
2,Rainbow: Nisha Rokubou no Shichinin,"Drama, Historical, Seinen, Thriller",2716,8.64,9.654531
3,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",7477,9.13,9.64759
4,Clannad: After Story,"Drama, Fantasy, Romance, Slice of Life, Supern...",15518,9.06,9.625787
5,Monster,"Drama, Horror, Mystery, Police, Psychological,...",4079,8.72,9.567637
6,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",2126,9.11,9.521137
7,Tengen Toppa Gurren Lagann,"Action, Adventure, Comedy, Mecha, Sci-Fi",16955,8.78,9.482833
8,Shigatsu wa Kimi no Uso,"Drama, Music, Romance, School, Shounen",8271,8.92,9.426137
9,Clannad,"Comedy, Drama, Romance, School, Slice of Life,...",18746,8.3,9.384167


Here is the top-10 of the randomly selected user.

## E - Conclusion of the collaborative filtering
**TODO: Add text**