## School project - 5MLRE
The following notebook was created for a school project to create an anime recommendation system. The subject and the questions are available in the appendix.

The group members who participated in this project are:
- AMIMI Lamine
- BEZIN Théo
- LECOMTE Alexis
- PAWLOWSKI Maxence

### Main index
1. Data analysis
2. Collaborative filtering
3. **Content-based filtering (you are here)**
4. _Appendix_

# 3 - Content-based filtering
In the previous notebook, we tested different collaborative filtering models, and we saw that they had very bad scores. We will now try another filtering technique. Content-based filtering uses item features to recommend other items to what the user likes. In our case, we use a user's previous ratings and try to suggest items that are similar to the animes he rated the highest.

### Index
<ol type="A">
  <li>Notebook initialization</li>
  <li>Data preparation</li>
  <li>The "Nearest Neighbors" model</li>
  <li>Conclusion of the content-based filtering</li>
</ol>

## A - Notebook initialization
### A.1 - Imports

In [2]:
# OS and filesystem
from pathlib import Path

# Math
import numpy

# Data
import pandas
from matplotlib import pyplot
import matplotx

# Model processing
from sklearn.compose import ColumnTransformer
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

# Console output
from colorama import Fore

# Misc.
from ast import literal_eval

### A.2 - Package initialization

In [3]:
pyplot.rcParams.update(pyplot.rcParamsDefault)
pyplot.style.use(matplotx.styles.dracula)  # Set the matplotlib style

### A.3 - Constants

In [4]:
# Filesystem paths
PARENT_FOLDER = Path.cwd()
DATA_FOLDER = (PARENT_FOLDER / ".." / ".." / "data").resolve()
MODELS_FOLDER = (PARENT_FOLDER / ".." / ".." / "models").resolve()
TEMP_FOLDER = (PARENT_FOLDER / ".." / ".." / "temp").resolve()

# Plots
FIG_SIZE = (12, 7)

# Misc.
RANDOM_STATE = 2077

### A.4 - Datasets loading

In [5]:
data_anime = pandas.read_csv(DATA_FOLDER / "anime_cleaned.csv", converters={"genre_split": literal_eval})

## B - Data preparation
The previous systems were using a collaborative filtering technique. These models were trained on a dataset containing a user identifier, an item identifier and a rating. No further processing were required. But in this notebook, we will use the content-based technique.

This technique uses the characteristics of the item itself to build recommandation, instead of interactions between users and items. While it cannot be used to get the last trending item, it is very useful to find similar items.

For this technique to work, we must first translate some values to a format that can be interpreted by the model. In this section, we will go through the preparation of the data and explain why we do this. These transformations were not done in the first pre-processing, as some transformations make the data less usable with the plotting libraries (e.g. one-hot encoding)

### B.1 - Filtering out some columns
We start the preparation by filtering out some of the columns that are not used in the research of similarities such as identifiers, ranks, ... Usually, we would keep the name of the anime as it constitute the `y`, the labels that we want to predict. But in this case, it is simple research of close items, and the name doesn't have a purpose in this situation.

In [6]:
data_preprocessed = data_anime.drop(labels=["anime_id", "name", "genre_split", "rank_avg_rating", "rank_num_ratings"], axis=1, inplace=False)
data_preprocessed

Unnamed: 0,genre,type,episodes,rating,members,num_ratings
0,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,1961
1,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665,21494
2,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262,1188
3,"Sci-Fi, Thriller",TV,24,9.17,673572,17151
4,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266,3115
...,...,...,...,...,...,...
12289,Hentai,OVA,1,4.15,211,2
12290,Hentai,OVA,1,4.28,183,2
12291,Hentai,OVA,4,4.88,219,1
12292,Hentai,OVA,1,4.98,175,1


### B.2 - Encoding and standardization
The next step is to encode the text values.

Usually we would use `scikit-learn` and its internal encoders to do the job, but the `genre` column is incompatible with those. So we will encode this column using `pandas` and `scikit-learn` will take care of the rest.

There is two techniques to encode strings: one-hot encoding and label encoding.

Label encoding is a technique for handling categorical variables. In this technique, each label is assigned a unique integer number based on alphabetical order. The main problem with this technique is that it creates a kind of ranking between categories. The model might interpret a higher integer as a better value. This type of encoding works well with ordinal features or when there are a large number of categories.

One-hot encoding is another technique for transforming categorical variables. It creates additional features (columns in the case of a dataframe) based on the number of unique values in the categorical features. Each possible value is represented by a new feature with two possible values: 0 or 1. This technique solves the label encoding problem, but it creates another one. We must be careful not to fall into the dummy variable trap. A dummy variable trap occurs when two categories have a very high correlation. For example, "single" and "divorced" are very close and the model could interpret these two categories as being the same, but in fact they are very different. In contrast to label encoding, one-hot encoding performs better on non-ordinal features and when the number of categories remains low.

The most sensible choice in our case is the one-hot encoding.

In [7]:
genre_col_idx = data_preprocessed.columns.get_loc("genre")

# We use this technique to preserve the column order
data_preprocessed = pandas.concat(objs=[
    data_preprocessed.iloc[:, :genre_col_idx],  # All columns before the `gender` column
    data_preprocessed["genre"].str.get_dummies(sep=", ").add_prefix("genre_"),  #  One-hot encoded genders
    data_preprocessed.iloc[:, (genre_col_idx + 1):]  # All columns after the `gender` column
], axis=1, ignore_index=False, sort=False)
data_preprocessed

Unnamed: 0,genre_Action,genre_Adventure,genre_Cars,genre_Comedy,genre_Dementia,genre_Demons,genre_Drama,genre_Ecchi,genre_Fantasy,genre_Game,...,genre_Thriller,genre_Unknown,genre_Vampire,genre_Yaoi,genre_Yuri,type,episodes,rating,members,num_ratings
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,Movie,1,9.37,200630,1961
1,1,1,0,0,0,0,1,0,1,0,...,0,0,0,0,0,TV,64,9.26,793665,21494
2,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,TV,51,9.25,114262,1188
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,TV,24,9.17,673572,17151
4,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,TV,51,9.16,151266,3115
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,OVA,1,4.15,211,2
12290,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,OVA,1,4.28,183,2
12291,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,OVA,4,4.88,219,1
12292,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,OVA,1,4.98,175,1


It is now possible to use `scikit-learn` to encode and normalize the rest of the dataset. We first complete the one-hot encoding with the last text column.

In [8]:
categorical_features = ["type"]
categorical_transformer = Pipeline(steps=[
    # Put the SimpleImputer here in case of missing data
    ("encoder", OneHotEncoder(categories="auto", handle_unknown="error"))
], verbose=False)

Then we standardize numerical columns.

Standardization is a technique that changes the range of values without affecting the shape of the data and by reducing the standard deviation to one. This pre-processing is necessary in order to produce a powerful model. In our case, the `num_ratings` column would have a much higher weight on the predictions than the `rating` column. But a large number of ratings does not necessarily mean that it is the best recommendation.

In [9]:
numeric_features = ["episodes", "rating", "members", "num_ratings"]
numeric_transformer = Pipeline(steps=[
    # Put the SimpleImputer here in case of missing data
    ("scaler", StandardScaler())
], verbose=False)

Usually, the pipeline would have a simple imputer to fill in the missing data. But we have already solved this problem before during the data exploration, so we don't need it.

The final step is to initialize the column transformer and to fit the dataset on it.

In [10]:
# Initialize the ColumnTransformer
genre_cols = [column for column in data_preprocessed if column.startswith("genre_")]
preprocessor = ColumnTransformer(
    transformers=[
        ("categorical", categorical_transformer, categorical_features),
        ("numeric", numeric_transformer, numeric_features),
        ("skipped", "passthrough", genre_cols)  # We skip the pre-processing of the gender columns but keep them.
    ],
    remainder="drop",
    verbose=True
)

# Fit and transform the dataset
features_x = preprocessor.fit_transform(data_preprocessed)

[ColumnTransformer] ... (1 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ....... (2 of 3) Processing numeric, total=   0.0s
[ColumnTransformer] ....... (3 of 3) Processing skipped, total=   0.0s


We can save the transformed dataset to disk for later use.

In [11]:
numpy.save(file=str(DATA_FOLDER / "x-anime_16-03-23_11-25"), arr=features_x)

And reload it with this block of code.

In [13]:
features_x = numpy.load(file=str(DATA_FOLDER / "x-anime_16-03-23_11-25.npy"))
features_x

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]])

As we explained before, this kind of data is nearly unusable during the data exploration. This is the reason why we are doing it now.

## C - The "Nearest Neighbors" model
In this section, we will define a model that will give ten close items for each anime in the dataset.

### C.1 - Model definition and similarities computation
From the [sklearn docs](https://scikit-learn.org/stable/modules/neighbors.html), [NearestNeighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors) implements unsupervised nearest neighbors learning. It acts as a uniform interface to three different nearest neighbors algorithms: [BallTree](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html#sklearn.neighbors.BallTree), [KDTree](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree), and a brute-force algorithm based on routines in `sklearn.metrics.pairwise`. When the algorithm is set to `"auto"`, the module attempts to determine the best `algorithm` from the training data.

In [14]:
model = NearestNeighbors(n_neighbors=11, radius=1.0, algorithm="auto", metric="cosine")
model.fit(features_x)

You may notice that we define the `n_neighbors` parameter to `11`. This is caused by the fact that the first item of the neighbors list will always be the item itself. We will filter this one out later, but we still need ten other items to be in this list. We're using the `cosine` metric because it is one of the best option when we want to compute similarities between items.

We can now compute the similarities.

In [15]:
distances, indices = model.kneighbors(features_x)

### C.2 - Getting the Top-N
We normally don't need the rating dataset. But in our case, we want to get the top ten of a user using his identifier.

In [18]:
data_ratings = pandas.read_csv((DATA_FOLDER / "rating.csv"), dtype={"user_id": int, "anime_id": int, "rating": float})

So we use the ratings dataset to get one of the best rated anime of a user and display other anime close to this one.

In [63]:
def get_best_rated_anime(user_id: int, df_ratings: pandas.DataFrame, df_animes: pandas.DataFrame) -> int | None:
    """
        Returns the index of the best rated anime by this user.
        **It does NOT return the anime identifier, only the index relative to the DataFrame.**
        It may return None if the user has no rating.
    """
    user_ratings = df_ratings.loc[df_ratings["user_id"] == user_id]
    user_ratings = user_ratings[user_ratings["rating"] >= 0]

    if len(user_ratings) == 0:  # If the user hasn't rated any item
        return None
    else:  # Get the best rated item
        user_ratings.sort_values(by="rating", ascending=False)
        user_ratings = user_ratings.loc[user_ratings["rating"] == max(user_ratings["rating"])]
        random_item = user_ratings.sample(n=1).iloc[0]

        return int(df_animes.index[df_animes["anime_id"] == random_item["anime_id"]].tolist()[0])


def get_top_n_of(user_id: int, df: pandas.DataFrame, df_ratings: pandas.DataFrame, indices_arr: numpy.ndarray) -> pandas.DataFrame:
    anime_idx = get_best_rated_anime(user_id=user_id, df_ratings=df_ratings, df_animes=df)

    if anime_idx is not None:
        related_anime = []

        for related_anime_idx in indices_arr[anime_idx][1:]:
            related_anime.append(df.iloc[related_anime_idx].to_dict())

        return pandas.DataFrame(related_anime)
    else:
        print(f"{Fore.YELLOW}Cannot build the Top-N: The user has not rated any anime.")

We pick a random user from our dataset.

In [64]:
random_user_id = int(data_ratings.sample(n=1).iloc[0]["user_id"])
random_user_id

9754

And we display his top-N.

In [65]:
get_top_n_of(user_id=random_user_id, df=data_anime, df_ratings=data_ratings, indices_arr=indices)

Unnamed: 0,anime_id,name,genre,genre_split,type,episodes,rating,members,num_ratings,rank_avg_rating,rank_num_ratings
0,1698,Nodame Cantabile,"Comedy, Drama, Josei, Music, Romance, Slice of...","[Comedy, Drama, Josei, Music, Romance, Slice o...",TV,23,8.46,157025,4616,156.0,318.0
1,21995,Ao Haru Ride,"Comedy, Drama, Romance, School, Shoujo, Slice ...","[Comedy, Drama, Romance, School, Shoujo, Slice...",TV,12,7.89,227417,5547,790.5,239.0
2,4722,Skip Beat!,"Comedy, Drama, Romance, Shoujo","[Comedy, Drama, Romance, Shoujo]",TV,25,8.28,134818,4893,285.0,290.0
3,1222,Bokura ga Ita,"Drama, Romance, Shoujo, Slice of Life","[Drama, Romance, Shoujo, Slice of Life]",TV,26,7.54,125051,3912,1598.5,395.0
4,120,Fruits Basket,"Comedy, Drama, Fantasy, Romance, Shoujo, Slice...","[Comedy, Drama, Fantasy, Romance, Shoujo, Slic...",TV,26,7.8,242553,9171,955.0,93.0
5,57,Beck,"Comedy, Drama, Music, Shounen, Slice of Life","[Comedy, Drama, Music, Shounen, Slice of Life]",TV,26,8.4,148328,4963,189.5,285.0
6,3731,Itazura na Kiss,"Comedy, Romance, Shoujo","[Comedy, Romance, Shoujo]",TV,25,7.76,136279,4873,1040.0,292.0
7,18671,Chuunibyou demo Koi ga Shitai! Ren,"Comedy, Drama, Romance, School, Slice of Life","[Comedy, Drama, Romance, School, Slice of Life]",TV,12,7.6,208885,5817,1426.0,223.0
8,11433,Ano Natsu de Matteru,"Comedy, Drama, Romance, Sci-Fi, Slice of Life","[Comedy, Drama, Romance, Sci-Fi, Slice of Life]",TV,12,7.7,169718,5171,1176.0,271.0
9,2034,Lovely★Complex,"Comedy, Romance, Shoujo","[Comedy, Romance, Shoujo]",TV,24,8.23,235003,8293,332.5,115.0


## D - Conclusion of the content-based filtering
This type of filtering showed us better result than in the previous notebook. This system is able to recommend a bunch of anime to a user based on those he already watched.

The next notebook is the appendix. It contains the list of sources used in our research and the questions from the school subject.