## School project - 5MLRE
The following notebook was created for a school project to create an anime recommendation system. The subject and the questions are available in the appendix.

The group members who participated in this project are:
- AMIMI Lamine
- BEZIN Théo
- LECOMTE Alexis
- PAWLOWSKI Maxence

### Main index
1. Data analysis
2. Collaborative filtering
3. **Content-based filtering (you are here)**
4. _Appendix_

# 3 - Content-based filtering
**TODO: In the previous notebook...**
Content-based filtering uses item features to recommend other items to what the user likes. In our case, we use a user's previous ratings and try to suggest items that are similar to the animes he rated the highest.

### Index
<ol type="A">
  <li>Notebook initialization</li>
  <li>**TODO**</li>
</ol>

## A - Notebook initialization
### A.1 - Imports

In [1]:
# OS and filesystem
import os
import sys
from pathlib import Path

# Math
import numpy

# Data
import pandas
from matplotlib import pyplot
import matplotx

# Model processing
from sklearn.compose import ColumnTransformer
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

# Misc.
from ast import literal_eval

# Local files
sys.path.append(os.path.join(os.pardir, os.pardir))
import helpers

### A.2 - Package initialization

In [2]:
pyplot.rcParams.update(pyplot.rcParamsDefault)
pyplot.style.use(matplotx.styles.dracula)  # Set the matplotlib style

### A.3 - Constants

In [3]:
# Filesystem paths
PARENT_FOLDER = Path.cwd()
DATA_FOLDER = (PARENT_FOLDER / ".." / ".." / "data").resolve()
MODELS_FOLDER = (PARENT_FOLDER / ".." / ".." / "models").resolve()
TEMP_FOLDER = (PARENT_FOLDER / ".." / ".." / "temp").resolve()

# Plots
FIG_SIZE = (12, 7)

# Misc.
RANDOM_STATE = 2077

### A.4 - Datasets loading

In [4]:
data_anime = pandas.read_csv(DATA_FOLDER / "anime_cleaned.csv", converters={"genre_split": literal_eval})

# B - Data preparation
Expliquer pourquoi second processing

*(Now that we have built and studied graphs, it will be time to start creating our recommendation system. But before that, we need to transform the dataset into a format that is readable for our future models. These transformations were not done in the first pre-processing, as some transformations make the data less usable with the plotting libraries (e.g. one-hot encoding).)*

# B.1 - Filtering out some columns
*(We start the second pre-processing by filtering out some columns used for the study.)*

In [5]:
data_preprocessed = data_anime.drop(labels=["anime_id", "genre_split", "rank_avg_rating", "rank_num_ratings"], axis=1, inplace=False)
data_preprocessed

Unnamed: 0,name,genre,type,episodes,rating,members,num_ratings
0,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,1961
1,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665,21494
2,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262,1188
3,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572,17151
4,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266,3115
...,...,...,...,...,...,...,...
12289,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1,4.15,211,2
12290,Under World,Hentai,OVA,1,4.28,183,2
12291,Violence Gekiga David no Hoshi,Hentai,OVA,4,4.88,219,1
12292,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175,1


# B.2 - Encoding and standardization
*(The next step is to encode the text values. We leave the `name` column as it is our labels, our Y set.)*

*(Usually we would use `scikit-learn` and its internal encoders to do the job, but the `gender` column is incompatible with those. So we will encode this column using `pandas` and `scikit-learn` will take care of the rest.)*

*(There is two techniques to encode strings: one-hot encoding and label encoding.)*

*(Label encoding is a technique for handling categorical variables. In this technique, each label is assigned a unique integer number based on alphabetical order. The main problem with this technique is that it creates a kind of ranking between categories. The model might interpret a higher integer as a better value. This type of encoding works well with ordinal features or when there are a large number of categories.)*

*(One-hot encoding is another technique for transforming categorical variables. It creates additional features (columns in the case of a dataframe) based on the number of unique values in the categorical features. Each possible value is represented by a new feature with two possible values: 0 or 1. This technique solves the label encoding problem, but it creates another one. We must be careful not to fall into the dummy variable trap. A dummy variable trap occurs when two categories have a very high correlation. For example, "single" and "divorced" are very close and the model could interpret these two categories as being the same, but in fact they are very different. In contrast to label encoding, one-hot encoding performs better on non-ordinal features and when the number of categories remains low.)*

*(The most sensible choice in our case is the one-hot encoding.)*

In [6]:
genre_col_idx = data_preprocessed.columns.get_loc("genre")

# We use this technique to preserve the column order
data_preprocessed = pandas.concat(objs=[
    data_preprocessed.iloc[:, :genre_col_idx],  # All columns before the `gender` column
    data_preprocessed["genre"].str.get_dummies(sep=", ").add_prefix("genre_"),  #  One-hot encoded genders
    data_preprocessed.iloc[:, (genre_col_idx + 1):]  # All columns after the `gender` column
], axis=1, ignore_index=False, sort=False)
data_preprocessed

Unnamed: 0,name,genre_Action,genre_Adventure,genre_Cars,genre_Comedy,genre_Dementia,genre_Demons,genre_Drama,genre_Ecchi,genre_Fantasy,...,genre_Thriller,genre_Unknown,genre_Vampire,genre_Yaoi,genre_Yuri,type,episodes,rating,members,num_ratings
0,Kimi no Na wa.,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,Movie,1,9.37,200630,1961
1,Fullmetal Alchemist: Brotherhood,1,1,0,0,0,0,1,0,1,...,0,0,0,0,0,TV,64,9.26,793665,21494
2,Gintama°,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,TV,51,9.25,114262,1188
3,Steins;Gate,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,TV,24,9.17,673572,17151
4,Gintama&#039;,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,TV,51,9.16,151266,3115
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,Toushindai My Lover: Minami tai Mecha-Minami,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,OVA,1,4.15,211,2
12290,Under World,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,OVA,1,4.28,183,2
12291,Violence Gekiga David no Hoshi,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,OVA,4,4.88,219,1
12292,Violence Gekiga Shin David no Hoshi: Inma Dens...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,OVA,1,4.98,175,1


*(It is now possible to use `scikit-learn` to encode and normalize the rest of the dataset. We first complete the one-hot encoding with the last text column.)*

In [7]:
categorical_features = ["type"]
categorical_transformer = Pipeline(steps=[
    # Put the SimpleImputer here in case of missing data
    ("encoder", OneHotEncoder(categories="auto", handle_unknown="error"))
], verbose=False)

*(Then we standardize numerical columns.)*

*(Standardization is a technique that changes the range of values without affecting the shape of the data and by reducing the standard deviation to one. This pre-processing is necessary in order to produce a powerful model. In our case, the `num_ratings` column would have a much higher weight on the predictions than the `rating` column. But a large number of ratings does not necessarily mean that it is the best recommendation.)*

In [8]:
numeric_features = ["episodes", "rating", "members", "num_ratings"]
numeric_transformer = Pipeline(steps=[
    # Put the SimpleImputer here in case of missing data
    ("scaler", StandardScaler())
], verbose=False)

*(Usually, the pipeline would have a simple imputer to fill in the missing data. But we have already solved this problem before, so we don't need it.)*

*(The final step is to initialize the column transformer and to fit the dataset on it.)*

In [9]:
# Split the features from the labels
features_x = data_preprocessed.drop(columns=["name"], inplace=False)
# features_y = numpy.array(data_preprocessed["name"].tolist())

# Initialize the ColumnTransformer
genre_cols = [column for column in features_x if column.startswith("genre_")]
preprocessor = ColumnTransformer(
    transformers=[
        ("categorical", categorical_transformer, categorical_features),
        ("numeric", numeric_transformer, numeric_features),
        ("skipped", "passthrough", genre_cols)  # We skip the pre-processing of the gender columns but keep them.
    ],
    remainder="drop",
    verbose=True
)

# Fit and transform the dataset
features_x = preprocessor.fit_transform(features_x)

[ColumnTransformer] ... (1 of 3) Processing categorical, total=   0.0s
[ColumnTransformer] ....... (2 of 3) Processing numeric, total=   0.0s
[ColumnTransformer] ....... (3 of 3) Processing skipped, total=   0.0s


*(We can save the numpy tables to disk for later use.)*

In [10]:
numpy.save(file=str(DATA_FOLDER / "x-anime_16-03-23_11-25"), arr=features_x)
# numpy.save(file=str(DATA_FOLDER / "y-anime_16-03-23_11-25"), arr=features_y)

*(And reload them with this block of code.)*

In [11]:
features_x = numpy.load(file=str(DATA_FOLDER / "x-anime_16-03-23_11-25.npy"))
# features_y = numpy.load(file=str(DATA_FOLDER / "y-anime_16-03-23_11-25.npy"))

In [12]:
features_x

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]])

démontrer diff. visuelle avec données de bases

# C - The "Nearest Neighbors" model
expliquer "NearestNeighbors"

In [13]:
model = NearestNeighbors(n_neighbors=11, radius=1.0, algorithm="auto", metric="cosine")
model.fit(features_x)

expliquer pourquoi pas de y
expliquer pourquoi cosine et pas minkowski

...

In [14]:
distances, indices = model.kneighbors(features_x)