# 🏯 Anime Recommendations Databas

Bienvenue !

#### Rappel du sujet

Le but de ce projet est d'effectuer les techniques de Collaborative filtering et de Content based filtering sur les données du site https://myanimelist.net/.

Voici la signification des différentes variables :

    Pour le dataset "anime.csv" :
        anime_id - myanimelist.net's unique id identifying an anime.
        name - full name of anime.
        genre - comma separated list of genres for this anime.
        type - movie, TV, OVA, etc.
        episodes - how many episodes in this show. (1 if movie).
        rating - average rating out of 10 for this anime.
        members - number of community members that are in this anime's "group".
    Pour le dataset "rating.csv" :
        user_id - non identifiable randomly generated user id.
        anime_id - the anime that this user has rated.
        rating - rating out of 10 this user has assigned (-1 if the user watched it but didn't assign a rating).


#### Import des librairies

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import re

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
ratings = pd.read_csv('/kaggle/input/anime-recommendations-database/rating.csv')
anime = pd.read_csv('/kaggle/input/anime-recommendations-database/anime.csv')

In [None]:
def headDict(dictionnary: dict, n: int = 10) -> None:
    """Show top n values in dict
    
    Args:
        n (int64): N items to show
    """
    i = 0;
    for key, value in dictionnary.items():
        print(str(key) + " : " + str(value))
        i += 1
        if(i > n):
            break

# Analyse exploratoire des données
## Dataset Ratings

In [None]:
ratings.head()

In [None]:
print(f"Dans un premier temps, on observe que le dataset de notes comporte des notes négatives (-1) qui correspondent au fait qu'un utilisateur ait vu un anime mais qu'il ne l'a pas noté.\nCela concerne {ratings[ratings['rating'] == -1]['rating'].count().sum()} lignes de données. ") 

In [None]:
ratings.describe()

On note que le dataset comporte presque 8 Millions de lignes.

En comptabilisant les notes -1, on observe une moyenne pour la variable rating de 6,14.

In [None]:
ratings.isnull().sum()

Le dataset ne contient pas de valeurs nulles.

In [None]:
ratings = ratings.drop(ratings[ratings['rating'] == -1].index)

In [None]:
ratings.describe()

Après suppression des notes -1, on observe que le dataset possède plus 6.3 Millions de de lignes et que la moyenne de la variable rating est de 7,81.

## Combien d'animés sont concernés par cette étude ?

In [None]:
print(f"{len(ratings['anime_id'].unique())} animés sont concernés par cette étude") 

## Combien d'utilisateurs concernés par cette étude ?

In [None]:
print(f"{len(ratings['user_id'].unique())} utilisateurs sont concernés par cette étude")

## Ajout des variables de rang (par note moyenne et par nombre de notes)

In [None]:
ratings_ranks = ratings[["anime_id", "rating"]].groupby(by=["anime_id"]).agg(['mean','count']).reset_index()
ratings_ranks.columns=["anime_id","mean_rating","count_rating"]
ratings_ranks

## Etude de la variable de rank par nombres de notes (paramètres statistiques et représentations graphiques).

In [None]:
ratings_ranks['count_rating'].describe()

On observe que la moyenne du nombre de note par animé est de 638.

Le premier quartile se situe à 1, la médiane se situe à 9 et le troisième quartile se situe à 57.

La valeur minimale étant à 1 et la valeur maximale à 34 224.

La valeur de l'écart type est de 1 795 ce qui correspond à une volatilité d'environ 281,5 %

On note donc que le nombre de notes par animé est très déséquilibré.

In [None]:
facecolor = '#eaeaf2'
fig, ax = plt.subplots(figsize=(10, 6), facecolor=facecolor)
title = 'Etude de la répartition du nombre de notes par animé'
plt.title(title, fontsize=18, pad=10)
plt.subplots_adjust(top=0.85)
plt.hist(ratings_ranks['count_rating'], bins=100)
plt.axvline(ratings_ranks['count_rating'].mean(), color='k', linestyle='solid', linewidth=1)
plt.show()

On observe à travers cet histogramme, la répartition du nombre de notes par animé.

Ici, la distributions est positive. Très peu d'animé possède beaucoup de notes.

Elle correspond aux observations faites précédemment, c'est-à-dire à la quantité d'animé possédant un faible nombre de notes.

In [None]:
sns.set(style='whitegrid')
facecolor = '#eaeaf2'
fig, ax = plt.subplots(figsize=(10, 6), facecolor=facecolor)
sns.boxplot(x=ratings_ranks["count_rating"])
title = 'Etude de la répartition du nombre de notes par animé'
plt.title(title, fontsize=18, pad=10)
plt.subplots_adjust(top=0.85)
plt.show()

A travers cette boîte à moustache, nous observons les valeurs extrêmes déséquilibrant le dataset.

## Etude de la variable de rank par la moyenne des notes (paramètres statistiques et représentations graphiques).

In [None]:
ratings_ranks['mean_rating'].describe()

On observe que la moyenne de la varible rating par animé est de 6,64.

Le premier quartile se situe à 1, la médiane se situe à 6,89 et le troisième quartile se situe à 7,49.

La valeur minimale étant à 1 et la valeur maximale à 10.

La valeur de l'écart type est de 1,30 ce qui représente une volatilité d'environ 20%.

On note donc que la moyenne des notes est légérement déséquilibré.

In [None]:
facecolor = '#eaeaf2'
fig, ax = plt.subplots(figsize=(10, 6), facecolor=facecolor)
title = 'Etude de la répartition de la moyenne des notes par animé'
plt.title(title, fontsize=18, pad=10)
plt.subplots_adjust(top=0.85)
plt.hist(ratings_ranks['mean_rating'], bins=50)
plt.axvline(ratings_ranks['mean_rating'].mean(), color='k', linestyle='solid', linewidth=1)
plt.show()

On observe presque sur cet histogramme une légère asymétriques à gauche.

In [None]:
sns.set(style='whitegrid')
facecolor = '#eaeaf2'
fig, ax = plt.subplots(figsize=(10, 6), facecolor=facecolor)
sns.boxplot(x=ratings_ranks["mean_rating"])
title = 'Etude de la répartition de la moyenne des notes par animé'
plt.title(title, fontsize=18, pad=10)
plt.subplots_adjust(top=0.85)
plt.show()

## Dataset Anime

In [None]:
anime

In [None]:
anime.describe()

Le dataset contient 12 294 animés différents.

Il contient 12 064 notes d'animés. 

La moyenne des notes est de 6,47 avec un minimum de 1,67 et un maximum de 10.

La valeur de l'écart type est de 1,03 ce qui est légérement mieux que le dataset ratings.

In [None]:
anime['type'].unique()

La variable type posséde 6 valeurs différentes.

Certains animé ne semble par posséder de valeur car la variable **nan** est présente.

In [None]:
anime['episodes'].unique()

Le nombres d'épisodes par animés est variable.

On observe qu'il existe la valeur **Unknown** dans cette colonne.

In [None]:
anime[anime['episodes'] == 'Unknown']

In [None]:
anime = anime.drop(anime[anime['episodes'] == 'Unknown'].index)

In [None]:
anime.isnull().sum()

Pour compléter les informations obtenues, précédemment, on observe qu'il y a 51 valeurs nulles pour la variable genre et 78 pour la variable rating.

In [None]:
anime.dropna(inplace=True)
anime

Après suppression des données manquantes, on obtient un dataset avec 12 017 lignes.

In [None]:
anime.describe()

On note que le dataset contient désormais 12 017 animés différents.

Il contient 12 017 notes d'animés. 

La moyenne des notes est de 6,48 avec un minimum de 1,67 et un maximum de 10.

La valeur de l'écart type est de 1,02 ce qui est légérement mieux.

In [None]:
def name_cleaning(text):
    text = re.sub(r'&quot;', '', text)
    text = re.sub(r'&quot;', '', text)
    text = re.sub(r'.hack//', '', text)
    text = re.sub(r'&#039;', '', text)
    text = re.sub(r'A&#039;s', '', text)
    text = re.sub(r'I&#039;', 'I\'', text)
    text = re.sub(r'&amp;', 'and', text)
    text = re.sub(r'°', 'and', text)
    
    return text

anime['name'] = anime['name'].apply(name_cleaning)
anime

Nettoyage des names pour faciliter le content filtering

In [None]:
anime[anime.duplicated()]

Aucune données ne semble être doublonnées.

### Etudes du type d'animé

In [None]:
def showPiePlotQualitativeVariable(df, qualitative_variable_column, graph_title):
    facecolor = '#eaeaf2'
    labels = df[qualitative_variable_column].value_counts().index
    values = df[qualitative_variable_column].value_counts().values

    fig, ax = plt.subplots(figsize=(20, 8), facecolor=facecolor)
    plt.pie(values, labels = labels, autopct='%1.1f%%')
    plt.title(graph_title, fontsize=18, pad=10)
    ax.legend(labels, title=qualitative_variable_column, loc="center left", bbox_to_anchor=(1, 0, 0.5, 1))
    plt.show() 

In [None]:
showPiePlotQualitativeVariable(anime, 'type', 'Etude de la répartition de type d\'animé')

Les types sont d'animés sont dominés à plus de 50 % par les types TV et OVA. 

### Etudes des genres

Récupération de tous les genres uniques et ajout pour chaque genre d'une colonne dans le dataset

In [None]:
anime['genre'] = anime['genre'].astype('string')
anime_genres = anime
anime_genres = anime_genres.reset_index()  # make sure indexes pair with number of rows

genres = []
for index, row in anime_genres.iterrows():
    list_genre = row['genre'].split(', ')
    for genre in list_genre :
        if not genre in genres :
            anime_genres[genre] = 0
            genres.append(genre)
        anime_genres.at[index, genre] = 1
genres

In [None]:
anime_genres

In [None]:
anime_names = anime['name'].to_list()

In [None]:
def getTopGenre(df, genres, rating_column_name):
    mean = [df[df["genre"].str.contains(x)][rating_column_name].mean() for x in genres]
    count = [df[df["genre"].str.contains(x)][rating_column_name].count() for x in genres]
    summaryGenres = pd.DataFrame({"genre": genres, "mean_rating": mean, 'count_rating':count})
    
    plt.figure(figsize=(15,6))
    sns.barplot(data = summaryGenres.sort_values(by="mean_rating", ascending=False).head(10), x="mean_rating", y = "genre")
    plt.show()
    
    plt.figure(figsize=(15,6))
    sns.barplot(data=summaryGenres.sort_values(by="count_rating", ascending=False).head(10), x="count_rating", y = "genre")
    plt.show()

In [None]:
def getTopAnime(df, rank_rating_name, rank_count_name = None):
    if rank_count_name == None:
        animes = df[['name', rank_rating_name]].drop_duplicates()
    else:
        animes = df[['name', rank_rating_name, rank_count_name]].drop_duplicates()
        count = animes[rank_count_name]
        
    animes = df[['name', rank_rating_name]].drop_duplicates()
    mean = animes[rank_rating_name]
    summaryGenres = pd.DataFrame({"genre": animes['name'], rank_rating_name: mean, rank_count_name: 1 if rank_count_name == None else count})
    
    plt.figure(figsize=(15,6))
    sns.barplot(data = summaryGenres.sort_values(by=rank_rating_name, ascending=False).head(10), x=rank_rating_name, y = "genre")
    plt.show()

    if rank_count_name != None:
        plt.figure(figsize=(15,6))
        sns.barplot(data=summaryGenres.sort_values(by=rank_count_name, ascending=False).head(10), x=rank_count_name, y = "genre")
        plt.show()

In [None]:
getTopGenre(anime, genres, 'rating')

In [None]:
getTopAnime(anime, 'rating', 'members')

## Définition des dictionnaires 

In [None]:
titleToAnimeId = pd.Series(anime.anime_id.values, index=anime.name).to_dict()

In [None]:
animeIdToTitle = pd.Series(anime.name.values, index=anime.anime_id).to_dict()

## Ajout des variables de rang

In [None]:
anime_ranks =  anime.join(ratings_ranks.set_index('anime_id'), on='anime_id', rsuffix='_user')
anime_ranks.rename(columns={"mean_rating": "user_mean_rating", "count_rating": "user_count_rating"}, inplace=True)

In [None]:
anime_ranks['rating_rank'] = anime_ranks['rating'].rank(method='max', ascending=False)
anime_ranks.sort_values(by=['rating_rank']).head(20)

In [None]:
anime_ranks['user_count_rating_rank'] = anime_ranks['user_count_rating'].rank(ascending=False)
anime_ranks.sort_values(by=['user_count_rating_rank']).head(20)

In [None]:
anime_ranks['user_mean_rating_rank'] = anime_ranks['user_mean_rating'].rank(method='max', ascending=False)
anime_ranks.sort_values(by=['user_mean_rating_rank']).head(20)

In [None]:
genre_user_rating = anime_ranks[['user_mean_rating','genre']]
genre_user_rating

### Etudes des genres

Comparaison la variable rating du dataset anime

In [None]:
getTopGenre(anime_ranks, genres, 'user_mean_rating')

Par comparaison avec les valeurs obtenus lors de l'étude de la variable rating, nous pouvons noté que :
* Le top 5 des genres les mieux notés sont toujours Josei, Thriller, Mystery, Police et Shounen.
* Dans le top 10, seul SuperNatural et Romande sont commun au 2 histogrammes.

In [None]:
getTopAnime(anime_ranks, 'user_mean_rating', 'user_count_rating')

In [None]:
anime_full_data = anime_ranks.join(ratings.set_index('anime_id'), on='anime_id', rsuffix='_user')
anime_full_data.head(20)

In [None]:
showPiePlotQualitativeVariable(anime_full_data, 'type', 'Etude de la répartition de type d\'animé')

In [None]:
users_n_rate = anime_full_data[['user_id','rating_user']].groupby(by=["user_id"]).agg(['count']).reset_index()
users_n_rate.columns=["user_id","n_rate"]
users_n_rate

In [None]:
def getTopNUser(n_rate = 100):
    return users_n_rate[users_n_rate['n_rate'] >= n_rate]

In [None]:
getTopNUser(200)

In [None]:
anime_n_rate = anime_full_data[['anime_id','rating_user']].groupby(by=["anime_id"]).agg(['count']).reset_index()
anime_n_rate.columns=["anime_id","n_rate"]
anime_n_rate

In [None]:
def getTopNAnime(n_rate = 100):
    return anime_n_rate[anime_n_rate['n_rate'] >= n_rate]

In [None]:
getTopNAnime(200)

In [None]:
def getRatingByNRate(n_rate_user = 100, n_rate_anime = 100):
    topNUsers = getTopNUser(n_rate_user)
    topNAnimes = getTopNAnime(n_rate_anime)
    
    ratings_top_users = pd.merge(left=ratings, right=topNUsers['user_id'], left_on='user_id', right_on='user_id')
    ratings_top_users_animes = pd.merge(left=ratings_top_users, right=topNAnimes['anime_id'], left_on='anime_id', right_on='anime_id')
    return ratings_top_users_animes

In [None]:
getRatingByNRate(200,200)

# Collaborative filtering

In [None]:
import surprise
import random
from surprise import Reader, Dataset, accuracy, dataset, SVD
from surprise.model_selection import train_test_split, LeaveOneOut
from surprise.prediction_algorithms.co_clustering import CoClustering
from surprise.prediction_algorithms.random_pred import NormalPredictor
from surprise.prediction_algorithms.baseline_only import BaselineOnly
from surprise.prediction_algorithms.knns import KNNBasic
from surprise.prediction_algorithms.slope_one import SlopeOne
from surprise.model_selection.search import GridSearchCV

### Méthodes utiles

In [None]:
def hitrate(topNpredictions,leftoutpredictions):
    hits=0
    total=0
    for leftout in leftoutpredictions:
        uid=leftout[0]
        leftoutmovieid=leftout[1]
        hit=False
        for movieId ,predictedRating in topNpredictions[uid]:
            if(leftoutmovieid == movieId):
                hit=True
        if(hit):
            hits+=1
        total+=1 

    return hits/total

In [None]:
def cumulativeHitRate(topNpredictions,leftOutPredictions, minRate):
    hits=0
    total=0
    for uid, leftOutAnimeID, actualRating, estimatedRating, _ in leftOutPredictions: 
        if(actualRating >= minRate):
            
            hit=False
            for animeId ,predictedRating in topNpredictions[uid]:
                if(leftOutAnimeID == animeId):
                    hit=True
            if(hit):
                hits+=1
            total+=1 

    return hits/total

In [None]:
def averageReciprocalHitRate(topNpredictions, leftOutPredictions):
    sumRanking = 0    
    for uid, leftOutAnimeID, actualRating, estimatedRating, _ in leftOutPredictions:
        rank = 0        
        for animeId, predictedRating in topNpredictions[uid]:
            rank += 1            
            if leftOutAnimeID == animeId:
                sumRanking += 1.0 / rank    
    return sumRanking / len(leftOutPredictions)

In [None]:
def userCoverage(topNPredicted, minRate):
        hits = 0
        total = 0
        for uid in topNPredicted.keys():
            hit = False
            for animeId, predictedRating in topNPredicted[uid]:
                if (predictedRating >= minRate):
                    hit = True
                    break
            if (hit):
                hits += 1
            total += 1

        return hits / total

In [None]:
from collections import defaultdict
def get_top_n(predictions, minRate, n=10):

    # First map the predictions to each user.
    topN = defaultdict(list)
    for uid, animeId, actualRating, estimatedRating, _ in predictions:
        if(estimatedRating >= minRate):
            topN[uid].append((animeId, estimatedRating))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, userRatings in topN.items():
        userRatings.sort(key=lambda x: x[1], reverse=True)
        topN[uid] = userRatings[:n]

    return topN

In [None]:
def getTopNByUserId(prediction_test, user_id):
    top_n = get_top_n(prediction_test, 4, n=10)
    uid, user_rating = top_n[user_id]
    for i in range(len(user_rating)):
        anime_id = user_ratings[i][0]
        anime = anime_ranks[anime_ranks['anime_id'] == anime_id]
        print(anime['name'] + ' - ' + anime_ranks['user_mean_rating'] + ' - ' + anime_ranks['user_count_rating'])

In [None]:
def manual_train_test_split(data, test_size = 0.5):
    train_data = data
    raw_ratings = train_data.raw_ratings
    random.shuffle(raw_ratings)
    threshold = int((1 - test_size) * len(raw_ratings))
    A_raw_ratings = raw_ratings[:threshold]
    B_raw_ratings = raw_ratings[threshold:]
    train_data.raw_ratings = A_raw_ratings
    test_data = train_data.construct_testset(B_raw_ratings)
    return train_data, test_data

In [None]:
def train_grid_search(train_data, test_data, model_class, hyperparameters):
    grid_search = GridSearchCV(model_class, param_grid=hyperparameters, measures=["rmse", "mae"], cv=5)
    print('Fit du gridsearch')
    grid_search.fit(train_data)
    best_model = grid_search.best_estimator["mae"]
    print('Fit du best model')
    best_model.fit(train_dataset.build_full_trainset())
    pred = best_model.test(test_data)
    
    print('Calcul Top N')
    top_n = getTopN(pred, 6, n=10)
    
    getTopNByUserId(pred, 8)
    print('Calcul Hitrate')
    hitrateGS = hitrate(top_n, pred)
    print('Calcul Hitrate cumlative')
    cumuHitrateGS = cumulativeHitRate(top_n, pred, minRate=6)
    print('Calcul Average reciprocal Hitrate')
    averageReciproHitrateGS = averageReciprocalHitRate(top_n, pred)
    print('Calcul User coverage')
    userCoverageGS = userCoverage(top_n, minRate=6)
    
    print(f"Hitrate : {hitrateGS}")
    print(f"Cummulative Hitrate : {cumuHitrateGS}")
    print(f"Average Reciprocal Hitrate : {averageReciproHitrateGS}")
    print(f"User Coverage : {userCoverageGS}")
    print(f"RMSE : {accuracy.rmse(pred)}")
    print(f"MAE : {accuracy.mae(pred)}")
    print(f"Best Params : {grid_search.best_params['rmse']}")
    
    return best_model

## 1 - Ensemble du dataset

In [None]:
reader = Reader(line_format="user item rating", sep=",", skip_lines=1, rating_scale=(0, 10))
dataRatings = Dataset.load_from_df(ratings[["user_id", "anime_id", "rating"]], reader)

In [None]:
train_dataset, test_dataset = manual_train_test_split(dataRatings, test_size = 0.2)

In [None]:
slopeone_hyperparameters_all = { }
best_model_slope_one = train_grid_search(train_dataset, test_dataset, SlopeOne, slopeone_hyperparameters_all)

In [None]:
coc_hyperparameters = { 'n_cltr_u': [3,4,5], 'n_cltr_i': [3,4,5], 'n_epochs': [10,20] }
best_model_coc = train_grid_search(train_dataset, test_dataset, CoClustering, coc_hyperparameters)

In [None]:
baseon_hyperparamaters = {
    'bsl_options': {
        'method': ['als', 'sgd'],
        'reg_u': [5, 10],
        'reg_i': [5, 10],
        "n_epochs": [5, 10, 20],
        'learning_rate': [0.00005, 0.0001, 0.0005, 0.001],
    },
}
best_model_baseone = train_grid_search(train_dataset, test_dataset, BaselineOnly, baseon_hyperparamaters)

In [None]:
svd_hyperparameters = {
    'n_factors': [80],
    'reg_all': [0.06],
    'n_epochs': [50],
    'lr_all': [0.005]
}
best_model_svd = train_grid_search(train_dataset, test_dataset, SVD, svd_hyperparameters)

## 2 - Reduction des données avec un filtre sur les utilisateurs ayant noté un nombre minimal de produits et aux produits ayant reçu un nombre minimal de notes.

In [None]:
ratings_n_rate = getRatingByNRate(300,300)

In [None]:
train_dataset, test_dataset = manual_train_test_split(dataRatings, test_size = 0.2)

### Algorithme SlopeOne

In [None]:
slopeone_hyperparameters_all = { }
best_model_slope_one = train_grid_search(train_dataset, test_dataset, SlopeOne, slopeone_hyperparameters_all)

### Algorithme de CoClustering

In [None]:
coc_hyperparameters = { 'n_cltr_u': [3,4,5], 'n_cltr_i': [3,4,5], 'n_epochs': [10,20] }
best_model_coc = train_grid_search(train_dataset, test_dataset, CoClustering, coc_hyperparameters)

### Algorithme BaselineOnly

In [None]:
baseon_hyperparamaters = {
    'bsl_options': {
        'method': ['als', 'sgd'],
        'reg_u': [5, 10],
        'reg_i': [5, 10],
        "n_epochs": [5, 10, 20],
        'learning_rate': [0.00005, 0.0001, 0.0005, 0.001],
    },
}
best_model_baseone = train_grid_search(train_dataset, test_dataset, BaselineOnly, baseon_hyperparamaters)

### Algorithme SVD

In [None]:
svd_hyperparameters = {
    'n_factors': [80],
    'reg_all': [0.06],
    'n_epochs': [50],
    'lr_all': [0.005]
}
best_model_svd = train_grid_search(train_dataset, test_dataset, SVD, svd_hyperparameters)

### Algorithme KNNBasic

In [None]:
knnb_hyperparameters= {
    'bsl_options': {
        'method': ['als', 'sgd'],
        'reg_u': [5, 10],
        'reg_i': [5, 10],
        "n_epochs": [5, 10, 20],
        'learning_rate': [0.00005, 0.0001],
    },
    'sim_options' : {
        'name': ['cosine'],
        'user_based': [False], 
    },
    'k': [40,50,60],
    'min_k': [1, 5 ,10],
}
train_grid_search(train_dataset, test_dataset, KNNBasic, knnb_hyperparameters)

# Content based filtering

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler,MinMaxScaler,OneHotEncoder,LabelEncoder
from sklearn.compose import ColumnTransformer

In [None]:
anime_genres

In [None]:
anime_genres.columns

In [None]:
toBeRemoved = ['genre','index','anime_id']

In [None]:
quantitativeVariables = ['episodes','rating','members']
quantitativeTransformer = Pipeline([('imp', SimpleImputer(strategy='median')), ('scaler', MinMaxScaler())])

In [None]:
categoricalVariables = ['type', 'name']
categoricalTransformer = Pipeline([('imp', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(categories='auto', drop='first', handle_unknown='error'))])

In [None]:
preprocessor = ColumnTransformer(remainder='passthrough', transformers= [
    ('quantitative',quantitativeTransformer, quantitativeVariables),
    ('categorical', categoricalTransformer, categoricalVariables),
    ('remove', 'drop', toBeRemoved)])

In [None]:
animes_transform = preprocessor.fit_transform(anime_genres)

In [None]:
from sklearn.neighbors import NearestNeighbors
NN = NearestNeighbors(n_neighbors=11)

In [None]:
NN.fit(animes_transform)

In [None]:
distances, indices = NN.kneighbors(animes_transform)

In [None]:
indiceToTitle = dict()
for i in range(len(anime_genres)):
    indiceToTitle[i] = anime_genres.loc[i, "name"]
headDict(indiceToTitle, 10)

In [None]:
titleToIndice = { title : indice for indice, title in indiceToTitle.items()}
headDict(titleToIndice, 10)

In [None]:
NN2 = NearestNeighbors(n_neighbors=11, metric = 'cosine')

In [None]:
NN2.fit(animes_transform)

In [None]:
distances2, indices2 = NN2.kneighbors(animes_transform)

In [None]:
def getAnimeTop10Neighbors(indices, animeName: str) -> list:
    print('Voici le top 10 des animés recommandés basés sur ' + animeName + ' : (Titre, Note moyenne, Nombre de notes)\n')
    animesList = []
    print("Id de l'animé " + str(indices[titleToIndice[animeName]][0]))
    for indice in indices[titleToIndice[animeName]][1:]:
        animeId = titleToAnimeId[indiceToTitle[indice]]
        anime = anime_ranks[anime_ranks['anime_id'] == animeId]
        animeInfo =  anime[['name', 'user_mean_rating', 'user_count_rating']].values.flatten().tolist()
        animesList.append(animeInfo)
        print(animeInfo[0] + ' - ' + str(animeInfo[1]) + ' - ' +str(animeInfo[2]))

    return animesList

In [None]:
topTenAnimes = getAnimeTop10Neighbors(indices, "Fullmetal Alchemist")

In [None]:
topTenAnimes = getAnimeTop10Neighbors(indices2, "Fullmetal Alchemist")

In [None]:
topTenAnimes = getAnimeTop10Neighbors(indices, "Death Note")

In [None]:
topTenAnimes = getAnimeTop10Neighbors(indices2, "Death Note")