# BHT Data Applications project
# Automatic Anime recommendation Algorithm
### This project aims to create an algorithm that can determine what anime to recommend to a user.
##### Authors: Rashmi Di Michino and Antonin Mathubert

The 320000 users and 16000 animes dataset was taken from https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020 <br>
We are going to use this dataset to build a model that can recommend an anime based on the animes that the user is watching, has dropped, has kept on hold or put on their watching list.

### 1. Importing and parsing the data
First, we want to import all of our available data in a suitable manner so it is treatable for the next steps of the project.<br><br>
In order to load the data, we are going to do it by chunking the csv file so it's more efficient. Then we're changing the default type of the columns to be more convenient memory wise.

In [24]:
from mlxtend.frequent_patterns import apriori, association_rules
from tqdm.notebook import tqdm
import pandas as pd
import numpy as np
import itertools
import random

In [2]:
path = "C:/Users/rashm/OneDrive/Desktop/data_applications_project/julius/anime_dataset/"

In [3]:
path = "dataset/anime/"

In [4]:
dataset_chunks = pd.read_csv(path+"animelist.csv", chunksize=20000)

chunks = []
for chunk in dataset_chunks:
    chunks.append(chunk)
    
dataset = pd.concat(chunks, ignore_index=True)
dataset = dataset.astype({'user_id': "int32", 'anime_id': 'int32', "watching_status": "int16"})

dataset_chunks = None
chunks = None

### 2. Recommendation system based on the watched animes
In this first version we're going to implement a recommendation system based on which animes the users have seen, for example if someone has watched cowboy bepop, they're going to be recommended to see death note
#### Reducing the dataset
As the dataset we're working with is too large, we're going to reduce it

In [5]:
dataset.drop(['watched_episodes'], axis=1, inplace=True)
dataset = dataset[(dataset['anime_id'] < 10000) & (dataset['user_id'] < 20000)]
dataset = dataset[(dataset['user_id'] != 61960) & (dataset['watching_status'] != 4)]
dataset = dataset.drop("watching_status", axis=1)

Here we can see a sample of how the dataset is structured

In [6]:
display(dataset.head(10))
len(dataset)

Unnamed: 0,user_id,anime_id,rating
0,0,67,9
1,0,6702,7
2,0,242,10
3,0,4898,0
4,0,21,10
5,0,24,9
6,0,2104,0
7,0,4722,8
8,0,6098,6
9,0,3125,9


2509211

#### Pivot the dataset
The next step is pivoting the dataset: we're constructing a matrix that will be used to build the recommendation system, where the rows are the users' ids and the columns are the animes' ids.

In [7]:
dataset = dataset.pivot(index='user_id', columns='anime_id', values='rating')

We are now converting our matrix into a binary matrix in order to be able to retrieve the association rules: we only take into account the ratings that are above 3.

In [8]:
dataset = dataset > 3

#### Retrieving the association rules
Finally, we are exploiting the mlxtend library to build the recommendation system and we're retrieving the association rules

In [17]:
frequent_itemsets  = apriori(dataset, use_colnames=True, min_support=0.1) #Getting under 0.1 support takes too much computation time / memory and lacks of meaning.

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

In [18]:
rules["antecedents"] = rules["antecedents"].apply(lambda x: [x for x in x])
rules["consequents"] = rules["consequents"].apply(lambda x: [x for x in x])
rules = rules[rules["confidence"] > 0.5]

Here are some of the rules detected by the algorithm.

In [19]:
display(rules.head(20))
print(f"{len(rules)} rules found.")

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
1,[6],[1],0.15317,0.266154,0.108153,0.706093,2.652947,0.067386,2.496865,0.735757
2,[1],[20],0.266154,0.409113,0.140708,0.528672,1.292237,0.031821,1.253662,0.308169
4,[1],[30],0.266154,0.299863,0.161954,0.608498,2.029256,0.082145,1.788338,0.691165
5,[30],[1],0.299863,0.266154,0.161954,0.540095,2.029256,0.082145,1.595647,0.724442
6,[32],[1],0.186714,0.266154,0.109745,0.587768,2.208375,0.06005,1.780178,0.6728
9,[43],[1],0.158111,0.266154,0.105298,0.665972,2.502204,0.063216,2.19696,0.713102
11,[47],[1],0.177656,0.266154,0.114411,0.644005,2.419668,0.067127,2.061393,0.713473
12,[1],[121],0.266154,0.315399,0.14252,0.535479,1.697779,0.058575,1.473776,0.560057
14,[1],[164],0.266154,0.274115,0.137744,0.517533,1.888016,0.064787,1.504528,0.64093
15,[164],[1],0.274115,0.266154,0.137744,0.502504,1.888016,0.064787,1.475077,0.647958


36206 rules found.


#### Parsing the rules
These functions are designed to parse and filter the results of the detected rules, so we can understand them more easily.

```find_recommendations_precise``` will compute every possible combination of the watched anime ids, and try to find them in the rules dataset.

```find_recommendations_free``` will look for every occurence of each anime id in the rules, even if the antecedents frozen set isn't containing only the given id. (*__This needs improvement__*)

In [30]:
animes_df = pd.read_csv(path+"anime.csv")
animes_df = animes_df[animes_df["MAL_ID"] < 10000]

def generate_combinations(ids):
	result = []
	for r in range(1, len(ids) + 1):
		permutations = itertools.permutations(ids, r)
		for p in permutations:
			result.append(list(p))

	print(f"Found {len(result)} possible combinations.")
	return result

def find_recommendations_precise(anime_ids):
	recommendations = []
	
	for combination in tqdm(generate_combinations(anime_ids), desc="Trying every possible combination..."):
		filter_df = rules["antecedents"].apply(lambda x: x == combination) & rules["consequents"].apply(lambda x: np.all([id not in x for id in anime_ids]))
		if filter_df.apply(lambda x: x != False).sum() < 1:
			continue
		recommendation = (combination, rules[filter_df]["consequents"].values, rules[filter_df]["confidence"].values, rules[filter_df]["lift"].values)
		recommendations.append(recommendation)

	return recommendations

def find_recommendations_free(anime_ids):
	recommendations = []

	for id in anime_ids:
		filter_df = rules["antecedents"].apply(lambda x: id in x) & rules["consequents"].apply(lambda x: np.all([id not in x for id in anime_ids]))
		if filter_df.apply(lambda x: x != False).sum() < 1:
			continue

		recommendation = pd.DataFrame({"source": id, "antecedents": rules[filter_df]["antecedents"].values, "consequents": rules[filter_df]["consequents"].values, "confidence": rules[filter_df]["confidence"].values, "lift": rules[filter_df]["lift"].values})
		recommendations.append(recommendation)

	recommendations = pd.concat(recommendations)
	recommendations_dict = {anime: {}  for anime in anime_ids}
	recommendations_df = []
	for anime in recommendations_dict:
		rows = recommendations[recommendations["source"] == anime]
		for _, row in rows.iterrows():
			for x in row["consequents"]:
				if x in recommendations_dict[anime]:
					recommendations_dict[anime][x] += row["confidence"]
				else:
					recommendations_dict[anime][x] = row["confidence"]
		
		for anime_recommended in recommendations_dict[anime]:
			recommendations_df.append([animes_df[animes_df["MAL_ID"] == anime]["Name"].values[0], animes_df[animes_df["MAL_ID"] == anime_recommended]["Name"].values[0], recommendations_dict[anime][anime_recommended]])

	recommendations_df = pd.DataFrame(recommendations_df, columns=['source', 'recommended_id', 'weight']).sort_values(by="weight", ascending=False).groupby("source")

	return recommendations_df


Here we use the previously defined function and parse the results to print them and link them with the anime infos dataset.

In [40]:
seen_animes = [30, 32, 42, 43, 47, 1535] #More than 7 at a time takes forever.
print(seen_animes)

[30, 32, 42, 43, 47, 1535]


In [41]:
for recommendations in find_recommendations_precise(seen_animes):
	for i in range(len(recommendations[1])):
		recommendation = (recommendations[0], recommendations[1][i], recommendations[2][i], recommendations[3][i])
		print("Because you have seen %s, we think you would like %s with %.3f%% confidence. You are also %.3f%% more likely to watch this/these anime(s)." % (
			" and ".join([animes_df[animes_df["MAL_ID"] == x]["Name"].values[0] + f" ({str(x)})" for x in recommendation[0]]), 
			" and ".join([animes_df[animes_df["MAL_ID"] == x]["Name"].values[0] + f" ({str(x)})" for x in recommendation[1]]), 
			recommendation[2] * 100,
			recommendation[3] * 100 - 100)
		)

Found 1956 possible combinations.


Trying every possible combination...:   0%|          | 0/1956 [00:00<?, ?it/s]

Because you have seen Neon Genesis Evangelion (30), we think you would like Cowboy Bebop (1) with 54.010% confidence. You are also 102.926% more likely to watch this/these anime(s).
Because you have seen Neon Genesis Evangelion (30), we think you would like Naruto (20) with 50.732% confidence. You are also 24.006% more likely to watch this/these anime(s).
Because you have seen Neon Genesis Evangelion (30), we think you would like Sen to Chihiro no Kamikakushi (199) with 60.381% confidence. You are also 52.649% more likely to watch this/these anime(s).
Because you have seen Neon Genesis Evangelion (30), we think you would like Elfen Lied (226) with 55.547% confidence. You are also 55.517% more likely to watch this/these anime(s).
Because you have seen Neon Genesis Evangelion (30), we think you would like Code Geass: Hangyaku no Lelouch (1575) with 67.851% confidence. You are also 53.681% more likely to watch this/these anime(s).
Because you have seen Neon Genesis Evangelion (30), we thi

In [42]:
# This part needs improvement, because the results are not consise enough. 
# We could create a list of every anime that has been watch by people who's seen a certain anime, and return the most common ones instead of every one. 
for index, recommendation_df in find_recommendations_free(seen_animes):
	display(recommendation_df.head(5))

Unnamed: 0,source,recommended_id,weight
49,Akira,Code Geass: Hangyaku no Lelouch,2.93287
51,Akira,Code Geass: Hangyaku no Lelouch R2,2.035737
46,Akira,Sen to Chihiro no Kamikakushi,1.453198
52,Akira,Fullmetal Alchemist: Brotherhood,1.404385
44,Akira,Cowboy Bebop,0.644005


Unnamed: 0,source,recommended_id,weight
57,Death Note,Code Geass: Hangyaku no Lelouch,1468.709617
58,Death Note,Code Geass: Hangyaku no Lelouch R2,1156.088166
61,Death Note,Angel Beats!,889.973223
60,Death Note,Fullmetal Alchemist: Brotherhood,837.759042
62,Death Note,Steins;Gate,667.665402


Unnamed: 0,source,recommended_id,weight
41,Koukaku Kidoutai,Sen to Chihiro no Kamikakushi,0.696875
40,Koukaku Kidoutai,Cowboy Bebop,0.665972
42,Koukaku Kidoutai,Code Geass: Hangyaku no Lelouch,0.662153
43,Koukaku Kidoutai,Fullmetal Alchemist: Brotherhood,0.659375


Unnamed: 0,source,recommended_id,weight
4,Neon Genesis Evangelion,Code Geass: Hangyaku no Lelouch,138.449764
6,Neon Genesis Evangelion,Code Geass: Hangyaku no Lelouch R2,115.050989
9,Neon Genesis Evangelion,Fullmetal Alchemist: Brotherhood,69.012855
11,Neon Genesis Evangelion,Steins;Gate,57.188884
10,Neon Genesis Evangelion,Angel Beats!,48.9072


Unnamed: 0,source,recommended_id,weight
30,Neon Genesis Evangelion: The End of Evangelion,Code Geass: Hangyaku no Lelouch,9.55262
33,Neon Genesis Evangelion: The End of Evangelion,Code Geass: Hangyaku no Lelouch R2,9.081726
36,Neon Genesis Evangelion: The End of Evangelion,Fullmetal Alchemist: Brotherhood,2.803305
38,Neon Genesis Evangelion: The End of Evangelion,Steins;Gate,1.970355
27,Neon Genesis Evangelion: The End of Evangelion,Sen to Chihiro no Kamikakushi,1.298458


### 3. Graphs
