# BHT Data Applications project
# Automatic Anime recommendation Algorithm
### This project aims to create an algorithm that can determine what anime to recommend to a user.
##### Authors: Rashmi Di Michino and Antonin Mathubert

The 320000 users and 16000 animes dataset was taken from https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020 <br>
We are going to use this dataset to build a model that can recommend an anime based on the animes that the user is watching, has dropped, has kept on hold or put on their watching list.

### 1. Importing and parsing the data
First, we want to import all of our available data in a suitable manner so it is treatable for the next steps of the project.<br><br>
In order to load the data, we are going to do it by chunking the csv file so it's more efficient. Then we're changing the default type of the columns to be more convenient memory wise.

In [1]:
from mlxtend.frequent_patterns import apriori, association_rules
from tqdm.notebook import tqdm
import pandas as pd
import numpy as np
import itertools

In [2]:
path = "C:/Users/rashm/OneDrive/Desktop/data_applications_project/julius/anime_dataset/"

In [3]:
path = "dataset/anime/"

In [4]:
dataset_chunks = pd.read_csv(path+"animelist.csv", chunksize=20000)

chunks = []
for chunk in dataset_chunks:
    chunks.append(chunk)
    
dataset = pd.concat(chunks, ignore_index=True)
dataset = dataset.astype({'user_id': "int32", 'anime_id': 'int32', "watching_status": "int16"})

dataset_chunks = None
chunks = None

### 2. Recommendation system based on the watched animes
In this first version we're going to implement a recommendation system based on which animes the users have seen, for example if someone has watched cowboy bepop, they're going to be recommended to see death note
#### Reducing the dataset
As the dataset we're working with is too large, we're going to reduce it

In [5]:
dataset.drop(['watched_episodes'], axis=1, inplace=True)
dataset = dataset[(dataset['anime_id'] < 10000) & (dataset['user_id'] < 20000)]
dataset = dataset[(dataset['user_id'] != 61960) & (dataset['watching_status'] != 4)]
dataset = dataset.drop("watching_status", axis=1)

Here we can see a sample of how the dataset is structured

In [6]:
display(dataset.head(100))
len(dataset)

Unnamed: 0,user_id,anime_id,rating
0,0,67,9
1,0,6702,7
2,0,242,10
3,0,4898,0
4,0,21,10
...,...,...,...
176,1,9253,10
183,1,22,9
184,1,995,8
185,1,4053,9


2509211

#### Pivot the dataset
The next step is pivoting the dataset: we're constructing a matrix that will be used to build the recommendation system, where the rows are the users' ids and the columns are the animes' ids.

In [7]:
dataset = dataset.pivot(index='user_id', columns='anime_id', values='rating')

We are now converting our matrix into a binary matrix in order to be able to retrieve the association rules: we only take into account the ratings that are above 3.

In [8]:
dataset = dataset > 3

#### Retrieving the association rules
Finally, we are exploiting the mlxtend library to build the recommendation system and we're retrieving the association rules

In [9]:
frequent_itemsets  = apriori(dataset, use_colnames=True, min_support=0.1) #Getting under 0.175 support takes too much computation time / memory.

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

In [10]:
rules["antecedents"] = rules["antecedents"].apply(lambda x: [x for x in x])
rules["consequents"] = rules["consequents"].apply(lambda x: [x for x in x])
rules = rules[rules["confidence"] > 0.75]

Here are some of the rules detected by the algorithm.

In [11]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
38,[1],[1535],0.266154,0.591106,0.208125,0.781972,1.322896,0.050800,1.875418,0.332607
84,[6],[1535],0.153170,0.591106,0.122317,0.798566,1.350969,0.031777,2.029917,0.306780
90,[19],[1535],0.119956,0.591106,0.107384,0.895195,1.514439,0.036477,3.901454,0.385992
120,[442],[20],0.109086,0.409113,0.100247,0.918973,2.246256,0.055619,7.292495,0.622748
138,[20],[1535],0.409113,0.591106,0.321658,0.786232,1.330103,0.079829,1.912791,0.420010
...,...,...,...,...,...,...,...,...,...,...
69988,"[4224, 9253, 2904, 5114, 1535]","[6547, 1575]",0.120999,0.276860,0.101839,0.841652,3.039993,0.068339,4.566765,0.763426
69991,"[4224, 1575, 6547, 5114, 1535]","[2904, 9253]",0.130222,0.245622,0.101839,0.782040,3.183922,0.069854,3.461093,0.788618
69993,"[4224, 6547, 2904, 5114, 1535]","[9253, 1575]",0.122591,0.268186,0.101839,0.830721,3.097561,0.068962,4.323126,0.771779
69998,"[9253, 6547, 2904, 5114, 1535]","[4224, 1575]",0.134450,0.264233,0.101839,0.757452,2.866609,0.066313,3.033491,0.752303


#### Parsing the rules
These functions are designed to parse and filter the results of the detected rules, so we can understand them more easily.

```find_recommendations_precise``` will compute every possible combination of the watched anime ids, and try to find them in the rules dataset.

```find_recommendations_free``` will look for every occurence of each anime id in the rules, even if the antecedents frozen set isn't containing only the given id. (*__This needs improvement__*)

In [12]:
animes_df = pd.read_csv(path+"anime.csv")

def generate_combinations(ids):
	result = []
	for r in range(1, len(ids) + 1):
		permutations = itertools.permutations(ids, r)
		for p in permutations:
			result.append(list(p))

	print(f"Found {len(result)} possible combinations.")
	return result

def find_recommendations_precise(anime_ids):
	recommendations = []
	
	for combination in tqdm(generate_combinations(anime_ids), desc="Trying every possible combination..."):
		filter_df = rules["antecedents"].apply(lambda x: x == combination) & rules["consequents"].apply(lambda x: np.all([id not in x for id in anime_ids]))
		if filter_df.apply(lambda x: x != False).sum() < 1:
			continue
		recommendation = (combination, rules[filter_df]["consequents"].values, rules[filter_df]["confidence"].values, rules[filter_df]["lift"].values)
		recommendations.append(recommendation)

	return recommendations

def find_recommendations_free(anime_ids):
	recommendations = []

	for id in anime_ids:
		filter_df = rules["antecedents"].apply(lambda x: id in x) & rules["consequents"].apply(lambda x: np.all([id not in x for id in anime_ids]))
		if filter_df.apply(lambda x: x != False).sum() < 1:
			continue

		recommendation = pd.DataFrame({"source": id, "antecedents": rules[filter_df]["antecedents"].values, "consequents": rules[filter_df]["consequents"].values, "confidence": rules[filter_df]["confidence"].values, "lift": rules[filter_df]["lift"].values})
		recommendations.append(recommendation)

	recommendations = pd.concat(recommendations)

	return recommendations


Here we use the previously defined function and parse the results to print them and link them with the anime infos dataset.

In [13]:
seen_animes = [1, 6, 19, 20, 442] #More than 7 at a time takes forever.

In [14]:
for recommendations in find_recommendations_precise(seen_animes):
	for i in range(len(recommendations[1])):
		recommendation = (recommendations[0], recommendations[1][i], recommendations[2][i], recommendations[3][i])
		print("Because you have seen %s, we think you would like %s with %.3f%% confidence. You are also %.3f%% more likely to watch this/these anime(s)." % (
			" and ".join([animes_df[animes_df["MAL_ID"] == x]["Name"].values[0] + f" ({str(x)})" for x in recommendation[0]]), 
			" and ".join([animes_df[animes_df["MAL_ID"] == x]["Name"].values[0] + f" ({str(x)})" for x in recommendation[1]]), 
			recommendation[2] * 100,
			recommendation[3] * 100 - 100)
		)

Found 325 possible combinations.


Trying every possible combination...:   0%|          | 0/325 [00:00<?, ?it/s]

Because you have seen Cowboy Bebop (1), we think you would like Death Note (1535) with 78.197% confidence. You are also 32.290% more likely to watch this/these anime(s).
Because you have seen Trigun (6), we think you would like Death Note (1535) with 79.857% confidence. You are also 35.097% more likely to watch this/these anime(s).
Because you have seen Monster (19), we think you would like Death Note (1535) with 89.519% confidence. You are also 51.444% more likely to watch this/these anime(s).
Because you have seen Naruto (20), we think you would like Death Note (1535) with 78.623% confidence. You are also 33.010% more likely to watch this/these anime(s).
Because you have seen Cowboy Bebop (1) and Naruto (20), we think you would like Death Note (1535) with 87.515% confidence. You are also 48.052% more likely to watch this/these anime(s).
Because you have seen Cowboy Bebop (1) and Naruto (20), we think you would like Fullmetal Alchemist: Brotherhood (5114) with 77.409% confidence. You 

In [None]:
# This part needs improvement, because the results are not consise enough. 
# We could create a list of every anime that has been watch by people who's seen a certain anime, and return the most common ones instead of every one. 
for recommendations in find_recommendations_free(seen_animes):
	for i in range(len(recommendations[1])):
		recommendation = (recommendations[0][i], recommendations[1][i], recommendations[2][i])
		print("People who have seen %s, also liked watching %s with %.3f%% confidence." % (
			" and ".join([animes_df[animes_df["MAL_ID"] == x]["Name"].values[0] + f" ({str(x)})" for x in recommendation[0]]), 
			" and ".join([animes_df[animes_df["MAL_ID"] == x]["Name"].values[0] + f" ({str(x)})" for x in recommendation[1]]), 
			recommendation[2] * 100)
		)
		print("************************************************************************************************")

In [15]:
find_recommendations_free(seen_animes)

Unnamed: 0,source,antecedents,consequents,confidence,lift
0,1,[1],[1535],0.781972,1.322896
1,1,"[1, 20]",[1535],0.875146,1.480523
2,1,"[1, 20]",[5114],0.774093,1.719105
3,1,"[32, 1]",[30],0.982491,3.276470
4,1,"[1, 30]",[1535],0.819661,1.386656
...,...,...,...,...,...
589,20,"[1535, 20, 9253, 1575]","[2904, 5114]",0.795661,2.980864
590,20,"[2904, 5114, 20, 9253]","[1535, 1575]",0.921425,2.584105
591,20,"[2904, 20, 9253, 1535]","[5114, 1575]",0.856476,2.923122
592,20,"[5114, 20, 9253, 1535]","[2904, 1575]",0.766920,2.018123


### 3. Graphs
