# BHT Data Applications project
# Automatic Anime recommendation Algorithm
### This project aims to create an algorithm that can determine what anime to recommend to a user.
##### Authors: Rashmi Di Michino and Antonin Mathubert

The 320000 users and 16000 animes dataset was taken from https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020 <br>
We are going to use this dataset to build a model that can recommend an anime based on the animes that the user is watching, has dropped, has kept on hold or put on their watching list.

### 1. Importing and parsing the data
First, we want to import all of our available data in a suitable manner so it is treatable for the next steps of the project.<br><br>
In order to load the data, we are going to do it by chunking the csv file so it's more efficient. Then we're changing the default type of the columns to be more convenient memory wise.

In [10]:
from mlxtend.frequent_patterns import apriori, association_rules
from tqdm.notebook import tqdm
import pandas as pd
import numpy as np
import itertools

In [11]:
path = "C:/Users/rashm/OneDrive/Desktop/data_applications_project/julius/anime_dataset/"

In [None]:
path = "dataset/anime/"

In [12]:
dataset_chunks = pd.read_csv(path+"animelist.csv", chunksize=20000)

chunks = []
for chunk in dataset_chunks:
    chunks.append(chunk)
    
dataset = pd.concat(chunks, ignore_index=True)
dataset = dataset.astype({'user_id': "int32", 'anime_id': 'int32', "watching_status": "int16"})

dataset_chunks = None
chunks = None

### 2. Recommendation system based on the watched animes
In this first version we're going to implement a recommendation system based on which animes the users have seen, for example if someone has watched cowboy bepop, they're going to be recommended to see death note
#### Reducing the dataset
As the dataset we're working with is too large, we're going to reduce it

In [13]:
dataset.drop(['rating', 'watched_episodes'], axis=1, inplace=True)
dataset = dataset[(dataset['anime_id'] < 10000) & (dataset['user_id'] < 20000)]
dataset = dataset[(dataset['user_id'] != 61960) & (dataset['watching_status'] != 4)]
dataset = dataset.drop("watching_status", axis=1)

Here we can see a sample of how the dataset is structured

In [14]:
display(dataset.head(100))
len(dataset)

Unnamed: 0,user_id,anime_id
0,0,67
1,0,6702
2,0,242
3,0,4898
4,0,21
...,...,...
176,1,9253
183,1,22
184,1,995
185,1,4053


2509211

The next step is pivoting the dataset: we're constructing a matrix that will be used to build the recommendation system, where the rows are the users' ids and the columns are the animes' ids.

In [15]:
dataset = dataset.pivot(index='user_id', columns='anime_id', values='anime_id')

We are now converting our matrix into a binary matrix in order to be able to retrieve the association rules

In [16]:
dataset[dataset.notnull()] = True
dataset = dataset.fillna(False)

Finally, we are exploiting the mlxtend library to build the recommendation system and we're retrieving the association rules

In [17]:
frequent_itemsets  = apriori(dataset, use_colnames=True, min_support=0.175) #Getting under 0.175 support takes too much computation time / memory.

rules = association_rules(frequent_itemsets)

rules["antecedents"] = rules["antecedents"].apply(lambda x: [x for x in x])
rules["consequents"] = rules["consequents"].apply(lambda x: [x for x in x])

Here are some of the rules detected by the algorithm.

In [21]:
rules.head(30)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,[6],[1],0.266044,0.44134,0.216195,0.812629,1.841278,0.098779,2.981573,0.622516
1,[47],[1],0.253692,0.44134,0.205051,0.808267,1.831394,0.093086,2.913736,0.608285
2,[1],[1535],0.44134,0.715784,0.372989,0.84513,1.180706,0.057086,1.835193,0.273957
3,[6],[1535],0.266044,0.715784,0.228109,0.857408,1.197859,0.037678,1.993216,0.225051
4,[6],[1575],0.266044,0.5888,0.214494,0.806232,1.369279,0.057846,2.122123,0.367445
5,[6],[5114],0.266044,0.619215,0.216086,0.812216,1.311687,0.051347,2.027785,0.323757
6,[19],[1535],0.285095,0.715784,0.259292,0.909494,1.270626,0.055226,3.140288,0.297923
7,[19],[1575],0.285095,0.5888,0.233105,0.817639,1.388652,0.065241,2.254867,0.391489
8,[19],[5114],0.285095,0.619215,0.24123,0.846139,1.366471,0.064695,2.474865,0.375137
9,[19],[9253],0.285095,0.58891,0.237387,0.832659,1.413899,0.069492,2.456604,0.409475


These functions are designed to parse and filter the results of the detected rules, so we can understand them more easily.

```find_recommendations_precise``` will compute every possible combination of the watched anime ids, and try to find them in the rules dataset.

```find_recommendations_free``` will look for every occurence of each anime id in the rules, even if the antecedents frozen set isn't containing only the given id. (*__This needs improvement__*)

In [18]:
animes_df = pd.read_csv(path+"anime.csv")

def generate_combinations(ids):
	result = []
	for r in range(1, len(ids) + 1):
		permutations = itertools.permutations(ids, r)
		for p in permutations:
			result.append(list(p))

	print(f"Found {len(result)} possible combinations.")
	return result

def find_recommendations_precise(anime_ids):
	recommendations = []
	
	for combination in tqdm(generate_combinations(anime_ids), desc="Trying every possible combination..."):
		filter_df = rules["antecedents"].apply(lambda x: x == combination)
		if filter_df.apply(lambda x: x != False).sum() < 1:
			continue
		recommendation = (combination, rules[filter_df]["consequents"].values, rules[filter_df]["confidence"].values)
		recommendations.append(recommendation)

	return recommendations

def find_recommendations_free(anime_ids):
	recommendations = []

	for id in anime_ids:
		filter_df = rules["antecedents"].apply(lambda x: id in x)
		if filter_df.apply(lambda x: x != False).sum() < 1:
			continue

		recommendation = (rules[filter_df]["antecedents"].values, rules[filter_df]["consequents"].values, rules[filter_df]["confidence"].values)
		recommendations.append(recommendation)

	return recommendations


Here we use the previously defined function and parse the results to print them and link them with the anime infos dataset.

In [19]:
seen_animes = [1, 242, 22, 995, 6, 47, 33] #More than 7 at a time takes forever.

for recommendations in find_recommendations_precise(seen_animes):
	for i in range(len(recommendations[1])):
		recommendation = (recommendations[0], recommendations[1][i], recommendations[2][i])
		print("Because you have seen %s, we think you would like %s with %.3f%% confidence." % (
			" and ".join([animes_df[animes_df["MAL_ID"] == x]["Name"].values[0] + f" ({str(x)})" for x in recommendation[0]]), 
			" and ".join([animes_df[animes_df["MAL_ID"] == x]["Name"].values[0] + f" ({str(x)})" for x in recommendation[1]]), 
			recommendation[2] * 100)
		)

Found 13699 possible combinations.


Trying every possible combination...:   0%|          | 0/13699 [00:00<?, ?it/s]

Because you have seen Cowboy Bebop (1), we think you would like Death Note (1535) with 84.513% confidence.
Because you have seen Trigun (6), we think you would like Cowboy Bebop (1) with 81.263% confidence.
Because you have seen Trigun (6), we think you would like Death Note (1535) with 85.741% confidence.
Because you have seen Trigun (6), we think you would like Code Geass: Hangyaku no Lelouch (1575) with 80.623% confidence.
Because you have seen Trigun (6), we think you would like Fullmetal Alchemist: Brotherhood (5114) with 81.222% confidence.
Because you have seen Akira (47), we think you would like Cowboy Bebop (1) with 80.827% confidence.
Because you have seen Akira (47), we think you would like Death Note (1535) with 85.068% confidence.
Because you have seen Akira (47), we think you would like Fullmetal Alchemist: Brotherhood (5114) with 80.264% confidence.
Because you have seen Kenpuu Denki Berserk (33), we think you would like Death Note (1535) with 88.560% confidence.
Because

In [20]:
# This part needs improvement, because the results are not consise enough. 
# We could create a list of every anime that has been watch by people who's seen a certain anime, and return the most common ones instead of every one. 
for recommendations in find_recommendations_free(seen_animes):
	for i in range(len(recommendations[1])):
		recommendation = (recommendations[0][i], recommendations[1][i], recommendations[2][i])
		print("People who have seen %s, also liked watching %s with %.3f%% confidence." % (
			" and ".join([animes_df[animes_df["MAL_ID"] == x]["Name"].values[0] + f" ({str(x)})" for x in recommendation[0]]), 
			" and ".join([animes_df[animes_df["MAL_ID"] == x]["Name"].values[0] + f" ({str(x)})" for x in recommendation[1]]), 
			recommendation[2] * 100)
		)
		print("************************************************************************************************")

People who have seen Cowboy Bebop (1), also liked watching Death Note (1535) with 84.513% confidence.
************************************************************************************************
People who have seen Cowboy Bebop (1) and Trigun (6), also liked watching Death Note (1535) with 87.684% confidence.
************************************************************************************************
People who have seen Cowboy Bebop (1) and Trigun (6), also liked watching Code Geass: Hangyaku no Lelouch (1575) with 83.113% confidence.
************************************************************************************************
People who have seen Cowboy Bebop (1) and Trigun (6), also liked watching Tengen Toppa Gurren Lagann (2001) with 81.107% confidence.
************************************************************************************************
People who have seen Cowboy Bebop (1) and Trigun (6), also liked watching Fullmetal Alchemist: Brotherhood (5114) with 84.

In [None]:
dataset = np.array(dataset.values)

In [None]:
dataset = np.nan_to_num(dataset, nan=0)

In [None]:
dataset = np.where(dataset != 0, 1, dataset)