# BHT Data Applications project
# Automatic Anime recommendation Algorithm
### This project aims to create an algorithm that can determine what anime to recommend to a user.
##### Authors: Rashmi Di Michino and Antonin Mathubert

The 320000 users and 16000 animes dataset was taken from https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020 <br>
We are going to use this dataset to build a model that can recommend an anime based on the animes that the user is watching, has dropped, has kept on hold or put on their watching list.

### 1. Importing and parsing the data
First, we want to import all of our available data in a suitable manner so it is treatable for the next steps of the project.<br><br>
In order to load the data, we are going to do it by chunking the csv file so it's more efficient. Then we're changing the default type of the columns to be more convenient memory wise.

In [1]:
from mlxtend.frequent_patterns import apriori, association_rules
from itertools import combinations
import plotly.graph_objs as go
from tqdm.notebook import tqdm
import networkx as nx
import pandas as pd
import numpy as np
import itertools

Next cells are for using the right path for the data depending on your file architecture.

In [2]:
path = "C:/Users/rashm/OneDrive/Desktop/data_applications_project/julius/anime_dataset/"

In [3]:
path = "dataset/anime/"

We load the whole data from the CSV file into a DataFrame, and we then set some variables to ```None``` to free memory.

The file containing information about the users watching status and rating of the animes is named ```animelist.csv```.

The file containing the details about the animes (ID, name, english name, production, genre...) is named ```anime.csv```. 

In [4]:
dataset_chunks = pd.read_csv(path+"animelist.csv", chunksize=20000)
animes_df = pd.read_csv(path+"anime.csv")

chunks = []
for chunk in dataset_chunks:
    chunks.append(chunk)
    
dataset = pd.concat(chunks, ignore_index=True)
dataset = dataset.astype({'user_id': "int32", 'anime_id': 'int32', "watching_status": "int16", "rating": "int16"})

dataset_chunks = None
chunks = None

This is what the main data looks like before altering it.

In [5]:
dataset.head()

Unnamed: 0,user_id,anime_id,rating,watching_status,watched_episodes
0,0,67,9,1,1
1,0,6702,7,1,4
2,0,242,10,1,4
3,0,4898,0,1,1
4,0,21,10,1,0


And this is what the anime data looks like (we never change it).

In [6]:
animes_df.head()

Unnamed: 0,MAL_ID,Name,Score,Genres,English name,Japanese name,Type,Episodes,Aired,Premiered,...,Score-10,Score-9,Score-8,Score-7,Score-6,Score-5,Score-4,Score-3,Score-2,Score-1
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,カウボーイビバップ,TV,26,"Apr 3, 1998 to Apr 24, 1999",Spring 1998,...,229170.0,182126.0,131625.0,62330.0,20688.0,8904.0,3184.0,1357.0,741.0,1580.0
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space",Cowboy Bebop:The Movie,カウボーイビバップ 天国の扉,Movie,1,"Sep 1, 2001",Unknown,...,30043.0,49201.0,49505.0,22632.0,5805.0,1877.0,577.0,221.0,109.0,379.0
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen",Trigun,トライガン,TV,26,"Apr 1, 1998 to Sep 30, 1998",Spring 1998,...,50229.0,75651.0,86142.0,49432.0,15376.0,5838.0,1965.0,664.0,316.0,533.0
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),TV,26,"Jul 2, 2002 to Dec 24, 2002",Summer 2002,...,2182.0,4806.0,10128.0,11618.0,5709.0,2920.0,1083.0,353.0,164.0,131.0
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",Beet the Vandel Buster,冒険王ビィト,TV,52,"Sep 30, 2004 to Sep 29, 2005",Fall 2004,...,312.0,529.0,1242.0,1713.0,1068.0,634.0,265.0,83.0,50.0,27.0


### 2. Cleaning the main dataset

#### Reducing the dataset
As the dataset we're working with is too large, we're also going to reduce it and keep only the first 20000 users (out of 320000).

We also get rid of of the rows where ```watching_status == 4``` because this means a user dropped this anime, so it shouldn't be included in the data used to train the recommendation system.

In [7]:
dataset.drop(['watched_episodes'], axis=1, inplace=True)
dataset = dataset[(dataset['watching_status'] != 4) & (dataset['user_id'] < 20000)]
dataset = dataset.drop("watching_status", axis=1)

Here we can see a sample of how the dataset is now structured.

In [8]:
display(dataset.head(10))
len(dataset)

Unnamed: 0,user_id,anime_id,rating
0,0,67,9
1,0,6702,7
2,0,242,10
3,0,4898,0
4,0,21,10
5,0,24,9
6,0,2104,0
7,0,4722,8
8,0,6098,6
9,0,3125,9


5858482

### 3. Understanding the data with graphs
In order to get a better look at the data and how it is structured, we are going to use graphs.

This will also allow us to get a first idea of what to recommend to the users.


#### Creating the edges and the nodes
We begin by computing edges and nodes to show on the graphs. 

The size of a node is proportional to the number of times an anime has been watched overall.

The opacity and thickness of an edge is proportional to how strong the link between the two nodes is. This means, if the two animes are often watched by the same user, then the link is stronger.

#### Computation times
This method is highly inefficient to display all the links between the dataset, because there are too many animes to analyze. If we kept all of the animes, the visual representation would be unreadable. Also, it would take hours if not days to compute all the visual features to display. This is why we only use this method for the first 250 most watched animes.

In [None]:
# Filter the dataset to include only the first 250 most watched animes
top_animes = dataset['anime_id'].value_counts().nlargest(250).index 
df = dataset[dataset['anime_id'].isin(top_animes)]
anime_counts = df["anime_id"].value_counts()

# Create a graph
G = nx.Graph()

# Group by user_id
user_groups = df.groupby('user_id')

# Create edges for each user's watched animes with ratings as weights
for user_id, group in tqdm(user_groups):
	animes = group['anime_id'].tolist()
	ratings = group['rating'].tolist()
	for (anime1, rating1), (anime2, rating2) in combinations(zip(animes, ratings), 2): # We use the ratings given to the anime by the users to compute how strong a link is.
		if G.has_edge(anime1, anime2):
			G[anime1][anime2]['weight'] += (rating1 + rating2) / 20  # Weight is how strong a link is.
		else:
			G.add_edge(anime1, anime2, weight=(rating1 + rating2) / 20)

# Get the top 250 most frequent links
top_edges = sorted(G.edges(data=True), key=lambda x: x[2]['weight'], reverse=True)[:250]

# Create a new graph with only the top 250 edges
top_G = nx.Graph()
top_G.add_edges_from([(u, v, {'weight': d['weight']}) for u, v, d in top_edges]) # We add an additional column that will be used for printing the graphs.

# Normalize the weights between 0 and 5 for visual representation reasons. Otherwise, the edges are too big and the graphs are unreadable.
weights = [d['weight'] for u, v, d in top_G.edges(data=True)]
min_weight = min(weights)
max_weight = max(weights)

min_count = anime_counts.min()
max_count = anime_counts.max()

node_sizes = [
	10 + (anime_counts[node] - min_count) * (50 - 10) / (max_count - min_count)
	for node in top_G.nodes()
]

for u, v, d in top_G.edges(data=True):
	d['normalized_weight'] = 5 * (d['weight'] - min_weight) / (max_weight - min_weight) # We modify the additional column to make it actually displayable on a graph

#### Spring layout
For our representation, we chose to use the spring layout.

In [10]:
# Get positions for all nodes
pos = nx.spring_layout(top_G, k=10, iterations=100)

We make the graphs more understandable by adding some context when hovering the nodes.

In [11]:
hover_texts = []
for node in top_G.nodes():
    neighbors = list(top_G[node])
    weights = [top_G[node][neighbor]['weight'] for neighbor in neighbors]
    total_weight = sum(weights)
    percentages = [(neighbor, top_G[node][neighbor]['weight'] / total_weight * 100) for neighbor in neighbors]
    percentages = sorted(percentages, key=lambda x: x[1], reverse=True)[:5]
    
    anime_name = animes_df[animes_df['MAL_ID'] == node]['English name'].values[0]
    neighbors = [f"{animes_df[animes_df['MAL_ID'] == neighbor]['English name'].values[0]}: {weight:.2f}%" for neighbor, weight in percentages] #Each neighbor associated string is in the format "{neighbor_name}: percent_occurence_with_linked_node%"
    hover_text = f"{anime_name}<br>" + "<br>".join(neighbors)
    hover_texts.append(hover_text)

In [12]:
# Create edge traces
edge_trace = []
for edge in top_G.edges(data=True):
	x0, y0 = pos[edge[0]]
	x1, y1 = pos[edge[1]]
	trace = go.Scatter(
		x=[x0, x1, None],
		y=[y0, y1, None],
		line=dict(width=edge[2]['normalized_weight'], color='gray'),  # We use the normalized weights computed before
		hoverinfo="none",
		mode='lines'
	)
	edge_trace.append(trace)

# Create node trace
node_trace = go.Scatter(
	x=[pos[node][0] for node in top_G.nodes()],
	y=[pos[node][1] for node in top_G.nodes()],
	hovertext=hover_texts,
	text=[animes_df[animes_df['MAL_ID'] == node]['English name'].values[0] for node in top_G.nodes()], # This is converting the id into a name
	mode='markers+text',
	textposition='top center',
	marker=dict(
		size=node_sizes,
		color='skyblue',
		line=dict(width=2, color='black')
	)
)

In [13]:
# Create the figure
fig = go.Figure(
	data=edge_trace + [node_trace],
	layout=go.Layout(
		title='Top 250 most frequent anime watching links',
		titlefont_size=16,
		showlegend=False,
		hovermode='closest',
		height=800,
		margin=dict(b=20, l=5, r=5, t=40),
		annotations=[dict(
			text="",
			showarrow=False,
			xref="paper", yref="paper"
		)],
		xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
		yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
	)
)

# Show the figure
fig.show()

### 4. Computing association rules

#### Binary matrix

To create association rules, we need to have a binary matrix where the rows represent the watchlist of a user and the columns are the animes.

In our case, we get a 20000 x 16000 matrix because we only kept the first 20000 users and there are approx. 16000 different animes in the dataset.

In [14]:
dataset = dataset.pivot(index='user_id', columns='anime_id', values='rating')

We are now converting our matrix into a binary matrix in order to be able to retrieve the association rules: we only take into account the ratings that are above 3.

In [15]:
dataset = dataset > 3

#### Retrieving the association rules
Finally, we are exploiting the mlxtend library to build the recommendation system and we're retrieving the association rules.

We chose to use the lift metric because we thought that in this case it was the most relevant.

In [16]:
frequent_itemsets  = apriori(dataset, use_colnames=True, min_support=0.15) #Getting under 0.1 support takes too much computation time / memory and lacks of meaning.

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.1)

Now that we have a lot of rules, we also filter the rules that are not very present in the dataset, using the confidence metric.

In [17]:
rules["antecedents"] = rules["antecedents"].apply(lambda x: [x for x in x])
rules["consequents"] = rules["consequents"].apply(lambda x: [x for x in x])
rules = rules[rules["confidence"] > 0.2].reset_index().drop("index", axis=1).sort_values("lift", ascending=False)

Here are some of the rules detected by the algorithm.

In [18]:
display(rules.head(20))
print(f"{len(rules)} rules found.")

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
23346,[28891],"[20583, 32935]",0.191679,0.160285,0.157786,0.823179,5.135735,0.127063,4.748967,0.996245
23343,"[20583, 32935]",[28891],0.160285,0.191679,0.157786,0.984412,5.135735,0.127063,51.855557,0.958999
73383,"[35760, 33486]","[25777, 31964, 38524]",0.178372,0.165499,0.150997,0.846529,5.115011,0.121476,5.437503,0.979149
73374,"[25777, 31964, 38524]","[35760, 33486]",0.165499,0.178372,0.150997,0.912373,5.115011,0.121476,9.376411,0.964045
23345,[32935],"[28891, 20583]",0.164521,0.187822,0.157786,0.959062,5.106217,0.126885,19.8394,0.962514
23344,"[28891, 20583]",[32935],0.187822,0.164521,0.157786,0.840081,5.106217,0.126885,5.224386,0.990129
73370,"[25777, 38524, 33486]","[35760, 31964]",0.157949,0.187497,0.150997,0.955983,5.098671,0.121382,18.459061,0.954658
73387,"[35760, 31964]","[25777, 38524, 33486]",0.187497,0.157949,0.150997,0.80533,5.098671,0.121382,4.325536,0.989375
73375,"[35760, 25777, 33486]","[31964, 38524]",0.174841,0.16979,0.150997,0.863622,5.08642,0.12131,6.087578,0.973628
73382,"[31964, 38524]","[35760, 25777, 33486]",0.16979,0.174841,0.150997,0.889315,5.08642,0.12131,7.455048,0.967704


74848 rules found.


#### Parsing the rules
These functions are designed to parse and filter the results of the detected rules, so we can understand them and use them more easily.

<hr>

```find_recommendations_precise``` will compute every possible combination of the watched anime ids, and try to find them in the rules dataset.
<hr>

```find_recommendations_free``` will look for every occurence of each anime id in the rules, even if the antecedents frozen set isn't containing only the given id. 

It will return one dataset per seen anime, ordered from highest to lowest weight. Weight is computed by adding every confidence value obtained in the rules containing the recommendated anime in the consequents.

In [19]:
def generate_combinations(ids): #This is used to compute every existing permutation of the elements of an array.
	result = []
	for r in range(1, len(ids) + 1):
		permutations = itertools.permutations(ids, r)
		for p in permutations:
			result.append(list(p))

	print(f"Found {len(result)} possible combinations.")
	return result

def get_linked_ids(ids): #This is used to find animes that are almost the same (For example, it is possible that in the dataset id 20000 represents Naruto and 21000 represents Naruto shipudden, this will be useful later.)
	linked = []
	for id in ids:
		anime_name = animes_df[animes_df["MAL_ID"] == id]["Name"].values[0]
		for value in animes_df[animes_df["Name"].apply(lambda x: anime_name.lower() in x.lower())]["MAL_ID"].values:
			linked.append(value)

	return list(set(linked))

def find_recommendations_precise(anime_ids): #The description is available in the markdown above
	recommendations = []
	
	for combination in tqdm(generate_combinations(anime_ids), desc="Trying every possible combination..."):
		# We try to find the associated rows to this exact combination by looking at the antecedents column, but the consequents column (recommendation) shouldn't an anime that has already been watched (i.e. is in anime_ids)
		filter_df = rules["antecedents"].apply(lambda x: x == combination) & rules["consequents"].apply(lambda x: np.all([id not in x for id in anime_ids])) 
		if filter_df.apply(lambda x: x != False).sum() < 1:
			continue
		recommendation = (combination, rules[filter_df]["consequents"].values, rules[filter_df]["confidence"].values, rules[filter_df]["lift"].values)
		recommendations.append(recommendation)

	return recommendations

def find_recommendations_free(anime_ids): #The description is available in the markdown above
	recommendations = []

	for id in anime_ids:
		# We try to find the associated rows to this anime by looking at the antecedents column, but the consequents column (recommendation) shouldn't an anime that has already been watched (i.e. is in anime_ids)
		filter_df = rules["antecedents"].apply(lambda x: id in x) & rules["consequents"].apply(lambda x: np.all([id not in x for id in anime_ids]))
		if filter_df.apply(lambda x: x != False).sum() < 1:
			continue

		recommendation = pd.DataFrame({"source": id, "antecedents": rules[filter_df]["antecedents"].values, "consequents": rules[filter_df]["consequents"].values, "confidence": rules[filter_df]["confidence"].values, "lift": rules[filter_df]["lift"].values})
		recommendations.append(recommendation)

	recommendations = pd.concat(recommendations)
	recommendations_dict = {anime: {}  for anime in anime_ids}
	recommendations_df = []
	for anime in recommendations_dict:
		rows = recommendations[recommendations["source"] == anime]
		for _, row in rows.iterrows():
			for x in row["consequents"]: 
				if x in recommendations_dict[anime]: # For each recommendation DataFrame (one by source anime), we compute the weight of its rules. For a given input anime, each output anime in consequents column is a rule. 
					# We sum up the product of the lift and confidence for the given rule. 
					# For example, if an input anime has 3 different rows or where the anime with 1 is present in the consequents column, il will sum up the product of the lift and confidence values of these 3 rows and in the end take the mean. 
					recommendations_dict[anime][x][0] += row["lift"] * row["confidence"] 
					recommendations_dict[anime][x][1] += 1
				else:
					recommendations_dict[anime][x] = [row["lift"]* row["confidence"], 1]
		
		for anime_recommended in recommendations_dict[anime]:
			recommendations_df.append(
				[
					animes_df[animes_df["MAL_ID"] == anime]["English name"].values[0],  # The name of the anime this rule is for
					animes_df[animes_df["MAL_ID"] == anime_recommended]["English name"].values[0],  # The name of the recommended anime
					recommendations_dict[anime][anime_recommended][0] / recommendations_dict[anime][anime_recommended][1] # The mean of the weights of the found rules
				]
			)

	recommendations_df = pd.DataFrame(recommendations_df, columns=['source', 'recommended_id', 'average weight']).sort_values(by="average weight", ascending=False).groupby("source")

	return recommendations_df

Here we use the previously defined function and parse the results to print them and link them with the anime infos dataset.

In [20]:
seen_animes = [23273] #More than 7 at a time takes forever.

In [21]:
for recommendations in find_recommendations_precise(seen_animes):
	for i in range(len(recommendations[1])):
		recommendation = (recommendations[0], recommendations[1][i], recommendations[2][i], recommendations[3][i])
		print("Because you have seen %s, we think you would like %s with %.3f%% confidence. You are also %.3f%% more likely to watch this/these anime(s)." % (
			" and ".join([animes_df[animes_df["MAL_ID"] == x]["English name"].values[0] + f" ({str(x)})" for x in recommendation[0]]), 
			" and ".join([animes_df[animes_df["MAL_ID"] == x]["English name"].values[0] + f" ({str(x)})" for x in recommendation[1]]), 
			recommendation[2] * 100,
			recommendation[3] * 100 - 100)
		)

Found 1 possible combinations.


Trying every possible combination...:   0%|          | 0/1 [00:00<?, ?it/s]

Because you have seen Your Lie in April (23273), we think you would like Your Name. (32281) and anohana:The Flower We Saw That Day (9989) with 50.480% confidence. You are also 142.028% more likely to watch this/these anime(s).
Because you have seen Your Lie in April (23273), we think you would like One Punch Man (30276) and anohana:The Flower We Saw That Day (9989) with 49.644% confidence. You are also 133.343% more likely to watch this/these anime(s).
Because you have seen Your Lie in April (23273), we think you would like anohana:The Flower We Saw That Day (9989) and No Game, No Life (19815) with 49.253% confidence. You are also 132.453% more likely to watch this/these anime(s).
Because you have seen Your Lie in April (23273), we think you would like Toradora! (4224) and Your Name. (32281) with 52.543% confidence. You are also 128.799% more likely to watch this/these anime(s).
Because you have seen Your Lie in April (23273), we think you would like ERASED (31043) and Angel Beats! (65

In [22]:
for index, recommendation_df in find_recommendations_free(seen_animes):
	display(recommendation_df.head(5))

Unnamed: 0,source,recommended_id,average weight
2,Your Lie in April,My Hero Academia 2,1.501425
0,Your Lie in April,Noragami Aragoto,1.476001
6,Your Lie in April,A Silent Voice,1.4497
7,Your Lie in April,Your Name.,1.430774
3,Your Lie in April,My Hero Academia,1.408307


### Results for real users

In [23]:
# This is a list of what Antonin's little sister, aka Clotilde, watched. 
# We used it to see if we got some recommendations of animes she wanted to watch. 
# Turns out we did, meaning that the recommendation system is working.

last_seen_clo = [
	6702,
	16498,
	33255,
	20,
	28755,
	24833,
	30911,
	136,
	34176,
	28999,
	38000,
	9919,
	36039,
	31043,
	23755,
	35500,
	22319,
	38777,
	38793,
	4898,
	269,
	14227,
	28171,
	31741,
	32282,
	164,
	199,
	431,
	32281,
	513,
	2890,
	512,
	21557,
	39235,
	16662,
	572,
	36098,
	10029,
	34541,
	35098,
	578,
	12355,
	33352,
	11757,
	28851,
	39533,
	35222,
	11771,
	35507,
	34572,
	46352,
	21995,
	37208,
	26243,
	37396,
	39535,
	42249,
	41094,
	46569,
	31964,
	23273,
	523,
	20583,
	42897,
	31478,
]

```recommend_user``` is a new function that uses a full list of watch animes and applies a modified version of the algorithms ```find_recommendation_precise``` and ```find_recommendations_free```.

The main difference is that instead of computing every possible combination of the watched animes (because it would take years), it computes the permutation of the last 6 seen, and returns every rule where the consequents columns does not contain any already seen anime, sorted by descending lift values.

In another cell, we retrieve all the names of the recommended animes, and chose to keep the first 10.

When giving the recommendations made by this system to my little sister, she said that some of the animes were actually in her watchlist.

In [26]:
def recommend_user(anime_ids): # We only keep the last 5 watched animes, but we use the full list to make sure we don't recommend some animes that they have already seen.
	recommendations = []
	linked_ids = get_linked_ids(anime_ids) # This allows us to get the ids of linked animes. For example, if anime id=1 has the name "Naruto", it will add to the list every anime whose name contains Naruto
	# This is useful here, because when Clotilde gave us the anime list, she didn't detailed every possible Naruto name of this anime series that she watched, although she did watch all the spin-offs.
	# So in order to come up faster with a more precise list, we made this function.
	
	for combination in tqdm(generate_combinations(anime_ids[-5:]), desc="Trying every possible combination..."):
		filter_df = rules["antecedents"].apply(lambda x: x == combination) & rules["consequents"].apply(lambda x: np.all([id not in x for id in linked_ids]))
		if filter_df.apply(lambda x: x != False).sum() < 1:
			continue
		for i in range(len(rules[filter_df]["consequents"].values)):
			recommendation = (
				" & ".join([animes_df[animes_df["MAL_ID"] == x]["English name"].values[0] for x in combination]), # Converting the ids of a rule to a string enumeration
				[animes_df[animes_df["MAL_ID"] == x]["English name"].values[0] for x in rules[filter_df]["consequents"].values[i]], # The list of the recommended animes
				rules[filter_df]["confidence"].values[i], # Confidence
				rules[filter_df]["lift"].values[i] # Lift
			)
			recommendations.append(recommendation)

	df_recommendations = pd.DataFrame(recommendations, columns=['combination', 'consequents', 'confidence', 'lift']).sort_values("lift", ascending=False) # We sort with lift descending
	return df_recommendations

In [27]:
recommendations_df = recommend_user(last_seen_clo)
display(recommendations_df.head(10))

Found 325 possible combinations.


Trying every possible combination...:   0%|          | 0/325 [00:00<?, ?it/s]

Unnamed: 0,combination,consequents,confidence,lift
0,Your Lie in April,"[One Punch Man, anohana:The Flower We Saw That...",0.496444,2.333425
1,Your Lie in April,"[anohana:The Flower We Saw That Day, No Game, ...",0.492532,2.324534
2,Your Lie in April,"[Re:ZERO -Starting Life in Another World-, Ang...",0.492888,2.218717
3,Your Lie in April,"[Toradora!, Noragami:Stray God]",0.491643,2.173784
4,Your Lie in April,"[Noragami:Stray God, Angel Beats!]",0.504445,2.120882
5,Your Lie in April,"[Toradora!, One Punch Man]",0.538229,2.116475
6,Your Lie in April,"[Toradora!, No Game, No Life]",0.541607,2.109931
7,Your Lie in April,"[Toradora!, anohana:The Flower We Saw That Day]",0.495733,2.092853
8,Your Lie in April,"[Angel Beats!, anohana:The Flower We Saw That ...",0.513869,2.079764
9,Your Lie in April,"[Noragami:Stray God, Noragami Aragoto]",0.496799,2.075465


In [28]:
recommended_animes = []
for consequents in recommendations_df["consequents"]:
    for consequent in consequents:
        if not consequent in recommended_animes:
            recommended_animes.append(consequent)

print(recommended_animes[:10])

['One Punch Man', 'anohana:The Flower We Saw That Day', 'No Game, No Life', 'Re:ZERO -Starting Life in Another World-', 'Angel Beats!', 'Toradora!', 'Noragami:Stray God', 'Noragami Aragoto', 'Death Note', 'Steins;Gate']


In this last cell, we print the 5 most relevant recommendation for a combination input to get more insights on the data obtained.

In [29]:
for _, recommendation_df in recommendations_df.groupby("combination"):
	display(recommendation_df.head(5))

Unnamed: 0,combination,consequents,confidence,lift
66,Haikyu!!,[Noragami:Stray God],0.692204,1.958533
67,Haikyu!!,"[One Punch Man, Death Note]",0.639561,1.9405
68,Haikyu!!,[One Punch Man],0.782034,1.776657
69,Haikyu!!,"[No Game, No Life]",0.692652,1.718192
70,Haikyu!!,[Fullmetal Alchemist:Brotherhood],0.678539,1.523115


Unnamed: 0,combination,consequents,confidence,lift
65,My Neighbor Totoro,[Death Note],0.756698,1.293914


Unnamed: 0,combination,consequents,confidence,lift
0,Your Lie in April,"[One Punch Man, anohana:The Flower We Saw That...",0.496444,2.333425
1,Your Lie in April,"[anohana:The Flower We Saw That Day, No Game, ...",0.492532,2.324534
2,Your Lie in April,"[Re:ZERO -Starting Life in Another World-, Ang...",0.492888,2.218717
3,Your Lie in April,"[Toradora!, Noragami:Stray God]",0.491643,2.173784
4,Your Lie in April,"[Noragami:Stray God, Angel Beats!]",0.504445,2.120882


### Conclusion

It was really interesting to use a huge anime dataset to create a recommender system that could be used in real case scenarios. We tried to keep a critical oversight of our throughout the whole process of writing this program, in order to understand its limits but also its strengths.

Using graphs to explore our data was a very efficient way for us to understand the global structure of this dataset, and to visualize it. 

We understood how to use association rules for a case that is more relatable than just doing groceries, on a large dataset. We discovered that a lot of small things had to be taken into account to make the generated rules really relevant, but we think we did a good job in the end. Of course, this doesn't mean there are no improvements to be made.

#### Possible improvements

1. Instead of using the animes in a user watchlist to train the system, we could try to use it to evaluate how good our system is. The global idea would be : __Are most of the recommended animes in the user watchlist ?__

2. With more computing power, we could also compute more rules to find less common patterns that apply only to niche anime fans.