# BHT Data Applications project
# Automatic Anime recommendation Algorithm
### This project aims to create an algorithm that can determine what anime to recommend to a user.
##### Authors: Rashmi Di Michino and Antonin Mathubert

The 320000 users and 16000 animes dataset was taken from https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020 <br>
We are going to use this dataset to build a model that can recommend an anime based on the animes that the user is watching, has dropped, has kept on hold or put on their watching list.

### 1. Importing and parsing the data
First, we want to import all of our available data in a suitable manner so it is treatable for the next steps of the project.<br><br>
In order to load the data, we are going to do it by chunking the csv file so it's more efficient. Then we're changing the default type of the columns to be more convenient memory wise.

In [2]:
from mlxtend.frequent_patterns import apriori, association_rules
from tqdm.notebook import tqdm
import pandas as pd
import numpy as np
#import cupy as cp
import random
import time
import os
import re

In [None]:
dataset_chunks = pd.read_csv("dataset/anime/animelist.csv", chunksize=10000)
chunks = []
for chunk in dataset_chunks:
    chunks.append(chunk)
    
dataset = pd.concat(chunks, ignore_index=True)
dataset = dataset.astype({'user_id': "int32", 'anime_id': 'int32', "watching_status": "int16"})

dataset_chunks = None
chunks = None

In [3]:
dataset_chunks = pd.read_csv("C:/Users/rashm/OneDrive/Desktop/data_applications_project/julius/anime_dataset/animelist.csv", chunksize=10000)

chunks = []
for chunk in dataset_chunks:
    chunks.append(chunk)
    
dataset = pd.concat(chunks, ignore_index=True)
dataset = dataset.astype({'user_id': "int32", 'anime_id': 'int32', "watching_status": "int16"})

dataset_chunks = None
chunks = None

### 2. Recommendation system based on the watched animes
In this first version we're going to implement a recommendation system based on which animes the users have seen, for example if someone has watched cowboy bepop, they're going to be recommended to see death note
#### Reducing the dataset
As the dataset we're working with is too large, we're going to reduce it

In [4]:
dataset.drop(['rating', 'watched_episodes'], axis=1, inplace=True)
dataset = dataset[(dataset['anime_id'] < 10000) & (dataset['user_id'] < 20000)]
dataset = dataset[(dataset['user_id'] != 61960) & (dataset['watching_status'] != 4)]
dataset = dataset.drop("watching_status", axis=1)

Here we can see a sample of how the dataset is structured

In [5]:
display(dataset.head(100))
len(dataset)

Unnamed: 0,user_id,anime_id
0,0,67
1,0,6702
2,0,242
3,0,4898
4,0,21
...,...,...
176,1,9253
183,1,22
184,1,995
185,1,4053


2509211

The next step is pivoting the dataset: we're constructing a matrix that will be used to build the recommendation system, where the rows are the users' ids and the columns are the animes' ids.

In [6]:
dataset = dataset.pivot(index='user_id', columns='anime_id', values='anime_id')

We are now converting our matrix into a binary matrix in order to be able to retrieve the association rules

In [7]:
dataset[dataset.notnull()] = True
dataset = dataset.fillna(False)

Finally, we are exploiting the mlxtend library to build the recommendation system and we're retrieving the association rules

In [8]:
frequent_itemsets  = apriori(dataset, use_colnames=True, min_support=0.2)

frequent_itemsets

rules = association_rules(frequent_itemsets)

rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(6),(1),0.266044,0.441340,0.216195,0.812629,1.841278,0.098779,2.981573,0.622516
1,(47),(1),0.253692,0.441340,0.205051,0.808267,1.831394,0.093086,2.913736,0.608285
2,(1),(1535),0.441340,0.715784,0.372989,0.845130,1.180706,0.057086,1.835193,0.273957
3,(6),(1535),0.266044,0.715784,0.228109,0.857408,1.197859,0.037678,1.993216,0.225051
4,(6),(1575),0.266044,0.588800,0.214494,0.806232,1.369279,0.057846,2.122123,0.367445
...,...,...,...,...,...,...,...,...,...,...
13622,"(4224, 6547, 2904, 5114, 1535)","(9253, 1575)",0.223772,0.432720,0.201921,0.902355,2.085308,0.105091,5.809628,0.670492
13623,"(9253, 6547, 2904, 5114, 1535)","(4224, 1575)",0.241779,0.414329,0.201921,0.835150,2.015669,0.101746,3.552749,0.664564
13624,"(4224, 2904, 6547, 9253)","(5114, 1535, 1575)",0.249849,0.407521,0.201921,0.808174,1.983146,0.100103,3.088626,0.660868
13625,"(4224, 5114, 2904, 9253)","(1535, 6547, 1575)",0.248531,0.368158,0.201921,0.812459,2.206820,0.110423,3.369079,0.727721


By running this next cell we can see that for the users that have seen Cowboy Bepop it's recommended to see Death Note

In [10]:
rules[rules["antecedents"]==frozenset({1})]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
2,(1),(1535),0.44134,0.715784,0.372989,0.84513,1.180706,0.057086,1.835193,0.273957


In [None]:
dataset = np.array(dataset.values)

In [None]:
dataset = np.nan_to_num(dataset, nan=0)

In [None]:
dataset = np.where(dataset != 0, 1, dataset)