# AnAdvisor
AnAdvisor is a service for anime recommendation based on user's preferences

Once the user submit their own list, the system will automatically profile them considering what they already watched, and provide recommendation relying on that

The approach i will follow is to cluster similiar users togheter based on the mean score they given to each genre, and then provide the best anime seen by people in the same cluster of the user

In [None]:
import pandas as pd
import numpy as np
import seaborn as sb
import random as rm
import math
import gc

from matplotlib import pyplot as plt

root = './Datasets'
animeFile = root + '/anime.csv'

In [None]:
anime = pd.read_csv(animeFile, na_values=['Unknown'])
anime.head(5)

In [None]:
#----------MEMORY CLEANING------------
del animeFile
#-------------------------------------

### Datasets details
The following datasets have been provided by MAL.net

  - Anime
    - `MAL_ID`: ***Changed in `anime_id`***
    - `Name`
    - `Score`: ***Mean score***
    - `Genres`: ***List of generes***
    - `Type`: ***TV, Film, ...***
    - `Episodes`: ***Number of episodes***
    - `Members`: ***Number of members of the anime 'group'***
    - `Favorites`: ***Number of users who have the anime as 'Favorite'***
    - `Other minor attributes`

  - Animelist
    - `user_id`
    - `anime_id`
    - `score`: ***Vote given by `user_id` to `anime_id`***
    - `watching_status`: ***Display if `anime_id` for `user_id` is in a Watching status, Planning, Dropped and so on***
    - `watched_episodes`: ***Number of episodes watched***

# Anime
This dataset contains various informations about all available anime (until early 2020)

I will make some exploration, and pre processing, to make the dataset usable later, if needed

I decided to remove all unnecessary attributes for the analysis

In [None]:
dropCList = ['English name', 'Japanese name', 'Aired', 'Premiered',
             'Producers', 'Licensors', 'Studios', 'Duration', 'Rating',
             'Source', 'Watching', 'Completed', 'Ranked',
             'On-Hold', 'Dropped', 'Plan to Watch', 'Score-10',
             'Score-9', 'Score-8', 'Score-7', 'Score-6', 'Score-5',
             'Score-4', 'Score-3', 'Score-2', 'Score-1']

anime.drop(columns = dropCList, inplace = True)
anime.rename(columns = {'MAL_ID': 'anime_id', 'Name': 'name', 'Score': 'score', 
                     'Genres': 'genres', 'Episodes': 'episodes', 'Type': 'type',
                     'Members': 'members', 'Favorites': 'favorites', 'Popularity': 'popularity'},
                     inplace = True)

I decided to handle all the nan values in different ways

In [None]:
anime.info()

In [None]:
anime.isna().sum()

Since genres is the most important feature for similarity calculation, i drop all animes without any. I will also drop anime without a type, since it's impossible to manually check and add to all of them
  - There are some major shows recently added which have no information, but i'm not interested on keeping them, since they have not been released yet, i can rely on them 

I also drop 'Music' type becouse it's out of my interest
  - This genres contains only best soundtracks of different anime

In [None]:
# ------------REMOVE NO GENRE FROM ANIMES---------------
mask = anime['genres'].isna()
anime.drop(anime.index[mask], inplace = True)
# ------------REMOVE NO TYPE FROM ANIMES----------------
mask = anime['type'].isna()
anime.drop(anime.index[mask], inplace = True)
    
# ----------REMOVE MUSIC TYPE FROM ANIMES---------------
mask = anime['type'] == 'Music'
anime.drop(anime.index[mask], inplace = True)
# ----------REMOVE THE ONE WITH POPULARITY == 0--------- 
# It's a dataset error
mask = anime['popularity'] == 0
anime.drop(anime.index[mask], inplace = True)

In [None]:
anime.isna().sum()

Let's see with a boxplot the the distribution of missing mean score

In [None]:
fig, plot = plt.subplots(1,1, figsize=(2,5), sharey = True)
sb.boxplot(y = 'score', data = anime, ax=plot)
plot.set_title('Score distribution')
plot.xaxis.set_visible(False)
plot.grid()
plt.show()

In [None]:
fig, plot = plt.subplots(1,4, figsize=(15,10), sharey = True)

sb.boxplot(y = 'score', data = anime, ax=plot[0])
sb.boxplot(data = anime.score.fillna(5), ax=plot[1])
sb.boxplot(data = anime.score.fillna(anime.score.mean()), ax=plot[2])
sb.boxplot(data = anime.score.fillna(anime.score.median()), ax=plot[3])
plot[0].set_title('Original')
plot[1].set_title('Replace with 5')
plot[2].set_title('Replace with mean')
plot[3].set_title('Replace with median')
plot[0].xaxis.set_visible(False)
plot[1].xaxis.set_visible(False)
plot[2].xaxis.set_visible(False)
plot[3].xaxis.set_visible(False)
plot[0].grid()
plot[1].grid()
plot[2].grid()
plot[3].grid()
plt.show()

print(f'\nScore mean: {anime.score.mean()}\nScore median: {anime.score.median()}')

Most of those still have to be aired (i removed 'Aired', but i can still tell becouse most of them don't have episodes either)

In [None]:
anime.loc[anime.score.isna()].sort_values(by = 'popularity', ascending = True).head(20)

I decided to fill all unrated anime with the median

In [None]:
anime['score'].fillna(anime.score.median(), inplace = True)

In [None]:
anime.isna().sum()

I'm considering also to remove any anime with a score lower than 5, i want the system to recommend only decently scored anime

In [None]:
print(f"Total anime count: ({anime.shape[0]})")
print(f"With score greater than 5: ({anime.loc[anime.score >= 5].shape[0]})")

Since there are not many anime with a score lower than 5, i will remove them

In [None]:
anime = anime[anime['score'] >= 5]

Now i will create a new attribute for each genre: for each anime, if a given genre has value 1, then this one belongs to that genre

In [None]:
def unpack(x):
  glist = x['genres'].split(',')
  for g in glist:
    x[g.strip()] = 1
  return x

In [None]:
genres = anime[['anime_id', 'genres']]
genresUnpacked = genres.apply(unpack, axis=1).drop(columns = ['genres'])
anime.drop(columns = ['genres'], inplace = True)
genresUnpacked.fillna(0, inplace = True)
genresUnpacked = genresUnpacked.astype(int)
genresUnpacked

I want to visualize if there is some correlation between genres in order to perform some dimentionality reduction and get rid of some genres

In [None]:
fig, plot = plt.subplots(figsize = (20,20))

sb.heatmap(genresUnpacked.drop(columns = 'anime_id').corr(), linewidth=3, square = True, ax = plot, annot = True)
plt.title('Correlation Matrix', fontsize=16)
plt.show()

Unfortunately seems that there is no pair of genres with a strong correlation, so i can't remove independent genres

I want anyway to remove some of minor genres, those which don't appear many time in the entire dataset

In [None]:
genresUnpacked.sum()

Looking at how many time each genre occurs, i decided to remove each genre that appears less than 400 times

In [None]:
print(f"Genres before removal:\t{genresUnpacked.shape[1]-1}")
minorGenres = genresUnpacked.columns[~(genresUnpacked.sum() >= 400)]
genresUnpacked.drop(columns = minorGenres, inplace = True)
print(f"Genres after removal:\t{genresUnpacked.shape[1]-1}")

In [None]:
# Finally join each anime with genres
anime.set_index('anime_id', inplace = True)
anime = anime.join(genresUnpacked.set_index('anime_id'))
anime.head(1)

In [None]:
genres = genresUnpacked.drop(columns = ['anime_id']).columns.to_numpy()

In [None]:
# Check how many anime are now without a genre
(genresUnpacked.set_index('anime_id').T.sum() < 1).value_counts()

As i removed some genres, i also have to remove those anime which are now without any genre

In [None]:
anime = anime.loc[anime.index[genresUnpacked.set_index('anime_id').T.sum() > 0]]

In [None]:
#-------------MEMORY CLEANING-------------
del dropCList
del genresUnpacked
del plot
del fig
del mask
del minorGenres

gc.collect()
#-------------MEMORY CLEANING-------------

Animes without episodes consist in two categories:
  - Those still airing but is not known when they will have an end
  - Not yet aired or it is unknown how many episodes they will have

Those out of ordinary schemes (ie. unusual number of total episodes) must be manually updated to last known episode, while to the others will be assigned the mean based on anime Type (Movies have 1 episode while non-Movies have usually 12 or 24 episodes, i will assign the closest one to the mean)

In [None]:
# Anime with NaN episode
anime[anime.episodes.isna()].sort_values(by = 'members', ascending = False).head(5)

In [None]:
most_famous = {21:1044, 34566:279, 235:1067, 42205:64}
for k,v in most_famous.items():
  anime.loc[k, 'episodes'] = v

In [None]:
print(f"Movie episodes mean: {anime[anime['type'] == 'Movie'].episodes.mean()}")
print(f"Special episodes mean: {anime[anime['type'] == 'Special'].episodes.mean()}")
print(f"TV episodes mean: {anime[anime['type'] == 'TV'].episodes.mean()}")
print(f"OVA episodes mean: {anime[anime['type'] == 'OVA'].episodes.mean()}")
print(f"ONA episodes mean: {anime[anime['type'] == 'ONA'].episodes.mean()}")

In [None]:
def fillEpisode(x):
  if not(pd.isna(x.episodes)):
    return x
  if x.type == 'Movie':
    x.episodes = 1
  else:
    # For each non-Movie i will assign 12
    x.episodes = 12
  return x

In [None]:
anime = anime.apply(fillEpisode, axis = 1)

In [None]:
anime.isna().sum()

In [None]:
anime.info()

# Some Analytics on Anime dataset

In [None]:
genresCount = pd.DataFrame(columns = ['Genre', 'Count'])
for x in genres:
  genresCount = genresCount.append({'Genre':x, 'Count':anime[anime[x] == 1].name.count()}, ignore_index=True)
fig, plot = plt.subplots(1,1, figsize = (15,10))
sb.barplot(data = genresCount.sort_values(by = 'Count', ascending = False).head(15), x = 'Genre', y = 'Count', ax = plot)
sb.set(font_scale=1.8)
plt.tick_params(axis='x', rotation=65)
plt.title('Number of anime per genre')
plot.grid()
plt.show()

In [None]:
fig, plot = plt.subplots(1,1, figsize = (15,10))
sb.barplot(data = anime.sort_values(by = 'members', ascending = False).head(15), x = 'name', y = 'members', ax = plot)
sb.set(font_scale=1.5)
plt.tick_params(axis='x', rotation=90)
plt.title('Top 15 watched animes')
plot.grid()
plt.show()

In [None]:
scoreXType = anime[['type', 'score']]
# I will round scores to nearest natural: round up if decimal is grater than 0.5, round down otherwise
scoreXType.score = scoreXType.score.apply(lambda x: np.ceil(x) if(x - np.floor(x) > 0.5) else np.floor(x))

In [None]:
fig, plot = plt.subplots(1,figsize=(15,5))
sb.countplot(x='score', hue = 'type', data = scoreXType, ax = plot)
plot.set_title('Score count per type')
plt.legend(loc = 'upper left')
plt.show()

TV has the best trend on 8 and 9 scores, compared to the others. In the opposite way, OVA has a major trand on 5, 6 and 7 score

In [None]:
#-------------MEMORY CLEANING-------------
del fig
del genresCount
del k
del most_famous
del plot
del v
del x
del scoreXType

gc.collect()
#-------------MEMORY CLEANING-------------

In [None]:
#----------------CHECKPOINT---------------
anime.to_csv(root + '/animeCheckpoint.csv')
#-----------------------------------------

# Ratings

In order to keep consistency with the pre-processing i made on Anime dataset, i have to remove all those ratings with an `anime_id` not in the previous dataset

I'm also removing those ratings without a vote (rating of 0), or those with a vote but set as Planning

In [None]:
import pandas as pd
import numpy as np
import seaborn as sb
import random as rm
import math
import gc

from matplotlib import pyplot as plt

root = './Datasets'
anime = pd.read_csv(root + '/animeCheckpoint.csv', index_col='anime_id')

In [None]:
cols = ['user_id','anime_id','rating','watching_status']
# Inizialize ratings table for append
size = 0
ratings = pd.read_csv(root + '/animelist.csv', nrows = 0, usecols = cols)
# Since animelist.csv is way too big, i have to perform numerosity reduction in chunks
for chunk in pd.read_csv(root + '/animelist.csv', chunksize = 1_000_000, usecols = cols):
  # Mask of ratings of anime not in Anime dataset
  size += chunk.shape[0]
  mask = (~chunk.anime_id.isin(anime.index))
  chunk.drop(chunk.index[mask], inplace = True)
  # Keep only ratings with score greater than 0
  chunk = chunk[chunk['rating'] > 0]
  # Keep only ratings with a watching_status < 6 (everything but Planning)
  chunk = chunk[chunk['watching_status'] < 6]
  # Watching_status encoding {1: 'Watching', 2: 'Completed', 3: 'On-Hold', 4: 'Dropped', 6: 'Planning'}
  ratings = ratings.append(chunk, ignore_index = True)
ratings.info()

In [None]:
#----------------CHECKPOINT---------------
ratings.to_csv(root + '/ratingsCheckpoint.csv')
#-----------------------------------------

In [None]:
#-------------MEMORY CLEANING-------------
del chunk
del cols
del mask

gc.collect()
#-------------MEMORY CLEANING-------------

---------------------------------------
AT THIS POINT ITS BETTER IF YOU RESTART THE RUNTIME AND START OVER WITH CHEKPOINT DATASETS (MEMORY LEAK PROBLEMS)

---------------------------------------

In [None]:
%pip install pyclustertend

In [None]:
import pandas as pd
pd.set_option('display.max_columns', 50)
import numpy as np
import seaborn as sb
import random as rm
import math
import gc

from matplotlib import pyplot as plt
plt.rcParams['axes.grid'] = True

from sklearn.preprocessing import StandardScaler, Normalizer

from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import KMeans, DBSCAN, OPTICS, AgglomerativeClustering

from sklearn.metrics import silhouette_score
from pyclustertend import hopkins
from scipy.cluster.hierarchy import dendrogram

anime = pd.read_csv(root + '/animeCheckpoint.csv', index_col='anime_id')
ratings = pd.read_csv(root + '/ratingsCheckpoint.csv')
ratings.drop(columns = ['Unnamed: 0'], inplace = True)
# {1: 'Watching', 2: 'Completed', 3: 'On-Hold', 4: 'Dropped', 6: 'Planning'}
genres = anime.drop(columns = ['episodes', 'type','name','score','popularity','members','favorites']).columns.to_numpy()

Let's do some Data Cleaning on ratings dataset:
  - We are not interested in keeping users who have only 1 anime rated
  - We are either not interested in keeping users who have rated tons of anime, this will bias users' cluster too much


  After cleaning, i will try to see if they can be splitted into 'groups' of preferred genres

In [None]:
fig, plot = plt.subplots(1,1, figsize = (15,5))

ratingsPerUser = pd.DataFrame(ratings.groupby(['user_id'])['anime_id'].count()).rename(columns = {'anime_id': 'count'})
sb.histplot(ratingsPerUser.sort_values(by = 'count', ascending = False), x = 'count', kde = True, ax = plot)
plt.xlabel('# Anime rated')
plt.tick_params(axis = 'x', rotation = 45)
plt.show()

Seems like someone rated 12k to 14k anime

In [None]:
fig, plot = plt.subplots(1,2, figsize = (15,5))

sb.histplot(ratingsPerUser.loc[(ratingsPerUser['count'] < 50)], x = 'count', kde = True, ax = plot[0])
sb.histplot(ratingsPerUser.loc[(ratingsPerUser['count'] > 1500)], x = 'count', kde = True, ax = plot[1])
plot[0].set_xlabel('# Anime rated')
plot[1].set_xlabel('# Anime rated')

fig.suptitle('Head and tail ratings count', fontsize=16)
plt.show()

In first place i will remove people with more than 2000 ratings or less than 250:
  - People with lot of ratings are considered noise becouse they see a lot of things and don't have a preferred genre; there are also accounts that just random vote every anime
  - Also people without not many ratings are not included in the model: their anime list may be a 'subset' of other users. They are also the target that will use the recommendation system

In [None]:
fig, plot = plt.subplots(1,1, figsize = (15,5))
higherOffset = 2000
lowerOffset = 250

print(f"Initial users: {ratingsPerUser.shape[0]}\n")
ratingsPerUser = ratingsPerUser[ratingsPerUser['count'] < higherOffset]
ratingsPerUser = ratingsPerUser[ratingsPerUser['count'] > lowerOffset]
sb.histplot(ratingsPerUser.sort_values(by = 'count', ascending = False), x = 'count',kde = True, ax = plot)
plt.xlabel('# Anime rated')
plt.ylabel('Users count')

fig.suptitle('How many users per ratings?', fontsize=16)

plt.show()
print(f"\nRemaining users: {ratingsPerUser.shape[0]}")

In [None]:
print(f"Ratings after:\t{ratings.shape[0]}")
ratings = ratings[ratings.user_id.isin(ratingsPerUser.index)]
print(f"Ratings before:\t{ratings.shape[0]}")

We almost halved the Ratings dataset

Repeat same process for anime: 
  - In this step, are removed unpopular anime and those who still have to be released, as they have low number of ratings


In [None]:
fig, plot = plt.subplots(1,1, figsize = (15,5))

ratingsPerAnime = pd.DataFrame(ratings.groupby(['anime_id'])['user_id'].count()).rename(columns = {'user_id': 'count'})
sb.histplot(ratingsPerAnime.sort_values(by = 'count').head(7000), x = 'count', kde = True, ax = plot)
plt.xlabel('Number of Ratings')
plt.ylabel('Anime count')

fig.suptitle('Number of anime per ratings count', fontsize=16)
plt.show()

In [None]:
fig, plot = plt.subplots(1,1, figsize = (15,5))
offset = 50

print(f"Initial animes: {ratingsPerAnime.shape[0]}\n")
ratingsPerAnime = ratingsPerAnime[ratingsPerAnime['count'] >= offset]
sb.histplot(ratingsPerAnime.sort_values(by = 'count').head(7000), x = 'count', kde = True, ax = plot)
plt.xlabel('Anime rated')
plt.tick_params(axis = 'x', rotation = 45)
plt.show()
print(f"\nRemaining animes: {ratingsPerAnime.shape[0]}")

Definetly remove users and anime from Anime and Ratings datasets

In [None]:
print(f"Anime before:\t{anime.shape[0]} | Ratings before:\t{ratings.shape[0]}")
anime = anime[anime.index.isin(ratingsPerAnime.index)]
ratings = ratings[ratings.anime_id.isin(ratingsPerAnime.index)]
print(f"Anime after:\t{anime.shape[0]} | Ratings after:\t{ratings.shape[0]}")

In [None]:
#-------------MEMORY CLEANING-------------
del fig
del higherOffset
del lowerOffset
del offset
del plot
del ratingsPerAnime
del ratingsPerUser

gc.collect()
#-------------MEMORY CLEANING-------------

Since the ratings dataset is too big even with numerosity reduction, i have to create the table i will use for clustering in chunks

In [None]:
animeReduced = anime.drop(columns = ['name', 'score','type', 'episodes', 'popularity', 'members', 'favorites'])
ratingsReduced = pd.DataFrame(columns = np.append(genres, 'user_id')).set_index('user_id')
meanTable = pd.DataFrame(columns = np.append(genres, 'user_id')).set_index('user_id')

users = len(ratings.user_id.unique().tolist())
for i in range(6, users):
  if users % i == 0:
    break
usersFolds = np.split(ratings.user_id.unique(), i)

for i in usersFolds:
  # Join each raring with related anime
  genreTable = ratings[ratings.user_id.isin(i.tolist())].join(animeReduced, on ='anime_id').drop(columns = 'watching_status')
  supportTable = ratings[ratings.user_id.isin(i.tolist())].drop(columns = ['watching_status', 'rating']).join(animeReduced, on = 'anime_id')
  # Multiply presence on that genre with the rating given by user
  genreTable[genres] = genreTable[genres].T.multiply(genreTable['rating']).T
  supportTable = supportTable.drop(columns = 'anime_id').groupby(by = 'user_id').mean()
  # Mean on genre for each user
  genreTable.drop(columns = ['rating', 'anime_id'], inplace = True)
  genreTable.replace(0, np.nan, inplace = True)
  genreTable = genreTable.groupby(by = 'user_id').mean()
  genreTable.replace(np.nan, 0, inplace = True)

  ratingsReduced = ratingsReduced.append(genreTable, ignore_index = False)
  meanTable = meanTable.append(supportTable, ignore_index = False)

In [None]:
#-------------MEMORY CLEANING-------------
del i
del users
del usersFolds
del genreTable
del animeReduced
del supportTable

gc.collect()
#-------------MEMORY CLEANING-------------

This is the resulting table: each user has it's mean score for each genre

In [None]:
ratingsReduced.head(5)

This is the table i will use to see the actual distribution of genres on each list: each value represent the presence of that genre in user list (ie. Action 0.37 for user 3 means that 37% of their list is of Action anime)

In [None]:
meanTable

In [None]:
def shuffleTrainTest(table = ratingsReduced, ttFactor = 0.6, verbose = True):

  """Shuffle unique users and split ratings in train and test set

  Parameters:
   > ttFactor (double) : default 0.6
      How many users go in the training set (1 - ttFactor are in test set)

  Returns:
   > trainingSet, testSet
  
  """
  if((ttFactor > 0.8) or (ttFactor < 0)):
    ttFactor = 0.6

  uniqueUsers = table.index.unique().tolist()
  rm.shuffle(uniqueUsers)

  trainingUsers = []
  testUsers = []

  trainingSize = math.ceil(len(uniqueUsers) * ttFactor)
  testSize = len(uniqueUsers) - trainingSize

  for i in range(0,trainingSize):
    trainingUsers.append(uniqueUsers[i])
  for i in range(trainingSize, trainingSize + testSize):
    testUsers.append(uniqueUsers[i])

  trainingRatings = table[table.index.isin(trainingUsers)]
  testRatings = table[table.index.isin(testUsers)]
  if verbose:
    print(f"Training set: {len(trainingUsers)} users\nTest set: {len(testUsers)} users")
  return trainingRatings, testRatings

def scale(table):
  return pd.DataFrame(
      StandardScaler().fit_transform(table.T),
      index = table.columns, columns = table.index).T

## Avarage vote

In [None]:
fig, plot = plt.subplots(4,4, figsize = (17,17))
#----SOME MAJOR GENRES----
sb.histplot(meanTable, x = 'Action', kde = True, stat = 'percent', ax = plot[0][0])
sb.histplot(meanTable, x = 'Adventure', kde = True, stat = 'percent', ax = plot[0][1])
sb.histplot(meanTable, x = 'Comedy', kde = True, stat = 'percent', ax = plot[0][2])
sb.histplot(meanTable, x = 'Drama', kde = True, stat = 'percent', ax = plot[0][3])
sb.histplot(meanTable, x = 'Fantasy', kde = True, stat = 'percent', ax = plot[1][0])
sb.histplot(meanTable, x = 'Romance', kde = True, stat = 'percent', ax = plot[1][1])
sb.histplot(meanTable, x = 'School', kde = True, stat = 'percent', ax = plot[1][2])
sb.histplot(meanTable, x = 'Sci-Fi', kde = True, stat = 'percent', ax = plot[1][3])
#----SOME MINOR GENRES----
sb.histplot(meanTable, x = 'Demons', kde = True, stat = 'percent', ax = plot[2][0])
sb.histplot(meanTable, x = 'Historical', kde = True, stat = 'percent', ax = plot[2][1])
sb.histplot(meanTable, x = 'Horror', kde = True, stat = 'percent', ax = plot[2][2])
sb.histplot(meanTable, x = 'Magic', kde = True, stat = 'percent', ax = plot[2][3])
sb.histplot(meanTable, x = 'Mecha', kde = True, stat = 'percent', ax = plot[3][0])
sb.histplot(meanTable, x = 'Music', kde = True, stat = 'percent', ax = plot[3][1])
sb.histplot(meanTable, x = 'Parody', kde = True, stat = 'percent', ax = plot[3][2])
sb.histplot(meanTable, x = 'Super Power', kde = True, stat = 'percent', ax = plot[3][3])
fig.suptitle('Type presence in lists frequency distribution', fontsize=15)

plt.show()

In this table we can see the distribution of frequency of genres per user: for instance, 'Action' is present at 40% on avarage
  - 2.5% of users, every 100 anime seen, 40 of those are Action

# First step: Clustering

In this step i will find clusters of users based on preferred genres.
If i manage to find similiar users in this way, i can put any test users into a category to recommend with a more qualitative measure

Each user have a different method for voting: 
  - Some users keep their votes around 5~6
  - Others prefer to give ratings around 8~9

I will use the StandardScaler on rows to bring everyone to the same proportion

In [None]:
trainingSet, testSet = shuffleTrainTest(ttFactor = 0.7)
testJoin = scale(trainingSet)
trainingSet2, testSet2 = shuffleTrainTest(table = meanTable, ttFactor = 0.7)
testJoin.head(3)

In [None]:
fig, plot = plt.subplots(figsize = (20,20))

sb.heatmap(trainingSet2.corr(), linewidth=3, square = True, ax = plot, annot = True)
plt.title('Correlation Matrix for genres presence', fontsize=16);
plt.show()

On the genres presence matrix there are some strongly correlated genres:
  - Sci-Fi and Mecha has 0.87
  - Mecha and Space has 0.87
  - Mecha and Military has 0.82

I can decide to remove two between Sci-Fi, Mecha and Space (for isntance the two with less occurrencies)

Concerning Mecha and Military, i considered 0.82 not enough: i can decide anyway to choce Mecha in the first removal

In [None]:
anime[['Sci-Fi', 'Mecha', 'Space']].sum()

I remove Space and Mecha columns becouse they have less occurrencies than Sci-Fi

In [None]:
genresSupp = genres
genresSupp = genresSupp[genresSupp != 'Mecha']
genresSupp = genresSupp[genresSupp != 'Space']
# Remove anime with now no genre
animeSupp = anime.loc[(anime.drop(columns = ['Mecha','Space'])[genresSupp].T.sum() > 0)]

In [None]:
fig, plot = plt.subplots(figsize = (20,20))

sb.heatmap(testJoin.corr(), linewidth = 3, square = True, ax = plot, annot = True)
plt.title('Correlation Matrix for scaled genres mean', fontsize=16);
plt.show()

Scaled genres mean dataset has no correlations between variables

In [None]:
fig, plot = plt.subplots(figsize = (20,20))

sb.heatmap(trainingSet.corr(), linewidth = 3, square = True, ax = plot, annot = True)
plt.title('Correlation Matrix for non-Scaled genres mean', fontsize=16);
plt.show()

On the other hand, not-scaled mean genres seem to have high correlation between most of them
  - This is due to the fact that genres with high occurrencies (Action, Adventure, Comedy and so on) tends to have a mean score equal to the mean score of general dataset, that is from 6 to 8 as we saw with the boxplot on the score previously

Let's see the hopkins statistics on scaled trainingSet and the genre presence dataset to see if it's better than the genre mean

In [None]:
x = int(testJoin.shape[0]*(1/10))
y = int(trainingSet2.shape[0]*(1/10))
print(f"Scaled genres mean\t\t- {x}\t samples: {1 - hopkins(testJoin,x)}")
print(f"non-Scaled genres presence\t- {x}\t samples: {1 - hopkins(trainingSet2,y)}")

Hopefully the datasets have high cluster tendency, becouse **Hopkins** score is higher than `0.75` (when Hopkins is 0.75 it is considered to have 90% cluster tendency)

I have to decide if to use the genres mean dataset or the genres presence dataset (removing highly correlated genres)

I will test both with K-Means and choce the one with best trade-off between value of K and Silhouette score

In [None]:
trainingSet2.drop(columns = ['Mecha', 'Space'], inplace = True)

### KMeans

As i am using K-means, i need to find the best K, so i'm gonna test both Elbow method and Silhouette score for different values of K

In [None]:
silC = []
silE = []
elbow = []
minK = 2
maxK = 21

for i in range(minK,maxK):
  kmeans = KMeans(n_clusters = i, random_state = 77, n_init = 10)
  y_kmeans = kmeans.fit_predict(testJoin)
  silC.append(silhouette_score(testJoin, y_kmeans, metric = 'cosine'))
  silE.append(silhouette_score(testJoin, y_kmeans, metric = 'euclidean'))
  elbow.append(kmeans.inertia_)

fig, plotSC = plt.subplots(2,1,figsize=(10,10))

plotSC[0].plot(range(minK,maxK), silC, 'rx-')
plotSC[0].set_ylabel('Silhouette Cosine score', color = 'r')
plotE = plotSC[0].twinx()
plotE.plot(range(minK,maxK), silE, 'bx-')
plotE.set_ylabel('Silhouette Euclidean score', color = 'b')

plotSC[1].plot(range(minK,maxK), elbow, 'gx-')
plotSC[1].set_ylabel('Elbow method', color = 'g')

plt.setp(plotSC, xticks=np.arange(minK,maxK,step=1))
plotSC[0].set_xlabel('Value of K')
plotSC[1].set_xlabel('Value of K')

plt.title('Evaluation scores for K-Means', fontsize = 16)
plt.show()

In [None]:
from mpl_toolkits import mplot3d

fig = plt.figure(figsize = (15,15))
plot= plt.axes(projection = '3d')

pca = PCA(n_components = 3).fit_transform(testJoin)
clusters = KMeans(n_clusters = 12, random_state = 77, n_init = 10).fit_predict(testJoin)
scatter = plot.scatter(pca[:,0], pca[:,1], pca[:,2], c = clusters, s=10, cmap='inferno_r')
plt.legend(*scatter.legend_elements())
plt.grid()
plt.show()

print(f"\nClusters presence:\n{pd.DataFrame(clusters).value_counts()}")

Genre presence dataset have a really bad silhouette score, so i will reject it

Results on scaled mean genres:
  - As a first point of view, **Hopkins** sudgest that, with a score of `~0.85`, our dataset tends to have clusters

  - **Elbow** is not very usefull in this dataset, since there isn't a well marked elbow point (the best seems to be 5)

  - For **Silhouette score**, i have tested both Euclidean and Cosine distance: the sudgest an optimal `k = 4`. Cosine distance have a greater score compared to Euclidean, with a value of `~0.45`

In the end i chosed `k = 12` becose the silhouette score is still high and i want more variety in clusters rather than only `k = 4`

In any case 3D scatter plot scaled with PCA, seems to show some natural cluster, even if there is a huge disparity beween the biggest one and the others

In [None]:
#-------------MEMORY CLEANING-------------
del clusters
del fig
del kmeans
del elbow
del maxK
del minK
del pca
del plotSC
del plotE
del plot
del silC
del silE
del scatter
del genres
del y_kmeans
del x
del i
del y
del animeSupp
del genresSupp
del testSet2
del trainingSet2

gc.collect()
#-------------MEMORY CLEANING-------------

### Agglomerative clustering

In [None]:
trainingSet, testSet = shuffleTrainTest(ttFactor = 0.3)
testJoin = scale(trainingSet)

In [None]:
def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

In [None]:
k = 13 # Best selected K according to K-Means 

for l in ['single','complete','average','ward']:
    fig, plot = plt.subplots(figsize=(15,15))
    agnes = AgglomerativeClustering(distance_threshold=0, n_clusters=None, linkage = l)
    agnes = agnes.fit(testJoin)
    plt.title(f"Hierarchical Clustering Dendrogram - {l} linkage")
    # plot the top three levels of the dendrogram
    plot_dendrogram(agnes, truncate_mode = "level", p = 3)
    plt.xlabel("Number of points in node (or index of point if no parenthesis).")
    plt.show()

    agnes = AgglomerativeClustering(n_clusters=k, linkage = l)
    y_agnes = agnes.fit_predict(testJoin)
    a = silhouette_score(testJoin, y_agnes, metric = 'cosine')
    b = silhouette_score(testJoin, y_agnes, metric = 'euclidean')
    print(f"{l} linkage - k = {k} - Cosine Silhouette = {a}, Euclidean Silhouette = {b}\n\n")

I calculated some dendograms for all linkage types along with Silhouette score for n_clusters = 13, since it is the best k i selected with K-Means

I will calculate the best K later when chosed the best linkage method for this dataset:
  - **Single linkage**: just create 1 big cluster and leave less than 10 samples alone with their own cluster - `Rejected`
  - **Avarage linkage**: even if the Silhouette score is pretty decent, it creates the same disparity that Single linkage does - `Rejected`

**Ward linkage** seems to be the best as it provides both a decent Silhouette score and distribution of points into clusters

I will give to **Complete linkage** a chance anyway and just test both them for some values of K

In [None]:
# I do it in a small portion of the whole dataset or it will takes hours
# Unfortunatley the resoult will not be very relyable
trainingSet, testSet = shuffleTrainTest(ttFactor = 0.4)
testJoin = scale(trainingSet)

In [None]:
minRange = 2
maxRange = 21

silCC = []
silCE = []
silWC = []
silWE = []

for l in ['ward', 'complete']:
  for n in range(minRange,maxRange):
    y_agnes = AgglomerativeClustering(n_clusters = n, linkage = l).fit_predict(testJoin)
    a = silhouette_score(testJoin, y_agnes, metric = 'cosine')
    b = silhouette_score(testJoin, y_agnes, metric = 'euclidean')
    if l == 'ward':
      silWC.append(a) # Ward Cosine
      silWE.append(b) # Ward Euclidean
    else:
      silCC.append(a) # Complete Cosine
      silCE.append(b) # Complete Euclidean


fig, plot = plt.subplots(2,1,figsize=(10,10))
fig.suptitle('Evaluation scores for AGNES', fontsize = 16)

plt.setp(plot, xticks=np.arange(minRange,maxRange,step=1))
plot[0].plot(range(minRange,maxRange), silWC, 'rx-')
plot[0].set_ylabel('Cosine score', color = 'r')
plot[0].set_title('Ward linkage')
plotW = plot[0].twinx()
plotW.plot(range(minRange,maxRange), silWE, 'bx-')
plotW.set_ylabel('Euclidean score', color = 'b')

plot[1].plot(range(minRange,maxRange), silCC, 'rx-')
plot[1].set_ylabel('Cosine score', color = 'r')
plot[1].set_title('Complete linkage')
plotC = plot[1].twinx()
plotC.plot(range(minRange,maxRange), silCE, 'bx-')
plotC.set_ylabel('Euclidean score', color = 'b')

plot[0].set_xlabel('Value of K')
plot[1].set_xlabel('Value of K')

plt.show()

They works good in different range of K:
  - **Both** seem to perform better with `K = 10-15` and then they have a drop
  - Despite this, **Ward linkage** has a better result in Silhouette score, so i will chose it between the two

I will also chose K = 14

In [None]:
trainingSet, testSet = shuffleTrainTest(ttFactor = 0.4)
testJoin = scale(trainingSet)

In [None]:
from mpl_toolkits import mplot3d

fig = plt.figure(figsize = (15,15))
plot= plt.axes(projection = '3d')

pca = PCA(n_components = 3).fit_transform(testJoin)
clusters = AgglomerativeClustering(n_clusters = 14, linkage = 'ward').fit_predict(testJoin)
scatter = plot.scatter(pca[:,0], pca[:,1], pca[:,2], c = clusters, s=10, cmap='inferno') # since 0 are 'others' not in a cluster, invert colors
plt.legend(*scatter.legend_elements())
plt.show()

print(f"\nClusters presence:\n{pd.DataFrame(clusters).value_counts()}")

As result, **Agglomerative Clustering** with ward linkage works pretty decent, giving us resoults similiar to **K-Means**, both in cluster quality and cluster balance

Anyway, K-Means it's faster with more accuracy and clusters balance, so i will use it as clustering method

In [None]:
#-------------MEMORY CLEANING-------------
del a
del agnes
del b
del fig
del k
del l
del maxRange
del minRange
del n
del plot
del silCE
del silCC
del silWE
del silWC
del y_agnes
del scatter
del plotC
del plotW
del pca
del clusters

gc.collect()
#-------------MEMORY CLEANING-------------

### DBSCAN

I will also give DBSCAN a chance to check if the dataset it's made by non-convex shaped clusters

First of all i will use OPTICS with differents MinPts to find the best Eps for this dataset

In [None]:
np.seterr(divide='ignore', invalid='ignore')

I will run **OPTICS** with minPts = 5 on the trainingSet to get a value for Eps and see if there are natural clusters with same density inside the dataset

In [None]:
trainingSet, testSet = shuffleTrainTest(ttFactor = 0.6)
trainingSet2, testSet2 = shuffleTrainTest(meanTable, ttFactor = 0.2, verbose = False)
testJoin = scale(trainingSet)

In [None]:
fig, plot = plt.subplots(1,1,figsize=(30,20))

optic = OPTICS(min_samples = 5, metric = 'euclidean')
optic.fit(testJoin)
plot.plot(np.arange(testJoin.shape[0]), (optic.reachability_[optic.ordering_]), c = 'k')

# plt.ylim((0,3))
# plt.setp(plot, yticks=np.arange(0,0.30,0.01))
plt.show()

**OPTICS** shows that there are some natural clusters, despite the huge difference in density. We can se that there are more or less 6 major natural clusters with a Eps = 1, while the rest of points seems to be very far away from the rest


I will give DBSCAN a try with with minPts = 5 and Eps = 1 and see what i get

In [None]:
from mpl_toolkits import mplot3d

fig = plt.figure(figsize = (15,15))
plot= plt.axes(projection = '3d')

pca = PCA(n_components = 3).fit_transform(testJoin)
clusters = DBSCAN(eps = 1, min_samples = 5, metric='euclidean').fit_predict(testJoin)
scatter = plot.scatter(pca[:,0], pca[:,1], pca[:,2], c = clusters, s=2, cmap='inferno_r') # since 0 are 'others' not in a cluster, invert colors
plt.legend(*scatter.legend_elements())
plt.grid()
plt.show()

print(f"\nClusters presence:\n{pd.DataFrame(clusters).value_counts()}")

As expected, the a lot of the points (those in cluster -1) are considered outliers becouse of the different densities shown by OPTICS

Anyway, it is shown that the majority of the points is in the first 1-2 clusters

In [None]:
trainingSet, testSet = shuffleTrainTest(ttFactor = 0.7)
testJoin = scale(trainingSet)

In [None]:
#-------------MEMORY CLEANING-------------
del optic
del fig
del plot
del clusters
del pca
del scatter

gc.collect()
#-------------MEMORY CLEANING-------------

## Ending results
*Runtime calculated on 70% of the users for K-Means and DBSCAN, 30% for Agglomerative Clustering*

**K-Means**
  - `K`: 12
  - `Silhouette score`: 0.45 (Cosine distance)
  - `Runtime in seconds`: 4.1s

**Agglomerative clustering**
  - `K`: 14
  - `Silhouette score`: 0.375 (Cosine distance)
  - `Runtime in seconds`: 44.4s

**DBSCAN**
  - `Eps`: 1
  - `Minpts`: 5
  - `Clusters found`: 6 with reasonable number of members
  - `Outliers`: More than 50% of the points
  - `Runtime in seconds`: 14.7s


  K-Means is the best algorithm found for this dataset, it offers a well marked trade-off between clusters silhouette, clusters variety and runtime

In [None]:
#----------------CHECKPOINT---------------
anime.to_csv(root + '/animeCheckpoint2.csv')
ratings.to_csv(root + '/ratingsCheckpoint2.csv')
ratingsReduced.to_csv(root + '/ratingsReducedCheckpoint2.csv')
#-----------------------------------------

---------------------------------------
AT THIS POINT ITS BETTER IF YOU RESTART THE RUNTIME AND START OVER WITH CHEKPOINT DATASETS (PART 2)

---------------------------------------

In [None]:
import pandas as pd
pd.set_option('display.max_columns', 50)
import numpy as np
import seaborn as sb
import random as rm
import math
import gc

from matplotlib import pyplot as plt
plt.rcParams['axes.grid'] = True

from sklearn.preprocessing import StandardScaler, Normalizer

from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import KMeans

root = './Datasets'
anime = pd.read_csv(root + '/animeCheckpoint2.csv', index_col='anime_id')
ratings = pd.read_csv(root + '/ratingsCheckpoint2.csv').drop(columns = 'watching_status')
ratingsReduced = pd.read_csv(root + '/ratingsReducedCheckpoint2.csv', index_col = 'user_id')

ratings.drop(columns = ['Unnamed: 0'], inplace = True)
# {1: 'Watching', 2: 'Completed', 3: 'On-Hold', 4: 'Dropped', 6: 'Planning'}
genres = anime.drop(columns = ['episodes', 'type','name','score','popularity','members','favorites']).columns.to_numpy()
animeReduced = anime.drop(columns = ['name','score','type','episodes','members','favorites','popularity'])

In [None]:
def shuffleTrainTest(table = ratingsReduced, ttFactor = 0.6, verbose = True):

  """Shuffle unique users and split ratings in train and test set

  Parameters:
   > ttFactor (double) : default 0.6
      How many users go in the training set (1 - ttFactor are in test set)

  Returns:
   > trainingSet, testSet
  
  """
  if((ttFactor > 0.8) or (ttFactor < 0)):
    ttFactor = 0.6

  uniqueUsers = table.index.unique().tolist()
  rm.shuffle(uniqueUsers)

  trainingUsers = []
  testUsers = []

  trainingSize = math.ceil(len(uniqueUsers) * ttFactor)
  testSize = len(uniqueUsers) - trainingSize

  for i in range(0,trainingSize):
    trainingUsers.append(uniqueUsers[i])
  for i in range(trainingSize, trainingSize + testSize):
    testUsers.append(uniqueUsers[i])

  trainingRatings = table[table.index.isin(trainingUsers)]
  testRatings = table[table.index.isin(testUsers)]
  if verbose:
    print(f"Training set: {len(trainingUsers)} users\nTest set: {len(testUsers)} users")
    
  return trainingRatings, testRatings

def scale(table):
  """Scale the given table using StandardScaler
    on the rows
  """
  return pd.DataFrame(
      StandardScaler().fit_transform(table.T),
      index = table.columns, columns = table.index).T

def mindiv(n, startn = 2):
  for i in range(startn, n):
    if n%i==0:
      return i

# Second step: Collaborative-filtering

Once i have the clusters of similiar users, i will give to the test user a recommendation, trying two methods:
   - First, i will take the list of anime of users in the same cluster,
   remove those already seen by the user under test, and then sort by occurrence: an anime seen by 6 users in this cluster is higher in the list rather than one seen by 4 users. In case of equality, i will sort by popularity in general

   - Second, i will try to implement a recommendation system like those used by major Companies like Amazon or Netflix. It uses a table like the one below to calculate the best item to see next, based on the best K users similiar to me

Example:
```
        | Show X | Show Y | Show Z |
--------+--------+--------+--------+
User A  |    0   |    5   |   8    |
--------+--------+--------+--------+
User B  |    7   |    5   |   2    |
--------+--------+--------+--------+
User C  |    8   |    0   |   0    |
--------+--------+--------+--------+
```

Users B and C may be more similiar each others than A and B or A and C
 - I can suggest to B something that C already saw, but B didn't

For the moment, train the KMeans and create a table where each `user_id` in the trainingSet has it's own cluster

In [None]:
cumulativePrecision = 0
maxRounds = 10
folds = np.array_split(ratingsReduced.sample(frac=1, random_state = 619), 10)

for t in range(maxRounds):
  cumulativeHit = 0
  cumulativeCount = 0
  
  trainingSet = pd.DataFrame(columns = np.append(ratingsReduced.columns, 'user_id')).set_index('user_id')
  testSet = pd.DataFrame(columns = np.append(ratingsReduced.columns, 'user_id')).set_index('user_id')

  for j in range(maxRounds):
    if (j == t):
      testSet = testSet.append(folds[j])
    else:
      trainingSet = trainingSet.append(folds[j])

  print(f"Fold {t+1} of {maxRounds}")

  # Scale the training set
  scaledTrainingSet = scale(trainingSet)
  # Build clusters
  kmeans = KMeans(n_clusters = 13, random_state = 523, n_init = 10).fit(scaledTrainingSet)
  # Take the cluster for each training sample
  y_kmeans = kmeans.predict(scaledTrainingSet)
  # Create a table with user_id-cluster pairs
  clusters = pd.DataFrame(y_kmeans, columns = ['cluster'], index = scaledTrainingSet.index)
  # Take id of test users
  testBase = testSet.index.tolist()
  # Shuffle them
  rm.shuffle(testBase)
  # Test this round on first 50 random
  for x in testBase[:150]:
    # Retrieve ratings of user under test
    userList = ratings[ratings.user_id == x].anime_id.unique().tolist()
    # Randomly sample 3/5 of those to associate the user under test to a cluster
    sampledList = rm.sample(userList, int(len(userList)*3/5))
    testList = []
    # The remaining 2/5 will be used as test of precision
    for i in userList:
      if i not in sampledList:
        testList.append(i)
    # Retrieve samped ratings of the user under test, join with anime, multiply genre presence by rating and mean on user_id
    temp = ratings.loc[(ratings['user_id'] == x) & (ratings.anime_id.isin(sampledList))].join(animeReduced, on = 'anime_id').drop(columns = ['anime_id'])
    temp[genres] = temp[genres].T.multiply(temp['rating']).T
    temp = temp.drop(columns = 'rating').replace(0,np.nan).groupby(by = 'user_id').mean().replace(np.nan,0)
    # Also scale the user rating list
    scaledTest = scale(temp)
    # Cluster label for tested testUser
    label = kmeans.predict(scaledTest)
    # Inizialize the KNN algorithm for 15 users
    knn = NearestNeighbors(n_neighbors = 15, metric = 'cosine', algorithm='brute')
    # Fit the model for training users in the same cluster of user under test
    knn.fit(scaledTrainingSet[scaledTrainingSet.index.isin(clusters[clusters.cluster == label[0]].index)])
    # Retreive 15 NearestNeighbors
    indexes = knn.kneighbors(scaledTest, return_distance = False)
    # Pick KNN ratings list (group by anime_id to remove duplicates)
    neighborsList = ratings[ratings.user_id.isin(indexes[0])].groupby(by = 'anime_id').mean().drop(columns = ['user_id'])
    # Now remove those already present in the sampled list of the user under test
    recommandationList = neighborsList[~neighborsList.index.isin(sampledList)].rename(columns = {'rating': 'vote'})
    # Only use the topTen anime sorted by members
    topTen = recommandationList.join(anime, on = 'anime_id').sort_values(by = ['members'], ascending = False).head(10)
    # Precision is HOW MANY OF THOSE 10 ARE PRESENT IN THE REMAINING 2/5 divided by 10
    cumulativeHit = cumulativeHit + topTen[topTen.index.isin(testList)].shape[0]/max(1, topTen.shape[0])
    cumulativeCount = cumulativeCount + 1
  print(f"\tStage accuracy: {int((cumulativeHit/cumulativeCount)*100)}%\n")
  cumulativePrecision += int((cumulativeHit/cumulativeCount)*100)
print(f"\nOverall accuracy: {int(cumulativePrecision / 10)}%")

`50+%` overall accuracy means that if we give a user a list of 10 recommanded anime, at least 5 are interesting to them
  - This score is calculated by removing some anime from someone's list to see if the system will recommend any of the remaining to the user
---
---
I compared this score with other methods i tried (not in this notebook), and they had an overall score of 30-40%, so this result is not that bad

I will try anyway another method i found in some paper: this is the old method Netflix.com used some years ago to recommend films to users
  - I will anyway use the clustering method used before to pick up similiar users (it will have a role of numerosity reduction)

In [None]:
#-------------MEMORY CLEANING-------------
del clusters
del cumulativeCount
del cumulativeHit
del indexes
del kmeans
del knn
del label
del maxRounds
del neighborsList
del recommandationList
del sampledList
del scaledTest
del scaledTrainingSet
del t
del temp
del testBase
del testList
del testSet
del topTen
del trainingSet
del userList
del x
del i
del y_kmeans

gc.collect()
#-------------MEMORY CLEANING-------------

In [None]:
cumulativePrecision = 0
maxRounds = 10
folds = np.array_split(ratingsReduced, 10)

for t in range(maxRounds):
  cumulativeHit = 0
  cumulativeCount = 0
  
  trainingSet = pd.DataFrame(columns = np.append(ratingsReduced.columns, 'user_id')).set_index('user_id')
  testSet = pd.DataFrame(columns = np.append(ratingsReduced.columns, 'user_id')).set_index('user_id')

  for j in range(maxRounds):
    if (j == t):
      testSet = testSet.append(folds[j])
    else:
      trainingSet = trainingSet.append(folds[j])

  print(f"Fold {t+1} of {maxRounds}")

  # Scale the training set
  scaledTrainingSet = scale(trainingSet)
  # Build clusters
  kmeans = KMeans(n_clusters = 13, random_state = 523, n_init = 10).fit(scaledTrainingSet)
  # Take the cluster for each training sample
  y_kmeans = kmeans.predict(scaledTrainingSet)
  # Create a table with user_id-cluster pairs
  clusters = pd.DataFrame(y_kmeans, columns = ['cluster'], index = scaledTrainingSet.index)
  # Take id of test users
  testBase = testSet.index.tolist()
  # Shuffle them
  rm.shuffle(testBase)
  # Test this round on first 20 random
  for x in testBase[:50]:
    # Retrieve ratings of user under test
    userList = ratings[ratings.user_id == x].anime_id.unique().tolist()
    # Randomly sample 3/5 of those to associate the user under test to a cluster
    sampledList = rm.sample(userList, int(len(userList)*3/5))
    testList = []
    # The remaining 2/5 will be used as test of precision
    for i in userList:
      if i not in sampledList:
        testList.append(i)
    # Retrieve samped ratings of the user under test, join with anime, multiply genre presence by rating and mean on user_id
    temp = ratings.loc[(ratings['user_id'] == x) & (ratings.anime_id.isin(sampledList))].join(animeReduced, on = 'anime_id').drop(columns = ['anime_id'])
    temp[genres] = temp[genres].T.multiply(temp['rating']).T
    temp = temp.drop(columns = 'rating').replace(0,np.nan).groupby(by = 'user_id').mean().replace(np.nan,0)
    # Also scale the user rating list
    scaledTest = scale(temp)
    # Cluster label for tested testUser
    label = kmeans.predict(scaledTest)
    # Pick similiar users
    similarUsers = clusters[clusters['cluster'] == label[0]].index
    # Ratings of sampled user and test and all users in same cluster
    knnTable = ratings.loc[((ratings.user_id == x)&(ratings.anime_id.isin(sampledList)))|(ratings.user_id.isin(similarUsers))]
    # Build the similarity matrix
    similarityMatrix = knnTable.pivot(index = 'user_id', columns = 'anime_id', values = 'rating').fillna(0).astype(np.uint8)
    # Inizialize the KNN algorithm for 15 users
    knn = NearestNeighbors(n_neighbors = 15, metric = 'euclidean', algorithm = 'brute')
    # Fit the model for training users in the same cluster of user under test
    knn.fit(similarityMatrix.loc[~(similarityMatrix.index == x)])
    # Retreive 15 NearestNeighbors
    indexes = knn.kneighbors(similarityMatrix.loc[(similarityMatrix.index == x)], return_distance = False)
    # same as in previous method, but using indexes from nearest users
    neighborsList = ratings[ratings.user_id.isin(indexes[0])].groupby(by = 'anime_id').mean().drop(columns = ['user_id'])
    # Now remove those already present in the sampled list of the user under test
    recommandationList = neighborsList[~neighborsList.index.isin(sampledList)].rename(columns = {'rating': 'vote'})
    # Only use the topTen anime sorted by members
    topTen = recommandationList.join(anime, on = 'anime_id').sort_values(by = ['members'], ascending = False).head(10)
    # Precision is HOW MANY OF THOSE 10 ARE PRESENT IN THE REMAINING 2/5 divided by 10
    cumulativeHit = cumulativeHit + topTen[topTen.index.isin(testList)].shape[0]/max(1, topTen.shape[0])
    cumulativeCount = cumulativeCount + 1

  print(f"\tStage accuracy: {int((cumulativeHit/cumulativeCount)*100)}%\n")
  cumulativePrecision += int((cumulativeHit/cumulativeCount)*100)
print(f"\nOverall accuracy: {int(cumulativePrecision / 10)}%")

In [None]:
#-------------MEMORY CLEANING-------------
del clusters
del cumulativeCount
del cumulativeHit
del indexes
del kmeans
del knn
del label
del maxRounds
del neighborsList
del recommandationList
del sampledList
del scaledTest
del scaledTrainingSet
del t
del temp
del testBase
del testList
del testSet
del topTen
del trainingSet
del userList
del x
del i
del y_kmeans

gc.collect()
#-------------MEMORY CLEANING-------------

The following is a Pure User-Based Collaborative Filtering

I have implemented it as explained in the licterature without using the clustering

In [None]:
cumulativePrecision = 0
maxRounds = 10
folds = np.array_split(ratingsReduced, 10)

for t in range(maxRounds):
  cumulativeHit = 0
  cumulativeCount = 0
  
  trainingSet = pd.DataFrame(columns = np.append(ratingsReduced.columns, 'user_id')).set_index('user_id')
  testSet = pd.DataFrame(columns = np.append(ratingsReduced.columns, 'user_id')).set_index('user_id')

  for j in range(maxRounds):
    if (j == t):
      testSet = testSet.append(folds[j])
    else:
      trainingSet = trainingSet.append(folds[j])

  print(f"Fold {t+1} of {maxRounds}")
  # Take id of training users
  trainingBase = trainingSet.index.tolist()
  # Take id of test users
  testBase = testSet.index.tolist()
  # Shuffle test users
  rm.shuffle(testBase)
  # Test this round on first 20 random
  for x in testBase[:50]:
    # Retrieve ratings of user under test
    userList = ratings[ratings.user_id == x].anime_id.unique().tolist()
    # Randomly sample 3/5 of those to associate the user under test to a cluster
    sampledList = rm.sample(userList, int(len(userList)*3/5))
    testList = []
    # The remaining 2/5 will be used as test of precision
    for i in userList:
      if i not in sampledList:
        testList.append(i)
    # Inizialize vectors to find similiar users in chunk
    dist = []
    indx = []
    # For all user in each chunk
    for chunk in np.split(np.array(trainingBase), mindiv(len(trainingBase))):
      # Retreive all ratings for those users and user under test but only their sampled list
      knnTable = ratings.loc[((ratings.user_id == x) & (ratings.anime_id.isin(sampledList))) | (ratings.user_id.isin(chunk))]
      # Build the similarity matrix (aka pivot table)
      similarityMatrix = knnTable.pivot(index = 'user_id', columns = 'anime_id', values = 'rating').fillna(0).astype(np.uint8)
      # Inizialize Knn and find 5 neighbors for each chunk
      knn = NearestNeighbors(n_neighbors = 5, metric = 'euclidean', algorithm = 'brute').fit(similarityMatrix.loc[~(similarityMatrix.index == x)])
      distances, indexes = knn.kneighbors(similarityMatrix.loc[(similarityMatrix.index == x)])
      dist.append(distances)
      indx.append(indexes)
    # This is the distance table for all chunks, ordered by distance
    distanceTab = pd.DataFrame(data = np.reshape(dist, 35), index = np.reshape(indx, 35), columns = ['distance']).sort_values(by = 'distance', ascending = True)
    # Get first 5 users list
    nearestUsers = distanceTab.head(10).index.tolist()
    # Get their rating list
    neighborsList = ratings[ratings.user_id.isin(nearestUsers)].groupby(by = 'anime_id').mean().drop(columns = ['user_id'])
    # recommandation list without already seen anime
    recommandationList = neighborsList.loc[~(neighborsList.index.isin(sampledList))].rename(columns = {'rating': 'vote'}).join(anime, on = 'anime_id').drop(columns = genres)
    # topTen recommended anime
    topTen = recommandationList.sort_values(by = ['members'], ascending = False).head(10)
    # Precision is HOW MANY OF THOSE 10 ARE PRESENT IN THE REMAINING 2/5 divided by 10
    cumulativeHit = cumulativeHit + topTen[topTen.index.isin(testList)].shape[0]/max(1, topTen.shape[0])
    cumulativeCount = cumulativeCount + 1

  print(f"\tStage accuracy: {int((cumulativeHit/cumulativeCount)*100)}%\n")
  cumulativePrecision += int((cumulativeHit/cumulativeCount)*100)
print(f"\nOverall accuracy: {int(cumulativePrecision / 10)}%")

# Conclusions

Trying to predict what an user may like it's an extremly difficoult task: leaving aside the fact that we are not rational beings, every human is different from another and trying to suggest something from similar users it's not very accurate

In this project i tried to give every user a profile, based on the mean score they gave to each genre. Becouse of the high dimensionality and numerosity, i had to reduce in size every dataset:
  - High dimentionality brings problem in clustering phase: as K-Means is based on distance between objects, it becomes meaningless when we have a lot of features. Becouse a user profile is based on mean score on each Genre, i decided to remove those whit low presence in the dataset
  - High numerosity, instead, brings problem in every process of the project: becouse of the size of the datasets, it was almost impossible to work while handling a lot of Giga of data as memory was not enough to perform certain operations

After giving to every user a profile, there were another major problem: users do not all rate in the same way, someone is balanced giving votes in every range, someone else tends to rate only from 8 to 10, someone else never used those rates:
  - Discern among the alternatives it's impossible, so i just assumed that scaling everyone with the same distribution of votes was the most reliable thing to do


Then, i tested between K-Means, Agglomerative Clustering and DBSCAN supported by OPTICS:
  - I tested all of them with the same data and picked up the one that gave me the best cluster quelity score, silhouette score for instance
  - For every method i tested, quality score was not very good: in my opinion this is due to the fact that people's tastes are almost unique for each one: some may like various genres, while some may hate them but at the same time they may like any anime of the latter. Becouse of this, i think that some fuzzy clustering method may work better for this type of work

In the end, after chosing K-Means (as it had better results), i tested all users using 10-fold cross-validation. The testing script works in this way:
  - Split the dataset of users with generated profile in 10 folds
  - For each iteration, use one fold as test Set and the others as training set
  - After trained the model, remove some anime from the list of each user in the test set
  - Use the remaining to profile them, and then put it into model to find the cluster where they fit the best
  - Combining all the anime lists of training users in the same cluster and order them in count and mean score
  - If the top ten anime in this list are present in the list of removed anime of the user under test, i measure how many of them are in this list from 0 to 10, recording a precision from 0% to 100%

In this step there were some problems either: if a user under test, even after removing some of the anime, none of them is in the top ten list, the precision have a drop, becouse of the difference in anime number seen.
Precision of each fold lay from 40 to 60%, with a mean score of 50%:
this means that, in avarage, at least 5 of the anime in the recommanded list are really liked by the user

I was pretty satisfied about that score, also becouse compared with other methods, it has a result better by 10 to 20 % points.

Anyway, there is also something that may be improved:
  - For instance, most famous anime are seen by everybody in all clustrers, so they tend to be recommended anyway no matter the cluster you are in
  - Another problem is the quality of the clusters: the dataset may be explored better to find better solutions for clustering users: the dataset i used is 2 years old. As they updated the website some months ago, there is now a distinction between Genres and Themes: in my dataset they are all considered as Genres: this update can be exploited to cluster user first by genres and then by Themes, and mix up things somehow to see if there are major differences in clusters quality and final accuracy score

I made a simple GUI program to use the model and API of anilist.co website:
  - Putting your username, it retrieve your list, build your genre profile and recommend ten anime you haven't
  - Anyway, in order for it to work, you have to rate some anime from the website
  - Becouse the internal anime dataset it's 2 years old, newer anime don't count and are not recommended either

------------------------
The following is an implementation of the algorithm so that users can use it

In this part of code datasets for the recommended are created and saved in a dedicated folder

In [None]:
from joblib import dump, load

import pandas as pd
pd.set_option('display.max_columns', 50)
import numpy as np
import seaborn as sb
import random as rm
import math
import gc

from matplotlib import pyplot as plt
plt.rcParams['axes.grid'] = True

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import KMeans

root = './Datasets'
anime = pd.read_csv(root + '/animeCheckpoint2.csv', index_col='anime_id')
ratings = pd.read_csv(root + '/ratingsCheckpoint2.csv').drop(columns = 'watching_status')
ratingsReduced = pd.read_csv(root + '/ratingsReducedCheckpoint2.csv', index_col = 'user_id')

ratings.drop(columns = ['Unnamed: 0'], inplace = True)
# {1: 'Watching', 2: 'Completed', 3: 'On-Hold', 4: 'Dropped', 6: 'Planning'}
genres = anime.drop(columns = ['episodes', 'type','name','score','popularity','members','favorites']).columns.to_numpy()
animeReduced = anime.drop(columns = ['name','score','type','episodes','members','favorites','popularity'])

In [None]:
def shuffleTrainTest(table = ratingsReduced, ttFactor = 0.6, verbose = True):

  """Shuffle unique users and split ratings in train and test set

  Parameters:
   > ttFactor (double) : default 0.6
      How many users go in the training set (1 - ttFactor are in test set)

  Returns:
   > trainingSet, testSet
  
  """
  if((ttFactor > 0.8) or (ttFactor < 0)):
    ttFactor = 0.6

  uniqueUsers = table.index.unique().tolist()
  rm.shuffle(uniqueUsers)

  trainingUsers = []
  testUsers = []

  trainingSize = math.ceil(len(uniqueUsers) * ttFactor)
  testSize = len(uniqueUsers) - trainingSize

  for i in range(0,trainingSize):
    trainingUsers.append(uniqueUsers[i])
  for i in range(trainingSize, trainingSize + testSize):
    testUsers.append(uniqueUsers[i])

  trainingRatings = table[table.index.isin(trainingUsers)]
  testRatings = table[table.index.isin(testUsers)]
  if verbose:
    print(f"Training set: {len(trainingUsers)} users\nTest set: {len(testUsers)} users")
    
  return trainingRatings, testRatings

def scale(table):
  """Scale the given table using StandardScaler
    on the rows
  """
  return pd.DataFrame(
      StandardScaler().fit_transform(table.T),
      index = table.columns, columns = table.index).T

def mindiv(n, startn = 2):
  for i in range(startn, n):
    if n%i==0:
      return i

In [None]:
# Take the whole dataset
trainingSet = ratingsReduced
# Scale it
scaledTrainingSet = scale(trainingSet)
# Build clusters
kmeans = KMeans(n_clusters = 13, random_state = 523, n_init = 10).fit(scaledTrainingSet)
# Take the cluster for each training sample
y_kmeans = kmeans.predict(scaledTrainingSet)
# Create a table with user_id-cluster pairs
clusters = pd.DataFrame(y_kmeans, columns = ['cluster'], index = scaledTrainingSet.index)

In [None]:
dump(kmeans, root + '/recommenderFiles/model.joblib')
anime.to_csv(root + '/recommenderFiles/anime.csv')
clusters.to_csv(root + '/recommenderFiles/clusters.csv')
ratings.to_csv(root + '/recommenderFiles/ratings.csv')

# RUN THE PROGRAM FROM HERE

In this part of code the algorithm just take your list from `Anilist.co` and find suggestion for you

In [None]:
from joblib import load
import requests
import threading

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 50)
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
root = './Datasets'
def loadTask():
  global ratings
  ratings = pd.read_csv(root + '/recommenderFiles/ratings.csv').drop(columns = 'Unnamed: 0')
  return

t = threading.Thread(target = loadTask)
t.start()

ratings = None
anime = pd.read_csv(root + '/recommenderFiles/anime.csv', index_col='anime_id')
clusters = pd.read_csv(root + '/recommenderFiles/clusters.csv')
genres = anime.drop(columns = ['episodes', 'type','name','score','popularity','members','favorites']).columns.to_numpy()
animeReduced = anime.drop(columns = ['name','score','type','episodes','members','favorites','popularity'])

kmeans = load(root + '/recommenderFiles/model.joblib')

def scale(table):
  """Scale the given table using StandardScaler
    on the rows
  """
  return pd.DataFrame(
      StandardScaler().fit_transform(table.T),
      index = table.columns, columns = table.index).T

In [None]:
nick = 'PatataAliena' # my nickname on anilist.co site used to get the list

query = '''
query ($nickname: String) {
    User (name: $nickname) {
        id
    }
}
'''
variables = {
    'nickname': nick
}

url = 'https://graphql.anilist.co'
# API request
response = requests.post(url, json={'query': query, 'variables': variables})

if response.status_code == 404:
  print("User not found")
elif response.status_code != 200:
  print(f"Error occurred: {response.status_code}")
else:
  userID = response.json()['data']['User']['id']
  print(f"Found user: {userID}")

In [None]:
query = '''
query ($id: Int) {
  MediaListCollection(userId: $id, type: ANIME) {
    lists {
      entries{
        media {
          idMal
        }
        score
      }
    }
  }
}
'''
variables = {
    'id': userID
}

url = 'https://graphql.anilist.co'

response = requests.post(url, json={'query': query, 'variables': variables})

if response.status_code != 200:
  print(f"Error occurred: {response.json()}")
else:
  edit = response.json()

In [None]:
myRatings = pd.DataFrame(columns = ['user_id', 'anime_id', 'rating'])
for LIST in edit['data']['MediaListCollection']['lists']:
  for ITEM in LIST['entries']:
    idMal = ITEM['media']['idMal']
    score = ITEM['score']
    myRatings = myRatings.append({'user_id': 0, 'anime_id': idMal,'rating': score}, ignore_index = True)

myRatings = myRatings[myRatings.anime_id.isin(anime.index)]
myRatings['user_id'] = myRatings['user_id'].astype(int)
myRatings['anime_id'] = myRatings['anime_id'].astype(int)
myList = myRatings.anime_id.tolist()
myRatings = myRatings[myRatings['rating'] > 0]

t.join()
ratingsSupp = ratings.append(myRatings)

In [None]:
# UserID
x = 0

# Build the genre profile for the user
temp = myRatings.join(animeReduced, on = 'anime_id').drop(columns = ['anime_id'])
temp[genres] = temp[genres].T.multiply(temp['rating']).T
temp[genres] = temp[genres].replace(0, np.nan)
temp = temp.drop(columns = 'rating').groupby(by = 'user_id').mean().replace(np.nan,0)

# Also scale the user rating list
scaledTest = scale(temp)
# Cluster label for tested testUser
label = kmeans.predict(scaledTest)

# USers in same cluster
nearestUsers = clusters[clusters.cluster == label[0]].index
temp

In [None]:
hiddenList = [121,5081,2167,4181]
# Hide things i am not really interested

nearestUsers = clusters[clusters['cluster'] == label[0]].user_id
usersList = ratings.loc[ratings.user_id.isin(nearestUsers)].drop(columns = 'user_id')
usersList['count'] = 1
meanList = usersList.groupby(by = 'anime_id').sum()
meanList['rating'] = meanList['rating'] / meanList['count']
finalList = meanList.join(anime[['name', 'members']])
topTen = finalList[['name', 'count', 'rating', 'members']].sort_values(by = ['count', 'rating', 'members'], ascending = False)
mask = (~(ratingsSupp.anime_id.isin(myList)) & ~(ratingsSupp.anime_id.isin(hiddenList)) & (ratingsSupp.user_id.isin(nearestUsers)))
result = topTen[topTen.index.isin(ratingsSupp.loc[mask].anime_id)].head(10)
result