# ANIME RECOMMENDER
---

`DESIGN`

The idea is to use the advantage of empath themes to help the user pick what anime they want recommendations based off on.

**STEPS**
- load pickled data.
- finalise data to be used.
- create empath search functionality to give the most popular(one item) anime with that empath theme.
- use the title of the most popular anime as seed for recommendation.
- give the user top 10 anime based on the seed to watch which is about the empath.

**features to use**

`anime data`: 
anime id (MAL_ID), score, name, genre, type, episodes, rating, members, popularity

`user rating data`: user id, anime id, rating

---

## Modules import

In [16]:
# Importing needed modules and setting parameters
import pandas as pd
import os
import numpy as np
import json
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
import matplotlib.pylab as pylab
import inflect
import ipywidgets as widgets
import warnings
from IPython.display import HTML, display
from nltk.corpus import wordnet
from clustering import *

warnings.filterwarnings('ignore')
nltk.download('wordnet', quiet=True)

with open('config.json') as json_data_file:
    config = json.load(json_data_file)

## Data import and curation

In [17]:
anime_rec_data = pd.read_pickle(
    os.path.join(config['Data_path'], 'final_recommendation_anime_data.pck'))
anime_rec_data

Unnamed: 0,anime_id,title,english_title,genres,anime_rating,synopsis,empath_themes,type,episodes,members,popularity,user_id,rating
0,1,Cowboy Bebop,Cowboy Bebop,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",8.78,"In the year 2071, humanity has colonized sever...",superhero music fun musical stealing crime art...,TV,26,1251960,39,3,9
1,1,Cowboy Bebop,Cowboy Bebop,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",8.78,"In the year 2071, humanity has colonized sever...",superhero music fun musical stealing crime art...,TV,26,1251960,39,6,6
2,1,Cowboy Bebop,Cowboy Bebop,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",8.78,"In the year 2071, humanity has colonized sever...",superhero music fun musical stealing crime art...,TV,26,1251960,39,14,9
3,1,Cowboy Bebop,Cowboy Bebop,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",8.78,"In the year 2071, humanity has colonized sever...",superhero music fun musical stealing crime art...,TV,26,1251960,39,19,8
4,1,Cowboy Bebop,Cowboy Bebop,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",8.78,"In the year 2071, humanity has colonized sever...",superhero music fun musical stealing crime art...,TV,26,1251960,39,22,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...
56726856,48456,SK∞: Crazy Rock Jam,Unknown,"Comedy, Sports",6.52,cap of the first 9 episodes of .,,Special,1,10722,4830,342067,5
56726857,48456,SK∞: Crazy Rock Jam,Unknown,"Comedy, Sports",6.52,cap of the first 9 episodes of .,,Special,1,10722,4830,347462,4
56726858,48456,SK∞: Crazy Rock Jam,Unknown,"Comedy, Sports",6.52,cap of the first 9 episodes of .,,Special,1,10722,4830,348266,5
56726859,48456,SK∞: Crazy Rock Jam,Unknown,"Comedy, Sports",6.52,cap of the first 9 episodes of .,,Special,1,10722,4830,348321,6


In [18]:
anime_rec_data_copy = anime_rec_data.copy()
anime_rec_data_copy['rating'].replace({-1: np.nan}, inplace = True)
anime_rec_data_copy.head()

Unnamed: 0,anime_id,title,english_title,genres,anime_rating,synopsis,empath_themes,type,episodes,members,popularity,user_id,rating
0,1,Cowboy Bebop,Cowboy Bebop,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",8.78,"In the year 2071, humanity has colonized sever...",superhero music fun musical stealing crime art...,TV,26,1251960,39,3,9
1,1,Cowboy Bebop,Cowboy Bebop,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",8.78,"In the year 2071, humanity has colonized sever...",superhero music fun musical stealing crime art...,TV,26,1251960,39,6,6
2,1,Cowboy Bebop,Cowboy Bebop,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",8.78,"In the year 2071, humanity has colonized sever...",superhero music fun musical stealing crime art...,TV,26,1251960,39,14,9
3,1,Cowboy Bebop,Cowboy Bebop,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",8.78,"In the year 2071, humanity has colonized sever...",superhero music fun musical stealing crime art...,TV,26,1251960,39,19,8
4,1,Cowboy Bebop,Cowboy Bebop,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",8.78,"In the year 2071, humanity has colonized sever...",superhero music fun musical stealing crime art...,TV,26,1251960,39,22,9


In [19]:
anime_rec_data_copy = anime_rec_data_copy.dropna(axis = 0, how = 'any')
anime_rec_data_copy.isnull().sum()

anime_id         0
title            0
english_title    0
genres           0
anime_rating     0
synopsis         0
empath_themes    0
type             0
episodes         0
members          0
popularity       0
user_id          0
rating           0
dtype: int64

In [20]:
anime_rec_data_copy['user_id'].value_counts()

189037    10484
68042     10063
283786    10057
162615    10028
259790     8421
          ...  
99842         1
140184        1
258522        1
247751        1
123349        1
Name: user_id, Length: 309480, dtype: int64

In [21]:
counts = anime_rec_data_copy['user_id'].value_counts()
anime_rec_data_copy = anime_rec_data_copy[anime_rec_data_copy['user_id'].isin(
    counts[counts >= 800].index)]

## Using empath themes to select seed anime title for recommendation

In [22]:
anime_seed_data = pd.read_pickle(
    os.path.join(config['Data_path'], 'final_empath_recommendation_anime_data.pck'))
anime_seed_data

Unnamed: 0,anime_id,title,english_title,genres,anime_rating,synopsis,empath_themes,type,episodes,members,popularity
0,1,Cowboy Bebop,Cowboy Bebop,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",8.78,"In the year 2071, humanity has colonized sever...",superhero music fun musical stealing crime art...,TV,26,1251960,39
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop:The Movie,"Action, Drama, Mystery, Sci-Fi, Space",8.39,"other day, another bounty—such is the life of ...",business surprise attractive art appearance mo...,Movie,1,273145,518
2,6,Trigun,Trigun,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen",8.24,"the is the man with a $$60,000,000,000 bounty ...",suffering superhero business stealing crime mo...,TV,26,558913,201
3,7,Witch Hunter Robin,Witch Hunter Robin,"Action, Mystery, Police, Supernatural, Drama, ...",7.27,ches are individuals with special powers like ...,business crime prison art order government con...,TV,26,94683,1467
4,8,Bouken Ou Beet,Beet the Vandel Buster,"Adventure, Fantasy, Shounen, Supernatural",6.98,It is the dark century and the people are suff...,suffering fun strength order traveling governm...,TV,52,13224,4369
...,...,...,...,...,...,...,...,...,...,...,...
16209,48481,Daomu Biji Zhi Qinling Shen Shu,Unknown,"Adventure, Mystery, Supernatural",,No synopsis information has been added to this...,computer communication internet reading meetin...,ONA,Unknown,354,13116
16210,48483,Mieruko-chan,Unknown,"Comedy, Horror, Supernatural",,ko is a typical high school student whose life...,hygiene suffering business surprise crime orde...,TV,Unknown,7010,17562
16211,48488,Higurashi no Naku Koro ni Sotsu,Higurashi:When They Cry – SOTSU,"Mystery, Dementia, Horror, Psychological, Supe...",,Sequel to no ni .,,TV,Unknown,11309,17558
16212,48491,Yama no Susume: Next Summit,Unknown,"Adventure, Slice of Life, Comedy",,no anime.,,TV,Unknown,1386,17565


In [34]:
anime_seed_data['anime_rating'].isnull().sum()

5123

In [23]:
def find_synonym(word_entry):
    found_synonyms = []
    for synonym in wordnet.synsets(word_entry):
        for lemma in synonym.lemmas():
            found_synonyms.append(lemma.name())
    return list(set(found_synonyms))


def anime_search(value, df):
    if value == '':
        if df.shape[0] < 25:
            return df, df.shape[0]
        return df[:], df.shape[0]
    df_out = pd.DataFrame({
        'title': [],
        'english_title': [],
        'genres': [],
        'synopsis': [],
        'episodes': [],
        'anime_rating': []
    })
    x = []
    for i in value.split():
        x.extend(find_synonym(i))
    if x == []:
        df_out = df.loc[df['synopsis'].str.contains(value,
                                                         case=False,
                                                         na=False)]
    else:
        reg = ' | '.join(x)
        reg = ' ' + reg
        df_out = df.loc[df['synopsis'].str.contains(reg,
                                                         regex=True,
                                                         case=False,
                                                         na=False)]

#     if toggle == 'Tags':
#         x = value.split(' ')
#         reg = ' | '.join(x)
#         df_out = df.loc[df['empath_themes'].str.contains(' ' + reg,
#                                                          regex=True,
#                                                          case=False,
#                                                          na=False)]

    if df_out.shape[0] < 25:
        return df_out, df_out.shape[0]
    return df_out[:], df_out.shape[0]

In [24]:
search = widgets.Text(
    placeholder = 'Type Something',
    description = 'Search:',
    disabled = False
)
print('''
What do you look for in anime that you want to watch?
Type a word describing that.

Example: Love, Action, Food, Heroes, Fight, Magic, Powers, Anger, Sadness, etc.
''')
display(search)
print('Enter the keyword and run the next code block')


What do you look for in anime that you want to watch?
Type a word describing that.

Example: Love, Action, Food, Heroes, Fight, Magic, Powers, Anger, Sadness, etc.



Text(value='', description='Search:', placeholder='Type Something')

Enter the keyword and run the next code block


In [25]:
fr_sh, size = anime_search(search.value, anime_seed_data)
df = fr_sh[[
    'title', 'english_title', 'genres', 'synopsis', 'episodes', 'anime_rating'
]]
df = df.sort_values(by='anime_rating', ascending=False)
seed_title = df.iloc[0]['title']
df = df[:1].to_html(escape=False, index=False)
print('Number of anime titles about', search.value, 'is: ' + str(size))
print(
    'However, the seed anime content that will be used for the recommendation is',
    seed_title, 'because it is the highest rated anime.')
display(HTML(df))

Number of anime titles about Explosions is: 29
However, the seed anime content that will be used for the recommendation is Mob Psycho 100 because it is the highest rated anime.


title,english_title,genres,synopsis,episodes,anime_rating
Mob Psycho 100,Mob Psycho 100,"Action, Slice of Life, Comedy, Supernatural","has tapped into his inner wellspring of psychic prowess at a young age. But the power quickly proves to be a liability when he realizes the potential danger in his skills. Choosing to suppress his power, only present use for his ability is to impress his longtime crush, who soon grows bored of the same tricks. In order to effectuate control on his skills, enlists himself under the wing of a con artist claiming to be a psychic, who exploits powers for pocket change. exorcising evil spirits on command has become a part of daily, monotonous life. the psychic energy he exerts is barely the tip of the iceberg; if his vast potential and unrestrained emotions run berserk, a cataclysmic event that would render him completely unrecognizable will be triggered. The progression toward explosion is rising and attempting to stop it is futile.",12,8.49


So the code to search for the anime title basically takes the input and searches through either the synopsis or the empath themes to find similar content having that. But that's not so great. I think it becomes quite a dumb way to perform the search and I will be handicaped by the empath themes. I think the best way for the empath thems to be handled is to adjust the code to give me the top 5 or n number of empath themes then use those as a list to create categories. According to how the the empath library works, it can create categories for words so the idea is to create categories for each empath theme, then generate some score for how many of those categories are in the synopsis. 

Another way I'm looking at this is to use the synonyms. The synonyms for the words provide a list of words. The list can then be used as a measure for a new column which combines the synopsis, categories as well as the empath themes. A score should be generated for the percentage of the synonyms are in this new column. When the percentages/measure are generated for the matching, the title with the highest percentage will be used as a seed content for the recommendation.

A list however should be created first so the code won't be combing through all the data. The above (last) idea should use a bit of its process here. Maybe the synonyms for the word shows a list of content with that has the synonyms in its synopsis. Then that list would be used for the generation of measures. 

Maybe the categories for the top 5 empath themes can be created as an available database. That will make the process faster which means we only draw out the synonyms of an input then get the measure of those synonyms in the cell with the combined synopsis, categories and empath themes.

Investigate using fuzzy matching for this measure creation.

In [26]:
seed_data_matrix = tfidf_empath(anime_seed_data)
seed_data_matrix.shape

(16214, 1970)

In [27]:
from sklearn.metrics.pairwise import sigmoid_kernel

# Compute the sigmoid kernel
sig = sigmoid_kernel(seed_data_matrix, seed_data_matrix)

In [28]:
seed_data_indices = pd.Series(anime_seed_data.index, index=anime_seed_data['title']).drop_duplicates()

In [39]:
def give_rec(title, sig=sig):
    # Get the index corresponding to original_title
    idx = seed_data_indices[title]

    # Get the pairwsie similarity scores
    sig_scores = list(enumerate(sig[idx]))

    # Sort the movies
    sig_scores = sorted(sig_scores, key=lambda x: x[1], reverse=True)

    # Scores of the 10 most similar movies
    sig_scores = sig_scores[1:16]

    # Movie indices
    anime_indices = [i[0] for i in sig_scores]

    # Top 10 most similar movies
    anime_rec = pd.DataFrame({
        'Anime Title':
        anime_seed_data['title'].iloc[anime_indices].values,
        'English Title':
        anime_seed_data['english_title'].iloc[anime_indices].values,
        'Rating':
        anime_seed_data['anime_rating'].iloc[anime_indices].values
    })
    anime_rec = anime_rec.sort_values(by='Rating', ascending=False)
    
    return anime_rec

In [40]:
display(HTML(give_rec(seed_title).to_html()))

Unnamed: 0,Anime Title,English Title
0,Code Geass: Boukoku no Akito 2 - Hikisakareshi Yokuryuu,Code Geass:Akito the Exiled - The Wyvern Divided
1,Detective Conan Movie 16: The Eleventh Striker,Unknown
2,Taimadou Gakuen 35 Shiken Shoutai,AntiMagic Academy 35th Test Platoon
3,Kekkai Sensen & Beyond,Blood Blockade Battlefront & Beyond
4,Akira,AKIRA
5,Makyou Gaiden Le Deus,Ladius
6,Carol,Unknown
7,Bucket no Naka no Kaka,Kaka in the bucket.
8,Kakumeiki Valvrave 2nd Season,Valvrave the Liberator 2nd Season
9,Speed Grapher,Speed Grapher


`TO DO:` \
- Remove titles with unknown in them.
- Remove titles with the ratings of 'NaN'.
- Change display names in cluster to english names.
