# Anime Classifier with Naïve Bayes
## Goal
Classify animes using their synopses into genres.
## Process
* Fetch anime data using an API
* Parse this data into class instances to facilitate processing
* Create a dataframe using our data, adding, for each word of the vocabulary, its number of occurences in each synopsis
* Implement the Multinomial Naïve Bayes algorithm -> calculate constants + classify
* Compute success rate for each genre
* Improve the classification system to improve the success rate

## Fetching Anime Data using an API
The API we will use is [Jikan](https://jikan.docs.apiary.io/), an unofficial API for the website [MyAnimeList](https://myanimelist.net/).
It does no require any authentication.

The only type of request we will use fetches the most popular animes of a specific genre, divided by pages of 99 animes ([doc](https://jikan.docs.apiary.io/#reference/0/genre)).


In [24]:
import requests
import json
import pandas as pd
import time
import re
import os.path
from os import path

class Constants:
    FORCE_API_GET = False
    FORCE_COMPUTE_PARAMETERS = False
    ANIMES_CSV_FILENAME = "animes_from_api.csv"

In [15]:
# gets the first page of the most popular animes of genre of id 8 (Drama).
# then converts it to a python dictionary
requests.get("https://api.jikan.moe/v3/genre/anime/8/1").json()

{'request_hash': 'request:genre:903a543f9a6ca027a1ea4eb619791bb1776ca2cf',
 'request_cached': True,
 'request_cache_expiry': 20845,
 'mal_url': {'mal_id': 8,
  'type': 'anime',
  'name': 'Drama Anime',
  'url': 'https://myanimelist.net/anime/genre/8/Drama'},
 'item_count': 2499,
 'anime': [{'mal_id': 16498,
   'url': 'https://myanimelist.net/anime/16498/Shingeki_no_Kyojin',
   'title': 'Shingeki no Kyojin',
   'image_url': 'https://cdn.myanimelist.net/images/anime/10/47347.jpg',
   'synopsis': "Centuries ago, mankind was slaughtered to near extinction by monstrous humanoid creatures called titans, forcing humans to hide in fear behind enormous concentric walls. What makes these giants truly terrifying is that their taste for human flesh is not born out of hunger but what appears to be out of pleasure. To ensure their survival, the remnants of humanity began living within defensive barriers, resulting in one hundred years without a single titan encounter. However, that fragile calm is s

In [2]:
#get one anime of id 1
response = requests.get("https://api.jikan.moe/v3/anime/1").json()
#get the anime of anime of id 1 page 1 (99 animes?)
"https://api.jikan.moe/v3/genre/anime/1/1"

'https://api.jikan.moe/v3/genre/anime/1/1'

In [16]:
genres_name = {
    4: "Comedy",
    8: "Drama"
}
# genres gathered by requesting 500+ animes
genres = ['Romance',
 'Historical',
 'Space',
 'Cars',
 'Game',
 'Supernatural',
 'Fantasy',
 'Psychological',
 'Sci-Fi',
 'Seinen',
 'Parody',
 'Drama',
 'Action',
 'Josei',
 'Police',
 'Super Power',
 'Sports',
 'Military',
 'Demons',
 'Vampire',
 'Adventure',
 'Shoujo Ai',
 'Mecha',
 'Shounen',
 'Horror',
 'Kids',
 'Dementia',
 'Samurai',
 'Shounen Ai',
 'Slice of Life',
 'Comedy',
 'Magic',
 'Shoujo',
 'Mystery',
 'Music',
 'Thriller',
 'Martial Arts',
 'School',
 'Harem',
 'Ecchi']


In [17]:
response["anime"][0]

NameError: name 'response' is not defined

In [18]:
# We get genres as a list of dictionaries, each dictionary having several keys.
# The name of the genre is the only one we are intersted in.
def simplify_genres(genres):
    clean_genres = []
    for genre in genres:
        clean_genres.append(genre["name"])
    return clean_genres

# takes in the response of the request to get animes of a genre, and returns a list of anime objects
def simplify(animes_of_genre_json):
    animes_json = animes_of_genre_json["anime"]
    animes_objs = []
    for anime_json in animes_json:
        animes_objs.append(Anime(anime_json))
    return animes_objs


# we will ignore the words that refer to genres or column names in our synopses to avoid conflicts
banned_words = set([genre.lower() for genre in genres] + ["title", "synopsis", "genres"]) 
# in order to exclude meaningless words, we only keep words of length 5 or more.
def is_valid(word):
    if word in banned_words or len(word) <= 5: 
        return False
    return True

# formats a string by removing ponctuation, 'written by mal rewrite', extra spaces, lowering it, making a list of valid words
def clean_synopsis(string):
    words = re.sub('\W', ' ', string).replace("written by mal rewrite", "").strip().lower().split()
    valid_words = [word for word in words if is_valid(word)]
    return valid_words

# We create our own structure to simplify the program and only work with the data we need
class Anime:
    
    def __init__(self, anime_from_json):
        self.title = anime_from_json["title"]
        self.synopsis = clean_synopsis(anime_from_json["synopsis"])
        self.genres = simplify_genres(anime_from_json["genres"])

        
# Returns a df from anime objects
def get_df(animes):
    titles = []
    synopses = []
    genres = []
    for anime in animes:
        titles.append(anime.title)
        synopses.append(anime.synopsis)
        genres.append(anime.genres)
    return pd.DataFrame(data = {"title": titles, "synopsis": synopses, "genres": genres})

In [112]:
# anime_objs = simplify(response)
# df = get_df(anime_objs)

In [19]:
# get request to get the first page of animes of a genre
def request_genre(genre_id, page=1):
    url = "https://api.jikan.moe/v3/genre/anime/" + str(genre_id) + "/" + str(page)
    return requests.get(url).json()

# remove duplicates
def remove_duplicates(animes_list):
    singleton = []
    anime_titles = set()
    for i, anime in enumerate(animes_list):
        if anime.title not in anime_titles:
            anime_titles.add(anime.title)
            singleton.append(anime)
    return singleton

# gets us all the anime of all specified genre (page 1) as a dataframe
def fetch_data(genres_dictionary, pages_per_genre=3):
    anime_groups = []
    for genre_id in genres_dictionary.keys():
        for page in range(1, pages_per_genre + 1):
            anime_groups.append(simplify(request_genre(genre_id, page)))
            time.sleep(4)    # as we request a lot of data each request, we wait a large amount of time to not flood the API
        
    animes = []    
    for group in anime_groups:
        for anime in group:
            animes.append(anime)
    animes = remove_duplicates(animes)
    return get_df(animes)      

In [65]:
# We only fetch API data if we did not already do it and saved it, or if we set it to do it regardless (new data?)
if not path.isfile(Constants.ANIMES_CSV_FILENAME) or Constants.FORCE_API_GET:
    print("Fetching & saving anime data")
    animes = fetch_data(genres_name)
    animes.to_csv(Constants.ANIMES_CSV_FILENAME, index = False)
else:
    print("Loading anime data")
    animes = pd.read_csv(Constants.ANIMES_CSV_FILENAME)
    # the synopsis column is saved and loaded as strings, not lists. We need to convert it back to lists
    animes["synopsis"] = animes["synopsis"].str.replace('[', '').str.replace(']', '').str.replace("'", '').str.replace(',', '').str.split()


print("Loaded " + str(animes.shape[0]) + " animes")
animes.head()

Loading anime data
Loaded 549 animes


Unnamed: 0,title,synopsis,genres
0,Fullmetal Alchemist: Brotherhood,"[something, obtained, something, alchemy, equi...","['Action', 'Military', 'Adventure', 'Comedy', ..."
1,One Punch Man,"[seemingly, ordinary, unimpressive, saitama, r...","['Action', 'Sci-Fi', 'Comedy', 'Parody', 'Supe..."
2,No Game No Life,"[surreal, follows, siblings, online, behind, l...","['Game', 'Adventure', 'Comedy', 'Supernatural'..."
3,Naruto,"[moments, naruto, uzumaki, kyuubi, tailed, att...","['Action', 'Adventure', 'Comedy', 'Super Power..."
4,Boku no Hero Academia,"[appearance, quirks, discovered, powers, stead...","['Action', 'Comedy', 'School', 'Shounen', 'Sup..."


In [66]:
# create a column per genre, a set it to TRUE if the anime is of this genre, FALSE otherwise
def clean_genres(df, genres):
    for genre in genres:
        df[genre] = df["genres"].apply(lambda genre_list: genre in genre_list)

clean_genres(animes, genres)
animes.head()

Unnamed: 0,title,synopsis,genres,Romance,Historical,Space,Cars,Game,Supernatural,Fantasy,...,Comedy,Magic,Shoujo,Mystery,Music,Thriller,Martial Arts,School,Harem,Ecchi
0,Fullmetal Alchemist: Brotherhood,"[something, obtained, something, alchemy, equi...","['Action', 'Military', 'Adventure', 'Comedy', ...",False,False,False,False,False,False,True,...,True,True,False,False,False,False,False,False,False,False
1,One Punch Man,"[seemingly, ordinary, unimpressive, saitama, r...","['Action', 'Sci-Fi', 'Comedy', 'Parody', 'Supe...",False,False,False,False,False,True,False,...,True,False,False,False,False,False,False,False,False,False
2,No Game No Life,"[surreal, follows, siblings, online, behind, l...","['Game', 'Adventure', 'Comedy', 'Supernatural'...",False,False,False,False,True,True,True,...,True,False,False,False,False,False,False,False,False,True
3,Naruto,"[moments, naruto, uzumaki, kyuubi, tailed, att...","['Action', 'Adventure', 'Comedy', 'Super Power...",False,False,False,False,False,False,False,...,True,False,False,False,False,False,True,False,False,False
4,Boku no Hero Academia,"[appearance, quirks, discovered, powers, stead...","['Action', 'Comedy', 'School', 'Shounen', 'Sup...",False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,True,False,False


## Training & Test Set

In [67]:
# Randomize the dataset
data_randomized = animes.sample(frac=1, random_state=1)

# Calculate index for split (80% 20%)
training_test_index = round(len(data_randomized) * 0.8)

# Training/Test split
training_set = data_randomized[:training_test_index].reset_index(drop=True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

print("Training shape: " + str(training_set.shape))
print("Test shape: " + str(test_set.shape))

Training shape: (439, 43)
Test shape: (110, 43)


In [68]:
# create a list of all the words present in synopses
# synopses are already curated to only contain valid words
def get_vocabulary(series):
    vocab = []
    for synopsis in series:
        for word in synopsis:
            vocab.append(word)
    return list(set(vocab))

vocabulary = get_vocabulary(training_set["synopsis"])
len(vocabulary)

6890

### Adding word occurences count columns
For each word, we will add a column counting the occurences of this word in the synopsis of each anime

In [69]:
# Creates an initial dictionary associating words with a list of n 0, n beign the number of rows (animes)
word_counts_per_synopsis = {unique_word: [0] * len(training_set["synopsis"]) for unique_word in vocabulary}

# We then populate our dictionary with the number of occurences
for index, synopsis in enumerate(training_set["synopsis"]):
    for word in synopsis:
        word_counts_per_synopsis[word][index] += 1

# We finally convert it to a dataframe for concat purposes
word_counts = pd.DataFrame(word_counts_per_synopsis)
word_counts.head()

Unnamed: 0,questioning,theories,otonashi,takeru,version,proclaiming,runaway,potential,prolong,coastal,...,uneventful,departure,pachinko,fictional,fumbling,hidekaz,advancement,shunned,superpowers,pureblood
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [70]:
# We can now concatenate our title, synopsis and genres data with our new word occurences data
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,title,synopsis,genres,Romance,Historical,Space,Cars,Game,Supernatural,Fantasy,...,uneventful,departure,pachinko,fictional,fumbling,hidekaz,advancement,shunned,superpowers,pureblood
0,Rosario to Vampire Capu2,"[tsukune, enrolled, youkai, academy, interesti...","['Comedy', 'Ecchi', 'Fantasy', 'Harem', 'Roman...",True,False,False,False,False,False,True,...,0,0,0,0,0,0,0,0,0,0
1,Akame ga Kill!,"[covert, assassination, branch, revolutionary,...","['Action', 'Adventure', 'Drama', 'Fantasy', 'S...",False,False,False,False,False,False,True,...,0,0,0,0,0,0,0,0,0,0
2,Kekkai Sensen,"[supersonic, monkeys, vampires, talking, fishm...","['Action', 'Comedy', 'Super Power', 'Supernatu...",False,False,False,False,False,True,True,...,0,0,0,0,0,0,0,0,0,0
3,Made in Abyss,"[gaping, stretching, depths, filled, mysteriou...","['Sci-Fi', 'Adventure', 'Mystery', 'Drama', 'F...",False,False,False,False,False,False,True,...,0,0,0,0,0,0,0,0,0,0
4,Saenai Heroine no Sodatekata,"[tomoya, obsessed, collecting, novels, attachi...","['Harem', 'Comedy', 'Romance', 'Ecchi', 'School']",True,False,False,False,False,False,False,...,0,0,0,0,0,0,0,0,0,0


### Calculating Constants

In [71]:
# P(Genre) -> proba of anime of this genre among all animes
genre_p = {}
for genre in genres:
    genre_p[genre] = (training_set_clean[genre] == True).sum() / len(training_set_clean)

genre_not_p = {}
for genre in genres:
    genre_not_p[genre] = (training_set_clean[genre] == False).sum() / len(training_set_clean)

# Associates genres to the number of words in all synopses of this genre
genre_n_words = {}    
for genre in genres:
    rows_of_this_genre = training_set_clean[training_set_clean[genre] == True]
    genre_n_words[genre] = rows_of_this_genre["synopsis"].apply(len).sum()
    
# Associates genres to the number of words in all synopses not of this genre
total_words = training_set_clean["synopsis"].apply(len).sum()
not_genre_n_words = {}    
for genre in genres:
    not_genre_n_words[genre] = total_words - genre_n_words[genre]
    
alpha = 1
n_vocabulary = len(vocabulary)
genre_n_words
genre_p
genre_n_words

{'Romance': 9736,
 'Historical': 1311,
 'Space': 573,
 'Cars': 58,
 'Game': 500,
 'Supernatural': 5536,
 'Fantasy': 5068,
 'Psychological': 2227,
 'Sci-Fi': 3888,
 'Seinen': 2415,
 'Parody': 1056,
 'Drama': 12554,
 'Action': 7438,
 'Josei': 333,
 'Police': 214,
 'Super Power': 1823,
 'Sports': 1229,
 'Military': 1269,
 'Demons': 958,
 'Vampire': 569,
 'Adventure': 3366,
 'Shoujo Ai': 167,
 'Mecha': 1132,
 'Shounen': 5675,
 'Horror': 927,
 'Kids': 203,
 'Dementia': 399,
 'Samurai': 369,
 'Shounen Ai': 125,
 'Slice of Life': 4793,
 'Comedy': 14074,
 'Magic': 1706,
 'Shoujo': 1816,
 'Mystery': 3073,
 'Music': 598,
 'Thriller': 961,
 'Martial Arts': 542,
 'School': 7313,
 'Harem': 2569,
 'Ecchi': 2346}

### Calculating Parameters

In [72]:
import time

def save_parameters():
#     filename = "parameters_" + str(len(clean_training_data)) + "_animes_" + str(n_vocabulary)  + "_words_" + str(len(genres)) + "_genres.json"
    with open("parameters.json", 'w', encoding='utf-8') as f:
        json.dump(parameters, f, ensure_ascii=False, indent=4)
    with open("parameters_no.json", 'w', encoding='utf-8') as f:
        json.dump(parameters_no, f, ensure_ascii=False, indent=4)
        
def load_parameters(filename="parameters.json", filename_no="parameters_no.json"):
    params, params_no = None, None
    with open(filename, 'r') as f:
        params =  json.load(f)
    with open(filename_no, 'r') as n:
        params_no =  json.load(n)
    return params, params_no

def compute_parameters():
    task_start_time = time.perf_counter()

    parameters = {genre: {unique_word:0 for unique_word in vocabulary} for genre in genres}
    parameters_no = {genre: {unique_word:0 for unique_word in vocabulary} for genre in genres} # P(not_genre|syno)
    for word in vocabulary:
        for genre in genres:

            # P(genre|synopsis)
            rows_of_this_genre = training_set_clean[training_set_clean[genre] == True]
            n_word_given_genre = rows_of_this_genre[word].sum()    # number of occurences of the treated word in synopses of this genre
            #print(n_word_given_genre)
            p_word_given_genre = (n_word_given_genre + alpha) / (genre_n_words[genre] + alpha*n_vocabulary)
            parameters[genre][word] = p_word_given_genre

            # P(not_genre|synopsis)
            rows_not_of_this_genre = training_set_clean[training_set_clean[genre] == False]
            n_word_given_not_genre = rows_not_of_this_genre[word].sum()
            p_word_given_not_genre = (n_word_given_not_genre + alpha) / (not_genre_n_words[genre] + alpha*n_vocabulary)
            parameters_no[genre][word] = p_word_given_not_genre

    task_duration = time.perf_counter() - task_start_time
    total_operations = n_vocabulary * len(genres)
    duration_per_operation = task_duration / total_operations
    print("Computing the parameters took " + str(round(task_duration, 2)) + " seconds for " + str(total_operations) + " operations") 
    print(str(round(duration_per_operation * 1000, 2)) + " miliseconds per operation")
    
    return parameters, parameters_no

# If we already computed our parameters and saved them locally, we load them. Otherwise we compute and save them
if path.isfile('parameters.json') and path.isfile('parameters_no.json'):
    parameters, parameters_no = load_parameters()
else:
    parameters, parameters_no = compute_parameters
    #save_parameters()


parameters["Action"]

{'illegal': 0.00013905304873809357,
 'masochist': 6.952652436904678e-05,
 'initially': 0.00034763262184523395,
 'succeeding': 0.00013905304873809357,
 'scorned': 6.952652436904678e-05,
 'princesses': 0.00013905304873809357,
 'wannabe': 6.952652436904678e-05,
 'graceful': 6.952652436904678e-05,
 'inhumane': 0.00020857957310714038,
 'abnormal': 0.00013905304873809357,
 'reverted': 6.952652436904678e-05,
 'severely': 6.952652436904678e-05,
 'retells': 6.952652436904678e-05,
 'remembers': 6.952652436904678e-05,
 'provoking': 0.00013905304873809357,
 'pleading': 6.952652436904678e-05,
 'retreat': 0.00013905304873809357,
 'suzutsuki': 6.952652436904678e-05,
 'hacked': 0.00020857957310714038,
 'identity': 0.00027810609747618714,
 'bloodstone': 6.952652436904678e-05,
 'anteiku': 0.00020857957310714038,
 'studio': 6.952652436904678e-05,
 'bravest': 6.952652436904678e-05,
 'festas': 0.00020857957310714038,
 'sakuranomori': 6.952652436904678e-05,
 'domain': 0.00013905304873809357,
 'stalker': 0.0

In [144]:
import re

def sort_dictio(d, descending=True):
    return {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse = descending)}

def classify(synopsis):    
    # synopsis is already cleaned: it's a list of valid words
    
    p_genre_given_synopsis = {genre: genre_p[genre] for genre in genres} # just a copy of genre_p
    p_not_genre_given_synopsis = {genre: genre_not_p[genre] for genre in genres} # just a copy of genre_p
    for word in synopsis:
        for genre in genres:
            if word in parameters[genre]:
                proba = parameters[genre][word]
                p_genre_given_synopsis[genre] *= proba
            if word in parameters_no[genre]:
                proba = parameters_no[genre][word]
                p_not_genre_given_synopsis[genre] *= proba
                
    return p_genre_given_synopsis, p_not_genre_given_synopsis
 
# Returns a dictionary associating each selected genre (P(genre|synopsis) > P(not_genre|synopsis)) to its confidence
# Confidence translates by how much (%) the proba of being in this genre was superior to not being in this genre given the synopsis 
def extract_best_guesses(classification):
    p, no_p = classification
    genres_classified = {}
    for k, v in p.items():
        if v > no_p[k]:        
            confidence = (v - no_p[k]) / (no_p[k]) * 100   
            genres_classified[k] = confidence
#     genres_selected_by_confidence = list(sort_dictio(genres_classified).keys())
    return sort_dictio(genres_classified)

# returns an array of predicted genres, ordered by confidence
# use it as df["predicted_genres"] = df["synopsis"].apply(get_perdicted_genres)
def get_predicted_genres(synopsis):
    classification = classify(synopsis)
    selected_genres_by_confidence = extract_best_guesses(classification)
    selected_genres_array = list(selected_genres_by_confidence.keys())
    return selected_genres_array

# creates two boolean columns per genre:
# - "[genre]_prediction" is True if the genre was predicted as present, False otherwise
# - "[genre]_prediction_is_accurate" is True if the genre prediction is accurate, False otherwise
def predict_all(df):
    df["predicted_genres"] = df["synopsis"].apply(get_predicted_genres)
    for genre in genres:
        df[genre + "_prediction"] = (df["predicted_genres"].apply(lambda predicted_genres: genre in predicted_genres))
        df[genre + "_prediction_is_accurate"] = df[genre + "_prediction"] == df[genre]

# Adds the column "accuracy" to the df, specifying the accuracy (%) of the predicted genres
# This accuracy is the percentage of genres predicted accurately (N_correct_boolean_values / len(genres))
def compute_accuracy_per_anime(df):    
    predicted_col_names = [g + "_prediction_is_accurate" for g in genres]
    # we sum the boolean values on each row, selecting all the columns relative to a genre's prediction status
    df["accuracy"] = df[predicted_col_names].sum(axis = 1) / len(genres) * 100

# Returns a series associating each genre with its overall accuracy
def get_accuracy_per_genre_df(df):
    # we sum the boolean values on each column, selecting all the columns relative to a genre's prediction status
    predicted_col_names = [g + "_prediction_is_accurate" for g in genres]
    accuracy_per_genre = (df[predicted_col_names].sum(axis = 0) / len(isolated_test_set)) * 100
    
    # drilling down the successful and failed prediction: true postives, false positives, true negatives, and false negatives
    true_positive_proportions = []
    false_positive_proportions = []
    true_negative_proportions = []
    false_negative_proportions = []
    for genre in genres:
        true_positive_proportions.append(((df[genre + "_prediction"] == True) & (df[genre] == True)).sum() / len(df) * 100)
        false_positive_proportions.append(((df[genre + "_prediction"] == True) & (df[genre] == False)).sum() / len(df) * 100)
        true_negative_proportions.append(((df[genre + "_prediction"] == False) & (df[genre] == False)).sum() / len(df) * 100)
        false_negative_proportions.append(((df[genre + "_prediction"] == False) & (df[genre] == True)).sum() / len(df) * 100)

    genre_stats = pd.DataFrame({
        "accuracy": list(accuracy_per_genre),
        "true_positive_proportion": true_positive_proportions,
        "false_positive_proportion": false_positive_proportions,
        "true_negative_proportion": true_negative_proportions,
        "false_negative_proportion": false_negative_proportions
    })
    genre_stats.index = genres
     
    return genre_stats.sort_values("accuracy", ascending = False)

test_anime = test_set.iloc[19]
print(test_anime["title"])
print(test_anime["genres"])
synopsis = test_anime["synopsis"]
classification = classify(synopsis)
p_to_be_of_genre, p_not_to_be_of_genre = sort_dictio(classification[0]), sort_dictio(classification[1])
print(p_to_be_of_genre)
print(p_not_to_be_of_genre)
extract_best_guesses(classification)

Koutetsujou no Kabaneri
['Action', 'Horror', 'Supernatural', 'Drama', 'Fantasy']
{'Fantasy': 1.3156059070117907e-215, 'Action': 2.2093540445631446e-217, 'Supernatural': 9.26372418212551e-218, 'Horror': 5.810137623738171e-218, 'Drama': 6.588123418410472e-219, 'Shounen': 2.3681243786460564e-228, 'Sci-Fi': 1.1291984122596291e-229, 'Psychological': 6.871419330230315e-231, 'Romance': 3.758620476141778e-231, 'Mystery': 6.208033561015271e-232, 'Harem': 5.211410038819293e-232, 'Comedy': 3.6514963885651564e-232, 'Adventure': 3.214402638510496e-232, 'Seinen': 1.2488877032387146e-232, 'Ecchi': 2.92358934778994e-233, 'Magic': 1.6593707917238544e-233, 'Military': 7.61939620430469e-234, 'Mecha': 6.126867714794904e-234, 'School': 4.776549606822182e-234, 'Super Power': 4.716657887324622e-234, 'Sports': 6.77538538986187e-235, 'Demons': 4.108263098627659e-236, 'Slice of Life': 3.1468884161985982e-236, 'Shoujo': 4.4799993094027416e-237, 'Martial Arts': 1.500649207323808e-237, 'Vampire': 9.605877919509009

{'Fantasy': 6.84450881107442e+17,
 'Action': 1.8059363552214988e+16,
 'Horror': 1153503019328763.8,
 'Supernatural': 973476749571869.1,
 'Drama': 596873459472251.9}

In [21]:
test_set.iloc[8][["title", "synopsis", "genres"]]

title                        Boku no Hero Academia 3rd Season
synopsis    [summer, arrives, students, academy, superhero...
genres         [Action, Comedy, School, Shounen, Super Power]
Name: 8, dtype: object

In [None]:
{'mal_id': 16498,
 'url': 'https://myanimelist.net/anime/16498/Shingeki_no_Kyojin',
 'title': 'Shingeki no Kyojin',
 'image_url': 'https://cdn.myanimelist.net/images/anime/10/47347.jpg',
 'synopsis': "Centuries ago, mankind was slaughtered to near extinction by monstrous humanoid creatures called titans, forcing humans to hide in fear behind enormous concentric walls. What makes these giants truly terrifying is that their taste for human flesh is not born out of hunger but what appears to be out of pleasure. To ensure their survival, the remnants of humanity began living within defensive barriers, resulting in one hundred years without a single titan encounter. However, that fragile calm is soon shattered when a colossal titan manages to breach the supposedly impregnable outer wall, reigniting the fight for survival against the man-eating abominations.\r\n\r\nAfter witnessing a horrific personal loss at the hands of the invading creatures, Eren Yeager dedicates his life to their eradication by enlisting into the Survey Corps, an elite military unit that combats the merciless humanoids outside the protection of the walls. Based on Hajime Isayama's award-winning manga, Shingeki no Kyojin follows Eren, along with his adopted sister Mikasa Ackerman and his childhood friend Armin Arlert, as they join the brutal war against the titans and race to discover a way of defeating them before the last walls are breached.\r\n\r\n[Written by MAL Rewrite]",
 'type': 'TV',
 'airing_start': '2013-04-06T16:58:00+00:00',
 'episodes': 25,
 'members': 1808210,
 'genres': [{'mal_id': 1,
   'type': 'anime',
   'name': 'Action',
   'url': 'https://myanimelist.net/anime/genre/1/Action'},
  {'mal_id': 38,
   'type': 'anime',
   'name': 'Military',
   'url': 'https://myanimelist.net/anime/genre/38/Military'},
  {'mal_id': 7,
   'type': 'anime',
   'name': 'Mystery',
   'url': 'https://myanimelist.net/anime/genre/7/Mystery'},
  {'mal_id': 31,
   'type': 'anime',
   'name': 'Super Power',
   'url': 'https://myanimelist.net/anime/genre/31/Super_Power'},
  {'mal_id': 8,
   'type': 'anime',
   'name': 'Drama',
   'url': 'https://myanimelist.net/anime/genre/8/Drama'},
  {'mal_id': 10,
   'type': 'anime',
   'name': 'Fantasy',
   'url': 'https://myanimelist.net/anime/genre/10/Fantasy'},
  {'mal_id': 27,
   'type': 'anime',
   'name': 'Shounen',
   'url': 'https://myanimelist.net/anime/genre/27/Shounen'}],
 'source': 'Manga',
 'producers': [{'mal_id': 858,
   'type': 'anime',
   'name': 'Wit Studio',
   'url': 'https://myanimelist.net/anime/producer/858/Wit_Studio'}],
 'score': 8.45,
 'licensors': ['Funimation'],
 'r18': False,
 'kids': False}

In [6]:
# parameters, parameters_no = load_parameters()

In [183]:
training_set_clean[training_set_clean["synopsis"].apply(len) == 0]

Unnamed: 0,title,synopsis,genres,Romance,Historical,Space,Cars,Game,Supernatural,Fantasy,...,envelope,corrupted,suspected,asahina,membership,collectors,transgression,illusions,anthropomorphic,kishou
48,Made in Abyss Movie 3: Fukaki Tamashii no Reimei,[],"[Sci-Fi, Adventure, Mystery, Drama, Fantasy]",False,False,False,False,False,False,True,...,0,0,0,0,0,0,0,0,0,0


In [145]:
isolated_test_set = test_set.copy()
predict_all(isolated_test_set)
compute_accuracy_per_anime(isolated_test_set)
isolated_test_set.head(10)
# (isolated_test_set["accuracy_per_anime"] >= 97).sum() / len(isolated_test_set)
get_accuracy_per_genre_df(isolated_test_set)

Unnamed: 0,accuracy,true_positive_proportion,false_positive_proportion,true_negative_proportion,false_negative_proportion
Police,100.0,0.0,0.0,100.0,0.0
Cars,100.0,0.0,0.0,100.0,0.0
Shoujo Ai,100.0,0.0,0.0,100.0,0.0
Parody,100.0,0.0,0.0,100.0,0.0
Space,99.090909,0.909091,0.0,98.181818,0.909091
Sports,99.090909,1.818182,0.0,97.272727,0.909091
Dementia,99.090909,0.0,0.0,99.090909,0.909091
Kids,98.181818,0.0,0.0,98.181818,1.818182
Music,98.181818,0.0,0.0,98.181818,1.818182
Shounen Ai,98.181818,0.0,0.0,98.181818,1.818182


In [107]:
pc = [g + "_predicted_accurately" for g in genres]
accuracy_per_genre = (isolated_test_set[pc].sum(axis = 0) / len(isolated_test_set)) * 100
accuracy_per_genre.columns = genres
accuracy_per_genre.sort_values(ascending = False).describe()

count     40.000000
mean      91.636364
std        7.708946
min       71.818182
25%       86.363636
50%       94.090909
75%       98.181818
max      100.000000
dtype: float64

In [113]:
isolated_test_set.head()


Unnamed: 0,title,synopsis,genres,Romance,Historical,Space,Cars,Game,Supernatural,Fantasy,...,Magic_predicted_accurately,Shoujo_predicted_accurately,Mystery_predicted_accurately,Music_predicted_accurately,Thriller_predicted_accurately,Martial Arts_predicted_accurately,School_predicted_accurately,Harem_predicted_accurately,Ecchi_predicted_accurately,accuracy_per_anime
0,Grisaia no Kajitsu,"[kazami, transfer, student, admitted, mihama, ...","['Drama', 'Harem', 'Psychological', 'Romance',...",True,False,False,False,False,False,False,...,True,True,True,True,True,True,True,True,True,100.0
1,Kaichou wa Maid-sama!,"[female, student, council, president, especial...","['Comedy', 'Romance', 'School', 'Shoujo']",True,False,False,False,False,False,False,...,True,False,True,True,True,True,True,True,True,97.5
2,Sakura-sou no Pet na Kanojo,"[abandoned, kittens, conscience, second, sorat...","['Slice of Life', 'Comedy', 'Drama', 'Romance'...",True,False,False,False,False,False,False,...,True,True,True,True,True,True,False,True,True,95.0
3,Ansatsu Kyoushitsu,"[mysterious, creature, permanent, crescent, st...","['Action', 'Comedy', 'School', 'Shounen']",False,False,False,False,False,False,False,...,True,True,True,True,True,True,False,True,True,90.0
4,Kaze ga Tsuyoku Fuiteiru,"[former, runner, sendai, kakeru, kurahara, cha...","['Comedy', 'Sports', 'Drama']",False,False,False,False,False,False,False,...,True,True,True,True,True,True,True,True,True,92.5
