# Predicting Tearjerker Anime

One of the few pieces of media to have ever coaxed tears from me is Kyoto Animation's devastating 2008 anime series *Clannad: After Story*. Despite its potential for crushing your soul, the show gets an enthusiatic recommend from me nonetheless. If anything, I consider an anime's ability to turn on the waterworks to be an indicator of its quality. Even if I don't tear up at an anime, a decent attempt on the part of its creators still gets credit from me. Most of my favorite anime are considered tearjerkers, *A Place Further than the Universe*, *Your Name*, and *K-On* come to mind as notable examples. Given my particular afinity for tearjerker anime, wouldn't it be great if I could somehow predict before a show even airs whether it will trigger this particular emotional reaction from the audience? 

In [None]:
# Import libraries 
# https://pub.towardsai.net/emoticon-and-emoji-in-text-mining-7392c49f596a
# https://medium.com/geekculture/text-preprocessing-how-to-handle-emoji-emoticon-641bbfa6e9e7
from emot.emo_unicode import EMOTICONS_EMO, EMOJI_UNICODE 
import pandas as pd
import string
import re

# Wrangle Anime Data

In [None]:
# Load data
# https://www.kaggle.com/datasets/marlesson/myanimelist-dataset-animes-profiles-reviews?select=animes.csv
animes = pd.read_csv("data/animes.csv") 
# https://www.kaggle.com/datasets/azathoth42/myanimelist?select=AnimeList.csv
AnimeList = pd.read_csv("data/AnimeList.csv")

In [None]:
AnimeList = AnimeList[["anime_id", "title_english", "title_synonyms", "type", "source", "producer", "licensor", "studio"]]
animes = animes.merge(AnimeList, how = "left", left_on = "uid", right_on = "anime_id")
anime = animes[["uid", 
                "title", 
                "synopsis", 
                "genre", 
                "type",
                "episodes", 
                "source", 
                "producer", 
                "licensor", 
                "studio"]]
anime = anime.drop_duplicates()
anime.head()

# Construct Target Field

My target field comes from a dataset of anime reviews.

In [None]:
# reviews = pd.read_csv("data/reviews.csv")
# reviews = reviews.drop_duplicates()

In [None]:
# import math

# total_rows = len(reviews)
# max_rows = 10000
# num_files = math.ceil(total_rows/max_rows)

# start = 0
# end = 9999

# for i in range(1, num_files + 1):
#     i = str(i)
#     print("Writing file #" + i)
#     reviews.iloc[start:(end + 1), :].to_csv("data/reviews/reviews" + i + ".csv", index = False)
#     start += 10000
#     end += 10000

In [None]:
# Read in data
reviews = pd.read_csv("data/reviews/reviews1.csv")
for i in range(2, 15):
    i = str(i)
    print("Concatenating data file #" + i)
    addition = pd.read_csv("data/reviews/reviews" + i + ".csv")
    reviews = pd.concat([reviews, addition])

In [None]:
# Keep only relevant fields
reviews = reviews[["uid", "anime_uid", "link", "text"]]

## Remove front and back matter

In [None]:
reviews = reviews.reset_index(drop = True)
filler = re.compile(r"^[\s\w]*Enjoyment[\s\d]*|\s*Helpful\s*$")
reviews["text"] = reviews["text"].str.replace(filler, "")

## Replace emoticons

The reviews feature heavy use of emoticons. These symbols allow their users to succintly communicate an emotional reaction through pictograms comprised of punctuation, letters, numbers, etc. Because they convey information on emotion, I want to retain them to help me determine whether an anime is a tearjerker or not. However, since they include punctuation, which will be removed from the text later on in the process of constructing the target field, I opted to replace them with verbal descriptions.

I used a dictionary of emoticons from the emo_unicode library . . .

In [None]:
# http://introtopython.org/dictionaries.html#General-Syntax
emoticons = {}
for symbol, meaning in EMOTICONS_EMO.items():
    emoticons[symbol] = "_".join(meaning.lower().replace(",", "").split())

. . . and added a few missing emoticons.

In [None]:
emoticons["(^—^)"] = "normal_laugh"
# https://en.wikipedia.org/wiki/List_of_emoticons
emoticons[">W<"] = "troubled"
emoticons["-_-'"] = "troubled"
# https://www.urbandictionary.com/define.php?term=%3E_%3E
emoticons[">_>"] = "right_sideways_look"
emoticons["<_<"] = "left_sideways_look"

Since many long emoticons are simply short ones with additional characters, I sorted the emoticons in descending order according to their length so that, later on, the longer emoticons would be matched prior to shorter emoticons. 

In [None]:
# https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value
# https://www.w3schools.com/python/ref_func_sorted.asp
emoticons = dict(sorted(emoticons.items(), key = lambda item: -len(item[0])))

I then assembled a list of the emoticons from the dictionary that were used in the reviews. I managed to get , but I'm sure I missed some T_T.

In [None]:
# https://www.pythonforbeginners.com/basics/list-comprehensions-in-python
# https://www.geeksforgeeks.org/python-accessing-key-value-in-dictionary/
# https://stackoverflow.com/questions/4202538/escape-special-characters-in-a-python-string
pattern = "\s(" + "|".join([re.escape(emoticon) for emoticon in emoticons]) + ")\W?"
used_emoticons = reviews["text"].str.extractall(pattern)

Here, I'm converting the dataframe of used emoticons into a dictionary.

In [None]:
# https://www.digitalocean.com/community/tutorials/python-convert-numpy-array-to-list
used_emoticons = used_emoticons.dropna().drop_duplicates().values.reshape(1, -1)[0].tolist()
# https://stackoverflow.com/questions/5352546/extract-subset-of-key-value-pairs-from-dictionary
used_emoticons = {symbol: emoticons[symbol] for symbol in used_emoticons}

Again, I'm sorting the emoticons according to their length.

In [None]:
used_emoticons = dict(sorted(used_emoticons.items(), key = lambda item: -len(item[0])))

Now, I'm looping through each emoticon and replacing it with its respective value in the dictionary if it's preceded by a white space and followed by a non-word character. These conditions are to reduce the likelihood of false-positives.

In [None]:
for symbol, meaning in used_emoticons.items():
    print("Replacing " + symbol + " with " + meaning)
    reviews["text"] = reviews["text"].str.replace("(?<=\s)" + re.escape(symbol) + "(?=\W?)", meaning, regex = True)

In [None]:
# reviews.to_csv("data/intermediates/replace_emoticons.csv", index = False)
reviews = pd.read_csv("data/intermediates/replace_emoticons.csv")

## Replace emojis

In [None]:
emojis = {}
for meaning, symbol in EMOJI_UNICODE.items():
    emojis[symbol] = "_".join(meaning.lower().replace(",", "").replace(":", "").split())
emojis['❤️'] = "heart"
emojis['♥️'] = "heart"

In [None]:
pattern = "(" + "|".join([re.escape(emoji) for emoji in emojis]) + ")"
used_emojis = reviews["text"].str.extractall(pattern)

In [None]:
reviews["text"].iloc[[19803]].str.contains('☺')

In [None]:
print(reviews["link"].iloc[19803])

In [None]:
used_emojis = used_emojis.dropna().drop_duplicates().values.reshape(1, -1)[0].tolist()
# used_emojis = {symbol: emojis[symbol] for symbol in used_emojis}
# for symbol, meaning in used_emojis.items():
#     print("Replacing " + symbol + " with " + meaning)
#     reviews["text"] = reviews["text"].str.replace(symbol, " " + meaning + " ", regex = False)
used_emojis

In [None]:
final_characters = reviews["text"].str.extract(r"(?P<final_character>.$)", expand = False)

In [None]:
pd.set_option("display.max_rows", None)
final_characters = reviews["text"].str.extract(r"(?P<final_character>.$)")
counts = final_characters.groupby("final_character")["final_character"].count()

In [None]:
list(counts.index)

In [None]:
reviews["text"][final_characters['final_character'] == '️']

In [None]:
counts[counts.index == '️']
# '⠀', '️', '̿'

## Tokenize text

In [None]:
reviews["text"] = reviews["text"].str.strip()
reviews["text"] = reviews["text"].str.lower()
reviews["text"] = reviews["text"].str.replace("\\", " ", regex = True)
reviews["text"] = reviews["text"].str.replace("/", " ")                                      
reviews["text"] = reviews["text"].str.replace("‘", "'").str.replace("’", "'")
reviews["text"] = reviews["text"].str.replace("“", '"').str.replace("”", '"')
pattern = "[" + string.punctuation.replace("'", "").replace("-", "") + "–" + "…" + "]" 
pattern = pattern + r"|(?<=\s)'(?=\w)|(?<=\w)'(?=\s)"
reviews["text"] = reviews["text"].str.replace(pattern, "", regex = True)
reviews["text"] = reviews["text"].str.split()

In [None]:
reviews = reviews.explode("text")

## Replace non-word characters except for emojis and select punctuation

First, I create a list of the unique non-word characters present in the reviews.

In [None]:
unique_tokens = reviews["text"].drop_duplicates()
non_word = unique_tokens.str.extractall(r"(\W)").drop_duplicates()
non_word = non_word.values.reshape(1, -1)[0].tolist()

Next, I reformat the definitions from the emoji dictionary that I obtained from .

In [None]:
emojis = {}
for meaning, symbol in EMOJI_UNICODE.items():
    emojis[symbol] = "_".join(meaning.lower().replace(",", "").replace(":", "").split())

In [None]:
keep = [emoji for emoji in emojis if emoji in non_word]
drop = [char for char in non_word if char not in emojis]
drop.remove("'")
drop.remove("-")
drop = "|".join(drop) # Create regex pattern

In [None]:
reviews["text"] = reviews["text"].str.replace(drop, "", regex = True)
reviews[reviews["text"] != ""]

In [None]:
# reviews["text"].loc[168].values
# reviews["text"].loc[281].values.tolist()

## Create target field

In [None]:
["cry", "cried", "crying", "sob", "sobbed", "sobbing", "bawl", "bawled", "bawling", "tear", "tears", 😭]

In [None]:
cry = reviews[["anime_uid", "text"]][reviews["tokenized_text"].apply(lambda x: sum([y == "cry" for y in x])) > 0]

In [None]:
cry2 = cry.groupby("anime_uid")["anime_uid"].count()

In [None]:
cry2.plot(kind = "hist", bins = 100)