# When Computers Cry: Predicting Tearjerker Anime Using Scikit-learn

One of the few pieces of media to have ever coaxed tears from me is Kyoto Animation's devastating 2008 anime series *Clannad: After Story*, a show that I enthusiatically recommend. And I do so not in spite of the fact that it crushed my soul but because of it. If anything, I consider an anime's ability to turn on the waterworks to be a marker of quality. Most of my favorite anime are considered tearjerkers, *A Place Further than the Universe*, *Your Name*, and *K-On* come to mind as notable examples. Even if I don't tear up while watching them, a decent attempt made by an anime's creators still gets enormous credit from me. Given my interest in anime, I decided to take on the task of predicting tearjerker anime as a starter project for me to clarify and advance my understanding of machine learning and Scikit-learn.

Using data scraped off of the website MyAnimeList courtesy of , here's what I did . . .

1. Determined for anime whether each is a tearjerker or not based on the frequency of key words, "cry", "tears," etc., in their reviews.
2. Used the synopsis of the anime from MyAnimeList along with to predict 

In [1]:
# Import libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
import re
# https://pub.towardsai.net/emoticon-and-emoji-in-text-mining-7392c49f596a
# https://medium.com/geekculture/text-preprocessing-how-to-handle-emoji-emoticon-641bbfa6e9e7
from emot.emo_unicode import EMOTICONS_EMO, EMOJI_UNICODE
# https://www.kaggle.com/code/shsagar/finding-anime-genre-based-on-synopsis-logistic-reg
import nltk
from nltk.corpus import stopwords

# Wrangle Anime Data

In [None]:
# Load data
# https://www.kaggle.com/datasets/marlesson/myanimelist-dataset-animes-profiles-reviews?select=animes.csv
animes = pd.read_csv("data/animes.csv") 
# https://www.kaggle.com/datasets/azathoth42/myanimelist?select=AnimeList.csv
AnimeList = pd.read_csv("data/AnimeList.csv")

In [None]:
AnimeList = AnimeList[["anime_id", 
                       "title_english", 
                       "title_synonyms", 
                       "type", 
                       "source", 
                       "producer", 
                       "licensor", 
                       "studio"]]
animes = animes.merge(AnimeList, how = "left", left_on = "uid", right_on = "anime_id")
anime = animes[["uid", 
                "title", 
                "title_english", 
                "title_synonyms",
                "score",
                "members",
                "type",
                "episodes",
                "synopsis", 
                "genre", 
                "source", 
                "studio",
                "producer", 
                "licensor"]]
anime = anime.drop_duplicates()
anime.head()

# What is a tearjerker anime?

In order to predict tearjerker anime, I must define it. To construct this target vector, I used a dataset of anime reviews. Thus, the determination of whether an anime is a tearjerker or not is based upon reviewers' reactions.

In [None]:
# reviews = pd.read_csv("data/reviews.csv")
# reviews = reviews.drop_duplicates()

In [None]:
# import math

# total_rows = len(reviews)
# max_rows = 10000
# num_files = math.ceil(total_rows/max_rows)

# start = 0
# end = 9999

# for i in range(1, num_files + 1):
#     i = str(i)
#     print("Writing file #" + i)
#     reviews.iloc[start:(end + 1), :].to_csv("data/reviews/reviews" + i + ".csv", index = False)
#     start += 10000
#     end += 10000

In [None]:
# Read in data
reviews = pd.read_csv("data/reviews/reviews1.csv")
for i in range(2, 15):
    i = str(i)
    print(f"Concatenating data file #{i}")
    addition = pd.read_csv(f"data/reviews/reviews{i}.csv")
    reviews = pd.concat([reviews, addition])

In [None]:
# Keep only relevant fields
reviews = reviews[["uid", "anime_uid", "link", "text"]]

In [None]:
review_counts = reviews.groupby("anime_uid")["anime_uid"].count() 
keep_anime = review_counts[review_counts >= 10].index.tolist()
reviews = reviews[reviews["anime_uid"].isin(keep_anime)]
reviews = reviews.reset_index(drop = True)

In [None]:
# n = reviews.groupby("anime_uid")["uid"].nunique().rename("n")
# reviews = reviews.sort_values(["anime_uid", "uid"]).groupby("anime_uid").head(1)
# reviews = reviews.merge(n, how = "left", left_on = "anime_uid", right_index = True)

## Remove front and back matter

In [None]:
filler = re.compile(r"^[\s\w]*Enjoyment[\s\d]*|\s*Helpful\s*$") # Assumes reviews never start with a number
reviews["text"] = reviews["text"].str.replace(filler, "")

## Replace emoticons

The reviews feature heavy use of emoticons. These symbols allow users to succintly communicate an emotional reaction through pictograms comprised of punctuation, letters, numbers, etc. Because they convey information on emotion, I want to retain them to help me determine whether an anime is a tearjerker or not. However, since they include punctuation, which will be removed from the text later on in the process of constructing the target vector, I opted to replace them with verbal descriptions.

I used a dictionary of emoticons from the `emot` library ...

In [None]:
# http://introtopython.org/dictionaries.html#General-Syntax
emoticons = {}
for symbol, meaning in EMOTICONS_EMO.items():
    emoticons[symbol] = "".join([word.capitalize() for word in meaning.replace(",", "").split()])

... and added a few missing emoticons.

In [None]:
emoticons["(^—^)"] = "NormalLaugh"
emoticons["-_-“"] = "Troubled" 
emoticons[":s)"] = "HappyFaceOrSmiley" 
emoticons[":S)"] = "HappyFaceOrSmiley" 
# https://en.wikipedia.org/wiki/List_of_emoticons
emoticons[">W<"] = "Troubled"
emoticons["-_-'"] = "Troubled"
# https://www.urbandictionary.com/define.php?term=%3E_%3E
emoticons[">_>"] = "RightSidewaysLook"
emoticons["<_<"] = "LeftSidewaysLook"

Since many long emoticons are simply short ones with additional characters, I sorted the emoticons in descending order according to their length so that, later on, the longer emoticons would be matched prior to shorter emoticons. 

In [None]:
# https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value
# https://www.w3schools.com/python/ref_func_sorted.asp
emoticons = dict(sorted(emoticons.items(), key = lambda item: -len(item[0])))

I then assembled a list of emoticons that feature in both the dictionary and the reviews. I managed to get 97 unique emoticons, but I'm sure I missed some T_T.

In [None]:
# https://www.pythonforbeginners.com/basics/list-comprehensions-in-python
# https://www.geeksforgeeks.org/python-accessing-key-value-in-dictionary/
# https://stackoverflow.com/questions/4202538/escape-special-characters-in-a-python-string
pattern = "\s(" + "|".join([re.escape(emoticon) for emoticon in emoticons]) + ")\W?"
used_emoticons = reviews["text"].str.extractall(pattern)

Here, I'm converting the dataframe of used emoticons into a dictionary.

In [None]:
# https://www.digitalocean.com/community/tutorials/python-convert-numpy-array-to-list
used_emoticons = used_emoticons.dropna().drop_duplicates().iloc[:, 0].tolist()
# https://stackoverflow.com/questions/5352546/extract-subset-of-key-value-pairs-from-dictionary
used_emoticons = {symbol: emoticons[symbol] for symbol in used_emoticons}

Again, I'm sorting the emoticons according to their length.

In [None]:
used_emoticons = dict(sorted(used_emoticons.items(), key = lambda item: -len(item[0])))

Now, I'm looping through each emoticon and replacing it with its respective value in the dictionary if it's preceded by a white space and followed by a non-word character. These conditions are meant to reduce the likelihood of false-positives.

In [None]:
for symbol, meaning in used_emoticons.items():
    print(f"Replacing {symbol} with {meaning}")
    reviews["text"] = reviews["text"].str.replace("(?<=\s)" + re.escape(symbol) + "(?=\W?)", meaning, regex = True)

In [None]:
reviews.to_csv("data/intermediates/replace_emoticons.csv", index = False)
reviews = pd.read_csv("data/intermediates/replace_emoticons.csv")

## Drop titles

I use the appearance of key words such as "cry" and "tears" in user reviews to determine whether an anime is a tearjerker or not. Some anime include these key words in their titles, meaning users really can't help but mention them in their reviews. Thus, to ensure that I'll only be counting authentic appearances of these terms, I exclude the title of the anime being discussed from the text of each review.

*True Tears* is a peculiar case because one of its characters' ability to cry is a central plot point. Thus, reviewers who are summarizing its story mention crying quite a lot even after scrubbing its reviews of any mention of the title.

In [None]:
reviews = reviews.merge(
    anime[["uid", "title", "title_english", "title_synonyms"]], 
    how = "left", left_on = "anime_uid", right_on = "uid"
)
reviews = reviews.drop(columns = "uid_y").rename(columns = {"uid_x": "uid"})
reviews["text"] = reviews["text"].str.lower()
reviews["title"] = reviews["title"].str.lower()
reviews["title_english"] = reviews["title_english"].str.lower()
reviews["title_synonyms"] = reviews["title_synonyms"].str.lower()

In [None]:
for field in ["title", "title_english", "title_synonyms"]:
    for i in range(len(reviews)):
        print(f"Processing row #{str(i)} for {field}")
        title = reviews[field].iloc[i]
        if not pd.isna(title):
            unigrams = reviews[field].iloc[[i]].str.split().iloc[0]
            # http://www.locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/
            bigrams = [" ".join(unigram) for unigram in list(zip(unigrams, unigrams[1:]))]
            pattern = "|".join([re.escape(title)] + [re.escape(bigram) for bigram in bigrams])
            reviews["text"].iloc[[i]] = reviews["text"].iloc[[i]].str.replace(pattern, "", regex = True)

In [None]:
# reviews = reviews.drop(columns = ["title", "title_english", "title_synonyms"])
# reviews.to_csv("data/intermediates/drop_titles.csv", index = False)
reviews = pd.read_csv("data/intermediates/drop_titles.csv")

## Tokenize text

I turn the table of reviews where each row is one review into a table of tokens where each row is one token from a review.

In [None]:
reviews["text"] = reviews["text"].str.strip()
reviews["text"] = reviews["text"].str.replace("\\", " ", regex = True)
reviews["text"] = reviews["text"].str.replace("/", " ")                                      
reviews["text"] = reviews["text"].str.replace("‘", "'").str.replace("’", "'")
reviews["text"] = reviews["text"].str.replace("“", '"').str.replace("”", '"')
# "–" is used as punctuation while "-" is used to create phrases
pattern = "[" + string.punctuation.replace("'", "").replace("-", "") + "–" + "…" + "]" 
pattern = pattern + r"|(?<=\s)'(?=\w)|(?<=\w)'(?=\s)"
reviews["text"] = reviews["text"].str.replace(pattern, "", regex = True)

In [None]:
reviews["text"] = reviews["text"].str.split()
reviews = reviews.explode("text")

In [None]:
reviews.to_csv("data/intermediates/tokenize_text.csv", index = False)
reviews = pd.read_csv("data/intermediates/tokenize_text.csv", na_filter = False)

## Replace non-word characters except for emojis and select punctuation

To further cleanse the tokens, I want to drop all non-word characters that aren't emojis. For the same reason as emoticons, I want to use emojis to create my target vector because they convey emotional information. First, I create a list of the unique non-word characters present in the reviews.

In [None]:
unique_tokens = reviews["text"].drop_duplicates()
non_word = unique_tokens.str.extractall(r"(\W)").drop_duplicates()
non_word = non_word.iloc[:, 0].tolist()

Next, I reformat the definitions from the emoji dictionary that I obtained from the `emot` library.

In [None]:
emojis = {}
for meaning, symbol in EMOJI_UNICODE.items():
    emojis[symbol] = meaning 

I also create a regex pattern that separates all the characters that I want to drop by a pipe, `|`.

In [None]:
# https://stackoverflow.com/questions/51976328/best-way-to-remove-xad-in-python
# https://stackoverflow.com/questions/31522361/python-getting-rid-of-u200b-from-a-string-using-regular-expressions
# https://stackoverflow.com/questions/17912307/u-ufeff-in-python-string
drop = [char for char in non_word if char not in emojis]
drop.remove("'")
drop.remove("-")
drop = "|".join(drop) # Create regex pattern

Using the pattern created above, I replace the specified non-word characters with empty strings. I drop these empty strings along with tokens that are comprised of consecutive hyphens. These tokens are used by reviewers as horizontal lines to format their pieces. However, I have no use for them.

In [None]:
reviews["text"] = reviews["text"].str.replace(drop, "", regex = True)
reviews = reviews[(reviews["text"] != "") & ~(reviews["text"].str.fullmatch(r"(-+)"))]

## Remove stopwords

I drop the rows of the table that belonged to stop words.

In [2]:
nltk.download("stopwords")
stopwords = list(stopwords.words("English"))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/peiyizhuo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
reviews = reviews[~reviews["text"].isin(stopwords)]

In [None]:
# reviews.to_csv("data/intermediates/cleaned_text.csv", index = False)
reviews = pd.read_csv("data/intermediates/cleaned_text.csv")

## Add target field

Here I define what I mean by a "tearjerker" anime. An anime is considered a tearjerker if its reviews feature at least one word from the `sad_words` list in addition to at least one from `cry_words`. This dual criteria serves as an indicator of whether an anime prompted its audience to weap from sadness.

In [None]:
sad_words = ["sad",
             "saddest",
             "emotion",
             "emotions",
             "emotional",
             "emotionally",
             "depressed",
             "depressing",
             "depressingly",
             "tragic",
             "tragedy",
             "sentimental"]

cry_words = ["cry", 
             "cried", 
             "crying", 
             "sob", 
             "sobbed", 
             "sobbing", 
             "bawl", 
             "bawled", 
             "bawling", 
             "tear", 
             "tears", 
             "teared", 
             "tearing", # as in "tearing up"
             "sadorcrying",
             "tearsofhappiness",
             "sadofcrying",
             "😭"]

reviews["sad"] = reviews["text"].isin(sad_words)
reviews["cry"] = reviews["text"].isin(cry_words)

In [None]:
cry_vote = reviews.groupby(["anime_uid", "uid"]).agg(
    sad = pd.NamedAgg(column = "sad", aggfunc = "sum"),
    cry = pd.NamedAgg(column = "cry", aggfunc = "sum")
)
cry_vote["cry"] = (cry_vote["sad"] > 0) & (cry_vote["cry"] > 0)
cry_vote = cry_vote["cry"].reset_index()
cry_vote = cry_vote.groupby("anime_uid").agg(
    cry = pd.NamedAgg(column = "cry", aggfunc = "mean"),
    reviews = pd.NamedAgg(column = "uid", aggfunc = "nunique")
)
cry_vote["cry"] = cry_vote["cry"] > 0
cry_vote = cry_vote.reset_index()

In [None]:
cry_vote[(cry_vote["anime_uid"] == 30276) & (cry_vote["cry"])]

In [None]:
reviews

In [None]:
reviews[reviews["uid"] == 206604]["link"].iloc[0]

In [None]:
anime = cry_vote.merge(anime, how = "left", left_on = "anime_uid", right_on = "uid").drop(columns = "anime_uid")

In [None]:
anime.to_csv("data/intermediates/anime.csv", index = False)
anime = pd.read_csv("data/intermediates/anime.csv")

# Create test set

In [None]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(anime, test_size = 0.2, random_state = 500)

In [3]:
# train_set.to_csv("data/intermediates/train.csv", index = False)
train_set = pd.read_csv("data/intermediates/train.csv")

# EDA

In [4]:
train_set.head()

Unnamed: 0,cry,reviews,uid,title,title_english,title_synonyms,score,members,type,episodes,synopsis,genre,source,studio,producer,licensor
0,False,20,32271,Dies Irae,Dies irae,Day of Wrath,5.46,97796,TV,11.0,"On May 1, 1945 in Berlin, as the Red Army rais...","['Action', 'Military', 'Super Power', 'Magic']",Visual novel,A.C.G.T.,"Genco, DMM pictures, Greenwood, My Theater D.D...",Funimation
1,True,45,856,Utawarerumono,Utawarerumono,The One Being Sung,7.71,126612,TV,26.0,An injured man is found in the woods by a girl...,"['Action', 'Drama', 'Fantasy', 'Sci-Fi']",Visual novel,OLM,"Lantis, Half H.P Studio, AQUAPLUS","ADV Films, Funimation"
2,False,20,17821,Stella Jogakuin Koutou-ka C³-bu,"Stella Women&#039;s Academy, High School Divis...","Stella Jogakuin Koutouka C3-bu, Stella Jogakui...",6.57,46285,TV,13.0,Yura Yamato has just arrived at the high schoo...,"['Military', 'School', 'Sports']",Manga,Gainax,"Pony Canyon, TBS, RAY",Sentai Filmworks
3,False,14,2969,Appleseed Saga Ex Machina,Appleseed: Ex Machina,"Appleseed 2, Appleseed 2007",7.41,33606,Movie,1.0,"Deunan, a young female warrior, and Briareos, ...","['Action', 'Mecha', 'Military', 'Sci-Fi']",Unknown,Digital Frontier,Sega,"ADV Films, Warner Bros."
4,False,30,257,Ikkitousen,Ikki Tousen,"Ikki-Tosen: Legendary Fighter, Battle Vixens",6.53,118837,TV,13.0,"In Ikkitousen , the Kanto region of Japan is ...","['Ecchi', 'Super Power', 'Martial Arts', 'Scho...",Manga,J.C.Staff,"Genco, Cosmic Ray, Eye Move, Bushiroad","Funimation, Geneon Entertainment USA"


In [None]:
train_set.info()

In [None]:
print(f"Proportion of tearjerker anime: {(train_set['cry']).mean()}")

In [None]:
train_set[train_set["cry"]].sort_values(by = "members", ascending = False).head()

I was suprised to see *One Punch Man* on this list since, in my view, it was as far from a tearjerker as one could get. However, one reviewer with the username taikuroki disagreed.
>This anime is one for the books. It was so enjoyable it has gotten to the point where I am literally pre-ordering the English manga sets, and potential Blu-ray release of the anime. It’s a definite GOAT for me; on the top 5 of my list. It had me laughing, **crying**, yelling, and pretending all at once while watching. It’s been a while since I have seen an anime so great that I am out here praising it, unbiased of course.

In [None]:
train_set[~train_set["cry"]].sort_values(by = "members", ascending = False).head()

In [None]:
plt.style.use("seaborn")

In [None]:
plt.figure(figsize = (16, 4))

genres = train_set[["cry", "genre"]].copy()
genres["genre"] = genres["genre"].str.replace(r"[\[\]']", "", regex = True).str.split(", ")
genres = genres.explode("genre")
cry = genres[genres["cry"]].groupby("genre")["genre"].count()
cry = cry.rename("count").reset_index().sort_values(by = "count").tail()
no_cry = genres[~genres["cry"]].groupby("genre")["genre"].count()
no_cry = no_cry.rename("count").reset_index().sort_values(by = "count").tail()

plt.subplot(1, 2, 1)
plt.barh(cry["genre"], cry["count"])
plt.title("Tearjerker")
plt.xlabel("Animes")

plt.subplot(1, 2, 2)
plt.barh(no_cry["genre"], no_cry["count"])
plt.title("Non-Tearjerker")
plt.xlabel("Animes")

plt.subplots_adjust(wspace = 0.3)

plt.tight_layout()
plt.savefig("plots/genres.png", dpi = 300)

In [None]:
plt.figure(figsize = (16, 4))

cry = train_set[train_set["cry"]][["source"]].groupby("source")["source"].count()
cry = cry.rename("count").reset_index().sort_values(by = "count").tail()

no_cry = train_set[~train_set["cry"]][["source"]].groupby("source")["source"].count()
no_cry = no_cry.rename("count").reset_index().sort_values(by = "count").tail()

plt.subplot(1, 2, 1)
plt.barh(cry["source"], cry["count"])
plt.title("Tearjerker")
plt.xlabel("Animes")

plt.subplot(1, 2, 2)
plt.barh(no_cry["source"], no_cry["count"])
plt.title("Non-Tearjerker")
plt.xlabel("Animes")

plt.tight_layout()
plt.savefig("plots/sources.png", dpi = 300)

In [None]:
plt.figure(figsize = (16, 4))

types = train_set[["cry", "type"]]
cry = types[types["cry"]][["type"]].groupby("type")["type"].count().sort_values()
no_cry = types[~types["cry"]][["type"]].groupby("type")["type"].count().sort_values()

plt.subplot(1, 2, 1)
plt.barh(cry.index, cry.values)
plt.title("Tearjerker")
plt.xlabel("Animes")

plt.subplot(1, 2, 2)
plt.barh(no_cry.index, no_cry.values)
plt.title("Non-Tearjerker")
plt.xlabel("Animes")

plt.tight_layout()
plt.savefig("plots/types.png", dpi = 300)

In [None]:
plt.figure(figsize = (16, 8))

plot_num = 1
for var in ["score", "members", "episodes", "reviews"]:
    plt.subplot(2, 2, plot_num)
    cry = train_set[train_set["cry"]][var].dropna()
    no_cry = train_set[~train_set["cry"]][var].dropna()
    plt.boxplot([cry, no_cry], vert = False, widths = 0.6,
                showfliers = False, labels = ["Tearjerker", "Non-Tearjerker"])
    plt.xlabel(var.capitalize())
    plot_num += 1

# https://www.geeksforgeeks.org/how-to-set-the-spacing-between-subplots-in-matplotlib-in-python/
plt.subplots_adjust(wspace = 0.4, hspace = 0.4)

plt.tight_layout()
plt.savefig("plots/numeric.png", dpi = 300)

# Model training

## Select desired rows and columns

In [5]:
train_set = train_set[["cry", "score", "members", "type", "episodes", "synopsis", "genre", "source", "studio"]]
train_set = train_set.dropna().reset_index(drop = True)

In [6]:
y_train = train_set["cry"]
X_train = train_set[["score", "members", "type", "episodes", "synopsis", "genre", "source", "studio"]]

## Clean `genre`, `studio`, and `synopsis` for `CountVectorizer`

In [7]:
X_train["genre"] = X_train["genre"].str.replace("[\[\]']", "", regex = True).str.split(", ")
X_train["genre"] = X_train["genre"].apply(lambda genres: [genre.replace(" ", "") for genre in genres])
X_train["genre"] = X_train["genre"].apply(lambda genres: " ".join(genres))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train["genre"] = X_train["genre"].str.replace("[\[\]']", "", regex = True).str.split(", ")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train["genre"] = X_train["genre"].apply(lambda genres: [genre.replace(" ", "") for genre in genres])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train["

In [8]:
X_train["studio"] = X_train["studio"].str.split(", ")
X_train["studio"] = X_train["studio"].apply(lambda studios: [studio.replace(" ", "") for studio in studios])
X_train["studio"] = X_train["studio"].apply(lambda studios: " ".join(studios))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train["studio"] = X_train["studio"].str.split(", ")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train["studio"] = X_train["studio"].apply(lambda studios: [studio.replace(" ", "") for studio in studios])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train["studio"] = X_train["studio"].appl

In [9]:
filler = re.compile(r"(?<=\n)[^\n]*$")
has_filler = X_train["synopsis"].str.extractall("(.)$").iloc[:, 0].str.match("[\]\)]").reset_index(drop = True)
X_train["synopsis"][has_filler] = X_train["synopsis"][has_filler].str.replace(filler, "").str.strip()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train["synopsis"][has_filler] = X_train["synopsis"][has_filler].str.replace(filler, "").str.strip()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(~key, value, inplace=True)


## Build data transformation pipeline

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer

pipeline = ColumnTransformer([
    ("normalized", StandardScaler(), ["score", "members", "episodes"]),
    # https://stackoverflow.com/questions/65242617/sklearn-pipeline-with-countvectorizer-and-category-on-a-pandas-dataframe
    # https://stackoverflow.com/questions/58772181/columntransformer-fails-with-countvectorizer-in-a-pipeline
    ("synopsis", CountVectorizer(stop_words = stopwords), "synopsis"),
    ("genre", CountVectorizer(stop_words = None), "genre"),
    ("studio", CountVectorizer(stop_words = None), "studio"),
    # https://www.learndatasci.com/glossary/dummy-variable-trap/
    ("one_hot", OneHotEncoder(drop = "first"), ["type", "source"])
])

In [11]:
# https://stackoverflow.com/questions/23838056/what-is-the-difference-between-transform-and-fit-transform-in-sklearn
X_train = pipeline.fit_transform(X_train)

## Select model type

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

In [13]:
logistic_reg = LogisticRegression()
decision_tree = DecisionTreeClassifier()
random_forest = RandomForestClassifier()

models = {"Logistic Regression": logistic_reg,
          "Decision Tree": decision_tree, 
          "Random Forest": random_forest}.items()
for name, model in models:
    scores = cross_val_score(model, X_train, y_train)
    accuracy = np.round(np.mean(scores), decimals = 4)
    std_dev = np.round(np.std(scores), decimals = 4)
    print(f"{name}: {str(accuracy)} ({str(std_dev)})")

Logistic Regression: 0.6822 (0.0209)
Decision Tree: 0.6367 (0.0227)
Random Forest: 0.6734 (0.0219)


I'll go with logistic regression for this exercise.

## Feature selection and parameter tuning

In [14]:
from sklearn.feature_selection import SelectFromModel

In [15]:
# https://stackoverflow.com/questions/477486/how-do-i-use-a-decimal-step-value-for-range
c_values = []
thresholds = []
accuracies = []
# https://www.knime.com/blog/regularization-for-logistic-regression-l1-l2-gauss-or-laplace
# https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a
inv_regs = np.linspace(1, 0.9, 11)
multiples = np.linspace(0.5, 3, 251)

for inv_reg in inv_regs:
    logistic_reg_2 = LogisticRegression(C = inv_reg)
    logistic_reg_2.fit(X_train, y_train)
    for multiple in multiples:
        print(f"Fitting model where C = {inv_reg} and threshold = {multiple}")
        threshold = np.std(logistic_reg_2.coef_[0]) * multiple
        selector = SelectFromModel(logistic_reg_2, threshold = threshold, prefit = True)
        X_train_2 = selector.transform(X_train)
        scores = cross_val_score(logistic_reg_2, X_train_2, y_train)
        c_values.append(inv_reg)
        thresholds.append(multiple)
        accuracies.append(np.mean(scores))

Fitting model where C = 1.0 and threshold = 0.5
Fitting model where C = 1.0 and threshold = 0.51
Fitting model where C = 1.0 and threshold = 0.52
Fitting model where C = 1.0 and threshold = 0.53
Fitting model where C = 1.0 and threshold = 0.54
Fitting model where C = 1.0 and threshold = 0.55
Fitting model where C = 1.0 and threshold = 0.56
Fitting model where C = 1.0 and threshold = 0.5700000000000001
Fitting model where C = 1.0 and threshold = 0.58
Fitting model where C = 1.0 and threshold = 0.59
Fitting model where C = 1.0 and threshold = 0.6
Fitting model where C = 1.0 and threshold = 0.61
Fitting model where C = 1.0 and threshold = 0.62
Fitting model where C = 1.0 and threshold = 0.63
Fitting model where C = 1.0 and threshold = 0.64
Fitting model where C = 1.0 and threshold = 0.65
Fitting model where C = 1.0 and threshold = 0.66
Fitting model where C = 1.0 and threshold = 0.67
Fitting model where C = 1.0 and threshold = 0.6799999999999999
Fitting model where C = 1.0 and threshold =

Fitting model where C = 1.0 and threshold = 2.1100000000000003
Fitting model where C = 1.0 and threshold = 2.12
Fitting model where C = 1.0 and threshold = 2.13
Fitting model where C = 1.0 and threshold = 2.14
Fitting model where C = 1.0 and threshold = 2.1500000000000004
Fitting model where C = 1.0 and threshold = 2.16
Fitting model where C = 1.0 and threshold = 2.17
Fitting model where C = 1.0 and threshold = 2.1799999999999997
Fitting model where C = 1.0 and threshold = 2.19
Fitting model where C = 1.0 and threshold = 2.2
Fitting model where C = 1.0 and threshold = 2.21
Fitting model where C = 1.0 and threshold = 2.2199999999999998
Fitting model where C = 1.0 and threshold = 2.23
Fitting model where C = 1.0 and threshold = 2.24
Fitting model where C = 1.0 and threshold = 2.25
Fitting model where C = 1.0 and threshold = 2.26
Fitting model where C = 1.0 and threshold = 2.27
Fitting model where C = 1.0 and threshold = 2.2800000000000002
Fitting model where C = 1.0 and threshold = 2.29


Fitting model where C = 0.99 and threshold = 1.19
Fitting model where C = 0.99 and threshold = 1.2000000000000002
Fitting model where C = 0.99 and threshold = 1.21
Fitting model where C = 0.99 and threshold = 1.22
Fitting model where C = 0.99 and threshold = 1.23
Fitting model where C = 0.99 and threshold = 1.24
Fitting model where C = 0.99 and threshold = 1.25
Fitting model where C = 0.99 and threshold = 1.26
Fitting model where C = 0.99 and threshold = 1.27
Fitting model where C = 0.99 and threshold = 1.28
Fitting model where C = 0.99 and threshold = 1.29
Fitting model where C = 0.99 and threshold = 1.3
Fitting model where C = 0.99 and threshold = 1.31
Fitting model where C = 0.99 and threshold = 1.32
Fitting model where C = 0.99 and threshold = 1.33
Fitting model where C = 0.99 and threshold = 1.3399999999999999
Fitting model where C = 0.99 and threshold = 1.35
Fitting model where C = 0.99 and threshold = 1.3599999999999999
Fitting model where C = 0.99 and threshold = 1.37
Fitting m

Fitting model where C = 0.99 and threshold = 2.77
Fitting model where C = 0.99 and threshold = 2.7800000000000002
Fitting model where C = 0.99 and threshold = 2.79
Fitting model where C = 0.99 and threshold = 2.8000000000000003
Fitting model where C = 0.99 and threshold = 2.81
Fitting model where C = 0.99 and threshold = 2.82
Fitting model where C = 0.99 and threshold = 2.83
Fitting model where C = 0.99 and threshold = 2.84
Fitting model where C = 0.99 and threshold = 2.85
Fitting model where C = 0.99 and threshold = 2.86
Fitting model where C = 0.99 and threshold = 2.87
Fitting model where C = 0.99 and threshold = 2.88
Fitting model where C = 0.99 and threshold = 2.89
Fitting model where C = 0.99 and threshold = 2.9
Fitting model where C = 0.99 and threshold = 2.91
Fitting model where C = 0.99 and threshold = 2.92
Fitting model where C = 0.99 and threshold = 2.93
Fitting model where C = 0.99 and threshold = 2.94
Fitting model where C = 0.99 and threshold = 2.95
Fitting model where C =

Fitting model where C = 0.98 and threshold = 1.84
Fitting model where C = 0.98 and threshold = 1.85
Fitting model where C = 0.98 and threshold = 1.86
Fitting model where C = 0.98 and threshold = 1.87
Fitting model where C = 0.98 and threshold = 1.8800000000000001
Fitting model where C = 0.98 and threshold = 1.8900000000000001
Fitting model where C = 0.98 and threshold = 1.9000000000000001
Fitting model where C = 0.98 and threshold = 1.91
Fitting model where C = 0.98 and threshold = 1.92
Fitting model where C = 0.98 and threshold = 1.93
Fitting model where C = 0.98 and threshold = 1.94
Fitting model where C = 0.98 and threshold = 1.95
Fitting model where C = 0.98 and threshold = 1.96
Fitting model where C = 0.98 and threshold = 1.97
Fitting model where C = 0.98 and threshold = 1.98
Fitting model where C = 0.98 and threshold = 1.99
Fitting model where C = 0.98 and threshold = 2.0
Fitting model where C = 0.98 and threshold = 2.01
Fitting model where C = 0.98 and threshold = 2.02
Fitting m

Fitting model where C = 0.97 and threshold = 0.9
Fitting model where C = 0.97 and threshold = 0.91
Fitting model where C = 0.97 and threshold = 0.9199999999999999
Fitting model where C = 0.97 and threshold = 0.9299999999999999
Fitting model where C = 0.97 and threshold = 0.94
Fitting model where C = 0.97 and threshold = 0.95
Fitting model where C = 0.97 and threshold = 0.96
Fitting model where C = 0.97 and threshold = 0.97
Fitting model where C = 0.97 and threshold = 0.98
Fitting model where C = 0.97 and threshold = 0.99
Fitting model where C = 0.97 and threshold = 1.0
Fitting model where C = 0.97 and threshold = 1.01
Fitting model where C = 0.97 and threshold = 1.02
Fitting model where C = 0.97 and threshold = 1.03
Fitting model where C = 0.97 and threshold = 1.04
Fitting model where C = 0.97 and threshold = 1.05
Fitting model where C = 0.97 and threshold = 1.06
Fitting model where C = 0.97 and threshold = 1.07
Fitting model where C = 0.97 and threshold = 1.08
Fitting model where C = 

Fitting model where C = 0.97 and threshold = 2.4699999999999998
Fitting model where C = 0.97 and threshold = 2.48
Fitting model where C = 0.97 and threshold = 2.49
Fitting model where C = 0.97 and threshold = 2.5
Fitting model where C = 0.97 and threshold = 2.5100000000000002
Fitting model where C = 0.97 and threshold = 2.52
Fitting model where C = 0.97 and threshold = 2.5300000000000002
Fitting model where C = 0.97 and threshold = 2.54
Fitting model where C = 0.97 and threshold = 2.55
Fitting model where C = 0.97 and threshold = 2.56
Fitting model where C = 0.97 and threshold = 2.57
Fitting model where C = 0.97 and threshold = 2.58
Fitting model where C = 0.97 and threshold = 2.59
Fitting model where C = 0.97 and threshold = 2.6
Fitting model where C = 0.97 and threshold = 2.61
Fitting model where C = 0.97 and threshold = 2.62
Fitting model where C = 0.97 and threshold = 2.63
Fitting model where C = 0.97 and threshold = 2.64
Fitting model where C = 0.97 and threshold = 2.65
Fitting mo

Fitting model where C = 0.96 and threshold = 1.53
Fitting model where C = 0.96 and threshold = 1.54
Fitting model where C = 0.96 and threshold = 1.55
Fitting model where C = 0.96 and threshold = 1.56
Fitting model where C = 0.96 and threshold = 1.57
Fitting model where C = 0.96 and threshold = 1.58
Fitting model where C = 0.96 and threshold = 1.59
Fitting model where C = 0.96 and threshold = 1.6
Fitting model where C = 0.96 and threshold = 1.61
Fitting model where C = 0.96 and threshold = 1.62
Fitting model where C = 0.96 and threshold = 1.6300000000000001
Fitting model where C = 0.96 and threshold = 1.6400000000000001
Fitting model where C = 0.96 and threshold = 1.6500000000000001
Fitting model where C = 0.96 and threshold = 1.66
Fitting model where C = 0.96 and threshold = 1.67
Fitting model where C = 0.96 and threshold = 1.68
Fitting model where C = 0.96 and threshold = 1.69
Fitting model where C = 0.96 and threshold = 1.7
Fitting model where C = 0.96 and threshold = 1.71
Fitting mo

Fitting model where C = 0.95 and threshold = 0.6
Fitting model where C = 0.95 and threshold = 0.61
Fitting model where C = 0.95 and threshold = 0.62
Fitting model where C = 0.95 and threshold = 0.63
Fitting model where C = 0.95 and threshold = 0.64
Fitting model where C = 0.95 and threshold = 0.65
Fitting model where C = 0.95 and threshold = 0.66
Fitting model where C = 0.95 and threshold = 0.67
Fitting model where C = 0.95 and threshold = 0.6799999999999999
Fitting model where C = 0.95 and threshold = 0.69
Fitting model where C = 0.95 and threshold = 0.7
Fitting model where C = 0.95 and threshold = 0.71
Fitting model where C = 0.95 and threshold = 0.72
Fitting model where C = 0.95 and threshold = 0.73
Fitting model where C = 0.95 and threshold = 0.74
Fitting model where C = 0.95 and threshold = 0.75
Fitting model where C = 0.95 and threshold = 0.76
Fitting model where C = 0.95 and threshold = 0.77
Fitting model where C = 0.95 and threshold = 0.78
Fitting model where C = 0.95 and thres

Fitting model where C = 0.95 and threshold = 2.17
Fitting model where C = 0.95 and threshold = 2.1799999999999997
Fitting model where C = 0.95 and threshold = 2.19
Fitting model where C = 0.95 and threshold = 2.2
Fitting model where C = 0.95 and threshold = 2.21
Fitting model where C = 0.95 and threshold = 2.2199999999999998
Fitting model where C = 0.95 and threshold = 2.23
Fitting model where C = 0.95 and threshold = 2.24
Fitting model where C = 0.95 and threshold = 2.25
Fitting model where C = 0.95 and threshold = 2.26
Fitting model where C = 0.95 and threshold = 2.27
Fitting model where C = 0.95 and threshold = 2.2800000000000002
Fitting model where C = 0.95 and threshold = 2.29
Fitting model where C = 0.95 and threshold = 2.3
Fitting model where C = 0.95 and threshold = 2.31
Fitting model where C = 0.95 and threshold = 2.3200000000000003
Fitting model where C = 0.95 and threshold = 2.33
Fitting model where C = 0.95 and threshold = 2.34
Fitting model where C = 0.95 and threshold = 2

Fitting model where C = 0.9400000000000001 and threshold = 1.08
Fitting model where C = 0.9400000000000001 and threshold = 1.0899999999999999
Fitting model where C = 0.9400000000000001 and threshold = 1.1
Fitting model where C = 0.9400000000000001 and threshold = 1.1099999999999999
Fitting model where C = 0.9400000000000001 and threshold = 1.12
Fitting model where C = 0.9400000000000001 and threshold = 1.13
Fitting model where C = 0.9400000000000001 and threshold = 1.1400000000000001
Fitting model where C = 0.9400000000000001 and threshold = 1.15
Fitting model where C = 0.9400000000000001 and threshold = 1.1600000000000001
Fitting model where C = 0.9400000000000001 and threshold = 1.17
Fitting model where C = 0.9400000000000001 and threshold = 1.1800000000000002
Fitting model where C = 0.9400000000000001 and threshold = 1.19
Fitting model where C = 0.9400000000000001 and threshold = 1.2000000000000002
Fitting model where C = 0.9400000000000001 and threshold = 1.21
Fitting model where C

Fitting model where C = 0.9400000000000001 and threshold = 2.31
Fitting model where C = 0.9400000000000001 and threshold = 2.3200000000000003
Fitting model where C = 0.9400000000000001 and threshold = 2.33
Fitting model where C = 0.9400000000000001 and threshold = 2.34
Fitting model where C = 0.9400000000000001 and threshold = 2.35
Fitting model where C = 0.9400000000000001 and threshold = 2.3600000000000003
Fitting model where C = 0.9400000000000001 and threshold = 2.37
Fitting model where C = 0.9400000000000001 and threshold = 2.38
Fitting model where C = 0.9400000000000001 and threshold = 2.39
Fitting model where C = 0.9400000000000001 and threshold = 2.4000000000000004
Fitting model where C = 0.9400000000000001 and threshold = 2.41
Fitting model where C = 0.9400000000000001 and threshold = 2.42
Fitting model where C = 0.9400000000000001 and threshold = 2.4299999999999997
Fitting model where C = 0.9400000000000001 and threshold = 2.44
Fitting model where C = 0.9400000000000001 and t

Fitting model where C = 0.93 and threshold = 1.19
Fitting model where C = 0.93 and threshold = 1.2000000000000002
Fitting model where C = 0.93 and threshold = 1.21
Fitting model where C = 0.93 and threshold = 1.22
Fitting model where C = 0.93 and threshold = 1.23
Fitting model where C = 0.93 and threshold = 1.24
Fitting model where C = 0.93 and threshold = 1.25
Fitting model where C = 0.93 and threshold = 1.26
Fitting model where C = 0.93 and threshold = 1.27
Fitting model where C = 0.93 and threshold = 1.28
Fitting model where C = 0.93 and threshold = 1.29
Fitting model where C = 0.93 and threshold = 1.3
Fitting model where C = 0.93 and threshold = 1.31
Fitting model where C = 0.93 and threshold = 1.32
Fitting model where C = 0.93 and threshold = 1.33
Fitting model where C = 0.93 and threshold = 1.3399999999999999
Fitting model where C = 0.93 and threshold = 1.35
Fitting model where C = 0.93 and threshold = 1.3599999999999999
Fitting model where C = 0.93 and threshold = 1.37
Fitting m

Fitting model where C = 0.93 and threshold = 2.77
Fitting model where C = 0.93 and threshold = 2.7800000000000002
Fitting model where C = 0.93 and threshold = 2.79
Fitting model where C = 0.93 and threshold = 2.8000000000000003
Fitting model where C = 0.93 and threshold = 2.81
Fitting model where C = 0.93 and threshold = 2.82
Fitting model where C = 0.93 and threshold = 2.83
Fitting model where C = 0.93 and threshold = 2.84
Fitting model where C = 0.93 and threshold = 2.85
Fitting model where C = 0.93 and threshold = 2.86
Fitting model where C = 0.93 and threshold = 2.87
Fitting model where C = 0.93 and threshold = 2.88
Fitting model where C = 0.93 and threshold = 2.89
Fitting model where C = 0.93 and threshold = 2.9
Fitting model where C = 0.93 and threshold = 2.91
Fitting model where C = 0.93 and threshold = 2.92
Fitting model where C = 0.93 and threshold = 2.93
Fitting model where C = 0.93 and threshold = 2.94
Fitting model where C = 0.93 and threshold = 2.95
Fitting model where C =

Fitting model where C = 0.92 and threshold = 1.84
Fitting model where C = 0.92 and threshold = 1.85
Fitting model where C = 0.92 and threshold = 1.86
Fitting model where C = 0.92 and threshold = 1.87
Fitting model where C = 0.92 and threshold = 1.8800000000000001
Fitting model where C = 0.92 and threshold = 1.8900000000000001
Fitting model where C = 0.92 and threshold = 1.9000000000000001
Fitting model where C = 0.92 and threshold = 1.91
Fitting model where C = 0.92 and threshold = 1.92
Fitting model where C = 0.92 and threshold = 1.93
Fitting model where C = 0.92 and threshold = 1.94
Fitting model where C = 0.92 and threshold = 1.95
Fitting model where C = 0.92 and threshold = 1.96
Fitting model where C = 0.92 and threshold = 1.97
Fitting model where C = 0.92 and threshold = 1.98
Fitting model where C = 0.92 and threshold = 1.99
Fitting model where C = 0.92 and threshold = 2.0
Fitting model where C = 0.92 and threshold = 2.01
Fitting model where C = 0.92 and threshold = 2.02
Fitting m

Fitting model where C = 0.91 and threshold = 0.9
Fitting model where C = 0.91 and threshold = 0.91
Fitting model where C = 0.91 and threshold = 0.9199999999999999
Fitting model where C = 0.91 and threshold = 0.9299999999999999
Fitting model where C = 0.91 and threshold = 0.94
Fitting model where C = 0.91 and threshold = 0.95
Fitting model where C = 0.91 and threshold = 0.96
Fitting model where C = 0.91 and threshold = 0.97
Fitting model where C = 0.91 and threshold = 0.98
Fitting model where C = 0.91 and threshold = 0.99
Fitting model where C = 0.91 and threshold = 1.0
Fitting model where C = 0.91 and threshold = 1.01
Fitting model where C = 0.91 and threshold = 1.02
Fitting model where C = 0.91 and threshold = 1.03
Fitting model where C = 0.91 and threshold = 1.04
Fitting model where C = 0.91 and threshold = 1.05
Fitting model where C = 0.91 and threshold = 1.06
Fitting model where C = 0.91 and threshold = 1.07
Fitting model where C = 0.91 and threshold = 1.08
Fitting model where C = 

Fitting model where C = 0.91 and threshold = 2.46
Fitting model where C = 0.91 and threshold = 2.4699999999999998
Fitting model where C = 0.91 and threshold = 2.48
Fitting model where C = 0.91 and threshold = 2.49
Fitting model where C = 0.91 and threshold = 2.5
Fitting model where C = 0.91 and threshold = 2.5100000000000002
Fitting model where C = 0.91 and threshold = 2.52
Fitting model where C = 0.91 and threshold = 2.5300000000000002
Fitting model where C = 0.91 and threshold = 2.54
Fitting model where C = 0.91 and threshold = 2.55
Fitting model where C = 0.91 and threshold = 2.56
Fitting model where C = 0.91 and threshold = 2.57
Fitting model where C = 0.91 and threshold = 2.58
Fitting model where C = 0.91 and threshold = 2.59
Fitting model where C = 0.91 and threshold = 2.6
Fitting model where C = 0.91 and threshold = 2.61
Fitting model where C = 0.91 and threshold = 2.62
Fitting model where C = 0.91 and threshold = 2.63
Fitting model where C = 0.91 and threshold = 2.64
Fitting mo

Fitting model where C = 0.9 and threshold = 1.54
Fitting model where C = 0.9 and threshold = 1.55
Fitting model where C = 0.9 and threshold = 1.56
Fitting model where C = 0.9 and threshold = 1.57
Fitting model where C = 0.9 and threshold = 1.58
Fitting model where C = 0.9 and threshold = 1.59
Fitting model where C = 0.9 and threshold = 1.6
Fitting model where C = 0.9 and threshold = 1.61
Fitting model where C = 0.9 and threshold = 1.62
Fitting model where C = 0.9 and threshold = 1.6300000000000001
Fitting model where C = 0.9 and threshold = 1.6400000000000001
Fitting model where C = 0.9 and threshold = 1.6500000000000001
Fitting model where C = 0.9 and threshold = 1.66
Fitting model where C = 0.9 and threshold = 1.67
Fitting model where C = 0.9 and threshold = 1.68
Fitting model where C = 0.9 and threshold = 1.69
Fitting model where C = 0.9 and threshold = 1.7
Fitting model where C = 0.9 and threshold = 1.71
Fitting model where C = 0.9 and threshold = 1.72
Fitting model where C = 0.9 a

In [None]:
models = pd.DataFrame({"c_value": c_values, "threshold": thresholds, "accuracy": accuracies})

In [None]:
subset = models[np.abs(models["c_value"] - 0.96) < 1e-6]
subset2 = models[np.abs(models["c_value"] - 1) < 1e-6]
max_score = models["accuracy"].max()
max_multiple = models[models["accuracy"] == max_score]["threshold"].values

plt.figure(figsize = (12, 5.5))
plt.axvline(x = max_multiple, color = "red", linestyle = "--", alpha = 0.5)
plt.plot(subset["threshold"], subset["accuracy"], color = "black", label = "Regularization parameter = 0.96")
plt.plot(subset2["threshold"], subset2["accuracy"], color = "orange", label = "Regularization parameter = 1")
plt.legend(loc = (1.02, 0.35))
plt.xlabel("Threshold standard deviation")
plt.ylabel("Average accuracy")

plt.tight_layout()
plt.savefig("plots/feature_selection.png", dpi = 300)

In [None]:
models[models["accuracy"] == max_score]

In [None]:
# Code from https://johaupt.github.io/blog/columnTransformer_feature_names.html

import warnings
import sklearn

def get_feature_names(column_transformer):
    """Get feature names from all transformers.
    Returns
    -------
    feature_names : list of strings
        Names of the features produced by transform.
    """
    # Remove the internal helper function
    #check_is_fitted(column_transformer)
    
    # Turn loopkup into function for better handling with pipeline later
    def get_names(trans):
        # >> Original get_feature_names() method
        if trans == 'drop' or (
                hasattr(column, '__len__') and not len(column)):
            return []
        if trans == 'passthrough':
            if hasattr(column_transformer, '_df_columns'):
                if ((not isinstance(column, slice))
                        and all(isinstance(col, str) for col in column)):
                    return column
                else:
                    return column_transformer._df_columns[column]
            else:
                indices = np.arange(column_transformer._n_features)
                return ['x%d' % i for i in indices[column]]
        if not hasattr(trans, 'get_feature_names'):
        # >>> Change: Return input column names if no method avaiable
            # Turn error into a warning
            warnings.warn("Transformer %s (type %s) does not "
                                 "provide get_feature_names. "
                                 "Will return input column names if available"
                                 % (str(name), type(trans).__name__))
            # For transformers without a get_features_names method, use the input
            # names to the column transformer
            if column is None:
                return []
            else:
                return [name + "__" + f for f in column]

        return [name + "__" + f for f in trans.get_feature_names()]
    
    ### Start of processing
    feature_names = []
    
    # Allow transformers to be pipelines. Pipeline steps are named differently, so preprocessing is needed
    if type(column_transformer) == sklearn.pipeline.Pipeline:
        l_transformers = [(name, trans, None, None) for step, name, trans in column_transformer._iter()]
    else:
        # For column transformers, follow the original method
        l_transformers = list(column_transformer._iter(fitted=True))
    
    
    for name, trans, column, _ in l_transformers: 
        if type(trans) == sklearn.pipeline.Pipeline:
            # Recursive call on pipeline
            _names = get_feature_names(trans)
            # if pipeline has no transformer that returns names
            if len(_names)==0:
                _names = [name + "__" + f for f in column]
            feature_names.extend(_names)
        else:
            feature_names.extend(get_names(trans))
    
    return feature_names

In [None]:
logistic_reg_final = LogisticRegression(C = 0.96)
logistic_reg_final.fit(X_train, y_train)
coefs = pd.DataFrame(
    {"feature_names": get_feature_names(pipeline),
     "coef": logistic_reg_final.coef_[0]}
)

In [None]:
coefs.sort_values(by = "coef", ascending = False).reset_index(drop = True)

The most important predictors of tearjerker anime are the number of members who have the anime on their animelist and whether the anime is a drama or not. This finding coinsides with the visualizations from the EDA.

In [None]:
threshold = np.std(coefs["coef"].values) * 1.88

plt.figure(figsize = (8, 5.5))
coefs["coef"].hist(bins = 100, edgecolor = "black")
plt.xlabel("Coefficient")
plt.axvline(x = threshold, color = "red", linestyle = "--")
plt.axvline(x = -threshold, color = "red", linestyle = "--")

plt.tight_layout()
plt.savefig("plots/coef_hist.png", dpi = 300)

# Takeaways

I used Aurélion Géron's *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow* as a reference text during this project. In it, Géron emphasizes the importance of creating a test set prior to any modeling so that it can be used to honestly assess how well your final model generalizes. I initially failed to adhere to this rule and so had to generate a brand new test set by randomly resampling 25% of the data set after already fitting my models on a previously sampled 25%. (Although I generated a new test set, overfitting might still pose an issue since I had been modifying my target vector to improve performance on the previous test set, and some of the data from the previous test set is still present within the new test set.) Indeed, I was so euthusiastic about modeling that I didn't give every step of the machine learning process its due, specifically EDA. After starting over with a new test set, I decided to spend additional time on EDA *before* proceeding to modeling. Having learned these lessons, I know that I'll allot the appropriate amount of time to EDA and be more concious of data leakage in the future.

# Note on potential overfitting

# Acknowledgements

These projects were especially helpful in helping me create my own.

- [Finding Anime Genre Based on Synopsis(Logistic Reg](https://www.kaggle.com/code/shsagar/finding-anime-genre-based-on-synopsis-logistic-reg)
- [A Beginner's Guide to Sentiment Analysis with Python](https://towardsdatascience.com/a-beginners-guide-to-sentiment-analysis-in-python-95e354ea84f6)

I also relied on this [regular expression cheat sheet](https://cheatography.com/davechild/cheat-sheets/regular-expressions/).