In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("ggplot")
import pandas as pd
import numpy as np
import matplotlib
from IPython.display import *

import AssembleData

# Anime Trends Based on MyAnimeList

**Authors**: Stephen Chen, Sean Reidy, Roger Liu

## Introduction
**Anime**, or Japanese Animation, commonly refers to animated television shows which originate from Japan. The medium contains shows spanning a vast number of genres, with shows geared towards children, such as *Pokemon*, to content aimed at young adults, such as *Full Metal Alchemist*.

![](images/anime.jpg)

[**MyAnimeList**](https://myanimelist.net/) (abbreviated MAL) is an information database website about anime, much like how *IMDb* is a database for movies. It provides metadata about anime titles, such as synopses, the involved production companies, and genres (e.g. “Action” or “Comedy”). Registred users of MAL can rate and review the animes they’ve watched, which MAL then aggregates into a public score from 1 to 10. As such, MAL serves as a useful indicator of which shows are “good” in the community’s eyes, as well as which shows are “famous.”

![](images/mal_example.jpg)

In the past years, with the rise of the Internet, the worldwide community of anime watchers grew larger and more defined, while the anime industry became increasingly aware of its audience. Tropes, or patterns in storytelling and character archetypes, developed to fit fan expectations. But as these tropes were reused over many titles, viewers became increasingly wary of “cookie-cutter” shows, many of which ended up forgotten in the mass of “average” shows. In this context, certain animation studios made their claim to fame by either breaking traditions or producing similar shows but of better technical quality. Long time watchers of anime eventually developed a natural intuition for which shows would be “average” or “good” just from the shows’ descriptions and production studios. These notions motivated the idea that there are some underlying trends in the anime medium.


Our project focused on discovering the underlying patterns of anime and of the anime community by performing data analysis on MAL. We first developed a web scraper which retrieved most of the metadata available on an anime title’s webpage. We then performed data exploration with the purpose of identifying trends within the metadata and exploring relationships between metadata and user scores. We explored trends within each genre, identified patterns within synopsis text, and analyzed the impact of the production studio and source material on popularity. Finally, we addressed the question: “Can we predict the MAL score for an anime title given its metadata?” Our investigation showed that prediction of score is impossible. Instead, we discovered that we can classify which titles are “above average” just from a title’s production studio, genres, source material and length.

In [None]:
'''
We occasionally have personal interpretations of trends in the
context of the anime industry and community. Since they are unimportant
in the analysis, we have seperated them as "trivia" boxes like such.
''';

## Data Collection

![](images/data_examples.jpg)

The scraper we wrote pulled information from all shows that were airing between 1998 to 2015. We did this by iterating through each broadcasting season (winter, spring, summer, fall) between those years and pulling metadata for each anime still airing during that time frame. The information was saved as a JSON, giving each title its own file identified by its MAL ID.

User contributions to MAL work as follows. Registered users each have an *AnimeList*, which is an individual profile that allows them to keep track of animes they have seen or plan to watch. Each anime entry added to the list can be labelled as “Completed”, “Watching”, “Plan to Watch”, “On-Hold”, and “Dropped”. Users can score an anime on an integer scale of 1 to 10 if the title is listed as “Completed” or “Watching” on their AnimeList. A page on MAL for an anime title consists of two user aggregates: **score** and **members**. Score is the average of scores users have given a particular title. Members is the total number of users who have added the anime to their AnimeList; this includes titles listed under “Plan to Watch”, “On-Hold”, and “Dropped”.

Intuitively, score represents how “good” an anime is, as judged by MAL users. Members is an approximation for how “famous” a show is. 

Following is all data fields that appear in the JSON:

![“Fields”](images/fields.png)

### Additional Fields
We computed some additional fields during our data exploration:

![“More Fields”](images/more_fields.png)

## Loading and Cleaning the Data

In [None]:
# Build Dataframe from the data/ folder
# MAL_df, all_genre  = AssembleData.read_files_all(write_csv=True, verbose=True)
# MAL_df = MAL_df.drop_duplicates('id')

In [None]:
# Alternatively save and load to a pickle file for faster loading.
import pickle

# MAL_df = MAL_df.drop_duplicates('id')
# with open('augmented_1998_2015_data.pkl', 'w') as f:
#     pickle.dump(MAL_df, f)
#     pickle.dump(all_genre, f)

with open('augmented_1998_2015_data.pkl', 'r') as f:
    MAL_df = pickle.load(f)
    all_genre = pickle.load(f)

In [None]:
# Sanity Check
MAL_df.drop_duplicates('id', inplace=True)

# Convert studio list to usable form
def fix_studios(l):
    """
        Fixes some the studio names in list l
    """
    # Collapse AIC companies under AIC
    # Collapse Xebec Zwei into Xebec
    for i in xrange(len(l)):
        if l[i].startswith('AIC'):
            l[i] = u'AIC'
        elif l[i].startswith('Xebec'):
            l[i] = u'Xebec'
    return l

MAL_df['studios'] = MAL_df['studios'].apply(lambda s: eval(s))
MAL_df['studios'] = MAL_df['studios'].apply(fix_studios)

# Length computation
MAL_df['length'] = MAL_df['duration'] * MAL_df['episodes']
# Normalize Length to mean 0, std 1 
MAL_df['norm_length'] = (MAL_df['length'] - np.mean(MAL_df['length'])) / np.std(MAL_df['length'])

# Convert start date to usable form
MAL_df['aired_start'] = pd.to_datetime(MAL_df['aired_start'])

# Compute bayes_scores
MAL_df['norm_score'] = (MAL_df['score'] - 1.0) / 9.0
MAL_df['bayes_score'] = (MAL_df['norm_score']*MAL_df['score_users'] + 1.0) / (MAL_df['score_users'] + 2.0)

# Finally index it properly.
MAL_df.reset_index(drop=True, inplace=True)

In [None]:
# The following code block builds a set of all the studios.
# split_field is also needed for later code blocks.

def split_field(df, field, new_field_name=None):
    new_rows = []
    for index, row in df.iterrows():
        for item in row[field]:
            new_rows.append((row['id'], item))
            
    needs_replace = False
    if new_field_name is None or len(new_field_name) == 0:
        new_field_name = "__temp_" + field
        needs_replace = True
        
    right_df = pd.DataFrame(new_rows, columns=['id', new_field_name])
    new_df = pd.merge(df, right_df, how='left', left_on='id', right_on='id')
    if needs_replace:
        del new_df[field]
        new_df[field] = new_df[new_field_name]
        del new_df[new_field_name]
        
    return new_df

def build_studios_set(df):
    split = split_field(df[['id', 'studios']], 'studios')
    split.dropna(axis=0, subset=['studios'], inplace=True)
    return split['studios'].unique()

all_studios = build_studios_set(MAL_df)
print "Number of studios:", all_studios.shape[0]
all_studios[:5]

In [None]:
# The following code block builds the set of all sources

def build_source_set(df):
    data = df['source'].copy()
    data.dropna(axis=0, inplace=True)
    return data.unique()

all_sources = build_source_set(MAL_df)
print "Number of sources:", all_sources.shape[0]
all_sources

In [None]:
# The list of genres are loaded with the pickle file.
# Check to make sure there are 44 genres here.
print len(all_genre)

## Exploration

### Initial Exploratory Data Analysis

After we collected our data into an organized pandas dataframe, we began the process of exploratory data analysis; where we looked at the various distributions of both the categorical variables and continuous variables. 

This process provided necessary insights into the effect the an works genre had on it’s respective score distribution. As depicted by the  histogram below, genres including psychological, shounen, and drama have score distributions that skew  It was evident that towards higher scores compared to genres like kids, where the distribution exhibits a heavy tail of low scores. From this we gathered that the genre was a definitive feature in potentially estimating the expected score of a work. 

We are most interested in the user score of a given anime, as we gather it's a good metric for the quality of an anime.  The scores follow a unimodal distribution that is roughly normal centered approximately around 7.5 out of 10. This distribution exhibits a tail towards the lower scores, evinced of some low scoring entries pulling down the mean. 

!["Score Den  Plot](Rcode/scoreDen.png)

This scatter plot matrix illustrates the bivariate relationships among all the continuous variables in the data frame. Where each cell in the matrix is a scatter plot with its x and y vars represented by its position.  One interesting correlation is that longer length in a show seems to imply higher scores. This is likely because companies only invest in producing a movie when they know it will be well received.

![Scatterplot Mat](Rcode/scatterplotMatrix.png)

### Score and Members
Scores and member count are intuitively correlated. The better a show, the more people are likely to have heard about it (and add it to their AnimeList).

In [None]:
matplotlib.rcParams['figure.figsize'] = (14.0, 9.0)
plt.scatter(MAL_df['members'], MAL_df['score'])
plt.xlabel("members")
plt.ylabel("score")
plt.title('score vs members')
plt.show()

Our graphic here shows a few interesting points. The major point related to our analysis is that there are a few data points which hit score exactly 0 or 10. This poses the problem that titles with lower number of users ratings can be heavily skewed. We address this in later analysis by using a new “bayes_score” field which normalizes the score to [0.0, 1.0] and adds an additional score of 0 and 1 to balance out the low number of scores.

In [None]:
matplotlib.rcParams['figure.figsize'] = (14.0, 9.0)
plt.scatter(MAL_df['members'], MAL_df['bayes_score'])
plt.xlabel("members")
plt.ylabel("bayes score")
plt.title('bayes score vs members')
plt.show()

Curiously, there seems to be a score threshold around 7.0 (.667 bayes_score): if an anime does not score at least 7.0, it will never become majorly popular on MAL. Alternatively, if an anime is popular on MAL, it should have a score of at least 7.0.

In [None]:
'''
There are a few interesting outlier which have high member count despite low scores.
These are School Days and Pupa, which are notoriously violent animes.
The incredible violence has drawn much attention despite their lack of quality. 
'''

### Genre Analysis

![“Genre user plot”](images/genre_show_score_dist.png)

### Standard Deviations in Genres

![“Genre std”](images/genre_top_std.png) ![“Genre std”](images/genre_bot_std.png)


![“Genre user plot”](images/genre_user_score_dist.png)
Something interesting to note is that while the distribution of scores among shows is fairly normal, the distribution amoung users is strongly heavy tailed. Also interesting to note is that the genres which we found to be the most highly rated have a significantly large set of 9 and 10 scores rather than simply a peak of 8’s and 9’s.

### Impact of Genre on Score

Before even reading the synopsis of a show, a viewer will want to know what **genre** the show is in. The while the genres are intended to be a straightforward description of a shows story, many genre carry loaded definitions due to other popular shows in that genre. For example, the tag **shonen**, which signifies that the show is intented for the boy's demographic, has many tropes associated with it due to the popularity of shows like DragonBall and One Piece. 

Observing this, we did a 1st pass linear regression on the genre tags against score and member count

<table>
<tr>
    <td> <img src="images/genre_score_reg.png" width="200"/> </td>
    <td> <img src="images/genre_member_reg.png" width="200"/> </td>
</tr>
</table>

Though our R^2 value was around 0.2, the fact that the tags mostly aligned with our notions of generally "good" genres suggested that there was a link between genre and score 

## Impact of Studios on Score

The *studios* field consists of the animation studio(s) that are primarily responsible for creating an anime. Some studios are recognizable by name in the community either because of how many titles they have released or how well received their titles have been.

Our data set features animes from 302 unique studios after collapsing branches of companies into their parent branch. Our first step was to rank the studios by the average scores of the animes they released.

In [None]:
# Some animes have more than one studio listed.
# Need to split each studio into its individual row.
split = split_field(MAL_df, 'studios')
by_studios = split.groupby('studios', as_index=False)

# Wrestling with merges in Pandas.
bigger_studios = by_studios.filter(lambda data: len(data) > 2)
bigger_studios_groups = bigger_studios.groupby('studios', as_index=False)
bigger_studios_size = bigger_studios_groups.size().to_frame('num_titles')
bigger_studios_scores_members = bigger_studios_groups[['score','members']].agg(np.mean)
bigger_studios_members_sorted = bigger_studios.sort_values('members', ascending=False).groupby('studios')
bigger_studios_max = bigger_studios_members_sorted['members'].max().to_frame('max')
bigger_studios_2nd = bigger_studios_members_sorted.nth(1)['members'].to_frame('2nd')
bigger_studios_3rd = bigger_studios_members_sorted.nth(2)['members'].to_frame('3rd')
merged = pd.merge(bigger_studios_scores_members, bigger_studios_size, how='inner', right_index=True, left_on='studios')
merged = pd.merge(merged, bigger_studios_max, how='left', left_on='studios', right_index=True)
merged = pd.merge(merged, bigger_studios_2nd, how='left', left_on='studios', right_index=True)
merged = pd.merge(merged, bigger_studios_3rd, how='left', left_on='studios', right_index=True)


# Display the top 20 scoring studios.
merged = merged.sort_values(['score', 'members'], ascending=False)
merged = merged.reset_index(drop=True)
merged[['studios', 'score', 'members', 'num_titles']].head(20)

These results were not within our personal experience. This is because studios often produce small animations to advertise their mainline series, and these short episodes are separate entries on MAL. We decided to filter these short entries out because they do not represent a studio as well as their main counterparts. We thus removed all entries whose total length was less than 30 minutes.

In [None]:
'''
Specifically, many of these are Original Video Animations (OVAs)
which are special, unaired single episodes of an anime shipped
with the anime’s DVD release.  These typically score worse than
their parent title on MAL since they generally do not advance
the parent story’s plot.
''';

In [None]:
# Get only entries that have total length greater than 30 minutes
sig_length = MAL_df[MAL_df['length'] > 30]
print "Total entries:", len(MAL_df)
print "Significant:", len(sig_length)

In [None]:
# Split each studio into its own row
split = split_field(sig_length, 'studios')
by_studios = split.groupby('studios', as_index=False)

# Wrestle with merges and groups in Pandas
bigger_studios = by_studios.filter(lambda data: len(data) > 2)
bigger_studios_groups = bigger_studios.groupby('studios', as_index=False)
bigger_studios_size = bigger_studios_groups.size().to_frame('num_titles')
bigger_studios_scores_members = bigger_studios_groups[['score','members']].agg(np.mean)
bigger_studios_members_sorted = bigger_studios.sort_values('members', ascending=False).groupby('studios')
bigger_studios_max = bigger_studios_members_sorted['members'].max().to_frame('max')
bigger_studios_2nd = bigger_studios_members_sorted.nth(1)['members'].to_frame('2nd')
bigger_studios_3rd = bigger_studios_members_sorted.nth(2)['members'].to_frame('3rd')
merged = pd.merge(bigger_studios_scores_members, bigger_studios_size, how='inner', right_index=True, left_on='studios')
merged = pd.merge(merged, bigger_studios_max, how='left', left_on='studios', right_index=True)
merged = pd.merge(merged, bigger_studios_2nd, how='left', left_on='studios', right_index=True)
merged = pd.merge(merged, bigger_studios_3rd, how='left', left_on='studios', right_index=True)

# Display top 20 studios
merged = merged.sort_values(['score', 'members'], ascending=False)
merged = merged.reset_index(drop=True)
merged[['studios', 'score', 'members', 'num_titles']].head(20)

This was more in line with our expectations. Particularly interesting are the studios which have high average score despite producing more than 5 titles, such as *Studio Ghibli* (average 7.93 over 13 titles), *White Fox* (average 7.816 over 13 titles), and *Kyoto Animation* (average 7.774 over 42 titles). This indicates consistent quality from these studios, and suggests that the scores of their productions can be inferred just based on the prestige of the studio.

In [None]:
'''
Curiously, this is not exactly the list of names that fans
in the anime community will immediately recognize. A better
representation of renown is captured by the membership.
Specifically, of the studio’s produced titles, the title with
the second largest membership is a good representation of renown.
Intuitively, most fans will only start recognizing a studio’s name
after it has produced at least two well known titles. When sorting
the studios in fashion, the top 20 list quickly becomes saturated
with commonly recognized studios.
'''

# Sort by studio's anime with the 2nd largest membership
merged = merged.sort_values(['2nd'], ascending=False)
merged = merged.reset_index(drop=True)
merged[['studios', 'score', 'members', 'num_titles']].head(20)

### Impact of Source Material on Score

Some animes are created directly for television. Many others are based on some source material, such as an existing manga or novel.

In [None]:
def make_source_features(df, source_names):
    new_df = df[['id', 'source']].copy()
    for source in source_names:
        new_df[source] = new_df['source'].apply(lambda s: s == source).astype(float)
    return new_df[source_names]

# Generate a DataFrame with boolean indicators of Source attached.
with_sources = pd.merge(MAL_df, make_source_features(MAL_df, all_sources), how='inner', left_index=True, right_index=True)

In [None]:
# Plot the score frequencies by Source Material.
matplotlib.rcParams['figure.figsize'] = (30.0, 40.0)
i = 1
for source in all_sources:
    plt.subplot(6, 3, i)
    plt.hist(with_sources[with_sources[source] == 1.0]['score'], bins=20, range=[0.0,10.0])
    plt.xlabel('score')
    plt.ylabel('count')
    plt.title('score frequency of "%s"' % source)
    i += 1

Our histogram tells us that anime original shows tend to hover around 7.0, shows based on *manga* (Japanese comics) or *light novels* (chapter books) hover higher than 7.0, and shows based on *visual novels* (interactive storytelling video games) generally sit below 7.0. This variation is significant considering that scores from 6.7 to 8.0 account for about 45% of all titles. Thus, source material is likely a factor in predicting MAL score.

In [None]:
'''
Contextually, we interpret the correlation as follows. Light novels
adaptations are highly rated because each novel in the original series
tells a complete, self-sustaining story. This translates well into
the short timeframe of 13-episode anime adaptation, as these adaptations
can usually end with all plot points addressed and without cliffhangers,
which leaves viewers satisfied with the anime. Meanwhile, manga
adaptations often occur because the source manga is already popular.
Thus the anime is also likely to be well received. In contrast,
visual novels adaptation tend to suffer because they remove the interactive
aspect of the original storytelling game. Moreover, visual novels title
typically are not as popular as mangas. Companies only produce these
adaptations because the fanbase, although small, is usually dedicated
and willing to invest in subsequent merchandise.
''';

### Synopsis Text Analysis 

##### Code for this section was written in R and code can be found in apendix 

!["Main Word Cloud”](Rcode/wordclouds/all_genre.png)


The synopsis of a given work provides a brief explanation of setting, plot and character details that a viewer should expect in the anime.  The free form nature of this text could lead to a powerful tool for classifying different works into clusters and groups based of the vocabulary used within each synopsis.  We built a large document term matrix of all the synopses separated into different documents by their genre. After removing  punctuation,  numbers/symbols, and commonly used english words (stop words), we created a frequency table of the more common words.

!["Doc Term Mat"](images/documentTermMat2.png)

Among the most frequently used words across all the genres included:  one, world, will, new ,school, life, however, girl, friends, two, day, and now. This alludes to larger tropes and trends commonly found in anime, for example where the setting is a “school” and the story’s twist is predicated with the word “however” 

!["Word Frequency"](images/wordFreq.png)

Perhaps the most interesting results from this were the word associations found within the document term matrix. For example, the word, “school” was closely related to the words “boys”, “doesn't”, “student” and “high”.  


Separating by genre, we find that there are some defining words that represent some genres, but this proved to be rather inconclusive as synopsis tend to be similar across all the works. 

#### Action 

!["Action Word Cloud"](Rcode/wordclouds/MAL_action .png)

#### Shounen 

!["Shounen Word Cloud"](Rcode/wordclouds/MAL_shounen .png)

#### Slice Of Live 

!["Slice of Life Word Cloud"](Rcode/wordclouds/MAL_sliceoflife .png)

More word clouds for each genre can be found here.  
https://github.com/FourSwordKirby/MALDataScience/tree/master/Rcode/wordclouds

### Collecting Sequels and Related Entries into Their Parent Title
When perusing through various shows on MyAnimeList, we found that many shows had sequels and related works listed underneath them. In addition, we noticed that multiple works from the same overall series ended up near the top of our various metrics. With this in mind, we decided to see what would happen if we were to works into their singular parent series. 


To do this, we took all of the works we collected and sorted them by number of members. This would make it so that series were denoted by their most popular work.We then iterated through this list. For each work, we recursively found the id’s of its the related works to form a series that encapsulated all of those works. We put these series into a into a new dataframe. A show was deemed to be related if it’s parent has a has a link pointing to it and it has a link pointing to its parent. We made sure to never include the id of a show that already belonged to another series.


In collapsing the data we found the following results. The number of entries dropped from __7053__ to __3915__
<table>
<tr>
    <td> <img src="images/series_count_30.png" width="300"/> </td>
    <td> <img src="images/series_score_30.png" width="300"/> </td>
</tr>
</table>
When removing all “insignificant” works (defined previously) we get the following. The number of entries dropped from __4853__ to __3121__

<table>
<tr>
    <td> <img src="images/series_filter_count_30.png" width="300/"> </td>
    <td> <img src="images/series_filter_score_30.png" width="300"/> </td>
</tr>
</table>


The degree to which shows were collapsed is not too surprising, as it suggests that about 1 in 3 works gets a second season, movie, or a significant set of DVD extras. 


## Learning MAL Scores

### Features
Our main motivation was to determine if we can predict an anime’s score based on its metadata.

Based on our data exploration, we determined that a title’s total length, genres, studios, and source material are all well correlated with its MAL score. Comparatively, our synopsis text analysis demonstrated the synopsis was indicative of the genre, but not necessarily of the quality of the title. We deemed this a reasonable conclusion when we manually read the synopses, as many high scoring titles had synopses similar to those of low scoring titles.

Our learning features thus included 363 features consisting of:

* An indicator variable {0, 1} for each 302 possible studios.
* An indicator variable {0, 1} for each of the 44 genres.
* An indicator variable {0, 1} for each of the 16 source material types.
* `norm_length` = title’s total length normalized to mean 0, standard deviation 1.

In [None]:
def make_studio_features(df, studio_names):
    new_df = df[['id', 'studios']].copy()
    for studio in studio_names:
        new_df[studio] = new_df['studios'].apply(lambda l: studio in l).astype(float)
    return new_df[studio_names]

def make_source_features(df, source_names):
    new_df = df[['id', 'source']].copy()
    for source in source_names:
        new_df[source] = new_df['source'].apply(lambda s: s == source).astype(float)
    return new_df[source_names]

def make_features(df):
    p1 = make_studio_features(df, all_studios)
    p2 = df[all_genre + ['norm_length']]
    p3 = make_source_features(df, all_sources)
    join = pd.concat([p1, p2, p3], axis=1, join='inner')
    return join

make_features(MAL_df.head(5)).shape

### Regression


We initially began with fitting a Linear Regression to these features to predict the MAL score. Our preliminary results were not promising, as the fits produced coefficients of correlation that were either incredible low or absurd values. This was likely because of the abundance of categorical variables, which translated into boolean indicator variables of 0 or 1, making accurate fits impossible. However, the learned parameters of these models showed that the high scoring genres, studios, and source materials that we discovered during data exploration did tend to have larger parameters values.

### Classifying “Above Average”


We then simplified our question. Rather than predicting MAL scores exactly, we wanted to predict whether an anime was “above average” or “below average.” Anime enthusiast can intuitively make this prediction based on an anime’s metadata and description, as they can typically identify “cookie-cutter” shows which fall into the “average” or “below average” spectrum. This motivated our belief that learning algorithms could do the same.


Our data was split into a training, validation, and test set based on the starting air date: before 2012 for training, 2012-2014 for validation, and 2015 for test.

In [None]:
def split_data_by_date(df):
    df = df.sort_values('aired_start')
    t_date = pd.to_datetime('2012-01-01')
    v_date = pd.to_datetime('2015-01-01')
    train = df[df['aired_start'] < t_date]
    val   = df[(df['aired_start'] >= t_date) & (df['aired_start'] < v_date)]
    test  = df[df['aired_start'] >= v_date]
    return train, val, test

data = MAL_df

train_df, val_df, test_df = split_data_by_date(data)
print "available", data.shape[0]
print "train", train_df.shape[0]
print "validation", val_df.shape[0]
print "test", test_df.shape[0]

X_tr = make_features(train_df)
X_cv = make_features(val_df)   
X_test = make_features(test_df)

print "X_tr", X_tr.shape

The scores we used for this section is the `bayes_score` field, which is the MAL score normalized to [0.0,1.0] with an extra 0 and 1 rating. We interpreted “above average” in two ways: first, above median scores across all of the entries MAL, and second, above a MAL score of 7.2 for practical purposes.

### Above Median
For this situation, we labeled the entries according to whether its score was above or below the median MAL score (6.74 unnormalized, 0.637 normalized).

In [None]:
median_bayes_score = np.median(MAL_df['bayes_score'])
print "Median bayes_score:", median_bayes_score
print "Median Score", median_bayes_score * 9 + 1

def make_labels(df):
    return df['bayes_score'] > median_bayes_score

y_tr = make_labels(train_df)
y_cv = make_labels(val_df)
y_test = make_labels(test_df)

Using `sklearn`’s `GradientBoostingClassifier`, we scored 73.5% accuracy on the validation set, which we consider a large success over a baseline of randomly guessing.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier()
clf.fit(X_tr, y_tr)

score = clf.score(X_cv, y_cv)
print "score", score

We also tried Logistic Regression with hyperparameter `C = 1.8`.

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

def get_err_reg(clf, X, y, X_cv, y_cv, C):
    clf.C = C
    clf.fit(X, y)
    return np.array([clf.score(X, y), clf.score(X_cv, y_cv)])

C = np.logspace(-3, 4, 50)
errors = np.array([get_err_reg(clf, X_tr, y_tr, X_cv, y_cv, c) for c in C])

matplotlib.rcParams['figure.figsize'] = (10.0, 7.0)
plt.semilogx(C, errors[:,0], C, errors[:,1]);

In [None]:
clf = LogisticRegression(C=1.8) # Use this when considering all data, not just extremities
clf.fit(X_tr, y_tr)
print clf.score(X_tr, y_tr)
print clf.score(X_cv, y_cv)
print clf.score(X_test, y_test)

Our classifier scored 75.7% accuracy on the 2015 test set. These test results broke down into 78.0% accuracy on predicting above median anime, and 73.4% accuracy on predicting below median anime.

In [None]:
# Accuracy on positives
pos_idx = y_test > 0
print "Positive accuracy", clf.score(X_test[pos_idx], y_test[pos_idx])

neg_idx = y_test < 1
print "Negative accuracy", clf.score(X_test[neg_idx], y_test[neg_idx])

### Above Score 7.2
Practically, predicting above median is not worthwhile since the majority of the anime community has never heard of shows with scores below the median of 6.74 (evidenced by [large membership in animes which score above 7.0](#Score-and-Members)). In practice, it is more useful to know if an anime scores above average of commonly known shows. Our personal experiences argued that a MAL score of 7.2 more accurately represents a “practical average.” This translates to the 70th percentile, which is reasonable when assuming that most known shows score approximately above the median.

In [None]:
# The following code block prints the percentile of works per category
intuitive_thresholds = [
    ('barely_watchable', 6.7),
    ('average', 7.2),
    ('above_average', 7.6),
    ('recommendable', 8.0),
    ('amazing', 8.5),
    ('gods', 9.0)
]

def to_bayes_score(score):
    return ((score - 1.0) / 9.0)

def percentages(df):
    print "Using regular score"
    for label, val in intuitive_thresholds:
        print "{} ({}): {}".format(
            label,
            val,
            1. - float(sum(df['score'] > val)) / df.shape[0])
        
    print ""
    print "Using bayes scores"
    for label, val in intuitive_thresholds:
        thresh = ((val - 1.0) / 9.0)
        print "{} ({}): {}".format(
            label,
            thresh,
            1. - float(sum( df['bayes_score'] > thresh )) / df.shape[0])
        
def percentage_bayes_score(df, score):
    return float(sum( df['bayes_score'] > score )) / df.shape[0]

print "==== Percentiles of Titles ===="
percentages(MAL_df)

In [None]:
def make_labels(df):
    return df['bayes_score'] > to_bayes_score(7.2) # ~70 percentile

y_tr = make_labels(train_df)
y_cv = make_labels(val_df)
y_test = make_labels(test_df)

Running a `GradientBoostingClassifier`, we scored 78.6% on the validation set.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier()
clf.fit(X_tr, y_tr)

score = clf.score(X_cv, y_cv)
print "score", score

Again, we tried Logistic Regression with hyperparameter `C = 1.0`.

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

def get_err_reg(clf, X, y, X_cv, y_cv, C):
    clf.C = C
    clf.fit(X, y)
    return np.array([clf.score(X, y), clf.score(X_cv, y_cv)])

C = np.logspace(-3, 4, 50)
errors = np.array([get_err_reg(clf, X_tr, y_tr, X_cv, y_cv, c) for c in C])

matplotlib.rcParams['figure.figsize'] = (10.0, 7.0)
plt.semilogx(C, errors[:,0], C, errors[:,1]);

In [None]:
clf = LogisticRegression(C=1.) # Use this when considering all data for above/below 70 percentile
clf.fit(X_tr, y_tr)
print clf.score(X_tr, y_tr)
print clf.score(X_cv, y_cv)
print clf.score(X_test, y_test)

We scored 79.8% on the test set, which breaks down into 59.8% accuracy on above practical average animes, and 88.5% accuracy on below practical average animes.

In [None]:
# Accuracy on positives
pos_idx = y_test > 0
print "Positive accuracy", clf.score(X_test[pos_idx], y_test[pos_idx])

neg_idx = y_test < 1
print "Negative accuracy", clf.score(X_test[neg_idx], y_test[neg_idx])

This is ~10 percentage points better accuracy than always predicting below average, and significantly better than random guessing.


When examining the Logistic Regression coefficients, we confirmed our original expectation that high scoring studios, genres, and source material would have a large influence on prediction.

In [None]:
feature_names = np.array(X_tr.columns.values)
feature_coeff = pd.DataFrame(sorted(zip(feature_names, clf.coef_[0]), key=lambda x : np.abs(x[1]), reverse=True),
                             columns=['Feature Name', 'Coefficient'])

In [None]:
feature_coeff[feature_coeff['Feature Name'].apply(lambda s: s in all_studios)].head(15)

In [None]:
feature_coeff[feature_coeff['Feature Name'].apply(lambda s: s in all_sources)].head(5)

### Visualization of Learned Classifier

A majority of shows are properly classified, as evident by a visible division at score 7.2.

In [None]:
colors = np.array([
            [1, 0, 0],
            [0, 0, 1],
            [0, 0, 0]
        ])

classes = pd.Series(clf.predict(make_features(MAL_df)).astype(int))
#classes[MAL_df['hentai'] == 1] = 2
colored_classes = colors[classes]

matplotlib.rcParams['figure.figsize'] = (14.0, 9.0)
plt.scatter(MAL_df['members'], MAL_df['score'], c=colored_classes)
plt.xlabel("members")
plt.ylabel("score")
plt.title('score vs members')
plt.show()

In [None]:
# This charts the scores of animes as they first aired over the years.
# The size of each dot is proportional to the number of members an anime has.
# NOTE: The time axis cannot be plotted as a date.
#       The time span stretches from 1998 -> 2016 though.
newer_shows = MAL_df[MAL_df['aired_start'] > pd.to_datetime('1997')]

colors = np.array([
            [1, 0, 0],
            [0, 0, 1],
            [.8, .8, .8]
        ])
classes = pd.Series(clf.predict(make_features(newer_shows)).astype(int))
colored_classes = colors[classes]

sizes = (np.log(newer_shows['members']) + 1) * 5

x_axis = newer_shows['aired_start'].apply(matplotlib.dates.date2num)

matplotlib.rcParams['figure.figsize'] = (14.0, 9.0)
plt.scatter(x_axis, newer_shows['score'], c=colored_classes, s=sizes)
plt.xlabel("starting air date")
plt.ylabel("score")
plt.title('scores of anime released from 1998-2015')
plt.show()

In [None]:
'''
There a few curious outliers, specifically the “above average”
show with high member count and low score that aired recently. 
This is "Pupa", as previously mentioned, an incredibly violent show.
"Pupa" was probably classified highly because of its psychological
genre and its owning studio Studio Deen, which is well known for some
other works. Ironically, the show was originally highly anticipated,
but quickly fell through the rankings as the content turned out far
below expectation. This conflicts between metadata and actual content
seems to have been reflected in our classifier.
''';

In [None]:
# Feel free to try seeing what the classifer does on cerntain titles.
title_to_predict = 'Amagi Brilliant Park'
clf.decision_function(make_features(MAL_df[MAL_df['title'] == title_to_predict]))

## Conclusion

![outliers](images/outliers.png)

## Future Work:(Sean add more)
   
Through our analysis, we were able to discover interesting trends that underlie anime and the community that surround it. In addition, the classifier we developed was able to identify good anime from bad anime. Future work that can be done in this field would be to utilize the vast amounts of user recommendations and communities that exist on MyAnimeList. By leveraging these data sources, we could potentially figure out which works are derivative and which works are “original”. In addition, MyAnimeList also stores information about the characters that appear in shows, allowing users to add them to a list of favorites. By leveraging the information stored in these profiles, we could potentially determine traits that make characters “popular”.