# 15-388 Final Project 

#### Group Members 
Stephen Chen, Sean Reidy, Roger Liu

## Introduction
**Anime**, or Japanese Animation, commonly refers to animated television shows which originate from Japan. The medium contains shows spanning a vast number of genres, with shows geared towards children, such as *Pokemon*, to content aimed at young adults, such as *Full Metal Alchemist*. Some titles are produced directly as television original shows, while many others are adaptations of existing source material, such as from Japanese novels, manga (comics), or even games. Anime titles often run for one or two “seasons,” airing one episode a week, totalling to a 13- or 26- episodes “series.” Unlike many American television series, animes are announced upfront that they will only run for one or two season, rather than running until discontinued by the broadcaster.


**MyAnimeList** (abbreviated MAL) is an information database website about anime, much like how *IMDb* is a database for movies. It provides metadata about anime titles, such as synopses, the involved production companies, and genres (e.g. “Action” or “Comedy”). Registred users of MAL can rate and review the animes they’ve watched, which MAL then aggregates into a public score from 1 to 10. As such, MAL serves as a useful indicator of which shows are “good” in the community’s eyes, as well as which shows are “famous.”


In the past years, with the rise of the Internet, the worldwide community of anime watchers grew larger and more defined, while the anime industry became increasingly aware of its audience. Tropes, or patterns in storytelling and character archetypes, developed to fit fan expectations. But as these tropes were reused over many titles, viewers became increasingly wary of “cookie-cutter” shows, many of which ended up forgotten in the mass of “average” shows. In this context, certain animation studios made their claim to fame by either breaking traditions or producing similar shows but of better technical quality. Long time watchers of anime eventually developed a natural intuition for which shows would be “average” or “good” just from the shows’ descriptions and production studios. These notions motivated the idea that there are some underlying trends in the anime medium.


Our project focused on discovering the underlying patterns of anime and of the anime community by performing data analysis on MAL. We first developed a web scraper which retrieved most of the metadata available on an anime title’s webpage. We then performed data exploration with the purpose of identifying trends within the metadata and exploring relationships between metadata and user scores. We explored trends within each genre, identified patterns within synopsis text, and analyzed the impact of the production studio and source material on popularity. Finally, we addressed the question: “Can we predict the MAL score for an anime title given its metadata?” Our investigation showed that prediction of score is impossible. Instead, we discovered that we can classify which titles are “above average” just from a title’s production studio, genres, source material and length.



## Data Collection
The scraper we wrote pulled information from all shows that were airing between 1998 to 2015. We did this by iterating through each broadcasting season (winter, spring, summer, fall) between those years and pulling metadata for each anime still airing during that time frame. The information was saved as a JSON, giving each title its own file identified by its MAL ID.

User contributions to MAL work as follows. Registered users each have an *AnimeList*, which is an individual profile that allows them to keep track of animes they have seen or plan to watch. Each anime entry added to the list can be labelled as “Completed”, “Watching”, “Plan to Watch”, “On-Hold”, and “Dropped”. Users can score an anime on an integer scale of 1 to 10 if the title is listed as “Completed” or “Watching” on their AnimeList. A page on MAL for an anime title consists of two user aggregates: **score** and **members**. Score is the average of scores users have given a particular title. Members is the total number of users who have added the anime to their AnimeList; this includes titles listed under “Plan to Watch”, “On-Hold”, and “Dropped”.

Intuitively, score represents how “good” an anime is, as judged by MAL users. Members is an approximation for how “famous” a show is. 

Following is all data fields that appear in the JSON:

![“Fields”](images/fields.png)

### Additional Fields
We computed some additional fields during our data exploration:

![“More Fields”](images/more_fields.png)

## Exploration

### Initial Exploratory Data Analysis

After we collected our data into an organized pandas dataframe, we began the process of exploratory data analysis; where we looked at the various distributions of both the categorical variables and continuous variables. 

This process provided necessary insights into the effect the an works genre had on it’s respective score distribution. As depicted by the  histogram below, genres including psychological, shounen, and drama have score distributions that skew  It was evident that towards higher scores compared to genres like kids, where the distribution exhibits a heavy tail of low scores. From this we gathered that the genre was a definitive feature in potentially estimating the expected score of a work. 

We are most interested in the user score of a given anime, as we gather it's a good metric for the quality of an anime.  The scores follow a unimodal distribution that is roughly normal centered approximately around 7.5 out of 10. This distribution exhibits a tail towards the lower scores, evinced of some low scoring entries pulling down the mean. 

!["Score Den  Plot](Rcode/scoreDen.png)

This scatter plot matrix illustrates the bivariate relationships among all the continuous variables in the data frame. Where each cell in the matrix is a scatter plot with its x and y vars represented by its position.  One interesting correlation is that longer length in a show seems to imply higher scores. This is likely because companies only invest in producing a movie when they know it will be well received.

![Scatterplot Mat](Rcode/scatterplotMatrix.png)

### Score and Members
Scores and member count are intuitively correlated. The better a show, the more people are likely to have heard about it (and add it to their AnimeList).


![“Scores vs Members”](images/scores_v_members.png)


Our graphic here shows a few interesting points. The major point related to our analysis is that there are a few data points which hit score exactly 0 or 10. This poses the problem that titles with lower number of users ratings can be heavily skewed. We address this in later analysis by using a new “bayes_score” field which normalizes the score to [0.0, 1.0] and adds an additional score of 0 and 1 to balance out the low number of scores.


In anime context, there are a few curious outlier which have high member count despite low scores. These are School Days and Pupa, which are notoriously violent animes. The incredible violence has drawn much attention despite their lack of substance. In addition, there seems to be a score threshold around 7.0: if an anime does not score at least 7.0, it will never become majorly popular on MAL. Alternatively, if an anime is popular on MAL, it should have a score of at least 7.0.

### Genre Analysis

![“Genre user plot”](images/genre_show_score_dist.png)

### Standard Deviations in Genres

![“Genre std”](images/genre_top_std.png) ![“Genre std”](images/genre_bot_std.png)


![“Genre user plot”](images/genre_user_score_dist.png)
Something interesting to note is that while the distribution of scores among shows is fairly normal, the distribution amoung users is strongly heavy tailed. Also interesting to note is that the genres which we found to be the most highly rated have a significantly large set of 9 and 10 scores rather than simply a peak of 8’s and 9’s.

### Impact of Genre on Score

Before even reading the synopsis of a show, a viewer will want to know what **genre** the show is in. The while the genres are intended to be a straightforward description of a shows story, many genre carry loaded definitions due to other popular shows in that genre. For example, the tag **shonen**, which signifies that the show is intented for the boy's demographic, has many tropes associated with it due to the popularity of shows like DragonBall and One Piece. 

Observing this, we did a 1st pass linear regression on the genre tags against score and member count

<table>
<tr>
    <td> <img src="images/genre_score_reg.png" width="200"/> </td>
    <td> <img src="images/genre_member_reg.png" width="200"/> </td>
</tr>
</table>

Though our R^2 value was around 0.2, the fact that the tags mostly aligned with our notions of generally "good" genres suggested that there was a link between genre and score 

### Impact of Studios on Score

The *studios* field consists of the animation studio(s) that are primarily responsible for creating an anime. Some studios are recognizable by name in the community either because of how many titles they have released or how well received their titles have been.


Our data set features animes from 301 unique producers after collapsing branches of companies into their parent branch. Our first step was to rank the studios by the average scores of the animes they released:


![Top 20](images/top_20_studios_all.png)


These results were not entirely in line with our personal experience. This is because studios often produce small video animations to advertise their mainline series, and these animations are counted as separate entries on MAL. Many of these are Original Video Animations (OVAs) which are special, unaired single episodes of a title shipped with an anime’s DVD release. These typically score worse than their parent title on MAL since they generally do not advance the parent story’s plot. We decided to filter these short entries out because they do not represent a studio as well as their parent counterparts. We did this by removing all entries that were less than 30 minutes long.


![Top 20](images/top_20_studios_sig.png)


This was more in line with our expectations. Some studio’s scores were even raised as a result. *Studio Ghibli*, which some western fans recognize as the creators of *Spirited Away* and *Princess Mononoke*, also enters into the list. Particularly interesting are the studios which have high average score despite producing more than 5 titles. This indicates consistent quality from a studio, and suggests that an anime’s score can be inferred based on the prestige of the studio making it.


### Impact of Source Material on Score
Some animes are created directly for television. Many others are based on some source material, such as an existing manga or novel.


![source](images/source_hist.png)


Our histogram tells us that anime original shows tend to hover around 7, shows based on manga (Japanese comics) or light novels (chapter books) hover slightly higher than 7, and shows based on visual novels (interactive storytelling video games) generally sit below 7. Considering that a scores from 6.7 to 8.0 account for about 45% of all titles, the source material is likely be a telling factor in how well a show is received on MAL.
Contextually, we interpret the correlation as follows. Light novels adaptations are highly rated because each novel in a series tells a complete, self-sustaining story which converts well into the short timeframe of 13-episode anime adaptation. These animes usually end with all plot points addressed and without cliffhangers, which leaves viewers satisfied with the anime. Meanwhile, manga adaptations often occur because the source manga is already popular. Thus the anime is also likely to be well received. In contrast, visual novels adaptation tend to suffer because they remove the interactive aspect of the original storytelling game. Moreover, visual novels title typically are not as popular as mangas. Companies only produce these adaptations because the fanbase, although small, is usually dedicated and willing to invest in subsequent merchandise.


### Synopsis Text Analysis 

##### Code for this section was written in R and code can be found in apendix 

!["Main Word Cloud”](Rcode/wordclouds/all_genre.png)


The synopsis of a given work provides a brief explanation of setting, plot and character details that a viewer should expect in the anime.  The free form nature of this text could lead to a powerful tool for classifying different works into clusters and groups based of the vocabulary used within each synopsis.  We built a large document term matrix of all the synopses separated into different documents by their genre. After removing  punctuation,  numbers/symbols, and commonly used english words (stop words), we created a frequency table of the more common words.

!["Doc Term Mat"](images/documentTermMat2.png)

Among the most frequently used words across all the genres included:  one, world, will, new ,school, life, however, girl, friends, two, day, and now. This alludes to larger tropes and trends commonly found in anime, for example where the setting is a “school” and the story’s twist is predicated with the word “however” 

!["Word Frequency"](images/wordFreq.png)

Perhaps the most interesting results from this were the word associations found within the document term matrix. For example, the word, “school” was closely related to the words “boys”, “doesn't”, “student” and “high”.  


Separating by genre, we find that there are some defining words that represent some genres, but this proved to be rather inconclusive as synopsis tend to be similar across all the works. Below is three different genre word clouds where the top 150 words are illustrated and each words size is proportional to its frequency in the corpus. 

#### Action 

!["Action Word Cloud"](Rcode/wordclouds/MAL_action .png)

#### Shounen 

!["Shounen Word Cloud"](Rcode/wordclouds/MAL_shounen .png)

#### Slice Of Live 

!["Slice of Life Word Cloud"](Rcode/wordclouds/MAL_sliceoflife .png)

More word clouds for each genre can be found here.  
https://github.com/FourSwordKirby/MALDataScience/tree/master/Rcode/wordclouds

### Collecting Sequels and Related Entries into Their Parent Title
When perusing through various shows on MyAnimeList, we found that many shows had sequels and related works listed underneath them. In addition, we noticed that multiple works from the same overall series ended up near the top of our various metrics. With this in mind, we decided to see what would happen if we were to works into their singular parent series. 


To do this, we took all of the works we collected and sorted them by number of members. This would make it so that series were denoted by their most popular work.We then iterated through this list. For each work, we recursively found the id’s of its the related works to form a series that encapsulated all of those works. We put these series into a into a new dataframe. A show was deemed to be related if it’s parent has a has a link pointing to it and it has a link pointing to its parent. We made sure to never include the id of a show that already belonged to another series.


In collapsing the data we found the following results. The number of entries dropped from __7053__ to __3915__
<table>
<tr>
    <td> <img src="images/series_count_30.png" width="300"/> </td>
    <td> <img src="images/series_score_30.png" width="300"/> </td>
</tr>
</table>
When removing all “insignificant” works (defined previously) we get the following. The number of entries dropped from __4853__ to __3121__

<table>
<tr>
    <td> <img src="images/series_filter_count_30.png" width="300/"> </td>
    <td> <img src="images/series_filter_score_30.png" width="300"/> </td>
</tr>
</table>


The degree to which shows were collapsed is not too surprising, as it suggests that about 1 in 3 works gets a second season, movie, or a significant set of DVD extras. 


## Conclusion

![outliers](images/outliers.png)

## Future Work:
   
Through our analysis, we were able to discover interesting trends that underlie anime and the community that surround it. In addition, the classifier we developed was able to identify good anime from bad anime. Future work that can be done in this field would be to utilize the vast amounts of user recommendations and communities that exist on MyAnimeList. By leveraging these data sources, we could potentially figure out which works are derivative and which works are “original”. In addition, MyAnimeList also stores information about the characters that appear in shows, allowing users to add them to a list of favorites. By leveraging the information stored in these profiles, we could potentially determine traits that make characters “popular”.


One potential avenue for future work is to expand the text analysis  beyond the synopsis alone and look at some of the community features found on MAL. Scraping from the discussion forums could provide a more diverse corpus of text that would be more unique each individual genre, as compared to the generic and similarly worded synopsis currently found in our dataset. 

## Code Appendix 