<a href="https://colab.research.google.com/github/HarryChenn/Animation-Prediction/blob/main/Animation_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Research Question
Our main objective is to identify the factors that contribute to the difference in the popularity of different animations. Popularity is a perfect indicator to measure an anime's success. There are multiple factors that affect anime's popularity. The production studio, the genre, and previously published reviews, for instance, could all have an effect. 







# Data filtering/Cleaning
As we are not creating a original dataset initially, the existing dataset has already been preprocessed, cleaned, and organized. The dataset, created by Marlesson Santana from Kaggle.com, was categorized into three files including the files of the review, profile, and anime. As shown above, the anime.csv contains lists including uid(user ID), titles, correspond synopsis, genre, aired(first broadcast period), episodes, members, popularity(popularity ranking), score, and finally ranked(a ranking according to score). As a result, there is no method needed to clean the data.


#Import Required Packages
We first import the packages we need such as gensim, pandas, nltk tokenizers, glob....





In [1]:
import gensim 

from nltk.tokenize import sent_tokenize
from nltk.tokenize.treebank import TreebankWordTokenizer
import nltk
nltk.download('punkt')
import glob
from pathlib import Path
from bs4 import BeautifulSoup 
import re 
from sklearn import feature_extraction
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib Inline
import seaborn as sns
import os

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Load anime data
Then we will load our data into the colab

In [2]:
! pip install -q kaggle

from google.colab import files

files.upload()

! mkdir ~/.kaggle

! cp kaggle.json ~/.kaggle/

! chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


In [3]:
! mkdir reviews

!kaggle datasets download -d marlesson/myanimelist-dataset-animes-profiles-reviews -p /reviews --unzip


Downloading myanimelist-dataset-animes-profiles-reviews.zip to /reviews
 94% 204M/217M [00:03<00:00, 79.6MB/s]
100% 217M/217M [00:03<00:00, 63.8MB/s]


In [4]:
df_anime = pd.read_csv('/reviews/animes.csv')

# Brief look of each data file
Now we can take a look at the head of both "animes.csv" and "reviews.csv" in the dataset

In [None]:
df_anime.head()

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,img_url,link
0,28891,Haikyuu!! Second Season,Following their participation at the Inter-Hig...,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...","Oct 4, 2015 to Mar 27, 2016",25.0,489888,141,25.0,8.82,https://cdn.myanimelist.net/images/anime/9/766...,https://myanimelist.net/anime/28891/Haikyuu_Se...
1,23273,Shigatsu wa Kimi no Uso,Music accompanies the path of the human metron...,"['Drama', 'Music', 'Romance', 'School', 'Shoun...","Oct 10, 2014 to Mar 20, 2015",22.0,995473,28,24.0,8.83,https://cdn.myanimelist.net/images/anime/3/671...,https://myanimelist.net/anime/23273/Shigatsu_w...
2,34599,Made in Abyss,The Abyss—a gaping chasm stretching down into ...,"['Sci-Fi', 'Adventure', 'Mystery', 'Drama', 'F...","Jul 7, 2017 to Sep 29, 2017",13.0,581663,98,23.0,8.83,https://cdn.myanimelist.net/images/anime/6/867...,https://myanimelist.net/anime/34599/Made_in_Abyss
3,5114,Fullmetal Alchemist: Brotherhood,"""In order for something to be obtained, someth...","['Action', 'Military', 'Adventure', 'Comedy', ...","Apr 5, 2009 to Jul 4, 2010",64.0,1615084,4,1.0,9.23,https://cdn.myanimelist.net/images/anime/1223/...,https://myanimelist.net/anime/5114/Fullmetal_A...
4,31758,Kizumonogatari III: Reiketsu-hen,After helping revive the legendary vampire Kis...,"['Action', 'Mystery', 'Supernatural', 'Vampire']","Jan 6, 2017",1.0,214621,502,22.0,8.83,https://cdn.myanimelist.net/images/anime/3/815...,https://myanimelist.net/anime/31758/Kizumonoga...


# Grouping the data into three group
We will separate the animes into three group based on their popularity

In [None]:
# Anime's synopsis from 0 to 5000
pop1 = []
# Anime's synopsis from 5000 to 10000
pop2 = []
# Anime's synopsis from 10000 to 150000
pop3 = []

for i in df_anime['popularity']:
  if(i < 5000):
    pop1.append(df_anime.synopsis[i])
  if(5000 < i < 10000):
    pop2.append(df_anime.synopsis[i])
  if(10000 < i):
    pop3.append(df_anime.synopsis[i])

#Corpus Pre-processing


In [None]:
import logging 
import itertools
import gensim
# configure logging, since topic modeling takes a while and it's good to know what's going on 
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO  
def head(stream, n=10):
    return list(itertools.islice(stream, n))

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

In [None]:
def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]

In [None]:
# list for tokenized documents in loop
pop1texts = []

for i in pop1:
    # clean and tokenize document string
    raw = str(i).lower()
    tokens = tokenize(raw)
    # add tokens to list
    pop1texts.append(tokens)

In [None]:
pop2texts = []

for i in pop2:
  raw = str(i).lower()
  tokens = tokenize(raw)
  pop2texts.append(tokens)

In [None]:
pop3texts = []

for i in pop3:
  raw = str(i).lower()
  tokens = tokenize(raw)
  pop3texts.append(tokens)

In [None]:
print(pop1texts[1])

['izayoi', 'inu', 'taishou', 'inuyasha', 'parents', 'having', 'problems', 'human', 'named', 'setsuna', 'takemaru', 'sou', 'unga', 'magical', 'sword', 'sealed', 'away', 'years', 'sword', 'powers', 'sword', 'mind', 'source', 'ann']


In [None]:
print(pop2texts[1])

['tomato', 'knights', 'salad', 'kindom', 'young', 'clumsy', 'ends', 'saving', 'day', 'admired', 'pretty', 'princess', 'peach', 'desperately', 'falls', 'love', 'episode', 'embark', 'new', 'journey', 'solve', 'new', 'mission', 'easier', 'insect', 'band', 'way', 'time', 'source', 'ann']


In [None]:
print(pop3texts[2])

['nan']


#Run the Method For Popularity Group One


In [None]:
## instead of this, need to pass in an iterator of "texts" rather than just the "texts" list

# id2word_synopsis = gensim.corpora.Dictionary(texts) 
# print(id2word_synopsis)

def iter_docs(synopsis_list):
  for id, synopsis in enumerate(synopsis_list):
    # in the above for loop, the id is just a number created by the 
    # enumerate function. However, if you wanted more legible docuemnt
    # IDs, you would probably use one of the other columns fro the original
    # dataframe, like uid. So, you'd remove the enumerate function and the "id"
    # variable, and write additional code here to map the synopsis at that list
    # item location to the uid at that same location in the dataframe 
    yield id, tokenize(synopsis)




In [None]:
# then, you need to make use of the function that you've just defined

# We found that instead of using the iter_docs, because we alreay iterated through our first popularity group and append them to the list
# So we can simply use the gensim function for the list 'texts'

# outstream = (tokens for _,tokens in iter_docs(pop1))

# this is your same code from above, but you are passing in the outstream
# not the synposis_list
pop1id2word_synopsis = gensim.corpora.Dictionary(pop1texts) 
print(pop1id2word_synopsis)


INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22677 unique tokens: ['aboard', 'actually', 'advanced', 'adventure', 'alongside']...) from 6378 documents (total 256261 corpus positions)


Dictionary(22677 unique tokens: ['aboard', 'actually', 'advanced', 'adventure', 'alongside']...)


In [None]:
pop1id2word_synopsis.token2id

{'aboard': 0,
 'actually': 1,
 'advanced': 2,
 'adventure': 3,
 'alongside': 4,
 'answers': 5,
 'bounty': 6,
 'brave': 7,
 'bukyou': 8,
 'catgirls': 9,
 'changes': 10,
 'come': 11,
 'corner': 12,
 'crew': 13,
 'dangerous': 14,
 'days': 15,
 'dreamed': 16,
 'dreams': 17,
 'duo': 18,
 'encountering': 19,
 'exhilarating': 20,
 'final': 21,
 'follows': 22,
 'frontier': 23,
 'gene': 24,
 'girl': 25,
 'great': 26,
 'hawking': 27,
 'highly': 28,
 'hunters': 29,
 'hunting': 30,
 'iii': 31,
 'instead': 32,
 'irrevocably': 33,
 'james': 34,
 'job': 35,
 'jobs': 36,
 'known': 37,
 'lives': 38,
 'mages': 39,
 'mal': 40,
 'meeting': 41,
 'melfina': 42,
 'mysteries': 43,
 'mysterious': 44,
 'navigating': 45,
 'odd': 46,
 'outlaw': 47,
 'pair': 48,
 'partner': 49,
 'piloting': 50,
 'pirates': 51,
 'planet': 52,
 'protecting': 53,
 'rachel': 54,
 'ragtag': 55,
 'rewrite': 56,
 'sea': 57,
 'search': 58,
 'seihou': 59,
 'sent': 60,
 'sentinel': 61,
 'ship': 62,
 'small': 63,
 'space': 64,
 'spends': 65,

In [None]:
pop1id2word_synopsis.filter_extremes(no_below=2, no_above=1.0)


pop1id2word_synopsis

INFO:gensim.corpora.dictionary:discarding 6748 tokens: [('deja', 1), ('essays', 1), ('kiyakiya', 1), ('nostalgic', 1), ('tatsuhiko', 1), ('vu', 1), ('horishita', 1), ('katsuya', 1), ('mochizuki', 1), ('naoya', 1)]...
INFO:gensim.corpora.dictionary:keeping 15929 tokens which were in no less than 2 and no more than 6378 (=100.0%) documents
INFO:gensim.corpora.dictionary:resulting dictionary: Dictionary(15929 unique tokens: ['aboard', 'actually', 'advanced', 'adventure', 'alongside']...)


<gensim.corpora.dictionary.Dictionary at 0x7f9b75a012b0>

In [None]:
class Corpus(object):
    def __init__(self, dump_file, dictionary, clip_docs=None):
        self.dump_file = dump_file
        self.dictionary = dictionary
        self.clip_docs = clip_docs
    
    def __iter__(self):
        self.titles = []
        for title, tokens in itertools.islice(iter_docs(self.dump_file), self.clip_docs):
            self.titles.append(title)
            yield self.dictionary.doc2bow(tokens)
    
    def __len__(self):
        return self.clip_docs

In [None]:
pop1synopsis = ''.join(pop1texts[0])

In [None]:
pop1anime_corpus = Corpus(pop1synopsis, pop1id2word_synopsis)

In [None]:
from gensim import corpora, models
pop1dictionary = corpora.Dictionary(pop1texts)
# convert tokenized documents into a document-term matrix
pop1corpus = [pop1dictionary.doc2bow(text) for text in pop1texts]

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22677 unique tokens: ['aboard', 'actually', 'advanced', 'adventure', 'alongside']...) from 6378 documents (total 256261 corpus positions)


In [None]:
pop1ldamodel = gensim.models.ldamodel.LdaModel(pop1corpus, num_topics=20, id2word = pop1dictionary, passes=50)
import pprint
pprint.pprint(pop1ldamodel.top_topics(pop1corpus,topn=5))

INFO:gensim.models.ldamodel:using symmetric alpha at 0.05
INFO:gensim.models.ldamodel:using symmetric eta at 0.05
INFO:gensim.models.ldamodel:using serial LDA version on this node
INFO:gensim.models.ldamodel:running online (multi-pass) LDA training, 20 topics, 50 passes over the supplied corpus of 6378 documents, updating model once every 2000 documents, evaluating perplexity every 6378 documents, iterating 50x with a convergence threshold of 0.001000
INFO:gensim.models.ldamodel:PROGRESS: pass 0, at document #2000/6378
INFO:gensim.models.ldamodel:merging changes from 2000 documents into a model of 6378 documents
INFO:gensim.models.ldamodel:topic #14 (0.050): 0.008*"source" + 0.005*"rewrite" + 0.005*"life" + 0.005*"day" + 0.005*"mal" + 0.004*"written" + 0.004*"new" + 0.004*"sword" + 0.004*"nan" + 0.004*"school"
INFO:gensim.models.ldamodel:topic #3 (0.050): 0.010*"source" + 0.006*"school" + 0.005*"world" + 0.005*"mal" + 0.005*"ann" + 0.005*"story" + 0.004*"high" + 0.004*"new" + 0.004*"wr

[([(0.01335983, 'world'),
   (0.010113467, 'earth'),
   (0.006919774, 'mal'),
   (0.0068626315, 'written'),
   (0.00685462, 'rewrite')],
  -0.837568592033062),
 ([(0.008650104, 'school'),
   (0.008198599, 'mal'),
   (0.008125628, 'written'),
   (0.008044874, 'rewrite'),
   (0.0061038746, 'friends')],
  -0.9764628117996189),
 ([(0.011851273, 'school'),
   (0.009814729, 'new'),
   (0.008617474, 'year'),
   (0.008399432, 'mal'),
   (0.008338226, 'rewrite')],
  -0.9935534634174873),
 ([(0.009860758, 'school'),
   (0.009032701, 'high'),
   (0.008971622, 'love'),
   (0.008632404, 'mal'),
   (0.008280042, 'written')],
  -1.0267376260652479),
 ([(0.007333232, 'mal'),
   (0.0071506808, 'rewrite'),
   (0.007064261, 'written'),
   (0.0062934724, 'school'),
   (0.0055562835, 'girls')],
  -1.2436061117352621),
 ([(0.009496267, 'life'),
   (0.009128792, 'mal'),
   (0.008971411, 'written'),
   (0.008897052, 'rewrite'),
   (0.0068501085, 'war')],
  -1.2440412957921851),
 ([(0.01375859, 'source'),
   (

In [None]:
pop1topics = pop1ldamodel.show_topics(20, 20, formatted=False)

for topic in pop1topics:
    topic_num = topic[0]
    topic_words = ""
    
    topic_pairs = topic[1]
    for pair in topic_pairs:
        topic_words += pair[0] + ", "
    
    print("T" + str(topic_num) + ": " + topic_words)

T0: dvd, source, th, included, manga, volume, ray, blu, anime, episode, years, home, woman, day, special, takashi, bundled, ann, specials, school, 
T1: source, game, ann, new, based, girl, day, evil, named, train, human, elves, humans, anidb, secret, known, year, stop, fate, world, 
T2: school, mal, written, rewrite, friends, high, girls, girl, adventures, class, idol, summer, work, band, time, country, life, maybe, hope, little, 
T3: source, world, anime, island, mal, ann, power, named, young, written, rewrite, life, japan, mysterious, story, friends, san, city, new, order, 
T4: lupin, world, source, story, girls, iii, princess, kingdom, called, week, family, yasuzu, depicted, hidden, dance, demi, merchant, knights, rain, dr, 
T5: source, school, love, girl, day, family, father, life, sex, new, time, girls, world, friend, sister, home, mother, mysterious, sexual, student, 
T6: school, high, love, mal, written, rewrite, new, king, music, video, friends, year, source, club, festival, re

#Run Method for Popularity Group Two



In [None]:
pop2id2word_synopsis = gensim.corpora.Dictionary(pop2texts) 
print(pop2id2word_synopsis)

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22388 unique tokens: ['according', 'actually', 'asahi', 'beginning', 'beings']...) from 6292 documents (total 220762 corpus positions)


Dictionary(22388 unique tokens: ['according', 'actually', 'asahi', 'beginning', 'beings']...)


In [None]:
pop2id2word_synopsis.token2id

{'according': 0,
 'actually': 1,
 'asahi': 2,
 'beginning': 3,
 'beings': 4,
 'bring': 5,
 'challenges': 6,
 'consequences': 7,
 'crush': 8,
 'day': 9,
 'deals': 10,
 'desperately': 11,
 'easier': 12,
 'easy': 13,
 'enter': 14,
 'face': 15,
 'family': 16,
 'father': 17,
 'follows': 18,
 'forced': 19,
 'friends': 20,
 'girl': 21,
 'heart': 22,
 'identities': 23,
 'jitsu': 24,
 'kuromine': 25,
 'life': 26,
 'mal': 27,
 'man': 28,
 'mouth': 29,
 'nature': 30,
 'new': 31,
 'order': 32,
 'process': 33,
 'promises': 34,
 'protect': 35,
 'quit': 36,
 'read': 37,
 'rewrite': 38,
 'rules': 39,
 'safe': 40,
 'said': 41,
 'school': 42,
 'secret': 43,
 'secrets': 44,
 'shiragami': 45,
 'shut': 46,
 'struggles': 47,
 'stumbles': 48,
 'supernatural': 49,
 'tries': 50,
 'troubles': 51,
 'true': 52,
 'truth': 53,
 'turns': 54,
 'unable': 55,
 'unfortunately': 56,
 'unique': 57,
 'vampire': 58,
 'wa': 59,
 'want': 60,
 'watashi': 61,
 'win': 62,
 'written': 63,
 'youko': 64,
 'admired': 65,
 'ann': 66,

In [None]:
pop2id2word_synopsis.filter_extremes(no_below=2, no_above=1.0)


pop2id2word_synopsis

INFO:gensim.corpora.dictionary:discarding 0 tokens: []...
INFO:gensim.corpora.dictionary:keeping 15323 tokens which were in no less than 2 and no more than 6292 (=100.0%) documents
INFO:gensim.corpora.dictionary:resulting dictionary: Dictionary(15323 unique tokens: ['according', 'actually', 'asahi', 'beginning', 'beings']...)


<gensim.corpora.dictionary.Dictionary at 0x7f06d1d57f40>

In [None]:
pop2synopsis = ''.join(pop2texts[0])

In [None]:
pop2anime_corpus = Corpus(pop2synopsis, pop2id2word_synopsis)

In [None]:
from gensim import corpora, models
pop2dictionary = corpora.Dictionary(pop2texts)
# convert tokenized documents into a document-term matrix
pop2corpus = [pop2dictionary.doc2bow(text) for text in pop2texts]

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22388 unique tokens: ['according', 'actually', 'asahi', 'beginning', 'beings']...) from 6292 documents (total 220762 corpus positions)


In [None]:
pop2ldamodel = gensim.models.ldamodel.LdaModel(pop2corpus, num_topics=20, id2word = pop2dictionary, passes=50)
import pprint
pprint.pprint(pop2ldamodel.top_topics(pop2corpus,topn=5))

INFO:gensim.models.ldamodel:using symmetric alpha at 0.05
INFO:gensim.models.ldamodel:using symmetric eta at 0.05
INFO:gensim.models.ldamodel:using serial LDA version on this node
INFO:gensim.models.ldamodel:running online (multi-pass) LDA training, 20 topics, 50 passes over the supplied corpus of 6292 documents, updating model once every 2000 documents, evaluating perplexity every 6292 documents, iterating 50x with a convergence threshold of 0.001000
INFO:gensim.models.ldamodel:PROGRESS: pass 0, at document #2000/6292
INFO:gensim.models.ldamodel:merging changes from 2000 documents into a model of 6292 documents
INFO:gensim.models.ldamodel:topic #12 (0.050): 0.007*"source" + 0.007*"world" + 0.006*"school" + 0.005*"series" + 0.005*"girl" + 0.005*"city" + 0.005*"year" + 0.004*"story" + 0.004*"mal" + 0.004*"love"
INFO:gensim.models.ldamodel:topic #0 (0.050): 0.008*"source" + 0.008*"world" + 0.006*"video" + 0.006*"nan" + 0.006*"music" + 0.005*"song" + 0.005*"school" + 0.004*"life" + 0.004*

[([(0.011454891, 'girl'),
   (0.0101594385, 'mal'),
   (0.009392357, 'school'),
   (0.00934733, 'rewrite'),
   (0.009263999, 'written')],
  -0.8128324951790022),
 ([(0.008857386, 'school'),
   (0.007292669, 'new'),
   (0.007038658, 'world'),
   (0.006964497, 'rewrite'),
   (0.006839888, 'written')],
  -1.0806898207316202),
 ([(0.05207199, 'video'),
   (0.051524002, 'music'),
   (0.035839256, 'song'),
   (0.016970558, 'official'),
   (0.0127456365, 'source')],
  -1.3493006886779542),
 ([(0.016891697, 'world'),
   (0.007570044, 'city'),
   (0.007565685, 'mal'),
   (0.007542698, 'years'),
   (0.0074225753, 'rewrite')],
  -1.3713753162056859),
 ([(0.021330643, 'school'),
   (0.010096071, 'high'),
   (0.009944903, 'student'),
   (0.008638129, 'girls'),
   (0.0074777873, 'class')],
  -1.731881011621223),
 ([(0.014635832, 'animation'),
   (0.014600627, 'nhk'),
   (0.013637762, 'uta'),
   (0.012348862, 'minna'),
   (0.009746077, 'program')],
  -1.7610405291026692),
 ([(0.031295937, 'episode'),

In [None]:
pop2topics = pop2ldamodel.show_topics(20, 20, formatted=False)

for topic in pop2topics:
    topic_num = topic[0]
    topic_words = ""
    
    topic_pairs = topic[1]
    for pair in topic_pairs:
        topic_words += pair[0] + ", "
    
    print("T" + str(topic_num) + ": " + topic_words)

T0: nan, source, world, life, special, youkai, family, naruto, franchise, fictional, mal, rewrite, zenon, power, ann, written, dark, irodorimidori, people, young, 
T1: short, film, source, animation, movie, episode, ann, story, chibi, recap, taku, precure, furukawa, keiichi, new, anidb, scenes, battle, shougo, ling, 
T2: tank, kun, commercial, source, girl, dragon, movie, megumi, end, island, tooru, lord, new, mbt, gun, ann, live, kobayashi, school, way, 
T3: source, world, demons, new, life, student, kid, home, anidb, series, girl, country, way, basara, named, friends, year, time, decides, strange, 
T4: source, earth, father, ann, war, years, son, old, planet, city, day, year, forces, young, death, time, pig, school, power, pilot, 
T5: source, new, love, girl, story, anime, ann, life, city, school, company, time, girls, academy, year, comedy, mariya, like, based, shinesman, 
T6: girl, mal, school, rewrite, written, life, time, day, world, source, new, family, named, friends, young, mo

#Run Method for Popularity Group Three

In [None]:
pop3id2word_synopsis = gensim.corpora.Dictionary(pop3texts) 
print(pop3id2word_synopsis)

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(23950 unique tokens: ['nan', 'animated', 'band', 'featured', 'itai']...) from 6639 documents (total 191593 corpus positions)


Dictionary(23950 unique tokens: ['nan', 'animated', 'band', 'featured', 'itai']...)


In [None]:
pop3id2word_synopsis.token2id

{'nan': 0,
 'animated': 1,
 'band': 2,
 'featured': 3,
 'itai': 4,
 'japanese': 5,
 'kara': 6,
 'kimi': 7,
 'minna': 8,
 'music': 9,
 'nhk': 10,
 'ni': 11,
 'program': 12,
 'shishamo': 13,
 'song': 14,
 'tonari': 15,
 'uekusa': 16,
 'uta': 17,
 'video': 18,
 'wataru': 19,
 'akane': 20,
 'arrival': 21,
 'blasters': 22,
 'boy': 23,
 'break': 24,
 'chizuru': 25,
 'embarrassed': 26,
 'enjoying': 27,
 'ensues': 28,
 'flirts': 29,
 'fox': 30,
 'friends': 31,
 'gets': 32,
 'girl': 33,
 'good': 34,
 'isn': 35,
 'kota': 36,
 'lazily': 37,
 'lingers': 38,
 'media': 39,
 'meets': 40,
 'nozomu': 41,
 'pines': 42,
 'prey': 43,
 'quo': 44,
 'rebuffs': 45,
 'rest': 46,
 'school': 47,
 'sexy': 48,
 'snared': 49,
 'source': 50,
 'status': 51,
 'summer': 52,
 'tale': 53,
 'tayura': 54,
 'time': 55,
 'timeless': 56,
 'warm': 57,
 'way': 58,
 'weather': 59,
 'agricultural': 60,
 'brand': 61,
 'club': 62,
 'collective': 63,
 'commercials': 64,
 'connects': 65,
 'cooperative': 66,
 'dairy': 67,
 'farmers': 

In [None]:
pop3id2word_synopsis.filter_extremes(no_below=2, no_above=1.0)


pop3id2word_synopsis

INFO:gensim.corpora.dictionary:discarding 11103 tokens: [('rubella', 1), ('vaccines', 1), ('angelique', 1), ('catargena', 1), ('limoges', 1), ('rosalia', 1), ('battleships', 1), ('dita', 1), ('karitori', 1), ('mejeiru', 1)]...
INFO:gensim.corpora.dictionary:keeping 12847 tokens which were in no less than 2 and no more than 6639 (=100.0%) documents
INFO:gensim.corpora.dictionary:resulting dictionary: Dictionary(12847 unique tokens: ['nan', 'animated', 'band', 'featured', 'itai']...)


<gensim.corpora.dictionary.Dictionary at 0x7f06d3fca970>

In [None]:
pop3synopsis = ''.join(pop3texts[0])

In [None]:
pop3anime_corpus = Corpus(pop3synopsis, pop3id2word_synopsis)

In [None]:
from gensim import corpora, models
pop3dictionary = corpora.Dictionary(pop3texts)
# convert tokenized documents into a document-term matrix
pop3corpus = [pop3dictionary.doc2bow(text) for text in pop3texts]

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(23950 unique tokens: ['nan', 'animated', 'band', 'featured', 'itai']...) from 6639 documents (total 191593 corpus positions)


In [None]:
pop3ldamodel = gensim.models.ldamodel.LdaModel(pop3corpus, num_topics=20, id2word = pop3dictionary, passes=50)
import pprint
pprint.pprint(pop3ldamodel.top_topics(pop3corpus,topn=5))

INFO:gensim.models.ldamodel:using symmetric alpha at 0.05
INFO:gensim.models.ldamodel:using symmetric eta at 0.05
INFO:gensim.models.ldamodel:using serial LDA version on this node
INFO:gensim.models.ldamodel:running online (multi-pass) LDA training, 20 topics, 50 passes over the supplied corpus of 6639 documents, updating model once every 2000 documents, evaluating perplexity every 6639 documents, iterating 50x with a convergence threshold of 0.001000
INFO:gensim.models.ldamodel:PROGRESS: pass 0, at document #2000/6639
INFO:gensim.models.ldamodel:merging changes from 2000 documents into a model of 6639 documents
INFO:gensim.models.ldamodel:topic #15 (0.050): 0.011*"school" + 0.011*"source" + 0.007*"series" + 0.005*"story" + 0.005*"family" + 0.004*"world" + 0.004*"old" + 0.004*"tv" + 0.004*"girls" + 0.004*"japanese"
INFO:gensim.models.ldamodel:topic #13 (0.050): 0.014*"source" + 0.007*"life" + 0.006*"music" + 0.006*"ann" + 0.005*"video" + 0.004*"friends" + 0.004*"song" + 0.004*"home" + 

[([(0.04198977, 'video'),
   (0.03958262, 'music'),
   (0.03146528, 'song'),
   (0.015348386, 'program'),
   (0.013801916, 'uta')],
  -1.0220079423685617),
 ([(0.033473007, 'school'),
   (0.013347808, 'high'),
   (0.012183223, 'girls'),
   (0.011938854, 'mal'),
   (0.011786187, 'student')],
  -1.5858522715614363),
 ([(0.036144044, 'episode'),
   (0.019782206, 'dvd'),
   (0.018783465, 'included'),
   (0.01496509, 'episodes'),
   (0.013597968, 'series')],
  -1.6398167208274874),
 ([(0.00818772, 'follows'),
   (0.0071121072, 'book'),
   (0.0068011642, 'named'),
   (0.0063131084, 'family'),
   (0.005948719, 'source')],
  -1.7465369195357272),
 ([(0.018331954, 'source'),
   (0.010782521, 'story'),
   (0.007442445, 'girl'),
   (0.0071802293, 'ann'),
   (0.006448769, 'young')],
  -1.8809623634695865),
 ([(0.012421963, 'source'),
   (0.009390911, 'life'),
   (0.007968681, 'girl'),
   (0.007384901, 'day'),
   (0.006365161, 'friends')],
  -1.9554392511422232),
 ([(0.020210464, 'school'),
   (0.0

In [None]:
pop3topics = pop3ldamodel.show_topics(20, 20, formatted=False)

for topic in pop3topics:
    topic_num = topic[0]
    topic_words = ""
    
    topic_pairs = topic[1]
    for pair in topic_pairs:
        topic_words += pair[0] + ", "
    
    print("T" + str(topic_num) + ": " + topic_words)

T0: world, source, earth, human, mysterious, time, power, called, life, mal, battle, years, people, war, humans, evil, rewrite, young, girl, ann, 
T1: nan, source, young, kingdom, com, wikipedia, film, ghost, country, discrimination, patarillo, tries, process, south, shin, number, elves, bishounen, goes, tour, 
T2: source, life, girl, day, friends, island, ann, cats, father, dog, people, kenichi, boy, old, town, meets, lives, live, sword, lost, 
T3: source, story, girl, ann, young, mother, old, boy, father, time, living, world, help, day, new, years, place, home, life, family, 
T4: episode, dvd, included, episodes, series, special, season, tv, released, ray, blu, release, th, second, specials, volume, bundled, aired, youtube, short, 
T5: source, new, hell, girl, series, ai, soul, war, battle, time, special, adventures, added, information, ii, planet, episodes, come, world, future, 
T6: source, robot, earth, ann, machine, series, planet, ninja, men, world, war, years, super, japanese, a

#Result Interpretation

*   As for the previous lists of topic word choices shown, words like "young", "trying", "gain", and "game"are shown to be related to Shonen manga, which is a Japanese manga category with an target audience of adolescent boys. 
*   Next, to determine whether there is a correlation between the choice of topic and the popularity of an anime, we will then use the same method to extract the keyword topics for the remaining two popularity groups and compare their results.
*   Furthermore, we will continue explore the factors that may have an imnpact on the success of an animation. Following the collection of the trends for the animations, we will web-scraping additional variables. By combining metadata, we hope to investigate whether popular anime production studio are generally well-liked for all of their animations, whether anime with a school setting receives more airtime, or whether action movies in general might get better reviews, by combining metadata.


# Web Scraping

We first will import the required packages for web scraping

In [6]:
from bs4 import BeautifulSoup
import requests 
from collections import Counter

Then we will focus on the gathering the list of studios that made top 200 anime because those are so-called "best" anime

In [5]:
#Create a list to store the link to anime
url = []

#We will select the anime also based on their popularity
for i in df_anime['popularity']:
  #Top 200 animes
  if(i < 200):
    #Add their link into the list 'url'
    url.append(df_anime.link[i])

Now we create a function to do web scraping. We find the corresponding website prefix for each producer and return the last producers because the last producer is the production studio

In [7]:
#We will create a method named 'scrape' that has the anime link as parameter
def scrape(url):
  #We then use buautiful soup to web srape the corresponding link
  response = requests.get(url)
  html_str = response.text
  document = BeautifulSoup(html_str, "html.parser")
  #Then we use the select method in beautiful soup to find the producers using prefix "/anime/producer"
  tags = []
  result = []
  tags = document.select('a[href^="/anime/producer"]')
  #We need to consider the edge case when there is no specified producer for an anime
  if len(tags) == 0:
    return
  for ele in tags:
    #Then we will add all the producers we got to the list 'result'
      result.append(ele.string)
  #Because the production studio is always listed at the end of all producers, we will return the last element
  return result[len(result)-1]

In [8]:
#Now we will call the method we created to run through the 'url' list that we created
tags = []
for i in url:
  tags.append(scrape(i))
tags

['Sunrise',
 'Sunrise',
 'Shin-Ei Animation',
 'Shaft',
 'Production I.G',
 'Zexcs',
 'TMS Entertainment',
 'Kyoto Animation',
 'ufotable',
 'Wit Studio',
 'Bones',
 'Sparkly Key Animation Studio',
 'Xebec',
 'Production I.G',
 'Shaft',
 'TMS Entertainment',
 'Gonzo',
 'ufotable',
 'Science SARU',
 'Sunrise',
 'Studio Ghibli',
 'Sunrise',
 'Toei Animation',
 'Studio Ghibli',
 'Production I.G',
 'Marvy Jack',
 'Trigger',
 'Studio Deen',
 'Studio Deen',
 'AIC ASTA',
 'Production I.G',
 'Toei Animation',
 'Gainax',
 'Xebec',
 'Toei Animation',
 'GoHands',
 'Tatsunoko Production',
 'Madhouse',
 'Carp Studio',
 'Phoenix Entertainment',
 'TMS Entertainment',
 'A-1 Pictures',
 'Studio Ghibli',
 'Shaft',
 'TMS Entertainment',
 'P.A. Works',
 'Sunrise',
 'A-1 Pictures',
 'Production I.G',
 'Production I.G',
 'Studio Deen',
 'Ruo Hong Culture',
 'Shaft',
 'Studio Ghibli',
 'J.C.Staff',
 'Bridge',
 'Group TAC',
 'Toei Animation',
 'Lerche',
 'J.C.Staff',
 'Madhouse',
 'CoMix Wave Films',
 'Produc

Now we count the number of times each production studio appears in the list we made

In [None]:
#The following code will count the appearences of each production studio in our list of anime
counts = dict(Counter(tags))
duplicates = {key:value for key, value in counts.items() if value > 1}
print(duplicates)

Then, we end up getting a list of top 200 animes to know what production studio  makes good animation.

Sunrise: 31

Production I.G: 26

Shaft: 18

A-1 Pictures: 16

Kyoto Animation: 15

TMS Entertainment: 14

Bones: 12

Studio Deen: 11