# DA210 Project 2 (data acquisition)
## Tran Ong, Tieu Ngan, Anh Le
1. Data sources:
Attached link: 

  https://en.wikipedia.org/api/rest_v1/page/html/List_of_anime_franchises_by_episode_count

  https://en.wikipedia.org/wiki/List_of_highest-grossing_Japanese_films#Highest-grossing_Japanese_films_worldwide

2. The format of data with given sources were already in HTML format.

3. How did you access and parse the data (explained below).

4. Central questions:
What are the predictors for a high-grossing anime (only box office earnings), among rankings, number of views, ratings, franchise revenues, and anime formats? And are there relationships between any two factors among release time, ending time, runtime, episode count, OVA count, and total episode count? 


In [None]:
import datetime
import requests

In [None]:
import base64

In [None]:
from lxml import etree
import io

In [None]:
import numpy as np

At first step (after importing all libraries needed), we store the urls into variables and scrap the data with 'requests'.

In [None]:
animeUrl = 'https://en.wikipedia.org/api/rest_v1/page/html/List_of_anime_franchises_by_episode_count'
res = requests.get(animeUrl)

wiki_html = res.text
# print (wiki_html)

Next, we parse the data with 'io string' and get the root of the tree after parsing

In [None]:
htmlparser = etree.HTMLParser()
try:
    # This one should work
    tree = etree.parse(io.StringIO(wiki_html), parser=htmlparser)
    # root of the whole wiki page in HTML
    root = tree.getroot()
except:
    print("Failed to parse as HTML")

'performXML' function helps to convert XML to normal objects/primary data type in python (list,dictionary) and work with the converted data with pandas.

In [None]:
def performXML(node,pos):
  table = []
  index = []
  columns = []
  for x in node[pos].getchildren():
    for y in x.getchildren():
      # y: each row in the table 
      # print (etree.tostring(y, pretty_print=True).decode())
      row = []
      for c in y.getchildren():
        # c: each column/value of one row in the table 
        if c.tag == 'td':
          # print (etree.tostring(c, pretty_print=True).decode())
          if c.text != None:
            row.append(c.text.strip('\n'))
          else:
            a = c.find('i')
            if a != None:
              a = a.find('a').text
              row.append(a)
            else:
              row.append(0)
              # print (etree.tostring(c, pretty_print=True).decode())
        else:
          if c.text != None:
            columns.append(c.text)
          else:
            a = c.find('i')
            if a != None: 
              a = a.find('a').text
              # print (a)
              row.append(a)
      if row != []: table.append(row)
  return table, columns, index

In [None]:
node = root.xpath("//table[contains(@class,'wikitable')]")
# use of xpath to extract the node we will use in the future exactly/
table, columns, index = performXML(node,0)
# 0 is the index of the first and only table of the given path. 

In [None]:
# columns.pop(0)
print (columns)
print (table)

['No.', 'Title', 'Started Broadcasting', 'Finished Broadcasting', 'Runtime (TV)', 'Episode(s)(TV)', 'Other(s)', 'Total count']
[['1', 'Doraemon', '1 Apr 1973', 0, '22-24 minutes', '3,003', '67', '3,070'], ['2', 'Sazae-san', '5 Oct 1969', 0, '22-24 minutes', '2,640', 0, '2,640+'], ['3', 'Nintama Rantarō', '10 Apr 1993', 0, '10 minutes', '2,199', '3', '2,202'], ['4', 'Ojarumaru', '5 Oct 1998', 0, '10 minutes', 0, '4', '1,831'], ['5', 'Oyako Club', '3 Oct 1994', '30 Mar 2013', '5 minutes', '1,818', 0, '1,818'], ['6', 'Soreike! Anpanman', '3 Oct 1988', 0, '24 minutes', 0, '113', '1,700'], ['7', 'Kirin Monoshiri Yakata', '1 Jan 1975', '31 Dec 1979', '5 minutes', '1,565', 0, '1,565'], ['8', 'Chibi Maruko-chan', '7 Jan 1990', 0, '24 minutes', '1,493', '64', '1,557'], ['9', 'Kirin Ashita no Calendar', '1 Jan 1980', '30 Dec 1984', '5 minutes', '1,498', 0, '1,498'], ['10', 'Manga Nippon Mukashi Banashi', '7 Jan 1975', '2 Jan 1995', '24 minutes', '1,494', 0, '1,494'], ['11', 'Hoka Hoka Kazoku', '

In [None]:
for row in table:
  pop = row.pop(0)
  index.append(pop)
  if row[2] == 0: row[2] = 'Ongoing'
print (table)

[['Doraemon', '1 Apr 1973', 'Ongoing', '22-24 minutes', '3,003', '67', '3,070'], ['Sazae-san', '5 Oct 1969', 'Ongoing', '22-24 minutes', '2,640', 0, '2,640+'], ['Nintama Rantarō', '10 Apr 1993', 'Ongoing', '10 minutes', '2,199', '3', '2,202'], ['Ojarumaru', '5 Oct 1998', 'Ongoing', '10 minutes', 0, '4', '1,831'], ['Oyako Club', '3 Oct 1994', '30 Mar 2013', '5 minutes', '1,818', 0, '1,818'], ['Soreike! Anpanman', '3 Oct 1988', 'Ongoing', '24 minutes', 0, '113', '1,700'], ['Kirin Monoshiri Yakata', '1 Jan 1975', '31 Dec 1979', '5 minutes', '1,565', 0, '1,565'], ['Chibi Maruko-chan', '7 Jan 1990', 'Ongoing', '24 minutes', '1,493', '64', '1,557'], ['Kirin Ashita no Calendar', '1 Jan 1980', '30 Dec 1984', '5 minutes', '1,498', 0, '1,498'], ['Manga Nippon Mukashi Banashi', '7 Jan 1975', '2 Jan 1995', '24 minutes', '1,494', 0, '1,494'], ['Hoka Hoka Kazoku', '1 Oct 1976', '31 Mar 1982', '5 minutes', '1,428', 0, '1,428'], ['Shimajirō', '13 Dec 1993', 'Ongoing', '24 minutes', '1,403', '8', '1,41

In [None]:
runtime_convert = {'22-24 minutes': 22, '24 minutes': 24, '5 minutes': 5, '10 minutes': 10, '12 minutes': 12, '12-24 minutes': 12}
for x in table:
  x[5] = int(x[5])
  if type(x[4]) is str and len(x[4]) == 5: x[4] = int(x[4][0:1] + x[4][2:])
  else: 
    if x[4] == '62+148': x[4] = 62 + 148
    else: x[4] = int(x[4])
  if x[0] == 'Ojarumaru': x[4] = 1827
  if x[0] == 'Soreike! Anpanman': x[4] = 1587
  if x[0] == 'Pokémon': x[4] = 1220
  x[1] = int(x[1][-4:])
  if x[2] != 'Ongoing': x[2] = int(x[2][-4:])
  else: x[2] = 2022
  if type(x[3]) is str: x[3] = int(runtime_convert[x[3]])



From the above url, we create a dataframe for *the list of anime series by franchise series total episode count*. We only take a sample of 50 series. This data will also be created as a SQlite table called top_series. 


In [None]:
import pandas as pd

# 50 Anime series by franchise series total episode count
df = pd.DataFrame(table)
df.columns = ['Series name', 'Year release', 'End Broadcasting Year', 'Runtime (min on TV)', 'Episode count', 'Others', 'Total count']

In [None]:
df['Runtime (min on TV)'].replace(0, np.NaN, inplace = True)

In [None]:
df

Unnamed: 0,Series name,Year release,End Broadcasting Year,Runtime (min on TV),Episode count,Others,Total count
0,Doraemon,1973,2022,22.0,3003,67,3070
1,Sazae-san,1969,2022,22.0,2640,0,"2,640+"
2,Nintama Rantarō,1993,2022,10.0,2199,3,2202
3,Ojarumaru,1998,2022,10.0,1827,4,1831
4,Oyako Club,1994,2013,5.0,1818,0,1818
...,...,...,...,...,...,...,...
83,0,2002,2006,,0,0,209
84,Katekyō Hitman Reborn!,2006,2010,,0,0,204
85,The Kindaichi Case Files,1997,2016,,0,0,200
86,Holly the Ghost,1991,1993,,0,0,200


In [None]:
def convert2int(s):
  # print (type(s))
  if type(s) is str:
    s = s.replace(',', '')
    s = s.replace('+', '')
    return int(s)
  return s
total = df['Total count']
total_count = [convert2int(x) for x in total]
table_v1 = {'Series name': list(df['Series name']),'Year release': list(df['Year release']), 'End Broadcasting Year': list(df['End Broadcasting Year']), 'Runtime (min on TV)': list(df['Runtime (min on TV)']), 'Others': list(df['Others']), 'Total count': total_count}
df_v1 = pd.DataFrame(table_v1)
df_v1

Unnamed: 0,Series name,Year release,End Broadcasting Year,Runtime (min on TV),Others,Total count
0,Doraemon,1973,2022,22.0,67,3070
1,Sazae-san,1969,2022,22.0,0,2640
2,Nintama Rantarō,1993,2022,10.0,3,2202
3,Ojarumaru,1998,2022,10.0,4,1831
4,Oyako Club,1994,2013,5.0,0,1818
...,...,...,...,...,...,...
83,0,2002,2006,,0,209
84,Katekyō Hitman Reborn!,2006,2010,,0,204
85,The Kindaichi Case Files,1997,2016,,0,200
86,Holly the Ghost,1991,1993,,0,200


For DataFrame df - 50 Anime series by franchise series total episode count, we can find the relationship between any two of the following for anime series: Year release, End Broadcasting Time, Runtime (min on TV), Episode count, Others, Total count. 

In [None]:
# Grossing of Japanese Films
path = 'https://en.wikipedia.org/wiki/List_of_highest-grossing_Japanese_films#Highest-grossing_Japanese_films_worldwide'

In [None]:
res = requests.get(path)

wiki_html = res.text
htmlparser = etree.HTMLParser()
try:
    # This one should work
    tree = etree.parse(io.StringIO(wiki_html), parser=htmlparser)
    # root of the whole wiki page in HTML
    root0 = tree.getroot()
except:
    print("Failed to parse as HTML")
node0 = root0.xpath("//table[contains(@class,'plainrowheaders')]")
# node0 = root0.xpath("//table[position()=1]")

Based on the data from the url, we create three DataFrames that reflect 3 lists of top Japanese anime movie: 
* Top 50 highest gross Japanese films worldwide - df0
* Top 50 highest gross japanese films in Japan - df1
* Top 50 box ticket sales all the time - df2

These lists help us answer the main question: What are the predictors for a high-grossing anime (only box office earnings), among rankings, number of views, ratings, franchise revenues, and anime formats?

In [None]:
#Top 50 highest gross Japanese films worldwide
table0, columns0, index0 = performXML(node0,0)
table0[0][3] = 'Anime'
table0[3][0] = "Howl's Moving Castle"
for x in table0:
  a = x[1][1:]
  a = a.replace(',', '')
  a = a.replace('$', '')
  x[1] = int(a)
print (table0)

[['Demon Slayer the Movie: Mugen Train', 506523013, '2020', 'Anime', 0], ['Spirited Away', 395580000, '2001', 'Anime', 0], ['Your Name', 380140500, '2016', 'Anime', 0], ["Howl's Moving Castle", 236323601, '2004', 'Anime', 0], ['One Piece Film: Red', 206740295, '2022', 'Anime', 0], ['Ponyo', 204826668, '2008', 'Anime', 0], ['Jujutsu Kaisen 0', 195870885, '2021', 'Anime', 0], ['Weathering with You', 193715360, '2019', 'Anime', 0], ['Stand by Me Doraemon', 183442714, '2014', 'Anime', 0], ['Pokémon: The First Movie', 172744662, '1998', 'Anime', 0], ['Princess Mononoke', 170005875, '1997', 'Anime', 0], ['Bayside Shakedown 2', 164450000, '2003', 0, 0], ['The Secret World of Arrietty', 149411550, '2010', 'Anime', 0], ['The Wind Rises', 136533257, '2013', 'Anime', 0], ['Pokémon: The Movie 2000', 133949270, '1999', 'Anime', 0], ['Dragon Ball Super: Broly', 122747755, '2018', 'Anime', 0], ['Detective Conan: The Fist of Blue Sapphire', 115570314, '2019', 'Anime', 0], ['Bayside Shakedown: The Movi

In [None]:
df0 = pd.DataFrame(table0)
df0.columns = ['Movie name', 'Revenue (USD)', 'Year release', 'Genre', 'a']
df0

Unnamed: 0,Movie name,Revenue (USD),Year release,Genre,a
0,Demon Slayer the Movie: Mugen Train,506523013,2020,Anime,0
1,Spirited Away,395580000,2001,Anime,0
2,Your Name,380140500,2016,Anime,0
3,Howl's Moving Castle,236323601,2004,Anime,0
4,One Piece Film: Red,206740295,2022,Anime,0
5,Ponyo,204826668,2008,Anime,0
6,Jujutsu Kaisen 0,195870885,2021,Anime,0
7,Weathering with You,193715360,2019,Anime,0
8,Stand by Me Doraemon,183442714,2014,Anime,0
9,Pokémon: The First Movie,172744662,1998,Anime,0


For dataframe df0 - "Top 50 highest gross Japanese films worldwide", we have three predictors : the year released of the movie, the movie revenue and the movie genre. These data will be used to compare Japanese films' revenue, year release, and genres in the location of Japan.

In [None]:
# Top 50 highest gross japanese films in Japan
table1, columns1, index1 = performXML(node0,1)
table1[0][1] = '¥40,430,000,000'
table1[0][3] = 'Anime'
table1[6][3] = 'Live-action'
for x in table1:
  a = x[1][1:]
  x[1] = int(a.replace(',', ''))
  
df1 = pd.DataFrame(table1)
print (len(table1))
# df1.rename(columns={df.columns[0]:'Movie name',df.columns[1]:'Revenue',df.columns[2]:'Year release',df.columns[3]:'Genre'},inplace=True)
df1.columns = ['Movie name', 'Revenue (Yen)', 'Year release', 'Genre', 'Other']
df1

26


Unnamed: 0,Movie name,Revenue (Yen),Year release,Genre,Other
0,Demon Slayer the Movie: Mugen Train,40430000000,2020,Anime,0
1,Spirited Away,31680000000,2001,Anime,0
2,Your Name,25030000000,2016,Anime,0
3,Princess Mononoke,20180000000,1997,Anime,0
4,Howl's Moving Castle,19600000000,2004,Anime,0
5,One Piece Film: Red,18670000000,2022,Anime,0
6,Bayside Shakedown 2,17350000000,2003,Live-action,0
7,Ponyo,15500000000,2008,Anime,0
8,Weathering with You,14190000000,2019,Anime,0
9,Jujutsu Kaisen 0,13750000000,2021,Anime,0


For dataFrame df1 - "Top 50 highest gross Japanese films in Japan", our predictors still keep as the same: the year released of the movie, the movie revenue and the movie genre. From these two DataFrames, we can analyze what kind of Japanese movie will be favorite both in Japan and worldwide, and from there predict what factor made to that success.

In [None]:
table2, columns2, index2 = performXML(node0,4)

In [None]:
print(table2)

[['Kimi yo Fundo no Kawa o Watare', '1976', '720,000', '433,700,000', '434,400,000', 'Live-action'], ['1978', '2,400,000', '100,000,000', '102,400,000', 'Live-action'], ['Sandakan No. 8', '1974', 'Un\xadknown', '100,000,000', '100,000,000', 'Live-action'], ['Legend of Dinosaurs & Monster Birds', '1977', '650,000', '48,700,000', '49,400,000', 'Live-action'], ['Spirited Away', '2001', '24,280,000', '22,095,221', '46,375,221', 'Anime'], ['Your Name', '2016', '19,300,000', '25,878,383', '45,178,383', 'Anime'], ['Demon Slayer the Movie: Mugen Train', '2020', '29,151,041', '15,354,900', '44,505,941', 'Anime'], ['Who Are You, Mr. Sorge?', '1961', 'Un\xadknown', '39,200,000', '39,200,000', 'Live-action'], ['Pokémon: The First Movie', '1998', '6,650,000', '30,187,706', '36,837,706', 'Anime'], ['Koi no Kisetsu', '1969', 'Un\xadknown', '27,600,000', '27,600,000', 'Live-action'], ['The Bullet Train', '1975', 'Un\xadknown', '25,440,638', '25,440,638', 'Live-action'], ['Tokyo Olympiad', '1965', '23,

In [None]:
table2[1] = ['Kitakitsune Monogatari'] + table2[1]
for x in table2:
  a,b,c = x[2],x[3],x[4]
  if a != 'Un\xadknown': a = int(a.replace(',',''))
  if b!= 'Un\xadknown' and b != '—': b = int(b.replace(',', ''))
  if c != 'Un\xadknown': c = int(c.replace(',', ''))
  x[2],x[3],x[4] = a,b,c
print (table2)

[['Kimi yo Fundo no Kawa o Watare', '1976', 720000, 433700000, 434400000, 'Live-action'], ['Kitakitsune Monogatari', '1978', 2400000, 100000000, 102400000, 'Live-action'], ['Sandakan No. 8', '1974', 'Un\xadknown', 100000000, 100000000, 'Live-action'], ['Legend of Dinosaurs & Monster Birds', '1977', 650000, 48700000, 49400000, 'Live-action'], ['Spirited Away', '2001', 24280000, 22095221, 46375221, 'Anime'], ['Your Name', '2016', 19300000, 25878383, 45178383, 'Anime'], ['Demon Slayer the Movie: Mugen Train', '2020', 29151041, 15354900, 44505941, 'Anime'], ['Who Are You, Mr. Sorge?', '1961', 'Un\xadknown', 39200000, 39200000, 'Live-action'], ['Pokémon: The First Movie', '1998', 6650000, 30187706, 36837706, 'Anime'], ['Koi no Kisetsu', '1969', 'Un\xadknown', 27600000, 27600000, 'Live-action'], ['The Bullet Train', '1975', 'Un\xadknown', 25440638, 25440638, 'Live-action'], ['Tokyo Olympiad', '1965', 23500000, 993555, 24493555, 'Live-action'], ['The Ballad of Narayama', '1983', 1600000, 2194

In [None]:
# Top 50 box ticket sales all the time
df2 = pd.DataFrame(table2)
# df2.rename(columns={df.columns[0]:'Year released',df.columns[1]:'Ticket sale in Japan',df.columns[2]:'Ticket sale oversea',df.columns[3]:'Ticket sale worldwide', df.columns[4]:'Movie genre'},inplace=True)
df2

Unnamed: 0,0,1,2,3,4,5
0,Kimi yo Fundo no Kawa o Watare,1976,720000,433700000,434400000,Live-action
1,Kitakitsune Monogatari,1978,2400000,100000000,102400000,Live-action
2,Sandakan No. 8,1974,Un­known,100000000,100000000,Live-action
3,Legend of Dinosaurs & Monster Birds,1977,650000,48700000,49400000,Live-action
4,Spirited Away,2001,24280000,22095221,46375221,Anime
5,Your Name,2016,19300000,25878383,45178383,Anime
6,Demon Slayer the Movie: Mugen Train,2020,29151041,15354900,44505941,Anime
7,"Who Are You, Mr. Sorge?",1961,Un­known,39200000,39200000,Live-action
8,Pokémon: The First Movie,1998,6650000,30187706,36837706,Anime
9,Koi no Kisetsu,1969,Un­known,27600000,27600000,Live-action


In [None]:
table4 = []
for x in table2:
  for y in table0:
    if x[0] == y[0]:
      table4.append(x + [y[1]])
      break

print (table4)

[['Spirited Away', '2001', 24280000, 22095221, 46375221, 'Anime', 395580000], ['Your Name', '2016', 19300000, 25878383, 45178383, 'Anime', 380140500], ['Demon Slayer the Movie: Mugen Train', '2020', 29151041, 15354900, 44505941, 'Anime', 506523013], ['Pokémon: The First Movie', '1998', 6650000, 30187706, 36837706, 'Anime', 172744662], ['Stand by Me Doraemon', '2014', 6250000, 16833251, 23083251, 'Anime', 183442714], ['Weathering with You', '2019', 10539869, 12147184, 22687053, 'Anime', 193715360], ["Howl's Moving Castle", '2004', 15500000, 6443336, 21943336, 'Anime', 236323601], ['Pokémon: The Movie 2000', '1999', 5600000, 13968660, 19568660, 'Anime', 133949270], ['Ponyo', '2008', 12900000, 5202354, 18102354, 'Anime', 204826668], ['Dragon Ball Super: Broly', '2018', 3070000, 14417445, 17487445, 'Anime', 122747755], ['Princess Mononoke', '1997', 14970000, 2014830, 16984830, 'Anime', 170005875], ['Detective Conan: The Fist of Blue Sapphire', '2019', 7400000, 7659418, 15059418, 'Anime', 1

At DataFrame df2 - box ticket sales all the time, we use a different factor: the ticket sale to comparing what is the gap among ticket sales in Japan, oversea and in worldwide. This step can help us to see the trending movie through the time, and analyze what factors that make a movie stays legend. 

In [None]:
ref_url = 'https://en.wikipedia.org/wiki/List_of_highest-grossing_Japanese_films#cite_ref-153'
res = requests.get(ref_url)

wiki_html = res.text

htmlparser = etree.HTMLParser()
try:
    # This one should work
    tree = etree.parse(io.StringIO(wiki_html), parser=htmlparser)
    # root of the whole wiki page in HTML
    root = tree.getroot()
except:
    print("Failed to parse as HTML")

node = root.xpath("//ol[contains(@class,'references')]")

table3 = []
for x in node[0].getchildren():
  a = []
  for y in x.getchildren():
    for z in y.getchildren():
      if z.tag == 'i':
        # print (z.text)
        a.append(z.text)
      if z.tag == 'ul':
        for t in z.getchildren():
          if t.tag == 'li':
            a.append(t.text)
  if len(a) > 0: table3.append(a)
  # print (etree.tostring(z, pretty_print = True).decode())

print (table3)

[['Kimi yo Fundo no Kawa o Watare', 'Manhunt', None, None], ['Mothra vs. Godzilla'], ['Spirited Away', 'China – 15,722,581', 'Europe – 2,995,684', 'United States and Canada – 1,961,800', 'South Korea (', 'Brazil and Argentina – 477,697'], ['Your Name', 'China – 20,728,000', 'South Korea – 3,712,891', 'Europe (excluding Russia and Spain) – 627,384', 'United States and Canada – 580,000', 'Russia and Argentina – 169,600', 'Spain – 60,508'], ['Demon Slayer the Movie: Mugen Train', 'Taiwan, Hong Kong, Macau, Southeast Asia, Latin America, Australia, New Zealand, Middle East, Africa – 6,340,000', 'United States and Canada – 5,090,000', 'South Korea – 2,151,861', 'Europe (excluding Russia) – 1,259,895', 'Russia – 513,144'], ['Pokémon: The First Movie', 'United States and Canada – 16,858,300', 'Europe – 12,679,046', 'Brazil – 468,000', 'South Korea (Seoul City) – 182,360'], ['The Bullet Train', 'France – 440,638', 'Soviet Union – 25,000,000+'], ['The Ballad of Narayama', None, 'France – 844,07