The following pacakges are being used for the following reasons:  
**Pandas**: Used to build a data frame, and export the data in CSV format.  
**Numpy**: Used to manipulate arrays of data that get passed on to pandas to build a data frame.  
**Requests**: Used to make get requests to websites and APIs in order to get the necessary data.  
**BeautifulSoup**: Used to parse html files, and extract text.  
**Json**: Used to manipulate jsons.

In [19]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup as bs
import json

The website used to gather box office data is 'The Numbers'. The top 500 movies are gather from the website, so 'pages' stores a list of all the suffixes needed to access the data.  
  
The for loop iterates over the pages list. For every item in the list, the loop parses the HTML file and adds the 'tbody' tag to the empty list tables. After it has iterated over all the items, it returns the tables list. 

In [20]:
def getTables():
    tables = []
    url = 'https://www.the-numbers.com/box-office-records/worldwide/all-movies/cumulative/all-time'
    pages = ['','/101','/201','/301','/401']
    
    for i in pages:
        page = requests.get(url+i).content
        soup = bs(page,'html.parser')
        table = soup.find('tbody')
        tables.append(table)
    
    return tables

This function takes in a tables arguement, which is the list returned by the getTables() function. The data varibla stores an empty list, which will be used to append text after it is extracted. A for loop is used to iterate over the tables list that has been taken as an arguement, finds all the 'td' tags, and extracts the text. The data list is returned. 

In [21]:
def getText(tables):
    data = []
    for i in tables:
        section = i.find_all('td')
        for i in section:
            text = i.get_text()
            data.append(text)
            
    return data

This function takes a list of all the extracted test from the getText() function. After, it uses numpy to split the list into 500 equal parts. This is passed into pandas to build a dataframe, and it drops the index column that came from scrapping the data. 

In [22]:
def buildDataFrame(data):
    data = np.array_split(data,500)
    df = pd.DataFrame(data,columns=['index','year','movie','worldwide','domestic','international'])
    df.drop('index',axis=1,inplace=True)
    return df

This function is used to fetch the API key for omdb. 

In [23]:
def get_keys(path):
    with open(path) as f:
        key = json.load(f)
    return key['omdb_key']

This function takes two arguements. The first arguement key is the API key for omdb, and the second is the dataframe returned by buildDataFrame(). It creates 5 empty lists to store different attributes of the 500 movies. The for loop iterates over the dataframe's movie column, uses a get method from the requests library to fetch a json from omdb, and it checks if the response returned data. If the call returned data, it will access the data and append it to the corresponding list. Also, it checks if there is a rating from rotten tomatoes. If there is no rating, it will append a 'NaN' to the list. If the call did not returned data, it appends 'NaN' to the corresponding list. After, the lists get appended onto the dataframe. 

In [24]:
def getInfo(key,df):
    rated = []
    genre = []
    director = []
    writer = []
    critics = []
    production = []
    
    for i in df.movie:
        i = i.replace(" ","+")
        url = "http://www.omdbapi.com/?apikey={}&t={}".format(key,i)
        data = requests.get(url).json()
        
        if data.get('Response') == 'True':
            rated.append(data.get('Rated'))
            genre.append(data.get('Genre'))
            director.append(data.get('Director'))
            writer.append(data.get('Writer'))
            try:
                critics.append(data.get('Ratings')[1]['Value'])
            except:
                critics.append('NaN')
            production.append(data.get('Production'))
        else:
            rated.append('NaN')
            genre.append('NaN')
            director.append('NaN')
            writer.append('NaN')
            critics.append('NaN')
            production.append('NaN')
          
    df['rated'] = rated
    df['genre'] = genre
    df['director'] = director
    df['writer'] = writer
    df['critics'] = critics
    df['production'] = production
    
    return df

This part of the code combines all the functions, and it uses the pandas to_csv() method in order to save a copy of the dataframe for later analysis.

In [25]:
tables = getTables()
data = getText(tables)
df = buildDataFrame(data)
key = get_keys('.secret/keys.json')
finalDF = getInfo(key,df)
finalDF.to_csv('data/movie_data.csv',index=False)
print('done')