# 2. Scrapping yearly market charts of top grossing movies
This notebook will be used to scrap every year's movies top grossing movies (from 1995 to 2020).
The website we use to get the data is https://www.the-numbers.com/


In [1]:
import requests
import pandas as pd
import os 
import re
import hashlib
import numpy as np
from bs4 import BeautifulSoup 

from typing import List,Set,Dict

## 2.1 Defining globals and functions
We will use the following definitions: <br>
- ``base_url: str`` - the base url of the website.
- ``dir: str`` - the directory in which we save the html files.
- ``years: List[int]`` - range of years we need to pull.
- ``getYearUrl(year: str)-> str`` will return a url to fetch based on the given string.
- ``getYearHTMLPath(year: str)-> str`` will return a local path to the html file of the given year.
- ``createDirsRecursive(cur: str, next: list)`` will create directories recursively wether the sub directories exist or not.

In [2]:
base_url: str = "https://www.the-numbers.com"
dir: str= os.path.join("src_data","the-numbers-html")
years: List[int] = range(1995,2021)
getYearURL: str = lambda year: f"{base_url}/market/{year}/top-grossing-movies"
getYearHTMLPath: str = lambda year: os.path.join(dir,f"{year}.html")

def createDirsRecursive(cur:str,next:List[str]):
    if len(next) == 0 : return
    a ,*b = next
    cur = os.path.join(cur,a)
    if not os.path.exists(cur):
        os.mkdir(cur)
    createDirsRecursive(cur,b)

## 2.2 Downloading the data
Here we create the sub directories needed and downloading the html files to use later instead of overheading network requests.

In [3]:
cur,*next = os.path.split(dir)
createDirsRecursive(cur,next)

for year in years:
    html = requests.get(getYearURL(year)).content
    with open(getYearHTMLPath(year),"wb") as f:
        f.write(html)
        f.close()


## 2.3 Creating soups and tables

### First we will define some helper functions to compile a table from the html file.
- ``convertStringToInt(string:str, sep:str, prefix: str)`` -  Will take a string seperate with ``sep`` value and trim the ``prefix`` provided and return the int value of it.
- ``fixMovieName(name: str)`` -  Some movies contain special characters such as <b><u>’</u></b> as opposed to <b><u>'</u></b>. <br> 
Also some movies are trimmed so only the begging is present and the end is completed with <b><u>...</u></b> <br> 
To fix these problems we replace the <b><u>’</u></b> with <b><u>'</u></b> but we keep the <b><u>...</u></b> and then we will substring them.
- ``createRow(...)`` - Will basically create a row tuple of the specific columns we need after parsing correctly and filling ``None``s where the data is missing
- ``createTableFromSoup(soup: BeautifulSoup)`` - Will create a dataframe from a given HTML soup using the functions above

In [4]:
#This function will take a number represented in a string with special delimiters and prefixes.
#It will simplify the string to a parsable integer and then return it as an int

def convertStringToInt(string:str,sep:str,prefix:str= ""):


    string = string.split(sep)
    string[0] = string[0][len(prefix):]
    return int("".join(string))

#This function will convert special non-standard characters to standard ones
fixMovieName:str = lambda n: n.replace("’","'").replace("—","-")


#This function will create a row in the final table
def createRow(row,year):
    rank,movie,date,dist,genre,gross,tickets = row
    movie_name = fixMovieName(movie.text)
    link = movie.a['href'].split("#")[0]
    id = hashlib.md5(movie.a["href"].encode("utf-8")).hexdigest()
    try: dist = dist.text
    except: dist = None

    try: date = date.a['href'].split('daily/')[1]
    except: date= None

    try: gross = convertStringToInt(gross.text,",","$",)
    except: gross = None

    try: tickets = convertStringToInt(tickets.text,",")
    except: tickets = None
    
    return {
        "id":id,
        "link":link,
        "movie":movie_name,
        "date":date,
        "dist":dist,
        f"gross_{year}":gross,
        f"tickets_{year}":tickets
    }


### Creating soup objects for each year
We will iterate over the years and create soup objects for each year and store in a dict
where the key is the year and the object is the soup

In [5]:
soups = {}
for year in years:
    with open(getYearHTMLPath(year),"r",encoding="utf-8") as file:
        soups[year]=BeautifulSoup(file.read(),'lxml')
        file.close()


### Creating a full dictionary of data
We will create a dictionary where the keys are unique ids generated for each movie and the values will be a dict of the data on said movie


In [6]:
movies_dict={}

def updateMoviesDictFromSoup(soup:BeautifulSoup,year:str):
    table = soup.find("div",{"id":"main"}).table
    #first row is for the colunm names, the others are data
    rows = table.find_all("tr")[1:-2]
    ####### cols = ["movie","release_date","distributor","gross","tickets"]
    for row in rows:
        row = createRow(row.find_all("td"),year)
        id = row.pop("id")
        if id in movies_dict: movies_dict[id] ={**movies_dict[id] ,**row}
        else: movies_dict[id] = row


### Creating a dataframe for the 'the-numbers' table
We will create an empty dataframe where the columns are
- ``id`` - unique MD5 hashed id generated with the href of each movie
- ``movie`` - for movie name
- ``date`` - for movie's release date
- ``dist`` - for the production distributor
- ``gross_{year}`` - a gross_{year} is for each year and will store how much the movie grossed (can be nullish)
- ``tickets_{year}`` - a tickets_{year} is for each year and will store the amount of tickets sold (can be nullish)

In [7]:
#This function will create a list of prefixes and year ie ["gross_1995","gross_1996",...]
def createYearPrefixes(prefix,years):
    prefixes = []
    for year in years:
        prefixes.append(f"{prefix}_{year}")
    return prefixes

df_fields=["id","movie","link","date","dist",*createYearPrefixes("gross",years),*createYearPrefixes("tickets",years)]

the_numbers_df = pd.DataFrame(columns=df_fields)


### Filling the dataframe and sorting by date
To fill the dataframe we will iterate over the ids of all movies and create rows accordingly.<br>
(NOTE: some values ,mostly gross and tickets, will be NaN)

In [8]:
for year,soup in soups.items():
    updateMoviesDictFromSoup(soup,year)

for id,data in movies_dict.items():
    row = pd.Series({"id":id,**data})
    the_numbers_df=the_numbers_df.append(row,ignore_index=True)

the_numbers_df = the_numbers_df.sort_values("date")

## Saving The Dataframe
We will save the dataframe and on the next notebook we will compile a final data set with cross references from 'the-numbers' and IMDB

In [9]:
the_numbers_df.to_csv("output_data/the_numbers.csv")