1. [Modules](#chapter1)
2. [Scraping the IDs and the URLs](#chapter2)
3. [API calls](#chapters3)
4. [Scraping additional data](#chapter4)
    1. [Custom functions](#subparagraph1)
    2. [Creating the dataframe](#subgraph2)
5. [Scraping people data](#chapter5)
    1. [Custom functions](#subparagraph3)
    2. [Creating the dataframe](#subparagraph4)

# Modules <a class="anchor" id="chapter1"></a>

In [1]:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
import time
import os
import pickle

# Scraping the film IDs and the URLs <a class="anchor" id="chapter2"></a>

IMDB doesn't have an API. However, there are two unofficial IMDBI APIs: The Open Movie Database (OMDb) and The Movie Database (TMDB). In this project we'll rely on the latter as it doesn't enforce a rate limit. We're also going to supplement the API data with scraped data.

TMDB API works by supplying the ID of the film(s) whose info we'd like to retrieve. The IDs consist of a double "t" followed by seven or eight digits. They can be found inside a film URL. For example, "tt0050782" is the ID for this film: https://www.imdb.com/title/tt0050782/. 

The first step will thus be retrieving the IDs of the 5 000 most popular films on IMDB, which can be found in this [playlist](https://www.imdb.com/search/keyword/?mode=detail&page=1&title_type=movie&ref_=kw_ref_rt_vt&num_votes=5000%2C&sort=num_votes,desc).

To scrape the film IDs of our 5 000 films we'll need to iterate a for loop over the first 100 pages in the playlist (each containing 50 films). The film ID is the value of the 'href' attribute inside h3 elements with the "lister-item-header" class. We'll write a simple regex to extract the film ID. We'll also scrape the entire URL: it's going to come in handy later when we do the scraping.

In [5]:
film_ids = []
film_urls = []

for i in range(1, 101):
  
  content = requests.get(f"https://www.imdb.com/search/keyword/?mode=detail&page={i}&title_type=movie&ref_=kw_ref_rt_vt&num_votes=5000%2C&sort=num_votes,desc").content
  
  soup = BeautifulSoup(content)

  for film in soup.find_all("h3", class_="lister-item-header"):

    link = film.find("a").get("href")
    film_id = re.findall(pattern = r"tt\d+", string = link)[0]
    film_ids.append(film_id)

    film_url = "https://www.imdb.com" + link
    film_urls.append(film_url)
    

# API calls <a class="anchor" id="chapter3"></a>

We'll now write a function to request the data from TMDB. The function takes a vector of film IDs as input and it returns a dataframe. For each ID it sends a GET request to the API. If the status code is equal to 200 (i.e. the request has been successful) it appends the data to a Pandas dataframe. Otherwise, it just appends the film_id and fills the remaining columns with null values.

In [4]:
def build_film_df(film_ids):
    
  api_key = os.environ.get("tmdb_api_key")

  df = pd.DataFrame(columns = ["id", "title", "release_date", "runtime", "country", "language", 
                               "genre", "studios", "budget", "revenue"])

  for film_id in film_ids:
      
      response = requests.get(f"https://api.themoviedb.org/3/movie/{film_id}?api_key={api_key}")

      if response.status_code == 200:  

        response_json = response.json()

        df = df.append({"id": response_json["imdb_id"],
                        "title": response_json["title"],
                        "release_date": response_json["release_date"],
                        "runtime": response_json["runtime"],
                        "country": ';'.join([country['name'] for country in response_json["production_countries"]]),
                        "language": ';'.join([language["english_name"] for language in response_json["spoken_languages"]]),
                        "genre": ';'.join([genre["name"] for genre in response_json["genres"]]),
                        "studios": ';'.join([company["name"] for company in response_json['production_companies']]),
                        "budget": response_json['budget'],
                        "revenue": response_json["revenue"]}, 
                        ignore_index = True)
        
      else:
        df = df.append({"id": film_id}, ignore_index = True)

  return df

Let's run it on the film IDs we just scraped

In [5]:
df = build_film_df(film_ids)
df.head()

Unnamed: 0,id,title,release_date,runtime,country,language,genre,studios,budget,revenue
0,tt0111161,The Shawshank Redemption,1994-09-23,142,United States of America,English,Drama;Crime,Castle Rock Entertainment,25000000,28341469
1,tt0468569,The Dark Knight,2008-07-14,152,United Kingdom;United States of America,English;Mandarin,Drama;Action;Crime;Thriller,DC Comics;Legendary Pictures;Syncopy;Isobel Gr...,185000000,1004558444
2,tt1375666,Inception,2010-07-15,148,United Kingdom;United States of America,English;Japanese,Action;Science Fiction;Adventure,Legendary Pictures;Syncopy;Warner Bros. Pictures,160000000,825532764
3,tt0137523,Fight Club,1999-10-15,139,Germany;United States of America,English,Drama,Regency Enterprises;Fox 2000 Pictures;Taurus F...,63000000,100853753
4,tt0109830,Forrest Gump,1994-07-06,142,United States of America,English,Comedy;Drama;Romance,Paramount;The Steve Tisch Company,55000000,677387716


Let's save it as a .csv file

In [12]:
df.to_csv("data/df_api.csv")

# Scraping additional data <a class="anchor" id="chapter4"></a>

TMDB API has some limits: it doesn't provide data about the people who worked on a film (directors, writer, actors etc.) and its rating data is inaccurate. We'll thus integrate the data we just got by scraping some more information.

## Custom functions <a class="anchor" id="subparagraph1"></a>

To implement sound software engineering principles we'll scrape the data by building a function for each type of data we need.

### Film ID

To scrape the film ID, we'll first access the 'meta' tag with property equal to 'imdb:pageConst' and then we'll get the value of the 'content' attribute

In [13]:
def scrape_film_id(soup):
    
    try:
        film_id = soup.find("meta", {"property": "imdb:pageConst"}).get("content")
    except:
        return np.nan
    else:
        return film_id

### Directors

To scrape the directors we'll access all the href tags containing the 'tt_ov_dr' regex and with classes equal to "ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link". The resulting list contains duplicated values, so we'll turn it into a set to keep only distinct values. We'll finally turn it back into a list which we'll collapse into a single string.

In [14]:
def scrape_director(soup):
    
    try:  
        a_tags = soup.find_all(href = re.compile("tt_ov_dr"), class_="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link")
    
        directors = list(set([a.text for a in a_tags]))
    
        directors = ';'.join(directors)
        
    except:
        return np.nan
        
    else:
        return directors

### Writers

Same as before but this time we're looking for href tags containing the 'tt_ov_wr' regex.

In [15]:
def scrape_writer(soup):
    
    try:
        a_tags = soup.find_all(href = re.compile("tt_ov_wr"), class_="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link")
        
        writers = list(set([a.text for a in a_tags]))
        
        writers = ';'.join(writers)
        
    except:
        return np.nan
        
    else:
        return writers

### IMDB average rating

The rating can be found as the text of the first span tag with class equal to 'AggregateRatingButton__RatingScore-sc-1ll29m0-1 iTLWoV'

In [16]:
def scrape_imdb_rating(soup):
    
    try:   
        span = soup.find_all("span", class_="AggregateRatingButton__RatingScore-sc-1ll29m0-1 iTLWoV")[0]
    
        imdb_rating = span.text
        
    except:
        return np.nan
    
    else:
        return imdb_rating
        

### IMDB rating count

To scrape the rating count we need to get a little creative. This value can be found in the 'script' tag with type equal to 'application/ld+json': it's preceded by '"ratingCount":'. If we treat the whole tag as string, we can extract it with a positive lookbehind.

In [17]:
def scrape_rating_count(soup):
    
    try:
        pattern = r'(?<="ratingCount":)[\d.]+'
        string = str(soup.find("script", {"type": "application/ld+json"}))
        rating_count = re.findall(pattern = pattern, string = string)[0]
        
    except:
        return np.nan
    
    else:
        return rating_count

### Metascore

The 'span' tag with the 'score-meta' class contains the Metascore of the film

In [18]:
def scrape_metascore(soup):
    
    try:
        metascore = soup.find("span", class_="score-meta").text 
    
    except:
        return np.nan
    
    else:
        return metascore

### User review count

We'll extract the number of user reviews by treating the 'script' tag with id equal to '__NEXT_DATA__' as a string and by using a regex with both a lookbehind and a lookahead.

In [19]:
def scrape_user_review_count(soup):
    
    try: 
        pattern = r'(?<="total":)\d+(?=,"__typename":"ReviewsConnection"},"criticReviewsTotal":)'
  
        string = str(soup.find("script", {'id': '__NEXT_DATA__'}))
    
        user_review_count = re.findall(pattern = pattern, string = string)[0]
        
    except:
        return np.nan
    
    else:
        return user_review_count

### Critic review count

To retrieve the number of critic reviews we'll first access all the 'span' tags assigned to a class containing the 'three-Elements' regex and then filter the resulting list by only keeping the 'span' tag containing the word 'Critic'. Finally we'll use a very simple regex to extract the count.

In [20]:
def scrape_critic_review_count(soup):
    
    try:   
        spans = soup.find_all("span", class_= re.compile("three-Elements")) 
        
        string = list(filter(lambda x: 'Critic' in str(x), spans))[0].text
        
        critic_review_count = re.findall(r'\d+', string)[0]
    
    except:
        return np.nan
        
    else:
        return critic_review_count

### Color

Similarly to what we did for the rating count and the user review count, we're going to use a positive lookbehind and a positive lookahead to scrape the 'Color' value.

In [21]:
def scrape_color(soup):
    
    try:
        pattern = r'(?<="text":")\w+(?=","attributes":\[\],"__typename":"Coloration"})'
        
        string = str(soup.find("script", {"type": "application/json"}))
    
        color = re.findall(pattern = pattern, string = string)[0]
        
    except:
        return np.nan
    
    else:
        return color

### Aspect ratio

The aspect ratio will be scraped similarly by using a positive lookbehind.

In [22]:
def scrape_aspect_ratio(soup):
    try:
        pattern = r'(?<="aspectRatio":")[\d.\s:]+'
    
        string = str(soup.find("script", {"type": "application/json"}))
    
        aspect_ratio = re.findall(pattern = pattern, string = string)[0] 
        
    except:
        return np.nan
        
    else:
        return aspect_ratio

## Creating the dataframe <a class="anchor" id="subparagraph2"></a>

We'll create a dataframe of scraped data by iterating over a list of URLs and by appending each scraped value to a column of the df. Should we got an error while scraping, we're ll retry up to 3 times by waiting approximately 9 minutes between each try.

In [23]:
def build_scraped_df(urls):

  df = pd.DataFrame(columns = ["id", "director", "writer", "imdb_rating", "imdb_rating_count", "metascore", 
                               "user_review_count", "critic_review_count", "color", "aspect_ratio"])

  for url in urls:
        
        session = requests.Session()
        retry = Retry(total = 30, backoff_factor = 0.000001)
        adapter = HTTPAdapter(max_retries=retry)
        session.mount('http://', adapter)
        session.mount('https://', adapter)
        content = session.get(url).content
        soup = BeautifulSoup(content)
        
        df = df.append({"id": scrape_film_id(soup),
                        "director": scrape_director(soup),
                        "writer": scrape_writer(soup),
                        "imdb_rating": scrape_imdb_rating(soup),
                        "imdb_rating_count": scrape_rating_count(soup),
                        "metascore": scrape_metascore(soup),
                        "user_review_count": scrape_user_review_count(soup),
                        "critic_review_count": scrape_critic_review_count(soup),
                        "color": scrape_color(soup),
                        "aspect_ratio": scrape_aspect_ratio(soup)}, 
                        ignore_index = True)



  return df


Let's run it on the film URLs we scraped at 2.1

In [24]:
df2 = build_scraped_df(film_urls)

Let's have a look at the dataframe

In [31]:
df2.head()

Unnamed: 0,id,director,writer,imdb_rating,imdb_rating_count,metascore,user_review_count,critic_review_count,color,aspect_ratio
0,tt0111161,Frank Darabont,Stephen King;Frank Darabont,9.3,2506833,80,9750,190,Color,1.85 : 1
1,tt0468569,Christopher Nolan,David S. Goyer;Jonathan Nolan;Christopher Nolan,9.0,2456425,84,7764,427,,2.39 : 1
2,tt1375666,Christopher Nolan,Christopher Nolan,8.8,2203914,74,4466,479,Color,2.39 : 1
3,tt0137523,David Fincher,Jim Uhls;Chuck Palahniuk,8.8,1971788,66,4127,366,Color,2.39 : 1
4,tt0109830,Robert Zemeckis,Winston Groom;Eric Roth,8.8,1934719,82,2807,164,Color,2.39 : 1


Let's save the dataframe

In [32]:
df2.to_csv("data/scraped_df_1.csv")

# Scraping people data <a class="anchor" id="chapter5"></a>

As mentioned previously, TMDB API doesn't provide much data about the people who worked on a film, so we'll need to scrape that data ourselves. These pieces of info aren't located in the regular film pages but in the "Full cast and crew" pages. So before scraping the data we first need to collect all these pages.

In [6]:
crew_pages = []

for i in range(len(film_ids)):
    crew_page = f'https://www.imdb.com/title/{film_ids[i]}/fullcredits/?ref_=tt_cl_sm'
    crew_pages.append(crew_page) 

## Custom functions <a class="anchor" id="subparagraph3"></a>

### Full cast

The names of the artists can be found inside the table with the "cast-list" class: they're the alternative text of the images inside the table.

In [2]:
def scrape_actors(soup):
    
    try:   
        images = soup.find("table", class_="cast_list").find_all("img") 
        
        actors = [img.get("alt") for img in images]  
        
        actors = ';'.join(actors)
    
    except:
        return np.nan
    
    else:
        return actors

### Cinematographer, editor, composer, producers, production designer, art director, costume designer

The following function scrapes the names of different types of artists. The type must be specified in the 'artist' argument. 

In the "Full Cast & Crew" pages each artist type has its own table element containing the names of the artists. Each of these tables is preceded by an h4 element with an id equal to the artist type. For example, the name of the cinematographer is contained in the table preceded by the h4 element with id equal to "cinematographer". To scrape this data, we'll first access the h4 element, then we'll access the table next to it and finally we'll get all the names.

In [3]:
def scrape_artist(soup, artist):
    
    try:
        h = soup.find("h4", id = artist)
        
        a_tags = h.find_next("table").find_all("a")
        
        artists = [a.text.lstrip().replace('\n', '') for a in a_tags]
        
        artists = ';'.join(artists)
    
    except:
        return np.nan
        
    else:     
        return artists

## Creating the dataframe <a class="anchor" id="subparagraph4"></a>

To build the second dataframe of scraped data, we'll build a function very similar to the one we used before.

In [4]:
def build_scraped_df_2(urls):
    
    df = pd.DataFrame(columns = ["id", "actors", "cinematographer", "editor", "composer", "producers", "production_designer",
                                "art_director", "costume_designer"])
    
    for url in urls:
        session = requests.Session()
        retry = Retry(total = 30, backoff_factor = 0.000001)
        adapter = HTTPAdapter(max_retries=retry)
        session.mount('http://', adapter)
        session.mount('https://', adapter)
        
        content = session.get(url).content
        soup = BeautifulSoup(content)
            
        df = df.append({"id": re.findall(r"tt\d+", url)[0],
                        "actors": scrape_actors(soup),
                        "cinematographer": scrape_artist(soup, "cinematographer"),
                        "editor": scrape_artist(soup, "editor"),
                        "composer": scrape_artist(soup, "composer"),
                        "producers": scrape_artist(soup, "producer"),
                        "production_designer": scrape_artist(soup, "production_designer"),
                        "art_director": scrape_artist(soup, "art_director"),
                        "costume_designer": scrape_artist(soup, "costume_designer")},
                        ignore_index = True)
        
    return df

Let's run it on our list of 'Full cast and crew' pages

In [7]:
df3 = build_scraped_df_2(crew_pages)

Let's have a look at the dataframe

In [11]:
df3.head()

Unnamed: 0,id,actors,cinematographer,editor,composer,producers,production_designer,art_director,costume_designer
0,tt0111161,Tim Robbins;Morgan Freeman;Bob Gunton;William ...,Roger Deakins,Richard Francis-Bruce,Thomas Newman,Liz Glotzer;David V. Lester;Niki Marvin,Terence Marsh;Soheil,Peter Landsdown Smith,Elizabeth McBride
1,tt0468569,Christian Bale;Heath Ledger;Aaron Eckhart;Mich...,Wally Pfister,Lee Smith,James Newton Howard;Hans Zimmer,Kevin de la Noy;Jordan Goldberg;Philip Lee;Ben...,Nathan Crowley,Mark Bartholomew;James Hambidge;Craig Jackson;...,Lindy Hemming
2,tt1375666,Leonardo DiCaprio;Joseph Gordon-Levitt;Elliot ...,Wally Pfister,Lee Smith,Hans Zimmer,Zakaria Alaoui;John Bernard;Chris Brigham;Jord...,Guy Hendrix Dyas,Luke Freeborn;Matthew Gray;Brad Ricker;Dean Wo...,Jeffrey Kurland
3,tt0137523,Edward Norton;Brad Pitt;Meat Loaf;Zach Grenier...,Jeff Cronenweth,James Haygood,Dust Brothers;John King;Michael Simpson,Ross Grayson Bell;Ceán Chaffin;John S. Dorsey;...,Alex McDowell,Melique Berger;Chris Gorak,Michael Kaplan
4,tt0109830,Tom Hanks;Rebecca Williams;Sally Field;Michael...,Don Burgess,Arthur Schmidt,Alan Silvestri,Wendy Finerman;Charles Newirth;Steve Starkey;S...,Rick Carter,Leslie McDonald;William James Teegarden,Joanna Johnston


Let's save it

In [17]:
df3.to_csv('data/scraped_df_2.csv')