# Modules and scripts <a class="anchor" id="chapter1"></a>

In [18]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
import re
import pickle
import scraping_functions

# Scraping the film IDs and the URLs <a class="anchor" id="chapter2"></a>

IMDB doesn't have an API. However, there are two unofficial IMDBI APIs: The Open Movie Database (OMDb) and The Movie Database (TMDB). In this project we'll rely on the latter as it doesn't enforce a rate limit. We're also going to supplement the API data with scraped data.

TMDB API works by supplying the ID of the film(s) whose info we'd like to retrieve. The IDs consist of a double "t" followed by seven or eight digits. They can be found inside a film URL. For example, "tt0050782" is the ID for this film: https://www.imdb.com/title/tt0050782/. 

The first step will thus be retrieving the IDs of the 5 000 most popular films on IMDB, which can be found in this [playlist](https://www.imdb.com/search/keyword/?mode=detail&page=1&title_type=movie&ref_=kw_ref_rt_vt&num_votes=5000%2C&sort=num_votes,desc).

To scrape the film IDs of our 5 000 films we'll iterate a for loop over the first 100 pages in the playlist (each containing 50 films). The film ID is the value of the 'href' attribute inside h3 elements with the "lister-item-header" class. We'll write a simple regex to extract the film ID. We'll also scrape the entire URL: it's going to come in handy later when we do the scraping.

In [2]:
film_ids = []
film_urls = []

for i in range(1, 101):
  
  content = requests.get(f"https://www.imdb.com/search/keyword/?mode=detail&page={i}&title_type=movie&ref_=kw_ref_rt_vt&num_votes=5000%2C&sort=num_votes,desc").content
  
  soup = BeautifulSoup(content)

  for film in soup.find_all("h3", class_="lister-item-header"):

    link = film.find("a").get("href")
    film_id = re.findall(pattern = r"tt\d+", string = link)[0]
    film_ids.append(film_id)

    film_url = "https://www.imdb.com" + link
    film_urls.append(film_url)
    

Saving both the film IDs and the URLs

In [21]:
with open('data/film_ids.txt', 'wb') as f:
    pickle.dump(film_ids, f)
    
with open('data/film_urls.txt', 'wb') as f:
    pickle.dump(film_urls, f)

# API calls <a class="anchor" id="chapter3"></a>

Let's build the dataframe with the API data by feeding the film IDs to our function

In [7]:
df = scraping_functions.build_film_df(film_ids)
df.head()

Unnamed: 0,id,title,release_date,runtime,country,language,genre,studios,budget,revenue
0,tt0111161,The Shawshank Redemption,1994-09-23,142,United States of America,English,Drama;Crime,Castle Rock Entertainment,25000000,28341469
1,tt0468569,The Dark Knight,2008-07-14,152,United Kingdom;United States of America,English;Mandarin,Drama;Action;Crime;Thriller,DC Comics;Legendary Pictures;Syncopy;Isobel Gr...,185000000,1004558444
2,tt1375666,Inception,2010-07-15,148,United Kingdom;United States of America,English;Japanese,Action;Science Fiction;Adventure,Legendary Pictures;Syncopy;Warner Bros. Pictures,160000000,825532764
3,tt0137523,Fight Club,1999-10-15,139,Germany;United States of America,English,Drama,Regency Enterprises;Fox 2000 Pictures;Taurus F...,63000000,100853753
4,tt0109830,Forrest Gump,1994-07-06,142,United States of America,English,Comedy;Drama;Romance,Paramount;The Steve Tisch Company,55000000,677387716


In [8]:
df.shape

(5000, 10)

Let's save it as a .csv file

In [12]:
df.to_csv("data/df_api.csv")

# Scraping additional data <a class="anchor" id="chapter4"></a>

TMDB API has some limits: it doesn't provide data about the people who worked on a film (directors, writer, actors etc.) and its rating data is inaccurate. We'll thus integrate the data we just got by scraping some more information.

To build the dataframe of scraped data we'll import another function from our functions scripts and we'll run it on the film URLs we scraped in section 2

In [13]:
df2 = scraping_functions.build_scraped_df(film_urls)



  soup = BeautifulSoup(content)


Let's have a look at the dataframe

In [14]:
df2.head()

Unnamed: 0,id,director,writer,imdb_rating,imdb_rating_count,metascore,user_review_count,critic_review_count,color,aspect_ratio,last_updated
0,tt0111161,Frank Darabont,Frank Darabont;Stephen King,9.3,2508904,80,9750,190,Color,1.85 : 1,2021-12-22
1,tt0468569,Christopher Nolan,David S. Goyer;Christopher Nolan;Jonathan Nolan,9.0,2458874,84,7767,427,,2.39 : 1,2021-12-22
2,tt1375666,Christopher Nolan,Christopher Nolan,8.8,2205560,74,67,479,Color,1.85 : 1,2021-12-22
3,tt0137523,David Fincher,Chuck Palahniuk;Jim Uhls,8.8,1973247,66,4131,366,Color,2.39 : 1,2021-12-22
4,tt0109830,Robert Zemeckis,Eric Roth;Winston Groom,8.8,1936080,82,2811,164,Color,2.39 : 1,2021-12-22


Let's save the dataframe

In [17]:
df2.to_csv("data/scraped_df_1.csv")

# Scraping people data <a class="anchor" id="chapter5"></a>

As mentioned previously, TMDB API doesn't provide much data about the people who worked on a film, so we'll need to scrape that data ourselves. These pieces of info aren't located in the regular film pages but in the "Full cast and crew" pages. So before scraping the data we first need to collect all these pages.

In [None]:
crew_pages = []

for i in range(len(film_ids)):
    crew_page = f'https://www.imdb.com/title/{film_ids[i]}/fullcredits/?ref_=tt_cl_sm'
    crew_pages.append(crew_page) 

## Creating the dataframe <a class="anchor" id="subparagraph4"></a>

Let's run it on our list of 'Full cast and crew' pages

In [None]:
df3 = scraping_functions.build_scraped_df_2(crew_pages)

Let's have a look at the dataframe

In [None]:
df3.head()

Let's save it

In [None]:
df3.to_csv('data/scraped_df_2.csv')