- scrape link full cast;
- creare funzione per cinematographer, writers, editor e composer;
- creare funzione per scrapare people;
- join df, df2, df3
- pulizia dati (contare nulle, casting)
- nascondere api key

# Project description

The goal of this project is to build a database of the 10 000 most popular feature films on IMDB. We'll define "popularity" as the number of ratings a film has received. 

The project will comprise the following steps:

- we'll collect the data through scraping and API requests;
- we'll load the data onto a PostgreSQL database;
- we'll query the data with SQL;
- we'll visualize the results of the queries

# Modules

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
import time

# Data collection

IMDB doesn't have an official API. However, there are two unofficial APIs providing access to IMDB data: The Open Movie Database (OMDb) and The Movie Database (TMDB). They are both valid though OMDb provides more pieces of information and more accurate rating data. Unfortunately, OMDB enforces a strict rate limit of 1000 requests per day. We will thus rely on TMDB and integrate it with additional scraped data.

The TMDB API works by supplying the ID of the film(s) whose info we'd like to retrieve. The IDs consist of a double "t" followed by seven or eight digits. They can be found inside the film URL. For example, "tt0050782" is the ID for this film: https://www.imdb.com/title/tt0050782/. 

The first step will thus be retrieving all the IDs of the 10 000 most popular films on IMDB, which can be found at this [link](https://www.imdb.com/search/keyword/?mode=detail&page=1&title_type=movie&ref_=kw_ref_rt_vt&num_votes=5000%2C&sort=num_votes,desc).

## Scraping the film IDs and the URLs

To scrape the film IDs of our 10 000 films we'll need to iterate a for loop over the first 200 pages in the playlist (each containing 50 films). The film ID is the value of the 'href' attribute inside h3 elements with the "lister-item-header" class. We'll write a simple regex to extract the film ID. We'll also scrape the entire URL: it's going to come in handy later when we do the scraping.

In [24]:
film_ids = []
film_urls = []

for i in range(1, 201):
  
  content = requests.get("https://www.imdb.com/search/keyword/?mode=detail&page=" + str(i) + "&title_type=movie&ref_=kw_ref_rt_vt&num_votes=5000%2C&sort=num_votes,desc").content
  
  soup = BeautifulSoup(content)

  for film in soup.find_all("h3", class_="lister-item-header"):

    link = film.find("a").get("href")
    film_id = re.findall(pattern = r"tt\d+", string = link)[0]
    film_ids.append(film_id)

    film_url = "https://www.imdb.com" + link
    film_urls.append(film_url)
    

## API calls

We'll now write a function to request the data from TMDB. The function takes a vector of film IDs and for each ID it sends a GET request to the API. If the status code is equal to 200 (i.e. the request has been successful) we'll append the data to a Pandas dataframe. Otherwise, we'll just append the film_id and we'll fill the remaining columns with null values.

In [None]:
def build_film_df(film_ids):

  df = pd.DataFrame(columns = ["id", "title", "release_date", "runtime", "country", "language", "genre", "studios", "budget", "revenue"])

  for film_id in film_ids:
      
      response = requests.get("https://api.themoviedb.org/3/movie/" + film_id + "?api_key=eb74f269207d7563d10da8190e228f61")

      if response.status_code == 200:  

        response_json = response.json()

        df = df.append({"id": response_json["imdb_id"],
                      "title": response_json["title"],
                      "release_date": response_json["release_date"],
                      "runtime": response_json["runtime"],
                      "country": ';'.join([country['name'] for country in response_json["production_countries"]]),
                      "language": ';'.join([language["english_name"] for language in response_json["spoken_languages"]]),
                      "genre": ';'.join([genre["name"] for genre in response_json["genres"]]),
                      "studios": ';'.join([company["name"] for company in response_json['production_companies']]),
                      "budget": response_json['budget'],
                      "revenue": response_json["revenue"]}, 
                       ignore_index = True)
        
      else:
        df = df.append({"id": film_id}, ignore_index = True)

  return df

In [None]:
df = build_film_df(film_ids)
df.head()

Let's have a look at the null values

In [None]:
df[df['title'].isnull()]

## Scraping additional data

TMDB doesn't provide data about the people who worked on a film (directors, writer, actors etc.) and its rating data is inaccurate. We'll thus integrate its data by scraping some more information.

### Custom functions

To implement sound software engineering principles we'll do the scraping by building a custom function for each type of data we need.

#### Film ID

In [77]:
def scrape_film_id(soup):
  
  return soup.find("meta", {"property": "imdb:pageConst"}).get("content")

#### Directors

In [78]:
def scrape_director(soup):
  
  directors = list(set([a.text for a in soup.find_all(href = re.compile("tt_ov_dr"), class_="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link")]))

  return ';'.join(directors)

#### IMDB average rating

In [79]:
def scrape_imdb_rating(soup):
  
  pattern = r'(?<="ratingValue":)[\d.]+(?=},"contentRating")'
  string = str(soup.find("script", {"type": "application/ld+json"}))

  return re.findall(pattern = pattern,
                    string = string)[0]

#### IMDB rating count

In [80]:
def scrape_rating_count(soup):

  pattern = r'(?<="ratingCount":)[\d.]+'
  string = str(soup.find("script", {"type": "application/ld+json"}))

  return re.findall(pattern = pattern,
                    string = string)[0]

#### Metascore

In [81]:
def scrape_metascore(soup):

  return soup.find("span", class_="score-meta").text

#### User reviews count

In [82]:
def scrape_user_review_count(soup):
  
  pattern = r'(?<="total":)\d+(?=,"__typename":"ReviewsConnection"})'
  
  string = str(soup.find("script", {'id': '__NEXT_DATA__'}))
  
  return re.findall(pattern = pattern, string = string)[0] if re.findall(pattern = pattern, string = string) else np.nan

#### Critic reviews count

In [83]:
def scrape_critic_review_count(soup):

  pattern = r'(?<=n"},"criticReviewsTotal":{"total":)\d+'
  
  string = str(soup.find("script", {'id': '__NEXT_DATA__'}))
  
  return re.findall(pattern = pattern, string = string)[0] if re.findall(pattern = pattern, string = string) else np.nan

In [98]:
def scrape_critic_review_count(soup):
    
    string = list(filter(lambda x: 'Critic' in str(x), soup.find_all("span", class_="three-Elements")))[0].text
    
    return re.findall(r'\d+', string)[0]

#### Color

In [84]:
def scrape_color(soup):
    
    pattern = r'(?<="text":")\w+(?=","attributes":\[\],"__typename":"Coloration"})'
    
    string = str(soup.find("script", {"type": "application/json"}))
    
    return re.findall(pattern = pattern, string = string)[0] if re.findall(pattern = pattern, string = string) else np.nan

#### Aspect ratio

In [85]:
def scrape_aspect_ratio(soup):
    
    pattern = r'(?<="aspectRatio":")[\d.\s:]+'
    
    string = str(soup.find("script", {"type": "application/json"}))
    
    return re.findall(pattern = pattern, string = string)[0] if re.findall(pattern = pattern, string = string) else np.nan

### Scraping

If we send 10 000 requests at once we're likely going to run into problems with IMDB, so let's split the film_urls list into 5 equally-sized chunks and let's scrape each chunk at a time by waiting five minutes before moving to the next one.

In [None]:
def build_scraped_df(urls):

  df = pd.DataFrame(columns = ["id", "director", "imdb_rating", "imdb_rating_count", "metascore", 
                               "user_review_count", "critic_review_count", "color", "aspect_ratio"])

  chunks = np.array_split(urls, 5)

  for chunk in chunks:
    for url in chunk:
      content = requests.get(url).content
      soup = BeautifulSoup(content)

      df = df.append({"id": scrape_film_id(soup),
                      "director": scrape_director(soup),
                      "imdb_rating": scrape_imdb_rating(soup),
                      "imdb_rating_count": scrape_rating_count(soup),
                      "metascore": scrape_metascore(soup),
                      "user_review_count": scrape_user_review_count(soup),
                      "critic_review_count": scrape_critic_review_count(soup),
                      "color": scrape_color(soup),
                      "aspect_ratio": scrape_aspect_ratio(soup)}, 
                      ignore_index = True)

    time.sleep(60*5)

  return df


In [None]:
df2 = build_scraped_df(film_urls)
df2.head()

In [99]:
def build_scraped_df(urls):

  df = pd.DataFrame(columns = ["id", "director", "imdb_rating", "imdb_rating_count", "metascore", 
                               "user_review_count", "critic_review_count", "color", "aspect_ratio"])

  for url in urls:
      content = requests.get(url).content
      soup = BeautifulSoup(content)

      df = df.append({"id": scrape_film_id(soup),
                      "director": scrape_director(soup),
                      "imdb_rating": scrape_imdb_rating(soup),
                      "imdb_rating_count": scrape_rating_count(soup),
                      "metascore": scrape_metascore(soup),
                      "user_review_count": scrape_user_review_count(soup),
                      "critic_review_count": scrape_critic_review_count(soup),
                      "color": scrape_color(soup),
                      "aspect_ratio": scrape_aspect_ratio(soup)}, 
                      ignore_index = True)

  return df

In [100]:
test = build_scraped_df(film_urls[:50])

In [101]:
test

Unnamed: 0,id,director,imdb_rating,imdb_rating_count,metascore,user_review_count,critic_review_count,color,aspect_ratio
0,tt0111161,Frank Darabont,9.3,2505291,80,9741,193,Color,1.85 : 1
1,tt0468569,Christopher Nolan,9.0,2454761,84,7762,433,,2.39 : 1
2,tt1375666,Christopher Nolan,8.8,2202866,74,4465,485,Color,2.39 : 1
3,tt0137523,David Fincher,8.8,1970870,66,4127,384,Color,2.39 : 1
4,tt0109830,Robert Zemeckis,8.8,1933770,82,2806,171,Color,2.39 : 1
5,tt0110912,Quentin Tarantino,8.9,1932952,94,3377,302,Color,2.39 : 1
6,tt0133093,Lana Wachowski;Lilly Wachowski,8.7,1789160,73,4650,235,Color,2.39 : 1
7,tt0120737,Peter Jackson,8.8,1751884,92,5553,203,Color,2.39 : 1
8,tt0167260,Peter Jackson,8.9,1730556,94,3939,361,Color,2.39 : 1
9,tt0068646,Francis Ford Coppola,9.2,1726752,100,0,269,Color,1.85 : 1


#### Full cast

In [None]:
def scrape_actors(soup):
  
  actors = [img.get("alt") for img in soup.find("table", class_="cast_list").find_all("img")]
  return ';'.join(actors)

# Load the data onto a database