## IMDb Movie Scraper

Author: **Michael B (MSB46)**

Additional Credit to **Brian Sheehy** for [JSON API storage code](https://towardsdatascience.com/store-api-credentials-easily-and-securely-in-jupyter-notebooks-50411e98e81c)


## Objective:

The purpose of this notebook is to scrape various information from the most popular and top rated animated movies according to IMDb. Upon scraping the data, I will be able to convert that data into a more readable format through a DataFrame which will be cleaned and modeled upon later.

**_Note: The scrapers used were for educational purposes only._**


In [1]:
import os.path
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

links = []
driver = webdriver.Chrome() 

value = 1
next_page = ""

#### The first step would be to gather all of the links that lead to the IMDb page of each animated movie. Should note that in the list of the "top" movies, there are currently 7 pages worth as of October 2022.
#### To achieve this, I used Selenium's web driver to search for any element that would lead to a movie page. Fortunately in this case, IMDb's lists are consistently structured which means I can get away with using a single XPATH to gather each link.

In [2]:
if not(os.path.isfile("links.txt")):
    with open("links.txt", "w") as l:
        for _ in range(10):
            driver.get('https://www.imdb.com/search/title/?title_type=feature&num_votes=10000,&genres=animation&sort=alpha,asc'+next_page+'&view=advanced')
            time.sleep(3)
            movie_we = driver.find_elements(By.XPATH, "/html/body/div[2]/div/div[2]/div[3]/div[1]/div/div[3]/div/div/div[2]/a")

            for i in range(len(movie_we)):
                href = movie_we[i].get_attribute('href')
                links.append(href)
                l.write(str(href) +"\n")

            value += 50
            next_page = "&start="+str(value)

#### Before, traversing each link, I thought it'd be easier to scrap some of the data that is previewed from the list. This includes important information like the title, release year, and scores. 

#### To perform this task, I resorted to using ParseHub due to the streamlined process of scraping data especially from a list. ParseHub also allows me to efficiently convert and import the scrapped data using an API key. 

In [3]:
import requests
from tkinter import filedialog
from tkinter import Tk
import json

# Quickly get rid of the root window popup
root = Tk()
root.withdraw()

# Use Filedialog.askopenfilename to open a dialog window where you can select your credentials file.
filepath = filedialog.askopenfilename()
file = open(filepath, 'r')

# Open the credentials file with json.load
credentials = json.load(file)

token = credentials['project_token_key']
api_key = credentials['api_key']

params = {
  "api_key": api_key,
  "format": "json"
}
r = requests.get('https://parsehub.com/api/v2/projects/'+token+'/last_ready_run/data', params=params)
file.close()

#### I don't want to stop just yet. I decided to add another column I thought can be interesting: budget. However the budget can only be seen inside a movie's page and not from the preview. This is where the list of links come into play. I intend to traverse through each link to scrap the available data regarding a movie's budget.

In [4]:
r.status_code

200

In [5]:
j = json.loads(r.text)
df = pd.DataFrame(j['movie'])
copy_df = df
copy_df.insert(9,"budget_est","",True)
copy_df.insert(3,"story_desc","",True)
copy_df.insert(8,"country","",True)
copy_df.insert(9,"languages","",True)
copy_df.insert(7,"production_companies","",True)
copy_df.insert(9,"aspect_ratio","",True)
copy_df.insert(15,"writers","",True)

This is the part that gave me the most trouble. Unfortunately, the page layout of every movie while seemingly similar in layout, are not exactly 1:1 levels of consistent. For example, in most pages, the budget of a movie would be the first span element on the Box Office section of the page. However, some movie pages like [_An American Tail_](https://www.imdb.com/title/tt0090633/?ref_=adv_li_tt) don't include the budget. This would mean using an element's class name or XPATH to find where the budget (and similar elements) can be done without a problem on some pages but would fail on others. As a temporary "solution" I decided to label any unknown budget as "-1" so I can make note of it in the data cleaning process.

There are also some cases where something else is unexpectedly scrapped instead of the budget despite being there (there are some movies where I'm just given the U.S/Canada gross, for example). Due to the inconsistent layout of the pages, a "one size fits all" solution where I can perfectly parse the budget of each page isn't going to be in the cards at the moment. On the bright side, the latter effect could easily be fixed due to an observation of mine regarding how the data in the budget column is presented (perhaps you might notice it too). But let's worry about that in the data cleaning process. 

In [6]:
def get_list_elements(l):
    elements = ''
    for x in range(len(l)):
        if l[x].text == "":
            continue
        
        elements += l[x].text
        if not(x == len(l)-1):
            elements += ', '
    
    return elements

In [7]:
def read_random(file):
    lines = file.read().splitlines()
    return random.choice(lines)

In [8]:
from tqdm.notebook import tqdm
import random
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
x = 0

with open("links.txt") as file:
    for link in file:
        links.append(link)
        
    for link in tqdm(links, desc="Getting movie data...",colour='green',position=0, leave=True):
        driver.get(link)
        time.sleep(0.25)
#Budget
        try:
            b = WebDriverWait(driver, 5).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, '.fJEELB li:first-of-type,  ipc-metadata-list-item__list-content-item'))).text
        
        except TimeoutException:
            print(link + ": Oops! Budget not Found!")
            b = "-1"
            
#Country
        try:
            c = WebDriverWait(driver, 5).until(
                EC.presence_of_all_elements_located((By.XPATH,"//*[@data-testid='title-details-origin']/div/ul/li")))
            
            countries = get_list_elements(c)

        except TimeoutException:
            print(link + ": Oops! Country not Found!")
            countries = "na"
           
#Languages
        try:
            l = WebDriverWait(driver, 5).until(
                    EC.presence_of_all_elements_located((By.XPATH,"//*[@data-testid='title-details-languages']/div/ul/li")))
            
            languages = get_list_elements(l)
            
        except TimeoutException:
            print(link + ": Oops! Languages not Found!")
            languages = "na"
        
#Production Companies
        try:
            p = WebDriverWait(driver, 5).until(
                EC.presence_of_all_elements_located((By.XPATH,"//*[@data-testid='title-details-companies']/div/ul/li")))
            
            companies = get_list_elements(p)
            
        except TimeoutException:
            print(link + ": Oops! Companies not Found!")
            companies = "na"
    
#Aspect Ratio
        try:
            ratio = WebDriverWait(driver, 5).until(
                EC.presence_of_element_located((By.XPATH,"//*[@data-testid='title-techspec_aspectratio']/div/ul/li/label"))).text

        except TimeoutException:
            print(link + " Oops! No Aspect Ratio found!")
            ratio = "na"
    
#Director(s)
        try:
            driver.find_element(By.XPATH,'/html/body/div[2]/main/div/section[1]/section/div[3]/section/section/div[3]/div[2]/div[1]/div[4]/button').click()
            d = WebDriverWait(driver, 5).until(
                EC.presence_of_all_elements_located((By.XPATH,"//*[@data-testid='title-pc-principal-credit'][1]/div/ul/li")))
            
            directors = get_list_elements(d)
            
        except TimeoutException:
            print(link + ": Oops! Directors not Found!")
            directors = "na"

#Writer(s)
        try:
            w = WebDriverWait(driver, 5).until(
                EC.presence_of_all_elements_located((By.XPATH,"//div/section[1]/section/div[3]/section/section/div[3]/div[2]/div[1]/div[4]/div/div/ul/li[2]/div/ul/li")))
            
            writers = get_list_elements(w)
        except TimeoutException:
            print(link + ": Oops! Writers not Found!")
            writers = "na"
#Story
        try:
            s = WebDriverWait(driver, 5).until(
                EC.presence_of_element_located((By.XPATH, '/html/body/div[2]/main/div/section[1]/section/div[3]/section/section/div[3]/div[2]/div[1]/div[1]/div[2]/span[1]'))).text
            if s.endswith("... Read all"):
                driver.find_element(By.XPATH,'/html/body/div[2]/main/div/section[1]/section/div[3]/section/section/div[3]/div[2]/div[1]/div[1]/div[2]/span[1]/a').click()
                s = driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div[3]/div[1]/section/ul[1]/li[1]/p').text
            
        except NoSuchElementException:
            print(link + "Oops! Story not Found!")
            s = "na"
            
        except TimeoutException:
            print(link + "Oops! Story not Found!")
            s = "na"

        finally:
            copy_df['budget_est'][x] = b
            copy_df['story_desc'][x] = s
            copy_df['country'][x] = countries
            copy_df['languages'][x] = languages
            copy_df['production_companies'][x] = companies
            copy_df['aspect_ratio'][x] = ratio
            
            if not(directors == "na"):
                copy_df['director'][x] = directors
                
            copy_df['writers'][x] = writers
            
            x+=1
        
driver.quit()

Getting movie data...:   0%|          | 0/469 [00:00<?, ?it/s]

https://www.imdb.com/title/tt0103639/?ref_=adv_li_i
 Oops! No Aspect Ratio found!
https://www.imdb.com/title/tt0090633/?ref_=adv_li_i
 Oops! No Aspect Ratio found!
https://www.imdb.com/title/tt0047834/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0047834/?ref_=adv_li_i
 Oops! No Aspect Ratio found!
https://www.imdb.com/title/tt0090667/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt1753496/?ref_=adv_li_i
 Oops! No Aspect Ratio found!
https://www.imdb.com/title/tt7451284/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt7167630/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0106364/?ref_=adv_li_i
 Oops! No Aspect Ratio found!
https://www.imdb.com/title/tt14324650/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt14402926/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0101414/?ref_=adv_li_i
 Oops! No Aspect Ratio found!
https://www.imdb.com/title/tt14145426/

In [9]:
copy_df

Unnamed: 0,name,rating,runtime,story_desc,genre,votescore,metacritic,production_companies,year,aspect_ratio,country,languages,votes,gross,budget_est,writers,director
0,9,PG-13,79 min,A rag doll that awakens in a postapocalyptic f...,"Animation, Action, Adventure",7.0,60,"Focus Features, Relativity Media, Arc Productions",(I) (2009),1.85 : 1,"United States, Canada, Luxembourg",English,140989,$31.74M,"$30,000,000 (estimated)","Pamela Pettler(screenplay by), Shane Acker(sto...",Shane Acker
1,A Bug's Life,G,95 min,"A misfit ant, looking for ""warriors"" to save h...","Animation, Adventure, Comedy",7.2,77,"Pixar Animation Studios, Walt Disney Pictures",(I) (1998),2.39 : 1,United States,English,293639,$162.80M,"$120,000,000 (estimated)","John Lasseter(original story by), Andrew Stant...","John Lasseter, Andrew Stanton(co-director)"
2,A Christmas Carol,PG,96 min,An animated retelling of Charles Dickens' clas...,"Animation, Adventure, Comedy",6.8,55,"Walt Disney Pictures, ImageMovers Digital, Ima...",(2009),2.39 : 1,United States,English,115942,$137.86M,"$200,000,000 (estimated)",Charles Dickens(based on the classic story by)...,Robert Zemeckis
3,A Goofy Movie,G,78 min,When Max makes a preposterous promise to a gir...,"Animation, Adventure, Comedy",6.9,53,"Walt Disney Pictures, Disney Television Animat...",(1995),1.85 : 1,"United States, Australia, France, Canada",English,55556,$35.35M,"$18,000,000 (estimated)","Jymn Magon(story by), Chris Matheson(screenpla...",Kevin Lima
4,A Scanner Darkly,R,100 min,An undercover cop in a not-too-distant future ...,"Animation, Comedy, Crime",7.0,73,"Warner Independent Pictures (WIP), Thousand Wo...",(2006),1.85 : 1,United States,English,112638,$5.50M,"$8,700,000 (estimated)","Philip K. Dick(novel ""A Scanner Darkly""), Rich...",Richard Linklater
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
316,Wolfwalkers,PG,103 min,"After being snubbed by the royal family, a mal...","Animation, Adventure, Family",8.0,87,"Walt Disney Animation Studios, Walt Disney Pro...",(2020),na,United States,English,32505,,"$6,000,000 (estimated)","Erdman Penner(story adaptation), Charles Perra...","Les Clark(sequence director), Clyde Geronimi(s..."
317,Wreck-It Ralph,PG,101 min,High up on a mountain peak surrounded by cloud...,"Animation, Adventure, Comedy",7.7,72,"Warner Bros., Cartoon Network Movies, Warner A...",(2012),2.35 : 1,United States,"English, Mandarin",422751,$189.42M,"$80,000,000 (estimated)","Karey Kirkpatrick(screenplay by), Clare Sera(s...","Karey Kirkpatrick, Jason Reisig(co-director)"
318,Yellow Submarine,G,85 min,"In this fully animated, all-new take on the Sm...","Animation, Adventure, Comedy",7.4,79,"Bontonfilm Studios, Columbia Pictures, Kerner ...",(1968),1.85 : 1,"United States, Hong Kong, China, Czech Republic",English,26576,$0.99M,"$60,000,000 (estimated)","Stacey Harman, Pamela Ribon, Peyo(based on the...",Kelly Asbury
319,Zootopia,PG,108 min,Exiled into the dangerous forest by her wicked...,"Animation, Adventure, Comedy",8.0,78,Walt Disney Animation Studios,(2016),1.37 : 1,United States,English,497059,$341.27M,"$1,499,000 (estimated)","Jacob Grimm(fairy tales), Wilhelm Grimm(fairy ...","William Cottrell(sequence director), David Han..."


#### I probably could've used ParseHub all the way instead of relying on Selenium for a few parts. Unfortunately, going through the pages of 300+ as opposed to only 7 pages from the list is going to require a lot of traversing, something the free version of ParseHub doesn't care much for. Oh well.

In [10]:
copy_df.to_csv("imdb_animated_movies.csv", index = False)

In [11]:
copy_df[copy_df['country'] == 'na']

Unnamed: 0,name,rating,runtime,story_desc,genre,votescore,metacritic,production_companies,year,aspect_ratio,country,languages,votes,gross,budget_est,writers,director


After converting this table into a readable csv, the data scraping process is concluded. Next step involves cleaning the data using the recently made csv file as a base.