## IMDb Movie Scraper

Author: **Michael B (MSB46)**

Additional Credit to **Brian Sheehy** for [JSON API storage code](https://towardsdatascience.com/store-api-credentials-easily-and-securely-in-jupyter-notebooks-50411e98e81c)


## Objective:

The purpose of this notebook is to scrape various information from the most popular and top rated animated movies according to IMDb. Upon scraping the data, I will be able to convert that data into a more readable format through a DataFrame which will be cleaned and modeled upon later.

**_Note: The scrapers used were for educational purposes only._**


In [1]:
import os.path
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

links = []
driver = webdriver.Chrome() 

value = 1
next_page = ""

#### The first step would be to gather all of the links that lead to the IMDb page of each animated movie. Should note that in the list of the "top" movies, there are currently 7 pages worth as of October 2022.
#### To achieve this, I used Selenium's web driver to search for any element that would lead to a movie page. Fortunately in this case, IMDb's lists are consistently structured which means I can get away with using a single XPATH to gather each link.

In [2]:
if not(os.path.isfile("links.txt")):
    with open("links.txt", "w") as l:
        for _ in range(10):
            driver.get('https://www.imdb.com/search/title/?title_type=feature&num_votes=10000,&genres=animation&sort=alpha,asc'+next_page+'&view=advanced')
            time.sleep(3)
            movie_we = driver.find_elements(By.XPATH, "/html/body/div[2]/div/div[2]/div[3]/div[1]/div/div[3]/div/div/div[2]/a")

            for i in range(len(movie_we)):
                href = movie_we[i].get_attribute('href')
                links.append(href)
                l.write(str(href) +"\n")

            value += 50
            next_page = "&start="+str(value)

#### Before, traversing each link, I thought it'd be easier to scrap some of the data that is previewed from the list. This includes important information like the title, release year, and scores. 

#### To perform this task, I resorted to using ParseHub due to the streamlined process of scraping data especially from a list. ParseHub also allows me to efficiently convert and import the scrapped data using an API key. 

In [3]:
import requests
from tkinter import filedialog
from tkinter import Tk
import json

# Quickly get rid of the root window popup
root = Tk()
root.withdraw()

# Use Filedialog.askopenfilename to open a dialog window where you can select your credentials file.
filepath = filedialog.askopenfilename()
file = open(filepath, 'r')

# Open the credentials file with json.load
credentials = json.load(file)

token = credentials['project_token_key']
api_key = credentials['api_key']

params = {
  "api_key": api_key,
  "format": "json"
}
r = requests.get('https://parsehub.com/api/v2/projects/'+token+'/last_ready_run/data', params=params)
file.close()

#### I don't want to stop just yet. I decided to add another column I thought can be interesting: budget. However the budget can only be seen inside a movie's page and not from the preview. This is where the list of links come into play. I intend to traverse through each link to scrap the available data regarding a movie's budget.

In [4]:
r.status_code

200

In [5]:
j = json.loads(r.text)
df = pd.DataFrame(j['movie'])
copy_df = df
copy_df.insert(7,"budget_est","",True)
copy_df.insert(7,"opening_weekend","",True)
copy_df.insert(7,"na_gross","",True)
copy_df.insert(7,"worldwide_gross","",True)
copy_df.insert(3,"story_desc","",True)
copy_df.insert(3,"genres","",True)
copy_df.insert(8,"country","",True)
copy_df.insert(9,"languages","",True)
copy_df.insert(7,"production_companies","",True)
copy_df.insert(9,"aspect_ratio","",True)
copy_df.insert(17,"writers","",True)
copy_df.insert(17,"director","",True)

This is the part that gave me the most trouble. Unfortunately, the page layout of every movie while seemingly similar in layout, are not exactly 1:1 levels of consistent. For example, in most pages, the budget of a movie would be the first span element on the Box Office section of the page. However, some movie pages like [_An American Tail_](https://www.imdb.com/title/tt0090633/?ref_=adv_li_tt) don't include the budget. This would mean using an element's class name or XPATH to find where the budget (and similar elements) can be done without a problem on some pages but would fail on others. As a temporary "solution" I decided to label any unknown budget as "-1" so I can make note of it in the data cleaning process.

There are also some cases where something else is unexpectedly scrapped instead of the budget despite being there (there are some movies where I'm just given the U.S/Canada gross, for example). Due to the inconsistent layout of the pages, a "one size fits all" solution where I can perfectly parse the budget of each page isn't going to be in the cards at the moment. On the bright side, the latter effect could easily be fixed due to an observation of mine regarding how the data in the budget column is presented (perhaps you might notice it too). But let's worry about that in the data cleaning process. 

In [6]:
def get_list_elements(l):
    elements = ''
    for x in range(len(l)):
        if l[x].text == "":
            continue
        
        elements += l[x].text
        if not(x == len(l)-1):
            elements += ', '
    
    return elements

In [7]:
def read_random(file):
    lines = file.read().splitlines()
    return random.choice(lines)

In [8]:
from tqdm.notebook import tqdm
import random
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
x = 0

with open("links.txt") as file:
    for link in file:
        links.append(link)
        
    for link in tqdm(links, desc="Getting movie data...",colour='green',position=0, leave=True):
        driver.get(link)
        time.sleep(0.25)

        
#Genres
        try:
            g = WebDriverWait(driver, 15).until(
                EC.presence_of_all_elements_located((By.XPATH,"//li[@data-testid='storyline-genres']/div/ul/li")))
            genres = get_list_elements(g)

        except TimeoutException:
            print(link + ": Oops! Genres not Found!")
            genres = "na"
#Budget
        try:
            b = WebDriverWait(driver, 5).until(
                EC.presence_of_element_located((By.XPATH, "//*[@data-testid='title-boxoffice-budget']/div/ul/li/label"))).text
        
        except TimeoutException:
            print(link + ": Oops! Budget not Found!")
            b = "-1"

#Opening Weekend
        try:
            op = WebDriverWait(driver, 5).until(
                EC.presence_of_element_located((By.XPATH, "//*[@data-testid='title-boxoffice-openingweekenddomestic']/div/ul/li/label"))).text
        
        except TimeoutException:
            print(link + ": Oops! Opening Weekend not Found!")
            op = "-1"            

#North American Gross
        try:
            nag = WebDriverWait(driver, 5).until(
                EC.presence_of_element_located((By.XPATH, "//*[@data-testid='title-boxoffice-grossdomestic']/div/ul/li/label"))).text
        
        except TimeoutException:
            print(link + ": Gross! Domestic Gross not Found!")
            nag = "-1"
            
#Worldwide Gross
        try:
            ww = WebDriverWait(driver, 5).until(
                EC.presence_of_element_located((By.XPATH, "//*[@data-testid='title-boxoffice-cumulativeworldwidegross']/div/ul/li/label"))).text
        
        except TimeoutException:
            print(link + ": Gross! Worldwide Gross not Found!")
            ww = "-1"


#Country
        try:
            c = WebDriverWait(driver, 5).until(
                EC.presence_of_all_elements_located((By.XPATH,"//*[@data-testid='title-details-origin']/div/ul/li")))
            
            countries = get_list_elements(c)

        except TimeoutException:
            print(link + ": Oops! Country not Found!")
            countries = "n/a"
           
#Languages
        try:
            l = WebDriverWait(driver, 5).until(
                    EC.presence_of_all_elements_located((By.XPATH,"//*[@data-testid='title-details-languages']/div/ul/li")))
            
            languages = get_list_elements(l)
            
        except TimeoutException:
            print(link + ": Oops! Languages not Found!")
            languages = "n/a"
        
#Production Companies
        try:
            p = WebDriverWait(driver, 5).until(
                EC.presence_of_all_elements_located((By.XPATH,"//*[@data-testid='title-details-companies']/div/ul/li")))
            
            companies = get_list_elements(p)
            
        except TimeoutException:
            print(link + ": Oops! Companies not Found!")
            companies = "n/a"
    
#Aspect Ratio
        try:
            ratio = WebDriverWait(driver, 5).until(
                EC.presence_of_element_located((By.XPATH,"//*[@data-testid='title-techspec_aspectratio']/div/ul/li/label"))).text

        except TimeoutException:
            print(link + " Oops! No Aspect Ratio found!")
            ratio = "n/a"
    
#Director(s)
        try:
            driver.find_element(By.XPATH,'/html/body/div[2]/main/div/section[1]/section/div[3]/section/section/div[3]/div[2]/div[1]/div[4]/button').click()
            d = WebDriverWait(driver, 5).until(
                EC.presence_of_all_elements_located((By.XPATH,"//*[@data-testid='title-pc-principal-credit'][1]/div/ul/li")))
            
            directors = get_list_elements(d)
            
        except TimeoutException:
            print(link + ": Oops! Directors not Found!")
            directors = "n/a"

#Writer(s)
        try:
            w = WebDriverWait(driver, 5).until(
                EC.presence_of_all_elements_located((By.XPATH,"//div/section[1]/section/div[3]/section/section/div[3]/div[2]/div[1]/div[4]/div/div/ul/li[2]/div/ul/li")))
            
            writers = get_list_elements(w)
        except TimeoutException:
            print(link + ": Oops! Writers not Found!")
            writers = "n/a"
#Story
        try:
            s = WebDriverWait(driver, 5).until(
                EC.presence_of_element_located((By.XPATH, '/html/body/div[2]/main/div/section[1]/section/div[3]/section/section/div[3]/div[2]/div[1]/div[1]/div[2]/span[1]'))).text
            if s.endswith("... Read all"):
                driver.find_element(By.XPATH,'/html/body/div[2]/main/div/section[1]/section/div[3]/section/section/div[3]/div[2]/div[1]/div[1]/div[2]/span[1]/a').click()
                s = driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div[3]/div[1]/section/ul[1]/li[1]/p').text
            
        except NoSuchElementException:
            print(link + "Oops! Story not Found!")
            s = "n/a"
            
        except TimeoutException:
            print(link + "Oops! Story not Found!")
            s = "n/a"

        finally:
            copy_df['budget_est'][x] = b
            copy_df['opening_weekend'][x] = op
            copy_df['na_gross'][x] = nag
            copy_df['worldwide_gross'][x] = ww
            
            copy_df['story_desc'][x] = s
            copy_df['genres'][x] = genres
            copy_df['country'][x] = countries
            copy_df['languages'][x] = languages
            copy_df['production_companies'][x] = companies
            copy_df['aspect_ratio'][x] = ratio
            
            copy_df['writers'][x] = writers
            copy_df['director'][x] = directors
            
            
            x+=1
        
driver.quit()

Getting movie data...:   0%|          | 0/469 [00:00<?, ?it/s]

https://www.imdb.com/title/tt6193408/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt6193408/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
https://www.imdb.com/title/tt6193408/?ref_=adv_li_i
: Gross! Domestic Gross not Found!
https://www.imdb.com/title/tt0103639/?ref_=adv_li_i
 Oops! No Aspect Ratio found!
https://www.imdb.com/title/tt0043274/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
https://www.imdb.com/title/tt0043274/?ref_=adv_li_i
: Gross! Domestic Gross not Found!
https://www.imdb.com/title/tt0043274/?ref_=adv_li_i
: Gross! Worldwide Gross not Found!
https://www.imdb.com/title/tt0090633/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0090633/?ref_=adv_li_i
 Oops! No Aspect Ratio found!
https://www.imdb.com/title/tt0101329/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0047834/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0047834/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
h

https://www.imdb.com/title/tt3513500/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
https://www.imdb.com/title/tt3513500/?ref_=adv_li_i
: Gross! Domestic Gross not Found!
https://www.imdb.com/title/tt0275277/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt2263944/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0142236/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0142242/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0099472/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0033563/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
https://www.imdb.com/title/tt0033563/?ref_=adv_li_i
: Gross! Domestic Gross not Found!
https://www.imdb.com/title/tt0923811/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0860907/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0860906/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/t

https://www.imdb.com/title/tt2591814/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
https://www.imdb.com/title/tt2591814/?ref_=adv_li_i
: Gross! Domestic Gross not Found!
https://www.imdb.com/title/tt0381348/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0381348/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
https://www.imdb.com/title/tt0381348/?ref_=adv_li_i
: Gross! Domestic Gross not Found!
https://www.imdb.com/title/tt0104652/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt2668134/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
https://www.imdb.com/title/tt2668134/?ref_=adv_li_i
: Gross! Domestic Gross not Found!
https://www.imdb.com/title/tt0090248/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0090248/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
https://www.imdb.com/title/tt0090248/?ref_=adv_li_i
: Gross! Domestic Gross not Found!
https://www.imdb.com/title/tt0090248/?ref_=adv_li_i
: Gross! Worldwide

https://www.imdb.com/title/tt3152592/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
https://www.imdb.com/title/tt1217213/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
https://www.imdb.com/title/tt1217213/?ref_=adv_li_i
: Gross! Domestic Gross not Found!
https://www.imdb.com/title/tt0291350/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt2458948/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt2458948/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
https://www.imdb.com/title/tt2458948/?ref_=adv_li_i
: Gross! Domestic Gross not Found!
https://www.imdb.com/title/tt0169858/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0169858/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
https://www.imdb.com/title/tt0169858/?ref_=adv_li_i
: Gross! Domestic Gross not Found!
https://www.imdb.com/title/tt0169880/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0169880/?ref_=adv_li_i
: Oops! Opening Weekend not

https://www.imdb.com/title/tt6587640/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
https://www.imdb.com/title/tt8097030/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
https://www.imdb.com/title/tt8097030/?ref_=adv_li_i
: Gross! Domestic Gross not Found!
https://www.imdb.com/title/tt0388473/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0961097/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
https://www.imdb.com/title/tt0961097/?ref_=adv_li_i
: Gross! Domestic Gross not Found!
https://www.imdb.com/title/tt1673702/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt1673702/?ref_=adv_li_i
 Oops! No Aspect Ratio found!
https://www.imdb.com/title/tt0216651/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt6338498/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt6338498/?ref_=adv_li_i
: Oops! Opening Weekend not Found!
https://www.imdb.com/title/tt6338498/?ref_=adv_li_i
: Gross! Domestic Gross not Foun

In [9]:
copy_df[:11]

Unnamed: 0,name,rating,runtime,genres,story_desc,votescore,metacritic,production_companies,year,aspect_ratio,country,languages,votes,worldwide_gross,na_gross,opening_weekend,budget_est,director,writers
0,9,PG-13,79 min,"Animation, Action, Adventure, Drama, Fantasy, ...",A rag doll that awakens in a postapocalyptic f...,7.0,60,"Focus Features, Relativity Media, Arc Productions",(I) (2009),1.85 : 1,"United States, Canada, Luxembourg",English,141104,"$48,428,063","$31,749,894","$10,740,446","$30,000,000 (estimated)",Shane Acker,"Pamela Pettler(screenplay by), Shane Acker(sto..."
1,A Bug's Life,G,95 min,"Animation, Adventure, Comedy, Family, Fantasy","A misfit ant, looking for ""warriors"" to save h...",7.2,77,"Pixar Animation Studios, Walt Disney Pictures",(I) (1998),2.39 : 1,United States,English,293994,"$363,258,859","$162,798,565","$291,121","$120,000,000 (estimated)","John Lasseter, Andrew Stanton(co-director)","John Lasseter(original story by), Andrew Stant..."
2,A Christmas Carol,PG,96 min,"Animation, Adventure, Comedy, Drama, Family, F...",An animated retelling of Charles Dickens' clas...,6.8,55,"Walt Disney Pictures, ImageMovers Digital, Ima...",(2009),2.39 : 1,United States,English,116224,"$325,286,646","$137,855,863","$30,051,075","$200,000,000 (estimated)",Robert Zemeckis,Charles Dickens(based on the classic story by)...
3,A Goofy Movie,G,78 min,"Animation, Adventure, Comedy, Family, Musical,...",When Max makes a preposterous promise to a gir...,6.9,53,"Walt Disney Pictures, Disney Television Animat...",(1995),1.85 : 1,"United States, Australia, France, Canada",English,55766,"$35,348,597","$35,348,597","$6,129,557","$18,000,000 (estimated)",Kevin Lima,"Jymn Magon(story by), Chris Matheson(screenpla..."
4,A Scanner Darkly,R,100 min,"Animation, Comedy, Crime, Drama, Mystery, Sci-...",An undercover cop in a not-too-distant future ...,7.0,73,"Warner Independent Pictures (WIP), Thousand Wo...",(2006),1.85 : 1,United States,English,112719,"$7,659,918","$5,501,616","$391,672","$8,700,000 (estimated)",Richard Linklater,"Philip K. Dick(novel ""A Scanner Darkly""), Rich..."
5,A Shaun the Sheep Movie: Farmageddon,G,86 min,"Animation, Adventure, Comedy, Family, Fantasy,...",When an alien with amazing powers crash-lands ...,6.8,79,"Aardman Animations, Amazon Prime Video, Anton",(2019),2.35 : 1,"United Kingdom, France, Belgium, United States",English,13951,"$43,121,792",-1,-1,-1,"Will Becher, Richard Phelan","Mark Burton, Jon Brown, Richard Starzak(based ..."
6,Abominable,PG,97 min,"Animation, Adventure, Comedy, Family, Fantasy",Three teenagers must help a Yeti return to his...,7.0,61,"DreamWorks Animation, Pearl Studio, China Film...",(2019),1.85 : 1,"United States, China, Japan","English, Mandarin",38879,"$190,514,622","$61,270,390","$20,612,100","$75,000,000 (estimated)","Jill Culton, Todd Wilderman(co-director)",Jill Culton
7,Akira,R,124 min,"Animation, Action, Drama, Fantasy, Sci-Fi, Thr...",A secret military project endangers Neo-Tokyo ...,8.0,67,"Akira Committee Company Ltd., Akira Studio, TM...",(1988),1.85 : 1,Japan,Japanese,187247,"$2,534,069","$553,171","$11,263","¥1,100,000,000 (estimated)",Katsuhiro Ôtomo(supervising director),"Katsuhiro Ôtomo(screenplay), Izô Hashimoto(scr..."
8,Aladdin,G,90 min,"Animation, Adventure, Comedy, Family, Fantasy,...",A kindhearted street urchin and a power-hungry...,8.0,86,"Walt Disney Pictures, Silver Screen Partners I...",(1992),,United States,English,419561,"$504,050,219","$217,350,219","$196,664","$28,000,000 (estimated)","Ron Clements, John Musker","Ron Clements(screenplay by), John Musker(scree..."
9,Alice in Wonderland,G,75 min,"Animation, Adventure, Comedy, Family, Fantasy,...",Alice stumbles into the world of Wonderland. W...,7.4,68,Walt Disney Animation Studios,(1951),1.37 : 1,United States,English,142935,-1,-1,-1,"$3,000,000 (estimated)","Clyde Geronimi, Wilfred Jackson, Hamilton Luske","Lewis Carroll(adaptation: of ""The Adventures o..."


In [10]:
copy_df.iloc[10]

name                                                All Dogs Go to Heaven
rating                                                                  G
runtime                                                            84 min
genres                  Animation, Adventure, Comedy, Drama, Family, F...
story_desc              A canine angel, Charlie, sneaks back to earth ...
votescore                                                             6.7
metacritic                                                             50
production_companies    Goldcrest Films International, Don Bluth Produ...
year                                                               (1989)
aspect_ratio                                                     1.37 : 1
country                            Ireland, United Kingdom, United States
languages                                                         English
votes                                                              42,514
worldwide_gross                       

#### I probably could've used ParseHub all the way instead of relying on Selenium for a few parts. Unfortunately, going through the pages of 300+ as opposed to only 7 pages from the list is going to require a lot of traversing, something the free version of ParseHub doesn't care much for. Oh well.

In [11]:
copy_df.to_csv("imdb_animated_movies.csv", index = False)

In [12]:
copy_df[copy_df['country'] == 'na']

Unnamed: 0,name,rating,runtime,genres,story_desc,votescore,metacritic,production_companies,year,aspect_ratio,country,languages,votes,worldwide_gross,na_gross,opening_weekend,budget_est,director,writers


After converting this table into a readable csv, the data scraping process is concluded. Next step involves cleaning the data using the recently made csv file as a base.