## IMDb Movie Scraper

Author: **Michael B (MSB46)**

Additional Credit to **Brian Sheehy** for [JSON API storage code](https://towardsdatascience.com/store-api-credentials-easily-and-securely-in-jupyter-notebooks-50411e98e81c)


## Objective:

The purpose of this notebook is to scrape various information from the most popular and top rated animated movies according to IMDb. Upon scraping the data, I will be able to convert that data into a more readable format through a DataFrame which will be cleaned and modeled upon later.

**_Note: The scrapers used were for educational purposes only._**


In [1]:
import os.path
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

links = []
driver = webdriver.Chrome() 

value = 1
next_page = ""

#### The first step would be to gather all of the links that lead to the IMDb page of each animated movie. Should note that in the list of the "top" movies, there are currently 7 pages worth as of October 2022.
#### To achieve this, I used Selenium's web driver to search for any element that would lead to a movie page. Fortunately in this case, IMDb's lists are consistently structured which means I can get away with using a single XPATH to gather each link.

In [2]:
if not(os.path.isfile("links.txt")):
    with open("links.txt", "w") as l:
        for _ in range(7):
            driver.get('https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=animation&sort=alpha,asc'+next_page+'&view=advanced')
            time.sleep(3)
            movie_we = driver.find_elements(By.XPATH, "/html/body/div[2]/div/div[2]/div[3]/div[1]/div/div[3]/div/div/div[2]/a")

            for i in range(len(movie_we)):
                href = movie_we[i].get_attribute('href')
                links.append(href)
                l.write(str(href) +"\n")

            value += 50
            next_page = "&start="+str(value)

#### Before, traversing each link, I thought it'd be easier to scrap some of the data that is previewed from the list. This includes important information like the title, release year, and scores. 

#### To perform this task, I resorted to using ParseHub due to the streamlined process of scraping data especially from a list. ParseHub also allows me to efficiently convert and import the scrapped data using an API key. 

In [3]:
import requests
from tkinter import filedialog
from tkinter import Tk
import json

# Quickly get rid of the root window popup
root = Tk()
root.withdraw()

# Use Filedialog.askopenfilename to open a dialog window where you can select your credentials file.
filepath = filedialog.askopenfilename()
file = open(filepath, 'r')

# Open the credentials file with json.load
credentials = json.load(file)

token = credentials['project_token_key']
api_key = credentials['api_key']

params = {
  "api_key": api_key,
  "format": "json"
}
r = requests.get('https://parsehub.com/api/v2/projects/'+token+'/last_ready_run/data', params=params)
file.close()

#### I don't want to stop just yet. I decided to add another column I thought can be interesting: budget. However the budget can only be seen inside a movie's page and not from the preview. This is where the list of links come into play. I intend to traverse through each link to scrap the available data regarding a movie's budget.

In [4]:
r.status_code

200

In [5]:
j = json.loads(r.text)
df = pd.DataFrame(j['movie'])
copy_df = df
copy_df.insert(9,"budget_est","",True)
copy_df.insert(3,"story_desc","",True)

This is the part that gave me the most trouble. Unfortunately, the page layout of every movie while seemingly similar in layout, are not exactly 1:1 levels of consistent. For example, in most pages, the budget of a movie would be the first span element on the Box Office section of the page. However, some movie pages like [_An American Tail_](https://www.imdb.com/title/tt0090633/?ref_=adv_li_tt) don't include the budget. This would mean using an element's class name or XPATH to find where the budget (and similar elements) can be done without a problem on some pages but would fail on others. As a temporary "solution" I decided to label any unknown budget as "-1" so I can make note of it in the data cleaning process.

There are also some cases where something else is unexpectedly scrapped instead of the budget despite being there (there are some movies where I'm just given the U.S/Canada gross, for example). Due to the inconsistent layout of the pages, a "one size fits all" solution where I can perfectly parse the budget of each page isn't going to be in the cards at the moment. On the bright side, the latter effect could easily be fixed due to an observation of mine regarding how the data in the budget column is presented (perhaps you might notice it too). But let's worry about that in the data cleaning process. 

In [6]:
from tqdm import tqdm
from selenium.common.exceptions import NoSuchElementException
x = 0

with open("links.txt") as file:
    for link in file:
        driver.get(link)
        time.sleep(2)

        try:
            b = driver.find_element(By.CSS_SELECTOR, '.fJEELB li:first-of-type,  ipc-metadata-list-item__list-content-item').text
        
        except NoSuchElementException:
            print(link + ": Oops! Budget not Found!")
            b = "-1"
            
        try:
            s = driver.find_element(By.XPATH, '/html/body/div[2]/main/div/section[1]/section/div[3]/section/section/div[3]/div[2]/div[1]/div[1]/div[2]/span[1]').text
            if s.endswith("... Read all"):
                driver.find_element(By.XPATH,'/html/body/div[2]/main/div/section[1]/section/div[3]/section/section/div[3]/div[2]/div[1]/div[1]/div[2]/span[1]/a').click()
                s = driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div[3]/div[1]/section/ul[1]/li[1]/p').text
#             print(s)
            
        except NoSuchElementException:
            print(link + "Oops! Story not Found!")
            s = "na"

        finally:
            copy_df['budget_est'][x] = b
            copy_df['story_desc'][x] = s
            x+=1
        
driver.quit()

https://www.imdb.com/title/tt0385700/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt9848626/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0076363/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt7979580/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt9288046/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt11657662/?ref_=adv_li_i
: Oops! Budget not Found!
https://www.imdb.com/title/tt0078480/?ref_=adv_li_i
: Oops! Budget not Found!


#### I probably could've used ParseHub all the way instead of relying on Selenium for a few parts. Unfortunately, going through the pages of 300+ as opposed to only 7 pages from the list is going to require a lot of traversing, something the free version of ParseHub doesn't care much for. Oh well.

In [7]:
copy_df

Unnamed: 0,name,rating,runtime,story_desc,genre,votescore,metacritic,year,votes,gross,budget_est,director
0,9,PG-13,79 min,A rag doll that awakens in a postapocalyptic f...,"Animation, Action, Adventure",7.0,60,(I) (2009),140989,$31.74M,"$30,000,000 (estimated)",Shane Acker
1,A Bug's Life,G,95 min,"A misfit ant, looking for ""warriors"" to save h...","Animation, Adventure, Comedy",7.2,77,(I) (1998),293639,$162.80M,"$120,000,000 (estimated)",John Lasseter
2,A Christmas Carol,PG,96 min,An animated retelling of Charles Dickens' clas...,"Animation, Adventure, Comedy",6.8,55,(2009),115942,$137.86M,"$200,000,000 (estimated)",Robert Zemeckis
3,A Goofy Movie,G,78 min,When Max makes a preposterous promise to a gir...,"Animation, Adventure, Comedy",6.9,53,(1995),55556,$35.35M,"$18,000,000 (estimated)",Kevin Lima
4,A Scanner Darkly,R,100 min,An undercover cop in a not-too-distant future ...,"Animation, Comedy, Crime",7.0,73,(2006),112638,$5.50M,"$8,700,000 (estimated)",Richard Linklater
...,...,...,...,...,...,...,...,...,...,...,...,...
316,Wolfwalkers,PG,103 min,A young apprentice hunter and her father journ...,"Animation, Adventure, Family",8.0,87,(2020),32505,,"$12,000,000 (estimated)",Tomm Moore
317,Wreck-It Ralph,PG,101 min,A video game villain wants to be a hero and se...,"Animation, Adventure, Comedy",7.7,72,(2012),422751,$189.42M,"$165,000,000 (estimated)",Rich Moore
318,Yellow Submarine,G,85 min,The Beatles agree to accompany Captain Fred in...,"Animation, Adventure, Comedy",7.4,79,(1968),26576,$0.99M,"£250,000 (estimated)",George Dunning
319,Zootopia,PG,108 min,"In a city of anthropomorphic animals, a rookie...","Animation, Adventure, Comedy",8.0,78,(2016),497059,$341.27M,"$150,000,000 (estimated)",Byron Howard


In [8]:
copy_df.to_csv("imdb_animated_movies.csv", index = False)

After converting this table into a readable csv, the data scraping process is concluded. Next step involves cleaning the data using the recently made csv file as a base.