IMDb Movie Scraper
Author: **Michael Saulon B (MSB46)**


Objective:

The purpose of this notebook is to scrape various information from the most popular and top rated animated movies according to IMDb. Upon scraping the data, I will be able to convert that data into a more readable format through a DataFrame which will be cleaned and modeled upon later.

**_Note: The scrapers used were for educational purposes only._**

In [None]:
import os.path
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import warnings

from tqdm.notebook import tqdm
import random
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException

from bs4 import BeautifulSoup
import requests
from tkinter import filedialog
from tkinter import Tk
import json

In [None]:
warnings.filterwarnings('ignore')

links = []

options = webdriver.ChromeOptions()
options.add_extension('/content/selenium_extension/ublock.crx')

#### Approach
The first step would be to gather all of the links that lead to the IMDb page of each animated movie. Since the last time I extracted movie data, the layout of the IMDb changed drastically which meant that I have to make changes to the selectors used to find the elements that had the data I needed. This was especially the case for the page that lists movies in a top-down layout. Instead of using a third-party API like I previously did, I used Selenium to extract the IMdb links to every movie page. Selenium was also useful for scrolling down and automatically clicking "Show 50 more" on the bottom of the page to get as many links as possible.

Previously, I used Selenium's web driver within a movie's page search for a movie's specifics. However, I decided to use BeautifulSoup this time as the issue with Selenium is that the time it takes to load a page is too volatile which can some elements to sometimes not be found. BS4 is also much quicker in returning content which means less waiting times.


In [None]:
# Get links

In [None]:
if not(os.path.isfile("/content/links.txt")):
    driver = webdriver.Chrome(options=options)
    with open("links.txt", "w") as l:
        driver.get('https://www.imdb.com/search/title/?title_type=feature&num_votes=5000,&genres=animation&sort=alpha,asc&view=advanced')
        while True:
            try:
                driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
                btn_show_more = WebDriverWait(driver, 6).until(EC.element_to_be_clickable((By.CSS_SELECTOR,".ipc-see-more__text")))
                driver.execute_script("arguments[0].click()",btn_show_more)

            except TimeoutException:
                selector = '/html/body/div[2]/main/div[2]/div[3]/section/section/div/section/section/div[2]/div/section/div[2]/div[2]/ul/li/div[1]/div/div/div[1]/div[2]/div[1]/a'
                movie_we = driver.find_elements(By.XPATH, selector)
                print(f"Found items: {len(movie_we)}")

                for i in range(len(movie_we)):
                        href = movie_we[i].get_attribute('href')
                        links.append(href)
                        l.write(str(href) +"\n")
                        # print(href)
                driver.close()
                break

SessionNotCreatedException: Message: session not created: Chrome failed to start: exited normally.
  (session not created: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /root/.cache/selenium/chrome/linux64/120.0.6099.109/chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Stacktrace:
#0 0x56158ea5df83 <unknown>
#1 0x56158e716cf7 <unknown>
#2 0x56158e74e60e <unknown>
#3 0x56158e74b26e <unknown>
#4 0x56158e79b80c <unknown>
#5 0x56158e78fe53 <unknown>
#6 0x56158e757dd4 <unknown>
#7 0x56158e7591de <unknown>
#8 0x56158ea22531 <unknown>
#9 0x56158ea26455 <unknown>
#10 0x56158ea0ef55 <unknown>
#11 0x56158ea270ef <unknown>
#12 0x56158e9f299f <unknown>
#13 0x56158ea4b008 <unknown>
#14 0x56158ea4b1d7 <unknown>
#15 0x56158ea5d124 <unknown>
#16 0x7f75d17e3ac3 <unknown>


In [None]:
# r.status_code

In [None]:
# j = json.loads(r.text)
# df = pd.DataFrame(j['movie'])
df_columns = [
    "title",
    "year",
    "rating",
    "runtime",
    "votes",
    "votescore",
    "metacritic",
    "budget",
    "opening_na",
    "worldwide",
    "story",
    "genres",
    "origin",
    "languages",
    "companies",
    "release_date",
    "cast",
    "crew_count"
    "writers",
    "director",
]
df = pd.DataFrame(columns=df_columns)

### Some more hurdles to overcome
While most pages are consistent in their layout, some pages tend to be structured differently meaning copying the CSS selector path of an element isn't going to cut it alone. As a solution I noticed that most elements I pick data from have a parent element with the attribute "test-dataid" which conviniently has a distinct name like "title-boxoffice-budget" or "genres" which makes things a lot easier to extract.

I originally wanted to use a page's 'story line' section to extract things like the rating and tagline of a movie. But when using an html parser to extract the html content of a page, some sections aren't to be seen. My assumption is that certain sections like the storyline section and possibly others aren't loaded/included until someone or something physically accesses the web page. Whether or not I find a workaround is yet to be determined.

As of now I found an alternate way to collect the genres, story description, and rating of a movie without using the storyline section as a selector. Unfortunately, I wasn't able to find a way for extracting a movie's tagline (if it has one). Luckily, it's probably one of the less important things to extract from a movie in my perspective so no big deal in the end.

In [None]:
def separated_by_commas(l):
    return ", ".join(l)

In [None]:
def read_random(file):
    lines = file.read().splitlines()
    return random.choice(lines)

In [None]:
x = 0
with open("links.txt") as file:
    for link in file:
        links.append(link)
    for link in tqdm(links, desc="Getting movie data..."):
        try:
            r = requests.get(link, headers={'User-Agent': 'Mozilla/5.0'})
    #         print(r)
            content = r.text
            soup = BeautifulSoup(content, "html.parser")
    #         print(soup.prettify())

            # Title, Year
            title = soup.select_one("h1[data-testid='hero__pageTitle'] > span").get_text()
            year = soup.select_one("div.sc-e226b0e3-3.dwkouE > div.sc-69e49b85-0.jqlHBQ > ul > li:nth-child(1)").get_text()
            print(title, year)

            # Rating, Runtime
            runtime = soup.select_one("li[data-testid='title-techspec_runtime'] > div").get_text()

            # The order of details underneath each page's title is as follows: Year, Rating, Runtime
            # Sometimes rating isn't present and instead lists the year and runtime only.
            # If there is only two items underneath, I will assume that the film has no rating.
            if len(soup.select(".sc-d8941411-2 > li")) < 3:
                rating = "Not Rated"
            else:
                rating = soup.select_one("div.sc-e226b0e3-3.dwkouE > div.sc-69e49b85-0.jqlHBQ > ul > li:nth-child(2)").get_text()
    #         rating = soup.select_one("section[data-testid='Storyline']").get_text()

            print(runtime, rating)

            # Votes, Votescore, Metacritic
            votes = soup.select_one("div.sc-bde20123-0.dLwiNw > div.sc-bde20123-3.gPVQxL").get_text()
            votescore = soup.select_one("div.sc-bde20123-0.dLwiNw > div.sc-bde20123-2.cdQqzc > span.sc-bde20123-1.cMEQkK").get_text()
            metacritic = soup.select_one(".sc-b0901df4-0")
            if metacritic:
                metacritic = metacritic.get_text() if metacritic else ""

            print(votes,votescore, metacritic)

            # Director, Writers
            directors_elmt = soup.select("div.sc-69e49b85-3.dIOekc > div > ul > li:nth-child(1) > div > ul > li")
            directors = separated_by_commas([d.get_text() for d in directors_elmt])
            writers_elmt = soup.select("div.sc-69e49b85-3.dIOekc > div > ul > li:nth-child(2) > div > ul > li")
            writers = separated_by_commas([w.get_text() for w in writers_elmt])

            print(directors)
            print(writers)


            # Genre
            genres_elmt = soup.select("div[data-testid='genres'] > div > a > span")
    #         genres_elmt = soup.select('li[@data-testid="storyline-genres"] > div > ul > li')
            genres = separated_by_commas([g.get_text() for g in genres_elmt])
            print(genres)

    #     "budget_est","opening_weekend", worldwide_gross"
            budget = soup.select_one("li[data-testid='title-boxoffice-budget'] > div")
            budget = budget.get_text() if budget else ""

            opening_na = soup.select_one("li[data-testid='title-boxoffice-openingweekenddomestic'] > div > ul > li")
            opening_na = opening_na.get_text() if opening_na else ""

            worldwide = soup.select_one("li[data-testid='title-boxoffice-cumulativeworldwidegross'] > div")
            worldwide = worldwide.get_text() if worldwide else ""

            print(worldwide, budget)

            # Release date, Country/Origin, Languages
            releasedate = soup.select_one("li[data-testid='title-details-releasedate'] > div").get_text()
            origin_elmt = soup.select("li[data-testid='title-details-origin'] > div > ul > li")
            if origin_elmt:
                origin = separated_by_commas([c.get_text() for c in origin_elmt])

            lang_elmt = soup.select("li[data-testid='title-details-languages'] > div > ul > li")
            language = separated_by_commas([l.get_text() for l in lang_elmt]) if lang_elmt else ""
            print(releasedate, origin, language)

            # Companies

            company_elmt = soup.select("li[data-testid='title-details-companies'] > div > ul > li")
            company = separated_by_commas([c.get_text() for c in company_elmt]) if company_elmt else ""
            print(company)

            # Story

            story = soup.select_one("p[data-testid='plot'] > span.sc-466bb6c-2").get_text()
            print(story)


            #
            # Cast
            r = requests.get(link, headers={'User-Agent': 'Mozilla/5.0'})
            content = r.text
            soup = BeautifulSoup(content, "html.parser")

            cast_elmt = soup.select("table.cast_list > tbody > tr> td:nth-child(2) > a")
            cast = separated_by_commas([a.get_text() for a in cast_elmt])

            crew_elmt = soup.select("tbody > tr > td.name > a")
            crew_count = len(list(set(crew_elmt)))

            print(cast)
            print(crew_count)

            values = [title, year, rating, runtime, votes, votescore,
                      metacritic, budget, opening_na, worldwide, story,
                      genres, origin, language, company, releasedate,
                      cast, crew_count, writers, directors]

            df.loc[x] = values

            x += 1
            # Year

        except Exception as e:
            print(f"Index {x}\n{e}")
            continue

In [None]:
len(links)

In [None]:
df

In [None]:
df.to_csv("imdb_animated_movies.csv", index = False)

After converting this table into a readable csv, the data scraping process is concluded. Next step involves cleaning the data using the recently made csv file as a base.