# The Movie Data Base - Exploratory Data Analysis
---

## Imports

In [62]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
import re
import pandas as pd

## Data Scraping

Since TMDB is using JavaScript and not simple inactive HTML, we'll need to use Selenium package to use the searchbox.

In [151]:
browser = webdriver.Safari()
link = "http://www.imdb.com/"
browser.get(link)

SessionNotCreatedException: Message: Could not create a session: The Safari instance is already paired with another WebDriver session.


We will explore the elements of the imdb website, analysing the interesting HTML elements directly on the page. Let's start with getting Selenium to pass a research into the searchbox by using its ID:

In [51]:
searchbox = browser.find_element(By.ID, value="nav-search-form")
searchbox.click()
searchbox.send_keys("The Thing")

We'll use `By.XPATH` to print out the first result of the suggestion box:

In [49]:
suggestion_list = browser.find_element(By.XPATH, '//*[@id="nav-search-form"]/div[2]/div/div/div/ul/li')

suggestion_list.get_attribute('innerHTML')

'<a class="sc-bqyKva ehfErK searchResult searchResult--const" data-testid="search-result--const" href="/title/tt13984270/?ref_=nv_sr_srsg_0"><div class="searchResult--const__img"><div class="ipc-media ipc-media--poster-27x40 ipc-image-media-ratio--poster-27x40 ipc-media--baseAlt ipc-media--custom ipc-media__img" style="width: 100%;"><img alt="The Thing About Pam" class="ipc-image" loading="lazy" src="https://m.media-amazon.com/images/M/MV5BZDkzOTNmZWUtNWMyMS00NTQyLTk0YTEtNGY2YTIyYjRjNzA5XkEyXkFqcGdeQXVyODUxOTU0OTg@._V1_QL75_UX50_CR0,0,50,74_.jpg" srcset="https://m.media-amazon.com/images/M/MV5BZDkzOTNmZWUtNWMyMS00NTQyLTk0YTEtNGY2YTIyYjRjNzA5XkEyXkFqcGdeQXVyODUxOTU0OTg@._V1_QL75_UX50_CR0,0,50,74_.jpg 50w, https://m.media-amazon.com/images/M/MV5BZDkzOTNmZWUtNWMyMS00NTQyLTk0YTEtNGY2YTIyYjRjNzA5XkEyXkFqcGdeQXVyODUxOTU0OTg@._V1_QL75_UX75_CR0,0,75,111_.jpg 75w, https://m.media-amazon.com/images/M/MV5BZDkzOTNmZWUtNWMyMS00NTQyLTk0YTEtNGY2YTIyYjRjNzA5XkEyXkFqcGdeQXVyODUxOTU0OTg@._V1_QL75_UX100_

As we can see, going through the searchbox might not be the most convenient way since it requires to have a precise idea of what we are looking for. Considering we want a more exhaustive list of movies, we'll therefore explore another path; let's tell Selenium to click on the Menu Button and Explore the Top 250 movies. 

In [149]:
wait = WebDriverWait(browser, 10)

wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="iconContext-menu"]'))).click()
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="imdbHeader"]/div[2]/aside/div/div[2]/div/div[1]/span/label/span[2]'))).click()
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="imdbHeader"]/div[2]/aside/div/div[2]/div/div[1]/span/div/div/ul/a[3]/span'))).click()

Now that we are on a listing pages, we can extract the title, the page link for each individual movie using their respective HTML element.

In [152]:
# soup = BeautifulSoup(driver.page_source, "lxml")


movie_titles = []
movie_links = []


block = browser.find_elements(By.CLASS_NAME, 'titleColumn')
print(block)
for i in range(250):
    movie_title = block[i].find_element(By.TAG_NAME, "a").text
    movie_titles.append(movie_title)
    movie_link = block[i].find_element(By.TAG_NAME, "a").get_attribute("href")
    movie_links.append(movie_link)

print(f"Length of movie_titles is {len(movie_titles)}")
print(f"movie_links is of same length: {len(movie_titles) == len(movie_links)}")

[<selenium.webdriver.remote.webelement.WebElement (session="041AF42B-9FD9-4E27-ABE3-4C559C642A63", element="node-641FB3CA-FD51-412C-B281-975D3531F159")>, <selenium.webdriver.remote.webelement.WebElement (session="041AF42B-9FD9-4E27-ABE3-4C559C642A63", element="node-17C1E7AA-F111-418B-8228-8F03C6784A7E")>, <selenium.webdriver.remote.webelement.WebElement (session="041AF42B-9FD9-4E27-ABE3-4C559C642A63", element="node-344DE9E1-AD51-43F7-97B9-065BBF4AB74F")>, <selenium.webdriver.remote.webelement.WebElement (session="041AF42B-9FD9-4E27-ABE3-4C559C642A63", element="node-1873DF96-87C4-436B-B725-2AAB6B799202")>, <selenium.webdriver.remote.webelement.WebElement (session="041AF42B-9FD9-4E27-ABE3-4C559C642A63", element="node-00C8BA2A-C847-46AF-B1CB-8B1A2BFB5C3D")>, <selenium.webdriver.remote.webelement.WebElement (session="041AF42B-9FD9-4E27-ABE3-4C559C642A63", element="node-D34BFCC8-AC9C-4627-A115-E2635A7FF530")>, <selenium.webdriver.remote.webelement.WebElement (session="041AF42B-9FD9-4E27-ABE

Length of movie_titles is 250
movie_links is of same length: True


In [153]:
print(movie_titles[0], movie_links[0])

Les Évadés https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=1a264172-ae11-42e4-8ef7-7fed1973bb8f&pf_rd_r=VXGXC5X0BY5JA1J4HWDZ&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1
