# Web Scraping using Selenium

# Scraping imdb.com for the Best Movies and retrieving extra information on the Top 10 as well as the top romance movies

1. Initially, to showcase the ability to navigate a website and its individual webpages using Selenium, the project begins at the IMDb homepage. From there, navigation proceeds towards the Top 250 movies webpage. 

2. Basic information on the Top 250 Movies, as rated by regular IMDb voters, is then retrieved using BeautifulSoup (bs4) and saved into a dataframe. This dataframe is subsequently exported as both a CSV and an XLSX for further analysis.

3. To further showcase mastery of Selenium and web scraping, detailed information on the Top 10 movies is gathered. This involves navigating the Selenium Driver to the individual movie webpages and using BeautifulSoup (bs4) to scrape additional information. The more detailed information for the Top 10 is stored in a separate dataframe and exported into a separate CSV and XLSX. Retrieving detailed information on all movies would be too time-consuming for a demonstration such as this. However, the code could be easily adapted to retrieve information on more movies, as the retrieval code is encapsulated within functions. These functions may, unfortunately, need adjustments if IMDb alters the structure of their website.

4. Lastly, to further showcase abilities with Selenium, the filter window on IMDb is utilized to retrieve the top romance movies. It is demonstrated that previously written functions can be reused for additional scraping tasks.

**Warning! Throughout the project, IMDb has occasionally changed class and ID names of different items. Therefore, functions were written in such a manner where classes and IDs are defined only once. This ensures that if a naming change occurs, only one variable or function needs adaptation. However, this does not guard against larger structural changes to the IMDb website, implying the program requires maintenance and has a limited shelf life.**

*Note: By default, the program saves the resulting CSVs and XLSXs to the working directory. Depending on the setup and Integrated Development Environment (IDE), this may be the active directory in which the Jupyter notebook file is located or it may be the base directory. In case the CSVs or XLSXs are not found, please check the base directory* 

In [None]:
import os

# We provide information about which is the current working directory on the computer.
# In the current working directory the CSV-file that is created through this code will be saved.

print(f"The CSV-files and XLSX-files, that are produced, will be saved in the current working directory.\nThe following is the current working directory: {os.getcwd()} \nPlease navigate to the current working directory in order to find the produced CSV-files and XLSX-files.")
print("Please check your base directory if the CSV-files and XLSX-files are not in your current working directory.")

### Importing libraries and Setting up Selenium Webdriver

1. We start by importing all necessary libraries for this code. 
2. Next, we setup our Selenium driver and set the homepage of IMDb (https://www.imdb.com/?ref_=nv_home) as our starting point.


In [None]:
# import libraries

from bs4 import BeautifulSoup

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

import re
import pandas as pd
import requests

In [None]:
# Set up the Selenium WebDriver
driver = webdriver.Chrome(ChromeDriverManager().install())

# Set homepage as starting point
url = "https://www.imdb.com/?ref_=nv_home"

### Navigating from the Homepage to the Top 250 Movies

1. First we start our driver.
2. To avoid issues with the driver, we await the cookies popup and click the "accept button" once it appears.
3. Next, we open up the dropdown menu as a first step to navigate towards the Top 250 movies.
4. We await the dropdown menu to load and click the link leading us to the Top 250 movies.

In [None]:
# Navigate driver to starting point
driver.get(url)

In [None]:
# Accept cookies

def accept_cookies(driver, xpath):
    # Set up a wait
    wait = WebDriverWait(driver, 10)

    # Wait for the cookies popup to appear and click the "Accept" button
    accept_button = wait.until(EC.element_to_be_clickable((By.XPATH, xpath)))
    accept_button.click()

# Usage
accept_cookies(driver, '//button[@data-testid="accept-button"]')

In [None]:
# Opening up the dropdown menu

def click_dropdown_menu(driver, imdb_menu_dropdown_id):
    # Set up a wait
    wait = WebDriverWait(driver, 10)

    try:
        # Wait until the dropdown menu is clickable
        dropdown_menu = wait.until(EC.element_to_be_clickable((By.ID, imdb_menu_dropdown_id)))

        # Click the dropdown menu
        dropdown_menu.click()
    except TimeoutException:
        print("TimeoutException: Element not found or not clickable within the time limit")

# Usage
click_dropdown_menu(driver, 'imdbHeader-navDrawerOpen')

In [None]:
# Navigating to the Top 250 Movies Page

def navigate_to_top_movies(driver, xpath):
    # Set up a wait
    wait = WebDriverWait(driver, 10)

    try:
        # Wait until the link is clickable
        top_movies_link = wait.until(EC.element_to_be_clickable((By.XPATH, xpath)))

        # Click the link
        top_movies_link.click()
    except TimeoutException:
        print("TimeoutException: Element not found or not clickable within the time limit")

# Usage
navigate_to_top_movies(driver, '//a[@href="/chart/top/?ref_=nv_mv_250"]')

### Retrieving the Top 250 Movies

1. First, we wait for the webpage of Top 250 Movies to load fully before we proceed.
2. We define functions to retrieve the information in a structured way. We use functions to allow for easy future editing and to avoid "spaghetti coding".
3. We use our functions to retrieve our desired information and save it to a dataframe.
4. We print our dataframe to check the content and save it to csv and xlsx for further use.

In [None]:
# Waiting for things to load

# Set up a wait
wait = WebDriverWait(driver, 10)

In [None]:
# Define the functions to extract the information on the movies

# Function to extract details of a movie made up of sub functions
def extract_movie_data(movie, ranking):
    # Extract title, year, rating, and link of the movie
    title = extract_title(movie)
    year = extract_year(movie)
    rating = extract_rating(movie)
    link = extract_link(movie)
    # Return the details as a list
    return {'Ranking': ranking, 'Title': title, 'Year': year, 'Rating': rating, 'Link': link}

# Function to extract the title of a movie
def extract_title(movie):
    # Find the title tag
    title_tag = movie.find('h3', {'class': 'ipc-title__text'})
    # Return the title text, or 'None' if the tag is not found
    # Use split function to remove the rank number
    return title_tag.text.split('.', 1)[1].strip() if title_tag else 'None'

# Function to extract the year of a movie
def extract_year(movie):
    # Find the year tag
    year_tag = movie.find('span', {'class': 'sc-43986a27-8 jHYIIK cli-title-metadata-item'})
    # Return the year text, or 'None' if the tag is not found
    return year_tag.text if year_tag else 'None'

# Function to extract the rating of a movie
def extract_rating(movie):
    # Find the rating tag
    rating_tag = movie.find('span', {'class': 'ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating'})
    # Return the rating text, or 'None' if the tag is not found
    return rating_tag.text.strip() if rating_tag else 'None'

# Function to extract the link of a movie
def extract_link(movie):
    # Find the link tag
    link_tag = movie.find('a', {'class': 'ipc-title-link-wrapper'})
    # Return the full link, or 'None' if the tag is not found
    return "https://www.imdb.com" + link_tag['href'] if link_tag else 'None'

# Function to create a DataFrame from a list of movies
def create_movie_dataframe(movies):
    # Extract details of each movie and create a list of movies
    movie_list = [extract_movie_data(movie, i+1) for i, movie in enumerate(movies)]
    # Return a DataFrame created from the list of movies
    return pd.DataFrame(movie_list, columns=['Ranking', 'Title', 'Year', 'Rating', 'Link'])

In [None]:
# Extracting the information on the movies using our functions

# Parse the page content
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Find all movie list items and use find_all function to retrieve the information
# old class tag from 08.12.2023
# movies = soup.find_all('li', {'class': 'ipc-metadata-list-summary-item sc-59b6048d-0 cuaJSp cli-parent'})
movies = soup.find_all('li', {'class': 'ipc-metadata-list-summary-item sc-3f724978-0 enKyEL cli-parent'})

# Create a DataFrame from the list of movies
df_top_250_movies = create_movie_dataframe(movies)

In [None]:
df_top_250_movies

In [None]:
# Save the DataFrame to a csv and xlsx file

df_top_250_movies.to_csv('top_250_movies.csv', index=False)
df_top_250_movies.to_excel('top_250_movies.xlsx', index=False)

### Retrieving more detailed info on the Top 10 Movies

1. We create a copy of our previous dataframe with only the Top 10 rated movies. For these movies, we would like to extract more information from their individual webpages.
2. We define our functions that allow us to navigate to the individual movie webpages and to retrieve our desired information.
3. We iterate our retrieval over the movies in our Top 10 dataframe
4. We print our dataframe to check the content and save it to csv and xlsx for further use.

In [None]:
# copying the top 10 from the top 250 movies

df_top_10_movies_detailed = df_top_250_movies.head(10)

In [None]:
# Function to navigate to a specific movie link
def navigate_to_link(driver, link):
    # Open the movie link
    driver.get(link)
    # Wait until the desired element has loaded
    wait = WebDriverWait(driver, 10)
    wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.ipc-chip-list__scroller')))

# Function to redirect driver and parse the page content
def parse_page(driver):
    return BeautifulSoup(driver.page_source, 'html.parser')

# Function to extract the description of a movie
def extract_description(soup):
    # Find the description tag
    description_tag = soup.find('span', {'data-testid': 'plot-l'})
    # Return the description text, or 'None' if the tag is not found
    return description_tag.text if description_tag else 'None'

# Function to extract the director of a movie
def extract_director(soup):
    # Find the director tag
    director_tag = soup.find('a', {'href': re.compile(r'/name/nm\d+/\?ref_=tt_ov_dr')})
    # Return the director text, or 'None' if the tag is not found
    return director_tag.text if director_tag else 'None'

# Function to extract the writers of a movie
def extract_writers(soup):
    # Find all writer tags
    writer_tags = soup.find_all('a', {'href': re.compile(r'/name/nm\d+/\?ref_=tt_ov_wr')})
    # Return a string of writers separated by commas, or 'None' if no tags are found
    # Use list comprehension to extract the text from each tag
    return ', '.join([tag.text for tag in writer_tags]) if writer_tags else 'None'

# Function to extract the stars of a movie
def extract_stars(soup):
    # Find all star tags
    star_tags = soup.find_all('a', {'href': re.compile(r'/name/nm\d+/\?ref_=tt_ov_st')})
    # Return a string of stars separated by commas, or 'None' if no tags are found
    # Use list comprehension to extract the text from each tag
    return ', '.join([tag.text for tag in star_tags]) if star_tags else 'None'

# Function to extract the genres of a movie
def extract_genres(soup):
    # Find all genre tags
    genre_tags = soup.find_all('span', {'class': 'ipc-chip__text'})
    # Filter out the 'Back to top' tag and return a string of genres separated by commas, or 'None' if no tags are found
    genres = [tag.text for tag in genre_tags if tag.text != 'Back to top']
    # Use list comprehension to extract the text from each tag
    return ', '.join(genres) if genres else 'None'

# Function to extract all the details of a movie
def extract_movie_details(link):
    # Navigate to the link and parse the page content
    navigate_to_link(driver, link)
    soup = parse_page(driver)
    # Extract the description, director, writers, stars, and genres of the movie
    description = extract_description(soup)
    director = extract_director(soup)
    writers = extract_writers(soup)
    stars = extract_stars(soup)
    genres = extract_genres(soup)
    # Return the extracted data as a dictionary
    return {
        'Link': link,
        'Description': description,
        'Director': director,
        'Writers': writers,
        'Stars': stars,
        'Genres': genres
    }

In [None]:
# Use a list comprehension to create a list of dictionaries containing the extracted data for each Top 10 movie
movie_details = [extract_movie_details(link) for link in df_top_10_movies_detailed['Link']]

# Convert the list of dictionaries into a DataFrame
df_new_info = pd.DataFrame(movie_details)

# Drop the 'Link' column from the new DataFrame as not to double the column
df_new_info = df_new_info.drop(columns=['Link'])

# Concatenate the new information with the existing DataFrame along the columns axis
df_top_10_movies_detailed = pd.concat([df_top_10_movies_detailed, df_new_info], axis=1)

In [None]:
df_top_10_movies_detailed

In [None]:
# Export the df

# Export df_top_10_movies_detailed to a CSV file
df_top_10_movies_detailed.to_csv('top_10_movies_detailed.csv', index=False)

# Export df_top_10_movies_detailed to an Excel file
df_top_10_movies_detailed.to_excel('top_10_movies_detailed.xlsx', index=False)

### Using a filter to receive the top "Romance" movies

1. To further demonstrate our mastery of selenium, we use filter for "Romance" movies. We move back to the Top 250 webpage
2. We access the filter window by clicking the filter button
3. We click the "Romance" filter
4. We close the filter window
5. We scrape the data using our previous functions
6. We check and save the data
7. We close our selenium driver and end our scraping exercise.

In [None]:
# Move back to top 250 movies page

navigate_to_link(driver, "https://www.imdb.com/chart/top/?ref_=nv_mv_250")

In [None]:
# Click the filter button

def click_filter_button(driver, css_selector):
    # Set up a wait
    wait = WebDriverWait(driver, 10)

    try:
        # Wait until the filter button is clickable
        filter_button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, css_selector)))

        # Click the filter button
        filter_button.click()
    except TimeoutException:
        print("TimeoutException: Element not found or not clickable within the time limit")

# Usage
click_filter_button(driver, 'button[data-testid="filter-menu-button"]')

In [None]:
# Click the filter for "Romance"

def click_genre_button(driver, css_selector):
    # Set up a wait
    wait = WebDriverWait(driver, 10)

    try:
        # Wait until the genre button is clickable
        genre_button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, css_selector)))

        # Click the genre button
        genre_button.click()
    except TimeoutException:
        print("TimeoutException: Element not found or not clickable within the time limit")

# Usage
click_genre_button(driver, 'button[data-testid="filter-genre-chip-Romance"]')

In [None]:
# Close filter window

def close_filter_window(driver, css_selector):
    # Set up a wait
    wait = WebDriverWait(driver, 10)

    try:
        # Wait until the close button is clickable
        close_button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, css_selector)))

        # Click the close button
        close_button.click()
    except TimeoutException:
        print("TimeoutException: Element not found or not clickable within the time limit")

# Usage
close_filter_window(driver, 'button[aria-label="Close Prompt"]')

In [None]:
# Waiting for things to load

# Set up a wait
wait = WebDriverWait(driver, 10)

In [None]:
# Extracting the information on the romance movies using our functions

# Parse the page content
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Find all movie list items and use find_all function to retrieve the information
movies = soup.find_all('li', {'class': 'ipc-metadata-list-summary-item sc-3f724978-0 enKyEL cli-parent'})

# Create a DataFrame from the list of movies
df_top_romance_movies = create_movie_dataframe(movies)

In [None]:
df_top_romance_movies

In [None]:
# Export the df

# Export df_top_10_movies_detailed to a CSV file
df_top_romance_movies.to_csv('top_romance_movies.csv', index=False)

# Export df_top_10_movies_detailed to an Excel file
df_top_romance_movies.to_excel('top_romance_movies.xlsx', index=False)

In [None]:
# Close the driver
driver.quit()