selenium: I used selenium as the core library to automate the web browser, allowing the script to act like a robot user.

webdriver_manager: I employed webdriver_manager to automatically download and install the correct version of the Google Chrome driver so I didn't have to do it manually.

time: I utilized the time module to pause the script, giving the website sufficient time to load data.

pandas: I used pandas to organize the scraped data into a table (DataFrame) and save it as a CSV file.

BeautifulSoup: I implemented BeautifulSoup as a parsing tool to easily extract specific text, like titles or ratings, from the raw HTML code.

By: I used By to help locate elements on the page, allowing me to search by attributes like CLASS_NAME and XPATH.

ActionChains: I applied ActionChains to perform complex actions, such as clicking buttons that were difficult to reach or simulating mouse movements.

In [1]:
import selenium, webdriver_manager

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver import ActionChains


start_url: I defined this variable with the specific link needed to access the IMDb website and load the list of movies available for scraping.

webdriver.Chrome(...): I initialized the Chrome WebDriver using ChromeDriverManager to automatically handle the installation and setup of the browser service.

driver.get(start_url): I instructed the automated driver to launch the browser and navigate to the provided link to begin the process.

In [2]:
start_url = "https://www.imdb.com/search/title/?title_type=feature&release_date=1950-01-01,2012-12-31&sort=num_votes,desc"

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(start_url)

Function Definition: I defined scrape_list_page to handle the extraction of movie data from the currently loaded page.

Parsing: I used BeautifulSoup to parse the raw HTML source code retrieved by the Selenium driver, converting it into a format I could search through.

Looping: I iterated through each movie card (identified by the li tag with a specific class) to process the films one by one.

Data Extraction:
 Title & Rating: I extracted the title and rating by targeting their specific HTML class names.
 Year & Duration: I used lambda functions (custom filters) to intelligently find the release year (looking for 4 digits) and duration (looking for the   letter "h").
 URL: I captured the relative link for each movie and combined it with "https://www.imdb.com" to create a complete, clickable URL for the next stage.

Return: I stored the details for each movie in a dictionary, appended them to a list, and returned the collection of data.

In [3]:
def scrape_list_page(driver):
    soup = BeautifulSoup(driver.page_source, "html.parser")
    movies = []

    for block in soup.find_all('li', attrs ={"class": "ipc-metadata-list-summary-item"}):
        title_tag = block.find("h3", attrs={"class": "ipc-title__text"})
        title = title_tag.text.strip() if title_tag else None

        year_tag = block.find("span", string=lambda x: x and x.isdigit() and len(x) == 4)
        year = year_tag.text if year_tag else None

        rating_tag = block.find("span", attrs={"class": "ipc-rating-star--rating"})
        rating = rating_tag.text if rating_tag else None

        duration_tag = block.find("span", string=lambda x: x and "h" in x)
        duration = duration_tag.text if duration_tag else None

        link_tag = block.find("a", attrs={"class": "ipc-title-link-wrapper"})
        url = "https://www.imdb.com" + link_tag["href"] if link_tag else None

        movies.append({
            "Title": title,
            "Year": year,
            "Rating": rating,
            "Duration": duration,
            "URL": url
        })

    return movies


Function Purpose: I defined scrape_movie_page to perform a "deep dive" into each individual movie's URL, extracting granular data that wasn't available on the main list page.

Navigation & Parsing: I navigated the driver to the specific movie URL, waited for the content to render, and parsed the HTML using BeautifulSoup to prepare for extraction.

Financial Data: I targeted specific data-testid attributes to robustly extract financial metrics, including Budget, Opening Weekend, Gross US & Canada, and Gross Worldwide revenue.

Metadata Extraction: I retrieved the Plot summary, Production Companies, Country, and Languages, using list comprehensions to join multiple values (like languages) into single comma-separated strings for cleaner data.

Crew Information: I implemented logic to locate and extract the Directors associated with the film.

Error Handling: I used if/else statements for every field to ensure the script assigns None instead of crashing if a specific piece of information (like "Budget") is missing from the page.

IMDb ID: I parsed the URL string directly to isolate the unique IMDb ID (tt0111161) for unique identification.

In [4]:
def scrape_movie_page(driver, url):
    driver.get(url)
    time.sleep(1)

    soup = BeautifulSoup(driver.page_source, "html.parser")
    data = {}
   
    
    plot_tag = soup.find("span", attrs={"class": "sc-bf30a0e-0 iOCbqI"})
    data["Plot"] = plot_tag.text.strip() if plot_tag else None

  
     # Budget
    budget_tag = soup.find("li", attrs={"data-testid": "title-boxoffice-budget"})
    if budget_tag:
        val = budget_tag.find("span", attrs={"class":"ipc-metadata-list-item__list-content-item"})
        data["Budget"] = val.text.strip() if val else None
    else:
        data["Budget"] = None
        

     # Opening Weekend
    opening_tag = soup.find("li", attrs={"data-testid": "title-boxoffice-openingweekenddomestic"})
    if opening_tag:
        val = opening_tag.find("span", class_="ipc-metadata-list-item__list-content-item")
        data["Opening_Weekend"] = val.text.strip() if val else None
    else:
        data["Opening_Weekend"] = None


        

    # Gross US & Canada
    gross_us_tag = soup.find("li", attrs={"data-testid": "title-boxoffice-grossdomestic"})
    
    if gross_us_tag:
        val = gross_us_tag.find("span", class_="ipc-metadata-list-item__list-content-item")
        data["Gross_US_Canada"] = val.text.strip() if val else None
    else:
        data["Gross_US_Canada"] = None 


     # Gross Worldwide
    gross_world_tag = soup.find("li", attrs={"data-testid": "title-boxoffice-cumulativeworldwidegross"})
    if gross_world_tag:
        val = gross_world_tag.find("span", class_="ipc-metadata-list-item__list-content-item")
        data["Gross_Worldwide"] = val.text.strip() if val else None
    else:
        data["Gross_Worldwide"] = None

        
    # Directors
    directors = soup.find("li", attrs={"class": "ipc-metadata-list__item ipc-metadata-list__item--align-end"})
    if directors:
        val = directors.find("li", attrs={"class": "ipc-metadata-list-item__list-content-item"})
        data["Directors"] = ", ".join([d.text.strip() for d in directors]) if directors else None
    else:
        data["Directors"] = None
        
    
     # Production Companies
    prod_tag = soup.find("li", attrs={"data-testid": "title-details-companies"})
    if prod_tag:
        comps = prod_tag.find_all("a", attrs={"class": "ipc-metadata-list-item__list-content-item"})
        data["Production_Companies"] = ", ".join([c.text.strip() for c in comps])
    else:
        data["Production_Companies"] = None   
        
        
     # Country
    country_tag = soup.find("li", attrs={"data-testid": "title-details-origin"})
    if country_tag:
        val = country_tag.find("a", attrs={"class": "ipc-metadata-list-item__list-content-item"})
        data["Country"] = val.text.strip() if val else None
    else:
        data["Country"] = None   
    
    
    # Languages
    lang_tag = soup.find("li", attrs={"data-testid": "title-details-languages"})
    if lang_tag:
        langs = lang_tag.find_all("a", attrs={"class": "ipc-metadata-list-item__list-content-item"})
        data["Languages"] = ", ".join([l.text.strip() for l in langs])
    else:
        data["Languages"] = None
        
        
    # IMDb ID
    data["IMDb_ID"] = url.split("/")[4]






    return data
    


Function Purpose: I created the scrape_all_movies function to act as the main controller, connecting the list expansion process with the detailed data extraction.

Controlling Data Volume: I included a max_clicks parameter to set a precise limit on how many times the script loads more movies, giving me control over the dataset size.

Phase 1: I used a for loop to automate clicking the "Load More" button multiple times using ActionChains. This simulated a user manually expanding the list to reveal more content while time.sleep ensured the new data loaded correctly between clicks.

Phase 2: Instead of scraping repeatedly, I executed scrape_list_page exactly once after the loading phase was finished. This efficient strategy allowed me to capture all visible movies in a single pass, completely eliminating the risk of collecting duplicate entries.

Phase 3: I iterated through the complete list of movies, calling scrape_movie_page for each item to fetch the hidden financial and crew details.

Data Merging: I used the dictionary .update() method to seamlessly merge the newly fetched details (like budget and directors) into the original movie record, creating a single comprehensive dataset.

Final Output: I converted the final list of enriched movie dictionaries into a Pandas DataFrame, exporting to a CSV file.

In [5]:
def scrape_all_movies(driver, start_url, max_clicks=10):
    driver.get(start_url)
    
    # PHASE 1: LOAD A SPECIFIC AMOUNT
    print(f"Starting to load... Limit set to {max_clicks} clicks.")
    
    for i in range(max_clicks):
        try:
            button = driver.find_element(By.CLASS_NAME, "ipc-see-more__button") 
            ActionChains(driver).click(button).perform() 
            
            print(f"Click {i+1}/{max_clicks} successful.")
            time.sleep(3) 
        except:
            print("Button not found (or end of list). Stopping loading.")
            break

    # PHASE 2: SCRAPE EVERYTHING ONCE
    print("Loading finished. Scraping visible movies...")
    all_movies = scrape_list_page(driver)
    print(f"Total movies found: {len(all_movies)}")

    # PHASE 3: GET DETAILS
    for movie in all_movies:
        try:
            details = scrape_movie_page(driver, movie["URL"])
            movie.update(details) 
        except Exception as e:
            print(f"Could not get details for {movie['URL']}")

    return pd.DataFrame(all_movies)

df = scrape_all_movies(...): I initiated the full scraping pipeline by calling the master function, passing in the active driver and the start URL. This executed the entire workflow and expanding the list, scraping the movies, and fetching the detailed metrics and stored the final result in the variable df.

df.head(): I called this method to immediately display the first five rows of the newly created DataFrame, allowing me to visually verify that the data was extracted correctly and structured properly before saving it.

In [6]:
df = scrape_all_movies(driver, start_url)
df.head()

Starting to load... Limit set to 10 clicks.
Click 1/10 successful.
Click 2/10 successful.
Click 3/10 successful.
Click 4/10 successful.
Click 5/10 successful.
Click 6/10 successful.
Click 7/10 successful.
Click 8/10 successful.
Click 9/10 successful.
Click 10/10 successful.
Loading finished. Scraping visible movies...
Total movies found: 550


Unnamed: 0,Title,Year,Rating,Duration,URL,Plot,Budget,Opening_Weekend,Gross_US_Canada,Gross_Worldwide,Directors,Production_Companies,Country,Languages,IMDb_ID
0,1. The Shawshank Redemption,1994,9.3,2h 22m,https://www.imdb.com/title/tt0111161/?ref_=sr_t_1,A banker convicted of uxoricide forms a friend...,"$25,000,000 (estimated)","$727,327","$28,767,189","$29,334,033","Director, Frank Darabont",Castle Rock Entertainment,United States,English,tt0111161
1,2. The Dark Knight,2008,9.1,2h 32m,https://www.imdb.com/title/tt0468569/?ref_=sr_t_2,When a menace known as the Joker wreaks havoc ...,"$185,000,000 (estimated)","$158,411,483","$534,987,076","$1,008,287,756","Director, Christopher Nolan","Warner Bros., Legendary Pictures, DC Comics",United States,"English, Mandarin",tt0468569
2,3. Inception,2010,8.8,2h 28m,https://www.imdb.com/title/tt1375666/?ref_=sr_t_3,A thief who steals corporate secrets through t...,"$160,000,000 (estimated)","$62,785,337","$292,587,330","$839,786,473","Director, Christopher Nolan","Warner Bros., Legendary Pictures, Syncopy",United States,"English, Japanese, French",tt1375666
3,4. Fight Club,1999,8.8,2h 19m,https://www.imdb.com/title/tt0137523/?ref_=sr_t_4,An insomniac office worker and a devil-may-car...,"$63,000,000 (estimated)","$11,035,485","$37,030,102","$101,321,009","Director, David Fincher","Fox 2000 Pictures, New Regency Productions, Li...",United States,English,tt0137523
4,5. Forrest Gump,1994,8.8,2h 22m,https://www.imdb.com/title/tt0109830/?ref_=sr_t_5,The history of the United States from the 1950...,"$55,000,000 (estimated)","$24,450,602","$330,455,270","$678,226,465","Director, Robert Zemeckis","Paramount Pictures, The Steve Tisch Company, W...",United States,English,tt0109830


len(df): I used the len() function to calculate the total number of rows in the DataFrame. This served as a quick verification step to confirm exactly how many movies were successfully scraped and stored in the final dataset before saving it.

In [7]:
len(df)

550

df.to_csv(...): I used the .to_csv() method to export the final, cleaned DataFrame into a permanent file named 'my_scraped_movies.csv', making the data easy to share or open in Excel .

index=False: I specifically set this parameter to ensure the output file contained only the actual movie data, preventing pandas from adding an unnecessary extra column of row numbers (0, 1, 2...).

print(...): I included a print statement to provide immediate visual confirmation that the scraping process finished and the file was successfully created on the computer.

In [8]:
df.to_csv('my_scraped_movies.csv', index=False)

print("File saved successfully!")

File saved successfully!


In this project, I successfully designed and implemented a robust web scraping pipeline to harvest structured movie data from IMDb. By integrating Selenium for browser automation and BeautifulSoup for HTML parsing, I overcame the challenge of dynamic content loading, pecifically the 'Load More' mechanism that limits standard scraping methods. The final script effectively extracts key metrics, including financial data like budget and revenue, and consolidates them into a clean Pandas DataFrame. The resulting dataset offers a valuable resource for analyzing industry trends, such as the correlation between production budgets and box office success. This project demonstrates practical proficiency in data extraction, dynamic element interaction, and automated workflow management.

In [15]:
df = pd.read_csv('my_scraped_movies.csv')
df.head()



Unnamed: 0,Title,Year,Rating,Duration,URL,Plot,Budget,Opening_Weekend,Gross_US_Canada,Gross_Worldwide,Directors,Production_Companies,Country,Languages,IMDb_ID
0,1. The Shawshank Redemption,1994,9.3,2h 22m,https://www.imdb.com/title/tt0111161/?ref_=sr_t_1,A banker convicted of uxoricide forms a friend...,"$25,000,000 (estimated)","$727,327","$28,767,189","$29,334,033","Director, Frank Darabont",Castle Rock Entertainment,United States,English,tt0111161
1,2. The Dark Knight,2008,9.1,2h 32m,https://www.imdb.com/title/tt0468569/?ref_=sr_t_2,When a menace known as the Joker wreaks havoc ...,"$185,000,000 (estimated)","$158,411,483","$534,987,076","$1,008,287,756","Director, Christopher Nolan","Warner Bros., Legendary Pictures, DC Comics",United States,"English, Mandarin",tt0468569
2,3. Inception,2010,8.8,2h 28m,https://www.imdb.com/title/tt1375666/?ref_=sr_t_3,A thief who steals corporate secrets through t...,"$160,000,000 (estimated)","$62,785,337","$292,587,330","$839,786,473","Director, Christopher Nolan","Warner Bros., Legendary Pictures, Syncopy",United States,"English, Japanese, French",tt1375666
3,4. Fight Club,1999,8.8,2h 19m,https://www.imdb.com/title/tt0137523/?ref_=sr_t_4,An insomniac office worker and a devil-may-car...,"$63,000,000 (estimated)","$11,035,485","$37,030,102","$101,321,009","Director, David Fincher","Fox 2000 Pictures, New Regency Productions, Li...",United States,English,tt0137523
4,5. Forrest Gump,1994,8.8,2h 22m,https://www.imdb.com/title/tt0109830/?ref_=sr_t_5,The history of the United States from the 1950...,"$55,000,000 (estimated)","$24,450,602","$330,455,270","$678,226,465","Director, Robert Zemeckis","Paramount Pictures, The Steve Tisch Company, W...",United States,English,tt0109830
