# Web Scraping Tutorial: Basketball Films from Wikipedia

This tutorial is designed for undergraduate students, particularly those in the iSchool or affiliated with data-focused groups such as the Data Science Club at UIUC. It's intended for learners who are comfortable with Python fundamentals (loops, lists, dictionaries) and have completed at least one semester of Python instruction, but are entirely new to web scraping.

Suppose you've ever worked on a data science or digital humanities project. In that case, you've likely encountered a frustrating challenge: the data you need exists, but it's trapped inside a webpage instead of a downloadable CSV. That's where web scraping comes in.

In this hands-on workshop, you'll learn how to scrape and clean real-world data from Wikipedia, specifically, from a saved HTML copy of the [List of Basketball Films](https://en.wikipedia.org/wiki/List_of_basketball_films) article. The project will walk you through extracting movie metadata, removing footnote clutter, improving the dataset with additional information like directors and publishers, and exporting everything into a usable CSV file for further analysis.

We'll use an offline HTML file rather than scraping Wikipedia live (which can trigger rate limits or vary over time). This ensures the process is repeatable and stable, while also respecting the source material.

## What Will You Learn?

By the end of this project, you should know how to:

- Open and parse an HTML file using BeautifulSoup  
- Locate and extract rows that are from the correct `<table>` tag  
- Remove clutter such as footnotes [1] using regular expression (`re`)  
- Extract and organize metadata:  
  - Director, Producer, Writer, Cast, Production Company, Country, Budget, and Running Time  
- Add additional columns to the dataset that are helpful:  
  - `movie_link`: The film's Wiki URL from saved HTML copy  
  - `footnote_text`: readable footnote content to show necessary references  

Doing **ALL** of these updates will organize references and in-text citations and add important data from each movie into a well-structured format.

## How to Retrieve the HTML File

- **Location of HTML File**  
  - Click this [Wikipedia link](https://en.wikipedia.org/wiki/List_of_basketball_films) to open the **List of basketball films** page. 
- **Instructions**  
  - Right-click anywhere on the page and select **"Save As"** (or **"Save Page As"**) to download the full HTML file.  
  - Save the file in the same folder as your Python script.

## Libraries You'll Use

First, if you haven't already, you must import essential tools for this tutorial. The tools we will be using today are:

- **Import `os`**  
  - This is not essential to the project; it will just be easier to handle file paths in your file directory.  
  - If you devote this project to a specific folder, you can skip this installation entirely.  
  - **Code to install:**
    - Open the terminal and write the following command, and press enter. - Having trouble? [Link](https://www.geeksforgeeks.org/how-to-install-os-sys-module-in-python/) to detailed guide!
      ```bash
      pip install os
      ```
- **Import `re`**  
  - Used for matching patterns using regular expressions (e.g., finding headers in text).  
  - No installation needed, `re` is built into Python’s standard library.

- **Import `time`**  
  - Used to slow down scraping requests to avoid overloading servers (such as Wikipedia).  
  - This is essential for ethical scraping and preventing IP bans.  
  - **Warning:** If you skip this, you may overload the server and get IP banned.  
  - No installation needed, `time` is built into Python’s standard library.

- **Import `requests`**  
  - Allows your script to fetch live HTML pages from Wikipedia.  
  - Used in this tutorial to retrieve links to listed movies and film budgets.  
  - **Code to install:** 
    - Open the terminal and write the following command, and press enter. - Having trouble? [Link](https://www.geeksforgeeks.org/how-to-install-requests-in-python-for-windows-linux-mac/) to detailed guide!
      ```bash
      pip install requests
      ```
- **From `bs4` import `BeautifulSoup`**  
  - Parses raw HTML into structured data, enabling CSV export.  
  - Also includes XML support, so no need for a separate XML import.  
  - **Code to install:**
    - Open the terminal and write the following command, and press enter. - Having trouble? [Link](https://www.geeksforgeeks.org/beautifulsoup-installation-python/) to detailed guide!
      ```bash
      pip install beautifulsoup4
      ```
  

This tutorial will introduce you to web scraping and show you how to build structured and cleaned datasets from unstructured HTML files. This process is an essential skill in data science, digital journalism, research, and data management.


---

## Step-by-Step Breakdown of the Scraping Process

Let's begin building our scraper! Below is the first chunk of our program that parses the Wikipedia HTML, collects footnotes, and identifies the table of basketball films. I will walk you through each line and concept to enable you to create an amazing project. Let's go!

### Step 1: Load the Saved HTML File from Wikipedia

Use this code block structure to open the HTML file. This is similar to HTML and XML, but in this case, we are using the BeautifulSoup library, which utilizes the XML library. Make sure the file is correctly named and is exactly an HTML file and not XML. 

**Follow this code syntax:**


In [None]:
import os
import re
import time
import csv
import requests
from bs4 import BeautifulSoup

# STEP 1: Load the saved Wikipedia HTML file
with open("List of basketball films - Wikipedia.html", "rb") as file:
    page = BeautifulSoup(file, "html.parser")

##### **What does this do?**
- We open the `.html` file in **binary mode** (`rb`). This guarantees that **all** characters - including non-UTF-8 ones, are read properly.  
- We parse the HTML file using BeautifulSoup with the `html.parser`.  
- The result is stored in a BeautifulSoup object called `page`, which we'll use to navigate and piece together our CSV.

**After this step**, we now have a fully structured HTML file from Wikipedia loaded into memory.


### Step 2: Extract the Footnotes from the Bottom of the Page

**Let's break this down:**  
In this step, we want to extract the footnotes that correspond to the footnotes in each **Notes** column on the webpage.  
First:
- We create an empty dictionary called `footnotes`. This will store ID–text pairs from the file.  
- We use CSS selectors via `page.select()` to grab all the `li` elements inside the `ol class="references"` list. This is where Wikipedia stores its citations.  
- `li[id]` makes sure we only grab the necessary items that actually have an ID, such as a footnote [1] in this case.  
- Inside the loop we create, we store:  
    - **Key**: This is the footnote's ID tag  
    - **Value**: The text of the footnote itself, using `get_text(" ", strip=True)`.  
        - This is done to:  
            - Flatten the nested tags into plain text  
            - Strip the extra whitespace in the text  
            - Separate the text with spaces so it is more readable

In [None]:
footnotes = {}
reference_list = page.select("ol.references > li[id]")
for ref in reference_list:
    footnotes[ref["id"]] = ref.get_text(" ", strip=True)

**Why do this?**

- The footnotes in this specific Wiki file appear inline in the Notes column (e.g., [1]). Our goal is to **remove them from the main content in the Notes column** while also keeping them organized in a separate column or file, like `footnote_text`.

### Step 3: Find the Main Movie Name

In [None]:
# STEP 3: Find the table with all the movies listed
table = page.find("table", class_="wikitable sortable")
rows = table.find_all("tr")

**What's going on here?**

- First, we use `page.find()` to find the **table** that has the class `"wikitable sortable"`. This is the general format for most Wikipedia tables found on their website.  
- This **table** contains **ALL** the basketball film data that we want:  
    - Title, year, genre, and notes  
- We can then use `find_all("tr")` on the **table** to grab all of its rows (table rows).  

**FYI:** The first row is usually the header (column name(s)), and the rest are the data within their respective rows.

### Step 4: Building Up the Full Scraper

Now that we have all the table rows from the HTML file, we can move on to the next steps! The next task is to loop over the collected table rows, extract the data from each cell, clean the footnotes, create new data columns to hold upcoming information, and store the footnote text.

**Here is how we do it step by step:**

In [None]:
# STEP 4: Set up the columns we want in our final CSV
column_names = [th.get_text(strip=True) for th in rows[0].find_all("th")]
column_names += [
    "movie_link", "director", "producer", "writer", "cast",
    "production_company", "country", "budget", "running_time", "footnote_text"
]

**Why are we doing this?**  
- The first row in the table (row 0) contains the column headers as specified earlier: **Title, Year, Genre**, and **Notes**.  
- We use a list to extract and clean the text we want. This is important for referencing any text and identifying which references or citations were already present, allowing us to cross-check information.  
    - Use `.get_text(strip=True)` to remove whitespace and unnecessary HTML tags.  
- We then **extend** this list with additional column names that we will fill in later.  
    - These include columns used in the `column_names` DataFrame.

**Why is this all useful?**  
- This helps make our final CSV structured and even more comprehensive, with new and more valuable information.

### Step 5: Looping Over the Table Rows and Extracting the Data

Since this looping process is complex, I’ve split it into three manageable parts: Step 5a to Step 5c for your convenience.

#### Step 5a

Before we can process each row in the movie table, we want to grab all the footnote citations that appear next to the text in the **Notes** column. They appear as numbers after the text, like this: [1].  
- We must first initialize an empty list called `all_data`. This will hold all the cleaned and structured movie rows.  
- Additionally, we want to create a base Wikipedia URL (`"https://en.wikipedia.org"`). This will help us later in the process to create appropriate full links to each individual movie page. This ensures that there won’t be any incomplete URLs and that the format remains clean.

In [None]:
all_data = []
base_url = "https://en.wikipedia.org"

**Now we being the looping process through each movie row:**

In [None]:
# STEP 5: Go through each movie row
for row in rows[1:]:

- We then skip the first row because those are column headers 
- Inside each row, we can find all the `td` (table data) elements:

In [None]:
cells = row.find_all("td")
if not cells:
    continue

- To make this loop perform better, it’s best to add a conditional that skips empty rows. If a row is empty (e.g., due to spacing or formatting), we skip it using `continue`.  

#### Extracting the Footnotes  
This is an essential step in our web scraping process. We need to collect the contents of any footnotes before they are removed from the HTML. This is **very** important to keep in this order because once a tag is removed using `.decompose()`, it is permanently deleted from the HTML tree. This means we can no longer access it or its contents.  
- In wiki tables, footnotes appear as small numbers [1], [2], etc., inside the `sup` tags. These `sup` tags contain hyperlinks that direct users to the references section at the bottom of the page. This is why extracting the footnotes now is an important step.

In [None]:
# Step 5a: Extract footnote references before decomposing
footnote_parts = []
for sup in row.find_all("sup"):
    a_tag = sup.find("a")
    if a_tag and "href" in a_tag.attrs:
        foot_id = a_tag["href"].replace("#", "")
        if foot_id in footnotes:
            footnote_parts.append(footnotes[foot_id])

- We loop through each `sup` tag in the row, since Wiki holds these footnotes as links/markers on their website.  
- If the `sup` contains a hyperlink (`href`), we extract its footnote ID by removing the corresponding `#`.  
- We then look up that ID in our **footnotes** dictionary (created in Step 2) and add the full footnote text to our `footnote_parts` list.  

Finally, we combine the footnote parts into a single string. We separate them using a vertical bar (`|`), which is also referenced as an "OR" conditional in Python. We use this bar to keep the footnotes clean and readable within one column.

In [None]:
footnote_text = " | ".join(footnote_parts)

**This string above** will be added later into our **final CSV** under the `footnote_text` column in Step 5c

#### Step 5b: Clean the Visible Table Text by Removing Footnotes

After we've successfully grabbed and stored the footnotes and their contents in Step 5a, we no longer need the visible footnote markers [1]. This reduces clutter, and the footnote no longer serves a purpose after being scraped.

These footnote markers are usually embedded directly into the text within the `td` element and wrapped in the `sup` tag. If we do not remove these, they’ll be included in the final CSV file, which is not our intended goal.

**Please add the following to your script**

In [None]:
for sup in row.find_all("sup"):
    sup.decompose()

The `row.find_all("sup")` scans the entire HTML row for any remaining `sup` tags (footnote markers). The `sup.decompose()` method **permanently deletes** the tag ([1]) from the HTML tree, including **ALL** of its contents.
- Think of it this way, we’ve already asked BeautifulSoup to grab the necessary content, the footnote text. Now that we no longer need the footnote tags themselves, we use `decompose()` to delete any remaining ones.


#### Step 5c: Extracting and Building the Final Cleaned Row

Now that we have: 
- Collected Footnotes (Step 5a)
- Removed the visible footnote tag/markers (Step 5b)

We are ready to extract the **clean text** from the row and build a structured list of data. This list will represent one complete movie entry in the dataset. 

**Here is how to Extract the Clean Cell Text**

In [None]:
row_info = []
for cell in cells:
    row_info.append(cell.get_text(" ", strip=True))

Our next steps:
- We loop through each `td` cell in the row.  
- Use `get_text(" ", strip=True)` to extract only the text using:  
    - `strip=True`: This removes all leading or trailing whitespace.  
    - `" "`: This replaces any inner line breaks with a single space.  
- Since we removed all `sup` markers/tags in Step 5b, this makes the entire process a lot easier, as the text is now **free of footnote markers**.  

**Padding for Any Missing Cells**  
- Some rows may have fewer cells than the expected number of columns if the `Notes` column field is blank.  
- To avoid index errors and keep everything positioned correctly when writing into the CSV, we **pad** the row with empty strings `" "` until it has the correct number.  
- **Important:** `-11` is used because we later add 11 more columns, as referenced in Step 4.  

**Extract the Movie Link**

In [None]:
# Get movie Wikipedia link
link = ""
link_tag = cells[0].find("a")
if link_tag and "href" in link_tag.attrs:
    link = base_url + link_tag["href"]

**Process Explanation:**
- We check if the first cell has a **hyperlink**. In this case, it contains the *movie title*.  
- If yes, we create the **full Wikipedia link** by appending the `href` to the base URL.  
- This lets us visit each individual Wiki page for additional scraping in Step 7.

**Add Placeholders**

In [None]:
row_info += [link, "", "", "", "", "", "", "", "", footnote_text]
all_data.append(row_info)

- We add placeholders (`""`) for the 8 fields that we will fill in later when scraping the individual movie pages:  
    - `director, producer, writer, cast, company, country, budget, runtime`
- We also add the cleaned `footnote_text` we gathered in earlier steps.  
- Finally, we add this completed row to `all_data`, which is the final list that will be written into the **final CSV** file.  

**Why This Matters:**  
This step helps create a fully cleaned and structured version of the movie row with:  
- Cleaned and properly positioned text  
- A working Wiki link  
- A place to hold the scraped values  
- Any footnote references formatted to make the Notes column easier to read and understand

### Step 6: Filter Out the Rows Without Wikipedia Links

Now that we created a list that is both structured and cleaned, we need to do a little more cleanup. SOme rows might not include a workable Wikipedia link, and we want to remove any of those before we move on to the final webscraping in Step 7. 

**Code Explanation** 

In [None]:
all_data = [row for row in all_data if row[column_names.index("movie_link")].startswith("http")]

This line uses a **list comprehension** - [What is List Comprehension?](https://www.w3schools.com/python/python_lists_comprehension.asp)  
It lets you take an existing list and build a new one by applying some operation or change to each item, optionally filtering along the way.  

Here is the process we will be doing:
- Loop through every **row** in the `all_data` list.  
- Verify whether the value in the `movie_link` column starts with an `http` path. This indicates that it is a valid link.  
- Only **keep rows** that pass this verification process.

#### **Why is this Necessary?**
- Not every movie in the Wikipedia table has its own page. Some might be plain text without a hyperlink, or the `<a>` tag might be missing. This varies case by case depending on the HTML file used.  
- If we try to scrape a Wikipedia page without a valid link, the request will fail and end, wasting the time it took to run the script entirely.  
- By doing this check before Step 7 (web scraping), we can guarantee that:  
    - Every remaining row we choose has a real and clickable Wiki link.  
    - The script will only try to scrape actual pages, not blank or missing entries.

**Code We Use**
`column_names.index("movie_link")`  
We use this to find the index of the `movie_link` column so we can check each row's link field dynamically; meaning we don’t assume the link is always in a fixed position.  

This matters because if your column order ever changes (e.g., someone moves the `movie_link` column to the end of the column list), a hardcoded index like `row[4]` will break or return an incorrect value.  

By finding the position based on the column name, the code will always point to the correct link field, no matter where that column appears in the list.  

**In short:** We are checking the link field dynamically so the code automatically adapts if the column order changes in the future.

**End Result:**
After this step: 
- `all_data` only contains valid, usable movie rows. 
- Each one has a working Wiki link for scraping in next step. 

### Step 7: Scraping the Movie(s) Information

Now that Step 6 is now complete, we can use those links to **visit each movie's page** and extract even more information. Perciely, we are scraping Wikipedia's infobox which is the table usually on the top right of each article. 

#### **The Function:** `scrape_movie_page(url)`
This function we will use only takes a single input, a url. This will then take us to a Wikipedia page for each movie and hopefully return 8 values: 
- `director, producer, writer, cast, company, country, budget,` and `runtime`. 

If anything fails in this process, either it cannot find one of these 8 values within the page, the string just does not return the value for the coresponding row. 

#### **Part 1: Requesting the Web Page**

In [None]:
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=10)

This line of code:
- Sends a request to the URL using `requests.get()`.
- Sets a `User-Agent` header to mimic a real browser. This helps prevent Wikipedia from blocking the request.
- Specifies a timeout with `timeout=10`, which ensures that the request doesn't hang indefinitely. You can extend this time if needed, but 10 seconds is a reasonable limit to assume the page isn't responding.

#### **Part 2: Parsing the Page with BeautifulSoup**

In [None]:
soup = BeautifulSoup(response.text, "html.parser")

This will convert the **raw** HTML of the page into a BeautifulSoup object, which makes it easy to search and extract information.

#### **Part 3: Locate the InfoBox in the Webpage**

In [None]:
info = soup.find("table", class_="infobox")
if not info:
    return [""] * 8

Above: 
- Looks for the `<table>` with the classifer **"infobox"**, where the key information we need is located. (director, cast, ect.)
- If there is **no infobox** the funciton will return a list of 8 empy strings, this will avoid crashing the script. 

#### **Part 4: Define the Helper Functions**

In [None]:
def get_text(label):
    header = info.find("th", string=re.compile(label, re.IGNORECASE))
    if header:
        cell = header.find_next_sibling("td")
        if cell:
            return cell.get_text(" ", strip=True)
    return ""

`get_text(label)`:
- This **helper** searches the infobox on the webpage for the `<th>` (table header) that matches a specific label, this case it would be like: **"Directed by"**, ignoring case.
- Then it grabs the `<td>` next to it
- Using `.get_text(" ", strip=True)` extracts clean text and joins it with spaces.
- Then finally it returns an empty string if the label or value isn’t found.

#### **Next**

In [None]:
def get_cast():
    header = info.find("th", string=re.compile("starring|cast", re.IGNORECASE))
    if header:
        cell = header.find_next_sibling("td")
        if cell:
            list_items = cell.find_all("li")
            if list_items:
                return ", ".join([li.get_text(strip=True) for li in list_items])
            return cell.get_text(", ", strip=True)
    return ""

`get_cast()`: 
- Handles the **"Starring"** or **"Cast"** row, which may use either plain text or list items `(<li>)`.
- If `<li>` tags exist, it joins them with commas.
- Otherwise, it extracts the text from the `<td>` directly.
- If nothing is found, it returns an empty string.

#### **Part 5: Extract Values from the Infobox**

In [None]:
director = get_text("Directed by")
producer = get_text("Produced by|Production")
writer = get_text("Written by")
cast = get_cast()
company = get_text("Production company|Production companies|Studio")
country = get_text("Country")
budget = get_text("Budget")
runtime = get_text("Running time")

- Each of these lines uses the `get_text()` or `get_cast()` function to find the correct field.
- Notice some labels use multiple terms separated by **`|`**, which acts as a logical **OR operator** in regular expressions (not standard Python booleans). This allows the pattern to match **any one** of the given options.
#### **Part 6: Return the Results** 

In [None]:
return [director, producer, writer, cast, company, country, budget, runtime]

- This function returns a list of all 8 values in the order they will be inserted into the CSV.

#### **Part 7 - EXTRA: Error Handling**

In [None]:
except Exception as e:
    print("Error scraping:", url)
    return [""] * 8

- This step is important becuase if something goes wrong (bad URL, timeout, missing tags), this try-except block ensures the script continues without crashing.
- It prints which URL failed and safely returns 8 empty fields.

### Step 8: Update Movie Rows with Scraped Data

At this given point we have: 
- Collected each movie’s basic info from the Wikipedia table.
- Saved the full link to each movie’s individual Wikipedia page.
- Written a function (scrape_movie_page()) to visit that page and extract detailed information.

**Finally** we now use `scrape_movie_page()` to loop through each row, extract the data from the individual movie page, and insert it into the right place in our dataset.

In [None]:
for row in all_data:

- We go row by row through `all_data`, which holds every movie and its basic info.

- Each row is a list of values (e.g., `title, year, genre, notes, link,` and `empty placeholders`).

#### **Get the Movie Link**

In [None]:
url = row[column_names.index("movie_link")]

- This will retieve the Wiki link for that specifc movie. 
- Use `column_names.index("movie_link")` to find the correct position in the list **dynamically**, in case the column order ever changes.

#### **We Print Progress**

In [None]:
print("Scraping:", url)

- This prints out the current movie link being scraped.

- It’s helpful for monitoring our progress or debugging if a specific link causes an error. This is helpful because we can better understand when and where it went wrong. 

#### **Scrape the Data**

In [None]:
details = scrape_movie_page(url)

- Calls the scrape_movie_page() function (defined in Step 7)

- Sends the current URL as input

- Returns a list of 8 values we specified:
    - `[director, producer, writer, cast, company, country, budget, runtime]`

#### **Insert the Data into the Row**

In [None]:
row[-10:-2] = details  # Fill in the 8 extra fields

**Breakdown**
- `row[-10:-2]` targets the 8 empty fields we originally reserved in Step 5c.
- The last 10 columns include: [movie_link, director, ..., runtime, footnote_text]
- We want to fill everything except the link and the footnote (which is the last column, -1)
- This line replaces those 8 empty fields with the new scraped data.

#### **Pause Between Requests**

In [None]:
time.sleep(1)

**Breakdown**
- This adds a 1-second delay between each request.
- It's a respectful and safe practice when scraping multiple pages, especially from public sites like Wikipedia.
- It prevents your script from overloading the server or getting rate-limited or blocked.

### Step 9: Write the Final Data to a CSV File

At this point, we’ve:

- Extracted all relevant movie info from the Wikipedia table,
- Scraped extra details from each movie’s page (like director, cast, and budget), 
- Stored everything in a structured list called all_data.

Now we write all of that cleaned, processed data to a CSV file so it can be used for analysis, sharing, or importing into a spreadsheet or database.

##### **Open a CSV File for Writing**

In [None]:
with open("basketball_films.csv", "w", newline="", encoding="utf-8") as f:

- This opens (or creates) a file named `basketball_films.csv` in **write mode** `("w")`.
- `newline=""` prevents blank rows from appearing between entries (a common issue on Windows).
- `encoding="utf-8"` verifies the file supports special characters (like accented letters).
- `f` is the file object used within the with  the block and it's automatically closed when the block ends.

#### **Create a CSV Writer, Write the Header Row & Write All Movie Rows**
In this final step, we export our cleaned and enriched dataset to a `.csv` file for use in tools like Excel, Google Sheets, or pandas.

We open the file using a `with` block to ensure it closes properly after writing. Then we create a CSV writer, write the column headers as the first row, and output all the movie data we’ve stored in `all_data`.

In [None]:
writer = csv.writer(f)
writer.writerow(column_names)
writer.writerows(all_data)
print("Saved: basketball_films.csv")

### Final Thoughts

You’ve just completed a full data scraping and processing project using Python. Here’s a summary of what your code accomplished from start to finish:

- Loaded a saved HTML file from Wikipedia that contains a table of basketball films.
- Parsed and extracted the table rows and headers, identifying key fields like Title, Year, Genre, and Notes.
- Collected footnote references [1] from the bottom of the page and stored them separately.
- Removed the footnote markers from the visible text in the table so the data is clean.
- Extracted and cleaned each movie’s row of information, including its title and notes, and attached a working Wikipedia link if it was available in original HTML file.
- Skipped any rows that didn’t include a real link to a Wikipedia page.
- Defined a custom function (`scrape_movie_page()`) to visit each movie’s Wikipedia page and scrape detailed information like director, cast, budget, and runtime from the infobox.
- Filled in those extra details into the correct columns for each movie.
- Wrote all the structured and cleaned data into a CSV file so it can be used for analysis, opened in Excel, or imported into a data science project.

This project introduced key concepts like:
- HTML parsing with BeautifulSoup
- Using regular expressions to match patterns in text
- Dynamic indexing to make your code flexible
- Writing structured data to a CSV file
- Adding delays between web requests to avoid overloading a server
- Handling missing data and potential errors safely

By doing this project, you’ve worked through a real-world example of taking messy web data and turning it into something usable and well-organized. This kind of skill is valuable in data science, research, and even journalism.

#### **Next Steps:**
If you want to go even further with this project, here are some ways you can build on what you’ve done:

- **Clean the data even more**
  - Right now, the `budget` column is mixed. Some values are written like “$1 million” while others say just “1 million.” You could write code to:
    - Remove dollar signs (`$`) and the word “million,”
    - Convert values to pure numbers (e.g., "$1 million" becomes `1000000`),
    - And leave the field blank if the value is unknown.
  - The `runtime` column also includes extra text like “167 mins.” You can clean this by:
    - Removing the word “mins” or “minutes,”
    - And converting the value to an integer, like `167`.
- **Add new features**
  - Scrape more information, like release dates, box office totals, or Rotten Tomatoes scores (if available on the Wikipedia page).
- **Make the script reusable**
  - Turn your code into a tool that could work on other Wikipedia film lists, like football movies.

By doing any of these extensions to the project, you’ll get more experience with data cleaning, web scraping, and real-world problem solving, while making your project even stronger.