# Week 1 Lab: Data Collection for Machine Learning

**CS 203: Software Tools and Techniques for AI**

---

## Lab Overview

In this lab, you will learn to collect data from the web using:

1. **HTTP fundamentals** - Understanding how the web works
2. **curl** - Command-line HTTP client
3. **Python requests** - Programmatic API calls
4. **BeautifulSoup** - Web scraping when APIs don't exist

**Goal**: Build a movie data collection pipeline for Netflix-style movie prediction.

---

## Setup

First, let's install and import the required libraries.

In [None]:
# Install required packages (uncomment if needed)
# !pip install requests beautifulsoup4 pandas

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import time

print("All imports successful!")

All imports successful!


---

# Part 1: HTTP Fundamentals

Before we start collecting data, we need to understand how the web works.

## 1.1 Understanding URLs

A URL (Uniform Resource Locator) has several components:

```
https://api.omdbapi.com:443/v1/movies?t=Inception&y=2010#details
└─┬──┘ └──────┬───────┘└┬─┘└───┬───┘└─────────┬────────┘└───┬───┘
  │           │         │      │              │             │
Protocol    Host      Port   Path          Query        Fragment
```

### Question 1.1 (Solved): Parse a URL

Use Python's `urllib.parse` to break down a URL into its components.

In [None]:
# SOLVED EXAMPLE
from urllib.parse import urlparse, parse_qs

url = "https://api.omdbapi.com/?apikey=demo&t=Inception&y=2010"

parsed = urlparse(url)

print(f"Scheme (protocol): {parsed.scheme}")
print(f"Host (domain): {parsed.netloc}")
print(f"Path: {parsed.path}")
print(f"Query string: {parsed.query}")

# Parse query parameters into a dictionary
params = parse_qs(parsed.query)
print(f"\nParsed parameters: {params}")

Scheme (protocol): https
Host (domain): api.omdbapi.com
Path: /
Query string: apikey=demo&t=Inception&y=2010

Parsed parameters: {'apikey': ['demo'], 't': ['Inception'], 'y': ['2010']}


### Question 1.2: Parse a Different URL

Parse the following GitHub API URL and extract:
1. The host
2. The path
3. All query parameters as a dictionary

URL: `https://api.github.com/search/repositories?q=machine+learning&sort=stars&order=desc`

In [None]:
# YOUR CODE HERE
url2 = "https://api.github.com/search/repositories?q=machine+learning&sort=stars&order=desc"

# Parse the URL
parsedd=urlparse(url2)
# Print the host
print(f"URL Host: {parsedd.netloc}")
# Print the path
print(f"URL Path: {parsedd.path}")
# Print the query parameters as a dictionary
parameters=parse_qs(parsedd.query)
print(f"Query Parameters: {parameters}")


URL Host: api.github.com
URL Path: /search/repositories
Query Parameters: {'q': ['machine learning'], 'sort': ['stars'], 'order': ['desc']}


---

## 1.2 HTTP Status Codes

HTTP status codes tell you what happened with your request:

| Range | Category | Common Examples |
|-------|----------|----------------|
| 2xx | Success | 200 OK, 201 Created |
| 3xx | Redirect | 301 Moved, 302 Found |
| 4xx | Client Error | 400 Bad Request, 401 Unauthorized, 404 Not Found |
| 5xx | Server Error | 500 Internal Error, 503 Service Unavailable |

### Question 1.3: Match Status Codes

Match each scenario to the most likely HTTP status code:

1. You requested a movie that doesn't exist in the database
2. You made too many requests and hit the rate limit
3. Your API key is invalid
4. The request was successful and data was returned
5. The server crashed while processing your request

Status codes to choose from: `200`, `401`, `404`, `429`, `500`

In [None]:
# YOUR ANSWERS HERE
answers = {
    "movie_not_found": 404,      # Replace None with the status code
    "rate_limited": 429,
    "invalid_api_key": 401,
    "success": 200,
    "server_crashed": 500
}

print(answers)

{'movie_not_found': 404, 'rate_limited': 429, 'invalid_api_key': 401, 'success': 200, 'server_crashed': 500}


---

# Part 2: Making Requests with `curl`

`curl` is a command-line tool for making HTTP requests. It's essential for quick testing.

## 2.1 Basic curl Commands

You can run shell commands in Jupyter using `!` prefix.

### Question 2.1 (Solved): Your First API Call

Let's call a simple public API that requires no authentication.

In [None]:
# SOLVED EXAMPLE
# JSONPlaceholder is a free fake API for testing
!curl -s "https://jsonplaceholder.typicode.com/posts/1"

{
  "userId": 1,
  "id": 1,
  "title": "sunt aut facere repellat provident occaecati excepturi optio reprehenderit",
  "body": "quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"
}

### Question 2.2: Pretty Print with jq

The output above is hard to read. Use `jq` to format it nicely.

**Hint**: Pipe the curl output to jq: `curl ... | jq .`

In [None]:
# YOUR CODE HERE
# Fetch the same post but format the output with jq
!curl -s "https://jsonplaceholder.typicode.com/posts/1" | jq

[1;39m{
  [0m[34;1m"userId"[0m[1;39m: [0m[0;39m1[0m[1;39m,
  [0m[34;1m"id"[0m[1;39m: [0m[0;39m1[0m[1;39m,
  [0m[34;1m"title"[0m[1;39m: [0m[0;32m"sunt aut facere repellat provident occaecati excepturi optio reprehenderit"[0m[1;39m,
  [0m[34;1m"body"[0m[1;39m: [0m[0;32m"quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"[0m[1;39m
[1;39m}[0m


### Question 2.3: Extract Specific Fields with jq

Fetch all posts from `https://jsonplaceholder.typicode.com/posts` and extract only the `title` field from each post.

**Hint**: Use `jq '.[].title'` to get the title from each element in the array.

In [None]:
# YOUR CODE HERE
!curl -s "https://jsonplaceholder.typicode.com/posts" |jq '.[].title'

[0;32m"sunt aut facere repellat provident occaecati excepturi optio reprehenderit"[0m
[0;32m"qui est esse"[0m
[0;32m"ea molestias quasi exercitationem repellat qui ipsa sit aut"[0m
[0;32m"eum et est occaecati"[0m
[0;32m"nesciunt quas odio"[0m
[0;32m"dolorem eum magni eos aperiam quia"[0m
[0;32m"magnam facilis autem"[0m
[0;32m"dolorem dolore est ipsam"[0m
[0;32m"nesciunt iure omnis dolorem tempora et accusantium"[0m
[0;32m"optio molestias id quia eum"[0m
[0;32m"et ea vero quia laudantium autem"[0m
[0;32m"in quibusdam tempore odit est dolorem"[0m
[0;32m"dolorum ut in voluptas mollitia et saepe quo animi"[0m
[0;32m"voluptatem eligendi optio"[0m
[0;32m"eveniet quod temporibus"[0m
[0;32m"sint suscipit perspiciatis velit dolorum rerum ipsa laboriosam odio"[0m
[0;32m"fugit voluptas sed molestias voluptatem provident"[0m
[0;32m"voluptate et itaque vero tempora molestiae"[0m
[0;32m"adipisci placeat illum aut reiciendis qui"[0m
[0;32m"doloribus ad provident

### Question 2.4: View Response Headers

Use the `-I` flag to fetch only the response headers (no body) from:
`https://api.github.com`

What is the value of the `X-RateLimit-Limit` header?

In [None]:
# YOUR CODE HERE
!curl -I "https://api.github.com"

HTTP/2 200 
[1mdate[0m: Mon, 12 Jan 2026 15:49:07 GMT
[1mcontent-type[0m: application/json; charset=utf-8
[1mcache-control[0m: public, max-age=60, s-maxage=60
[1mvary[0m: Accept,Accept-Encoding, Accept, X-Requested-With
[1metag[0m: W/"4f825cc84e1c733059d46e76e6df9db557ae5254f9625dfe8e1b09499c449438"
[1mx-github-media-type[0m: github.v3; format=json
[1mx-github-api-version-selected[0m: 2022-11-28
[1maccess-control-expose-headers[0m: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset
[1maccess-control-allow-origin[0m: *
[1mstrict-transport-security[0m: max-age=31536000; includeSubdomains; preload
[1mx-frame-options[0m: deny
[1mx-content-type-options[0m: nosniff
[1mx-xss-protection[0m: 0
[1mreferrer-policy[0m: origin-w

### Question 2.5: Add Custom Headers

Make a request to `https://httpbin.org/headers` with the following custom headers:
- `User-Agent: CS203-Lab/1.0`
- `Accept: application/json`

**Hint**: Use `-H "Header-Name: value"` for each header.

In [None]:
# YOUR CODE HERE
! curl -s -H "User-Agent: CS203-Lab/1.0" -H "Accept: application/json" "https://httpbin.org/headers"

{
  "headers": {
    "Accept": "application/json", 
    "Host": "httpbin.org", 
    "User-Agent": "CS203-Lab/1.0", 
    "X-Amzn-Trace-Id": "Root=1-696517f4-13fc41c5263224e27e65ebe8"
  }
}


---

# Part 3: Python `requests` Library

While `curl` is great for testing, we need Python for automation.

## 3.1 Basic GET Requests

### Question 3.1 (Solved): Simple GET Request

Make a GET request and inspect the response object.

In [None]:
# SOLVED EXAMPLE
import requests

response = requests.get("https://jsonplaceholder.typicode.com/posts/1")

print(f"Status Code: {response.status_code}")
print(f"Content-Type: {response.headers['Content-Type']}")
print(f"Response OK: {response.ok}")
print(f"\nJSON Data:")
print(response.json())

Status Code: 200
Content-Type: application/json; charset=utf-8
Response OK: True

JSON Data:
{'userId': 1, 'id': 1, 'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit', 'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'}


### Question 3.2: Fetch Multiple Posts

Fetch posts from `https://jsonplaceholder.typicode.com/posts` and:
1. Print the total number of posts
2. Print the titles of the first 5 posts

In [None]:
# YOUR CODE HERE
response2=requests.get("https://jsonplaceholder.typicode.com/posts")

posts=response2.json()

print(f"Number of total posts {len(posts)}")

print("TITLES OF THE FIRST FIVE: \n")

for post in posts[:5]:
    print(post['title']," ")

Number of total posts 100
TITLES OF THE FIRST FIVE: 

sunt aut facere repellat provident occaecati excepturi optio reprehenderit  
qui est esse  
ea molestias quasi exercitationem repellat qui ipsa sit aut  
eum et est occaecati  
nesciunt quas odio  


### Question 3.3 (Solved): Using Query Parameters

The proper way to add query parameters is using the `params` argument.

In [None]:
# SOLVED EXAMPLE
import requests

# Bad way (manual string building)
# url = "https://jsonplaceholder.typicode.com/posts?userId=1"

# Good way (using params)
response = requests.get(
    "https://jsonplaceholder.typicode.com/posts",
    params={"userId": 1}
)

posts = response.json()
print(f"User 1 has {len(posts)} posts")
print(f"\nActual URL used: {response.url}")

User 1 has 10 posts

Actual URL used: https://jsonplaceholder.typicode.com/posts?userId=1


### Question 3.4: Filter Posts by User

Fetch all posts by user 5 and user 7. Compare how many posts each user has.

**Hint**: Make two separate requests with different `userId` values.

In [None]:
# YOUR CODE HERE
import requests

response=requests.get("https://jsonplaceholder.typicode.com/posts",
    params={"userId": 5})

posts5=response.json()

response2=requests.get("https://jsonplaceholder.typicode.com/posts", params={"userId":7})

posts7=response2.json()

print(f"user 5 posted {len(posts5)} while user 7  posted {len(posts7)}")

user 5 posted 10 while user 7  posted 10


---

## 3.2 Working with Real APIs

Let's work with some real-world APIs.

### Question 3.5 (Solved): GitHub API - Public Repositories

The GitHub API is free to use (with rate limits) and doesn't require authentication for public data.

In [None]:
# SOLVED EXAMPLE
import requests

# Fetch information about a popular repository
response = requests.get(
    "https://api.github.com/repos/pandas-dev/pandas",
    headers={"Accept": "application/vnd.github.v3+json"}
)

if response.ok:
    repo = response.json()
    print(f"Repository: {repo['full_name']}")
    print(f"Description: {repo['description']}")
    print(f"Stars: {repo['stargazers_count']:,}")
    print(f"Forks: {repo['forks_count']:,}")
    print(f"Language: {repo['language']}")
else:
    print(f"Error: {response.status_code}")

Repository: pandas-dev/pandas
Description: Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Stars: 47,555
Forks: 19,504
Language: Python


### Question 3.6: Compare Popular ML Libraries

Fetch information about these ML-related repositories and create a comparison table:
- `scikit-learn/scikit-learn`
- `pytorch/pytorch`
- `tensorflow/tensorflow`

Show: name, stars, forks, and primary language.

**Hint**: Loop through the repos and collect data into a list of dictionaries, then create a DataFrame.

In [None]:
# YOUR CODE HERE
import pandas as pd
repos = [
    "scikit-learn/scikit-learn",
    "pytorch/pytorch",
    "tensorflow/tensorflow"
]

data=[]

# Fetch data for each repo
for repo_name in repos:
  response=requests.get(f"https://api.github.com/repos/{repo_name}", headers={"Accept": "application/vnd.github.v3+json"})
  repo=response.json()
  data.append({
            "Name": repo["full_name"],
            "Stars": repo["stargazers_count"],
            "Forks": repo["forks_count"],
            "Primary Language": repo["language"]
  })


# Create a DataFrame

df=pd.DataFrame(data)
df=df.set_index('Name')
# Display the comparison
display(df)


Unnamed: 0_level_0,Stars,Forks,Primary Language
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
scikit-learn/scikit-learn,64601,26595,Python
pytorch/pytorch,96560,26488,Python
tensorflow/tensorflow,193309,75155,C++


### Question 3.7: Search GitHub Repositories

Use the GitHub search API to find the top 10 most starred repositories with "machine learning" in their description.

API endpoint: `https://api.github.com/search/repositories`

Parameters:
- `q`: search query (e.g., "machine learning")
- `sort`: "stars"
- `order`: "desc"
- `per_page`: 10

Print the name and star count of each repository.

In [None]:
# YOUR CODE HERE
import requests

parameters={"q":"machine learning", "sort":"stars", "order":"desc", "per_page":10}

response=requests.get("https://api.github.com/search/repositories", params=parameters)

top10=response.json()

for repo in top10["items"]:
  print(f"Name - {repo['full_name']} & star count - {repo['stargazers_count']}")

Name - tensorflow/tensorflow & star count - 193309
Name - huggingface/transformers & star count - 154950
Name - microsoft/ML-For-Beginners & star count - 83013
Name - fighting41love/funNLP & star count - 78381
Name - josephmisiti/awesome-machine-learning & star count - 71299
Name - scikit-learn/scikit-learn & star count - 64601
Name - gradio-app/gradio & star count - 41279
Name - TheAlgorithms/C-Plus-Plus & star count - 33669
Name - lutzroeder/netron & star count - 32168
Name - ashishpatel26/500-AI-Machine-learning-Deep-learning-Computer-vision-NLP-Projects-with-code & star count - 30823


---

## 3.3 Error Handling

Real-world APIs fail. We need to handle errors gracefully.

### Question 3.8 (Solved): Handling HTTP Errors

In [None]:
# SOLVED EXAMPLE
import requests

def fetch_with_error_handling(url):
    """Fetch URL with proper error handling."""
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()  # Raises exception for 4xx/5xx
        return response.json()
    except requests.exceptions.Timeout:
        print(f"Timeout: Request took too long")
    except requests.exceptions.HTTPError as e:
        print(f"HTTP Error: {e.response.status_code}")
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
    return None

# Test with valid URL
print("Valid URL:")
data = fetch_with_error_handling("https://jsonplaceholder.typicode.com/posts/1")
if data:
    print(f"  Got post: {data['title'][:50]}...")

# Test with invalid URL (404)
print("\nInvalid URL (404):")
fetch_with_error_handling("https://jsonplaceholder.typicode.com/posts/99999")

Valid URL:
  Got post: sunt aut facere repellat provident occaecati excep...

Invalid URL (404):
HTTP Error: 404


### Question 3.9: Robust Fetcher Function

Write a function `safe_fetch(url, max_retries=3)` that:

1. Attempts to fetch the URL
2. If it fails with a 5xx error, retries up to `max_retries` times
3. Waits 1 second between retries
4. Returns the JSON data if successful, None otherwise

Test it with `https://httpbin.org/status/500` (always returns 500) and `https://jsonplaceholder.typicode.com/posts/1` (always works).

In [None]:
# YOUR CODE HERE
import time

def safe_fetch(url, max_retries=3):
    """Fetch URL with retry logic for server errors."""

    for i in range(max_retries):
      try:
        response = requests.get(url, timeout=10)
        if(response.status_code>=500 and response.status_code<=599):
          if(i<max_retries-1):
            time.sleep(1)
            print("testing\n")
            continue
        response.raise_for_status()  # Raises exception for 4xx/5xx
        return response.json()

      except requests.exceptions.Timeout:
        print(f"Timeout: Request took too long")
      except requests.exceptions.HTTPError as e:
        print(f"HTTP Error: {e.response.status_code}")
      return None




# Test your function
print("Testing with working URL:")
result = safe_fetch("https://jsonplaceholder.typicode.com/posts/1")
print(f"Result: {result}")

print("\nTesting with failing URL (500):")
result = safe_fetch("https://httpbin.org/status/500")
print(f"Result: {result}")

Testing with working URL:
Result: {'userId': 1, 'id': 1, 'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit', 'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'}

Testing with failing URL (500):
testing

testing

HTTP Error: 500
Result: None


---

# Part 4: The OMDb Movie API

Now let's work with the OMDb API - our main data source for the Netflix project.

**Note**: You need an API key from https://www.omdbapi.com/apikey.aspx (free tier available).

For this lab, we'll use a demo key that has limited functionality.

In [None]:
# Set your API key here
# Get a free key from: https://www.omdbapi.com/apikey.aspx
OMDB_API_KEY = "f2d40084"  # Replace with your actual key

# For demo purposes, you can try with key "demo" but it's very limited
# OMDB_API_KEY = "demo"

### Question 4.1 (Solved): Fetch a Single Movie

In [None]:
# SOLVED EXAMPLE
import requests

def fetch_movie(title, year=None, api_key=OMDB_API_KEY):
    """Fetch movie data from OMDb API."""
    params = {
        "apikey": api_key,
        "t": title,  # Search by title
        "type": "movie"
    }
    if year:
        params["y"] = year

    response = requests.get("https://www.omdbapi.com/", params=params)

    if response.ok:
        data = response.json()
        if data.get("Response") == "True":
            return data
        else:
            print(f"Movie not found: {data.get('Error')}")
    return None

# Fetch Inception
movie = fetch_movie("Inception", 2010)
if movie:
    print(f"Title: {movie['Title']}")
    print(f"Year: {movie['Year']}")
    print(f"Director: {movie['Director']}")
    print(f"IMDB Rating: {movie['imdbRating']}")
    print(f"Genre: {movie['Genre']}")

Title: Inception
Year: 2010
Director: Christopher Nolan
IMDB Rating: 8.8
Genre: Action, Adventure, Sci-Fi


### Question 4.2: Explore the Response

Fetch data for "The Dark Knight" and print ALL available fields in the response.

Which fields might be useful for predicting movie success?

In [None]:
# YOUR CODE HERE
movie=fetch_movie("The Dark Knight")

if movie:
  print(movie)

{'Title': 'The Dark Knight', 'Year': '2008', 'Rated': 'PG-13', 'Released': '18 Jul 2008', 'Runtime': '152 min', 'Genre': 'Action, Crime, Drama', 'Director': 'Christopher Nolan', 'Writer': 'Jonathan Nolan, Christopher Nolan, David S. Goyer', 'Actors': 'Christian Bale, Heath Ledger, Aaron Eckhart', 'Plot': 'When a menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman, James Gordon and Harvey Dent must work together to put an end to the madness.', 'Language': 'English, Mandarin', 'Country': 'United States, United Kingdom', 'Awards': 'Won 2 Oscars. 163 wins & 165 nominations total', 'Poster': 'https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw@@._V1_SX300.jpg', 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '9.1/10'}, {'Source': 'Rotten Tomatoes', 'Value': '94%'}, {'Source': 'Metacritic', 'Value': '85/100'}], 'Metascore': '85', 'imdbRating': '9.1', 'imdbVotes': '3,115,102', 'imdbID': 'tt0468569', 'Type': 'movie', 'DVD':

### Question 4.3: Fetch Multiple Movies

Create a function `fetch_movies(titles)` that:
1. Takes a list of movie titles
2. Fetches data for each movie
3. Returns a list of movie dictionaries (only successful fetches)
4. Adds a 0.5 second delay between requests (to respect rate limits)

Test it with: `["Inception", "The Matrix", "Interstellar", "NonExistentMovie123"]`

In [None]:
# YOUR CODE HERE
def fetch_movies(titles):
    """Fetch multiple movies from OMDb API."""
    output=[]

    for title in titles:
      params = {
        "apikey": OMDB_API_KEY,
        "t": title,  # Search by title
        "type": "movie"
      }

      response = requests.get("https://www.omdbapi.com/", params=params)

      if response.ok:
          data = response.json()
          if data.get("Response") == "True":
              output.append(data)

    return output


# Test
test_titles = ["Inception", "The Matrix", "Interstellar", "NonExistentMovie123"]
movies = fetch_movies(test_titles)
print(f"Successfully fetched {len(movies)} out of {len(test_titles)} movies")

Successfully fetched 3 out of 4 movies


### Question 4.4: Create a Movie DataFrame

Using the movies you fetched, create a pandas DataFrame with these columns:
- title
- year (as integer)
- genre
- director
- imdb_rating (as float)
- imdb_votes (as integer, remove commas)
- runtime_minutes (as integer, extract from "148 min")
- box_office (keep as string for now)

**Hint**: You'll need to clean the data types.

In [None]:
# YOUR CODE HERE
import requests
import pandas as pd

def fetch__movies(titles):
    output=[]

    for title in titles:
      params = {
        "apikey": OMDB_API_KEY,
        "t": title,  # Search by title
        "type": "movie"
      }

      response = requests.get("https://www.omdbapi.com/", params=params)

      if response.ok:
          data = response.json()
          if data["Response"] == "True":
              output.append({
                  "Title": data["Title"],
                  "Year": data["Year"],
                  "Genre": data["Genre"],
                  "Director": data["Director"],
                  "IMDB Rating": data["imdbRating"],
                  "IMDB votes": data["imdbVotes"],
                  "Runtime minutes": data["Runtime"],
                  "Box Office": data["BoxOffice"],
              })

    return output

test_titles = ["Inception", "The Matrix", "Interstellar"]
movies = fetch__movies(test_titles)

df=pd.DataFrame(movies)

df["Year"] = df["Year"].astype(int)

df["IMDB Rating"] = pd.to_numeric(df["IMDB Rating"], errors="coerce")


df["IMDB votes"] = (
    df["IMDB votes"]
    .str.replace(",", "")
    .astype(int)
)

df["Runtime minutes"] = (
    df["Runtime minutes"]
    .str.replace(" min", "")
    .astype(int)
)

print(df)



          Title  Year                      Genre  \
0     Inception  2010  Action, Adventure, Sci-Fi   
1    The Matrix  1999             Action, Sci-Fi   
2  Interstellar  2014   Adventure, Drama, Sci-Fi   

                          Director  IMDB Rating  IMDB votes  Runtime minutes  \
0                Christopher Nolan          8.8     2767518              148   
1  Lana Wachowski, Lilly Wachowski          8.7     2217731              136   
2                Christopher Nolan          8.7     2454660              169   

     Box Office  
0  $292,587,330  
1  $177,559,005  
2  $203,227,580  


### Question 4.5: Search Movies by Title

OMDb also has a search endpoint that returns multiple results.

Use the `s` parameter instead of `t` to search for movies containing "Star Wars".

API endpoint: `https://www.omdbapi.com/?apikey=YOUR_KEY&s=Star Wars&type=movie`

Print the title and year of each result.

In [None]:
# YOUR CODE HERE
import requests

params={
    "apikey": OMDB_API_KEY,
    "s": "Star Wars",
    "type": "movie"
}

response=requests.get("https://www.omdbapi.com/", params=params)

data=response.json()

for it in data['Search']:
  print(f"title: {it['Title']} , year: {it['Year']}\n")

{'Search': [{'Title': 'Star Wars: Episode IV - A New Hope', 'Year': '1977', 'imdbID': 'tt0076759', 'Type': 'movie', 'Poster': 'https://m.media-amazon.com/images/M/MV5BOGUwMDk0Y2MtNjBlNi00NmRiLTk2MWYtMGMyMDlhYmI4ZDBjXkEyXkFqcGc@._V1_SX300.jpg'}, {'Title': 'Star Wars: Episode V - The Empire Strikes Back', 'Year': '1980', 'imdbID': 'tt0080684', 'Type': 'movie', 'Poster': 'https://m.media-amazon.com/images/M/MV5BMTkxNGFlNDktZmJkNC00MDdhLTg0MTEtZjZiYWI3MGE5NWIwXkEyXkFqcGc@._V1_SX300.jpg'}, {'Title': 'Star Wars: Episode VI - Return of the Jedi', 'Year': '1983', 'imdbID': 'tt0086190', 'Type': 'movie', 'Poster': 'https://m.media-amazon.com/images/M/MV5BNWEwOTI0MmUtMGNmNy00ODViLTlkZDQtZTg1YmQ3MDgyNTUzXkEyXkFqcGc@._V1_SX300.jpg'}, {'Title': 'Star Wars: Episode VII - The Force Awakens', 'Year': '2015', 'imdbID': 'tt2488496', 'Type': 'movie', 'Poster': 'https://m.media-amazon.com/images/M/MV5BOTAzODEzNDAzMl5BMl5BanBnXkFtZTgwMDU1MTgzNzE@._V1_SX300.jpg'}, {'Title': 'Star Wars: Episode I - The Phanto

### Question 4.6: Handle Pagination

The OMDb search API returns 10 results per page and includes a `totalResults` field.

Write a function `search_all_movies(query)` that:
1. Searches for movies matching the query
2. Fetches ALL pages of results (use the `page` parameter)
3. Returns a list of all movies found

**Hint**: `totalResults` tells you how many movies exist. Divide by 10 to get the number of pages.

Test with a query that has many results like "Batman".

In [None]:
import requests
import math

# YOUR CODE HERE
def search_all_movies(query, api_key=OMDB_API_KEY):
    """Search OMDb and return ALL matching movies across all pages."""
    params={
        "apikey":api_key,
        "s": query,
        "type": "movie"
    }

    response=requests.get("https://www.omdbapi.com/", params=params)
    data=response.json()
    pages=int(data["totalResults"])
    pages=math.ceil(pages/10)

    all_movies=[]

    for page in range(1, pages+1):
      params["page"]=page
      response=requests.get("https://www.omdbapi.com/", params=params)
      data=response.json()
      all_movies.extend(data["Search"])

    return all_movies


# Test
all_batman = search_all_movies("Batman")
print(f"Found {len(all_batman)} Batman movies")

Found 516 Batman movies


---

# Part 5: Web Scraping with BeautifulSoup

When APIs don't exist or don't have what we need, we scrape.

## 5.1 HTML Basics

### Question 5.1 (Solved): Parse HTML

In [None]:
# SOLVED EXAMPLE
from bs4 import BeautifulSoup

html = """
<html>
<body>
    <div class="movie" id="movie-1">
        <h2 class="title">Inception</h2>
        <span class="year">2010</span>
        <span class="rating">8.8</span>
        <a href="/movies/inception">More Info</a>
    </div>
    <div class="movie" id="movie-2">
        <h2 class="title">The Matrix</h2>
        <span class="year">1999</span>
        <span class="rating">8.7</span>
        <a href="/movies/matrix">More Info</a>
    </div>
</body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

# Find all movie divs
movies = soup.find_all('div', class_='movie')
print(f"Found {len(movies)} movies\n")

# Extract data from each
for movie in movies:
    title = movie.find('h2', class_='title').text
    year = movie.find('span', class_='year').text
    rating = movie.find('span', class_='rating').text
    link = movie.find('a')['href']

    print(f"{title} ({year}) - Rating: {rating} - Link: {link}")

Found 2 movies

Inception (2010) - Rating: 8.8 - Link: /movies/inception
The Matrix (1999) - Rating: 8.7 - Link: /movies/matrix


### Question 5.2: CSS Selectors

Rewrite the above extraction using CSS selectors (`.select()` and `.select_one()`) instead of `.find()` and `.find_all()`.

**Hint**:
- `.movie` selects elements with class "movie"
- `.movie .title` selects elements with class "title" inside class "movie"

In [None]:
# YOUR CODE HERE
# Use the same 'soup' from above
movie=soup.select('.movie')
print(f"Found {len(movies)} movies\n")

# Extract using CSS selectors
for movie in movies:
  title=movie.select_one('.title').text
  year=movie.select_one('.year').text
  rating=movie.select_one('.rating').text
  link=movie.select_one('a')['href']

  print(f"{title} ({year}) - Rating: {rating} - Link: {link}")


Found 2 movies

Inception (2010) - Rating: 8.8 - Link: /movies/inception
The Matrix (1999) - Rating: 8.7 - Link: /movies/matrix


### Question 5.3: Scrape a Real Website

Let's scrape the example website `http://quotes.toscrape.com/` which is designed for scraping practice.

Extract all quotes from the first page, including:
- The quote text
- The author name
- The tags

Return the results as a list of dictionaries.

In [None]:
# YOUR CODE HERE
import requests
from bs4 import BeautifulSoup

# Fetch the page
url = "http://quotes.toscrape.com/"

response=requests.get(url)


# Parse the HTML
html=response.text
soup=BeautifulSoup(html, 'html.parser')

quotes=soup.select('.quote')
ans=[]
# Extract quotes
for quote in quotes:
  quoteline=quote.select_one('.text').text
  author=quote.select_one('.author').text
  tags=[tag.text for tag in quote.select('.tag')]
  ans.append({"Quote": quoteline,
              "Author": author,
              "Tags": tags})

# Print results
for it in ans:
  print(it)

{'Quote': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'Author': 'Albert Einstein', 'Tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'Quote': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'Author': 'J.K. Rowling', 'Tags': ['abilities', 'choices']}
{'Quote': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'Author': 'Albert Einstein', 'Tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
{'Quote': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'Author': 'Jane Austen', 'Tags': ['aliteracy', 'books', 'classic', 'humor']}
{'Quote': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'Author': 'Marilyn Monroe', 'Tags': ['be-yourself', 'inspirationa

### Question 5.4: Handle Pagination in Scraping

The quotes website has multiple pages. Scrape the first 3 pages and collect all quotes.

Pages follow the pattern:
- Page 1: `http://quotes.toscrape.com/page/1/`
- Page 2: `http://quotes.toscrape.com/page/2/`
- etc.

**Remember**: Add a delay between requests to be polite!

In [None]:
# YOUR CODE HERE
import requests
from bs4 import BeautifulSoup

# Fetch the page
url = "http://quotes.toscrape.com/"
params={}
ans=[]

for i in range(3):
  params['page']=i
  response=requests.get(url,params)


  # Parse the HTML
  html=response.text
  soup=BeautifulSoup(html, 'html.parser')

  quotes=soup.select('.quote')
  # Extract quotes
  for quote in quotes:
    quoteline=quote.select_one('.text').text
    author=quote.select_one('.author').text
    tags=[tag.text for tag in quote.select('.tag')]
    ans.append({"Quote": quoteline,
                "Author": author,
                "Tags": tags})

# Print results
for it in ans:
  print(it)

{'Quote': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'Author': 'Albert Einstein', 'Tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'Quote': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'Author': 'J.K. Rowling', 'Tags': ['abilities', 'choices']}
{'Quote': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'Author': 'Albert Einstein', 'Tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
{'Quote': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'Author': 'Jane Austen', 'Tags': ['aliteracy', 'books', 'classic', 'humor']}
{'Quote': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'Author': 'Marilyn Monroe', 'Tags': ['be-yourself', 'inspirationa

### Question 5.5: Extract Table Data

Scrape the table from `https://www.w3schools.com/html/html_tables.asp`.

The table contains company data. Extract all rows and create a pandas DataFrame.

**Hint**: Look for `<table>`, `<tr>` (table row), `<th>` (header), and `<td>` (data cell) elements.

In [None]:
# YOUR CODE HERE
import requests
from bs4 import BeautifulSoup
import pandas as pd
url="https://www.w3schools.com/html/html_tables.asp"

response=requests.get(url)

html=response.text

soup=BeautifulSoup(html, 'html.parser')
tables=soup.select_one('.ws-table-all ')
rows=tables.select('tr')
elements=[]


for row in rows:
  sett=[]
  header=row.select('th')

  if(header):
    for col in header:
      sett.append(col.text)

  else:
    data=row.select('td')

    for col in data:
      sett.append(col.text)

  elements.append(sett)

df=pd.DataFrame(elements)

print(df)
print('\n')

# Hint: pandas has a read_html() function that can do this automatically!
tables=pd.read_html(url)
print(tables[0])
# But try doing it manually first to understand the process.


                              0                 1        2
0                       Company           Contact  Country
1           Alfreds Futterkiste      Maria Anders  Germany
2    Centro comercial Moctezuma   Francisco Chang   Mexico
3                  Ernst Handel     Roland Mendel  Austria
4                Island Trading     Helen Bennett       UK
5  Laughing Bacchus Winecellars   Yoshi Tannamuri   Canada
6  Magazzini Alimentari Riuniti  Giovanni Rovelli    Italy


                        Company           Contact  Country
0           Alfreds Futterkiste      Maria Anders  Germany
1    Centro comercial Moctezuma   Francisco Chang   Mexico
2                  Ernst Handel     Roland Mendel  Austria
3                Island Trading     Helen Bennett       UK
4  Laughing Bacchus Winecellars   Yoshi Tannamuri   Canada
5  Magazzini Alimentari Riuniti  Giovanni Rovelli    Italy


---

# Part 6: Building the Movie Data Pipeline

Now let's put everything together to build a complete data collection pipeline for our Netflix project.

## 6.1 The Complete Pipeline

### Question 6.1 (Solved): Movie Data Collector Class

In [None]:
# SOLVED EXAMPLE
import requests
import pandas as pd
import time
from typing import List, Dict, Optional

class MovieDataCollector:
    """Collect movie data from OMDb API."""

    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "http://www.omdbapi.com/"
        self.delay = 0.5  # Seconds between requests

    def fetch_movie(self, title: str, year: Optional[int] = None) -> Optional[Dict]:
        """Fetch a single movie by title."""
        params = {
            "apikey": self.api_key,
            "t": title,
            "type": "movie"
        }
        if year:
            params["y"] = year

        try:
            response = requests.get(self.base_url, params=params, timeout=10)
            response.raise_for_status()
            data = response.json()

            if data.get("Response") == "True":
                return data
        except Exception as e:
            print(f"Error fetching {title}: {e}")

        return None

    def fetch_movies(self, titles: List[str]) -> List[Dict]:
        """Fetch multiple movies."""
        movies = []

        for i, title in enumerate(titles):
            print(f"Fetching {i+1}/{len(titles)}: {title}")
            movie = self.fetch_movie(title)

            if movie:
                movies.append(movie)

            time.sleep(self.delay)

        return movies

    def to_dataframe(self, movies: List[Dict]) -> pd.DataFrame:
        """Convert movie data to cleaned DataFrame."""
        if not movies:
            return pd.DataFrame()

        # Extract relevant fields
        rows = []
        for m in movies:
            rows.append({
                "title": m.get("Title"),
                "year": m.get("Year"),
                "genre": m.get("Genre"),
                "director": m.get("Director"),
                "actors": m.get("Actors"),
                "imdb_rating": m.get("imdbRating"),
                "imdb_votes": m.get("imdbVotes"),
                "runtime": m.get("Runtime"),
                "box_office": m.get("BoxOffice"),
                "imdb_id": m.get("imdbID")
            })

        df = pd.DataFrame(rows)

        # Clean data types
        df["year"] = pd.to_numeric(df["year"], errors="coerce").astype("Int64")
        df["imdb_rating"] = pd.to_numeric(df["imdb_rating"], errors="coerce")
        df["imdb_votes"] = df["imdb_votes"].str.replace(",", "").pipe(pd.to_numeric, errors="coerce").astype("Int64")
        # Fix: str.extract returns a DataFrame, we need column 0 to get a Series
        df["runtime_min"] = df["runtime"].str.extract(r"(\d+)")[0].pipe(pd.to_numeric, errors="coerce").astype("Int64")

        return df

#Usage example
collector = MovieDataCollector(OMDB_API_KEY)
movies = collector.fetch_movies(["Inception", "The Matrix"])
df = collector.to_dataframe(movies)
print(df)

Fetching 1/2: Inception
Fetching 2/2: The Matrix
        title  year                      genre  \
0   Inception  2010  Action, Adventure, Sci-Fi   
1  The Matrix  1999             Action, Sci-Fi   

                          director  \
0                Christopher Nolan   
1  Lana Wachowski, Lilly Wachowski   

                                              actors  imdb_rating  imdb_votes  \
0  Leonardo DiCaprio, Joseph Gordon-Levitt, Ellio...          8.8     2767518   
1  Keanu Reeves, Laurence Fishburne, Carrie-Anne ...          8.7     2217731   

   runtime    box_office    imdb_id  runtime_min  
0  148 min  $292,587,330  tt1375666          148  
1  136 min  $177,559,005  tt0133093          136  


### Question 6.2: Add Search Functionality

Extend the `MovieDataCollector` class to add a `search_movies(query, max_results=50)` method that:
1. Searches for movies matching the query
2. Handles pagination to get up to `max_results` movies
3. For each search result, fetches the full movie details
4. Returns the detailed movie data

**Hint**: Search results only contain basic info (title, year, poster, imdbID). You need to use the imdbID to fetch full details.

In [None]:
# YOUR CODE HERE
# Extend the MovieDataCollector class or add a method
# SOLVED EXAMPLE
import requests
import pandas as pd
import time
from typing import List, Dict, Optional
import math

class MovieDataCollector:
    """Collect movie data from OMDb API."""

    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "http://www.omdbapi.com/"
        self.delay = 0.3  # Seconds between requests

    def fetch_movie_byid(self, imdb_id: str):
      """Fetch a single movie by IMDb ID."""

      params = {
          "apikey": self.api_key,
          "i": imdb_id,
          "type": "movie"
      }

      try:
          response = requests.get(self.base_url, params=params, timeout=10)
          response.raise_for_status()
          data = response.json()

          if data.get("Response") == "True":
              return data

      except Exception as e:
          print(f"Error fetching movie with IMDb ID {imdb_id}: {e}")

      return None

    def search_movies(self, query, max_results=50):
      params = {
            "apikey": self.api_key,
            "s": query,
            "type": "movie"
      }

      response = requests.get(self.base_url, params=params, timeout=10)
      data=response.json()

      if(data["Response"] == "False"):
        return []

      pages=int(data["totalResults"])
      pages=math.ceil(pages/10)

      all_movies=[]

      for page in range(1, pages+1):

        params["page"]=page
        response=requests.get(self.base_url, params=params)
        data=response.json()

        if data["Response"] == "False" or "Search" not in data:
          break

        for d in data['Search']:

          if len(all_movies)>=max_results:
            break

          id=d['imdbID']
          dictt=self.fetch_movie_byid(id)
          if dictt:
            all_movies.append(dictt)

          time.sleep(self.delay)

      return all_movies



    def to_dataframe(self, movies: List[Dict]) -> pd.DataFrame:
        """Convert movie data to cleaned DataFrame."""
        if not movies:
            return pd.DataFrame()

        # Extract relevant fields
        rows = []
        for m in movies:
            rows.append({
                "title": m.get("Title"),
                "year": m.get("Year"),
                "genre": m.get("Genre"),
                "director": m.get("Director"),
                "actors": m.get("Actors"),
                "imdb_rating": m.get("imdbRating"),
                "imdb_votes": m.get("imdbVotes"),
                "runtime": m.get("Runtime"),
                "box_office": m.get("BoxOffice"),
                "imdb_id": m.get("imdbID")
            })

        df = pd.DataFrame(rows)

        # Clean data types
        df["year"] = pd.to_numeric(df["year"], errors="coerce").astype("Int64")
        df["imdb_rating"] = pd.to_numeric(df["imdb_rating"], errors="coerce")
        df["imdb_votes"] = df["imdb_votes"].str.replace(",", "").pipe(pd.to_numeric, errors="coerce").astype("Int64")
        # Fix: str.extract returns a DataFrame, we need column 0 to get a Series
        df["runtime_min"] = df["runtime"].str.extract(r"(\d+)")[0].pipe(pd.to_numeric, errors="coerce").astype("Int64")

        return df

#Usage example
collector = MovieDataCollector('497f766c')
movies = collector.search_movies("Batman", 5)
df = collector.to_dataframe(movies)
print(df)

                                title  year                      genre  \
0                       Batman Begins  2005       Action, Crime, Drama   
1                          The Batman  2022       Action, Crime, Drama   
2  Batman v Superman: Dawn of Justice  2016  Action, Adventure, Sci-Fi   
3  Batman v Superman: Dawn of Justice  2016  Action, Adventure, Sci-Fi   
4                              Batman  1989          Action, Adventure   

            director                                         actors  \
0  Christopher Nolan    Christian Bale, Michael Caine, Ken Watanabe   
1        Matt Reeves  Robert Pattinson, Zoë Kravitz, Jeffrey Wright   
2        Zack Snyder           Ben Affleck, Henry Cavill, Amy Adams   
3        Zack Snyder           Ben Affleck, Henry Cavill, Amy Adams   
4         Tim Burton   Michael Keaton, Jack Nicholson, Kim Basinger   

   imdb_rating  imdb_votes  runtime    box_office    imdb_id  runtime_min  
0          8.2     1688565  140 min  $206,863,479  t

### Question 6.3: Build a Genre-Based Dataset

Use your collector to build a dataset of popular movies from different genres:

1. Search for 10 movies each for: "action", "comedy", "drama", "horror", "sci-fi"
2. Combine all results into a single DataFrame
3. Remove any duplicates (some movies might appear in multiple searches)
4. Save to CSV

**Note**: This might take a while due to rate limiting. Start with fewer movies for testing.

In [None]:
import pandas as pd

GENRES = ["action", "comedy", "drama", "horror", "sci-fi"]
MOVIES_PER_GENRE = 5

collector = MovieDataCollector('b59316b5')

all_genre_dfs = []

for genre in GENRES:
    movies = collector.search_movies(genre, max_results=MOVIES_PER_GENRE)

    df = collector.to_dataframe(movies)
    all_genre_dfs.append(df)

if not all_genre_dfs:
    print("No movies fetched.")
else:
    final_df = pd.concat(all_genre_dfs, ignore_index=True)
    final_df = final_df.drop_duplicates(subset="imdb_id")
    final_df.to_csv("popular_genre_movies", index=False)

    print(final_df)



                                         title  year  \
0                             Last Action Hero  1993   
1                               Back in Action  2025   
2                 Looney Tunes: Back in Action  2003   
3                               An Action Hero  2022   
4                               A Civil Action  1998   
5                           The King of Comedy  1982   
6               A Midsummer Night's Sex Comedy  1982   
7             Fear City: A Family-Style Comedy  1994   
8                               King of Comedy  1999   
9    The Broken Hearts Club: A Romantic Comedy  2000   
10        Confessions of a Teenage Drama Queen  2004   
11           A Woman of Paris: A Drama of Fate  1923   
12                  Kim Possible: So the Drama  2005   
13                      Eating Out: Drama Camp  2011   
14                           Love Action Drama  2019   
15               The Rocky Horror Picture Show  1975   
16                       The Amityville Horror  

### Question 6.4: Data Quality Analysis

Using the dataset you created:

1. How many movies have missing IMDB ratings?
2. How many movies have missing box office data?
3. What's the distribution of ratings? (min, max, mean, median)
4. Which directors appear most frequently?
5. What's the average runtime by genre?

These quality checks will be important for Week 2 (Data Validation)!

In [None]:
# YOUR CODE HERE
missing_imdb_count = final_df['imdb_rating'].isnull().sum()
print("missing imdb counts:", missing_imdb_count)

missing_boxoffice = (final_df['box_office']=="N/A").sum()
print("mssing box office:", missing_boxoffice)

max_rating=final_df['imdb_rating'].max()
min_rating=final_df['imdb_rating'].min()
mean_rating=final_df['imdb_rating'].mean()
median_rating=final_df['imdb_rating'].median()
print(f"Max: {max_rating}, Min: {min_rating}, Mean: {mean_rating}, Median: {median_rating} \n")

top_directors=final_df['director'].value_counts().head(5)
print(f"Top Directors: {top_directors} \n")

action_avgmin=final_df.iloc[0:5]['runtime_min'].mean()
comedy_avgmin=final_df.iloc[5:10]['runtime_min'].mean()
drama_avgmin=final_df.iloc[10:15]['runtime_min'].mean()
horror_avgmin=final_df.iloc[15:20]['runtime_min'].mean()
scifi_avgmin=final_df.iloc[20:25]['runtime_min'].mean()

print(f"Action Avg Min: {action_avgmin} \nComedy Avg Min: {comedy_avgmin} \nDrama Avg Min: {drama_avgmin} \nHorror Avg Min: {horror_avgmin} \nSci-Fi Avg Min: {scifi_avgmin} \n")

missing imdb counts: 0
mssing box office: 15
Max: 8.3, Min: 2.5, Mean: 6.544, Median: 6.9 

Top Directors: director
N/A                2
John McTiernan     1
Joe Dante          1
Seth Gordon        1
Steven Zaillian    1
Name: count, dtype: int64 

Action Avg Min: 116.0 
Comedy Avg Min: 96.0 
Drama Avg Min: 95.8 
Horror Avg Min: 96.6 
Sci-Fi Avg Min: 76.66666666666667 



---

# Part 7: Challenge Problems

These are optional advanced exercises for those who finish early.

### Challenge 7.1: Rate Limit Handler

Create a `RateLimiter` class that:
1. Tracks how many requests have been made
2. Automatically adds delays to stay under a rate limit
3. Handles 429 (Too Many Requests) responses by waiting and retrying

```python
limiter = RateLimiter(requests_per_minute=30)
response = limiter.get("https://api.example.com/data")
```

In [None]:
# YOUR CODE HERE


### Challenge 7.2: Async Movie Collector

The synchronous approach is slow because we wait for each request to complete.

Create an async version using `aiohttp` that can fetch multiple movies concurrently (while still respecting rate limits).

Compare the time to fetch 20 movies with sync vs async approach.

In [None]:
# YOUR CODE HERE
# Hint: You'll need to install aiohttp: pip install aiohttp
# And use asyncio to run the async code


### Challenge 7.3: Multi-Source Data Fusion

Create a data collection pipeline that:
1. Fetches basic movie data from OMDb
2. Enriches it with additional data from another source (e.g., Wikipedia API for plot summaries)
3. Merges the data based on movie title/year
4. Handles cases where data is missing from one source

Wikipedia API example:
```
https://en.wikipedia.org/api/rest_v1/page/summary/Inception_(film)
```

In [None]:
# YOUR CODE HERE


---

# Summary

In this lab, you learned:

1. **HTTP Fundamentals**: URLs, status codes, headers
2. **curl**: Command-line HTTP requests
3. **Python requests**: Programmatic data collection
4. **Error handling**: Timeouts, retries, status codes
5. **OMDb API**: Real-world movie data
6. **BeautifulSoup**: Web scraping when APIs don't exist
7. **Data pipelines**: Building reusable collection code

## Next Week

**Week 2: Data Validation & Quality**

The data we collected today is messy! Next week we'll learn:
- Schema validation with Pydantic
- Data type cleaning
- Handling missing values
- Quality metrics

---

## Submission

Save your completed notebook and submit:
1. This notebook with all cells executed
2. The CSV file of movies you collected
3. A brief summary (1 paragraph) of what you learned