# Fetching Data from The Movie Database API

This notebook demonstrates how to:
1. **Connect to a REST API** using the requests library
2. **Handle API rate limiting** to avoid connection errors
3. **Gather paginated data** across multiple API requests
4. **Transform raw JSON data** into a structured DataFrame

We'll fetch the top-rated movies from The Movie Database (TMDB) API across all available pages.

In [None]:
# Import required libraries for API requests and data manipulation
import requests
import pandas as pd

# API Key for authentication with The Movie Database API
# Free tier API key from https://www.themoviedb.org/settings/api
API_KEY = ''

# Make a single test request to page 2 to verify API connection
url = f"https://api.themoviedb.org/3/movie/top_rated?api_key={API_KEY}&page=2"
response = requests.get(url)
response.json()

{'page': 2,
 'results': [{'adult': False,
   'backdrop_path': '/5TiwfWEaPSwD20uwXjCTUqpQX70.jpg',
   'genre_ids': [18, 53],
   'id': 550,
   'original_language': 'en',
   'original_title': 'Fight Club',
   'overview': 'A ticking-time-bomb insomniac and a slippery soap salesman channel primal male aggression into a shocking new form of therapy. Their concept catches on, with underground "fight clubs" forming in every town, until an eccentric gets in the way and ignites an out-of-control spiral toward oblivion.',
   'popularity': 29.4107,
   'poster_path': '/pB8BM7pdSp6B6Ih7QZ4DrQ3PmJK.jpg',
   'release_date': '1999-10-15',
   'title': 'Fight Club',
   'video': False,
   'vote_average': 8.438,
   'vote_count': 31387},
  {'adult': False,
   'backdrop_path': '/e3hG3uadtcP0pYdRa5ch4ysQW76.jpg',
   'genre_ids': [28, 18, 36],
   'id': 14537,
   'original_language': 'ja',
   'original_title': 'ÂàáËÖπ',
   'overview': 'Down-on-his-luck veteran Tsugumo Hanshir≈ç enters the courtyard of the prosp

## Step 1: Setup - Import Libraries and API Credentials

First, we import the necessary libraries:
- **requests**: Makes HTTP requests to APIs
- **pandas**: Converts API response data into structured DataFrames
- **time**: Adds delays to respect API rate limits

In [None]:
# Extract the 'results' array from the JSON response
data = response.json()['results']

# Convert the list of movie dictionaries into a DataFrame and select specific columns
df = pd.DataFrame(data)[['adult','original_language','original_title','overview','popularity','release_date','vote_average']]

# Display first few rows to understand data structure
df.head()

Unnamed: 0,adult,original_language,original_title,overview,popularity,release_date,vote_average
0,False,en,The Shawshank Redemption,Imprisoned in the 1940s for the double murder ...,40.2914,1994-09-23,8.715
1,False,en,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",37.7366,1972-03-14,8.687
2,False,en,The Godfather Part II,In the continuing saga of the Corleone crime f...,25.3743,1974-12-20,8.571
3,False,en,Schindler's List,The true story of how businessman Oskar Schind...,22.2033,1993-12-15,8.566
4,False,en,12 Angry Men,The defense and the prosecution have rested an...,14.0671,1957-04-10,8.554


## Step 2: Explore Single API Response

From the test request, we extract the data and convert it into a DataFrame:
- API returns JSON with 'results' key containing movie objects
- We select relevant columns: adult status, language, title, overview, popularity, release date, and rating
- This shows the structure before we fetch all pages

In [None]:
# Import time module for adding delays between API requests
import time

# Initialize empty DataFrame to store all movie data
df = pd.DataFrame()

## Step 3: Gather Data from All Pages

Now we'll fetch data from all 532 pages. Key concepts:

**Session Management**: Using `requests.Session()` reuses the TCP connection, making requests faster
**Rate Limiting**: Add delays between requests to avoid overwhelming the API server
**Error Handling**: Gracefully handle connection errors and HTTP 429 (rate limit) responses
**Pagination**: Loop through pages 1-532 to collect all top-rated movies

In [None]:
# Create a session object to reuse TCP connections (faster than new connection each time)
session = requests.Session()

# Container to collect all movie data from all pages
all_data = []

# Loop through all 532 pages of top-rated movies
for i in range(1, 533):
    # Build the API URL with the current page number
    url = f"https://api.themoviedb.org/3/movie/top_rated?api_key={API_KEY}&page={i}"
    
    try:
        # Send GET request with 10-second timeout to avoid hanging
        res = session.get(url, timeout=10)
        
        # Handle HTTP 429 (Too Many Requests) - server is rate limiting us
        if res.status_code == 429:
            # Extract wait time from response header, default to 5 seconds
            wait = int(res.headers.get("Retry-After", 5))
            print(f"\nRate limit hit. Waiting {wait} seconds...")
            time.sleep(wait)
            res = session.get(url)  # Retry the request once

        # Check if request was successful (HTTP 200)
        if res.status_code == 200:
            # Extract the 'results' array from JSON response
            page_data = res.json().get('results', [])
            # Add all movies from this page to our collection
            all_data.extend(page_data)
            # Show progress (overwrite same line with carriage return)
            print(f"Processed page {i}/532", end="\r")
        
        # Add delay between requests to stay under rate limit (40 requests / 10 seconds)
        # 0.3 second delay = ~3.3 requests per second, well below the limit
        time.sleep(0.3) 

    except (requests.exceptions.ConnectionError, requests.exceptions.Timeout) as e:
        # Handle connection failures (like the error you were seeing)
        print(f"\nConnection error on page {i}. Sleeping for 5 seconds before retrying...")
        time.sleep(5)
        # Could optionally retry: i -= 1
        continue

# Create DataFrame from all collected data ONCE (much faster than appending)
# Note: Fixed typo 'original_languae' -> 'original_language'
df = pd.DataFrame(all_data)[['id', 'adult', 'original_language', 'original_title', 'overview', 'popularity', 'release_date', 'vote_average']]
print("\nDone! Dataframe created.")


Connection error on page 1. Sleeping for 5 seconds before retrying...

Connection error on page 2. Sleeping for 5 seconds before retrying...

Connection error on page 3. Sleeping for 5 seconds before retrying...

Connection error on page 4. Sleeping for 5 seconds before retrying...
Processed page 500/532
Done! Dataframe created.


In [None]:
# Check the dimensions of our collected data (rows, columns)
print(f"Dataset shape: {df.shape}")

# Export the DataFrame to a CSV file for future analysis
# This stores all 532 pages worth of movie data locally
df.to_csv('moives.csv', index=False)
print("Data saved to 'moives.csv'")

## Key Learnings: API Data Gathering Best Practices

‚úÖ **What we implemented:**
- **Session reuse**: Using `requests.Session()` for connection pooling
- **Rate limit handling**: 0.3s delays + HTTP 429 response handling
- **Error resilience**: Try-except blocks for connection failures
- **Efficient batching**: Collect data, then create DataFrame once (not per-page)

‚ùå **What to avoid:**
- Making requests in tight loops without delays (causes 429 errors)
- Creating/appending DataFrames in loops (very slow)
- Ignoring response headers for Retry-After information
- No timeout values on requests (can hang indefinitely)

üìö **General API data gathering workflow:**
1. Authenticate (API key, OAuth, etc.)
2. Explore single response to understand structure
3. Build pagination loop with proper delays
4. Handle errors gracefully
5. Store results efficiently
6. Save to persistent storage (CSV, database, etc.)

## Step 4: Save the Collected Data

Now that we've successfully gathered all the data from the API, we:
1. Check the shape (rows √ó columns) of our dataset
2. Export the DataFrame to a CSV file for future use