# IMDB Data Collection

Author: Nithish Kumar

# Data Collection

## 1. Objective
The main goal was to compile a dataset of movies released between 2010 and 2020, including essential information such as movie titles, release dates, budgets, revenues, and additional metadata like cast, crew, genres, and production companies.

## 2. Source
All movie data was collected using **The Movie Database (TMDB)**, a free and reputable resource. Its API provides comprehensive metadata about movies, including box office figures, popularity scores, cast and crew information, and more.

## 3. Methodology

1. **Retrieving Movie IDs**  
   - A call was made to TMDB’s **Discover Movie** endpoint to retrieve a list of all movies released within the target timeframe (2010–2020).  
   - The response contained basic movie attributes, including their unique IDs.  
   - Results spanned multiple pages, so pagination was used to fetch all relevant movie IDs.

2. **Fetching Detailed Information**  
   - For each movie ID obtained in the first step, a second API call was made to TMDB’s **Movie Details** endpoint, using the parameter `append_to_response=credits` to fetch cast and crew data in a single request.  
   - The response included fields such as budget, revenue, popularity score, genres, runtime, production companies, and full cast and crew credits.

3. **Data Storage**  
   - After extracting the required information from each response, it was compiled into a structured format (e.g., a Python dictionary).  
   - The data was then stored in a **CSV file**, ensuring it could be easily analyzed or imported for EDA and training the model.

## 4. Outcome
By combining both steps (discovering movie IDs first, then pulling all details for each movie), a **comprehensive dataset** was created. This dataset covers a decade’s worth of film releases (2010–2020) and includes all the necessary metadata for further analysis, such as financial figures, cast/crew breakdowns, and other relevant attributes.

## 5. Considerations

- **Rate Limits**: The TMDB API has usage limits, so a slight delay (e.g., `time.sleep()`) was introduced between requests to avoid hitting the rate limit.  
- **Data Quality**: TMDB data quality depends on community and official contributions; hence, any missing fields (like budget or revenue) are a result of the source data not being available.  
- **Pagination**: Proper pagination was handled to ensure that all movies are gathered, rather than just the first page of results.


In [12]:
import requests
import pandas as pd
import time

In [None]:
TMDB_API_KEY = "API_KEY"
BASE_URL = 'https://api.themoviedb.org/3/'

In [None]:
# Function to get all Movies id's that are released in a particular year
def get_movies_by_year_range(start_year, end_year, page=1):
    url = BASE_URL+ "discover/movie"
    params = {
        'api_key': TMDB_API_KEY,
        'language': 'en-US',
        'sort_by': 'popularity.desc',
        'include_adult': 'false',
        'include_video': 'false',
        'page': page,
        'primary_release_date.gte': f'{start_year}-01-01',
        'primary_release_date.lte': f'{end_year}-12-31',
    }

    response = requests.get(url, params=params)
    return response.json()

In [None]:
movie_ids = []

# Loop over each year from 2010 - 2020
for year in range(2010, 2020):
    print(f"\n🔍 Fetching movies from {year}...")
    for page in range(1, 501):  # Max 500 pages per year
        data = get_movies_by_year_range(year, year, page)
        if not data or 'results' not in data or len(data['results']) == 0:
            break

        ids_on_page = [movie['id'] for movie in data['results']]
        movie_ids.extend(ids_on_page)

        print(f"Year {year} - Page {page} ✔️ Fetched {len(ids_on_page)} IDs")

        time.sleep(0.25)  # Respect TMDb rate limits (4 req/sec)

print(f"\n✅ Total movie IDs collected: {len(movie_ids)}")

# Save to CSV
movies_id_df = pd.DataFrame(movie_ids, columns=['movie_id'])


🔍 Fetching movies from 2010...
Year 2010 - Page 1 ✔️ Fetched 20 IDs
Year 2010 - Page 2 ✔️ Fetched 20 IDs
Year 2010 - Page 3 ✔️ Fetched 20 IDs
Year 2010 - Page 4 ✔️ Fetched 20 IDs

✅ Total movie IDs collected: 80


In [27]:
# Sample showing movies_id df for fewer pages
movies_id_df

Unnamed: 0,movie_id
0,48650
1,27205
2,20526
3,38575
4,11324
...,...
75,107748
76,44918
77,38199
78,7978


In [None]:
# Querying TMDB API to get all static info about a movie.
def get_tmdb_movie_data(tmdb_id):
    url = f"{BASE_URL}movie/{tmdb_id}?api_key={TMDB_API_KEY}&append_to_response=credits,keywords"
    response = requests.get(url)
    
    if response.status_code != 200:
        print(f"Failed to get data for movie ID: {tmdb_id}")
        return None
    
    data = response.json()

    # Get director
    director = next((crew['name'] for crew in data['credits']['crew'] if crew['job'] == 'Director'), None)

    # Get top 3 cast
    top_cast = [cast['name'] for cast in data['credits']['cast'][:3]]

    # Get keywords
    keywords = [kw['name'] for kw in data.get('keywords', {}).get('keywords', [])]

    return {
        'movie_ID': data.get('id'),
        'IMDB_ID': data.get('imdb_id'),
        'title': data.get('title'),
        'vote_average': data.get('vote_average'),
        'vote_count': data.get('vote_count'),
        'status': data.get('status'),
        'Release Date': data.get('release_date'),
        'Budget': data.get('budget'),
        'Revenue': data.get('revenue'),
        'Popularity': data.get('popularity'),
        'Runtime': data.get('runtime'),
        'Language': data.get('original_language'),
        'Genres': ", ".join([genre['name'] for genre in data.get('genres', [])]),
        'Production Companies': ", ".join([company['name'] for company in data.get('production_companies', [])]),
        'Director': director,
        'Top Cast': ", ".join(top_cast),
        'Keywords': ", ".join(keywords)
    }


results = []

for index, row in movies_id_df.iterrows():
    tmdb_id = row['movie_id']
    data = get_tmdb_movie_data(tmdb_id)
    if data:
        results.append(data)

# Create a DataFrame from results
enriched_df = pd.DataFrame(results)

# Save to CSV
enriched_df.to_csv(!pwd +"tmdb_enriched_movies.csv", mode='a', header=False, index=False)

print("Data saved to tmdb_enriched_movies.csv ✅")
enriched_df.head()

Data saved to tmdb_enriched_movies.csv ✅


Unnamed: 0,movie_ID,IMDB_ID,title,vote_average,vote_count,status,Release Date,Budget,Revenue,Popularity,Runtime,Language,Genres,Production Companies,Director,Top Cast,Keywords
0,48650,tt1263750,Room in Rome,6.404,761,Released,2010-05-07,0,844281,29.042,109,es,"Drama, Romance","Morena Films, Alicia Produce, Intervenciones N...",Julio Medem,"Elena Anaya, Natasha Yarovenko, Enrico Lo Verso","hotel, hotel room, rome, italy, female friends..."
1,27205,tt1375666,Inception,8.369,37315,Released,2010-07-15,160000000,839030630,27.3318,148,en,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ken W...","rescue, mission, dreams, airplane, paris, fran..."
2,20526,tt1104001,TRON: Legacy,6.495,7270,Released,2010-12-14,170000000,409912892,25.2795,126,en,"Adventure, Action, Science Fiction","Walt Disney Pictures, Sean Bailey Productions",Joseph Kosinski,"Garrett Hedlund, Olivia Wilde, Jeff Bridges","artificial intelligence (a.i.), computer progr..."
3,38575,tt1155076,The Karate Kid,6.543,6116,Released,2010-06-10,40000000,359126022,22.3718,140,en,"Action, Adventure, Drama, Family","Jerry Weintraub Productions, Columbia Pictures...",Harald Zwart,"Jaden Smith, Jackie Chan, Taraji P. Henson","martial arts, duringcreditsstinger, karate kid..."
4,11324,tt1130884,Shutter Island,8.201,24415,Released,2010-02-14,80000000,294804195,20.4984,138,en,"Drama, Thriller, Mystery","Paramount Pictures, Phoenix Pictures, Sikelia ...",Martin Scorsese,"Leonardo DiCaprio, Mark Ruffalo, Ben Kingsley","island, based on novel or book, hurricane, inv..."
5,38757,tt0398286,Tangled,7.606,11698,Released,2010-11-24,260000000,592461732,20.7712,100,en,"Animation, Family, Adventure","Walt Disney Animation Studios, Walt Disney Pic...",Byron Howard,"Mandy Moore, Zachary Levi, Donna Murphy","princess, magic, hostage, fairy tale, horse, v..."
6,12444,tt0926084,Harry Potter and the Deathly Hallows: Part 1,7.7,19417,Released,2010-11-17,250000000,954305868,18.3191,146,en,"Adventure, Fantasy","Warner Bros. Pictures, Heyday Films",David Yates,"Daniel Radcliffe, Emma Watson, Rupert Grint","witch, friendship, london, england, corruption..."
7,10138,tt1228705,Iron Man 2,6.848,21360,Released,2010-04-28,200000000,623933331,18.6207,124,en,"Adventure, Action, Science Fiction","Marvel Studios, Fairview Entertainment, Marvel...",Jon Favreau,"Robert Downey Jr., Gwyneth Paltrow, Don Cheadle","technology, superhero, malibu, based on comic,..."
8,10191,tt0892769,How to Train Your Dragon,7.836,13324,Released,2010-03-18,165000000,494879471,19.6761,98,en,"Fantasy, Adventure, Animation, Family",DreamWorks Animation,Dean DeBlois,"Jay Baruchel, Gerard Butler, Craig Ferguson","friendship, ship, blacksmith, island, based on..."
9,10192,tt0892791,Shrek Forever After,6.382,7496,Released,2010-05-20,165000000,752600867,18.0458,93,en,"Comedy, Adventure, Fantasy, Animation, Family",DreamWorks Animation,Mike Mitchell,"Mike Myers, Eddie Murphy, Cameron Diaz","witch, sequel, ogre"
