# TMDB Animation Movie Scraper

This notebook scrapes animated movies from TMDB, filters out anime, and saves the results in batches. It also provides data cleaning and feature engineering steps for further analysis.


## Table of Contents

- 1. Import libraries

- 2. Configuration and Setup

- 3. Batched Scraping Loopn
- 4. Data Cleaning: Remove Duplicates

- 5. Feature Engineering

- 6. Preview Final Data


In [None]:
## 1. Import libraries
import requests
import pandas as pd
import time
import os


In [30]:
import requests

API_KEY = 'YOUR-API-KEY'  # Replace with your TMDB API Key
url = 'https://api.themoviedb.org/3/discover/movie'

params = {
    'api_key': API_KEY,
    'with_genres': '16',
    'language': 'en-US',
    'with_original_language': 'en',
    'page': 1
}

response = requests.get(url, params=params)
data = response.json()

print(f"🎯 Total animation movies on TMDB: {data['total_results']}")
print(f"📄 Total pages: {data['total_pages']}")

🎯 Total animation movies on TMDB: 32730
📄 Total pages: 1637


## 2. Configuration and Setup

Set up API keys, endpoints, and batch parameters.

In [None]:

# === CONFIGURATION ===
API_KEY = 'YOUR-API-KEY'
DISCOVER_URL = 'https://api.themoviedb.org/3/discover/movie'
DETAIL_URL = 'https://api.themoviedb.org/3/movie/{}'
KEYWORDS_URL = 'https://api.themoviedb.org/3/movie/{}/keywords'

b_size = input("Enter Number of Pages per Batch: ")
target_page = input("Enter Total Pages to Scrape: ")

BATCH_SIZE = int(b_size)                  # Pages per batch
TOTAL_TARGET_PAGES = int(target_page)       # Total pages you want to scrape this run
SLEEP_BETWEEN_REQUESTS = 0.3    # Delay between API requests (in seconds)
SLEEP_BETWEEN_BATCHES = 100   # 1.5 minutes between batches (in seconds)      (to change----------------------------------------------------------)

CHECKPOINT_FILE = "last_scraped_page.txt"
OUTPUT_FOLDER = "scraped_batches"
os.makedirs(OUTPUT_FOLDER, exist_ok=True)

## Helper Functions

Functions for safe API requests and error handling.

In [24]:
# === HELPER: Safe request with 429 handling ===
def safe_request(url, params):
    for attempt in range(3):
        response = requests.get(url, params=params)
        if response.status_code == 200:
            return response
        elif response.status_code == 429:
            print("⏳ Rate limited (429). Sleeping for 60 seconds...")
            time.sleep(60)
        else:
            print(f"⚠️ Error {response.status_code} on URL: {url}")
            return None
    return None

## Determine Scraping Range

Set the start and end pages for the scraping process, using checkpointing.

In [11]:
# === GET STARTING PAGE ===
if os.path.exists(CHECKPOINT_FILE):
    with open(CHECKPOINT_FILE, "r") as f:
        start_page = int(f.read().strip()) + 1
else:
    start_page = 1

end_page = min(start_page + TOTAL_TARGET_PAGES - 1, 1636)

In [27]:
print(f"📄 Starting from page {start_page} to {end_page}...")

📄 Starting from page 701 to 702...


## 3. Batched Scraping Loop

Scrape movie data in batches, save progress, and handle API limits.

In [None]:
# === START BATCHED SCRAPING ===
current_page = start_page
while current_page <= end_page:
    all_movies = []
    batch_start = current_page
    batch_end = min(current_page + BATCH_SIZE - 1, end_page)

    #print(f"\n🚀 Scraping batch: pages {batch_start} to {batch_end}")

    for page in range(batch_start, batch_end + 1):
        print(f"📄 Fetching page {page}...")
        params = {
            'api_key': API_KEY,
            'with_genres': '16',
            'language': 'en-US',
            'with_original_language': 'en',
            'sort_by': 'popularity.desc',
            'page': page
        }

        response = safe_request(DISCOVER_URL, params)
        if not response:
            print(f"❌ Skipping page {page} due to failed discover call.")
            with open("skipped_pages.txt", "a") as skip_log:
               skip_log.write(f"Skipped discover page {page}\n")
            continue  # ← SKIP this page, continue to next one

        data = response.json()
        for movie in data.get('results', []):
            movie_id = movie.get('id')
            if not movie_id:
                continue

            detail_res = safe_request(DETAIL_URL.format(movie_id), {'api_key': API_KEY})
            if not detail_res:
                continue
            detail_data = detail_res.json()

            keyword_res = safe_request(KEYWORDS_URL.format(movie_id), {'api_key': API_KEY})
            keyword_data = keyword_res.json().get('keywords', []) if keyword_res else []
            keyword_names = [kw['name'].lower() for kw in keyword_data]

            # Filter out anime
            if 'anime' in keyword_names or detail_data.get('original_language') == 'ja':
                continue

            movie_record = {
                'movie_id': movie_id,
                'title': detail_data.get('title'),
                'release_date': detail_data.get('release_date'),
                'overview': detail_data.get('overview'),
                'tagline': detail_data.get('tagline'),
                'rating': detail_data.get('vote_average'),
                'vote_count': detail_data.get('vote_count'),
                'popularity': detail_data.get('popularity'),
                'budget': detail_data.get('budget'),
                'revenue': detail_data.get('revenue'),
                'runtime': detail_data.get('runtime'),
                'genres': ', '.join([g['name'] for g in detail_data.get('genres', [])]),
                'production_companies': ', '.join([c['name'] for c in detail_data.get('production_companies', [])]),
                'poster_path': detail_data.get('poster_path'),
                'homepage': detail_data.get('homepage'),
                'language': detail_data.get('original_language'),
                'keywords': ', '.join(keyword_names)
            }

            all_movies.append(movie_record)
            time.sleep(SLEEP_BETWEEN_REQUESTS)

        # Save checkpoint after each page
        with open(CHECKPOINT_FILE, "w") as f:
            f.write(str(page))

    # Save batch to CSV
    if all_movies:
        df = pd.DataFrame(all_movies)
        batch_filename = f"animation_batch_page_{batch_start}_to_{batch_end}.csv"
        df.to_csv(os.path.join(OUTPUT_FOLDER, batch_filename), index=False)
        print(f"✅ Batch saved: {batch_filename}")

    current_page = batch_end + 1
    if current_page <= end_page:
        print(f"🕒 Sleeping for 1.5 minutes before next batch...")
        time.sleep(SLEEP_BETWEEN_BATCHES)

print("🎉 All batches completed!")

📄 Fetching page 701...
⚠️ Error 400 on URL: https://api.themoviedb.org/3/discover/movie
❌ Skipping page 701 due to failed discover call.
🕒 Sleeping for 1.5 minutes before next batch...
📄 Fetching page 702...
⚠️ Error 400 on URL: https://api.themoviedb.org/3/discover/movie
❌ Skipping page 702 due to failed discover call.
🎉 All batches completed!


## Combine Batch Results

Merge all batch CSV files into a single DataFrame for further analysis.

In [None]:
folder_path = "scraped_batches"

# List to store DataFrames
all_dfs = []

# Loop through all CSV files and read them
for filename in os.listdir(folder_path):
    if filename.endswith(".csv") and "animation_batch_page_" in filename:
        file_path = os.path.join(folder_path, filename)
        df = pd.read_csv(file_path)
        print("Appended ",filename)
        all_dfs.append(df)

# Concatenate all DataFrames into one
combined_df = pd.concat(all_dfs, ignore_index=True)

# Save to a single CSV file
combined_df.to_csv("animation_movies_combined.csv", index=False)

In [3]:
animation_movie_df = combined_df.copy()
animation_movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9960 entries, 0 to 9959
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   movie_id              9960 non-null   int64  
 1   title                 9959 non-null   object 
 2   release_date          9792 non-null   object 
 3   overview              9887 non-null   object 
 4   tagline               2065 non-null   object 
 5   rating                9960 non-null   float64
 6   vote_count            9960 non-null   int64  
 7   popularity            9960 non-null   float64
 8   budget                9960 non-null   int64  
 9   revenue               9960 non-null   int64  
 10  runtime               9960 non-null   int64  
 11  genres                9960 non-null   object 
 12  production_companies  8052 non-null   object 
 13  poster_path           8994 non-null   object 
 14  homepage              1931 non-null   object 
 15  language             

## 4. Data Cleaning: Remove Duplicates

Drop duplicate movies based on `movie_id` and inspect the cleaned DataFrame.

In [9]:
missing_summary = animation_movie_df.isnull().sum().sort_values(ascending=False)
missing_percent = (animation_movie_df.isnull().sum() / len(animation_movie_df) * 100).sort_values(ascending=False)
missing_df = pd.DataFrame({'Missing Count': missing_summary, 'Missing %': missing_percent})
print(missing_df)
# Check for duplicates based on movie_id
duplicate_count = animation_movie_df.duplicated(subset='movie_id').sum()
print(f"Duplicate movie IDs found: {duplicate_count}")


                      Missing Count  Missing %
homepage                       8029  80.612450
tagline                        7895  79.267068
keywords                       4465  44.829317
production_companies           1908  19.156627
poster_path                     966   9.698795
release_date                    168   1.686747
overview                         73   0.732932
title                             1   0.010040
movie_id                          0   0.000000
budget                            0   0.000000
popularity                        0   0.000000
vote_count                        0   0.000000
rating                            0   0.000000
genres                            0   0.000000
runtime                           0   0.000000
revenue                           0   0.000000
language                          0   0.000000
Duplicate movie IDs found: 203


In [None]:
# Find rows that are duplicates based on movie_id
duplicate_rows = animation_movie_df[animation_movie_df.duplicated(subset='movie_id')].sort_values('movie_id')
duplicate_rows.head()

Unnamed: 0,movie_id,title,release_date,overview,tagline,rating,vote_count,popularity,budget,revenue,runtime,genres,production_companies,poster_path,homepage,language,keywords
8035,532,A Close Shave,1996-03-07,Wallace's whirlwind romance with the proprieto...,,7.566,866,1.2167,0,4638,30,"Family, Animation, Comedy","Aardman, BBC Bristol Productions, BBC Children...",/qKvN2z4ZcnWkMv6cMNC1Z26lEen.jpg,http://www.wallaceandgromit.com/films/a-close-...,en,"prison, sheep, inventor, loyalty, innocence, h..."
9104,5255,The Polar Express,2004-11-10,When a doubting young boy takes an extraordina...,This holiday season... believe.,6.72,6526,7.2387,165000000,318432432,100,"Animation, Adventure, Family, Fantasy","Golden Mean, Playtone, ImageMovers, Castle Roc...",/eOoCzH0MqeGr2taUZO4SwG416PF.jpg,,en,"faith, holiday, santa claus, nerd, bell, train..."
5767,5393,Happily N'Ever After,2007-01-05,"An alliance of evil-doers, led by Frieda, look...",Fairy Tale Endings Aren't What They Used To Be.,5.038,431,1.5965,47000000,38100000,87,"Adventure, Animation, Comedy, Family, Fantasy","Lionsgate, Vanguard Animation, Odyssey Enterta...",/gOOlHRhdEoJbPNE1jpocNawjnc5.jpg,http://www.happilyneverafterthefilm.com/,en,"princess, dwarf, wolf, fairy tale, bad mother-..."
9063,7518,Over the Hedge,2006-05-17,A scheming raccoon fools a mismatched family o...,Get over it.,6.576,4585,7.5787,80000000,343397247,84,"Family, Comedy, Animation",DreamWorks Animation,/jtZnymorbnHY7mOiBXR14ZDJseM.jpg,,en,"suburbian idyll, garbage, entrapment, squirrel..."
9096,9016,Treasure Planet,2002-11-26,When space galleon cabin boy Jim Hawkins disco...,Find your place in the universe.,7.558,4349,7.3135,140000000,109578115,96,"Science Fiction, Adventure, Animation, Family,...",Walt Disney Pictures,/zMKatZ0c0NCoKzfizaCzVUcbKMf.jpg,https://movies.disney.com/treasure-planet,en,"mutiny, space marine, based on novel or book, ..."


In [25]:
#drop duplicat rows
# Drop duplicate rows based on movie_id
df_cleaned = animation_movie_df.drop_duplicates(subset='movie_id').reset_index(drop=True)
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9757 entries, 0 to 9756
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   movie_id              9757 non-null   int64  
 1   title                 9756 non-null   object 
 2   release_date          9589 non-null   object 
 3   overview              9684 non-null   object 
 4   tagline               1958 non-null   object 
 5   rating                9757 non-null   float64
 6   vote_count            9757 non-null   int64  
 7   popularity            9757 non-null   float64
 8   budget                9757 non-null   int64  
 9   revenue               9757 non-null   int64  
 10  runtime               9757 non-null   int64  
 11  genres                9757 non-null   object 
 12  production_companies  7855 non-null   object 
 13  poster_path           8793 non-null   object 
 14  homepage              1846 non-null   object 
 15  language             

In [26]:
#filter rows whose title is missing

filter = (df_cleaned['title'].isnull())
df_cleaned.loc[filter,'title'] = 'NONE'

df_cleaned.loc[filter]


Unnamed: 0,movie_id,title,release_date,overview,tagline,rating,vote_count,popularity,budget,revenue,runtime,genres,production_companies,poster_path,homepage,language,keywords
4658,518061,NONE,2015-01-01,NONE is a short film that explores the balance...,,0.0,0,0.16,0,0,4,Animation,,/ponf6oGL9tE7l2EysogAiAD50hr.jpg,,en,


In [27]:
#filter rows whose released date is missing
filter = (df_cleaned['release_date'].isnull())
df_cleaned.loc[filter]

Unnamed: 0,movie_id,title,release_date,overview,tagline,rating,vote_count,popularity,budget,revenue,runtime,genres,production_companies,poster_path,homepage,language,keywords
22,656618,High in the Clouds,,After he accidentally sparked a revolution aga...,,0.0,0,0.4664,0,0,0,"Animation, Comedy, Family, Adventure, Music","MPL Communications, Unique Features, 88 Pictur...",/4BP4DUuj29HhOcybUCLbyu8YRZw.jpg,https://www.gaumont.com/en/movie/high-in-the-c...,en,
153,1403836,Bluey: The Movie,,Based on the TV series of the same name. Plot ...,,0.0,0,0.4424,0,0,0,"Animation, Family","Ludo Studio, BBC Studios, Cosmic Dino Studio",/wUiL0OTK3EPRyQfdQkEU4Ma87V2.jpg,,en,"anthropomorphic animal, based on tv series"
155,1453440,Cars 4,,Fourth installment of the Disney Pixar Cars se...,,0.0,0,0.4421,0,0,0,"Animation, Family","Pixar, Walt Disney Pictures",/ztIomHsqW7WQ21nVQU9AdRcf54A.jpg,,en,
218,467914,The Land of Sometimes,,Brother and sister Alfie and Elise keep wishin...,,0.0,0,0.4286,0,0,0,"Animation, Adventure, Family, Fantasy",Premiere Picture,/dcO0JNF5QeXqBAxRuBqgNBimXyD.jpg,,en,musical
253,1499744,The Chuck E. Cheese Christmas Special,,When Chuck E. Cheese and Friends learn that Sa...,,0.0,0,0.4222,0,0,0,"Animation, Family","HappyNest Entertainment, Pixel Zoo",,,en,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9490,536097,Rogue Trooper,,"Rogue is a ""Genetic Infantryman"", a geneticall...",,0.0,0,0.5715,0,0,0,"Science Fiction, Action, Animation","Rebellion Developments Ltd., Liberty Films Ent...",/pTJedB0fNraLqMkVEV1OgeWQS4b.jpg,,en,
9587,791523,Wildwood,,"Set beyond Portland’s city limits, in Wildwood...",,0.0,0,0.5267,0,0,0,"Animation, Family, Fantasy, Horror, Adventure",LAIKA,/dW65yBrb2oszg3vKHYlHuKO4iNq.jpg,https://www.laika.com/our-films/wildwood,en,"based on novel or book, stop motion"
9629,1189480,Mortal Kombat Legends: Fall of Edenia,,An upcoming film in the Mortal Kombat Legends ...,,0.0,0,0.5125,0,0,0,"Animation, Action, Fantasy",Warner Bros. Animation,,,en,"martial arts, fighting, based on video game, e..."
9635,811855,ThunderCats,,A computer-animated re-imagining of the hit '8...,,0.0,0,0.5105,0,0,0,"Animation, Fantasy, Action",Warner Bros. Pictures,/4Dj3A0twGJEor05xtlOTlMiyFLp.jpg,,en,


In [28]:
#breakpoint save df_cleaned to csv
df_cleaned.to_csv("animation_movies_cleaned.csv", index=False)

In [None]:
## Again, we will use the TMDB API to recover missing release dates


# Your TMDB API Key
API_KEY = "YOUR-API-KEY"
BASE_URL = "https://api.themoviedb.org/3/movie/"

# Filter missing release_date rows
missing_df = df_cleaned[df_cleaned['release_date'].isnull()].copy()
recovered_dates = {}

# Fetch release dates
for movie_id in missing_df['movie_id']:
    url = f"{BASE_URL}{movie_id}"
    params = {'api_key': API_KEY}
    try:
        res = requests.get(url, params=params)
        if res.status_code == 200:
            data = res.json()
            if data.get('release_date'):
                recovered_dates[movie_id] = data['release_date']
                recovered_dates[status] = data['status']
        time.sleep(0.25)  # avoid bursts
    except:
        continue

# Update your main DataFrame
df_cleaned['release_date'] = df_cleaned.apply(
    lambda row: recovered_dates.get(row['movie_id'], row['release_date']),
    axis=1
)
#display null values of updated dataframe
df_cleaned.isnull().sum()


movie_id                   0
title                      0
release_date             168
overview                  73
tagline                 7799
rating                     0
vote_count                 0
popularity                 0
budget                     0
revenue                    0
runtime                    0
genres                     0
production_companies    1902
poster_path              964
homepage                7911
language                   0
keywords                4431
dtype: int64

In [None]:
## since no data has been recovered, I have searched manually some random movies name came to conclusion
## most movies yet to be released, very old or a short film ( so we can ignore these null value and exclude them when doing analysis
#Tried API recovery ✔️
#Manually validated ✔️

## 4. Feature Engineering

Create new features such as release year, decade, and short film classification.

In [40]:
import numpy as np
# Replace runtime = 0 with NaN
df_cleaned['runtime'] = df_cleaned['runtime'].replace(0, np.nan)

# Recompute is_short_film with 'Unknown' logic
def classify_short_film(rt):
    if pd.isnull(rt):
        return "Unknown"
    elif rt < 40:
        return True
    else:
        return False

df_cleaned['is_short_film'] = df_cleaned['runtime'].apply(classify_short_film)

# Ensure release_date is in datetime format
df_cleaned['release_date'] = pd.to_datetime(df_cleaned['release_date'], errors='coerce')

# Feature 1: release_year
df_cleaned['release_year'] = df_cleaned['release_date'].dt.year

# Feature 2: release_year_month (e.g., "2014-06")
df_cleaned['release_year_month'] = df_cleaned['release_date'].dt.to_period('M').astype(str)

# Feature 4: release_decade
def assign_decade(year):
    if pd.isnull(year):
        return "Unknown"
    decade_start = int(year) - int(year) % 10
    return f"{decade_start}s"

df_cleaned['release_decade'] = df_cleaned['release_year'].apply(assign_decade)

# Check featured data 
df_cleaned.head()


Unnamed: 0,movie_id,title,release_date,overview,tagline,rating,vote_count,popularity,budget,revenue,...,production_companies,poster_path,homepage,language,keywords,title_lower,is_short_film,release_year,release_year_month,release_decade
0,1290125,"License to Kill, Part MCMXC",1990-01-01,Reverses the role of hunter and hunted. It is ...,,0.0,0,0.473,0,0,...,,,,en,,"license to kill, part mcmxc",True,1990.0,1990-01,1990s
1,1415244,Phineas and Ferb Save Summer,2014-06-09,When L.O.V.E.M.U.F.F.I.N. moves the Earth furt...,This must be a special episode. He's yelling a...,0.0,0,0.4729,0,0,...,,/1rvJ7avpZ8aLIrIAosxd2mF7jNH.jpg,,en,,phineas and ferb save summer,False,2014.0,2014-06,2010s
2,69861,GoBots: Battle of the Rock Lords,1986-03-21,"The GoBots, television's amazing transformable...",At the edge of the universe... The adventure o...,5.5,6,0.4727,0,0,...,"Hanna-Barbera Productions, Tonka, Clubhouse Pi...",/vo5JCxcmNQp6lkLhDuk5QXboNsQ.jpg,,en,,gobots: battle of the rock lords,False,1986.0,1986-03,1980s
3,339549,Yamasong: March of the Hollows,2017-09-10,An automated girl and tortoise warrior journey...,,3.538,13,0.4724,0,0,...,Dark Dunes Productions,/4gegihZDkO5RHSs0HFu70kIFEY1.jpg,,en,,yamasong: march of the hollows,False,2017.0,2017-09,2010s
4,219230,Inhumans,2013-04-23,The Inhumans have always been one of Marvel’s ...,,6.2,10,0.4721,0,0,...,Marvel Knights,/1rl6zS7g2DPeMDEd78Ql6KLKDC9.jpg,,en,based on comic,inhumans,False,2013.0,2013-04,2010s


In [41]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9757 entries, 0 to 9756
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   movie_id              9757 non-null   int64         
 1   title                 9757 non-null   object        
 2   release_date          9589 non-null   datetime64[ns]
 3   overview              9684 non-null   object        
 4   tagline               1958 non-null   object        
 5   rating                9757 non-null   float64       
 6   vote_count            9757 non-null   int64         
 7   popularity            9757 non-null   float64       
 8   budget                9757 non-null   int64         
 9   revenue               9757 non-null   int64         
 10  runtime               9477 non-null   float64       
 11  genres                9757 non-null   object        
 12  production_companies  7855 non-null   object        
 13  poster_path       

Some movies have the same title. This could be due to remakes (20,000 Leagues Under the Sea is 1973 movie also 1985 and 2004). 
Thus, the very first thing to do  is change the titles of the movie to the title in format title(year),
changing ‘20,000 Leagues Under the Sea to 20,000 Leagues Under the Sea(1973)’
- Prevents confusion between movies with the same name
- Helps users clearly see which version is being recommended

In [None]:
# Check for duplicate movie titles (case-insensitive)
df_cleaned['title_lower'] = df_cleaned['title'].str.lower().str.strip()
duplicate_titles = df_cleaned[df_cleaned.duplicated(subset='title_lower', keep=False)].sort_values('title_lower')
duplicate_titles.head()

# save duplicate titles to csv
duplicate_titles.to_csv("duplicate_titles.csv", index=False)

In [46]:
# Update 'title_year' to use release_year as an integer (drop any decimals)
df_cleaned['title_year'] = df_cleaned.apply(
    lambda row: f"{row['title']} ({row['release_year_month']})" if pd.notnull(row['release_year_month']) else row['title'],
    axis=1
)


In [43]:
# Save new dataset with features
df_cleaned.to_csv("animation_movies_featured.csv", index=False)

In [48]:
df_cleaned['title_year_lower'] = df_cleaned['title_year'].str.lower().str.strip()
duplicate_titles = df_cleaned[df_cleaned.duplicated(subset='title_year', keep=False)].sort_values('title_year_lower')
duplicate_titles.head()
duplicate_titles.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, 297 to 9346
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   movie_id              8 non-null      int64         
 1   title                 8 non-null      object        
 2   release_date          8 non-null      datetime64[ns]
 3   overview              8 non-null      object        
 4   tagline               3 non-null      object        
 5   rating                8 non-null      float64       
 6   vote_count            8 non-null      int64         
 7   popularity            8 non-null      float64       
 8   budget                8 non-null      int64         
 9   revenue               8 non-null      int64         
 10  runtime               7 non-null      float64       
 11  genres                8 non-null      object        
 12  production_companies  7 non-null      object        
 13  poster_path           6 

In [50]:
duplicate_titles

Unnamed: 0,movie_id,title,release_date,overview,tagline,rating,vote_count,popularity,budget,revenue,...,homepage,language,keywords,title_lower,is_short_film,release_year,release_year_month,release_decade,title_year,title_year_lower
297,70665,Flatland,2007-01-14,Flatland is a two-dimensional universe occupie...,"A tale of math, physics, dimensionality, philo...",5.96,25,0.4127,0,0,...,http://flatlandthefilm.com,en,"geometry, another dimension, triangle, multipl...",flatland,False,2007.0,2007-01,2000s,Flatland (2007-01),flatland (2007-01)
993,337550,Flatland,2007-01-14,Set in a world of only two dimensions inhabite...,A Journey of Many Dimensions,5.6,17,0.3121,0,0,...,http://www.flatlandthemovie.com,en,"geometry, triangle, cube, dimensional travel, ...",flatland,True,2007.0,2007-01,2000s,Flatland (2007-01),flatland (2007-01)
2291,1271882,Rebooted,2018-05-29,"Owl Guy, a retro comic book superhero, is sudd...",,0.0,0,0.2344,0,0,...,,en,"superhero, comic book",rebooted,True,2018.0,2018-05,2010s,Rebooted (2018-05),rebooted (2018-05)
4402,680609,Rebooted,2018-05-29,"Owl Guy, a retro comic book superhero, meets h...",,0.0,0,0.1658,0,0,...,,en,comic book,rebooted,Unknown,2018.0,2018-05,2010s,Rebooted (2018-05),rebooted (2018-05)
2692,378127,The Hunchback of Notre Dame,1996-04-16,"The classic tale of a loveable, outcast hunchb...",,3.5,4,0.2151,0,0,...,,en,"hunchback, notre-dame",the hunchback of notre dame,False,1996.0,1996-04,1990s,The Hunchback of Notre Dame (1996-04),the hunchback of notre dame (1996-04)
2894,268762,The Hunchback of Notre Dame,1996-04-10,"Set in the middle ages, this is the wonderful,...",Burbank Animation 1996 Version of The Hunchbac...,6.7,3,0.2062,0,0,...,http://www.burbankanimation.com/pages/hunchbac...,en,"hunchback, notre-dame",the hunchback of notre dame,False,1996.0,1996-04,1990s,The Hunchback of Notre Dame (1996-04),the hunchback of notre dame (1996-04)
2251,1106231,The Wind in the Willows,1995-12-25,"Follow Mr. Toad as he purchases a motor car, a...",,0.0,0,0.2365,0,0,...,,en,,the wind in the willows,False,1995.0,1995-12,1990s,The Wind in the Willows (1995-12),the wind in the willows (1995-12)
9346,59178,The Wind in the Willows,1995-12-24,"Jailed for his reckless driving, rambunctious ...",,7.044,34,0.6467,0,0,...,,en,,the wind in the willows,False,1995.0,1995-12,1990s,The Wind in the Willows (1995-12),the wind in the willows (1995-12)


In [None]:
## keep one row based on title_year
# Keep the first occurrence of each title_year
df_cleaned = df_cleaned.drop_duplicates(subset='title_year_lower', keep='first').reset_index(drop=True)
# Save the cleaned DataFrame to a new CSV file
df_cleaned.to_csv("animation_movies_featured.csv", index=False)


Unnamed: 0,movie_id,title,release_date,overview,tagline,rating,vote_count,popularity,budget,revenue,...,homepage,language,keywords,title_lower,is_short_film,release_year,release_year_month,release_decade,title_year,title_year_lower


In [54]:
df_cleaned.columns
# drop column title_year_lower,title_lower
df_cleaned = df_cleaned.drop(columns=['title_year_lower', 'title_lower'])
# Save the final cleaned DataFrame to a new CSV file
df_cleaned.to_csv("animation_movies_featured.csv", index=False)
df_cleaned.columns

Index(['movie_id', 'title', 'release_date', 'overview', 'tagline', 'rating',
       'vote_count', 'popularity', 'budget', 'revenue', 'runtime', 'genres',
       'production_companies', 'poster_path', 'homepage', 'language',
       'keywords', 'is_short_film', 'release_year', 'release_year_month',
       'release_decade', 'title_year'],
      dtype='object')

In [None]:
df = pd.read_csv("animation_movies_featured.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9751 entries, 0 to 9750
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   movie_id              9751 non-null   int64  
 1   title                 9751 non-null   object 
 2   release_date          9583 non-null   object 
 3   overview              9678 non-null   object 
 4   tagline               1955 non-null   object 
 5   rating                9751 non-null   float64
 6   vote_count            9751 non-null   int64  
 7   popularity            9751 non-null   float64
 8   budget                9751 non-null   int64  
 9   revenue               9751 non-null   int64  
 10  runtime               9472 non-null   float64
 11  genres                9751 non-null   object 
 12  production_companies  7849 non-null   object 
 13  poster_path           8788 non-null   object 
 14  homepage              1844 non-null   object 
 15  language             

## 5. Preview Final Data

In [2]:
df.head()

Unnamed: 0,movie_id,title,release_date,overview,tagline,rating,vote_count,popularity,budget,revenue,...,production_companies,poster_path,homepage,language,keywords,is_short_film,release_year,release_year_month,release_decade,title_year
0,1290125,"License to Kill, Part MCMXC",01-01-1990,Reverses the role of hunter and hunted. It is ...,,0.0,0,0.473,0,0,...,,,,en,,True,1990.0,1990-01,1990s,"License to Kill, Part MCMXC (1990-01)"
1,1415244,Phineas and Ferb Save Summer,09-06-2014,When L.O.V.E.M.U.F.F.I.N. moves the Earth furt...,This must be a special episode. He's yelling a...,0.0,0,0.4729,0,0,...,,/1rvJ7avpZ8aLIrIAosxd2mF7jNH.jpg,,en,,False,2014.0,2014-06,2010s,Phineas and Ferb Save Summer (2014-06)
2,69861,GoBots: Battle of the Rock Lords,21-03-1986,"The GoBots, television's amazing transformable...",At the edge of the universe... The adventure o...,5.5,6,0.4727,0,0,...,"Hanna-Barbera Productions, Tonka, Clubhouse Pi...",/vo5JCxcmNQp6lkLhDuk5QXboNsQ.jpg,,en,,False,1986.0,1986-03,1980s,GoBots: Battle of the Rock Lords (1986-03)
3,339549,Yamasong: March of the Hollows,10-09-2017,An automated girl and tortoise warrior journey...,,3.538,13,0.4724,0,0,...,Dark Dunes Productions,/4gegihZDkO5RHSs0HFu70kIFEY1.jpg,,en,,False,2017.0,2017-09,2010s,Yamasong: March of the Hollows (2017-09)
4,219230,Inhumans,23-04-2013,The Inhumans have always been one of Marvel’s ...,,6.2,10,0.4721,0,0,...,Marvel Knights,/1rl6zS7g2DPeMDEd78Ql6KLKDC9.jpg,,en,based on comic,False,2013.0,2013-04,2010s,Inhumans (2013-04)
