# Guide To Extracting Data from the Movies Dataset

This is a guide to *how to extract data from the imdb movies dataset*. This is one way of doing things. In this notebook I extract the data from the different *csv* files. They were seperate for a practical reasons in the data scraping phase. I also add some normalization and data cleaning. I am doing the bare minimum in terms of cleaning and normalization so it wouldn't be very complex.

There is another notebook that is meant for a whole data cleaning process. You can check it out.

## Introducing the Dataset and its structure

This dataset provides annual data for the most popular 500–600 movies per year from 1920 to 2025, extracted from IMDb. It includes over 60,000 movies, spanning more than 100 years of cinematic history.

Each year’s data is divided into three CSV files for flexibility and ease of use:
- ``imdb_movies_[year].csv``: Basic movie details.
- ``advanced_movies_details_[year]``.csv: Comprehensive metadata and financial details.
- ``merged_movies_data_[year].csv``: A unified dataset combining both files.

## What we are doing exactly:

1. Merge all `merged_movies_data_{year}` files across all years into one csv.
2. Change names to uniform naming style.
3. Extract the `id`.
4. Remove Duplicates
5. Change Empty values to None.

<h3 style="font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; color: #2c3e50; margin-top: 40px;">Contact Me</h3>

<p style="font-size: 16px; color: #555;">
  If you notice anything lacking, spot an issue with this notebook, or have suggestions for improvements, feel free to reach out through any of the platforms below:
</p>

<table style="font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; border-collapse: collapse; width: 100%; max-width: 600px;">
  <tbody>
    <tr>
      <td style="padding: 10px;">
        <a href="https://www.linkedin.com/in/addalaraed/" target="_blank">
          <img src="https://img.shields.io/badge/LinkedIn-Raed_Addala-blue?style=for-the-badge&logo=linkedin" alt="LinkedIn Badge"/>
        </a>
      </td>
    </tr>
    <tr>
      <td style="padding: 10px;">
        <a href="https://x.com/AddalaRaed" target="_blank">
          <img src="https://img.shields.io/badge/Twitter-@AddalaRaed-1DA1F2?style=for-the-badge&logo=twitter" alt="Twitter Badge"/>
        </a>
      </td>
    </tr>
    <tr>
      <td style="padding: 10px;">
        <a href="mailto:addala.raed@gmail.com">
          <img src="https://img.shields.io/badge/Gmail-addala.raed@gmail.com-D14836?style=for-the-badge&logo=gmail" alt="Gmail Badge"/>
        </a>
      </td>
    </tr>
    <tr>
      <td style="padding: 10px;">
        <a href="https://github.com/RaedAddala" target="_blank">
          <img src="https://img.shields.io/badge/GitHub-RaedAddala-181717?style=for-the-badge&logo=github" alt="GitHub Badge"/>
        </a>
      </td>
    </tr>
  </tbody>
</table>


## Data Organization and Cleaning

In [1]:
import os
import re
import numpy as np
import polars as pl
from dateutil import parser

In [2]:
root_dir = "/kaggle/input/imdb-movies-from-1960-to-2023/Data"

In [3]:
merged_data = pl.DataFrame()

In [4]:
column_mapping = {
    'Title': 'title',
    'Year': 'year',
    'Duration': 'duration',
    'MPA': 'MPA',
    'Rating': 'rating',
    'Votes': 'votes',
    'méta_score': 'meta_score',
    'description': 'description',
    'Movie Link': 'movie_link',
    'link': 'movie_link',
    'writers': 'writers',
    'directors': 'directors',
    'stars': 'stars',
    'budget': 'budget',
    'opening_weekend_Gross': 'opening_weekend_gross',
    'grossWorldWWide': 'gross_worldwide',
    'gross_US_Canada': 'gross_us_canada',
    'Release_date': 'release_date',
    'countries_origin': 'countries_origin',
    'filming_locations': 'filming_locations',
    'production_company': 'production_companies',
    'awards_content': 'awards_content',
    'genres': 'genres',
    'Languages': 'languages'
}

In [5]:
def parse_release_date(x: str):
    try:
        return parser.parse(str(x), fuzzy=True).date()
    except Exception:
        return None

In [6]:
frames: list[pl.DataFrame] = []

for folder in os.listdir(root_dir):
    folder_path = os.path.join(root_dir, folder)
    if not os.path.isdir(folder_path):
        continue

    for file in os.listdir(folder_path):
        if not (file.startswith("merged_movies_data_") and file.endswith(".csv")):
            continue

        file_path = os.path.join(folder_path, file)
        df = pl.read_csv(file_path)

        # Rename
        df = df.rename({k: v for k, v in column_mapping.items() if k in df.columns})

        # Cast everything to Utf8 pre-concat to avoid schema mismatches
        df = df.select([pl.col(c).cast(pl.Utf8).alias(c) for c in df.columns])

        # Clean title
        if "title" in df.columns:
            df = df.with_columns(
                pl.col("title")
                  .str.replace(r"^\d+\.\s*", "")
                  .str.strip_chars()
                  .alias("title")
            )

        # Parse release_date to Date
        if "release_date" in df.columns:
            df = df.with_columns(
                pl.col("release_date")
                  .map_elements(parse_release_date, return_dtype=pl.Date)
                  .alias("release_date")
            )

        # year -> release_date fallback (YYYY-01-01 if release_date missing)
        if "year" in df.columns:
            df = df.with_columns(
                pl.col("year").cast(pl.Int32, strict=False)
            ).with_columns(
                pl.when(pl.col("release_date").is_null() & pl.col("year").is_not_null())
                  .then((pl.col("year").cast(pl.Utf8) + "-01-01").str.strptime(pl.Date, "%Y-%m-%d"))
                  .otherwise(pl.col("release_date"))
                  .alias("release_date")
            ).drop("year")

        # Clean movie_link and extract IMDb id
        if "movie_link" in df.columns:
            df = df.with_columns(
                pl.col("movie_link").str.replace(r"\?.*$", "").alias("movie_link"),
                # extract /title/ttXXXXXXXX  →  ttXXXXXXXX
                pl.coalesce([
                    pl.col("movie_link").str.extract(r"/title/(tt\d+)", 1),
                    pl.col("movie_link").str.extract(r"(tt\d+)", 1)  # fallback if path slightly different
                ]).alias("id")
            )

        # Replace "[]" with nulls in list-like text columns
        for field in (
            'directors','writers','stars','genres','countries_origin',
            'filming_locations','production_companies','languages'
        ):
            if field in df.columns:
                df = df.with_columns(
                    pl.when(pl.col(field).is_null() | (pl.col(field).str.strip_chars() == "[]"))
                      .then(None)
                      .otherwise(pl.col(field))
                      .alias(field)
                )

        frames.append(df)

In [7]:
# Concatenate (missing cols filled with nulls)
merged_data = pl.concat(frames, how="diagonal")

In [8]:
# Drop duplicate IDs
if "id" in merged_data.columns:
    dup = (merged_data.filter(pl.col("id").is_not_null())
                     .group_by("id")
                     .len()
                     .filter(pl.col("len") > 1))
    if dup.height > 0:
        print("Duplicate IDs found:")
        print(merged_data.filter(pl.col("id").is_in(dup["id"]))[["id", "title"]])
    merged_data = merged_data.unique(subset=["id"], keep="last")

Duplicate IDs found:
shape: (2, 2)
┌───────────┬───────────────┐
│ id        ┆ title         │
│ ---       ┆ ---           │
│ str       ┆ str           │
╞═══════════╪═══════════════╡
│ tt0020393 ┆ Shanghai Lady │
│ tt0020393 ┆ Shanghai Lady │
└───────────┴───────────────┘


In [9]:
# Reorder columns: put id and title first
columns_order = ["id", "title"] + [col for col in merged_data.columns if col not in ["id", "title"]]
merged_data = merged_data.select(columns_order)

In [10]:
print(f"Merged data shape: {merged_data.shape}")
display(merged_data.head().to_pandas())

Merged data shape: (63249, 23)


Unnamed: 0,id,title,duration,MPA,rating,votes,meta_score,description,movie_link,writers,...,opening_weekend_gross,gross_worldwide,gross_us_canada,release_date,countries_origin,filming_locations,production_companies,awards_content,genres,languages
0,tt10208198,"The Gangster, the Cop, the Devil",1h 49m,Not Rated,7.0,28K,65.0,A crime boss teams up with a cop to track down...,https://www.imdb.com/title/tt10208198/,['Lee Won-tae'],...,"$78,655","$25,775,371","$216,494",2019-05-15,['South Korea'],"['Seoul, South Korea']","['Kiwi Media Group', 'Acemaker Movieworks', 'B...","Awards, 1 win & 2 nominations total","['True Crime', 'Action', 'Crime', 'Thriller']","['Korean', 'English']"
1,tt3960412,Popstar: Never Stop Never Stopping,1h 27m,R,6.7,71K,68.0,When it becomes clear that his solo album is a...,https://www.imdb.com/title/tt3960412/,"['Andy Samberg', 'Akiva Schaffer', 'Jorma Tacc...",...,"$4,698,715","$9,680,029","$9,639,125",2016-06-03,"['United States', 'China']",,"['Universal Pictures', 'Perfect World Pictures...","Awards, 1 win & 6 nominations total","['Mockumentary', 'Raunchy Comedy', 'Comedy', '...",['English']
2,tt1001508,He's Just Not That Into You,2h 9m,PG-13,6.4,188K,47.0,This Baltimore-set movie of interconnecting st...,https://www.imdb.com/title/tt1001508/,"['Abby Kohn', 'Marc Silverstein', 'Greg Behren...",...,"$27,785,487","$178,866,158","$93,953,653",2009-02-06,"['Germany', 'United States']","['Handy Market - 2514 W. Magnolia Boulevard, B...","['New Line Cinema', 'Flower Films (II)', 'Inte...","Awards, 1 win & 4 nominations total","['Feel-Good Romance', 'Romantic Comedy', 'Come...",['English']
3,tt0038303,Anna and the King of Siam,2h 8m,Approved,7.0,2.8K,,"In 1862, a young Englishwoman becomes royal tu...",https://www.imdb.com/title/tt0038303/,"['Talbot Jennings', 'Sally Benson', 'Margaret ...",...,,,,1946-09-06,['United States'],['Los Angeles County Arboretum & Botanic Garde...,['Twentieth Century Fox'],"Won 2 Oscars, 6 wins & 6 nominations total","['Period Drama', 'Biography', 'Drama', 'Romance']",['English']
4,tt1401152,Unknown,1h 53m,PG-13,6.8,273K,56.0,When a man awakens from a coma only to discove...,https://www.imdb.com/title/tt1401152/,"['Oliver Butcher', 'Stephen Cornwell', 'Didier...",...,"$21,856,389","$135,710,029","$63,686,397",2011-02-18,"['United States', 'Germany', 'United Kingdom',...","['Tresor, Berlin, Germany (night club scene)']","['Dark Castle Entertainment', 'Panda Productio...","Awards, 3 nominations total","['Action', 'Mystery', 'Thriller']","['English', 'German', 'Turkish', 'Arabic']"


In [11]:
merged_data.write_csv("/kaggle/working/final_dataset.csv")