# Guide To Extracting Data from the Movies Dataset

This is a guide to *how to extract data from the imdb movies dataset*. This is one way of doing things. In this notebook I extract the data from the different *csv* files. They were seperate for a practical reasons in the data scraping phase. I also add some normalization and data cleaning. I am doing the bare minimum in terms of cleaning and normalization so it wouldn't be very complex.

There is another notebook that is meant for a whole data cleaning process. You can check it out.

## Introducing the Dataset and its structure

This dataset provides annual data for the most popular 500–600 movies per year from 1920 to 2025, extracted from IMDb. It includes over 60,000 movies, spanning more than 100 years of cinematic history.

Each year’s data is divided into three CSV files for flexibility and ease of use:
- ``imdb_movies_[year].csv``: Basic movie details.
- ``advanced_movies_details_[year]``.csv: Comprehensive metadata and financial details.
- ``merged_movies_data_[year].csv``: A unified dataset combining both files.

<h3 style="font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; color: #2c3e50; margin-top: 40px;">Contact Me</h3>

<p style="font-size: 16px; color: #555;">
  If you notice anything lacking, spot an issue with this notebook, or have suggestions for improvements, feel free to reach out through any of the platforms below:
</p>

<table style="font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; border-collapse: collapse; width: 100%; max-width: 600px;">
  <tbody>
    <tr>
      <td style="padding: 10px;">
        <a href="https://www.linkedin.com/in/raed-addala-498b69191/" target="_blank">
          <img src="https://img.shields.io/badge/LinkedIn-Raed_Addala-blue?style=for-the-badge&logo=linkedin" alt="LinkedIn Badge"/>
        </a>
      </td>
    </tr>
    <tr>
      <td style="padding: 10px;">
        <a href="https://x.com/AddalaRaed" target="_blank">
          <img src="https://img.shields.io/badge/Twitter-@AddalaRaed-1DA1F2?style=for-the-badge&logo=twitter" alt="Twitter Badge"/>
        </a>
      </td>
    </tr>
    <tr>
      <td style="padding: 10px;">
        <a href="mailto:addala.raed@gmail.com">
          <img src="https://img.shields.io/badge/Gmail-addala.raed@gmail.com-D14836?style=for-the-badge&logo=gmail" alt="Gmail Badge"/>
        </a>
      </td>
    </tr>
    <tr>
      <td style="padding: 10px;">
        <a href="https://github.com/RaedAddala" target="_blank">
          <img src="https://img.shields.io/badge/GitHub-RaedAddala-181717?style=for-the-badge&logo=github" alt="GitHub Badge"/>
        </a>
      </td>
    </tr>
  </tbody>
</table>


## Data Organization and Cleaning

In [1]:
import numpy as np
import pandas as pd
import os
import re

In [2]:
root_dir = "/kaggle/input/imdb-movies-from-1960-to-2023/Data"

In [3]:
merged_data = pd.DataFrame()

In [4]:
# Iterate through folders and files
for folder in os.listdir(root_dir):
    folder_path = os.path.join(root_dir, folder)
    if os.path.isdir(folder_path):
        for file in os.listdir(folder_path):
            if file.startswith("merged_movies_data_") and file.endswith(".csv"):
                file_path = os.path.join(folder_path, file)
                data = pd.read_csv(file_path)

                column_mapping = {
                    'Title': 'title',
                    'Year': 'year',
                    'Duration': 'duration',
                    'MPA': 'MPA',
                    'Rating': 'rating',
                    'Votes': 'votes',
                    'méta_score': 'meta_score',
                    'description': 'description',
                    'Movie Link': 'Movie_Link',
                    'link': 'Movie_Link',
                    'writers': 'writers',
                    'directors': 'directors',
                    'stars': 'stars',
                    'budget': 'budget',
                    'opening_weekend_Gross': 'opening_weekend_gross',
                    'grossWorldWWide': 'gross_worldwide',
                    'gross_US_Canada': 'gross_us_canada',
                    'Release_date': 'release_date',
                    'countries_origin': 'countries_origin',
                    'filming_locations': 'filming_locations',
                    'production_company': 'production_companies',
                    'awards_content': 'awards_content',
                    'genres': 'genres',
                    'Languages': 'languages'
                }
                data = data.rename(columns={k: v for k, v in column_mapping.items() if k in data.columns})

                # Clean the title column: remove leading numbers and trim whitespace
                data['title'] = data['title'].apply(lambda x: re.sub(r'^\d+\.\s*', '', str(x)).strip())

                # Step 1: Check and handle mismatched year and release_date fields
                data['year'] = data['year'].astype(int)
                data['release_date'] = pd.to_numeric(data['release_date'], errors='coerce')
                data['release_date'] = data['release_date'].fillna(data['year']).astype(int)
                mismatched_rows = data[data['year'] != data['release_date']]
                if mismatched_rows.empty:
                    data = data.drop(columns=['release_date'])

                # Step 2: Clean the Movie_Link column
                data['Movie_Link'] = data['Movie_Link'].apply(lambda x: re.sub(r'/\?ref_=.*$', '', str(x)))

                # Step 3: Extract and add the id field from Movie_Link
                data['id'] = data['Movie_Link'].apply(lambda x: x.split('/')[-1] if '/' in str(x) else None)

                # Step 4: Check for duplicate IDs and keep only one row per ID
                duplicate_ids = data[data.duplicated(subset=['id'], keep=False)]
                if not duplicate_ids.empty:
                    print("Duplicate IDs found:")
                    print(duplicate_ids)
                data = data.drop_duplicates(subset=['id'], keep='first')

                # Step 5: Replace empty arrays with null
                fields_to_check = [
                    'directors', 'writers', 'stars', 'genres', 'countries_origin',
                    'filming_locations', 'production_companies', 'languages'
                ]
                for field in fields_to_check:
                    if field in data.columns:
                        data[field] = data[field].apply(lambda x: None if pd.isna(x) or str(x).strip() == '[]' else x)

                merged_data = pd.concat([merged_data, data], ignore_index=True)

Duplicate IDs found:
             title  year duration  MPA  rating votes  meta_score  \
278  Shanghai Lady  1929    1h 6m  NaN     NaN   NaN         NaN   
447  Shanghai Lady  1929    1h 6m  NaN     NaN   NaN         NaN   

                                           description  \
278  Having spent several wasted months in a Shangh...   
447  Having spent several wasted months in a Shangh...   

                               Movie_Link  \
278  https://www.imdb.com/title/tt0020393   
447  https://www.imdb.com/title/tt0020393   

                                               writers  ... gross_worldwide  \
278  ['Daisy H. Andrews', 'Houston Branch', 'John C...  ...             NaN   
447  ['Daisy H. Andrews', 'Houston Branch', 'John C...  ...             NaN   

    gross_us_canada release_date   countries_origin  \
278             NaN         1929  ['United States']   
447             NaN         1929  ['United States']   

                                     filming_locations  \
2

  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


In [5]:
columns_order = ['id', 'title'] + [col for col in merged_data.columns if col not in ['id', 'title']]
merged_data = merged_data[columns_order]

In [6]:
print(f"Merged data shape: {merged_data.shape}")
display(merged_data.head())

Merged data shape: (63249, 24)


Unnamed: 0,id,title,year,duration,MPA,rating,votes,meta_score,description,Movie_Link,...,opening_weekend_gross,gross_worldwide,gross_us_canada,release_date,countries_origin,filming_locations,production_companies,awards_content,genres,languages
0,tt0073195,Jaws,1975,2h 4m,PG,8.1,690K,87.0,When a massive killer shark unleashes chaos on...,https://www.imdb.com/title/tt0073195,...,"$7,061,513","$477,916,625","$267,263,625",1975.0,['United States'],"[""Water Street, Edgartown, Martha's Vineyard, ...","['Zanuck/Brown Productions', 'Universal Pictur...","Won 3 Oscars, 16 wins & 20 nominations total","['Monster Horror', 'Sea Adventure', 'Survival'...",['English']
1,tt0073629,The Rocky Horror Picture Show,1975,1h 40m,R,7.4,174K,65.0,A newly-engaged couple have a breakdown in an ...,https://www.imdb.com/title/tt0073629,...,,"$115,827,018","$112,892,319",1975.0,"['United Kingdom', 'United States']","[""Oakley Court, Windsor Road, Oakley Green, Wi...","['Twentieth Century Fox', 'Michael White Produ...","Awards, 3 wins & 4 nominations total","['B-Horror', 'Dark Comedy', 'Parody', 'Raunchy...",['English']
2,tt0073486,One Flew Over the Cuckoo's Nest,1975,2h 13m,R,8.7,1.1M,84.0,"In the Fall of 1963, a Korean War veteran and ...",https://www.imdb.com/title/tt0073486,...,,"$109,115,366","$108,981,275",1975.0,['United States'],['Oregon State Mental Hospital - 2600 Center S...,"['Fantasy Films', 'N.V. Zvaluw']","Won 5 Oscars, 38 wins & 15 nominations total","['Medical Drama', 'Psychological Drama', 'Drama']",['English']
3,tt0072890,Dog Day Afternoon,1975,2h 5m,R,8.0,281K,86.0,Three amateur robbers plan to hold up a Brookl...,https://www.imdb.com/title/tt0072890,...,,"$50,004,527","$50,000,000",1975.0,['United States'],"['285 Prospect Park West, Brooklyn, New York C...","['Warner Bros.', 'Artists Entertainment Complex']","Won 1 Oscar, 14 wins & 20 nominations total","['Dark Comedy', 'Heist', 'True Crime', 'Biogra...",['English']
4,tt0073692,Shampoo,1975,1h 50m,R,6.4,15K,65.0,"On Election Day, 1968, irresponsible hairdress...",https://www.imdb.com/title/tt0073692,...,,"$49,407,734","$49,407,734",1975.0,['United States'],"[""2270 Bowmont Drive, Beverly Hills, Californi...","['Persky-Bright / Vista', 'Columbia Pictures',...","Won 1 Oscar, 3 wins & 11 nominations total","['Satire', 'Comedy', 'Drama']",['English']


In [7]:
merged_data.to_csv('/kaggle/working/final_dataset.csv', index=False)