# Guide To Extracting Data from the Movies Dataset

This is a guide to *how to extract data from the imdb movies dataset*. This is one way of doing things. In this notebook I extract the data from the different *csv* files. They were seperate for a practical reasons in the data scraping phase. I also add some normalization and data cleaning. I am doing the bare minimum in terms of cleaning and normalization so it wouldn't be very complex.

There is another notebook that is meant for a whole data cleaning process. You can check it out.

## Introducing the Dataset and its structure

This dataset provides annual data for the most popular 500–600 movies per year from 1920 to 2025, extracted from IMDb. It includes over 60,000 movies, spanning more than 100 years of cinematic history.

Each year’s data is divided into three CSV files for flexibility and ease of use:
- ``imdb_movies_[year].csv``: Basic movie details.
- ``advanced_movies_details_[year]``.csv: Comprehensive metadata and financial details.
- ``merged_movies_data_[year].csv``: A unified dataset combining both files.

## What we are doing exactly:

1. Merge all `merged_movies_data_{year}` files across all years into one csv.
2. Change names to uniform naming style.
3. Extract the `id`.
4. Remove Duplicates
5. Change Empty values to None.

<h3 style="font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; color: #2c3e50; margin-top: 40px;">Contact Me</h3>

<p style="font-size: 16px; color: #555;">
  If you notice anything lacking, spot an issue with this notebook, or have suggestions for improvements, feel free to reach out through any of the platforms below:
</p>

<table style="font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; border-collapse: collapse; width: 100%; max-width: 600px;">
  <tbody>
    <tr>
      <td style="padding: 10px;">
        <a href="https://www.linkedin.com/in/addalaraed/" target="_blank">
          <img src="https://img.shields.io/badge/LinkedIn-Raed_Addala-blue?style=for-the-badge&logo=linkedin" alt="LinkedIn Badge"/>
        </a>
      </td>
    </tr>
    <tr>
      <td style="padding: 10px;">
        <a href="https://x.com/AddalaRaed" target="_blank">
          <img src="https://img.shields.io/badge/Twitter-@AddalaRaed-1DA1F2?style=for-the-badge&logo=twitter" alt="Twitter Badge"/>
        </a>
      </td>
    </tr>
    <tr>
      <td style="padding: 10px;">
        <a href="mailto:addala.raed@gmail.com">
          <img src="https://img.shields.io/badge/Gmail-addala.raed@gmail.com-D14836?style=for-the-badge&logo=gmail" alt="Gmail Badge"/>
        </a>
      </td>
    </tr>
    <tr>
      <td style="padding: 10px;">
        <a href="https://github.com/RaedAddala" target="_blank">
          <img src="https://img.shields.io/badge/GitHub-RaedAddala-181717?style=for-the-badge&logo=github" alt="GitHub Badge"/>
        </a>
      </td>
    </tr>
  </tbody>
</table>


## Data Organization and Cleaning

In [None]:
import numpy as np
import pandas as pd
import os
import re
from dateutil import parser

In [None]:
root_dir = "/kaggle/input/imdb-movies-from-1960-to-2023/Data"

In [None]:
merged_data = pd.DataFrame()

In [None]:
for folder in os.listdir(root_dir):
    folder_path = os.path.join(root_dir, folder)
    if os.path.isdir(folder_path):
        for file in os.listdir(folder_path):
            if file.startswith("merged_movies_data_") and file.endswith(".csv"):
                file_path = os.path.join(folder_path, file)
                data = pd.read_csv(file_path)

                column_mapping = {
                    'Title': 'title',
                    'Year': 'year',
                    'Duration': 'duration',
                    'MPA': 'MPA',
                    'Rating': 'rating',
                    'Votes': 'votes',
                    'méta_score': 'meta_score',
                    'description': 'description',
                    'Movie Link': 'movie_link',
                    'link': 'movie_link',
                    'writers': 'writers',
                    'directors': 'directors',
                    'stars': 'stars',
                    'budget': 'budget',
                    'opening_weekend_Gross': 'opening_weekend_gross',
                    'grossWorldWWide': 'gross_worldwide',
                    'gross_US_Canada': 'gross_us_canada',
                    'Release_date': 'release_date',
                    'countries_origin': 'countries_origin',
                    'filming_locations': 'filming_locations',
                    'production_company': 'production_companies',
                    'awards_content': 'awards_content',
                    'genres': 'genres',
                    'Languages': 'languages'
                }
                data = data.rename(columns={k: v for k, v in column_mapping.items() if k in data.columns})

                # Clean the title column: remove leading numbers and trim whitespace
                data['title'] = data['title'].apply(lambda x: re.sub(r'^\d+\.\s*', '', str(x)).strip())

                def parse_release_date(x):
                    try:
                        return parser.parse(str(x), fuzzy=True).date()
                    except (parser.ParserError, TypeError, ValueError):
                        return pd.NaT

                data['release_date'] = data['release_date'].apply(parse_release_date)
                if 'year' in data.columns:
                    data['year'] = pd.to_numeric(data['year'])
                    data['release_date'] = data.apply(
                        lambda row: pd.to_datetime(f"{int(row['year'])}-01-01") if pd.isna(row['release_date']) and not pd.isna(row['year']) else row['release_date'],
                        axis=1
                    )
                    # Drop the year column
                    data = data.drop(columns=['year'])

                # Clean the Movie_Link column
                data['movie_link'] = data['movie_link'].apply(lambda x: re.sub(r'/\?ref_=.*$', '', str(x)))

                # Extract and add the id field from Movie_Link
                data['id'] = data['movie_link'].apply(lambda x: x.split('/')[-1] if '/' in str(x) else None)

                # Check for duplicate IDs and keep only one row per ID
                duplicate_ids = data[data.duplicated(subset=['id'], keep=False)]
                if not duplicate_ids.empty:
                    print("Duplicate IDs found:")
                    print(print(duplicate_ids[['id', 'title']]))
                data = data.drop_duplicates(subset=['id'], keep='last')

                # Replace empty arrays with null
                fields_to_check = [
                    'directors', 'writers', 'stars', 'genres', 'countries_origin',
                    'filming_locations', 'production_companies', 'languages'
                ]
                for field in fields_to_check:
                    if field in data.columns:
                        data[field] = data[field].apply(lambda x: None if pd.isna(x) or str(x).strip() == '[]' else x).astype("object")

                merged_data = pd.concat([merged_data, data], ignore_index=True)

In [None]:
def detect_data_issues(df):
    issues = {}

    for col in df.columns:
        col_issues = {}
        
        # Count NaN values
        nans = df[col].isna().sum()
        if nans > 0:
            col_issues['NaN_count'] = int(nans)

        # Count +inf and -inf (only for numeric types)
        if pd.api.types.is_numeric_dtype(df[col]):
            inf_count = np.isinf(df[col]).sum()
            if inf_count > 0:
                col_issues['inf_count'] = int(inf_count)

        if col_issues:
            issues[col] = col_issues

    return issues

problems = detect_data_issues(data)

if problems:
    print("⚠️ Issues detected in dataset:")
    for col, issue in problems.items():
        print(f" - {col}: {issue}")
else:
    print("✅ No NaN/inf issues found")

In [None]:
columns_order = ['id', 'title'] + [col for col in merged_data.columns if col not in ['id', 'title']]
merged_data = merged_data[columns_order]

In [None]:
print(f"Merged data shape: {merged_data.shape}")
display(merged_data.head())

In [None]:
merged_data.to_csv('/kaggle/working/final_dataset.csv', index=False)