## Step 1

This block imports core libraries needed for data manipulation and processing. Pandas is essential for handling tabular data and merging datasets. The `os` library is used to check file existence, ensuring smooth I/O operations.

### What it Does
- Imports `pandas` as `pd` for DataFrame operations.
- Imports `os` for checking file presence and manipulating file paths.

**Pandas is essential for working with tabular (table-like) data in Python.**

In [27]:
import pandas as pd
import os

## Step 2
IMDb distributes its datasets in TSV (Tab-Separated Values) format. These datasets contain information about movies, ratings, and names. Loading these files is the first step in preprocessing.

### What it Does
- Reads three  `.tsv` files into pandas DataFrames.
- Assigns column names automatically from the headers.

### Variables
- `name_basics`: Contains people’s names and their professions.
- `title_basics`: Contains movie title information, including genres.
- `title_ratings`: Contains movie ratings and vote counts.

### Parameters
- `sep="\t"` tells pandas that the columns are separated by a tab character (TSV format).
- `low_memory=False` forces Pandas to read the entire file into memory to prevent dtype warnings.

In [33]:
name_basics = pd.read_csv("name.basics.tsv", sep="\t", low_memory=False)
title_basics = pd.read_csv("title.basics.tsv", sep="\t", low_memory=False)
title_ratings = pd.read_csv("title.ratings.tsv", sep="\t", low_memory=False)

## Step 3
We want only essential columns that contribute to the recommendation logic. Columns like birth/death year and original title are not relevant for movie suggestion.

### What it Does
- Drops irrelevant columns from each of the three datasets using column subsetting.
- Filters only movies from `title_basics` by checking for 'movie' in `titleType`.

### Variables
- Modified `name_basics`, `title_basics`, and `title_ratings` DataFrames with reduced columns.

In [35]:
name_basics = name_basics[["nconst", "primaryName", "knownForTitles"]]
title_ratings = title_ratings[["tconst", "averageRating", "numVotes"]]
title_basics = title_basics[title_basics["titleType"] == "movie"]
title_basics = title_basics[["tconst", "primaryTitle", "genres"]]

## Step 4
Many actors/creators are linked to multiple movies. So we must expand these relationships for better linking between movie and contributor.

### What it Does
- Fills nulls in `knownForTitles` column with an empty string.
- Splits titles into a list for each person to handle people known for multiple movies.
- Uses `explode` to create one row per movie-person link. Necessary to join people and movies properly.

### Variables
- `name_basics`: Expanded so each row has one movie-person link.

In [37]:
name_basics["knownForTitles"] = name_basics["knownForTitles"].fillna("")
name_basics["knownForTitles"] = name_basics["knownForTitles"].str.split(",")
name_basics = name_basics.explode("knownForTitles")
name_basics = name_basics.rename(columns={"knownForTitles": "tconst"})

## Step 5
Now that all datasets are aligned on `tconst`, we merge them to build a unified movie record that includes title, rating, and person info.

### What it Does
- Performs inner joins on `tconst` to keep only valid movies across all datasets.

### Variables
- `merged_imdb_df`: Fully joined DataFrame used for modeling.

In [39]:
merged_imdb_df = pd.merge(title_basics, title_ratings, on="tconst")
merged_imdb_df = pd.merge(merged_imdb_df, name_basics, on="tconst")

## Step 6
Saving preprocessed datasets so they can be reused across training scripts.

### What it Does
- Saves the final merged DataFrame to a CSV file.

### Variables
- Output file: `merged_imdb_data.csv`

In [41]:
merged_imdb_df.to_csv("merged_imdb_data.csv", index=False)