<a href="https://colab.research.google.com/github/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/blob/main/preprocessing_notebook_3rd_edition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🔍 Pre-processing Methodology

### 1. Data Acquisition
I began by acquiring five different datasets from multiple sources. Each dataset was loaded into the environment and assigned a clear name based on its content (e.g., `movie_franchises`, `tmdb_data`, `financial_data`, etc.).

### 2. Initial Structure Check
For each table:
- I previewed the data using `.head()` to assess structure, key columns, and formatting issues.
- I identified the potential primary key (movie titles that I named `movie_id`).

### 3. NA Summary (Column-Level)
I applied a custom function, `quick_column_summary()`, that computes:
- Column name
- Data type
- NA count
- % of missing values

This allowed me to:
- Identify strong vs. weak or unrelevant variables
- Understand data quality before any merge

### 4. Table Acquisition Summary
Each dataset was summarized by its relevant contributions to two modeling goals:

| Table | Key Variables |
|-------|----------------|
| `movie_franchises` | `name`,	`rating`,	`genre	year`,	`released`,	`imdb_score`,	`votes`,	`director`,	`writer`,	`star`,	`country`,	`budget`,	`gross`,	`company`,	`runtime` |
| `tmdb_data` | `vote_average`, `vote_count`, `runtime`, `popularity` |
| `meta_data` | `cast`, `crew`, `keywords`, `overview`, `tagline` |
| `data2` | `Lifetime Gross` |
| `financial_data` | `profit`, `worldwide_gross`, genre dummies |

We retained only columns with acceptable completeness or analytical value.


In [None]:
# Define a compact Column Summary Function for checking NA% - It will help us with the data proccessing along the way
def quick_column_summary(df, table_name):
    print(f"\n📋 Column Summary for `{table_name}`\n")
    total_rows = len(df)
    summary = pd.DataFrame({
        'Column': df.columns,
        'Data Type': [df[col].dtype for col in df.columns],
        'NA Count': [df[col].isna().sum() for col in df.columns],
        '% Missing': [df[col].isna().mean() * 100 for col in df.columns]
    })
    display(summary)

# 1st Dataset: Movie Franchises



In [None]:
# 1. Movie Data Analysis Dataset

!wget https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/movie.csv -O movie.csv

# Load the CSV file
import pandas as pd
movie_franchises = pd.read_csv("movie.csv")

### First check

In [None]:
# Display the first few rows
movie_franchises.head()

In [None]:
# Check for data types and NA's
quick_column_summary(movie_franchises, 'movie_franchises')

In [None]:
# Rename the IMDB score column
movie_franchises = movie_franchises.rename(columns={"score": "imdb_score"})  # replace "score" with our desirable target name - "imdb_score"

In [None]:
# Omit observations with NA's in target variables
movie_franchises = movie_franchises[
    movie_franchises['imdb_score'].notna() &
    movie_franchises['budget'].notna() &
    movie_franchises['gross'].notna()
].copy()

### 📥 Table Acquisition Summary: `movie_franchises`

#### 🎯 Relevant Variables

| Column         | Description                     | Relevance                          |
|----------------|----------------------------------|-------------------------------------|
| `name`         | Movie name (key)                | ✅ Unique ID across datasets         |
| `imdb_score`   | IMDB rating                     | ✅ Target variable #1                |
| `budget`       | Budget in dollars               | 📌 Required for ROI (target #2)     |
| `gross`        | Revenue in dollars              | 📌 Required for ROI (target #2)     |
| `votes`        | Number of user ratings          | 🧪 May influence IMDB score         |
| `genre`, `rating`, `year`, `released` | Movie metadata | 📊 Potential features |
| `director`, `writer`, `star`, `company` | People / studio involved | 📊 Potential features |
| `runtime`      | Duration in minutes             | 📊 Feature (e.g., audience fatigue) |
| `country`      | Country of production           | 📊 Feature for cultural reception   |

# 2nd Dataset: additional Movie Franchises

In [None]:
# 2. Global Movie Franchise Revenue and Budget Data

!wget https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/MovieFranchises.csv -O MovieFranchises.csv
import pandas as pd
data2 = pd.read_csv("MovieFranchises.csv") # Save in a different name due to similar name to the 1st dataset

### First check

In [None]:
# Display the first few rows
data2.head()

In [None]:
# Check for data types and NA's
quick_column_summary(data2, 'data2')

In [None]:
# Keep only the useful parts of data2
data2 = data2[['MovieID', 'Title', 'Lifetime Gross']].copy()

### 📥 Table Acquisition Summary: `data2`

This table had lots of missing values. But still, the table includes financial data that can add us more information about our target variable ROI.

#### 🔁 Remaining Variables

| `movie_franchises` | `studio_financials` | Action |
|--------------------|---------------------|--------|
| `name`             | `Title`             | Normalize to `movie_id` for matching |
| `budget`           | `Budget`            | Compare and retain best version |
| `gross`            | `Lifetime Gross`    | Compare with `gross` |


# 3rd Dataset: TMDB data

In [None]:
# If the 3rd dataset have error contains "LocalFileSystem is not supported" then use the code:
# pip install -U datasets

In [None]:
# 3. TMDB 5000 Movies Dataset

!pip install datasets

from datasets import load_dataset
import pandas as pd

# Load the TMDB dataset from Hugging Face
dataset = load_dataset("AiresPucrs/tmdb-5000-movies", split="train")
tmdb_data = pd.DataFrame(dataset)

# Save the DataFrame to a CSV file
tmdb_data.to_csv("tmdb_movies.csv", index=False)

# Confirm the file exists in the current directory
import os
os.listdir()

### First Check

In [None]:
# Display the first few rows
tmdb_data.head()

In [None]:
# Check for data types and NA's
quick_column_summary(tmdb_data, 'tmdb_data')

### 📥 Table Acquisition: `tmdb_data`

This is the richest and most structured table so far. It includes both structured and nested (JSON-like) data, contributing heavily to both prediction targets.

---

#### 🎯 Relevant Variables

| Column | Description | Relevance |
|--------|-------------|-----------|
| `title` | Movie name | ✅ Used to create `movie_id` |
| `vote_average` | Average audience rating | ✅ Proxy for IMDB score |
| `vote_count` | Number of votes | 🧪 May influence or complement score |
| `budget` | Production cost | 📌 Required for ROI |
| `revenue` | Box office revenue | 📌 Required for ROI |
| `runtime` | Duration in minutes | 📊 Feature for pacing / cost |
| `popularity` | TMDB popularity score | 📊 Social visibility |
| `release_date` | Date released | 📊 Use for time features (month, year) |
| `genres` | List of genres (JSON) | 🧠 To parse later for genre-based analysis |
| `keywords` | Thematic keywords (JSON) | 🧠 Useful after parsing |
| `overview`, `tagline` | Textual summary & tagline | 🧠 Potential for NLP sentiment modeling |
| `original_language` | Language code (e.g., 'en') | 📊 Cultural/demographic indicator |
| `production_companies` | Companies involved (JSON) | 🧠 Feature engineering (studio power) |
| `production_countries` | Countries involved (JSON) | 📊 International impact |
| `spoken_languages` | Languages spoken (JSON) | 📊 Audience reach |
| `cast`, `crew` | Cast and crew (JSON) | 🧠 Feature-rich, parse later |
| `status` | e.g., Released, Post-production, etc. | 🧪 May correlate with box office |

---

#### 🧠 Summary

- This table contributes to both `imdb_score_features` and `roi_features`
- Contains multiple nested fields that will be parsed during feature engineering
- Will be save in SQLite as `raw_tmdb_data`


# 4th Dataset: Meta-Analysis Data

In [None]:
# 4. Complete Movie Metadata Dataset

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
file_path = '/content/drive/My Drive/Projects/Blockbuster Movies/movies.csv'  # Adjust path as needed
meta_data = pd.read_csv(file_path)

# Save the DataFrame to a CSV file
meta_data.to_csv("movies.csv", index=False)

# Confirm the file exists in the current directory
import os
os.listdir()

### First Check

In [None]:
# Display the first few rows
meta_data.head()

In [None]:
# Check for data types and NA's
quick_column_summary(meta_data, 'meta_data')

### 📥 Table Acquisition: `meta_data`

This dataset appears to be an updated or complementary version of `tmdb_data`, containing recent and upcoming titles with similar structure.

---

#### 🎯 Relevant Variables

| Column | Description | Relevance |
|--------|-------------|-----------|
| `title` | Movie title | ✅ Used to create `movie_id` |
| `vote_average` / `vote_count` | User rating and count | ✅ Score-related |
| `budget`, `revenue` | Financial data | 📌 Used for ROI |
| `runtime`, `release_date` | Timing & length | 📊 Influences score & ROI |
| `popularity` | TMDB popularity score | 📊 Social reach |
| `genres`, `keywords`, `overview`, `tagline` | Text / tags | 🧠 Feature-rich, parse later |
| `original_language` | Language code | 📊 Cultural signal |
| `status` | Release status | 🧪 Could correlate with results |
| `production_companies` | Studios involved | 🧠 To group studio trends |
| `credits` | Raw cast and crew | 🧠 To parse later for influence modeling |

---

#### 🧠 Summary

- High overlap with `tmdb_data` (Table 3) — strong candidate for integration
- Contributes to both `imdb_score_features` and `roi_features`
- Will require deduplication and possible enrichment during post-acquisition phase
- Will be save in SQLite as `meta_data`


# 5th Dataset: Revenues Data

In [None]:
# 5. Movie Revenue Analysis Dataset

!wget https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/final_dataset.csv -O final_dataset.csv
import pandas as pd
financial_data = pd.read_csv("final_dataset.csv")

### First Check

In [None]:
# Display the first few rows
financial_data.head()

In [None]:
# Check for data types and NA's
quick_column_summary(financial_data, 'financial_data')

### 📥 Table Acquisition: `financial_data`

This table is highly focused on financial metrics and genre distribution. It provides engineered columns for ROI, profit, and genre flags, making it very valuable for prediction.

---

#### 🎯 Relevant Variables

| Column | Description | Relevance |
|--------|-------------|-----------|
| `movie` | Movie name | ✅ Used to create `movie_id` |
| `production_budget`, `domestic_gross`, `foreign_gross`, `worldwide_gross` | Raw inputs for ROI | ✅ |
| `profit`, `roi`, `profit_margin`, `pct_foreign` | Pre-calculated finance metrics | ✅ |
| `vote_average`, `vote_count`, `popularity` | Score-related audience signals | ✅ |
| `original_language`, `release_date`, `month` | Contextual/cultural features | ✅ |
| `Action`, `Drama`, etc. | Binary genre flags | ✅ Helps both score and ROI models |

---

#### 🧠 Summary

- Strongest financial data table (calculated ROI & profit)
- Includes one-hot encoded genre info (clean and ready)
- Will contribute to both `imdb_score_features` and `roi_features`
- Saved in SQLite as `raw_financial_data`


---

# Full Preprocessing & Integration Pipeline

## 🔗 Data Consolidation & Final Dataset Preparation

### 1. Normalizing Identifiers
To prepare for merging:
- I created a **primary key** called `movie_id` in each dataset using the cleaned movie title (`name`, `title`, or `movie`) columns.
- Each name was normalized (lowercase, stripped whitespace) to ensure consistent joining across tables.

### 2. Merging Strategy
I used a **left join** strategy starting from `movie_franchises` as the base table.  
Why left join?
- It ensured we preserved only relevant and valid movies (with complete modeling targets).
- Still keeping the base data in `movie_franchises` and only add to it (with inner-join we could have lost all the data that isn't in all the sets and be left with a smaller sample of movies).
- Avoided introducing excessive NAs from mismatched movie entries across datasets (can happen with outer-join).

### 3. Post-Merge Cleaning
After joining:
- I dropped duplicate variables (e.g., `vote_average_y`, `runtime_x`) and renamed important ones clearly.
- Used logic to **fill in genre dummy columns** from the `genre` column when genre indicators were missing.
- All genre columns were validated to contain proper 0/1 indicators for classification tasks.

### 4. Organizing the Dataset
I reordered columns based on:
- **Target modeling relevance** (e.g., `imdb_score`)
- **Predictive features** (e.g., votes, budget, genre dummies)
- **Meta content** (overview, keywords, cast)

### 5. Final Export
I saved the final processed dataset to a `.csv` file and uploaded it to my GitHub repository.  
This final version will be loaded in the next notebook for:
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Model Building



## 1. Normalization

In [None]:
def normalize_title(title):
    return title.str.strip().str.lower()

movie_franchises['movie_id'] = normalize_title(movie_franchises['name'])
data2['movie_id'] = normalize_title(data2['Title'])
tmdb_data['movie_id'] = normalize_title(tmdb_data['title'])
meta_data['movie_id'] = normalize_title(meta_data['title'])
financial_data['movie_id'] = normalize_title(financial_data['movie'])

## 2. Merging Data (by left-join)

In [None]:
enriched = movie_franchises.merge(
    financial_data[['movie_id', 'roi', 'production_budget', 'worldwide_gross', 'profit','Action', 'Adventure', 'Animation', 'Comedy', 'Crime',
       'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror',
       'Music', 'Mystery', 'Romance', 'Science Fiction', 'TV Movie',
       'Thriller', 'War', 'Western']],
    on='movie_id',
    how='left'
)

enriched = enriched.merge(
    tmdb_data[['movie_id', 'vote_average', 'vote_count', 'popularity', 'runtime','homepage', 'keywords','overview','tagline','cast', 'crew',]],
    on='movie_id',
    how='left'
)

enriched = enriched.merge(
    meta_data[['movie_id', 'vote_average', 'vote_count', 'popularity', 'runtime', 'keywords','overview','tagline','recommendations']],
    on='movie_id',
    how='left'
)

In [None]:
enriched['na_count'] = enriched.isna().sum(axis=1)
enriched = enriched.sort_values(by='na_count').drop_duplicates(subset='movie_id', keep='first')
enriched = enriched.drop(columns='na_count')
enriched

## 3. Post-Merging Cleaning

In [None]:
# Check for data types and NA's
quick_column_summary(enriched, 'enriched')

Assumptions from the outlook:

1. Duplicate columns: We can see that some of the columns are duplicates of others (For example, `tagline_x` and `tagline_y`). It happened because some of the datasets we merged had the same column in both.
2. High percentage NA's: Some of the columns have high NA count. We can see that most of them are duplicates columns and genre classification columns (`Action`, ..., `Western`)

Action: We'll omit the duplicates, specifically the higher NA percentage. I'll leave the genre classification columns to be, because we can edit the missing data with the genre column as long as it has 0% NA (and it does).

In [None]:
# 1. Omit the duplicates columns. The column with the higher NA% will be the one we'll omit
enriched_better = enriched[['name', 'rating', 'genre', 'year', 'released', 'imdb_score', 'votes',
       'director', 'writer', 'star', 'country', 'budget', 'gross', 'company',
       'runtime_x', 'movie_id','production_budget', 'worldwide_gross',
       'profit', 'Action', 'Adventure', 'Animation', 'Comedy', 'Crime',
       'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror',
       'Music', 'Mystery', 'Romance', 'Science Fiction', 'TV Movie',
       'Thriller', 'War', 'Western','homepage','cast', 'crew', 'vote_average_y', 'vote_count_y',
       'popularity_y', 'runtime', 'keywords_y', 'overview_y', 'tagline_y',
       'recommendations']]

In [None]:
# 2. Edit genre columns based on the genre column values

# Define the full set of genre dummy columns
genre_columns = [
    'Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary',
    'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery',
    'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western'
]

# Ensure all genre columns exist and are numeric
for col in genre_columns:
    if col not in enriched_better.columns:
        enriched_better[col] = pd.NA  # Create missing columns
    enriched_better[col] = enriched_better[col].astype("float")  # Force numeric

# Mask for rows where all genre columns are missing
genre_na_mask = enriched_better[genre_columns].isna().all(axis=1)

# Update those rows based on the 'genre' column
for idx in enriched_better[genre_na_mask].index:
    genre_str = enriched_better.loc[idx, 'genre']
    genre_list = [g.strip() for g in str(genre_str).split('|')] if pd.notna(genre_str) else []

    for col in genre_columns:
        enriched_better.at[idx, col] = 1.0 if col in genre_list else 0.0

## 4. Organizing the Dataset

In [None]:
# Rearrange the columns based on order of importance and NA%
enriched_best = enriched_better [['movie_id','rating', 'genre', 'year', 'released', 'imdb_score', 'votes',
       'director', 'writer', 'star', 'country', 'budget', 'gross', 'company',
       'runtime_x','vote_average_y', 'vote_count_y', 'popularity_y',
       'keywords_y','Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary',
       'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery',
       'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western','production_budget', 'worldwide_gross', 'profit' ,'overview_y', 'tagline_y', 'recommendations']]

In [None]:
# Check for missing columns in the reordering list
col_list = ['movie_id','rating', 'genre', 'year', 'released', 'imdb_score', 'votes',
       'director', 'writer', 'star', 'country', 'budget', 'gross', 'company',
       'runtime_x','vote_average_y', 'vote_count_y', 'popularity_y',
       'keywords_y','Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary',
       'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery',
       'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western','production_budget', 'worldwide_gross', 'profit' ,'overview_y', 'tagline_y', 'recommendations']

missing = [col for col in col_list if col not in enriched_best.columns]
print("❗ Missing columns:", missing)

In [None]:
# Check for data types and NA's
quick_column_summary(enriched_best, 'enriched_best')

In [None]:
# Rename the columns that were duplicates before
# Define a dictionary of old column names to new ones
rename_map = {
    'runtime_x': 'runtime',
    'vote_average_y': 'vote_average',
    'vote_count_y': 'vote_count',
    'popularity_y': 'popularity',
    'tagline_y': 'tagline',
    'keywords_y': 'keywords',
    'overview_y': 'overview'
}

# Apply renaming in one line
enriched_best = enriched_best.rename(columns=rename_map)

## 5. Final Export

In [None]:
# Save to local file (which will also show up in Colab's Files tab)
enriched_best.to_csv("final_movie_data.csv", index=False)

from google.colab import files
files.download('final_movie_data.csv')  # or any other filename you want to download

---