# 🔍 Pre-processing Methodology

### 1. Data Acquisition
I began by acquiring five different datasets from multiple sources. Each dataset was loaded into the environment and assigned a clear name based on its content (e.g., `movie_franchises`, `tmdb_data`, `financial_data`, etc.).

### 2. Initial Structure Check
For each table:
- I previewed the data using `.head()` to assess structure, key columns, and formatting issues.
- I identified the potential primary key (movie titles that I named `movie_id`).

### 3. NA Summary (Column-Level)
I applied a custom function, `quick_column_summary()`, that computes:
- Column name
- Data type
- NA count
- % of missing values

This allowed me to:
- Identify strong vs. weak or unrelevant variables
- Understand data quality before any merge

### 4. Table Acquisition Summary
Each dataset was summarized by its relevant contributions to two modeling goals:

| Table | Key Variables |
|-------|----------------|
| `movie_franchises` | `name`,	`rating`,	`genre	year`,	`released`,	`imdb_score`,	`votes`,	`director`,	`writer`,	`star`,	`country`,	`budget`,	`gross`,	`company`,	`runtime` |
| `tmdb_data` | `vote_average`, `vote_count`, `runtime`, `popularity` |
| `meta_data` | `cast`, `crew`, `keywords`, `overview`, `tagline` |
| `data2` | `Lifetime Gross` |
| `financial_data` | `profit`, `worldwide_gross`, genre dummies |

We retained only columns with acceptable completeness or analytical value.


In [None]:
# Define a compact Column Summary Function for checking NA% - It will help us with the data proccessing along the way
def quick_column_summary(df, table_name):
    print(f"\n📋 Column Summary for `{table_name}`\n")
    total_rows = len(df)
    summary = pd.DataFrame({
        'Column': df.columns,
        'Data Type': [df[col].dtype for col in df.columns],
        'NA Count': [df[col].isna().sum() for col in df.columns],
        '% Missing': [df[col].isna().mean() * 100 for col in df.columns]
    })
    display(summary)

# 1st Dataset: Movie Franchises



In [None]:
# 1. Movie Data Analysis Dataset

!wget https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/movie.csv -O movie.csv

# Load the CSV file
import pandas as pd
movie_franchises = pd.read_csv("movie.csv")

--2025-05-21 10:39:30--  https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/movie.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1294548 (1.2M) [text/plain]
Saving to: ‘movie.csv’


2025-05-21 10:39:31 (21.0 MB/s) - ‘movie.csv’ saved [1294548/1294548]



### First check

In [None]:
# Display the first few rows
movie_franchises.head()

Unnamed: 0,name,rating,genre,year,released,score,votes,director,writer,star,country,budget,gross,company,runtime
0,The Shining,R,Drama,1980,"June 13, 1980 (United States)",8.4,927000.0,Stanley Kubrick,Stephen King,Jack Nicholson,United Kingdom,19000000.0,46998772.0,Warner Bros.,146.0
1,The Blue Lagoon,R,Adventure,1980,"July 2, 1980 (United States)",5.8,65000.0,Randal Kleiser,Henry De Vere Stacpoole,Brooke Shields,United States,4500000.0,58853106.0,Columbia Pictures,104.0
2,Star Wars: Episode V - The Empire Strikes Back,PG,Action,1980,"June 20, 1980 (United States)",8.7,1200000.0,Irvin Kershner,Leigh Brackett,Mark Hamill,United States,18000000.0,538375067.0,Lucasfilm,124.0
3,Airplane!,PG,Comedy,1980,"July 2, 1980 (United States)",7.7,221000.0,Jim Abrahams,Jim Abrahams,Robert Hays,United States,3500000.0,83453539.0,Paramount Pictures,88.0
4,Caddyshack,R,Comedy,1980,"July 25, 1980 (United States)",7.3,108000.0,Harold Ramis,Brian Doyle-Murray,Chevy Chase,United States,6000000.0,39846344.0,Orion Pictures,98.0


In [None]:
# Check for data types and NA's
quick_column_summary(movie_franchises, 'movie_franchises')


📋 Column Summary for `movie_franchises`



Unnamed: 0,Column,Data Type,NA Count,% Missing
0,name,object,0,0.0
1,rating,object,77,1.004173
2,genre,object,0,0.0
3,year,int64,0,0.0
4,released,object,2,0.026082
5,score,float64,3,0.039124
6,votes,float64,3,0.039124
7,director,object,0,0.0
8,writer,object,3,0.039124
9,star,object,1,0.013041


In [None]:
# Rename the IMDB score column
movie_franchises = movie_franchises.rename(columns={"score": "imdb_score"})  # replace "score" with our desirable target name - "imdb_score"

In [None]:
# Omit observations with NA's in target variables
movie_franchises = movie_franchises[
    movie_franchises['imdb_score'].notna() &
    movie_franchises['budget'].notna() &
    movie_franchises['gross'].notna()
].copy()

### 📥 Table Acquisition Summary: `movie_franchises`

#### 🎯 Relevant Variables

| Column         | Description                     | Relevance                          |
|----------------|----------------------------------|-------------------------------------|
| `name`         | Movie name (key)                | ✅ Unique ID across datasets         |
| `imdb_score`   | IMDB rating                     | ✅ Target variable #1                |
| `budget`       | Budget in dollars               | 📌 Required for ROI (target #2)     |
| `gross`        | Revenue in dollars              | 📌 Required for ROI (target #2)     |
| `votes`        | Number of user ratings          | 🧪 May influence IMDB score         |
| `genre`, `rating`, `year`, `released` | Movie metadata | 📊 Potential features |
| `director`, `writer`, `star`, `company` | People / studio involved | 📊 Potential features |
| `runtime`      | Duration in minutes             | 📊 Feature (e.g., audience fatigue) |
| `country`      | Country of production           | 📊 Feature for cultural reception   |

# 2nd Dataset: additional Movie Franchises

In [None]:
# 2. Global Movie Franchise Revenue and Budget Data

!wget https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/MovieFranchises.csv -O MovieFranchises.csv
import pandas as pd
data2 = pd.read_csv("MovieFranchises.csv") # Save in a different name due to similar name to the 1st dataset

--2025-05-21 10:39:46--  https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/MovieFranchises.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26322 (26K) [text/plain]
Saving to: ‘MovieFranchises.csv’


2025-05-21 10:39:46 (16.5 MB/s) - ‘MovieFranchises.csv’ saved [26322/26322]



### First check

In [None]:
# Display the first few rows
data2.head()

Unnamed: 0,index,MovieID,Title,Lifetime Gross,Year,Studio,Rating,Runtime,Budget,ReleaseDate,VoteAvg,VoteCount,FranchiseID
0,0,1001,Star Wars: Episode IV - A New Hope,775398007,1977,Lucasfilm,PG,121.0,11000000.0,05-25-77,4.09,96233.0,101.0
1,1,1002,Star Wars: Episode V - The Empire Strikes Back,538375067,1980,Lucasfilm,PG,124.0,18000000.0,06-20-80,4.12,79231.0,101.0
2,2,1003,Star Wars: Episode VI - Return of the Jedi,475106177,1983,Lucasfilm,PG,135.0,32500000.0,05-25-83,3.98,76082.0,101.0
3,3,1004,Jurassic Park,1109802321,1993,Universal Pictures,PG-13,127.0,63000000.0,06-11-93,3.69,82700.0,102.0
4,4,1005,The Lost World: Jurassic Park,618638999,1997,Universal Pictures,PG-13,129.0,73000000.0,05-23-97,3.01,19721.0,102.0


In [None]:
# Check for data types and NA's
quick_column_summary(data2, 'data2')


📋 Column Summary for `data2`



Unnamed: 0,Column,Data Type,NA Count,% Missing
0,MovieID,object,0,0.0
1,Title,object,0,0.0
2,Lifetime Gross,object,0,0.0
3,movie_id,object,0,0.0


In [None]:
# Keep only the useful parts of data2
data2 = data2[['MovieID', 'Title', 'Lifetime Gross']].copy()

### 📥 Table Acquisition Summary: `data2`

This table had lots of missing values. But still, the table includes financial data that can add us more information about our target variable ROI.

#### 🔁 Remaining Variables

| `movie_franchises` | `studio_financials` | Action |
|--------------------|---------------------|--------|
| `name`             | `Title`             | Normalize to `movie_id` for matching |
| `budget`           | `Budget`            | Compare and retain best version |
| `gross`            | `Lifetime Gross`    | Compare with `gross` |


# 3rd Dataset: TMDB data

In [None]:
# If the 3rd dataset have error contains "LocalFileSystem is not supported" then use the code:
# pip install -U datasets

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
[31mERROR: pip's dependency r

In [None]:
# 3. TMDB 5000 Movies Dataset

!pip install datasets

from datasets import load_dataset
import pandas as pd

# Load the TMDB dataset from Hugging Face
dataset = load_dataset("AiresPucrs/tmdb-5000-movies", split="train")
tmdb_data = pd.DataFrame(dataset)

# Save the DataFrame to a CSV file
tmdb_data.to_csv("tmdb_movies.csv", index=False)

# Confirm the file exists in the current directory
import os
os.listdir()



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

(…)-00000-of-00001-6db04ab1c75d6817.parquet:   0%|          | 0.00/13.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4803 [00:00<?, ? examples/s]

['.config',
 'MovieFranchises.csv',
 'movie.csv',
 'final_dataset.csv',
 'tmdb_movies.csv',
 'sample_data']

### First Check

In [None]:
# Display the first few rows
tmdb_data.head()

Unnamed: 0,id,budget,genres,homepage,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,movie_id
0,5,4000000,"[{""id"": 80, ""name"": ""Crime""}, {""id"": 35, ""name...",,"[{""id"": 612, ""name"": ""hotel""}, {""id"": 613, ""na...",en,Four Rooms,It's Ted the Bellhop's first night on the job....,22.87623,"[{""name"": ""Miramax Films"", ""id"": 14}, {""name"":...",...,98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Twelve outrageous guests. Four scandalous requ...,Four Rooms,6.5,530,"[{""cast_id"": 42, ""character"": ""Ted the Bellhop...","[{""credit_id"": ""52fe420dc3a36847f800012d"", ""de...",four rooms
1,11,11000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 28, ""...",http://www.starwars.com/films/star-wars-episod...,"[{""id"": 803, ""name"": ""android""}, {""id"": 4270, ...",en,Star Wars,Princess Leia is captured and held hostage by ...,126.393695,"[{""name"": ""Lucasfilm"", ""id"": 1}, {""name"": ""Twe...",...,121.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"A long time ago in a galaxy far, far away...",Star Wars,8.1,6624,"[{""cast_id"": 3, ""character"": ""Luke Skywalker"",...","[{""credit_id"": ""52fe420dc3a36847f8000437"", ""de...",star wars
2,12,94000000,"[{""id"": 16, ""name"": ""Animation""}, {""id"": 10751...",http://movies.disney.com/finding-nemo,"[{""id"": 494, ""name"": ""father son relationship""...",en,Finding Nemo,"Nemo, an adventurous young clownfish, is unexp...",85.688789,"[{""name"": ""Pixar Animation Studios"", ""id"": 3}]",...,100.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"There are 3.7 trillion fish in the ocean, they...",Finding Nemo,7.6,6122,"[{""cast_id"": 8, ""character"": ""Marlin (voice)"",...","[{""credit_id"": ""52fe420ec3a36847f80006b1"", ""de...",finding nemo
3,13,55000000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",,"[{""id"": 422, ""name"": ""vietnam veteran""}, {""id""...",en,Forrest Gump,A man with a low IQ has accomplished great thi...,138.133331,"[{""name"": ""Paramount Pictures"", ""id"": 4}]",...,142.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"The world will never be the same, once you've ...",Forrest Gump,8.2,7927,"[{""cast_id"": 7, ""character"": ""Forrest Gump"", ""...","[{""credit_id"": ""52fe420ec3a36847f800076b"", ""de...",forrest gump
4,14,15000000,"[{""id"": 18, ""name"": ""Drama""}]",http://www.dreamworks.com/ab/,"[{""id"": 255, ""name"": ""male nudity""}, {""id"": 29...",en,American Beauty,"Lester Burnham, a depressed suburban father in...",80.878605,"[{""name"": ""DreamWorks SKG"", ""id"": 27}, {""name""...",...,122.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Look closer.,American Beauty,7.9,3313,"[{""cast_id"": 6, ""character"": ""Lester Burnham"",...","[{""credit_id"": ""52fe420ec3a36847f8000809"", ""de...",american beauty


In [None]:
# Check for data types and NA's
quick_column_summary(tmdb_data, 'tmdb_data')


📋 Column Summary for `tmdb_data`



Unnamed: 0,Column,Data Type,NA Count,% Missing
0,id,int64,0,0.0
1,budget,int64,0,0.0
2,genres,object,0,0.0
3,homepage,object,3091,64.355611
4,keywords,object,0,0.0
5,original_language,object,0,0.0
6,original_title,object,0,0.0
7,overview,object,3,0.062461
8,popularity,float64,0,0.0
9,production_companies,object,0,0.0


### 📥 Table Acquisition: `tmdb_data`

This is the richest and most structured table so far. It includes both structured and nested (JSON-like) data, contributing heavily to both prediction targets.

---

#### 🎯 Relevant Variables

| Column | Description | Relevance |
|--------|-------------|-----------|
| `title` | Movie name | ✅ Used to create `movie_id` |
| `vote_average` | Average audience rating | ✅ Proxy for IMDB score |
| `vote_count` | Number of votes | 🧪 May influence or complement score |
| `budget` | Production cost | 📌 Required for ROI |
| `revenue` | Box office revenue | 📌 Required for ROI |
| `runtime` | Duration in minutes | 📊 Feature for pacing / cost |
| `popularity` | TMDB popularity score | 📊 Social visibility |
| `release_date` | Date released | 📊 Use for time features (month, year) |
| `genres` | List of genres (JSON) | 🧠 To parse later for genre-based analysis |
| `keywords` | Thematic keywords (JSON) | 🧠 Useful after parsing |
| `overview`, `tagline` | Textual summary & tagline | 🧠 Potential for NLP sentiment modeling |
| `original_language` | Language code (e.g., 'en') | 📊 Cultural/demographic indicator |
| `production_companies` | Companies involved (JSON) | 🧠 Feature engineering (studio power) |
| `production_countries` | Countries involved (JSON) | 📊 International impact |
| `spoken_languages` | Languages spoken (JSON) | 📊 Audience reach |
| `cast`, `crew` | Cast and crew (JSON) | 🧠 Feature-rich, parse later |
| `status` | e.g., Released, Post-production, etc. | 🧪 May correlate with box office |

---

#### 🧠 Summary

- This table contributes to both `imdb_score_features` and `roi_features`
- Contains multiple nested fields that will be parsed during feature engineering
- Will be save in SQLite as `raw_tmdb_data`


# 4th Dataset: Meta-Analysis Data

In [None]:
# 4. Complete Movie Metadata Dataset

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
file_path = '/content/drive/My Drive/Projects/Blockbuster Movies/movies.csv'  # Adjust path as needed
meta_data = pd.read_csv(file_path)

# Save the DataFrame to a CSV file
meta_data.to_csv("movies.csv", index=False)

# Confirm the file exists in the current directory
import os
os.listdir()

Mounted at /content/drive


['.config',
 'drive',
 'MovieFranchises.csv',
 'movies.csv',
 'movie.csv',
 'final_dataset.csv',
 'tmdb_movies.csv',
 'sample_data']

### First Check

In [None]:
# Display the first few rows
meta_data.head()

Unnamed: 0,id,title,genres,original_language,overview,popularity,production_companies,release_date,budget,revenue,runtime,status,tagline,vote_average,vote_count,credits,keywords,poster_path,backdrop_path,recommendations
0,615656,Meg 2: The Trench,Action-Science Fiction-Horror,en,An exploratory dive into the deepest depths of...,8763.998,Apelles Entertainment-Warner Bros. Pictures-di...,2023-08-02,129000000.0,352056500.0,116.0,Released,Back for seconds.,7.079,1365.0,Jason Statham-Wu Jing-Shuya Sophia Cai-Sergio ...,based on novel or book-sequel-kaiju,/4m1Au3YkjqsxF8iwQy0fPYSxE0h.jpg,/qlxy8yo5bcgUw2KAmmojUKp4rHd.jpg,1006462-298618-569094-1061181-346698-1076487-6...
1,758323,The Pope's Exorcist,Horror-Mystery-Thriller,en,Father Gabriele Amorth Chief Exorcist of the V...,5953.227,Screen Gems-2.0 Entertainment-Jesus & Mary-Wor...,2023-04-05,18000000.0,65675820.0,103.0,Released,Inspired by the actual files of Father Gabriel...,7.433,545.0,Russell Crowe-Daniel Zovatto-Alex Essoe-Franco...,spain-rome italy-vatican-pope-pig-possession-c...,/9JBEPLTPSm0d1mbEcLxULjJq9Eh.jpg,/hiHGRbyTcbZoLsYYkO4QiCLYe34.jpg,713704-296271-502356-1076605-1084225-1008005-9...
2,533535,Deadpool & Wolverine,Action-Comedy-Science Fiction,en,A listless Wade Wilson toils away in civilian ...,5410.496,Marvel Studios-Maximum Effort-21 Laps Entertai...,2024-07-24,200000000.0,1326387000.0,128.0,Released,Come together.,7.765,3749.0,Ryan Reynolds-Hugh Jackman-Emma Corrin-Matthew...,hero-superhero-anti hero-mutant-breaking the f...,/8cdWjvZQUExUUTzyp4t6EDMubfO.jpg,/dvBCdCohwWbsP5qAaglOXagDMtk.jpg,573435-519182-957452-1022789-945961-718821-103...
3,667538,Transformers: Rise of the Beasts,Action-Adventure-Science Fiction,en,When a new threat capable of destroying the en...,5409.104,Skydance-Paramount-di Bonaventura Pictures-Bay...,2023-06-06,200000000.0,407045500.0,127.0,Released,Unite or fall.,7.34,1007.0,Anthony Ramos-Dominique Fishback-Luna Lauren V...,peru-alien-end of the world-based on cartoon-b...,/gPbM0MK8CP8A174rmUwGsADNYKD.jpg,/woJbg7ZqidhpvqFGGMRhWQNoxwa.jpg,496450-569094-298618-385687-877100-598331-4628...
4,693134,Dune: Part Two,Science Fiction-Adventure,en,Follow the mythic journey of Paul Atreides as ...,4742.163,Legendary Pictures,2024-02-27,190000000.0,683813700.0,167.0,Released,Long live the fighters.,8.3,2770.0,Timothée Chalamet-Zendaya-Rebecca Ferguson-Jav...,epic-based on novel or book-fight-sandstorm-sa...,/czembW0Rk1Ke7lCJGahbOhdCuhV.jpg,/xOMo8BRK7PfcJv9JCnx7s5hj0PX.jpg,438631-763215-792307-1011985-467244-634492-359...


In [None]:
# Check for data types and NA's
quick_column_summary(meta_data, 'meta_data')


📋 Column Summary for `meta_data`



Unnamed: 0,Column,Data Type,NA Count,% Missing
0,id,int64,0,0.0
1,title,object,6,0.000831
2,genres,object,210317,29.116994
3,original_language,object,0,0.0
4,overview,object,118243,16.369959
5,popularity,float64,0,0.0
6,production_companies,object,384926,53.290453
7,release_date,object,51549,7.136617
8,budget,float64,0,0.0
9,revenue,float64,0,0.0


### 📥 Table Acquisition: `meta_data`

This dataset appears to be an updated or complementary version of `tmdb_data`, containing recent and upcoming titles with similar structure.

---

#### 🎯 Relevant Variables

| Column | Description | Relevance |
|--------|-------------|-----------|
| `title` | Movie title | ✅ Used to create `movie_id` |
| `vote_average` / `vote_count` | User rating and count | ✅ Score-related |
| `budget`, `revenue` | Financial data | 📌 Used for ROI |
| `runtime`, `release_date` | Timing & length | 📊 Influences score & ROI |
| `popularity` | TMDB popularity score | 📊 Social reach |
| `genres`, `keywords`, `overview`, `tagline` | Text / tags | 🧠 Feature-rich, parse later |
| `original_language` | Language code | 📊 Cultural signal |
| `status` | Release status | 🧪 Could correlate with results |
| `production_companies` | Studios involved | 🧠 To group studio trends |
| `credits` | Raw cast and crew | 🧠 To parse later for influence modeling |

---

#### 🧠 Summary

- High overlap with `tmdb_data` (Table 3) — strong candidate for integration
- Contributes to both `imdb_score_features` and `roi_features`
- Will require deduplication and possible enrichment during post-acquisition phase
- Will be save in SQLite as `meta_data`


# 5th Dataset: Revenues Data

In [None]:
# 5. Movie Revenue Analysis Dataset

!wget https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/final_dataset.csv -O final_dataset.csv
import pandas as pd
financial_data = pd.read_csv("final_dataset.csv")

--2025-05-21 11:07:11--  https://raw.githubusercontent.com/JohnnySolo/Data-Analysis-Project---Blockbuster-Movies/main/final_dataset.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 456039 (445K) [text/plain]
Saving to: ‘final_dataset.csv’


2025-05-21 11:07:11 (13.8 MB/s) - ‘final_dataset.csv’ saved [456039/456039]



### First Check

In [None]:
# Display the first few rows
financial_data.head()

Unnamed: 0.1,Unnamed: 0,movie,year,production_budget,domestic_gross,foreign_gross,worldwide_gross,month,profit,profit_margin,...,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western,movie_id
0,0,Avatar,2009,425000000,760507625,2015837654,2776345279,12,2351345279,0.846921,...,0,0,0,0,1,0,0,0,0,avatar
1,1,Pirates of the Caribbean: On Stranger Tides,2011,410600000,241063875,804600000,1045663875,5,635063875,0.607331,...,0,0,0,0,0,0,0,0,0,pirates of the caribbean: on stranger tides
2,2,Avengers: Age of Ultron,2015,330600000,459005868,944008095,1403013963,5,1072413963,0.764364,...,0,0,0,0,1,0,0,0,0,avengers: age of ultron
3,3,Avengers: Infinity War,2018,300000000,678815482,1369318718,2048134200,4,1748134200,0.853525,...,0,0,0,0,0,0,0,0,0,avengers: infinity war
4,4,Justice League,2017,300000000,229024295,426920914,655945209,11,355945209,0.542645,...,0,0,0,0,1,0,0,0,0,justice league




In [None]:
# Check for data types and NA's
quick_column_summary(financial_data, 'financial_data')


📋 Column Summary for `financial_data`



Unnamed: 0,Column,Data Type,NA Count,% Missing
0,Unnamed: 0,int64,0,0.0
1,movie,object,0,0.0
2,year,int64,0,0.0
3,production_budget,int64,0,0.0
4,domestic_gross,int64,0,0.0
5,foreign_gross,int64,0,0.0
6,worldwide_gross,int64,0,0.0
7,month,int64,0,0.0
8,profit,int64,0,0.0
9,profit_margin,float64,0,0.0


### 📥 Table Acquisition: `financial_data`

This table is highly focused on financial metrics and genre distribution. It provides engineered columns for ROI, profit, and genre flags, making it very valuable for prediction.

---

#### 🎯 Relevant Variables

| Column | Description | Relevance |
|--------|-------------|-----------|
| `movie` | Movie name | ✅ Used to create `movie_id` |
| `production_budget`, `domestic_gross`, `foreign_gross`, `worldwide_gross` | Raw inputs for ROI | ✅ |
| `profit`, `roi`, `profit_margin`, `pct_foreign` | Pre-calculated finance metrics | ✅ |
| `vote_average`, `vote_count`, `popularity` | Score-related audience signals | ✅ |
| `original_language`, `release_date`, `month` | Contextual/cultural features | ✅ |
| `Action`, `Drama`, etc. | Binary genre flags | ✅ Helps both score and ROI models |

---

#### 🧠 Summary

- Strongest financial data table (calculated ROI & profit)
- Includes one-hot encoded genre info (clean and ready)
- Will contribute to both `imdb_score_features` and `roi_features`
- Saved in SQLite as `raw_financial_data`


---

# Full Preprocessing & Integration Pipeline

## 🔗 Data Consolidation & Final Dataset Preparation

### 1. Normalizing Identifiers
To prepare for merging:
- I created a **primary key** called `movie_id` in each dataset using the cleaned movie title (`name`, `title`, or `movie`) columns.
- Each name was normalized (lowercase, stripped whitespace) to ensure consistent joining across tables.

### 2. Merging Strategy
I used a **left join** strategy starting from `movie_franchises` as the base table.  
Why left join?
- It ensured we preserved only relevant and valid movies (with complete modeling targets).
- Still keeping the base data in `movie_franchises` and only add to it (with inner-join we could have lost all the data that isn't in all the sets and be left with a smaller sample of movies).
- Avoided introducing excessive NAs from mismatched movie entries across datasets (can happen with outer-join).

### 3. Post-Merge Cleaning
After joining:
- I dropped duplicate variables (e.g., `vote_average_y`, `runtime_x`) and renamed important ones clearly.
- Used logic to **fill in genre dummy columns** from the `genre` column when genre indicators were missing.
- All genre columns were validated to contain proper 0/1 indicators for classification tasks.

### 4. Organizing the Dataset
I reordered columns based on:
- **Target modeling relevance** (e.g., `imdb_score`)
- **Predictive features** (e.g., votes, budget, genre dummies)
- **Meta content** (overview, keywords, cast)

### 5. Final Export
I saved the final processed dataset to a `.csv` file and uploaded it to my GitHub repository.  
This final version will be loaded in the next notebook for:
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Model Building



## 1. Normalization

In [None]:
def normalize_title(title):
    return title.str.strip().str.lower()

movie_franchises['movie_id'] = normalize_title(movie_franchises['name'])
data2['movie_id'] = normalize_title(data2['Title'])
tmdb_data['movie_id'] = normalize_title(tmdb_data['title'])
meta_data['movie_id'] = normalize_title(meta_data['title'])
financial_data['movie_id'] = normalize_title(financial_data['movie'])

## 2. Merging Data (by left-join)

In [None]:
enriched = movie_franchises.merge(
    financial_data[['movie_id', 'roi', 'production_budget', 'worldwide_gross', 'profit','Action', 'Adventure', 'Animation', 'Comedy', 'Crime',
       'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror',
       'Music', 'Mystery', 'Romance', 'Science Fiction', 'TV Movie',
       'Thriller', 'War', 'Western']],
    on='movie_id',
    how='left'
)

enriched = enriched.merge(
    tmdb_data[['movie_id', 'vote_average', 'vote_count', 'popularity', 'runtime','homepage', 'keywords','overview','tagline','cast', 'crew',]],
    on='movie_id',
    how='left'
)

enriched = enriched.merge(
    meta_data[['movie_id', 'vote_average', 'vote_count', 'popularity', 'runtime', 'keywords','overview','tagline','recommendations']],
    on='movie_id',
    how='left'
)

In [None]:
enriched['na_count'] = enriched.isna().sum(axis=1)
enriched = enriched.sort_values(by='na_count').drop_duplicates(subset='movie_id', keep='first')
enriched = enriched.drop(columns='na_count')
enriched

Unnamed: 0,name,rating,genre,year,released,imdb_score,votes,director,writer,star,...,cast,crew,vote_average_y,vote_count_y,popularity_y,runtime,keywords_y,overview_y,tagline_y,recommendations
14140,Paddington,PG,Animation,2014,"January 16, 2015 (United States)",7.2,103000.0,Paul King,Paul King,Hugh Bonneville,...,"[{""cast_id"": 5, ""character"": ""Millicent"", ""cre...","[{""credit_id"": ""542861980e0a26556f002f29"", ""de...",7.10,3465.0,27.495,96.0,london england-based on novel or book-peru-ant...,A young Peruvian bear travels to London in sea...,The adventure begins.,346648-136387-49479-49133-149023-27598-360913-...
15024,Steve Jobs,R,Biography,2015,"October 23, 2015 (United States)",7.2,160000.0,Danny Boyle,Aaron Sorkin,Michael Fassbender,...,"[{""cast_id"": 1, ""character"": ""Steve Jobs"", ""cr...","[{""credit_id"": ""5668d60692514174110041ae"", ""de...",6.80,3771.0,13.264,122.0,biography-computer-based on true story-father ...,Set backstage at three iconic product launches...,Can a great man be a good man?,314365-318846-294016-296098-274479-115782-2732...
15020,Point Break,PG-13,Action,2015,"December 25, 2015 (United States)",5.3,60000.0,Ericson Core,Kurt Wimmer,Edgar Ramírez,...,"[{""cast_id"": 4, ""character"": ""Bodhi"", ""credit_...","[{""credit_id"": ""57553467c3a368606100216a"", ""de...",7.13,3152.0,29.336,122.0,surfer-undercover-wave-surfboard-fbi-self-dest...,In Los Angeles a gang of bank robbers who call...,100% Pure Adrenaline!,10795-24257-21-152653-104755-257088-842544-163...
11635,How to Train Your Dragon,PG,Animation,2010,"March 26, 2010 (United States)",8.1,685000.0,Dean DeBlois,William Davies,Jay Baruchel,...,"[{""cast_id"": 3, ""character"": ""Hiccup"", ""credit...","[{""credit_id"": ""56264613925141179d003285"", ""de...",7.80,11603.0,72.156,98.0,flying-based on novel or book-blacksmith-arena...,As the son of a Viking leader on the cusp of m...,One adventure will change two worlds,82702-9502-38757-20352-10193-585-10681-82690-9...
11634,The Last Airbender,PG,Action,2010,"July 1, 2010 (United States)",4.0,156000.0,M. Night Shyamalan,M. Night Shyamalan,Noah Ringer,...,"[{""cast_id"": 2, ""character"": ""Aang"", ""credit_i...","[{""credit_id"": ""52fe433f9251416c750092c9"", ""de...",4.66,3530.0,47.471,103.0,fire-war ship-prince-kingdom-village-arrest-re...,The story follows the adventures of Aang a you...,"Four nations, one destiny.",27022-32657-2486-18823-9543-44912-46529-2268-3...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12341,The Raid: Redemption,R,Action,2011,"April 13, 2012 (United States)",7.6,195000.0,Gareth Evans,Gareth Evans,Iko Uwais,...,,,,,,,,,,
3676,"Blood In, Blood Out",R,Crime,1993,"April 30, 1993 (United States)",8.0,29000.0,Taylor Hackford,Ross Thomas,Damian Chapa,...,,,,,,,,,,
2796,Leatherface: Texas Chainsaw Massacre III,R,Horror,1990,"January 12, 1990 (United States)",5.1,15000.0,Jeff Burr,Kim Henkel,Kate Hodge,...,,,,,,,,,,
17085,X-Men: Dark Phoenix,PG-13,Action,2019,"June 7, 2019 (United States)",5.7,166000.0,Simon Kinberg,Simon Kinberg,James McAvoy,...,,,,,,,,,,


## 3. Post-Merging Cleaning

In [None]:
# Check for data types and NA's
quick_column_summary(enriched, 'enriched')


📋 Column Summary for `enriched`



Unnamed: 0,Column,Data Type,NA Count,% Missing
0,name,object,0,0.0
1,rating,object,12,0.224257
2,genre,object,0,0.0
3,year,int64,0,0.0
4,released,object,0,0.0
5,imdb_score,float64,0,0.0
6,votes,float64,0,0.0
7,director,object,0,0.0
8,writer,object,0,0.0
9,star,object,0,0.0


Assumptions from the outlook:

1. Duplicate columns: We can see that some of the columns are duplicates of others (For example, `tagline_x` and `tagline_y`). It happened because some of the datasets we merged had the same column in both.
2. High percentage NA's: Some of the columns have high NA count. We can see that most of them are duplicates columns and genre classification columns (`Action`, ..., `Western`)

Action: We'll omit the duplicates, specifically the higher NA percentage. I'll leave the genre classification columns to be, because we can edit the missing data with the genre column as long as it has 0% NA (and it does).

In [None]:
# 1. Omit the duplicates columns. The column with the higher NA% will be the one we'll omit
enriched_better = enriched[['name', 'rating', 'genre', 'year', 'released', 'imdb_score', 'votes',
       'director', 'writer', 'star', 'country', 'budget', 'gross', 'company',
       'runtime_x', 'movie_id','production_budget', 'worldwide_gross',
       'profit', 'Action', 'Adventure', 'Animation', 'Comedy', 'Crime',
       'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror',
       'Music', 'Mystery', 'Romance', 'Science Fiction', 'TV Movie',
       'Thriller', 'War', 'Western','homepage','cast', 'crew', 'vote_average_y', 'vote_count_y',
       'popularity_y', 'runtime', 'keywords_y', 'overview_y', 'tagline_y',
       'recommendations']]

In [None]:
# 2. Edit genre columns based on the genre column values

# Define the full set of genre dummy columns
genre_columns = [
    'Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary',
    'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery',
    'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western'
]

# Ensure all genre columns exist and are numeric
for col in genre_columns:
    if col not in enriched_better.columns:
        enriched_better[col] = pd.NA  # Create missing columns
    enriched_better[col] = enriched_better[col].astype("float")  # Force numeric

# Mask for rows where all genre columns are missing
genre_na_mask = enriched_better[genre_columns].isna().all(axis=1)

# Update those rows based on the 'genre' column
for idx in enriched_better[genre_na_mask].index:
    genre_str = enriched_better.loc[idx, 'genre']
    genre_list = [g.strip() for g in str(genre_str).split('|')] if pd.notna(genre_str) else []

    for col in genre_columns:
        enriched_better.at[idx, col] = 1.0 if col in genre_list else 0.0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  enriched_better[col] = enriched_better[col].astype("float")  # Force numeric
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  enriched_better[col] = enriched_better[col].astype("float")  # Force numeric
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  enriched_better[col] = enriched_better[col].astype(

## 4. Organizing the Dataset

In [None]:
# Rearrange the columns based on order of importance and NA%
enriched_best = enriched_better [['movie_id','rating', 'genre', 'year', 'released', 'imdb_score', 'votes',
       'director', 'writer', 'star', 'country', 'budget', 'gross', 'company',
       'runtime_x','vote_average_y', 'vote_count_y', 'popularity_y',
       'keywords_y','Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary',
       'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery',
       'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western','production_budget', 'worldwide_gross', 'profit' ,'overview_y', 'tagline_y', 'recommendations']]

In [None]:
# Check for missing columns in the reordering list
col_list = ['movie_id','rating', 'genre', 'year', 'released', 'imdb_score', 'votes',
       'director', 'writer', 'star', 'country', 'budget', 'gross', 'company',
       'runtime_x','vote_average_y', 'vote_count_y', 'popularity_y',
       'keywords_y','Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary',
       'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery',
       'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western','production_budget', 'worldwide_gross', 'profit' ,'overview_y', 'tagline_y', 'recommendations']

missing = [col for col in col_list if col not in enriched_best.columns]
print("❗ Missing columns:", missing)

❗ Missing columns: []


In [None]:
# Check for data types and NA's
quick_column_summary(enriched_best, 'enriched_best')


📋 Column Summary for `enriched_best`



Unnamed: 0,Column,Data Type,NA Count,% Missing
0,movie_id,object,0,0.0
1,rating,object,12,0.224257
2,genre,object,0,0.0
3,year,int64,0,0.0
4,released,object,0,0.0
5,imdb_score,float64,0,0.0
6,votes,float64,0,0.0
7,director,object,0,0.0
8,writer,object,0,0.0
9,star,object,0,0.0


In [None]:
# Rename the columns that were duplicates before
# Define a dictionary of old column names to new ones
rename_map = {
    'runtime_x': 'runtime',
    'vote_average_y': 'vote_average',
    'vote_count_y': 'vote_count',
    'popularity_y': 'popularity',
    'tagline_y': 'tagline',
    'keywords_y': 'keywords',
    'overview_y': 'overview'
}

# Apply renaming in one line
enriched_best = enriched_best.rename(columns=rename_map)

## 5. Final Export

In [None]:
# Save to local file (which will also show up in Colab's Files tab)
enriched_best.to_csv("final_movie_data.csv", index=False)

from google.colab import files
files.download('final_movie_data.csv')  # or any other filename you want to download

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

---