# Movie Data Cleaning and Merging Process
## Project Overview
We're going to merge multiple movie datasets to create a comprehensive database for analysis. Our goal is to combine IMDB ratings, box office data, TMDB scores, and budget information into one clean dataset.

### Data Sources:

IMDB: Movie metadata, ratings, and crew info, 
Box Office Mojo: Revenue data (domestic/foreign gross), 
TMDB: Additional ratings and popularity scores and
The Numbers: Production budget data

## Initial Setup and Data Loading

First, we need to set up pandas display options so we can see all the columns properly when debugging. Then we'll load our main datasets.

In [1]:
import pandas as pd
import numpy as np

# Better display options for debugging
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.float_format', '{:,.0f}'.format)  # cleaner number formatting

# Load our datasets
imdb_df = pd.read_csv("clean_imdb_data.csv")
bom_df = pd.read_csv("cleaned_bom.movie_gross.csv")

print(f"IMDB dataset: {imdb_df.shape}")
print(f"Box Office Mojo dataset: {bom_df.shape}")

IMDB dataset: (83375, 12)
Box Office Mojo dataset: (3387, 7)


Let's take a quick look at what we're working with:

In [2]:
# peek at the data structure
imdb_df.head(2)

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,person_id,averagerating,numvotes,category,primary_name,primary_profession
0,tt0063540,Sunghursh,Sunghursh,2013,175,"Action,Crime,Drama",nm0712540,7,77,director,Harnam Singh Rawail,"director,writer,producer"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114,"Biography,Drama",nm0002411,7,43,director,Mani Kaul,"director,writer,actor"


In [3]:
bom_df.head(2)  # see what bom columns look like

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,year_only,total_gross
0,Toy Story 3,BV,415000000,652000000,2010-01-01,2010,1067000000
1,Alice in Wonderland (2010),BV,334200000,691300000,2010-01-01,2010,1025500000


## Data Standardization for Merging

Next, we need to standardize titles and years so we can match movies across datasets. We'll convert everything to lowercase and strip whitespace becuase movie titles are inconsistent between sources.

In [4]:
# Create standardized join keys
imdb_df['join_title'] = imdb_df['primary_title'].str.lower().str.strip() 
imdb_df['join_year'] = imdb_df['start_year']  # already clean

bom_df['join_title'] = bom_df['title'].str.lower().str.strip()
bom_df['join_year'] = pd.to_datetime(bom_df['year'], errors='coerce').dt.year  # extract year from date

# Quick check of our join keys
print("Sample IMDB join keys:")
print(imdb_df[['primary_title', 'join_title', 'join_year']].head(3))

Sample IMDB join keys:
                     primary_title                       join_title  join_year
0                        Sunghursh                        sunghursh       2013
1  One Day Before the Rainy Season  one day before the rainy season       2019
2       The Other Side of the Wind       the other side of the wind       2018


Let's see if there are any obvious issues with the standardization:

In [5]:
# check for any weird characters or formatting issues
print("Unique characters in first 100 titles:")
sample_titles = imdb_df['join_title'].head(100).str.cat()
print(set(sample_titles))

Unique characters in first 100 titles:
{'!', 'a', 'z', 'ú', 'x', 'w', '-', "'", '&', 'f', 'ç', '5', 'k', 'j', 'u', 'l', 'v', 'r', '.', 'y', 'ô', 'p', 'b', ':', 'd', '3', 'q', 'ã', 'á', 'o', ' ', 'n', 'h', '2', 'i', 'c', 's', 't', 'e', '1', '4', 'm', 'g'}


## First Merge: IMDB + Box Office Data

Now we'll use an outer merge to see how many movies match between datasets. The indicator flag will help us understand the merge quality.

In [6]:
merged_df = pd.merge(imdb_df, bom_df, 
                    on=['join_title', 'join_year'], 
                    how='outer', 
                    indicator=True)

print("Merge summary:")
print(merged_df['_merge'].value_counts())
print(f"\nMatch rate: {merged_df['_merge'].value_counts()['both'] / len(merged_df) * 100:.1f}%")

Merge summary:
_merge
left_only     81254
both           2121
right_only     1492
Name: count, dtype: int64

Match rate: 2.5%


As expected, most movies are only in IMDB (left_only) since IMDB has way more movies than box office data. Only about 2,121 movies have both ratings and revenue data.

Let's see some examples of movies that didn't match:

In [7]:
# look at some unmatched titles to understand why they didn't merge
unmatched_imdb = merged_df[merged_df['_merge'] == 'left_only']['primary_title'].head(10)
unmatched_bom = merged_df[merged_df['_merge'] == 'right_only']['title'].head(10)  # these are from bom only

print("Some unmatched IMDB titles:")
print(unmatched_imdb.tolist())
print("\nSome unmatched BOM titles:")
print(unmatched_bom.tolist())

Some unmatched IMDB titles:
['!Women Art Revolution', '#1 Serial Killer', '#5', '#66', '#AbroHilo', '#ALLMYMOVIES', '#ALLMYMOVIES', '#ALLMYMOVIES', '#artoffline', '#babynymph']

Some unmatched BOM titles:
["'71", '1,000 Times Good Night', '10 Years', '1001 Grams', '13 Assassins', '13 Hours: The Secret Soldiers of Benghazi', '13 Minutes', '14 Blades', '17 Girls', '2 Autumns, 3 Winters']


## Extracting Successfully Matched Movies

We only want to keep movies that matched between both datasets since we need both IMDB ratings and box office numbers for our analysis.

In [8]:
matched = merged_df[merged_df['_merge'] == 'both'].copy()
print(f"Movies with both IMDB and box office data: {len(matched)}")

Movies with both IMDB and box office data: 2121


## Data Cleanup and Renaming

Next, we need to clean up our merged dataset. We'll drop the merge indicator and rename some columns for clarity.

In [9]:
# Drop the merge indicator column
matched.drop(columns=['_merge'], inplace=True)  # don't need this anymore

# Rename BOM columns to avoid confusion with IMDB columns
matched.rename(columns={
    'title': 'bom_title',
    'year': 'bom_full_date'
}, inplace=True)

# Reset index for clean numbering
matched.reset_index(drop=True, inplace=True)

print(f"Cleaned dataset shape: {matched.shape}")

Cleaned dataset shape: (2121, 21)


## Handling Multiple People per Movie

Since movies have multiple directors/actors, we need to collapse them into single rows. We'll create a function that groups by movie info and combines all the people data.

In [10]:
def collapse_people(df):
    """
    Collapse multiple people (directors/actors) per movie into single rows
    by concatenating their names and info
    """
    group_cols = [
        'movie_id', 'primary_title', 'original_title', 'start_year',
        'runtime_minutes', 'genres', 'averagerating', 'numvotes',
        'bom_title', 'studio', 'domestic_gross', 'foreign_gross',
        'year_only', 'total_gross'
    ]
    
    return df.groupby(group_cols).agg({
        'primary_name': lambda x: ', '.join(sorted(set(x.dropna()))),  # combine all names
        'primary_profession': lambda x: ', '.join(sorted(set(x.dropna()))),
        'person_id': lambda x: ', '.join(sorted(set(x.dropna()))),
    }).reset_index()

# Apply the function
collapsed_df = collapse_people(matched)

print(f"Before collapsing: {len(matched)} rows")
print(f"After collapsing: {len(collapsed_df)} rows")

Before collapsing: 2121 rows
After collapsing: 1918 rows


Let's check what the people data looks like now:

In [11]:
# see how the collapsed people data looks
print("Sample of combined people data:")
print(collapsed_df[['primary_title', 'primary_name', 'primary_profession']].head(3))

Sample of combined people data:
                     primary_title   primary_name        primary_profession
0                            Wazir  Bejoy Nambiar  producer,writer,director
1                      On the Road  Walter Salles  director,producer,writer
2  The Secret Life of Walter Mitty    Ben Stiller   producer,actor,director


## Adding TMDB Ratings Data

Now we need to load TMDB data to get additional rating information. We'll standardize the column names so they don't conflict with IMDB ratings.

In [12]:
# Load TMDB data
tmdb = pd.read_csv("cleaned_tmdb_movies.csv")
print(f"TMDB dataset shape: {tmdb.shape}")

# Rename rating columns to avoid confusion
collapsed_df = collapsed_df.rename(columns={
    'averagerating': 'imdb_rating',
    'numvotes': 'imdb_votes'  # make it clear these are imdb votes
})

tmdb = tmdb.rename(columns={
    'vote_average': 'tmdb_rating',
    'vote_count': 'tmdb_votes'
})

TMDB dataset shape: (25497, 10)


Quick peek at TMDB data structure:

In [13]:
tmdb.info()  # see what we're working with

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25497 entries, 0 to 25496
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         25497 non-null  int64  
 1   genre_ids          25497 non-null  object 
 2   id                 25497 non-null  int64  
 3   original_language  25497 non-null  object 
 4   original_title     25497 non-null  object 
 5   popularity         25497 non-null  float64
 6   release_date       25497 non-null  object 
 7   title              25497 non-null  object 
 8   tmdb_rating        25497 non-null  float64
 9   tmdb_votes         25497 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 1.9+ MB


## Preparing TMDB for Merge

We need to create standardized join keys for TMDB data. We'll have to extract the year from the release_date field.

In [14]:
# Create join keys for both datasets
collapsed_df['join_title'] = collapsed_df['primary_title'].str.lower().str.strip()
collapsed_df['join_year'] = collapsed_df['year_only'].astype('Int64')  # handle missing years

tmdb['join_title'] = tmdb['title'].str.lower().str.strip()
tmdb['join_year'] = pd.to_datetime(tmdb['release_date'], errors='coerce').dt.year.astype('Int64')

print("TMDB join key samples:")
print(tmdb[['title', 'join_title', 'release_date', 'join_year']].head(3))

TMDB join key samples:
                                          title  \
0  Harry Potter and the Deathly Hallows: Part 1   
1                      How to Train Your Dragon   
2                                    Iron Man 2   

                                     join_title release_date  join_year  
0  harry potter and the deathly hallows: part 1   2010-11-19       2010  
1                      how to train your dragon   2010-03-26       2010  
2                                    iron man 2   2010-05-07       2010  


## Merging TMDB Data

We'll use a left join since we want to keep all our existing movies even if TMDB doesn't have ratings for them.

In [15]:
final_df = collapsed_df.merge(
    tmdb[['join_title', 'join_year', 'tmdb_rating', 'tmdb_votes', 'popularity']],
    on=['join_title', 'join_year'],
    how='left'  # keep all our movies
)

print(f"After TMDB merge: {final_df.shape}")
print(f"Movies with TMDB ratings: {final_df['tmdb_rating'].notna().sum()}")

After TMDB merge: (1927, 22)
Movies with TMDB ratings: 1675


Most of our movies got matched with TMDB data, that's good! We now have both IMDB and TMDB ratings for comparison.

Let's compare the rating scales:

In [16]:
# quick comparison of rating distributions
print("IMDB vs TMDB rating comparison:")
print(f"IMDB ratings range: {final_df['imdb_rating'].min()} - {final_df['imdb_rating'].max()}")
print(f"TMDB ratings range: {final_df['tmdb_rating'].min()} - {final_df['tmdb_rating'].max()}")  # should be 0-10 scale

IMDB vs TMDB rating comparison:
IMDB ratings range: 1.6 - 8.9
TMDB ratings range: 3.4 - 9.5


## Adding Budget Data

Next, we need to load The Numbers (TN) budget data and merge it in. This will help us analyze profitability later.

In [17]:
# Load and prepare budget data
tn_df = pd.read_csv("tn.movie_budgets_clean.csv")
tn_df['join_title'] = tn_df['movie'].str.lower().str.strip()
tn_df['join_year'] = tn_df['release_year']  # Already an int

# Merge budget data
final_df = final_df.merge(
    tn_df[['join_title', 'join_year', 'production_budget']],
    on=['join_title', 'join_year'],
    how='left'
)

# Check missing budget data
missing_budgets = final_df['production_budget'].isna().sum()
print(f"Movies missing budget data: {missing_budgets} out of {len(final_df)}")
print(f"Budget data coverage: {(1 - missing_budgets/len(final_df)) * 100:.1f}%")

Movies missing budget data: 830 out of 1927
Budget data coverage: 56.9%


We can see that a lot of movies are missing budget data. This is pretty common, budget info is harder to find than box office numbers.

Let's look at budget distribution for movies that have data:

In [18]:
# explore budget ranges
budget_stats = final_df['production_budget'].describe()
print("Production budget statistics (millions):")
print(budget_stats / 1_000_000)  # convert to millions for readability

Production budget statistics (millions):
count     0
mean     52
std      58
min       0
25%      13
50%      30
75%      65
max     411
Name: production_budget, dtype: float64


## Budget Imputation Strategy

Since we're missing lots of budget data, we need to try estimating budgets based on genre patterns. The idea is that similar genres probably have similar budget-to-revenue ratios.

In [19]:
# Extract primary genre from genre list
final_df['main_genre'] = final_df['genres'].str.split(',').str[0]

# Calculate revenue-to-budget ratios for movies with known budgets
valid = final_df.dropna(subset=['production_budget', 'total_gross'])
valid = valid[valid['production_budget'] > 0]  # Avoid division by 0
valid = valid[valid['total_gross'] > 0]

valid['revenue_to_budget_ratio'] = valid['total_gross'] / valid['production_budget']  # roi basically

# Get median ratio per genre (more robust than mean)
genre_medians = valid.groupby('main_genre')['revenue_to_budget_ratio'].median()

print("Revenue-to-budget ratios by genre:")
print(genre_medians.sort_values(ascending=False))

Revenue-to-budget ratios by genre:
main_genre
Horror        6
Thriller      6
Mystery       5
Animation     5
Documentary   3
Adventure     3
Comedy        3
Romance       3
Drama         3
Action        2
Crime         2
Biography     2
Fantasy       1
Music         0
Name: revenue_to_budget_ratio, dtype: float64


Interesting! Horror and Thriller movies tend to have the highest return on investment, while Fantasy has the lowest.

Let's see how many movies we have per genre for validation:

In [20]:
# check sample sizes per genre
genre_counts = valid.groupby('main_genre').size().sort_values(ascending=False)
print("Movies per genre (with budget data):")
print(genre_counts)

Movies per genre (with budget data):
main_genre
Action         348
Comedy         221
Drama          174
Adventure      133
Biography       92
Horror          55
Crime           49
Documentary      9
Animation        6
Fantasy          3
Mystery          3
Thriller         2
Music            1
Romance          1
dtype: int64


## Budget Estimation Function

We'll create a function that estimates missing budgets by working backwards from total revenue using genre-specific ratios. Not perfect but better than leaving them blank.

In [21]:
def estimate_budget(row):
    """
    Estimate production budget using genre-specific revenue ratios
    """
    # If we already have budget data, use it
    if pd.notna(row['production_budget']):
        return row['production_budget']
    
    # Get genre and revenue
    genre = row['main_genre']
    revenue = row['total_gross']
    
    # Can't estimate if missing key data
    if pd.isna(genre) or pd.isna(revenue) or revenue == 0:
        return np.nan  # give up
    
    # Get the ratio for this genre
    ratio = genre_medians.get(genre)
    if pd.isna(ratio) or ratio == 0:
        return np.nan  # no data for this genre
    
    # Estimate budget = revenue / ratio
    return revenue / ratio

# Apply the estimation
final_df['production_budget_imputed'] = final_df.apply(estimate_budget, axis=1)

# Check how many we could estimate
estimated = final_df['production_budget_imputed'].notna().sum()
original_budget_count = len(final_df) - missing_budgets
newly_estimated = estimated - original_budget_count

print(f"Total movies with budget data (real + estimated): {estimated}")
print(f"Newly estimated budgets: {newly_estimated}")

Total movies with budget data (real + estimated): 1926
Newly estimated budgets: 829


## Final Dataset Summary

Let's examine our final cleaned and merged dataset:

In [23]:
final_df.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,imdb_rating,imdb_votes,bom_title,studio,domestic_gross,foreign_gross,year_only,total_gross,primary_name,primary_profession,person_id,join_title,join_year,tmdb_rating,tmdb_votes,popularity,production_budget,main_genre,production_budget_imputed
0,tt0315642,Wazir,Wazir,2016,103,"Action,Crime,Drama",7,15378,Wazir,Relbig.,1100000,0,2016,1100000,Bejoy Nambiar,"producer,writer,director",nm2349060,wazir,2016,7,63,4,,Action,477223
1,tt0337692,On the Road,On the Road,2012,124,"Adventure,Drama,Romance",6,37886,On the Road,IFC,744000,8000000,2012,8744000,Walter Salles,"director,producer,writer",nm0758574,on the road,2012,6,518,9,,Adventure,3257824
2,tt0359950,The Secret Life of Walter Mitty,The Secret Life of Walter Mitty,2013,114,"Adventure,Comedy,Drama",7,275300,The Secret Life of Walter Mitty,Fox,58200000,129900000,2013,188100000,Ben Stiller,"producer,actor,director",nm0001774,the secret life of walter mitty,2013,7,4859,11,91000000.0,Adventure,91000000
3,tt0365907,A Walk Among the Tombstones,A Walk Among the Tombstones,2014,114,"Action,Crime,Drama",6,105116,A Walk Among the Tombstones,Uni.,26300000,26900000,2014,53200000,Scott Frank,"writer,producer,director",nm0291082,a walk among the tombstones,2014,6,1685,19,28000000.0,Action,28000000
4,tt0369610,Jurassic World,Jurassic World,2015,124,"Action,Adventure,Sci-Fi",7,539338,Jurassic World,Uni.,652300000,1019,2015,652301019,Colin Trevorrow,"writer,producer,director",nm1119880,jurassic world,2015,7,14056,21,215000000.0,Action,215000000


Our final dataset combines IMDB movie metadata and ratings, Box Office Mojo revenue data, TMDB additional ratings and popularity scores, and The Numbers budget data (real + imputed).
We ended up with 1,927 movies with comprehensive data for analysis. Now we're ready for exploratory data analysis and modeling.