# PHASE 2 PROJECT

## Business Objective:

Your company is launching a movie studio. You have been tasked with analyzing current trends in movies to provide 3 actionable recommendations for what types of movies the studio should produce.

### 1. Key Business Questions:

What genres tend to perform best at the box office?

Do critic ratings predict box office success?

How do budget levels correlate with box office revenue?

### 2. Data Loading and Initial Exploration
Import Libraries

In [99]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3
import pandas as pd
import re

### Load Datasets

In [100]:
# IMDB SQLite Database
conn = sqlite3.connect('im.db')

# Create a cursor object
cursor = conn.cursor()

# Query to list tables
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")

# Fetch and print results
tables = cursor.fetchall()
tables

[('movie_basics',),
 ('directors',),
 ('known_for',),
 ('movie_akas',),
 ('movie_ratings',),
 ('persons',),
 ('principals',),
 ('writers',)]

In [101]:
import os
print(os.path.exists('im.db'))

True


In [102]:
#BOM movie gross CSV
df_bom = pd.read_csv('bom.movie_gross.csv')
df_bom.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [103]:
#Rotten Tomatoes movie info (TSV)
df_rt_info = pd.read_csv('rt.movie_info.tsv', sep='\t')
df_rt_info.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [104]:
#Rotten Tomatoes reviews (TSV)
df_rt_reviews = pd.read_csv('rt.reviews.tsv', sep='\t', encoding ='latin1')  # encoding ='latin1' fixes non character errors
df_rt_reviews.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [105]:
# TheMovieDB movies (CSV)
df_tmdb = pd.read_csv('tmdb.movies.csv')
df_tmdb.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [106]:
#The Numbers movie budgets (CSV)
df_tn_budgets = pd.read_csv('tn.movie_budgets.csv')
df_tn_budgets.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


### 3. Data Clean-up

### Cleanup of movie_basics Table In IMDB

The `movie_basics` table contains the following columns:

`['movie_id', 'primary_title', 'original_title', 'start_year', 'runtime_minutes', 'genres']`

#### Cleanup Steps:

> **start_year**:
   - Convert to numeric (integer), handling missing or invalid values as `NaN`.
   
> **runtime_minutes**:
   - Convert to numeric (integer), handling missing or invalid values as `NaN`.

> **genres**:
   - Replace placeholder `\N` with `NaN`.
   - Optionally fill missing genres with `'Unknown'`.

#### Notes:

- The table does not contain a `title_type` column — filtering by movie type is not required.
- After cleanup, this table will be ready to merge with other datasets on shared keys such as title or year.

In [107]:
# Connect to IMDB database
conn = sqlite3.connect('im.db')

# Load movie_basics
df_basics = pd.read_sql_query("SELECT * FROM movie_basics", conn)

# Load movie_ratings
df_ratings = pd.read_sql_query("SELECT * FROM movie_ratings", conn)

# Close connection
conn.close()

print(df_basics.columns.tolist())

# No need to filter on 'titleType'

# Convert start_year to numeric
df_basics['start_year'] = pd.to_numeric(df_basics['start_year'], errors='coerce')

# Convert runtime_minutes to numeric
df_basics['runtime_minutes'] = pd.to_numeric(df_basics['runtime_minutes'], errors='coerce')

# Optional: Clean genres column — remove '\N' or empty values
df_basics['genres'] = df_basics['genres'].replace('\\N', pd.NA)
df_basics['genres'] = df_basics['genres'].fillna('Unknown')

# Display cleaned basics
df_basics.head()

['movie_id', 'primary_title', 'original_title', 'start_year', 'runtime_minutes', 'genres']


Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


### Cleaning `movie_ratings` table In IMDB

The `movie_ratings` table contains movie IDs, average ratings, and number of votes.  
We will perform the following cleanup steps:

* Ensure columns are numeric.
* Filter out movies with fewer than 50 votes to avoid unreliable ratings.
* Check for missing values.

This cleaned table will be used to analyze how **critic ratings predict box office success** and for any correlation with genres or budgets.

In [108]:
# Reconnect to database
conn = sqlite3.connect('im.db')

# Load movie_ratings
df_ratings = pd.read_sql_query('SELECT * FROM movie_ratings', conn)

# Check columns
print(df_ratings.columns.tolist())

# Quick look at the data
df_ratings.head()

['movie_id', 'averagerating', 'numvotes']


Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [109]:
#  Make sure types are correct
df_ratings['averagerating'] = pd.to_numeric(df_ratings['averagerating'], errors='coerce')
df_ratings['numvotes'] = pd.to_numeric(df_ratings['numvotes'], errors='coerce')

# Drop movies with low number of votes (less than 50 votes can be unreliable)
df_ratings = df_ratings[df_ratings['numvotes'] >= 50]

# Check for missing values
print(df_ratings.isnull().sum())

# Final shape of the cleaned ratings table
print(f"Cleaned ratings table has {df_ratings.shape[0]} movies.")
# View a few rows
df_ratings.head()

movie_id         0
averagerating    0
numvotes         0
dtype: int64
Cleaned ratings table has 36824 movies.


Unnamed: 0,movie_id,averagerating,numvotes
1,tt10384606,8.9,559
3,tt1043726,4.2,50352
5,tt1069246,6.2,326
6,tt1094666,7.0,1613
7,tt1130982,6.4,571


### 2. BOM MOVIE GROSS CSV
   

In [110]:
#BOM movie gross CSV Data Clean up
df_bom = pd.read_csv('bom.movie_gross.csv')
df_bom.tail()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018
3386,An Actor Prepares,Grav.,1700.0,,2018


In [111]:
df_bom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [112]:
df_bom['domestic_gross'] = df_bom['domestic_gross'].astype('object')

In [113]:
df_bom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   title           3387 non-null   object
 1   studio          3382 non-null   object
 2   domestic_gross  3359 non-null   object
 3   foreign_gross   2037 non-null   object
 4   year            3387 non-null   int64 
dtypes: int64(1), object(4)
memory usage: 132.4+ KB


In [114]:
df_filled = df_bom.copy()

df_filled["title"] = df_filled["title"].fillna("Unknown")

df_filled["studio"] = df_filled["studio"].fillna("Unknown")

df_filled["domestic_gross"] = df_filled["domestic_gross"].fillna("Unknown")
df_filled["foreign_gross"] = df_filled["foreign_gross"].fillna("Unknown")
df_filled["year"] = df_filled["year"].fillna("Unknown")


df_filled

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.0,Unknown,2018
3383,Edward II (2018 re-release),FM,4800.0,Unknown,2018
3384,El Pacto,Sony,2500.0,Unknown,2018
3385,The Swan,Synergetic,2400.0,Unknown,2018


In [115]:
#Checking for duplicates and viewing duplicated rows
print(f"Number of duplicate rows: {df_filled.duplicated()}")

# Show the duplicate rows
dupes = df_filled[df_filled.duplicated()]
print("\nDuplicate rows:")
print(dupes)

# Remove duplicates and keep the first occurrence
df_no_dupes = df_filled.drop_duplicates()

print(f"\nShape before removing duplicates: {df_filled.shape}")
print(f"Shape after removing duplicates: {df_no_dupes.shape}")
df_no_dupes

Number of duplicate rows: 0       False
1       False
2       False
3       False
4       False
        ...  
3382    False
3383    False
3384    False
3385    False
3386    False
Length: 3387, dtype: bool

Duplicate rows:
Empty DataFrame
Columns: [title, studio, domestic_gross, foreign_gross, year]
Index: []

Shape before removing duplicates: (3387, 5)
Shape after removing duplicates: (3387, 5)


Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.0,Unknown,2018
3383,Edward II (2018 re-release),FM,4800.0,Unknown,2018
3384,El Pacto,Sony,2500.0,Unknown,2018
3385,The Swan,Synergetic,2400.0,Unknown,2018


### 3.Rotten Tomatoes reviews (TSV)
Data clean up for rt.reviews dataset.

In [116]:
#Rotten Tomatoes reviews (TSV)
df_rt_reviews = pd.read_csv('rt.reviews.tsv', sep='\t', encoding ='latin1')  # encoding ='latin1' fixes non character errors
df_rt_reviews.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [117]:

# Drop rows with missing values in any column
df_rt_reviews = df_rt_reviews.dropna()

# we inspect the datatypes
df_rt_reviews.info()

df_rt_reviews['date'] = pd.to_datetime(df_rt_reviews['date'], errors='coerce')

df_rt_reviews['year'] = df_rt_reviews['date'].dt.year
# we added another column name year 
df_rt_reviews.rename(columns={'old_column_name': 'year'}, inplace=True)
df_rt_reviews

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33988 entries, 0 to 54424
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          33988 non-null  int64 
 1   review      33988 non-null  object
 2   rating      33988 non-null  object
 3   fresh       33988 non-null  object
 4   critic      33988 non-null  object
 5   top_critic  33988 non-null  int64 
 6   publisher   33988 non-null  object
 7   date        33988 non-null  object
dtypes: int64(2), object(6)
memory usage: 2.3+ MB


Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date,year
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,2018-11-10,2018
6,3,"Quickly grows repetitive and tiresome, meander...",C,rotten,Eric D. Snider,0,EricDSnider.com,2013-07-17,2013
7,3,Cronenberg is not a director to be daunted by ...,2/5,rotten,Matt Kelemen,0,Las Vegas CityLife,2013-04-21,2013
11,3,"While not one of Cronenberg's stronger films, ...",B-,fresh,Emanuel Levy,0,EmanuelLevy.Com,2013-02-03,2013
12,3,Robert Pattinson works mighty hard to make Cos...,2/4,rotten,Christian Toto,0,Big Hollywood,2013-01-15,2013
...,...,...,...,...,...,...,...,...,...
54419,2000,"Sleek, shallow, but frequently amusing.",2.5/4,fresh,Gene Seymour,1,Newsday,2002-09-27,2002
54420,2000,The spaniel-eyed Jean Reno infuses Hubert with...,3/4,fresh,Megan Turner,1,New York Post,2002-09-27,2002
54421,2000,"Manages to be somewhat well-acted, not badly a...",1.5/4,rotten,Bob Strauss,0,Los Angeles Daily News,2002-09-27,2002
54422,2000,Arguably the best script that Besson has writt...,3.5/5,fresh,Wade Major,0,Boxoffice Magazine,2002-09-27,2002


### Normalize the ratings

In [118]:
grade_scale = {
    'A+': 1.0, 'A': 0.95, 'A-': 0.9,
    'B+': 0.85, 'B': 0.8, 'B-': 0.75,
    'C+': 0.7, 'C': 0.65, 'C-': 0.6,
    'D+': 0.55, 'D': 0.5, 'D-': 0.45,
    'F': 0.0
}

In [119]:
def normalize_rating(rating):
    if pd.isnull(rating):
        return None

    rating = str(rating).strip().upper()

    # If multiple items (e.g. "7,A"), split and check each
    for part in rating.replace(" ", "").split(','):
        # Handle fraction like "3/5"
        if '/' in part:
            try:
                num, denom = part.split('/')
                return round(float(num) / float(denom), 2)
            except:
                continue

        # Handle whole number (assuming 10-point scale)
        try:
            num = float(part)
            if 0 <= num <= 10:
                return round(num / 10, 2)  # Normalize to 0–1 scale
        except:
            pass

        # Handle letter grades
        if part in grade_scale:
            return grade_scale[part]

    return None  # If nothing matched


In [120]:
## Cleaned RT_REVIEWS DATA FOR ANALYSIS
df_rt_reviews = df_rt_reviews.copy()
df_rt_reviews['normalized_rating'] = df_rt_reviews['rating'].apply(normalize_rating)
df_rt_reviews

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date,year,normalized_rating
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,2018-11-10,2018,0.60
6,3,"Quickly grows repetitive and tiresome, meander...",C,rotten,Eric D. Snider,0,EricDSnider.com,2013-07-17,2013,0.65
7,3,Cronenberg is not a director to be daunted by ...,2/5,rotten,Matt Kelemen,0,Las Vegas CityLife,2013-04-21,2013,0.40
11,3,"While not one of Cronenberg's stronger films, ...",B-,fresh,Emanuel Levy,0,EmanuelLevy.Com,2013-02-03,2013,0.75
12,3,Robert Pattinson works mighty hard to make Cos...,2/4,rotten,Christian Toto,0,Big Hollywood,2013-01-15,2013,0.50
...,...,...,...,...,...,...,...,...,...,...
54419,2000,"Sleek, shallow, but frequently amusing.",2.5/4,fresh,Gene Seymour,1,Newsday,2002-09-27,2002,0.62
54420,2000,The spaniel-eyed Jean Reno infuses Hubert with...,3/4,fresh,Megan Turner,1,New York Post,2002-09-27,2002,0.75
54421,2000,"Manages to be somewhat well-acted, not badly a...",1.5/4,rotten,Bob Strauss,0,Los Angeles Daily News,2002-09-27,2002,0.38
54422,2000,Arguably the best script that Besson has writt...,3.5/5,fresh,Wade Major,0,Boxoffice Magazine,2002-09-27,2002,0.70


### 4.rt.movie data clean up

In [121]:
df_rt_info = pd.read_csv('rt.movie_info.tsv', sep='\t')
df_rt_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


We are going to drop the missing values to ensure that our data is ready for EDA

In [122]:
df_rt_info = df_rt_info.dropna()
df_rt_info

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One
6,10,Some cast and crew from NBC's highly acclaimed...,PG-13,Comedy,Jake Kasdan,Mike White,"Jan 11, 2002","Jun 18, 2002",$,41032915,82 minutes,Paramount Pictures
7,13,"Stewart Kane, an Irishman living in the Austra...",R,Drama,Ray Lawrence,Raymond Carver|Beatrix Christian,"Apr 27, 2006","Oct 2, 2007",$,224114,123 minutes,Sony Pictures Classics
15,22,Two-time Academy Award Winner Kevin Spacey giv...,R,Comedy|Drama|Mystery and Suspense,George Hickenlooper,Norman Snider,"Dec 17, 2010","Apr 5, 2011",$,1039869,108 minutes,ATO Pictures
18,25,"From ancient Japan's most enduring tale, the e...",PG-13,Action and Adventure|Drama|Science Fiction and...,Carl Erik Rinsch,Chris Morgan|Hossein Amini,"Dec 25, 2013","Apr 1, 2014",$,20518224,127 minutes,Universal Pictures
...,...,...,...,...,...,...,...,...,...,...,...,...
1530,1968,"This holiday season, acclaimed filmmaker Camer...",PG,Comedy|Drama,Cameron Crowe,Aline Brosh McKenna|Cameron Crowe,"Dec 23, 2011","Apr 3, 2012",$,72700000,126 minutes,20th Century Fox
1537,1976,"Embrace of the Serpent features the encounter,...",NR,Action and Adventure|Art House and International,Ciro Guerra,Ciro Guerra|Jacques Toulemonde Vidal,"Feb 17, 2016","Jun 21, 2016",$,1320005,123 minutes,Buffalo Films
1541,1980,A band of renegades on the run in outer space ...,PG-13,Action and Adventure|Science Fiction and Fantasy,Joss Whedon,Joss Whedon,"Sep 30, 2005","Dec 20, 2005",$,25335935,119 minutes,Universal Pictures
1542,1981,"Money, Fame and the Knowledge of English. In I...",NR,Comedy|Drama,Gauri Shinde,Gauri Shinde,"Oct 5, 2012","Nov 20, 2012",$,1416189,129 minutes,Eros Entertainment


# Movie Budget

Monetary columns are strings (with $ and commas):

production_budget, domestic_gross, worldwide_gross
→ These must be converted to numeric.

release_date is a string:

Needs to be parsed as a datetime object.

In [123]:
# Load the dataset
df_budget = pd.read_csv('tn.movie_budgets.csv')

print(df_budget.columns)
print(df_budget.head())
df_budget.info()

Index(['id', 'release_date', 'movie', 'production_budget', 'domestic_gross',
       'worldwide_gross'],
      dtype='object')
   id  release_date                                        movie  \
0   1  Dec 18, 2009                                       Avatar   
1   2  May 20, 2011  Pirates of the Caribbean: On Stranger Tides   
2   3   Jun 7, 2019                                 Dark Phoenix   
3   4   May 1, 2015                      Avengers: Age of Ultron   
4   5  Dec 15, 2017            Star Wars Ep. VIII: The Last Jedi   

  production_budget domestic_gross worldwide_gross  
0      $425,000,000   $760,507,625  $2,776,345,279  
1      $410,600,000   $241,063,875  $1,045,663,875  
2      $350,000,000    $42,762,350    $149,762,350  
3      $330,600,000   $459,005,868  $1,403,013,963  
4      $317,000,000   $620,181,382  $1,316,721,747  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  D

Clean monetary columns
The columns: production_budget, domestic_gross, and worldwide_gross are strings with $ and ,. Clean them like this:

In [124]:
money_cols = ['production_budget', 'domestic_gross', 'worldwide_gross']

for col in money_cols:
    df_budget[col] = (
        df_budget[col]
        .replace('[\$,]', '', regex=True)
        .astype('Int64')  # or use float instead if needed
    )
    print(df_budget)

      id  release_date                                        movie  \
0      1  Dec 18, 2009                                       Avatar   
1      2  May 20, 2011  Pirates of the Caribbean: On Stranger Tides   
2      3   Jun 7, 2019                                 Dark Phoenix   
3      4   May 1, 2015                      Avengers: Age of Ultron   
4      5  Dec 15, 2017            Star Wars Ep. VIII: The Last Jedi   
...   ..           ...                                          ...   
5777  78  Dec 31, 2018                                       Red 11   
5778  79   Apr 2, 1999                                    Following   
5779  80  Jul 13, 2005                Return to the Land of Wonders   
5780  81  Sep 29, 2015                         A Plague So Pleasant   
5781  82   Aug 5, 2005                            My Date With Drew   

      production_budget domestic_gross worldwide_gross  
0             425000000   $760,507,625  $2,776,345,279  
1             410600000   $241,06

In [125]:
    # Convert release date
    df_budget['release_date'] = pd.to_datetime(df_budget['release_date'], errors='coerce')

    # Create a year column
    df_budget['release_year'] = df_budget['release_date'].dt.year
    df_budget.info()
print(df_budget.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 5782 non-null   int64         
 1   release_date       5782 non-null   datetime64[ns]
 2   movie              5782 non-null   object        
 3   production_budget  5782 non-null   Int64         
 4   domestic_gross     5782 non-null   Int64         
 5   worldwide_gross    5782 non-null   Int64         
 6   release_year       5782 non-null   int64         
dtypes: Int64(3), datetime64[ns](1), int64(2), object(1)
memory usage: 333.3+ KB
   id release_date                                        movie  \
0   1   2009-12-18                                       Avatar   
1   2   2011-05-20  Pirates of the Caribbean: On Stranger Tides   
2   3   2019-06-07                                 Dark Phoenix   
3   4   2015-05-01                      Avengers: 

In [126]:
# Save cleaned version
df_budget.to_csv('cleaned_movie_budgets.csv', index=False)