# IMDB DATASET

#### Introduction 
The cleaned IMDB dataset provides standardized information on movie titles, genres, runtime, release year, ratings, and vote counts, prepared for analysis. The cleaning process involved merging relevant tables, handling missing values, correcting data types, and removing duplicates to improve data quality and consistency.

This refined dataset supports reliable analysis of movie characteristics and audience reception and will be used alongside other sources to identify trends and inform recommendations on the types of films the studio should produce.

#### Import libraries

In [1]:
import pandas as pd
import sqlite3

#### PATH 

In [2]:
data_path = '../data/zippedData/' # Set data path

#### Connect to IMDB database

In [6]:
conn = sqlite3.connect(data_path + "im.db")

What this does

Opens the IMDB SQLite database

Creates connection called conn

# LOADING IMBD DATA

The IMDB dataset consists of two separate tables: movie_basics and movie_ratings. These tables were merged using the movie_id column to create a unified dataset containing movie titles, genres, runtime, and ratings. Merging the datasets allows for comprehensive analysis of movie characteristics and performance.

#### Load first IMDB table (movie_basics)

In [7]:
movie_basics = pd.read_sql("SELECT * FROM movie_basics", conn)
movie_basics.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


#### Load second IMDB table (movie_ratings)

In [8]:
movie_ratings = pd.read_sql("SELECT * FROM movie_ratings", conn)
movie_ratings.head()

Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


#### Merge both tables

In [9]:
imdb_movies = movie_basics.merge(movie_ratings, on="movie_id")
imdb_movies.head()


Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119


# CLEANING

In [10]:
imdb_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_id         73856 non-null  object 
 1   primary_title    73856 non-null  object 
 2   original_title   73856 non-null  object 
 3   start_year       73856 non-null  int64  
 4   runtime_minutes  66236 non-null  float64
 5   genres           73052 non-null  object 
 6   averagerating    73856 non-null  float64
 7   numvotes         73856 non-null  int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 4.5+ MB


The imdb_movies.info() function was used to inspect the dataset structure and identify missing values. Columns such as runtime_minutes and genres had fewer non-null entries compared to the total number of rows, indicating missing data that required cleaning before analysis.

#### 1. Remove columns we don’t need
We don’t need original_title.

In [11]:
imdb_movies = imdb_movies.drop(columns=['original_title'])

In [12]:
imdb_movies.columns

Index(['movie_id', 'primary_title', 'start_year', 'runtime_minutes', 'genres',
       'averagerating', 'numvotes'],
      dtype='object')

#### 2. Handle missing runtime
runtime_minutes → 66236 non-null But total rows = 73856 So about 7,000 missing.

We cannot analyze runtime if missing.

Remove those rows:

In [13]:
imdb_movies = imdb_movies.dropna(subset=['runtime_minutes'])

The runtime_minutes column contained several missing values. Since runtime is a key variable in analyzing movie performance and the dataset remained sufficiently large after removal, rows with missing runtime values were dropped to ensure accuracy and consistency in analysis.

#### 3. Remove missing genres

In [14]:
imdb_movies = imdb_movies.dropna(subset=['genres'])

The genres column contained missing values. Since genre is a key variable for identifying movie categories and determining performance by film type, rows with missing genre information were removed to ensure accurate analysis

#### 4. Convert runtime to integer

In [15]:
imdb_movies['runtime_minutes'] = imdb_movies['runtime_minutes'].astype(int)

The runtime_minutes column was converted from float to integer to reflect actual movie runtime values in whole minutes and ensure consistency for analysis.

#### 5. Remove duplicates

Sometimes same movie appears twice.

In [16]:
imdb_movies = imdb_movies.drop_duplicates(subset='primary_title')

6. Final cleaned dataset check

In [17]:
imdb_movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 62444 entries, 0 to 73852
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_id         62444 non-null  object 
 1   primary_title    62444 non-null  object 
 2   start_year       62444 non-null  int64  
 3   runtime_minutes  62444 non-null  int32  
 4   genres           62444 non-null  object 
 5   averagerating    62444 non-null  float64
 6   numvotes         62444 non-null  int64  
dtypes: float64(1), int32(1), int64(2), object(3)
memory usage: 3.6+ MB


- no missing runtime

- no missing genres

- clean dataset

#### 7. convert genres to lowercase

In [18]:
imdb_movies['genres'] = imdb_movies['genres'].str.lower()

The IMDB dataset has been cleaned by removing unnecessary columns, handling missing values and converting data types. This ensured the dataset was accurate and suitable for analysis...

In [21]:
imdb_movies_cleaned = imdb_movies.copy()
imdb_movies_cleaned.to_csv('../data/cleanedData/imdb_cleaned_data.csv', index = False)


# CLEANING ROTTEN TOMATOES 


Rotten Tomatoes comes with two separate datasets, and each serves a different purpose.

Think of them like this:

rt.movie_info → information about the movie itself

rt.reviews → what critics said about the movie

They complement each other.

# Load datasets

In [4]:
data_path = '../data/zippedData/'  # correct path


In [7]:
import pandas as pd


In [8]:
rt_movies = pd.read_csv(
    data_path + "rt.movie_info.tsv.gz",
    sep="\t",
    encoding="latin-1"
)

rt_reviews = pd.read_csv(
    data_path + "rt.reviews.tsv.gz",
    sep="\t",
    encoding="latin-1"
)


#### confirm they load

In [9]:
rt_movies.head()


Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [10]:
rt_reviews.head()


Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


#### Next inspect structure

In [11]:
rt_movies.shape

(1560, 12)

In [12]:
rt_reviews.shape

(54432, 8)

In [13]:
rt_movies.columns

Index(['id', 'synopsis', 'rating', 'genre', 'director', 'writer',
       'theater_date', 'dvd_date', 'currency', 'box_office', 'runtime',
       'studio'],
      dtype='object')

In [14]:
rt_reviews.columns

Index(['id', 'review', 'rating', 'fresh', 'critic', 'top_critic', 'publisher',
       'date'],
      dtype='object')

In [15]:
rt_movies.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


In [16]:
rt_reviews.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


# 1 — Clean rt_movies

#### Copy dataset

This is to keepthe original dataset as your source of truth and to avoid accidental overwriting 

In [22]:
rt_movies_clean = rt_movies.copy()


#### Fix column formats

In [23]:
rt_movies_clean['theater_date'] = pd.to_datetime(rt_movies_clean['theater_date'], errors='coerce')
rt_movies_clean['dvd_date'] = pd.to_datetime(rt_movies_clean['dvd_date'], errors='coerce')


#### Convert runtime to numeric

In [25]:
rt_movies_clean['runtime'] = rt_movies_clean['runtime'].str.extract('(\d+)')
rt_movies_clean['runtime'] = pd.to_numeric(rt_movies_clean['runtime'], errors='coerce')


#### Convert box office to numeric

In [26]:
rt_movies_clean['box_office'] = rt_movies_clean['box_office'].str.replace('[\$,]', '', regex=True)
rt_movies_clean['box_office'] = pd.to_numeric(rt_movies_clean['box_office'], errors='coerce')


#### Standardize text columns

In [27]:
text_cols = ['synopsis','genre','director','writer','studio','rating']

for col in text_cols:
    rt_movies_clean[col] = rt_movies_clean[col].str.strip()


#### Remove duplicates

In [28]:
rt_movies_clean = rt_movies_clean.drop_duplicates()


#### Inspect missing values

In [29]:
rt_movies_clean.isna().sum()


id                 0
synopsis          62
rating             3
genre              8
director         199
writer           449
theater_date     359
dvd_date         359
currency        1220
box_office      1220
runtime           30
studio          1066
dtype: int64

# 2 — Clean rt_reviews

#### Copy dataset

In [30]:
rt_reviews_clean = rt_reviews.copy()


#### Convert rating to numeric where possible

In [31]:
rt_reviews_clean['rating_num'] = rt_reviews_clean['rating'].str.extract('(\d+\.?\d*)')
rt_reviews_clean['rating_num'] = pd.to_numeric(rt_reviews_clean['rating_num'], errors='coerce')


#### Convert top_critic to category

In [32]:
rt_reviews_clean['top_critic'] = rt_reviews_clean['top_critic'].astype('category')


#### Convert date to datetime

In [33]:
rt_reviews_clean['date'] = pd.to_datetime(rt_reviews_clean['date'], errors='coerce')


#### Clean text fields

In [34]:
text_cols_reviews = ['review','critic','publisher']

for col in text_cols_reviews:
    rt_reviews_clean[col] = rt_reviews_clean[col].str.strip()


#### Remove duplicates

In [35]:
rt_reviews_clean = rt_reviews_clean.drop_duplicates()


#### Inspect missing

In [36]:
rt_reviews_clean.isna().sum()


id                0
review         5556
rating        13516
fresh             0
critic         2713
top_critic        0
publisher       309
date              0
rating_num    19984
dtype: int64

# 3 — Create master Rotten Tomatoes dataset

Now we connect them generically.

In [37]:
rt_master = pd.merge(rt_reviews_clean, rt_movies_clean, on='id', how='left')


In [38]:
rt_master.shape

(54423, 20)

In [39]:
rt_master.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54423 entries, 0 to 54422
Data columns (total 20 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            54423 non-null  int64         
 1   review        48867 non-null  object        
 2   rating_x      40907 non-null  object        
 3   fresh         54423 non-null  object        
 4   critic        51710 non-null  object        
 5   top_critic    54423 non-null  category      
 6   publisher     54114 non-null  object        
 7   date          54423 non-null  datetime64[ns]
 8   rating_num    34439 non-null  float64       
 9   synopsis      54291 non-null  object        
 10  rating_y      54337 non-null  object        
 11  genre         54336 non-null  object        
 12  director      48984 non-null  object        
 13  writer        45197 non-null  object        
 14  theater_date  53197 non-null  datetime64[ns]
 15  dvd_date      53197 non-null  dateti

In [40]:
rt_master.head()


Unnamed: 0,id,review,rating_x,fresh,critic,top_critic,publisher,date,rating_num,synopsis,rating_y,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,2018-11-10,3.0,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,2012-08-17,2013-01-01,$,600000.0,108.0,Entertainment One
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,2018-05-23,,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,2012-08-17,2013-01-01,$,600000.0,108.0,Entertainment One
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,2018-01-04,,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,2012-08-17,2013-01-01,$,600000.0,108.0,Entertainment One
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,2017-11-16,,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,2012-08-17,2013-01-01,$,600000.0,108.0,Entertainment One
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,2017-10-12,,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,2012-08-17,2013-01-01,$,600000.0,108.0,Entertainment One


#### 4 — Save cleaned dataset

In [43]:
rt_master.to_csv('../data/cleanedData/rt_master_cleaned.csv', index=False)
