# Data-Driven Insights for a Modern Film Studio


# 1. Business Understanding

## 1.1 Overview
The global movie industry has undergone major shifts in recent years, with changes in consumer behavior, streaming trends, and content production strategies. Streaming platforms such as Netflix and Prime are reshaping how content is consumed and evaluated. While theatrical revenue remains an important success metric, online ratings, platform distribution and viewer engagement have become just as critical in predicting and evaluating a movie's performance.

As a new entrant in the movie production industry, understanding what drives a movie’s commercial success is crucial for making data-driven decisions in production, marketing, and distribution.

This project seeks to uncover actionable insights from historical movie data to uncover what drives both financial and audience success, and how this knowledge can inform business strategy for a potential new film studio.


## 1.2 Background
With increased competition from both traditional cinema and digital streaming platforms, studios need to optimize decisions around:

* Genre selection

* Budget allocation

* Casting

* Timing of releases

* Marketing focus

* Platform strategy - Balancing between the box office and streaming platforms

By analyzing both theatrical and streaming success, studios can better navigate this complex, hybrid distribution landscape. Data from past movies including box office revenue, ratings, genres, and production details can reveal patterns and predictors of success.

This analysis will serve as a proof of concept for how a data-driven approach can enhance Return on Investment(ROI) and reduce risk in movie production.


## 1.3 Objectives
The primary objectives of this project are:

1. To identify the key factors that contribute to a movie's success (e.g. revenue, high ratings, box office success, streaming performance)

2. To provide data-backed recommendations for genre selection, ideal budgets, release strategy and cast decisions

3. To build visualizations and models that support strategic decisions for a new movie production company


## 1.4 Problem Statement

A new movie production company is seeking to make informed decisions about:

* What types of movies to produce (genre, language, duration)

* How much budget to allocate

* When to release their movies

* Which actors or directors are most associated with successful projects

* Decide whether to prioritize theatrical releases, streaming or a hybrid model

The challenge is to analyze historical movie data to find patterns that can help predict which factors lead to higher box office performance or audience engagement.


## 1.5 Metrics of Success

* Identification of top 5 features most correlated with success metrics

* Creation of visual dashboards to communicate findings clearly

* Development of a simple predictive model for revenue or rating to estimate movie success

* Strategic, business-friendly recommendations based on findings

* Well-documented collaboration and communication via GitHub, Trello, and reporting tools


## 1.6 Tools & Technologies
- Python (Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn)

- Jupyter Notebooks for analysis

- Git & GitHub for version control and collaboration

- Trello for project management and workflow tracking

- Google Docs for the final data report

- Google Slides for the Presentation

- Tableau for Interactive Dashboard


## 1.7 Stakeholders

The primary stakeholder is the founding team of the new movie production studio.

Secondary stakeholders include potential investors, marketing consultants, streaming partners and creative directors.

# 2. Data Understanding

In order to uncover what drives a movie’s success, both financially and in terms of audience reception, we must first develop a thorough understanding of the data at our disposal. This section explores the structure, scope, and quality of the datasets used for analysis.

Our data sources include publicly available movie datasets with information on:

* **Movie metadata**: title, genre, release date, language, runtime, production companies

* **Financial data**: production budget, box office revenue (domestic & worldwide)

* **Ratings**: IMDb scores, Rotten Tomatoes critic/audience ratings

* **Streaming availability**: whether the movie was released theatrically, via streaming, or both

* **Cast and crew**: actors, directors, and producers

The movie datasets are drawn from: 
* [Box Office Mojo](https://www.boxofficemojo.com/)
* [IMDb](https://www.imdb.com/)
* [Rotten Tomatoes](https://www.rottentomatoes.com/)
* [The Movie DB](https://www.themoviedb.org/)
* [The Numbers](https://www.the-numbers.com/)

The datasets are in the following formats:
* bom.movie_gross.csv (CSV File)
* im.db (sqLite Database)
* rt.movie_info.tsv (TSV File)
* rt.reviews.tsv (TSV File)
* tmdb.movies.csv (CSV File)
* tn.movie_budgets.csv (CSV File)

By reviewing the attributes of these datasets and exploring initial patterns, we aim to:

* *Identify which variables are relevant to our business goals and their data types*
* *Detect any missing, duplicated, or inconsistent records*
* *Gain early insights into data trends that may inform modeling later on*

This understanding will serve as the foundation for cleaning, feature engineering, and deeper analysis in the next stages of the project.

## 2.1 Importing Essential Libraries

Before diving into data exploration, we need to import the key Python libraries that will support our data analysis, visualization, and modeling tasks.

In [3]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For dealing with SQLite Databases
import sqlite3

# For data cleaning 
from datetime import datetime 
import re   

import warnings
warnings.filterwarnings("ignore")

## 2.2 Initial Data Exploration of Box Office Mojo Data

In this section, we begin exploring the raw box office dataset to understand its structure, completeness, and key variables. This includes checking for missing values, data types, duplicates, and overall distribution of records.

In [4]:
# load the box office mojo dataset
bom_df = pd.read_csv("Data/bom.movie_gross.csv")

In [5]:
# preview the first 5 rows
bom_df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [6]:
#preview the last 5 rows
bom_df.tail()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018
3386,An Actor Prepares,Grav.,1700.0,,2018


In [7]:
# get the shape of the dataframe
print(f"The box office mojo dataset has {bom_df.shape[0]} rows.")
print(f"The box office mojo dataset has {bom_df.shape[1]} columns.")

The box office mojo dataset has 3387 rows.
The box office mojo dataset has 5 columns.


In [8]:
# summary information on the dataframe
bom_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [9]:
# creating an overview of key summary data on the dataset
data_dict = pd.DataFrame({
   "Column Name": bom_df.columns,
    "Data Type": bom_df.dtypes,
    "Missing Values": bom_df.isnull().sum(),
    "Unique Values": bom_df.nunique()
})

data_dict

Unnamed: 0,Column Name,Data Type,Missing Values,Unique Values
title,title,object,0,3386
studio,studio,object,5,257
domestic_gross,domestic_gross,float64,28,1797
foreign_gross,foreign_gross,object,1350,1204
year,year,int64,0,9


In [10]:
#checking for duplicates
print(f"The Box Office Movie dataset has {bom_df.duplicated().sum()} duplicates.")

The Box Office Movie dataset has 0 duplicates.


## Observations

The dataframe comprises of **3387 rows** and **5 columns**.

The dataset is uniform from top to bottom on inspection of the first 5 rows and last 5 rows.

### Columns

The dataset contains 5 columns with the following information:
* Information on the movie's title, the studio that produced it and the year when it was produced
* Gross earnings from the movie (this has been categorised into the earnings from the domestic market and the foreign market). This information will help us in comparison of the performance of a movie in the foreign and domestic market

### Data Types 

The summary information of the data frame reveals two datatypes:
* Categorical Data (3 columns: title, studio and foreign_gross)
* Numerical Data (2 columns: domestic_gross and year) 

We note that there will be need for data cleaning as follows:
 * Change the year column from numerical datatype(int) to datetime
 * Change the foreign_gross column from categorical datatype(object) to numerical datatype(float)

### Missing Data

3 columns have null values as follows:
  * *studio*
  * *domestic_gross*
  * *foreign_gross*

## 2.3 Initial Data Exploration of Rotten Tomatoes Data

In this section, we begin exploring the raw rotten tomatoes datasets to understand their structure, completeness, and key variables. This includes checking for missing values, data types, duplicates, and overall distribution of records.

### 2.3.1 Exploration of Movie Information Dataset

In [11]:
# loading the Rotten Tomatoes dataset
rtinfo_df = pd.read_csv("Data/rt.movie_info.tsv", delimiter="\t")

In [12]:
#checking the first 5 rows
rtinfo_df.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [13]:
#checking the last 5 rows
rtinfo_df.tail()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
1555,1996,Forget terrorists or hijackers -- there's a ha...,R,Action and Adventure|Horror|Mystery and Suspense,,,"Aug 18, 2006","Jan 2, 2007",$,33886034.0,106 minutes,New Line Cinema
1556,1997,The popular Saturday Night Live sketch was exp...,PG,Comedy|Science Fiction and Fantasy,Steve Barron,Terry Turner|Tom Davis|Dan Aykroyd|Bonnie Turner,"Jul 23, 1993","Apr 17, 2001",,,88 minutes,Paramount Vantage
1557,1998,"Based on a novel by Richard Powell, when the l...",G,Classics|Comedy|Drama|Musical and Performing Arts,Gordon Douglas,,"Jan 1, 1962","May 11, 2004",,,111 minutes,
1558,1999,The Sandlot is a coming-of-age story about a g...,PG,Comedy|Drama|Kids and Family|Sports and Fitness,David Mickey Evans,David Mickey Evans|Robert Gunter,"Apr 1, 1993","Jan 29, 2002",,,101 minutes,
1559,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures


In [14]:
# get the shape of the dataframe
print(f"The Rotten Tomatoes Movie Info dataset has {rtinfo_df.shape[0]} rows.")
print(f"The Rotten Tomatoes Movie Info dataset has {rtinfo_df.shape[1]} columns.")

The Rotten Tomatoes Movie Info dataset has 1560 rows.
The Rotten Tomatoes Movie Info dataset has 12 columns.


In [15]:
# summary information on the dataframe
rtinfo_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


In [16]:
#getting a summary overview of data

data_dict = pd.DataFrame({
    "Data Type": rtinfo_df.dtypes,
    "Missing Values": rtinfo_df.isnull().sum(),
    "Unique Values": rtinfo_df.nunique()
})

data_dict

Unnamed: 0,Data Type,Missing Values,Unique Values
id,int64,0,1560
synopsis,object,62,1497
rating,object,3,6
genre,object,8,299
director,object,199,1125
writer,object,449,1069
theater_date,object,359,1025
dvd_date,object,359,717
currency,object,1220,1
box_office,object,1220,336


In [17]:
#creating a description column for the information on the columns
data_dict["Description"] =[
        'Unique movie numeric identifier',
        'A brief summary of the movie\'s plot.',
        'MPAA film rating indicating audience suitability.',
        'Movie genres; may contain multiple genres.',
        'Name(s) of the director(s).',
        'Name(s) of the screenwriter(s).',
        'Date of release in theaters.',
        'Date of release on DVD.',
        'Currency of box office revenue.',
        'Box office earnings of the movie.',
        'Duration of the movie.',
        'Name of the production or distribution studio.'
    ]

data_dict

Unnamed: 0,Data Type,Missing Values,Unique Values,Description
id,int64,0,1560,Unique movie numeric identifier
synopsis,object,62,1497,A brief summary of the movie's plot.
rating,object,3,6,MPAA film rating indicating audience suitability.
genre,object,8,299,Movie genres; may contain multiple genres.
director,object,199,1125,Name(s) of the director(s).
writer,object,449,1069,Name(s) of the screenwriter(s).
theater_date,object,359,1025,Date of release in theaters.
dvd_date,object,359,717,Date of release on DVD.
currency,object,1220,1,Currency of box office revenue.
box_office,object,1220,336,Box office earnings of the movie.


In [18]:
print(f"The Rotten Tomatoes Movie Info dataset has {rtinfo_df.duplicated().sum()} duplicates.")

The Rotten Tomatoes Movie Info dataset has 0 duplicates.


In [19]:
rtinfo_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,1560.0,1007.303846,579.164527,1.0,504.75,1007.5,1503.25,2000.0


In [20]:
rtinfo_df.describe(include = "O").T

Unnamed: 0,count,unique,top,freq
synopsis,1498,1497,A group of air crash survivors are stranded in...,2
rating,1557,6,R,521
genre,1552,299,Drama,151
director,1361,1125,Steven Spielberg,10
writer,1111,1069,Woody Allen,4
theater_date,1201,1025,"Jan 1, 1987",8
dvd_date,1201,717,"Jun 1, 2004",11
currency,340,1,$,340
box_office,340,336,600000,2
runtime,1530,142,90 minutes,72


## Observations

The Rotten Tomatoes Movie Info dataset comprises of **1560 rows** and **12 columns**.

The dataset is uniform from top to bottom on inspection of the first 5 rows and last 5 rows.

Each row appears to represent a unique movie identified by the id column.

### Columns

The dataset contains 12 columns with the following information:
* **Movie Metadata**: studio, genre, synopsis,rating, director,writer,theater date, dvd date and runtime 
* **Financial Data**: currency and Box office earnings from the movie.

### Data Types 

The summary information of the data frame reveals two datatypes:
* Categorical Data (11 columns: studio, genre, synopsis,rating, director,writer,theater date, dvd date, box office, currency and runtime )
* Numerical Data (1 column: id) 

We note that there will be need for data cleaning as follows:
 * Change the runtime column from categorical datatype(object) to numerical datatype(int)
 * Change the box_office column from categorical datatype(object) to numerical datatype(float/int)
 * Both theater_date and dvd_date are object datatypes and require parsing into datetime format for any time-based analysis.

### Missing Data
 
Some columns have significant missing values:

* currency and box_office have 78% missing values (1,220 out of 1,560 rows).

* studio is missing in over 1,000 rows, limiting studio-based analysis.

* writer and director also have notable gaps (449 and 199 missing values respectively).

* theater_date and dvd_date are missing in 359 rows each.

This will impact trend analysis over time or by revenue and there is therefore need to know how to handle the missing values in the next stage.

### 2.3.2 Exploration of Movie Reviews Dataset

In [21]:
# loading the Rotten Tomatoes Movie Reviews Dataset
rtreviews_df = pd.read_csv("Data/rt.reviews.tsv", delimiter="\t", encoding="latin1")

In [22]:
#checking the first 5 rows
rtreviews_df.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [23]:
#checking the last 5 rows
rtreviews_df.tail()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
54427,2000,The real charm of this trifle is the deadpan c...,,fresh,Laura Sinagra,1,Village Voice,"September 24, 2002"
54428,2000,,1/5,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,,2/5,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,,2.5/5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"
54431,2000,,3/5,fresh,Nicolas Lacroix,0,Showbizz.net,"November 12, 2002"


In [24]:
# get the shape of the dataframe
print(f"The Rotten Tomatoes Reviews dataset has {rtreviews_df.shape[0]} rows.")
print(f"The Rotten Tomatoes Movie Info dataset has {rtreviews_df.shape[1]} columns.")

The Rotten Tomatoes Reviews dataset has 54432 rows.
The Rotten Tomatoes Movie Info dataset has 8 columns.


In [25]:
#checking the summary data
rtreviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


In [26]:
#getting a summary overview of data

data_dict = pd.DataFrame({
    "Data Type": rtreviews_df.dtypes,
    "Missing Values": rtreviews_df.isnull().sum(),
    "Unique Values": rtreviews_df.nunique()
})

data_dict

Unnamed: 0,Data Type,Missing Values,Unique Values
id,int64,0,1135
review,object,5563,48682
rating,object,13517,186
fresh,object,0,2
critic,object,2722,3496
top_critic,int64,0,2
publisher,object,309,1281
date,object,0,5963


In [27]:
#creating a description column in our data dictionary
data_dict["Description"] = [
        'Unique numeric identifier',
        'Full text of the review written by the critic.',
        'Rating given by the critic.',
        'Indicates if the review is "fresh" (positive) or "rotten" (negative).',
        'Name of the critic who wrote the review.',
        'Binary indicator where 1 = top critic, 0 = not a top critic.',
        'Publication or outlet where the review was published.',
        'Date the review was published.'
]

data_dict

Unnamed: 0,Data Type,Missing Values,Unique Values,Description
id,int64,0,1135,Unique numeric identifier
review,object,5563,48682,Full text of the review written by the critic.
rating,object,13517,186,Rating given by the critic.
fresh,object,0,2,"Indicates if the review is ""fresh"" (positive) ..."
critic,object,2722,3496,Name of the critic who wrote the review.
top_critic,int64,0,2,"Binary indicator where 1 = top critic, 0 = not..."
publisher,object,309,1281,Publication or outlet where the review was pub...
date,object,0,5963,Date the review was published.


In [28]:
#checking summary numerical statistics
rtreviews_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,54432.0,1045.706882,586.657046,3.0,542.0,1083.0,1541.0,2000.0
top_critic,54432.0,0.240594,0.427448,0.0,0.0,0.0,0.0,1.0


In [29]:
#checking summary categorical statistics
rtreviews_df.describe(include="O").T

Unnamed: 0,count,unique,top,freq
review,48869,48682,Parental Content Review,24
rating,40915,186,3/5,4327
fresh,54432,2,fresh,33035
critic,51710,3496,Emanuel Levy,595
publisher,54123,1281,eFilmCritic.com,673
date,54432,5963,"January 1, 2000",4303


## Observations

The Rotten Tomatoes Movie Reviews dataset comprises of **54432 rows** and **8 columns**.

The dataset is uniform from top to bottom on inspection of the first 5 rows and last 5 rows.

Each row appears to represent a movie review done by a critic.

### Columns

The dataset contains 8 columns with the following information:
* **Movie Review Metadata**: movie identifier id, review, rating, fresh, critic, top_critic, publisher and date 

The following columns are of key impact as follows: 

* **fresh**: Indicates whether a review is positive ("fresh") or negative ("rotten"), making it suitable for binary classification tasks.

* **top_critic**: Binary flag (1 = Top Critic, 0 = Regular Critic) that can be used to assess influence or credibility.

* **rating**: Contains non-standardized formats (e.g., "3/5", "7/10", "B+"), which will need normalization for numeric analysis.

### Data Types 

The summary information of the data frame reveals two datatypes:
* Categorical Data (6 columns: review, rating, fresh, critic, publisher, date)
* Numerical Data (2 columns: id and top_critic) 

We note that there will be need for data cleaning as follows:
 * The date is an object datatype and requires parsing into datetime format for any time-based analysis.
 * The rating may need to be further analysed and cleaned and converted into numerical datatype

### Missing Data
 
Some columns have significant missing values:

* review: 10% missing values 

* rating: 25% missing values — this could limit use in quantitative rating analysis.

* critic: 2,722 missing values 

* publisher: 309 missing values

These columns require handling based on analysis goals in the data preparation stage.

## 2.4 Initial Data Exploration of The Movie DB Dataset

In [30]:
tmdb_df = pd.read_csv("tmdb.movies.csv/tmdb.movies.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'tmdb.movies.csv/tmdb.movies.csv'

## 2.5 Initial Data Exploration of The Numbers Movie Budgets Dataset

In [None]:
tn_df = pd.read_csv("tn.movie_budgets.csv/tn.movie_budgets.csv")

## 2.6 Initial Data Exploration of the IMDb Dataset

In [None]:
#importing the relevant library for connection to the database
import sqlite3

#connecting to the database
conn = sqlite3.connect("Data/im.db")

In [31]:
#reading the tables in the database using pandas
table_df = pd.read_sql("""
SELECT name 
FROM sqlite_master 
WHERE type = 'table'
;""",conn)

#previewing the tables within the IMDb SqLite Database
table_df

Unnamed: 0,name
0,movie_basics
1,directors
2,known_for
3,movie_akas
4,movie_ratings
5,persons
6,principals
7,writers


### 2.6.1 Initial Data Exploration of the Movie Basics Table

In [32]:
#creating a dataframe for the movie basics table
df_mb = pd.read_sql("""
SELECT * 
FROM movie_basics 
;""",conn)

In [34]:
#checking the first 5 rows
df_mb.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [35]:
#checking the last 5 rows
df_mb.tail()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.0,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116.0,
146143,tt9916754,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,2013,,Documentary


In [33]:
#checking the summary information on the table
df_mb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [36]:
# get the shape of the dataframe
print(f"The IMDb Movie Basics table has {df_mb.shape[0]} rows.")
print(f"The IMDb Movie Basics table has {df_mb.shape[1]} columns.")

The IMDb Movie Basics table has 146144 rows.
The IMDb Movie Basics table has 6 columns.


In [43]:
#getting a summary overview of data

data_dict = pd.DataFrame({
    "Missing Values": df_mb.isnull().sum(),
    "Unique Values": df_mb.nunique()
})

data_dict

Unnamed: 0,Missing Values,Unique Values
movie_id,0,146144
primary_title,0,136071
original_title,21,137773
start_year,0,19
runtime_minutes,31739,367
genres,5408,1085


In [46]:
#creating a description column in our data dictionary
data_dict["Description"] =[
        "Unique identifier for each movie entry",
        "The main title of the movie",
        "The original title of the movie (native language)",
        "The year the movie was released",
        "Total duration of the movie in minutes",
        "Genres associated with the movie"
    ]

data_dict

Unnamed: 0,Missing Values,Unique Values,Description
movie_id,0,146144,Unique identifier for each movie entry
primary_title,0,136071,The main title of the movie
original_title,21,137773,The original title of the movie (native language)
start_year,0,19,The year the movie was released
runtime_minutes,31739,367,Total duration of the movie in minutes
genres,5408,1085,Genres associated with the movie
