## Overview

As the demand for original video content continues to grow, our company is preparing to enter the film industry by launching a new movie studio but the company currently lacks industry experience and clear insight into what drives film success at the box office.This project analyzes various film industry datasets to uncover trends that can guide a company's strategic entry into the entertainment industry
to provide practical, data-driven insights that can guide the company's decisions. The findings will help the stakeholders identify which types of films to prioritize, understand audience preferences, and allocate resources more effectively to improve the chances of box office success and long-term sustainability.

## DATA UNDERSTANDING
To provide a strategy for the new studio, we are using information from two primary sources that shows film success: popularity and financial performance.

#### Data sources
1. Box Office Mojo (bom.movie_gross.csv.gz): This dataset serves as the financial benchmark. It contains domestic and foreign gross earnings, allowing us to identify which films achieved the highest commercial reach.

2. IMDB Relational Database (im.db): We are utilizing two key tables from this SQLite database:

- movie_basics: Provides essential metadata including primary titles, original title, runtime minutes, release years, and genre classifications.

- movie_ratings: Contains user-generated data, specifically average ratings and "numvotes," which act as a proxy for audience engagement and long-term relevance.

#### Integration Strategy
By joining these datasets on movie titles and years, we can correlate specific genres with their return on investment. This allows us to move beyond simply seeing what people watched, to understanding what they actually enjoyed and which genres consistently command the highest ticket sales.


# Data Understanding
 ## Loading Datasets
 This section involves accessing the data and exploring it, through this we can determine the quality of the data and describe the results.It is important to explore all our datasets to decide the datesets to use for our analysis.The datasets we have a include:
        Box Office Mojo
        IMDB
        Rotten Tomatoes Movies
        Rotten Tomatoes Critic Review
        TheMovieDB
        The Numbers

In [15]:
#import libraries
import pandas as pd
import numpy as np
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns

In [16]:
#Reading bom.movies,csv file
df_gross=pd.read_csv('zippedData/bom.movie_gross.csv.gz',compression='gzip')
df_gross.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [17]:
# Read the 'tmdb.movies.csv' file from the original location
df_tmdb = pd.read_csv('zippedData/tmdb.movies.csv.gz', compression='gzip')
df_tmdb.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [18]:
# Read the 'tn.movies_budgets csv' file from the original location
df_budgets = pd.read_csv('zippedData/tn.movie_budgets.csv.gz', compression='gzip')
df_tmdb.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [19]:
# Read the 'rt.movie_info.tsv' file, specifying the tab separator
df_info = pd.read_csv('zippedData/rt.movie_info.tsv.gz', sep='\t', compression='gzip')
df_info.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [20]:
#Read the 'rt.reviews.tsv' file
df_reviews = pd.read_csv('zippedData/rt.reviews.tsv.gz',sep='\t',compression='gzip',encoding='latin1')

df_reviews.head()


Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [21]:
#Extracting im.db.zip file
import zipfile

zip_path = 'zippedData/im.db.zip'
extract_to = 'zippedData'

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to)

print("Extraction complete")


Extraction complete


In [22]:
#Connecting to im.db database and reading table names

conn = sqlite3.connect('zippedData/im.db')

tables = pd.read_sql(
    "SELECT name FROM sqlite_master WHERE type='table';",
    conn
)

tables


Unnamed: 0,name
0,movie_basics
1,directors
2,known_for
3,movie_akas
4,movie_ratings
5,persons
6,principals
7,writers


In [23]:
conn.close()