# MOVIE SUCCESS ANALYSIS PROJECT 
## 1. Business Understanding
### Background
The entertainment industry is rapidly evolving as major companies produce original content for streaming platforms. To remain competitive, our company aims to enter the film production market. Understanding which types of movies perform best is critical to making informed investment decisions.

### Business Problem
The company plans to launch a new movie studio but lacks insight into what types of films perform best at the box office. This project aims to explore current movie trends to guide the studio in making data-driven decisions about what films to produce.

### Project Overview
This project uses data from Box Office Mojo (for movie revenues) and IMDB (for ratings and movie characteristics) to perform exploratory data analysis. The goal is to identify patterns, correlations, and factors that contribute to movie success and translate findings into actionable business recommendations.

### Project Goal
To identify the key factors that drive box office success and provide insights to guide the new movie studio's production strategy.

### Objectives 
1. To identify high-performing movie genres.  
2. To Evaluate the impact of audience ratings on box office performance.  
3. To identify key factors that predict movie success (e.g., genre, runtime, release timing).  
4. To examine how box office performance changes over time.

## 2. Data Understanding
We'll use two datasets:
1. bom.movie_gross.csv.gz — Box Office Mojo data with revenue information.
2. im.db — IMDB SQLite database containing movie information such as titles, genres, and ratings.
### Reasons
1. The BOM has the revenue information which is one of our target variable
2. The IMDB also has movies atributes eg genres, runtimes, release year and ratings

# Movie Success Analysis Project
## 1. Business Understanding
### Background
The entertainment industry is rapidly evolving as major companies produce original content for streaming platforms. To remain competitive, our company aims to enter the film production market. Understanding which types of movies perform best is critical to making informed investment decisions.

### Business Problem
The company plans to launch a new movie studio but lacks insight into what types of films perform best at the box office. This project aims to explore current movie trends to guide the studio in making data-driven decisions about what films to produce.

### Project Overview
This project uses data from Box Office Mojo (for movie revenues) and IMDB (for ratings and movie characteristics) to perform exploratory data analysis. The goal is to identify patterns, correlations, and factors that contribute to movie success and translate findings into actionable business recommendations.

### Project Goal
To identify the key factors that drive box office success and provide insights to guide the new movie studio’s production strategy.

### Objectives
1. TO identify high-performing movie genres.  
2. TO evaluate the impact of audience ratings on box office performance.  
3. TO identify key factors that predict movie success (e.g., genre, runtime, release timing).  
4. TO examine how box office performance changes over time.

## 2. Data Understanding
We’ll use two datasets:
1. bom.movie_gross.csv.gz — Box Office Mojo data with revenue information.
2. im.db — IMDB SQLite database containing movie information such as titles, genres, and ratings.
### Reasons
1. BOM has revenue information for which revenue is one of our target variable
2. IMDB contains movie attributes which are more important for our objectives

### To view our datasets, we first import the necessary libraries, load the data files, and preview their contents to understand the structure, features, and types of information available for analysis.



In [1]:
import pandas as pd
import sqlite3
import zipfile


In [2]:
# Load Box Office Mojo dataset
bom = pd.read_csv("zippedData/bom.movie_gross.csv.gz")
print("Box Office Mojo Data:")
print(bom.head())

Box Office Mojo Data:
                                         title studio  domestic_gross  \
0                                  Toy Story 3     BV     415000000.0   
1                   Alice in Wonderland (2010)     BV     334200000.0   
2  Harry Potter and the Deathly Hallows Part 1     WB     296000000.0   
3                                    Inception     WB     292600000.0   
4                          Shrek Forever After   P/DW     238700000.0   

  foreign_gross  year  
0     652000000  2010  
1     691300000  2010  
2     664300000  2010  
3     535700000  2010  
4     513900000  2010  


In [3]:
# Load IMDB database
with zipfile.ZipFile("zippedData/im.db.zip", "r") as z:
    z.extractall("zippedData/")

conn = sqlite3.connect("zippedData/im.db")

# View available tables
tables = pd.read_sql("""SELECT name FROM sqlite_master WHERE type='table';""", conn)
print(tables)

            name
0   movie_basics
1      directors
2      known_for
3     movie_akas
4  movie_ratings
5        persons
6     principals
7        writers


In [4]:
# Load movie_basics and movie_ratings tables
basics = pd.read_sql("SELECT * FROM movie_basics;", conn)
ratings = pd.read_sql("SELECT * FROM movie_ratings;", conn)
conn.close()

print("Movie Basics Sample:")
print(basics.head())

print("Movie Ratings Sample:")
print(ratings.head())

Movie Basics Sample:
    movie_id                    primary_title              original_title  \
0  tt0063540                        Sunghursh                   Sunghursh   
1  tt0066787  One Day Before the Rainy Season             Ashad Ka Ek Din   
2  tt0069049       The Other Side of the Wind  The Other Side of the Wind   
3  tt0069204                  Sabse Bada Sukh             Sabse Bada Sukh   
4  tt0100275         The Wandering Soap Opera       La Telenovela Errante   

   start_year  runtime_minutes                genres  
0        2013            175.0    Action,Crime,Drama  
1        2019            114.0       Biography,Drama  
2        2018            122.0                 Drama  
3        2018              NaN          Comedy,Drama  
4        2017             80.0  Comedy,Drama,Fantasy  
Movie Ratings Sample:
     movie_id  averagerating  numvotes
0  tt10356526            8.3        31
1  tt10384606            8.9       559
2   tt1042974            6.4        20
3   tt10

In [5]:
## looking the shapes of the datasets
print("Box Office Mojo shape:", bom.shape)
print("Movie Basics shape:", basics.shape)
print("Movie Ratings shape:", ratings.shape)

Box Office Mojo shape: (3387, 5)
Movie Basics shape: (146144, 6)
Movie Ratings shape: (73856, 3)


In [6]:
## looking the columns of the datasets
print("Box Office Mojo columns:", bom.columns.tolist())
print("Movie Basics columns:", basics.columns.tolist())
print("Movie Ratings columns:", ratings.columns.tolist())

Box Office Mojo columns: ['title', 'studio', 'domestic_gross', 'foreign_gross', 'year']
Movie Basics columns: ['movie_id', 'primary_title', 'original_title', 'start_year', 'runtime_minutes', 'genres']
Movie Ratings columns: ['movie_id', 'averagerating', 'numvotes']


In [7]:
## looking for information of each dataset
print("Box Office Mojo info:", bom.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB
Box Office Mojo info: None


In [8]:
print("Movie Basics info:", basics.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB
Movie Basics info: None


In [9]:
print("Movie ratings info:", ratings.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


Movie ratings info: None
