## Final Project Submission

* Student name: Bradley Ouko
* Student pace: full time
* Scheduled project review date/time: 22/03/2024, 4pm
* Instructor name: Asha Deen
* Blog post URL:


***
## Microsoft Movie Studio: Statistical Analysis for Success


**Author:**  Bradley Ouko
***
![Microsoft Visual Studio Blueprint](https://github.com/Misfit911/Microsoft-Movie-Studio-Blueprint/assets/127237815/66300c64-853c-44ea-bb7b-5735ce301b0d)

## Overview
This project leverages data from [Box Office Mojo](https://www.boxofficemojo.com), [IMDb](https://www.imdb.com/) e.t.c. to define Microsoft's New Movie Studio Strategy. 
It sees all the big companies creating original video content and they want to get into the lucrative business prospects. 
Microsoft decided to create a new movie studio based on scientific facts through data analysis.
Here are the insights following the `exploratory data analysis` conducted:
- The movie genres **Action, Adventure & Animation** have the most sells
- It costs an average of <> as production costs
- <> movies are on the uptrend in the movie market

## Business Problem
Microsoft wants a piece of the cake in the movie industry, however, it is limited in the domain knowledge. The research was done based on the following research questions:
1. Which are the top performing movie genres
1. What are the preferences of the target audience
1. 

The following are the **data questions** answered in this analysis:

**1. What types of films are currently performing best at the box office (based on box office gross)?**
- What are the characteristics of top-performing movies based on box office gross?
- Which movies have been the most successful financially?

**2. Which movie genres have been the most popular and successful over time?**
- What are the trends in genre preferences?
- What are the preferences of the target audience based on user ratings and reviews? 
- What type of content resonates well with the audience?

**3. How does the movie budget impact box office revenue, and can smaller budget films be profitable?**
- How does the movie budget affect box office revenue? 
- Can smaller budget films be profitable, and is there an optimal budget range?

**4. Are there seasonal trends in movie performance, and when is the best time to release a movie?**
- Are there seasonal patterns in movie performance? 
- When is the best time to release a movie for maximum revenue?

## Data Understanding
The research retrieves information from Box Office Mojo, The Numbers and IMDb databases.

**Data Description**:
The data used for this project comes from multiple movie-related datasets from the following sources:

1. **IMDb (Internet Movie Database)**:
    - **IMDb Basics**:
       - Dataset: imdb.title.basics
       - Description: Contains basic movie information like **movie_id, primary_title, original_title, start_year,
         runtime_minutes, genres**
       - Relationship to Data Analysis Questions: Provides information on movie genres and basic movie details required for genre analysis.

    - **IMDb Ratings**:
       - Dataset: imdb.title.ratings
       - Description: Contains **'movie_id', 'averagerating', 'numvotes'**
       - Relationship to Data Analysis Questions: Allows us to explore audience preferences based on user ratings and reviews.

2. **Box Office Mojo**:
   - Dataset: bom.movie_gross
   - Description: Contains box office gross information for movies. \
   i.e **'title', 'studio', 'domestic_gross', 'foreign_gross', 'year'**
   - Relationship to Data Analysis Questions: Essential for analyzing box office success and financial performance of movies.

3. **The Numbers**:
   - Dataset: tn.movie_budgets
   - Description: Contains movie financials, like **'id', 'release_date', 'movie', 'production_budget', 'domestic_gross',
     'worldwide_gross'**
   - Relationship to Data Analysis Questions: Enables us to analyze the impact of movie budgets on box office revenue.

  Key Variables:
> performance, rating, genre, budget, revenue

Other particulars involved include:
> movie title, box office gross, user rating, release date

**Properties of Variables of interest**:
- **Movie Name** - Categorical variable which is a textual label or name of the movie. 
- **Genre**: Categorical variable representing the type or category of the movie (e.g., Action, Drama, Comedy).
- **Budget**: Continuous variable representing the production cost or budget of the movie.
- **Worldwide Gross**: Continuous variable representing the total revenue generated by the movie at the box office.
- **User Rating**: Continuous variable representing the average ratings or scores given by users for the movie.
- **Release Date**: Temporal variable indicating the date when the movie was released in theaters.

**Target Variable**:
The target variable for this project is the **"worldwide gross"** of movies. Box office gross represents the total revenue generated by the movie in theaters and serves as an indicator of movie financial success.

These variables will be used to answer the data questions and derive actionable insights to guide Microsoft's new movie studio in making informed decisions for successful movie production. The analysis will focus on understanding the relationships between these variables and their impact on movie success and profitability.

## Data Preparation

The following describes the conventional data cleaning process to remove inconsistencies and operate on standardized data:


1. **Data Loading**:
- Load the required datasets, including imdb.title.basics, imdb.title.ratings, bom.movie_gross, and tn.movie_budgets, into the analysis environment.

2. **Data Cleaning**:
- Handle Missing Values: Identify and handle missing values appropriately for each dataset. Depending on the extent of missingness, we may choose to impute missing values, drop rows, or drop entire variables if they are not crucial for the analysis.
- Drop Irrelevant Variables: Remove irrelevant or redundant variables that do not contribute to the analysis questions or recommendations.
- Merge Data: Combine relevant datasets based on common keys (e.g., movie titles) to create a comprehensive dataset that includes essential movie information.

3. **Feature Engineering**:

- Calculate `Profit`: Create a new variable to calculate the profit for each movie by subtracting the budget from the box office gross. This will help us understand the financial performance of each movie.
- Calculate `foreign_gross`: from the difference between worldwide gross and domestic gross
4. **Handling Outliers**:

- Analyze and Address Outliers: Identify outliers in numeric variables like budget and box office gross. Outliers may affect our analysis, and we need to decide whether to remove or transform them based on their impact on the results.
- Consider Genre Outliers: In genre analysis, we may encounter less common or niche genres. We must decide whether to group them into broader genres or retain them as unique categories based on their significance.


## 1. Data Loading
Import the necessary modules and packages:

In [1]:
# Import standard packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3

%matplotlib inline

Let us create a class that opens the `im.db` file and encapsulates the data in a class with all necessary attributes and methods required to operate it as a `Pandas Dataframe`:

In [2]:
# Encapsulate the im.db file inside a class with all necessary attributes and 
# methods required to operate it as a Pandas Dataframe

class IMDb:
    """Class that provides a convenient way to interact with an SQLite
    database. It simplifies the process of querying tables and accessing data,
    making it easier for users to work with the database in their Python code
    -----------------------------------------------------------------------
    1. IMDb.tables() - returns pandas dataframe with all tables in db
    2. IMDb.get(table_name : str) - gets the dataframe with arg table_name
        - alternative usage: MovieDB[table_name]
    3. IMDb.keys() - returns pandas dataframe with all tables in db
    """

    def __init__(self, file_name):
        """IMDb class constructor method: Provide the name of the database file """
        if not file_name:
            raise ValueError("Please provide file to read")
        self.file = file_name
        self.conn = sqlite3.connect(self.file)

    def tables(self):
        """Method to return all tables in IMDb"""
        query = """
        SELECT name 
        FROM sqlite_master
        WHERE type = 'table';
        """
        tables = (pd.read_sql(query, self.conn))
        tables = list(tables['name'])
        return tables
    
    def get(self, table=None):
        """Method to get DataFrame from a specific table name
        Args:
            table (str): table name
        """
        if not table:
            raise ValueError("Please provide table name")
        else:
            query = f"""
            SELECT *
            FROM {table}
            """
            return (pd.read_sql(query, self.conn))

    def __str__(self):
        """String representation of IMDb"""
        tables = {'tables' : list(self.tables()['name'])}
        return f"{tables}"

    def __repr__(self):
        """Callable string representation of IMDb"""
        db = {'tables' : list(self.tables())}
        print("IMDb movie database")
        return f"{db}"
    
    def __getitem__(self, key):
        """Method that enables pythonic dictionary
        operations on class instance. 
        e.g IMDb['persons']
        Args:
            key (str): table name
        """
        return self.get(key)

    def __getattr__(self, table_name):
        """"Method that enables accessing tables as attributes
        of the IMDb object.
        i.e IMDb.table_name
        """
        if table_name not in self.tables():
            raise AttributeError("Attribute does not exist")
        return self.get(table_name)

        
    def keys(self):
        """Return the tables as dict keys"""
        return self.tables()


Open the files provided files into their respective `Pandas DataFrames`:

In [3]:
# list of all files provided for this analysis
files = ['bom.movie_gross.csv', 'im.db', 'tn.movie_budgets.csv', 'rt.movie_info.tsv', 'rt.reviews.tsv', 
         'tmdb.movies.csv']

# instantiate empty list
data_files = []

# read all data from files into the data_files list
def read_files():
    """Function to read the file data in form of Pandas DataFrames
    into a list (data_files)
    """
    for i, file in enumerate(files):
        try:
            file = 'zippedData/' + file
            if file.endswith('.csv'):
                data_files.append(pd.read_csv(file))
            elif file.endswith('.tsv'):
                data_files.append(pd.read_csv(file, sep='\t'))
            elif file.endswith('.db'):
                data_files.append(IMDb(file))
        except UnicodeDecodeError:
            data_files.append(pd.read_csv(file, sep='\t', 
                                          encoding= 'unicode_escape'))

# read all files 
read_files()

# Unpack data_files list to individual variables in memory
bom_movie_gross, imdb, rt_movie_info, rt_reviews, tmdb_movies, tn_movie_budgets = data_files

Assert whether all necessary files have properly been read into memory as DataFrames and are 
not empty:

In [4]:
# test whether all necessary files have properly been read into memory as DataFrames

var_names = ['bom_movie_gross', 'imdb', 'tn_movie_budgets']
for var in var_names:
    # if var is from the .db file type 
    if var == 'imdb':
        tables = ['movie_basics', 'movie_ratings']
        for table in tables:
            var = eval(f"imdb['{table}']")
            # check if variable is a dataframe
            assert isinstance(var, pd.DataFrame)
            # check if dataframe is not empty
            assert len(var) != 0
            # check if dataframe has columns
            assert len(var.columns) != 0
    else:
        var = eval(var)
        # check if variable is a dataframe
        assert isinstance(var, pd.DataFrame)
        # check if dataframe is not empty
        assert len(var) != 0
        # check if dataframe has columns 
        assert len(var.columns) != 0

Check the length of the files of the opened DataFrames and print the column information to have a feel of the data;
> Use
`.head()`
`.info()`
`.tail()`
`.info()`
`.describe()`
to get the summary statistics


Using the respective `data_files` print the statistics:

In [56]:
# print lengths of each dataset
print("Box Office Gross Data: ", len(bom_movie_gross))
print("The Numbers Dataset: ", len(tn_movie_budgets))
print("IMDb Movie Basics: ", len(imdb.get("movie_basics")))
print("IMDb Movie Ratings: ", len(imdb.get("movie_ratings")))
print("Rotten Tomatoes Info: ", len(rt_movie_info))
print("Rotten Tomatoes Reviews: ", len(rt_reviews))

Box Office Gross Data:  3387
The Numbers Dataset:  26517
IMDb Movie Basics:  146144
IMDb Movie Ratings:  73856
Rotten Tomatoes Info:  5782
Rotten Tomatoes Reviews:  1560


In [54]:
# i) list the first 5 records of each dataset
# ii) get column info for each as well

title_strip = lambda x: x.strip().upper()

# Print the title
print(f"{title_strip('Box Office Gross Data')}")

# Print the DataFrame head
bom_movie_gross.head()

BOX OFFICE GROSS DATA


Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [7]:
bom_movie_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [55]:

title = lambda x: x.strip().upper()

# Print the title
print(f"{title('The Numbers Dataset')}")

tn_movie_budgets.head()

THE NUMBERS DATASET


Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [9]:
tn_movie_budgets.info()
tn_movie_budgets

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


In [44]:
# i) list the first 5 records of each dataset
# ii) get column info for each as well
movie_basics = imdb.get('movie_basics')
title_strip = lambda x: x.strip().upper()

# Print the title
print(f"{title_strip('imdb:movie basics')}")

movie_basics.head()

IMDB:MOVIE BASICS


Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [11]:
movie_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [45]:
# check imdb.ratings
movie_ratings = imdb.get('movie_ratings')
title_strip = lambda x: x.strip().upper()

# Print the title
print(f"{title_strip('imdb:movie ratings')}")
movie_ratings.head()

IMDB:MOVIE RATINGS


Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [13]:
movie_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


In [58]:
title_strip = lambda x: x.strip().upper()

# Print the title
print(f"{title_strip('Rotten Tomatoes Info')}")

rt_movie_info.head()

ROTTEN TOMATOES INFO


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [52]:
rt_movie_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [59]:
title_strip = lambda x: x.strip().upper()

# Print the title
print(f"{title_strip('Rotten Tomatoes Reviews')}")
rt_reviews.head()

ROTTEN TOMATOES REVIEWS


Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [60]:
rt_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


The datasets required for this analysis are fine, and seem to contain the required data.

1. The **IMDb.moviebasics** DataFrame consists of the following columns:
    > `'movie_id'`, `'primary_title'`, `'original_title'`, `'start_year'`, `'runtime_minutes'`, `'genres'`
2. The **IMDb.ratings** DataFrame consists of the following columns:
    > `movie_id`, `averagerating`, `numvotes` 
3. The **Bom_Movie_Budgets** DataFrame consists of the following columns:
    > `'title'`, `'studio'`, `'domestic_gross'`, `'foreign_gross'`, `'year'` 
4. The **Rotten Tomatoes Reviews** DataFrame consists of the following columns:
	> `'id'`, `'release_date'`, `'movie'`, `'production_budget'`, `'domestic_gross'`, `'worldwide_gross'`

In [None]:
After the data has been properly opened and read to pandas DataFrames,