## Final Project Submission

* Student name: Bradley Ouko
* Student pace: full time
* Scheduled project review date/time: 22/03/2024, 4pm
* Instructor name: Asha Deen
* Blog post URL:


***
## Microsoft Movie Studio: Statistical Analysis for Success


**Author:**  Bradley Ouko
***
![Microsoft Visual Studio Blueprint](https://github.com/Misfit911/Microsoft-Movie-Studio-Blueprint/assets/127237815/66300c64-853c-44ea-bb7b-5735ce301b0d)

## Overview
This project leverages data from [Box Office Mojo](https://www.boxofficemojo.com), [IMDb](https://www.imdb.com/) e.t.c. to define Microsoft's New Movie Studio Strategy. 
Microsoft sees all the big companies creating `original video content` and they want to get into the lucriative business prospects. 
Microsoft decided to create a `new movie studio` based on scientific facts through `data analysis`.
Here are the insights following the `exploratory data analysis` conducted:
- The movie genres **Action, Adventure & Animation** have the most sells
- It costs an average of <> as production costs
- <> movies are on the uptrend in the movie market

## Business Problem
Microsoft wants a piece of the cake in the movie industry, however, it is limited in the domain knowledge. The research was done based on the following research questions:
1. Which are the top performing movie genres
1. What are the preferences of the target audience
1. 

The following are the **data questions** answered in this analysis:

**1. What types of films are currently performing best at the box office (based on box office gross)?**
- What are the characteristics of top-performing movies based on box office gross?
- Which movies have been the most successful financially?

**2. Which movie genres have been the most popular and successful over time?**
- What are the trends in genre preferences?
- What are the preferences of the target audience based on user ratings and reviews? 
- What type of content resonates well with the audience?

**3. How does the movie budget impact box office revenue, and can smaller budget films be profitable?**
- How does the movie budget affect box office revenue? 
- Can smaller budget films be profitable, and is there an optimal budget range?

**4. Are there seasonal trends in movie performance, and when is the best time to release a movie?**
- Are there seasonal patterns in movie performance? 
- When is the best time to release a movie for maximum revenue?

## Data Understanding
The research retrieves information from Box Office Mojo, The Numbers and IMDb databases.

  Key Variables:
> performance, rating, genre, budget, revenue

Other particulars involved include:
> movie title, box office gross, user rating, release date

**Properties of Variables of interest**:
- **Movie Name** - Categorical variable which is a textual label or name of the movie. 
- **Genre**: Categorical variable representing the type or category of the movie (e.g., Action, Drama, Comedy).
- **Budget**: Continuous variable representing the production cost or budget of the movie.
- **worldwide Gross**: Continuous variable representing the total revenue generated by the movie at the box office.
- **User Rating**: Continuous variable representing the average ratings or scores given by users for the movie.
- **Release Date**: Temporal variable indicating the date when the movie was released in theaters.

These variables will be used to answer the data questions and derive actionable insights to guide Microsoft's new movie studio in making informed decisions for successful movie production. The analysis will focus on understanding the relationships between these variables and their impact on movie success and profitability.

## Data Preparation

The following describes the conventional data cleaning process to remove inconsistencies and operate on standardized data:


1. **Data Loading**: Load the required datasets, including imdb.title.basics, imdb.title.ratings, bom.movie_gross, and tn.movie_budgets, into the analysis environment.

2. **Data Cleaning**:
- Handle Missing Values: Identify and handle missing values appropriately for each dataset. Depending on the extent of missingness, we may choose to impute missing values, drop rows, or drop entire variables if they are not crucial for the analysis.
- Drop Irrelevant Variables: Remove irrelevant or redundant variables that do not contribute to the analysis questions or recommendations.
- Merge Data: Combine relevant datasets based on common keys (e.g., movie titles) to create a comprehensive dataset that includes essential movie information.

3. **Feature Engineering**:

- Calculate `Profit`: Create a new variable to calculate the profit for each movie by subtracting the budget from the box office gross. This will help us understand the financial performance of each movie.
- Calculate `foreign_gross`: from the difference between worldwide gross and domestic gross
4. **Handling Outliers**:

- Analyze and Address Outliers: Identify outliers in numeric variables like budget and box office gross. Outliers may affect our analysis, and we need to decide whether to remove or transform them based on their impact on the results.
- Consider Genre Outliers: In genre analysis, we may encounter less common or niche genres. We must decide whether to group them into broader genres or retain them as unique categories based on their significance.


## 1. Data Loading
Import the necessary modules and packages

In [1]:
# Import standard packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3

%matplotlib inline

Next step is to open the files provided files into their respective `Pandas DataFrames`:

In [10]:
# list of all files provided for this analysis
files = ['bom.movie_gross.csv', 'im.db', 'rt.movie_info.tsv', 'rt.reviews.tsv', 
         'tmdb.movies.csv','tn.movie_budgets.csv']

# instantiate empty list
data_files = []

# read all data from files into the data_files list
def read_files():
    """Function to read the file data in form of Pandas DataFrames
    into a list (data_files)
    """
    for i, file in enumerate(files):
        try:
            file = 'zippedData/' + file
            if file.endswith('.csv'):
                data_files.append(pd.read_csv(file))
            elif file.endswith('.tsv'):
                data_files.append(pd.read_csv(file, sep='\t'))
            elif file.endswith('.db'):
                conn = sqlite3.connect(file)
                cur = conn.cursor()
                data_files.append(pd.read_sql_query("""SELECT name FROM sqlite_master
                                                                    WHERE type = 'table';""", conn))
        except UnicodeDecodeError:
            data_files.append(pd.read_csv(file, sep='\t', 
                                          encoding= 'unicode_escape'))

# read all files 
read_files()

# Unpack data_files list to individual variables in memory
bom_movie_gross, imdb, rt_movie_info, rt_reviews, tmdb_movies, tn_movie_budgets = data_files

In [19]:
tmdb_movies

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.920,2010-07-16,Inception,8.3,22186
...,...,...,...,...,...,...,...,...,...,...
26512,26512,"[27, 18]",488143,en,Laboratory Conditions,0.600,2018-10-13,Laboratory Conditions,0.0,1
26513,26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.600,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,26514,"[14, 28, 12]",381231,en,The Last One,0.600,2018-10-01,The Last One,0.0,1
26515,26515,"[10751, 12, 28]",366854,en,Trailer Made,0.600,2018-06-22,Trailer Made,0.0,1
