## Final Project Submission

Please fill out:
* Student names: Christopher Noel, Margaret Nyairo, Victor Masinde, James Ngumo, Anthony Ekeno. 
* Student pace: full time 
* Instructor name: Maryann Mwikali


 Market Analysis & Insights For Strategic Movie Production

## Business Understanding

### Overview
This project is designed to aid a company's venture into the movie production industry by launching a new studio. Through comprehensive data analysis, the project will identify current trends and provide actionable insights from box office data. This information will guide the company in determining the types of movies that are most successful in today’s market, thereby supporting strategic content creation and maximizing box office returns

###  The Problem Statement
The company needs to pinpoint what types of movies are most successful in the current market to propel the new studio's launch. Specifically, we aim to:

1. **Identify the Genres Performing Well at the Box Office**: Determine which movie genres are currently popular and yield high box office returns.
   
2. **Analyze Movie Budgets and Profitability**: Evaluate the relationship between production budgets, returns, and overall profitability to find the optimal investment range.
   
3. **Assess Audience Demographics Driving Success**: Understand which demographic segments are contributing significantly to box office revenues.
   
4. **Recommend Optimal Release Seasons or Windows**: Identify the best times of the year for releasing movies to maximize box office performance.

### Main objectives
To identify the most successful types of films currently at the box office and translate these insights into actionable strategies that guide the new movie studio's production choices, ensuring competitive and commercial success in the film industry.


 ## Data Wrangling
 


 ### import libraries
First we import the necessary packages for Exploratory data analysis.


In [None]:
# Importing library.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3 

%matplotlib inline


### Exploratory Data Analysis

The data used in this analysis contains data collected from various popular movie sites such as Box Office Mojo, IMDb, rotten tomato reviews. It contains detailed information on movie titles, actors, directors, box office earnings, and movie ratings.
1. im.db

2. box office mojo

3. movie budgets 

4. movie info 


 

### Import Data sets



##### First Dataset - im.db
This dataset will form root basis of analysis

In [None]:
#FIRST DATA SET IM.DB- 

#establishing a connection with database.
conn = sqlite3.connect("zippedData/im.db")
cur = conn.cursor()

#opening the database
pd.read_sql("""
SELECT*
FROM sqlite_master
""",conn) 

###### An Entity Relationship Diagram [ERD]
-Below is an ERD explaining further contents contained in tables shown above

-After studying each column name in the ERD below, we discover that we are interested in the contenct of two tables. These are; **movie_basics and movie_ratings**  since they have contents that will be vital to our analysis.

-Analyse structure of the tables by using **JOIN** statement to combine them.

![Alt text](movie_data_erd.jpeg)

######  JOIN tables[ movie basics + movie ratings]  to create a new dataframe  **imdb**

In [None]:
## merge required tables then convert it to a dataframe
#Select relevant information from movie_basics table
#JOIN to movie_ratings

imdb = pd.read_sql("""
SELECT primary_title,start_year,runtime_minutes,genres,averagerating
FROM movie_basics
JOIN movie_ratings
USING("movie_id")
""",conn)
imdb

##### SECOND DATASET  -box office mojo gross
Essential for understanding box office success and profitability.

In [None]:
## SECOND DATASET 
gross = pd.read_csv("zippedData/bom.movie_gross.csv")
gross.head()

##### THIRD DATASET -movie budgets
Used to assess budget-related profitability.

In [None]:
## THIRD DATASET
budget = pd.read_csv("zippedData/tn.movie_budgets.csv")
budget.head()

###### FOURTH DATASET - movie info
Allows us to categorize movies, essential for identifying high-performing genres


In [None]:
# FOURTH DATASET
movie_info = pd.read_csv("zippedData/rt.movie_info.tsv",sep="\t")
movie_info.head()

####  Data Understanding 
In this section we are going to examine our data for better understanding before we start working on it. The section helps us ascertain the shape of the number of rows and columns(`shape`). We also get a slight summary of the data displaying column names, number of non-null values and the Dtype of the column contents(`info`).When dealing with continous data, we can have a brief statistical summary of the columns with intergers or float data types(`.describe`)

######  .info Function
This returns the summary of the data frame

In [None]:
# first dataset
imdb.info()

In [None]:
# second dataset 
gross.info()

In [None]:
#  third dataset
budget.info()

In [None]:
# fourth dataset 
movie_info.info()

###### .columns Function
This returns column labels of each dataframe

In [None]:
# First DataSet
imdb.columns

In [None]:
# Second DataSet
gross.columns

In [None]:
# Third DataSet
budget.columns

In [None]:
# Fourth DataSet 
movie_info.columns

###### .describe Function 
This returns the descriptive statistics of the dataframe

In [None]:
# First DataSet
imdb.describe()

In [None]:
# Second DataSet 
gross.describe()

In [None]:
# Third Dataset
budget.describe()

In [None]:
# Fourth DataSet
movie_info.describe()

###### .shape Function
This returns number of rows and columns.

In [None]:
# First DataSet
imdb.shape

In [None]:
# Seond DataSeta
gross.shape

In [None]:
# Third DataSet
budget.shape

In [None]:
# Fourth DataSet
movie_info.shape

We are working with four different datasets for this project. The first one named **imdb** has 73856 rows and 5 columns.It was sourced from ... .It has all the three types of data namely Float, intergers and objects. The data here helps us analyse the `genres, titles, runtime minutes and the ratings of movies`

The second dataset is named **gross** and it has 3387 rows and 5 columns. It was sourced from ... .It has Float, intergers and objects as dataypes. The data here helps us analyse income generated as it contains columns with `domestic and foreign gross data.`

The third dataset is named **budget** and it has 5782 rows and 6 columns. It was sourced from ... . The datatypes in this dataset are intergers and objects only. It has columns that can help us calculate the budget of producing a movie i.e `production_budget`. We can also see seasonal trends as it has a column with information on dates. 

The last dataset is named **movie_info** and it has 1560 rows and 12 columns. It is sourced from ... . The datatypes in this dataset are intergers and objects only. It has columns with information about `writer, director, studio` etc that can help us make informed reccomendations at the end of the project


`Data clening will follow below`