## Overview

## Movie Genre Analysis for Microsoft Movie Studio.

This project aims to to provide actionable insights for Microsoft's new movie studio. The goal is to identify commercially successful film types to guide genre selection decisions.

## Business Understanding

Microsoft, has decided to venture into the film industry by establishing a new movie studio. However, lack of experience in movie production necessitates a data-driven approach to navigate this new territory.This project aims to solve the problem of identifying commercially successful movie genres for Microsoft's new studio. Key objectives include:

1.Explore and analyze movie datasets to understand trends and patterns in the film industry.
2.Identify the types of movies that are performing well at the box office according to multiple criteria such as revenue, budget, box office perfomance, ratings, and genre.
3.Translate the findings into concrete business recommendations for Microsoft's new movie studio.


Primary Stakeholder of this project is Microsoft's new movie studio. The project will directly benefit the studio by providing:

1.Actionable recommendations on genres with high domestic gross.
2.Data-driven insights on audience reception verses gross return.
3.A competitive edge by identifying potentially lucrative niche genres.

# Data Understanding

To gain a comprehensive understanding of movie genre performance, we've utilized a combination of data sources;

1.tn.movie_budgets.csv: This table contains essential information about each movie, including its title, genre(s), release date, and production budget. This data allows us to categorize movies by genre and analyze box office performance within those categories.
2.tmdb.movies.csv: This table captures user ratings for various movies. While not a direct measure of box office success, user ratings can offer valuable insights into audience reception and potential genre preferences.
3.bom.movie_gross.csv: This dataset, contains details about a movie's domestic and international box office gross.

## Python Libraliries

In [1]:
# Importing relevant python libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sqlite3

# Read the first CSV file bom.movies into a pandas DataFrame using bom_movies as our variable

In [11]:
#reading the bom.movies using the file path
bom_movies = pd.read_csv(r"C:\Users\user\Desktop\phase1_project\Data\bom.movie_gross.csv")

# Display the first 5 rows
bom_movies

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018


Display the last 5 rows to check the structure

In [12]:
# Checking the structure of last rows
bom_movies.describe()

Unnamed: 0,domestic_gross,year
count,3359.0,3387.0
mean,28745850.0,2013.958075
std,66982500.0,2.478141
min,100.0,2010.0
25%,120000.0,2012.0
50%,1400000.0,2014.0
75%,27900000.0,2016.0
max,936700000.0,2018.0


Get the dimensions of a DataFrame

In [6]:
# Check how many rows and columns the dataframe contains.
bom_movies.shape

(3387, 5)

Display information about the DataFrame

In [7]:
# Shows detailed information  of rows,columns,data type and any column with missing values
bom_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


Lets use boolean indexing to select the columns with missing values.

In [8]:
#checking columns with missing values
bom_movies.columns[bom_movies.isnull().any()]

Index(['studio', 'domestic_gross', 'foreign_gross'], dtype='object')

From the above output there are 3 columns with missing values. lets check below how many values are missing in each column.

In [9]:
# Get the total missing values in each column
bom_movies.isna().sum()

title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64

From the output we choose to drop all the missing values in the columns studio and domestic gross because the number of missing values is small and not crucial for the analysis.

In [15]:
# Dropping rows with missing values for studio and domestic gross
bom_movies.dropna(subset=['studio', 'domestic_gross'], inplace=True)

#checking the new dimension
bom_movies.shape

(3356, 5)

Foreign gross has a large number of missing values, we will use the forward fill method to fill the NAN values in the dataset because if we delete we will lose alot of data.

In [16]:
# Forward fill missing values in 'foreign_gross' column
bom_movies['foreign_gross'].fillna(method='ffill', inplace=True)

In [17]:
#checking if we still have any column with missing values
bom_movies.isna().sum()

title             0
studio            0
domestic_gross    0
foreign_gross     0
year              0
dtype: int64

From the above output we nolonger have missing values. Below we can check data information again to see the changes we have made.

In [18]:
# checking wether the changes we made were effected.
bom_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3356 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3356 non-null   object 
 1   studio          3356 non-null   object 
 2   domestic_gross  3356 non-null   float64
 3   foreign_gross   3356 non-null   object 
 4   year            3356 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 157.3+ KB


Check for any duplicated values

In [19]:
#checking for any duplicates
duplicates = bom_movies[bom_movies.duplicated()]
duplicates

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year


Based on the output we do not have any duplicated values.

## Now, lets read the second csv dataset from tmdb.movies below:

In [21]:
# Reading the file
tmdb_movies = pd.read_csv(r"C:\Users\user\Desktop\phase1_project\Data\tmdb.movies.csv")

# Display the first 5 rows
tmdb_movies

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.920,2010-07-16,Inception,8.3,22186
...,...,...,...,...,...,...,...,...,...,...
26512,26512,"[27, 18]",488143,en,Laboratory Conditions,0.600,2018-10-13,Laboratory Conditions,0.0,1
26513,26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.600,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,26514,"[14, 28, 12]",381231,en,The Last One,0.600,2018-10-01,The Last One,0.0,1
26515,26515,"[10751, 12, 28]",366854,en,Trailer Made,0.600,2018-06-22,Trailer Made,0.0,1


Lets explore the data by checking any missing values

In [23]:
#Check for the dimension of the dataframe (rows and columns).
tmdb_movies.shape

(26517, 10)

In [24]:
# Shows detailed information  of rows,columns,data type and any column with missing values
tmdb_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


We can now check if the dataset has any missing values:

In [25]:
#checking if we have any column with missing values
tmdb_movies.isna().sum()

Unnamed: 0           0
genre_ids            0
id                   0
original_language    0
original_title       0
popularity           0
release_date         0
title                0
vote_average         0
vote_count           0
dtype: int64

since we dont have any missing values as per our output above, we can check if there are any dupplicated values in our dataframe: