# Introduction
Microsoft has identified a new business oportunity in the film industry following the success of other major companies producing original films. We will explore high performing film genres and translate them into actionable insights.

## Objectives
- **Analyze Current Box Office trends:** Examine the type of films that are currently succesful at the box office.
- **Identify Key Film Attributes:** Determine the attributes (e.g., genre, target audience, budget range) of the top-performing films
- **Provide Actionable Insights:** Offer recommendations on the type of films Microsoft should produce in the analysis.

## Key Questions
1. **What genres are currently performing the best at the box office?**
2. **What re the common characteristics of the top-grossing films (e.g., budget, cast, director, special effects)?**
3. **Who are the target audiences for these successful films?**
4. **How do seasonal trends affect box office performance?**
5. **What marketing strategies are being used by top-performing films?**

### 1. Data Undestanding
In this part, we will preprocess dataset from Box Office Mojo (Data\bom.movie_gross.csv) which contains data on movies doing well at the box office.

In [1]:
import pandas as pd
import numpy as np
import sqlite3
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

### 1.1) Read `bom.movie_gross.csv` into a pandas DataFrame named `df`

We will use pandas to create a new DataFrame, called `df`, containing the data from the dataset in the file `bom.movie_gross.csv` in the folder containing this notebook. 

In [2]:
df = pd.read_csv('Data/bom.movie_gross.csv')
df.head() # Returns the first five rows from the datafame.

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [3]:
df.shape # Returns a tuple representing the rows and columns of the  dataframe.

(3387, 5)

In [4]:
#To check a  summary of the df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


#### From the above code block, we can observe that there are missing values as the number of Non-Null Counts are different. We will handle this in part 2 on Data Preparation.

## Part 2: Data Preparation
In this part, we will clean and transform the data. This includes handling missing values, converting data types, filtering and more.


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [6]:
df.isna().sum()

title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64

##### From the above code block, we can see that the studio column has 5 missing values, domestic gross has 28 missing values & foreign gross has 1350 missing values. Now that we know the missing values in each column, we can make decisions on how to deal with them.

##### In the cell below;
- **Let's determine what percentage of rows in columns with missing values contain missing values.**

- **Print out the number of unique values in the columns.**


In [7]:
print('Percentage of Null foreign_gross Values:', len(df[df.foreign_gross.isna()])/len(df))

print('Percentage of Null domestic_gross Values:', len(df[df.domestic_gross.isna()])/len(df))

print('Percentage of Null studio Values:', len(df[df.studio.isna()])/len(df))


Percentage of Null foreign_gross Values: 0.3985828166519043
Percentage of Null domestic_gross Values: 0.008266902863891349
Percentage of Null studio Values: 0.0014762326542663124


In [8]:
df.isna().sum()


title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64

#### With 39% of the foreign_gross dataset containing missing values, this is a significant proportion. Imputing such a large amount of missing data might introduce bias and distort the dataset's variance and distribution. The best option would be to drop the missing values.

In [9]:
df = df.drop('foreign_gross', axis = 1)
df.isna().sum()


title              0
studio             5
domestic_gross    28
year               0
dtype: int64

#### In the cell below:
- **We will find the mean and median of the domestic gross column whicch will inform our decision on which statistic measure to use to replace the missing value.**


In [10]:
domestic_gross_mean = df['domestic_gross'].mean()
domestic_gross_median = df['domestic_gross'].median()

print(f"Mean value for domestic gross  column: {domestic_gross_mean}")
print(f"Median value for domestic gross  column: {domestic_gross_median}")

Mean value for domestic gross  column: 28745845.06698422
Median value for domestic gross  column: 1400000.0


- **In the cell below, we will replace the missing values in the domestic gross column with the median value.**

In [11]:
df['domestic_gross'] = df['domestic_gross'].fillna(value=df['domestic_gross'].median)

##### Now let's confirm the missing values have been replaced and also check on the number of remaining null values in the dataset.

In [12]:
df.isna().sum()

title             0
studio            5
domestic_gross    0
year              0
dtype: int64

In [13]:
df = df.dropna()
df.isna().sum()

title             0
studio            0
domestic_gross    0
year              0
dtype: int64