# 1. Business Understanding

## 1.1 Introduction

Microsoft movie project seeks to ssek out relevant data which will be used by Microsoft in building a movie studio. The project focuses on Box Office Mojo, movie budgets and other movie information that has been used by other movie production companies.
After understanding the data, data will be cleaned to:removing any duplicates, Checking any missing data,Removing any irrelevant data. 
Thereafter Visualizations to present the data will be made to ensure it is easy to understand to the leader of microsoft.

## 1.2 Problem Statement

Microsoft wants to get into the movie industry. However they don't know much about the industry. They therefore want a data scientist to evaluate the movie industry, to identify which movies are actually doing well at the box office hence providing them with actionable insights that the company, Microsoft, can use to determine the films they will major on.

## 1.3 Main Objective

This project aims to help Microsoft's new movie studio make informed decisions about the types of films to create. By analyzing data from the film industry, insights and trends that will assist in identifying successful strategies for the studio can be uncovered.

## 1.4 Specific Objectives

a). Identify the most successful movie genres that microsoft head will use to make a prompt decision on which type of movies to put more focus on.

b). To visualize movie budgets (especially using the tn movie budgets file)

c). Understanding how different branches of the entire movie field affect the company reputation and sequentials for thaat matter. E.g., rating, genre, runtime.

## 1.5 Notebook Structure

Introduction

`Problem Statement

Main Objective

Specific Ojectives

Importing Libraries

Data Understanding

Data Cleaning

Data Visualizations

Conclusions

Recommendations

# 2. Importing Libraries

In [188]:
import csv
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# 3. Data Understanding

## 3.1 Box office mojo

In [189]:
# Loading the data
bom_data = pd.read_csv("Data/bom.movie_gross.csv")

In [190]:
# viewing the columns
bom_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [191]:
#To see the first rows in order to get a visual of how the data looks like
bom_data.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [None]:
#To see the last rows in order to get a visual of how the data looks like
bom_data.tail()

In [None]:
# To get a glimpse of the data measures
bom_data.describe()

## 3.2 tn.movie_budgets.csv

In [None]:
# Loading the data
tn_movie_budgets = pd.read_csv("Data/tn.movie_budgets.csv")

In [None]:
# viewing the columns
tn_movie_budgets.info()

In [None]:
#To see the first rows in order to get a visual of how the data looks like
tn_movie_budgets.head()

In [None]:
#To see the last rows in order to get a visual of how the data looks like
tn_movie_budgets.tail()

In [None]:
# To get a glimpse of the data measures
tn_movie_budgets.describe()

## 3.3 rt.movie.info.tsv

In [None]:
# Loading the data
rt_movie_info = pd.read_csv("Data/rt.movie_info.tsv", delimiter = "\t")

In [None]:
# viewing the columns
rt_movie_info.info()

In [None]:
#To see the first rows in order to get a visual of how the data looks like
rt_movie_info.head()

In [None]:
#To see the last rows in order to get a visual of how the data looks like
rt_movie_info.tail()

In [None]:
# To get a glimpse of the data measures
rt_movie_info.describe()

# 4. Data Cleaning

## 4.1 bom.movies_gross.csv

### 4.1.1 Checking any missing data

In [None]:
print(bom_data.isna())
print(bom_data.isna().sum())
#True indicates NaN(Not a number)

There are missing values in studio, foreign gross and domestic gross columns. To solve that, we replace the Null for Unknown the NaN value with the median

In [None]:
bom_data.hist()

In [None]:
bom_data["studio"] = bom_data.studio.fillna("unknown")

Now we replace NaN with the median of the data

In [None]:
# Remove commas from "domestic_gross" column
bom_data["domestic_gross"] = bom_data.domestic_gross.fillna(0)

We will do the same for the foreign gross

In [None]:
# Remove commas from "foreign_gross" column
bom_data["foreign_gross"] = bom_data["foreign_gross"].replace(',', '')

# Convert "foreign_gross" column to float and fill missing values with 0
bom_data["foreign_gross"] = pd.to_numeric(bom_data["foreign_gross"], errors='coerce')
bom_data["foreign_gross"] = bom_data.domestic_gross.fillna(0)

In [None]:
#confirm there no more missing values
print(bom_data.isna().sum())

### 4.1.2 Removing any duplicates

In [None]:
def identify_duplicates(data):
    """Simple function to identify any duplicates"""
    # identify the duplicates (dataframename.duplicated() , can add .sum() to get total count)
    # empty list to store Bool results from duplicated
    duplicates = []
    for i in data.duplicated():
        duplicates.append(i)
    # identify if there is any duplicates. (If there is any we expect a True value in the list duplicates)
    duplicates_set = set(duplicates) 
    if (len(duplicates_set) == 1):
        print("The Data has no duplicates")
    else:
        no_true = 0
        for val in duplicates:
            if (val == True):
                no_true += 1
        # percentage of the data represented by duplicates 
        duplicates_percentage = np.round(((no_true / len(data)) * 100), 3)
        print(f"The Data has {no_true} duplicated rows.\nThis constitutes {duplicates_percentage}% of the data set.")

# Example usage with a DataFrame
identify_duplicates(bom_data)


### 4.1.3 Removing any irrelevant data

In [None]:
# check for any movie produced before 2010 
bom_data[bom_data["year"] < 2010]

### 4.1.4 Arranging Messy columns

In [None]:
bom_data.columns
[col.strip() for col in bom_data.columns]
bom_data.columns.str.strip()

### 4.1.5 Final bom_data after cleaning

In [None]:
bom_data.info()

In [None]:
bom_data.hist()

## 4.2 tn_movie_budgets.csv

### 4.2.1 Checking any missing data

In [None]:
print(tn_movie_budgets.isna())
print(tn_movie_budgets.isna().sum())

In [None]:
#replacing missing values in director with unknown
tn_movie_budgets["release_date"] = tn_movie_budgets.release_date.fillna("released")

### 4.2.2 Removing any duplicates

First thing first, we check for any duplicated data in the id section. IDs should be unique to everyone

In [None]:
def identify_duplicates(data):
    # identify the duplicates (dataframename.duplicated() , can add .sum() to get total count)
    # empty list to store Bool results from duplicated
    duplicates = []
    for i in data.duplicated():
        duplicates.append(i)
    duplicates_set = set(duplicates) 
    if (len(duplicates_set) == 1):
        print("The Data has no duplicates")
    else:
        no_true = 0
        for val in duplicates:
            if (val == True):
                no_true += 1

        duplicates_percentage = np.round(((no_true / len(data)) * 100), 3)
        print(f"The Data has {no_true} duplicated rows.\nThis constitutes {duplicates_percentage}% of the data set.") 


identify_duplicates(tn_movie_budgets)

Now that we have spotted duplicated rows, let's drop them

### 4.2.3 Removing any irrelevant columns

In [None]:
# Checking how the data looks with outliers (in this case movies produced before 2000)
# Convert the release year to pd datetime
tn_movie_budgets["release_date"] = pd.to_datetime(tn_movie_budgets["release_date"])

# Extract the year from the release date
tn_movie_budgets["release_date"] = tn_movie_budgets["release_date"].dt.year

#plot
sns.boxplot(data = tn_movie_budgets, x =  "release_date")

In [None]:
#dropping off
tn_movie_budgets = tn_movie_budgets[tn_movie_budgets["release_date"] > 2000]
tn_movie_budgets

In [None]:
#checking to see how the data now looks like after removing the years prior to 2000
sns.boxplot(data = tn_movie_budgets, x = "release_date")

In [None]:
unique_dates = tn_movie_budgets["release_date"].unique()
print(unique_dates)

tn_movie_budgets has no irrelevant data. However

### 4.2.4 Arranging messy data

In [None]:
tn_movie_budgets.columns
[col.strip() for col in tn_movie_budgets.columns]
tn_movie_budgets.columns.str.strip()

### 4.2.5 Creating a new column to combine domestic and foreign gross

In [None]:
tn_movie_budgets['domestic_gross'] = tn_movie_budgets['domestic_gross'].str.replace('[\$,]', '', regex=True).astype(float)
tn_movie_budgets['worldwide_gross'] = tn_movie_budgets['worldwide_gross'].str.replace('[\$,]', '', regex=True).astype(float)

tn_movie_budgets["total_gross"] = tn_movie_budgets["domestic_gross"] + tn_movie_budgets["worldwide_gross"]
tn_movie_budgets.head()

## 4.3 rt.reviews.tsv

### 4.3.1 Checking any missing data

In [None]:
rt_movie_info.isna()
rt_movie_info.isna().sum()

In [None]:
#rt_movie_info before cleaning
plt.figure(figsize = (10,10))
x = np.arange(10)
y = np.arange(10)
plt.bar(x, y, label = "rt movie information")
plt.xlabel("rt movie info")
plt.ylabel("Y values")
plt.title("Bar graph of rt_movie_info before cleaning")
plt.show()

In [None]:
#replacing missing values in synopsis with unknown
rt_movie_info["synopsis"] = rt_movie_info.synopsis.fillna("unknown")

#replacing missing values in rating with R
rt_movie_info["rating"] = rt_movie_info.rating.fillna("R")

# replacing missing values in genre with the term Drama
rt_movie_info["genre"] = rt_movie_info.genre.fillna("Drama")

#replacing the missing values in director column with the data in writer column
rt_movie_info["director"] = rt_movie_info["director"].fillna(rt_movie_info["writer"])

#replacing the missing values in writer column with the data in the director column
rt_movie_info["writer"] = rt_movie_info["writer"].fillna(rt_movie_info["director"])

#replacing the missing values in theater date column with the data in the dvd date column
rt_movie_info["theater_date"] = rt_movie_info["theater_date"].fillna(rt_movie_info["dvd_date"])

#replacing the missing values in dvd date column with the data in the theater date column
rt_movie_info["dvd_date"] = rt_movie_info["dvd_date"].fillna(rt_movie_info["theater_date"])

# replacing currency($) with 0
rt_movie_info["currency"] = rt_movie_info.currency.fillna("0")

# replacing box office with the mean
rt_movie_info["box_office"] = rt_movie_info["box_office"].replace(",", "")
rt_movie_info["box_office"] = pd.to_numeric(rt_movie_info["box_office"], errors = "coerce")
rt_movie_info["box_office"] = rt_movie_info.box_office.fillna(0)

# replace missing values in runtime with 120 minutes
rt_movie_info["runtime"] = rt_movie_info.runtime.fillna("120 minutes")

#replace missing values in studio
rt_movie_info["studio"] = rt_movie_info.studio.fillna("Entertainment one")

Let's check if there any more missing values

In [None]:
print(rt_movie_info.isna().sum())

Since there are missing vlues, we have to continue cleaning

In [None]:
#replacing missing values in director with unknown
rt_movie_info["director"] = rt_movie_info.writer.fillna("unknown")

#replacing missing values in director with unknown
rt_movie_info["writer"] = rt_movie_info.writer.fillna("unknown")

#replacing missing values in director with unknown
rt_movie_info["theater_date"] = rt_movie_info.theater_date.fillna("released")

#replacing missing values in director with unknown
rt_movie_info["dvd_date"] = rt_movie_info.dvd_date.fillna("released")

#replacing missing values in box_office with median
rt_movie_info["box_office"].fillna(rt_movie_info.box_office.median())

In [None]:
# Take a look at the data once more
print(rt_movie_info.isna().sum())

All good to go! No missing data

### 4.3.2 Removing any duplicates

In [None]:
def identify_duplicates(rt_movie_info):
    # identify the duplicates (dataframename.duplicated() , can add .sum() to get total count)
    # empty list to store Bool results from duplicated
    duplicates = []
    for i in rt_movie_info.duplicated():
        duplicates.append(i)
    duplicates_set = set(duplicates) 
    if (len(duplicates_set) == 1):
        print("The Data has no duplicates")
    else:
        no_true = 0
        for val in duplicates:
            if (val == True):
                no_true += 1

        duplicates_percentage = np.round(((no_true / len(data)) * 100), 3)
        print(f"The Data has {no_true} duplicated rows.\nThis constitutes {duplicates_percentage}% of the data set.") 


identify_duplicates(rt_movie_info)

### 4.3.3 Removing irrelevant data

In [None]:
# # check for any movie produced before 2010 or past 2018
rt_movie_info["dvd_date"] = pd.to_numeric(rt_movie_info["dvd_date"], errors="coerce")
rt_movie_info[(rt_movie_info["dvd_date"] < 2010) & (rt_movie_info["dvd_date"] > 2018)]

### 4.3.4 Arranging messy columns

In [None]:
rt_movie_info.columns
[col.strip() for col in rt_movie_info.columns]
rt_movie_info.columns.str.strip()

In [None]:
#rt_movie_info after cleaning
plt.figure(figsize = (10,10))
x = np.arange(10)
y = np.arange(10)
plt.bar(x, y, label = "rt movie information")
plt.xlabel("rt movie info")
plt.ylabel("Y values")
plt.title("Bar graph of rt_movie_info before cleaning")
plt.show()

# 5. Data Visualizations

# 6. Conclusions

## 7. Recommendations