![example](images/director_shot.jpeg)

# Project Title

**Authors:** Student 1, Student 2, Student 3
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns


%matplotlib inline

import warnings

In [2]:
title_df = pd.read_csv('data/zippeddata/imdb.title.basics.csv.gz')
title_df.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [3]:
rating_df = pd.read_csv('data/zippeddata/imdb.title.ratings.csv.gz')
rating_df.head()

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [4]:
budget_df = pd.read_csv('data/zippedData/tn.movie_budgets.csv.gz')
budget_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [5]:
title_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   tconst           146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [64]:
rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


In [29]:
budget_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [30]:
budget_df['production_budget']

0       $425,000,000
1       $410,600,000
2       $350,000,000
3       $330,600,000
4       $317,000,000
            ...     
5777          $7,000
5778          $6,000
5779          $5,000
5780          $1,400
5781          $1,100
Name: production_budget, Length: 5782, dtype: object

In [31]:
budget_df['production_budget'].str.replace(',', '')

0       $425000000
1       $410600000
2       $350000000
3       $330600000
4       $317000000
           ...    
5777         $7000
5778         $6000
5779         $5000
5780         $1400
5781         $1100
Name: production_budget, Length: 5782, dtype: object

In [32]:
budget_df['production_budget'].str.replace(',', '').str.split('$')

0       [, 425000000]
1       [, 410600000]
2       [, 350000000]
3       [, 330600000]
4       [, 317000000]
            ...      
5777         [, 7000]
5778         [, 6000]
5779         [, 5000]
5780         [, 1400]
5781         [, 1100]
Name: production_budget, Length: 5782, dtype: object

In [10]:
budget_df['production_budget'].str.replace(',', '').str.split('$').map(lambda x:x[1])

0       425000000
1       410600000
2       350000000
3       330600000
4       317000000
          ...    
5777         7000
5778         6000
5779         5000
5780         1400
5781         1100
Name: production_budget, Length: 5782, dtype: object

In [33]:
budget_df['production_budget'] = budget_df['production_budget'].str.replace(',', '').str.split('$').map(lambda x:x[1])

In [34]:
budget_df['production_budget'].str.contains('.').sum()

5782

In [35]:
budget_df['production_budget'] = budget_df['production_budget'].astype(float)

In [36]:
budget_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5782 non-null   int64  
 1   release_date       5782 non-null   object 
 2   movie              5782 non-null   object 
 3   production_budget  5782 non-null   float64
 4   domestic_gross     5782 non-null   object 
 5   worldwide_gross    5782 non-null   object 
dtypes: float64(1), int64(1), object(4)
memory usage: 271.2+ KB


In [37]:
budget_df['domestic_gross'] = budget_df['domestic_gross'].str.replace(',','').str.split('$').map(lambda x:x[1])

In [38]:
budget_df['domestic_gross'] = budget_df['domestic_gross'].astype(float)

In [39]:
budget_df['worldwide_gross'] = budget_df['worldwide_gross'].str.replace(',','').str.split('$').map(lambda x:x[1])

In [40]:
budget_df['worldwide_gross'] = budget_df['worldwide_gross'].astype(float)

In [41]:
budget_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,425000000.0,760507625.0,2776345000.0
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,410600000.0,241063875.0,1045664000.0
2,3,"Jun 7, 2019",Dark Phoenix,350000000.0,42762350.0,149762400.0
3,4,"May 1, 2015",Avengers: Age of Ultron,330600000.0,459005868.0,1403014000.0
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,317000000.0,620181382.0,1316722000.0


In [42]:
budget_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5782 non-null   int64  
 1   release_date       5782 non-null   object 
 2   movie              5782 non-null   object 
 3   production_budget  5782 non-null   float64
 4   domestic_gross     5782 non-null   float64
 5   worldwide_gross    5782 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 271.2+ KB


In [43]:
# convert 'release_date' to datetime object.
budget_df['release_date'] = pd.to_datetime(budget_df['release_date'])

In [44]:
budget_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,2009-12-18,Avatar,425000000.0,760507625.0,2776345000.0
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000.0,241063875.0,1045664000.0
2,3,2019-06-07,Dark Phoenix,350000000.0,42762350.0,149762400.0
3,4,2015-05-01,Avengers: Age of Ultron,330600000.0,459005868.0,1403014000.0
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000.0,620181382.0,1316722000.0


In [45]:
# reset the index to datetime format to geth the month and year in separate column.
budget_df.index = pd.to_datetime(budget_df.index)

In [46]:
# create 'month' column
budget_df['month'] = budget_df['release_date'].dt.month

In [47]:
# craete 'year'column
budget_df['year'] = budget_df['release_date'].dt.year

In [48]:
budget_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,month,year
1970-01-01 00:00:00.000000000,1,2009-12-18,Avatar,425000000.0,760507625.0,2776345000.0,12,2009
1970-01-01 00:00:00.000000001,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000.0,241063875.0,1045664000.0,5,2011
1970-01-01 00:00:00.000000002,3,2019-06-07,Dark Phoenix,350000000.0,42762350.0,149762400.0,6,2019
1970-01-01 00:00:00.000000003,4,2015-05-01,Avengers: Age of Ultron,330600000.0,459005868.0,1403014000.0,5,2015
1970-01-01 00:00:00.000000004,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000.0,620181382.0,1316722000.0,12,2017


## Data Cleaning

We need to clean the data before performing analysis. Check for the null values and dropped unnecessary columns.

Columns 'runtime_minutes' and 'genres' contains null values.

In [6]:
title_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   tconst           146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [7]:
# checking for the null values
title_df['runtime_minutes'].isna().sum()

31739

In [8]:
# checking for the average number of null vales
title_df['runtime_minutes'].isna().sum() / len(title_df)

0.21717620976571053

In [9]:
title_df['runtime_minutes'] = title_df['runtime_minutes'].fillna(0)
title_df.dropna(subset = ['runtime_minutes'], axis = 0, inplace = True)

In [13]:
title_df.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,0.0,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [14]:
title_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   tconst           146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  146144 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 7.8+ MB


In [15]:
# checking for the null values in 'genre' column
title_df['genres'].isna().sum()

5408

In [61]:
title_df['genres'].isna().sum() / len(title_df)

0.037004598204510616

In [None]:
# since it has few null values,I am dropping the rows .
title.dropna(subset = ['genres'], axis = 0, inplace = True)

In [None]:
title.info()

### Data Merging

Before merging the data,We need to check for the duplicates.

In [None]:
title_df[title_df.duplicated(subset = ['primary_title'], keep = False)].sort_values(by = 'primary_title').iloc[20:30]

In [None]:
# checking for the title with year column
title_df[title_df.duplicated(subset = ['primary_title', 'start_year'],keep = False)].sort_values(by = 'primary_title').iloc[20:30]

In [None]:
title_df[title_df.duplicated(subset = ['primary_title','start_year','runtime_minutes'], keep = False)].sort_values(by = 'primary_title')

In [None]:
# removing duplicates having same title,year and runtime
title_df.drop_duplicates(subset = ['primary_title','start_year','runtime_minutes'], inplace = True)

In [None]:
# checking for merging conflict between 'title' and 'gross' dataframe.
title_df[title['primary_title'] == 'Abduction']

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [6]:
# Here you run your code to clean the data

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***