## TMDB Movies Data Analysis

### Table of Contents
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#Data_Assessment">Data Assessment</a></li>
<li><a href="#Data_Cleaning">Data Cleaning</a></li>
<li><a href="#Data_Analysis">Data Analysis</a></li>
<li><a href="#Conclusion">Conclusion</a></li>
</ul>

<a id='intro'></a>
### Introduction

In [None]:
#import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Read movies data from the csv
df_movies = pd.read_csv('tmdb-movies.csv')
df_movies.head()

<a id='Data_Assessment'></a>
### Data Assessment

In [None]:
df_movies.info()

There are an unequal number of non-null values across the columns in the dataset and some datatypes do not reflect the data in the column

In [None]:
#find number of rows and columns
df_movies.shape

In [None]:
#check the datatypes of columns
df_movies.dtypes

In [None]:
df_movies.describe()

The minimum values and bottom quartile (25%) for Revenue, Budget, Revenue (Adjusted for Inflation) and  Budget (Adjusted for Inflation) are Zeros. It could be that these are unrecorded values.

In [None]:
# Number of duplicated row
df_movies.duplicated().sum()

Duplicate values can alter the results of analysis. There is only one duplicate row and it would be dropped.

<a id='Data_Cleaning'></a>
### Data Cleaning

In [None]:
# Drop the duplicated row and there are no duplicated rows after dropping

df_movies.drop_duplicates(inplace=True)
df_movies.duplicated().sum()

In [None]:
#change dtypes for release_date

df_movies['release_date'] = pd.to_datetime(df_movies['release_date'])

df_movies.dtypes

The release_date column has an object datatype. This means Python will handle it as a string. Changing it to datetime would make it possible to use the year, month and days for analysis.

In [None]:
# Drop the row with the NaN value and confirm if there are NaN values after dropping

df_movies.dropna(inplace=True)
df_movies.isnull().sum().sum()

In [None]:
df_movies.shape

There is a significant reduction in the number of rows from 10866 to 1992. This indicates a large number of NaN values in the dataset.

##### >> Handling Values With Zero
While it is possible for Revenue to be Zero, Budget cannot be zero. Hence, rows where Budget is zero wil be treated carefully. However, Budget adjusted will be used to drop rows not the Budget column. This is because, the Budget for each movie are assessed based on the currency for value for a singular year when Adjusted for Inflation.

In [None]:
#Drop rows where revenue is zero as these values cannot be used in analysis.

df_movies.drop(df_movies.index[df_movies['budget_adj'] == 0], inplace=True)

In [None]:
df_movies.shape

The number of rows shrinks further when rows where movie budget is zero are dropped from 1992 to 1446.

In [None]:
df_movies.describe()

Looking at the min values and bottom quartile (25%) values for Budget (Adjusted), they are no longer zeros.

(10866, 21)

In [None]:
#Drop columns that would not be used in analysis

df_movies.drop(['id', 'popularity', 'cast', 'homepage', 'tagline', 'keywords', 'overview', 'runtime', 'vote_count', 'vote_average'], axis = 1, inplace = True)

df_movies.columns

<a id='Data_Analysis'></a>
### Data Analysis


#### Is There A Relationship Between Movie Budgets & Revenue?

To understand whether or not there is a relationship between Movie Budgets and Revenue, only columns relevant to the question will be used. The budget and revenue adjusted for inflation are the columns that are best suited to compare each movie's revenue and budget

In [None]:
#Create Plot Function for 1d Explorations

def plot1(df, col):
    '''
    The function returns a Histogram with values in the column passed
            Parameters:
                    df = Dataframe with relevant column
                    col (str) = column in Dataframe to be plotted
            Returns:
                    Chart : Plotted based on input parameters
        '''
    df[col].hist(grid = False, bins = 30, figsize = [10, 10]);

In [None]:
#call function to plot graph for Budget(Adjusted)

plot1(df_movies, col ='budget_adj')

In [None]:
#call function to plot graph for Revenue(Adjusted)

plot1(df_movies, col ='revenue_adj')

The two charts for Revenue and Budget (both adjusted) have some similarities in their chart trends from left to right.

In [None]:
#create function to plot chart for analysis of relationship
def data_plot(df, x , y, title, col1, col2, invert = 'Yes', kind = 'bar'):
    '''
    Returns a Chart plotted with two columns.

            Parameters:
                    df = Dataframe with relevant columns
                    x (str) = Column name for x axis
                    y (str) = Column name for y axis
                    invert (str) = Default Yes
                    kind (chart) = Chart type. Default is Bar
                    col1 (str) = x-axis column label
                    col2 (str) = y-axis column label
            Returns:
                    Chart : Plotted based on input parameters    
    '''

    #plot chart
        df.plot(x, y, kind, figsize = (10,10))

    #set title and column labels
    plt.title(title,fontsize = 15)
    plt.xlabel(col1,fontsize = 15)
    plt.ylabel(col2,fontsize = 15);

    #invert x-axis to order from highest to lowest
    if invert == 'Yes':
        plt.gca().invert_xaxis()

In [None]:
#call function to plot graph of Budget Against Revenue

data_plot(df_movies, x = 'revenue_adj', y = 'budget_adj', invert = 'No', kind = 'scatter', 
            title = 'Correlation Between Budget & Revenue Adjusted For Inflation', 
            col1 = 'Revenue', col2 = 'Budget')

The Movie Budget and Revenue plot shows some correlation but there are quite a number of outliers. This could be because of their release year, genre, cast or a number of other features.

#### Which Movies Had The Highest Revenue?

In [None]:
#create Function that sorts for top values and plots a chart
def top_val(df, col, top = 10):
    ''' 
    Sorts Dataframe base on a column and returns a Bar Chart plotted with two columns.

            Parameters:
                    df = Dataframe with relevant columns
                    col = Column for x axis
                    top = number of rows to be returned

            Returns:
                    Chart : Bar Chart with top 10 values
    '''

    #data preparation
    data = df.sort_values(by = col).tail(top).set_index('original_title')[col]

    #plot graph
    col = col.replace("_", " ")
    data.plot(kind = 'bar', figsize = (10,10))
    plt.title(f'Which Movies Have The Highest {col}',fontsize = 15)
    plt.xlabel("Movies",fontsize = 15)
    plt.ylabel(col, fontsize = 15)
    plt.gca().invert_xaxis()    

In [None]:
#plot bar chart with top 10 movies by Revenue
top_val(df_movies, col = 'revenue_adj')

For these movies, their revenue figures do not have huge disparities between them. The decrease is steady from left to right with with minimal differences between each successive movie.

#### Which Movies Had The Highest Budget?

In [None]:
#plot bar chart with top 10 movies by Budget
top_val(df_movies, col = 'budget_adj')

Save for the top three movies, the remaining movies in the top 10 have very close values.

The movies with the highest budgets are not the same as the movies with the highest revenues. Very few movies are in both.

There are movies which are part of a Film Franchise in the List of each other in the list:

Pirates of The Caribbean Film Franchise
- Pirates of the Caribbean: On Stranger Tides
- Pirates of the Caribbean: At World's End

Harry Potter Film Franchise
- Harry Potter and the Half-Blood Prince
- Harry Potter and the Deathly Hallows: Part 1

Each of the franchises have their movies following each other in the chart

#### Which Years Had The Most Movies?

In [None]:
#plot bar chart to display movie count for each year
df_movies['release_year'].sort_index().value_counts().plot(kind = 'bar', figsize = (13,13));

From the chart, 2011 had the highest number of movies while 1961 had the lowest

#### Which Directors Worked On The Most Expensive Movies?

In [None]:
#list of most expensive movies and directors in ascending order
df_movies[['director', 'budget_adj', 'original_title']].sort_values('budget_adj', ascending=False).head(10)

David Yates is the only Director with more than one movie amongst the top 10 movies by Budget. The two movies are part of the same film franchise.

<a id='Conclusion'></a>
### Conclusion

While Revenue and Budget have some correlation, the top movies by budget and revenue do not entirely overlap. This indicates that it is possible for movies to have huge budgets and not have huge revenue. The same applies in the reverse.

The dataset is a sample dataset and does not contain data of all movies ever produced so drawing conclusions based on the results will not be accurate. For example, the highest grossing movie in the dataset might not actually be the highest grossing movie globally.

Like with many sample datasets, the data has inconsistencies; missing values, zero as value in unexpected columns, incorrect datatypes etc.

### References

https://datascienceparichay.com/article/pandas-delete-rows-based-on-column-values/

https://statology.org/matplotlib-reverse-axis/

https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows

https://www.statology.org/pandas-sort-alphabetically/

https://www.programiz.com/python-programming/docstrings