<a href="https://www.kaggle.com/code/edifonjimmy/tmdb-analysis?scriptVersionId=114406072" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

### Dataset

* The tmbd dataset contains information about 10,000 movies collected from **The movies database (TMDb)**, including user ratings and revenue. 

 ### Description
     
     
   * The data set consist of 10866 rows and 21 columns.

 ### Research Questions:
    
    
   * How many movies are created per year?
   * What year has the highest number of movies created?
   * Which movie has the highest and lowest budget?
   * What's the budget, revenue and popularity of the top ten ranked movie?
   * Which production company produces the best rated movies?
   * What genre of movie do people rate the most?
   * Do people give high rating to movies with long runtime?
   * Properties associated with movies with high revenue.

In [1]:
# Importing preprocessing tools
import pandas as pd
import numpy as np

import warnings

warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)

# Importing visualization tools
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

In [2]:
tmbd = pd.read_csv("/kaggle/input/tmbdmovie/tmdb-movies.csv")
tmbd.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,monster|dna|tyrannosaurus rex|velociraptor|island,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,future|chase|post-apocalyptic|dystopia|australia,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,based on novel|revolution|dystopia|sequel|dyst...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,android|spaceship|jedi|space opera|3d,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,car race|speed|revenge|suspense|car,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


In [3]:
tmbd.shape

(10866, 21)

In [4]:
tmbd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date       

In [5]:
tmbd.describe().T.drop("count", axis=1)

Unnamed: 0,mean,std,min,25%,50%,75%,max
id,66064.18,92130.14,5.0,10596.25,20669.0,75610.0,417859.0
popularity,0.646441,1.000185,6.5e-05,0.207583,0.383856,0.713817,32.98576
budget,14625700.0,30913210.0,0.0,0.0,0.0,15000000.0,425000000.0
revenue,39823320.0,117003500.0,0.0,0.0,0.0,24000000.0,2781506000.0
runtime,102.0709,31.38141,0.0,90.0,99.0,111.0,900.0
vote_count,217.3897,575.6191,10.0,17.0,38.0,145.75,9767.0
vote_average,5.974922,0.9351418,1.5,5.4,6.0,6.6,9.2
release_year,2001.323,12.81294,1960.0,1995.0,2006.0,2011.0,2015.0
budget_adj,17551040.0,34306160.0,0.0,0.0,0.0,20853250.0,425000000.0
revenue_adj,51364360.0,144632500.0,0.0,0.0,0.0,33697100.0,2827124000.0


In [6]:
tmbd.isnull().sum()

id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      76
homepage                7930
director                  44
tagline                 2824
keywords                1493
overview                   4
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64

## Data cleaning steps:

   * Dropping duplicate id's.
   * Filling missing values with `Missing`.
   * Dropping unwanted columns.
   * Dropping index that has a revenue of 0 or a budget of 0.
   * Changing the format of release date to a datetime format.
   * Reseting the index of the dataframe.

In [7]:
# Dropping duplicate id as they're not needed for analysis

tmbd.drop_duplicates(subset="id", inplace=True)

In [8]:
# Filling missing values with 'Missing'

tmbd.fillna(value="Missing", inplace=True)

In [9]:
# Dropping unwanted columns

unwanted_columns = ["id", "imdb_id", "homepage", "tagline", "budget_adj", "revenue_adj", "overview", "keywords", "release_date"]

tmbd.drop(unwanted_columns, axis=1, inplace=True)

# Displaying the shape after dropping unwanted columns

tmbd.shape

(10865, 12)

In [10]:
# Dropping index with a revenue of 0 or budget of 0

tmbd.drop(index=tmbd.query("(budget == 0) | (revenue == 0)").index, inplace= True)

tmbd.shape

(3854, 12)

In [11]:
tmbd.reset_index(drop=True, inplace=True)

### Explanatory data analysis

#### Research question 1 : How many movies are created per year?

In [12]:
# Group movie by release year then taking the count of the movie per year 

number_of_movies_per_year = tmbd.groupby("release_year")["original_title"].count()

# Plot outcome
px.bar(number_of_movies_per_year,
        x= number_of_movies_per_year.index,
        y = number_of_movies_per_year.values,
        title="Number of movies per year",
        color = number_of_movies_per_year.index,
        labels = {"release_year": "Release year", "y": "Movie counts"},
        range_x = [1960, 2015],
       text_auto= True
       ).update_traces(textposition = "outside")

### Research question 2 : What year has the highest number of movies created?

In [13]:
px.line(number_of_movies_per_year,
        x= number_of_movies_per_year.index,
        y = number_of_movies_per_year.values,
        title="Number of movies per year",
        labels = {"release_year": "Release year", "y": "Movie counts"},
        range_x = [1960, 2015])

###### * From the above plot we can see that 2011 has the highest number of movies made.

### Research question 3 : Which movie has the highest and lowest budget?

In [14]:
# Grouping by movie title then summing the total budget.

highest_movie_budget = tmbd.groupby("original_title").sum().sort_values(by=["budget"] , ascending = False)["budget"].head(10)
lowest_movie_budget = tmbd.groupby("original_title").sum().sort_values(by=["budget"] , ascending = True)["budget"].head(10)

# Creating a subplot for both outputs
subplots = make_subplots(rows=1, 
                         cols=2,
                        subplot_titles=["Descending order of budget", "Ascending order of budget"])

subplots.add_traces(go.Bar(x= highest_movie_budget.index,
                           y= highest_movie_budget.values,
                           text= highest_movie_budget.values,
                           textposition="outside",
                           marker= {"color": highest_movie_budget.values, 'colorscale': 'Sunset'}
                          ),
                    rows=1, cols=1)

subplots.add_traces(go.Bar(x= lowest_movie_budget.index,
                           y= lowest_movie_budget.values,
                           text= lowest_movie_budget.values,
                           textposition="outside",
                           marker= {"color": lowest_movie_budget.values, 'colorscale': 'Sunset'}
                          ),
                    rows=1, cols=2)

# Updating layout settings

subplots.update_layout(showlegend= False, height = 700, width = 1000, title = "Movie budget plots")

# Updating x_axis

subplots.update_xaxes(title_text = "Movie name", row=1, col=1)
subplots.update_xaxes(title_text = "Movie name", row=1, col=2)

# Updating y_axis

subplots.update_yaxes(title_text = "Budget")

# Dispalying the subplot

subplots.show()

##### Movie with the highest budget: 

* The worrior's way

##### Movies with the lowest budget:

* Lost & Found 
* Love, Wedding and Marriage

### Research question 4 : What's the budget, revenue and popularity of the top ten ranked movie?

In [15]:
# Group dataframe by release year then by movie title then take the average of each attributes
# Sort the dataframe by the rating of the movie
# Reset index then plot result

top_ranked_movie = tmbd.groupby(["release_year", 
                    "original_title"]).mean().sort_values(by= ["vote_average"], ascending= False).reset_index().head(10)

subplots = make_subplots(rows=2, 
                         cols=2,
                         specs=[[{}, {}], [{"colspan":2}, None]],# This makes sure the third subplot is drawn to full scale
                         subplot_titles = ["Budget of movie", "Revenue of movie", "Popularity of movie"]
                        )

subplots.add_traces(go.Bar(x= top_ranked_movie["original_title"],
                           y= top_ranked_movie["budget"],
                           text= top_ranked_movie["budget"],
                           textposition="outside",
                           marker= {"color": top_ranked_movie["budget"], 'colorscale': 'Sunset'}
                          ),
                    rows=1, cols=1)

subplots.add_traces(go.Bar(x= top_ranked_movie["original_title"],
                           y= top_ranked_movie["revenue"],
                           text= top_ranked_movie["revenue"],
                           textposition="outside",
                           marker= {"color": top_ranked_movie["budget"], 'colorscale': 'Sunset'}
                          ),
                    rows=1, cols=2)

subplots.add_traces(go.Bar(x= top_ranked_movie["original_title"],
                           y= top_ranked_movie["popularity"],
                           text= np.round(top_ranked_movie["popularity"], 2),
                           textposition="outside",
                           marker= {"color": top_ranked_movie["popularity"], 'colorscale': 'Sunset'}
                          ),
                    rows=2, cols=1)

subplots.update_layout(showlegend= False, height = 1000, width=1000)

# updating x_axis 

subplots.update_xaxes(title_text = "Movie name", row=1, col=1)
subplots.update_xaxes(title_text = "Movie name", row=1, col=2)
subplots.update_xaxes(title_text = "Movie name", row=2, col=1)

# updating y_axis

subplots.update_yaxes(title_text = "Budget", row=1, col=1)
subplots.update_yaxes(title_text = "Revenue", row=1, col=2)
subplots.update_yaxes(title_text = "Popularity", row=2, col=1)

# displaying plot

subplots.show()

### Research question 5 : Which production company produces the best rated movies?

In [16]:
# Sort dataframe by vote_average
# plot bar chart

company_with_best_rated_movies = tmbd.sort_values(by=["vote_average"], ascending=False).head(15)

px.bar(data_frame=company_with_best_rated_movies,
        x= "production_companies",
        y = "vote_average",
        labels= {"production_companies": "Production companies", "vote_average":"Average vote"},
        title= "Top 15 companies with best rated movies",
        height= 800,
       width = 1000,
       text_auto=True).update_traces(textposition="outside")

### Research question 6 : What genre of movie do people rate the most?

In [17]:
# Group movie by genre
# Plot bar chart

most_rated_genre = tmbd.groupby("genres").mean().reset_index().sort_values(by= "vote_average", ascending=False).head(20)

px.bar(data_frame=most_rated_genre,
      x = "genres",
      y = "vote_average",
      title= "Top 10 most rated genre",
      color = "genres",
      text = "vote_average",
       labels= {"vote_average": "Average vote", "genres":"Genre"},
      height = 800, 
      width = 1000,
      ).update_traces(textposition = "outside")

### Research question 7 : Do people give high rating to movies with long runtime?

In [18]:
### Groupby Movie title and runtime.

runtime_movie_rating = tmbd.groupby(["original_title","runtime"]).mean().reset_index().sort_values(by=["runtime"], ascending=False).head(20)

px.bar(data_frame=runtime_movie_rating,
      x = "original_title",
      y = "vote_average",
      title= "Long runtime movie rating",
      color = "original_title",
      text = "vote_average",
      labels = {"vote_average":"Average vote", "original_title": "Movie title"},
      height = 800,
      width = 1000,
      ).update_traces(textposition = "outside")

### Research question 8 : Properties associated with movies with high revenue.

In [19]:
movieWithHighRevenue = tmbd.groupby(["original_title","release_year", "runtime"]).sum().reset_index().query("revenue > budget")
movieWithHighRevenue.head()

Unnamed: 0,original_title,release_year,runtime,popularity,budget,revenue,vote_count,vote_average
0,(500) Days of Summer,2009,95,3.244139,7500000,60722734,1778,7.3
1,10 Things I Hate About You,1999,97,1.769152,16000000,53478166,947,7.2
2,"10,000 BC",2008,109,1.841839,105000000,266000000,586,5.2
3,101 Dalmatians,1996,103,1.419885,54000000,320689294,367,5.5
4,102 Dalmatians,2000,100,0.410235,85000000,183611771,150,5.0


In [20]:
px.imshow(np.round(movieWithHighRevenue.corr(), 1),
          title= "Correlarion matrix",
          text_auto= True,
          width = 800,
          height = 800
         )

#### Features that correlate:

   * Revenue vs Popularity
   * Vote_count vs Popularity
   * Revenue vs Budget
   * Vote_count vs Revenue
   * Vote_average vs Runtime
   * Budget Vs Vote count

In [21]:
# Creating subplots 

subplots = make_subplots(rows=3, 
                        cols=3,
                        subplot_titles= ["Revenue Vs Popularity", 
                                        "Vote Count Vs Popularity", 
                                        "Revenue Vs Budget",
                                        "Popularity Vs Revenue",
                                        "Vote Count Vs Revenue",
                                        "Budget Vs Vote count"
                                        ]
                                )

# Adding plots to the subplot

subplots.add_traces( go.Scatter(x= movieWithHighRevenue["revenue"],
                                y = movieWithHighRevenue["popularity"],
                                mode = "markers"
                                ),
                    rows= 1, cols=1)

subplots.add_traces( go.Scatter(x= movieWithHighRevenue["vote_count"],
                                y = movieWithHighRevenue["popularity"],
                                mode = "markers"
                                ),
                    rows=1, cols=2)

subplots.add_traces( go.Scatter(x= movieWithHighRevenue["revenue"],
                                y = movieWithHighRevenue["budget"],
                                mode = "markers"
                                ),
                    rows=1, cols=3)

subplots.add_traces( go.Scatter(x= movieWithHighRevenue["vote_count"],
                                y = movieWithHighRevenue["revenue"],
                                mode = "markers"
                                ),
                    rows=2, cols=1)

subplots.add_traces( go.Scatter(x= movieWithHighRevenue["vote_average"],
                                y = movieWithHighRevenue["runtime"],
                                mode = "markers"
                                ), 
                    rows=2, cols=2)

subplots.add_traces( go.Scatter(x= movieWithHighRevenue["budget"],
                                y = movieWithHighRevenue["vote_count"],
                                mode = "markers"
                                ), 
                    rows=2, cols=3)

# Setting layout 

subplots.update_layout(showlegend = False, width = 1200, height = 1000)

# Updating x_axis

subplots.update_xaxes(title_text = "Revenue", row=1,col=1)
subplots.update_xaxes(title_text = "Vote Count", row=1,col=2)
subplots.update_xaxes(title_text = "Revenue", row=1,col=3)
subplots.update_xaxes(title_text = "Vote Count", row=2,col=1)
subplots.update_xaxes(title_text = "Vote Average", row=2,col=2)
subplots.update_xaxes(title_text = "Budget", row=2,col=3)

# Updating y_axis

subplots.update_yaxes(title_text = "Popularity", row=1,col=1)
subplots.update_yaxes(title_text = "Popularity", row=1,col=2)
subplots.update_yaxes(title_text = "Budget", row=1,col=3)
subplots.update_yaxes(title_text = "Revenue", row=2,col=1)
subplots.update_yaxes(title_text = "Runtime", row=2,col=2)
subplots.update_yaxes(title_text = "Vote Count", row=2,col=3)


subplots.show()

# Conclusion

* The properties assiociated with high revenue movies are positive correlation.
* The higher the Vote Count the higher the Revenue.
* The higher the Runtime the higher the Movie rating.
* The higher the Vote Count the higher the Budget.
* As the Popularity increases the Revenue increases.
* The higher the Vote Count the higher the Popularity.