# Exploratory Data Analysis on IMDB Dataset

This notebook is dedicated to exploratory data analysis of IMDB dataset. Below you can find the movie data between 2006 and 2016. I've tried to answer the questions that comes across my mind as a cinephile, such as, if higher rating results in higher revenue in the box office, if metacritics and users get along really well, or which director is the most successful.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
imdb = pd.read_csv("../input/imdb-data/IMDB-Movie-Data.csv",sep=',')
imdb.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


Fixing column names by removing parenthesis for better use of pandas.

In [3]:
imdb.rename(columns={'Revenue (Millions)':"Revenue","Runtime (Minutes)":"Runtime"},inplace=True)
print(imdb.columns)

Index(['Rank', 'Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime', 'Rating', 'Votes', 'Revenue', 'Metascore'],
      dtype='object')


In [5]:
imdb.describe(include='all')

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue,Metascore
count,1000.0,1000,1000,1000,1000,1000,1000.0,1000.0,1000.0,1000.0,872.0,936.0
unique,,999,207,1000,644,996,,,,,,
top,,The Host,"Action,Adventure,Sci-Fi",A vengeful barbarian warrior sets off to get h...,Ridley Scott,"Jennifer Lawrence, Josh Hutcherson, Liam Hemsw...",,,,,,
freq,,2,50,1,8,2,,,,,,
mean,500.5,,,,,,2012.783,113.172,6.7232,169808.3,82.956376,58.985043
std,288.819436,,,,,,3.205962,18.810908,0.945429,188762.6,103.25354,17.194757
min,1.0,,,,,,2006.0,66.0,1.9,61.0,0.0,11.0
25%,250.75,,,,,,2010.0,100.0,6.2,36309.0,13.27,47.0
50%,500.5,,,,,,2014.0,111.0,6.8,110799.0,47.985,59.5
75%,750.25,,,,,,2016.0,123.0,7.4,239909.8,113.715,72.0


### Grouped movies by years, trying to see if the rating follows a pattern throughout the years, if the quality of the movies follow a trend. 

In [6]:
import plotly.express as px
years=imdb.groupby("Year")["Rating"].mean().reset_index()
px.scatter(years,x="Year", y="Rating").show()


The users don't seem to be satisfied with the movies nowadays.

**Let's see which movie has the most success in the box office.**

In [8]:
mostearned=imdb[imdb["Revenue"]==imdb["Revenue"].max()].T
print(mostearned)

                                                            50
Rank                                                        51
Title               Star Wars: Episode VII - The Force Awakens
Genre                                 Action,Adventure,Fantasy
Description  Three decades after the defeat of the Galactic...
Director                                           J.J. Abrams
Actors       Daisy Ridley, John Boyega, Oscar Isaac, Domhna...
Year                                                      2015
Runtime                                                    136
Rating                                                     8.1
Votes                                                   661608
Revenue                                                 936.63
Metascore                                                   81


Let's look at the user-favourite director between 2006-2016.

In [9]:
imdbtop=imdb[["Title","Director","Rating"]][imdb["Rating"]==imdb["Rating"].max()]
imdbtop.head()

Unnamed: 0,Title,Director,Rating
54,The Dark Knight,Christopher Nolan,9.0


I've grouped all directors by the rating they have and took the mean value, then sorted them in descending.

In [10]:
directors=imdb.groupby("Director")["Rating"].mean().reset_index()
directors.sort_values("Rating", ascending=False)



Unnamed: 0,Director,Rating
465,Nitesh Tiwari,8.80
108,Christopher Nolan,8.68
392,Makoto Shinkai,8.60
470,Olivier Nakache,8.60
194,Florian Henckel von Donnersmarck,8.50
...,...,...
435,Micheal Bafaro,3.50
329,Jonathan Holbrook,3.20
274,James Wong,2.70
566,Shawn Burkett,2.70


## Let's see if critics and users get along really well.

In [None]:
imdb['Rating'].corr(imdb['Metascore'])

There seems to be a correlation, let's visualize.

In [None]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')

plt.scatter(imdb.Metascore, imdb.Rating)
plt.show()

Let's see if there's a correlation between the revenue movie makes and the rating. Does it mean that most earned is always the most liked?

In [None]:
imdb['Rating'].corr(imdb['Revenue'])

There's a weak correlation between them. Let's plot.

In [None]:
plt.scatter(imdb.Rating, imdb.Revenue)
plt.show()

I want to see if longer duration always results in better rating.

In [None]:
imdb['Rating'].corr(imdb['Runtime'])

Well, meh. Let's plot this.

In [None]:
import seaborn as sns
f, ax = plt.subplots(figsize=(10, 6))
ax.set(xscale="log")
sns.scatterplot(imdb.Rating, imdb.Runtime, ax=ax)
plt.show()

Thank you for looking at the analysis, hit me up if there's anything you'd like me to discover! 