## IMDb 1,000 popular movies from 2006-2016 Analysis

### This document will analyze 1,000 popular movies taken from the IMDb website that looks at data points such as title, genre, director, revenue, etc. This analysis aims to see among popular movies if there are correlations between their success and exciting results, such as the average runtime of these 1,000 movies and their average revenue.

We will be importing Pandas so that we can analyze our data. I have gotten the metadata for this analysis from [Kaggle](https://www.kaggle.com/datasets/PromptCloudHQ/imdb-data).

Below is where we import Pandas and where our data index happens. This is going to put our 1,000 movies into columns and rows. We also have **index_col="Title"**, which will be used later on to type in a movie title and get the data such as the genre, actors, director, rating, etc.


In [6]:
import pandas as pd
data = pd.read_csv('IMDB-Movie-Data.csv')
data_indexed = pd.read_csv('IMDB-Movie-Data.csv', index_col="Title")

Using **data.head()** will print out the first five rows of our data.

In [7]:
data.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


With **data.info()**, we can see how many entries there are in our metadata which is 1,000. We can also see that there are twelve columns, which is the categories for each movie, as seen below.

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  872 non-null    float64
 11  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB


**data.shape** tells us how many rows and columns there are. There are 1,000 rows and 12 columns.

In [9]:
data.shape

(1000, 12)

When using **data.describe()**, this is where much of the meat is regarding the data for these movies. We can see the mean runtime of the 1,000 movies, roughly 113 minutes, along with the standard deviation and percentiles in the runtime, rating, revenue, and Metascore of the movies. We also see the average rating, roughly 6.7/10, and the minimum and maximum revenue, which, to clarify, is the domestic revenue, not the worldwide revenue. We can see the movie from the list with the highest revenue was roughly 937 million dollars, and the movie it says with the lowest revenue was zero dollars. When I looked online, this wasn't correct since the movie from the list that it says made zero dollars was a movie called **A Kind of Murder**. This movie actually made $2,915 domestically. Because of this, the movies at the bottom for revenue are slightly off but still close to being accurate.

In [10]:
data.describe()

Unnamed: 0,Rank,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
count,1000.0,1000.0,1000.0,1000.0,1000.0,872.0,936.0
mean,500.5,2012.783,113.172,6.7232,169808.3,82.956376,58.985043
std,288.819436,3.205962,18.810908,0.945429,188762.6,103.25354,17.194757
min,1.0,2006.0,66.0,1.9,61.0,0.0,11.0
25%,250.75,2010.0,100.0,6.2,36309.0,13.27,47.0
50%,500.5,2014.0,111.0,6.8,110799.0,47.985,59.5
75%,750.25,2016.0,123.0,7.4,239909.8,113.715,72.0
max,1000.0,2016.0,191.0,9.0,1791916.0,936.63,100.0


As an example, what we can do with our data is only print out the genre of the 1,000 movies, as seen below.

In [18]:
genre = data['Genre']

In [19]:
data[['Genre']]

Unnamed: 0,Genre
0,"Action,Adventure,Sci-Fi"
1,"Adventure,Mystery,Sci-Fi"
2,"Horror,Thriller"
3,"Animation,Comedy,Family"
4,"Action,Adventure,Fantasy"
...,...
995,"Crime,Drama,Mystery"
996,Horror
997,"Drama,Music,Romance"
998,"Adventure,Comedy"


Now for a more introspective look at our data, we can print out specific columns that we want. We're printing out the title, genre, the actors, director, and rating of a movie. What we can do though, is type in whatever movie we want and get our results. Below, **Guardians of the Galaxy** was typed in as an example, but we could use any movie from the list that we want.

In [20]:
some_cols = data[['Title','Genre','Actors','Director','Rating']]

In [23]:
data_indexed.loc[['Guardians of the Galaxy']][['Genre','Actors','Director','Rating','Revenue (Millions)']]

Unnamed: 0_level_0,Genre,Actors,Director,Rating,Revenue (Millions)
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Guardians of the Galaxy,"Action,Adventure,Sci-Fi","Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",James Gunn,8.1,333.13


We can also get specific with our data like below. With our code here, we're specifying movies that came out between 2014 and 2016, but we can change this to whatever year we want that's in the range of 2006-2016 since that's the range of our data. What we're also doing beyond specifying the years for movies in which they came out, is that we're also specifying that the movie has to have a rating of below six along with revenue being in the 0.80 quantile. So the movie had to do fairly well for itself at the box office. Again, these are the revenue numbers domestically, not internationally or worldwide. 

In [27]:
data[((data['Year'] >= 2014) & (data['Year'] <= 2016))
      & (data['Rating'] < 6.0)
      & (data['Revenue (Millions)'] > data['Revenue (Millions)'].quantile(0.80))]

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
63,64,Fifty Shades of Grey,"Drama,Romance,Thriller",Literature student Anastasia Steele's life cha...,Sam Taylor-Johnson,"Dakota Johnson, Jamie Dornan, Jennifer Ehle,El...",2015,125,4.1,244474,166.15,46.0
126,127,Transformers: Age of Extinction,"Action,Adventure,Sci-Fi",Autobots must escape sight from a bounty hunte...,Michael Bay,"Mark Wahlberg, Nicola Peltz, Jack Reynor, Stan...",2014,165,5.7,255483,245.43,32.0
657,658,Teenage Mutant Ninja Turtles,"Action,Adventure,Comedy","When a kingpin threatens New York City, a grou...",Jonathan Liebesman,"Megan Fox, Will Arnett, William Fichtner, Noel...",2014,101,5.9,178527,190.87,31.0


We can also group our data, such as here, with the director and their average rating for the movies they've directed.

In [28]:
data.groupby('Director')[['Rating']].mean().head()

Unnamed: 0_level_0,Rating
Director,Unnamed: 1_level_1
Aamir Khan,8.5
Abdellatif Kechiche,7.8
Adam Leon,6.5
Adam McKay,7.0
Adam Shankman,6.3


Below is an example of this idea where we grouped directors and their average ratings for their movies. The difference with this column compared to the one above is that we put the list from greatest rating to least. So, this list is the directors with the top ten highest average movie ratings that they've directed. This is done with the code **ascending=False** at the end of the code. The top director, as we can see, is Nitesh Tiwari.

In [63]:
data.groupby('Director')[['Rating']].mean().sort_values(['Rating'], ascending=False).head(10)

Unnamed: 0_level_0,Rating
Director,Unnamed: 1_level_1
Nitesh Tiwari,8.8
Christopher Nolan,8.68
Olivier Nakache,8.6
Makoto Shinkai,8.6
Aamir Khan,8.5
Florian Henckel von Donnersmarck,8.5
Naoko Yamada,8.4
Damien Chazelle,8.4
Lee Unkrich,8.3
Amber Tamblyn,8.3


On the flip side, we can see who the bottom ten directors are when it comes to average rating by changing **ascending=False** to **ascending=True**. The director, as we can see with the lowest average rating, is Jason Friedberg.

In [64]:
data.groupby('Director')[['Rating']].mean().sort_values(['Rating'], ascending=True).head(10)

Unnamed: 0_level_0,Rating
Director,Unnamed: 1_level_1
Jason Friedberg,1.9
James Wong,2.7
Shawn Burkett,2.7
Jonathan Holbrook,3.2
Femi Oyeniran,3.5
Micheal Bafaro,3.5
Jeffrey G. Hunt,3.7
Rolfe Kanefsky,3.9
Joey Curtis,4.0
Sam Taylor-Johnson,4.1


When looking through the Excel file and also pulling the null data from the code below, it's clear there are rows where there's nothing input for the revenue or Metascore. 

In [33]:
data.isnull().sum()

Rank                    0
Title                   0
Genre                   0
Description             0
Director                0
Actors                  0
Year                    0
Runtime (Minutes)       0
Rating                  0
Votes                   0
Revenue (Millions)    128
Metascore              64
dtype: int64

What we can do is drop the Metascore column completely from the data. **axis=1** is saying that the column is to be dropped. These changes don't take place unless we specify **inplace=True** as a parameter in the **drop()** function.

As we can see below, there's no Metascore column anymore.

In [34]:
data.drop('Metascore', axis=1).head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions)
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02


Using **data.dropna()** is important because this is where we can drop those null values that we discussed earlier that can mess with the data. These null values, if you remember, were in the **Revenue** and **Metascore** columns.

In [35]:
data.dropna()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
...,...,...,...,...,...,...,...,...,...,...,...,...
993,994,Resident Evil: Afterlife,"Action,Adventure,Horror",While still out to destroy the evil Umbrella C...,Paul W.S. Anderson,"Milla Jovovich, Ali Larter, Wentworth Miller,K...",2010,97,5.9,140900,60.13,37.0
994,995,Project X,Comedy,3 high school seniors throw a birthday party t...,Nima Nourizadeh,"Thomas Mann, Oliver Cooper, Jonathan Daniel Br...",2012,88,6.7,164088,54.72,48.0
996,997,Hostel: Part II,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46.0
997,998,Step Up 2: The Streets,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50.0


The code below is what drops all of the columns containing missing data. You will see there are no revenue or Metascore columns.

In [36]:
data.dropna(axis=1)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727
...,...,...,...,...,...,...,...,...,...,...
995,996,Secret in Their Eyes,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585
996,997,Hostel: Part II,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152
997,998,Step Up 2: The Streets,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699
998,999,Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881


Next, we're using the thresh parameter to specify the minimum number of non-null values for the column/row to be held without dropping. This is to clean up the data, so we don't have a bunch of null values present in our data.

In [37]:
data.dropna(axis=0, thresh=6)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,Secret in Their Eyes,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,,45.0
996,997,Hostel: Part II,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46.0
997,998,Step Up 2: The Streets,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50.0
998,999,Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0


Now we're finding the mean in revenue for movies that have a revenue value in their column. With this, we see that the mean revenue with movies that have a revenue value tied to them is roughly 83 million. 

In [38]:
revenue_mean = data_indexed['Revenue (Millions)'].mean()
print("The mean revenue is: ", revenue_mean)

The mean revenue is:  82.95637614678897


Now with the code below, we will use the mean revenue we just calculated to input into these null revenue values to fill in these null values. So when the data frame is checked, there won't be any null values in the revenue column. The reason for using the mean revenue to fill in null values is that it gives us a closely accurate picture of the total mean revenue if these null values had a value present. So when we calculate the mean of all 1,000 movies, we get a more accurate picture since the null values aren't weighing the true mean value of the movies down.

In [39]:
data_indexed['Revenue (Millions)'].fillna(revenue_mean, inplace=True)

We can use the **apply()** function, which allows us to get technical in terms of how we group the ratings of our movies along with labeling whether the movie was good, average, or bad based on the rating. So with the code below, we created a **rating_group** function that allows us to do this. As we can see, if the rating is greater than or equal to 7.5, the movie would be considered **Good**. If it's equal to or greater than 6 and less than 7.5, the movie will be considered **Average**. If the rating is below 6, the movie will be considered **Bad**.

In [40]:
def rating_group(rating):
    if rating >= 7.5:
        return 'Good'
    elif rating >= 6.0:
        return 'Average'
    else:
        return 'Bad'

This is where we apply the **rating_group** function

In [41]:
data['Rating_category'] = data['Rating'].apply(rating_group)

In [71]:
data[['Title','Director','Rating','Rating_category']].head(30)


Unnamed: 0,Title,Director,Rating,Rating_category
0,Guardians of the Galaxy,James Gunn,8.1,Good
1,Prometheus,Ridley Scott,7.0,Average
2,Split,M. Night Shyamalan,7.3,Average
3,Sing,Christophe Lourdelet,7.2,Average
4,Suicide Squad,David Ayer,6.2,Average
5,The Great Wall,Yimou Zhang,6.1,Average
6,La La Land,Damien Chazelle,8.3,Good
7,Mindhorn,Sean Foley,6.4,Average
8,The Lost City of Z,James Gray,7.1,Average
9,Passengers,Morten Tyldum,7.0,Average


In [None]:
data.groupby('Director')[['Rating']].mean().sort_values(['Rating'], ascending=True).head()

As we can see from above, we have the title of the movie, the director, rating, and the **rating_group** function we created has been renamed to **Rating_category** where we see the function being applied. I decided to print the first thirty results from the list with the **head()** function so that we could see the **Good**, **Average**, and **Bad** ratings on display.

!["C:\Users\14dav\miniconda3\Lib\site-packages\nbclassic\static\components\requirejs-plugins\src\imdb.png"](imdb.png)

Here, I have organized the highest-grossing movies from the list regarding revenue and looked for a correlation. What we can see from this chart is that movies with the highest domestic revenue don't have a common director that's dominating the list, but they have a common genre. Almost all of these movies have the same couple of related genres: **Action**, **Adventure**, and **Sci-Fi**. You could also sprinkle in **Fantasy** and **Comedy**, but the other three are the main staples. 
So, when looking at the top 1,000 IMDb movies from 2006-2016, we can see a correlation between the movies with the highest revenue and their genre. There's also some fascinating insight into movies with low Metascore ratings that still have a high revenue. For example, when we printed out movies with a rating below six and were at the 0.80 quantile in revenue, we saw movies such as **Fifty Shades of Grey** bringing in 166.15 million dollars even with a 4.1 rating. We also had **Transformers: Age of Extinction** with a rating of 5.7 that brought in 245.43 million dollars. So, just because a movie has a low rating doesn't mean there's no potential to make a lot of money at the box office.