<a href="https://colab.research.google.com/github/EricAshby/EDA-IMDb-Movie-Data/blob/main/TEDA1030_Mod2_practice_EricAshby_08_07_23.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Data Analysis on IMDb Movie Data
## Introduction
IMDb is a movie information site that keeps track of all sorts of data as it relates to various movies, actors, directors, etc.

This exploratory data analysis seeks to explore the information about the movies within the data.

## Analysis Purpose
This analysis aims to answer:

1.   What is the average runtime for movies in the data set?
2.   How long would it take you to watch all movies, in hours?
3.   What is the median meta score of movies in the data set?
4.   Which genre has the most titles in the data set?
5.   How many different directors appear in the data set?
6.   Which actor appeared the most times in the Star1 column?
7.   In which year was the oldest movie released?
8.   Which movie has the highest Gross (money made)?
9.   Which movie got the most votes?
10.   Which movie had the worst IMDb Rating?
11.   Which year had the most movie titles?
12.   What is the average gross across all movies? Round to two decimal places.
13.   What is the median gross across all movies?
14.   Is the gross profit from the movies in this data set right skewed, left skewed, or not skewed at all?
15.   Which movie made the most gross profit per minute?








In [None]:
import pandas as pd
import re
movies = pd.read_csv("imdb_movies.csv")

## Overview
Here are the first 5 rows of the data set.  From this, we can see that the data set contains information such as the link to the movie poster, the series title, the release date (year), certification, runtime, genre, IMDB rating, Overview, Meta score, the Director, up to 4 star cast members, and the gross earnings of the movie.

In [None]:
#First 5 rows of the data set
movies.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


Here we have the data about the data set, itself, displaying the column names, number of non-null observations, and data types for each column. There are 16 coulumns and 1000 observations (seen here as "entries").  There are only 3 numerical data types in the data set: IMDB_Rating, Meta_score, and No_of_votes.  Note that Gross and Runtime are stored here as objects rather than numbers. This will be taken into account for calculations involving these.

Most coulumns have no null entries, however Certificate, Meta_score, and Gross are incomplete. This will be taken into account for the coming analysis.

In [None]:
#Info on Data Set
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Poster_Link    1000 non-null   object 
 1   Series_Title   1000 non-null   object 
 2   Released_Year  1000 non-null   object 
 3   Certificate    899 non-null    object 
 4   Runtime        1000 non-null   object 
 5   Genre          1000 non-null   object 
 6   IMDB_Rating    1000 non-null   float64
 7   Overview       1000 non-null   object 
 8   Meta_score     843 non-null    float64
 9   Director       1000 non-null   object 
 10  Star1          1000 non-null   object 
 11  Star2          1000 non-null   object 
 12  Star3          1000 non-null   object 
 13  Star4          1000 non-null   object 
 14  No_of_Votes    1000 non-null   int64  
 15  Gross          831 non-null    object 
dtypes: float64(2), int64(1), object(13)
memory usage: 125.1+ KB


These are the summary or descriptive statistics for the numeric data in the data set. Note that the descriptive statistics for Gross and Runtime cannot be reported here as both are stored as an object rather than a numerical data type.

In [None]:
#Descriptive Stats
movies.describe()

Unnamed: 0,IMDB_Rating,Meta_score,No_of_Votes
count,1000.0,843.0,1000.0
mean,7.9493,77.97153,273692.9
std,0.275491,12.376099,327372.7
min,7.6,28.0,25088.0
25%,7.7,70.0,55526.25
50%,7.9,79.0,138548.5
75%,8.1,87.0,374161.2
max,9.3,100.0,2343110.0


# Analysis

## 1. What is the average runtime for movies in the data set?
Here, we must reinterpret the data in the Runtime column as integers before calculating the mean.

In [None]:
[numRows,numCols] = movies.shape

movies["Runtime_int"] = movies["Runtime"] #create a new coulumn of the correct size

#correct data typing
for i in range(numRows):
  tempString = str(movies.loc[i,"Runtime"])
  tempString = tempString.replace(" min","")  #using string replace()
  # tempString = re.sub(" min","",tempString)  #using re module re.sub()
  movies.loc[i,"Runtime_int"] = int(tempString)

print(round(movies["Runtime_int"].mean(),2) , "min")


122.89 min


## 2. How long would it take you to watch all movies, in hours?

In [None]:
print(round(movies["Runtime_int"].sum() / 60,2) , "hours")

2048.18 hours


## 3. What is the median meta score of movies in the data set?

In [None]:
print(movies["Meta_score"].median())

79.0


## 4. Which genre has the most titles in the data set?
Drama is common among the most frequently occuring genre lists.  We can guess therefore that the most genre is likely to be Drama. This is not conclusive, however, as there could be a genre that spans across a greater number of lists but that does not show up in the top 5.

In [None]:
movies["Genre"].value_counts()

Drama                        85
Drama, Romance               37
Comedy, Drama                35
Comedy, Drama, Romance       31
Action, Crime, Drama         30
                             ..
Adventure, Thriller           1
Animation, Action, Sci-Fi     1
Action, Crime, Comedy         1
Animation, Crime, Mystery     1
Adventure, Comedy, War        1
Name: Genre, Length: 202, dtype: int64

## 5. How many different directors appear in the data set?

In [None]:
print(movies['Director'].nunique(),"directors")

548 directors


## 6. Which actor appeared the most times in the Star1 column?
Tom Hanks occurs most frequently in the Star1 column.

In [None]:
movies["Star1"].value_counts()

Tom Hanks          12
Robert De Niro     11
Al Pacino          10
Clint Eastwood     10
Humphrey Bogart     9
                   ..
Preity Zinta        1
Javier Bardem       1
Ki-duk Kim          1
Vladimir Garin      1
Robert Donat        1
Name: Star1, Length: 660, dtype: int64

## 7. In which year was the oldes movie released?
We can easily find the earliest year using the .min() method. However, sorting the data reveals an incorrectly input entry for *Apollo 13*. Nevertheless, since *Apollo 13* was definitely not released prior to 1920, we still find that the earliest release date was in 1920.

In [None]:
#earliest release date
print(movies['Released_Year'].min())

1920


Note the "PG" in the bottom entry of the Released_Year column in the following table.  This is a mis-input.

In [None]:
movies.sort_values(by = 'Released_Year')


Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Gross_float,Runtime_int
321,https://m.media-amazon.com/images/M/MV5BNWJiNG...,Das Cabinet des Dr. Caligari,1920,,76 min,"Fantasy, Horror, Mystery",8.1,"Hypnotist Dr. Caligari uses a somnambulist, Ce...",,Robert Wiene,Werner Krauss,Conrad Veidt,Friedrich Feher,Lil Dagover,57428,,,76
127,https://m.media-amazon.com/images/M/MV5BZjhhMT...,The Kid,1921,Passed,68 min,"Comedy, Drama, Family",8.3,"The Tramp cares for an abandoned child, but ev...",,Charles Chaplin,Charles Chaplin,Edna Purviance,Jackie Coogan,Carl Miller,113314,5450000,5450000.0,68
568,https://m.media-amazon.com/images/M/MV5BMTAxYj...,Nosferatu,1922,,94 min,"Fantasy, Horror",7.9,Vampire Count Orlok expresses interest in a ne...,,F.W. Murnau,Max Schreck,Alexander Granach,Gustav von Wangenheim,Greta Schröder,88794,,,94
194,https://m.media-amazon.com/images/M/MV5BZWFhOG...,Sherlock Jr.,1924,Passed,45 min,"Action, Comedy, Romance",8.2,"A film projectionist longs to be a detective, ...",,Buster Keaton,Buster Keaton,Kathryn McGuire,Joe Keaton,Erwin Connelly,41985,977375,977375.0,45
193,https://m.media-amazon.com/images/M/MV5BZjEyOT...,The Gold Rush,1925,Passed,95 min,"Adventure, Comedy, Drama",8.2,A prospector goes to the Klondike in search of...,,Charles Chaplin,Charles Chaplin,Mack Swain,Tom Murray,Henry Bergman,101053,5450000,5450000.0,95
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
464,https://m.media-amazon.com/images/M/MV5BNmI0MT...,Dil Bechara,2020,UA,101 min,"Comedy, Drama, Romance",7.9,The emotional journey of two hopelessly in lov...,,Mukesh Chhabra,Sushant Singh Rajput,Sanjana Sanghi,Sahil Vaid,Saswata Chatterjee,111478,,,101
18,https://m.media-amazon.com/images/M/MV5BNjViNW...,Hamilton,2020,PG-13,160 min,"Biography, Drama, History",8.6,The real life of one of America's foremost fou...,90.0,Thomas Kail,Lin-Manuel Miranda,Phillipa Soo,Leslie Odom Jr.,Renée Elise Goldsberry,55291,,,160
20,https://m.media-amazon.com/images/M/MV5BOTc2ZT...,Soorarai Pottru,2020,U,153 min,Drama,8.6,"Nedumaaran Rajangam ""Maara"" sets out to make t...",,Sudha Kongara,Suriya,Madhavan,Paresh Rawal,Aparna Balamurali,54995,,,153
613,https://m.media-amazon.com/images/M/MV5BOTNjM2...,Druk,2020,,117 min,"Comedy, Drama",7.8,"Four friends, all high school teachers, test a...",81.0,Thomas Vinterberg,Mads Mikkelsen,Thomas Bo Larsen,Magnus Millang,Lars Ranthe,33931,,,117


## 8. Which movie has the highest Gross (money made)?

Sorting the data by Gross as is will not work as the entries in the Gross column are not stored as numeric values. We must therefore interpret the values numerically, creating the column Gross_float.

In [None]:
#interpret Gross numerically
[numRows,numCols] = movies.shape

movies["Gross_float"] = movies["Gross"]

for i in range(numRows):
  tempString = str(movies.loc[i,"Gross"])
  tempString = tempString.replace(",","")  #using string replace()
  # tempString = re.sub(",","",tempString)  #using re module re.sub()
  movies.loc[i,"Gross_float"] = float(tempString)

Sorting by Gross_float we find that *Star Wars: Episode VII - The Force Awakens* made the most money.

In [None]:
#Sorting by Gross (interpreted as float)
movies.sort_values(by = "Gross_float" , ascending = False)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Gross_float,Runtime_int,Gross Per Minute
477,https://m.media-amazon.com/images/M/MV5BOTAzOD...,Star Wars: Episode VII - The Force Awakens,2015,U,138 min,"Action, Adventure, Sci-Fi",7.9,"As a new threat to the galaxy rises, Rey, a de...",80.0,J.J. Abrams,Daisy Ridley,John Boyega,Oscar Isaac,Domhnall Gleeson,860823,936662225,936662225.0,138,6787407.427536
59,https://m.media-amazon.com/images/M/MV5BMTc5MD...,Avengers: Endgame,2019,UA,181 min,"Action, Adventure, Drama",8.4,After the devastating events of Avengers: Infi...,78.0,Anthony Russo,Joe Russo,Robert Downey Jr.,Chris Evans,Mark Ruffalo,809955,858373000,858373000.0,181,4742392.265193
623,https://m.media-amazon.com/images/M/MV5BMTYwOT...,Avatar,2009,UA,162 min,"Action, Adventure, Fantasy",7.8,A paraplegic Marine dispatched to the moon Pan...,83.0,James Cameron,Sam Worthington,Zoe Saldana,Sigourney Weaver,Michelle Rodriguez,1118998,760507625,760507625.0,162,4694491.512346
60,https://m.media-amazon.com/images/M/MV5BMjMxNj...,Avengers: Infinity War,2018,UA,149 min,"Action, Adventure, Sci-Fi",8.4,The Avengers and their allies must be willing ...,68.0,Anthony Russo,Joe Russo,Robert Downey Jr.,Chris Hemsworth,Mark Ruffalo,834477,678815482,678815482.0,149,4555808.604027
652,https://m.media-amazon.com/images/M/MV5BMDdmZG...,Titanic,1997,UA,194 min,"Drama, Romance",7.8,A seventeen-year-old aristocrat falls in love ...,75.0,James Cameron,Leonardo DiCaprio,Kate Winslet,Billy Zane,Kathy Bates,1046089,659325379,659325379.0,194,3398584.427835
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,https://m.media-amazon.com/images/M/MV5BYTE4YW...,Blowup,1966,A,111 min,"Drama, Mystery, Thriller",7.6,A fashion photographer unknowingly captures a ...,82.0,Michelangelo Antonioni,David Hemmings,Vanessa Redgrave,Sarah Miles,John Castle,56513,,,111,
995,https://m.media-amazon.com/images/M/MV5BNGEwMT...,Breakfast at Tiffany's,1961,A,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,76.0,Blake Edwards,Audrey Hepburn,George Peppard,Patricia Neal,Buddy Ebsen,166544,,,115,
996,https://m.media-amazon.com/images/M/MV5BODk3Yj...,Giant,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,James Dean,Carroll Baker,34075,,,201,
998,https://m.media-amazon.com/images/M/MV5BZTBmMj...,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471,,,97,


## 9. Which movie got the most votes?

Sorting by number of votes, we find *The Shawshank Redemption* to have the most votes.

In [None]:
#movie with the most votes
movies.sort_values(by = "No_of_Votes", ascending = False).head(1)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Gross_float,Runtime_int,Gross Per Minute
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469,28341469.0,142,199587.809859


## 10. Which movie had the worst IMDb Rating?
Since there is more than one movie with the lowest IMDb rating, we must resort to a sort. We find that *Moneyball*, among other movies, has the lowest IMDb rating.

In [None]:
#movie with worst IMDb rating
movies.sort_values(by = 'IMDB_Rating').head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Gross_float,Runtime_int,Gross Per Minute
999,https://m.media-amazon.com/images/M/MV5BMTY5OD...,The 39 Steps,1935,,86 min,"Crime, Mystery, Thriller",7.6,A man in London tries to help a counter-espion...,93.0,Alfred Hitchcock,Robert Donat,Madeleine Carroll,Lucie Mannheim,Godfrey Tearle,51853,,,86,
908,https://m.media-amazon.com/images/M/MV5BMTMzNz...,Kick-Ass,2010,UA,117 min,"Action, Comedy, Crime",7.6,Dave Lizewski is an unnoticed high school stud...,66.0,Matthew Vaughn,Aaron Taylor-Johnson,Nicolas Cage,Chloë Grace Moretz,Garrett M. Brown,524081,48071303.0,48071303.0,117,410865.837607
909,https://m.media-amazon.com/images/M/MV5BMjI2OD...,Celda 211,2009,,113 min,"Action, Adventure, Crime",7.6,The story of two men on different sides of a p...,,Daniel Monzón,Luis Tosar,Alberto Ammann,Antonio Resines,Manuel Morón,63882,,,113,
910,https://m.media-amazon.com/images/M/MV5BMjAxOT...,Moneyball,2011,PG-13,133 min,"Biography, Drama, Sport",7.6,Oakland A's general manager Billy Beane's succ...,87.0,Bennett Miller,Brad Pitt,Robin Wright,Jonah Hill,Philip Seymour Hoffman,369529,75605492.0,75605492.0,133,568462.345865
911,https://m.media-amazon.com/images/M/MV5BYmFmNj...,La piel que habito,2011,R,120 min,"Drama, Horror, Thriller",7.6,"A brilliant plastic surgeon, haunted by past t...",70.0,Pedro Almodóvar,Antonio Banderas,Elena Anaya,Jan Cornet,Marisa Paredes,138959,3185812.0,3185812.0,120,26548.433333


## 11. Which year had the most movie titles?
2014 has the greates number of titles.

Note that, since *Apollo 13* was released in 1995, we need not worry that the mis-input of "PG" may affect our result here as 1995 is not within the top 5 years for movie releases.

In [None]:
#number of titles by year
movies["Released_Year"].value_counts()

2014    32
2004    31
2009    29
2013    28
2016    28
        ..
1926     1
1936     1
1924     1
1921     1
PG       1
Name: Released_Year, Length: 100, dtype: int64

##12. What is the average gross across all movies?
Note that this is the average gross across only the movies with a gross provided in the data set as there are null values in some of the entries.

In [None]:
#average gross
print(round(movies["Gross_float"].mean(),2),"USD")

68034750.87 USD


## 13. What is the median gross across all movies?
Note that this median is across only the movies with a gross provided in the data set.

In [None]:
#median gross
print(movies['Gross_float'].median(),"USD")

23530892.0 USD


## 14. Is the gross profit from the movies in this data set right skewed, left skewed, or not skewed at all?

Tolerance for what is considered significant set to 10% of the average Gross.

In [None]:
#checking for skew
GrossMean = movies['Gross_float'].mean()
GrossMedian = movies['Gross_float'].median()

toleranceFactor = 0.1 #significance tolerance factor set to 10%

if GrossMean - GrossMedian > toleranceFactor * GrossMean :
  print("Gross is right skewed")
else:
  if GrossMean - GrossMedian < -1 * toleranceFactor * GrossMean :
    print("Gross is left skewed")
  else:
    print("No significant skew")


Gross is right skewed


## 15. Which movie made the most gross profit per minute?

Here we make a new column Gross Per Minute, which shows the money made per minute. Sorting, we find that *Star Wars: Episode VII - The Force Awakens* has the highest Gross Per Minute (GPM).

In [None]:
#movie with greates GPM
movies['Gross Per Minute'] = movies['Gross_float'] / movies["Runtime_int"]
movies.sort_values(by = "Gross Per Minute" , ascending = False).head(1)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Gross_float,Runtime_int,Gross Per Minute
477,https://m.media-amazon.com/images/M/MV5BOTAzOD...,Star Wars: Episode VII - The Force Awakens,2015,U,138 min,"Action, Adventure, Sci-Fi",7.9,"As a new threat to the galaxy rises, Rey, a de...",80.0,J.J. Abrams,Daisy Ridley,John Boyega,Oscar Isaac,Domhnall Gleeson,860823,936662225,936662225.0,138,6787407.427536
