# Let's continue to explore the IMDB Dataset.

### import IMDB data and replace the PG value in the Released_Year

In [6]:
import pandas as pd
imdb_df = pd.read_csv('sample_data/imdb_top_1000.csv')
imdb_df_indexed = pd.read_csv('sample_data/imdb_top_1000.csv', index_col='Series_Title')

In [7]:
imdb_df_indexed[~imdb_df_indexed.Released_Year.str.isnumeric()]

Unnamed: 0_level_0,Poster_Link,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
Series_Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Apollo 13,https://m.media-amazon.com/images/M/MV5BNjEzYj...,PG,U,140 min,"Adventure, Drama, History",7.6,NASA must devise a strategy to return Apollo 1...,77.0,Ron Howard,Tom Hanks,Bill Paxton,Kevin Bacon,Gary Sinise,269197,173837933


In [8]:
imdb_df_indexed.loc['Apollo 13','Released_Year']='1995' # give it a value

Then we can compare the Released_Year with int values.

### Data Selection and Filtering

Movie from 2016:

In [None]:
imdb_df_indexed[imdb_df_indexed["Released_Year"].astype(int)== 2016]

Filter the dataset to show the movies from 2016 and all movies that had a rating more than 8.0:

In [None]:
imdb_df_indexed[(imdb_df_indexed["Released_Year"].astype(int)== 2016) & (imdb_df_indexed["IMDB_Rating"] >= 8.0)]

Let's create a new DataFrame and convert the Gross column to numeric values.



In [24]:
imdb_df_indexed_num_gross = pd.read_csv('sample_data/imdb_top_1000.csv', index_col='Series_Title', thousands=',')
imdb_df_indexed_num_gross.loc['Apollo 13','Released_Year']='1995' # give it a value
imdb_df_indexed_num_gross["Released_Year"] = imdb_df_indexed_num_gross["Released_Year"].astype(int)

In [None]:
imdb_df_indexed_num_gross.head()

In [26]:
imdb_df_indexed_num_gross.describe()

Unnamed: 0,Released_Year,IMDB_Rating,Meta_score,No_of_Votes,Gross
count,1000.0,1000.0,843.0,1000.0,831.0
mean,1991.221,7.9493,77.97153,273692.9,68034750.0
std,23.285669,0.275491,12.376099,327372.7,109750000.0
min,1920.0,7.6,28.0,25088.0,1305.0
25%,1976.0,7.7,70.0,55526.25,3253559.0
50%,1999.0,7.9,79.0,138548.5,23530890.0
75%,2009.0,8.1,87.0,374161.2,80750890.0
max,2020.0,9.3,100.0,2343110.0,936662200.0


Movies that released between 2016-2018 and had ratings < 8.1 but had high revenue (gross above the 75th percentile).


In [None]:
imdb_df_indexed_num_gross[(imdb_df_indexed_num_gross.Released_Year >= 2016)
                          & (imdb_df_indexed_num_gross.Released_Year <= 2018) 
                          & (imdb_df_indexed_num_gross.IMDB_Rating < 8.1)
                          & (imdb_df_indexed_num_gross.Gross > imdb_df_indexed_num_gross.Gross.quantile(0.75)) ]

### Grouping and Sorting

Q1: We want to see how much gross (sum) of each director, and also the mean rating of them.

In [32]:
imdb_df_indexed_num_gross.head(1)

Unnamed: 0_level_0,Poster_Link,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
Series_Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
The Shawshank Redemption,https://m.media-amazon.com/images/M/MV5BMDFkYT...,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469.0


In [None]:
imdb_df_indexed_num_gross.groupby('Director')[['Gross']].sum()

In [None]:
imdb_df_indexed_num_gross.groupby('Director')[['IMDB_Rating']].mean()

Q2. Who earned the most?

Answer: Steven Spielberg

In [48]:
imdb_df_indexed_num_gross.groupby('Director')[["Gross"]].sum().sort_values(['Gross'], ascending = False)

Unnamed: 0_level_0,Gross
Director,Unnamed: 1_level_1
Steven Spielberg,2.478133e+09
Anthony Russo,2.205039e+09
Christopher Nolan,1.937454e+09
James Cameron,1.748237e+09
Peter Jackson,1.597312e+09
...,...
Tetsuya Nakashima,0.000000e+00
Ericson Core,0.000000e+00
Leo McCarey,0.000000e+00
Kinji Fukasaku,0.000000e+00


Q3. Which movies had both the highest gross and the highest rating? (use sort)

In [49]:
imdb_df_indexed_num_gross.sort_values(['Gross','IMDB_Rating'], ascending=False)

Unnamed: 0_level_0,Poster_Link,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
Series_Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Star Wars: Episode VII - The Force Awakens,https://m.media-amazon.com/images/M/MV5BOTAzOD...,2015,U,138 min,"Action, Adventure, Sci-Fi",7.9,"As a new threat to the galaxy rises, Rey, a de...",80.0,J.J. Abrams,Daisy Ridley,John Boyega,Oscar Isaac,Domhnall Gleeson,860823,936662225.0
Avengers: Endgame,https://m.media-amazon.com/images/M/MV5BMTc5MD...,2019,UA,181 min,"Action, Adventure, Drama",8.4,After the devastating events of Avengers: Infi...,78.0,Anthony Russo,Joe Russo,Robert Downey Jr.,Chris Evans,Mark Ruffalo,809955,858373000.0
Avatar,https://m.media-amazon.com/images/M/MV5BMTYwOT...,2009,UA,162 min,"Action, Adventure, Fantasy",7.8,A paraplegic Marine dispatched to the moon Pan...,83.0,James Cameron,Sam Worthington,Zoe Saldana,Sigourney Weaver,Michelle Rodriguez,1118998,760507625.0
Avengers: Infinity War,https://m.media-amazon.com/images/M/MV5BMjMxNj...,2018,UA,149 min,"Action, Adventure, Sci-Fi",8.4,The Avengers and their allies must be willing ...,68.0,Anthony Russo,Joe Russo,Robert Downey Jr.,Chris Hemsworth,Mark Ruffalo,834477,678815482.0
Titanic,https://m.media-amazon.com/images/M/MV5BMDdmZG...,1997,UA,194 min,"Drama, Romance",7.8,A seventeen-year-old aristocrat falls in love ...,75.0,James Cameron,Leonardo DiCaprio,Kate Winslet,Billy Zane,Kathy Bates,1046089,659325379.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Blowup,https://m.media-amazon.com/images/M/MV5BYTE4YW...,1966,A,111 min,"Drama, Mystery, Thriller",7.6,A fashion photographer unknowingly captures a ...,82.0,Michelangelo Antonioni,David Hemmings,Vanessa Redgrave,Sarah Miles,John Castle,56513,
Breakfast at Tiffany's,https://m.media-amazon.com/images/M/MV5BNGEwMT...,1961,A,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,76.0,Blake Edwards,Audrey Hepburn,George Peppard,Patricia Neal,Buddy Ebsen,166544,
Giant,https://m.media-amazon.com/images/M/MV5BODk3Yj...,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,James Dean,Carroll Baker,34075,
Lifeboat,https://m.media-amazon.com/images/M/MV5BZTBmMj...,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471,


### Missing values

In [50]:
import numpy as np
import pandas as pd

array_with_None = np.array([0, None, 2, 3])
print(array_with_None.dtype)
print(array_with_None)

object
[0 None 2 3]


In [52]:
array_with_NaN = np.array([0, np.nan, 2, 3])
print(array_with_NaN.dtype)
print(array_with_NaN)

float64
[ 0. nan  2.  3.]


In [None]:
print(array_with_None.sum()) # it will fail with None

In [54]:
print(array_with_NaN.sum()) # it will calculate, but return nan

nan


In [55]:
imdb_df_indexed_num_gross.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, The Shawshank Redemption to The 39 Steps
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Poster_Link    1000 non-null   object 
 1   Released_Year  1000 non-null   int64  
 2   Certificate    899 non-null    object 
 3   Runtime        1000 non-null   object 
 4   Genre          1000 non-null   object 
 5   IMDB_Rating    1000 non-null   float64
 6   Overview       1000 non-null   object 
 7   Meta_score     843 non-null    float64
 8   Director       1000 non-null   object 
 9   Star1          1000 non-null   object 
 10  Star2          1000 non-null   object 
 11  Star3          1000 non-null   object 
 12  Star4          1000 non-null   object 
 13  No_of_Votes    1000 non-null   int64  
 14  Gross          831 non-null    float64
dtypes: float64(3), int64(2), object(10)
memory usage: 157.3+ KB


In [56]:
imdb_df_indexed_num_gross.isnull()

Unnamed: 0_level_0,Poster_Link,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
Series_Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
The Shawshank Redemption,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
The Godfather,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
The Dark Knight,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
The Godfather: Part II,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
12 Angry Men,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Breakfast at Tiffany's,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
Giant,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
From Here to Eternity,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Lifeboat,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True


In [57]:
imdb_df_indexed_num_gross.isnull().sum() # it will check how many nulls in each column

Poster_Link        0
Released_Year      0
Certificate      101
Runtime            0
Genre              0
IMDB_Rating        0
Overview           0
Meta_score       157
Director           0
Star1              0
Star2              0
Star3              0
Star4              0
No_of_Votes        0
Gross            169
dtype: int64

drop the null values

In [None]:
imdb_df_indexed_num_gross.dropna() # it will drop all rows with any missing values

In [None]:
imdb_df_indexed_num_gross.dropna(axis = 1) # it will drop all columns with any missing values

But doing this, we lost lots of data! If we drop rows, we lose more than 200 movies. If we drop columns, we lose useful info like gross. We only want to drop the values if it's super unuseful.

In [None]:
imdb_df_indexed_num_gross.dropna(axis='rows', thresh=13) #thresh to specify a minium number of non-null values 

Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

In [69]:
imdb_df_indexed_num_gross.dropna(axis='columns',how='all') # columns will be dropped if all the values are missing

Unnamed: 0_level_0,Poster_Link,Released_Year,Runtime,Genre,IMDB_Rating,Overview,Director,Star1,Star2,Star3,Star4,No_of_Votes
Series_Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
The Shawshank Redemption,https://m.media-amazon.com/images/M/MV5BMDFkYT...,1994,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110
The Godfather,https://m.media-amazon.com/images/M/MV5BM2MyNj...,1972,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367
The Dark Knight,https://m.media-amazon.com/images/M/MV5BMTMxNT...,2008,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232
The Godfather: Part II,https://m.media-amazon.com/images/M/MV5BMWMwMG...,1974,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952
12 Angry Men,https://m.media-amazon.com/images/M/MV5BMWU4N2...,1957,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845
...,...,...,...,...,...,...,...,...,...,...,...,...
Breakfast at Tiffany's,https://m.media-amazon.com/images/M/MV5BNGEwMT...,1961,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,Blake Edwards,Audrey Hepburn,George Peppard,Patricia Neal,Buddy Ebsen,166544
Giant,https://m.media-amazon.com/images/M/MV5BODk3Yj...,1956,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,George Stevens,Elizabeth Taylor,Rock Hudson,James Dean,Carroll Baker,34075
From Here to Eternity,https://m.media-amazon.com/images/M/MV5BM2U3Yz...,1953,118 min,"Drama, Romance, War",7.6,"In Hawaii in 1941, a private is cruelly punish...",Fred Zinnemann,Burt Lancaster,Montgomery Clift,Deborah Kerr,Donna Reed,43374
Lifeboat,https://m.media-amazon.com/images/M/MV5BZTBmMj...,1944,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471
