<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Analyzing IMDb Data

_Author: Kevin Markham (DC)_

---

For project two, you will complete a series of exercises exploring movie rating data from IMDb.

For these exercises, you will be conducting basic exploratory data analysis on IMDB's movie data, looking to answer such questions as:

What is the average rating per genre?
How many different actors are in a movie?

This process will help you practice your data analysis skills while becoming comfortable with Pandas.

## Basic level

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#### Read in 'imdb_1000.csv' and store it in a DataFrame named movies.

In [61]:
movies = pd.read_csv('./data/imdb_1000.csv')
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


#### Check the number of rows and columns.

In [3]:
len(movies)

979

In [5]:
len(movies.columns)

6

#### Check the data type of each column.

In [7]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 979 entries, 0 to 978
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   star_rating     979 non-null    float64
 1   title           979 non-null    object 
 2   content_rating  976 non-null    object 
 3   genre           979 non-null    object 
 4   duration        979 non-null    int64  
 5   actors_list     979 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 46.0+ KB


#### Calculate the average movie duration.

In [8]:
movies['duration'].mean()

120.97957099080695

#### Sort the DataFrame by duration to find the shortest and longest movies.

In [9]:
movies.sort_values('duration')

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
389,8.0,Freaks,UNRATED,Drama,64,"[u'Wallace Ford', u'Leila Hyams', u'Olga Bacla..."
338,8.0,Battleship Potemkin,UNRATED,History,66,"[u'Aleksandr Antonov', u'Vladimir Barsky', u'G..."
258,8.1,The Cabinet of Dr. Caligari,UNRATED,Crime,67,"[u'Werner Krauss', u'Conrad Veidt', u'Friedric..."
293,8.1,Duck Soup,PASSED,Comedy,68,"[u'Groucho Marx', u'Harpo Marx', u'Chico Marx']"
88,8.4,The Kid,NOT RATED,Comedy,68,"[u'Charles Chaplin', u'Edna Purviance', u'Jack..."
...,...,...,...,...,...,...
445,7.9,The Ten Commandments,APPROVED,Adventure,220,"[u'Charlton Heston', u'Yul Brynner', u'Anne Ba..."
142,8.3,Lagaan: Once Upon a Time in India,PG,Adventure,224,"[u'Aamir Khan', u'Gracy Singh', u'Rachel Shell..."
78,8.4,Once Upon a Time in America,R,Crime,229,"[u'Robert De Niro', u'James Woods', u'Elizabet..."
157,8.2,Gone with the Wind,G,Drama,238,"[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit..."


#### Create a histogram of duration, choosing an "appropriate" number of bins.

In [17]:
import plotly.express as px
px.histogram(movies, x='duration', nbins = 20)

#### Use a box plot to display that same data.

In [80]:
duration_counts = movies['duration'].value_counts().reset_index()
px.box(duration_counts, x = 'index', y = "duration")

In [79]:
duration_counts

Unnamed: 0,index,duration
0,112,23
1,113,22
2,102,20
3,101,20
4,129,19
...,...,...
128,180,1
129,177,1
130,168,1
131,166,1


## Intermediate level

#### Count how many movies have each of the content ratings.

In [75]:
rating_counts = movies['content_rating'].value_counts().reset_index()

In [76]:
rating_counts

Unnamed: 0,index,content_rating
0,R,460
1,PG-13,189
2,PG,123
3,NOT RATED,65
4,APPROVED,47
5,UNRATED,38
6,G,32
7,NC-17,7
8,PASSED,7
9,X,4


#### Use a visualization to display that same data, including a title and x and y labels.

In [51]:
px.bar(rating_counts, x = 'index', y = 'content_rating', title = "IMDB Movie Ratings Frequnecy", 
      labels={"index": "Movie Rating",
              'content_rating': "Frequency"})

#### Convert the following content ratings to "UNRATED": NOT RATED, APPROVED, PASSED, GP.

In [62]:
import numpy as np
conditions = [
    movies['content_rating'] == 'NOT RATED', 
    movies['content_rating'] == 'APPROVED', 
    movies['content_rating'] == 'PASSED', 
    movies['content_rating'] == 'GP'
]
results   = [
    'UNRATED', 
    'UNRATED', 
    'UNRATED', 
    'UNRATED'
] 

movies['content_rating_new'] = np.select(conditions, results, movies['content_rating'])

In [63]:
movies['content_rating_new']

0          R
1          R
2          R
3      PG-13
4          R
       ...  
974       PG
975       PG
976    PG-13
977       PG
978        R
Name: content_rating_new, Length: 979, dtype: object

#### Convert the following content ratings to "NC-17": X, TV-MA.

In [64]:
import numpy as np
conditions = [
    movies['content_rating_new'] == 'X', 
    movies['content_rating_new'] == 'TV-MA'
]
results   = [
    'NC-17', 
    'NC-17'
] 

movies['content_rating_final'] = np.select(conditions, results, movies['content_rating_new'])

In [66]:
movies.groupby('content_rating_final').size().reset_index()

Unnamed: 0,content_rating_final,0
0,G,32
1,NC-17,12
2,PG,123
3,PG-13,189
4,R,460
5,UNRATED,160


#### Count the number of missing values in each column.

In [72]:
movies.isnull().sum()

star_rating             0
title                   0
content_rating          3
genre                   0
duration                0
actors_list             0
content_rating_new      3
content_rating_final    3
dtype: int64

#### If there are missing values: examine them, then fill them in with "reasonable" values.

In [85]:
movies[movies.isnull().any(axis=1)]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list,content_rating_new,content_rating_final
187,8.2,Butch Cassidy and the Sundance Kid,,Biography,110,"[u'Paul Newman', u'Robert Redford', u'Katharin...",,
649,7.7,Where Eagles Dare,,Action,158,"[u'Richard Burton', u'Clint Eastwood', u'Mary ...",,
936,7.4,True Grit,,Adventure,128,"[u'John Wayne', u'Kim Darby', u'Glen Campbell']",,


In [86]:
conditions = [
    movies['title'] == 'Butch Cassidy and the Sundance Kid', 
    movies['title'] == 'Where Eagles Dare', 
    movies['title'] == 'True Grit'
]
results   = [
    'PG', 
    'PG', 
    'PG-13'
] 

movies['content_rating_final_final'] = np.select(conditions, results, movies['content_rating_final'])

#I wanted to use something like fillna but wasn't exactly sure how to use it here

#### Calculate the average star rating for movies 2 hours or longer, and compare that with the average star rating for movies shorter than 2 hours.

In [93]:
two_hour_plus_movies = movies[movies['duration'] >= 120]

two_hour_plus_movies['star_rating'].mean()

7.948898678414082

In [94]:
less_than_two_hour_movies = movies[movies['duration'] < 120]

less_than_two_hour_movies['star_rating'].mean()

7.838666666666657

#### Use a visualization to detect whether there is a relationship between duration and star rating.

In [99]:
px.scatter(movies, x = 'duration', y = 'star_rating',
           trendline='ols', 
           title = "Relationship Between Duration and Star Rating")

#### Calculate the average duration for each genre.

In [100]:
movies.groupby('genre')['duration'].mean()

genre
Action       126.485294
Adventure    134.840000
Animation     96.596774
Biography    131.844156
Comedy       107.602564
Crime        122.298387
Drama        126.539568
Family       107.500000
Fantasy      112.000000
Film-Noir     97.333333
History       66.000000
Horror       102.517241
Mystery      115.625000
Sci-Fi       109.000000
Thriller     114.200000
Western      136.666667
Name: duration, dtype: float64

## Advanced level

#### Visualize the relationship between content rating and duration.

In [102]:
movies.content_rating_final_final.unique()

array(['R', 'PG-13', 'UNRATED', 'PG', 'G', 'NC-17'], dtype=object)

In [103]:
conditions = [
    movies['content_rating_final_final'] == 'UNRATED', 
    movies['content_rating_final_final'] == 'G', 
    movies['content_rating_final_final'] == 'PG', 
    movies['content_rating_final_final'] == 'PG-13', 
    movies['content_rating_final_final'] == 'R', 
    movies['content_rating_final_final'] == 'NC-17'
]
results   = [
    1, 
    2,
    3, 
    4,
    5, 
    6] 

movies['content_rating_levels'] = np.select(conditions, results, 999)

In [106]:
movies[['content_rating_levels', 'content_rating_final_final']]

Unnamed: 0,content_rating_levels,content_rating_final_final
0,5,R
1,5,R
2,5,R
3,4,PG-13
4,5,R
...,...,...
974,3,PG
975,3,PG
976,4,PG-13
977,3,PG


In [109]:
px.scatter(movies, x = "content_rating_levels", y = "duration",trendline='ols' )

#### Determine the top rated movie (by star rating) for each genre.

In [118]:
top_movies_by_genre = movies.groupby('genre')['star_rating'].idxmax().tolist()

In [119]:
top_movies_by_genre

[3, 7, 30, 8, 25, 0, 5, 468, 638, 105, 338, 39, 38, 145, 350, 6]

In [124]:
movies.iloc[top_movies_by_genre][['title', 'genre' ,'star_rating']]

Unnamed: 0,title,genre,star_rating
3,The Dark Knight,Action,9.0
7,The Lord of the Rings: The Return of the King,Adventure,8.9
30,Spirited Away,Animation,8.6
8,Schindler's List,Biography,8.9
25,Life Is Beautiful,Comedy,8.6
0,The Shawshank Redemption,Crime,9.3
5,12 Angry Men,Drama,8.9
468,E.T. the Extra-Terrestrial,Family,7.9
638,The City of Lost Children,Fantasy,7.7
105,The Third Man,Film-Noir,8.3


#### Check if there are multiple movies with the same title, and if so, determine if they are actually duplicates.

In [152]:
possible_dups = movies['title'].value_counts() == 2

In [153]:
possible_dups

Les Miserables                      True
Dracula                             True
True Grit                           True
The Girl with the Dragon Tattoo     True
Whale Rider                        False
                                   ...  
Being John Malkovich               False
Gone Girl                          False
Cowboy Bebop: The Movie            False
Leaving Las Vegas                  False
Fruitvale Station                  False
Name: title, Length: 975, dtype: bool

In [154]:
possible_dups_list = possible_dups.reset_index()['index'][0:4].tolist()

In [156]:
movies[movies['title'].isin(possible_dups_list)]

#They are not duplicates but multiple movies with the same title.

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list,content_rating_new,content_rating_final,content_rating_final_final,content_rating_levels
466,7.9,The Girl with the Dragon Tattoo,R,Crime,158,"[u'Daniel Craig', u'Rooney Mara', u'Christophe...",R,R,R,5
482,7.8,The Girl with the Dragon Tattoo,R,Crime,152,"[u'Michael Nyqvist', u'Noomi Rapace', u'Ewa Fr...",R,R,R,5
662,7.7,True Grit,PG-13,Adventure,110,"[u'Jeff Bridges', u'Matt Damon', u'Hailee Stei...",PG-13,PG-13,PG-13,4
678,7.7,Les Miserables,PG-13,Drama,158,"[u'Hugh Jackman', u'Russell Crowe', u'Anne Hat...",PG-13,PG-13,PG-13,4
703,7.6,Dracula,APPROVED,Horror,85,"[u'Bela Lugosi', u'Helen Chandler', u'David Ma...",UNRATED,UNRATED,UNRATED,1
905,7.5,Dracula,R,Horror,128,"[u'Gary Oldman', u'Winona Ryder', u'Anthony Ho...",R,R,R,5
924,7.5,Les Miserables,PG-13,Crime,134,"[u'Liam Neeson', u'Geoffrey Rush', u'Uma Thurm...",PG-13,PG-13,PG-13,4
936,7.4,True Grit,,Adventure,128,"[u'John Wayne', u'Kim Darby', u'Glen Campbell']",,,PG-13,4


#### Calculate the average star rating for each genre, but only include genres with at least 10 movies


#### Option 1: manually create a list of relevant genres, then filter using that list

In [None]:
# Answer:

#### Option 2: automatically create a list of relevant genres by saving the value_counts and then filtering

In [161]:
genres_greater_than_10 = movies['genre'].value_counts() >= 10

In [165]:
genre_list = genres_greater_than_10.reset_index()['index'].tolist()

In [168]:
movies[movies['genre'].isin(genre_list)].groupby('genre')['star_rating'].mean()

genre
Action       7.884559
Adventure    7.933333
Animation    7.914516
Biography    7.862338
Comedy       7.822436
Crime        7.916935
Drama        7.902518
Family       7.850000
Fantasy      7.700000
Film-Noir    8.033333
History      8.000000
Horror       7.806897
Mystery      7.975000
Sci-Fi       7.920000
Thriller     7.680000
Western      8.255556
Name: star_rating, dtype: float64

#### Option 3: calculate the average star rating for all genres, then filter using a boolean Series

In [None]:
# Answer:

#### Option 4: aggregate by count and mean, then filter using the count

In [None]:
# Answer:

## Bonus

#### Figure out something "interesting" using the actors data!