<a href="https://colab.research.google.com/github/CalvinHulleman/Movie_Project/blob/main/Analyzing_Movies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fun With Movies

The Internet Movie DataBase (imdb) has lots of great information we can use to practice pandas.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import re

### First, import imdb.txt into colab (use the folder symbol to the left)

## Read in dataframe from .csv or .txt file

In [None]:
movies = pd.read_csv('imdb.txt')
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


## What does the dataframe look like?

`shape` gives the number of rows and columns.

In [None]:
# check the number of rows and columns
print(movies.shape)
# check the data type of each column
movies.dtypes

(979, 6)


Unnamed: 0,0
star_rating,float64
title,object
content_rating,object
genre,object
duration,int64
actors_list,object


Now we can use the `mean()` function on any of the numerical columns.  Calculate and print out the average `duration` and `star_rating`. I've provided some sample code.

In [None]:
avg_duration = movies['duration'].mean()
print(avg_duration)

avg_rating = movies['star_rating'].mean()
print(avg_rating)

120.97957099080695
7.889785495403474


We can select parts of the data in different ways. For example, we can look only at movies with a longer-than-average duration.

Notice that we use the test  `movies['duration'] > avg_duration`  to select just the long ones.

In [None]:
long_movies = movies[movies['duration'] > avg_duration]

#how do these movies rate?
long_movies['star_rating'].mean()


7.953669724770642

## TASK

Compare the mean rating of longer-than-average movies to the mean rating of shorter-than-average movies. In a comment, discuss whether longer movies get higher ratings.

In [None]:
long_movies = movies[movies['duration'] > avg_duration]
print(long_movies['star_rating'].mean())
short_movies = movies[movies['duration'] < avg_duration]
print(short_movies['star_rating'].mean())
#Longer movies, on average, get a higher rating than shorter movies

7.953669724770642
7.838489871086555


## TASK

Use regex to select movies that are sequels.  For simplicity, we will just look at movies with `Part` and/or `II` and/or `III` in the title. I'll get you started.

In the last line, I show how to capture the ratings of just the movies that are sequels. Add these ratings to a list and call it `sequel_ratings`.

In [None]:
count = 0
sum = 0
title_series = movies['title']
for title in title_series:
    if re.search('Part |II|III',title):   #this regex isnt' quite right. Fix it!
        print(title)
        idx = movies[movies['title']==title].index[0]  #index of the sequel
        count += 1
        sum += movies['star_rating'][idx]
        print("rating is", movies['star_rating'][idx])  #add each rating to a list
print("Average:", sum/count)
#On average, sequals have lower ratings than non-sequals

The Godfather: Part II
rating is 9.1
Harry Potter and the Deathly Hallows: Part 2
rating is 8.1
Evil Dead II
rating is 7.8
Back to the Future Part II
rating is 7.8
Star Trek II: The Wrath of Khan
rating is 7.7
Harry Potter and the Deathly Hallows: Part 1
rating is 7.7
Star Wars: Episode III - Revenge of the Sith
rating is 7.7
The Godfather: Part III
rating is 7.6
Menace II Society
rating is 7.5
Clerks II
rating is 7.5
Back to the Future Part III
rating is 7.4
Average: 7.8090909090909095


Clearly, there is a problem here. The movie `The Party` should not be included. Modify the regex search to exclude names like Party, but keep anything with Part. Also grab anything with a II or III.

Once you see only sequels, and you are getting `Evil Dead II` and other titles with II/III in it, you are ready to calculate the average rating.  Using your list `sequel_ratings`, compute the average rating.  How do the ratings compare to the average rating for the entire dataset?  

## TASK - if you have time

Experiment with improving the regex selection to include other ways to indicate that a movie is a sequel. Can you collect other sequels without mistakenly capturing movies that are not sequels?


In [None]:
count = 0
sum = 0
for title in title_series:
    if re.search('Part |^[^M]*II|III',title):   # SAMPLE CODE - please improve
        print(title)
        idx = movies[movies['title']==title].index[0]  #index of the sequel
        count += 1
        sum += movies['star_rating'][idx]
        print("rating is", movies['star_rating'][idx])  #add each rating to a list
print("Average:", sum/count)


The Godfather: Part II
rating is 9.1
Harry Potter and the Deathly Hallows: Part 2
rating is 8.1
Evil Dead II
rating is 7.8
Back to the Future Part II
rating is 7.8
Star Trek II: The Wrath of Khan
rating is 7.7
Harry Potter and the Deathly Hallows: Part 1
rating is 7.7
Star Wars: Episode III - Revenge of the Sith
rating is 7.7
The Godfather: Part III
rating is 7.6
Clerks II
rating is 7.5
Back to the Future Part III
rating is 7.4
Average: 7.840000000000001


In [None]:
print("R movies average rating:", movies[movies['content_rating'] == 'R']['star_rating'].mean())
print("PG-13 movies average rating:", movies[movies['content_rating'] == 'PG-13']['star_rating'].mean())
print("PG movies average rating:", movies[movies['content_rating'] == 'PG']['star_rating'].mean())
print("G movies average rating:", movies[movies['content_rating'] == 'G']['star_rating'].mean())

R movies average rating: 7.854782608695651
PG-13 movies average rating: 7.828571428571428
PG movies average rating: 7.879674796747967
G movies average rating: 7.9906250000000005
