# Movie Classification Team 11

# Web Scraping and Data Preparation

### Team Members:
Andrew Lund, Nicholas Morgan, Amay Umradia, Charles Webb

**The purpose of this notebook is two-fold:**
1. To scrape movie data from both IMDB and TMDB
2. To clean, organize, and store the data in such a way to support future analysis

In [1]:
import tmdbsimple as tmdb
import requests
import pandas as pd
import numpy as np
from ast import literal_eval
import imdb

# Scraping TMDB Data

We used [The Movie DB's website](https://www.themoviedb.org/documentation/api) to scrape this information. This requires an API key, which is not included with this repository, but can be requested through their website. One thing to note as that their API has a 40-request-per-10-second limit, per IP addresss. To account for this, we added a sleep function into our loop below.

Since we are doing text analysis, we are only interested in movies that have plots written in english.

In [2]:
key = open('key.txt','r').read()
payload = "{}"


movie_df = pd.DataFrame() #create empty dataframe to enable 'while loop' below

page=1
while movie_df.shape[0] < 1000:
    url = "https://api.themoviedb.org/3/movie/top_rated?api_key={0}&language=en-US&page={1}".format(key, str(page))
    response = requests.request("GET", url, data=payload).json()
    if page == 1: #initialize dataframe on first loop
        movie_df = pd.DataFrame(response['results'])
    else:
        movie_df = movie_df.append(pd.DataFrame(response['results']))
    
    movie_df = movie_df[movie_df['original_language']=='en'] #remove non english movies
    time.sleep(0.25) #rate limit is 4 pages per second
    page+=1
    
movie_df.reset_index(inplace=True,drop=True) #reset index since we dropped non english rows

#drop irrelevant columns for this analysis
dropCols = ['adult','backdrop_path', 'original_language','original_title', 'poster_path','video']

movie_df.drop(dropCols,axis=1,inplace=True)

**IMDB Mapping**

The IMDB Movie ID is not included with the data that was scraped above, so it requires a separate process to retrieve.

In [3]:
def tmdb_to_imdb(tmdb_id):
    time.sleep(0.25) #rate limit is 4 requests per second
    url = "https://api.themoviedb.org/3/movie/{0}/external_ids?api_key={1}&language=en-US".format(tmdb_id, key)
    response = requests.request("GET", url, data=payload).json()
    if 'imdb_id' in response:
        return response['imdb_id']
    else:
        return None


In [4]:
movie_df['imdb_id'] = movie_df['id'].apply(lambda x: tmdb_to_imdb(x))

In [5]:
movie_df.head()

Unnamed: 0,genre_ids,id,overview,popularity,release_date,title,vote_average,vote_count,imdb_id
0,"[18, 80]",278,Framed in the 1940s for the double murder of h...,34.346733,1994-09-23,The Shawshank Redemption,8.6,9913,tt0111161
1,"[18, 80]",238,"Spanning the years 1945 to 1955, a chronicle o...",39.07922,1972-03-14,The Godfather,8.5,7479,tt0068646
2,"[18, 36, 10752]",424,The true story of how businessman Oskar Schind...,21.613388,1993-11-29,Schindler's List,8.4,5626,tt0108052
3,"[18, 80]",240,In the continuing saga of the Corleone crime f...,34.872781,1974-12-20,The Godfather: Part II,8.4,4331,tt0071562
4,"[53, 80]",680,"A burger-loving hit man, his philosophical par...",39.663864,1994-09-10,Pulp Fiction,8.3,11076,tt0110912


Now we can use the imdb_id to get the imdb attributes for each movie

In [6]:
ia = imdb.IMDb()

In [7]:
godfather = ia.get_movie('0068646')
godfather

<Movie id:0068646[http] title:_The Godfather (1972)_>

In [8]:
godfather.keys()

['title',
 'year',
 'kind',
 'cast',
 'composers',
 'editorial department',
 'production managers ',
 'art department',
 'visual effects',
 'casting department',
 'costume departmen',
 'location management',
 'music department',
 'transportation department',
 'thanks',
 'genres',
 'runtimes',
 'countries',
 'country codes',
 'language codes',
 'color info',
 'aspect ratio',
 'sound mix',
 'certificates',
 'original air date',
 'rating',
 'votes',
 'cover url',
 'director',
 'writer',
 'producer',
 'cinematographer',
 'editor',
 'casting director',
 'production design',
 'art direction',
 'set decoration',
 'costume designer',
 'make up',
 'assistant director',
 'sound crew',
 'special effects companies',
 'stunt performer',
 'camera and electrical department',
 'miscellaneous crew',
 'plot outline',
 'languages',
 'akas',
 'top 250 rank',
 'plot',
 'synopsis',
 'canonical title',
 'long imdb title',
 'long imdb canonical title',
 'smart canonical title',
 'smart long imdb canonical tit

We are only interested in getting the imdb plot description, but the cell below could be adjusted to include information from any of the keys in the list above.

In [9]:
all_imdb_data = movie_df['imdb_id'].apply(lambda x: ia.get_movie(x[2:]))

movie_df['imdb_plot'] = all_imdb_data.apply(lambda x: x['plot']if 'plot' in x.keys() else None)

The cell below demonstrates that each imdb plot is actually a list of multiple plots. We will thus make one additional transformation, by taking the first plot in each list. The reason for this is to allow for a more "fair" comparison to the tmdb plots.

If our ultimate goal was to maximize accuracy, perhaps taking every plot in the list would result in the most accurate predictions. This hypothesis is outside the scope of this project, but is an area for future exploration if one were interested.

In [11]:
movie_df.loc[0,'imdb_plot']

["Chronicles the experiences of a formerly successful banker as a prisoner in the gloomy jailhouse of Shawshank after being found guilty of a crime he did not commit. The film portrays the man's unique way of dealing with his new, torturous life; along the way he befriends a number of fellow prisoners, most notably a wise long-term inmate named Red.::J-S-Golden",
 "Story of a hot-shot American banker Andrew Dufresne who finds himself to be an inmate at the Shawshank prison for a crime he says he didn't commit, the murder of his wife and her lover. The movie revolves around Andy's take on this drastic transformation, his journey as an inmate in the prison during which he befriends Red, a fellow inmate as well as gains the respect of his friends.::Rajat Sindhu",
 'After the murder of his wife, hotshot banker Andrew Dufresne is sent to Shawshank Prison, where the usual unpleasantness occurs. Over the years, he retains hope and eventually gains the respect of his fellow inmates, especially

In [13]:
movie_df['imdb_plot'] = movie_df['imdb_plot'].apply(lambda x: x[0])

In [14]:
movie_df.head()

Unnamed: 0,genre_ids,id,overview,popularity,release_date,title,vote_average,vote_count,imdb_id,imdb_plot
0,"[18, 80]",278,Framed in the 1940s for the double murder of h...,34.346733,1994-09-23,The Shawshank Redemption,8.6,9913,tt0111161,Chronicles the experiences of a formerly succe...
1,"[18, 80]",238,"Spanning the years 1945 to 1955, a chronicle o...",39.07922,1972-03-14,The Godfather,8.5,7479,tt0068646,When the aging head of a famous crime family d...
2,"[18, 36, 10752]",424,The true story of how businessman Oskar Schind...,21.613388,1993-11-29,Schindler's List,8.4,5626,tt0108052,Oskar Schindler is a vainglorious and greedy G...
3,"[18, 80]",240,In the continuing saga of the Corleone crime f...,34.872781,1974-12-20,The Godfather: Part II,8.4,4331,tt0071562,The continuing saga of the Corleone crime fami...
4,"[53, 80]",680,"A burger-loving hit man, his philosophical par...",39.663864,1994-09-10,Pulp Fiction,8.3,11076,tt0110912,Jules Winnfield (Samuel L. Jackson) and Vincen...


# Genres

In [15]:
url = "https://api.themoviedb.org/3/genre/movie/list?api_key={0}&language=en-US".format(key)
response = requests.request("GET", url, data=payload).json()

id_to_genre = dict(zip([i['id'] for i in response['genres']],
                     [i['name'] for i in response['genres']]))

genre_to_id = dict(zip([i['name'] for i in response['genres']],
                       [i['id'] for i in response['genres']]))

id_to_genre

{12: 'Adventure',
 14: 'Fantasy',
 16: 'Animation',
 18: 'Drama',
 27: 'Horror',
 28: 'Action',
 35: 'Comedy',
 36: 'History',
 37: 'Western',
 53: 'Thriller',
 80: 'Crime',
 99: 'Documentary',
 878: 'Science Fiction',
 9648: 'Mystery',
 10402: 'Music',
 10749: 'Romance',
 10751: 'Family',
 10752: 'War',
 10770: 'TV Movie'}

**Store dictionaries as .json files**

In [16]:
import json

with open('data/id_to_genre.json', 'w') as fp:
    json.dump(id_to_genre, fp)
    
with open('data/genre_to_id.json', 'w') as fp:
    json.dump(genre_to_id, fp)

Next we will rename the columns for consistency

In [17]:
movie_df.rename(columns={'id':'tmdb_id',
                        'genre_ids':'tmdb_genres',
                         'overview':'tmdb_plot'}, inplace=True)

Punctuation is not relevant for the purposes of our analysis, so we will remove it in the next step

In [18]:
import re
def remove_punctuation(string):
    string = re.sub('[^\w\s]',' ',string) #remove punctuation
    string = string.replace('\r', ' ') #the regular expression above does not remove this return symbol
    string = string.replace('\n', ' ') #the regular expression above does not remove this newline symbol
    
    return string

movie_df['tmdb_plot'] = movie_df['tmdb_plot'].apply(lambda x: remove_punctuation(x))
movie_df['imdb_plot'] = movie_df['imdb_plot'].apply(lambda x: remove_punctuation(x))

In [19]:
movie_df = movie_df[movie_df['tmdb_genres'].str.len() != 0] #remove empty genres

In [20]:
movie_df.shape

(999, 10)

We lost 1 observation as a result of dropping the empty genres.

The genres are currently listed as categorical variables. As per [scikit-learn's documentation](http://scikit-learn.org/stable/modules/multiclass.html), the joint set of binary classification tasks is expressed with label binary indicator arrays. We will be using [MultiLabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer) to make this conversion.

In [21]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb=MultiLabelBinarizer()

MultiLabelBinarizer converts the corresponding categorical variables into a numpy array. Pandas behaves strangely when reading csv's that contain numpy arrays, so we will instead save the numpy arrays as .npy files.

In [23]:
binary_tmdb = mlb.fit_transform(movie_df['tmdb_genres'])
np.save('data/binary_tmdb.npy',binary_tmdb)

Much later in our process, we will require a `target_names` variable to be used with our classification report. This list of genres must be in the same order of the MultiLabelBinarizer classes, so we will create and store those lists here. To improve readability, we will convert the genre_id into the corresponding genre name.

In [25]:
target_names = {}
tmdb_target_names = []
for genre_id in mlb.classes_:
    tmdb_target_names.append(id_to_genre[genre_id])

    
target_names['tmdb'] = tmdb_target_names


with open('data/target_names.json', 'w') as fp:
    json.dump(target_names, fp)
    

Last, we will order the column more logically, and store the results as a csv.

In [27]:
#reorder columns
movie_df = movie_df[['tmdb_id', 'imdb_id', 'tmdb_genres', 'tmdb_plot',
       'imdb_plot', 'popularity', 'release_date', 'title', 'vote_average',
       'vote_count' ]]

In [28]:
movie_df.to_csv('data/movies.csv',encoding="utf-8",index=False)