# Movies Recommender System and Topic Modeling
**By: Sarah Alabdulwahab & Asma Althakafi**
> For this Project, we will try to find the degree of similarity between movies in order to recommend movies based on their plots and perform topic modeling on the movie plots as well.

In [1]:
#supress warnings
import warnings
from pandas.errors import DtypeWarning
warnings.simplefilter(action='ignore', category=DtypeWarning)

import pandas as pd
from tqdm import tqdm

## Data Collection
The dataset consists of movies released on or before July 2017. Data points include budget, genres, overview, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages, etc.

In [2]:
movies_df = pd.read_csv('Data/movies_metadata.csv')
movies_df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [3]:
print('The dataset contains',movies_df.shape[0],'movies and',movies_df.shape[1],'features')

The dataset contains 45466 movies and 24 features


Since our focus is on the plot and genres of the movies, we will exclude all of the other features. However, we will keep the IDs to get more information.

In [4]:
#keep the english movies only
movies_df = movies_df[movies_df.original_language == 'en'][['id', 'imdb_id', 'original_title', 'overview', 'genres']]

In [5]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32269 entries, 0 to 45465
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              32269 non-null  object
 1   imdb_id         32256 non-null  object
 2   original_title  32269 non-null  object
 3   overview        32200 non-null  object
 4   genres          32269 non-null  object
dtypes: object(5)
memory usage: 1.5+ MB


In [6]:
#drop nulls and reset the index
movies_df.dropna(inplace=True)
movies_df.reset_index(drop=True, inplace=True)

Now we are going to check the word count per movie overview and only keep those that exceed or are equal to a 100 words.

In [7]:
#collect the indices of movies we will keep 
idx = []
for i in range(movies_df.shape[0]):
    list_of_words = movies_df.overview[i].split()
    if len(list_of_words) >= 100:
        idx.append(i)

In [8]:
movies_df = movies_df.iloc[idx, :].reset_index(drop=True)
movies_df.head()

Unnamed: 0,id,imdb_id,original_title,overview,genres
0,63,tt0114746,Twelve Monkeys,"In the year 2035, convict James Cole reluctant...","[{'id': 878, 'name': 'Science Fiction'}, {'id'..."
1,139405,tt0112286,Across the Sea of Time,"A young Russian boy, Thomas Minton, travels to...","[{'id': 12, 'name': 'Adventure'}, {'id': 36, '..."
2,35196,tt0114272,Restoration,"An aspiring young physician, Robert Merivel fo...","[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n..."
3,27526,tt0112744,The Crossing Guard,"After his daughter died in a hit and run, Fred...","[{'id': 18, 'name': 'Drama'}, {'id': 53, 'name..."
4,146599,tt0114039,Once Upon a Time... When We Were Colored,This film relates the story of a tightly conne...,"[{'id': 10749, 'name': 'Romance'}, {'id': 18, ..."


### Cleaning `genres`
We will extract the name of the genres for each movie.

The genres are stored as a list of dictionaries, however, they are of type string. Here is an example:

    '[{'id': 878, 'name': 'Science Fiction'}, {...}]'

In [9]:
genres_list = []
for genre in movies_df.genres:
    genres_str = ''

    #list of indices of the genres -> one after the 'name' key
    indices = [i+1 for i, j in enumerate(genre.split()) if j == "'name':"]
    
    for i in indices:
        genre_name = genre.split()[i]
        if genre_name == "'Science": #a special case because this genre contains two words
            genres_str += genre_name[1:]+ '-' + genre.split()[i+1][0:-3] + ' '
        else:
            genres_str += genre_name[1:-3] + ' '
    
    genres_list.append(genres_str.strip())
    
#NOTE:slicing ([1:-3]) is done to remove the single quote at the begining and the other at the end and "}," or "}]"

In [10]:
movies_df.genres = genres_list
movies_df = movies_df[movies_df['genres'] != ''].reset_index(drop=True)
movies_df.head()

Unnamed: 0,id,imdb_id,original_title,overview,genres
0,63,tt0114746,Twelve Monkeys,"In the year 2035, convict James Cole reluctant...",Science-Fiction Thriller Mystery
1,139405,tt0112286,Across the Sea of Time,"A young Russian boy, Thomas Minton, travels to...",Adventure History Drama Family
2,35196,tt0114272,Restoration,"An aspiring young physician, Robert Merivel fo...",Drama Romance
3,27526,tt0112744,The Crossing Guard,"After his daughter died in a hit and run, Fred...",Drama Thriller
4,146599,tt0114039,Once Upon a Time... When We Were Colored,This film relates the story of a tightly conne...,Romance Drama


### The Movie Database (TMDB) API
We will use tmdbv3api in order to retrieve the keywords for each movie. 
>You can install tmdbv3api using pip: `pip install tmdbv3api`

In [11]:
from tmdbv3api import TMDb
from tmdbv3api import Movie

In [12]:
tmdb = TMDb()
tmdb.api_key = '1791fcb344c4c590973d16c243e6c620'
movie = Movie()

In [13]:
keywords = []
for i, ID in tqdm(enumerate(movies_df.id)):
    try:
        keywords_str = ''
        words = movie.details(ID)['keywords']['keywords']
        for word in words:
            keywords_str += word['name'] + ' '
        keywords.append(keywords_str)
    except:
        keywords.append('')
        continue

3133it [33:43,  1.55it/s]


In [14]:
movies_df['keywords'] = keywords
movies_df.head()

Unnamed: 0,id,imdb_id,original_title,overview,genres,keywords
0,63,tt0114746,Twelve Monkeys,"In the year 2035, convict James Cole reluctant...",Science-Fiction Thriller Mystery,"schizophrenia philadelphia, pennsylvania stock..."
1,139405,tt0112286,Across the Sea of Time,"A young Russian boy, Thomas Minton, travels to...",Adventure History Drama Family,
2,35196,tt0114272,Restoration,"An aspiring young physician, Robert Merivel fo...",Drama Romance,jealousy medicine fountain court wealth spanie...
3,27526,tt0112744,The Crossing Guard,"After his daughter died in a hit and run, Fred...",Drama Thriller,loss of loved one hit-and-run revenge tragedy
4,146599,tt0114039,Once Upon a Time... When We Were Colored,This film relates the story of a tightly conne...,Romance Drama,racial segregation family relationships rural ...


### IMDB scraping
We will use BeautifulSoup in order to scrape the plots of the movies from IMDB to increase our word count.

In [15]:
import requests
from bs4 import BeautifulSoup
import time

In [16]:
imdb_plot = []
for ID in tqdm(movies_df.imdb_id):
    try:
        time.sleep(3)
        response = requests.get("https://www.imdb.com/title/" + ID)
        soup = BeautifulSoup(response.text, "html.parser")
        imdb_plot.append(soup.find('span', class_="GenresAndPlot__TextContainerBreakpointL-cum89p-1 gwuUFD").text)
    except:
        imdb_plot.append('')

100%|██████████| 3133/3133 [4:42:18<00:00,  5.41s/it]  


In [17]:
movies_df['imdb_plot'] = imdb_plot
movies_df.head()

Unnamed: 0,id,imdb_id,original_title,overview,genres,keywords,imdb_plot
0,63,tt0114746,Twelve Monkeys,"In the year 2035, convict James Cole reluctant...",Science-Fiction Thriller Mystery,"schizophrenia philadelphia, pennsylvania stock...","In a future world devastated by disease, a con..."
1,139405,tt0112286,Across the Sea of Time,"A young Russian boy, Thomas Minton, travels to...",Adventure History Drama Family,,"A young Russian boy, Thomas Minton, travels to..."
2,35196,tt0114272,Restoration,"An aspiring young physician, Robert Merivel fo...",Drama Romance,jealousy medicine fountain court wealth spanie...,The exiled royal physician to King Charles II ...
3,27526,tt0112744,The Crossing Guard,"After his daughter died in a hit and run, Fred...",Drama Thriller,loss of loved one hit-and-run revenge tragedy,Freddy Gale is a seedy jeweller who has sworn ...
4,146599,tt0114039,Once Upon a Time... When We Were Colored,This film relates the story of a tightly conne...,Romance Drama,racial segregation family relationships rural ...,A narrator tells the story of his childhood ye...


Now that we have used the IDs to get the extra information that we needed, we can drop them.

In [18]:
movies_df.drop(columns=['id','imdb_id'], inplace=True)
movies_df.head()

Unnamed: 0,original_title,overview,genres,keywords,imdb_plot
0,Twelve Monkeys,"In the year 2035, convict James Cole reluctant...",Science-Fiction Thriller Mystery,"schizophrenia philadelphia, pennsylvania stock...","In a future world devastated by disease, a con..."
1,Across the Sea of Time,"A young Russian boy, Thomas Minton, travels to...",Adventure History Drama Family,,"A young Russian boy, Thomas Minton, travels to..."
2,Restoration,"An aspiring young physician, Robert Merivel fo...",Drama Romance,jealousy medicine fountain court wealth spanie...,The exiled royal physician to King Charles II ...
3,The Crossing Guard,"After his daughter died in a hit and run, Fred...",Drama Thriller,loss of loved one hit-and-run revenge tragedy,Freddy Gale is a seedy jeweller who has sworn ...
4,Once Upon a Time... When We Were Colored,This film relates the story of a tightly conne...,Romance Drama,racial segregation family relationships rural ...,A narrator tells the story of his childhood ye...


In [19]:
print('After cleaning, the dataset contains', movies_df.shape[0],'movies and', movies_df.shape[1],'features')

After cleaning, the dataset contains 3133 movies and 5 features


In [21]:
movies_df.to_csv('Data/movies.csv', index=False)