# Movies Recommender System and Topic Modeling
**By: Sarah Alabdulwahab & Asma Althakafi**
> For this Project, we will try to find the degree of similarity between movies in order to recommend movies based on their plots and perform topic modeling on the movie plots as well.

In [1]:
import pandas as pd
from tqdm import tqdm

## Cleaning

In [2]:
movies_df = pd.read_csv('Data/movies.csv')
movies_df.head()

Unnamed: 0,original_title,overview,genres,keywords,imdb_plot
0,Twelve Monkeys,"In the year 2035, convict James Cole reluctant...",Science-Fiction Thriller Mystery,"schizophrenia philadelphia, pennsylvania stock...","In a future world devastated by disease, a con..."
1,Across the Sea of Time,"A young Russian boy, Thomas Minton, travels to...",Adventure History Drama Family,,"A young Russian boy, Thomas Minton, travels to..."
2,Restoration,"An aspiring young physician, Robert Merivel fo...",Drama Romance,jealousy medicine fountain court wealth spanie...,The exiled royal physician to King Charles II ...
3,The Crossing Guard,"After his daughter died in a hit and run, Fred...",Drama Thriller,loss of loved one hit-and-run revenge tragedy,Freddy Gale is a seedy jeweller who has sworn ...
4,Once Upon a Time... When We Were Colored,This film relates the story of a tightly conne...,Romance Drama,racial segregation family relationships rural ...,A narrator tells the story of his childhood ye...


In [3]:
print('The dataset contains',movies_df.shape[0],'movies and',movies_df.shape[1],'features')

The dataset contains 3133 movies and 5 features


In [4]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3133 entries, 0 to 3132
Data columns (total 5 columns):
original_title    3133 non-null object
overview          3133 non-null object
genres            3133 non-null object
keywords          2241 non-null object
imdb_plot         3111 non-null object
dtypes: object(5)
memory usage: 122.5+ KB


We decided to fill the null values with empty strings, therefore, we won't lose data.

In [5]:
movies_df.fillna('', inplace=True)

In order to increase the word count, we will concatenate the `overview`, `imdb_plot`, and `keywords` features.

In [6]:
movies_df['plot'] = movies_df['overview'] + movies_df['imdb_plot'] + movies_df['keywords']

In [7]:
#drop them since they are no longer useful
movies_df.drop(columns=['overview','imdb_plot','keywords'], inplace = True)

In [8]:
movies_df.head()

Unnamed: 0,original_title,genres,plot
0,Twelve Monkeys,Science-Fiction Thriller Mystery,"In the year 2035, convict James Cole reluctant..."
1,Across the Sea of Time,Adventure History Drama Family,"A young Russian boy, Thomas Minton, travels to..."
2,Restoration,Drama Romance,"An aspiring young physician, Robert Merivel fo..."
3,The Crossing Guard,Drama Thriller,"After his daughter died in a hit and run, Fred..."
4,Once Upon a Time... When We Were Colored,Romance Drama,This film relates the story of a tightly conne...


### Spacy
We will use Spacy in order to remove named entities from each movie plot. 
>You can install Spacy using pip: `!pip install spacy` and download trained pipelines in English: `!python -m spacy download en_core_web_sm`

In [9]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [10]:
for i, txt in tqdm(enumerate(movies_df['plot'])):
    document = nlp(txt)
    movies_df['plot'][i]=" ".join([ent.text for ent in document 
                                   if ent.ent_type_ != 'PERSON' and ent.ent_type_ != 'DATE'])

3133it [03:31, 14.84it/s]


### NLTK, LancasterStemmer, and CountVectorizer
We will use NLTK, LancasterStemmer, and CountVectorizer in order to clean the movies plots.

In [11]:
import nltk
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english')

In [13]:
clean_plots = []
for plot in tqdm(movies_df['plot']):
    words = nltk.word_tokenize(plot)
    clean_plots.append(' '.join([stemmer.stem(word).lower() for word in words if word.isalpha()]))

100%|██████████| 3133/3133 [00:26<00:00, 116.29it/s]


In [14]:
doc_word = cv.fit_transform(clean_plots)
doc_word.shape

(3133, 14829)

### Classification

In [15]:
main_genre = []
for genres in movies_df.genres.str.split():
    main_genre.append(genres[0])

In [16]:
pd.Series(main_genre).value_counts()

Drama              824
Documentary        542
Comedy             521
Action             266
Horror             164
Thriller           100
Romance             98
Adventure           88
Crime               80
Family              79
Animation           77
Music               61
Science-Fiction     59
Fantasy             52
Western             35
Mystery             31
War                 26
History             23
Foreign              7
dtype: int64

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(clean_plots, main_genre,test_size=.2, random_state=0)

X_train = cv.fit_transform(X_train)
X_test = cv.transform(X_test)

In [18]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train, y_train)
nb.score(X_test, y_test)

0.46570972886762363

In [19]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

In [20]:
model = RandomForestClassifier(max_depth=25, n_estimators=100)
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.4194577352472089

## Content-based Recommender System

In [21]:
from sklearn.metrics.pairwise import cosine_similarity 

In [22]:
def find_similar(title, num_of_movies=10):
    index = movies_df[movies_df['original_title'] == title].index[0] #index of the given movie
    similarities = []
    for i in range(doc_word.shape[0]):
        similarities.append([cosine_similarity(doc_word[index], doc_word[i])[0][0], 
                             movies_df['original_title'][i], movies_df['genres'][i]])
    results = sorted(similarities, reverse=True)[1:num_of_movies+1]
    return results

In [23]:
for movie in find_similar('Twelve Monkeys',3):
    print(movie[1],"(similarity score:",movie[0],')')
    print('\tgenres:',movie[2],'\n')

Carriers (similarity score: 0.2490583770684494 )
	genres: Action Drama Horror Science-Fiction Thriller 

Day of the Dead 2: Contagium (similarity score: 0.22658913094724192 )
	genres: Horror Science-Fiction 

Solos (similarity score: 0.2257646938068417 )
	genres: Horror Thriller Science-Fiction Foreign 



## Clustering

In [24]:
from sklearn.cluster import KMeans

In [25]:
num_clusters = 100
km = KMeans(n_clusters=num_clusters,random_state=10,n_init=1)
km.fit(doc_word)
km.inertia_

367307.3197550094

In [26]:
from sklearn.metrics import silhouette_score

In [27]:
clusterer = KMeans(n_clusters=10, random_state=10)
cluster_labels = clusterer.fit_predict(doc_word)
silhouette_avg = silhouette_score(doc_word, cluster_labels)
print("For n_clusters =", 10,"the average silhouette_score is:",silhouette_avg)

For n_clusters = 10 the average silhouette_score is: 0.0025007458660632684
