# Big Data & Content Analytics Project

**Students FullName:** Spanos Nikolaos, Baratsas Sotirios <br/>
**Students ID:** f2821826, f2821803 <br/>
**Supervisor:** Papageorgious Xaris, Perakis Georgios <br/>

<div style="text-align: left">  <strong> <font size="3"> Executive Summary </strong> </div>
    <div style="text-align: justify"> This jupyter notebook presents the python code and all the steps of execution that has been followed to create a movie recommendation engine. The notebook is splitted into two parts. The first part contains 4 different units, using data found on IMDB. While the second part, is a different approach using wikipedia's dataset. <p> <p>
        <i> Part 1 </i><br>
    <b> Unit 1: </b> Read, update & clean the data <br/>
    <b> Unit 2: </b> Building the movie recommendation model <br/>
    <b> Unit 3: </b> Word Embeddings <br/>
        <b> Unit 4: </b> Movie Recommendation algorithm </p><br/></div>
     <div style="text-align: justify">Each unit should be executed in the order written for the algorithm to work properly. Although, since most of those units may take time to process each different script of code, the reader of this notebook can <b>directly read the pickled (serialised) file right before the beginning of Unit 4</b>.<br>The jupyter notebook is one of the three deliveralbles of this project and its role is to present the code scripts that built the movie recommendation engine. For the reader to understand the logic behind the tools used to build this movie recommendation engine, (s)he is instructed to read the project's report. Last but not least, the algorithm written in Unit 4 is the one that have been passed to the chatbot messenger in order to propose the user each time a different choice of movie.</div>

**Import all the python modules mendatory for the code to be executed properly.**

In [2]:
# For cleaning and preparing the dataset
# -> dataframe manipulation
# -> text manipulation
# -> Web Scrapping

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import re


# Module to serialize the content produced from the execution of the code

import pickle


# Module to monitor the progress of a python for loop

from tqdm import tqdm


# Module to manipulate text in python - NLTK package

import nltk
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords


# Module to compute word vectorizers and compute the cosine distance

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_distances


# Module to train our own word embdeddings

import fasttext

If the reader of the notebook wants to download any of the above modules locally on his/her laptop, (s)he is advised to use the following *command*:

In [None]:
# import sys
# !{sys.executable} -m pip install gensim

<b> This is the final dataset that has been created over the units 1 to 3</b>. The reader is advised to go through all the units of this notebook, in order to understand the processed followed to reach to the final data structure.

Read the pickled (serialized) dataset with all the transformation and cleaning steps included!

In [None]:
five_thousands = pd.read_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\five_thousands_02092019.pkl')

#  Unit 1: Read, update and clean the data

<b>Summary:</b> In the first unit of the notebook, the code for the following three processes is executed:

* <u>Reading:</u> The dataset is imported as a pandas dataframe in order to have the values in a tabular format.
* <u>Cleaning:</u> The dataset had many noise. For example, many columns had special characters and extra spaces. Moreover, some columns had to be splitted in order for their respective values to be treated separately (i.e genres). Last but not least, some of the columns of the original dataset was irrelevant with the scope of the project, while some attributes were in the wrong data type.
* <u>Updating:</u> The dataset was updated, by scraping information from the imbd links provided in column "movie_imdb_link". The fields updated were the IMDB Rating and the Cast, while the Plot_Summary of each movie was added.

<div style="text-align: center"><b>Section 1:</b> Read and Clean the dataset</div>

**Initial raw dataset collected**

In [63]:
dataset = pd.read_csv("movie_metadata.csv", encoding = 'UTF-8')
dataset = dataset.reset_index()

**Step 1:** Check for duplicate imbd links and remove any duplicates found

In [64]:
duplicates = dataset['movie_imdb_link'].value_counts().tolist()

In [65]:
empty_l = []
for i in duplicates:
    if i > 1:
        empty_l.append(i)
        
len(empty_l)

117

So it is obvious that the dataset has some links more that one time and this is a problem. So the solution, is to remove those duplicate values and keep the first occurence of each!

**Solution**

In [66]:
dataset_new = dataset.drop_duplicates(subset=['movie_imdb_link'], keep='first')

**Step 2:** Remove spaces and tabs from the movie title field

In [67]:
dataset_new.movie_title.tail(5)

5038      Signed Sealed Delivered 
5039    The Following             
5040         A Plague So Pleasant 
5041             Shanghai Calling 
5042            My Date with Drew 
Name: movie_title, dtype: object

As we can see from the above print the movie "The Following" has extra spaces. It is not propered aligned with the other movies. Thus, those extra tabs should be removed.

In [68]:
dataset_new['movie_title'] = dataset_new['movie_title'].apply(lambda x: re.sub('\s+', ' ', x).strip())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [69]:
dataset_new.movie_title.tail(5)

# The extra spaces are properly removed.

5038    Signed Sealed Delivered
5039              The Following
5040       A Plague So Pleasant
5041           Shanghai Calling
5042          My Date with Drew
Name: movie_title, dtype: object

**Step 3:** Droping the columns that are irrelevant

As irrelevant are declared those variables that do not have any relevant contribution to the content of a movie. For example, the facebook likes two movies received does not affect their respective similarity.

In [70]:
dataset_new = dataset_new.drop(['color', 'num_critic_for_reviews', 'director_facebook_likes', 'actor_3_facebook_likes', 
                                'actor_1_facebook_likes', 'cast_total_facebook_likes', 'facenumber_in_poster', 
                                'actor_2_facebook_likes', 'aspect_ratio', 'movie_facebook_likes', 'content_rating'], axis=1)
dataset_new.shape

(4919, 18)

**Step 4:** Change the order of the remaining 18 columns

In [71]:
dataset_new = dataset_new[['movie_title', 'imdb_score', 'num_user_for_reviews', 'num_voted_users', 'director_name', 
                           'actor_1_name', 'actor_2_name', 'actor_3_name', 'plot_keywords', 'gross', 'genres',  'duration', 
                           'language', 'country', 'budget', 'title_year', 'movie_imdb_link']]

**Step 5:** Replace "|" with "," in the "plot keywords" and "genres" columns

In [72]:
print(dataset_new.plot_keywords, dataset_new.genres)

0                  avatar|future|marine|native|paraplegic
1       goddess|marriage ceremony|marriage proposal|pi...
2                     bomb|espionage|sequel|spy|terrorist
3       deception|imprisonment|lawlessness|police offi...
4                                                     NaN
                              ...                        
5038               fraud|postal worker|prison|theft|trial
5039         cult|fbi|hideout|prison escape|serial killer
5040                                                  NaN
5041                                                  NaN
5042    actress name in title|crush|date|four word tit...
Name: plot_keywords, Length: 4919, dtype: object 0       Action|Adventure|Fantasy|Sci-Fi
1              Action|Adventure|Fantasy
2             Action|Adventure|Thriller
3                       Action|Thriller
4                           Documentary
                     ...               
5038                       Comedy|Drama
5039       Crime|Drama|Mystery|Th

In [73]:
mylist = ['plot_keywords', 'genres']
for i in mylist:
    dataset_new.loc[:, i] = dataset_new.loc[:, i].str.replace('|', ',')

**Step 6:** Separate the genres of each movie

In [74]:
dataset_new = dataset_new.join(dataset_new['genres'].str.split(',', expand=True).add_prefix('genre_').fillna(0))
dataset_new.head(5)

Unnamed: 0,movie_title,imdb_score,num_user_for_reviews,num_voted_users,director_name,actor_1_name,actor_2_name,actor_3_name,plot_keywords,gross,...,title_year,movie_imdb_link,genre_0,genre_1,genre_2,genre_3,genre_4,genre_5,genre_6,genre_7
0,Avatar,7.9,3054.0,886204,James Cameron,CCH Pounder,Joel David Moore,Wes Studi,"avatar,future,marine,native,paraplegic",760505847.0,...,2009.0,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,Action,Adventure,Fantasy,Sci-Fi,0,0,0,0
1,Pirates of the Caribbean: At World's End,7.1,1238.0,471220,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport,"goddess,marriage ceremony,marriage proposal,pi...",309404152.0,...,2007.0,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,Action,Adventure,Fantasy,0,0,0,0,0
2,Spectre,6.8,994.0,275868,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman,"bomb,espionage,sequel,spy,terrorist",200074175.0,...,2015.0,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,Action,Adventure,Thriller,0,0,0,0,0
3,The Dark Knight Rises,8.5,2701.0,1144337,Christopher Nolan,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,"deception,imprisonment,lawlessness,police offi...",448130642.0,...,2012.0,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,Action,Thriller,0,0,0,0,0,0
4,Star Wars: Episode VII - The Force Awakens,7.1,,8,Doug Walker,Doug Walker,Rob Walker,,,,...,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,Documentary,0,0,0,0,0,0,0


**Step 7:** Clean the numeric variables

In [75]:
numeircs_list = ['num_user_for_reviews', 'budget', 'title_year']
for i in numeircs_list:
    dataset_new[i] = dataset_new[i].fillna(0).astype(np.int64)

**Step 8:** Remove the rows that refer to TV Series

We have noticed that the duration of some rows was less than an hour (60 minutes). This strongly implies that those rows are episodes of a TV serie. Since we would like to recommend only movies, any row that has a duration less than 70 minutes will be removed.

In [76]:
dataset_new = dataset_new[dataset_new['duration'] > 70]

Up to this point the shape of our dataset is:

In [77]:
dataset_new.shape

(4779, 25)

Up to this point the dataset has 25 columns. However, we won't stop here. We have to enrich the dataset with columns that will add value to the similarity index between different movies. One such feature is the plot summary of each movie.

Using the active imdb link we achieved to request the html file of the movie and extract from there the field relevant to the plot summary.

<div style="text-align: center"><b>Section 2:</b> Update the dataset using web scraping techniques</div>

<i>Python Modules used: BeatifulSoup, Requests</i>

**Step 1:** Scrap the Plot Summary

Note: The code snipset below is commented, since its completion would demand 1,5 hours. </p>
The script was executed once and then was serialised locally for future use.

In [7]:
five_thousands = pd.read_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\five_thousands_02092019.pkl')

In [15]:
# mylist = []
# souplist = []
# myfield = []
# plot_summary = []
# mock_dataset = five_thousands.movie_imdb_link

# for i in tqdm(mock_dataset):
#     mylist.append(requests.get(i))
    
# for i in tqdm(mylist):
#     souplist.append(BeautifulSoup(i.text))
        
# for i in tqdm(souplist):
#     myfield.append(i.find_all('div', {'class':'plot_summary'}))

# for i in tqdm(myfield):
#     for x in tqdm(i):
#         for y in tqdm(x.find_all('div', {'class':'summary_text'})):
#             plot_summary.append(y.text)

In [None]:
# five_thousands.to_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\dataset_new_update1.pkl')

In [9]:
# Pickle the requests file with the 4779 movies

# Save the file

# with open('requests.pkl', 'wb') as f:
#     pickle.dump(mylist, f)

**Step 2:** Scrap the IMDB Ratings

Note: What we noticed on some movies, was the outdated IMDB Rating so we got the latest IMBD Rating based on the online links of each movie.

**Import the pickled requests list**, which contains all the html docs downloaded in Step 1

In [1]:
# with open('requests.pkl', 'rb') as f:
#     requests_list = pickle.load(f)

In [59]:
# souplist = []
# myfield = []
# ratings = []

# for i in tqdm(requests_list):
#     souplist.append(BeautifulSoup(i.text))

# for i in tqdm(souplist):
#     myfield.append(i.find_all('div', {'class':'ratingValue'}))

# for i in tqdm(myfield):
#     for x in tqdm(i):
#         for y in tqdm(x.find_all('span', {'itemprop':'ratingValue'})):
#             ratings.append(y.text)

In [3]:
# five_thousands.to_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\dataset_new_update2.pkl')

**Final Step:** Having processed all these steps of cleaning and updating the data, it is time to pickle the dataset up to this point.

In [None]:
# five_thousands.to_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\dataset_cleaned_updated_03092019.pkl')

# Unit 2: Building the movie reccomendation model <p>

<div style="text-align: justify"> <b> Summary: </b> In this unit, we will prepare the dataset that has been pickled for the recommendation engine.
Some actions needed for the preparation is to select specific features of the dataset, the most relevant ones, and also to combine features together in order for the cosine distance and the word embeddings to work properly.</div>

**Import the pickled dataset**, generated in Unit 1.

In [None]:
five_thousands = pd.read_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\dataset_cleaned_updated_03092019.pkl')

**Step 1:** Selecting the features that will be used to compare the movie together

In [None]:
features= ['director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name', 'plot_keywords', 'genre_0', 'genre_1', 'genre_2',
           'genre_3', 'genre_4', 'genre_5', 'genre_6', 'genre_7', 'plot_summary']

**Step 2:** Create the function that will combine those features (specified above) to a unified row

In [None]:
def combine_features(row):
    return row['director_name'] + " " + row['actor_1_name'] + " " + row['actor_2_name'] + " " + row['actor_3_name'] + " " + row['plot_keywords'] + " " + row['genre_0'] + " " + row['genre_1'] + " " + row['genre_2'] + " " + row['genre_3'] + " " + row['genre_4'] + " " + row['genre_5'] + " " + row['genre_6'] + " " + row['genre_7'] + " " + row['plot_summary']

**Step 3:** Replace missing values with 'space' and transform each feature to string

In [None]:
for feature in features:
    five_thousands[feature] = five_thousands[feature].fillna('')
    
for feature in features:
    five_thousands[feature] = five_thousands[feature].astype('str')

**Step 4:** Create the final column of the dataset, which will determine the content of each movie. </p>
**Column Name:** "combined features"

In [None]:
five_thousands["combined_features"] = five_thousands.apply(combine_features, axis=1)

Pickle the dataset that has been generated through Unit 2.

In [None]:
# five_thousands.to_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\five_thousands_unit_2_03092019.pkl')

# Unit 3: Word Embeddings <p>

<div style="text-align: justify"> <b> Summary:  </b>Using the FastText module I trained the unsupervised dataset to create word embeddings based on the cast, the plot summary, the plot keywords, the director and the genres of the movies. Having trained the dataset then I used the word embeddings to calculate the cosine distance between the movie the user gave as input with the rest of the dataset's movies. </div>

**Import the pickled dataset** created in the end of Unit 2.

In [28]:
five_thousands = pd.read_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\five_thousands_plot_cast_feature_embedded_05092019.pkl')
five_thousands.head(5)

Unnamed: 0,level_0,index,movie_title,imdb_score,num_user_for_reviews,num_voted_users,director_name,actor_1_name,actor_2_name,actor_3_name,...,genre_6,genre_7,movie_index,updated_rating,plot_summary,combined_features,combined_actors,average_combined_features,average_cast_vectors,average_plot_vectors
0,0,0,Avatar,7.9,3054,886204,James Cameron,CCH Pounder,Joel David Moore,Wes Studi,...,0,0,1,7.8,\n A paraplegic Marine disp...,James Cameron CCH Pounder Joel David Moore Wes...,"CCH Pounder,Joel David Moore,Wes Studi","[0.02741548, 0.014202604, 0.09099899, 0.077658...","[-0.030927671, 0.1788817, -0.03483894, 0.00479...","[-0.013443385, 0.052663647, 0.0044201775, 0.01..."
1,1,1,Pirates of the Caribbean: At World's End,7.1,1238,471220,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport,...,0,0,2,7.1,"\n Captain Barbossa, Will T...",Gore Verbinski Johnny Depp Orlando Bloom Jack ...,"Johnny Depp,Orlando Bloom,Jack Davenport","[-0.014147306, 0.055645518, 0.01842277, 0.0490...","[-0.054685786, 0.29279318, -0.10224095, 0.0574...","[0.012848631, 0.14669019, 0.024647376, 0.08664..."
2,2,2,Spectre,6.8,994,275868,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman,...,0,0,3,6.8,\n A cryptic message from 0...,Sam Mendes Christoph Waltz Rory Kinnear Stepha...,"Christoph Waltz,Rory Kinnear,Stephanie Sigman","[0.08728486, 0.03535206, 0.053212844, 0.082393...","[-0.1668923, 0.08294236, -0.16238469, 0.078987...","[0.041572966, 0.11594085, 0.02080741, -0.00669..."
3,3,3,The Dark Knight Rises,8.5,2701,1144337,Christopher Nolan,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,...,0,0,4,8.4,\n Eight years after the Jo...,Christopher Nolan Tom Hardy Christian Bale Jos...,"Tom Hardy,Christian Bale,Joseph Gordon-Levitt","[0.024196928, 0.05928087, -0.027536094, 0.0421...","[-0.20388456, 0.20223862, -0.06453689, -0.0548...","[0.017869577, 0.08244453, -0.037199665, -0.025..."
4,5,5,John Carter,6.6,738,212204,Andrew Stanton,Daryl Sabara,Samantha Morton,Polly Walker,...,0,0,6,6.6,"\n Transported to Barsoom, ...",Andrew Stanton Daryl Sabara Samantha Morton Po...,"Daryl Sabara,Samantha Morton,Polly Walker","[0.0008429145, 0.019675335, 0.071446545, 0.031...","[-0.093794234, 0.28681445, -0.062326398, -0.02...","[0.034331556, 0.12629311, 0.0005355305, -0.016..."


In [None]:
# five_thousands = pd.read_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\five_thousands_unit_2_03092019.pkl')

In [16]:
# # Step 1.1: Cast Embeddings

# def combine_actors(row):
#     return str(row['actor_1_name']) + "," + str(row['actor_2_name']) + "," + str(row['actor_3_name'])

# five_thousands["combined_actors"] = five_thousands.apply(combine_actors, axis=1)
# five_thousands = five_thousands.reset_index()

# with open('actors_embeddings.txt', 'w', encoding="utf-8") as f:
#     for text in five_thousands["combined_actors"].tolist():
#         f.write(text + '\n')

# Skipgram model (updated)
# model = fasttext.train_unsupervised("actors_embeddings.txt", model='skipgram', lr=0.05, dim=100, ws=3, epoch=500)
# model.save_model("model_file_cast.bin")

# Stpe 1.2

# model = fasttext.load_model('model_file_cast.bin')

# average_vector_list_cast = []
# for i in tqdm(range(len(five_thousands["combined_actors"]))):
#     actors = five_thousands["combined_actors"][i].split(',')
#     average = np.mean([model[actor] for actor in actors], axis=0)
#     average_vector_list_cast.append(average)

# five_thousands['average_cast_vectors'] = average_vector_list_cast

# #------------------------------------------------------------------------------------------

# Step 2.1: Plot Embeddings

# with open('plot_summary_embeddings.txt', 'w', encoding="utf-8") as f:
#     for text in five_thousands["plot_summary"].tolist():
#         f.write(text + '\n')

# Skipgram model (updated)
# model = fasttext.train_unsupervised("plot_summary_embeddings.txt", model='skipgram', lr=0.05, dim=300, ws=6, epoch=500)
# model.save_model("model_file_plot.bin")

# Step 2.2

# model = fasttext.load_model('model_file_plot.bin')

# average_vector_list_plot = []
# for i in tqdm(range(0, len(five_thousands["plot_summary"]))):
#     plot = five_thousands["plot_summary"].str.replace(',', '').str.split(' ')[i]
#     average = np.mean([model[word] for word in plot], axis=0)
#     average_vector_list_plot.append(average)

# five_thousands['average_plot_vectors'] = average_vector_list_plot

# # Step 3

# my_embeddings_array = np.hstack([five_thousands['average_cast_vectors'].apply(pd.Series).values,
# five_thousands['average_plot_vectors'].apply(pd.Series).values])

# print(my_embeddings_array.shape)

# # Step 4: Pickle the word vectors

# with open('my_embeddings_array_02092019.pkl', 'wb') as f:
#     pickle.dump(my_embeddings_array, f)

In [27]:
# five_thousands.to_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\five_thousands_plot_cast_feature_embedded_05092019.pkl')

**Output:** "my_embeddings_array, which contains the vectors generated by the cast and the plot embeddings <p>
**Size:** 4779x400 <p> **Estimated Time of Completion:** 1 hour

In [None]:
# #------------------------------------------------------------------------------------------

# # Step 5.1: Combined Features Embeddings

# five_thousands = five_thousands.reset_index()

# with open('combined_features_embeddings.txt', 'w', encoding="utf-8") as f:
#     for text in five_thousands["combined_features"].tolist():
#         f.write(text + '\n')

# # Skipgram model (updated)
# model = fasttext.train_unsupervised("combined_features_embeddings.txt", model='skipgram', lr=0.05, dim=300, ws=6, epoch=500)
# model.save_model("model_file_combined_features.bin")

# #Step 5.2

# average_vector_list_combined_features = []
# for i in tqdm(range(0, len(five_thousands["combined_features"]))):
#     feature = five_thousands["combined_features"].str.split(' ')[i]
#     average = np.mean([model[word] for word in feature], axis=0)
#     average_vector_list_combined_features.append(average)

# five_thousands['average_combined_features'] = average_vector_list_combined_features

# my_embeddings_array_updated = np.hstack([five_thousands['average_combined_features'].apply(pd.Series).values])

# print(my_embeddings_array_updated.shape)

# # Step 6: Pickle the word vectors

# with open('my_embeddings_array_updated_02092019.pkl', 'wb') as f:
#     pickle.dump(my_embeddings_array_updated, f)

**Output:** "my_embeddings_array_updated, which contains the vectors generated by combined features embeddings <p>
**Size:** 4779x300 <p> **Estimated Time of Completion:** 1 hour </p>

Pickle the word embeddings calculated on step 5.

In [None]:
# with open('my_embeddings_array_updated_02092019.pkl', 'wb') as f:
#     pickle.dump(my_embeddings_array_updated, f)

Pickle the the final form of the dataset that contains also the cast, plot and combined features Word Embeddings. </p>

<b> This is the final form of the dataset that will be used for the movie recommendation engine! </b> </p>
<b> So the reader can directrly have a look on this dataset if (s)he wants to understand the structure of the data!</b>

In [None]:
# five_thousands.to_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\five_thousands_embedded_02092019.pkl')

# Unit 4: Movie Recommendation Algorithm <p>

<div style="text-align: justify"><b> Summary: </b> Create the algorithm that will process the user's inputs and will recommend three different movies.</div></br>
The process flow: <br>

<b> Phase 1: Get the user's input and transform it to the appropriate form.</b>
* Step 1: User gives a movie genre (i.e action, adventure, thriller, etc).<br>
* Step 2: User gives his favorite movie based on the genre (i.e action movie, adventure movie, thriller movie, etc).<br>
* Step 3: User gives reasons why (s)he likes the movie. (i.e I like "Spectre" because James Bond is my favorite 007 agent<br>

<b> Phase 2: Slice the dataset based on user's input</b>
* <div style="text-align: justify"> Step 4: Based on the movie genre the user gave, we filter the rows of the dataset that match the user's genre (i.e From 5000 movies we locate only the action movies which are ~1200 movies). So from those 1200 "action" movies we take all the other relative columns of the initial dataframe. </div>
* Step 5: Check with an IF/ELSE condition whether or not the movie provided by the user is in the movie list of the dataset or not. <br>
* Step 6: If the movie is in the movie_list then we followe the process of word embeddings and cosine distance. Otherwise, we do not use the cosine distance approach but rather only the scoring functions that we have created. <br>

<b> Phase 3: Recommend to the user the three most similar and highly scored movies </b>
* Step 7: Calcutlate the movie scoring and propose the three most highly scored movies. <br> <br>
    **Scoring parameter 1:** Primary genre. Award those movies were their first or second genre values are matched to the genre given by the user. <br> <br>
    **Scoring parameter 2:** IMDB Rating. The movie promoted first should have a higher imdb rating than the other two movies. Although, since we don't want the result to get biased from a high IMDB rate, we don't give a high weight to the IMDB rating. <br><br>
    <div style="text-align: justify"> <b>Scoring parameter 3:</b> Number of words. This is the most highly weighted scoring parameter of the two aforementioned parameters. To calculate the number of words, we use the input given my the user (based on his/her preferences about the movie). Then we use the column "Combined Features" and we count how many of the words given by the user are found in the column "Combined Features". For your information "Combined Features" is a column which contains the values of all the other columns (i.e actors, genres, plot_keywords, plot_summary and director's name) in one common text per movie. So the more the words found the higher the scoring of that movie. </div>

Make a list of the movies that exist in the dataset.

In [12]:
movies_list = five_thousands.movie_title.str.lower().str.replace('-', '').str.replace('The', '').str.replace(':', '').tolist()

with open('movie_title_list.pkl', 'wb') as f:
    pickle.dump(movies_list, f)

<b>Version 6 - The last version</b>

<b>Updates:</b> <br>
    1) Adding the IF/ELSE option so the user do not get a message error if the movie (s)he give as input does not exist in the dataset. <br>
    2) Filtering the word embeddings based on the movie genre given by the user.<br>
    3) Award the movies, which their genre belongs to columns: "genre_0" and "genre_1". So for example a movie that has the following genres: Action, Comedy, Drama and a following one that is: Mystery, Action, Romantic will be awarded higher that those movies that their genres is not first or second.<br>
    4) Ensemble the function "find_correct_movie", to match the movie given as input to the closest of the ones already existed in the dataset.

In [58]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import nltk
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_distances
import pickle

# Functions used --------------------------------------------------------------------------------------------------

def get_index_from_input_movie(user_input):
    return five_thousands[five_thousands.movie_title.str.lower().str.replace('-', '').replace('the', '').replace(':', '') == user_input]['index'].values[0]
    
def stop_and_stem(uncleaned_list):
    ps = PorterStemmer()
    stop = set(stopwords.words('english'))
    stopped_list = [i for i in uncleaned_list if i not in stop]
    stemmed_words = [ps.stem(word) for word in stopped_list]
    return stemmed_words

def search_words(row, list_of_words):
    ps = PorterStemmer()
    row = [ps.stem(x) for x in row]
    counter = 0
    for word in list_of_words:
        if word in row:
            counter = counter + 1
    return counter

def find_correct_genre(user_input, genre_list):
    scores_sim=[]
    vectorizer = TfidfVectorizer()

    for item in genre_list:
        ed = nltk.edit_distance(user_input, item)
        scores_sim.append(ed)
    correct_genre_index = scores_sim.index(min(scores_sim))
    correct_genre = genre_list[correct_genre_index].lower()
    return correct_genre

def find_correct_movie(user_input, movie_list):
    scores_similarity=[]

    for item in movie_list:
        ed = nltk.edit_distance(user_input, item)
        scores_similarity.append(ed)
    correct_movie_index = scores_similarity.index(min(scores_similarity))
    correct_movie = movie_list[correct_movie_index].lower()
    return correct_movie


# -----------------------------------------------------------------------------------------------


# Import the dataset

# five_thousands = pd.read_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\five_thousands_embedded_02092019.pkl')

five_thousands = pd.read_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\five_thousands_plot_cast_feature_embedded_05092019.pkl')

five_thousands = five_thousands.drop(['level_0', 'index'], axis = 1)

five_thousands = five_thousands.reset_index()

five_thousands['index'] = np.arange(0, len(five_thousands))


# -------------------------------------------------------------------------------------------------


# Create the movie_genre list with the unique types of genre 

movie_genre_first = five_thousands.genre_0.unique().tolist()
movie_genre_second = five_thousands.genre_1.unique().tolist()
movie_genre_third = five_thousands.genre_2.unique().tolist()
movie_genre_fourth = five_thousands.genre_3.unique().tolist()
movie_genre_fifth = five_thousands.genre_4.unique().tolist()
movie_genre_sixth = five_thousands.genre_5.unique().tolist()
movie_genre_seventh = five_thousands.genre_6.unique().tolist()
movie_genre_eight = five_thousands.genre_7.unique().tolist()

movie_genre_list = np.asarray(movie_genre_first + movie_genre_second + movie_genre_third + movie_genre_fourth + movie_genre_fifth + movie_genre_sixth + movie_genre_seventh + movie_genre_eight)
list(movie_genre_list.flatten())
movie_genre_list = list(set(movie_genre_list))

movie_genre_list = [x.lower() for x in movie_genre_list]


# -------------------------------------------------------------------------------------------------


# Phase 1: Get the user's input and transform it to the appropriate form


input_one = input("Give me a movie genre (i.e romance, action, adventure): ")
input_one = find_correct_genre(input_one.lower(), movie_genre_list)

input_movie = input("Give me the title of a movie: ").lower().replace('-', '').replace('The', '').replace(':', '')

input_two = input("Now think of some reasons why you like '{}':".format(input_movie)).lower().replace(',', '').replace('.', '').split(' ')
inputs_list = stop_and_stem(input_two)


# -------------------------------------------------------------------------------------------------


# Using the genre input given by the user, isolate those movies that match the given genre (i.e Action movies)

locked_frame = five_thousands.loc[(five_thousands.genre_0.str.lower() == input_one) | (five_thousands.genre_1.str.lower() == input_one) | (five_thousands.genre_2.str.lower() == input_one) | (five_thousands.genre_3.str.lower() == input_one) | (five_thousands.genre_4.str.lower() == input_one) | (five_thousands.genre_5.str.lower() == input_one) | (five_thousands.genre_6.str.lower() == input_one) | (five_thousands.genre_7.str.lower() == input_one)]

indexes_list = locked_frame.index.tolist()

locked_frame['index'] = np.arange(0, len(locked_frame))


# -------------------------------------------------------------------------------------------------


# Phase 2: Slice the dataset based on the user's input


# Check of the movie user gave is in the movie list of the dataset

with open('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\movie_title_list.pkl', 'rb') as f:
    movies_list = pickle.load(f)

if input_movie in movies_list:
    
    input_movie = find_correct_movie(input_movie_before, movies_list)

    # Isolate the movie plot of the movie provided from the user [If the movie is part of the dataset].

    movie_plot_new = locked_frame['plot_summary'].loc[(locked_frame.movie_title.str.lower().str.replace('-', '').str.replace('The', '').str.replace(':', '') == input_movie)].apply(lambda x: list(set(re.split(' |,|\n', x.strip().lower())))).values[0]

    cleaned_movie_plot = stop_and_stem(movie_plot_new)

    plot_user_input_list = inputs_list + cleaned_movie_plot


    # -------------------------------------------------------------------------------------------------


    # Get the index of the movie provied by the user

    movie_index = get_index_from_input_movie(input_movie)

    # -------------------------------------------------------------------------------------------------


    # Get Features Embeddings based on the movie_index

    feature_vector = five_thousands['average_combined_features'][five_thousands['index'] == movie_index]

    # Get the Embeddings of the movies matched the user's genre (i.e of all the action movies)

    with open('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\my_embeddings_array_updated_02092019.pkl', 'rb') as f:
        my_embeddings_array_updated = pickle.load(f)

    feature_embeddings_array = my_embeddings_array_updated[indexes_list]
    

    # -------------------------------------------------------------------------------------


    # Concatenate the embeddings of the combined features

    selected_movie_vector = np.hstack([feature_vector.apply(pd.Series).values])

    # Calculate Cosine Distance

    cosine_dist = cosine_distances(feature_embeddings_array, selected_movie_vector.reshape(1,-1))

    # Get the similar movies & Slice the dataframe on the top 5 most similar movies to the movie given  by the user

    movie_return = np.argsort(cosine_dist, axis=None).tolist()[1:6]

    locked_frame_new = locked_frame[locked_frame['index'].isin(movie_return)]


    # -------------------------------------------------------------------------------------


    # Create two new columns "Unique Words" + "Number of words"

    # Create the new column of "UNIQUE" words of the combined features
    locked_frame_new['unique_words'] = locked_frame_new.combined_features.apply(lambda x: list(set(re.split(' |,|\n', x.strip().lower()))))

    # Create the column "Number of words" for each word contained in the unique words column
    locked_frame_new['number_of_words'] = locked_frame_new.unique_words.apply(search_words, args=(plot_user_input_list,))


    # -------------------------------------------------------------------------------------

    
    # Phase 3: Recommend to the user the three most similar and highly scored movies 
    
    
    # Calculate the movie score

    primary_genre = list([(locked_frame.genre_0.str.lower() == input_one)*0.1, (locked_frame.genre_1.str.lower() == input_one)*0.1])

    locked_frame_new['movie_score'] = 0.3*locked_frame_new.updated_rating.astype(float) + 0.5*locked_frame_new.number_of_words

    locked_frame_new['movie_score'] = locked_frame_new['movie_score'] + primary_genre[0] + primary_genre[1]


    # -------------------------------------------------------------------------------------


    # Give to the user the proper movie recommendation

    top_three_rows = locked_frame_new.nlargest(3, 'movie_score')
    
    top_three_rows.rename(columns={'movie_title':'Movie Title', 'updated_rating':'IMDB Rate', 'movie_imdb_link':"Movie's Link"}, inplace=True)

    # Recommend the movie

    recommendations_list = top_three_rows.loc[:, ['Movie Title', 'IMDB Rate', "Movie's Link"]].values.tolist()
    
    print(recommendations_list)
    
else:
    
    plot_user_input_list = inputs_list
    
    locked_frame['unique_words'] = locked_frame.combined_features.apply(lambda x: list(set(re.split(' |,|\n', x.strip().lower()))))

    locked_frame['number_of_words'] = locked_frame.unique_words.apply(search_words, args=(plot_user_input_list,))
    
    
    # Fase 3: Recommend to the user the three most similar and highly scored movies
    
    
    primary_genre = list([(locked_frame.genre_0.str.lower() == input_one)*0.1, (locked_frame.genre_1.str.lower() == input_one)*0.1])

    locked_frame['movie_score'] = 0.3*locked_frame.updated_rating.astype(float) + 0.5*locked_frame.number_of_words
    
    locked_frame['movie_score'] = locked_frame['movie_score'] + primary_genre[0] + primary_genre[1]
    
    
    # Give to the user the proper movie recommendation

    top_three_rows = locked_frame.nlargest(3, 'movie_score')
    
    top_three_rows.rename(columns={'movie_title':'Movie Title', 'updated_rating':'IMDB Rate', 'movie_imdb_link':"Movie's Link"}, inplace=True)

    
    # Recommend the movie

    recommendations_list = top_three_rows.loc[:, ['Movie Title', 'IMDB Rate', "Movie's Link"]].values.tolist()
    
    print(recommendations_list)

Give me a movie genre (i.e romance, action, adventure): adventure
Give me the title of a movie: spectre
Now think of some reasons why you like 'spectre':i love spies


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_gui

[['Skyfall', '7.7', 'http://www.imdb.com/title/tt1074638/?ref_=fn_tt_tt_1'], ['Quantum of Solace', '6.6', 'http://www.imdb.com/title/tt0830515/?ref_=fn_tt_tt_1'], ['The Adventures of Elmo in Grouchland', '5.8', 'http://www.imdb.com/title/tt0159421/?ref_=fn_tt_tt_1']]


**Version 5**

<b> Updates: </b> <br>
1) Use the word embeddings that belong only to the combined features column. So, we did not use the cast and plot embeddings. Because the results were better than before.

In [None]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import nltk
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_distances

# Functions used --------------------------------------------------------------------------------------------------

def get_index_from_input_movie(user_input):
    return five_thousands[five_thousands.movie_title.str.lower().str.replace('-', '').replace('The', '').replace(':', '') == user_input]['index'].values[0]
    
def stop_and_stem(uncleaned_list):
    ps = PorterStemmer()
    stop = set(stopwords.words('english'))
    stopped_list = [i for i in uncleaned_list if i not in stop]
    stemmed_words = [ps.stem(word) for word in stopped_list]
    return stemmed_words

def search_words(row, list_of_words):
    ps = PorterStemmer()
    row = [ps.stem(x) for x in row]
    counter = 0
    for word in list_of_words:
        if word in row:
            counter = counter + 1
    return counter

def find_correct_genre(user_input, genre_list):
    scores_sim=[]
    vectorizer = TfidfVectorizer()

    for item in genre_list:
        ed = nltk.edit_distance(user_input, item)
        scores_sim.append(ed)
    correct_genre_index = scores_sim.index(min(scores_sim))
    correct_genre = genre_list[correct_genre_index].lower()
    print(correct_genre)
    return correct_genre


# -----------------------------------------------------------------------------------------------


# Import the dataset

five_thousands = pd.read_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\five_thousands_embedded_02092019.pkl')

five_thousands = five_thousands.drop(['level_0', 'index'], axis = 1)

five_thousands = five_thousands.reset_index()

five_thousands['index'] = np.arange(0, len(five_thousands))


# -------------------------------------------------------------------------------------------------


# Create the movie_genre list with the unique types of genre 

movie_genre_first = five_thousands.genre_0.unique().tolist()
movie_genre_second = five_thousands.genre_1.unique().tolist()
movie_genre_third = five_thousands.genre_2.unique().tolist()
movie_genre_fourth = five_thousands.genre_3.unique().tolist()
movie_genre_fifth = five_thousands.genre_4.unique().tolist()
movie_genre_sixth = five_thousands.genre_5.unique().tolist()
movie_genre_seventh = five_thousands.genre_6.unique().tolist()
movie_genre_eight = five_thousands.genre_7.unique().tolist()

movie_genre_list = np.asarray(movie_genre_first + movie_genre_second + movie_genre_third + movie_genre_fourth + movie_genre_fifth + movie_genre_sixth + movie_genre_seventh + movie_genre_eight)
list(movie_genre_list.flatten())
movie_genre_list = list(set(movie_genre_list))

movie_genre_list = [x.lower() for x in movie_genre_list]


# -------------------------------------------------------------------------------------------------


# Get inputs from the user and clean them

input_one = input("Give me a movie genre (i.e romance, action, adventure): ")
input_one = find_correct_genre(input_one.lower(), movie_genre_list)

input_movie = input("Give me the title of a movie: ").lower().replace('-', '').replace('The', '').replace(':', '')

input_two = input("Now think of some reasons why you like '{}':".format(input_movie)).lower().replace(',', '').replace('.', '').split(' ')
inputs_list = stop_and_stem(input_two)


# -------------------------------------------------------------------------------------------------


# Using the genre input given by the user, isolate those movies that match the given genre (i.e Action movies)

locked_frame = five_thousands.loc[(five_thousands.genre_0.str.lower() == input_one) | (five_thousands.genre_1.str.lower() == input_one) | (five_thousands.genre_2.str.lower() == input_one) | (five_thousands.genre_3.str.lower() == input_one) | (five_thousands.genre_4.str.lower() == input_one) | (five_thousands.genre_5.str.lower() == input_one) | (five_thousands.genre_6.str.lower() == input_one) | (five_thousands.genre_7.str.lower() == input_one)]

indexes_list = locked_frame.index.tolist()

locked_frame['index'] = np.arange(0, len(locked_frame))


# -------------------------------------------------------------------------------------------------


# Isolate the movie plot of the movie provided from the user [If the movie is part of the dataset].

movie_plot_new = locked_frame['plot_summary'].loc[(locked_frame.movie_title.str.lower().str.replace('-', '').str.replace('The', '').str.replace(':', '') == input_movie)].apply(lambda x: list(set(re.split(' |,|\n', x.strip().lower())))).values[0]

cleaned_movie_plot = stop_and_stem(movie_plot_new)

plot_user_input_list = inputs_list + cleaned_movie_plot


# -------------------------------------------------------------------------------------------------


# Get the index of the movie provied by the user

movie_index = get_index_from_input_movie(input_movie)

# -------------------------------------------------------------------------------------------------


# Get Features Embeddings based on the movie_index

feature_vector = five_thousands['average_combined_features'][five_thousands['index'] == movie_index]


# Get the Embeddings of the movies matched the user's genre (i.e of all the action movies)

with open('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\my_embeddings_array_updated_02092019.pkl', 'rb') as f:
    my_embeddings_array_updated = pickle.load(f)

genre_embeddings_array = my_embeddings_array_updated[indexes_list]


# -------------------------------------------------------------------------------------


# Concatenate the embeddings of the combined features

selected_movie_vector = np.hstack([feature_vector.apply(pd.Series).values])

# Calculate Cosine Distance

cosine_dist = cosine_distances(genre_embeddings_array, selected_movie_vector.reshape(1,-1))

# Get the similar movies & Slice the dataframe on the top 5 most similar movies to the movie given  by the user

movie_return = np.argsort(cosine_dist, axis=None).tolist()[1:6]

locked_frame_new = locked_frame[locked_frame['index'].isin(movie_return)]


# -------------------------------------------------------------------------------------


# Create two new columns "Unique Words" + "Number of words"

# Create the new column of "UNIQUE" words of the combined features
locked_frame_new['unique_words'] = locked_frame_new.combined_features.apply(lambda x: list(set(re.split(' |,|\n', x.strip().lower()))))

# Create the column "Number of words" for each word contained in the unique words column
locked_frame_new['number_of_words'] = locked_frame_new.unique_words.apply(search_words, args=(plot_user_input_list,))


# -------------------------------------------------------------------------------------


# Calculate the movie score

primary_genre = list((locked_frame_new.genre_0.str.lower() == input_one)*0.2)

locked_frame_new['movie_score'] = 0.05*locked_frame_new.updated_rating.astype(float) + 0.75*locked_frame_new.number_of_words

locked_frame_new['movie_score'] = locked_frame_new['movie_score'] + primary_genre


# -------------------------------------------------------------------------------------


# Give to the user the proper movie recommendation

top_three_rows = locked_frame_new.nlargest(3, 'movie_score')
top_three_rows.rename(columns={'movie_title':'Movie Title', 'updated_rating':'IMDB Rate', 'movie_imdb_link':"Movie's Link"}, inplace=True)

# Recommend the movie

recommendations_list = top_three_rows.loc[:, ['Movie Title', 'IMDB Rate', "Movie's Link"]].values.tolist()
print(recommendations_list)

**Version 4**

<b> Updates: </b><br>
1) Create the word embeddings of the cast, the plot_summary and the combined features using the FastText library.

In [59]:
# Code snipset that wiil provide the response

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import nltk
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def get_index_from_input_movie(user_input):
    return five_thousands[five_thousands.movie_title.str.lower().str.replace('-', '').replace('The', '').replace(':', '') == user_input]['index'].values[0]
    
def stop_and_stem(uncleaned_list):
    ps = PorterStemmer()
    stop = set(stopwords.words('english'))
    stopped_list = [i for i in uncleaned_list if i not in stop]
    stemmed_words = [ps.stem(word) for word in stopped_list]
    return stemmed_words

def search_words(row, list_of_words):
    ps = PorterStemmer()
    row = [ps.stem(x) for x in row]
    counter = 0
    for word in list_of_words:
        if word in row:
            counter = counter + 1
    return counter

def find_correct_genre(user_input, genre_list):
    scores_sim=[]
    vectorizer = TfidfVectorizer()

    for item in genre_list:
        ed = nltk.edit_distance(user_input, item)
        scores_sim.append(ed)
    correct_genre_index = scores_sim.index(min(scores_sim))
    correct_genre = genre_list[correct_genre_index].lower()
    print(correct_genre)
    return correct_genre

# Create the movie_genre list with the unique types of genre 

five_thousands = pd.read_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\five_thousands_embedded_02092019.pkl')

five_thousands = five_thousands.drop(['level_0', 'index'], axis = 1)

five_thousands = five_thousands.reset_index()

five_thousands['index'] = np.arange(0, len(five_thousands))

movie_genre_first = five_thousands.genre_0.unique().tolist()
movie_genre_second = five_thousands.genre_1.unique().tolist()
movie_genre_third = five_thousands.genre_2.unique().tolist()
movie_genre_fourth = five_thousands.genre_3.unique().tolist()
movie_genre_fifth = five_thousands.genre_4.unique().tolist()
movie_genre_sixth = five_thousands.genre_5.unique().tolist()
movie_genre_seventh = five_thousands.genre_6.unique().tolist()
movie_genre_eight = five_thousands.genre_7.unique().tolist()

movie_genre_list = np.asarray(movie_genre_first + movie_genre_second + movie_genre_third + movie_genre_fourth + movie_genre_fifth + movie_genre_sixth + movie_genre_seventh + movie_genre_eight)
list(movie_genre_list.flatten())
movie_genre_list = list(set(movie_genre_list))

movie_genre_list = [x.lower() for x in movie_genre_list]

#-----------------------------------------------------------------------------------

# Get inputs from the user and clean them

input_one = input("Give me a movie genre (i.e romance, action, adventure): ")
input_one = find_correct_genre(input_one.lower(), movie_genre_list)

input_movie = input("Give me the title of a movie: ").lower().replace('-', '').replace('The', '').replace(':', '')

input_two = input("Now think of some reasons why you like '{}':".format(input_movie)).lower().replace(',', '').replace('.', '').split(' ')
inputs_list = stop_and_stem(input_two)

# Calculate the score functions ("IMDB score" and "Included number of words")

# Process User's Input 

#movie_plot = five_thousands['plot_summary'].loc[(five_thousands.movie_title.str.lower() == input_movie)]

movie_plot_new = five_thousands['plot_summary'].loc[(five_thousands.movie_title.str.lower().str.replace('-', '').str.replace('The', '').str.replace(':', '') == input_movie)].apply(lambda x: list(set(re.split(' |,|\n', x.strip().lower())))).values[0]
cleaned_movie_plot = stop_and_stem(movie_plot_new)
plot_user_input_list = inputs_list + cleaned_movie_plot

# ----------------------------------------------------------------------------------

# Similarity

# Get movie_index
movie_index = get_index_from_input_movie(input_movie)

# Calculate distance based on the trained word embeddings

# Get Cast Embeddings

cast_vector = five_thousands['average_cast_vectors'][five_thousands['index'] == movie_index]

# Get Plot Embeddings

plot_vector = five_thousands['average_plot_vectors'][five_thousands['index'] == movie_index]

# Get Features Embeddings

feature_vector = five_thousands['average_combined_features'][five_thousands['index'] == movie_index]

# -------------------------------------------------------------------------------------

# Concatenate the embeddings of cast and plot

selected_movie_vector = np.hstack([feature_vector.apply(pd.Series).values])

# Calculate Cosine Distance

cosine_dist = cosine_distances(my_embeddings_array_updated, selected_movie_vector.reshape(1,-1))

# Get the similar movies

movie_return = np.argsort(cosine_dist, axis=None).tolist()[1:5]
five_thousands_new = five_thousands[five_thousands['index'].isin(movie_return)]

# -----------------------------------------------------------------------------------

# Create two new columns

# Create the new column of "UNIQUE" words of the combined features
five_thousands_new['unique_words'] = five_thousands_new.combined_features.apply(lambda x: list(set(re.split(' |,|\n', x.strip().lower()))))

# Create the column "Number of words" for each word contained in the unique words column
five_thousands_new['number_of_words'] = five_thousands_new.unique_words.apply(search_words, args=(plot_user_input_list,))

# -----------------------------------------------------------------------------------

# Calculate the movie score
primary_genre = list((five_thousands_new.genre_0.str.lower() == input_one)*0.2)

five_thousands_new['movie_score'] = 0.05*five_thousands_new.updated_rating.astype(float) + 0.75*five_thousands_new.number_of_words
five_thousands_new['movie_score'] = five_thousands_new['movie_score'] + primary_genre

# Give to the user the proper movie recommendation

top_three_rows = five_thousands_new.nlargest(3, 'movie_score')
top_three_rows.rename(columns={'movie_title':'Movie Title', 'updated_rating':'IMDB Rate', 'movie_imdb_link':"Movie's Link"}, inplace=True)

recommendations_list = top_three_rows.loc[:, ['Movie Title', 'IMDB Rate', "Movie's Link"]].values.tolist()
print(recommendations_list)

**Version 3**

<b> Updates: </b><br>
1) Use the cosine distance instead of cosine similarity.<br>
2) Use the value of the cosine distance as a movie scoring factor. <br>

In [None]:
# Code snipset that wiil provide the response

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import nltk
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def get_index_from_input_movie(user_input):
    return locked_frame[locked_frame.movie_title.str.lower() == user_input]['index'].values[0]
    
def stop_and_stem(uncleaned_list):
    ps = PorterStemmer()
    stop = set(stopwords.words('english'))
    stopped_list = [i for i in uncleaned_list if i not in stop]
    stemmed_words = [ps.stem(word) for word in stopped_list]
    return stemmed_words

def search_words(row, list_of_words):
    ps = PorterStemmer()
    row = [ps.stem(x) for x in row]
    counter = 0
    for word in list_of_words:
        if word in row:
            counter = counter + 1
    return counter

def find_correct_genre(user_input, genre_list):
    scores_sim=[]
    vectorizer = TfidfVectorizer()

    for item in genre_list:
        ed = nltk.edit_distance(user_input, item)
        scores_sim.append(ed)
    correct_genre_index = scores_sim.index(min(scores_sim))
    correct_genre = genre_list[correct_genre_index].lower()
    print(correct_genre)
    return correct_genre

# Create the movie_genre list with the unique types of genre 

five_thousands_old = pd.read_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\five_thousands_model.pkl')

five_thousands_old = five_thousands_old.reset_index()

five_thousands = five_thousands_old.drop_duplicates(subset=['movie_imdb_link'], keep='first')

five_thousands['movie_title'] = five_thousands['movie_title'].apply(lambda x: re.sub('\s+', ' ', x).strip())

movie_genre_first = five_thousands.genre_0.unique().tolist()
movie_genre_second = five_thousands.genre_1.unique().tolist()
movie_genre_third = five_thousands.genre_2.unique().tolist()
movie_genre_fourth = five_thousands.genre_3.unique().tolist()
movie_genre_fifth = five_thousands.genre_4.unique().tolist()
movie_genre_sixth = five_thousands.genre_5.unique().tolist()
movie_genre_seventh = five_thousands.genre_6.unique().tolist()
movie_genre_eight = five_thousands.genre_7.unique().tolist()

movie_genre_list = np.asarray(movie_genre_first + movie_genre_second + movie_genre_third + movie_genre_fourth + movie_genre_fifth + movie_genre_sixth + movie_genre_seventh + movie_genre_eight)
list(movie_genre_list.flatten())
movie_genre_list = list(set(movie_genre_list))

movie_genre_list = [x.lower() for x in movie_genre_list]

# Get inputs from the user and clean them

input_one = input("Give me a movie genre (i.e romance, action, adventure): ")
input_one = find_correct_genre(input_one.lower(), movie_genre_list)

input_movie = input("Give me the title of a(n) {} movie: ".format(input_one)).lower().replace('-', '').replace('The', '').replace(':', '')

input_two = input("Now think of some reasons why you like '{}':".format(input_movie)).lower().split(' ')
inputs_list = stop_and_stem(input_two)

# Calculate the score functions ("IMDB score" and "Included number of words")

# Process User's Input 

locked_frame = five_thousands.loc[(five_thousands.genre_0.str.lower() == input_one) | (five_thousands.genre_1.str.lower() == input_one) | (five_thousands.genre_2.str.lower() == input_one) | (five_thousands.genre_3.str.lower() == input_one) | (five_thousands.genre_4.str.lower() == input_one) | (five_thousands.genre_5.str.lower() == input_one) | (five_thousands.genre_6.str.lower() == input_one) | (five_thousands.genre_7.str.lower() == input_one)]

locked_frame['index'] = np.arange(0, len(locked_frame))

movie_plot = locked_frame['plot_summary'].loc[(locked_frame.movie_title.str.lower() == input_movie)]

if len(movie_plot) == 0:
    plot_user_input_list = inputs_list

else:
    movie_plot_new = locked_frame['plot_summary'].loc[(locked_frame.movie_title.str.lower().str.replace('-', '').replace('The', '').replace(':', '') == input_movie)].apply(lambda x: list(set(re.split(' |,|\n', x.strip().lower())))).values[0]
    cleaned_movie_plot = stop_and_stem(movie_plot_new)
    plot_user_input_list = inputs_list + cleaned_movie_plot

# Process the similarity of combined features

# Get movie_index
movie_index = get_index_from_input_movie(input_movie)

# Calculate distance
cv = CountVectorizer()
count_matrix = cv.fit_transform(locked_frame['combined_features'])
cosine_sim = cosine_similarity(count_matrix)
cosine_distance = 1 - cosine_sim

# Get the similar movies
similar_movies = list(enumerate(cosine_distance[movie_index]))

similar_movies_list = []
sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse=False)

for element in range(0, len(similar_movies)):
    if similar_movies[element][1] <= 0.76:
        similar_movies_list.append(similar_movies[element])
        
similar_movies_list = similar_movies_list[1:]

movie_return=[]
for index in range(0, len(similar_movies_list)):
    movie_return.append(similar_movies_list[index][0])

similarity_score = []
for score in range(0, len(similar_movies_list)):
    similarity_score.append(similar_movies_list[score][1])

locked_frame_new = locked_frame[locked_frame['index'].isin(movie_return)]
locked_frame_new['unique_words'] = locked_frame_new.combined_features.apply(lambda x: list(set(re.split(' |,|\n', x.strip().lower()))))
locked_frame_new.loc[:, ('similarity_score')] = similarity_score

locked_frame_new['number_of_words'] = locked_frame_new.unique_words.apply(search_words, args=(plot_user_input_list,))

# Calculate the movie score
primary_genre = list((locked_frame_new.genre_0.str.lower() == input_one)*0.1)

locked_frame_new['movie_score'] = 0.05*locked_frame_new.updated_rating.astype(float) + 0.65*locked_frame_new.number_of_words + (-0.2*locked_frame_new.similarity_score)
locked_frame_new['movie_score'] = locked_frame_new['movie_score'] + primary_genre

# Give to the user the proper movie recommendation

four_rows = locked_frame_new.nlargest(4, 'movie_score')
four_rows.rename(columns={'movie_title':'Movie Title', 'updated_rating':'IMDB Rate', 'movie_imdb_link':"Movie's Link"}, inplace=True)

if len(movie_plot) == 0:
    recommendations_list = four_rows.loc[:, ['Movie Title', 'IMDB Rate', "Movie's Link"]].values.tolist()[0:3]

else:
    recommendations_list = four_rows.loc[:, ['Movie Title', 'IMDB Rate', "Movie's Link"]].values.tolist()[0:3]

**Version 2**

<b> Updates </b><br>
1) Create the first IF/ELSE statement to filter if the movie existed in the dataset or not. <br>

In [None]:
# Code snipset that wiil provide the response

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import nltk
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def stop_and_stem(uncleaned_list):
    ps = PorterStemmer()
    stop = set(stopwords.words('english'))
    stopped_list = [i for i in uncleaned_list if i not in stop]
    stemmed_words = [ps.stem(word) for word in stopped_list]
    return stemmed_words

def search_words(row, list_of_words):
    ps = PorterStemmer()
    row = [ps.stem(x) for x in row]
    counter = 0
    for word in list_of_words:
        if word in row:
            counter = counter + 1
    return counter

def find_correct_genre(user_input, genre_list):
    scores_sim=[]
    vectorizer = TfidfVectorizer()

    for item in genre_list:
        ed = nltk.edit_distance(user_input, item)
        scores_sim.append(ed)
    correct_genre_index = scores_sim.index(min(scores_sim))
    correct_genre = genre_list[correct_genre_index].lower()
    print(correct_genre)
    return correct_genre

# Create the movie_genre list with the unique types of genre 

five_thousands_old = pd.read_pickle('C:\\Users\\dq186sy\\Desktop\\Big Data Content Analytics\\Movie Recommendation System\\five_thousands_model.pkl')

five_thousands = five_thousands_old.drop_duplicates(subset=['movie_imdb_link'], keep='first')

movie_genre_first = five_thousands.genre_0.unique().tolist()
movie_genre_second = five_thousands.genre_1.unique().tolist()
movie_genre_third = five_thousands.genre_2.unique().tolist()
movie_genre_fourth = five_thousands.genre_3.unique().tolist()
movie_genre_fifth = five_thousands.genre_4.unique().tolist()
movie_genre_sixth = five_thousands.genre_5.unique().tolist()
movie_genre_seventh = five_thousands.genre_6.unique().tolist()
movie_genre_eight = five_thousands.genre_7.unique().tolist()

movie_genre_list = np.asarray(movie_genre_first + movie_genre_second + movie_genre_third + movie_genre_fourth + movie_genre_fifth + movie_genre_sixth + movie_genre_seventh + movie_genre_eight)
list(movie_genre_list.flatten())
movie_genre_list = list(set(movie_genre_list))

movie_genre_list = [x.lower() for x in movie_genre_list]

# Get inputs from the user and clean them

input_one = input("Give me a movie genre (i.e romance, action, adventure): ")
input_one = find_correct_genre(input_one.lower(), movie_genre_list)

input_movie = input("Give me the title of a(n) {} movie: ".format(input_one)).lower().replace('-', '')

input_two = input("Now think of some reasons why you like '{}':".format(input_movie)).lower().split(' ')
inputs_list = stop_and_stem(input_two)

# Calculate the score functions ("IMDB score" and "Included number of words")

# Process User's Input 

locked_frame = five_thousands.loc[(five_thousands.genre_0.str.lower() == input_one) | (five_thousands.genre_1.str.lower() == input_one) | (five_thousands.genre_2.str.lower() == input_one) | (five_thousands.genre_3.str.lower() == input_one) | (five_thousands.genre_4.str.lower() == input_one) | (five_thousands.genre_5.str.lower() == input_one) | (five_thousands.genre_6.str.lower() == input_one) | (five_thousands.genre_7.str.lower() == input_one)]
movie_plot = locked_frame['plot_summary'].loc[(locked_frame.movie_title.str.lower() == input_movie)]

if len(movie_plot) == 0:
    plot_user_input_list = inputs_list

else:
    movie_plot_new = locked_frame['plot_summary'].loc[(locked_frame.movie_title.str.lower().str.replace('-', '') == input_movie)].apply(lambda x: list(set(re.split(' |,|\n', x.strip().lower())))).values[0]
    cleaned_movie_plot = stop_and_stem(movie_plot_new)
    plot_user_input_list = inputs_list + cleaned_movie_plot
    
# movie_plot_new = locked_frame['plot_summary'].loc[(locked_frame.movie_title.str.lower().str.replace('-', '') == input_movie)].apply(lambda x: list(set(re.split(' |,|\n', x.strip().lower())))).values[0]
# cleaned_movie_plot = stop_and_stem(movie_plot_new)
# plot_user_input_list = inputs_list + cleaned_movie_plot

locked_frame['unique_words'] = locked_frame.combined_features.apply(lambda x: list(set(re.split(' |,|\n', x.strip().lower()))))

# Process the similarity of combined features

cv = CountVectorizer()
count_matrix = cv.fit_transform(locked_frame['combined_features'])
cosine_sim = cosine_similarity(count_matrix)

locked_frame['std_similarity'] = [np.std(x) for x in cosine_sim]
locked_frame['median_similarity'] = [np.median(x) for x in cosine_sim]

locked_frame['number_of_words'] = locked_frame.unique_words.apply(search_words, args=(plot_user_input_list,))

# Calculate the movie score
primary_genre = list((locked_frame.genre_0.str.lower() == input_one)*0.1)

locked_frame['movie_score'] = 0.05*locked_frame.updated_rating.astype(float) + 0.75*locked_frame.number_of_words + 0.05*locked_frame.std_similarity + 0.05*locked_frame.median_similarity
locked_frame['movie_score'] = locked_frame['movie_score'] + primary_genre

# Give to the user the proper movie recommendation

four_rows = locked_frame.nlargest(4, 'movie_score')
four_rows.rename(columns={'movie_title':'Movie Title', 'updated_rating':'IMDB Rate', 'movie_imdb_link':"Movie's Link"}, inplace=True)

if len(movie_plot) == 0:
    recommendations_list = four_rows.loc[:, ['Movie Title', 'IMDB Rate', "Movie's Link"]].values.tolist()[0:3]

else:
    recommendations_list = four_rows.loc[:, ['Movie Title', 'IMDB Rate', "Movie's Link"]].values.tolist()[1:]

**Version 1**

1) Use the cosine distance to calculate the similarity amongs the movies. <br>
2) Use 5 movie scores. Later this number will drop.<br>
3) It is mendatory the movie to exist in the dataset otherwise the algorithm's execution will break.<br>

In [None]:
# Code snipset that wiil provide the response

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import nltk
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def stop_and_stem(uncleaned_list):
    ps = PorterStemmer()
    stop = set(stopwords.words('english'))
    stopped_list = [i for i in uncleaned_list if i not in stop]
    stemmed_words = [ps.stem(word) for word in stopped_list]
    return stemmed_words

def search_words(row, list_of_words):
    ps = PorterStemmer()
    row = [ps.stem(x) for x in row]
    counter = 0
    for word in list_of_words:
        if word in row:
            counter = counter + 1
    return counter

def find_correct_genre(user_input, genre_list):
    scores_sim=[]
    vectorizer = TfidfVectorizer()

    for item in genre_list:
        ed = nltk.edit_distance(user_input, item)
        scores_sim.append(ed)
    correct_genre_index = scores_sim.index(min(scores_sim))
    correct_genre = genre_list[correct_genre_index].lower()
    print(correct_genre)
    return correct_genre

# Create the movie_genre list with the unique types of genre 

movie_genre_first = five_thousands.genre_0.unique().tolist()
movie_genre_second = five_thousands.genre_1.unique().tolist()
movie_genre_third = five_thousands.genre_2.unique().tolist()
movie_genre_fourth = five_thousands.genre_3.unique().tolist()
movie_genre_fifth = five_thousands.genre_4.unique().tolist()
movie_genre_sixth = five_thousands.genre_5.unique().tolist()
movie_genre_seventh = five_thousands.genre_6.unique().tolist()
movie_genre_eight = five_thousands.genre_7.unique().tolist()


movie_genre_list = np.asarray(movie_genre_first + movie_genre_second + movie_genre_third + movie_genre_fourth + movie_genre_fifth + movie_genre_sixth + movie_genre_seventh + movie_genre_eight)
list(movie_genre_list.flatten())
movie_genre_list = list(set(movie_genre_list))

movie_genre_list = [x.lower() for x in movie_genre_list]

# Get inputs from the user and clean them

input_one = input("Give me a movie genre (i.e romance, action, adventure): ")
input_one = find_correct_genre(input_one.lower(), movie_genre_list)

input_two = input("Think of a(n) {} movie that you like and write a reason that you like this movie: ".format(input_one)).lower().split(' ')
inputs_list = stop_and_stem(input_two)

# Calculate the score functions ("IMDB score" and "Included number of words")

# Process User's Input 

locked_frame = five_thousands.loc[(five_thousands.genre_0.str.lower() == input_one) | (five_thousands.genre_1.str.lower() == input_one) | (five_thousands.genre_2.str.lower() == input_one) | (five_thousands.genre_3.str.lower() == input_one) | (five_thousands.genre_4.str.lower() == input_one) | (five_thousands.genre_5.str.lower() == input_one) | (five_thousands.genre_6.str.lower() == input_one) | (five_thousands.genre_7.str.lower() == input_one)]
locked_frame['unique_words'] = locked_frame.combined_features.apply(lambda x: list(set(re.split(' |,|\n', x.strip().lower()))))

# Process the similarity of combined features

cv = CountVectorizer()
count_matrix = cv.fit_transform(locked_frame['combined_features'])
cosine_sim = cosine_similarity(count_matrix)

locked_frame['std_similarity'] = [np.std(x) for x in cosine_sim]
locked_frame['median_similarity'] = [np.median(x) for x in cosine_sim]

locked_frame['number_of_words'] = locked_frame.unique_words.apply(search_words, args=(inputs_list,))

# Calculate the movie score
primary_genre = list((locked_frame.genre_0.str.lower() == input_one)*0.1)

locked_frame['movie_score'] = 0.05*locked_frame.updated_rating.astype(float) + 0.75*locked_frame.number_of_words + 0.05*locked_frame.std_similarity + 0.05*locked_frame.median_similarity
locked_frame['movie_score'] = locked_frame['movie_score'] + primary_genre

# Give to the user the proper movie recommendation

three_rows = locked_frame.nlargest(3, 'movie_score')
three_rows.rename(columns={'movie_title':'Movie Title', 'updated_rating':'IMDB Rate', 'movie_imdb_link':"Movie's Link"}, inplace=True)
recommendations_list = three_rows.loc[:, ['Movie Title', 'IMDB Rate', "Movie's Link"]].values.tolist()
recommendations_list

# Part 2: A different approach on movie recommendation

Creating a Movie Recommendation Engine using Wikipedia links and Neural Network Embeddings

Inspired by: https://towardsdatascience.com/building-a-recommendation-system-using-neural-network-embeddings-1ef92e5c80c9

In [None]:
import sys
import tensorflow as tf
keras_home = '/Users/sotirisbaratsas/.keras/datasets/'
from keras.utils import get_file
from keras.utils.data_utils import get_file
import pandas as pd
import gc
import json
import requests
import os
import time
import xml.sax
import subprocess
import re
import mwparserfromhell
import json
from collections import Counter
from keras.models import Model
from keras.layers import Embedding, Input, Reshape
from keras.layers.merge import Dot
import random

import numpy as np
import nltk
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_distances

Downloading the wikipedia dump file
We start by downloading the Wikipedia dump file that suits our problem from https://dumps.wikimedia.org/enwiki/20190820/. The file we want is the one named "-pages-articles.xml.bz2". If we wanted to find a specific article, the multistream version (which is indexed) would be better.
Since the filesize is quite large (15.3 GB compressed), we can also work with a few of the partitioned files, which are listed just below the main one.

In [None]:
# path = "/Users/sotirisbaratsas/.keras/datasets/enwiki-20190820-pages-articles.xml.bz2"

To parse the XML pages, we will use a tool called "Simple API for XML" (SAX) following the documentation found at http://pyxml.sourceforge.net/topics/howto/section-SAX.html.
We are interested in 2 tags for each article in the XML: the <title> tag, which includes the title of the wikipedia article and <text> tag that containts the information of the wikipedia article.

In [None]:
class WikiXmlHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)
        self._buffer = None
        self._values = {}
        self._movies = []
        self._curent_tag = None

    def characters(self, content):
        if self._curent_tag:
            self._buffer.append(content)

    def startElement(self, name, attrs):
        if name in ('title', 'text'):
            self._curent_tag = name
            self._buffer = []

    def endElement(self, name):
        if name == self._curent_tag:
            self._values[name] = ' '.join(self._buffer)

        if name == 'page':
            movie = process_article(**self._values)
            if movie:
                self._movies.append(movie)

* Next, we define a function to help us identify whether a wikipedia article is a movie.
* To parse the article we will use a tool called `mwparserfromhell`, specifically designed to parse wikipedia.
* We are in luck, because Wikipedia uses specific templates to help its authors and contributors keep an aligned structure across similar content (e.g. all movies). These templates are called `Infobox templates` and the one we want is called `Infobox film`.
* We will program our parser to locate this piece of code in an article and process each movie article, saving the title, the movie properties (e.g. poster, director, producers, etc), as well as the links to other wikipedia articles (i.e. internal links) and external URLs (i.e. external links).

In [None]:
def process_article(title, text):
    wikicode = mwparserfromhell.parse(text)
    film = next((template for template in wikicode.filter_templates() 
                 if template.name.strip().lower() == 'infobox film'), None)
    if film:
        properties = {param.name.strip_code().strip(): param.value.strip_code().strip() 
                      for param in film.params
                      if param.value.strip_code().strip()
                     }
        links = [x.title.strip_code().strip() for x in wikicode.filter_wikilinks()]
        return (title, properties, links)

* We can now run the parser.
* We use `bzcat` to iterate through the compressed file (the uncompressed file is HUGE!).
* This might take a while, depending on the number of articles we have parsed.
* We could try parsing the partitioned files in parallel to make this quicker.

In [None]:
parser = xml.sax.make_parser()
handler = WikiXmlHandler()
parser.setContentHandler(handler)
for line in subprocess.Popen(['bzcat'], stdin=open(path), stdout=subprocess.PIPE).stdout:
    try:
        parser.feed(line)
    except StopIteration:
        break

In [None]:
with open('/Users/sotirisbaratsas/.keras/datasets/wp_movies.ndjson', 'wt') as fout:
    for movie in handler._movies:
         fout.write(json.dumps(movie) + '\n')

* Our next step is to load the file with the movies (about 3130 movies) and create a separate list with all the movie titles. We will use this to match a user input to an actual title.
* In the process of doing that, we observe that a lot of movie titles are followed by a disambiguation note - e.g. Titanic (1997 film) - to help wikipedia users distinguish articles with the same name.
* Using a regular expression, we will remove this part of the string from our movie titles list, since a user is unlikely to provide the movie name in this format.

In [None]:
with open('/Users/sotirisbaratsas/.keras/datasets/wp_movies.ndjson') as fin:
    movies = [json.loads(l) for l in fin]

In [None]:
movie_titles = []
for i in range(0, len(movies)):
    x = re.sub('\ \(.*?\)', '', movies[i][0])
    x = re.sub('the ', '', x.lower())
    movie_titles.append(x)
    
len(movie_titles)

In [None]:
movie_titles[0:9]

* Next, we want to get an idea about the links that each movie has.
* We will create a counter for the links in each movie.

In [None]:
link_counts = Counter()
for movie in movies:
    link_counts.update(movie[2])
# Let's take a look at the 20 most common links for all movies.
link_counts.most_common(20)

To make things easier, we will keep only the link pairs that appear more than 3 times.

In [None]:
top_links = [link for link, c in link_counts.items() if c >= 3]
link_to_idx = {link: idx for idx, link in enumerate(top_links)}
movie_to_idx = {movie[0]: idx for idx, movie in enumerate(movies)}
pairs = []
for movie in movies:
    pairs.extend((link_to_idx[link], movie_to_idx[movie[0]]) for link in movie[2] if link in link_to_idx)
pairs_set = set(pairs)
len(pairs), len(top_links), len(movie_to_idx)

As we can see, from the 321.483 pairs of links that appear, we will keep only 31.026 for a total of 3130 unique movies, which should be enough to build our embeddings.

### Creating our embeddings

Now, for the most important step, we need to define the way our embeddings will work. The idea is simple: we will feed the model with pairs of movies and links {movie, link} and train the model to define embeddings of size=50 variables. Our hope is that the model will place movies that link to the same pages closer together and that placement will reflect the similarity of the movies. Moreover, the model will be trained to understand if certain links are similar to each other (e.g. "Category: Action movies" will be similar to "Category: Adventure movies" since a lot of movies will have both links).

* We will feed the Movie and the Link as the Input Layer. Each movie and each link will be represented by an integer value.

In [None]:
def movie_embedding_model(embedding_size=50):
    link = Input(name='link', shape=(1,))
    movie = Input(name='movie', shape=(1,))
    link_embedding = Embedding(name='link_embedding', 
                               input_dim=len(top_links), 
                               output_dim=embedding_size)(link)
    movie_embedding = Embedding(name='movie_embedding', 
                                input_dim=len(movie_to_idx), 
                                output_dim=embedding_size)(movie)
    dot = Dot(name='dot_product', normalize=True, axes=2)([link_embedding, movie_embedding])
    merged = Reshape((1,))(dot)
    model = Model(inputs=[link, movie], outputs=[merged])
    model.compile(optimizer='nadam', loss='mse')
    return model

model = movie_embedding_model()
model.summary()

In [None]:
random.seed(123)

def batchifier(pairs, positive_samples=50, negative_ratio=10):
    batch_size = positive_samples * (1 + negative_ratio)
    batch = np.zeros((batch_size, 3))
    while True:
        for idx, (link_id, movie_id) in enumerate(random.sample(pairs, positive_samples)):
            batch[idx, :] = (link_id, movie_id, 1)
        idx = positive_samples
        while idx < batch_size:
            movie_id = random.randrange(len(movie_to_idx))
            link_id = random.randrange(len(top_links))
            if not (link_id, movie_id) in pairs_set:
                batch[idx, :] = (link_id, movie_id, -1)
                idx += 1
        np.random.shuffle(batch)
        yield {'link': batch[:, 0], 'movie': batch[:, 1]}, batch[:, 2]

next(batchifier(pairs, positive_samples=3, negative_ratio=2))

In [None]:
positive_samples_per_batch = 512

model.fit_generator(
    batchifier(pairs, positive_samples=positive_samples_per_batch, negative_ratio=10),
    epochs=15,
    steps_per_epoch=len(pairs) // positive_samples_per_batch,
    verbose=2
)

In [None]:
user_input = input()

def find_correct_title(user_input):
    scores_sim=[]
    vectorizer = TfidfVectorizer()

    for item in movie_titles:
        ed = nltk.edit_distance(user_input, item)
        scores_sim.append(ed)
    correct_title_index = scores_sim.index(min(scores_sim))
    correct_title = movies[correct_title_index][0]
    print(correct_title)
    return correct_title

find_correct_title(user_input)

In [None]:
user_input = input("Give me a movie you like: ").lower().replace('the', '')

movie = model.get_layer('movie_embedding')
movie_weights = movie.get_weights()[0]
movie_lengths = np.linalg.norm(movie_weights, axis=1)
normalized_movies = (movie_weights.T / movie_lengths).T

def find_correct_title(user_input):
    scores_sim=[]
    vectorizer = TfidfVectorizer()

    for item in movie_titles:
        ed = nltk.edit_distance(user_input, item)
        scores_sim.append(ed)
    correct_title_index = scores_sim.index(min(scores_sim))
    correct_title = movies[correct_title_index][0]
    return(correct_title)

def similar_movies(movie):
    dists = np.dot(normalized_movies, normalized_movies[movie_to_idx[movie]])
    closest = np.argsort(dists)[-10:]
    for c in reversed(closest):
        print(c, movies[c][0], dists[c])

correct_title = find_correct_title(user_input)
similar_movies(correct_title)

# Notes / Sub-section

This section containes code that has been implemented although not used. Thus, it is commented only for presentation purposes.

In [61]:
# # Pretrained Model by FastText

# from gensim.models.keyedvectors import KeyedVectors
# vec_model = KeyedVectors.load_word2vec_format('C:/Users/dq186sy/Desktop/Big Data Content Analytics/Movie Recommendation System/crawl-300d-2M.vec', binary=False)

# # Cast

# def combine_actors(row):
#     return row['actor_1_name'] + "," + row['actor_2_name'] + "," + row['actor_3_name']

# five_thousands["combined_actors"] = five_thousands.apply(combine_actors, axis=1)

# average_vector_list_cast = []
# for i in tqdm(range(len(five_thousands["combined_actors"]))):
#     actors = five_thousands["combined_actors"].str.split(',')[i]
#     average = np.mean([vec_model[actor] for actor in actors], axis=0)
#     average_vector_list_cast.append(average)

# five_thousands['average_cast_vectors'] = average_vector_list_cast

# # Plot

# average_vector_list_plot = []
# for i in tqdm(range(len(five_thousands["plot_summary"]))):
#     plot = five_thousands["plot_summary"].str.split(' ')[i]
#     average = np.mean([vec_model[word] for word in plot], axis=0)
#     average_vector_list_plot.append(average)

# five_thousands['average_plot_vectors'] = average_vector_list_plot

# # Combined Features

# average_vector_list_combined_features = []
# for i in tqdm(range(0, len(five_thousands["combined_features"]))):
#     feature = five_thousands["combined_features"].str.split(' ')[i]
#     average = np.mean([model[word] for word in feature], axis=0)
#     average_vector_list_combined_features.append(average)

# five_thousands['average_combined_features'] = average_vector_list_combined_features

# # Vect Embeddings

# my_embeddings_array = np.hstack([five_thousands['average_plot_vectors'].apply(pd.Series).values,
#                                  five_thousands['average_cast_vectors'].apply(pd.Series).values,
#                                  five_thousands['average_combined_features'].apply(pd.Series).values])

# my_embeddings_array.shape

# # Pickle the word vectors

# with open('my_embeddings_array.pkl', 'wb') as f:
#     pickle.dump(my_embeddings_array, f)

# ------------------------------------------------------------------------------- 

# Clean the categorical variables [From the moment that I read the dataset with encoding UTF-8] this script is redadunt

# # Strip '/xa0' from the strings

# categorical_features = []
# for var in dataset_new.columns:
#     if dataset_new[var].dtypes=='O':
#         categorical_features.append(var)
        
# for i in categorical_features:
#     dataset_new[i] = dataset_new[i].astype(str).apply(lambda x: x.replace(u'\xa0', u''))

# dataset_new.actor_1_name[0]

# Using the Webhook with Python

Instructions on how to use webhook can be found in the two python files also attached with this Jupyter Notebook

<b>File 1:</b> app_v6.py <br>
<b>Fiel 2:</b> movie_recommendation_v5.py

#### Run app_v6.py (using the cmd terminal on Windows):

Step 1: Set the path directory to: Desktop (if you have saved the app_v6.py file in Desktop) <br>
Step 2 (Run the command): python app.py or FLASK_APP=hello.py flask run

#### Run the https protocole (using the cmd terminal on Windows): 

Step 1: Set the path directory to: C:\Users\dq186sy\Desktop\ngrok-stable-windows-amd64 (or the path where the ngrok.exe is saved) <br>
Step 2 (Run the command): ngrok http 5000 <br>
Step 3: Copy paste the **https** link that ends to .io (this link is updated every time the command is executed) <br>
Step 4: Copy paste the link to dialogflow engine under the tab: fulfilment.

# End of Notebook