## Assignment 1: Recommending Similar Movies (5 points)

- In this assignment, you will quantify the similarities between movies using plot summaries from Wikipedia and IMDb (from the two different sources). For this, we will calculate Cosine similarity for all pairs of movies using movies data set. The data set includes 100 movies with their titles, genre, plot summaries from Wikipedia and IMDb.

- Instructions on what to do are provided in every code block with numbers.





- Do not clear the outputs, you must print out your outputs.



- Write your (legal) full name here and include your name in the file name:

## 1. Import Data Set

In [3]:
import numpy as np
import pandas as pd
import nltk

# (1) The data set comes in as a csv file. Import the data file (it is uploaded on Black Board folder).
# from google.colab import drive
# import os
# drive.mount('/content/gdrive')
movie = pd.read_csv("movies.csv")
# (2) check the number of rows and columns, column names, and print out a few rows to see how the data looks like.
print(movie.shape)
print()
movie.head(5)

(100, 5)



Unnamed: 0,rank,title,genre,wiki_plot,imdb_plot
0,0,The Godfather,"[u' Crime', u' Drama']","On the day of his only daughter's wedding, Vit...","In late summer 1945, guests are gathered for t..."
1,1,The Shawshank Redemption,"[u' Crime', u' Drama']","In 1947, banker Andy Dufresne is convicted of ...","In 1947, Andy Dufresne (Tim Robbins), a banker..."
2,2,Schindler's List,"[u' Biography', u' Drama', u' History']","In 1939, the Germans move Polish Jews into the...",The relocation of Polish Jews from surrounding...
3,3,Raging Bull,"[u' Biography', u' Drama', u' Sport']","In a brief scene in 1964, an aging, overweight...","The film opens in 1964, where an older and fat..."
4,4,Casablanca,"[u' Drama', u' Romance', u' War']",It is early December 1941. American expatriate...,"In the early years of World War II, December 1..."


## 2. Combine (concatenate) Wikipedia and IMDb plot summaries

In [4]:
# (3) Combine wiki_plot and imdb_plot into a single column.
movie['plot'] = movie['wiki_plot'] + movie['imdb_plot']

# (4) make sure concatenation worked properly. Among many ways to check this, one thing you can do simply is to
#     check the length of wiki_plot, imdb_plot and compare the length of the combined plot.
#     The length of the combined plot should be (almost) equal to sum of the length of wiki_plot and imdb_plot.
print(len(movie['wiki_plot'][0]) + len(movie['imdb_plot'][0]))
print(len(movie['plot'][0]))

26877
26877


## 3. Tokenization

In [5]:
# (5) Tokenize the summed plot column. During or before/after the tokenization step, remove punctuations, non-words, stop-words.
#     You can also work on a few pre-processing steps during the vectorization process.
#     keep the elements which contain letters only.

# Fill in NaN values
movie['plot'].fillna("", inplace = True)

# Tokenize used nltk word_tokenize
from nltk.tokenize import word_tokenize
nltk.download('punkt')
movie['plot'].astype(str)
movie['plot_tk'] = movie['plot'].apply(word_tokenize)

# Remove punctuation, non-words, and stop-words
import string                          # for punctuation
from nltk.corpus import stopwords      # for stopwords
nltk.download('stopwords')

stopwords = nltk.corpus.stopwords.words('english')

def clean(text):
  text = [word for word in text if word not in string.punctuation]
  text = [word for word in text if word.isalnum()]   # might be repetitive with string.punctuation but there to be safe
  text = [word for word in text if word.lower() not in stopwords]
  return text

movie['plot_tk_clean'] = movie['plot_tk'].apply(clean)
movie.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,rank,title,genre,wiki_plot,imdb_plot,plot,plot_tk,plot_tk_clean
0,0,The Godfather,"[u' Crime', u' Drama']","On the day of his only daughter's wedding, Vit...","In late summer 1945, guests are gathered for t...","On the day of his only daughter's wedding, Vit...","[On, the, day, of, his, only, daughter, 's, we...","[day, daughter, wedding, Vito, Corleone, hears..."
1,1,The Shawshank Redemption,"[u' Crime', u' Drama']","In 1947, banker Andy Dufresne is convicted of ...","In 1947, Andy Dufresne (Tim Robbins), a banker...","In 1947, banker Andy Dufresne is convicted of ...","[In, 1947, ,, banker, Andy, Dufresne, is, conv...","[1947, banker, Andy, Dufresne, convicted, murd..."
2,2,Schindler's List,"[u' Biography', u' Drama', u' History']","In 1939, the Germans move Polish Jews into the...",The relocation of Polish Jews from surrounding...,"In 1939, the Germans move Polish Jews into the...","[In, 1939, ,, the, Germans, move, Polish, Jews...","[1939, Germans, move, Polish, Jews, Kraków, Gh..."
3,3,Raging Bull,"[u' Biography', u' Drama', u' Sport']","In a brief scene in 1964, an aging, overweight...","The film opens in 1964, where an older and fat...","In a brief scene in 1964, an aging, overweight...","[In, a, brief, scene, in, 1964, ,, an, aging, ...","[brief, scene, 1964, aging, overweight, Italia..."
4,4,Casablanca,"[u' Drama', u' Romance', u' War']",It is early December 1941. American expatriate...,"In the early years of World War II, December 1...",It is early December 1941. American expatriate...,"[It, is, early, December, 1941, ., American, e...","[early, December, 1941, American, expatriate, ..."


## 4. Stemming

In [7]:
#(6) stem the tokenized summed plot column using a Snowball Stemmer.

# Import the SnowballStemmer.
from nltk.stem.snowball import SnowballStemmer

# Create an English language SnowballStemmer object
stemmer = SnowballStemmer("english")

# Use lambda to access the individual words
#   movie['plot_tk_clean'] accesses the column
#   lambda token accesses the list entry
#   word accesses the individual words in the list
movie['plot_stem'] = movie['plot_tk_clean'].apply(lambda token: [stemmer.stem(word) for word in token])

movie.head()

Unnamed: 0,rank,title,genre,wiki_plot,imdb_plot,plot,plot_tk,plot_tk_clean,plot_stem
0,0,The Godfather,"[u' Crime', u' Drama']","On the day of his only daughter's wedding, Vit...","In late summer 1945, guests are gathered for t...","On the day of his only daughter's wedding, Vit...","[On, the, day, of, his, only, daughter, 's, we...","[day, daughter, wedding, Vito, Corleone, hears...","[day, daughter, wed, vito, corleon, hear, requ..."
1,1,The Shawshank Redemption,"[u' Crime', u' Drama']","In 1947, banker Andy Dufresne is convicted of ...","In 1947, Andy Dufresne (Tim Robbins), a banker...","In 1947, banker Andy Dufresne is convicted of ...","[In, 1947, ,, banker, Andy, Dufresne, is, conv...","[1947, banker, Andy, Dufresne, convicted, murd...","[1947, banker, andi, dufresn, convict, murder,..."
2,2,Schindler's List,"[u' Biography', u' Drama', u' History']","In 1939, the Germans move Polish Jews into the...",The relocation of Polish Jews from surrounding...,"In 1939, the Germans move Polish Jews into the...","[In, 1939, ,, the, Germans, move, Polish, Jews...","[1939, Germans, move, Polish, Jews, Kraków, Gh...","[1939, german, move, polish, jew, kraków, ghet..."
3,3,Raging Bull,"[u' Biography', u' Drama', u' Sport']","In a brief scene in 1964, an aging, overweight...","The film opens in 1964, where an older and fat...","In a brief scene in 1964, an aging, overweight...","[In, a, brief, scene, in, 1964, ,, an, aging, ...","[brief, scene, 1964, aging, overweight, Italia...","[brief, scene, 1964, age, overweight, italian,..."
4,4,Casablanca,"[u' Drama', u' Romance', u' War']",It is early December 1941. American expatriate...,"In the early years of World War II, December 1...",It is early December 1941. American expatriate...,"[It, is, early, December, 1941, ., American, e...","[early, December, 1941, American, expatriate, ...","[earli, decemb, 1941, american, expatri, rick,..."


## 5. Generate TF-IDF Vectorizer

In [8]:
#(7)-1 use a tf-idf vectorizer to create TF-IDF vectors.
#(7)-2 You can adjust a few parameters within tf-idf Vectorizer object such as removing stopwords, including bigrams
# for efficient processing of text.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(lowercase = False, analyzer = 'word', stop_words = 'english')

## 6. Fit_transform TF-IDF Vectorizer

In [9]:
# (8) Fit and transform the tfidf_vectorizer with the summed plot column for each movie
# to create a vector representation of the plot summaries

# Running TfidfVectorizer on a list formatted text may give us an error.
#To avoid the error, you can use a join function to join back all the list elements (tokens) in a string format, and then run TfidfVectorizer on the string formatted column.

# Convert from list to str for the vectorizer
movie['plot_stem_join'] = movie['plot_stem'].apply(lambda text: ' '.join(word for word in text))
movie.head()

movie_tfidf = tfidf.fit_transform(movie['plot_stem_join'])

## 7. Calculate Similarity Index

In [87]:
# (9)-1
# Import cosine_similarity to calculate similarity of movie plots.
from sklearn.metrics.pairwise import cosine_similarity
cos_sim_matrix = cosine_similarity(movie_tfidf)

# (9)-2 show 2D array matrix.
print(cos_sim_matrix)

# (9)-3 calculate cos similarity scores for all the combinations.
#You should come up with a 100 by 100 array matrix.

# Generate dictionary of tuples
# Keys are the movie name
# First value in key-value pair is the same movie as the key so you skip it
#
similarities = {}
for i in range(len(cos_sim_matrix)):
    # Sort each element in cos_sim_matrix and get the indexes of the most similar
    similar_indices = cos_sim_matrix[i].argsort()[::-1]

    # Store in similarities each movie title
    similarities[movie['title'].iloc[i]] = [(cos_sim_matrix[i][x], movie['title'][x])
                                           for x in similar_indices][1:]

print()
print(len(similarities))
print()
print(similarities)


[[1.         0.0163585  0.01947914 ... 0.02342618 0.02353292 0.        ]
 [0.0163585  1.         0.03008894 ... 0.01467898 0.01476877 0.        ]
 [0.01947914 0.03008894 1.         ... 0.01713607 0.01379893 0.        ]
 ...
 [0.02342618 0.01467898 0.01713607 ... 1.         0.03479676 0.        ]
 [0.02353292 0.01476877 0.01379893 ... 0.03479676 1.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]

100

{'The Godfather': [(0.6284664240158511, 'The Godfather: Part II'), (0.2552446699822492, 'Tootsie'), (0.1366002171454995, 'E.T. the Extra-Terrestrial'), (0.08064885122442336, 'The Grapes of Wrath'), (0.06988022751980032, 'Nashville'), (0.06422421998602419, 'Goodfellas'), (0.05323523571078316, 'The Deer Hunter'), (0.04731676019508179, 'It Happened One Night'), (0.043699224225281656, 'To Kill a Mockingbird'), (0.04355940754198671, 'Rain Man'), (0.041020258416640504, 'The Sound of Music'), (0.04089898567965925, 'The Best Years of Our Lives'), (0.04

## 8. Build a Small Recommendation Algorithm based on Similarity Score

In [141]:
# create a defined function which gives you top 10 most similar movies (based on their similarity scores) to a focal movie
# that you type in. This part has been mostly done for you.

# Generate mapping between titles and index.
indices = pd.Series(movie.index, index=movie['title']).drop_duplicates()
print (indices)
print (indices.shape)

def get_recommendations(title):

    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(similarities[title]))

    #sort the scores from highest to lowest (descending).
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores for 10 most similar movies EXCEPT FOR the 1st highest one (itself).
    sim_scores = sim_scores[0:10]

    print(f'The top 10 most similar movies to {title} are:')
    for i in range(len(sim_scores)):
      print(f'{i + 1}. {sim_scores[i][1][1]}')
    return(' ')

title
The Godfather                0
The Shawshank Redemption     1
Schindler's List             2
Raging Bull                  3
Casablanca                   4
                            ..
Rebel Without a Cause       95
Rear Window                 96
The Third Man               97
North by Northwest          98
Yankee Doodle Dandy         99
Length: 100, dtype: int64
(100,)


## 9. What are the similar movies of a focal movie?

In [142]:
# e.g. what are the top 10 similar movies for the movie 'Star Wars'?

# print(get_recommendations('type in a movie name here', cosine_sim, indices))
print(get_recommendations('Star Wars'))

The top 10 most similar movies to Star Wars are:
1. 2001: A Space Odyssey
2. Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb
3. The Maltese Falcon
4. Close Encounters of the Third Kind
5. Platoon
6. The Sound of Music
7. Saving Private Ryan
8. The Bridge on the River Kwai
9. The Pianist
10. Braveheart
 


In [143]:
print(get_recommendations('Amadeus'))

The top 10 most similar movies to Amadeus are:
1. The Sound of Music
2. Close Encounters of the Third Kind
3. Ben-Hur
4. The Third Man
5. Out of Africa
6. The Good, the Bad and the Ugly
7. Gandhi
8. It Happened One Night
9. The Godfather: Part II
10. The Exorcist
 
