# Project 8 - Movie Recommendations with Document Similarity

Recommender systems are one of the popular and most adopted applications of machine learning. They are typically used to recommend entities to users and these entites can be anything like products, movies, services and so on. 

Popular examples of recommendations include,
- Amazon suggesting products on its website
- Amazon Prime, Netflix, Hotstar recommending movies\shows
- YouTube recommending videos to watch

Typically recommender systems can be implemented in three ways:

- Simple Rule-based Recommenders: Typically based on specific global metrics and thresholds like movie popularity, global ratings etc.
- Content-based Recommenders: This is based on providing similar entities based on a specific entity of interest. Content metadata can be used here like movie descriptions, genre, cast, director and so on
- Collaborative filtering Recommenders: Here we don't need metadata but we try to predict recommendations and ratings based on past ratings of different users and specific items.

We will be building a movie recommendation system here where based on data\metadata pertaining to different movies, we try and recommend similar movies of interest!

![](netflix_rec.png)

Since our focus in not really recommendation engines but NLP, we will be leveraging the text-based metadata for each movie to try and recommend similar movies based on specific movies of interest. This falls under content-based recommenders. 

# Load Dataset

In [None]:
import pandas as pd

df = pd.read_csv('tmdb_5000_movies.csv.gz', compression='gzip')
df.info()

In [None]:
df.head()

In [None]:
df = df[['title', 'tagline', 'overview', 'genres', 'popularity']]
df.tagline.fillna('', inplace=True)
df['description'] = df['tagline'].map(str) + ' ' + df['overview']
df.dropna(inplace=True)
df.info()

In [None]:
df.head()

# Your Turn: Build a Movie Recommender System

Here you will build your own movie recommender system. We will use the following pipeline:
- Text pre-processing
- Feature Engineering
- Document Similarity Computation
- Find top similar movies
- Build a movie recommendation function


## Document Similarity

Recommendations are about understanding the underlying features which make us favour one choice over the other. Similarity between items(in this case movies) is one way to understanding why we choose one movie over another. There are different ways to calculate similarity between two items. One of the most widely used measures is __cosine similarity__ which we have already used in the previous unit.

### Cosine Similarity

Cosine Similarity is used to calculate a numeric score to denote the similarity between two text documents. Mathematically, it is defined as follows:

$$ cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $$

## Text pre-processing

We will do some basic text pre-processing on our movie descriptions before we build our features

In [None]:
import nltk
import re
import numpy as np

stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(list(df['description']))
len(norm_corpus)

## Extract TF-IDF Features

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
tfidf_matrix = tf.fit_transform(df['description'])
tfidf_matrix.shape

## Compute Pairwise Document Similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

doc_sim = cosine_similarity(tfidf_matrix)
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

## Get List of Movie Titles

In [None]:
movies_list = df['title'].values
movies_list, movies_list.shape

## Sort Dataset by Popular Movies

In [None]:
pop_movies = df.sort_values(by='popularity', ascending=False)
pop_movies.head(10)

## Find Top Similar Movies for a Sample Movie

Let's take __Minions__ the most popular movie the the dataframe above and try and find the most similar movies which can be recommended

#### Find movie ID

In [None]:
movie_idx = np.where(movies_list == 'Minions')[0][0]
movie_idx

#### Get movie similarities

In [None]:
movie_similarities = doc_sim_df.iloc[movie_idx].values
movie_similarities

#### Get top 5 similar movie IDs

In [None]:
similar_movie_idxs = np.argsort(-movie_similarities)[1:6]
similar_movie_idxs

#### Get top 5 similar movies

In [None]:
similar_movies = movies_list[similar_movie_idxs]
similar_movies

### Your Turn: Build a movie recommender function to recommend top 5 similar movies for any movie 

The movie title, movie title list and document similarity matrix dataframe will be given as inputs to the function

In [None]:
def movie_recommender(movie_title, movies=movies_list, doc_sims=doc_sim_df):
    # find movie id
    _____
    # get movie similarities
    _____
    # get top 5 similar movie IDs
    _____
    # get top 5 movies
    similar_movies = _____
    # return the top 5 movies
    return similar_movies

### Your Turn: Now use this function on the top 20 popular movies

Hint: Try getting the first 20 titles from the `popular_movies` dataframe

In [None]:
popular_movies = _____

In [None]:
for movie in popular_movies:
    print('Movie:', movie)
    print('Top 5 recommended Movies:', movie_recommender(movie_title=movie))
    print()

## Cluster Similar Movies

Now that we built our own movie recommendation utility, it is time to level up.

Clustering is an unsupervised approach to find groups of similar items in any given dataset. There are different clustering algorithms and __K-Means__ is a pretty simple yet affect one. Most movies span different emotions and can be categorized into multiple genres (same is the case with movies listed in our current dataset). Can clustering of movie descriptions help us understand these groupings?

Similarity analysis was a good starting point, but can we do better? 

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from collections import Counter
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
kmeans = KMeans(n_clusters=2, max_iter=100,random_state=42).fit(____) #feature matrix

In [None]:
Counter(kmeans.labels_)

## Identifying K

Since most movies are a combination of emotions, stories, characters, scenary, etc, let us try and utilize K-means to see if the movies can be grouped into common themes capturing some of these underlying aspects.

One challenge we face while working with K-Means is finding the right value of K. To do so, there are a few heuristics like _the silhouette_score_ and the _elbow_ method. 

### Sillhouette Score
Silhouette Scoring helps in quantifying interpretation and validation of consistency within clusters of data. 
The silhouette score value quantifies how similar an item is to its own cluster (cohesion) compared to other clusters (separation). The silhouette score ranges from −1 to +1, where a high value indicates well placed item. A negative score indicates that there may be too many or too few clusters.

### Elbow Method
This method requires us to run k-means clustering on a given dataset for a range of values of k. Then for each value of k, we calculate sum of squared errors (SSE).

The next step is to plot a line graph of the SSE aganist each value of k. The line graph looks like an arm, the _elbow_ on the arm is the value of optimal k (number of cluster). 
The goal is to choose a small value of k that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k.

The following snippet loops through different values of K, generates the silhouette scores as well the elbow plot to help us narrow down to the optimal value of K

In [None]:
def identify_k(feature_matrix, min_k=2,max_k=3):
    sse = {}
    for k in range(min_k,max_k):
        kmeans = KMeans(n_clusters=k, max_iter=100,random_state=42).fit(_____)#feature matrix
        sil_coeff = silhouette_score(tfidf_matrix, 
                                     _________,# cluster labels 
                                     metric='euclidean')
        print("For K={}, Silhouette Coefficient = {}".format(k, sil_coeff))
        # Inertia: Sum of distances of samples to their closest cluster center
        sse[k] = kmeans.inertia_ 
    plt.figure()
    plt.plot(_______, ______) # x-axis=different values of k, y-axis=sse value for each k
    plt.xlabel("K")
    plt.ylabel("SSE")
    plt.show()

### Your Turn
Find the optimal value of K

In [None]:
# iterate from k=2 to k=10 and plot the elbow curve
identify_k(___,___,____)

### Set Cluster Labels
For the current scenario, let us set __K=4__ and assign cluster labels to each of the movies in our dataset

In [None]:
kmeans = KMeans(________).fit(____)
df["cluster_label"] = ______ # cluster labels

### Extract Cluster Details

Each cluster constitutes movies which have some underlying aspects which are common.
We also understand that cluster centers are in a way representatives of the whole cluster. Let us utilize this understanding to extract top common features amongst clusters.

In [None]:
def get_cluster_details(clustering_obj, movie_data, 
                     feature_names, num_clusters,
                     topn_features=10):

    cluster_details = {}  
    # get cluster centroids
    ordered_centroids = clustering_obj.cluster_centers_.argsort()[:, ::-1]
    
    # get key features for each cluster
    # get movies belonging to each cluster
    for cluster_num in range(num_clusters):
        cluster_details[cluster_num] = {}
        cluster_details[cluster_num]['cluster_num'] = ____ # set cluster number
        key_features = [_________]# extract key features from centroids
        cluster_details[cluster_num]['key_features'] = key_features
        
        movies = ____________ # assign list of movies belonging to this cluster
        cluster_details[cluster_num]['movies'] = movies
    
    return cluster_details

In [None]:
def print_cluster_details(cluster_data,n_movies=5):
    # print cluster details
    for cluster_num, cluster_details in cluster_data.items():
        print('Cluster {} details:'.format(cluster_num))
        print('-'*20)
        print('Key features:', ______)
        print('Movies in this cluster:')
        print(', '.join(_________))
        print('='*40)

### Your Turn

Extract Cluster Details

In [None]:
cluster_data =  get_cluster_details(clustering_obj=____, # clustering object
                                     movie_data=df,
                                     feature_names=_____, #hint:use the tfidf vectorizer to get list of features
                                     num_clusters=4,
                                     topn_features=5)      

In [None]:
print_cluster_details(______) 

## Heirarchical Clustering
add content

In [None]:
from scipy.cluster.hierarchy import ward, dendrogram

In [None]:
def ward_hierarchical_clustering(feature_matrix):
    
    cosine_distance = ________ # 1- cosine similarity of the features
    linkage_matrix = ward(cosine_distance)
    return linkage_matrix

In [None]:
def plot_hierarchical_clusters(linkage_matrix, movie_data, p=100, figure_size=(8,12)):
    # set size
    fig, ax = plt.subplots(figsize=figure_size) 
    movie_titles = _______ # extract movie titles as list
    
    # prepare dendrogram
    R = dendrogram(linkage_matrix, 
                   orientation="left", 
                   labels=movie_titles,
                   truncate_mode='lastp',
                   p=p, 
                   no_plot=True)
    
    temp = {R["leaves"][ii]: movie_titles[ii] for ii in range(len(R["leaves"]))}
    def llf(xx):
        return "{}".format(temp[xx])
    
    # plot dendrogram
    ax = dendrogram(
            ______, # linkage matrix
            truncate_mode='lastp',
            orientation="left",
            p=p,  
            leaf_label_func=_____, # function to get leaf labels 
            leaf_font_size=10.,
            )
    plt.tick_params(axis= 'x',   
                    which='both',  
                    bottom='off',
                    top='off',
                    labelbottom='off')
    plt.tight_layout()
    plt.savefig('movie_hierachical_clusters.png', dpi=200)

### Your Turn

Use the above prepare utilities to perform hierarchical clustering and generate a dendrogram

In [None]:
linkage_matrix = ward_hierarchical_clustering(______)

In [None]:
plot_hierarchical_clusters(___________)