<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Datasets-used-breakdown" data-toc-modified-id="Datasets-used-breakdown-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Datasets used breakdown</a></span></li><li><span><a href="#Identifying-users-who-still-needs-recommendations" data-toc-modified-id="Identifying-users-who-still-needs-recommendations-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Identifying users who still needs recommendations</a></span></li><li><span><a href="#Finding-similarities-between-movies" data-toc-modified-id="Finding-similarities-between-movies-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Finding similarities between movies</a></span></li><li><span><a href="#Providing-recommendation-for-users" data-toc-modified-id="Providing-recommendation-for-users-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Providing recommendation for users</a></span></li><li><span><a href="#How-Did-We-Do?" data-toc-modified-id="How-Did-We-Do?-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>How Did We Do?</a></span></li></ul></div>

# Content Based Recommendations

In the previous notebook, recommendations was introduced using collaborative filtering.  However, using this technique there are a large number of users who were left without any recommendations at all.  Other users were left with fewer than the ten recommendations that were set up by our function to retrieve...

In order to provide these users with recommendations, **content based** recommendation technique will be used.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
from IPython.display import HTML
import progressbar
import tests as t
import pickle


%matplotlib inline

# Read in the datasets
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']


all_recs = pickle.load(open("all_recs.p", "rb"))

## Datasets used breakdown


`a.` **movies** - a dataframe of all of the movies in the dataset along with other content related information about the movies (genre and date)


`b.` **reviews** - this was the main dataframe used before for collaborative filtering, as it contains all of the interactions between users and movies.


`c.` **all_recs** - a dictionary where each key is a user, and the value is a list of movie recommendations based on collaborative filtering

For the individuals in **all_recs** who did recieve 10 recommendations using collaborative filtering, we don't really need to worry about them.  However, there were a number of individuals in our dataset who did not receive any recommendations.



## Identifying users who still needs recommendations

In [3]:
# 1. Identifying users who have enough recommendation so far

# List to store user ids who have all their recommendations in this (10 or more)
users_with_all_recs = [] 

# Unpacking the dictionary of all the done-so-far recommendations
for user, moviess in all_recs.items():
    # If user has more the 9
    if len(moviess) > 9:
        # Add to the list
        users_with_all_recs.append(user)
        
print('The number of users who have all their recs is: {}'.format(len(users_with_all_recs)))



# 2. Identifying users who still need recommendations

# An array of all unique users in the database
users = np.unique(reviews['user_id'])
# Finding the difference between the two arrays
users_who_need_recs = np.setdiff1d(users, users_with_all_recs)

print('The number of users who still need their recs is: {}'.format(len(users_who_need_recs)))

The number of users who have all their recs is: 22187
The number of users who still need their recs is: 45166


## Finding similarities between movies

I will be doing a bit of a mix of content and collaborative filtering to make recommendations for the users this time,  which will allow us to obtain recommendations in many cases where we didn't make recommendations earlier.     

Before finding recommendations, rank the user's ratings from highest to lowest. for we will move through the movies in this order looking for other similar movies.

In [5]:
# create a dataframe similar to reviews, but ranked by rating for each user
ranked_reviews = reviews.sort_values(by=['user_id', 'rating'], ascending=False)

For us to pull out a matrix that describes the movies in our dataframe in terms of content, we might just use the indicator variables related to **year** and **genre** for our movies.  

Then we can obtain a matrix of how similar movies are to one another by taking the **dot product** of this matrix with itself.  As seen in the below image,  the dot product where our 1 values overlap gives a value of 2 indicating higher similarity.  In the second dot product, the 1 values don't match up.  This leads to a dot product of 0 indicating lower similarity.

<img src="images/dotprod1.png" alt="Dot Product" height="500" width="500">

We can perform the dot product on a matrix of movies with content characteristics to provide a movie by movie matrix where each cell is an indication of how similar two movies are to one another.  In the below image, we can see that movies 1 and 8 are most similar, movies 2 and 8 are most similar and movies 3 and 9 are most similar for this subset of the data.  The diagonal elements of the matrix will contain the similarity of a movie with itself, which will be the largest possible similarity (which will also be the number of 1's in the movie row within the orginal movie content matrix.

<img src="images/moviemat.png" alt="Dot Product" height="500" width="500">

In [6]:
# Subset so movie_content is only using the dummy variables 
# for each genre and the 3 century based year dummy columns
movie_content = np.array(movies.iloc[:,4:])

# Take the dot product to obtain a movie x movie matrix of similarities
dot_prod_movies = movie_content.dot(np.transpose(movie_content))

## Providing recommendation for users

Now that we have two matrices that would help us to provide recommendations for all users who have less than 10 recommendations:

`a.` Matrix where each user has their ratings ordered. (orderer first by user_id, then by rating) 

`b.` Matrix where movies are each axis, and the matrix entries are larger where the two movies are more similar and smaller where the two movies are dissimilar.  This matrix is a measure of content similarity.

For each user, we will perform the following:

    i. For each movie, find the movies that are most similar that the user hasn't seen.

    ii. Continue through the available, rated movies until 10 recommendations or until there are no additional movies.

In [8]:
def find_similar_movies(movie_id):
    '''
    INPUT
    movie_id - a movie_id 
    OUTPUT
    similar_movies - an array of the most similar movies by title
    '''
    # find the row of each movie id
    movie_idx = np.where(movies['movie_id'] == movie_id)[0][0]
    
    # find the most similar movie indices - to start I said they need to be the same for all content
    similar_idxs = np.where(dot_prod_movies[movie_idx] == np.max(dot_prod_movies[movie_idx]))[0]
    
    # pull the movie titles based on the indices
    similar_movies = np.array(movies.iloc[similar_idxs, ]['movie'])
    
    return similar_movies
    
    
def get_movie_names(movie_ids):
    '''
    INPUT
    movie_ids - a list of movie_ids
    OUTPUT
    movies - a list of movie names associated with the movie_ids
    
    '''
    movie_lst = list(movies[movies['movie_id'].isin(movie_ids)]['movie'])
   
    return movie_lst

def make_recs():
    '''
    INPUT
    None
    OUTPUT
    recs - a dictionary with keys of the user and values of the recommendations
    '''
    # Create dictionary to return with users and ratings
    recs = defaultdict(set)
    # How many users for progress bar
    n_users = len(users)

    
    # Create the progressbar
    cnter = 0
    bar = progressbar.ProgressBar(maxval=n_users+1, widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
    bar.start()
    
    # For each user
    for user in users:
        
        # Update the progress bar
        cnter+=1 
        bar.update(cnter)

        # Pull only the reviews the user has seen
        reviews_temp = ranked_reviews[ranked_reviews['user_id'] == user]
        movies_temp = np.array(reviews_temp['movie_id'])
        movie_names = np.array(get_movie_names(movies_temp))

        # Look at each of the movies (highest ranked first), 
        # pull the movies the user hasn't seen that are most similar
        # These will be the recommendations - continue until 10 recs 
        # or you have depleted the movie list for the user
        for movie in movies_temp:
            rec_movies = find_similar_movies(movie)
            temp_recs = np.setdiff1d(rec_movies, movie_names)
            recs[user].update(temp_recs)

            # If there are more than 
            if len(recs[user]) > 9:
                break

    bar.finish()
    
    return recs

In [9]:
recs = make_recs()



## How Did We Do?

Now that you have made the recommendations, how did we do in providing everyone with a set of recommendations?

In [142]:
# Explore recommendations
users_without_all_recs = []
users_with_all_recs = []
no_recs = []
for user, movie_recs in recs.items():
    if len(movie_recs) < 10:
        users_without_all_recs.append(user)
    if len(movie_recs) > 9:
        users_with_all_recs.append(user)
    if len(movie_recs) == 0:
        no_recs.append(user)

In [145]:
# Some characteristics of my content based recommendations
print("There were {} users without all 10 recommendations we would have liked to have.".format(len(users_without_all_recs)))
print("There were {} users with all 10 recommendations we would like them to have.".format(len(users_with_all_recs)))
print("There were {} users with no recommendations at all!".format(len(no_recs)))

There were 2179 users without all 10 recommendations we would have liked to have.
There were 51789 users with all 10 recommendations we would like them to have.
There were 174 users with no recommendations at all!
