## Collaborative Filtering Recommendation

One of the most popular methods for making recommendations is **collaborative filtering**.  In collaborative filtering, you are using the collaboration of user-item recommendations to assist in making new recommendations.  

There are two main methods of performing collaborative filtering:

1. **Neighborhood-Based Collaborative Filtering**, which is based on the idea that we can either correlate items that are similar to provide recommendations or we can correlate users to one another to provide recommendations.

2. **Model Based Collaborative Filtering**, which is based on the idea that we can use machine learning and other mathematical models to understand the relationships that exist amongst items and users to predict ratings and provide ratings.


In this notebook, you will be working on performing **neighborhood-based collaborative filtering**.  There are two main methods for performing collaborative filtering:

1. **User-based collaborative filtering:** In this type of recommendation, users related to the user you would like to make recommendations for are used to create a recommendation.

2. **Item-based collaborative filtering:** In this type of recommendation, first you need to find the items that are most related to each other item (based on similar ratings).  Then you can use the ratings of an individual on those similar items to understand if a user will like the new item.

In this notebook you will be implementing **user-based collaborative filtering**.  However, it is easy to extend this approach to make recommendations using **item-based collaborative filtering**.  First, let's read in our data and necessary libraries.

## User-based Collaborative Filtering Recommendation System


In [3]:
# !pip install gdown -q
# import gdown

In [2]:
!pip install --upgrade --no-cache-dir gdown -q
!gdown 1vo5amPt4t2IeHq3wQhq_fXAOQtoKGFKM
!gdown 1CUOpMwUmYh_L6Ni-e-hVrlpCLJUb3FtE
!gdown 19QHcex_bHy8Yl7sYtx0a3knX99aEGCci

In [None]:
import pickle

# write python dict to a file
mydict = {'a': 1, 'b': 2, 'c': 3}
output = open('myfile.pkl', 'wb')
pickle.dump(mydict, output)
output.close()

# read python dict back from the file
pkl_file = open('myfile.pkl', 'rb')
mydict2 = pickle.load(pkl_file)
pkl_file.close()

print (mydict)
print (mydict2)

In [None]:
!ls ..

In [None]:
!cp myfile.pkl 

In [None]:
!cp -r ../working ../new

In [None]:
!cd ../ && ls .

In [None]:
import shutil
shutil.make_archive('output_filename', 'zip', '../new')

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tests as t
from scipy.sparse import csr_matrix
from IPython.display import HTML


%matplotlib inline

# Read in the datasets
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']



In [5]:
movies.head()

In [6]:
reviews.head()

### User-Item Matrix

In order to calculate the similarities, it is common to put values in a matrix.  In this matrix, users are identified by each row, and items are represented by columns.  

In [7]:
user_items = reviews[['user_id', 'movie_id', 'rating']]
user_items.head()

### Creating the User-Item Matrix

In order to create the user-items matrix (like the one above), I personally started by using a [pivot table](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html). 

However, I quickly ran into a memory error (a common theme throughout this notebook).  I will help you navigate around many of the errors I had, and achieve useful collaborative filtering results! 

_____

`1.` Create a matrix where the users are the rows, the movies are the columns, and the ratings exist in each cell, or a NaN exists in cells where a user hasn't rated a particular movie. If you get a memory error (like I did), [this link here](https://stackoverflow.com/questions/39648991/pandas-dataframe-pivot-memory-error) might help you!

In [8]:
# Create user-by-item matrix
user_by_movie = user_items.groupby(['user_id', 'movie_id'])['rating'].max().unstack()

In [None]:
# user_items.head().groupby(['user_id', 'movie_id'])['rating'].max().unstack()

In [9]:
user_by_movie

Check your results below to make sure your matrix is ready for the upcoming sections.

`2.` Now that we have a matrix of users by movies, we this matrix to create a dictionary where the key is each user and the value is an array of the movies each user has rated.

In [11]:
# Create a dictionary with users and corresponding movies seen

def movies_watched(user_id):
    '''
    INPUT:
    user_id - the user_id of an individual as int
    OUTPUT:
    movies - an array of movies the user has watched
    '''

    return reviews.movie_id[reviews.user_id == user_id].tolist()


def create_user_movie_dict():
    '''
    INPUT: None
    OUTPUT: movies_seen - a dictionary where each key is a user_id and the value is an array of movie_ids
    
    Creates the movies_seen dictionary
    '''
    
    return {userId:movies_watched(userId) for userId in reviews.user_id.unique()}


# Use your function to return dictionary
movies_seen = create_user_movie_dict()

`3.` If a user hasn't rated more than 2 movies, we consider these users "too new".  Create a new dictionary that only contains users who have rated more than 2 movies.  This dictionary will be used for all the final steps of this workbook.

In [13]:
# Remove individuals who have watched 2 or fewer movies - don't have enough data to make recs

def create_movies_to_analyze(movies_seen, lower_bound=2):
    '''
    INPUT:  
    movies_seen - a dictionary where each key is a user_id and the value is an array of movie_ids
    lower_bound - (an int) a user must have more movies seen than the lower bound to be added to the movies_to_analyze dictionary

    OUTPUT: 
    movies_to_analyze - a dictionary where each key is a user_id and the value is an array of movie_ids
    
    The movies_seen and movies_to_analyze dictionaries should be the same except that the output dictionary has removed 
    
    '''
    
    # Do things to create updated dictionary
    
    return {userId:movieList for userId,movieList in movies_seen.items() if len(movieList)>lower_bound}


# Use your function to return your updated dictionary
movies_to_analyze = create_movies_to_analyze(movies_seen)

### Calculating User Similarities

Now that we have set up the **movies_to_analyze** dictionary, it is time to take a closer look at the similarities between users.  Below is the pseudocode for how I thought about determining the similarity between users:

```
for user1 in movies_to_analyze
    for user2 in movies_to_analyze
        see how many movies match between the two users
        if more than two movies in common
            pull the overlapping movies
            compute the distance/similarity metric between ratings on the same movies for the two users
            store the users and the distance metric
 

In [15]:
def compute_correlation(user1, user2):
    '''
    INPUT
    user1 - int user_id
    user2 - int user_id
    OUTPUT
    the correlation between the matching ratings between the two users
    '''
    m1 = movies_to_analyze[user1]
    m2 = movies_to_analyze[user2]
    m12 = np.intersect1d(m1,m2)
    r1 = [reviews.rating[(reviews['user_id'] == user1) & (reviews['movie_id'] == mi)].tolist()[0] for mi in m12]
    r2 = [reviews.rating[(reviews['user_id'] == user2) & (reviews['movie_id'] == mi)].tolist()[0] for mi in m12]
    return np.corrcoef(r1,r2)[0,1] #return the correlation

In [18]:
compute_correlation(2,104)

In [None]:
# all ratings are equal and so the mean is the same as each one of them so x-mean = 0 so the correlation gives 0/0 value

Think and write your ideas here about why these NaNs exist, and use the cells below to do some coding to validate your thoughts. You can check other pairs of users and see that there are actually many NaNs in our data - 2,526,710 of them in fact. These NaN's ultimately make the correlation coefficient a less than optimal measure of similarity between two users.




In [19]:
# Which movies did both user 2 and user 104 see?
m = np.intersect1d(movies_to_analyze[2], movies_to_analyze[104])
m

In [20]:
# What were the ratings for each user for those movies?

r1 = [reviews.rating[(reviews['user_id'] == 2) & (reviews['movie_id'] == mi)].tolist()[0] for mi in m]
r2 = [reviews.rating[(reviews['user_id'] == 104) & (reviews['movie_id'] == mi)].tolist()[0] for mi in m]
print(r1, r2, sep='\n')

`6.` Because the correlation coefficient proved to be less than optimal for relating user ratings to one another, we could instead calculate the euclidean distance between the ratings.  I found [this post](https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy) particularly helpful when I was setting up my function.  This function should be very similar to your previous function.  When you feel confident with your function, test it against our results.

In [21]:
def compute_euclidean_dist(user1, user2):
    '''
    INPUT
    user1 - int user_id
    user2 - int user_id
    OUTPUT
    the euclidean distance between user1 and user2
    ''' 
    m1 = movies_to_analyze[user1]
    m2 = movies_to_analyze[user2]
    m12 = np.intersect1d(m1,m2)
    r1 = [reviews.rating[(reviews['user_id'] == user1) & (reviews['movie_id'] == mi)].tolist()[0] for mi in m12]
    r2 = [reviews.rating[(reviews['user_id'] == user2) & (reviews['movie_id'] == mi)].tolist()[0] for mi in m12]
   
    return np.linalg.norm(np.array(r1)-np.array(r2)) #return the euclidean distance

In [None]:
dists
for user1 in movies_to_analyze:
    for user2 in movies_to_analyze:
        dists.append(compute_euclidean_dist(user1, user2))
        
df_dists = pd.DataFrame()        
df_dists['user1'] = list(movies_to_analyze.keys())
df_dists['user2'] = list(movies_to_analyze.keys())
df_dists['eucl_dist'] = dists


### It will take foreve, so I previously calculated it and I will load it from drive

In [24]:
!gdown --id 1MpGEx2PWBtmVKydgZSyuFxoz8ds0LGQt

In [41]:
# Read in solution euclidean distances"
import pickle
df_dists = pd.read_pickle("dists.p")

In [32]:
df_dists.iloc[0]

### Using the Nearest Neighbors to Make Recommendations

In the previous question, you read in **df_dists**. Therefore, you have a measure of distance between each user and every other user. This dataframe holds every possible pairing of users, as well as the corresponding euclidean distance.

Because of the **NaN** values that exist within the correlations of the matching ratings for many pairs of users, as we discussed above, we will proceed using **df_dists**. You will want to find the users that are 'nearest' each user.  Then you will want to find the movies the closest neighbors have liked to recommend to each user.

I made use of the following objects:

* df_dists (to obtain the neighbors)
* user_items (to obtain the movies the neighbors and users have rated)
* movies (to obtain the names of the movies)

`7.` Complete the functions below, which allow you to find the recommendations for any user.  There are five functions which you will need:

* **find_closest_neighbors** - this returns a list of user_ids from closest neighbor to farthest neighbor using euclidean distance


* **movies_liked** - returns an array of movie_ids


* **movie_names** - takes the output of movies_liked and returns a list of movie names associated with the movie_ids


* **make_recommendations** - takes a user id and goes through closest neighbors to return a list of movie names as recommendations


* **all_recommendations** = loops through every user and returns a dictionary of with the key as a user_id and the value as a list of movie recommendations

In [42]:
def find_closest_neighbors(user):
    '''
    INPUT:
        user - (int) the user_id of the individual you want to find the closest users
    OUTPUT:
        closest_neighbors - an array of the id's of the users sorted from closest to farthest away
    '''
    # I treated ties as arbitrary and just kept whichever was easiest to keep using the head method
    # You might choose to do something less hand wavy - order the neighbors
    
    
    
    return df_dists[df_dists.user1==user].sort_values(by='eucl_dist').user2.tolist()
    
    
    
def movies_liked(user_id, min_rating=7):
    '''
    INPUT:
    user_id - the user_id of an individual as int
    min_rating - the minimum rating considered while still a movie is still a "like" and not a "dislike"
    OUTPUT:
    movies_liked - an array of movies the user has watched and liked
    '''
  
    return reviews.sort_values(by='rating', ascending=False).movie_id[reviews.rating>=7].tolist()


def movie_names(movie_ids):
    '''
    INPUT
    movie_ids - a list of movie_ids
    OUTPUT
    movies - a list of movie names associated with the movie_ids
    
    '''

    return movies.movie[movies.movie_id.isin(movie_ids)].tolist()
    
    
def make_recommendations(user, num_recs=10):
    '''
    INPUT:
        user - (int) a user_id of the individual you want to make recommendations for
        num_recs - (int) number of movies to return
    OUTPUT:
        recommendations - a list of movies - if there are "num_recs" recommendations return this many
                          otherwise return the total number of recommendations available for the "user"
                          which may just be an empty list
    '''
    neighbours = find_closest_neighbors(user)
    my_m = movies_watched(user)
    recommendations = []
    u=0
    
    while len(recommendations)<num_recs:
        m = movies_liked(neighbours[u])
        diff = np.setdiff1d(m, my_m)
        for i in diff:
            recommendations.append(i)
        u+=1       
    
    return recommendations

def all_recommendations(num_recs=10):
    '''
    INPUT 
        num_recs (int) the (max) number of recommendations for each user
    OUTPUT
        all_recs - a dictionary where each key is a user_id and the value is an array of recommended movie titles
    '''
    # Make the recommendations for each user
    
    
    return {u:movie_names(make_recommendations(u, num_recs)) for u in df_dists.user1}

first10_recs = all_recommendations(10)

print(first10_recs)

##  This will take a long time to compute all recommendations for all users

In [None]:
# This loads our solution dictionary so you can compare results
all_recs_sol = pd.read_pickle("all_recs.p")