<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Creating-the-User-Item-Matrix" data-toc-modified-id="Creating-the-User-Item-Matrix-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Creating the User-Item Matrix</a></span></li><li><span><a href="#Creating-a-dictionary-for-watched-movies-by-each-user" data-toc-modified-id="Creating-a-dictionary-for-watched-movies-by-each-user-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Creating a dictionary for watched movies by each user</a></span></li><li><span><a href="#Filtering-users-whom-I-could-offer-recommendations-to." data-toc-modified-id="Filtering-users-whom-I-could-offer-recommendations-to.-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Filtering users whom I could offer recommendations to.</a></span></li><li><span><a href="#Calculating-User-Similarities-using-Euclidian-Distance" data-toc-modified-id="Calculating-User-Similarities-using-Euclidian-Distance-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Calculating User Similarities using Euclidian Distance</a></span></li><li><span><a href="#Using-the-Nearest-Neighbors-to-Make-Recommendations" data-toc-modified-id="Using-the-Nearest-Neighbors-to-Make-Recommendations-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Using the Nearest Neighbors to Make Recommendations</a></span></li></ul></div>

# Recommendations with MovieTweetings: Neighbourhood-based Collaborative Filtering

One of the most popular methods for making recommendations is **collaborative filtering**.  In collaborative filtering, you are using the collaboration of user-item recommendations to assist in making new recommendations.  

There are two main methods of performing collaborative filtering:

1. **Neighborhood-Based Collaborative Filtering**, which is based on the idea that we can either correlate items that are similar to provide recommendations or we can correlate users to one another to provide recommendations.

2. **Model Based Collaborative Filtering**, which is based on the idea that we can use machine learning and other mathematical models to understand the relationships that exist amongst items and users to predict ratings and provide ratings.


In this notebook, I am working on performing **neighborhood-based collaborative filtering**.  There are two main methods for performing collaborative filtering:

1. **User-based collaborative filtering:** In this type of recommendation, users related to the user you would like to make recommendations for are used to create a recommendation.

2. **Item-based collaborative filtering:** In this type of recommendation, first you need to find the items that are most related to each other item (based on similar ratings).  Then you can use the ratings of an individual on those similar items to understand if a user will like the new item.

In this notebook I will be implementing **user-based collaborative filtering**.

In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix

%matplotlib inline

# Read in the datasets
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']

print(reviews.head())

   user_id  movie_id  rating   timestamp                 date
0        1    114508       8  1381006850  2013-10-05 22:00:50
1        2    208092       5  1586466072  2020-04-09 22:01:12
2        2    358273       9  1579057827  2020-01-15 03:10:27
3        2  10039344       5  1578603053  2020-01-09 20:50:53
4        2   6751668       9  1578955697  2020-01-13 22:48:17


In [2]:
# Resricting the size of the reviews used at the maximum my Computer can handle,  for it makes my RAM crash.
user_items = reviews[['user_id', 'movie_id', 'rating']].iloc[:742337]
user_items.head()

Unnamed: 0,user_id,movie_id,rating
0,1,114508,8
1,2,208092,5
2,2,358273,9
3,2,10039344,5
4,2,6751668,9


## Creating the User-Item Matrix

The aim is to create a dataframe in the shape of user vs item filled with the ratings of each user to the given item using the dataframe **user_items** from the previous cell.

![alt text](images/userxitem.png "User Item Matrix")

In order to create the user-items matrix (like the one above), I personally started by using a [pivot table] However, I quickly ran into a memory error.

'https://stackoverflow.com/questions/61757170/python-unstacked-dataframe-is-too-big-causing-int32-overflow'

The issue was resolved by restricting the size of rows to 90% of the original dataset.

In [3]:
# Create user-by-item matrix
user_by_movie = user_items.groupby(['user_id', 'movie_id'])[
    'rating'].max().unstack()

## Creating a dictionary for watched movies by each user

using the User-Item matrix I have created a dictionary where the key is each user and the value is an array of the movies ids each user has rated.

In [4]:
def movies_watched(user_id):
    '''
    INPUT:
    user_id - the user_id of an individual as int
    OUTPUT:
    movies - an array of movies the user has watched
    '''
    # filtering the user_by_movie by each user Where: The rating of the user at this column (movie)
    # is not null.
    # The movie ids are presented as as array for performance
    movies = user_by_movie.loc[user_id][user_by_movie.loc[user_id].isnull(
    ) == False].index.values

    # returning the array
    return movies


def create_user_movie_dict():
    '''
    INPUT: None
    OUTPUT: movies_seen - a dictionary where each key is a user_id and the value is an array of movie_ids

    Creates the movies_seen dictionary
    '''
    # The number of users the need processing
    n_users = user_by_movie.shape[0]
    # The dictionary the would hold the users and their respective movies rated
    movies_seen = dict()

    # For each user, starting from user number 1, as numbering starts with 1 not 0
    for user1 in range(1, n_users+1):

        # assign list of movies to each user key
        movies_seen[user1] = movies_watched(user1)

    return movies_seen


# Creating the Dictionary
movies_seen = create_user_movie_dict()

## Filtering users whom I could offer recommendations to.

If a user hasn't rated more than 2 movies, we consider these users "too new".  Create a new dictionary that only contains users who have rated more than 2 movies.  This dictionary will be used for all the final steps of this notebook.

In [5]:
# Remove individuals who have watched 2 or fewer movies - don't have enough data to make recs

def create_movies_to_analyze(movies_seen, lower_bound=2):
    '''
    INPUT:  
    movies_seen - a dictionary where each key is a user_id and the value is an array of movie_ids
    lower_bound - (an int) a user must have more movies seen than the lower bound to be added to the movies_to_analyze dictionary

    OUTPUT: 
    movies_to_analyze - a dictionary where each key is a user_id and the value is an array of movie_ids

    The movies_seen and movies_to_analyze dictionaries should be the same except that the output dictionary has removed 

    '''
    # Creating an empty dictionary to be filled with users whom I could offer recommendations
    movies_to_analyze = dict()

    # Unpacking the dictionary
    for user, movies in movies_seen.items():
        # ensuring the number of movies rated by a given user is larger than the lower bound
        if len(movies) > lower_bound:
            # Assigning the analyzable users to the new dict
            movies_to_analyze[user] = movies

    return movies_to_analyze


# Creating the analyzable users dictionary
movies_to_analyze = create_movies_to_analyze(movies_seen)

## Calculating User Similarities using Euclidian Distance

Now that we have set up the **movies_to_analyze** dictionary, it is time to take a closer look at the similarities between users. Below is the pseudocode for how I thought about determining the similarity between users:

```
for user1 in movies_to_analyze
    for user2 in movies_to_analyze
        see how many movies match between the two users
        if more than two movies in common
            pull the overlapping movies
            compute the distance metric between ratings on the same movies for the two users
            store the users and the distance metric
```


Calculate the euclidean distance between the ratings.  I found [this post](https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy) particularly helpful when I was setting up my function.

In [6]:
def compute_euclidean_dist(user1, user2):
    '''
    INPUT
    user1 - int user_id
    user2 - int user_id
    OUTPUT
    the euclidean distance between user1 and user2
    '''
    # Pull movies for each user
    movies1 = movies_to_analyze[user1]
    movies2 = movies_to_analyze[user2]

    # Find Similar Movies
    sim_movs = np.intersect1d(movies1, movies2, assume_unique=True)

    # Calculate euclidean distance between the users
    df = user_by_movie.loc[(user1, user2), sim_movs]
    dist = np.linalg.norm(df.loc[user1] - df.loc[user2])

    return dist  # return the euclidean distance

In [8]:
df_dists = pd.read_pickle("data/Term2/recommendations/lesson1/data/dists.p")

## Using the Nearest Neighbors to Make Recommendations

In a previous cell, I have read in **df_dists**. Therefore, I have a measure of distance between each user and every other user. This dataframe holds every possible pairing of users, as well as the corresponding euclidean distance.

I will proceed using **df_dists**. We will want to find the users that are 'nearest' each user.  Then we will want to find the movies the closest neighbors have liked to recommend to each user.

I made use of the following objects:

* df_dists (to obtain the neighbors)
* user_items (to obtain the movies the neighbors and users have rated)
* movies (to obtain the names of the movies)

The five functions below allow to find the recommendations for any user.

* **find_closest_neighbors** - this returns a list of user_ids from closest neighbor to farthest neighbor using euclidean distance


* **movies_liked** - returns an array of movie_ids


* **movie_names** - takes the output of movies_liked and returns a list of movie names associated with the movie_ids


* **make_recommendations** - takes a user id and goes through closest neighbors to return a list of movie names as recommendations


* **all_recommendations** - loops through every user and returns a dictionary of with the key as a user_id and the value as a list of movie recommendations

In [9]:
def find_closest_neighbors(user):
    '''
    INPUT:
        user - (int) the user_id of the individual you want to find the closest users
    OUTPUT:
        closest_neighbors - an array of the id's of the users sorted from closest to farthest away
    '''
    # To get the closest neighbours the df_dists is used in the following way:
    # 1. Filtering based on the user I am providing recommendations for
    # 2. Sorting the User2 array by the Euclidian Distance Ascending
    # 3. Removing the first row as it's the distance between the user and themselves.
    closest_users = df_dists[df_dists['user1'] == user].sort_values(
        by='eucl_dist').iloc[1:]['user2']

    # Casting as a numpy array for performance
    closest_neighbors = np.array(closest_users)

    return closest_neighbors


def movies_liked(user_id, min_rating=7):
    '''
    INPUT:
    user_id - the user_id of an individual as int
    min_rating - the minimum rating considered while still a movie is still a "like" and not a "dislike"
    OUTPUT:
    movies_liked - an array of movies the user has watched and liked
    '''
    # Quering the dataframe only to keep the movies which the user has given a rating greater than or equal to 7
    # Keeping the chosen movie ids as a numpy array
    movies_liked = np.array(user_items.query(
        'user_id == @user_id and rating > (@min_rating -1)')['movie_id'])

    return movies_liked


def movie_names(movie_ids):
    '''
    INPUT
    movie_ids - a list of movie_ids
    OUTPUT
    movies - a list of movie names associated with the movie_ids

    '''
    # Retreiving the names corresponding to the movie ids provided by the highly rated movies in 'movies_liked' function
    movie_lst = list(movies[movies['movie_id'].isin(movie_ids)]['movie'])

    return movie_lst


def make_recommendations(user, num_recs=10):
    '''
    INPUT:
        user - (int) a user_id of the individual you want to make recommendations for
        num_recs - (int) number of movies to return
    OUTPUT:
        recommendations - a list of movies - if there are "num_recs" recommendations return this many
                          otherwise return the total number of recommendations available for the "user"
                          which may just be an empty list
    '''
    # I wanted to make recommendations by pulling different movies than the user has already seen
    # Go in order from closest to farthest to find movies you would recommend
    # I also only considered movies where the closest user rated the movie as a 9 or 10

    # movies_seen by user (we don't want to recommend these)
    movies_seen = movies_watched(user)
    closest_neighbors = find_closest_neighbors(user)

    # Keep the recommended movies here
    recs = np.array([])

    # Go through the neighbors and identify movies they like the user hasn't seen
    for neighbor in closest_neighbors:
        neighbs_likes = movies_liked(neighbor)

        # Obtain recommendations for each neighbor
        new_recs = np.setdiff1d(neighbs_likes, movies_seen, assume_unique=True)

        # Update recs with new recs
        recs = np.unique(np.concatenate([new_recs, recs], axis=0))

        # If we have enough recommendations exit the loop
        if len(recs) > num_recs-1:
            break

    # Pull movie titles using movie ids
    recommendations = movie_names(recs)

    return recommendations


def all_recommendations(num_recs=10):
    '''
    INPUT 
        num_recs (int) the (max) number of recommendations for each user
    OUTPUT
        all_recs - a dictionary where each key is a user_id and the value is an array of recommended movie titles
    '''

    # All the users we need to make recommendations for
    users = np.unique(df_dists['user1'])
    n_users = len(users)

    # Store all recommendations in this dictionary
    all_recs = dict()

    # Make the recommendations for each user
    for user in users:
        all_recs[user] = make_recommendations(user, num_recs)

    return all_recs


# Actual recommendation for all user in the database
all_recs = all_recommendations(10)