## Recommendation Engine - ISE 589 Python Programming Project

### Introduction

A recommendation engine is software that analyzes available data to make suggestions for something that a website user might be interested in, such as a book, a video or a job, among other possibilities. Netflix, for example, uses metadata tagging on videos in conjunction with data about user behavior to come up with recommended movies and TV shows for specific members. Through this project, I have attempted to create a Recommendation Engine using the Collaborative filtering model to predict what movies an existing user as well as a new user would want to watch.

### Importing the packages used for the analysis

In [1]:
# Importing Pandas and NumPy packages to perform dataframe and array operations
import pandas as pd
import numpy as np

# Importing Sklearn packages to perform similarity and matrix related operations
from sklearn.metrics.pairwise import pairwise_distances 
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing

# Importing iPython Widget packages to create an interactive dashboard layout
import ipywidgets as widgets
from ipywidgets import interactive
from ipywidgets import interact, interactive, fixed, interact_manual
from ipywidgets import FloatSlider

### Data used for analysis

The MovieLens dataset has been used for the purpose of this project. It has been collected by the GroupLens Research Project at the University of Minnesota. It consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies
* Demographic info for the users
* Genre information of movies

First the data is loaded into Python. There are many files in the ml-100k.zip file which can be used. Lets load the three most importance files (user, movie and ratings) to get a sense of the data.

In [2]:
#Reading users file:
userscolNames = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv(r'C:\Users\Satyajit Narayanan\Desktop\589\Project\ml-100k\ml-100k\u.user', sep='|', names=userscolNames,encoding='latin-1')

#Reading ratings file:
ratingscolNames = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(r'C:\Users\Satyajit Narayanan\Desktop\589\Project\ml-100k\ml-100k\u.data', sep='\t', names=ratingscolNames,encoding='latin-1')

#Reading movies file:
moviecolNames = ['movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
movies = pd.read_csv(r'C:\Users\Satyajit Narayanan\Desktop\589\Project\ml-100k\ml-100k\u.item', sep='|', names=moviecolNames,
encoding='latin-1')

### Modeling the Recommendation Engine

There are broadly two types of recommender engines – Content Based and Collaborative Filtering.
* ***Content Based algorithms*** are strongly based on driving the context of the item. Once you have gathered this context level information on items, you try to find look alike items and recommend them. It generally works well when its easy to determine the properties of each item. For instance when we are recommending the same kind of item like a movie recommendation or song recommendation. 

* ***Collaborative Filtering algorithm*** is entirely based on the past behavior and not on the context. This makes it one of the most commonly used algorithm as it is not dependent on any additional information. All you need is the transaction level information of the industry. Further, there are 2 types of collaborative filtering algorithms:
    * *User-User Collaborative filtering:* Here look alike customers (based on similarity) are found and offered products based on what  the customer’s look alike has chosen in past. This algorithm is very effective but takes a lot of time and resources. 
    * *Item-Item Collaborative filtering:* This is similar to the previous algorithm, but instead of finding customer look alike, we try finding item look alike. Once we have item look alike matrix, we can easily recommend alike items to customer who have purchased any item from the store. This algorithm is far less resource consuming than user-user collaborative filtering. 
    
This project showcases the use of the *Collaborative Filtering algorithm* to make predictions of best recommendations. Both, User-User similarity and Item-Item similarity is taken into consideration to make this prediction.
Furthermore, in order to recommend movies to a *New User*, a model is created to select best picks based on 1 previously watched movie.

#### Creating a user-movie ratings matrix to be used to calculate the similarity between users and movies

In [3]:
# Creating a list of unique users based on their user id
Users = ratings.user_id.unique()
uniqueUsers = sorted(pd.DataFrame(Users)[0].unique().tolist())

# Creating a list of unique movie titles
uniqueMovies = pd.DataFrame(movies)['movie title']

# Number of users
numUsers = ratings.user_id.unique().shape[0]

# Number of movies
numMovies = ratings.movie_id.unique().shape[0]

# Creating the user-movie matrix 
usermovieMatrix = np.zeros((numUsers, numMovies))
for line in ratings.itertuples():
    usermovieMatrix[line[1]-1, line[2]-1] = line[3]

#### Using the *pairwise_distance* function from *sklearn* to calculate the similarity

In [4]:
# Calculating the user - user similarity matrix
userCosineSimilarity = pairwise_distances(usermovieMatrix, metric='cosine')

# Calculating the movie - movie similarity matrix
movieCosineSimilarity = pairwise_distances(usermovieMatrix.T, metric='cosine')

#### Making predictions based on these movie-movie and user-user similarities

The ratings of the movie by a user is predicted based on:
* User-User Similarity
* Movie-Movie Similarity

These predictions are made by matrix multiplication of similarity matrix with the user-movie ratings matrix

In [5]:
# Making predictions based on user similarity and movie similarity respectively
user_prediction = userCosineSimilarity.dot(usermovieMatrix) / np.array([np.abs(userCosineSimilarity).sum(axis=1)]).T
movie_prediction = usermovieMatrix.dot(movieCosineSimilarity) / np.array([np.abs(movieCosineSimilarity).sum(axis=1)])


#### Combining Predictions

Here, ratings are obtained and combined from both the filtering methods. This is done in order to incorporate the effect of both the predictions for a better result.
The weightage term is ***alpha***. The greater the *alpha* value, the more it gives weight towards the prediction based on Movie Similarity and vice versa.

The alpha value can be changed based on user's preference. The default starting value is taken as 0.5 (50%)

In [6]:
alpha = 0.5
combined_pred = (1-alpha)*user_prediction + alpha*movie_prediction

#### Defining functions to rank and pick top recommendations

A function is defined to display top 5 recommendations for an *Existing User* and an *alpha* value.

After the *User ID* is chosen, the top 5 recommendations for that *User* is displayed based on *alpha* = 0.5, that is, equal weightage given to predcition based on Movie Similarity and User Similarity.

In [7]:
def top5recommendations(User, alpha):
    if User is None:
        print("\n")
    elif User is not None:
        # Ranking based on combined prediction
        
        # Combined prediction values are recalculated
        combined_pred = (1-alpha)*user_prediction + alpha*movie_prediction 
        
        # Prediction scores are ranked and sorted
        combinedRank = pd.DataFrame(pd.DataFrame(combined_pred).iloc[User-1,])
        combinedRank['Rank'] = combinedRank.rank(ascending=False)
        combinedRank.sort_values('Rank')

        # Movie titles are added 
        combinedRank = pd.merge(combinedRank, movies['movie title'].to_frame(), how='left', left_index=True, right_index=True)

        # Joining actual rating values for the movies
        combinedRankjoin = pd.merge(combinedRank, 
        pd.DataFrame(usermovieMatrix).iloc[User-1,].to_frame(), how='left', left_index=True, right_index=True)

        # Filtering for movies that the user hasn't rated (seen) before
        combinedRankF = combinedRankjoin.drop(combinedRankjoin[combinedRankjoin[f'{User-1}_y']>0].index)

        # Creating a list of top 5 recommendations
        rList = combinedRankF.sort_values('Rank').head()['movie title']

        # Printing the recommendations
        print("\nThe top 5 recommended movies are:")
        for a, b in enumerate(rList, 1):
            print ('{}. {}'.format(a, b))
            

A function is defined to display top 5 recommendations for an *New User* by asking them to choose 1 previously watched movie.
This is done by ranking movies based on their similarity to the chosen movie (based on similarity scores calculated earlier)

In [8]:
def top5NewUSer(Movie):
    if Movie is None:
        print("\n")
    elif Movie is not None:
        Movieindex = uniqueMovies[uniqueMovies == Movie].index[0]
        movieRecomm = pd.DataFrame(movieCosineSimilarity[Movieindex]).sort_values([0])
        movieRecomm = movieRecomm.drop(movieRecomm[movieRecomm[0].index==Movieindex].index)
        movieRecomm = pd.merge(movieRecomm.head(), movies['movie title'].to_frame(), how='left', left_index=True, right_index=True)
        print("\nThe top 5 recommended movies are:")
        for a, b in enumerate(movieRecomm['movie title'], 1):
            print ('{}. {}'.format(a, b))

### Evaluating the recommendation engine

There are various metrics to evaluate the recommendation engine. I have used Precision to evaluate the performance of the model. This is because precision expresses the proportion of the data points our model says was relevant actually were relevant, which is what we want to measure. 

Here, **Precision** shows out of all the recommended items, how many did the user actually like?

It is given by:

**Precision = tp/(tp + fp)**

* *tp* represents the number of movies recommended to a user that he/she likes (4 or 5 rating)
* *tp+fp* represents the total number of movies recommended to a user (I have used the top 20 recommendations for each user)

Larger the precision, better the recommendations. We obtain an average precision for all users put together as 63%.



In [9]:
precisionList=[]
alpha = 0.5
for i in range(len(uniqueUsers)):
    User = i
    #for range(len(uniqueUsers))
    combined_pred = (1-alpha)*user_prediction + alpha*movie_prediction 
    # Prediction scores are ranked and sorted
    combinedRank = pd.DataFrame(pd.DataFrame(combined_pred).iloc[User,])
    combinedRank['Rank'] = combinedRank.rank(ascending=False)
    combinedRank.sort_values('Rank')
    # Joining actual rating values for the movies
    combinedRankjoin = pd.merge(combinedRank, 
    pd.DataFrame(usermovieMatrix).iloc[User,].to_frame(), how='left', left_index=True, right_index=True)
    # Filtering for movies that the user hasn rated (seen) before
    combinedRated = combinedRankjoin.take(combinedRankjoin[combinedRankjoin[f'{User}_y']>0].index)
    precisionList.append(len(combinedRated.head(20)[combinedRated.head(20)[f'{User}_y']>3])/20)

    
print('The Precision is', sum(precisionList)/len(precisionList))

The Precision is 0.628260869565217


### Visualizing the Recommendation Engine

An interactive dashboard was created as a function using iPython's widget functionality for Jupyter Notebooks. This helps the user intuitively choose the movies and/or users based on the kind of recommendation he/she needs to make. 

Below are 2 functions to choose the kind of user and respective recommendation and to call the Recommendations Engine function.

In [10]:
def recomm(i):
    if i=='New User':
        print("Choose 1 movie you have previously watched:")
        return widgets.interactive(top5NewUSer, Movie=widgets.Dropdown(options=uniqueMovies, value=None))
    elif i == 'Existing User':
        print("Choose a user for whom you want to recommend:")
        return widgets.interactive(top5recommendations, User=widgets.Dropdown(options=uniqueUsers, value=None), alpha = widgets.FloatSlider(
    value=0.5, min=0, max=1, step=0.1,  description='Alpha:',  disabled=False, continuous_update=False, orientation='horizontal',
    readout=True, readout_format='.1f',))


def Recommender():
    im = interact_manual(recomm,i=widgets.Dropdown(options=['New User','Existing User']));
    im.widget.children[0].description = 'User Type:';
    im.widget.children[1].description = 'Recommend';

#### Calling the ***Recommender*** function

In [11]:
Recommender()

interactive(children=(Dropdown(description='i', options=('New User', 'Existing User'), value='New User'), Butt…