In [1]:
# EXECUTE FIRST

# computational imports
import numpy as np
import pandas as pd
pd.set_option('display.html.use_mathjax', False)
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Reader, Dataset, KNNBasic, NormalPredictor,BaselineOnly,KNNWithMeans,KNNBaseline
from surprise import SVD, SVDpp, NMF, SlopeOne, CoClustering
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from surprise import accuracy

import random
from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# plotting imports
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
matplotlib.style.use('ggplot')

# display imports
from IPython.display import display, IFrame
from IPython.core.display import HTML

<font size=18>Lesson 14 Homework: Recommender Systems 2</font>

# **Question 1** <font color="magenta">(2 points)</font>

Which of the following recommenders is based on the user/item ratings? (Check all that apply.)

* SVD item-based collaborative filter
* KNN user-based collaborative filter
* Content recommender
* Knowledge-based recommender
* Chart

# **Question 2** <font color="magenta">(2 points)</font>

Which Surprise algorithm reduces the size of the problem space through matrix factorization?

* NormalPredictor
* KNNBasic
* KNNWithMeans
* BaselineOnly
* SVD
* KNNWithZScores

# Data Exploration
(Note: This section is not included in the quiz and is ungraded.)

The file **restaurant_ratings.csv** (found in the presentation download for this lesson) contains user ratings for various New York City restaurants. You can read a little more about the data at <a href="https://www.kaggle.com/popoandrew/restaurant-week-2018-in-nyc?select=restaurant_week_2018_final.csv">Kaggle</a>. We have modified the data to generate user ratings that match the star columns in this file.

Do the following:

* read the data into a variable called "ratings"
* display the first 5 lines of the data (get familiar with the data frame)
* find the minimum restaurant rating
* find the maximum restaurant rating
* adjust the rating scale by shifting up 1 if 0 is included

In [1]:
#Add your code here

# **Question 3** <font color="magenta">(2 points)</font>

What is the minimum restaurant rating?



In [2]:
#Add your code here

# **Question 4** <font color="magenta">(2 points)</font>

What is the maximum restaurant rating?



In [3]:
#Add your code here

# **Question 5** <font color="magenta">(2 points)</font>

What is the mean restaurant rating for all restaurants (rounded to 2 significant digits)? 



In [4]:
#Add your code here

# **Question 6** <font color="magenta">(2 points)</font>

What is the median of the restaurant rating scale? 



In [5]:
#Add your code here

# Train/Test Split and Score Setup
(Note: this section is not included in the quiz and is not graded.)

We've provided code to you below for a scoring function and to split the data into train and test sets. Use the train and test set generated from this code to answer the next questions. You must not change this code if you want to get the correct answers.

In [7]:
#This section not included in quiz/solutions.

#Function to compute the RMSE score obtained on the testing set by a model
def score(cf_model, X_test, *args):
    
    #Construct a list of user-item tuples from the testing dataset
    id_pairs = zip(X_test[X_test.columns[0]], X_test[X_test.columns[1]])
    
    #Predict the rating for every user-item tuple
    y_pred = np.array([cf_model(user, item, *args) for (user, item) in id_pairs])
    
    #Extract the actual ratings given by the users in the test data
    y_true = np.array(X_test[X_test.columns[2]])
    
    #Return the final RMSE score
    return mean_squared_error(y_true, y_pred, squared=False)

#Assign X as the original ratings dataframe and y as the user_id column of ratings.
X = ratings.copy()
y = ratings['userID']

#Split into training and test datasets, stratified along user_id
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.20, random_state=14)



# **Question 7** <font color="magenta">(2 points)</font>

Compute a baseline model that always returns the median of the rating scale (rounded to 2 significant digits). What is the RMSE on this model?


In [6]:
#Add your code here

# **Question 8** Build a Weighted Mean User-Based Filter (manually graded) <font color="magenta">(4 points)</font>

From data in the file **restaurant_rating.csv**, build a ratings matrix from the data frame of users, restaurants, and ratings and build a user-based collaborative filtering model that weights mean rank using cosine similarity among users.

In [9]:
# Add your code here

# **Question 9** <font color="magenta">2 points</font>

What is the RMSE (rounded to 2 significant digits) of the Weighted Mean algorithm? 






In [7]:
#Add your code here

# **Question 10** User-Based SVD - Hyperparameter tuning (Manually Graded) <font color="magenta">(4 points)</font>
From data in the file **restaurant_ratings.csv**, use the *surprise* library in Python to build an <a href="https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD">SVD</a> user-based collaborative filtering model for the restaurant ratings. Use gridsearch to tune the hyperparameters, reserving 15% of the data to get an unbiased estimate of the accuracy. For the grid, use the following options:

* 'n_epochs': [15, 20, 25] (The number of iterations of the Stochastic Gradient Descent minimization procedure.)
* 'lr_all': [.005, .025, .001] (The learning rate.)
* 'reg_all': [.01, .02, .05] (The penalty for complex models.)

Additionally, use the following:

* 3 folds for cross validation
* a seed of 14


Use the example from the lesson and be sure to set the seed in the appropriate place. **Note:** this code will take several minutes to run.




In [8]:
#Add your code here

# **Question 11** <font color="magenta">(2 points)</font>

What is the **biased** accuracy (rounded to 2 significant digits) of the algorithm? 





In [9]:
# Add your code here

# **Question 12** <font color="magenta">(2 points)</font>

What is the **unbiased** accuracy (rounded to 2 significant digits) of the algorithm? 




In [10]:
#Add your code here

# **Question 13** <font color="magenta">(2 points)</font>

What is the number of iterations of the stochastic gradient descent ('n_epochs') value chosen by the grid search? 



In [11]:
#Add your code here


# **Question 14** <font color="magenta">(2 points)</font>

What is the learning rate ('lr_all') chosen by the grid search? 



In [12]:
#Add your code here

# **Question 15** <font color="magenta">(2 points)</font>

What is the regularization ('reg_all') chosen by the grid search? 


In [13]:
#Add your code here

# **Question 16** <font color="magenta">(2 points)</font>

Now that we know what our best parameters should be, we need to train our SVD model on all the available data. Do the following:
* set the seeds for reproducibility
* reset the data.raw_ratings to all of the ratings OR reload the data from the dataframe
* use the build_full_trainset() method to build a full trainset
* set up an SVD algorithm using the best parameters
* fit the data to the trainset
* predict the estimated rating for user 1061 and restaurant 347

What is the predicted estimated rating (rounded to 2 digits) for **user 1061** and **restaurant 347**?





In [14]:
#Add your code here

# Hybrid Filter Setup 
(Note: This section is not included in the quiz/solutions.)

From data in the files **restaurant_ratings.csv** and **restaurants.csv** build a recommender system that is a hybrid of a metadata content-based recommender and the SVD user-based collaborative filter that you just trained.  

To set up your hybrid filter:

* read in the restaurants.csv into a variable called rest
* review the data in the dataframe (Note that we have pre-cleaned the data for you, including using TextBlob to extract just the relevant descriptors from the description. Not all restaurants have a description.)
* make a soup from the following columns, which are all simple strings (**Hint: the soup for the first item in the geoplaces dataframe should be: 'Contemporary American Average_price rustic airy adorable classic most distinguished uncommon innovative American proud only world-class week.IMPORTANT special welcome'**):
    - restaurant_type
    - price_range
    - ambiance
    - descriptors
* Instantiate a CountVectorizer with no stopwords. (We shouldn't have much in the way of stopwords, since it's all keywords.) 
* Use the provided fetchSimilarity function to get a countVectorizer similarity matrix using the soup column. (**Hint: the similarity at [0,2] should be 0.2849014411490949.**)


In [18]:
# Not Included in Quiz/Solutions
def fetchSimilarityMatrix(df, soupCol, vectorizer, vectorType='Tfidf'):
    '''
    Parameters
    df: the dataframe containing a soup column to tranform
    soupCol: The string title of the soup column
    vectorizer: an initialized vectorizer, with all pre-processing you desire
    vectorType: 'Tfidf' or 'Count' - representing the type of vectorizer you used.

    Returns
    Sparse Similarity Matrix
    '''

    # make sure the soup has no NaN
    df[soupCol] = df[soupCol].fillna('')
    nmatrix = vectorizer.fit_transform(df[soupCol])

    #apply the appropriate vectorizer
    if(vectorType=='Tfidf'):
        print('Using Linear Kernel (Tfidf)')
        sim =linear_kernel(nmatrix, nmatrix)
    else:
        print('Using Cosine_similarity')
        sim = cosine_similarity(nmatrix, nmatrix)
    return(sim)

def content_recommender(df, seed, seedCol, sim_matrix,  topN=5): 
    #get the indices based off the seedCol
    indices = pd.Series(df.index, index=df[seedCol]).drop_duplicates()
    
    # Obtain the index of the item that matches our seed
    idx = indices[seed]
    
    # Get the pairwsie similarity scores of all items and convert to tuples
    sim_scores = list(enumerate(sim_matrix[idx]))
    
    #delete the item that was passed in
    del sim_scores[idx]
    
    # Sort the items based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the top-n most similar items.
    sim_scores = sim_scores[:topN]
    
    # Get the item indices
    movie_indices = [i[0] for i in sim_scores]
    
    snip = df.iloc[movie_indices].copy()
    snip['sim_score'] = [i[1] for i in sim_scores]
    
    # Return the topN most similar items
    return snip

# **Question 17** Use The Content Recommender <font color="magenta">(2 points)</font>

Using the provided content recommender function and the code you've prepared, get the top 5 recommendations for 'Tao Uptown'. (Hint: the top restaurant for 'Becco' should be 'Scampi'.)

Which if these restaurants is the top recommendation?

* Haru Sushi - Amsterdam Ave
* Bistrot Leo
* Rice & Gold
* Zengo - NYC
* Restaurant Nippon


In [15]:
#Your code here

# **Question 18** - Build the Hybrid Function (manually graded) <font color="magenta">(4 points)</font>

Some times recommendation designers are less focused on recommending things that have the highest rating, and more focused on recommending things that will have an acceptable rating, but are very similar to items the user has previously liked. For the homework, we're going to build a hybrid recommender that predicts the ratings a single user would give to all of the restaurants, limits that rating to a specified minimum, and then returns the restaurants that are most similar (content-wise). We'll follow the example used in the lesson in which we will pre-fetch the content recommendations, and pass those pre-fetched recommendations into the hybrid function. 

The full list of parameters needed will be:
* user: the userid for which we are making predictions
* contentRecs: the dataframe that contains the content recommendations, with similarity scores (this is returned for you in the content_recommender function we provided)
* algo: the trained algorithm to use for colaborative filtering
* predCol: the column in your contentRecs that can be used for predictions
* minRating: the minimum rating we'll accept (estimated ratings should be >= to this number)
* N: the final number of recommendations to return

Your function should return a dataframe that contains all of the information that was in your contentRecs plus the estimated rating for the "N" number of rows.

In [36]:
#Your code here

# **Question 19** - Calling the Hybrid Function <font color="magenta">(2 points)</font>

Use your hybrid function to find recommendations for **user 1235** and restaurant '**Lido**'. 
* Remember, you will need to call your content_recommender function first to get the similarity scores. (Hint: there are 348 total restaurants.) 
* Use the SVD algorithm you trained in Question 10 to predict ratings. 
* The minimum allowed rating is 4.5. 
* Return the top 3 recommendations. 

**Which answer shows the top 3 recommendations, in order?**

*Hint: If make recommendations for user 1061, and 'Schilling' and everything else the same, the top recommendation should be Edi and The Wolf.*

* Naples 45 Ristorante E Pizzeria, Obica Mozzarella Bar Pizza e Cucina, La Pecora Bianca - NoMad
* La Pecora Bianca - NoMad, La Pecora Bianca - Midtown, Becco
* Becco, La Pecora Bianca - Midtown, Stella 34 Trattoria
* La Pecora Bianca - NoMad, La Pecora Bianca - Midtown, Stella 34 Trattoria
* Esca, Lincoln Ristorante, La Pecora Bianca - Midtown




In [16]:
#Add your code here

# **Question 20** KNNWithMeans item-based collaborative filter (manually graded)<font color="magenta">(4 points)</font>

Train a <a href="https://surprise.readthedocs.io/en/stable/knn_inspired.html?highlight=knnwith#surprise.prediction_algorithms.knns.KNNWithMeans">KNNWithMeans Surprise collaborative filter</a>. We ran a gridsearch already and learned that the best k for this is 3, and we get the best results using an item-based similarity measure. You should:

* Set seeds of 14
* Read in the data and set up your reader
* Set up a data object
* Build a full trainset
* set up a KNNWithMeans algorithm using the following parameters:
    * k of 3 
    * set the <a href="https://surprise.readthedocs.io/en/stable/prediction_algorithms.html?highlight=user_based#similarity-measure-configuration">sim_options 'user_based' to False</a> (this switches it to an item-based similarity measure, instead of a user-based).
* fit the algorithm using the full trainset
* predict the rating for **user 1000** and **restaurant 300**

**Hint: the predicted rating for user 1000 and restaurant 300 should be 4.32**





In [17]:
#Add your code here


## **Question 21** Hybrid with KNN <font color="magenta">(2 points)</font>

Use your hybrid function again with  **user 1235** and restaurant '**Lido**'. 

* Remember, you will need to call your content_recommender function first to get the similarity scores. (Hint: there are 348 total restaurants.) 
* Use the KNN algorithm you just trained predict ratings. 
* The minimum allowed rating is 4.5. 
* Return the top 3 recommendations. 


**Hint: If you call your function with user 1001 and Feast, the top recommendation should be Tuome.**

What are the top 3 restaurants, in order?

* Bar Primi, Naples 45 Ristorante E Pizzeria, La Pecora Bianca - NoMad
* Il Mulino New York - Uptown, Naples 45 Ristorante E Pizzeria, Bar Primi
* .Tarallucci e Vino Upper West Side, Il Mulino New York - Uptown, La Pecora Bianca - NoMad
* Naples 45 Ristorante E Pizzeria, La Pecora Bianca - NoMad, Il Mulino New York - Uptown
* La Pecora Bianca - Midtown, La Pecora Bianca - NoMad, Naples 45 Ristorante E Pizzeria


In [18]:
#Your code here