In [1]:
# execute to import notebook styling for tables and width etc.
from IPython.core.display import HTML
import urllib.request
response = urllib.request.urlopen('https://raw.githubusercontent.com/DataScienceUWL/DS775v2/master/ds755.css')
HTML(response.read().decode("utf-8"));

<font size=18>Homework 11: Recommender Systems 2</font>

# Build a Baseline Model and Compute the RMSE

The file **rating_final.csv** (found in the presentation download for this lesson) contains user ratings for overall, food, and service for various restaurants.   

Do the following:

* display the first 5 lines of the data (get familiar with the data frame)
* find the minimum restaurant rating
* find the maximum restaurant rating
* adjust the rating scale by shifting up 1 if 0 is included
* calculate the mean restaurant rating for all restaurant (just to get an idea)
* drop the ratings for food and service so that only the overall rating remains
* split the data set so that 80\% of a users ratings are in the training set and 20\% are in the testing set
* build a baseline model that assigns the appropriate rating for all predictions and compute the RMSE of these on the testing set

Click <a href = https://www.kaggle.com/uciml/restaurant-data-with-consumer-ratings> here </a> or <a href = https://archive.ics.uci.edu/ml/datasets/Restaurant+%26+consumer+data> here </a> for more details about the data set.

<font color = "blue"> *** 12 points -  answer in cells below *** (don't delete this cell) </font>

In [2]:
# enter your code here
import pandas as pd
import numpy as np

ratings = pd.read_csv('./data/rating_final.csv')
ratings['rating'] += 1

print(f"Minimum overall rating: {ratings['rating'].min()}")
print(f"Maximum overall rating: {ratings['rating'].max()}")
print(f"Mean overall rating: {ratings['rating'].mean():.2f}")

# drop unused ratings
ratings = ratings.drop(['food_rating','service_rating'], axis=1)#Import the train_test_split function
from sklearn.model_selection import train_test_split

#Assign X as the original ratings dataframe and y as the user_id column of ratings.
X = ratings.copy()
y = ratings['userID']

#Split into training and test datasets, stratified along user_id
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.20, random_state=42)

#Import the mean_squared_error function
from sklearn.metrics import mean_squared_error

#Function that computes the root mean squared error (or RMSE)
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

#Define the baseline model to always return scale midpoint.
def baseline(userID, placeID):
    return 2.0

#Function to compute the RMSE score obtained on the testing set by a model
def score(cf_model):
    
    #Construct a list of user-restaurant tuples from the testing dataset
    id_pairs = zip(X_test['userID'], X_test['placeID'])
    
    #Predict the rating for every user-restaurant tuple
    y_pred = np.array([cf_model(user, place) for (user, place) in id_pairs])
    
    #Extract the actual ratings given by the users in the test data
    y_true = np.array(X_test['rating'])
    
    #Return the final RMSE score
    return rmse(y_true, y_pred)

print(f'RMSE for baseline model: {score(baseline):.3f}')

Minimum overall rating: 1
Maximum overall rating: 3
Mean overall rating: 2.20
RMSE for baseline model: 0.800


# Build a Weighted Mean User-Based Filter

From data in the file **rating_final.csv**, build a ratings matrix from the data frame of users, places, and restaurant ratings and build a user-based collaborative filtering model that weights mean rank using cosine similarity among users.  Fit the model on the training set and compute the RMSE for this model using the test set and compare it to the RMSE of the baseline model.  Is it better than baseline?  (*i.e.* is the RMSE smaller?)

<font color = "blue"> *** 12 points -  answer in cells below *** (don't delete this cell) </font>

In [3]:
# enter your code here

#Build the ratings matrix using pivot_table function
r_matrix = X_train.pivot_table(values='rating', index='userID', columns='placeID')

#Create a dummy ratings matrix with all null values imputed to 0
r_matrix_dummy = r_matrix.copy().fillna(0)

# Import cosine_score 
from sklearn.metrics.pairwise import cosine_similarity

#Compute the cosine similarity matrix using the dummy ratings matrix
cosine_sim = cosine_similarity(r_matrix_dummy, r_matrix_dummy)

#Convert into pandas dataframe 
cosine_sim = pd.DataFrame(cosine_sim, index=r_matrix.index, columns=r_matrix.index)

def cf_user_wmean(user_id, place_id):
    
    #Check if place_id exists in r_matrix and if there is overlap with other 
    # users, 
    if place_id in r_matrix:
        
        #Get the similarity scores for the user in question with every other user
        sim_scores = cosine_sim[user_id]
        
        #Get the user ratings for the movie in question
        m_ratings = r_matrix[place_id]
        
        #Extract the indices containing NaN in the m_ratings series
        idx = m_ratings[m_ratings.isnull()].index
                
        #Drop the NaN values from the m_ratings Series
        m_ratings = m_ratings.dropna()
        
        #Drop the corresponding cosine scores from the sim_scores series
        sim_scores = sim_scores.drop(idx)
        
        #Compute the final weighted mean
        if sim_scores.sum()>0:
            wmean_rating = np.dot(sim_scores, m_ratings)/ sim_scores.sum()
        else:  # user had zero cosine similarity with other users
            wmean_rating = 2.0
    
    else:
        #Default to a rating of 2.0 in the absence of any information
        wmean_rating = 2.0
    
    return wmean_rating

print(f'RMSE for baseline model: {score(cf_user_wmean):.3f}')

HERE
HERE
HERE
RMSE for baseline model: 0.867


# Build a kNN-Based Collaborative Filter

From data in the file **rating_final.csv**, use the *surprise* library in Python to build an kNN-based collaborative filtering model for the restaurant ratings.  Fit the model on the training set and compute the RMSE for this model on the test and compare it to the RMSEs of the baseline and weighted mean user-based models.

<font color = "blue"> *** 12 points -  answer in cells below *** (don't delete this cell) </font>

In [5]:
#Import the required classes and methods from the surprise library
from surprise import Reader, Dataset, KNNBasic

#Define a Reader object
#The Reader object helps in parsing the file or dataframe containing ratings
reader = Reader()

#Create the dataset to be used for building the filter
data = Dataset.load_from_df(ratings, reader)

#Define the algorithm object; in this case kNN
knn = KNNBasic()

# can use cross validation to estimate generalization RMSE
np.random.seed(8675309) # for reproducibility
from surprise.model_selection import cross_validate
result = cross_validate(knn, data, measures=['RMSE'], cv=5, verbose=True)

print(f"\nRMSE for KNN model: {np.mean(result['test_rmse']):.3f}")
print("Exact answers will vary due to random numbers.")

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8227  0.8498  0.7901  0.7941  0.7678  0.8049  0.0284  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    

RMSE for KNN model: 0.805
Exact answers will vary due to random numbers.


# Build a Hybrid Filter

From data in the files **rating_final.csv** and **geoplaces2.csv** build a recommender system that is a hybrid of a metadata content-based recommender and the SVD collaborative filter.  Your recommender should do the following:

* Take in a user ID and restaurant name as user input
* Use a metadata content-based model to compute the 25 most similar restaurants based on cosine similarity from the following details (create a soup as done for the content-based recommender from Lesson 10)
    - price
    - dress code
    - accessibility
    - ambiance
    - alcohol
    - smoking area
* Compute the predicted ratings that the user might give to these 25 restaurants using the SVD collaborative filter
* Print price, dress code, accessibility, ambiance, alcohol, and smoking area to see if they are similar for the predicted restaurants.
* Return the top 10 restaurant recommendations along with their predicted ratings when user **U1077** enters the restaurant named **Restaurante Tiberius**. 
* Also return the top 10 restaurant recommendations along with their predicted ratings when user **U1065** enters the restaurant named **Restaurante Tiberius** and comment on the similarities and differences in the resulting recommendations. 

*Note 1: This data set does not have the issue of having two different ID's in separate files for each restaurant like the movie data used in the textbook example, so you won't need to use the cell for mapping ID's to titles.*

*Note 2: because of the small number of words in the "soup", many of the restaurant pairs have a cosine similarity of 1, which will affect the predicted ratings and recommendations.  This means you will have to use a different method for excluding the cosine similarity of the item with itself (use the **del** function).* 

<font color = "blue"> *** 14 points -  answer in cells below *** (don't delete this cell) </font>

In [6]:
# enter your code here

# load restaurant meta data file
geoplaces = pd.read_csv('./data/geoplaces2.csv')

#Function that creates a soup out of the desired metadata
def create_soup(x):
    return x['price'] + ' ' + x['dress_code'] + ' ' + x['accessibility'] + ' ' + x['Rambience'] + ' ' + x['alcohol'] + ' ' + x['smoking_area']

# Create the new soup feature
geoplaces['soup'] = geoplaces.apply(create_soup, axis=1)

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

#Define a new CountVectorizer object and create vectors for the soup
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(geoplaces['soup'])

#Import cosine_similarity function
from sklearn.metrics.pairwise import cosine_similarity

#Compute the cosine similarity score (equivalent to dot product for tf-idf vectors)
cosine_sim = cosine_similarity(count_matrix, count_matrix)

#Build the SVD based Collaborative filter
from surprise import SVD, Reader, Dataset
from surprise.model_selection import cross_validate

reader = Reader(rating_scale=(1,3))
ratings = pd.read_csv('./data/rating_final.csv')
ratings['rating'] += 1
data = Dataset.load_from_df(ratings[['userID', 'placeID', 'rating']], reader)

print("Training the SVD collaborative filter:\n")
np.random.seed(8675309) # for reproducibility
algo = SVD()
result = cross_validate(algo,data,cv=5,verbose=True)

print(f"\nRMSE for SVD model: {np.mean(result['test_rmse']):.3f}")
print("Exact answers will vary due to random numbers.")

#get indices and item names, and drop duplicate names, if any
indices = pd.Series(geoplaces.index, index=geoplaces['name']).drop_duplicates()
    
# setup the hybrid filter
def hybrid(user_id, restaurant_name):
    
    # Obtain the index of the restaurant that matches restaurant_name
    idx = indices[restaurant_name]
    
    #Extract the similarity scores and their corresponding index for every item from the cosine_sim matrix
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    #excluding the similarity score of the item with itself
    del sim_scores[idx]
    
    #Sort the (index, score) tuples in decreasing order of similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    #Select the top 25 tuples 
    sim_scores = sim_scores[0:25]
    
    #sim_scores = sim_scores[1:26]
    
    #Store the cosine_sim indices of the top 25 restaurants in a list
    item_indices = [i[0] for i in sim_scores]

    #Extract the metadata of the aforementioned restaurants
    items = geoplaces.iloc[item_indices][['name', 'placeID', 'price', 'dress_code', 'accessibility', 'Rambience','alcohol','smoking_area']]
    
    #Compute the predicted ratings using the SVD filter
    items['est_rating'] = items['placeID'].apply(lambda x: algo.predict(user_id, x).est)
    
    #Sort the items in decreasing order of predicted rating
    items = items.sort_values('est_rating', ascending=False)
    
    #Return the top 10 items as recommendations
    return items.head(10)

Training the SVD collaborative filter:

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.6354  0.6971  0.6685  0.6558  0.6458  0.6605  0.0213  
MAE (testset)     0.5428  0.5929  0.5660  0.5596  0.5483  0.5619  0.0175  
Fit time          0.05    0.04    0.04    0.04    0.04    0.04    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    

RMSE for SVD model: 0.661
Exact answers will vary due to random numbers.


In [7]:
# results will vary some due to the random folds in the SVD training
hybrid('U1077', 'Restaurante Tiberius')

Unnamed: 0,name,placeID,price,dress_code,accessibility,Rambience,alcohol,smoking_area,est_rating
53,Michiko Restaurant Japones,135034,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.515511
108,Mariscos El Pescador,135075,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.4819
49,Restaurante Alhondiga,135063,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.471512
34,El Rincon de San Francisco,135025,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.468247
20,Restaurant El Muladar de Calzada,135033,medium,informal,no_accessibility,familiar,No_Alcohol_Served,section,2.41932
0,Kiku Cuernavaca,134999,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.4168
44,Dominos Pizza,132869,medium,informal,no_accessibility,familiar,No_Alcohol_Served,not permitted,2.402928
51,Restaurant los Pinos,135000,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.389935
25,Restaurant Oriental Express,135042,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.375125
42,Restaurante El Reyecito,135046,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.325534


In [8]:
# results will vary some due to the random folds in the SVD training
hybrid('U1065', 'Restaurante Tiberius')

Unnamed: 0,name,placeID,price,dress_code,accessibility,Rambience,alcohol,smoking_area,est_rating
34,El Rincon de San Francisco,135025,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.670554
108,Mariscos El Pescador,135075,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.604684
49,Restaurante Alhondiga,135063,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.394037
53,Michiko Restaurant Japones,135034,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.297742
0,Kiku Cuernavaca,134999,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.296214
77,Restaurant Wu Zhuo Yi,135044,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.290856
51,Restaurant los Pinos,135000,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.287768
42,Restaurante El Reyecito,135046,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.286794
126,Sushi Itto,135072,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.275986
25,Restaurant Oriental Express,135042,medium,informal,no_accessibility,familiar,No_Alcohol_Served,none,2.240447
