# Capstone Term 2 - Project

* Marcos Bittencourt
---
* Contributors
    * Todd McCullough [Git](https://github.com/tamccullough)
    * Savya Sharma [Git](https://github.com/SavyaSharma)
    * Marko Topitch [Git](https://github.com/TopMarko)
---

### Initialize Modules and Data

##### Import the Needed Modules

In [1]:
import pandas as pd
import numpy as np
import heapq
from math import floor

##### Import Surprise
[Surprise](http://surpriselib.com/) is a Python scikit building and analyzing recommender systems that deal with explicit rating data.

In [2]:
from surprise import Reader, Dataset
from surprise import KNNWithMeans
from surprise.model_selection import cross_validate

##### Import Data

In [3]:
recipes_df = pd.read_csv('datasets/recipes-sub.csv')
users_df = pd.read_csv('datasets/users-sub.csv')
master_ratings_df = pd.read_csv('datasets/reviews-sub.csv')

In [4]:
recipes_df.columns

Index(['recipe_id', 'title', 'servings', 'cals_per_serving', 'prep_time',
       'cook_time', 'ready_time', 'nutrition', 'ingredients', 'url'],
      dtype='object')

In [5]:
recipes_df.head()

Unnamed: 0,recipe_id,title,servings,cals_per_serving,prep_time,cook_time,ready_time,nutrition,ingredients,url
0,1,Chocolate Sandwich Cookies I,12,600.0,30 Minutes,8 Minutes,1 h 10 m,Per Serving: 600 calories;26.9 g fat;86.6g car...,3 cups all-purpose flour;1 1/2 cups white suga...,https://www.allrecipes.com/recipe/10000/chocol...
1,2,Chocolate Pizzelles,12,358.0,,,,Per Serving: 358 calories;22.6 g fat;35.5g car...,4 eggs;1/4 cup cocoa powder;1 cup white sugar;...,https://www.allrecipes.com/recipe/10001/chocol...
2,3,Cookie Press Shortbread,24,116.0,25 Minutes,10 Minutes,35 m,Per Serving: 116 calories;7.8 g fat;10.9g carb...,1 cup butter;1 1/2 cups all-purpose flour;1/2 ...,https://www.allrecipes.com/recipe/10003/cookie...
3,4,Brown Sugar Frosting,16,59.0,,,,Per Serving: 59 calories;0 g fat;15.2g carbohy...,1/2 cup packed dark brown sugar;1 tablespoon w...,https://www.allrecipes.com/recipe/10009/brown-...
4,5,Butter Cookies II,36,103.0,15 Minutes,10 Minutes,1 h 40 m,Per Serving: 103 calories;5.3 g fat;12.7g carb...,1 cup butter;1 cup white sugar;1 egg;2 2/3 cup...,https://www.allrecipes.com/recipe/10011/butter...


In [6]:
users_df.head()

Unnamed: 0,user_id,AR_id,link
0,1,naples34102,https://www.allrecipes.com/cook/naples34102
1,2,1207425,https://www.allrecipes.com/cook/1207425
2,3,183259,https://www.allrecipes.com/cook/183259
3,4,162514,https://www.allrecipes.com/cook/162514
4,5,1159515,https://www.allrecipes.com/cook/1159515


In [7]:
master_ratings_df.head()

Unnamed: 0,reviewer_id,recipe_id,rating,date,link
0,1,1,5,2012-02-16,https://www.allrecipes.com/recipe/10000/chocol...
1,2,1,2,2004-11-27,https://www.allrecipes.com/recipe/10000/chocol...
2,1,1,5,2012-02-16,https://www.allrecipes.com/recipe/10000/chocol...
3,3,1,4,2003-08-05,https://www.allrecipes.com/recipe/10000/chocol...
4,2,1,2,2004-11-27,https://www.allrecipes.com/recipe/10000/chocol...


### Data Cleaning

In [8]:
ratings_df = master_ratings_df.copy()
ratings_df.pop('date')
ratings_df.pop('link')
ratings_df.columns = ['user', 'item', 'rating']

In [9]:
ratings_df.head()

Unnamed: 0,user,item,rating
0,1,1,5
1,2,1,2
2,1,1,5
3,3,1,4
4,2,1,2


##### Clean the ratings (Remove duplicates)

Likes change, and making a recipe a few time can change a users rating. This means that users can rate a recipe multiple times. You can see this in the ratings_df.head() above.

For a recommandation system it is best to only use one rating per user.
For this system, only their most recent review on an item is kept. 

In [10]:
values = ratings_df.values.tolist()
used_user_item_pairs = []

# TODO: Optimize this
clean_values = []
for value in values:
    if value[:2] not in used_user_item_pairs:
        used_user_item_pairs.append(value[:2])
        clean_values.append(value)
    else:
        clean_values[used_user_item_pairs.index(value[:2])] = value

clean_ratings_df = pd.DataFrame(clean_values, columns=['user', 'item', 'rating'])

The results after cleaning the ratings.

In [11]:
clean_ratings_df.head()

Unnamed: 0,user,item,rating
0,1,1,5
1,2,1,2
2,3,1,4
3,4,1,5
4,5,1,3


In [12]:
clean_ratings_df.to_csv('datasets/rr-ratings.csv')
recipes_df.to_csv('datasets/rr-recipes.csv')
users_df.to_csv('datasets/rr-users.csv')

##### Define a Ratings scale
This scale is determined by the lowest and highest rating possible. 
In this case the lowest rating is 1, while the highest is 5.

In [13]:
reader = Reader(rating_scale=(1,5)) # This just defines the rating scale
data = Dataset.load_from_df(clean_ratings_df[['user', 'item', 'rating']], reader=reader)

### Build the model

##### KNN with Means - Surprise

[KNN with Means](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans) has been chosen for the recommender, which is a basic collaborative filtering algorithm, taking into account the mean ratings of each user.

In [14]:
def build_recommender(user_based=False, sim_type='cosine'):
    sim_options = {
        "name": sim_type,
        "user_based": user_based
    }

    return KNNWithMeans(sim_options=sim_options)

##### Calculate the Similarity Matrix

Ignoring folds this builds the *Trainset* using [build_full_trainset()](https://surprise.readthedocs.io/en/stable/dataset.html#surprise.dataset.DatasetAutoFolds.build_full_trainset)

The Trainset is built using the data, but then contains more information about the data

In [15]:
trainset = data.build_full_trainset()

# user_based_recommender = build_recommender(user_based=True)
item_based_recommender = build_recommender()

# User based seems to give a memory error when fit, due to the much larger amount of users than recipes.
# user_based_recommender.fit(trainset)
item_based_recommender.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f0de5a7a690>

### Evaluate the Model

Using [cross_validation()](https://surprise.readthedocs.io/en/stable/model_selection.html#cross-validation) from surprise, we can quickly evaluate the model using a few metrics. 

In [16]:
cross_validate(item_based_recommender, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9716  0.9782  0.9892  0.9896  0.9718  0.9801  0.0080  
MAE (testset)     0.7189  0.7188  0.7278  0.7278  0.7176  0.7222  0.0046  
Fit time          0.09    0.11    0.11    0.11    0.10    0.10    0.01    
Test time         0.13    0.12    0.14    0.13    0.13    0.13    0.00    


{'test_rmse': array([0.97163128, 0.97818258, 0.98915386, 0.98956806, 0.97175306]),
 'test_mae': array([0.71890052, 0.71881606, 0.7278457 , 0.72775939, 0.7176301 ]),
 'fit_time': (0.08956146240234375,
  0.1103048324584961,
  0.10791921615600586,
  0.11022830009460449,
  0.10349893569946289),
 'test_time': (0.12713408470153809,
  0.1244969367980957,
  0.1385502815246582,
  0.1281423568725586,
  0.1281144618988037)}

##### Prediction

Using this test to see how a users might rate a specific recipe.

In [17]:
i = 1
for i in range(150):
    prediction = item_based_recommender.predict(i,167)
    print(round(prediction.est,2), end=', ')
    i = i + 1

4.46, 4.38, 4.65, 4.65, 4.65, 5, 4.65, 4.65, 4.65, 4.65, 4.46, 4.26, 4.65, 4.65, 4.65, 4.46, 4.65, 4.65, 5, 4.46, 4.65, 4.65, 4.65, 4.65, 4.65, 4.65, 4.65, 4.65, 4.17, 4.65, 5, 4.65, 4.65, 4.65, 4.65, 4.65, 4.65, 4.46, 4.65, 4.65, 4.65, 4.46, 4.65, 4.65, 4.65, 4.65, 4.65, 4.65, 4.46, 4.65, 4.46, 4.65, 4.65, 4.46, 3.76, 4.76, 5, 3.01, 5, 5, 5, 4.76, 2.76, 5, 4.46, 5, 5, 5, 4.76, 5, 1.76, 2.76, 5, 5, 4.65, 4.47, 5, 5, 4.76, 4.76, 2.76, 5, 3.76, 1.76, 1.76, 1.76, 3.76, 4.76, 3.76, 5, 4.46, 2.76, 4.46, 2.76, 1.76, 4.46, 5, 5, 5, 4.46, 4.76, 4.87, 4.65, 4.76, 5, 4.46, 5, 5, 4.76, 5, 5, 4.46, 5, 2.76, 4.46, 5, 5, 4.76, 4.46, 5, 4.46, 3.76, 4.76, 5, 5, 5, 4.8, 4.45, 5, 4.46, 5, 5, 4.76, 4.76, 5, 2.76, 4.46, 4.46, 2.76, 5, 4.76, 4.65, 5, 5, 3.76, 5, 4.76, 3.76, 4.46, 4.46, 

### Inference

Here is the meat and potatoes(har) of the whole thing.

In [18]:
def get_r(user_id):
    # Select which system to use. Due to memory constraints, item based is the only viable option
    recommender_system = item_based_recommender

    # User to recommend for
    #user_id = 562

    # N will represent how many items to recommend
    N = 200

    # The setting to a set and back to list is a failsafe.
    rated_items = list(set(clean_ratings_df.loc[clean_ratings_df['user'] == user_id]['item'].tolist()))

    # Self explanitory name
    all_item_ids = list(set(clean_ratings_df['item'].tolist()))

    # New_items just represents all the items not rated by the user
    new_items = [x for x in all_item_ids if x not in rated_items]

    # Estimate ratings for all unrated items
    predicted_ratings = {}
    for item_id in new_items:
        predicted_ratings[item_id] = recommender_system.predict(user_id, item_id).est
        pass

    # Get the item_ids for the top ratings
    recommended_ids = heapq.nlargest(N, predicted_ratings, key=predicted_ratings.get)
    recommended_ids = sorted(recommended_ids)

    # predicted_ratings
    recommended_df = recipes_df.loc[recipes_df['recipe_id'].isin(recommended_ids)].copy()
    recommended_df.set_index('recipe_id', inplace=True)
    recommended_df.insert(1, 'pred_rating', np.zeros(len(recommended_ids)))
    # recommended_df = recipes_df.copy()
    for idx,item_id in enumerate(recommended_ids):
        recommended_df.iloc[idx, recommended_df.columns.get_loc('pred_rating')] =predicted_ratings[item_id]
        pass

    return recommended_df.head(N).sort_values('pred_rating', ascending=False)

### Get a Recommendation Based on Ingredients

The final code that will be impletented in a cleaner fashion through the browser interface.

In [19]:
# ask the user for input
# get their ID number
user_id = int(input('Enter user id: '))

# get them to list some ingredients, currently it breaks if the second or next ingredient is not there
ingredient_list = input('Enter the ingredients separated by commas that you have on hand: ')

# split the input up into an array for the loop
items = np.array(ingredient_list.split(','))

# get the lowest rating
rating = int(input('Enter the lowest rating you\'ll accept: '))

# get their user name
user_name = users_df.loc[users_df['user_id'] == user_id]

# print some details
print('\nuser: ',user_name.iloc[0,1])
print(ingredient_list)
print('\nHere are your recommendations.')
test = get_r(user_id)
for item in items:
    test = test[test['ingredients'].str.contains(item)]
test = test[test['pred_rating'] >= rating]
test

Enter user id: 35
Enter the ingredients separated by commas that you have on hand: cheese
Enter the lowest rating you'll accept: 4

user:  10309
cheese

Here are your recommendations.


Unnamed: 0_level_0,title,pred_rating,servings,cals_per_serving,prep_time,cook_time,ready_time,nutrition,ingredients,url
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
505,Butter Schnitzel,5.0,12,351.0,25 Minutes,30 Minutes,55 m,Per Serving: 351 calories;23.2 g fat;15.7g car...,"12 boneless pork loin chops, 3/4 inch thick;2 ...",https://www.allrecipes.com/recipe/109451/butte...
967,Almond Melon Tart,5.0,8,437.0,30 Minutes,30 Minutes,3 h,Per Serving: 437 calories;25.6 g fat;47.2g car...,1/2 (11 ounce) package pie crust mix;1 cup sou...,https://www.allrecipes.com/recipe/12370/almond...
891,Chile Rellenos Pie,5.0,6,289.0,15 Minutes,45 Minutes,1 h,Per Serving: 289 calories;19.2 g fat;18.2g car...,4 eggs;1/2 cup milk;1/4 cup chopped fresh cila...,https://www.allrecipes.com/recipe/12115/chile-...
869,Classic Alfredo Sauce,5.0,2,714.0,,,,Per Serving: 714 calories;72 g fat;5g carbohyd...,3 tablespoons butter;8 fluid ounces heavy whip...,https://www.allrecipes.com/recipe/12065/classi...
851,Alfredo Light,5.0,8,292.0,20 Minutes,20 Minutes,40 m,Per Serving: 292 calories;4.1 g fat;50.5g carb...,"1 onion, chopped;1 clove garlic, minced;2 teas...",https://www.allrecipes.com/recipe/11915/alfred...
828,Baked Ziti I,5.0,10,578.0,20 Minutes,35 Minutes,55 m,Per Serving: 578 calories;25.3 g fat;58.4g car...,"1 pound dry ziti pasta;1 onion, chopped;1 poun...",https://www.allrecipes.com/recipe/11758/baked-...
821,American Lasagna,5.0,8,664.0,30 Minutes,1 Hour 15 Minutes,1 h 55 m,Per Serving: 664 calories;29.5 g fat;48.3g car...,"1 1/2 pounds lean ground beef;1 onion, chopped...",https://www.allrecipes.com/recipe/11729/americ...
820,Cheese Ravioli with Fresh Tomato and Artichoke...,5.0,6,355.0,20 Minutes,5 Minutes,25 m,Per Serving: 355 calories;15.5 g fat;42g carbo...,2 (9 ounce) packages fresh cheese ravioli;1 ta...,https://www.allrecipes.com/recipe/11723/cheese...
815,Alla Checca,5.0,4,610.0,20 Minutes,,2 h 20 m,Per Serving: 610 calories;30.7 g fat;69.4g car...,"5 tomatoes, seeded and diced;4 cloves garlic, ...",https://www.allrecipes.com/recipe/11692/alla-c...
169,Easy Chocolate Cream Cheese Frosting,5.0,24,80.0,,,,Per Serving: 80 calories;6.3 g fat;6g carbohyd...,1 (8 ounce) package cream cheese;1/4 cup confe...,https://www.allrecipes.com/recipe/10342/easy-c...


In [21]:
rec_df = pd.read_csv('datasets/recipes-sub.csv')
u_df = pd.read_csv('datasets/users-sub.csv')
r_df = pd.read_csv('datasets/reviews-sub.csv')

In [22]:
a = 99
r_df.loc[master_ratings_df['reviewer_id'] == a]

Unnamed: 0,reviewer_id,recipe_id,rating,date,link
104,99,3,5,2010-11-21,https://www.allrecipes.com/recipe/getreviews/r...


## Save the Model

In [28]:
import pickle
filename = 'recipes_recommender_model.sav'
pickle.dump(item_based_recommender, open(filename, 'wb'))

In [29]:
rr_model = pickle.load(open(filename, 'rb'))

In [30]:
cross_validate(rr_model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9783  0.9865  0.9795  0.9817  0.9782  0.9808  0.0031  
MAE (testset)     0.7217  0.7253  0.7224  0.7214  0.7203  0.7222  0.0017  
Fit time          0.08    0.12    0.10    0.11    0.10    0.10    0.01    
Test time         0.13    0.13    0.15    0.13    0.13    0.13    0.01    


{'test_rmse': array([0.97825221, 0.98649248, 0.97948788, 0.98170631, 0.97816223]),
 'test_mae': array([0.72170042, 0.72532985, 0.72239182, 0.72144593, 0.72027243]),
 'fit_time': (0.0803375244140625,
  0.12134957313537598,
  0.10322785377502441,
  0.10894298553466797,
  0.10386300086975098),
 'test_time': (0.13140225410461426,
  0.13246560096740723,
  0.1513383388519287,
  0.12859630584716797,
  0.1291794776916504)}