# Capstone Term 2 - Project

* Marcos Bittencourt
---
* Contributors
    * Todd McCullough [Git](https://github.com/tamccullough)
    * Savya Sharma [Git](https://github.com/SavyaSharma)
    * Marko Topitch [Git](https://github.com/TopMarko)
---

### Initialize Modules and Data

##### Import the Needed Modules

In [1]:
import pandas as pd
import numpy as np
import heapq
from math import floor

##### Import Surprise
[Surprise](http://surpriselib.com/) is a Python scikit building and analyzing recommender systems that deal with explicit rating data.

In [2]:
from surprise import Reader, Dataset
from surprise import KNNWithMeans

##### Import Data

In [3]:
recipes_df = pd.read_csv('datasets/recipes-sub.csv')
users_df = pd.read_csv('datasets/users-sub.csv')
ratings_df = pd.read_csv('datasets/clean_ratings.csv')

In [4]:
recipes_df.head()

Unnamed: 0,recipe_id,title,servings,cals_per_serving,prep_time,cook_time,ready_time,nutrition,ingredients,url
0,1,Chocolate Sandwich Cookies I,12,600.0,30 Minutes,8 Minutes,1 h 10 m,Per Serving: 600 calories;26.9 g fat;86.6g car...,3 cups all-purpose flour;1 1/2 cups white suga...,https://www.allrecipes.com/recipe/10000/chocol...
1,2,Chocolate Pizzelles,12,358.0,,,,Per Serving: 358 calories;22.6 g fat;35.5g car...,4 eggs;1/4 cup cocoa powder;1 cup white sugar;...,https://www.allrecipes.com/recipe/10001/chocol...
2,3,Cookie Press Shortbread,24,116.0,25 Minutes,10 Minutes,35 m,Per Serving: 116 calories;7.8 g fat;10.9g carb...,1 cup butter;1 1/2 cups all-purpose flour;1/2 ...,https://www.allrecipes.com/recipe/10003/cookie...
3,4,Brown Sugar Frosting,16,59.0,,,,Per Serving: 59 calories;0 g fat;15.2g carbohy...,1/2 cup packed dark brown sugar;1 tablespoon w...,https://www.allrecipes.com/recipe/10009/brown-...
4,5,Butter Cookies II,36,103.0,15 Minutes,10 Minutes,1 h 40 m,Per Serving: 103 calories;5.3 g fat;12.7g carb...,1 cup butter;1 cup white sugar;1 egg;2 2/3 cup...,https://www.allrecipes.com/recipe/10011/butter...


In [5]:
users_df.head()

Unnamed: 0,user_id,AR_id,link
0,1,naples34102,https://www.allrecipes.com/cook/naples34102
1,2,1207425,https://www.allrecipes.com/cook/1207425
2,3,183259,https://www.allrecipes.com/cook/183259
3,4,162514,https://www.allrecipes.com/cook/162514
4,5,1159515,https://www.allrecipes.com/cook/1159515


In [6]:
ratings_df.head()

Unnamed: 0.1,Unnamed: 0,user,item,rating
0,0,1,1,5
1,1,2,1,2
2,2,3,1,4
3,3,4,1,5
4,4,5,1,3


##### Define a Ratings scale
This scale is determined by the lowest and highest rating possible. 
In this case the lowest rating is 1, while the highest is 5.

In [7]:
reader = Reader(rating_scale=(1,5)) # This just defines the rating scale
data = Dataset.load_from_df(ratings_df[['user', 'item', 'rating']], reader=reader)

### Build the model

##### KNN with Means - Surprise

[KNN with Means](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans) has been chosen for the recommender, which is a basic collaborative filtering algorithm, taking into account the mean ratings of each user.

In [8]:
def build_recommender(user_based=False, sim_type='cosine'):
    sim_options = {
        "name": sim_type,
        "user_based": user_based
    }

    return KNNWithMeans(sim_options=sim_options)

##### Calculate the Similarity Matrix

Ignoring folds this builds the *Trainset* using [build_full_trainset()](https://surprise.readthedocs.io/en/stable/dataset.html#surprise.dataset.DatasetAutoFolds.build_full_trainset)

The Trainset is built using the data, but then contains more information about the data

In [9]:
trainset = data.build_full_trainset()

# user_based_recommender = build_recommender(user_based=True)
item_based_recommender = build_recommender()

# User based seems to give a memory error when fit, due to the much larger amount of users than recipes.
# user_based_recommender.fit(trainset)
item_based_recommender.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f692d0f7d10>

##### Prediction

Using this test to see how a users might rate a specific recipe.

In [10]:
i = 1
for i in range(150):
    prediction = item_based_recommender.predict(i,167)
    print(round(prediction.est,2), end=', ')
    i = i + 1

4.46, 4.22, 2.97, 4.63, 4.99, 5, 4.63, 4.63, 4.63, 4.63, 4.63, 4.23, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 5, 4.63, 4.63, 4.63, 4.63, 4.63, 4.23, 4.97, 4.63, 5, 4.25, 4.63, 5, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.42, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 5, 3.64, 4.64, 5, 4.3, 5, 5, 5, 4.64, 2.64, 5, 5, 5, 5, 5, 4.64, 5, 1.64, 2.64, 5, 5, 5, 4.41, 5, 5, 4.64, 4.64, 2.64, 5, 3.64, 1.64, 1.64, 1.64, 3.64, 4.64, 3.64, 4.82, 5, 2.64, 2.64, 2.64, 1.64, 5, 5, 5, 5, 5, 4.64, 4.64, 4.64, 4.64, 5, 5, 5, 5, 4.64, 5, 5, 4.64, 5, 2.64, 1.64, 5, 5, 4.64, 5, 5, 5, 3.64, 4.39, 5, 4.95, 4.99, 4.49, 4.38, 5, 5, 5, 5, 4.64, 4.64, 5, 2.64, 1.64, 5, 2.64, 5, 4.64, 4.64, 5, 5, 4.51, 5, 4.64, 3.64, 5, 5, 

### Get a Recommendation Based on Ingredients

The final code that will be impletented in a cleaner fashion through the browser interface.

In [31]:
# Select which system to use. Due to memory constraints, item based is the only viable option
recommender_system = item_based_recommender

# User to recommend for
user_id = 420

# N will represent how many items to recommend
N = 10

# The setting to a set and back to list is a failsafe.
rated_items = list(set(ratings_df.loc[ratings_df['user'] == user_id]['item'].tolist()))

# Self explanitory name
all_item_ids = list(set(ratings_df['item'].tolist()))

# New_items just represents all the items not rated by the user
new_items = [x for x in all_item_ids if x not in rated_items]

# Estimate ratings for all unrated items
predicted_ratings = {}
for item_id in new_items:
    predicted_ratings[item_id] = recommender_system.predict(user_id, item_id).est
    pass

# Get the item_ids for the top ratings
recommended_ids = heapq.nlargest(N, predicted_ratings, key=predicted_ratings.get)
recommended_ids = sorted(recommended_ids)

# predicted_ratings
recommended_df = recipes_df.loc[recipes_df['recipe_id'].isin(recommended_ids)].copy()
recommended_df.set_index('recipe_id', inplace=True)
recommended_df.insert(1, 'pred_rating', np.zeros(len(recommended_ids)))
# recommended_df = recipes_df.copy()
for idx,item_id in enumerate(recommended_ids):
    recommended_df.iloc[idx, recommended_df.columns.get_loc('pred_rating')] =predicted_ratings[item_id]
    pass

In [32]:
organized = recommended_df.head(N).sort_values('pred_rating', ascending=False)

In [33]:
organized[['title','pred_rating']]

Unnamed: 0_level_0,title,pred_rating
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1
49,Christmas Rocks,5.0
78,Chocolate Crinkles III,5.0
119,Cassis Martini,5.0
129,Basic Chocolate Drop Cookies,5.0
132,Basic Nut Cookies,5.0
144,Classic Peanut Butter Cookies,5.0
189,Aunt Cora's World's Greatest Cookies,5.0
191,Buttermilk Chocolate Chip Cookies,5.0
259,Beth's Spicy Oatmeal Raisin Cookies,5.0
286,Champagne with Strawberries,5.0


### View users highest rated recipes
This section is to compare the recommendations to what the user has previously rated.

In [12]:
tmp = ratings_df.copy()
tmp = tmp[tmp['user'] == user_id]
tmp = tmp.sort_values('rating', ascending=False)
top_item_ids = tmp['item'].tolist()[:N]
top_ratings = tmp['rating'].tolist()[:N]

rated_recipes = recipes_df.copy()

rated_recipes[rated_recipes['recipe_id'].isin(top_item_ids)].head(N)

Unnamed: 0,recipe_id,title,servings,cals_per_serving,prep_time,cook_time,ready_time,nutrition,ingredients,url
4,5,Butter Cookies II,36,103.0,15 Minutes,10 Minutes,1 h 40 m,Per Serving: 103 calories;5.3 g fat;12.7g carb...,1 cup butter;1 cup white sugar;1 egg;2 2/3 cup...,https://www.allrecipes.com/recipe/10011/butter...
