# Capstone Term 2 - Project

* Marcos Bittencourt
---
* Contributors
    * Todd McCullough [Git](https://github.com/tamccullough)
    * Savya Sharma [Git](https://github.com/SavyaSharma)
    * Marko Topitch [Git](https://github.com/TopMarko)
---

### Initialize Modules and Data

##### Import the Needed Modules

In [1]:
import pandas as pd
import numpy as np
import heapq
from math import floor

##### Import Surprise
[Surprise](http://surpriselib.com/) is a Python scikit building and analyzing recommender systems that deal with explicit rating data.

In [2]:
from surprise import Reader, Dataset
from surprise import KNNWithMeans

##### Import Data

In [3]:
recipes_df = pd.read_csv('datasets/recipes-sub.csv')
users_df = pd.read_csv('datasets/users-sub.csv')
master_ratings_df = pd.read_csv('datasets/reviews-sub.csv')

In [4]:
recipes_df.columns

Index(['recipe_id', 'title', 'servings', 'cals_per_serving', 'prep_time',
       'cook_time', 'ready_time', 'nutrition', 'ingredients', 'url'],
      dtype='object')

In [5]:
recipes_df.head()

Unnamed: 0,recipe_id,title,servings,cals_per_serving,prep_time,cook_time,ready_time,nutrition,ingredients,url
0,1,Chocolate Sandwich Cookies I,12,600.0,30 Minutes,8 Minutes,1 h 10 m,Per Serving: 600 calories;26.9 g fat;86.6g car...,3 cups all-purpose flour;1 1/2 cups white suga...,https://www.allrecipes.com/recipe/10000/chocol...
1,2,Chocolate Pizzelles,12,358.0,,,,Per Serving: 358 calories;22.6 g fat;35.5g car...,4 eggs;1/4 cup cocoa powder;1 cup white sugar;...,https://www.allrecipes.com/recipe/10001/chocol...
2,3,Cookie Press Shortbread,24,116.0,25 Minutes,10 Minutes,35 m,Per Serving: 116 calories;7.8 g fat;10.9g carb...,1 cup butter;1 1/2 cups all-purpose flour;1/2 ...,https://www.allrecipes.com/recipe/10003/cookie...
3,4,Brown Sugar Frosting,16,59.0,,,,Per Serving: 59 calories;0 g fat;15.2g carbohy...,1/2 cup packed dark brown sugar;1 tablespoon w...,https://www.allrecipes.com/recipe/10009/brown-...
4,5,Butter Cookies II,36,103.0,15 Minutes,10 Minutes,1 h 40 m,Per Serving: 103 calories;5.3 g fat;12.7g carb...,1 cup butter;1 cup white sugar;1 egg;2 2/3 cup...,https://www.allrecipes.com/recipe/10011/butter...


In [6]:
users_df.head()

Unnamed: 0,user_id,AR_id,link
0,1,naples34102,https://www.allrecipes.com/cook/naples34102
1,2,1207425,https://www.allrecipes.com/cook/1207425
2,3,183259,https://www.allrecipes.com/cook/183259
3,4,162514,https://www.allrecipes.com/cook/162514
4,5,1159515,https://www.allrecipes.com/cook/1159515


In [7]:
master_ratings_df.head()

Unnamed: 0,reviewer_id,recipe_id,rating,date,link
0,1,1,5,2012-02-16,https://www.allrecipes.com/recipe/10000/chocol...
1,2,1,2,2004-11-27,https://www.allrecipes.com/recipe/10000/chocol...
2,1,1,5,2012-02-16,https://www.allrecipes.com/recipe/10000/chocol...
3,3,1,4,2003-08-05,https://www.allrecipes.com/recipe/10000/chocol...
4,2,1,2,2004-11-27,https://www.allrecipes.com/recipe/10000/chocol...


### Data Cleaning

In [8]:
ratings_df = master_ratings_df.copy()
ratings_df.pop('date')
ratings_df.pop('link')
ratings_df.columns = ['user', 'item', 'rating']

In [9]:
ratings_df.head()

Unnamed: 0,user,item,rating
0,1,1,5
1,2,1,2
2,1,1,5
3,3,1,4
4,2,1,2


##### Clean the ratings (Remove duplicates)

Likes change, and making a recipe a few time can change a users rating. This means that users can rate a recipe multiple times. You can see this in the ratings_df.head() above.

For a recommandation system it is best to only use one rating per user.
For this system, only their most recent review on an item is kept. 

In [10]:
values = ratings_df.values.tolist()
used_user_item_pairs = []

# TODO: Optimize this
clean_values = []
for value in values:
    if value[:2] not in used_user_item_pairs:
        used_user_item_pairs.append(value[:2])
        clean_values.append(value)
    else:
        clean_values[used_user_item_pairs.index(value[:2])] = value

clean_ratings_df = pd.DataFrame(clean_values, columns=['user', 'item', 'rating'])

The results after cleaning the ratings.

In [11]:
clean_ratings_df.head()

Unnamed: 0,user,item,rating
0,1,1,5
1,2,1,2
2,3,1,4
3,4,1,5
4,5,1,3


##### Define a Ratings scale
This scale is determined by the lowest and highest rating possible. 
In this case the lowest rating is 1, while the highest is 5.

In [12]:
reader = Reader(rating_scale=(1,5)) # This just defines the rating scale
data = Dataset.load_from_df(clean_ratings_df[['user', 'item', 'rating']], reader=reader)

### Build the model

##### KNN with Means - Surprise

[KNN with Means](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans) has been chosen for the recommender, which is a basic collaborative filtering algorithm, taking into account the mean ratings of each user.

In [13]:
def build_recommender(user_based=False, sim_type='cosine'):
    sim_options = {
        "name": sim_type,
        "user_based": user_based
    }

    return KNNWithMeans(sim_options=sim_options)

##### Calculate the Similarity Matrix

Ignoring folds this builds the *Trainset* using [build_full_trainset()](https://surprise.readthedocs.io/en/stable/dataset.html#surprise.dataset.DatasetAutoFolds.build_full_trainset)

The Trainset is built using the data, but then contains more information about the data

In [14]:
trainset = data.build_full_trainset()

# user_based_recommender = build_recommender(user_based=True)
item_based_recommender = build_recommender()

# User based seems to give a memory error when fit, due to the much larger amount of users than recipes.
# user_based_recommender.fit(trainset)
item_based_recommender.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f978dc4d750>

##### Prediction

Using this test to see how a users might rate a specific recipe.

In [28]:
i = 1
for i in range(150):
    prediction = item_based_recommender.predict(i,167)
    print(round(prediction.est,2), end=', ')
    i = i + 1

4.46, 4.22, 2.97, 4.63, 4.99, 5, 4.63, 4.63, 4.63, 4.63, 4.63, 4.23, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 5, 4.63, 4.63, 4.63, 4.63, 4.63, 4.23, 4.97, 4.63, 5, 4.25, 4.63, 5, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.42, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 4.63, 5, 3.64, 4.64, 5, 4.3, 5, 5, 5, 4.64, 2.64, 5, 5, 5, 5, 5, 4.64, 5, 1.64, 2.64, 5, 5, 5, 4.41, 5, 5, 4.64, 4.64, 2.64, 5, 3.64, 1.64, 1.64, 1.64, 3.64, 4.64, 3.64, 4.82, 5, 2.64, 2.64, 2.64, 1.64, 5, 5, 5, 5, 5, 4.64, 4.64, 4.64, 4.64, 5, 5, 5, 5, 4.64, 5, 5, 4.64, 5, 2.64, 1.64, 5, 5, 4.64, 5, 5, 5, 3.64, 4.39, 5, 4.95, 4.99, 4.49, 4.38, 5, 5, 5, 5, 4.64, 4.64, 5, 2.64, 1.64, 5, 2.64, 5, 4.64, 4.64, 5, 5, 4.51, 5, 4.64, 3.64, 5, 5, 

### Inference

Here is the meat and potatoes(har) of the whole thing.

In [13]:
def get_r(user_id):
    # Select which system to use. Due to memory constraints, item based is the only viable option
    recommender_system = item_based_recommender

    # User to recommend for
    #user_id = 562

    # N will represent how many items to recommend
    N = 200

    # The setting to a set and back to list is a failsafe.
    rated_items = list(set(clean_ratings_df.loc[clean_ratings_df['user'] == user_id]['item'].tolist()))

    # Self explanitory name
    all_item_ids = list(set(clean_ratings_df['item'].tolist()))

    # New_items just represents all the items not rated by the user
    new_items = [x for x in all_item_ids if x not in rated_items]

    # Estimate ratings for all unrated items
    predicted_ratings = {}
    for item_id in new_items:
        predicted_ratings[item_id] = recommender_system.predict(user_id, item_id).est
        pass

    # Get the item_ids for the top ratings
    recommended_ids = heapq.nlargest(N, predicted_ratings, key=predicted_ratings.get)
    recommended_ids = sorted(recommended_ids)

    # predicted_ratings
    recommended_df = recipes_df.loc[recipes_df['recipe_id'].isin(recommended_ids)].copy()
    recommended_df.set_index('recipe_id', inplace=True)
    recommended_df.insert(1, 'pred_rating', np.zeros(len(recommended_ids)))
    # recommended_df = recipes_df.copy()
    for idx,item_id in enumerate(recommended_ids):
        recommended_df.iloc[idx, recommended_df.columns.get_loc('pred_rating')] =predicted_ratings[item_id]
        pass

    return recommended_df.head(N).sort_values('pred_rating', ascending=False)

### Get a Recommendation Based on Ingredients

The final code that will be impletented in a cleaner fashion through the browser interface.

In [14]:
# ask the user for input
# get their ID number
user_id = int(input('Enter user id: '))

# get them to list some ingredients, currently it breaks if the second or next ingredient is not there
ingredient_list = input('Enter the ingredients separated by commas that you have on hand: ')

# split the input up into an array for the loop
items = np.array(ingredient_list.split(','))

# get the lowest rating
rating = int(input('Enter the lowest rating you\'ll accept: '))

# get their user name
user_name = users_df.loc[users_df['user_id'] == user_id]

# print some details
print('\nuser: ',user_name.iloc[0,1])
print(ingredient_list)
print('\nHere are your recommendations.')
test = get_r(user_id)
for item in items:
    test = test[test['ingredients'].str.contains(item)]
test = test[test['pred_rating'] >= rating]
test

Enter user id: 34
Enter the ingredients separated by commas that you have on hand: chocolate
Enter the lowest rating you'll accept: 3

user:  214950
chocolate

Here are your recommendations.


Unnamed: 0_level_0,title,pred_rating,servings,cals_per_serving,prep_time,cook_time,ready_time,nutrition,ingredients,url
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
14,Allison's Supreme Chocolate Chip Cookies,5.0,24,294.0,15 Minutes,12 Minutes,40 m,Per Serving: 294 calories;16.5 g fat;35.4g car...,"1/2 cup shortening;1/2 cup butter, softened;3/...",https://www.allrecipes.com/recipe/10032/alliso...
659,Caramel Brownies II,5.0,18,364.0,,,,Per Serving: 364 calories;18.8 g fat;47.6g car...,1 (14 ounce) package individually wrapped cara...,https://www.allrecipes.com/recipe/11242/carame...
689,Crisp Rice Chocolate Chip Cookies,5.0,18,187.0,,,,Per Serving: 187 calories;8.3 g fat;27.7g carb...,1 1/2 cups all-purpose flour;1/2 teaspoon baki...,https://www.allrecipes.com/recipe/11329/crisp-...
691,Butterfinger Chunkies,5.0,12,477.0,,,,Per Serving: 477 calories;25.9 g fat;56.5g car...,"1/2 cup butter, softened;3/4 cup white sugar;2...",https://www.allrecipes.com/recipe/11333/butter...
701,Chocolate Mint Dessert Brownies,5.0,24,286.0,15 Minutes,30 Minutes,2 h 5 m,Per Serving: 286 calories;13.8 g fat;39.5g car...,"1 cup white sugar;1/2 cup butter, softened;4 e...",https://www.allrecipes.com/recipe/11363/chocol...
707,Brownies To Die For,5.0,36,174.0,15 Minutes,35 Minutes,50 m,Per Serving: 174 calories;10.1 g fat;21.5g car...,1 (19.8 ounce) package brownie mix;1 cup sour ...,https://www.allrecipes.com/recipe/11375/browni...
579,Buffalo Chip Cookies,5.0,24,514.0,,,,Per Serving: 514 calories;22.7 g fat;75.6g car...,"2 cups margarine, melted;2 cups packed brown s...",https://www.allrecipes.com/recipe/11090/buffal...
430,Candy Bar Brownies,5.0,6,841.0,,,,Per Serving: 841 calories;40.5 g fat;112.3g ca...,1 (18.25 ounce) package German chocolate cake ...,https://www.allrecipes.com/recipe/10822/candy-...
438,Candy-Coated Milk Chocolate Pieces Party Cookies,5.0,48,120.0,,,,Per Serving: 120 calories;5.9 g fat;15.7g carb...,1 cup shortening;1 cup packed brown sugar;1/2 ...,https://www.allrecipes.com/recipe/10841/candy-...
483,Anna's Chocolate Chip Cookies,5.0,48,120.0,15 Minutes,10 Minutes,25 m,Per Serving: 120 calories;6.2 g fat;16g carboh...,1 cup butter;1/2 cup white sugar;1 cup packed ...,https://www.allrecipes.com/recipe/10909/annas-...


In [15]:
rec_df = pd.read_csv('recipes-sub.csv')
u_df = pd.read_csv('users-sub.csv')
r_df = pd.read_csv('reviews-sub.csv')

In [16]:
a = 99
r_df.loc[master_ratings_df['reviewer_id'] == a]

Unnamed: 0,reviewer_id,recipe_id,rating,date,link
104,99,3,5,2010-11-21,https://www.allrecipes.com/recipe/getreviews/r...
