# Capstone Term 2 - Project

* Marcos Bittencourt
---
* Contributors
    * Todd McCullough [Git](https://github.com/tamccullough)
    * Savya Sharma [Git](https://github.com/SavyaSharma)
    * Marko Topitch [Git](https://github.com/TopMarko)
---

### Initialize Modules and Data

##### Import the Needed Modules

In [1]:
import pandas as pd
import numpy as np
import heapq
from math import floor

##### Import Surprise
[Surprise](http://surpriselib.com/) is a Python scikit building and analyzing recommender systems that deal with explicit rating data.

In [2]:
from surprise import Reader, Dataset
from surprise import KNNWithMeans
from surprise.model_selection import cross_validate

##### Import Data

In [3]:
recipes_df = pd.read_csv('datasets/rr-recipes.csv')
users_df = pd.read_csv('datasets/rr-users.csv')
ratings_df = pd.read_csv('datasets/rr-ratings.csv')

In [4]:
ratings_df.head(2)

Unnamed: 0,user,item,rating
0,675719,7000,5
1,1478626,7000,5


##### Define a Ratings scale
This scale is determined by the lowest and highest rating possible. 
In this case the lowest rating is 1, while the highest is 5.

In [5]:
reader = Reader(rating_scale=(1,5)) # This just defines the rating scale
data = Dataset.load_from_df(ratings_df[['user', 'item', 'rating']], reader=reader)

### Build the model

##### KNN with Means - Surprise

[KNN with Means](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans) has been chosen for the recommender, which is a basic collaborative filtering algorithm, taking into account the mean ratings of each user.

In [6]:
def build_recommender(user_based=False, sim_type='cosine'):
    sim_options = {
        "name": sim_type,
        "user_based": user_based
    }

    return KNNWithMeans(sim_options=sim_options)

##### Calculate the Similarity Matrix

Ignoring folds this builds the *Trainset* using [build_full_trainset()](https://surprise.readthedocs.io/en/stable/dataset.html#surprise.dataset.DatasetAutoFolds.build_full_trainset)

The Trainset is built using the data, but then contains more information about the data

In [7]:
trainset = data.build_full_trainset()

# user_based_recommender = build_recommender(user_based=True)
item_based_recommender = build_recommender()

# User based seems to give a memory error when fit, due to the much larger amount of users than recipes.
# user_based_recommender.fit(trainset)
item_based_recommender.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7fc415f93910>

### Evaluate the Model

Using [cross_validation()](https://surprise.readthedocs.io/en/stable/model_selection.html#cross-validation) from surprise, we can quickly evaluate the model using a few metrics. 

In [8]:
cross_validate(item_based_recommender, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9080  0.9073  0.9058  0.9046  0.9060  0.9064  0.0012  
MAE (testset)     0.6254  0.6257  0.6253  0.6247  0.6253  0.6253  0.0003  
Fit time          20.30   20.39   21.30   21.95   21.69   21.13   0.67    
Test time         15.67   15.90   16.02   16.90   16.97   16.29   0.54    


{'test_rmse': array([0.9080496 , 0.90728224, 0.90583113, 0.90463127, 0.90596804]),
 'test_mae': array([0.62540011, 0.62573803, 0.62533957, 0.62469321, 0.62529947]),
 'fit_time': (20.30126142501831,
  20.39469814300537,
  21.295984983444214,
  21.950374126434326,
  21.693845510482788),
 'test_time': (15.667166233062744,
  15.899041891098022,
  16.023027181625366,
  16.902676343917847,
  16.974899768829346)}

##### Prediction

Using this test to see how a users might rate a specific recipe.

In [9]:
i = 1
for i in range(150):
    prediction = item_based_recommender.predict(i,167)
    print(round(prediction.est,2), end=', ')
    i = i + 1

4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 4.51, 

### Inference

Here is the meat and potatoes(har) of the whole thing.

In [12]:
def get_r(user_id):
    # Select which system to use. Due to memory constraints, item based is the only viable option
    recommender_system = item_based_recommender
    # N will represent how many items to recommend
    N = 1000
    
    # The setting to a set and back to list is a failsafe.
    rated_items = list(set(ratings_df.loc[ratings_df['user'] == user_id]['item'].tolist()))
    ratings_list = recipes_df['recipe_id'].values.tolist()
    reduced_ratings = ratings_df.loc[ratings_df['item'].isin(ratings_list)].copy()
    
    # Self explanitory name
    all_item_ids = list(set(reduced_ratings['item'].tolist()))
    
    # New_items just represents all the items not rated by the user
    new_items = [x for x in all_item_ids if x not in rated_items]
    
    # Estimate ratings for all unrated items
    predicted_ratings = {}
    for item_id in new_items:
        predicted_ratings[item_id] = recommender_system.predict(user_id, item_id).est
        pass
    
    # Get the item_ids for the top ratings
    recommended_ids = heapq.nlargest(N, predicted_ratings, key=predicted_ratings.get)
    recommended_ids = sorted(recommended_ids)
    
    # predicted_ratings
    recommended_df = recipes_df.loc[recipes_df['recipe_id'].isin(recommended_ids)].copy()
    #recommended_df.insert(1, 'pred_rating', np.zeros(len(recommended_ids)))
    recommended_df.insert(1, 'pred_rating', 0)
    
    # recommended_df = recipes_df.copy()
    for idx,item_id in enumerate(recommended_ids):
        recommended_df.iloc[idx, recommended_df.columns.get_loc('pred_rating')] = predicted_ratings[item_id]
        pass
    return recommended_df.head(N).sort_values('pred_rating', ascending=False)

def set_up_rr(user_id,ingredient_list):
    # split the input up into an array for the loop
    items = ingredient_list.split(',')
    rr_list = get_r(user_id)
    for j in range(0,len(items)):
        print(items[j])
        rr_list = rr_list[rr_list['ingredients'].str.contains(items[j])]
    return rr_list

def mk_tbl(rows):
    #this is for creating dynamic tables
    arr = []
    for row in rows:
        title = row[2]
        r_t = row[5]
        p_t = row[3]
        c_t = row[4]
        url = row[8]
        pred = row[1]
        arr.append([pred,title,r_t,p_t,c_t,url])
    return arr

### Get a Recommendation Based on Ingredients

The final code that will be impletented in a cleaner fashion through the browser interface.

In [14]:
user_id = 2617981
ingredient_list = 'tofu'
table_list = set_up_rr(user_id,ingredient_list)

table_list = table_list.to_numpy()

test = pd.DataFrame(table_list)
test

tofu


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,14095,4.98456,Tofu Fudge Mocha Bars Recipe,5 m,10 m,15 m,"tofu,2 tablespoons safflower oil,salt,sugar,co...",Preheat oven to 325 degrees F (165 degrees C)....,https://www.allrecipes.com/recipe/14095,https://images.media-allrecipes.com/userphotos...
1,14176,4.97616,Cucumber and Tomato Salad Recipe,15 m,X,15 m,"tomato,cucumber,onion,bean,tofu,basil,salad dr...","In a large bowl, combine the tomato, cucumber,...",https://www.allrecipes.com/recipe/14176,https://images.media-allrecipes.com/userphotos...


In [15]:
got_it = pd.DataFrame(mk_tbl(table_list))

## Save the Model

In [16]:
import pickle
filename = 'recipes_recommender_model.sav'
pickle.dump(item_based_recommender, open(filename, 'wb'))

In [17]:
rr_model = pickle.load(open(filename, 'rb'))

In [18]:
cross_validate(rr_model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9071  0.9070  0.9061  0.9060  0.9066  0.9066  0.0004  
MAE (testset)     0.6266  0.6249  0.6253  0.6253  0.6251  0.6254  0.0006  
Fit time          22.56   24.01   21.76   22.46   25.57   23.27   1.36    
Test time         16.98   17.87   18.30   15.50   18.73   17.48   1.15    


{'test_rmse': array([0.9071256 , 0.90697745, 0.90610662, 0.90600469, 0.90660522]),
 'test_mae': array([0.62662526, 0.62488307, 0.62527877, 0.62525857, 0.62507525]),
 'fit_time': (22.558342695236206,
  24.009589433670044,
  21.762155532836914,
  22.45959758758545,
  25.570380926132202),
 'test_time': (16.97546100616455,
  17.86919665336609,
  18.300251483917236,
  15.498399496078491,
  18.733608961105347)}