# Capstone Term 2 - Project

* Marcos Bittencourt
---
* Contributors
    * Todd McCullough [Git](https://github.com/tamccullough)
    * Savya Sharma [Git](https://github.com/SavyaSharma)
    * Marko Topitch [Git](https://github.com/TopMarko)
---

## Content Based Filtering Tests

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

### Load the data

In [2]:
recipes_df = pd.read_csv('datasets/recipes-sub.csv')
users_df = pd.read_csv('datasets/users-sub.csv')
master_ratings_df = pd.read_csv('datasets/reviews-sub.csv')

##### Data Preparation

[TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

This module converts a collection of raw documents to a matrix of TF-IDF features. [Read more about count vectorization](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

In [3]:
# Find the term frequency for words in the title (How often a word is used in the whole dataset)
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')

# Compute the inverse document frequency (How important a word is in the whole title corpus)
tfidf_matrix = tf.fit_transform(recipes_df['title'])

##### Calculate the relationship between the neighbours

[Linear Kernel](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.linear_kernel.html)

[This module computes the linear kernel between the two selected points.](https://scikit-learn.org/stable/modules/metrics.html#linear-kernel)

In [4]:
# Calculate the cosine similarity between all items
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix) 
results = {}
for idx, row in recipes_df.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1] 
    similar_items = [(cosine_similarities[idx][i], recipes_df['recipe_id'][i]) for i in similar_indices] 
    results[row['recipe_id']] = similar_items[1:]

## Inference

Some simple functions are created that to find the title of a recipe, based on ID.

While the second function uses the matrix created above to find items with similarities

In [5]:
# This function just finds the title of a recipe given its id
def item(id):  
  return recipes_df.loc[recipes_df['recipe_id'] == id]['title'].tolist()[0].split(' - ')[0] # Just reads the results out of the dictionary.

# Finds the most similar items from the cosine similarity matrix above
def recommend(item_id, N):
    recs = results[item_id][:N]   
    for rec in recs: 
        pass

    return recs

In [6]:
# User to recommend for
user_id = 420

# Number of items to recommend
N = 10

# Gets all the items a user has rated
rated_items = list(set(master_ratings_df.loc[master_ratings_df['reviewer_id'] == user_id]['recipe_id'].tolist()))

# Predict similar recipes for all the user's previously rated recipes.
preds = []
for rated_item in rated_items:
    preds += recommend(item_id=rated_item, N=N)
    
# Drop duplicate results
# TODO: This is bugged. It will not remove duplicate items if they have different similarities (both instances were recommended by different recipes)
preds = list(set(preds))
# Sort the list
preds = sorted(preds, key=lambda tup: tup[0])
# Keep only the specified amount of recommendations
preds = preds[:N]


#Print the results out nicely
for pred in preds:
    print("Recommended: " + item(pred[1]) + " (score:" + str(pred[0]) + ")")

Recommended: Cut-Out Butter Cookies (score:0.2963890636827812)
Recommended: Almond Cookies II (score:0.2966995984569579)
Recommended: Drop Butter Cookies (score:0.29766245887046844)
Recommended: Brown Butter Cookies (score:0.302178394252651)
Recommended: Cinnamon Butter Cookies (score:0.30285039369142613)
Recommended: Butter Cookies IV (score:0.30435019528062957)
Recommended: Classic Butter Cookies I (score:0.30737541857220485)
Recommended: Butter Cookies III (score:0.32773233913696276)
Recommended: Cinnamon Sugar Butter Cookies II (score:0.5871393137609483)
Recommended: Classic Butter Cookies II (score:0.7026410119057137)
