## Model Recognition and Evaluation

In [2]:
import pandas as pd #pandas is a library for data manipulation and analysis
pd.set_option('display.max_colwidth', None) #set the maximum width of columns to unlimited. This will prevent long strings from being truncated.

import matplotlib.pyplot as plt #matplotlib.pyplot is a plotting library used for 2D graphics in python programming language. It can be used in python scripts, shell, web application servers and other graphical user interface toolkits.

import numpy as np #NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

from sklearn.feature_extraction.text import CountVectorizer #CountVectorizer is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. It is part of the sklearn library.

from nltk import corpus #Natural Language Toolkit’s (NLTK) corpus package defines a collection of corpus reader classes, which can be used to access the contents of a diverse set of corpora.

import re #re is the standard library for regular expressions in Python

import ast #The ast module helps Python applications to process trees of the Python abstract syntax grammar

import statsmodels.api as sm #statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

from sklearn.preprocessing import StandardScaler #StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. 

from sklearn.metrics.pairwise import cosine_similarity #Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them.

from sklearn import datasets, linear_model #datasets and linear_model are both modules of the sklearn library. datasets provides pre-loaded datasets for testing and experimenting with different algorithms. linear_model provides different linear models for regression, classification and other tasks.

import pickle #pickle module implements binary protocols for serializing and de-serializing a Python object structure.


Content-based recommendation systems work by using specific features of an item to recommend additional items with similar properties. In the context of your recipe recommendation system, using the names of the recipes as a basis for generating recommendations can be quite useful. Here's a more detailed breakdown:

Feature Extraction: The first step in building a content-based recommendation system is feature extraction. In this case, we can treat each recipe name as a document and apply natural language processing techniques such as tokenization, stemming, and vectorization (such as TF-IDF or count vectorization) to convert these names into numerical representations that can be processed by the recommendation algorithm.

Similarity Computation: Once the recipes' names are converted into numerical form, the recommendation system can compute the similarity between different recipes based on these representations. There are various metrics available for computing the similarity like cosine similarity, Euclidean distance, Jaccard similarity, etc. The key idea is that recipes with similar names are assumed to be more similar in terms of their content, ingredients, or style, and therefore, if a user likes a certain recipe, they are more likely to enjoy recipes with similar names.

Recommendation Generation: Based on these similarity scores, when a user interacts with a particular recipe (for instance, by rating it highly or viewing it multiple times), the system can then recommend other recipes with similar names.

Model Evaluation and Refinement: It's crucial to evaluate the recommendation system using appropriate metrics like user feedback. Based on this evaluation, the model can be refined and improved over time.

This approach has the advantage of not requiring any user interaction history, which makes it effective for new users (solving the cold start problem). However, it relies heavily on the assumption that recipes with similar names are indeed similar in content, which might not always be the case. Thus, it's often beneficial to combine content-based approaches with other recommendation strategies, such as collaborative filtering, for more robust and diversified recommendations.

In [1]:
#importing libraries
import pandas as pd #pandas takes care of all file handling
from bokeh.models import BoxAnnotation #bokeh is a plot library and boxannotation will help with aspects of the graph
from bokeh.plotting import figure, show  #help with shape and view aspect of the graph
from bokeh.io import output_notebook #gives it the "in notebook" display rather than an HTML display
from bokeh.models import ColumnDataSource, HoverTool #added tools for better features of the graph

output_notebook()

#Data setup
data = {
    'Week': ['May 1-14', 'May 15-Jun 4', 'Jun 5-19', 'Jun 20-23', 'Jun 24-25'],
    'Task': [
        'Data processing',
        'Key features & indicators',
        'Model training & evaluation',
        'A/b testing, Model refinement',
        'Next Steps'
    ],
    'Start': ['2023-05-01', '2023-05-15', '2023-06-05', '2023-06-20', '2023-06-24'],
    'End': ['2023-05-14', '2023-06-04', '2023-06-19', '2023-06-23', '2023-06-25']
}

df = pd.DataFrame(data)
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])

#Column data source creation
source = ColumnDataSource(df)

#colors for each task
colors = ['gray', 'green', 'red', 'purple', 'orange']

#figure creation
p = figure(title="CAPSTONE Schedule", x_axis_type='datetime', y_range=df['Task'], width=800, height=400, tools="")

for indx, task in enumerate(df['Task']):
    task_source = ColumnDataSource(df[df['Task'] == task])
    p.hbar(y='Task', height=0.9, left='Start', right='End', color=colors[indx], source=task_source)

#shading the passive and active intervals
box = BoxAnnotation(left=pd.to_datetime('2023-05-01'), right=pd.to_datetime('2023-05-14'), fill_color='lightgrey', fill_alpha=0.1)
box2 = BoxAnnotation(left=pd.to_datetime('2023-05-15'), right=pd.to_datetime('2023-06-04'), fill_color='lightgrey', fill_alpha=0.1)
box3 = BoxAnnotation(left=pd.to_datetime('2023-06-05'), right=pd.to_datetime('2023-06-19'), fill_color='lightgreen', fill_alpha=0.1)

#displaying them
p.add_layout(box)
p.add_layout(box2)
p.add_layout(box3)

#this allows the cursor to hover over it
hover = HoverTool(tooltips=[("Task", "@Task"), ("Start", "@Start{%F}"), ("End", "@End{%F}")], formatters={"@Start": "datetime", "@End": "datetime"})
p.add_tools(hover)

#these are the labels
p.xaxis.axis_label = "Date"
p.yaxis.axis_label = "Task"
p.ygrid.grid_line_color = None
p.xaxis.major_label_orientation = 1

#shows the grapha
show(p)


## Table of Contents


<a id="0"></a> <br>
1. [Sample Data](#0.3)

2. [Cosine Similarity](#0.4)

3. [Recommendations based on ingredients](#0.5)




## Sample Data 
        Dataset volume: 60k 
        type: Random state

In [27]:
#subseting the sample data with random state
resultdf_sample = resultdf2.sample(60000, random_state=1)

In [42]:
#saving the subset
from pathlib import Path  
filepath = Path('Desktop/recipe_sample.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
resultdf_sample.to_csv(filepath) 

In [43]:
#show the sample dataset top 5 rows
resultdf_sample.head(5)

Unnamed: 0,user,count,name,id,minutes,contributor_id,submitted,tags,n_steps,steps,...,Calories (#),Total Fat (PDV),Sugar (PDV),Sodium (PDV),Protein (PDV),Saturated Fat,Carbohydrates (PDV),date,rate,rev
739970,618715,173,herb crusted rack of lamb,367872,97,452355,2009-04-25,timetomake course mainingredient preparation maindish lambsheep meat 4hoursorless,10,in a small bowl stir together the lemon zest garlic rosemary thyme and 1 4 cup olive oil spread the mixture evenly over the racks of lamb cover and refrigerate at least 1 hour or up to overnight preheat oven to 400f season the racks of lamb with salt and pepper in a large ovenproof saute pan over mediumhigh heat warm 2 tb olive oil add the lamb and cook until browned on both sides about 7 minutes total transfer the pan to the oven and cook until the lamb is well browned and an instantread thermometer inserted into the thickest part of the racks away from the bone registers 130f for mediumrare about 15 minutes or until done to your liking transfer the racks to a cutting board and let rest for 5 minutes carve into individual chops and serve the lemon wedges alongside,...,296.8,31.0,71.0,12.0,13.0,26.0,8.0,2009-06-07,5,"This was delicious and easy! I served this with Recipe#171633 drizzled over it since mint goes so well with lamb. Made for the Epicurean Queens, ZWT5: Family Picks."
902802,1355247,8,banana buttermilk muffins,94528,35,133174,2004-06-28,60minutesorless timetomake course preparation breads breakfast oven diabetic muffins dietary quickbreads equipment,10,preheat oven to 375f spray 12 regular muffin cups with nonstick cooking spray in a large bowl mix together the flours sugar baking powder and baking soda in separate bowl whisk together the buttermilk mashed banana oil egg and vanilla pour wet ingredients over dry ingredients and stir just until blended spoon batter into muffin cups filling about threefourths full sprinkle tops evenly with nuts bake until lightly brown and a toothpick inserted in center comes out clean approximately 15 to 20 minutes allow to cool in pan on wire rack for 15 minutes then turn out onto rack and cool completely these muffins freeze well and can be warmed in microwave,...,444.3,32.0,39.0,45.0,57.0,42.0,11.0,2011-07-23,4,"I swapped the egg for a T of ground flaxseed, added chocolate chips, and added a bit of cinnamon and they turned out beautifully!"
176781,86520,241,custard cream cookies,76655,18,106624,2003-11-16,30minutesorless timetomake course preparation handformedcookies desserts oven cookiesandbrownies equipment numberofservings,11,cream 2 ounces of powdered sugar with the 6 ounces of butter or margarine cream well until fluffy sift the 6 ounces of flour with the baking soda and custard powder add dry mixture to the creamed mix and combine well roll into small balls and then flatten with the tines of a fork bake at 325f on parchmentlined sheet until bottoms are just turning golden this should be about 8 minutes remove to cooling rack and cool completely cream filling combine the last of the 2 ounces powdered icing sugar with the 1 ounce of room temperature butter until creamy and spreadable spread frosting on the bottom of one cookie and top with the bottom of another to make a filled cookie,...,183.0,0.0,167.0,0.0,0.0,0.0,15.0,2004-12-12,5,"these are very simple to make and have a unique flavour with the custard powder. I baked the first pan at 325° but I had to bake for 10 min. The 2nd pan I raised the temp. to 350°and baked for the 8 minutes.I only got 32 cookies. I made about 1"" balls but I guess they were a bit too big. I think a 1"" cookie scoop would be ideal.Took pics with and without the icing between."
928259,1609199,1,sweet and spicy ground turkey stir fry,293768,25,384041,2008-03-24,30minutesorless timetomake course mainingredient preparation maindish poultry easy turkey dietary lowcholesterol lowsaturatedfat lowcalorie lowcarb inexpensive healthy2 lowinsomething meat,11,brown ground turkey in a little oil over mediumhigh heat drain any excess grease leaving just enough to accomplish the stirfrying process add garlic onion coriander and a tbs of the soy sauce stir well begin adding the vegetables starting with longcooking veggies like carrots and broccoli and working towards the quickcooking greens but not adding those just yet just before adding the greens right after adding the beans or snap peas add the remaining soy sauce the sugar the sherry and the red pepper flakes stir well so that sugar dissolves and pepper flakes are well distributed add the greens and complete the cooking make the optional sauce if desired by whisking the cornstarch into the chicken stock pour into pan just before the greens are done and cook stirring constantly until the sauce has simmered for about one minute serve over rice ramen noodle or other preferred asian starch,...,933.6,86.0,281.0,18.0,15.0,47.0,34.0,2010-05-03,5,This is a fantastic and healthy recipe that the whole family will love! I could eat this twice a week! LOL!
502734,303646,4,potato and kale soup,38367,55,37636,2002-08-26,60minutesorless timetomake course mainingredient cuisine preparation occasion northamerican for1or2 lowprotein healthy soupsstews potatoes vegetables american easy fall lowfat vegan vegetarian winter stovetop dietary lowsodium lowcholesterol seasonal lowsaturatedfat lowcalorie comfortfood lowcarb healthy2 lowinsomething tastemood equipment numberofservings 3stepsorless,7,in a large pan cook onion in oil until tender mix in garlic potatoes and water bring to a boil boil 4 minutes reduce heat and cook 20 minutes or so or until potatoes are tender mash potatoes into the liquid mix in the kale and pepper simmer 15 minutes more or so till kale is done then serve,...,600.1,26.0,81.0,64.0,95.0,31.0,22.0,2006-05-09,5,I used stock instead of water and this soup was so good my partner had seconds even though he usually doesn't like soup. Just as a warning to others I didn't cut my kale small enough so it was very messy to eat. Next time i'll cut them bite size...


In [None]:
#save sparse matrix
from scipy import sparse

sparse.save_npz("similarity.npz", similarities_a)

Saving the similarity matrix to save time.

In [28]:
#Now lets run the TF IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words = "english", min_df=2)
resultdf_sample['name'] = resultdf_sample['name'].fillna("")

TF_IDF_matrix = vectorizer.fit_transform(resultdf_sample['name'])

**TF-IDF**

Term Frequency-Inverse Document Frequency. It is a numerical statistic used to reflect how important a word is to a document in a collection or corpus. It's one of the most popular techniques used for text-based recommendation systems and is very effective in the field of information retrieval and natural language processing. It consists of two parts:

Term Frequency (TF): This measures the frequency of a word in a document. That is, if a word appears frequently in a document, it's important. What term frequency does is normalize the raw count of terms based on the length of the document. The intuition for this is that a term appearing 10 times in a 1000 word document is probably more important than a term appearing 10 times in a 10000 word document.

Inverse Document Frequency (IDF): This measures the uniqueness of a word across all documents. That is, if a word appears rarely across documents but frequently in certain documents, then it's a good discriminator or unique identifier for those documents. IDF diminishes the weight of words that occur very frequently in the document set and increases the weight of words that occur rarely.

The multiplication of these two quantities (TF and IDF) results in the TF-IDF score. This score is used to rank the importance of a word to a document in the context of a set of documents.

In the context of a content-based recommender system, you might use TF-IDF scores to represent items (like recipes, in your case) as vectors where each element of the vector corresponds to the TF-IDF score of a term (like a specific ingredient). This allows you to then use methods like cosine similarity to calculate the similarity between different items.

The advantage of using a TF-IDF vector representation is that it can capture both the frequency and rarity of terms in your data, which can improve the quality of your recommendations. Words with high TF-IDF scores in a document are often good keywords to summarize the content of the document. It is particularly good at dealing with text data, making it a popular choice for recommendation systems that deal with items like books, articles, or recipes.

In [29]:
#shape of the matrix 
#columns represent features
TF_IDF_matrix.shape

(60000, 5662)

In [242]:
TF_IDF_matrix

<60000x5662 sparse matrix of type '<class 'numpy.float64'>'
	with 231858 stored elements in Compressed Sparse Row format>

In [337]:

#tf idf matrix when accessing the key word pasta
TF_IDF_matrix[(resultdf_sample['name'] == 'pasta').values].todense().squeeze()

matrix([[0., 0., 0., ..., 0., 0., 0.]])

In [338]:
#finding the same key word with the sample dataframe
resultdf_sample[resultdf_sample['name'] == 'pasta']

Unnamed: 0,index,user,count,name,id,minutes,contributor_id,submitted,tags,n_steps,...,Calories (#),Total Fat (PDV),Sugar (PDV),Sodium (PDV),Protein (PDV),Saturated Fat,Carbohydrates (PDV),date,rate,rev
13289,907222,1384367,64,pasta,41490,30,26278,2002-09-30,30minutesorless timetomake course mainingredient preparation occasion maindish pasta vegetables easy kidfriendly vegetarian dietary inexpensive pastariceandgrains,16,...,469.9,40.0,17.0,52.0,46.0,48.0,11.0,2011-03-14,0,"I don't want to rate this recipe, as I cut back on the oil a bit and skipped the olives, so I didn't follow the recipe exactly. But I had a very hard time flipping the spaghetti ""crust."" Our pasta pizzas turned into pasta bowls that tasted like really good spaghetti. We aren't complaining - 'cause it was good - but it wasn't exactly what I had envisioned. I do have a photo, though, which I can upload, if you'd like! :)"


## Cosine Similarity

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. It is widely used in information retrieval and text mining as a measure of textual similarity of documents, and consequently, in building recommender systems.

In the context of a recommender system, a Cosine Similarity matrix can be seen as a way to quantify the similarity between items. The items could be anything like users, movies, books, or in your case, recipes.

If you are building a content-based recommender system, which you've mentioned before, the goal is to recommend items that are similar to the ones the user liked in the past. To accomplish this, the system must quantify how similar the items are to each other. One common way to do this is to represent the items as high-dimensional vectors, and then calculate the cosine of the angle between these vectors. The cosine similarity will be a value between -1 and 1, where 1 means the vectors are identical, 0 means the vectors are orthogonal (i.e., not similar), and -1 means the vectors are diametrically opposed (i.e., completely dissimilar).

Let's say you represent each recipe as a vector where each element of the vector corresponds to the presence or absence of an ingredient. In this case, the cosine similarity between two recipe vectors would be a measure of how many ingredients the two recipes share. This cosine similarity matrix can then be used to retrieve the most similar recipes to a given one, which can then be recommended to the user.

Similarly, for a user-user collaborative filtering recommender system, the cosine similarity can be calculated between users based on their rating history to find the most similar users. The system then recommends items that the similar users have rated highly.

In both cases, the cosine similarity matrix plays a key role in identifying the most similar items or users, which is the core concept in recommender systems.

In [154]:
#this example provides the exact difference between 2 rows
from sklearn.metrics.pairwise import cosine_similarity

recipe_n1 = TF_IDF_matrix[(resultdf_sample['name'] == 'chicken and ham cassoulet').values,]
recipe_n2 = TF_IDF_matrix[(resultdf_sample['name'] == 'chicken fried steak w cream gravy').values,]

print("Similarity:", cosine_similarity(recipe_n1, recipe_n2))

Similarity: [[0.0787372 0.0787372 0.0787372 0.0787372 0.0787372 0.0787372 0.0787372
  0.0787372 0.0787372 0.0787372]
 [0.0787372 0.0787372 0.0787372 0.0787372 0.0787372 0.0787372 0.0787372
  0.0787372 0.0787372 0.0787372]]


The computation of the cosine similarity matrix using the TF-IDF matrix is indeed a large computational task, primarily for two reasons:

High-Dimensionality: Each unique term in the entire corpus of documents contributes to a dimension in the TF-IDF matrix. For example, if you have a corpus of recipes, each unique ingredient or word used in the names or descriptions of the recipes contributes to a dimension in the TF-IDF matrix. This could easily add up to thousands or even millions of dimensions, depending on the size and diversity of your corpus.

Pairwise Comparisons: When computing the cosine similarity matrix, every item in your corpus needs to be compared to every other item to determine their similarity. This results in a square matrix of size n x n, where n is the number of items in your corpus. So, if you have a large number of items, the number of computations (and thus the size of the matrix) grows quadratically.

Despite these challenges, using cosine similarity with a TF-IDF matrix is a popular and effective approach in many recommendation systems, especially in content-based systems where the items can be represented as text. Techniques such as dimensionality reduction (e.g., PCA, SVD) or utilizing sparse representations can help manage the computational load.

In [339]:
#Sparse Matrix
from sklearn.metrics.pairwise import cosine_similarity 
similarities_a = cosine_similarity(TF_IDF_matrix, dense_output=False)

Saving the sparse matrix.

In [38]:
from scipy import sparse

sparse.save_npz("similarity.npz", similarities_a)

In [340]:
#looking up the values for pasta
resultdf_sample[resultdf_sample['name'].str.contains('pasta')]

Unnamed: 0,index,user,count,name,id,minutes,contributor_id,submitted,tags,n_steps,...,Calories (#),Total Fat (PDV),Sugar (PDV),Sodium (PDV),Protein (PDV),Saturated Fat,Carbohydrates (PDV),date,rate,rev
216,524115,323186,1385,shrimp and pasta picante,301665,70,644902,2008-05-01,timetomake course mainingredient preparation maindish seafood shrimp dietary onedishmeal shellfish 4hoursorless,11,...,503.3,29.0,101.0,48.0,52.0,33.0,19.0,2009-07-30,5,"We loved this, my husband declared it one of the better meals of late! we are huge shrimp and pasta lovers, and this just brought everything together nicely! I wanted to try the charred onions, but in the end, forgot, another time I will definitely try this, I know that we will love them! We liked the heat and the robustness of this sauce. My photo does not do justice to an excellent dish, served on spaghetti, I will aim for a better one next time I make this!\nThank you, Ravenseyes, made for Veg swap#12"
326,786122,745336,12,easy spicy shrimp pasta low fat,144728,15,209747,2005-11-13,15minutesorless timetomake course mainingredient cuisine preparation occasion northamerican for1or2 healthy maindish pasta seafood caribbean easy centralamerican dinnerparty holidayevent lowfat romantic shrimp dietary lowsaturatedfat lowinsomething pastariceandgrains shellfish tastemood numberofservings,19,...,20.1,0.0,9.0,0.0,1.0,0.0,1.0,2008-03-07,5,"This was a great, light, fresh meal! The flavor was really good with the combination of lemon juice and the onion. I didn't have jalapeno pepper, so I just used more red pepper flakes. Yum! Yum!"
344,947307,1831591,22,ranch pasta salad with bacon,255591,25,339260,2007-09-26,30minutesorless timetomake course mainingredient preparation occasion salads pasta easy dinnerparty pastariceandgrains brunch,10,...,691.4,45.0,49.0,63.0,31.0,20.0,30.0,2013-12-07,4,"I made this to bring to a pot luck. I used low fat ingredients and low sodium bacon. I increased the amount of bacon, tomatoes and cheese from the original recipe, otherwise too much pasta. I added salt and pepper and the salad does not need salt with all the bacon. I thought it tasted salty but others did not. It was only refrigerated for about an hour and I think all pasta salads should chill longer than that to get all the flavors melded. All in all, very good and pretty fast and easy. I think it would serve at least 12 as a side dish, even more on a buffet/pot luck."
354,805015,813730,1,creamy garlic penne pasta,43023,15,37305,2002-10-14,15minutesorless timetomake course mainingredient preparation occasion healthy salads pasta easy potluck dinnerparty kidfriendly stovetop dietary inexpensive pastariceandgrains penne togo equipment,7,...,1131.9,85.0,7.0,86.0,285.0,77.0,1.0,2008-04-09,5,"I made this for a couple of coworkers tonight and it turned out delicious! They suggested that I should add shrimp to it when I made it. I did and they loved it! I served this dish as a side to a garlic chicken recipe I also found on the site. The best part was that the whole meal was super easy and quick to throw together, but tasted like something I slaved on for hours! Terrific!"
435,860132,1059964,11,creamy cajun chicken pasta,39087,25,30534,2002-09-02,30minutesorless timetomake course mainingredient cuisine preparation occasion northamerican for1or2 maindish eggsdairy pasta poultry american cajun southernunitedstates easy dinnerparty kidfriendly romantic chicken stovetop dietary spicy comfortfood meat chickenbreasts pastariceandgrains tastemood equipment numberofservings,4,...,33.0,2.0,1.0,2.0,2.0,3.0,1.0,2009-03-20,5,I used a lb of pasta and doubled everything for the sauce (used 2 cups of cream) it was wonderful! Would be really good with mushrooms too! My 2 year old loved it!!!
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59833,423414,227652,796,lemon basil pasta w chicken,18266,25,13593,2002-01-28,30minutesorless timetomake course mainingredient preparation occasion lunch maindish sidedishes pasta poultry easy beginnercook dinnerparty kidfriendly chicken stovetop dietary onedishmeal comfortfood brownbag inexpensive meat pastariceandgrains tastemood togo equipment,6,...,371.4,19.0,16.0,20.0,30.0,23.0,16.0,2006-09-11,5,"This was so good! I'm not even a big lemon fan & this couldn't have been tastier or easier to put together. I cut the recipe in half, used spaghetti noodles, lite butter & only used half the parm cheese (I felt like it was all that was needed). I also used dried basil (for 2 servings, I used 1/4 + 1/8 tsp dried basil). Loved it & will make often. Thanks so much for sharing."
59854,540778,338666,1,creamy cajun chicken pasta,39087,25,30534,2002-09-02,30minutesorless timetomake course mainingredient cuisine preparation occasion northamerican for1or2 maindish eggsdairy pasta poultry american cajun southernunitedstates easy dinnerparty kidfriendly romantic chicken stovetop dietary spicy comfortfood meat chickenbreasts pastariceandgrains tastemood equipment numberofservings,4,...,33.0,2.0,1.0,2.0,2.0,3.0,1.0,2009-10-22,5,"I made this last night and it was wonderfull. This is now going into my cookbook! I have a few notes to make: Some cajun seasonings contain a large percentage of salt. I left the salt out of the recipe and it still had plenty of salt from my cajun seasonings. I doubled the recipe including the cream. If you double the cream it will be thin, but will reduce some if you simmer it down. Next time I will use 3 cups cream when I double the recipe. This was a great dish Thanks!"
59953,297607,156526,112,chicken with angel hair pasta,179083,90,329402,2006-07-24,timetomake course mainingredient cuisine preparation occasion northamerican maindish pasta poultry american easy dinnerparty kidfriendly chicken dietary californian inexpensive toddlerfriendly meat chickenbreasts pastariceandgrains 4hoursorless,10,...,471.6,36.0,41.0,3.0,63.0,14.0,12.0,2006-08-12,5,"Made this when some friends came over for dinner - what a great dish. Can also be made in the slow cooker. I added sliced mushrooms over the chicken before pouring the sauce over it, which was yummy ! Definitely keeping this one around."
59960,549872,348803,8,solo honey mustard steak and pasta,47135,15,1634,2002-11-20,15minutesorless timetomake course mainingredient preparation occasion for1or2 healthy lunch maindish beef pasta easy beginnercook stovetop dietary lowcholesterol lowsaturatedfat healthy2 lowinsomething meat pastariceandgrains equipment numberofservings,8,...,93.1,10.0,0.0,5.0,13.0,11.0,0.0,2007-12-06,3,"Great, quick solo meal! I tried peas instead of mushrooms and was pleased. 3.5 minutes on both sides left the middle too red to seem safe to me so it might take a little longer to prepare than listed. I like Brother Williams recipees and you will too, especially if after taking his suggestions you add your own personality on the second run. Thanks Brother!"


In [260]:
#reseting index 
resultdf_sample = resultdf_sample.reset_index(drop=True)

In [255]:
#cleaning the extra added 0 common for indexing
resultdf_sample = resultdf_sample.drop('level_0', axis = 1)

In [185]:
# Get the column based upon the index
recipe_index = resultdf_sample[resultdf_sample['name'].str.contains('pasta')].index

# Create a dataframe with the movie titles
sim_df = pd.DataFrame({'recipe':resultdf_sample['name'], 
                       'similarity': np.array(similarities_a[recipe_index, :].todense()).squeeze().flatten()})

ValueError: array length 67440000 does not match index length 60000

In [344]:
#show sim_df
sim_df

Unnamed: 0,recipe,similarity
0,herb crusted rack of lamb,0.0
1,banana buttermilk muffins,0.0
2,custard cream cookies,0.0
3,sweet and spicy ground turkey stir fry,0.0
4,potato and kale soup,0.0
...,...,...
59995,fannie farmer s classic baked macaroni cheese,0.0
59996,vincent price spaghetti alla bolognese spaghetti w meat sauce,0.0
59997,red lobster nantucket baked cod by todd wilbur,0.0
59998,chicken fried steak w cream gravy,0.0


Initial result of the similarity score matrix - it needs to be descending order.

In [345]:
# Return the top 10 most similar movies
sim_df.sort_values(by='similarity', ascending=False).head(50)

Unnamed: 0,recipe,similarity
52834,klomp s pasta,1.0
41152,pasta tutto giardino,1.0
19179,improvisation pasta,1.0
30790,post workout pasta,1.0
10217,tuxedo pasta,1.0
13289,pasta,1.0
58914,chicken pasta,0.845372
23462,chicken pasta,0.845372
745,grandads pasta sauce,0.765284
45452,easy pasta,0.750481


In [346]:
#adding the count towards each row
def content_recommender(title, recipe, similarities, vote_threshold=10, top_n=10) :
    
    # Get the movie by the title
    recipe_index = resultdf_sample[resultdf_sample['name'] == title].index
    
    # Create a dataframe with the movie titles
    sim_df = pd.DataFrame(
        {'recipe': resultdf_sample['name'], 
         'similarity': np.array(similarities_a[recipe_index, :].todense()).squeeze(),
         'count': resultdf_sample['count']
        })
    
    # Get the top 10 movies with > 10 votes
    top_recipes = sim_df[sim_df['count'] > vote_threshold].sort_values(by='similarity', ascending=False).head(top_n)
    
    return top_recipes

Adding the count of items to the similarity matrix is important for several reasons, especially in the context of recommender systems:

Improving Accuracy: The frequency of item interactions can provide important contextual information that can improve the accuracy of the recommendations. For example, if a user has interacted with a specific item many times, it might be more relevant to that user than an item they have only interacted with once.

Popularity Bias: In many recommendation systems, popular items (i.e., those with a high interaction count) tend to be recommended more often. Including item counts in the similarity matrix can help account for this popularity bias, as it provides a measure of how often each item is interacted with in general, not just by the target user.

Sparse Data: Recommender systems often have to deal with sparse data, where only a small fraction of possible user-item interactions are known. Including item counts in the similarity matrix can provide additional information that helps the system make predictions for unknown interactions.

Diversification: Using the count of items can also help in diversifying the recommendations. If you only base your recommendations on the most similar items, you might end up recommending very similar items, leading to a lack of diversity in the recommendations. But if you also consider the popularity of items (as represented by their count), you can recommend popular items that are not necessarily the most similar, thereby increasing diversity.

It's worth noting that the value of adding item counts to the similarity matrix can depend on the specific context and dataset. In some cases, it might significantly improve the recommendations, while in others it might not make a big difference. It's always a good idea to experiment with different approaches and evaluate their performance on your specific task.

In [348]:
# Test the recommender with 1000 threshold
similar_recipes = content_recommender("pasta", sim_df, similarities_a, vote_threshold=1000, top_n=40)
similar_recipes

Unnamed: 0,recipe,similarity,count
58914,chicken pasta,0.845372,2541
28789,vegetarian pasta e fagioli pasta and beans,0.650005,1330
34232,garlic pasta salad,0.626941,1170
48933,four cheese pasta casserole,0.625975,1883
10454,spicy sonora chicken pasta,0.621849,2774
21911,chicken and mushroom pasta,0.600587,1452
2403,gramps italian pasta salad,0.593751,3811
30845,pasta bean soup,0.587455,7395
17202,baked pasta with spinach,0.584227,1409
50269,pasta and sausage soup,0.581205,1170


It's only show 1000 counts and up which means that my control on what it recommends is actually working and providing results that I can see and tweak according to my goal.  Being novelty we need to increase the diversity of the count, because that could bring more interesting results; however I do not know the threshold where this will start to be come aparent so I will try a few to decide.

**Threshold 100**

In [None]:
# Test the recommender
similar_movies = content_recommender("baked pasta with spinach", sim_df, similarities_a, vote_threshold=200)
similar_movies.head(10)

Unnamed: 0,recipe,similarity,count
17202,baked pasta with spinach,1.0,1409
5348,chicken spinach and pasta,0.778549,239
6212,chicken spinach and pasta,0.778549,239
37350,pasta with bacon and spinach,0.679348,337
42911,baked pasta sauce,0.668385,514
14080,creamy chicken and spinach pasta,0.654511,1664
42314,spinach tomato pasta salad,0.629096,393
30439,baked spinach and eggs,0.613909,1929
41152,pasta tutto giardino,0.584227,232
30790,post workout pasta,0.584227,296


**Threshold 200**

In [349]:
# Test the recommender
similar_movies300 = content_recommender("pasta", sim_df, similarities_a, vote_threshold=300, top_n=40)
similar_movies300.head(30)

Unnamed: 0,recipe,similarity,count
23462,chicken pasta,0.845372,943
58914,chicken pasta,0.845372,2541
48790,lemon pasta,0.710148,581
28789,vegetarian pasta e fagioli pasta and beans,0.650005,1330
44084,pasta and zucchini,0.629255,838
42911,baked pasta sauce,0.629239,514
34232,garlic pasta salad,0.626941,1170
48933,four cheese pasta casserole,0.625975,1883
10454,spicy sonora chicken pasta,0.621849,2774
19703,chicken pasta bake,0.616881,515


**Threshold 300**

In [46]:
# Test the recommender
similar_movies = content_recommender("baked pasta with spinach", sim_df, similarities_a, vote_threshold=500)
similar_movies.head(10)

Unnamed: 0,recipe,similarity,count
17202,baked pasta with spinach,1.0,1409
42911,baked pasta sauce,0.668385,514
14080,creamy chicken and spinach pasta,0.654511,1664
30439,baked spinach and eggs,0.613909,1929
3225,spinach and mushroom pasta bake,0.568252,1437
7532,baked chicken parmesan over pasta,0.567646,1649
42161,baked alfredo pasta,0.554671,2897
18154,creamy spinach and avocado pasta,0.554043,5369
32492,penne pasta with spinach and bacon,0.542524,4487
38524,spinach and ricotta cheese sauce for pasta,0.513723,919


**Threshold 500**

In [50]:
# Test the recommender
similar_movies = content_recommender("baked pasta with spinach", sim_df, similarities_a, vote_threshold=600)
similar_movies.head(10)

Unnamed: 0,recipe,similarity,count
17202,baked pasta with spinach,1.0,1409
14080,creamy chicken and spinach pasta,0.654511,1664
30439,baked spinach and eggs,0.613909,1929
3225,spinach and mushroom pasta bake,0.568252,1437
7532,baked chicken parmesan over pasta,0.567646,1649
42161,baked alfredo pasta,0.554671,2897
18154,creamy spinach and avocado pasta,0.554043,5369
32492,penne pasta with spinach and bacon,0.542524,4487
38524,spinach and ricotta cheese sauce for pasta,0.513723,919
59460,baked pasta e fagioli,0.503814,894


End of finding threshold; For this approach we will stick with 300; gives more of a diverse count.

**Threshold 600**

In [358]:
#a copy of the similar movies
sm = similar_movies300.copy()

In [359]:
#sm2 
sm2 = sm.drop_duplicates()

In [360]:
sm2

Unnamed: 0,recipe,similarity,count
23462,chicken pasta,0.845372,943
58914,chicken pasta,0.845372,2541
48790,lemon pasta,0.710148,581
28789,vegetarian pasta e fagioli pasta and beans,0.650005,1330
44084,pasta and zucchini,0.629255,838
42911,baked pasta sauce,0.629239,514
34232,garlic pasta salad,0.626941,1170
48933,four cheese pasta casserole,0.625975,1883
10454,spicy sonora chicken pasta,0.621849,2774
19703,chicken pasta bake,0.616881,515


**threshold at 300**

By setting a threshold at 300, this recommendation system becomes more selective, focusing on recommending items with a high degree of similarity. This enhances the diversity of the recommendations by veering away from the usual, popular items that would typically populate a user's recommendation list. It ensures that the system surfaces less common but potentially highly relevant items for each user.

However, it's essential to remember that the choice of threshold isn't a one-size-fits-all solution. It could vary depending on the dataset, the preferences of individual users, and the ultimate objectives of the recommendation system. Some scenarios might benefit from a lower threshold that allows for a broader range of recommendations.

The proof of the pudding is in the eating, though. The real metric of success for any recommendation system lies in its reception by the end-users. By having real users test your system, you gain invaluable insights into its performance and the relevance of its recommendations. The feedback and interaction data from these users provide a clear direction for fine-tuning the system and enhancing its recommendation quality. So, your strategy of involving real users in testing your recommender system is commendable and will significantly contribute to the system's evolution.

In [350]:
similar_movies300.to_csv(r'Pasta_S_Mod1.csv')

In [51]:
cv = CountVectorizer(min_df=2)
count_matrix = cv.fit_transform(resultdf_sample['ingredients'])

In [52]:
cosine_sim = cosine_similarity(count_matrix, dense_output=False)

KeyboardInterrupt: 

In [81]:
def content_recommender_ing(title, recipe, similarities, vote_threshold=10, top_n=10) :
    
    # Get the movie by the title
    recipe_index = resultdf_sample[resultdf_sample['ingredients'] == title].index
    
    # Create a dataframe with the movie titles
    sim_df = pd.DataFrame(
        {'recipe': resultdf_sample['name'], 
         'similarity': np.array(similarities_a[recipe_index, :].todense()).squeeze(),
         'count': resultdf_sample['count']
        })
    
    # Get the top 10 movies with > 10 votes
    top_recipes = sim_df[sim_df['count'] > vote_threshold].sort_values(by='similarity', ascending=False).head(top_n)
    
    return top_recipes

In [82]:
# Test the recommender
similar_movies = content_recommender_ing("pasta", sim_df, cosine_sim, vote_threshold=1000)
similar_movies.head(10)

ValueError: Per-column arrays must each be 1-dimensional

## Recommendation based on Ingredients

In [150]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
pd.set_option('display.max_colwidth', None)

In [151]:
cv = CountVectorizer()
count_matrix = cv.fit_transform(resultdf_sample['ingredients'])

In [55]:
cosine_sim = cosine_similarity(count_matrix)

In [330]:
#key word 'herb' 
#ingredient_name = 'cake'
ingredient_name = 'pasta'
#ingredient_name = 'herb'


In [331]:
def recommend_recipe(ingredient_name):

  recipe_index = resultdf_sample[resultdf_sample['name'].str.contains(ingredient_name)].index[0]
  similar_recipes = list(enumerate(cosine_sim[recipe_index]))
  sorted_recipes = sorted(similar_recipes,key=lambda x:x[1],reverse=True)[1:]
  recommended_recipes = []
  for i in range(50):
      recommended_recipes.append(resultdf_sample.iloc[sorted_recipes[i][0]]['name'])
  
  return recommended_recipes

In [332]:
recommend_recipe(ingredient_name)

['herb crusted rack of lamb',
 'herb roasted idaho potato fries',
 'roasted rosemary potatoes with garlic',
 'roasted rosemary potatoes with garlic',
 'roasted rosemary potatoes with garlic',
 'bbq or roasted spiced leg of lamb',
 'rosemary roasted new potatoes',
 'balsamic pork tenderloin',
 'rosemary   garlic oven fries',
 'rosemary   garlic oven fries',
 'rosemary   garlic oven fries',
 'leg of lamb boneless greek style',
 'brined whole chicken with lemon and thyme',
 'roasted bone in chicken breasts with herbs',
 'tsr version of carrabba s bread dipping spice by todd wilbur',
 'tsr version of carrabba s bread dipping spice by todd wilbur',
 'tsr version of carrabba s bread dipping spice by todd wilbur',
 'tsr version of carrabba s bread dipping spice by todd wilbur',
 'tsr version of carrabba s bread dipping spice by todd wilbur',
 'tsr version of carrabba s bread dipping spice by todd wilbur',
 'tsr version of carrabba s bread dipping spice by todd wilbur',
 'tsr version of carrab

In [333]:
Ingredients = pd.DataFrame(recommend_recipe(ingredient_name)).drop_duplicates()

In [334]:
Ingredients

Unnamed: 0,0
0,herb crusted rack of lamb
1,herb roasted idaho potato fries
2,roasted rosemary potatoes with garlic
5,bbq or roasted spiced leg of lamb
6,rosemary roasted new potatoes
7,balsamic pork tenderloin
8,rosemary garlic oven fries
11,leg of lamb boneless greek style
12,brined whole chicken with lemon and thyme
13,roasted bone in chicken breasts with herbs


This table looks ready for the user to start responding to.  no duplicates and no missing values and they all are in same way related to the ingredient **'herb'**.

In [336]:
Ingredients.to_csv(r'Herb_S_Mod2.csv')

The second recommendation system represents a shift in focus, from recipe names to ingredients. This choice allows you to tap into a different dimension of the culinary experience and recommend recipes based on ingredient similarity.

Here's an expansion on the idea:

While the first model focused on the names of the recipes, the second model considers the ingredients used in the recipes. This is a significant shift in the recommendation strategy. While recipe names may give us a broad idea of what the dish might be, they often do not encapsulate the entirety of the recipe. Names might be creative, themed, or abstract, which might not give us a clear picture of the actual contents of the dish.

On the other hand, ingredients are the building blocks of any recipe. They define what the recipe is at a fundamental level. By focusing on ingredients, this model aims to provide recommendations based on the actual contents of the dishes. For example, if a user likes a recipe that heavily features tomatoes and basil, the system would recommend other recipes where these ingredients are prominent.

Cosine similarity is still the core of this model, indicating that the mathematical approach remains the same. However, the shift in focus to ingredients as features likely changes the recommendation results significantly. It enables the system to find deeper correlations between dishes based on their ingredients, rather than relying solely on their names.

By providing a different perspective, this ingredient-based model offers a more granular, content-driven approach to recipe recommendations. This method could potentially cater better to users with specific ingredient preferences or dietary restrictions, thereby enhancing the user experience and the system's overall effectiveness.