### Data Gathering and Pre-Processing

In [1]:
from recipe_scrapers import scrape_me

In [2]:
# Test out recipe scraper
scrape = scrape_me('https://www.allrecipes.com/recipe/257938/spicy-thai-basil-chicken-pad-krapow-gai')
print(scrape.title())
print(scrape.ingredients())

Spicy Thai Basil Chicken (Pad Krapow Gai)
['1/3 cup chicken broth', '1 tablespoon oyster sauce', '1 tablespoon soy sauce, or as needed', '2 teaspoons fish sauce', '1 teaspoon white sugar', '1 teaspoon brown sugar', '2 tablespoons vegetable oil', '1 pound skinless, boneless chicken thighs, coarsely chopped', '1/4 cup sliced shallots', '4 cloves garlic, minced', '2 tablespoons minced Thai chilies, Serrano, or other hot pepper', '1 cup very thinly sliced fresh basil leaves', '2 cups hot cooked rice']


In [3]:
# Remove common filler words that aren't ingredients; I actually ended up keeping some words that could be latent features
import pandas
data = pandas.read_csv("words_remove.csv")
words_remove = data['Words'].tolist()
print(words_remove)

['1', '2', '3', '4', '5', '6', '7', '8', '9', "'", ',', '/', 'baking', 'brown', 'cans', 'chopped', 'cloves', 'coarsely', 'crumbled', 'crumbs', 'crushed', 'cup', 'cups', 'cut', 'dark', 'divided', 'minced', 'mix', 'needed', 'optional', 'other', 'ounces', 'ounce', 'package', 'pan', 'parts', 'pound', 'sliced', 'tablespoons', 'tablespoon', 'tbs', 'tbsp', 'teaspoons', 'teaspoon', 'tsp', 'vegetable', 'white', 'large', 'purpose', 'peeled', 'discarded', 'finely', 'finely', 'pinches', 'pinch', 'shears', 'grey', 'serving', 'slices', 'slivered']


In [4]:
def clean_ingredients():
    for i in range(len(words_remove)):
        global ingredients
        ingredients = [x.replace(words_remove[i],"") for x in ingredients]
        ingredients = [x.replace("  "," ") for x in ingredients]
        ingredients = [x.strip() for x in ingredients]

In [5]:
# Test out writing the cleaned-up ingredients to a csv
import csv
filename = "recipetest.csv"
f = open(filename, "w")
headers = "ingredients"
f.write(headers)
ingredients = scrape.ingredients()
#ingredients = ''.join(scrape.ingredients())
#f.write(ingredients.replace("'", ""))

clean_ingredients()

with open(filename, "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in ingredients:
        writer.writerow([val])
f.close()
print(ingredients)

['chicken broth', 'oyster sauce', 'soy sauce or as', 'fish sauce', 'sugar', 'sugar', 'oil', 'skinless boneless chicken thighs', 'shallots', 'garlic', 'Thai chilies Serrano or hot pepper', 'very thinly fresh basil leaves', 's hot cooked rice']


This gives us a list of cleaned-up ingredients that we could perhaps put into a dictionary and then create a matrix from. But I'm going to try a different method. I will concatenate the items in the list into a string which will allow me to use TfidfVectorizer from Sci-Kit Learn.

In [6]:
df_recipes = pandas.read_csv("recipe_links.csv")
recipe_links = df_recipes['Link'].tolist()

In [7]:
ingredients_combined = []
titles_list = []
for j in range(len(recipe_links)):
    scrape = scrape_me(recipe_links[j])
    ingredients = scrape.ingredients()
    clean_ingredients()
    ingredients_combined.append(' '.join(ingredients))
    titles_list.append(scrape.title())

In [8]:
ingredients_matrix = df_recipes
ingredients_matrix['Title'] = titles_list
ingredients_matrix['Ingredients'] = ingredients_combined
ingredients_matrix.head()

Unnamed: 0,Link,Title,Ingredients
0,https://www.allrecipes.com/recipe/257938/spicy...,Spicy Thai Basil Chicken (Pad Krapow Gai),chicken broth oyster sauce soy sauce or as fis...
1,https://www.allrecipes.com/recipe/238840/quick...,Quick Crispy Parmesan Chicken Breasts,cooking spray ko bread Parmesan cheese paprika...
2,https://www.allrecipes.com/recipe/23847/pasta-...,Pasta Pomodoro,( ) angel hair pasta olive oil onion garlic s ...
3,https://www.allrecipes.com/recipe/50435/fry-br...,Fry Bread Tacos II,Toppings: (. ) can pinto beans with liquid pic...
4,https://www.allrecipes.com/recipe/142488/amazi...,Amazing Spicy Grilled Shrimp,olive oil sesame oil fresh parsley hot sauce g...


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import linear_kernel

cv = CountVectorizer(analyzer='word', stop_words='english', binary=True)
cv_matrix = cv.fit_transform(ingredients_matrix['Ingredients'])
#print(cv_matrix.toarray())
#print(cv.get_feature_names())

Here we can use CountVectorizer to get a visual matrix/array represenation of what words were in each ingredient list and also see what feature names are being used from the scraped ingredients list.

In [10]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), stop_words='english', binary=True)
tfidf_matrix = tf.fit_transform(ingredients_matrix['Ingredients'])
tfidf_matrix

<102x3774 sparse matrix of type '<class 'numpy.float64'>'
	with 6478 stored elements in Compressed Sparse Row format>

TFidfVectorizer produces normalized vectors so we can use linear_kernal for cosine similarity; ngram_range allows us to pick up single words, two words, and three words in a sequence as they may be important; I put binary=True because I don't care how many times an ingredient is mentioned in a recipe, I just care if it is listed or not.

In [11]:
#cosine similarity
recipe_comparitor = 1
cosine_similarities = linear_kernel(tfidf_matrix[recipe_comparitor], tfidf_matrix).flatten()
cosine_similarities
print('Comparing recipes to: ' + str(ingredients_matrix['Title'][recipe_comparitor]))

Comparing recipes to: Quick Crispy Parmesan Chicken Breasts


In [12]:
# This will compare every recipe with every recipe
#cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
#for idx, row in ingredients_matrix.iterrows():
#    similar_indices = cosine_similarities[idx].argsort()[:-100:-1]
#    similar_items = [(cosine_similarities[idx][i], ingredients_matrix['Title'][i]) for i in similar_indices]
#similar_items

In [13]:
cosine_index = cosine_similarities.argsort()[:-12:-1] # Return the 10 best matches for recipes not including recipe used for comparison
cosine_index

array([  1,  75, 100,  70,  82,  10,  29,  80,  68,  71,  54], dtype=int64)

In [14]:
similar_items = []
for i in range(len(cosine_index)):
    similar_items.append([(ingredients_matrix['Title'][cosine_index[i]]), cosine_similarities[cosine_index[i]]])
del similar_items[0] # Delete first item from list as that will be the recipe being used for comparison
print('Showing 10 best recipe matches and the cosine similarity')
similar_items

Showing 10 best recipe matches and the cosine similarity


[['Barbeque Bacon Chicken Bake', 0.19105400130925004],
 ['Chicken Parmesan', 0.16237525968599492],
 ['Chicken Souvlaki with Tzatziki Sauce', 0.12151562402119515],
 ['Curry Stand Chicken Tikka Masala Sauce', 0.11593126997255343],
 ['Chicken Cacciatore in a Slow Cooker ', 0.11541166177320977],
 ['Buttered Noodles', 0.09161782268707958],
 ['Easy Chicken and Corn Chowder', 0.08960571813160662],
 ['Oven Roasted Parmesan Potatoes', 0.08724557817388662],
 ['Baked Split Chicken Breast', 0.0743766979226665],
 ['Butter-Roasted Cauliflower ', 0.0733589275492336]]

### Data Integrity Check

Did you account for missing values and outliers?
Is there information leakage? ie. a variable which is actually inferred by the outcome (eg. predicting a user likes a movie using the fact that they've liked that movie before).
Are some variables non-sensical or redundant? (ie. if you see "Male" sometimes and "M" other times, or numerical values in the gender column).

In recipe_links.csv, I have duplicates highlighted just in case I add a link that I've already added before.

### Feature Engineering

### Standardization

### SQL Database