# Top Ingredients Used by Tori Avey

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gather">Gathering Data</a></li>
<li><a href="#clean">Cleaning Data</a></li>
<li><a href="#analysis">Analysis</a></li>
</ul>

<a id='intro'></a>
## Introduction
    
In this project I scraped 168 recipes from my favorite recipe blog using BeautifulSoup. I then cleaned the data, by first getting rid rows that were highly specific recipes like soup stock or different ways to prepare eggplant instead of creating a full dish, second, checking for items that are objects instead of food items, and finally getting rid of unneccessary parts of information in the text (for example, dropping the 'to taste' in 'salt, to taste'). I did some analysis in the end to uncover the ingredients, the number of times each ingredient was shown, and the proportion of recipes that each ingredient appears in (not counting when the item appears more than once per recipe, which it turns out happens often!).

The objectives here are as follows:  

<ul>
    <li><b>scraping data</b>: scrape at least 100 recipes</li>
    <li><b>cleaning data</b>: clean data for further calculation</li>
    <li><b>calculating</b>: assess which are the ten most commonly used ingredients</li>
</ul>

For a recipe blog I have chosen https://toriavey.com/. I highly recommend her pretzel challah: https://toriavey.com/toris-kitchen/pretzel-challah/

In [606]:
import pandas as pd
import csv
import requests
from bs4 import BeautifulSoup
import re
import numpy as np

<a id='gather'></a>
## Gathering Data

<div class="alert alert-block alert-info">
In this section I will first set up some functions to scrape data from Tori Avey's website. Because scraping is very specific to the layout of her blog posts, I am creating functions that could be partially reused, but the function that calls each of them will not use any abstraction. Following, I will implement the wrangling functions, check that the data is at least partially correct, and save it to an csv.
</div>

Set up wrangling functions:

In [607]:
def get_recipe_genres():
    '''No input because this is specific to the layout of this particular website, returns list of strings
       that signify the urls of each genre listed on her site'''
    # Get the homepage with all available recipe type links and make a soup
    url = 'https://toriavey.com/recipes/'
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')

    #find the url for each recipe categories 
    recipe_type_urls = []
    recipeList = soup.find("div", class_="recipes-index")
    for catBox in recipeList.find_all("div", class_="cat-box"):
        recipe_type_urls.append(catBox.a.get('href'))
    return recipe_type_urls

In [608]:
def get_recipe_urls(recipe_type_urls):
    '''Input is a list of url strings, output is a list of url strings that are enumerated on the recipe genre page'''
    recipe_urls = []
    for genre_url in recipe_type_urls:
        recipe_genre_page = requests.get(genre_url)
        genre_soup = BeautifulSoup(recipe_genre_page.content, 'html.parser')
        main_content = genre_soup.find("main")
        #some of these might be ads and not have the same layout
        if not main_content:
            continue
        for recipe in main_content.find_all("article"):
            recipe_urls.append(recipe.a.get('href'))
    return recipe_urls

In [609]:
def get_recipe_info(recipe_urls):
    '''Input is a list of url strings, output is a nested list indicating url, title, and ingredient for each element'''
    csv_data = []
    for recipe_url in recipe_urls:
        recipe_page = requests.get(recipe_url)
        recipe_soup = BeautifulSoup(recipe_page.content, 'html.parser')
        title = recipe_soup.find("h2", class_="wprm-recipe-name")
        #some of these might be ads and not have the same layout
        if not title:
            continue
        title = title.text
        ingredients = recipe_soup.find_all("ul", class_="wprm-recipe-ingredients")
        for ingredient_list in ingredients:
            for ingredient in ingredient_list.find_all("li"):
                ingredient_element = ingredient.find("span", class_="wprm-recipe-ingredient-name")
                if ingredient_element.text:
                    ingredient_name = ingredient_element.text.strip()
                else:
                    continue
                csv_line = [recipe_url]
                csv_line.append(title)
                csv_line.append(ingredient_name)
                csv_data.append(csv_line)
    return csv_data

In [610]:
def get_toris_recipes():
    recipe_genre_urls = get_recipe_genres()
    recipe_urls = get_recipe_urls(recipe_genre_urls)
    csv_data = get_recipe_info(recipe_urls)
    return csv_data

Scrape data and check the first line

In [611]:
recipe_data = get_toris_recipes()
print(len(recipe_data))
print(recipe_data[0])

5283
['https://toriavey.com/toris-kitchen/aunt-bevs-vegetarian-chopped-liver/', "Aunt Bev's Vegetarian Chopped Liver", 'vegetable oil']


In [612]:
def create_csv(filename, header, csv_data):
    '''Takes in a string for a filename, a list for a header, and a list of lists for csv data
       No returns, a file is created with the listed filename'''
    with open(filename, 'w', encoding='UTF8') as f:
        writer = csv.writer(f)
        writer.writerow(header)
        writer.writerows(csv_data)

In [613]:
csv_header = ['url', 'name', 'ingredient']
create_csv('rawData.csv', csv_header, recipe_data)

In [614]:
rawData_pd = pd.read_csv('rawData.csv')
rawData_pd.head(7)

Unnamed: 0,url,name,ingredient
0,https://toriavey.com/toris-kitchen/aunt-bevs-v...,Aunt Bev's Vegetarian Chopped Liver,vegetable oil
1,https://toriavey.com/toris-kitchen/aunt-bevs-v...,Aunt Bev's Vegetarian Chopped Liver,"onion,"
2,https://toriavey.com/toris-kitchen/aunt-bevs-v...,Aunt Bev's Vegetarian Chopped Liver,"chopped walnuts,"
3,https://toriavey.com/toris-kitchen/aunt-bevs-v...,Aunt Bev's Vegetarian Chopped Liver,"peeled hard boiled eggs,"
4,https://toriavey.com/toris-kitchen/aunt-bevs-v...,Aunt Bev's Vegetarian Chopped Liver,"peas,"
5,https://toriavey.com/toris-kitchen/aunt-bevs-v...,Aunt Bev's Vegetarian Chopped Liver,Salt and pepper to taste
6,https://toriavey.com/toris-kitchen/panko-corn-...,Panko Corn and Pepper Schnitzel,"ears corn, shucked,"


<a href="#clean"></a>
## Cleaning Data

<div class="alert alert-block alert-info">
In this section I did some preliminary exploration to come up with the table of issues listed below, and dealt with them one by one.<br/><br/>
For the names column, I realized that the items starting with "how to" looked more like basic ingredients rather than full recipes so I deleted the columns with those listed.
    <br/><br/>The ingredients column needed some further investigation so I looked at them a bit more and decided that it would be better to keep both options when it says something like "barcardi or havana rum" because it could be interesting to see how often the same substitution pairs come up. I also saw that if a line starts with a capital 'A' it is always an object while a lowercase 'a' is used before food items (like 'a vanilla bean' vs 'A saucepan') so I just got rid of all the lines that start with a capital A. I also deleted anything that comes after a comma or parenthesis because this is usually followed by information that is not relevant.
<br/><br/>

I stored all of the new information in a clean dataframe called clean_df and saved it to a csv using the method from above. 
</div>

| Urls | Name | Ingredient |
| --- | --- | --- |
| --- | Some recipes start with "How to" instead of the recipe name | There is sometimes extra information after a ',' or ')' |
| --- | In one "how to" there is also a '-' with more information following | I see 'A saucepan' on line 3389 |
| --- | --- | Sometimes see "to taste" or "for frying" or "for garnish" ect |
| --- | --- | Line 5271 has a 'bacardi or havana rum' maybe we can look at items with 'or' to assess more |

In [615]:
rawData_pd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5283 entries, 0 to 5282
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   url         5283 non-null   object
 1   name        5283 non-null   object
 2   ingredient  5283 non-null   object
dtypes: object(3)
memory usage: 123.9+ KB


In [616]:
len(rawData_pd["name"].unique())

176

In [617]:
rawData_pd.sample(50)

Unnamed: 0,url,name,ingredient
5016,https://toriavey.com/toris-kitchen/julia-child...,Julia Child's Easy Blender Hollandaise Sauce,unsalted butter
2458,https://toriavey.com/toris-kitchen/holiday-bri...,Holiday Brisket,peeled whole garlic cloves
2608,https://toriavey.com/toris-kitchen/grilled-veg...,Grilled Vegetable Salad,fresh lemon juice
4219,https://toriavey.com/toris-kitchen/roasted-veg...,Vegetable Moussaka,Salt and pepper
1079,https://toriavey.com/toris-kitchen/grilled-veg...,Grilled Vegetable Salad,extra virgin olive oil
2328,https://toriavey.com/toris-kitchen/honey-garli...,Honey Garlic Chicken,extra virgin olive oil
3832,https://toriavey.com/toris-kitchen/grilled-veg...,Grilled Vegetable Salad,"yellow squash (2 medium), halved lengthwise, e..."
645,https://toriavey.com/toris-kitchen/matcha-gree...,Matcha Green Tea Latte,hot water
529,https://toriavey.com/toris-kitchen/apple-date-...,Apple Date Rose Tarts,hot water
3612,https://toriavey.com/toris-kitchen/kasha-varni...,Kasha Varnishkes (Kasha and Bows),"fresh parsley,"


Let's double check that these urls are in good shape

In [618]:
rawData_pd["url"][0]

'https://toriavey.com/toris-kitchen/aunt-bevs-vegetarian-chopped-liver/'

That looks good to me, so let's take a closer look at what's happening with these names. 

In [619]:
recipe_names = rawData_pd['name'].tolist()
indices_to_drop = []
for index, name in enumerate(recipe_names):
    if name[0:4] == "How ":
        print(name)
        indices_to_drop.append(index)

How to Roast Garlic - Five Easy Methods
How to Roast Garlic - Five Easy Methods
How to Make Coconut Milk & Gluten Free Coconut Flour
How to Make Coconut Milk & Gluten Free Coconut Flour
How to Make Coconut Milk & Gluten Free Coconut Flour
How to Make Coconut Milk & Gluten Free Coconut Flour
How to Make Candied Lemon Peels
How to Make Candied Lemon Peels
How to Make Candied Lemon Peels
How to Make Candied Lemon Peels
How to Make Candied Lemon Peels
How to Make Candied Lemon Peels
How to Make Candied Lemon Peels
How to Make Candied Lemon Peels
How to Make Homemade Chicken Stock - 4 Easy Methods
How to Make Homemade Chicken Stock - 4 Easy Methods
How to Make Homemade Chicken Stock - 4 Easy Methods
How to Make Homemade Chicken Stock - 4 Easy Methods
How to Make Homemade Chicken Stock - 4 Easy Methods
How to Make Homemade Chicken Stock - 4 Easy Methods
How to Make Homemade Chicken Stock - 4 Easy Methods
How to Make Homemade Chicken Stock - 4 Easy Methods
How to Make Homemade Chicken Stock -

On second thought, these all look like ways to prepare basic ingredients instead of recipes so I think it would be better if we drop the columns with these items

In [620]:
clean_df = rawData_pd.copy()

In [621]:
clean_df = clean_df.drop(indices_to_drop)

In [622]:
print(clean_df.shape[0], rawData_pd.shape[0])

5120 5283


Finally let's go through all the ingredients

In [623]:
def drop_anything_after_delimiter(ingredient):
    split = list(filter(None, re.split(r'(\,|\()', item)))
    return split[0]

In [624]:
def drop_excess_tos_and_fors(ingredient):
    split = ingredient.split(" ")
    if len(split) > 2 and (split[-2] == 'for' or split[-2] == 'to'):
        return " ".join(split[:-2])
    return ingredient

In [625]:
ingredients = clean_df['ingredient'].tolist()
clean_ingredients = []
indices_to_drop = []
for index, item in enumerate(ingredients):
    if split[0] == "A":
        indices_to_drop.append(index)
    no_delimiters = drop_anything_after_delimiter(item)
    final = drop_excess_tos_and_fors(no_delimiters)
    clean_ingredients.append(str(final))

In [626]:
clean_df = clean_df.drop(indices_to_drop)

In [627]:
n = clean_df.columns[2]
clean_df.drop(n, axis = 1, inplace = True)
clean_df[n] = clean_ingredients

In [628]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5120 entries, 0 to 5282
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   url         5120 non-null   object
 1   name        5120 non-null   object
 2   ingredient  5120 non-null   object
dtypes: object(3)
memory usage: 289.0+ KB


In [629]:
clean_df.sample(50)

Unnamed: 0,url,name,ingredient
4928,https://toriavey.com/toris-kitchen/peach-and-b...,Peach and Blueberry Crisp,aged balsamic vinegar or fresh lemon juice
5254,https://toriavey.com/toris-kitchen/calumet-par...,Parker House Rolls,Calumet Baking Powder
4725,https://toriavey.com/toris-kitchen/fennel-appl...,Fennel Apple Salad with Tahini Dressing,olive oil
3763,https://toriavey.com/how-to/easy-blender-lemon...,Lemon Curd,eggs
705,https://toriavey.com/toris-kitchen/watermelon-...,Watermelon Rosemary Frozen Lemonade,chilled seedless watermelon chunks
1072,https://toriavey.com/toris-kitchen/middle-east...,Middle Eastern Roasted Vegetable Rice,cilantro
3273,https://toriavey.com/toris-kitchen/garlicky-ka...,Kale Caesar Salad with Parmesan and Panko,water
2867,https://toriavey.com/toris-kitchen/lemony-mari...,Lemony Marinated Chicken Skewers,turmeric
4526,https://toriavey.com/toris-kitchen/easy-caulif...,Easy Cauliflower Soup,Unsalted butter
171,https://toriavey.com/toris-kitchen/classic-hum...,Classic Hummus,cumin


In [630]:
clean_df.to_csv('cleanData.csv', index=False)

<a id='analysis'></a>
## Analysis


<div class="alert alert-block alert-info">
For our results csv we wanted to have a list of ingredients, their value count overall, and the proportion of recipes in which the ingredient appears. <br/><br/>
The first two columns are fairly straightforward, but when I originally tried to find the proportion of recipes in which each ingredient appears, I noticed that the top ingredient, salt, appeared 1.6X per recipe. I did a quick scan of Tori's website and it turns out many of her recipes are subdivided into a few smaller recipes. For example, in my suggestion of her preztel challah, she has a section for the challah itself and also a section for egg wash-- and, just as I suspected, salt shows up in both, making it give this recipe double the salt intake for my dataframe. <br/><br/>
I rewrote the function to only count ingredients for each recipe once. Following, salt was still the most common ingredient, but its proportion had been cut to only a little over half of the recipes. The top 20 results looked reasonable to me, so I sent them to the final csv
</div>

In [631]:
results_df = clean_df.copy()

In [632]:
results_df.head()

Unnamed: 0,url,name,ingredient
0,https://toriavey.com/toris-kitchen/aunt-bevs-v...,Aunt Bev's Vegetarian Chopped Liver,vegetable oil
1,https://toriavey.com/toris-kitchen/aunt-bevs-v...,Aunt Bev's Vegetarian Chopped Liver,onion
2,https://toriavey.com/toris-kitchen/aunt-bevs-v...,Aunt Bev's Vegetarian Chopped Liver,chopped walnuts
3,https://toriavey.com/toris-kitchen/aunt-bevs-v...,Aunt Bev's Vegetarian Chopped Liver,peeled hard boiled eggs
4,https://toriavey.com/toris-kitchen/aunt-bevs-v...,Aunt Bev's Vegetarian Chopped Liver,peas


In [633]:
ingredients_count = {}
for index, row in results_df.iterrows():
    if row.ingredient in ingredients_count:
        if row['name'] not in ingredients_count[row.ingredient]:
            ingredients_count[row.ingredient].append(row['name'])
    else:
        ingredients_count[row.ingredient] = [row['name']]

recipe_count = len(results_df['name'].value_counts())
recipe_count

168

In [634]:
frame = results_df.ingredient.value_counts().to_frame()

In [635]:
frame.reset_index(inplace=True)
final = frame.rename(columns = {'index': 'name'})


In [636]:
ingredient_proportion = []
for index, item in enumerate(final['name']):
    ingredient_proportion.append(len(ingredients_count[item])/recipe_count)

final['proportion'] = ingredient_proportion

In [637]:
final.head(20)

Unnamed: 0,name,ingredient,proportion
0,salt,271,0.52381
1,extra virgin olive oil,181,0.238095
2,sugar,83,0.214286
3,cinnamon,77,0.119048
4,fresh lemon juice,72,0.107143
5,eggs,69,0.166667
6,black pepper,66,0.119048
7,olive oil,63,0.119048
8,cayenne pepper,61,0.095238
9,unsalted butter,60,0.107143


In [638]:
final.head(10).to_csv('results.csv', index=False)

In [639]:
results_df.describe()

Unnamed: 0,url,name,ingredient
count,5120,5120,5120
unique,168,168,847
top,https://toriavey.com/toris-kitchen/grilled-veg...,Grilled Vegetable Salad,salt
freq,165,165,271


In [640]:
results_df['ingredient'].value_counts()

salt                      271
extra virgin olive oil    181
sugar                      83
cinnamon                   77
fresh lemon juice          72
                         ... 
Yukon Gold potatoes         1
Manischewitz syrup          1
chopped celery              1
yellow onion                1
vanilla bean paste          1
Name: ingredient, Length: 847, dtype: int64