# Scraping Webpages for Recipes and their Ingredients

## Importing libraries

In [7]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

## HTML Data and BeautifulSoup Document Creation

In [37]:
#load the CSV file
urls = pd.read_csv('C:\Users\praga\Desktop\Collecting Data\CD-Assignment-3\recipes-metadata.csv', delimiter='\t', encoding='utf=8')

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape (2904858460.py, line 2)

In [38]:
urls

Unnamed: 0,Dish,Published By,Published/Updated On,Recipe Rating,URL
0,Shrimp Étouffée,Patricia S York,17-01-2020,4.7,https://www.southernliving.com/recipes/shrimp-...
1,Chicken and Sausage Gumbo,Southern Living Test Kitchen,15-09-2024,5.0,https://www.southernliving.com/recipes/chicken...


In [39]:
import requests
#headers required because the webscraping was not possible without
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'
}

def scrape_recipe(url):
    response = requests.get(url, headers=headers)
    html_string = response.text
    return html_string

In [40]:
for index, row in urls.iterrows():
    try:
        urls.loc[index, 'text'] = scrape_recipe(row['URL'])
    except:
        urls.loc[index, 'text'] = "URL not available"

In [41]:
urls

Unnamed: 0,Dish,Published By,Published/Updated On,Recipe Rating,URL,text
0,Shrimp Étouffée,Patricia S York,17-01-2020,4.7,https://www.southernliving.com/recipes/shrimp-...,"<!DOCTYPE html>\n<html id=""recipeScTemplate_1-..."
1,Chicken and Sausage Gumbo,Southern Living Test Kitchen,15-09-2024,5.0,https://www.southernliving.com/recipes/chicken...,"<!DOCTYPE html>\n<html id=""recipeScTemplate_1-..."


In [51]:
#Creating BeautifulSoup Documents

def BSDOC(url): 
    html_string = scrape_recipe(url)
    document = BeautifulSoup(html_string, "html.parser")
    return document


In [53]:
SE_Doc = BSDOC("https://www.southernliving.com/recipes/shrimp-etouffee")
Gumbo_Doc = BSDOC("https://southernliving.com/recipes/chicken-and-sausage-gumbo")

## Extract Ingredients and Recipe 

Upon inspecting the HTML elements, I narrowed down the specific elements required to extract the ingredients and the methods for the recipes. As it is the same for both recipes, it can be assumed that the below functions are applicable to most recipes on this website. 

In [96]:
def get_ingredients(document):
    all_ingredients = document.find_all("span", attrs = {"data-ingredient-name":"true"})
    ingredients = []
    for item in all_ingredients:
        item_content = item.text
        ingredients.append(item_content) 
    return ingredients                                   

In [97]:
def get_methods(document):
    all_methods = document.find_all("p", attrs = {"class":"comp mntl-sc-block mntl-sc-block-html"})
    methods = []
    for item in all_methods:
        item_content = item.text
        methods.append(item_content)
    return methods

In [76]:
#Testing Functions
get_ingredients(SE_Doc)

['(4 oz.) salted butter',
 '(about 1 1/2 oz.) all-purpose flour',
 'chopped yellow onion (from 1 large onion)',
 'chopped celery (from 1 large stalk)',
 'minced garlic (from 1 large garlic clove)',
 'chopped red bell pepper (from 1 medium bell pepper)',
 'chopped green bell pepper (from 1 medium bell pepper)',
 'chicken broth',
 'water',
 'medium peeled, deveined raw shrimp',
 'hot sauce',
 'kosher salt',
 'black pepper',
 'chopped scallions (from 2 scallions)',
 'chopped fresh flat-leaf parsley',
 'Hot cooked long-grain white rice']

In [99]:
get_methods(Gumbo_Doc)

[' This iconic chicken and sausage gumbo recipe represents everything we love about Louisiana cooking. With ordinary ingredients, the right seasonings, and patience, the results are extraordinary.\n',
 " The secrets to a good gumbo aren't anything fancy either, but if you take the time to do them right, your gumbo will be just as good as the ones served in New Orleans.\n",
 " The first? Make sure to brown the sausage and chicken until they both have crispy caramelization. Secondly, don't fear the roux. Brown is the color of flavor so make sure to stir your vegetable oil and flour mixture until it's reached a true chocolate hue.\xa0\n",
 ' Gumbo originated in the early 18th century in Louisiana, and is a flavorful stew made up of stock, a\xa0holy trinity\xa0of onion, bell pepper, and celery; meat or shellfish, and a thickener—typically a roux, okra, or\xa0filé powder. A vibrant combination of flavors and textures, this hearty dish is often used as a metaphor for the melting pot of cultu

In [100]:
get_methods(SE_Doc)

[" This classic Louisiana dish can be on the dinner table in just over an hour. Have all of the ingredients prepped and ready to go before you start cooking. To make the meal come together even faster, you can cook the rice ahead of time and reheat it before serving. Once dinner's done, all you'll need is some hot sauce and plenty of crusty bread for mopping up all the rich, velvety sauce. Not a fan of shrimp? Substitute crawfish tail meat.\n",
 ' Melt butter in a large Dutch oven over medium-low; whisk in flour. Cook, whisking constantly, until mixture turns golden brown, 10 to 12 minutes. Increase heat to medium, and add onion, celery, and garlic. Cook, stirring often, until soft and golden, about 15 minutes.\n',
 ' Stir in bell peppers, and cook, stirring often, 5 minutes. Stir in broth and water, and cook, stirring constantly, until mixture thickens, 7 to 10 minutes. Stir in shrimp, and cook, stirring occasionally, until shrimp turn pink, about 5 minutes. Stir in hot sauce, salt, a

## Using RegEx to clean up
Upon testing the functions, I found that the find_all method extracted unnecessary information in addition to '\n'. The code below is to clean up the output. 

In [103]:
import re

In [121]:
#Save methods as variables
MakeGumbo = get_methods(Gumbo_Doc)
MakeEtoufee = get_methods(SE_Doc)


In [154]:
def clean_doc(document):
    string = "".join(document)
    cleaned_doc = string.split("\n")
    return cleaned_doc

In [169]:
#to remove '\xao'
def methods(document):
    new_doc = clean_doc(document)
    doc = " ".join(new_doc)
    methods = re.sub('\xa0', '', doc)
    instructions = re.split(r'(?<=[.!?])\s+', methods)
    return instructions

In [192]:
## Final Cleanup - unnecessary text
Etoufee_Ingredients = get_ingredients(SE_Doc)
Etoufee_Instructions = methods(MakeEtoufee)[6:]
Gumbo_Ingredients = get_ingredients(Gumbo_Doc)
Gumbo_Instructions = methods(MakeGumbo)[34:]


# Shrimp Etoufee

In [193]:
Etoufee_Ingredients

['(4 oz.) salted butter',
 '(about 1 1/2 oz.) all-purpose flour',
 'chopped yellow onion (from 1 large onion)',
 'chopped celery (from 1 large stalk)',
 'minced garlic (from 1 large garlic clove)',
 'chopped red bell pepper (from 1 medium bell pepper)',
 'chopped green bell pepper (from 1 medium bell pepper)',
 'chicken broth',
 'water',
 'medium peeled, deveined raw shrimp',
 'hot sauce',
 'kosher salt',
 'black pepper',
 'chopped scallions (from 2 scallions)',
 'chopped fresh flat-leaf parsley',
 'Hot cooked long-grain white rice']

In [194]:
Etoufee_Instructions

['Melt butter in a large Dutch oven over medium-low; whisk in flour.',
 'Cook, whisking constantly, until mixture turns golden brown, 10 to 12 minutes.',
 'Increase heat to medium, and add onion, celery, and garlic.',
 'Cook, stirring often, until soft and golden, about 15 minutes.',
 'Stir in bell peppers, and cook, stirring often, 5 minutes.',
 'Stir in broth and water, and cook, stirring constantly, until mixture thickens, 7 to 10 minutes.',
 'Stir in shrimp, and cook, stirring occasionally, until shrimp turn pink, about 5 minutes.',
 'Stir in hot sauce, salt, and pepper; cook 5 more minutes.',
 'Stir in scallions and parsley, and simmer 5 minutes.',
 'Remove from heat.',
 'Cover and let stand 5 minutes.',
 'Serve immediately over hot cooked rice.',
 '']

# Chicken and Sausage Gumbo

In [195]:
Gumbo_Ingredients

['andouille sausage, cut into 1/4-in.-thick slices',
 'skinned bone-in chicken breasts',
 'Vegetable oil',
 'all-purpose flour',
 'medium onion, chopped',
 'green bell pepper, chopped',
 'celery ribs, sliced',
 'hot water',
 'garlic cloves, minced',
 'bay leaves',
 'Worcestershire sauce',
 'Creole seasoning',
 'dried thyme',
 'hot sauce',
 'green onions, sliced',
 'Filé powder (optional)',
 'Hot cooked rice',
 'Garnish: chopped green onions']

In [196]:
Gumbo_Instructions

['Cook sausage in a Dutch oven over medium heat, stirring constantly, 5 minutes or until browned.',
 'Drain on paper towels, reserving drippings in Dutch oven.',
 'Set sausage aside.',
 'Cook chicken in reserved drippings in Dutch oven over medium heat 5 minutes or until browned.',
 'Remove to paper towels, reserving drippings in Dutch oven.',
 'Set chicken aside.',
 'Add enough oil to drippings in Dutch oven to measure 1/2 cup.',
 'Add flour, and cook over medium heat, stirring constantly, 20 to 25 minutes, or until roux is chocolate colored.',
 'Stir in onion, bell pepper, and celery; cook, stirring often, 8 minutes or until tender.',
 'Gradually add 2 quarts hot water, and bring mixture to a boil; add chicken, garlic, and next 5 ingredients.',
 'Reduce heat to low, and simmer, stirring occasionally, 1 hour.',
 'Remove chicken; let cool.',
 'Add sausage to gumbo; cook 30 minutes.',
 'Stir in green onions; cook for 30 more minutes.',
 'Bone chicken, and cut meat into strips; return ch

## Compiling New Dataframe and Converting to CSV file

In [197]:
#Adding Ingredients and Methods to dataframe
urls['Ingredients'] = [Etoufee_Ingredients, Gumbo_Ingredients] 
urls['Methods'] = [Etoufee_Instructions, Gumbo_Instructions] 

In [198]:
urls

Unnamed: 0,Dish,Published By,Published/Updated On,Recipe Rating,URL,text,Ingredients,Methods
0,Shrimp Étouffée,Patricia S York,17-01-2020,4.7,https://www.southernliving.com/recipes/shrimp-...,"<!DOCTYPE html>\n<html id=""recipeScTemplate_1-...","[(4 oz.) salted butter, (about 1 1/2 oz.) all-...",[Melt butter in a large Dutch oven over medium...
1,Chicken and Sausage Gumbo,Southern Living Test Kitchen,15-09-2024,5.0,https://www.southernliving.com/recipes/chicken...,"<!DOCTYPE html>\n<html id=""recipeScTemplate_1-...","[andouille sausage, cut into 1/4-in.-thick sli...",[Cook sausage in a Dutch oven over medium heat...


In [199]:
urls.to_csv('Extracted_Recipes.csv', index=False)