### TARGET SITE: Allrecipes.com

Allrecipes.com is a vibrant online culinary hub where food enthusiasts can explore, share, and discover a vast array of recipes. Founded in 1997, it has evolved into one of the most extensive and interactive platforms for home cooks and professional chefs alike. The website boasts an extensive database of user-generated recipes, covering diverse cuisines and catering to various dietary needs. Users can rate and review recipes, offering valuable feedback and fostering a community of shared culinary experiences. Enhanced by step-by-step video tutorials and personalized meal planning tools, Allrecipes.com provides an engaging and accessible way for individuals to enhance their cooking skills and enjoy a wide variety of dishes. The platform’s mobile app further extends its convenience, allowing users to access recipes and plan meals anytime, anywhere.

### OBJECTIVE

The objective of this project is to extract recipes from allrecipes.com. The information to be extracted for each recipe is:
    
1. Recipe Name
    
2. Servings
    
3. Ingredients
    
4. Steps
    
The output would be a pandas dataframe

### IMPORT NECESSARY LIBRARIES

In [140]:
# Importing the pandas library for data manipulation and analysis
import pandas as pd

# Importing the time library to introduce delays in the script, if needed
import time

# Importing BeautifulSoup from the bs4 library to parse HTML and XML documents
from bs4 import BeautifulSoup

# Importing the requests library to make HTTP requests
import requests

# Importing the random library to generate random values (e.g., for selecting random user agents)
import random

### SCRAPING STEPS

Step 1: Scrape Recipe Group Links
First, we'll extract the links to the different recipe groups from the main recipes page.

Step 2: Scrape Individual Recipe Links from Group Pages
Next, we'll visit each group page to get links to individual recipes.

Step 3: Scrape Recipe Details
Finally, we'll scrape the details from each individual recipe page.

### SET INSTANCES

In [32]:
# Send a GET request to the recipes webpage using a mozilla agaent to mimic web browser
source = requests.get("https://www.allrecipes.com/recipes/", headers={'User-Agent': 'Mozilla/5.0'})
source.raise_for_status()  # Raise an error if the response is not successful
    
# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(source.text, "html.parser")

### WEBPAGE STRUCTURE

Taking a look at the webpage, the structure looks similar to something like this:

1. Recipes page that contains links to other webpages for groups of recipes

2. The group webpages contain different recipes

3. Each recipe page contains Name, Rating, Reviews, Description, Prep time, Cook time, Total time, servings, Ingredients, and Directions


A look at the primary webpage- "https://www.allrecipes.com/recipes/"- for our project, the 'li' tags all contain links with recipes in each of them which would lead us to the next page that contains different recipes.

The links look like this: "https://www.allrecipes.com/recipes/1642/everyday-cooking" and a closer look reveals that they all contain the keyword "recipes". We can use this to filter the individual recipes.

### STEP 01: GET THE LINKS TO ALL OF THE RECIPE GROUPS

In [42]:
# Empty list to hold the links
links_list = []

# Get the section conatining all of the links
group_links = soup.find_all('a', class_='mntl-taxonomy-nodes__link mntl-text-link type--squirrel-link', href=True)

# Append the links to the list created above
for link in group_links:
    links_list.append(link['href'])
    
# Print the list to see the links
print(links_list)

['https://www.allrecipes.com/recipes/17562/dinner/', 'https://www.allrecipes.com/recipes/1116/fruits-and-vegetables/', 'https://www.allrecipes.com/recipes/156/bread/', 'https://www.allrecipes.com/recipes/1642/everyday-cooking/', 'https://www.allrecipes.com/recipes/17561/lunch/', 'https://www.allrecipes.com/recipes/17567/ingredients/', 'https://www.allrecipes.com/recipes/236/us-recipes/', 'https://www.allrecipes.com/recipes/76/appetizers-and-snacks/', 'https://www.allrecipes.com/recipes/77/drinks/', 'https://www.allrecipes.com/recipes/78/breakfast-and-brunch/', 'https://www.allrecipes.com/recipes/79/desserts/', 'https://www.allrecipes.com/recipes/80/main-dish/', 'https://www.allrecipes.com/recipes/81/side-dish/', 'https://www.allrecipes.com/recipes/82/trusted-brands-recipes-and-tips/', 'https://www.allrecipes.com/recipes/84/healthy-recipes/', 'https://www.allrecipes.com/recipes/85/holidays-and-events/', 'https://www.allrecipes.com/recipes/86/world-cuisine/', 'https://www.allrecipes.com/

### STEP 02: FUNCTION TO EXTRACT GROUPS OF RECIPES STORED WITHIN THE PREVIOUS GROUPS

In [80]:
# List of user agents to rotate
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
]

# Global list to hold the links
individual_links_list = []

# Function to extract individual recipe links from a group page
def extract_recipe_links(links):
    global individual_links_list  # Declare the variable as global to modify it within the function
    
    for link in links:
        # Randomly select a User-Agent header from the list
        headers = {'User-Agent': random.choice(user_agents)}

        try:
            # Send a request with rotating user-agent
            response = requests.get(link, headers=headers)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, 'html.parser')

            # Get the section containing all of the links
            individual_links = soup.find_all('a', class_='mntl-taxonomy-nodes__link mntl-text-link type--squirrel-link', href=True)

            # Append the links to the global list
            for ind_link in individual_links:
                individual_links_list.append(ind_link['href'])

        except Exception as e:
            print(f"Error occurred while processing {link}: {e}")

In [81]:
# Extract and print the groups of recipes
extract_recipe_links(links_list)
print(individual_links_list)

['https://www.allrecipes.com/recipes/15054/everyday-cooking/cooking-for-one/quick-and-easy/', 'https://www.allrecipes.com/recipes/476/everyday-cooking/cooking-for-two/', 'https://www.allrecipes.com/recipes/22992/everyday-cooking/sheet-pan-dinners/', 'https://www.allrecipes.com/recipes/17253/everyday-cooking/slow-cooker/main-dishes/', 'https://www.allrecipes.com/recipes/265/everyday-cooking/vegetarian/main-dishes/', 'https://www.allrecipes.com/recipes/1320/healthy-recipes/main-dishes/', 'https://www.allrecipes.com/recipes/80/main-dish/', 'https://www.allrecipes.com/recipes/256/main-dish/meatloaf/', 'https://www.allrecipes.com/recipes/17245/main-dish/pasta/', 'https://www.allrecipes.com/recipes/674/main-dish/pork/pork-chops/', 'https://www.allrecipes.com/recipes/260/main-dish/salads/', 'https://www.allrecipes.com/recipes/475/meat-and-poultry/beef/steaks/', 'https://www.allrecipes.com/recipes/664/meat-and-poultry/chicken/baked-and-roasted/', 'https://www.allrecipes.com/recipes/81/side-dis

### STEP 03: FUNCTION FOR GETTING INDIVIDUAL RECIPE LINKS

In [83]:
# Global list to store extracted links
extracted_links = []

def extract(param):
    global extracted_links  # Declare the global variable

    if param is None:
        print("Error: The input to 'extract' is None. Please provide a valid iterable.")
        return []  # Handle None input gracefully
    
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
        # ... other user agents
    ]

    for link in param:
        headers = {'User-Agent': random.choice(user_agents)}

        try:
            response = requests.get(link, headers=headers)
            response.raise_for_status()  # Raise an exception for bad responses (4xx and 5xx)
            soup = BeautifulSoup(response.text, 'html.parser')

            ir_links = soup.find_all('a', class_='comp mntl-card-list-items mntl-document-card mntl-card card--image-top card card--no-image', href=True)

            # Extract and append links to the global list
            for irs_link in ir_links:
                extracted_links.append(irs_link['href']) 

        except requests.exceptions.RequestException as e:
            print(f"Error fetching {link}: {e}")
        except Exception as e:  # Catch any other unexpected errors
            print(f"Unexpected error while processing {link}: {e}")

    return extracted_links  # Return the list of extracted links

In [84]:
# Extract and print individual recipe links
extract(individual_links_list)
print(extracted_links)

['https://www.allrecipes.com/recipe/23891/grilled-cheese-sandwich/', 'https://www.allrecipes.com/recipe/160099/seared-ahi-tuna-steaks/', 'https://www.allrecipes.com/recipe/21306/fish-in-foil/', 'https://www.allrecipes.com/air-fryer-everything-bagel-salmon-dinner-recipe-7509518', 'https://www.allrecipes.com/recipe/8541313/pepperoni-pizza-dip-with-cream-cheese/', 'https://www.allrecipes.com/recipe/8525243/southeast-asian-style-chicken-rice/', 'https://www.allrecipes.com/sheet-pan-black-pepper-tofu-and-broccoli-recipe-8649317', 'https://www.allrecipes.com/sheet-pan-quesadillas-recipe-8642141', 'https://www.allrecipes.com/easy-sheet-pan-roasted-greek-salmon-and-broccoli-recipe-8622034', 'https://www.allrecipes.com/slow-cooker-honey-garlic-chicken-noodles-recipe-8629517', 'https://www.allrecipes.com/crispy-slow-cooker-carnitas-recipe-8628927', 'https://www.allrecipes.com/slow-cooker-chicken-thighs-green-beans-and-potatoes-recipe-8620224', 'https://www.allrecipes.com/sheet-pan-black-pepper-t

### STEP 04: EXTRACT THE INFORMATION WE NEED AND STORE THEM IN A PANDAS DATAFRAME

In [132]:
# User-agents list (same as before)
user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15',
    # Add more user agents if needed
]

# Create an empty DataFrame with appropriate columns
df = pd.DataFrame(columns=["Recipe", "Prep Time", "Servings", "Ingredients", "Directions"])

# Function to extract individual recipe links from a group page
def extract_recipes_final(links):
    global df  # Declare the DataFrame as global

    for link in links:
        headers = {'User-Agent': random.choice(user_agents)}

        try:
            response = requests.get(link, headers=headers)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, 'html.parser')
            
            recipe = soup.find('h1', class_ = 'article-heading type--lion').text
            
            prep_time = soup.find('div', class_ = 'mntl-recipe-details__value').text
            
            servings_div = soup.find('div', string="Servings:")
            servings = servings_div.find_next_sibling('div', class_='mntl-recipe-details__value').get_text(strip=True) if servings_div else "Not found"

            ingredients = soup.find('ul', class_='mntl-structured-ingredients__list').text
            ingredients_list = [ingredient.strip() for ingredient in ingredients.split('\n') if ingredient.strip()]
            
            directions = soup.find('div', class_ = 'comp recipe__steps-content mntl-sc-page mntl-block', id = 'recipe__steps-content_1-0').text.replace('\n', '').split('.')[0:2]


            # Create a new row as a dictionary and append it to the DataFrame
            new_row = pd.DataFrame([{
                "Recipe": recipe,
                "Prep Time": prep_time,
                "Servings": servings,
                "Ingredients": ingredients_list,
                "Directions": directions
            }])
            df = pd.concat([df, new_row], ignore_index=True)

        except Exception as e:
            print(f"Error occurred while processing {link}: {e}")

In [133]:
# Extraction
extract_recipes_final(extracted_links)

Error occurred while processing https://www.allrecipes.com/the-best-chopped-salads-to-make-for-dinner-7550953: 'NoneType' object has no attribute 'text'


### PRINT DATAFRAME GENERATED

In [137]:
df

Unnamed: 0,Recipe,Prep Time,Servings,Ingredients,Directions
0,Grilled Cheese Sandwich,5 mins,2,"[4 slices white bread, 3 tablespoons butter, d...","[ Preheat a nonstick skillet over medium heat,..."
1,Seared Ahi Tuna Steaks,5 mins,2,"[2 (5 ounce) ahi tuna steaks, 1 teaspoon koshe...",[ Pat tuna steaks dry and season on both sides...
2,Fish in Foil,10 mins,2,"[2 rainbow trout fillets, 1 tablespoon olive ...",[ Preheat the oven to 400 degrees F (200 degre...
3,Air Fryer Everything-Bagel Salmon Dinner,15 mins,2,"[cooking spray, 1 large (8 ounces) sweet potat...",[ Preheat air fryer to 400 degrees F (200 degr...
4,Pepperoni Pizza Dip with Cream Cheese,15 mins,2,"[2 tablespoons freshly grated Parmesan cheese,...",[ Preheat the oven to 350 degrees F (175 degre...
5,Southeast Asian Style Chicken Rice,20 mins,2,"[1 large chicken breast, skin on, 1 teaspoon k...",[ Set chicken breast onto a cutting board and ...
6,Sheet Pan Black Pepper Tofu and Broccoli,10 mins,4,"[1 (16 ounce) package extra-firm tofu, 1 head ...",[ Slice tofu into thirds lengthwise so you hav...
7,Sheet Pan Quesadillas,10 mins,8,"[1 tablespoon olive oil, 1 onion, chopped, 1 ...",[ Preheat the oven to 375 degrees F (190 degre...
8,Easy Sheet Pan Roasted Greek Salmon and Broccoli,10 mins,4,"[1 bunch broccoli, cut into large florets, 2 t...",[ Preheat the oven to 450 degrees F (230 degre...
9,Slow Cooker Honey Garlic Chicken Noodles,10 mins,6 (serving size: about 1 1/4 cups),"[1/3 cup honey, 1/4 cup lower-sodium soy sauce...","[ Gather all ingredients, Dotdash Meredith ..."
