# CGAS

**Question 1: Part (A)**

In this section, we first use predefined search queries to extract URLs and titles of recipes from a website. This involves querying the base URL with various search terms related to ingredients and dishes. Each search term generates a list of recipe links, which are collected for further processing. Subsequently, we use these extracted URLs to retrieve detailed recipe data, which is then saved in a JSON file format. This approach ensures that we systematically gather and organize comprehensive data from the identified recipe links.

In [None]:
import requests
from bs4 import BeautifulSoup
from itertools import combinations
import time
import csv

# Define base URL for search
base_url = "https://www.allrecipes.com/search?q="

# Define initial search terms
search_terms = [
    "Beef", "Mutton", "Vegetables", "Chicken", "Pork", "Fish", "Shrimp", "Lamb",
    "Turkey", "Duck", "Tofu", "Cheese", "Pasta", "Rice", "Potatoes", "Eggs", "Sausage",
    "Beans", "Lentils", "Soup", "Salad", "Bread", "Noodles", "Tomatoes", "Garlic",
    "Onions", "Carrots", "Spinach", "Broccoli", "Cauliflower", "Cabbage", "Peppers",
    "Mushrooms", "Zucchini", "Squash", "Avocado", "Apples", "Oranges", "Bananas",
    "Berries", "Grapes", "Pineapple", "Mango", "Melons", "Yams", "Sweet Potatoes",
    "Corn", "Peas", "Cucumber", "Radishes", "Herbs", "Spices", "Beefsteak", "Short Ribs",
    "Filet Mignon", "Brisket", "Pork Chops", "Sausages", "Ribs", "Salmon", "Tuna",
    "Cod", "Scallops", "Crab", "Clams", "Mussels", "Oysters", "Chicken Wings",
    "Chicken Thighs", "Chicken Breasts", "Turkey Legs", "Turkey Breast", "Duck Legs",
    "Duck Breast", "Lamb Chops", "Lamb Shanks", "Mutton Chops", "Mutton Stew",
    "Tofu Stir Fry", "Cheese Sauce", "Cheese Omelette", "Pasta Salad", "Spaghetti",
    "Lasagna", "Fried Rice", "Risotto", "Potato Salad", "Mashed Potatoes", "Baked Potatoes",
    "Vegetable Soup", "Chicken Soup", "Beef Stew", "Pork Roast", "Fish Tacos",
    "Shrimp Scampi", "Tofu Curry", "Veggie Burger", "Fruit Salad", "Smoothies", "Pancakes"
]

# Function to fetch recipe links and titles from a search page
def fetch_recipe_info(search_term):
    search_url = f"{base_url}{search_term.replace(' ', '+')}"
    response = requests.get(search_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    recipes = []
    for a in soup.find_all('a', href=True):
        href = a['href']
        if '/recipe/' in href:
            full_url = f"https://www.allrecipes.com{href}"
            title_tag = a.find('span', class_='card__title-text')
            title = title_tag.get_text(strip=True) if title_tag else 'No Title'

            # Default to 'US' cuisine if not found
            cuisine = 'US'
            recipe_info = {
                'title': title,
                'link': full_url,
                'cuisine': cuisine
            }
            recipes.append(recipe_info)

    return recipes

# Function to determine cuisine
def determine_cuisine(term):
    term_lower = term.lower()
    if 'curry' in term_lower:
        return 'Asia'
    elif 'beef' in term_lower or 'taco' in term_lower or 'alfredo' in term_lower:
        return 'US'
    else:
        return 'World'

# Generate combinations of search terms
def generate_search_terms(terms, max_combination_size):
    all_combinations = []
    for r in range(1, max_combination_size + 1):
        for combo in combinations(terms, r):
            all_combinations.append(" ".join(combo))
    return all_combinations

# Main script
all_recipe_info = {}
unique_recipe_links = set()
total_recipes_found = 0
target_recipe_count = 10000  # Adjust as needed
max_no_new_recipes = 2
stop_flag = False  # Add a stop flag

# Generate combinations of search terms
search_combinations = generate_search_terms(search_terms, 3)  # Adjust max combination size as needed

# Iterate over search combinations and fetch recipes
for term in search_combinations:
    if stop_flag:  # Check if the stop flag is set to break out of the outer loop
        break

    no_new_recipes_count = 0
    while no_new_recipes_count < max_no_new_recipes:
        recipes = fetch_recipe_info(term)
        if not recipes:
            no_new_recipes_count += 1
            if no_new_recipes_count >= max_no_new_recipes:
                break
            time.sleep(1)  # Adjust sleep time as needed
            continue

        new_recipes_found = False
        for recipe in recipes:
            if recipe['link'] not in unique_recipe_links:
                recipe['cuisine'] = determine_cuisine(term)
                all_recipe_info[recipe['link']] = recipe
                unique_recipe_links.add(recipe['link'])
                total_recipes_found += 1
                new_recipes_found = True

        # Print progress
        print(f"Search term: {term}")
        print(f"Total recipes found so far: {total_recipes_found}")

        if not new_recipes_found:
            no_new_recipes_count += 1
        else:
            no_new_recipes_count = 0

        # Stop if we have enough recipes
        if total_recipes_found >= target_recipe_count:
            stop_flag = True  # Set the stop flag to break the outer loop
            break

        time.sleep(1)  # Adjust sleep time as needed

# Print final results
print(f"\nFinal Total Recipes Found: {total_recipes_found}")

# Save results to CSV
csv_file = "recipes.csv"
csv_columns = ['title', 'link', 'cuisine']
try:
    with open(csv_file, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
        writer.writeheader()
        for recipe in all_recipe_info.values():
            writer.writerow(recipe)
    print(f"Data successfully written to {csv_file}")
except IOError:
    print("I/O error while writing to CSV")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Total recipes found so far: 5750
Search term: Lamb Spinach
Total recipes found so far: 5752
Search term: Lamb Spinach
Total recipes found so far: 5752
Search term: Lamb Spinach
Total recipes found so far: 5752
Search term: Lamb Broccoli
Total recipes found so far: 5753
Search term: Lamb Broccoli
Total recipes found so far: 5753
Search term: Lamb Broccoli
Total recipes found so far: 5753
Search term: Lamb Cauliflower
Total recipes found so far: 5755
Search term: Lamb Cauliflower
Total recipes found so far: 5755
Search term: Lamb Cauliflower
Total recipes found so far: 5755
Search term: Lamb Cabbage
Total recipes found so far: 5760
Search term: Lamb Cabbage
Total recipes found so far: 5760
Search term: Lamb Cabbage
Total recipes found so far: 5760
Search term: Lamb Peppers
Total recipes found so far: 5760
Search term: Lamb Peppers
Total recipes found so far: 5760
Search term: Lamb Mushrooms
Total recipes found so far: 5763


**Question 1: Part (A)**

In this phase, we gather comprehensive data by scraping the recipes from the URLs previously extracted and stored in a CSV file. For each URL, we send an HTTP request to retrieve the HTML content of the recipe page. The HTML is then parsed to extract detailed information including the recipe title, ingredients, directions, nutritional facts, author details, and update date.

We employ functions to clean and validate URLs, ensuring accurate and reliable access to each recipe page. The extracted data is structured and stored in a list, which is subsequently saved as a JSON file. This process consolidates all relevant recipe information from the URLs into a single, comprehensive dataset, facilitating further analysis and utilization.

In [None]:
import csv
import requests
from bs4 import BeautifulSoup
import json

# Read the URLs from the CSV file
with open('recipes.csv', newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile)
    urls = [row['link'] for row in reader]

# Initialize a list to store all recipes
all_recipes = []

# Function to clean and validate URLs
def clean_url(url):
    if url.startswith('https://www.allrecipes.comhttps://'):
        url = url.replace('https://www.allrecipes.comhttps://', 'https://')
    elif not url.startswith('http'):
        url = 'https://' + url.lstrip('/')
    return url

def scrape_recipe(url, recipe_id):
    # Clean the URL
    url = clean_url(url)

    try:
        # Send a request to fetch the HTML content
        response = requests.get(url)
        response.raise_for_status()  # Check for HTTP errors
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find the title using the given class
        title_tag = soup.find('h1', class_='article-heading type--lion')
        title = title_tag.get_text(strip=True) if title_tag else 'Title not found'

        # Extract recipe details
        details = {}
        for item in soup.select('.mm-recipes-details__item'):
            label = item.select_one('.mm-recipes-details__label').get_text(strip=True)
            value = item.select_one('.mm-recipes-details__value').get_text(strip=True)
            details[label] = value

        # Extract ingredients
        ingredients = []
        ingredients_section = soup.find('h2', class_='mm-recipes-structured-ingredients__heading')
        if ingredients_section:
            ingredient_list = ingredients_section.find_next('ul')
            for ingredient in ingredient_list.find_all('li'):
                ingredient_text = ' '.join(ingredient.stripped_strings)
                ingredients.append(ingredient_text)

        # Extract directions
        directions = []
        directions_section = soup.find('h2', class_='mm-recipes-steps__heading')
        if directions_section:
            steps_list = directions_section.find_next('ol')
            if steps_list:
                for step in steps_list.find_all('li'):
                    step_text = ' '.join(step.stripped_strings)
                    directions.append(step_text)

        # Extract nutritional facts
        nutrition_facts = {}
        nutrition_section = soup.find('h2', class_='mm-recipes-nutrition-facts-summary__heading')
        if nutrition_section:
            nutrition_table = nutrition_section.find_next('table')
            if nutrition_table:
                for row in nutrition_table.find_all('tr'):
                    cols = row.find_all('td')
                    if len(cols) == 2:
                        nutrient = cols[0].get_text(strip=True)
                        amount = cols[1].get_text(strip=True)
                        nutrition_facts[nutrient] = amount

        # Extract author information
        author_info = {}
        author_section = soup.find('div', class_='allrecipes-bylines')
        if author_section:
            author_name_tag = author_section.find('a', class_='mntl-attribution__item-name')
            author_name = author_name_tag.get_text(strip=True) if author_name_tag else 'Author not found'
            author_link = author_name_tag['href'] if author_name_tag else None

            author_bio_tag = author_section.find('div', class_='mntl-author-tooltip__bio')
            author_bio = author_bio_tag.get_text(strip=True) if author_bio_tag else 'Bio not available'

            author_info = {
                'name': author_name,
                'link': author_link,
                'bio': author_bio
            }

        # Extract update date
        update_date_tag = author_section.find('div', class_='mntl-attribution__item-date')
        update_date = update_date_tag.get_text(strip=True) if update_date_tag else 'Update date not available'

        # Create a dictionary with the URL, title, details, ingredients, directions, nutrition facts, author info, and update date
        data = {
            'id': recipe_id,
            'url': url,
            'title': title,
            'details': details,
            'ingredients': ingredients,
            'directions': directions,
            'nutrition_facts': nutrition_facts,
            'author_info': author_info,
            'update_date': update_date
        }

        # Add the recipe data to the list of all recipes
        all_recipes.append(data)

    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
k = 0
# Loop through each URL and scrape the recipe
for i, url in enumerate(urls, start=1):
    print(f'Fetching the data from {k}th file ....')
    scrape_recipe(url, i)
    k += 1

# Save all recipes to a single JSON file
with open('Question1_A_RawData.json', 'w') as file:
    json.dump(all_recipes, file, indent=4)

print(f'Data saved to Question1_A_RawData.json with {len(all_recipes)} recipes.')


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Fetching the data from 5003th file ....
Fetching the data from 5004th file ....
Fetching the data from 5005th file ....
Fetching the data from 5006th file ....
Fetching the data from 5007th file ....
Fetching the data from 5008th file ....
Fetching the data from 5009th file ....
Fetching the data from 5010th file ....
Fetching the data from 5011th file ....
Fetching the data from 5012th file ....
Fetching the data from 5013th file ....
Fetching the data from 5014th file ....
Fetching the data from 5015th file ....
Fetching the data from 5016th file ....
Fetching the data from 5017th file ....
Fetching the data from 5018th file ....
Fetching the data from 5019th file ....
Fetching the data from 5020th file ....
Fetching the data from 5021th file ....
Fetching the data from 5022th file ....
Fetching the data from 5023th file ....
Fetching the data from 5024th file ....
Fetching the data from 5025th file ....
Fetching the da

**We have converted the JSON data into a CSV file for easier analysis and manipulation.**

In [1]:
import pandas as pd
import json

# Load JSON data
with open('Question1_A_RawData.json') as json_file:
    data = json.load(json_file)

# Convert JSON to DataFrame
df = pd.json_normalize(data)

# Save DataFrame to CSV
df.to_csv('Question1_A_RawData.csv', index=False)




```
# install dependency
```



In [3]:
pip install ijson


Collecting ijson
  Downloading ijson-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (21 kB)
Downloading ijson-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/114.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ijson
Successfully installed ijson-3.3.0


**Question 1: Part (B)**

In this script, we use Named Entity Recognition (NER) to extract ingredient names from a JSON file containing recipes. Here’s a detailed breakdown:

1. **Load SpaCy Model**: We start by loading SpaCy’s NER model (`en_core_web_sm`) to process and extract named entities from text.

2. **Read JSON File**: We use `ijson` to handle large JSON files efficiently and read the recipes from `Question1_A_RawData`. We handle encoding errors gracefully.

3. **Process Ingredients**: For each recipe, we extract the list of ingredients and apply the SpaCy NER model to identify named entities. We print these entities and their labels for debugging purposes.

4. **Filter and Extract Ingredient Names**:
   - **Tokenization**: We tokenize the ingredient phrases and filter out common non-ingredient words and unwanted parts of speech (e.g., stop words, digits).
   - **Ingredient Extraction**: We assume that the last token in the filtered list often represents the ingredient name. We collect these names along with their corresponding recipe IDs.

5. **Data Collection**: We store unique ingredients with their recipe IDs in a set to avoid duplicates.

6. **Convert to DataFrame**: We convert the collected ingredient data into a Pandas DataFrame, remove any duplicate entries, and sort by Recipe ID.

7. **Output and Save**: We print the sorted ingredient data and save the results to a CSV file (`Question1_B.csv`), providing an organized format for further analysis.

This approach ensures that we efficiently extract and organize ingredient names from a large dataset, using NER to enhance the accuracy of the ingredient identification process.

In [10]:
import spacy
import ijson
import pandas as pd
import re
from collections import defaultdict

# Load SpaCy's NER model
nlp = spacy.load("en_core_web_sm")

# File path to the large JSON file
file_path = 'Question1_A_RawData.json'

# Set to store unique ingredients with their recipe ID
unique_ingredients = set()

# Define a function to clean ingredient names
def clean_ingredient_name(name):
    # Remove special characters and digits
    cleaned_name = re.sub(r'[^a-zA-Z\s]', '', name)  # Keep only letters and spaces
    return cleaned_name.strip()

# Open the JSON file, handling encoding errors
with open(file_path, 'r', encoding='utf-8', errors='replace') as f:
    # Parse recipes array using ijson
    recipes = ijson.items(f, 'item')

    count = 0  # Counter to limit to 10 recipes

    # Loop through each recipe in the JSON file
    for recipe in recipes:
        try:
            recipe_id = recipe.get("id", None)  # Use the pre-assigned recipe ID
            ingredients = recipe.get("ingredients", [])

            # Process each ingredient phrase
            for ingredient_phrase in ingredients:
                # Apply NER using SpaCy to extract named entities
                doc = nlp(ingredient_phrase)

                # Print entities and their labels for debugging
                print(f"Processing ingredients for recipe {recipe_id}")
                for ent in doc.ents:
                    print(f"Entity: {ent.text}, Label: {ent.label_}")  # Print all recognized entities

                # Processing ingredients to extract main ingredient names using POS tagging
                ingredient_parts = []
                for token in doc:
                    # Filter out stop words, digits, and unwanted POS tags
                    if token.pos_ in ['NOUN', 'PROPN'] and not token.text.isdigit():
                        ingredient_parts.append(token.text.lower())

                # Filter out common non-ingredient keywords and special characters
                non_keywords = {'cups', 'cup', 'tablespoons', 'tablespoon', 'teaspoons', 'teaspoon', 'optional',
                                'packed', 'all-purpose', 'inch', 'pie', 'shell', 'divided',
                                'smoke', 'chopped', 'roast', 'sliced', 'crushed', 'powder', 'pieces', 'taste', 'packet', 'chuck', ')', ',', '(', 'thinly'}
                filtered_ingredients = [word for word in ingredient_parts if word not in non_keywords]

                # Clean ingredient names and remove any that are empty
                cleaned_ingredients = [clean_ingredient_name(word) for word in filtered_ingredients if clean_ingredient_name(word)]

                if cleaned_ingredients:
                    # Add the last cleaned word which often is the ingredient name along with recipe ID
                    unique_ingredients.add((recipe_id, cleaned_ingredients[-1]))

            count += 1
            if count == 10002:
                break

        except Exception as e:
            print(f"Error processing recipe {recipe_id}: {e}")
            continue

# Convert the collected data to a DataFrame
df = pd.DataFrame(list(unique_ingredients), columns=["Recipe ID", "Ingredient"])

# Remove duplicates, if needed
df = df.drop_duplicates(subset=["Recipe ID", "Ingredient"])

# Sort the DataFrame by Recipe ID
df = df.sort_values(by="Recipe ID")

# Print the sorted result
print("\nRecipe ID | Ingredient Name")
for index, row in df.iterrows():
    print(f"{row['Recipe ID']} | {row['Ingredient']}")

# Save the results to a CSV file
df.to_csv('Question1_B.csv', index=False)

print("Ingredient extraction complete. Data saved to 'Question1_B.csv'.")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
9490 | tortillas
9490 | onion
9490 | cheese
9490 | beans
9491 | cheese
9491 | beans
9491 | onion
9491 | squash
9491 | tortillas
9491 | jar
9491 | spray
9492 | eggs
9492 | soy
9492 | ham
9492 | salt
9492 | onions
9492 | carrots
9492 | sprouts
9492 | rice
9492 | oil
9492 | pepper
9493 | oil
9493 | plantains
9493 | salt
9494 | sauce
9494 | onion
9494 | potatoes
9494 | pepper
9494 | beans
9494 | juice
9494 | mustard
9494 | garlic
9494 | vinegar
9494 | oil
9494 | basil
9495 | pepper
9495 | mustard
9495 | honey
9495 | shallot
9495 | oil
9495 | garlic
9495 | vinegar
9496 | crumbs
9496 | beef
9496 | sauce
9496 | mix
9496 | water
9497 | yeast
9497 | avocado
9497 | onion
9497 | beans
9497 | juice
9497 | salt
9497 | potatoes
9497 | mustard
9497 | slices
9498 | soy
9498 | pepper
9498 | water
9498 | tortillas
9498 | potatoes
9498 | beans
9498 | mustard
9498 | cloves
9498 | cumin
9498 | oil
9498 | cheese
9498 | onion
9498 | chili
9499 

**Question 1: Part (C)**

In this solution, we accomplish the task of storing recipe and ingredient information in the specified format:

1. **Load the Extracted Ingredients Data**:
   - **Code**: `df = pd.read_csv('Question1_B.csv')`
   - **Explanation**: We start by loading the CSV file (`Question1_B.csv`) that contains the recipe IDs and corresponding ingredient names into a Pandas DataFrame. This file was previously generated and contains the necessary data for this task.

2. **Randomly Choose 100 Unique Recipe IDs**:
   - **Code**:
     ```python
     unique_recipe_ids = df['Recipe ID'].unique()
     selected_recipe_ids = random.sample(list(unique_recipe_ids), 100)
     ```
   - **Explanation**: We extract all unique recipe IDs from the DataFrame and then use the `random.sample()` function to randomly select 100 of these recipe IDs. This ensures that our sample is representative and unbiased.

3. **Filter the DataFrame for Selected Recipe IDs**:
   - **Code**: `filtered_df = df[df['Recipe ID'].isin(selected_recipe_ids)]`
   - **Explanation**: We filter the original DataFrame to include only the rows where the 'Recipe ID' matches one of the 100 randomly selected IDs. This gives us a subset of the data relevant to our selected recipes.

4. **Save the Recipe and Ingredient Information to a Text File**:
   - **Code**:
     ```python
     with open('Question1_C.txt', 'w') as file:
         for recipe_id in selected_recipe_ids:
             # Filter ingredients for the current Recipe ID
             ingredients = filtered_df[filtered_df['Recipe ID'] == recipe_id]
             for _, row in ingredients.iterrows():
                 file.write(f"{row['Recipe ID']}—{row['Ingredient']}\n")
             file.write("\n")
     ```
   - **Explanation**: We open a text file (`Question1_C.txt`) for writing. For each selected recipe ID, we filter the DataFrame to get the ingredients associated with that ID. We then write each recipe ID and ingredient pair to the file in the format `(Recipe ID)—(Ingredient Name)`. After writing all ingredients for a recipe, we add a newline for separation between different recipes.

This approach ensures that the data is stored in a clear and structured format, making it easy to review and analyze the recipe and ingredient information.

In [11]:
import pandas as pd
import random

# Load the extracted ingredients CSV file
df = pd.read_csv('Question1_B.csv')

# Randomly choose 100 unique recipe IDs
unique_recipe_ids = df['Recipe ID'].unique()
selected_recipe_ids = random.sample(list(unique_recipe_ids), 100)

# Filter the DataFrame for these selected Recipe IDs
filtered_df = df[df['Recipe ID'].isin(selected_recipe_ids)]

# Save the selected recipes and ingredients to a text file
with open('Question1_C.txt', 'w') as file:
    for recipe_id in selected_recipe_ids:
        # Filter ingredients for the current Recipe ID
        ingredients = filtered_df[filtered_df['Recipe ID'] == recipe_id]
        for _, row in ingredients.iterrows():
            file.write(f"{row['Recipe ID']}—{row['Ingredient']}\n")
        file.write("\n")

print("Selected recipes and ingredients saved to 'Question1_C.txt'.")


Selected recipes and ingredients saved to 'Question1_C.txt'.
