# Ejercicio 11: Web Scraping
## Objetivo de la práctica
El objetivo de este ejercicio es construir un web scraper que recoja datos de un website.



## Parte 0: Planificar


### Indentificar los datos a obtener
- Descripcion de la receta
- Ingredientes
- Instrucciones
- Informacion Nutricional


### Elegir el sitio web objetivo.

Utilizaremos esta pagina de recetas de cocina
- https://www.allrecipes.com/




### Planificar la estructura del corpus.

El corpus va a estar compuesto por la informacion relevante de cada subpagina especifica para cada receta

## Parte 1: Entender el sitio web objetivo


### Analizar la estructura de la página web a ser analizada.


In [64]:
#Instalar beautiful soup
!pip install beautifulsoup4



In [65]:
#Verificar la carga de la pagina html
!ls

sample_data
view-source_https___www.allrecipes.com_recipe_13087_mulligatawny-soup-i_.html


In [66]:
#Cargamos una pagina especifica de una receta
import requests
from bs4 import BeautifulSoup

url = "https://www.allrecipes.com/recipe/13087/mulligatawny-soup-i/"

response = requests.get(url, headers={
    "User-Agent": "Mozilla/5.0"
})

#Instancia soup
soup = BeautifulSoup(response.text, "html.parser")

### Identificar los elementos HTML que contienen los datos bsuscados.


In [67]:
meta_title = soup.find("meta", property="og:title")
title = meta_title["content"] if meta_title else None

title


'Mulligatawny Soup'

In [68]:
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
for ingredient in ingredients_section:
    print(ingredient.text.strip())

½ cup chopped onion
2 stalks celery, chopped
1  carrot, diced
¼ cup butter
1 ½ tablespoons all-purpose flour
1 ½ teaspoons curry powder
4 cups chicken broth
½  apple, cored and chopped
¼ cup white rice
1  skinless, boneless chicken breast half - cut into cubes
1 pinch dried thyme
salt and ground black pepper to taste
½ cup heavy cream, heated


## Parte 2: Obtener los datos deseados

In [69]:
# Extracting the description
description = soup.find("meta", {"name": "description"})["content"]

# Extracting the ingredients
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
ingredients = [ingredient.get_text().strip() for ingredient in ingredients_section]

# Extracting the instructions
instructions_section = soup.find_all("p", class_="comp mntl-sc-block mntl-sc-block-html")
instructions = [instruction.get_text().strip() for instruction in instructions_section]

# Extracting the nutrition information
nutrition_section = soup.find_all("span", class_="mm-recipes-nutrition-facts-label__nutrient-name mm-recipes-nutrition-facts-label__nutrient-name--has-postfix")
nutrition_facts = [fact.parent.get_text().strip().replace('\n', ' ') for fact in nutrition_section]

# Print the extracted information
print("Recipe Title:", title)
print("Description:", description)
print("Ingredients:")
for ingredient in ingredients:
    print("-", ingredient)
print("Instructions:")
for i, instruction in enumerate(instructions, 1):
    print(f"{i}. {instruction}")
print("Nutrition Facts:")
for fact in nutrition_facts:
    print("-", fact)

Recipe Title: Mulligatawny Soup
Description: This mulligatawny soup recipe combines flavors of chicken, curry, apple, and cream for lovely, Indian-inspired dish — easy to make comfort in a bowl.
Ingredients:
- ½ cup chopped onion
- 2 stalks celery, chopped
- 1  carrot, diced
- ¼ cup butter
- 1 ½ tablespoons all-purpose flour
- 1 ½ teaspoons curry powder
- 4 cups chicken broth
- ½  apple, cored and chopped
- ¼ cup white rice
- 1  skinless, boneless chicken breast half - cut into cubes
- 1 pinch dried thyme
- salt and ground black pepper to taste
- ½ cup heavy cream, heated
Instructions:
1. Gather all ingredients.
2. Melt butter in a large soup pot over medium heat. Add onions, celery, and carrot and sauté until soft, 5 to 7 minutes. Add flour and curry, and cook 5 more minutes, stirring frequently.
3. Add chicken broth, mix well, and bring to a boil. Reduce heat and simmer for about 30 minutes. Add apple, rice, chicken, thyme, salt, and pepper. Simmer until rice is tender, 15 to 20 minu

## Parte 3: Obtener enlaces relacionados

In [70]:
# Find all the links to other recipes
recipe_links = soup.find_all("a", href=True)

# Filter and print only the links that are likely to be recipes
recipe_urls = []
for link in recipe_links:
    href = link['href']
    if "recipe" in href:
        recipe_urls.append(href)

# Print the recipe URLs
print("Linked Recipes:")
for url in recipe_urls:
    print(url)

Linked Recipes:
https://www.allrecipes.com/authentication/login?regSource=3675&relativeRedirectUrl=%2Frecipe%2F13087%2Fmulligatawny-soup-i%2F
/account/add-recipe
https://www.myrecipes.com/favorites
https://support.people.inc/hc/en-us/categories/360003648613-Allrecipes
https://www.allrecipes.com/authentication/logout?relativeRedirectUrl=%2Frecipe%2F13087%2Fmulligatawny-soup-i%2F
https://www.allrecipes.com/recipes/17562/dinner/
https://www.allrecipes.com/recipes/17057/everyday-cooking/more-meal-ideas/5-ingredients/main-dishes/
https://www.allrecipes.com/recipes/15436/everyday-cooking/one-pot-meals/
https://www.allrecipes.com/recipes/1947/everyday-cooking/quick-and-easy/
https://www.allrecipes.com/recipes/455/everyday-cooking/more-meal-ideas/30-minute-meals/
https://www.allrecipes.com/recipes/17889/everyday-cooking/family-friendly/family-dinners/
https://www.allrecipes.com/recipes/94/soups-stews-and-chili/
https://www.allrecipes.com/recipes/16099/everyday-cooking/comfort-food/
https://www

## Parte 4: Hacer RAG con las recetas obtenidas

### Normalizar URLs

In [71]:
from urllib.parse import urljoin

base_url = "https://www.allrecipes.com"

clean_recipe_urls = []

for href in recipe_urls:
    full_url = urljoin(base_url, href)
    clean_recipe_urls.append(full_url)

# eliminar duplicados
clean_recipe_urls = list(set(clean_recipe_urls))

print("Total URLs de recetas:", len(clean_recipe_urls))



Total URLs de recetas: 101


### Scrapear URLs obtenidas

In [72]:
import requests
from bs4 import BeautifulSoup
import time

recipes = []
headers = {"User-Agent": "Mozilla/5.0"}

for url in clean_recipe_urls:
    try:
        response = requests.get(url, headers=headers, timeout=10)
        soup = BeautifulSoup(response.text, "html.parser")

        title_tag = soup.find("meta", property="og:title")
        description_tag = soup.find("meta", {"name": "description"})

        ingredients = [
            li.get_text().strip()
            for li in soup.find_all(
                "li",
                class_="mm-recipes-structured-ingredients__list-item"
            )
        ]

        instructions = [
            p.get_text().strip()
            for p in soup.find_all(
                "p",
                class_="comp mntl-sc-block mntl-sc-block-html"
            )
        ]

        if title_tag and ingredients and instructions:
            recipes.append({
                "url": url,
                "title": title_tag["content"],
                "description": description_tag["content"] if description_tag else "",
                "ingredients": ingredients,
                "instructions": instructions
            })

        time.sleep(1)

    except Exception as e:
        print("Error en:", url)


### Estructura del Corpus

In [75]:
recipe_data = {
    "title": title,
    "description": description,
    "ingredients": ingredients,
    "instructions": instructions,
}

recipes.append(recipe_data)


In [76]:
documents = []

for recipe in recipes:
    doc = f"""
    Title: {recipe['title']}

    Description: {recipe['description']}

    Ingredients:
    {', '.join(recipe['ingredients'])}

    Instructions:
    {' '.join(recipe['instructions'])}
    """
    documents.append(doc.strip())



### Crear Embeddings

In [77]:
pip install sentence-transformers faiss-cpu




In [78]:
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = model.encode(documents, convert_to_numpy=True, show_progress_bar=True)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### Crear el Vector Store (FAISS)

In [79]:
import faiss

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)


### Retrieval (búsqueda semántica)

In [85]:
def retrieve_recipes(query, k=5):
    query_embedding = model.encode([query], convert_to_numpy=True)
    distances, indices = index.search(query_embedding, k)

    results = []
    for idx, dist in zip(indices[0], distances[0]):
        recipe = recipes[idx]

        results.append({
            "title": recipe["title"],
            "description": recipe.get("description", ""),
            "ingredients": recipe.get("ingredients", []),
            "distance": dist
        })

    return results



### Generation (RAG propiamente dicho)

## Pruebas de uso

In [81]:
recipes[0].keys()


dict_keys(['url', 'title', 'description', 'ingredients', 'instructions'])

In [90]:
query = "chicken soup"

results = retrieve_recipes(query, k=3)

print("======================================================================")
print("================================RESULTADOS============================")
print("======================================================================")
print("QUERY:", query)

for r in results:
    print("1.", r["title"])
    print("2. Distancia:", round(r["distance"], 4))
    print("3. Descripción:", r["description"])
    print("4. Ingredientes:")
    for ing in r["ingredients"]:
        print("  -", ing)
    print()



QUERY: chicken soup
1. Chicken and Vegetables Soup
2. Distancia: 0.7379
3. Descripción: This chicken soup is packed with Brussels sprouts, cauliflower, asparagus, and carrots. Every steamy, hearty bite will warm your bones and your belly.
4. Ingredientes:
  - 1  whole onion, peeled
  - 6  chicken drumsticks
  - ½ teaspoon salt
  - ⅓ head cauliflower, chopped
  - 1 pound Brussels sprouts, trimmed and chopped
  - ½ pound baby carrots, chopped
  - 1 pound fresh asparagus spears, trimmed and chopped
  - 1 (32 ounce) package fat-free chicken broth
  - ½ teaspoon garlic powder
  - 1 teaspoon salt-free seasoning blend
  - ¼ cup uncooked long grain white rice
  - 1 bunch fresh dill weed

1. Greek Lemon Chicken Soup
2. Distancia: 0.738
3. Descripción: This Greek lemon chicken soup is made with cooked chicken, rice, egg yolks, and lots of fresh lemon juice for a light, flavorful, and comforting meal.
4. Ingredientes:
  - 8 cups chicken broth
  - ½ cup fresh lemon juice
  - ½ cup shredded carrots