# Ejercicio 12: Web Scraping

## Objetivo de la práctica

El objetivo de este ejercicio es construir un web scraper que recoja datos de un website.

### Parte 0: Planificar
1. Identificar los datos que quieres obtener.
2. Elegir el sitio web objetivo.
3. Planificar la estructura del corpus.

## Parte 1: Entender el sitio web objetivo

- Analizar la estructura de la página web a ser analizada.
- Identificar los elementos HTML que contienen los datos bsuscados.

In [23]:
!pip install sentence-transformers faiss-cpu



In [24]:
import requests
from bs4 import BeautifulSoup
import json

corpus = []

In [25]:

file = '/content/rotisserie-chicken.html'

# Load the HTML file
with open(file, "r", encoding="utf-8") as file:
    html_content = file.read()

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

In [26]:
# Extracting the recipe title
title = soup.find("meta", {"property": "og:title"})["content"]
title

'Rotisserie Chicken'

In [27]:
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
for ingredient in ingredients_section:
    print(ingredient.text.strip())

1 (3 pound) whole chicken
1 pinch salt
¼ cup butter, melted
1 tablespoon salt
1 tablespoon ground paprika
¼ tablespoon ground black pepper


## Parte 2: Obtener los datos deseados

* Buscar dentro del contenido HTML y extraer la información.

In [28]:
# Extracting the description
description = soup.find("meta", {"name": "description"})["content"]

# Extracting the ingredients
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
ingredients = [ingredient.get_text().strip() for ingredient in ingredients_section]

# Extracting the instructions
instructions_section = soup.find_all("p", class_="comp mntl-sc-block mntl-sc-block-html")
instructions = [instruction.get_text().strip() for instruction in instructions_section]

# Extracting the nutrition information
nutrition_section = soup.find_all("span", class_="mm-recipes-nutrition-facts-label__nutrient-name mm-recipes-nutrition-facts-label__nutrient-name--has-postfix")
nutrition_facts = [fact.parent.get_text().strip().replace('\n', ' ') for fact in nutrition_section]

# Print the extracted information
print("Recipe Title:", title)
print("Description:", description)
print("Ingredients:")
for ingredient in ingredients:
    print("-", ingredient)
print("Instructions:")
for i, instruction in enumerate(instructions, 1):
    print(f"{i}. {instruction}")
print("Nutrition Facts:")
for fact in nutrition_facts:
    print("-", fact)


Recipe Title: Rotisserie Chicken
Description: Rotisserie chicken that's easy to cook on a gas grill and turns out moist and juicy with crispy skin. This is a simple recipe that our family loves.
Ingredients:
- 1 (3 pound) whole chicken
- 1 pinch salt
- ¼ cup butter, melted
- 1 tablespoon salt
- 1 tablespoon ground paprika
- ¼ tablespoon ground black pepper
Instructions:
1. Intimidated by the idea of making a rotisserie chicken at home? We're here to help. Get your grill and rotisserie attachment ready — you'll want to try this recipe ASAP.
2. Here's what you'll need to make rotisserie chicken at home:
3. · Whole Chicken: This recipe is meant for a whole 3-pound chicken. If your chicken is larger or smaller, you'll have to adjust the cooking time.· Butter: Butter keeps the chicken moist and juicy, while giving the seasonings something to stick to.· Seasonings: The rotisserie chicken is simply seasoned with salt, pepper, and paprika.
4. You'll find the full, step-by-step recipe below — b

## Parte 3: Obtener enlaces relacionados
* Encontrar links a otras recetas para completar el corpus

In [29]:
import requests
from bs4 import BeautifulSoup
import time

MAX_TOTAL_RECIPES = 100  # Cantidad máxima deseada

def get_category_links():
    url = "https://www.allrecipes.com/recipes/"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    category_links = set()

    for link in soup.find_all("a", href=True):
        href = link['href']
        if href.startswith("https://www.allrecipes.com/recipes/") and href.count('/') > 4:
            category_links.add(href)

    return list(category_links)

def scrape_recipes_from_category(category_url, max_pages=3):
    recipe_urls = set()
    for page in range(1, max_pages + 1):
        url = f"{category_url}?page={page}"
        print(f"Visitando: {url}")
        try:
            response = requests.get(url)
            if response.status_code != 200:
                continue

            soup = BeautifulSoup(response.text, "html.parser")
            links = soup.find_all("a", href=True)

            for link in links:
                href = link['href']
                if href.startswith("https://www.allrecipes.com/recipe/"):
                    recipe_urls.add(href)

            time.sleep(1)

        except Exception as e:
            print(f"Error en {url}: {e}")

    return recipe_urls

# === CRAWLER PRINCIPAL ===

recipe_urls = set()
categories = get_category_links()

print(f"Se encontraron {len(categories)} categorías.")
print("Ejemplo de categorías:", categories[:5])

for cat_url in categories:
    if len(recipe_urls) >= MAX_TOTAL_RECIPES:
        break

    print(f"\nExplorando categoría: {cat_url}")
    recetas_en_cat = scrape_recipes_from_category(cat_url, max_pages=3)
    print(f" → {len(recetas_en_cat)} recetas encontradas en esta categoría.")

    for r in recetas_en_cat:
        if len(recipe_urls) < MAX_TOTAL_RECIPES:
            recipe_urls.add(r)
        else:
            break

# === RESULTADOS ===

print("\n========================")
print(f"Total de recetas únicas encontradas: {len(recipe_urls)}")
print("========================")
for i, r in enumerate(sorted(recipe_urls), 1):
    print(f"{i:03d}. {r}")


Se encontraron 53 categorías.
Ejemplo de categorías: ['https://www.allrecipes.com/recipes/16099/everyday-cooking/comfort-food/', 'https://www.allrecipes.com/recipes/723/world-cuisine/european/italian/', 'https://www.allrecipes.com/recipes/22882/everyday-cooking/instant-pot/', 'https://www.allrecipes.com/recipes/17583/everyday-cooking/cookware-and-equipment/', 'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/']

Explorando categoría: https://www.allrecipes.com/recipes/16099/everyday-cooking/comfort-food/
Visitando: https://www.allrecipes.com/recipes/16099/everyday-cooking/comfort-food/?page=1
Visitando: https://www.allrecipes.com/recipes/16099/everyday-cooking/comfort-food/?page=2
Visitando: https://www.allrecipes.com/recipes/16099/everyday-cooking/comfort-food/?page=3
 → 27 recetas encontradas en esta categoría.

Explorando categoría: https://www.allrecipes.com/recipes/723/world-cuisine/european/italian/
Visitando: https://www.allrecipes.com/recipes/723/world-cuisine

In [30]:
def extract_recipe_data(html_content, source_url=""):
    soup = BeautifulSoup(html_content, "html.parser")

    title_tag = soup.find("h1")
    title = title_tag.get_text().strip() if title_tag else "No Title"

    description_tag = soup.find("meta", {"name": "description"})
    description = description_tag["content"] if description_tag else ""

    ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
    ingredients = [ing.get_text().strip() for ing in ingredients_section]

    instructions_section = soup.find_all("p", class_="comp mntl-sc-block mntl-sc-block-html")
    instructions = [inst.get_text().strip() for inst in instructions_section]

    nutrition_section = soup.find_all("span", class_="mm-recipes-nutrition-facts-label__nutrient-name mm-recipes-nutrition-facts-label__nutrient-name--has-postfix")
    nutrition_facts = [fact.parent.get_text().strip().replace('\n', ' ') for fact in nutrition_section]

    return {
        "title": title,
        "description": description,
        "ingredients": ingredients,
        "instructions": instructions,
        "nutrition_facts": nutrition_facts,
        "url": source_url
    }


In [31]:
import requests
import time

base_url = "https://www.allrecipes.com/"
corpus = []

for url in recipe_urls:
    # Completar URL si es relativa
    if url.startswith("/"):
        full_url = base_url + url.lstrip("/")
    else:
        full_url = url

    print(f"Procesando: {full_url}")

    try:
        response = requests.get(full_url)
        if response.status_code == 200:
            data = extract_recipe_data(response.text, source_url=full_url)
            corpus.append(data)
        else:
            print(f"Error al obtener {full_url}: Código {response.status_code}")
    except Exception as e:
        print(f"Excepción al procesar {full_url}: {e}")

    time.sleep(1)  # Para no saturar el servidor

print(f"\nDatos extraídos de {len(corpus)} recetas.")


Procesando: https://www.allrecipes.com/recipe/222814/crispy-chinese-noodles-restaurant-style/
Procesando: https://www.allrecipes.com/recipe/57716/shrimp-with-lobster-sauce/
Procesando: https://www.allrecipes.com/recipe/259216/instant-pot-frozen-salmon/
Procesando: https://www.allrecipes.com/recipe/280980/chinese-hand-pulled-noodles-in-beef-broth/
Procesando: https://www.allrecipes.com/recipe/9027/kung-pao-chicken/
Procesando: https://www.allrecipes.com/recipe/272258/instant-pot-baked-beans/
Procesando: https://www.allrecipes.com/recipe/214464/chinese-chicken-wings/
Procesando: https://www.allrecipes.com/recipe/31848/jambalaya/
Procesando: https://www.allrecipes.com/recipe/262470/fall-off-the-bone-30-minute-instant-pot-ribs/
Procesando: https://www.allrecipes.com/recipe/270310/instant-pot-italian-wedding-soup/
Procesando: https://www.allrecipes.com/recipe/268516/instant-pot-pot-roast/
Procesando: https://www.allrecipes.com/recipe/269283/instant-pot-wheat-berries/
Procesando: https://www

In [32]:
with open("recipes_corpus.json", "w", encoding="utf-8") as f:
    json.dump(corpus, f, ensure_ascii=False, indent=4)


In [33]:
import pandas as pd
import json

with open("recipes_corpus.json", "r", encoding="utf-8") as f:
    corpus = json.load(f)
df = pd.DataFrame(corpus)
df

Unnamed: 0,title,description,ingredients,instructions,nutrition_facts,url
0,Crispy Chinese Noodles,These crispy noodles are easy to make at home ...,"[1 (10 ounce) package egg roll wrappers, 2 cup...","[Gather the ingredients., Cut egg roll wrapper...","[Total Fat 6g, Saturated Fat 1g, Cholesterol 3...",https://www.allrecipes.com/recipe/222814/crisp...
1,Shrimp with Lobster Sauce,Make shrimp with lobster sauce at home with st...,"[1 ½ teaspoons cornstarch, 2 teaspoons cooking...",[Dissolve 1 1/2 teaspoons cornstarch in sherry...,"[Total Fat 20g, Saturated Fat 4g, Cholesterol ...",https://www.allrecipes.com/recipe/57716/shrimp...
2,Instant Pot Frozen Salmon,This Instant Pot recipe takes salmon from froz...,"[1 cup cold water, ¼ cup lemon juice, cooking ...","[Gather all ingredients., Pour cold water and ...","[Total Fat 7g, Saturated Fat 1g, Cholesterol 5...",https://www.allrecipes.com/recipe/259216/insta...
3,Chinese Hand-Pulled Noodles in Beef Broth,Fresh hand-pulled noodles are served in a rich...,"[1 gallon water, 2 pounds boneless beef should...",[Bring water to a boil in a large pot. Add bee...,"[Total Fat 15g, Saturated Fat 4g, Cholesterol ...",https://www.allrecipes.com/recipe/280980/chine...
4,Kung Pao Chicken,Love kung pao chicken? It's easy to make this ...,"[2 tablespoons cornstarch, dissolved in 2 tabl...",[Make restaurant-worthy Kung Pao chicken at ho...,"[Total Fat 23g, Saturated Fat 4g, Cholesterol ...",https://www.allrecipes.com/recipe/9027/kung-pa...
...,...,...,...,...,...,...
95,Pressure Cooker Corned Beef,This pressure cooker corned beef is perfectly ...,"[1 (4 pound) corned beef brisket, seasoning pa...","[Gather all ingredients., Place brisket fat-si...","[Total Fat 15g, Saturated Fat 5g, Cholesterol ...",https://www.allrecipes.com/recipe/256749/press...
96,Slow Cooker Beef Stew,This beef stew recipe is easy to make in a slo...,"[2 pounds beef stew meat, cut into 1-inch piec...","[Gather all ingredients., Place beef in the sl...","[Total Fat 30g, Saturated Fat 12g, Cholesterol...",https://www.allrecipes.com/recipe/14685/slow-c...
97,Easy and Delicious Homemade Ricotta Cheese,Making homemade ricotta couldn't be simpler wi...,"[8 ½ cups whole milk, ⅓ cup lemon juice, 1 tea...",[Pour milk into a saucepan set over medium hea...,"[Total Fat 8g, Saturated Fat 5g, Cholesterol 2...",https://www.allrecipes.com/recipe/254480/easy-...
98,"Pressure Cooker Bone-In Pork Chops, Baked Pota...",This pressure cooker pork chop recipe features...,"[4 3/4-inch thick bone-in pork chops, salt an...","[Gather all ingredients., Season pork chops wi...","[Total Fat 22g, Saturated Fat 11g, Cholesterol...",https://www.allrecipes.com/recipe/236566/press...


## Parte 4: Hacer RAG con las recetas obtenidas
* Una vez que se ha construido el corpus, implementar y desplegar RAG para realizar búsquedas en el corpus

In [34]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Asegurar que todos los campos sean tipo string
texts = (
    df["title"].astype(str) + ". " +
    df["description"].astype(str) + ". " +
    df["ingredients"].astype(str) + ". " +
    df["instructions"].astype(str) + ". " +
    df["nutrition_facts"].astype(str) + ". " +
    df["url"].astype(str)
)

# Calcular los embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(texts.tolist(), show_progress_bar=True)

# Crear índice FAISS
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

# Guardar textos base para futuras búsquedas
corpus_texts = texts.tolist()



Batches:   0%|          | 0/4 [00:00<?, ?it/s]

In [35]:
def retrieve_similar_recipes(query, k=5):
    query_embedding = model.encode([query])
    distances, indices = index.search(query_embedding, k)
    results = [corpus_texts[i] for i in indices[0]]
    return results

In [36]:
def build_prompt(query, context_texts):
    context_str = "\n\n".join(context_texts)
    prompt = f"""Responde la siguiente consulta usando el contexto de recetas que te doy.

Contexto:
{context_str}

Consulta:
{query}

Respuesta:"""
    return prompt


In [37]:
from google import genai

# Crear el cliente con la API key
client = genai.Client(api_key="AIzaSyAggMkVW2px2MKCrGRWkMHllN75M24Qx4M")

def run_rag(query):
    docs = retrieve_similar_recipes(query)
    prompt = build_prompt(query, docs)

    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=prompt
    )

    return response.text



In [38]:
query = "¿Cuales son los ingredientes para hacer Crispy Chinese Noodles?"
respuesta = run_rag(query)
print(respuesta)


Los ingredientes para hacer Crispy Chinese Noodles son:

*   1 (paquete de 10 onzas) de láminas para rollitos de huevo
*   2 tazas de aceite vegetal, o según sea necesario
