# Ejercicio 12: Web Scraping

## Objetivo de la práctica

El objetivo de este ejercicio es construir un web scraper que recoja datos de un website.

### Parte 0: Planificar
1. Identificar los datos que quieres obtener.
2. Elegir el sitio web objetivo.
3. Planificar la estructura del corpus.

## Parte 1: Entender el sitio web objetivo

- Analizar la estructura de la página web a ser analizada.
- Identificar los elementos HTML que contienen los datos bsuscados.

**Importar librerías**

In [44]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


**Descargar la página web**
Se usa User-Agent para evitar bloqueos

Código 200 significa que la página fue descargada correctamente

In [45]:
url = "https://www.allrecipes.com/recipe/235997/unstuffed-cabbage-roll/"

headers = {
    "User-Agent": "Mozilla/5.0"
}

response = requests.get(url, headers=headers)

response.status_code


200

**Parsear el HTML** Ahora se puede buscar etiquetas HTML (h1, li, div, etc.)

In [46]:
soup = BeautifulSoup(response.text, "html.parser")

## Parte 2: Obtener los datos deseados

* Buscar dentro del contenido HTML y extraer la información.

**Extraer el título de la receta**

In [47]:
title = soup.find("h1").get_text(strip=True)
title


'Unstuffed Cabbage Roll'

**Extraer la descripción**

In [48]:
description_tag = soup.find("p", class_="article-subheading")

description = description_tag.get_text(strip=True) if description_tag else "No description found"
description


"This is an easy casserole made with ground beef, cabbage, garlic, and tomatoes. My kids don't even like cabbage, but they love this dish! Serve with rice for a comforting weeknight dinner. Also, the longer it stands the better it tastes!"

**Extraer ingredientes**

In [49]:
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
for ingredient in ingredients_section:
    print(ingredient.text.strip())

2 pounds ground beef
1 large onion, chopped
1 small head cabbage, chopped
2 (14.5 ounce) cans diced tomatoes
1 (8 ounce) can tomato sauce
½ cup water
2 cloves garlic, minced
2 teaspoons salt
1 teaspoon ground black pepper


In [61]:
# Extracting the description
description = soup.find("meta", {"name": "description"})["content"]

# Extracting the ingredients
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
ingredients = [ingredient.get_text().strip() for ingredient in ingredients_section]

# Extracting the instructions
instructions_section = soup.find_all("p", class_="comp mntl-sc-block mntl-sc-block-html")
instructions = [instruction.get_text().strip() for instruction in instructions_section]

# Extracting the nutrition information
nutrition_section_clean = [
    nutrient.get_text(strip=True)
    for nutrient in nutrition_section
]
nutrition_facts = [fact.parent.get_text().strip().replace('\n', ' ') for fact in nutrition_section]

# Print the extracted information
print("Recipe Title:", title)
print("Description:", description)
print("Ingredients:")
for ingredient in ingredients:
    print("-", ingredient)
print("Instructions:")
for i, instruction in enumerate(instructions, 1):
    print(f"{i}. {instruction}")
print("Nutrition Facts:")
for fact in nutrition_facts:
    print("-", fact)


Recipe Title: Unstuffed Cabbage Roll
Description: Unstuffed cabbage rolls with ground beef, cabbage, garlic, and tomatoes make a family-pleasing comforting casserole that's perfect for weeknights.
Ingredients:
- 2 pounds ground beef
- 1 large onion, chopped
- 1 small head cabbage, chopped
- 2 (14.5 ounce) cans diced tomatoes
- 1 (8 ounce) can tomato sauce
- ½ cup water
- 2 cloves garlic, minced
- 2 teaspoons salt
- 1 teaspoon ground black pepper
Instructions:
1. This unstuffed cabbage roll dish is a cheap, quick, and easy weeknight dinner you don't want to miss. If you're looking for a simple and hearty casserole that's just as good the next day, you're going to want to add this one to your recipe box.
2. An unstuffed cabbage roll is basically a deconstructed version of a regular casserole. All the traditional cabbage roll ingredients (cabbage, ground beef, tomatoes, other veggies, and spices and seasonings) are cooked together — so you don't have to worry about pre-cooking and rolling

## Parte 3: Obtener enlaces relacionados
* Encontrar links a otras recetas para completar el corpus

In [62]:
# Find all the links to other recipes
recipe_links = soup.find_all("a", href=True)

# Filter and print only the links that are likely to be recipes
recipe_urls = []
for link in recipe_links:
    href = link['href']
    if "recipe" in href:
        recipe_urls.append(href)

# Print the recipe URLs
print("Linked Recipes:")
for url in recipe_urls:
    print(url)

Linked Recipes:
https://www.allrecipes.com/authentication/login?regSource=3675&relativeRedirectUrl=%2Frecipe%2F235997%2Funstuffed-cabbage-roll%2F
/account/add-recipe
https://www.myrecipes.com/favorites
https://support.people.inc/hc/en-us/categories/360003648613-Allrecipes
https://www.allrecipes.com/authentication/logout?relativeRedirectUrl=%2Frecipe%2F235997%2Funstuffed-cabbage-roll%2F
https://www.allrecipes.com/recipes/17562/dinner/
https://www.allrecipes.com/recipes/17057/everyday-cooking/more-meal-ideas/5-ingredients/main-dishes/
https://www.allrecipes.com/recipes/15436/everyday-cooking/one-pot-meals/
https://www.allrecipes.com/recipes/1947/everyday-cooking/quick-and-easy/
https://www.allrecipes.com/recipes/455/everyday-cooking/more-meal-ideas/30-minute-meals/
https://www.allrecipes.com/recipes/17889/everyday-cooking/family-friendly/family-dinners/
https://www.allrecipes.com/recipes/94/soups-stews-and-chili/
https://www.allrecipes.com/recipes/16099/everyday-cooking/comfort-food/
htt

**Extraer pasos de preparación**

In [77]:
steps_clean = [
    step for step in steps
    if step.startswith(("Gather", "Heat", "Add"))
]

steps_text = " ".join(steps_clean)


## Parte 4: Hacer RAG con las recetas obtenidas
* Una vez que se ha construido el corpus, implementar y desplegar RAG para realizar búsquedas en el corpus

**Instalar librerías**

In [64]:
import google.generativeai as genai
from google.colab import userdata


**Configuración Gemini**

In [65]:
API_KEY = userdata.get("GEMINI_API_KEY")

if API_KEY is None:
    raise ValueError("No se encontró GEMINI_API_KEY en Colab Secrets")

genai.configure(api_key=API_KEY)


**Construir el contexto RAG**

In [78]:
context = f"""
Description:
{description}

Nutrition Information:
{", ".join(nutrition_section_clean)}

Ingredients:
{", ".join(ingredients)}

Preparation Steps:
{steps_text}
"""


**Consulta del usuario**

In [73]:
query_text = "Estimate the calories of the recipe and indicate if it is complex to prepare."

**Prompt RAG**

In [80]:
prompt = f"""
Eres un asistente académico. Responde usando ÚNICAMENTE la información
contenida en el contexto proporcionado.

Consulta del usuario:
"Estimate the calories of the recipe and indicate if it is complex to prepare."

Contexto:
{context}

Tarea:
- Estima las calorías aproximadas basándote en los ingredientes y la información nutricional.
- Indica si la receta es de complejidad baja, media o alta.
- Justifica brevemente tu respuesta.
- Si la información no es suficiente, indícalo claramente.
"""


**Inicializar Gemini**

In [81]:
gemini_model = genai.GenerativeModel("gemini-3-flash-preview")

**Generar respuesta**

In [82]:
response = gemini_model.generate_content(prompt)
print(response.text)


Basado únicamente en la información proporcionada en el contexto:

*   **Estimación de calorías:** La información es **insuficiente**. Aunque el contexto menciona una sección de "Nutrition Information", solo enumera las categorías (como Total Fat, Protein, etc.) sin proporcionar valores numéricos ni el total de calorías.
*   **Complejidad:** **Baja**.
*   **Justificación:** La receta se describe como ideal para "noches de semana" (*weeknights*). El proceso de preparación es sencillo: requiere un solo recipiente (*Dutch oven* o sartén grande), implica técnicas básicas como dorar carne y hervir a fuego lento (*simmer*), y consta de pocos pasos de ejecución.
