# Ejercicio 12: Web Scraping

## Objetivo de la práctica

El objetivo de este ejercicio es construir un web scraper que recoja datos de un website.

### Parte 0: Planificar
1. Identificar los datos que quieres obtener.
2. Elegir el sitio web objetivo.
3. Planificar la estructura del corpus.

## Parte 1: Entender el sitio web objetivo

- Analizar la estructura de la página web a ser analizada.
- Identificar los elementos HTML que contienen los datos bsuscados.

In [None]:
!pip install sentence-transformers faiss-cpu



In [None]:
import requests
from bs4 import BeautifulSoup
import json

corpus = []

In [None]:

file = '/content/rotisserie-chicken.html'

# Load the HTML file
with open(file, "r", encoding="utf-8") as file:
    html_content = file.read()

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

In [None]:
# Extracting the recipe title
title = soup.find("meta", {"property": "og:title"})["content"]
title

'Rotisserie Chicken'

In [None]:
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
for ingredient in ingredients_section:
    print(ingredient.text.strip())

1 (3 pound) whole chicken
1 pinch salt
¼ cup butter, melted
1 tablespoon salt
1 tablespoon ground paprika
¼ tablespoon ground black pepper


## Parte 2: Obtener los datos deseados

* Buscar dentro del contenido HTML y extraer la información.

In [None]:
# Extracting the description
description = soup.find("meta", {"name": "description"})["content"]

# Extracting the ingredients
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
ingredients = [ingredient.get_text().strip() for ingredient in ingredients_section]

# Extracting the instructions
instructions_section = soup.find_all("p", class_="comp mntl-sc-block mntl-sc-block-html")
instructions = [instruction.get_text().strip() for instruction in instructions_section]

# Extracting the nutrition information
nutrition_section = soup.find_all("span", class_="mm-recipes-nutrition-facts-label__nutrient-name mm-recipes-nutrition-facts-label__nutrient-name--has-postfix")
nutrition_facts = [fact.parent.get_text().strip().replace('\n', ' ') for fact in nutrition_section]

# Print the extracted information
print("Recipe Title:", title)
print("Description:", description)
print("Ingredients:")
for ingredient in ingredients:
    print("-", ingredient)
print("Instructions:")
for i, instruction in enumerate(instructions, 1):
    print(f"{i}. {instruction}")
print("Nutrition Facts:")
for fact in nutrition_facts:
    print("-", fact)


Recipe Title: Rotisserie Chicken
Description: Rotisserie chicken that's easy to cook on a gas grill and turns out moist and juicy with crispy skin. This is a simple recipe that our family loves.
Ingredients:
- 1 (3 pound) whole chicken
- 1 pinch salt
- ¼ cup butter, melted
- 1 tablespoon salt
- 1 tablespoon ground paprika
- ¼ tablespoon ground black pepper
Instructions:
1. Intimidated by the idea of making a rotisserie chicken at home? We're here to help. Get your grill and rotisserie attachment ready — you'll want to try this recipe ASAP.
2. Here's what you'll need to make rotisserie chicken at home:
3. · Whole Chicken: This recipe is meant for a whole 3-pound chicken. If your chicken is larger or smaller, you'll have to adjust the cooking time.· Butter: Butter keeps the chicken moist and juicy, while giving the seasonings something to stick to.· Seasonings: The rotisserie chicken is simply seasoned with salt, pepper, and paprika.
4. You'll find the full, step-by-step recipe below — b

## Parte 3: Obtener enlaces relacionados
* Encontrar links a otras recetas para completar el corpus

In [None]:
import requests
from bs4 import BeautifulSoup
import time

MAX_TOTAL_RECIPES = 100  # Cantidad máxima deseada

def get_category_links():
    url = "https://www.allrecipes.com/recipes/"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    category_links = set()

    for link in soup.find_all("a", href=True):
        href = link['href']
        if href.startswith("https://www.allrecipes.com/recipes/") and href.count('/') > 4:
            category_links.add(href)

    return list(category_links)

def scrape_recipes_from_category(category_url, max_pages=3):
    recipe_urls = set()
    for page in range(1, max_pages + 1):
        url = f"{category_url}?page={page}"
        print(f"Visitando: {url}")
        try:
            response = requests.get(url)
            if response.status_code != 200:
                continue

            soup = BeautifulSoup(response.text, "html.parser")
            links = soup.find_all("a", href=True)

            for link in links:
                href = link['href']
                if href.startswith("https://www.allrecipes.com/recipe/"):
                    recipe_urls.add(href)

            time.sleep(1)

        except Exception as e:
            print(f"Error en {url}: {e}")

    return recipe_urls

# === CRAWLER PRINCIPAL ===

recipe_urls = set()
categories = get_category_links()

print(f"Se encontraron {len(categories)} categorías.")
print("Ejemplo de categorías:", categories[:5])

for cat_url in categories:
    if len(recipe_urls) >= MAX_TOTAL_RECIPES:
        break

    print(f"\nExplorando categoría: {cat_url}")
    recetas_en_cat = scrape_recipes_from_category(cat_url, max_pages=3)
    print(f" → {len(recetas_en_cat)} recetas encontradas en esta categoría.")

    for r in recetas_en_cat:
        if len(recipe_urls) < MAX_TOTAL_RECIPES:
            recipe_urls.add(r)
        else:
            break

# === RESULTADOS ===

print("\n========================")
print(f"Total de recetas únicas encontradas: {len(recipe_urls)}")
print("========================")
for i, r in enumerate(sorted(recipe_urls), 1):
    print(f"{i:03d}. {r}")


Se encontraron 53 categorías.
Ejemplo de categorías: ['https://www.allrecipes.com/recipes/455/everyday-cooking/more-meal-ideas/30-minute-meals/', 'https://www.allrecipes.com/recipes/16492/everyday-cooking/special-collections/allrecipes-allstars/', 'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/', 'https://www.allrecipes.com/recipes/16099/everyday-cooking/comfort-food/', 'https://www.allrecipes.com/recipes/82/trusted-brands-recipes-and-tips/']

Explorando categoría: https://www.allrecipes.com/recipes/455/everyday-cooking/more-meal-ideas/30-minute-meals/
Visitando: https://www.allrecipes.com/recipes/455/everyday-cooking/more-meal-ideas/30-minute-meals/?page=1
Visitando: https://www.allrecipes.com/recipes/455/everyday-cooking/more-meal-ideas/30-minute-meals/?page=2
Visitando: https://www.allrecipes.com/recipes/455/everyday-cooking/more-meal-ideas/30-minute-meals/?page=3
 → 60 recetas encontradas en esta categoría.

Explorando categoría: https://www.allrecipes.com/recipes/16

In [None]:
def extract_recipe_data(html_content, source_url=""):
    soup = BeautifulSoup(html_content, "html.parser")

    title_tag = soup.find("h1")
    title = title_tag.get_text().strip() if title_tag else "No Title"

    description_tag = soup.find("meta", {"name": "description"})
    description = description_tag["content"] if description_tag else ""

    ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
    ingredients = [ing.get_text().strip() for ing in ingredients_section]

    instructions_section = soup.find_all("p", class_="comp mntl-sc-block mntl-sc-block-html")
    instructions = [inst.get_text().strip() for inst in instructions_section]

    nutrition_section = soup.find_all("span", class_="mm-recipes-nutrition-facts-label__nutrient-name mm-recipes-nutrition-facts-label__nutrient-name--has-postfix")
    nutrition_facts = [fact.parent.get_text().strip().replace('\n', ' ') for fact in nutrition_section]

    return {
        "title": title,
        "description": description,
        "ingredients": ingredients,
        "instructions": instructions,
        "nutrition_facts": nutrition_facts,
        "url": source_url
    }


In [None]:
import requests
import time

base_url = "https://www.allrecipes.com/"
corpus = []

for url in recipe_urls:
    # Completar URL si es relativa
    if url.startswith("/"):
        full_url = base_url + url.lstrip("/")
    else:
        full_url = url

    print(f"Procesando: {full_url}")

    try:
        response = requests.get(full_url)
        if response.status_code == 200:
            data = extract_recipe_data(response.text, source_url=full_url)
            corpus.append(data)
        else:
            print(f"Error al obtener {full_url}: Código {response.status_code}")
    except Exception as e:
        print(f"Excepción al procesar {full_url}: {e}")

    time.sleep(1)  # Para no saturar el servidor

print(f"\nDatos extraídos de {len(corpus)} recetas.")


Procesando: https://www.allrecipes.com/recipe/215231/empanadas-beef-turnovers/
Procesando: https://www.allrecipes.com/recipe/216888/good-new-orleans-creole-gumbo/
Procesando: https://www.allrecipes.com/recipe/127500/japanese-style-deep-fried-shrimp/
Procesando: https://www.allrecipes.com/recipe/257938/spicy-thai-basil-chicken-pad-krapow-gai/
Procesando: https://www.allrecipes.com/recipe/23600/worlds-best-lasagna/
Procesando: https://www.allrecipes.com/recipe/220597/simple-garlic-shrimp/
Procesando: https://www.allrecipes.com/recipe/151656/one-dish-chicken-and-stuffing-bake/
Procesando: https://www.allrecipes.com/recipe/14685/slow-cooker-beef-stew-i/
Procesando: https://www.allrecipes.com/recipe/11758/baked-ziti-i/
Procesando: https://www.allrecipes.com/recipe/47015/quick-and-easy-pancit/
Procesando: https://www.allrecipes.com/recipe/220164/classic-cuban-style-picadillo/
Procesando: https://www.allrecipes.com/recipe/18379/best-green-bean-casserole/
Procesando: https://www.allrecipes.com

In [None]:
with open("recipes_corpus.json", "w", encoding="utf-8") as f:
    json.dump(corpus, f, ensure_ascii=False, indent=4)


In [None]:
import pandas as pd
import json

with open("recipes_corpus.json", "r", encoding="utf-8") as f:
    corpus = json.load(f)
df = pd.DataFrame(corpus)
df

Unnamed: 0,title,description,ingredients,instructions,nutrition_facts,url
0,Empanadas (Beef Turnovers),This Goya empanada recipe for deep-fried pastr...,"[1 tablespoon Goya Extra Virgin Olive Oil, ½ p...","[Gather the ingredients., Heat olive oil in a ...",[],https://www.allrecipes.com/recipe/215231/empan...
1,Good New Orleans Creole Gumbo,"Traditional Creole gumbo made with roux, shrim...","[1 cup all-purpose flour, ¾ cup bacon dripping...",[New Orleans-style gumbo is a true taste of So...,"[Total Fat 17g, Saturated Fat 6g, Cholesterol ...",https://www.allrecipes.com/recipe/216888/good-...
2,Japanese-Style Deep-Fried Shrimp,This deep-fried shrimp recipe seasons shrimp w...,"[1 pound medium shrimp, peeled (tails left on)...","[Gather all ingredients., Place shrimp in a bo...","[Total Fat 37g, Saturated Fat 6g, Cholesterol ...",https://www.allrecipes.com/recipe/127500/japan...
3,Spicy Thai Basil Chicken (Pad Krapow Gai),"Chef John's version of this classic Thai dish,...","[⅓ cup chicken broth, 1 tablespoon oyster sauc...",[Allrecipes home cooks give a solid 5-star rat...,"[Total Fat 30g, Saturated Fat 7g, Cholesterol ...",https://www.allrecipes.com/recipe/257938/spicy...
4,World's Best Lasagna,This lasagna recipe from John Chandler is our ...,"[1 pound sweet Italian sausage, ¾ pound lean g...",[When John Chandler submitted this lasagna rec...,"[Total Fat 21g, Saturated Fat 10g, Cholesterol...",https://www.allrecipes.com/recipe/23600/worlds...
...,...,...,...,...,...,...
95,Quick and Easy Chicken Spaghetti,This chicken spaghetti recipe is quick and eas...,"[1 (12 ounce) package angel hair pasta, 2 cups...",[Bring a large pot of lightly salted water to ...,"[Total Fat 13g, Saturated Fat 6g, Cholesterol ...",https://www.allrecipes.com/recipe/26286/quick-...
96,Quick Beef Stir-Fry,This beef stir-fry recipe is quick and easy fo...,"[2 tablespoons vegetable oil, 1 pound beef sir...",[This classic beef stir-fry recipe is the perf...,"[Total Fat 16g, Saturated Fat 4g, Cholesterol ...",https://www.allrecipes.com/recipe/228823/quick...
97,Easy Lasagna,This easy lasagna recipe is made with lean gro...,"[1 pound lean ground beef, 1 (32 ounce) jar sp...",[Making perfect homemade lasagna doesn’t have ...,"[Total Fat 17g, Saturated Fat 8g, Cholesterol ...",https://www.allrecipes.com/recipe/12011/easy-l...
98,Bourbon Chicken,Bourbon chicken is a delicious dish of tender ...,"[1 ½ pounds skinless boneless chicken thighs, ...",[This bourbon chicken recipe creates results t...,"[Total Fat 20g, Saturated Fat 5g, Cholesterol ...",https://www.allrecipes.com/recipe/9025/bourbon...


## Parte 4: Hacer RAG con las recetas obtenidas
* Una vez que se ha construido el corpus, implementar y desplegar RAG para realizar búsquedas en el corpus

In [None]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Usa título + descripción como texto base
texts = df["title"] + ". " + df["description"]
texts = texts.fillna("")

# Embeddings con un modelo ligero
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(texts.tolist(), show_progress_bar=True)

# Crear índice FAISS
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

# Guardar para búsquedas
corpus_texts = texts.tolist()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
def retrieve_similar_recipes(query, k=5):
    query_embedding = model.encode([query])
    distances, indices = index.search(query_embedding, k)
    results = [corpus_texts[i] for i in indices[0]]
    return results

In [None]:
def build_prompt(query, context_texts):
    context_str = "\n\n".join(context_texts)
    prompt = f"""Responde la siguiente consulta usando el contexto de recetas que te doy.

Contexto:
{context_str}

Consulta:
{query}

Respuesta:"""
    return prompt


In [None]:
from google import genai

# Crear el cliente con la API key
client = genai.Client(api_key="AIzaSyAggMkVW2px2MKCrGRWkMHllN75M24Qx4M")

def run_rag(query):
    docs = retrieve_similar_recipes(query)
    prompt = build_prompt(query, docs)

    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=prompt
    )

    return response.text



In [None]:
query = "¿Cómo puedo preparar empanadas?"
respuesta = run_rag(query)
print(respuesta)


Para preparar empanadas, se hacen pasteles fritos utilizando discos de masa Goya (Goya Discos pastry rounds). Se rellenan con carne de res sazonada, tomates, cebollas y ajo.

También se puede usar picadillo de carne molida (Classic Cuban-Style Picadillo) como relleno para las empanadas.
