# Ejercicio 11: Web Scraping

# Nombre: Darlin Joel Anacicha   Curso : GR1CC
## Objetivo de la práctica

El objetivo de este ejercicio es construir un web scraper que recoja datos de un website.

### Parte 0: Planificar
1. Identificar los datos que quieres obtener.
2. Elegir el sitio web objetivo.
3. Planificar la estructura del corpus.

## Parte 1: Entender el sitio web objetivo

- Analizar la estructura de la página web a ser analizada.
- Identificar los elementos HTML que contienen los datos bsuscados.

In [19]:
from bs4 import BeautifulSoup

file = '/kaggle/input/receta/Unstuffed Cabbage Roll Recipe.html'

# Load the HTML file
with open(file, "r", encoding="utf-8") as file:
    html_content = file.read()
    
# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

In [20]:
# Extracting the recipe title
title = soup.find("meta", {"property": "og:title"})["content"]
title

'Unstuffed Cabbage Roll'

In [21]:
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
for ingredient in ingredients_section:
    print(ingredient.text.strip())

2 pounds ground beef
1 large onion, chopped
1 small head cabbage, chopped
2 (14.5 ounce) cans diced tomatoes
1 (8 ounce) can tomato sauce
½ cup water
2 cloves garlic, minced
2 teaspoons salt
1 teaspoon ground black pepper


## Parte 2: Obtener los datos deseados

* Buscar dentro del contenido HTML y extraer la información.

In [22]:
# Extracting the description
description = soup.find("meta", {"name": "description"})["content"]

# Extracting the ingredients
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
ingredients = [ingredient.get_text().strip() for ingredient in ingredients_section]

# Extracting the instructions
instructions_section = soup.find_all("p", class_="comp mntl-sc-block mntl-sc-block-html")
instructions = [instruction.get_text().strip() for instruction in instructions_section]

# Extracting the nutrition information
nutrition_section = soup.find_all("span", class_="mm-recipes-nutrition-facts-label__nutrient-name mm-recipes-nutrition-facts-label__nutrient-name--has-postfix")
nutrition_facts = [fact.parent.get_text().strip().replace('\n', ' ') for fact in nutrition_section]

# Print the extracted information
print("Recipe Title:", title)
print("Description:", description)
print("Ingredients:")
for ingredient in ingredients:
    print("-", ingredient)
print("Instructions:")
for i, instruction in enumerate(instructions, 1):
    print(f"{i}. {instruction}")
print("Nutrition Facts:")
for fact in nutrition_facts:
    print("-", fact)


Recipe Title: Unstuffed Cabbage Roll
Description: Unstuffed cabbage rolls with ground beef, cabbage, garlic, and tomatoes make a family-pleasing comforting casserole that's perfect for weeknights.
Ingredients:
- 2 pounds ground beef
- 1 large onion, chopped
- 1 small head cabbage, chopped
- 2 (14.5 ounce) cans diced tomatoes
- 1 (8 ounce) can tomato sauce
- ½ cup water
- 2 cloves garlic, minced
- 2 teaspoons salt
- 1 teaspoon ground black pepper
Instructions:
1. This unstuffed cabbage roll dish is a cheap, quick, and easy weeknight dinner you don't want to miss. If you're looking for a simple and hearty casserole that's just as good the next day, you're going to want to add this one to your recipe box.
2. An unstuffed cabbage roll is basically a deconstructed version of a regular casserole. All the traditional cabbage roll ingredients (cabbage, ground beef, tomatoes, other veggies, and spices and seasonings) are cooked together — so you don't have to worry about pre-cooking and rolling

## Parte 3: Obtener enlaces relacionados
* Encontrar links a otras recetas para completar el corpus

In [23]:
# Find all the links to other recipes
recipe_links = soup.find_all("a", href=True)

# Filter and print only the links that are likely to be recipes
recipe_urls = []
for link in recipe_links:
    href = link['href']
    if "recipe" in href:
        recipe_urls.append(href)

# Print the recipe URLs
print("Linked Recipes:")
for url in recipe_urls:
    print(url)

Linked Recipes:
https://www.allrecipes.com/authentication/login?regSource=3675&relativeRedirectUrl=%2Frecipe%2F235997%2Funstuffed-cabbage-roll%2F
/account/add-recipe
https://www.myrecipes.com/favorites
https://support.people.inc/hc/en-us/categories/360003648613-Allrecipes
https://www.allrecipes.com/authentication/logout?relativeRedirectUrl=%2Frecipe%2F235997%2Funstuffed-cabbage-roll%2F
https://www.allrecipes.com/recipes/17562/dinner/
https://www.allrecipes.com/recipes/17057/everyday-cooking/more-meal-ideas/5-ingredients/main-dishes/
https://www.allrecipes.com/recipes/15436/everyday-cooking/one-pot-meals/
https://www.allrecipes.com/recipes/1947/everyday-cooking/quick-and-easy/
https://www.allrecipes.com/recipes/455/everyday-cooking/more-meal-ideas/30-minute-meals/
https://www.allrecipes.com/recipes/17889/everyday-cooking/family-friendly/family-dinners/
https://www.allrecipes.com/recipes/94/soups-stews-and-chili/
https://www.allrecipes.com/recipes/16099/everyday-cooking/comfort-food/
htt

## Parte 4: Hacer RAG con las recetas obtenidas
* Una vez que se ha construido el corpus, implementar y desplegar RAG para realizar búsquedas en el corpus

In [24]:

!pip install -q langchain langchain-community faiss-cpu sentence-transformers

In [25]:

from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.docstore.document import Document

# 1. CONSTRUIR EL TEXTO PARA EL RAG
# Usamos las variables que YA TIENES de la Parte 2 (title, ingredients, instructions, etc.)

# Convertimos las listas a texto plano separado por comas o espacios
texto_ingredientes = ", ".join(ingredients)
texto_instrucciones = " ".join(instructions)
texto_nutricion = ", ".join(nutrition_facts)

# Creamos un bloque de texto consolidado que contiene TODA la información
contenido_receta = f"""
TÍTULO: {title}
DESCRIPCIÓN: {description}
INGREDIENTES: {texto_ingredientes}
PASOS DE PREPARACIÓN: {texto_instrucciones}
INFORMACIÓN NUTRICIONAL: {texto_nutricion}
"""

# 2. CREAR EL DOCUMENTO LANGCHAIN
# Creamos una lista con un solo documento (ya que estamos analizando una receta)
# Si tuvieras más recetas, agregarías más documentos a esta lista.
docs = [Document(page_content=contenido_receta, metadata={"source": title})]

print("Generando Embeddings y Base de Datos Vectorial...")

# 3. GENERAR EMBEDDINGS (Vectores)
# Usamos un modelo gratuito y local (funciona en CPU en Kaggle sin API Key)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# 4. CREAR EL ÍNDICE DE BÚSQUEDA (FAISS)
db = FAISS.from_documents(docs, embeddings)

print("¡Sistema RAG listo para recibir preguntas!")

Generando Embeddings y Base de Datos Vectorial...


  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
2026-01-22 05:18:47.377977: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1769059127.634611      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1769059127.715496      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1769059128.333027      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1769059128.333079      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than onc

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

¡Sistema RAG listo para recibir preguntas!


In [26]:
# Función para buscar en el corpus
def consultar_receta(pregunta):
    print(f"\n PREGUNTA: '{pregunta}'")
    
    # Buscamos el fragmento más relevante (k=1)
    resultados = db.similarity_search(pregunta, k=1)
    
    for doc in resultados:
        print("-" * 50)
        print(f" INFORMACIÓN RELEVANTE ENCONTRADA:")
        # Mostramos hasta 500 caracteres del resultado para no llenar la pantalla
        print(doc.page_content[:500] + "...") 
        print("-" * 50)

# --- PRUEBAS REALES ---
# Preguntamos algo sobre los ingredientes
consultar_receta("What are the ingredients for the sauce?")

# Preguntamos sobre cómo cocinarlo
consultar_receta("How long do I cook it?")

# Preguntamos sobre calorías o nutrición
consultar_receta("calories info")


 PREGUNTA: 'What are the ingredients for the sauce?'
--------------------------------------------------
 INFORMACIÓN RELEVANTE ENCONTRADA:

TÍTULO: Unstuffed Cabbage Roll
DESCRIPCIÓN: Unstuffed cabbage rolls with ground beef, cabbage, garlic, and tomatoes make a family-pleasing comforting casserole that's perfect for weeknights.
INGREDIENTES: 2 pounds ground beef, 1 large onion, chopped, 1 small head cabbage, chopped, 2 (14.5 ounce) cans diced tomatoes, 1 (8 ounce) can tomato sauce, ½ cup water, 2 cloves garlic, minced, 2 teaspoons salt, 1 teaspoon ground black pepper
PASOS DE PREPARACIÓN: This unstuffed cabbage roll dish is a chea...
--------------------------------------------------

 PREGUNTA: 'How long do I cook it?'
--------------------------------------------------
 INFORMACIÓN RELEVANTE ENCONTRADA:

TÍTULO: Unstuffed Cabbage Roll
DESCRIPCIÓN: Unstuffed cabbage rolls with ground beef, cabbage, garlic, and tomatoes make a family-pleasing comforting casserole that's perfect for we