<a href="https://colab.research.google.com/github/RikiGL/Deber-Recuperacion-Informacion/blob/main/ejercicio_11_webcrawling_rikiguallichico.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ejercicio 11: Web Scraping

## Objetivo de la práctica

El objetivo de este ejercicio es construir un web scraper que recoja datos de un website.

### Parte 0: Planificar
1. Identificar los datos que quieres obtener.
2. Elegir el sitio web objetivo.
3. Planificar la estructura del corpus.

## Parte 1: Entender el sitio web objetivo

- Analizar la estructura de la página web a ser analizada.
- Identificar los elementos HTML que contienen los datos bsuscados.

In [None]:
from bs4 import BeautifulSoup

# Lerr html
file = '/content/sample_data/rotisserie-chicken.html'

with open(file, "r", encoding="utf-8") as file_obj:
    html_content = file_obj.read()

soup = BeautifulSoup(html_content, "html.parser")

In [29]:
# Extracting the recipe title
title = soup.find("meta", {"property": "og:title"})["content"]
title

'Rotisserie Chicken'

In [30]:
# Extraer ingrendientes:
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
for ingredient in ingredients_section:
    print(ingredient.text.strip())

1 (3 pound) whole chicken
1 pinch salt
¼ cup butter, melted
1 tablespoon salt
1 tablespoon ground paprika
¼ tablespoon ground black pepper


## Parte 2: Obtener los datos deseados

* Buscar dentro del contenido HTML y extraer la información.

In [31]:
# Extracting the description
description = soup.find("meta", {"name": "description"})["content"]

# Extracting the ingredients
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
ingredients = [ingredient.get_text().strip() for ingredient in ingredients_section]

# Extracting the instructions
instructions_section = soup.find_all("p", class_="comp mntl-sc-block mntl-sc-block-html")
instructions = [instruction.get_text().strip() for instruction in instructions_section]

# Extracting the nutrition information
nutrition_section = soup.find_all("span", class_="mm-recipes-nutrition-facts-label__nutrient-name mm-recipes-nutrition-facts-label__nutrient-name--has-postfix")
nutrition_facts = [fact.parent.get_text().strip().replace('\n', ' ') for fact in nutrition_section]

# Print the extracted information
print("Recipe Title:", title)
print("Description:", description)
print("Ingredients:")
for ingredient in ingredients:
    print("-", ingredient)
print("Instructions:")
for i, instruction in enumerate(instructions, 1):
    print(f"{i}. {instruction}")
print("Nutrition Facts:")
for fact in nutrition_facts:
    print("-", fact)


Recipe Title: Rotisserie Chicken
Description: Rotisserie chicken that's easy to cook on a gas grill and turns out moist and juicy with crispy skin. This is a simple recipe that our family loves.
Ingredients:
- 1 (3 pound) whole chicken
- 1 pinch salt
- ¼ cup butter, melted
- 1 tablespoon salt
- 1 tablespoon ground paprika
- ¼ tablespoon ground black pepper
Instructions:
1. Intimidated by the idea of making a rotisserie chicken at home? We're here to help. Get your grill and rotisserie attachment ready — you'll want to try this recipe ASAP.
2. Here's what you'll need to make rotisserie chicken at home:
3. · Whole Chicken: This recipe is meant for a whole 3-pound chicken. If your chicken is larger or smaller, you'll have to adjust the cooking time.· Butter: Butter keeps the chicken moist and juicy, while giving the seasonings something to stick to.· Seasonings: The rotisserie chicken is simply seasoned with salt, pepper, and paprika.
4. You'll find the full, step-by-step recipe below — b

## Parte 3: Obtener enlaces relacionados
* Encontrar links a otras recetas para completar el corpus

In [32]:
# Find all the links to other recipes
recipe_links = soup.find_all("a", href=True)

# Filter and print only the links that are likely to be recipes
recipe_urls = []
for link in recipe_links:
    href = link['href']
    if "recipe" in href:
        recipe_urls.append(href)

# Print the recipe URLs
print("Linked Recipes:")
for url in recipe_urls:
    print(url)

Linked Recipes:
https://www.allrecipes.com/authentication/login?regSource=3675&relativeRedirectUrl=%2Frecipe%2F93168%2Frotisserie-chicken%2F
/account/add-recipe
https://www.myrecipes.com/favorites
https://support.people.inc/hc/en-us/categories/360003648613-Allrecipes
https://www.allrecipes.com/authentication/logout?relativeRedirectUrl=%2Frecipe%2F93168%2Frotisserie-chicken%2F
https://www.allrecipes.com/recipes/17562/dinner/
https://www.allrecipes.com/recipes/17057/everyday-cooking/more-meal-ideas/5-ingredients/main-dishes/
https://www.allrecipes.com/recipes/15436/everyday-cooking/one-pot-meals/
https://www.allrecipes.com/recipes/1947/everyday-cooking/quick-and-easy/
https://www.allrecipes.com/recipes/455/everyday-cooking/more-meal-ideas/30-minute-meals/
https://www.allrecipes.com/recipes/17889/everyday-cooking/family-friendly/family-dinners/
https://www.allrecipes.com/recipes/94/soups-stews-and-chili/
https://www.allrecipes.com/recipes/16099/everyday-cooking/comfort-food/
https://www.a

## Parte 4: Hacer RAG con las recetas obtenidas
* Una vez que se ha construido el corpus, implementar y desplegar RAG para realizar búsquedas en el corpus

In [None]:
!pip install langchain langchain-community langchain-huggingface chromadb sentence-transformers

In [None]:

from langchain_core.documents import Document

recipe_text = f"""
TÍTULO: {title}
DESCRIPCIÓN: {description}

INGREDIENTES:
{', '.join(ingredients)}

INSTRUCCIONES:
{' '.join(instructions)}

NUTRICIÓN:
{', '.join(nutrition_facts)}
"""

docs = [Document(page_content=recipe_text, metadata={"source": "rotisserie-chicken.html"})]

print("Documento creado exitosamente.")

In [None]:
!pip install langchain-chroma

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

# CONFIGURACION DE EMBEDDINGS
print("Cargando modelo de embeddings...")
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# CREAR BASE DE DATOS VECTORIAL
# Esto guarda tu receta en una base de datos buscable
vector_db = Chroma.from_documents(
    documents=docs,
    embedding=embedding_model
)

# PROBAR BUSQUEDA (RETRIEVAL)
query = "¿Cuáles son los ingredientes principales del pollo?"

# Buscamos los fragmentos más parecidos a la pregunta
results = vector_db.similarity_search(query, k=1)

print("\n--- RESPUESTA DEL SISTEMA DE BÚSQUEDA ---")
print(f"Información encontrada:\n{results[0].page_content}")

In [None]:
!pip install langchain-google-genai

In [33]:
# 1. Cambiamos la importacion para usar Google
from langchain_google_genai import ChatGoogleGenerativeAI
from google.colab import userdata

# Mi clave principal
mi_clave = userdata.get('GEMINI')

# 2. Configuramos el modelo de Google
llm = ChatGoogleGenerativeAI(
    model="gemini-3-flash-preview",
    google_api_key=mi_clave
)

# 3. Ejecutamos la consulta RAG (Esto se mantiene igual)
contexto = results[0].page_content
prompt = f"Usa esta información para responder: {contexto}. Pregunta: {query}"

respuesta = llm.invoke(prompt)

# Google devuelve un objeto mensaje, así que imprimimos el contenido
print(respuesta.content)

[{'type': 'text', 'text': 'Basado en la información proporcionada, los ingredientes principales para preparar el pollo son:\n\n*   **1 pollo entero** (de aproximadamente 3 libras).\n*   **Mantequilla:** 1/4 de taza (derretida).\n*   **Sal:** Una pizca para la cavidad y 1 cucharada para la mezcla exterior.\n*   **Paprika (pimentón):** 1 cucharada.\n*   **Pimienta negra molida:** 1/4 de cucharada.', 'extras': {'signature': 'EsQLCsELAXLI2nx8sEA5rNFJ7O4Vr0pchiOc/ZLYn0doGZWynWLvEBWGCnbBoZfUpd4MnoC8NwtD2TlzorqxGOTvzyWbiLYuMq3WZbglgqtIcJl0hqiT7zd/3i7pHvxbaNyAX/69D9b0ddTq5bMz3sP6LxopGLOIOieeUpKep5vriEkBboqDjgaAyJn5o4s71n/zV15KGtVZC78LoH2GGvWUvjJtS0mWIpL/hNE2I8/0q66sBl7j35r6dhCJSGD1X9mgWFMkgkB8DALgsIQaYPdy9Tbtwnbg8BlJQNXuYsV3zQhMAAzgJrppj441zOEIemcigUgibQr18WlTofzf6G4zINfqdIptha6MOkqQ8uiYvvqKHIOUKHYXI6LV1dmRW6er9uYgmnM+bzce8mq9w2d6KdTc9H1583yiBpxRyXV8F3qaaS5FmwH7yBEjy6arJjiP2TJ+4+Qro4rIEKf+WTfMch3/rx0tDDUmRNA9YMbbmxM+mtfZPae3KKOvgnTf3Oms+ox7OIMxv8+p/HuaZ1h35RYGRU/CXymtfZvpSoYgUGIrNRbPB+61aLqf9d