# **Webscraping**

## Parte 0: Planificar
Identificar los datos que quieres obtener.
Elegir el sitio web objetivo.

In [24]:
!pip install beautifulsoup4 requests



##Parte 1: Entender el sitio web objetivo
Analizar la estructura de la página web a ser analizada.
Identificar los elementos HTML que contienen los datos bsuscados.

In [25]:
from bs4 import BeautifulSoup

file = '/content/sample_data/data/12webcrawling/Rotisserie Chicken Recipe.html'

# Load the HTML file
with open(file, "r", encoding="utf-8") as file:
    html_content = file.read()

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

In [26]:
# Extracting the recipe title
title = soup.find("meta", {"property": "og:title"})["content"]
title

'Rotisserie Chicken'

In [27]:
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
for ingredient in ingredients_section:
    print(ingredient.text.strip())

1 (3 pound) whole chicken
1 pinch salt
¼ cup butter, melted
1 tablespoon salt
1 tablespoon ground paprika
¼ tablespoon ground black pepper


##Parte 2: Obtener los datos deseados
Buscar dentro del contenido HTML y extraer la información.

In [28]:
# Extracting the description
description = soup.find("meta", {"name": "description"})["content"]

# Extracting the ingredients
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
ingredients = [ingredient.get_text().strip() for ingredient in ingredients_section]

# Extracting the instructions
instructions_section = soup.find_all("p", class_="comp mntl-sc-block mntl-sc-block-html")
instructions = [instruction.get_text().strip() for instruction in instructions_section]

# Extracting the nutrition information
nutrition_section = soup.find_all("span", class_="mm-recipes-nutrition-facts-label__nutrient-name mm-recipes-nutrition-facts-label__nutrient-name--has-postfix")
nutrition_facts = [fact.parent.get_text().strip().replace('\n', ' ') for fact in nutrition_section]

# Print the extracted information
print("Recipe Title:", title)
print("Description:", description)
print("Ingredients:")
for ingredient in ingredients:
    print("-", ingredient)
print("Instructions:")
for i, instruction in enumerate(instructions, 1):
    print(f"{i}. {instruction}")
print("Nutrition Facts:")
for fact in nutrition_facts:
    print("-", fact)

Recipe Title: Rotisserie Chicken
Description: Rotisserie chicken that's easy to cook on a gas grill and turns out moist and juicy with crispy skin. This is a simple recipe that our family loves.
Ingredients:
- 1 (3 pound) whole chicken
- 1 pinch salt
- ¼ cup butter, melted
- 1 tablespoon salt
- 1 tablespoon ground paprika
- ¼ tablespoon ground black pepper
Instructions:
1. Intimidated by the idea of making a rotisserie chicken at home? We're here to help. Get your grill and rotisserie attachment ready — you'll want to try this recipe ASAP.
2. Here's what you'll need to make rotisserie chicken at home:
3. · Whole Chicken: This recipe is meant for a whole 3-pound chicken. If your chicken is larger or smaller, you'll have to adjust the cooking time.· Butter: Butter keeps the chicken moist and juicy, while giving the seasonings something to stick to.· Seasonings: The rotisserie chicken is simply seasoned with salt, pepper, and paprika.
4. You'll find the full, step-by-step recipe below — b

##Parte 3: Obtener enlaces relacionados
Encontrar links a otras recetas para completar el corpus

In [29]:
# Find all the links to other recipes
recipe_links = soup.find_all("a", href=True)

# Filter and print only the links that are likely to be recipes
recipe_urls = []
for link in recipe_links:
    href = link['href']
    if "recipe" in href:
        recipe_urls.append(href)

# Print the recipe URLs
print("Linked Recipes:")
for url in recipe_urls:
    print(url)

Linked Recipes:
https://www.allrecipes.com/authentication/login?regSource=3675&relativeRedirectUrl=%2Frecipe%2F93168%2Frotisserie-chicken%2F
/account/add-recipe
https://www.myrecipes.com/favorites
https://www.allrecipes.com/authentication/logout?relativeRedirectUrl=%2Frecipe%2F93168%2Frotisserie-chicken%2F
https://www.magazines.com/allrecipes-magazine.html?utm_source=allrecipes.com&utm_medium=owned&utm_campaign=i111arr1w2661
https://www.magazines.com/allrecipes-magazine.html
https://www.allrecipes.com/recipes/17562/dinner/
https://www.allrecipes.com/recipes/17057/everyday-cooking/more-meal-ideas/5-ingredients/main-dishes/
https://www.allrecipes.com/recipes/15436/everyday-cooking/one-pot-meals/
https://www.allrecipes.com/recipes/1947/everyday-cooking/quick-and-easy/
https://www.allrecipes.com/recipes/455/everyday-cooking/more-meal-ideas/30-minute-meals/
https://www.allrecipes.com/recipes/17889/everyday-cooking/family-friendly/family-dinners/
https://www.allrecipes.com/recipes/94/soups-s

## Parte 4: Hacer RAG con las recetas obtenidas
Una vez que se ha construido el corpus, implementar y desplegar RAG para realizar búsquedas en el corpus

In [30]:
!pip install sentence-transformers faiss-cpu openai



In [39]:
import requests
from bs4 import BeautifulSoup

def extraer_receta(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    title = soup.find("meta", {"property": "og:title"})["content"] if soup.find("meta", {"property": "og:title"}) else "Sin título"
    description = soup.find("meta", {"name": "description"})["content"] if soup.find("meta", {"name": "description"}) else ""

    ingredients_section = soup.find_all("li", class_="mntl-structured-ingredients__list-item")
    ingredients = [i.get_text().strip() for i in ingredients_section]

    instructions_section = soup.find_all("p", class_="comp mntl-sc-block mntl-sc-block-html")
    instructions = [i.get_text().strip() for i in instructions_section]

    nutrition_section = soup.find_all("span", class_="mntl-nutrition-facts-label__text")
    nutrition = [n.parent.get_text().strip().replace('\n', ' ') for n in nutrition_section]

    # Juntar todo como texto
    content = f"{description}\n\nIngredientes:\n" + "\n".join(ingredients) + "\n\nInstrucciones:\n" + "\n".join(instructions)
    if nutrition:
        content += "\n\nInformación nutricional:\n" + "\n".join(nutrition)

    return {"title": title, "content": content}

#Lista de URLs obtenidas en Parte 3
urls = [
    "https://www.allrecipes.com/recipe/83557/rotisserie-chicken/",
  "https://www.allrecipes.com/authentication/login?regSource=3675&relativeRedirectUrl=%2Frecipe%2F93168%2Frotisserie-chicken%2F/account/add-recipe",
"https://www.myrecipes.com/favorites",
"https://www.allrecipes.com/authentication/logout?relativeRedirectUrl=%2Frecipe%2F93168%2Frotisserie-chicken%2F",
"https://www.magazines.com/allrecipes-magazine.html?utm_source=allrecipes.com&utm_medium=owned&utm_campaign=i111arr1w2661",
"https://www.allrecipes.com/recipes/86/world-cuisine/",
"https://www.allrecipes.com/kitchen-tips/",
"https://www.allrecipes.com/food-news-trends/",
"https://www.allrecipes.com/recipes/1642/everyday-cooking/",
"https://www.dotdashmeredith.com/brands/food-drink/allrecipes"

]

#  Corpus completo
corpus = [extraer_receta(url) for url in urls]


In [40]:
!pip install -q sentence-transformers faiss-cpu openai

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Modelo para embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")

texts = [doc["content"] for doc in corpus]
titles = [doc["title"] for doc in corpus]

# Calcular embeddings
embeddings = model.encode(texts, show_progress_bar=True, convert_to_numpy=True)

# Crear índice FAISS
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [47]:
import openai
from openai import OpenAI
client = OpenAI(api_key = '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')

def rag_query(query, top_k=3):
    q_emb = model.encode([query])
    D, I = index.search(q_emb, top_k)
    contexto = corpus[I[0][0]]["content"]

    resp = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Eres un chef experto que responde usando solo la receta dada."
},
            {"role": "user", "content": f"Contexto: {contexto}"},
            {"role": "user", "content": f"Pregunta: {query}"}
        ],
         temperature=0.2
    )
    return resp.choices[0].message.content


In [55]:
print(rag_query("¿Qué especias necesito para el pollo rostizado?", top_k=1))


🧾 CONTEXTO ENVIADO A GPT:

Taste your way around the world with these recipes from global cuisines, from classic dishes to creative fusions.

Ingredientes:


Instrucciones:
Taste your way around the world with these recipes from global cuisines, from classic dishes to creative fusions.
Lo siento, pero no se proporcionó información específica sobre los ingredientes necesarios para el pollo rostizado en el texto proporcionado. ¿Te gustaría que te proporcione una receta estándar de pollo rostizado con especias comunes?
