# Ejercicio 12: Web Scraping
## Objetivo de la pr√°ctica
El objetivo de este ejercicio es construir un web scraper que recoja datos de un website.
### Parte 0: Planificar
1. Identificar los datos que quieres obtener.
2. Elegir el sitio web objetivo.
3. Planificar la estructura del corpus.

### Parte 1: Entender el sitio web objetivo
- Analizar la estructura de la p√°gina web a ser analizada.
- Identificar los elementos HTML que contienen los datos buscados.

In [None]:
# Obtener el archivo HTML
!wget -O rotisserie-chicken.html \
--header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36" \
"https://www.allrecipes.com/recipe/93168/rotisserie-chicken/"

--2025-07-26 23:10:12--  https://www.allrecipes.com/recipe/93168/rotisserie-chicken/
Resolving www.allrecipes.com (www.allrecipes.com)... 162.159.141.224, 172.66.1.220, 2a06:98c1:58::1d8, ...
Connecting to www.allrecipes.com (www.allrecipes.com)|162.159.141.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‚Äòrotisserie-chicken.html‚Äô

rotisserie-chicken.     [ <=>                ] 612.94K  --.-KB/s    in 0.08s   

2025-07-26 23:10:12 (7.37 MB/s) - ‚Äòrotisserie-chicken.html‚Äô saved [627654]



In [None]:
from bs4 import BeautifulSoup

file = 'rotisserie-chicken.html'

# load html file
with open(file, "r", encoding="UTF-8") as file:
    html_content = file.read()

# parse the html content with BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

In [None]:
# extracting the recipe title
title = soup.find("meta", {"property":"og:title"})["content"]
title

'Rotisserie Chicken'

In [None]:
# Ingredients
ingredients_section = soup.find_all("li", class_ = "mm-recipes-structured-ingredients__list-item")
for ingredient in ingredients_section:
    print(ingredient.text.strip())

1 (3 pound) whole chicken
1 pinch salt
¬º cup butter, melted
1 tablespoon salt
1 tablespoon ground paprika
¬º tablespoon ground black pepper


### Parte 2: Obtener los datos deseados
- Buscar dentro del contenido HTML y extraer la informaci√≥n.

In [None]:
# Extracting the summary
summary= soup.find("p", class_ = "article-subheading text-utility-300").text.strip()

In [None]:
# Extracting the ingredients
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
ingredients = [ingredient.get_text().strip() for ingredient in ingredients_section]

In [None]:
# Extracting the rating
review= soup.find("div", class_ = "comp mm-recipes-review-bar__rating mntl-text-block text-label-300").text.strip()

In [None]:
# Extracting the number of Servings
import re

serving_results = soup.find_all("div", class_="mm-recipes-details__value")

for serving in serving_results:
    text = serving.text.strip()
    if re.fullmatch(r"\d+", text):  # Solo si es un n√∫mero entero
        servings = text

In [None]:
# Extracting the time

# todos los items de detalle
details = soup.find_all("div", class_="mm-recipes-details__item")

for item in details:
    label = item.find("div", class_="mm-recipes-details__label")
    value = item.find("div", class_="mm-recipes-details__value")
    # extract the time
    if label and label.text.strip() == "Total Time:":
        time = value.text.strip()

In [None]:
# directions section
li_items = soup.find_all("li", class_="comp mntl-sc-block mntl-sc-block-startgroup mntl-sc-block-group--LI")
directions = []
# Itera sobre ellos y busca su <p> hijo
for li in li_items:
    p_tag = li.find("p")
    if p_tag:
        directions.append(p_tag.text.strip())

In [None]:
# Extracting the nutrition information
nutrition_section = soup.find_all("span", class_="mm-recipes-nutrition-facts-label__nutrient-name mm-recipes-nutrition-facts-label__nutrient-name--has-postfix")
nutrition_facts = [fact.parent.get_text().strip().replace('\n', ' ') for fact in nutrition_section]

In [None]:
# Extracting the image
def extraer_url_imagen_receta(soup):

    article_content = soup.find('div', class_='loc article-content')

    if not article_content:
        return None

    image_url = None

    # Buscar la URL en la etiqueta de video
    video_tag = article_content.find('video')
    if video_tag:
        if video_tag.has_attr('data-poster'):
            image_url = video_tag['data-poster']
        elif video_tag.has_attr('poster'):
            image_url = video_tag['poster']

    # Si no se encontr√≥ en el video, buscar en la etiqueta de imagen
    if not image_url:
        figure_tag = article_content.find('figure')
        if figure_tag:
            try:
                img_tag = figure_tag.find('div').find('div').find('img')
                if img_tag and img_tag.has_attr('src'):
                    image_url = img_tag['src']
            except AttributeError:
                pass

    return image_url

In [None]:
url_imagen = extraer_url_imagen_receta(soup)

print(f"URL extra√≠da: {url_imagen}")

URL extra√≠da: https://cdn.jwplayer.com/v2/media/H1YIV7s6/thumbnails/9pjeduwU.jpg


In [None]:
# Print the extracted information
print("Title:", title)
print("Summary:", summary)
print("Ingredients:")
for ingredient in ingredients:
    print("-", ingredient)
print("Rating:", review)
print("Servings:", servings)
print("Time:", time)
print("Directions:")
for i, direction in enumerate(directions, 1):
    print(f"{i}." + direction)
print("Nutrition Facts:")
for fact in nutrition_facts:
    print("-", fact)
print("Image URL:", url_imagen)

Title: Rotisserie Chicken
Summary: This rotisserie chicken recipe is so easy to make with simple seasonings on your grill. Occasional basting with a butter mixture ensures crispy skin and moist meat. Our family loves this! Rotisserie chicken is perfect as the main dish with French fries and coleslaw, or with any number of other sides.
Ingredients:
- 1 (3 pound) whole chicken
- 1 pinch salt
- ¬º cup butter, melted
- 1 tablespoon salt
- 1 tablespoon ground paprika
- ¬º tablespoon ground black pepper
Rating: 4.7
Servings: 6
Time: 1 hr 30 mins
Directions:
1.Gather all ingredients. Preheat an outdoor grill for high heat and lightly oil the grate.
2.Season chicken cavity with a pinch of salt. Tie legs together with kitchen string; then tie wings to the bird. Secure chicken on a rotisserie attachment.
3.Place rotisserie over the preheated grill and cook for 10 minutes.
4.Meanwhile, quickly mix together butter, 1 tablespoon of salt, paprika, and pepper. Turn the grill down to medium and baste 

### Parte 3: Obtener enlaces relacionados
- Encontrar links a otras recetas para completar el corpus

In [None]:
# Lista para almacenar las URLs encontradas
recipe_links = []

# Encontrar todos los hiperv√≠nculos (<a>) con la clase espec√≠fica de las tarjetas de recetas
link_elements = soup.find_all('a', class_='comp mntl-card-list-items mntl-universal-card mntl-document-card mntl-card card card--no-image')

# Iterar sobre los elementos encontrados y extraer la URL
for link in link_elements:
    # Obtener el valor del atributo 'href', que contiene la URL
    href = link.get('href')

    # Validar que sea un enlace de receta y no est√© vac√≠o
    # Nos aseguramos que el enlace existe y que es una receta (suelen empezar con '/recipe/').
    if href and href.startswith('https://www.allrecipes.com/recipe/'):
        recipe_links.append(href)

In [None]:
# Mostrar los resultados
print(f"‚úÖ Se encontraron {len(recipe_links)} enlaces a recetas.")
for url in recipe_links:
    print(url)

‚úÖ Se encontraron 16 enlaces a recetas.
https://www.allrecipes.com/recipe/238575/cilantro-lime-grilled-chicken/
https://www.allrecipes.com/recipe/275062/buttermilk-barbecue-chicken/
https://www.allrecipes.com/recipe/274724/grilled-spatchcocked-chicken/
https://www.allrecipes.com/recipe/14531/beer-butt-chicken/
https://www.allrecipes.com/recipe/221093/good-frickin-paprika-chicken/
https://www.allrecipes.com/recipe/264278/miso-honey-chicken/
https://www.allrecipes.com/recipe/258659/rosemary-buttermilk-chicken/
https://www.allrecipes.com/recipe/222936/smoked-beer-butt-chicken/
https://www.allrecipes.com/recipe/228070/the-best-beer-can-chicken-ever/
https://www.allrecipes.com/recipe/214619/bbq-beer-can-chicken/
https://www.allrecipes.com/recipe/19944/drunk-chicken/
https://www.allrecipes.com/recipe/275044/grilled-chicken-under-a-brick/
https://www.allrecipes.com/recipe/281255/smoked-whole-chicken/
https://www.allrecipes.com/recipe/34957/easy-barbeque-chicken/
https://www.allrecipes.com/re

In [None]:
import requests
import time

def crear_corpus_recetas(url_inicial, cantidad_maxima=100):

    # La cola de URLs que necesitamos visitar. Empezamos con la URL inicial.
    urls_a_visitar = [url_inicial]

    # Un conjunto (set) para guardar las URLs que ya hemos visitado o agregado a la cola.
    urls_visitadas = {url_inicial}

    # La lista final donde guardaremos las recetas v√°lidas encontradas.
    enlaces_recetas_encontrados = []

    print(f"ü§ñ Iniciando crawler en: {url_inicial}")
    print("-------------------------------------------------")

    # BUCLE PRINCIPAL DEL CRAWLER
    # El bucle se ejecuta mientras tengamos URLs en la cola y no hayamos alcanzado nuestro objetivo.
    while urls_a_visitar and len(enlaces_recetas_encontrados) < cantidad_maxima:

        # Sacamos la primera URL de la lista para procesarla.
        url_actual = urls_a_visitar.pop(0)

        # A√±adimos la URL actual a nuestra lista final de recetas.
        enlaces_recetas_encontrados.append(url_actual)
        print(f"[{len(enlaces_recetas_encontrados)}/{cantidad_maxima}] Procesando: {url_actual}")

        try:
            # OBTENER Y PARSEAR EL HTML

            headers = {'User-Agent': 'RecipeCorpusCrawler/1.0'}

            response = requests.get(url_actual, headers=headers, timeout=10)

            # Si la solicitud no fue exitosa, saltamos a la siguiente URL.
            if response.status_code != 200:
                print(f"  -> Error: No se pudo acceder a la URL (C√≥digo: {response.status_code})")
                continue

            soup = BeautifulSoup(response.content, 'html.parser')

            # EXTRAER NUEVOS ENLACES DE LA P√ÅGINA ACTUAL
            nuevos_enlaces = soup.find_all('a', class_='comp mntl-card-list-items mntl-universal-card mntl-document-card mntl-card card card--no-image')
            print(f"  -> Se encontraron {len(nuevos_enlaces)} nuevos enlaces.")

            for link in nuevos_enlaces:
                href = link.get('href')

                # Verificamos que el enlace es v√°lido, es una receta, y no lo hemos visitado antes.
                if href and href.startswith('https://www.allrecipes.com/recipe/') and href not in urls_visitadas:
                    # Si es un enlace nuevo y v√°lido, lo a√±adimos a la cola y al conjunto de visitados.
                    urls_visitadas.add(href)
                    urls_a_visitar.append(href)

        except requests.exceptions.RequestException as e:
            print(f"  -> Error de red al intentar acceder a {url_actual}: {e}")

        # Hacemos una pausa de 1 segundo entre cada solicitud para no sobrecargar el sitio.
        time.sleep(1)

    print("-------------------------------------------------")
    print(f"‚úÖ Proceso completado. Total de enlaces recolectados: {len(enlaces_recetas_encontrados)}")
    return enlaces_recetas_encontrados

In [None]:
# --- EJECUCI√ìN DEL CRAWLER
url_semilla = 'https://www.allrecipes.com/recipe/93168/rotisserie-chicken/'
corpus_final = crear_corpus_recetas(url_semilla, cantidad_maxima=100)

# primeros 10 enlaces del corpus final
print("\nPrimeros 10 enlaces del corpus:")
for url in corpus_final[:10]:
    print(url)

ü§ñ Iniciando crawler en: https://www.allrecipes.com/recipe/93168/rotisserie-chicken/
-------------------------------------------------
[1/100] Procesando: https://www.allrecipes.com/recipe/93168/rotisserie-chicken/
  -> Se encontraron 16 nuevos enlaces.
[2/100] Procesando: https://www.allrecipes.com/recipe/238575/cilantro-lime-grilled-chicken/
  -> Se encontraron 16 nuevos enlaces.
[3/100] Procesando: https://www.allrecipes.com/recipe/275062/buttermilk-barbecue-chicken/
  -> Se encontraron 16 nuevos enlaces.
[4/100] Procesando: https://www.allrecipes.com/recipe/274724/grilled-spatchcocked-chicken/
  -> Se encontraron 16 nuevos enlaces.
[5/100] Procesando: https://www.allrecipes.com/recipe/14531/beer-butt-chicken/
  -> Se encontraron 16 nuevos enlaces.
[6/100] Procesando: https://www.allrecipes.com/recipe/221093/good-frickin-paprika-chicken/
  -> Se encontraron 16 nuevos enlaces.
[7/100] Procesando: https://www.allrecipes.com/recipe/264278/miso-honey-chicken/
  -> Se encontraron 16 nu

In [None]:
# from each url in corpus_final, extract the tittle, summary, ingredients, rating, servings, time, directions, nutrition facts and image URL to agroup them in a dataframe
import pandas as pd

def scrape_recipe_details(url):
    """
    Visita la URL de una receta y extrae todos sus detalles.
    Retorna un diccionario con la informaci√≥n de la receta.
    """
    try:
        headers = {'User-Agent': 'RecipeScraper/2.0'}
        response = requests.get(url, headers=headers, timeout=15)
        if response.status_code != 200:
            print(f"  -> Error al acceder a {url}, C√≥digo: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"  -> Error de red para {url}: {e}")
        return None

    soup = BeautifulSoup(response.content, 'html.parser')

    recipe_data = {}

    try:
        # Title
        recipe_data['title'] = soup.find("meta", {"property":"og:title"})["content"]
    except AttributeError:
        recipe_data['title'] = None

    try:
        # Summary
        recipe_data['summary'] = soup.find("p", class_="article-subheading text-utility-300").text.strip()
    except AttributeError:
        recipe_data['summary'] = None

    try:
        # Ingredients
        ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
        recipe_data['ingredients'] = [ing.get_text(strip=True) for ing in ingredients_section]
    except:
        recipe_data['ingredients'] = []

    try:
        # Rating
        recipe_data['rating'] = soup.find("div", class_="comp mm-recipes-review-bar__rating mntl-text-block text-label-300").text.strip()
    except AttributeError:
        recipe_data['rating'] = None

    try:
        # Servings y Time
        servings, total_time = None, None
        details_items = soup.find_all("div", class_="mm-recipes-details__item")
        for item in details_items:
            label = item.find("div", class_="mm-recipes-details__label").text.strip()
            value = item.find("div", class_="mm-recipes-details__value").text.strip()
            if label == "Servings:":
                servings = value
            elif label == "Total Time:":
                total_time = value
        recipe_data['servings'] = servings
        recipe_data['time'] = total_time
    except:
        recipe_data['servings'] = None
        recipe_data['time'] = None

    try:
        # Directions
        li_items = soup.find_all("li", class_="comp mntl-sc-block mntl-sc-block-startgroup mntl-sc-block-group--LI")
        directions = [li.find("p").text.strip() for li in li_items if li.find("p")]
        recipe_data['directions'] = directions
    except:
        recipe_data['directions'] = []

    try:
        # Nutrition Facts
        nutrition_section = soup.find_all("tr", class_="mm-recipes-nutrition-facts-summary__table-row")
        nutrition_facts = [fact.get_text(strip=True).replace('\n', ' ').replace('  ', ' ') for fact in nutrition_section]
        recipe_data['nutrition_facts'] = nutrition_facts
    except:
        recipe_data['nutrition_facts'] = []

    # Image URL
    recipe_data['image_url'] = extraer_url_imagen_receta(soup)

    # URL de la receta
    recipe_data['source_url'] = url

    return recipe_data

In [None]:
all_recipes_data = []

print(f"üç≤ Empezando a scrapear {len(corpus_final)} recetas...")

# Iteramos sobre cada URL en nuestro corpus
for i, url in enumerate(corpus_final):
    print(f"Procesando [{i+1}/{len(corpus_final)}]: {url}")

    data = scrape_recipe_details(url)

    # Si la funci√≥n devolvi√≥ datos, los a√±adimos a nuestra lista
    if data:
        all_recipes_data.append(data)

    # para no saturar el servidor
    time.sleep(1)

print("\n‚úÖ Scraping completado.")

# Crear el DataFrame a partir de la lista de diccionarios
df_recetas = pd.DataFrame(all_recipes_data)

print(" DataFrame creado exitosamente:")
print("\nInformaci√≥n del DataFrame:")
df_recetas.info()

print("\nPrimeras 5 filas del DataFrame:")
print(df_recetas.head())

üç≤ Empezando a scrapear 100 recetas...
Procesando [1/100]: https://www.allrecipes.com/recipe/93168/rotisserie-chicken/
Procesando [2/100]: https://www.allrecipes.com/recipe/238575/cilantro-lime-grilled-chicken/
Procesando [3/100]: https://www.allrecipes.com/recipe/275062/buttermilk-barbecue-chicken/
Procesando [4/100]: https://www.allrecipes.com/recipe/274724/grilled-spatchcocked-chicken/
Procesando [5/100]: https://www.allrecipes.com/recipe/14531/beer-butt-chicken/
Procesando [6/100]: https://www.allrecipes.com/recipe/221093/good-frickin-paprika-chicken/
Procesando [7/100]: https://www.allrecipes.com/recipe/264278/miso-honey-chicken/
Procesando [8/100]: https://www.allrecipes.com/recipe/258659/rosemary-buttermilk-chicken/
Procesando [9/100]: https://www.allrecipes.com/recipe/222936/smoked-beer-butt-chicken/
Procesando [10/100]: https://www.allrecipes.com/recipe/228070/the-best-beer-can-chicken-ever/
Procesando [11/100]: https://www.allrecipes.com/recipe/214619/bbq-beer-can-chicken/


In [None]:
df_recetas

Unnamed: 0,title,summary,ingredients,rating,servings,time,directions,nutrition_facts,image_url,source_url
0,Rotisserie Chicken,This rotisserie chicken recipe is so easy to m...,"[1(3 pound)whole chicken, 1pinchsalt, ¬ºcupbutt...",4.7,6,1 hr 30 mins,[Gather all ingredients. Preheat an outdoor gr...,"[357Calories, 25gFat, 1gCarbs, 31gProtein]",https://cdn.jwplayer.com/v2/media/H1YIV7s6/thu...,https://www.allrecipes.com/recipe/93168/rotiss...
1,Cilantro-Lime Grilled Chicken,This marinated cilantro-lime grilled chicken r...,"[¬Ωcupchopped fresh cilantro, 4limes, juiced, 2...",4.6,6,1 hr 15 mins,"[Whisk cilantro, lime juice, garlic salt, and ...","[258Calories, 10gFat, 4gCarbs, 38gProtein]",https://www.allrecipes.com/thmb/IJPCQCe7l0H6vy...,https://www.allrecipes.com/recipe/238575/cilan...
2,Buttermilk Barbecue Chicken,Buttermilk is a very popular marinade for frie...,"[2cupsbuttermilk, ¬ºcupbrown sugar, 1tablespoon...",4.8,6,7 hrs 5 mins,"[Whisk buttermilk, brown sugar, cider vinegar,...","[323Calories, 11gFat, 15gCarbs, 40gProtein]",https://cdn.jwplayer.com/v2/media/dy4pDAAG/pos...,https://www.allrecipes.com/recipe/275062/butte...
3,Grilled Spatchcocked Chicken,"To spatchcock a chicken, you need to remove th...","[¬ºcupkosher salt, water, 1(4 pound)whole chick...",4.7,6,7 hrs 40 mins,[Place salt in a large bowl or Dutch oven; add...,"[345Calories, 13gFat, 5gCarbs, 50gProtein]",https://www.allrecipes.com/thmb/4I56STrguOMDZt...,https://www.allrecipes.com/recipe/274724/grill...
4,Beer Butt Chicken,"For this beer butter chicken, all you need is ...","[1cupbutter, divided, 2tablespoonsgarlic salt,...",4.7,8,3 hrs 30 mins,[Preheat an outdoor grill for low heat and lig...,"[514Calories, 40gFat, 3gCarbs, 31gProtein]",https://www.allrecipes.com/thmb/eEFuH5W5XYsEkD...,https://www.allrecipes.com/recipe/14531/beer-b...
...,...,...,...,...,...,...,...,...,...,...
95,Baked Chicken with Peaches,Rushed? Need an elegant main dish to serve for...,"[8skinless, boneless chicken breast halves, 1c...",4.0,8,45 mins,[Preheat oven to 350 degrees F (175 degrees C)...,"[248Calories, 3gFat, 30gCarbs, 25gProtein]",https://www.allrecipes.com/thmb/L_KkdtGbzm4Rez...,https://www.allrecipes.com/recipe/19880/baked-...
96,Easy Garlic and Rosemary Chicken,"This rosemary chicken recipe is a simple, flav...","[2skinless, boneless chicken breasts, 2clovesg...",3.9,2,30 mins,[Preheat the oven to 375 degrees F (190 degree...,"[147Calories, 2gFat, 4gCarbs, 28gProtein]",https://www.allrecipes.com/thmb/gL6TSvdbVEVv99...,https://www.allrecipes.com/recipe/8844/easy-ga...
97,Garlic Ranch Chicken,This is very easy and fast to make; using fat ...,"[4skinless, boneless chicken breasts, 1cupfat ...",4.2,4,1 hr 5 mins,"[Combine the dressing, garlic and basil in a l...","[232Calories, 2gFat, 23gCarbs, 28gProtein]",https://www.allrecipes.com/thmb/A6KNoc7EUC06mp...,https://www.allrecipes.com/recipe/14517/garlic...
98,German Chicken,"Boneless, roasted chicken breasts cooked over ...","[4skinless, boneless chicken breast halves, 1c...",4.0,4,40 mins,[Preheat the oven to 350 degrees F (175 degree...,"[253Calories, 2gFat, 29gCarbs, 29gProtein]",https://www.allrecipes.com/thmb/hIQpZKB_FXj5Nn...,https://www.allrecipes.com/recipe/8781/german-...


#### Preprocesamiento del texto previo al c√°lculo de embeddings

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [None]:
def preprocesar_texto(texto):
    """
    Realiza preprocesamiento del texto:
    - Unir palabras cortadas por gui√≥n y salto de l√≠nea.
    - Eliminar saltos de l√≠nea y tabuladores restantes.
    - Eliminar caracteres especiales.
    - Convertir a min√∫sculas.
    - Tokenizar.
    - Eliminar stopwords.
    - Aplicar stemming.
    Retorna un string con las palabras procesadas separadas por espacio.
    """
    if not texto:
        return ""

    # Unir palabras separadas por gui√≥n y salto de l√≠nea
    texto = re.sub(r"-\n([a-z])", r"\1", texto)

    # Eliminar saltos de l√≠nea y tabuladores sobrantes
    texto = texto.replace("\n", " ").replace("\t", " ")

    # Eliminar caracteres especiales (conservar solo letras y n√∫meros)
    texto = re.sub(r"[^a-zA-Z0-9 ]", " ", texto)

    # Pasar a min√∫sculas
    texto = texto.lower()

    # Tokenizar
    tokens = word_tokenize(texto)

    # Eliminar stopwords
    stop_words = set(stopwords.words("english"))
    tokens = [word for word in tokens if word not in stop_words]

    # Stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]

    # Unir tokens de nuevo
    texto_procesado = " ".join(tokens)

    return texto_procesado

In [None]:
df_recetas["contenido_preprocesado"] = df_recetas.apply(
    lambda row: preprocesar_texto(
        f"{str(row['title'] or '')}. "
        f"{str(row['summary'] or '')}. "
        f"Ingredientes: {' '.join(row['ingredients'] or [])}. "
        f"Instrucciones: {' '.join(row['directions'] or [])}. "
        f"NutritionFacts: {' '.join(row['nutrition_facts'] or [])}"
    ),
    axis=1
)

In [None]:
df_recetas

Unnamed: 0,title,summary,ingredients,rating,servings,time,directions,nutrition_facts,image_url,source_url,contenido_preprocesado
0,Rotisserie Chicken,This rotisserie chicken recipe is so easy to m...,"[1(3 pound)whole chicken, 1pinchsalt, ¬ºcupbutt...",4.7,6,1 hr 30 mins,[Gather all ingredients. Preheat an outdoor gr...,"[357Calories, 25gFat, 1gCarbs, 31gProtein]",https://cdn.jwplayer.com/v2/media/H1YIV7s6/thu...,https://www.allrecipes.com/recipe/93168/rotiss...,rotisseri chicken rotisseri chicken recip easi...
1,Cilantro-Lime Grilled Chicken,This marinated cilantro-lime grilled chicken r...,"[¬Ωcupchopped fresh cilantro, 4limes, juiced, 2...",4.6,6,1 hr 15 mins,"[Whisk cilantro, lime juice, garlic salt, and ...","[258Calories, 10gFat, 4gCarbs, 38gProtein]",https://www.allrecipes.com/thmb/IJPCQCe7l0H6vy...,https://www.allrecipes.com/recipe/238575/cilan...,cilantro lime grill chicken marin cilantro lim...
2,Buttermilk Barbecue Chicken,Buttermilk is a very popular marinade for frie...,"[2cupsbuttermilk, ¬ºcupbrown sugar, 1tablespoon...",4.8,6,7 hrs 5 mins,"[Whisk buttermilk, brown sugar, cider vinegar,...","[323Calories, 11gFat, 15gCarbs, 40gProtein]",https://cdn.jwplayer.com/v2/media/dy4pDAAG/pos...,https://www.allrecipes.com/recipe/275062/butte...,buttermilk barbecu chicken buttermilk popular ...
3,Grilled Spatchcocked Chicken,"To spatchcock a chicken, you need to remove th...","[¬ºcupkosher salt, water, 1(4 pound)whole chick...",4.7,6,7 hrs 40 mins,[Place salt in a large bowl or Dutch oven; add...,"[345Calories, 13gFat, 5gCarbs, 50gProtein]",https://www.allrecipes.com/thmb/4I56STrguOMDZt...,https://www.allrecipes.com/recipe/274724/grill...,grill spatchcock chicken spatchcock chicken ne...
4,Beer Butt Chicken,"For this beer butter chicken, all you need is ...","[1cupbutter, divided, 2tablespoonsgarlic salt,...",4.7,8,3 hrs 30 mins,[Preheat an outdoor grill for low heat and lig...,"[514Calories, 40gFat, 3gCarbs, 31gProtein]",https://www.allrecipes.com/thmb/eEFuH5W5XYsEkD...,https://www.allrecipes.com/recipe/14531/beer-b...,beer butt chicken beer butter chicken need who...
...,...,...,...,...,...,...,...,...,...,...,...
95,Baked Chicken with Peaches,Rushed? Need an elegant main dish to serve for...,"[8skinless, boneless chicken breast halves, 1c...",4.0,8,45 mins,[Preheat oven to 350 degrees F (175 degrees C)...,"[248Calories, 3gFat, 30gCarbs, 25gProtein]",https://www.allrecipes.com/thmb/L_KkdtGbzm4Rez...,https://www.allrecipes.com/recipe/19880/baked-...,bake chicken peach rush need eleg main dish se...
96,Easy Garlic and Rosemary Chicken,"This rosemary chicken recipe is a simple, flav...","[2skinless, boneless chicken breasts, 2clovesg...",3.9,2,30 mins,[Preheat the oven to 375 degrees F (190 degree...,"[147Calories, 2gFat, 4gCarbs, 28gProtein]",https://www.allrecipes.com/thmb/gL6TSvdbVEVv99...,https://www.allrecipes.com/recipe/8844/easy-ga...,easi garlic rosemari chicken rosemari chicken ...
97,Garlic Ranch Chicken,This is very easy and fast to make; using fat ...,"[4skinless, boneless chicken breasts, 1cupfat ...",4.2,4,1 hr 5 mins,"[Combine the dressing, garlic and basil in a l...","[232Calories, 2gFat, 23gCarbs, 28gProtein]",https://www.allrecipes.com/thmb/A6KNoc7EUC06mp...,https://www.allrecipes.com/recipe/14517/garlic...,garlic ranch chicken easi fast make use fat fr...
98,German Chicken,"Boneless, roasted chicken breasts cooked over ...","[4skinless, boneless chicken breast halves, 1c...",4.0,4,40 mins,[Preheat the oven to 350 degrees F (175 degree...,"[253Calories, 2gFat, 29gCarbs, 29gProtein]",https://www.allrecipes.com/thmb/hIQpZKB_FXj5Nn...,https://www.allrecipes.com/recipe/8781/german-...,german chicken boneless roast chicken breast c...


#### Obtener embeddings para cada receta
##### Carga del modelo

In [None]:
from sentence_transformers import SentenceTransformer

# Cargar el modelo
model = SentenceTransformer('all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

##### C√°lculo de embeddings

In [None]:
def generar_embeddings(df, columna_texto="contenido_preprocesado"):
    """
    Genera embeddings SBERT para cada fila del DataFrame.
    Par√°metros:
        df: DataFrame con la columna de texto preprocesado.
        columna_texto: nombre de la columna con el texto (por defecto 'contenido_preprocesado').
    Retorna:
        DataFrame con nueva columna 'embedding'.
    """
    # Lista de textos
    textos = df[columna_texto].tolist()

    # Generar embeddings
    embeddings = model.encode(textos, show_progress_bar=True, convert_to_numpy=True)

    # Asignar embeddings al DataFrame
    df["embedding"] = embeddings.tolist()

    return df

In [None]:
df_recetas = generar_embeddings(df_recetas)

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
df_recetas

Unnamed: 0,title,summary,ingredients,rating,servings,time,directions,nutrition_facts,image_url,source_url,contenido_preprocesado,embedding
0,Rotisserie Chicken,This rotisserie chicken recipe is so easy to m...,"[1(3 pound)whole chicken, 1pinchsalt, ¬ºcupbutt...",4.7,6,1 hr 30 mins,[Gather all ingredients. Preheat an outdoor gr...,"[357Calories, 25gFat, 1gCarbs, 31gProtein]",https://cdn.jwplayer.com/v2/media/H1YIV7s6/thu...,https://www.allrecipes.com/recipe/93168/rotiss...,rotisseri chicken rotisseri chicken recip easi...,"[-0.02465100586414337, -0.05288245156407356, -..."
1,Cilantro-Lime Grilled Chicken,This marinated cilantro-lime grilled chicken r...,"[¬Ωcupchopped fresh cilantro, 4limes, juiced, 2...",4.6,6,1 hr 15 mins,"[Whisk cilantro, lime juice, garlic salt, and ...","[258Calories, 10gFat, 4gCarbs, 38gProtein]",https://www.allrecipes.com/thmb/IJPCQCe7l0H6vy...,https://www.allrecipes.com/recipe/238575/cilan...,cilantro lime grill chicken marin cilantro lim...,"[-0.08417970687150955, -0.03231288120150566, -..."
2,Buttermilk Barbecue Chicken,Buttermilk is a very popular marinade for frie...,"[2cupsbuttermilk, ¬ºcupbrown sugar, 1tablespoon...",4.8,6,7 hrs 5 mins,"[Whisk buttermilk, brown sugar, cider vinegar,...","[323Calories, 11gFat, 15gCarbs, 40gProtein]",https://cdn.jwplayer.com/v2/media/dy4pDAAG/pos...,https://www.allrecipes.com/recipe/275062/butte...,buttermilk barbecu chicken buttermilk popular ...,"[-0.031861066818237305, -0.06565631926059723, ..."
3,Grilled Spatchcocked Chicken,"To spatchcock a chicken, you need to remove th...","[¬ºcupkosher salt, water, 1(4 pound)whole chick...",4.7,6,7 hrs 40 mins,[Place salt in a large bowl or Dutch oven; add...,"[345Calories, 13gFat, 5gCarbs, 50gProtein]",https://www.allrecipes.com/thmb/4I56STrguOMDZt...,https://www.allrecipes.com/recipe/274724/grill...,grill spatchcock chicken spatchcock chicken ne...,"[-0.06601766496896744, -0.057308390736579895, ..."
4,Beer Butt Chicken,"For this beer butter chicken, all you need is ...","[1cupbutter, divided, 2tablespoonsgarlic salt,...",4.7,8,3 hrs 30 mins,[Preheat an outdoor grill for low heat and lig...,"[514Calories, 40gFat, 3gCarbs, 31gProtein]",https://www.allrecipes.com/thmb/eEFuH5W5XYsEkD...,https://www.allrecipes.com/recipe/14531/beer-b...,beer butt chicken beer butter chicken need who...,"[-0.07050492614507675, -0.03555162623524666, -..."
...,...,...,...,...,...,...,...,...,...,...,...,...
95,Baked Chicken with Peaches,Rushed? Need an elegant main dish to serve for...,"[8skinless, boneless chicken breast halves, 1c...",4.0,8,45 mins,[Preheat oven to 350 degrees F (175 degrees C)...,"[248Calories, 3gFat, 30gCarbs, 25gProtein]",https://www.allrecipes.com/thmb/L_KkdtGbzm4Rez...,https://www.allrecipes.com/recipe/19880/baked-...,bake chicken peach rush need eleg main dish se...,"[-0.038459084928035736, -0.042220864444971085,..."
96,Easy Garlic and Rosemary Chicken,"This rosemary chicken recipe is a simple, flav...","[2skinless, boneless chicken breasts, 2clovesg...",3.9,2,30 mins,[Preheat the oven to 375 degrees F (190 degree...,"[147Calories, 2gFat, 4gCarbs, 28gProtein]",https://www.allrecipes.com/thmb/gL6TSvdbVEVv99...,https://www.allrecipes.com/recipe/8844/easy-ga...,easi garlic rosemari chicken rosemari chicken ...,"[-0.03184344246983528, -0.002325143665075302, ..."
97,Garlic Ranch Chicken,This is very easy and fast to make; using fat ...,"[4skinless, boneless chicken breasts, 1cupfat ...",4.2,4,1 hr 5 mins,"[Combine the dressing, garlic and basil in a l...","[232Calories, 2gFat, 23gCarbs, 28gProtein]",https://www.allrecipes.com/thmb/A6KNoc7EUC06mp...,https://www.allrecipes.com/recipe/14517/garlic...,garlic ranch chicken easi fast make use fat fr...,"[-0.06011435389518738, -0.027687720954418182, ..."
98,German Chicken,"Boneless, roasted chicken breasts cooked over ...","[4skinless, boneless chicken breast halves, 1c...",4.0,4,40 mins,[Preheat the oven to 350 degrees F (175 degree...,"[253Calories, 2gFat, 29gCarbs, 29gProtein]",https://www.allrecipes.com/thmb/hIQpZKB_FXj5Nn...,https://www.allrecipes.com/recipe/8781/german-...,german chicken boneless roast chicken breast c...,"[0.013388339430093765, -0.017645588144659996, ..."


### Parte 4: Hacer RAG con las recetas obtenidas
- Una vez que se ha construido el corpus, implementar y desplegar RAG para realizar b√∫squedas en el corpus

##### Similitud coseno y ranking top n

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
def obtener_indices_top_similares(df, query_embedding, top_n=5):
    """
    Calcula similitud coseno y devuelve los √≠ndices de las filas m√°s similares.
    """

    matriz_embeddings = np.vstack(df["embedding"].values)

    query_embedding = np.array(query_embedding).reshape(1, -1)

    # Calcular similitud coseno
    similitudes = cosine_similarity(query_embedding, matriz_embeddings)[0]

    # Obtener √≠ndices ordenados de mayor a menor similitud
    indices_ordenados = np.argsort(similitudes)[::-1]

    # Seleccionar top N
    indices_top_n = indices_ordenados[:top_n]

    return indices_top_n

#### Selecci√≥n de la receta que desea

In [None]:
!pip install ipywidgets

Collecting jedi>=0.16 (from ipython>=4.0.0->ipywidgets)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.6/1.6 MB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi
Successfully installed jedi-0.19.2


In [None]:
import ipywidgets as widgets
from IPython.display import display

In [None]:
from google.colab import output
output.enable_custom_widget_manager()

In [None]:
def construir_contexto_receta(df, indice):
    receta = df.iloc[indice]

    # Manejo de valores None
    title = receta['title'] or "No disponible"
    summary = receta['summary'] or "No disponible"
    rating = receta['rating'] or "Sin calificaci√≥n"
    servings = receta['servings'] or "No especificado"
    time = receta['time'] or "No especificado"
    image_url = receta['image_url'] or None

    # Formateo de listas
    ingredients = "\n".join([f"{i+1}. {ing}" for i, ing in enumerate(receta['ingredients'])]) if receta['ingredients'] else "No especificado"
    directions = "\n".join([f"**Paso {i+1}:** {paso}" for i, paso in enumerate(receta['directions'])]) if receta['directions'] else "No disponible"
    nutrition_facts = "\n".join([f"- {fact}" for fact in receta['nutrition_facts']]) if receta['nutrition_facts'] else "No disponible"

    # Construcci√≥n del contexto en formato Markdown
    contexto = f"""
## üç≥ {title}

{f'<img src="{image_url}" width="300">' if image_url else '*Sin imagen disponible*'}

**üìù Resumen:** {summary}

**‚≠ê Calificaci√≥n:** {rating} estrellas
**üë• Porciones:** {servings}
**‚è±Ô∏è Tiempo:** {time}

**ü•ï Ingredientes:**\n
    {ingredients}

**üë®‚Äçüç≥ Preparaci√≥n:**\n
    {directions}

**üìä Datos nutricionales:**\n
    {nutrition_facts}
    """

    return contexto

In [None]:
query = "barbecue chicken"
preprocessed_query = preprocesar_texto(query)
query_embedding = model.encode(preprocessed_query)

indices_top5 = obtener_indices_top_similares(df_recetas, query_embedding, top_n=5)
print("√çndices top 5:", indices_top5)

√çndices top 5: [78 32 71 88 82]


In [None]:
indice_seleccionado = None

# Crear opciones para los radio buttons (t√≠tulos de las recetas)
opciones_recetas = [f"{idx + 1}: {df_recetas.iloc[idx]['title']}" for idx in indices_top5]

# Crear los widgets (radio buttons + bot√≥n)
radio = widgets.RadioButtons(
    options=opciones_recetas,
    description='Elige una receta:',
    disabled=False
)

boton = widgets.Button(description="Seleccionar receta")

def on_button_click(b):
    global indice_seleccionado

    receta_seleccionada = radio.value
    indice_seleccionado = int(receta_seleccionada.split(":")[0]) - 1

    # Bloquear los widgets despu√©s de la selecci√≥n
    radio.disabled = True
    boton.disabled = True

    print("\n--- Receta seleccionada ---")
    print(f"T√≠tulo: {df_recetas.iloc[indice_seleccionado]['title']}")

boton.on_click(on_button_click)

# Mostrar el formulario
display(radio)
display(boton)

RadioButtons(description='Elige una receta:', options=('79: Barbecue Pineapple Chicken', "33: Chef John's Barb‚Ä¶

Button(description='Seleccionar receta', style=ButtonStyle())


--- Receta seleccionada ---
T√≠tulo: Barbecue Pineapple Chicken


In [None]:
# Despu√©s de que el usuario seleccione una receta (indice_seleccionado contiene el √≠ndice)
contexto_receta = construir_contexto_receta(df_recetas, indice_seleccionado)

#### Usar la API para generar la respuesta
Usar las API de deepseek  para la generaci√≥n de respuestas en base al contexto proporcionado.

In [None]:
from openai import OpenAI
from IPython.display import Markdown, display

In [None]:
client_deepseek = OpenAI(api_key="api_key", base_url="URL")

In [None]:
prompt = f"""
    Eres un asistente culinario experto. Responde usando SOLAMENTE el siguiente contexto.
    Si la pregunta no puede responderse con esta informaci√≥n, di: 'No tengo informaci√≥n sobre tu b√∫squeda en la receta seleccionada'.

    Instrucciones de formato:
    1. Muestra toda la informaci√≥n de la receta en formato Markdown si es que est√° disponible dentro del contexto, ind√≠cale al usuario que esos son los detalles de la receta y que disfrute de su platillo.
    2. Si no est√° disponible, responde que no dispones de la informaci√≥n en la receta seleccionada

    ------
    {contexto_receta}
    ------

    Pregunta del usuario: {query}
    """

# Enviar al modelo
response = client_deepseek.chat.completions.create(
    model="deepseek/deepseek-r1:free",
    messages=[
        {"role": "system", "content": "Eres un chef experto que responde con precisi√≥n."},
        {"role": "user", "content": prompt}
    ],
    temperature=0
)

# Mostrar resultados
display(Markdown(response.choices[0].message.content))

Aqu√≠ est√°n los detalles de la receta **Barbecue Pineapple Chicken**. ¬°Disfruta de tu platillo! üçóüçç

---

## üç≥ Barbecue Pineapple Chicken  
<img src="https://www.allrecipes.com/thmb/2FrUNsUGqxFVrV8CJMDF2AW018I=/1500x0/filters:no_upscale():max_bytes(150000):strip_icc()/Barbecue-Pineapple-Chicken-Tammy-Lynn-4x3-1-2000-f285297acd6140b7b0eab1b7acd10740.jpg" width="300">  

**üìù Resumen:** Una receta f√°cil de pollo para preparar a la parrilla. ¬°Es deliciosa! Personal√≠zala a tu gusto y disfr√∫tala.  

**‚≠ê Calificaci√≥n:** 4.8 estrellas  
**üë• Porciones:** 6  
**‚è±Ô∏è Tiempo total:** 2 hrs 20 mins  

### ü•ï Ingredientes:  
- 2 botellas (18 oz c/u) de salsa barbecue  
- 1 lata (8 oz) de pi√±a triturada  
- 1 cucharada de ajo en polvo  
- ¬Ω cucharadita de chile en polvo  
- 6 muslos con contramuslos de pollo  

### üë®‚Äçüç≥ Preparaci√≥n:  
1. **Precalienta el horno** a 175¬∞C (350¬∞F).  
2. **Mezcla** en un bowl la salsa barbecue, pi√±a, ajo en polvo y chile en polvo.  
3. **Coloca el pollo** en una bandeja de horno 9x13 pulgadas. Ba√±a con la mezcla de salsa, asegurando cubrir todo el pollo. Cubre con papel aluminio.  
4. **Hornea** 1 hora.  
5. **Deja enfriar** el pollo 1 hora fuera del horno.  
6. **Prepara la parrilla** a fuego medio y unta aceite en la rejilla.  
7. **Asa el pollo** en la parrilla hasta dorar ambos lados (10-15 min).  

### üìä Informaci√≥n nutricional por porci√≥n:  
- **Calor√≠as:** 666  
- **Grasas:** 23g  
- **Carbohidratos:** 72g  
- **Prote√≠nas:** 41g  

---  
¬°Que aproveche! üòä