<div align="center">
    <h3>Escuela Politécnica Nacional</h3>
    <h3>Recuperación de información</h3>
</div>


**Taller:** Web Scraping


**Nombre:** Gabriela Salazar

**Fecha:** 24/01/2025

In [1]:
#pip install pandas beautifulsoup4 requests

In [2]:
#pip install selenium pandas webdriver-manager

In [3]:
import httpx
from bs4 import BeautifulSoup
import pandas as pd
import re
from tqdm import tqdm
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


In [4]:
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
# Cabeceras HTTP
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.6778.265 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1"
}

# Crear una instancia de cliente HTTP con las cabeceras definidas
client = httpx.Client(headers=HEADERS)

# URL de la página de categorías de recetas
url_categorias = "https://www.allrecipes.com/recipes-a-z-6735880"
# Realizar la solicitud GET a la URL especificada
response = client.get(url_categorias)

# Verificar el estado de la respuesta HTTP
if response.status_code == 200:
    # Analizar el contenido de la página con BeautifulSoup
    soup = BeautifulSoup(response.content, "html.parser")
    print("Página descargada correctamente.")
else:
    print(f"Error al conectar: {response.status_code}")


Página descargada correctamente.


In [6]:
# Extraer categorías de la página A-Z
def obtener_categorias(soup):
    """
    Extrae las categorías de recetas de una página web, utilizando un objeto BeautifulSoup para analizar el HTML.

    Args:
        soup (BeautifulSoup): Objeto BeautifulSoup que representa el contenido de la página web.

    Returns:
        list: Lista de diccionarios donde cada diccionario representa una categoría. 
              Cada diccionario contiene las siguientes claves:
              - "id_categoria" (int): Un identificador único incremental para la categoría.
              - "categoria" (str): El nombre de la categoría (texto del enlace).
              - "url" (str): La URL asociada a la categoría.
    """
    categorias = []
    for i, link in enumerate(soup.select("#main a[href]"), start=1):  # Agregar índice como ID
        href = link['href']
        texto = link.get_text(strip=True)
        if texto and "/recipes/" in href:  # Filtrar solo enlaces válidos de recetas
            categorias.append({"id_categoria": i, "categoria": texto, "url": href})
    return categorias

# Usar la función para extraer categorías
categorias = obtener_categorias(soup)
print(f"Se encontraron {len(categorias)} categorías:")

Se encontraron 378 categorías:


In [7]:
# Crear un DataFrame con las categorías
df_categorias = pd.DataFrame(categorias)

# Guardar el DataFrame en un archivo CSV
df_categorias.to_csv("categorias.csv", index=False)

In [8]:
df_categorias

Unnamed: 0,id_categoria,categoria,url
0,27,Air Fryer Recipes,https://www.allrecipes.com/recipes/23070/every...
1,28,Allrecipes Allstar Recipes,https://www.allrecipes.com/recipes/16492/every...
2,29,Angel Food Cakes,https://www.allrecipes.com/recipes/385/dessert...
3,30,Antipasti,https://www.allrecipes.com/recipes/102/appetiz...
4,31,Appetizers and Snacks,https://www.allrecipes.com/recipes/76/appetize...
...,...,...,...
373,400,Winter Squash,https://www.allrecipes.com/recipes/1097/fruits...
374,401,Yams,https://www.allrecipes.com/recipes/2452/fruits...
375,402,Yeast Breads,https://www.allrecipes.com/recipes/339/bread/y...
376,403,Ziti,https://www.allrecipes.com/recipes/550/pasta-a...


In [9]:
def obtener_recetas(categoria):
    """
    Extrae recetas de una categoría específica.
    Args:
        categoria (dict): Diccionario con id_categoria, categoria y url
    Returns:
        list: Lista de recetas con id_categoria, categoria, id_receta, receta y url_receta
    """
    response = client.get(categoria["url"])
    if response.status_code != 200:
        return []
    
    soup = BeautifulSoup(response.content, "html.parser")
    recetas = []
    
    for card in soup.select("div[id^='tax-sc__recirc-list'] a.mntl-card-list-items"):
        id_receta = card.get('data-doc-id')
        if not id_receta:
            continue
            
        url_receta = card.get('href')
        nombre_elem = card.select_one('span.card__title-text')
        
        if nombre_elem and url_receta:
            recetas.append({
                "id_categoria": categoria["id_categoria"],
                "categoria": categoria["categoria"],
                "id_receta": id_receta,
                "receta": nombre_elem.get_text(strip=True),
                "url_receta": url_receta
            })
    
    return recetas

In [10]:
def procesar_todas_las_categorias(categorias):
   """
   Procesa categorías, extrae recetas y guarda en CSV.
   Args:
       categorias (list): Lista de diccionarios con id_categoria, categoria y url
   Returns:
       pandas.DataFrame: DataFrame con id_categoria, categoria, id_receta, receta y url_receta
   """
   todas_las_recetas = []
   
   for categoria in categorias:
       try:
           recetas = obtener_recetas(categoria)
           todas_las_recetas.extend(recetas)
       except Exception as e:
           print(f"Error: {str(e)}")
           continue

   if todas_las_recetas:
       try:
           df_recetas = pd.DataFrame(todas_las_recetas)
           df_recetas.to_csv("recetas.csv", index=False, encoding='utf-8')
           print(f"Total recetas extraídas: {len(todas_las_recetas)}")
           return df_recetas
       except Exception as e:
           print(f"Error guardando CSV: {str(e)}")
           return None
   return None

df_recetas = procesar_todas_las_categorias(categorias)

Total recetas extraídas: 19093


In [11]:
df_recetas

Unnamed: 0,id_categoria,categoria,id_receta,receta,url_receta
0,27,Air Fryer Recipes,8726749,Air Fryer Lemon Garlic Parmesan Chicken,https://www.allrecipes.com/air-fryer-lemon-gar...
1,27,Air Fryer Recipes,8729089,Our 15 Best Air Fryer Thanksgiving Recipes,https://www.allrecipes.com/best-air-fryer-than...
2,27,Air Fryer Recipes,8736955,Air Fryer S’Mores,https://www.allrecipes.com/air-fryer-s-mores-r...
3,27,Air Fryer Recipes,8737640,Air Fryer Baked Yams,https://www.allrecipes.com/air-fryer-baked-yam...
4,27,Air Fryer Recipes,8727930,Lemon Garlic Butter Chicken Spiedini,https://www.allrecipes.com/lemon-garlic-butter...
...,...,...,...,...,...
19088,404,Zucchini Breads,6599517,Gluten-Free Zucchini Bread (or Muffins),https://www.allrecipes.com/recipe/244775/glute...
19089,404,Zucchini Breads,6562167,Cherry-Zucchini Bread,https://www.allrecipes.com/recipe/277978/cherr...
19090,404,Zucchini Breads,6648897,Savory Zucchini Muffins,https://www.allrecipes.com/recipe/204983/savor...
19091,404,Zucchini Breads,6600656,Andy's Jalapeno Zucchini Bread,https://www.allrecipes.com/recipe/239859/andys...


In [12]:
def format_ingredient(ingredient):
    """
    Formatea una cadena de texto que representa un ingrediente, 
    asegurándose de agregar espacios adecuados entre números, unidades de medida y palabras.

    Args:
        ingredient (str): Cadena de texto que contiene el ingrediente a formatear.

    Returns:
        str: La cadena formateada, con espacios agregados entre números, fracciones, unidades y palabras.
    """
    # Add space between number and unit
    ingredient = re.sub(r'(\d+(\.\d+)?(/\d+)?)(tablespoons?|teaspoons?|cups?|pounds?|ounces?|oz|lb|g|kg|ml|l)', r'\1 \2', ingredient, flags=re.IGNORECASE)
    # Add space between fractions and words
    ingredient = re.sub(r'(\d+/\d+)(\w+)', r'\1 \2', ingredient)
    # Add space between numbers and words
    ingredient = re.sub(r'(\d+)([a-zA-Z])', r'\1 \2', ingredient)
    return ingredient

In [13]:
# Función para extraer detalles de una receta desde una URL
def get_recipe_details(url):
    """
    Extrae los detalles de una receta desde la URL especificada, incluyendo la lista de ingredientes 
    y los pasos de preparación.

    Args:
        url (str): La URL de la página web de la receta.

    Returns:
        dict: Un diccionario con dos claves principales:
              - "ingredients" (list): Lista de ingredientes formateados.
              - "preparation" (str): Texto concatenado con los pasos de preparación.
        None: Si ocurre un error en la extracción o si la respuesta HTTP no es válida.

    """
    client = httpx.Client(headers=HEADERS)
    try:
        response = client.get(url)
        if response.status_code != 200:
            print(f"Error: Status code {response.status_code}")
            return None
            
        soup = BeautifulSoup(response.content, "html.parser")
        
        # Get ingredients
        ingredients_list = []
        ingredients_section = soup.find('ul', class_='mm-recipes-structured-ingredients__list')
        if ingredients_section:
            for item in ingredients_section.find_all('li'):
                ingredient_text = item.get_text(strip=True)
                formatted_ingredient = format_ingredient(ingredient_text)
                ingredients_list.append(formatted_ingredient)
            
        # Get preparation steps
        steps_list = []
        steps_section = soup.find('ol', id='mntl-sc-block_1-0')
        if steps_section:
            for step in steps_section.find_all('p', class_='mntl-sc-block-html'):
                step_text = step.get_text(strip=True)
                if step_text:
                    steps_list.append(step_text)
            
        return {
            'ingredients': ingredients_list,
            'preparation': ' '.join(steps_list)
        }
    except Exception as e:
        print(f"Error extracting recipe details: {str(e)}")
        return None

In [14]:
def process_recipes(df_recetas, limit=1000):
   """
   Extrae detalles de recetas como ingredientes y pasos.
   Args:
       df_recetas (pd.DataFrame): DataFrame con id_receta, receta y url_receta
       limit (int): Máximo de recetas a procesar. Default 1000
   Returns:
       pd.DataFrame: DataFrame con id_receta, receta, ingredientes y preparacion
   """
   recipe_details = []
   
   for _, row in df_recetas.head(limit).iterrows():
       try:
           details = get_recipe_details(row['url_receta'])
           if details:
               recipe_details.append({
                   'id_receta': row['id_receta'],
                   'receta': row['receta'],
                   'ingredientes': details['ingredients'],
                   'preparacion': details['preparation']
               })
       except Exception as e:
           print(f"Error: {str(e)}")
           continue
           
   return pd.DataFrame(recipe_details)

In [15]:
try:
   df_recetas = pd.read_csv('recetas.csv')
   recipe_details = []
   
   for _, row in tqdm(df_recetas.head(1000).iterrows(), total=min(1000, len(df_recetas)), desc="Procesando recetas"):
       try:
           details = get_recipe_details(row['url_receta'])
           if details:
               recipe_details.append({
                   'id_receta': row['id_receta'],
                   'receta': row['receta'],
                   'ingredientes': details['ingredients'],
                   'preparacion': details['preparation']
               })
       except Exception as e:
           print(f"Error: {str(e)}")
           continue
           
   df_detalles = pd.DataFrame(recipe_details)
   df_detalles.to_csv('recetas_detalladas.csv', index=False, encoding='utf-8')
   print(f"\nGuardadas {len(df_detalles)} recetas")

except Exception as e:
   print(f"Error en proceso principal: {str(e)}")

Procesando recetas: 100%|██████████| 1000/1000 [18:52<00:00,  1.13s/it]


Guardadas 1000 recetas





In [16]:
df_detalles

Unnamed: 0,id_receta,receta,ingredientes,preparacion
0,8726749,Air Fryer Lemon Garlic Parmesan Chicken,"[1 1/2 skinless boneless chicken thighs, 3 clo...",Gather all ingredients. Preheat an air fryer t...
1,8729089,Our 15 Best Air Fryer Thanksgiving Recipes,[],
2,8736955,Air Fryer S’Mores,"[1 sleevegraham crackers, 5(1.5 ounce)chocolat...",Preheat an air fryer to 380 degrees F (193 deg...
3,8737640,Air Fryer Baked Yams,"[1 yam, 1/2 olive oil]",Preheat an air fryer to 400 degrees F (200 deg...
4,8727930,Lemon Garlic Butter Chicken Spiedini,"[1/2 extra-virgin olive oil, 1/4 white wine, s...","Whisk together olive oil, wine, 2 tablespoons ..."
...,...,...,...,...
995,6581318,'So This Is What Heaven Tastes Like!' Cream Ch...,"[1 ½cupsmargarine, softened, 1(8 ounce) packag...",Preheat oven to 350 degrees F (175 degrees C)....
996,6576113,Blondies II,"[3 ½cupsall-purpose flour, 2 ¼teaspoonsbaking ...",Preheat oven to 350 degrees F (175 degrees C)....
997,6586671,Robin's Blond Brownies,"[6 butter, softened, 1 packed brown sugar, 2 e...",Preheat oven to 350 degrees F (175 degrees C)....
998,6570491,Berry and White Chocolate Blondies,"[½cupunsalted butter, melted, ½cupfirmly packe...",Preheat the oven to 350 degrees F (175 degrees...


In [21]:
# Leer el CSV
df = pd.read_csv('recetas_detalladas.csv')

# Eliminar duplicados basados en id_receta, manteniendo la primera ocurrencia
df_recetas_sin_duplicados = df.drop_duplicates(subset='id_receta', keep='first')

# Guardar el resultado
df_recetas_sin_duplicados.to_csv('recetas_sin_duplicados.csv', index=False, encoding='utf-8')

print(f"Recetas originales: {len(df)}")
print(f"Recetas sin duplicados: {len(df_recetas_sin_duplicados)}")
print(f"Se eliminaron {len(df) - len(df_recetas_sin_duplicados)} duplicados")

Recetas originales: 1000
Recetas sin duplicados: 974
Se eliminaron 26 duplicados


In [23]:
df_recetas_sin_duplicados

Unnamed: 0,id_receta,receta,ingredientes,preparacion
0,8726749,Air Fryer Lemon Garlic Parmesan Chicken,"['1 1/2 skinless boneless chicken thighs', '3 ...",Gather all ingredients. Preheat an air fryer t...
1,8729089,Our 15 Best Air Fryer Thanksgiving Recipes,[],
2,8736955,Air Fryer S’Mores,"['1 sleevegraham crackers', '5(1.5 ounce)choco...",Preheat an air fryer to 380 degrees F (193 deg...
3,8737640,Air Fryer Baked Yams,"['1 yam', '1/2 olive oil']",Preheat an air fryer to 400 degrees F (200 deg...
4,8727930,Lemon Garlic Butter Chicken Spiedini,"['1/2 extra-virgin olive oil', '1/4 white wine...","Whisk together olive oil, wine, 2 tablespoons ..."
...,...,...,...,...
995,6581318,'So This Is What Heaven Tastes Like!' Cream Ch...,"['1 ½cupsmargarine, softened', '1(8 ounce) pac...",Preheat oven to 350 degrees F (175 degrees C)....
996,6576113,Blondies II,"['3 ½cupsall-purpose flour', '2 ¼teaspoonsbaki...",Preheat oven to 350 degrees F (175 degrees C)....
997,6586671,Robin's Blond Brownies,"['6 butter, softened', '1 packed brown sugar',...",Preheat oven to 350 degrees F (175 degrees C)....
998,6570491,Berry and White Chocolate Blondies,"['½cupunsalted butter, melted', '½cupfirmly pa...",Preheat the oven to 350 degrees F (175 degrees...


In [24]:
# Preprocesamiento de texto
def preprocess_text(text):
    """
    Preprocesa texto eliminando caracteres especiales, palabras vacías (stopwords),
    y dejando solo palabras relevantes.

    Args:
        text (str): Cadena de texto que contiene el texto a procesar.

    Returns:
        str: Texto procesado, en minúsculas, sin caracteres especiales ni palabras vacías, 
             y con palabras relevantes unidas por espacios.
    """
    # Eliminar caracteres especiales y convertir a minúsculas
    text = re.sub(r'[^a-zA-Z\s]', '', text.lower())
    # Tokenizar texto
    tokens = word_tokenize(text)
    # Filtrar stopwords y palabras muy cortas
    tokens = [word for word in tokens if word not in stop_words and len(word) > 2]
    return ' '.join(tokens)

In [25]:
# Preprocesar los textos relevantes (ingredientes y preparación)
print("Preprocesando datos...")
df['text'] = df['ingredientes'].fillna('') + ' ' + df['preparacion'].fillna('')
df['text'] = df['text'].apply(preprocess_text)

Preprocesando datos...


In [26]:
# Generación de embeddings utilizando TF-IDF
print("Generando embeddings...")
tfidf_vectorizer = TfidfVectorizer(max_features=1000)
embeddings = tfidf_vectorizer.fit_transform(df['text'])

Generando embeddings...


In [27]:
# Clustering
def apply_clustering(embeddings, n_clusters=10):
    """
    Aplica K-means para agrupar embeddings en clusters similares.
    Args:
        embeddings (sparse matrix): Matriz de embeddings generada.
        n_clusters (int): Número de clusters a formar.
    Returns:
        list: Etiquetas de cluster para cada documento.
    """
    print("Agrupando con K-means...")
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    labels = kmeans.fit_predict(embeddings)
    return labels

# Agregar etiquetas de cluster a los datos
df['cluster'] = apply_clustering(embeddings, n_clusters=10)

Agrupando con K-means...




In [28]:
# Procesamiento de consulta y recuperación de recetas similares
def process_query(query, embeddings, df):
    """
    Procesa una consulta para buscar recetas similares.
    Args:
        query (str): Texto de la consulta.
        embeddings (sparse matrix): Matriz de embeddings de los documentos.
        df (DataFrame): DataFrame con los datos originales.
    Returns:
        DataFrame: Recetas ordenadas por similitud con la consulta.
    """
    print("Procesando consulta...")
    query = preprocess_text(query)
    query_embedding = tfidf_vectorizer.transform([query])

    # Calcular similitudes de coseno
    similarities = cosine_similarity(query_embedding, embeddings).flatten()

    # Crear un DataFrame con los resultados ordenados por relevancia
    df['similarity'] = similarities
    results = df.sort_values(by='similarity', ascending=False).head(10)
    return results[['receta', 'similarity', 'cluster', 'ingredientes', 'preparacion']]

In [29]:
# Consulta 
def run_example_query(query, df):
    """
    Ejecuta una consulta de ejemplo y muestra los resultados.
    Args:
        query (str): Consulta de ejemplo.
        df (DataFrame): DataFrame de recetas ya procesado.
    """
    results = process_query(query, embeddings, df)
    print("Recetas:")
    for _, row in results.iterrows():
        print(f"\nReceta: {row['receta']}\nSimilitud: {row['similarity']:.2f}\nCluster: {row['cluster']}\nIngredientes: {row['ingredientes']}\nPreparación: {row['preparacion']}")

In [30]:
query = "chicken with garlic and lemon"
run_example_query(query, df)

Procesando consulta...
Recetas:

Receta: Sheet Pan Lemon Garlic Chicken with Vegetables
Similitud: 0.63
Cluster: 5
Ingredientes: ['2 skinless, boneless chicken thighs', 'salt and freshly ground black pepper to taste', '1/4 unsalted butter', '1/2 licedred onion', '4 cloves garlic, minced', '2 Greekseasoning', '1 asparagus, washed, trimmed, and halved', '6 mini bell peppers, assorted colors, seeds removed, quartered', '1/2 chicken broth', '1 emon, zested and juiced', 'lemon slices, for garnish (optional)', 'fresh parsley sprigs, for garnish (optional)']
Preparación: Preheat the oven to 400 degrees F (200 degrees C) and line a sheet pan with foil or parchment paper. Pat chicken thighs dry with paper towels and season with salt and pepper. Melt butter in a large skillet over medium-high heat. When butter is sizzling, add chicken and cook until browned, 3 to 5 minutes per side. Place chicken on the prepared sheet pan. Roast chicken in the preheated oven for 15 minutes. To the same skillet, 

In [31]:
query = "corona beans"
run_example_query(query, df)

Procesando consulta...
Recetas:

Receta: 4-Bean Baked Beans
Similitud: 0.69
Cluster: 4
Ingredientes: ['1(16 ounce) canpork and beans, drained', '1(15 ounce) cankidney beans, drained', '1(15 ounce) canbutter beans, drained', '1(15 ounce) canlima beans, drained', '8 bacon, cut into small pieces', '1 argeonion, chopped', '2 clovesgarlic, chopped', '¾cupbrown sugar', '½cupketchup', '½cupvinegar', '¼cupmolasses', '1 dry mustard']
Preparación: Preheat the oven to 350 degrees F (175 degrees C). Add pork and beans, kidney beans, butter beans, and lima beans to a 2-quart casserole dish. Heat a saucepan over medium heat. Add bacon, onion, and garlic; cook and stir until bacon browned, about 10 minutes. Drain excess grease. Whisk brown sugar, ketchup, vinegar, molasses, and mustard into bacon mixture; simmer until cooked through, about 20 minutes. Pour sauce over beans; stir to combine. Bake in the preheated oven until bubbling, 1 hour 15 minutes.

Receta: Slow Cooker BBQ Baked Beans
Similitud: 0