# Web Scraping Exercise

## 1. Introduction and Planning

### Objective:
The goal of this exercise is to build a web scraper that collects data from a chosen website. You will learn how to send HTTP requests, parse HTML content, extract relevant data, and store it in a structured format.

### Tasks:
1. Identify the data you want to scrape.
2. Choose the target website(s).
3. Plan the structure of your project.

### Example:
For this exercise, we will scrape job listings from Indeed.com. We will extract job titles, company names, locations, and job descriptions.

## 2. Understanding the Target Website
### Objective:

Analyze the structure of the web pages to be scraped.
### Tasks:

* Inspect the target website using browser developer tools.
* Identify the HTML elements that contain the desired data.

### Instructions:

* Open your browser and navigate to the target website (e.g., allrecipes.com).
* Right-click on the webpage and select "Inspect" or press Ctrl+Shift+I.
* Use the developer tools to explore the HTML structure of the webpage.
* Identify the tags and classes of the elements that contain the job titles, company names, locations, and descriptions.

## 3. Writing the Scraper
### Objective:

Develop the code to scrape data from the target website.
### Tasks:

* Send HTTP requests to the target website.
* Parse the HTML content and extract the required data.
* Handle pagination to scrape data from multiple pages.
* Implement error handling.

In [5]:
%pip install selenium

Collecting selenium
  Downloading selenium-4.22.0-py3-none-any.whl.metadata (7.0 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.26.0-py3-none-any.whl.metadata (8.8 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting sortedcontainers (from trio~=0.17->selenium)
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Collecting pysocks!=1.5.7,<2.0,>=1.5.6 (from urllib3[socks]<3,>=1.26->selenium)
  Downloading PySocks-1.7.1-py3-none-any.whl.metadata (13 kB)
Downloading selenium-4.22.0-py3-none-any.whl (9.4 MB)
   ---------------------------------------- 0.0/9.4 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.4 MB 1.


[notice] A new release of pip is available: 24.1.1 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time



In [25]:
# Configuración de opciones para Selenium
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')

In [26]:
chrome_driver_path = 'D:/Kevin/Documents/EPN/2024A/Recuperación de Información/IR-2024A/week12/chromedriver.exe'

In [27]:
# Crear una instancia del navegador
service = Service(chrome_driver_path)
driver = webdriver.Chrome(service=service, options=options)

In [28]:
# Navegar a la página web
url = 'https://www.allrecipes.com/recipes/'
driver.get(url)

# Esperar un momento para asegurarse de que la página ha cargado completamente
time.sleep(5)

In [33]:
# Obtener el contenido de la página
page_source = driver.page_source

# Parsear el contenido HTML de la página
soup = BeautifulSoup(page_source, 'html.parser')


In [50]:
# Encontrar las recetas en la página principal
recipes = soup.find_all('div', class_='comp mntl-taxonomysc-article-list-group mntl-block')

In [93]:
# Guardar las recetas en un csv
import csv    

In [99]:
with open('recipes.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Title', 'Description', 'Ingredients', 'Directions', 'Link'])
    
    # Iterar sobre las recetas y extraer la información
    for recipe in recipes:
        # Obtener todos los links de la receta para obtener los ingredientes
        links = recipe.find_all('a', class_='comp mntl-card-list-items mntl-document-card mntl-card card card--no-image')
        for link_element in links:
            link = link_element['href']
            
            # Navegar a la página de cada receta
            driver.get(link)
            time.sleep(5)  # Esperar a que la página cargue

            # Obtener el contenido de la página de la receta
            recipe_page_source = driver.page_source
            recipe_soup = BeautifulSoup(recipe_page_source, 'html.parser')

            # Obtener el titulo de la receta
            titles = recipe_soup.find_all('h1', class_='article-heading type--lion')
            recipe_title = titles[0].text.strip() if titles else 'No Title'
            print(f'Título: {recipe_title}')

            # Obtener la descripción de la receta
            descriptions = recipe_soup.find_all('p', class_='article-subheading type--dog')
            description_text = descriptions[0].text.strip() if descriptions else 'No Description'
            print(f'Descripción: {description_text}')

            # Obtener los ingredientes
            ingredients = recipe_soup.find_all('li', class_='mm-recipes-structured-ingredients__list-item')
            ingredient_list = [ingredient.text.strip() for ingredient in ingredients]
            print('Ingredientes:')
            for ingredient in ingredient_list:
                print(f'- {ingredient}')

            # Obtener las instrucciones
            directions = recipe_soup.find_all('p', class_='comp mntl-sc-block mntl-sc-block-html')    
            direction_list = [direction.text.strip() for direction in directions]
            print('Instrucciones:')
            for direction in direction_list:
                print(f'- {direction}')
                
            print('\n' + '='*40 + '\n')  
            # Escribir la información de la receta en el archivo CSV
            writer.writerow([recipe_title, description_text, ingredient_list, direction_list, link])
        


Título: Gochujang Mayonnaise
Descripción: This delicious gochujang mayonnaise is a simple recipe, great on just about anything. You can make it spicier by adding more gochujang sauce. I love to slather it on my burgers or use it as a dipping sauce for my fries. The possibilities are endless.
Ingredientes:
- 1/4 cup mayonnaise
- 2 tablespoons Gochujang sauce
- 1/2 teaspoon lime juice
- 1/4 teaspoon sea salt
- 1/4 teaspoon black pepper
- 1/4 teaspoon garlic powder
Instrucciones:
- Stir mayonnaise, gochujang sauce, lime juice, salt, pepper, and garlic powder together in a small bowl until well combined.


Título: Frito Corn Salad
Descripción: "It's such a hit!"
Ingredientes:
- 6 ears fresh corn with husks
- 2 (9.25 ounce) bags scoop-style corn chips, divided (such as Frito Scoops)
- 8 ounces shredded sharp Cheddar cheese
- 2 small multicolored bell peppers, chopped
- 1 medium red onion, finely chopped
- 3/4 cup chopped fresh cilantro, plus more for garnish
- 3/4 cup mayonnaise
- 1/2 cup s

In [100]:
# Cerrar el csv
file.close()

In [101]:
# Cerrar el navegador
driver.quit()

In [108]:
# Leer el archivo CSV
import pandas as pd 

df_recipes = pd.read_csv('recipes.csv')
df_recipes.head()

Unnamed: 0,Title,Description,Ingredients,Directions,Link
0,Gochujang Mayonnaise,This delicious gochujang mayonnaise is a simpl...,"['1/4 cup mayonnaise', '2 tablespoons Gochujan...","['Stir mayonnaise, gochujang sauce, lime juice...",https://www.allrecipes.com/gochujang-mayonnais...
1,Frito Corn Salad,"""It's such a hit!""","['6 ears fresh\xa0corn\xa0with husks', '2 (9.2...",['There’s nothing better than a salad with a l...,https://www.allrecipes.com/frito-corn-salad-re...
2,Cowboy Colada,"""It's totally perfect for summer cocktails on ...","['fresh pineapple wedges', 'chili-lime seasoni...",['This might look like your average tropical c...,https://www.allrecipes.com/cowboy-colada-recip...
3,Chocolate-Glazed Hazelnut Mousse Cake,"This wonderfully rich, silky dessert is an ide...","['2 tablespoons hazelnuts, toasted and skins r...",['For shortbread base: Put oven rack in middle...,https://www.allrecipes.com/recipe/134726/choco...
4,Grilled Spaghetti,Grilled spaghetti is cooked directly in a very...,"['8 ounces thick dry spaghetti', '3 tablespoon...",['Prepare an outdoor charcoal grill for medium...,https://www.allrecipes.com/grilled-spaghetti-r...


In [110]:
df = df_recipes['Description']
df

0     This delicious gochujang mayonnaise is a simpl...
1                                    "It's such a hit!"
2     "It's totally perfect for summer cocktails on ...
3     This wonderfully rich, silky dessert is an ide...
4     Grilled spaghetti is cooked directly in a very...
5     These bacon stuffed French toast sliders make ...
6     These easy eggs Benedict breakfast sliders mak...
7     The ultimate cowboy breakfast sliders are made...
8            Cool off with this creamy, vodka cocktail.
9     This peach and blueberry feta salad is the BES...
10    This copycat Costco chicken bake is an easier,...
11    If Klondike won't bring back the ice cream tru...
12    This gorgeous limoncello sunrise cocktail mimi...
13    This hamburger bread pudding, made from stale ...
14    These feta and spinach stuffed chicken breasts...
15    These Buffalo chicken pizza rolls are filling ...
16    Make the most of peach season with these top-r...
17    Keep reading for our most-saved Chef John 