# Web Scraping Exercise

## 1. Introduction and Planning

### Objective:
The goal of this exercise is to build a web scraper that collects data from a chosen website. You will learn how to send HTTP requests, parse HTML content, extract relevant data, and store it in a structured format.

### Tasks:
1. Identify the data you want to scrape.
2. Choose the target website(s).
3. Plan the structure of your project.

### Example:
For this exercise, we will scrape job listings from Indeed.com. We will extract job titles, company names, locations, and job descriptions.

## 2. Understanding the Target Website
### Objective:

Analyze the structure of the web pages to be scraped.
### Tasks:

* Inspect the target website using browser developer tools.
* Identify the HTML elements that contain the desired data.

### Instructions:

* Open your browser and navigate to the target website (e.g., Indeed.com).
* Right-click on the webpage and select "Inspect" or press Ctrl+Shift+I.
* Use the developer tools to explore the HTML structure of the webpage.
* Identify the tags and classes of the elements that contain the job titles, company names, locations, and descriptions.

## 3. Writing the Scraper
### Objective:

Develop the code to scrape data from the target website.
### Tasks:

* Send HTTP requests to the target website.
* Parse the HTML content and extract the required data.
* Handle pagination to scrape data from multiple pages.
* Implement error handling.

## Step 1: Importación de librerías

In [7]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
import csv    
import pandas as pd 


## Step 2: Enviar solicitudes HTTP al sitio web de destino

### Configurar opciones para Selenium


In [2]:
options = Options() #crear una instancia de la clase Options
options.add_argument('--headless') #agregamos el argumento '--headless' para que el navegador se ejecute sin interfaz gráfica
options.add_argument('--disable-gpu') #agregamos el argumento '--disable-gpu' para desactivar la aceleración por GPU

### Crear una instancia del navegador


In [9]:
!chmod +x /content/sample_data/chromedriver.exe

In [4]:
chrome_driver_path = 'chromedriver.exe'
service = Service(chrome_driver_path)
driver = webdriver.Chrome(service=service, options=options)

### Navegar a la página web

In [5]:
url = 'https://www.allrecipes.com/recipes/'
driver.get(url) #utilizar el controlador del navegador
time.sleep(5) #esperar para asegurarse de que la pagina ha cargado completamente

### Obtener el contenido de la página

In [6]:
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser') #parsear el contenido HTML de la pagina
recipes = soup.find_all('div', class_='comp mntl-taxonomysc-article-list-group mntl-block') #encontrar las recetas en la pagina principal


### Guardar el contenido de la página

In [9]:
with open('recipes.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Title', 'Description', 'Ingredients', 'Directions', 'Link']) #escribir la fila de cabecera en el archivo CSV
    
    #extraer toda la informacion
    for recipe in recipes:
        links = recipe.find_all('a', class_='comp mntl-card-list-items mntl-document-card mntl-card card card--no-image') #Obtener todos los url's de la receta para obtener los ingredientes

        for link_element in links:
            link = link_element['href']
            driver.get(link) #navegar a la pagina de cada receta
            time.sleep(5) #esperar para asegurarse de que la pagina ha cargado completamente
            #obtener el contenido de la pagina de la receta
            recipe_page_source = driver.page_source
            recipe_soup = BeautifulSoup(recipe_page_source, 'html.parser')
            #obtener el titulo de la receta
            titles = recipe_soup.find_all('h1', class_='article-heading type--lion')
            recipe_title = titles[0].text.strip() if titles else 'No Title'
            #obtener la descripcion de la receta
            descriptions = recipe_soup.find_all('p', class_='article-subheading type--dog')
            description_text = descriptions[0].text.strip() if descriptions else 'No Description'
            #obtener los ingredientes
            ingredients = recipe_soup.find_all('li', class_='mm-recipes-structured-ingredients__list-item')
            ingredient_list = [ingredient.text.strip() for ingredient in ingredients]
            #obtener las instrucciones
            directions = recipe_soup.find_all('p', class_='comp mntl-sc-block mntl-sc-block-html')    
            direction_list = [direction.text.strip() for direction in directions]
            #escribir la informacion de la receta en el archivo CSV
            writer.writerow([recipe_title, description_text, ingredient_list, direction_list, link])
            
file.close() #cerrar el csv
driver.quit() #cerrar el navegador

### Acceder al contenido del CSV

In [14]:
df_recipes = pd.read_csv('recipes.csv') #leer el archivo csv
df_recipes.head() #muestra las 5 primeras filas del dataframe

Unnamed: 0,Title,Description,Ingredients,Directions,Link
0,Raspberry and Strawberry Buckle,I've always called this a buckle (I'm from the...,"['½ cup butter, softened', '½ cup white sugar'...",['Preheat oven to 375 degrees F (190 degrees C...,https://www.allrecipes.com/recipe/7603/raspber...
1,Copycat Panda Express Orange Chicken,Try this copycat recipe to make Panda Express ...,"['2 cups flour', '1/2 cup cornstarch', '2 teas...","['Whisk together flour, cornstarch, salt, and ...",https://www.allrecipes.com/copycat-panda-expre...
2,Plant Based Cheese Sprinkles,These plant-based cheese sprinkles came about ...,"['1/2 cup cashews', '1/2 cup nutritional yeast...",['Place cashews in a food processor. Pulse unt...,https://www.allrecipes.com/plant-based-cheese-...
3,Honey Roasted Beets,These honey roasted beets combine the natural ...,"['3 large beets', '1 tablespoon olive oil', '1...",['Preheat the oven to 375 degrees F (190 degre...,https://www.allrecipes.com/honey-roasted-beets...
4,Spiral Cucumber Salad,This spiral cucumber salad features a sesame o...,"['6 to 8 mini seedless cucumbers', '1/2 yellow...",['Place cucumber between two chopsticks or woo...,https://www.allrecipes.com/spiral-cucumber-sal...


### Mostrar el titulo de todas las recetas

In [15]:
df = df_recipes['Title']
df

0                       Raspberry and Strawberry Buckle
1                  Copycat Panda Express Orange Chicken
2                          Plant Based Cheese Sprinkles
3                                   Honey Roasted Beets
4                                 Spiral Cucumber Salad
5       Turkey Meatballs in Maple-Bourbon Mustard Sauce
6                      Italian Marinated Grilled Cheese
7                                      Cornmeal Cookies
8                 Big Batch Freezer Lemon Drop Martinis
9                                Doritos Locos Taco Pie
10                        Crispy Air Fryer Potato Bites
11                              Frozen Espresso Martini
12                    Ground Beef and Broccoli Stir Fry
13                       Greek-Style Grilled Lamb Chops
14                    One Pan Pasta with Bacon and Peas
15                                Doggie Ice Cream Cake
16                                   Zucchini Rollatini
17                            Chicken Ricotta Me

### Mostrar la descripción de todas las recetas

In [16]:
df = df_recipes['Description']
df

0     I've always called this a buckle (I'm from the...
1     Try this copycat recipe to make Panda Express ...
2     These plant-based cheese sprinkles came about ...
3     These honey roasted beets combine the natural ...
4     This spiral cucumber salad features a sesame o...
5     These turkey meatballs in maple-bourbon mustar...
6     This Italian marinated grilled cheese is made ...
7     These cornmeal cookies have a lovely, chewy te...
8     These big batch freezer lemon drop martinis, e...
9     Try this sizzling doritos locos taco pie with ...
10    These crispy air fryer potato bites are crispy...
11    This frozen espresso martini is perfect for ho...
12    This ground beef and broccoli stir fry is an i...
13    These Greek-style grilled lamb chops, seasoned...
14    One pan pasta with bacon and peas is an easy w...
15    This doggie ice cream cake is for when the tem...
16    These zucchini rollatini are ribbons of zucchi...
17    These ricotta chicken meatballs are both b

### Mostrar los ingredientes de todas las recetas

In [17]:
df = df_recipes['Ingredients']
df

0     ['½ cup butter, softened', '½ cup white sugar'...
1     ['2 cups flour', '1/2 cup cornstarch', '2 teas...
2     ['1/2 cup cashews', '1/2 cup nutritional yeast...
3     ['3 large beets', '1 tablespoon olive oil', '1...
4     ['6 to 8 mini seedless cucumbers', '1/2 yellow...
5     ['1/4 cup minced onion', '6 tablespoons panko ...
6     ['1 clove garlic, minced', '1 tablespoon white...
7     ['1/2 cup unsalted butter, softened', '3/4 cup...
8     ['1 3/4 cups vodka', '1/2 cup freshly squeezed...
9     ['2 teaspoons olive oil', '1/2 cup finely chop...
10    ['1 pound red potatoes, cut into 1-inch cubes'...
11    ['3 cups freshly brewed espresso', '1 cup simp...
12    ['1 cup lower-sodium beef broth', '1/4 cup low...
13    ['8 cloves garlic, crushed', '6 tablespoons ex...
14    ['8 ounces bacon, diced', '1/2 cup diced onion...
15    ['1 (32 ounce) container Greek yogurt', '1 (16...
16    ['2 (12 ounce) zucchinis, sliced lengthwise in...
17    ['2 teaspoons oil', '1/2 white onion, fine