# Web Scraping Exercise

## 1. Introduction and Planning

### Objective:
The goal of this exercise is to build a web scraper that collects data from a chosen website. You will learn how to send HTTP requests, parse HTML content, extract relevant data, and store it in a structured format.

### Tasks:
1. Identify the data you want to scrape.
2. Choose the target website(s).
3. Plan the structure of your project.

### Example:
For this exercise, we will scrape job listings from Indeed.com. We will extract job titles, company names, locations, and job descriptions.

## 2. Understanding the Target Website
### Objective:

Analyze the structure of the web pages to be scraped.
### Tasks:

* Inspect the target website using browser developer tools.
* Identify the HTML elements that contain the desired data.

### Instructions:

* Open your browser and navigate to the target website (e.g., Indeed.com).
* Right-click on the webpage and select "Inspect" or press Ctrl+Shift+I.
* Use the developer tools to explore the HTML structure of the webpage.
* Identify the tags and classes of the elements that contain the job titles, company names, locations, and descriptions.

In [16]:
from bs4 import BeautifulSoup
import requests
import re

# Load the HTML file
with open("Baked spaghetti cups _ Tesco Real Food.html", "r", encoding="utf-8") as file:
    html_content = file.read()

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

def extract_text_from_nested_tags(tag):
    """Extrae y une el texto de una etiqueta y sus etiquetas anidadas."""
    return ' '.join(tag.stripped_strings)

def get_recipe_information(soup):
    # Extraer el título de la receta
    title = soup.find('h1').text.strip() if soup.find('h1') else 'No title available'

    # Extraer los ingredientes
    ingredients = []
    for li in soup.find_all('li', class_='recipe-detail__list-item'):
        ingredients.append(extract_text_from_nested_tags(li))
    ingredients = ', '.join(ingredients)

    # Extraer el método de preparación
    steps = []
    method_section = soup.find('h2', string='Method')
    if method_section:
        method_steps = method_section.find_next('ol')
        if method_steps:
            for li in method_steps.find_all('li'):
                steps.append(extract_text_from_nested_tags(li))
    steps = ' '.join(steps)

    return {
        'Title': title,
        'Ingredients': ingredients,
        'Method': steps
    }


In [17]:
# Obtener la información de la receta
recipe_info = get_recipe_information(soup)

# Imprimir los resultados
for key, value in recipe_info.items():
    print(f"{key}: {value}\n")

Title: Baked spaghetti cups recipe

Ingredients: 200g Hearty Foods Co. spaghetti, 300g Tesco garlic & herb passata, 1 medium courgette, coarsely grated, 1 medium egg, beaten, 75g grated 30% reduced fat mature cheese, oil cooking spray, sliced cucumber, to serve, red and green grapes, to serve

Method: Preheat the oven to 200°C/180°C fan/Gas 6. Bring a large pan of water to the boil and cook the spaghetti according to pack instructions, until just al dente. Drain and rinse the spaghetti under running water until cool. In the pan, mix together the passata and the beaten egg until combined. Season with black pepper and add the grated courgette and half of the cheese. Mix well, add the drained spaghetti and stir until evenly coated. Spray a 12-muffin tin with oil and curl the spaghetti into them. Top with the rest of the cheese and bake for 15-18 mins until the cheese is golden and bubbling. Let cool completely and refrigerate until needed. Serve at room temperature or cold with sliced cuc

In [18]:
# URL de la página web
url = 'https://realfood.tesco.com/recipes/popular/pizza.html'

# Enviar una solicitud GET a la página
response = requests.get(url)

# Verificar si la solicitud fue exitosa
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    # Encontrar todos los enlaces en la página
    links = soup.find_all('a', href=True)

    # Definir la expresión regular para encontrar enlaces de recetas
    recipe_pattern = re.compile(r'/recipes/.*\.html$')

    # Filtrar enlaces usando la expresión regular y excluyendo enlaces a categorías
    recipe_links = [link['href'] for link in links if recipe_pattern.search(link['href']) and 'popular' not in link['href']]

    recipe_links = [url.split('/recipes/')[0] + link if link.startswith('/recipes') else link for link in recipe_links]
    recipe_links = list(set(recipe_links))

    # Imprimir y guardarlos enlaces
    with open('recipe_links.txt', 'w') as file:
        for link in recipe_links:
            print(link)
            file.write(link + '\n')
    
else:
    print(f'Error al acceder a la página: {response.status_code}')

https://realfood.tesco.com/recipes/fish-and-seafood/cod-recipes.html
https://realfood.tesco.com/recipes/collections/soup-recipes.html
https://realfood.tesco.com/recipes/spicy-chicken-cauliflower-pizza.html
https://realfood.tesco.com/recipes/pesto-pizza.html
https://realfood.tesco.com/recipes/fish-and-seafood/prawn-recipes.html
https://realfood.tesco.com/recipes/fish-and-seafood/salmon-recipes.html
https://realfood.tesco.com/recipes/veggie-tortilla-pizzas.html
https://realfood.tesco.com/recipes/spicy-sausage-pizza-with-hot-honey.html
https://realfood.tesco.com/recipes/chicken-pizza-calzone.html
https://realfood.tesco.com/recipes/prosciutto-and-sun-dried-tomato-wholewheat-pizza.html
https://realfood.tesco.com/recipes/pepperoni-pizza-jacket-potatoes.html
https://realfood.tesco.com/recipes/chargrilled-veg-and-goats-cheese-pizzas.html
https://realfood.tesco.com/recipes/meat/beef-recipes.html
https://realfood.tesco.com/recipes/rosemary-olive-and-pesto-pizza.html
https://realfood.tesco.com/re

In [19]:
def get_recipe_information(soup):
    
    title = soup.find('h1').text.strip() if soup.find('h1') else 'No title available'
    ingredients = []
    for li in soup.find_all('li', class_='recipe-detail__list-item'):
        ingredients.append(extract_text_from_nested_tags(li))
    ingredients = ', '.join(ingredients) if ingredients else None

    steps = []
    method_section = soup.find('h2', string='Method')
    if method_section:
        method_steps = method_section.find_next('ol')
        if method_steps:
            for li in method_steps.find_all('li'):
                steps.append(extract_text_from_nested_tags(li))
    steps = ' '.join(steps) if steps else None

    return {
        'Title': title,
        'Ingredients': ingredients,
        'Method': steps
    }

def main():
    # Leer los enlaces del archivo de texto
    with open('recipe_links.txt', 'r') as file:
        links = file.read().splitlines()

    # Procesar cada enlace
    for link in links:
        response = requests.get(link)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            recipe_info = get_recipe_information(soup)
            if recipe_info['Ingredients'] and recipe_info['Method']:
                for key, value in recipe_info.items():
                    print(f"{key}: {value}\n")
        else:
            print(f'Error al acceder a la página: {response.status_code}')

if __name__ == '__main__':
    main()

Title: Spicy chicken cauliflower 'pizza' recipe

Ingredients: 350g cauliflower florets, 30g ground almonds, 1 egg, beaten, 3 tbsp passata, 100g low-fat natural yogurt, 10g mint, leaves finely chopped, 50g roasted peppers from a jar, drained, 170g pack fiery piri piri chicken mini fillets, thinly sliced, 20g wild rocket

Method: Preheat the oven to gas 7, 220°C, fan 200°C. Blitz the cauliflower in a food processor until it resembles breadcrumbs. Steam the cauliflower on the hob for 5 mins, or use the microwave to save some energy. Once steamed, spread out in a shallow bowl. Mix in the almonds and egg, and season with pepper. Heap the mixture onto a large baking tray lined with nonstick baking paper and press into a 28cm-30cm circle. Bake in the oven for 20 mins. Spread the passata over the cauliflower pizza base and bake for a further 5 mins. Meanwhile, mix the yogurt and mint together, and season with pepper to taste. Remove the base from the oven and then scatter over the peppers, chi