# Job Scraping and Filtering Pipeline Documentation

## Project Overview
This project is a web scraping solution designed to collect, filter, and store job listings from emprego.sapo.pt, focusing on both Lisbon and International IT positions.

## Technologies Used
- **Python**: Primary programming language
- **Selenium**: Web automation and scraping
- **BeautifulSoup**: HTML parsing
- **Pandas**: Data manipulation and filtering
- **JSON**: Data storage format

## Key Components

### 1. Web Scraping
- Uses Selenium WebDriver for dynamic page interaction
- Handles cookie consent popups automatically
- Scrolls pages to load all content
- Extracts data from article elements

### 2. Data Extraction
- Job titles and URLs
- Company names (different methods for Lisboa/Internacional)
- Job locations
- Job descriptions
- Employment types
- Posting dates

### 3. Data Processing
- Two-stage filtering process
  - Initial filter for basic validation
  - Advanced filter to remove duplicates
- Text cleaning and normalization
- JSON output formatting

### 4. File Management
- Separate files for Lisboa and Internacional jobs
- Automatic file cleanup before new scrapes
- UTF-8 encoding support for special characters

## Features
- **Debug Mode**: Optional browser inspection mode
- **Pagination**: Handles multiple pages of results
- **Error Handling**: Graceful handling of missing data
- **Different HTML Structures**: Adapts to Lisboa vs Internacional layouts

## Output Files
- `vagas_filtradas.json`: Lisboa job listings
- `vagas_filtradas_internacional.json`: International job listings

## Performance Considerations
- Built-in delays for page loading
- Page limit to control scraping scope
- Memory-efficient processing with generators

## Usage Notes
- Set `debug = True` for development/testing
- Adjust `limite` parameter to control pages scraped
- Chrome WebDriver required for execution
- Internet connection required

# Job Scraping and Filtering Pipeline
This notebook combines web scraping and data filtering for job listings.

In [10]:
# 1. Imports
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import pandas as pd
import json
import os
import time
import re

In [11]:
# 2. Global Variables
vagas = []
vagas_finais=[]
pagina = 1
debug = False



In [12]:
# 3. Function Definitions

def filter_jobs(vagas):
    if not vagas:  # Check if vagas is empty
        return []
    return vagas  # Return all jobs without filtering

In [13]:
def advanced_filter(vagas):
    if not vagas:  # Check if vagas is empty
        return []
    df = pd.DataFrame(vagas)
    df = df.drop_duplicates(subset=['url'])
    return df.to_dict('records')

In [14]:
def clean_text(text):
    if not isinstance(text, str):
        return text
    text = text.replace('\n', ' ')
    text = text.replace('\t', ' ')
    text = text.replace('\r', ' ')
    text = text.replace('\xa0', ' ')
    text = text.replace('﻿', '')
    return ' '.join(text.split()).strip()

In [15]:
def clean_job_data(jobs):
    clean_jobs = []
    for job in jobs:
        clean_job = {
            key: clean_text(value) for key, value in job.items()
        }
        clean_jobs.append(clean_job)
    return clean_jobs

In [16]:
def scrape_jobs(url_base, saida_json, limite=5):
    # Initialize local variables
    current_page = 1
    local_vagas = []
    is_international = 'local=Internacional' in url_base

    print(f"\n📍 Iniciando scraping para {url_base}")
    os.makedirs(os.path.dirname(saida_json), exist_ok=True)

    # Setup do navegador
    options = Options()
    # options.add_argument("--headless")
    driver = webdriver.Chrome(options=options)
    wait = WebDriverWait(driver, 5)

    try:
        while True:
            url = url_base.format(current_page)
            print(f"\n🌐 Página {current_page}: {url}")
            driver.get(url)

            # Handle cookies only on first page
            if current_page == 1:
                try:
                    rejeitar_span = wait.until(EC.element_to_be_clickable(
                        (By.XPATH, "//span[translate(normalize-space(), 'REJEITAR TODOS', 'rejeitar todos') = 'rejeitar todos']")
                    ))
                    rejeitar_span.click()
                    print("✅ Clicado em 'REJEITAR TODOS'")
                    time.sleep(2)
                except TimeoutException:
                    print("ℹ️ Botão 'REJEITAR TODOS' não apareceu.")

            if debug:
                print("Debug mode: Keeping browser open for inspection")
                while True:
                    time.sleep(1)

            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(2)

            soup = BeautifulSoup(driver.page_source, "html.parser")
            itens = soup.find_all("article", attrs={"data-v-37f59e59": True})
            print(f"🔎 {len(itens)} blocos encontrados.")

            for item in itens:
                try:
                    # Get title and URL
                    link = item.find("a", {"data-trackerlink": "offers_list|offer|text_offer"})
                    if not link:
                        continue

                    titulo = link.get_text(strip=True)
                    url = link["href"] if "href" in link.attrs else ""
                    if url and not url.startswith("http"):
                        url = "https://emprego.sapo.pt" + url

                    # Get company name and location based on job type
                    info_text = item.get_text(" ", strip=True)
                    
                    if is_international:
                        # Get company name for international jobs
                        company_link = item.find("a", {"data-trackerlink": "offers_list|offer|text_company"})
                        empresa = company_link.get_text(strip=True) if company_link else ""

                        # Get location for international jobs
                        location_text = info_text
                        location_start = location_text.find('class="location">') + len('class="location">')
                        location_end = location_text.find('</li>', location_start)
                        local = location_text[location_start:location_end].strip() if location_start > -1 and location_end > -1 else "Internacional"
                    else:
                        # Existing Lisboa logic
                        empresa_match = re.search(rf"{re.escape(titulo)}\s*(.*?)\s*Lisboa", info_text)
                        empresa = empresa_match.group(1).strip() if empresa_match else ""
                        local = "Lisboa , Portugal"

                    # Common fields
                    tipo = "Full-Time" if "Full-Time" in info_text else ""
                    data = "Últimas 24 horas" if "Últimas 24 horas" in info_text else ""

                    # Get description
                    description_p = item.find("p")
                    descricao = description_p.get_text(strip=True) if description_p else ""

                    if titulo and not titulo.startswith("Curso"):
                        vaga = {
                            "titulo": titulo,
                            "empresa": empresa,
                            "local": local,
                            "descricao": descricao,
                            "tipo": tipo,
                            "data": data,
                            "url": url
                        }
                        local_vagas.append(vaga)
                except Exception as e:
                    print(f"Error processing item: {e}")
                    continue

            current_page += 1
            if current_page > limite:  # Limit pages
                break

    finally:
        if not debug:
            driver.quit()

    # Apply filters
    vagas_filtradas = filter_jobs(local_vagas)
    vagas_finais = advanced_filter(vagas_filtradas)

    # Delete existing file if it exists
    if os.path.exists(saida_json):
        try:
            os.remove(saida_json)
            print(f"🗑️ Arquivo existente removido: {saida_json}")
        except Exception as e:
            print(f"⚠️ Erro ao remover arquivo: {e}")

    # Save filtered results
    with open(saida_json, "w", encoding="utf-8") as f:
        json.dump(vagas_finais, f, indent=2, ensure_ascii=False)

    print(f"\n💾 Total de {len(vagas_finais)} vagas filtradas salvas em '{saida_json}'")
    return vagas_finais

In [17]:
# Process Lisbon Jobs
lisboa_url = "https://emprego.sapo.pt/offers?local=lisboa&categoria=informatica-tecnologias&pagina={}&ordem=mais-recentes"
lisboa_output = "./vagas_filtradas.json"

# Clear previous data
vagas = []
vagas_finais = []

print("\n🔍 Starting to scrape Lisboa jobs...")
lisboa_jobs = scrape_jobs(lisboa_url, lisboa_output)

# Clean and save Lisboa jobs
cleaned_lisboa_jobs = clean_job_data(lisboa_jobs)
with open(lisboa_output, "w", encoding="utf-8") as f:
    json.dump(cleaned_lisboa_jobs, f, indent=2, ensure_ascii=False)

print(f"\n💾 Saved {len(cleaned_lisboa_jobs)} Lisboa jobs to {lisboa_output}")


🔍 Starting to scrape Lisboa jobs...

📍 Iniciando scraping para https://emprego.sapo.pt/offers?local=lisboa&categoria=informatica-tecnologias&pagina={}&ordem=mais-recentes

🌐 Página 1: https://emprego.sapo.pt/offers?local=lisboa&categoria=informatica-tecnologias&pagina=1&ordem=mais-recentes

🌐 Página 1: https://emprego.sapo.pt/offers?local=lisboa&categoria=informatica-tecnologias&pagina=1&ordem=mais-recentes
✅ Clicado em 'REJEITAR TODOS'
✅ Clicado em 'REJEITAR TODOS'
🔎 10 blocos encontrados.

🌐 Página 2: https://emprego.sapo.pt/offers?local=lisboa&categoria=informatica-tecnologias&pagina=2&ordem=mais-recentes
🔎 10 blocos encontrados.

🌐 Página 2: https://emprego.sapo.pt/offers?local=lisboa&categoria=informatica-tecnologias&pagina=2&ordem=mais-recentes
🔎 10 blocos encontrados.

🌐 Página 3: https://emprego.sapo.pt/offers?local=lisboa&categoria=informatica-tecnologias&pagina=3&ordem=mais-recentes
🔎 10 blocos encontrados.

🌐 Página 3: https://emprego.sapo.pt/offers?local=lisboa&categoria=i

In [18]:
# Process International Jobs
internacional_url = "https://emprego.sapo.pt/offers?local=Internacional&categoria=informatica-tecnologias&pagina={}&ordem=mais-recentes"
internacional_output = "./vagas_filtradas_internacional.json"

# Clear previous data
vagas = []
vagas_finais = []

print("\n🔍 Starting to scrape Internacional jobs...")
internacional_jobs = scrape_jobs(internacional_url, internacional_output)

# Clean and save Internacional jobs
cleaned_internacional_jobs = clean_job_data(internacional_jobs)
with open(internacional_output, "w", encoding="utf-8") as f:
    json.dump(cleaned_internacional_jobs, f, indent=2, ensure_ascii=False)

print(f"\n💾 Saved {len(cleaned_internacional_jobs)} Internacional jobs to {internacional_output}")


🔍 Starting to scrape Internacional jobs...

📍 Iniciando scraping para https://emprego.sapo.pt/offers?local=Internacional&categoria=informatica-tecnologias&pagina={}&ordem=mais-recentes

🌐 Página 1: https://emprego.sapo.pt/offers?local=Internacional&categoria=informatica-tecnologias&pagina=1&ordem=mais-recentes

🌐 Página 1: https://emprego.sapo.pt/offers?local=Internacional&categoria=informatica-tecnologias&pagina=1&ordem=mais-recentes
✅ Clicado em 'REJEITAR TODOS'
✅ Clicado em 'REJEITAR TODOS'
🔎 10 blocos encontrados.

🌐 Página 2: https://emprego.sapo.pt/offers?local=Internacional&categoria=informatica-tecnologias&pagina=2&ordem=mais-recentes
🔎 10 blocos encontrados.

🌐 Página 2: https://emprego.sapo.pt/offers?local=Internacional&categoria=informatica-tecnologias&pagina=2&ordem=mais-recentes
🔎 10 blocos encontrados.

🌐 Página 3: https://emprego.sapo.pt/offers?local=Internacional&categoria=informatica-tecnologias&pagina=3&ordem=mais-recentes
🔎 10 blocos encontrados.

🌐 Página 3: https: