## Phase 1 – Website Structure Analysis

An initial analysis was conducted to determine how job offers are loaded on the
Hellowork platform in order to select an appropriate data extraction method.

Hellowork provides continuously updated job listings, advanced search filters,
sector-based navigation, and personalized features for users. These functionalities
require real-time interaction with backend systems and dynamic content generation.

### Conclusion

Hellowork is identified as a **dynamic website**, where job offers are rendered
dynamically rather than being fully embedded in static HTML pages. Consequently,
the data extraction process will be implemented using **Selenium**, which allows
the execution of client-side scripts and reliable access to dynamic content.


## Phase 1 – Scraping Scope Definition

### Search Criteria

To build a dataset suitable for large-scale business intelligence analysis and machine learning tasks, the scraping process targets a broad and representative set of job offers published on the Hellowork platform.

The search criteria are intentionally defined in a **general and cross-domain manner**, covering multiple professional fields, including but not limited to:
- Information Technology and digital professions
- Engineering and technical roles
- Business, management, and administration
- Marketing, communication, and sales
- Finance, accounting, and economics
- Human resources and support functions

The data collection relies on generic job search result pages without restrictive filters on job category, experience level, or contract type. This approach allows the aggregation of a diverse sample of job offers reflecting various sectors, seniority levels, and employment types.

To ensure sufficient data volume for clustering, classification, and trend analysis, the scraping scope is designed to collect **at least 2,000 job offers**. This is achieved by iterating over multiple search result pages and aggregating offers across different categories and search queries.

Only currently available job offers are considered, and location information is preserved to support geographic analysis in later phases of the project.


## Phase 1 – Data Fields Definition

To ensure a structured and complete dataset, the following fields will be extracted from each job offer on **Hellowork**:

| Field | Description | Notes / Purpose |
|-------|------------|----------------|
| `Sector` | Name of the job sector | Tracks the category/industry of the job; useful for filtering and analytics |
| `Job_Title` | The title of the job position | Key feature for clustering, keyword extraction, and dashboard display |
| `Company` | Name of the hiring company | Useful for analytics, filtering, and company-specific trends |
| `Location` | City or region of the job | Enables geographic analysis in dashboards |
| `Contract` | Type of employment contract (CDI, CDD, Internship, etc.) | Categorical feature for ML and filtering |
| `Salary` | Salary as displayed in the job posting | Will be cleaned and standardized during ETL |
| `Description` | Full textual description of the position | Main input for NLP, clustering, and keyword extraction |
| `URL` | Direct link to the job offer | Reference for validation, scraping completeness, and linking |
| `additional_info` | Optional field for other relevant details (e.g., benefits, remote work) | Can capture extra structured or unstructured info; may be empty |

### Notes
- Some fields may be **missing** in certain job postings (e.g., salary or contract). Missing values will be handled during ETL.  
- Text fields (`Job_Title`, `Description`) will be **preprocessed** for NLP tasks in Phase 3.  
- Categorical fields (`Sector`, `Contract`, `Location`) will be **encoded** during preprocessing.  
- The dataset will be **saved incrementally** in `hellowork_progress.csv` and finalized as `hellowork_final_sectors_data.csv`.  

This structured field definition ensures that the dataset is **consistent, ML-ready, and suitable for visualization** in later phases.


## Phase 1 – HTML Structure Analysis

To accurately extract the defined data fields, the HTML structure of Hellowork job listings was analyzed using browser developer tools. This step identifies the **HTML tags, CSS selectors, and containers** corresponding to each field.

### Mapping of Data Fields to HTML Elements

| Field | HTML Tag / CSS Selector | Notes |
|-------|------------------------|-------|
| `Job_Title` | `[data-cy="jobTitle"]` (usually inside `<h1>` or `<a>`) | Main title of the job; used as the primary identifier for the job offer |
| `Company` | `h1 a` | Name of the hiring company; displayed near the job title |
| `Location` | `ul.tw-flex.tw-flex-wrap.tw-gap-3 li:nth-child(1)` | City/region of the job; first `<li>` under job info list |
| `Contract` | `ul.tw-flex.tw-flex-wrap.tw-gap-3 li:nth-child(2)` | Employment type (CDI, CDD, Internship); second `<li>` under job info list |
| `Salary` | `[data-cy="salary-tag-button"]` | Optional; may be missing for some offers |
| `Description` | `[data-truncate-text-target="content"]` | Full textual job description; line breaks are removed in preprocessing |
| `URL` | `href` attribute of the `<a>` tag linking to job offer | Direct link to job details; used for navigation and validation |
| `Sector` | N/A (from sector list in scraping loop) | Assigned based on the sector being scraped |
| `additional_info` | Optional; not explicitly extracted in current code | Can be added later if extra details are needed |

### Notes
- Some job postings may **omit optional fields** (e.g., `Salary` or `Contract`). Missing values are handled in ETL.  
- All fields are **available in the loaded HTML** after page load; no dynamic JS rendering is required for the current selectors.  
- Using this mapping, the scraping script can reliably extract each field for all job offers across sectors.  


## Phase 1 – Scraping Workflow and Strategy

With the HTML structure and data fields defined, the scraping workflow is designed to extract all job offers efficiently and reliably while respecting ethical standards.

### 1. Workflow Overview

1. **Set up Selenium WebDriver** (Chrome) with proper options:
   - User-agent header
   - Maximized window
   - Optional headless mode
2. **Access the search results page** for each sector.
3. **Handle cookies banners** and other pop-ups to ensure uninterrupted scraping.
4. **Iterate over each job listing on the page**:
   - Extract job URL from the search results
   - Open job detail page to scrape all defined data fields (`Job_Title`, `Company`, `Location`, `Contract`, `Salary`, `Description`, `URL`, `Sector`)
   - Use **explicit waits** to ensure elements are loaded before extraction
5. **Store extracted data** in a structured format (CSV)
6. **Repeat for all pages** of each sector until the target of ≥ 2,000 job offers is reached

---

### 2. Pagination Handling

- Each search result page contains multiple job listings (typically 20–25 per page)  
- The scraper navigates through pages by **modifying the page number parameter** in the URL (`&p=page_number`)  
- Iteration continues until:
  - Enough job offers are collected (≥ 2,000)  
  - No more pages are available or the page contains no job listings

---

### 3. Error Handling

- **Missing fields**: optional fields (Salary, Contract) are stored as `"N/A"` or `None`  
- **Timeouts / page load issues**: use Selenium **explicit waits** and retry logic for failed elements  
- **Unexpected page structure**: log the issue and continue with remaining jobs  

---

### 4. Politeness and Ethics

- Add a **delay of 0.5–3 seconds** between requests and job detail visits to avoid overloading the server  
- Respect the website’s **robots.txt** and terms of use  
- Scraping is performed **only for academic purposes**, not for commercial use  

---

### 5. Data Storage

- Extracted data is **saved incrementally** to `hellowork_progress.csv` to prevent loss in case of interruption  
- Final dataset saved as `hellowork_final_sectors_data.csv`  
- CSV encoding: UTF-8 to preserve special characters  
- Data includes all defined fields and sector information, ready for **Phase 2 – ETL and Data Cleaning**

---

### 6. Summary

This workflow ensures:
- Reliable extraction of ≥ 2,000 job offers across multiple sectors  
- Structured and consistent dataset with all required fields  
- Ethical, reproducible, and robust scraping process using Selenium  
- Preparedness for subsequent **ETL, ML, and interactive dashboard** phases


In [1]:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import TimeoutException , NoSuchElementException

## Libraries Used in Phase 1 – Scraping

- `time`  
  Provides functions to **pause execution** (e.g., `sleep`) between requests to avoid overloading the server.

- `pandas` (`pd`)  
  Used for **data manipulation and storage**. Allows saving scraped job offers to CSV (`.to_csv`) and performing ETL in later phases.

- `selenium`  
  Automates **browser interactions** for scraping dynamic websites. Main components used:  
  - `webdriver` – controls the browser (Chrome in this case)  
  - `Service` – manages the ChromeDriver service  
  - `By` – locates HTML elements by ID, CSS selector, XPath, etc.  
  - `WebDriverWait` – explicit waits until elements are present or clickable  
  - `expected_conditions` (`EC`) – conditions for waits, e.g., element visibility or clickability  
  - `TimeoutException` – handles cases when a wait exceeds the maximum time  

- `webdriver_manager.chrome` (`ChromeDriverManager`)  
  Automatically **downloads and manages the correct ChromeDriver version**, simplifying Selenium setup.



In [2]:

# --- Configuration ---
BASE_SEARCH_URL = "https://www.hellowork.com/fr-fr/emploi/recherche.html?k=job+%C3%A9tudiant&st=relevance"
MAX_PAGES_PER_SECTOR = 10

SECTORS_LIST = [
    {"id": "Agri_peche", "name": "Agriculture • Pêche"},
    {"id": "BTP", "name": "BTP"},
    {"id": "Banq_assur_finan", "name": "Banque • Assurance • Finance"},
    {"id": "Distrib_commerce", "name": "Distribution • Commerce de gros"},
    {"id": "Enseign_forma", "name": "Enseignement • Formation"},
    {"id": "Immo", "name": "Immobilier"},
    {"id": "Ind_agro", "name": "Industrie Agro • alimentaire"},
    {"id": "Ind_auto_meca_nav", "name": "Industrie Auto • Meca • Navale"},
    {"id": "Ind_aero", "name": "Industrie Aéronautique • Aérospatial"},
    {"id": "Ind_manufact", "name": "Industrie Manufacturière"},
    {"id": "Ind_pharma_bio_chim", "name": "Industrie Pharmaceutique • Biotechn. • Chimie"},
    {"id": "Ind_petro", "name": "Industrie Pétrolière • Pétrochimie"},
    {"id": "Ind_hightech_telecom", "name": "Industrie high • tech • Telecom"},
    {"id": "Media_internet_com", "name": "Média • Internet • Communication"},
    {"id": "Resto", "name": "Restauration"},
    {"id": "Sante_social", "name": "Santé • Social • Association"},
    {"id": "Energie_envir", "name": "Secteur Energie • Environnement"},
    {"id": "Inform_SSII", "name": "Secteur informatique • ESN"},
    {"id": "Serv_public_autre", "name": "Service public autres"},
    {"id": "Serv_public_etat", "name": "Service public d'état"},
    {"id": "Serv_public_collec_terri", "name": "Service public des collectivités territoriales"},
    {"id": "Serv_public_hosp", "name": "Service public hospitalier"},
    {"id": "Serv_entreprise", "name": "Services aux Entreprises"},
    {"id": "Serv_pers_part", "name": "Services aux Personnes • Particuliers"},
    {"id": "Tourism_hotel_loisir", "name": "Tourisme • Hôtellerie • Loisirs"},
    {"id": "Transport_logist", "name": "Transport • Logistique"}
]

def setup_driver():
    options = webdriver.ChromeOptions()
    # options.add_argument("--headless")  # Décommenter pour exécution en arrière-plan
    options.add_argument("--start-maximized")
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    return driver

def handle_cookies(driver):
    """Ferme la bannière de cookies si présente."""
    try:
        cookie_btn = WebDriverWait(driver, 4).until(
            EC.element_to_be_clickable((By.ID, "hw-cc-notice-continue-without-accepting-btn"))
        )
        cookie_btn.click()
        print("Cookies handled.")
        time.sleep(1)
    except:
        print("No cookies banner found.")

def scrape_job_details(driver, url, sector_name):
    """Scrape les détails d'une offre spécifique."""
    driver.get(url)
    data = {
        "Sector": sector_name,
        "Job_Title": "N/A",
        "Company": "N/A",
        "Location": "N/A",
        "Contract": "N/A",
        "Salary": "N/A",
        "Description": "N/A",
        "URL": url
    }

    try:
        WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.TAG_NAME, "h1")))

        # 1. Job Title
        try:
            data["Job_Title"] = driver.find_element(By.CSS_SELECTOR, '[data-cy="jobTitle"]').text.strip()
        except: pass

        # 2. Company
        try:
            data["Company"] = driver.find_element(By.CSS_SELECTOR, 'h1 a').text.strip()
        except: pass

        # 3. Location & Contract
        try:
            tags = driver.find_elements(By.CSS_SELECTOR, 'ul.tw-flex.tw-flex-wrap.tw-gap-3 li')
            if len(tags) > 0: data["Location"] = tags[0].text.strip()
            if len(tags) > 1: data["Contract"] = tags[1].text.strip()
        except: pass

        # 4. Salary
        try:
            data["Salary"] = driver.find_element(By.CSS_SELECTOR, '[data-cy="salary-tag-button"]').text.strip()
        except: pass

        # 5. Description
        try:
            desc = driver.find_element(By.CSS_SELECTOR, '[data-truncate-text-target="content"]').text
            data["Description"] = desc.replace("\n", " ").strip()
        except: pass

    except Exception as e:
        print(f"Error extracting details for {url}: {e}")

    return data

def main():
    driver = setup_driver()
    all_results = []

    try:
        driver.get(BASE_SEARCH_URL)
        handle_cookies(driver)

        for sector in SECTORS_LIST:
            s_name = sector['name']
            s_id = sector['id']
            
            print(f"\n--- START SECTOR: {s_name} (ID: {s_id}) ---")

            for page in range(1, MAX_PAGES_PER_SECTOR + 1):
                sector_url = f"{BASE_SEARCH_URL}&s={s_id}&p={page}"
                print(f"Page {page} | Sector URL: {sector_url}")
                driver.get(sector_url)

                try:
                    WebDriverWait(driver, 6).until(
                        EC.presence_of_element_located((By.CSS_SELECTOR, 'a[data-cy="offerTitle"]'))
                    )
                except TimeoutException:
                    print(f"No jobs on page {page}, skipping sector.")
                    break

                offer_elems = driver.find_elements(By.CSS_SELECTOR, 'a[data-cy="offerTitle"]')
                urls_to_visit = list(set([elem.get_attribute("href") for elem in offer_elems]))
                print(f"Found {len(urls_to_visit)} jobs.")

                for url in urls_to_visit:
                    job_data = scrape_job_details(driver, url, s_name)
                    all_results.append(job_data)
                    time.sleep(0.5)

            # Sauvegarde intermédiaire
            pd.DataFrame(all_results).to_csv("hellowork_progress.csv", index=False, encoding='utf-8-sig')

    finally:
        # Sauvegarde finale
        df = pd.DataFrame(all_results)
        final_filename = "hellowork_final_sectors_data.csv"
        df.to_csv(final_filename, index=False, encoding='utf-8-sig')
        print(f"Scraping terminé. Fichier CSV : {final_filename} | Total jobs : {len(all_results)}")
        driver.quit()

if __name__ == "__main__":
    main()

No cookies banner found.

--- START SECTOR: Agriculture • Pêche (ID: Agri_peche) ---
Page 1 | Sector URL: https://www.hellowork.com/fr-fr/emploi/recherche.html?k=job+%C3%A9tudiant&st=relevance&s=Agri_peche&p=1
Scraping terminé. Fichier CSV : hellowork_final_sectors_data.csv | Total jobs : 0


KeyboardInterrupt: 