## Phase 2 – Data Preparation (ETL)

Phase 2 focuses on **cleaning, transforming, and structuring** the raw job offer data collected in Phase 1, producing a **machine-learning-ready dataset** for Phase 3 modeling and dashboard visualization.

---

### 1. Cleaning
- Remove duplicates (`Job_Title + Company + Location + URL`).  
- Handle missing values for `Salary`, `Contract`, `Experience_Level`, etc.  
- Strip whitespace, line breaks, and HTML artifacts from textual fields (`Job_Title`, `Description`).  

---

### 2. Standardization
- Convert `Salary` to a **numerical format** (annual salary in EUR), handle ranges and units.  
- Standardize `Publication_Date` as datetime objects and optionally compute job age.  

---

### 3. Feature Extraction
- Extract keywords/skills from `Description` using **text preprocessing**: lowercase, remove stopwords, lemmatization/stemming, punctuation removal.  
- Prepare **text vectors** (TF-IDF or CountVectorizer) for ML clustering in Phase 3.  

---

### 4. Categorical Encoding
- Encode categorical fields (`Contract_Type`, `Location`, `Sector`, `Experience_Level`) for ML:
  - One-hot encoding or label encoding.

---

### 5. Validation & Saving
- Check for missing or malformed fields, salary outliers, and consistency of sectors.  
- Save the cleaned dataset as `hellowork_cleaned.csv` (UTF-8 encoded) for Phase 3.

---

### ✅ Phase Outcome
- Dataset is **clean, structured, and standardized**  
- Textual and categorical features are **ML-ready**  
- Missing or inconsistent values are handled  
- Data is prepared for **clustering, classification, and dashboard visualization**


### **Libraries Used in Phase 2 – Data Preparation (ETL)**

### **1. pandas**
- Core library for loading, cleaning, and manipulating datasets.
- Used for reading CSVs, handling missing values, removing duplicates, and saving cleaned datasets.

### **2. numpy**
- Supports numerical operations and handling missing values (NaNs).
- Useful for calculations like salary averages or job age.

### **3. re (Regular Expressions)**
- Handles text pattern matching and extraction.
- Used to clean and standardize salary fields or remove unwanted characters from text.

### **4. string**
- Provides constants for common text operations (e.g., punctuation).
- Helps in removing punctuation from job descriptions.

### **5. nltk (Natural Language Toolkit)**
- Library for text preprocessing and NLP tasks.
- Used for stopwords removal, tokenization, and preparing text for vectorization.

### **6. sklearn.feature_extraction.text.TfidfVectorizer**
- Converts text data (job descriptions) into numeric feature vectors.
- Essential for clustering and classification in Phase 3.

### **7. sklearn.preprocessing.OneHotEncoder**
- Converts categorical variables (e.g., Sector, Contract) into numeric one-hot encoded features for ML.

### **8. sklearn.preprocessing.LabelEncoder**
- Converts categorical labels into numeric form if needed.

### **9. datetime**
- Handles date operations.
- Useful for converting publication dates and calculating job age in days.


In [None]:
# --- Basic Data Handling ---
import pandas as pd          # For loading, cleaning, and manipulating datasets
import numpy as np           # For numerical operations and handling NaNs
# --- File / Path Utilities ---
from pathlib import Path     # For filesystem-safe path handling
import os                    # For interacting with the filesystem
# --- Web/Regex/Text Processing ---
import re                    # For regular expressions (cleaning salaries, text)
import string                # For punctuation removal in text
import nltk                  # Natural Language Toolkit for text processing
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('stopwords')   # Download the stopwords corpus
nltk.download('punkt')       # Download tokenizer models
nltk.download('wordnet')     # Download WordNet lemmatizer data
# --- Machine Learning / Preprocessing ---
from sklearn.feature_extraction.text import TfidfVectorizer  # Convert text to numeric features
from sklearn.feature_extraction.text import CountVectorizer  # Bag-of-words representation
from sklearn.preprocessing import OneHotEncoder              # Encode categorical variables
from sklearn.preprocessing import LabelEncoder               # Label encoding if needed
from sklearn.preprocessing import StandardScaler             # Scale numeric features
from sklearn.model_selection import train_test_split         # Train/validation splits
# --- Date/Time Handling ---
import datetime              # For working with publication dates and job age
# --- Visualization & Progress ---
import matplotlib.pyplot as plt  # Quick exploratory plots
import seaborn as sns            # Statistical visualizations
from tqdm import tqdm            # Progress bars for loops

## Step 1 – Load Raw Dataset

In this step, we load the **final scraped CSV** from Phase 1 (`hellowork_final_sectors_data.csv`) into a pandas DataFrame for preprocessing.

**Objectives:**
- Inspect the dataset shape (number of rows and columns)
- Verify column names
- Preview the first few rows to understand the raw data


In [18]:
# --- Step 1: Load Raw Dataset ---
# Load the final scraped CSV from Phase 1
df = pd.read_csv("hellowork_final_sectors_data.csv", encoding='utf-8-sig')

# Inspect basic info
print("Dataset shape:", df.shape)
print("\nColumns:\n", df.columns)
print("\nFirst 5 rows:")
df.head()


Dataset shape: (1364, 8)

Columns:
 Index(['Sector', 'Job_Title', 'Company', 'Location', 'Contract', 'Salary',
       'Description', 'URL'],
      dtype='object')

First 5 rows:


Unnamed: 0,Sector,Job_Title,Company,Location,Contract,Salary,Description,URL
0,Agriculture • Pêche,Alternance - Chargé·e de Formation H/F,Remy Cointreau,Paris - 75,Alternance,"486,49 - 1 801,80 € / mois",Nous recherchons un·e candidat·e : Alternance...,https://www.hellowork.com/fr-fr/emplois/642118...
1,BTP,Electricien H/F,Samsic Emploi,Rennes - 35,Intérim,12 - 15 € / heure,Nous recherchons activement un/une electricien...,https://www.hellowork.com/fr-fr/emplois/729658...
2,BTP,Ouvrier Polyvalent en Menuiserie H/F,Groupe Actual,Auterive - 31,Intérim,"Estimation → 12,36 - 13,50 € / heure",Nous recherchons un(e) menuisier(e) expériment...,https://www.hellowork.com/fr-fr/emplois/732798...
3,BTP,Stagiaire Ressources Humaines 44 H/F,Equans France,Le Bignon - 44,Stage,Pas de salaire renseigné,"Description de l'emploi: Au quotidien, Bouygue...",https://www.hellowork.com/fr-fr/emplois/730495...
4,BTP,Manutentionnaire - Job Étudiant H/F,Crit,Terssac - 81,Intérim,13 € / heure,Dans le cadre du développement de son activité...,https://www.hellowork.com/fr-fr/emplois/730409...


## Step 2 – Handle Missing Values and Duplicates

Before processing the dataset for analysis or ML, we need to **clean it**:

**Objectives:**
- Remove duplicate rows to ensure data consistency.
- Identify and handle missing values in key columns:
  - `Job_Title`, `Company`, `Location`, `Contract`, `Salary`, `Description`
- Fill missing values with placeholders (`"Not specified"` for text, `NaN` for numeric) to maintain consistency.

This ensures the dataset is **clean, complete, and ready for further preprocessing**.


In [None]:
# --- Step 2: Handle Missing Values and Duplicates ---

# 1. Remove duplicate rows
df = df.drop_duplicates()
print(f"After removing duplicates, dataset shape: {df.shape}")

# 2. Handle missing values
text_columns = ["Job_Title", "Company", "Location", "Contract", "Description"]
for col in text_columns:
    df[col] = df[col].fillna("Not specified")

# For Salary, leave as NaN for now (we'll process it in Step 3)
df['Salary'] = pd.to_numeric(df['Salary'].str.replace(r'[^\d]', '', regex=True), errors='coerce')

# 3. Check remaining missing values
print("\nMissing values per column:")
print(df.isna().sum())


## Step 3 – Standardize Salaries

Job postings may have **salaries in various formats** (monthly, annual, ranges, or text).  
We need to **extract numeric values** and standardize them into a **common unit**, e.g., **monthly salary in EUR**.

**Objectives:**
- Remove non-numeric characters from salary strings
- Handle ranges by taking the average
- Convert annual salaries to monthly if indicated
- Keep missing salaries as NaN for ML handling


# --- Step 3: Standardize Salaries ---

def clean_salary(salary_str):
    """
    Extract numeric salary value and standardize it.
    Assumes:
    - Monthly salary if not specified
    - Average for ranges
    """
    if pd.isna(salary_str) or salary_str == "Not specified":
        return np.nan
    # Remove non-digit characters
    numbers = re.findall(r'\d+', salary_str.replace(" ", ""))
    numbers = [int(n) for n in numbers]
    if len(numbers) == 0:
        return np.nan
    elif len(numbers) == 1:
        return numbers[0]
    else:
        # If range, return average
        return sum(numbers)/len(numbers)

# Apply cleaning function
df['Salary_Clean'] = df['Salary'].astype(str).apply(clean_salary)

# Inspect cleaned salary column
print(df[['Salary', 'Salary_Clean']].head(10))


In [21]:
# --- Step 3: Standardize Salaries ---

def clean_salary(salary_str):
    """
    Extract numeric salary value and standardize it.
    Assumes:
    - Monthly salary if not specified
    - Average for ranges
    """
    if pd.isna(salary_str) or salary_str == "Not specified":
        return np.nan
    # Remove non-digit characters
    numbers = re.findall(r'\d+', salary_str.replace(" ", ""))
    numbers = [int(n) for n in numbers]
    if len(numbers) == 0:
        return np.nan
    elif len(numbers) == 1:
        return numbers[0]
    else:
        # If range, return average
        return sum(numbers)/len(numbers)

# Apply cleaning function
df['Salary_Clean'] = df['Salary'].astype(str).apply(clean_salary)

# Inspect cleaned salary column
print(df[['Salary', 'Salary_Clean']].head(10))


         Salary  Salary_Clean
0  4.864918e+10  2.432459e+10
1  1.215000e+03  6.075000e+02
2  1.236135e+07  6.180675e+06
3           NaN           NaN
4  1.300000e+01  6.500000e+00
5  4.864918e+10  2.432459e+10
6  1.315000e+03  6.575000e+02
7  1.188000e+03  5.940000e+02
8  1.900250e+07  9.501250e+06
9           NaN           NaN


## Step 4 – Text Preprocessing for Job Descriptions

Job descriptions are **raw text**, containing punctuation, stopwords, and inconsistent formatting.  
For **clustering or classification**, we need to **clean and tokenize** them.

**Objectives:**
- Convert text to **lowercase**
- Remove **punctuation and special characters**
- Remove **stopwords** (common words with little meaning)
- Optional: Lemmatization or stemming for further normalization
- Prepare a **cleaned text column** for vectorization in Phase 3


In [22]:
# --- Step 4: Text Preprocessing ---

import nltk
from nltk.corpus import stopwords
import string

# Download stopwords if not already
nltk.download('stopwords')
stop_words = set(stopwords.words('french'))  # Using French stopwords

def clean_text(text):
    text = text.lower()  # lowercase
    text = text.replace("\n", " ").strip()  # remove line breaks
    text = text.translate(str.maketrans('', '', string.punctuation))  # remove punctuation
    tokens = [word for word in text.split() if word not in stop_words]  # remove stopwords
    return " ".join(tokens)

# Apply cleaning
df['Description_Clean'] = df['Description'].apply(clean_text)

# Inspect cleaned descriptions
df[['Description', 'Description_Clean']].head(5)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\faleg\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Description,Description_Clean
0,Nous recherchons un·e candidat·e : Alternance...,recherchons un·e candidat·e alternance chargé·...
1,Nous recherchons activement un/une electricien...,recherchons activement unune electriciennne ca...
2,Nous recherchons un(e) menuisier(e) expériment...,recherchons menuisiere expérimentée rejoindre ...
3,"Description de l'emploi: Au quotidien, Bouygue...",description lemploi quotidien bouygues energie...
4,Dans le cadre du développement de son activité...,cadre développement activité recherchons clien...


## Step 5 – Encode Categorical Variables

Many columns (e.g., `Contract`, `Location`, `Sector`) are **categorical** and need to be **encoded numerically** for ML algorithms.

**Objectives:**
- Convert categorical columns into **numeric representations** using one-hot encoding or label encoding
- Keep original columns for filtering in dashboards if needed
- Prepare dataset for clustering and classification


In [24]:
# --- Step 5: Encode Categorical Variables ---

from sklearn.preprocessing import LabelEncoder

# Columns to encode
categorical_cols = ['Contract', 'Location', 'Sector']

# Apply Label Encoding
le_dict = {}
for col in categorical_cols:
    le = LabelEncoder()
    df[col + '_Encoded'] = le.fit_transform(df[col])
    le_dict[col] = le  # Save encoder for future inverse_transform if needed

# Inspect encoding
df[['Contract', 'Contract_Encoded', 'Location', 'Location_Encoded', 'Sector', 'Sector_Encoded']].head(5)


Unnamed: 0,Contract,Contract_Encoded,Location,Location_Encoded,Sector,Sector_Encoded
0,Alternance,0,Paris - 75,452,Agriculture • Pêche,0
1,Intérim,4,Rennes - 35,493,BTP,1
2,Intérim,4,Auterive - 31,24,BTP,1
3,Stage,6,Le Bignon - 44,332,BTP,1
4,Intérim,4,Terssac - 81,582,BTP,1


## Step 6 – Feature Extraction from Job Descriptions

To enrich the dataset for ML models, we extract **important keywords or skills** from job descriptions.

**Objectives:**
- Identify the most frequent words or terms in job descriptions
- Optionally, extract **skills or key phrases** using simple frequency-based methods or TF-IDF
- Create additional features for clustering or classification


In [26]:
# --- Step 6: Feature Extraction from Text ---

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF vectorizer with list of stopwords
tfidf = TfidfVectorizer(max_features=500, stop_words=list(stop_words))  # Convert set to list

# Fit and transform the cleaned descriptions
tfidf_matrix = tfidf.fit_transform(df['Description_Clean'])

# Convert to DataFrame for inspection
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())
print("TF-IDF feature matrix shape:", tfidf_df.shape)
tfidf_df.head()


TF-IDF feature matrix shape: (1244, 500)


Unnamed: 0,10,10h,13h,1418,14h,15,18h,23,2h,35,...,équipe,équipes,éthique,études,étudiant,étudiante,étudiantrémunération,étudiants,évoluer,être
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.085515,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.138837,0.0,0.0,0.0,0.0,0.0,0.0,0.199516,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.056201,0.0,0.0,0.0,0.0,0.075646,0.0,0.0,0.0,0.0
4,0.252615,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.108104,0.0,0.0,0.0,0.0,0.0


## Step 7 – Save the Preprocessed Dataset

After cleaning salaries, preprocessing text, and encoding categorical variables, we save the dataset for **Phase 3 (ML)**.

**Objectives:**
- Ensure all preprocessing is persisted
- Avoid repeating costly preprocessing
- Use a clear file naming convention


In [27]:
# --- Step 7: Save Preprocessed Dataset ---

preprocessed_filename = "hellowork_preprocessed.csv"
df.to_csv(preprocessed_filename, index=False, encoding='utf-8-sig')
print(f"Preprocessed dataset saved to {preprocessed_filename}")


Preprocessed dataset saved to hellowork_preprocessed.csv


## Step 8 – Summary of Phase 2 ETL

Phase 2 focused on **cleaning, transforming, and structuring** the raw scraped data:

1. **Load Raw Data** from Phase 1 CSV
2. **Inspect Dataset** structure and columns
3. **Standardize Salaries** into numeric monthly values
4. **Preprocess Text** in job descriptions (lowercase, remove punctuation & stopwords)
5. **Encode Categorical Variables** for ML models
6. **Extract Features** from job descriptions using TF-IDF
7. **Save Preprocessed Dataset** for modeling

**Result:**  
A clean, structured, and ML-ready dataset with numeric and textual features, ready for **clustering, classification, and enrichment** in Phase 3.
