# üß™ Test des Modules NLP - Offres R√©elles

**Objectif** : Tester les 3 modules NLP sur un √©chantillon de 10 offres depuis la BDD

**Modules test√©s** :
1. ‚úÖ `text_cleaner.py` - Nettoyage et lemmatisation
2. ‚úÖ `skill_extractor.py` - Extraction comp√©tences (tech + soft)
3. ‚úÖ `info_extractor.py` - Extraction infos structur√©es (salaire, XP, formation, remote)

## üì¶ Imports et Configuration

In [1]:
# Imports standards
import os
import sys
import psycopg2
import pandas as pd
import numpy as np
from dotenv import load_dotenv
import warnings
warnings.filterwarnings('ignore')

# Ajouter le chemin des modules NLP
sys.path.insert(0, os.path.join(os.path.dirname(os.getcwd()), 'modules'))

# Modules NLP
from text_cleaner import TextCleaner
from skill_extractor import SkillExtractor
from info_extractor import InfoExtractor

# Config
load_dotenv()
DATABASE_URL = os.getenv('DATABASE_URL')

print("‚úÖ Imports OK")
print(f"üìç Modules NLP charg√©s depuis : {os.path.join(os.path.dirname(os.getcwd()), 'modules')}")

‚úÖ Imports OK
üìç Modules NLP charg√©s depuis : c:\Users\cyraptor\Documents\PROJECTS\Projet-ATLAS\NLP\modules


## üîå Connexion BDD et Chargement des Offres

In [2]:
# Connexion
conn = psycopg2.connect(DATABASE_URL)
print("‚úÖ Connect√© √† PostgreSQL")

# Charger 10 offres al√©atoires FILTR√âES SUR TOPICS TECH
query = """
SELECT 
    offer_id,
    title,
    description,
    company_name,
    contract_type,
    salary_min,
    salary_max,
    experience_years,
    topic_id,
    topic_label,
    topic_confidence
FROM fact_job_offers
WHERE description IS NOT NULL 
  AND LENGTH(description) > 200
  AND topic_label IN (
      'Ing√©nierie Cloud & Cybers√©curit√©',
      'Data Analysis & Transformation Digitale',
      'Ing√©nierie R&D & Data Science',
      'Product Management & D√©veloppement Java',
      'Gestion de Projet & D√©veloppement'
  )
ORDER BY RANDOM()
LIMIT 10
"""

df = pd.read_sql(query, conn)
conn.close()

print(f"\nüìä {len(df)} offres TECH charg√©es")
print(f"\nAper√ßu :")
df[['offer_id', 'title', 'company_name', 'topic_label']].head()

‚úÖ Connect√© √† PostgreSQL

üìä 10 offres TECH charg√©es

Aper√ßu :


Unnamed: 0,offer_id,title,company_name,topic_label
0,674,Chef de projet √©lectricit√© H/F,Work & You,Ing√©nierie Cloud & Cybers√©curit√©
1,4423,DIRECTEUR COMMUNICATION & MARKETING DIGITAL - ...,Bras droit des dirigeants,Ing√©nierie Cloud & Cybers√©curit√©
2,1198,Chef de Projets Data H/F,JEMS,Ing√©nierie Cloud & Cybers√©curit√©
3,2038,Enqu√™teur / Enqu√™trice terrain (H/F),BVA,Ing√©nierie R&D & Data Science
4,364,Chef de projet WMS (H/F),E-KENT,Ing√©nierie Cloud & Cybers√©curit√©


## üßπ Initialisation des Modules NLP

In [3]:
print("‚è≥ Initialisation des modules NLP...")

# Instancier les modules
cleaner = TextCleaner()
skill_extractor = SkillExtractor()
info_extractor = InfoExtractor()

print("‚úÖ Modules NLP initialis√©s")
print(f"   - TextCleaner : mod√®le spaCy charg√©")
print(f"   - SkillExtractor : {len(skill_extractor.all_tech_skills)} comp√©tences tech + {len(skill_extractor.soft_skills)} soft skills")
print(f"   - InfoExtractor : extraction salaire, XP, formation, remote")

‚è≥ Initialisation des modules NLP...
‚úÖ Modules NLP initialis√©s
   - TextCleaner : mod√®le spaCy charg√©
   - SkillExtractor : 218 comp√©tences tech + 76 soft skills
   - InfoExtractor : extraction salaire, XP, formation, remote


## üî¨ Fonction de Traitement NLP

In [4]:
def process_offer_nlp(offer_data):
    """
    Applique les 3 modules NLP sur une offre
    
    Returns:
        dict avec tous les r√©sultats NLP
    """
    description = offer_data['description']
    
    # 1. NETTOYAGE
    cleaned = cleaner.clean_text(description)
    lemmas = cleaner.lemmatize(cleaned)
    
    # 2. EXTRACTION SKILLS
    skills = skill_extractor.extract_skills(description)
    category = skill_extractor.categorize_offer(description)
    
    # Calcul du profile_confidence (en pourcentage)
    # Score normalis√© bas√© sur le nombre de skills d√©tect√©es
    max_score = 10  # On consid√®re 10+ skills comme 100%
    profile_confidence = min(100, int((category['profile_score'] / max_score) * 100))
    
    # 3. EXTRACTION INFOS
    info = info_extractor.extract_all(description)
    
    return {
        # Texte
        'cleaned_text': cleaned,
        'lemmas_count': len(lemmas),
        
        # Skills
        'skills_tech': skills['all_tech_skills'],
        'skills_soft': skills['soft_skills'],
        'skill_count_tech': skills['skill_count']['tech'],
        'skill_count_soft': skills['skill_count']['soft'],
        
        # Profil
        'profile_category': category['dominant_profile'],
        'profile_score': category['profile_score'],
        'profile_confidence': profile_confidence,  # EN POURCENTAGE
        'is_full_stack': category['is_full_stack'],
        
        # Infos structur√©es
        'salary_min': info['salary']['min'],
        'salary_max': info['salary']['max'],
        'experience_min': info['experience']['min'],
        'experience_max': info['experience']['max'],
        'experience_level': info['experience']['level'],
        'education_level': info['education']['level'],
        'education_type': info['education']['degree_type'],
        'contract_types': info['contract_types'],
        'remote_possible': info['remote']['remote_possible'],
        'remote_days': info['remote']['remote_days'],
        'remote_percentage': info['remote']['remote_percentage'],
    }

print("‚úÖ Fonction de traitement d√©finie")

‚úÖ Fonction de traitement d√©finie


## üöÄ Traitement des 10 Offres

In [5]:
print("="*80)
print("üöÄ TRAITEMENT DES OFFRES")
print("="*80)
print()

results = []

for idx, row in df.iterrows():
    print(f"\nüìå Offre {idx + 1}/{len(df)} : {row['title'][:60]}...")
    
    # Traiter
    nlp_result = process_offer_nlp(row)
    
    # Combiner avec donn√©es originales
    result = {
        'offer_id': row['offer_id'],
        'title': row['title'],
        'company_name': row['company_name'],
        **nlp_result
    }
    
    results.append(result)
    
    # Affichage rapide
    print(f"   ‚úÖ {result['skill_count_tech']} skills tech | {result['skill_count_soft']} soft skills")
    print(f"   üéØ Profil : {result['profile_category']} (confiance: {result['profile_confidence']}%)")

print(f"\n{'='*80}")
print(f"‚úÖ {len(results)} offres trait√©es")
print(f"{'='*80}")

# Cr√©er DataFrame des r√©sultats
df_results = pd.DataFrame(results)

üöÄ TRAITEMENT DES OFFRES


üìå Offre 1/10 : Chef de projet √©lectricit√© H/F...
   ‚úÖ 0 skills tech | 6 soft skills
   üéØ Profil : G√©n√©raliste (confiance: 0%)

üìå Offre 2/10 : DIRECTEUR COMMUNICATION & MARKETING DIGITAL - INDEPENDANT (H...
   ‚úÖ 0 skills tech | 1 soft skills
   üéØ Profil : G√©n√©raliste (confiance: 0%)

üìå Offre 3/10 : Chef de Projets Data H/F...
   ‚úÖ 3 skills tech | 8 soft skills
   üéØ Profil : G√©n√©raliste (confiance: 0%)

üìå Offre 4/10 : Enqu√™teur / Enqu√™trice terrain (H/F)...
   ‚úÖ 2 skills tech | 2 soft skills
   üéØ Profil : G√©n√©raliste (confiance: 0%)

üìå Offre 5/10 : Chef de projet WMS (H/F)...
   ‚úÖ 2 skills tech | 7 soft skills
   üéØ Profil : G√©n√©raliste (confiance: 0%)

üìå Offre 6/10 : Candidature spontan√©e - Theodo Data & AI...
   ‚úÖ 0 skills tech | 0 soft skills
   üéØ Profil : G√©n√©raliste (confiance: 0%)

üìå Offre 7/10 : Animateur de communaut√© (H/F)...
   ‚úÖ 1 skills tech | 4 soft skills
   üéØ Profil : G√©n

## üìä R√©sultats D√©taill√©s par Offre

In [6]:
# Afficher r√©sultats d√©taill√©s pour chaque offre
for idx, result in enumerate(results, 1):
    print("\n" + "="*80)
    print(f"üìã OFFRE #{idx} : {result['title']}")
    print("="*80)
    print(f"   Entreprise : {result['company_name']}")
    
    print(f"\nüßπ NETTOYAGE :")
    print(f"   Texte nettoy√© : {len(result['cleaned_text'])} caract√®res")
    print(f"   Lemmes extraits : {result['lemmas_count']}")
    
    print(f"\nüíª COMP√âTENCES TECHNIQUES ({result['skill_count_tech']}) :")
    if result['skills_tech']:
        for i, skill in enumerate(result['skills_tech'][:15], 1):  # Top 15
            print(f"   {i:2}. {skill}")
        if len(result['skills_tech']) > 15:
            print(f"   ... et {len(result['skills_tech']) - 15} autres")
    else:
        print("   ‚ÑπÔ∏è  Aucune comp√©tence technique d√©tect√©e")
    
    print(f"\nü§ù SOFT SKILLS ({result['skill_count_soft']}) :")
    if result['skills_soft']:
        print(f"   {', '.join(result['skills_soft'][:10])}")
        if len(result['skills_soft']) > 10:
            print(f"   ... et {len(result['skills_soft']) - 10} autres")
    else:
        print("   ‚ÑπÔ∏è  Aucune soft skill d√©tect√©e")
    
    print(f"\nüéØ CAT√âGORISATION :")
    print(f"   Profil dominant : {result['profile_category']}")
    print(f"   Score : {result['profile_score']} comp√©tences match√©es")
    print(f"   Confiance : {result['profile_confidence']}%")
    if result['is_full_stack']:
        print(f"   ‚ö†Ô∏è  Profil Full Stack d√©tect√© !")
    
    print(f"\nüìã INFORMATIONS STRUCTUR√âES :")
    
    # Salaire
    if result['salary_min'] or result['salary_max']:
        print(f"   üí∞ Salaire : {result['salary_min']:,}‚Ç¨ - {result['salary_max']:,}‚Ç¨ /an")
    else:
        print(f"   üí∞ Salaire : Non sp√©cifi√©")
    
    # Exp√©rience
    if result['experience_min'] is not None:
        if result['experience_min'] == result['experience_max']:
            print(f"   üìÖ Exp√©rience : {result['experience_min']} ans ({result['experience_level']})")
        else:
            print(f"   üìÖ Exp√©rience : {result['experience_min']}-{result['experience_max']} ans ({result['experience_level']})")
    else:
        print(f"   üìÖ Exp√©rience : Non sp√©cifi√©e")
    
    # Formation
    if result['education_level']:
        print(f"   üéì Formation : Bac+{result['education_level']} ({result['education_type']})")
    else:
        print(f"   üéì Formation : Non sp√©cifi√©e")
    
    # Contrat
    if result['contract_types']:
        print(f"   üìù Contrat : {', '.join(result['contract_types'])}")
    else:
        print(f"   üìù Contrat : Non sp√©cifi√©")
    
    # T√©l√©travail
    if result['remote_possible']:
        if result['remote_days']:
            print(f"   üè† T√©l√©travail : {result['remote_days']} jours/semaine ({result['remote_percentage']}%)")
        elif result['remote_percentage']:
            print(f"   üè† T√©l√©travail : {result['remote_percentage']}%")
        else:
            print(f"   üè† T√©l√©travail : Possible (d√©tails non pr√©cis√©s)")
    else:
        print(f"   üè† T√©l√©travail : Non mentionn√©")


üìã OFFRE #1 : Chef de projet √©lectricit√© H/F
   Entreprise : Work & You

üßπ NETTOYAGE :
   Texte nettoy√© : 1686 caract√®res
   Lemmes extraits : 123

üíª COMP√âTENCES TECHNIQUES (0) :
   ‚ÑπÔ∏è  Aucune comp√©tence technique d√©tect√©e

ü§ù SOFT SKILLS (6) :
   analyse, autonome, collaboration, dynamique, esprit d'√©quipe, rigoureux

üéØ CAT√âGORISATION :
   Profil dominant : G√©n√©raliste
   Score : 0 comp√©tences match√©es
   Confiance : 0%

üìã INFORMATIONS STRUCTUR√âES :
   üí∞ Salaire : Non sp√©cifi√©
   üìÖ Exp√©rience : 3-6 ans (Confirm√©)
   üéì Formation : Bac+5 (Master/Ing√©nieur)
   üìù Contrat : CDI
   üè† T√©l√©travail : Possible (d√©tails non pr√©cis√©s)

üìã OFFRE #2 : DIRECTEUR COMMUNICATION & MARKETING DIGITAL - INDEPENDANT (H/F)
   Entreprise : Bras droit des dirigeants

üßπ NETTOYAGE :
   Texte nettoy√© : 2430 caract√®res
   Lemmes extraits : 180

üíª COMP√âTENCES TECHNIQUES (0) :
   ‚ÑπÔ∏è  Aucune comp√©tence technique d√©tect√©e

ü§ù SOFT SKILLS

## üìà Statistiques Globales

In [7]:
print("\n" + "="*80)
print("üìà STATISTIQUES GLOBALES")
print("="*80)

print(f"\nüéØ Profils d√©tect√©s :")
profile_counts = df_results['profile_category'].value_counts()
for profile, count in profile_counts.items():
    print(f"   ‚Ä¢ {profile:40} : {count} offre(s)")

print(f"\nüíª Comp√©tences techniques :")
print(f"   Moyenne par offre : {df_results['skill_count_tech'].mean():.1f}")
print(f"   M√©diane : {df_results['skill_count_tech'].median():.0f}")
print(f"   Min / Max : {df_results['skill_count_tech'].min()} / {df_results['skill_count_tech'].max()}")

print(f"\nü§ù Soft skills :")
print(f"   Moyenne par offre : {df_results['skill_count_soft'].mean():.1f}")
print(f"   M√©diane : {df_results['skill_count_soft'].median():.0f}")
print(f"   Min / Max : {df_results['skill_count_soft'].min()} / {df_results['skill_count_soft'].max()}")

print(f"\nüéØ Confiance profil :")
print(f"   Moyenne : {df_results['profile_confidence'].mean():.1f}%")
print(f"   M√©diane : {df_results['profile_confidence'].median():.0f}%")
print(f"   Min / Max : {df_results['profile_confidence'].min()}% / {df_results['profile_confidence'].max()}%")

print(f"\nüìä Informations extraites :")
salary_detected = df_results['salary_min'].notna().sum()
exp_detected = df_results['experience_min'].notna().sum()
edu_detected = df_results['education_level'].notna().sum()
contract_detected = df_results['contract_types'].apply(lambda x: len(x) > 0).sum()
remote_detected = df_results['remote_possible'].sum()

print(f"   Salaire d√©tect√© : {salary_detected}/{len(df_results)} offres ({salary_detected/len(df_results)*100:.0f}%)")
print(f"   Exp√©rience d√©tect√©e : {exp_detected}/{len(df_results)} offres ({exp_detected/len(df_results)*100:.0f}%)")
print(f"   Formation d√©tect√©e : {edu_detected}/{len(df_results)} offres ({edu_detected/len(df_results)*100:.0f}%)")
print(f"   Contrat d√©tect√© : {contract_detected}/{len(df_results)} offres ({contract_detected/len(df_results)*100:.0f}%)")
print(f"   T√©l√©travail d√©tect√© : {remote_detected}/{len(df_results)} offres ({remote_detected/len(df_results)*100:.0f}%)")


üìà STATISTIQUES GLOBALES

üéØ Profils d√©tect√©s :
   ‚Ä¢ G√©n√©raliste                              : 10 offre(s)

üíª Comp√©tences techniques :
   Moyenne par offre : 0.8
   M√©diane : 0
   Min / Max : 0 / 3

ü§ù Soft skills :
   Moyenne par offre : 3.5
   M√©diane : 3
   Min / Max : 0 / 8

üéØ Confiance profil :
   Moyenne : 0.0%
   M√©diane : 0%
   Min / Max : 0% / 0%

üìä Informations extraites :
   Salaire d√©tect√© : 0/10 offres (0%)
   Exp√©rience d√©tect√©e : 2/10 offres (20%)
   Formation d√©tect√©e : 5/10 offres (50%)
   Contrat d√©tect√© : 5/10 offres (50%)
   T√©l√©travail d√©tect√© : 1/10 offres (10%)


## üîù Top 20 Comp√©tences Techniques

In [8]:
# Agr√©ger toutes les skills tech
all_tech_skills = []
for skills_list in df_results['skills_tech']:
    all_tech_skills.extend(skills_list)

# Compter
from collections import Counter
skill_counts = Counter(all_tech_skills).most_common(20)

print("\n" + "="*80)
print("üîù TOP 20 COMP√âTENCES TECHNIQUES")
print("="*80)
print()

for rank, (skill, count) in enumerate(skill_counts, 1):
    pct = (count / len(df_results)) * 100
    print(f"   {rank:2}. {skill:30} : {count} offres ({pct:.0f}%)")


üîù TOP 20 COMP√âTENCES TECHNIQUES

    1. agile                          : 2 offres (20%)
    2. jira                           : 1 offres (10%)
    3. safe                           : 1 offres (10%)
    4. s√©curit√©                       : 1 offres (10%)
    5. teams                          : 1 offres (10%)
    6. go                             : 1 offres (10%)
    7. rgpd                           : 1 offres (10%)


## üíæ Export des R√©sultats (Optionnel)

In [9]:
# Exporter en CSV pour analyse
output_file = "test_nlp_results_10offres.csv"

# Pr√©parer colonnes pour CSV (simplifier les listes)
df_export = df_results.copy()
df_export['skills_tech_str'] = df_export['skills_tech'].apply(lambda x: ', '.join(x[:10]))
df_export['skills_soft_str'] = df_export['skills_soft'].apply(lambda x: ', '.join(x[:10]))
df_export['contract_types_str'] = df_export['contract_types'].apply(lambda x: ', '.join(x))

# Colonnes √† exporter
cols_to_export = [
    'offer_id', 'title', 'company_name',
    'lemmas_count',
    'skill_count_tech', 'skill_count_soft',
    'skills_tech_str', 'skills_soft_str',
    'profile_category', 'profile_score', 'profile_confidence',
    'salary_min', 'salary_max',
    'experience_min', 'experience_max', 'experience_level',
    'education_level', 'education_type',
    'contract_types_str',
    'remote_possible', 'remote_days', 'remote_percentage'
]

df_export[cols_to_export].to_csv(output_file, index=False, encoding='utf-8')

print(f"\n‚úÖ R√©sultats export√©s : {output_file}")
print(f"   {len(df_export)} offres trait√©es")


‚úÖ R√©sultats export√©s : test_nlp_results_10offres.csv
   10 offres trait√©es
