# üì∏ DataSens E1_v3 ‚Äî Notebook 5 : Snapshot et README

**üéØ Objectif** : Cr√©er un bilan E1_v3, exporter le DDL/CSV, cr√©er un tag Git et d√©finir la roadmap E2/E3

---

## üìã Contenu de ce notebook

1. **Bilan E1_v3** : Dataset pr√©par√© pour E2 (annotation simple)
2. **Export DDL** : Sauvegarde du sch√©ma SQL dans `docs/e1_schema.sql`
3. **Export CSV** : Snapshots du dataset annot√© simple dans `data/gold/`
4. **Export Dataset IA** : Export Parquet/CSV structur√© pour enrichissement IA (E2)
5. **V√©rification Tables** : V√©rification que toutes les tables sont remplies (th√®mes, etc.)
6. **Tag Git** : Cr√©ation du tag `E1_REAL_YYYYMMDD`
7. **Roadmap E2/E3** : Annotation IA avanc√©e (CamemBERT, FlauBERT) dans E2



In [None]:
# üì¶ Inventaire E1 ‚Äî Sources et traces de collecte (DataLake + PostgreSQL)

import os
from datetime import datetime
from pathlib import Path

import pandas as pd
from dotenv import load_dotenv
from sqlalchemy import create_engine, text

NOTEBOOK_DIR = Path.cwd()
PROJECT_ROOT = NOTEBOOK_DIR.parent if NOTEBOOK_DIR.name == "notebooks" else NOTEBOOK_DIR
load_dotenv(PROJECT_ROOT / ".env")

# Fail-fast DB URL (3s timeout)
PG_URL = (
    f"postgresql+psycopg2://{os.getenv('POSTGRES_USER','ds_user')}:{os.getenv('POSTGRES_PASS','ds_pass')}@"
    f"{os.getenv('POSTGRES_HOST','localhost')}:{int(os.getenv('POSTGRES_PORT','5432'))}/"
    f"{os.getenv('POSTGRES_DB','datasens')}?connect_timeout=3"
)

# Option pour ignorer DB si instable
SKIP_DB = os.getenv("DS_SKIP_DB", "0") == "1"
engine = create_engine(PG_URL, future=True, pool_pre_ping=True)

RAW = PROJECT_ROOT / "data" / "raw"

# Limiter le scan (d√©fensif)
def list_top(path: Path, pattern: str, limit: int = 50):
    try:
        files = list(path.glob(pattern)) if path.exists() else []
        return files[:limit]
    except Exception:
        return []

paths = {
    "kaggle_csv": list_top(RAW / "kaggle", "*.csv"),
    "owm_api": list_top(RAW / "api" / "owm", "*.csv"),
    "rss_multi": list_top(RAW / "rss", "*.csv"),
    "scraping_multi": list_top(RAW / "scraping" / "multi", "*.csv"),
    "gdelt_gkg": list_top(RAW / "gdelt", "*.zip"),
    "manifests": list_top(RAW / "manifests", "*.json"),
}

# Comptes DB (si tables pr√©sentes)
db_counts = {"document": None, "flux": None, "source": None, "meteo": None, "evenement": None}
docs_by_type = []

if not SKIP_DB:
    try:
        with engine.connect() as conn:
            # Timeout de requ√™te 3s
            try:
                conn.exec_driver_sql("SET LOCAL statement_timeout = 3000")
            except Exception:
                pass

            # Corrig√© avec pr√©fixes tXX_
            for table, tname in [("document", "t04_document"), ("flux", "t03_flux"), ("source", "t02_source"), ("meteo", "t19_meteo"), ("evenement", "t25_evenement")]:
                try:
                    db_counts[table] = conn.execute(text(f"SELECT COUNT(*) FROM {tname}")).scalar()
                except Exception:
                    db_counts[table] = None

            try:
                docs_by_type = conn.execute(text(
                    """
                    SELECT td.libelle AS type_donnee, COUNT(d.id_doc) AS nb_docs
                    FROM t04_document d
                    LEFT JOIN t03_flux f ON d.id_flux = f.id_flux
                    LEFT JOIN t02_source s ON f.id_source = s.id_source
                    LEFT JOIN t01_type_donnee td ON s.id_type_donnee = td.id_type_donnee
                    GROUP BY td.libelle
                    ORDER BY nb_docs DESC NULLS LAST
                    """
                )).fetchall()
            except Exception:
                docs_by_type = []
    except Exception:
        # DB indisponible, ignorer proprement
        SKIP_DB = True

# Rendu console
print("\n=== Inventaire fichiers data/raw ===")
for k, v in paths.items():
    print(f"- {k:14s}: {len(v)} fichier(s)")

print("\n=== Comptes en base ===")
for k, v in db_counts.items():
    print(f"- {k:10s}: {v if v is not None else 'N/A'}")

if docs_by_type:
    print("\n=== Documents par type_donnee ===")
    for row in docs_by_type:
        print(f"- {row[0] or 'Inconnu'}: {row[1]}")

# √âcriture rapport Markdown
report_path = PROJECT_ROOT / "docs" / "SOURCES_INVENTAIRE_E1.md"
report_lines = []
report_lines.append("# üßæ Inventaire E1 ‚Äî Preuves de collecte et d'insertion\n")
report_lines.append(f"G√©n√©r√© le: {datetime.utcnow().strftime('%Y-%m-%d %H:%M:%SZ')} UTC\n\n")
report_lines.append("## R√©sum√© fichiers pr√©sents (data/raw)\n")
for label, files in paths.items():
    report_lines.append(f"- **{label}**: {len(files)} fichier(s)")
    for p in sorted(files)[:5]:
        report_lines.append(f"  - {p.as_posix()}")
    if len(files) > 5:
        report_lines.append(f"  - (+{len(files)-5} autres)\n")

report_lines.append("\n## Comptes en base (PostgreSQL)\n")
for k, v in db_counts.items():
    report_lines.append(f"- **{k}**: {v if v is not None else 'N/A'}")

if docs_by_type:
    report_lines.append("\n### Documents par type_donnee\n")
    for row in docs_by_type:
        report_lines.append(f"- {row[0] or 'Inconnu'}: {row[1]}")

report_path.write_text("\n".join(report_lines), encoding="utf-8")
print(f"\n‚úÖ Rapport inventaire √©crit: {report_path}")


In [None]:
# Configuration
import os
import subprocess
from datetime import UTC, datetime
from pathlib import Path

from dotenv import load_dotenv
from sqlalchemy import create_engine, text

NOTEBOOK_DIR = Path.cwd()
PROJECT_ROOT = NOTEBOOK_DIR.parent if NOTEBOOK_DIR.name == "notebooks" else NOTEBOOK_DIR
load_dotenv(PROJECT_ROOT / ".env")

PG_HOST = os.getenv("POSTGRES_HOST", "localhost")
PG_PORT = int(os.getenv("POSTGRES_PORT", "5432"))
PG_DB = os.getenv("POSTGRES_DB", "datasens")
PG_USER = os.getenv("POSTGRES_USER", "ds_user")
PG_PASS = os.getenv("POSTGRES_PASS", "ds_pass")

PG_URL = f"postgresql+psycopg2://{PG_USER}:{PG_PASS}@{PG_HOST}:{PG_PORT}/{PG_DB}"
engine = create_engine(PG_URL, future=True)

print("‚úÖ Configuration charg√©e")


## üìä Bilan E1 : Ce qui est fait / √† faire


In [None]:
print("üìä BILAN E1_V3 - Dataset Final Annot√©")
print("=" * 80)

import matplotlib.pyplot as plt
import pandas as pd

with engine.connect() as conn:
    # Statistiques avec pr√©fixes tXX_ corrig√©s
    stats = {
        "tables": conn.execute(text("SELECT COUNT(*) FROM information_schema.tables WHERE table_schema = 'public' OR table_schema = 'datasens'")).scalar(),
        "documents": conn.execute(text("SELECT COUNT(*) FROM t04_document")).scalar(),
        "flux": conn.execute(text("SELECT COUNT(*) FROM t03_flux")).scalar(),
        "sources": conn.execute(text("SELECT COUNT(*) FROM t02_source")).scalar(),
        "meteo": conn.execute(text("SELECT COUNT(*) FROM t19_meteo")).scalar(),
        "evenements": conn.execute(text("SELECT COUNT(*) FROM t25_evenement")).scalar(),
        "themes": conn.execute(text("SELECT COUNT(*) FROM t24_theme")).scalar(),
    }
    
    # Statistiques par type de donn√©e
    df_final = pd.read_sql_query("""
        SELECT 
            td.libelle AS type_donnee,
            COUNT(DISTINCT s.id_source) AS nb_sources,
            COUNT(DISTINCT d.id_doc) AS nb_documents,
            COUNT(DISTINCT f.id_flux) AS nb_flux
        FROM t01_type_donnee td
        LEFT JOIN t02_source s ON td.id_type_donnee = s.id_type_donnee
        LEFT JOIN t03_flux f ON s.id_source = f.id_source
        LEFT JOIN t04_document d ON f.id_flux = d.id_flux
        GROUP BY td.libelle
        ORDER BY nb_documents DESC
    """, conn)

print("\n‚úÖ R√©alis√© E1_v3 :")
print(f"   ‚Ä¢ {stats['tables']} tables PostgreSQL cr√©√©es (architecture compl√®te 36/37 tables)")
print(f"   ‚Ä¢ {stats['sources']} sources configur√©es")
print(f"   ‚Ä¢ {stats['flux']} flux de collecte")
print(f"   ‚Ä¢ {stats['documents']:,} documents collect√©s et nettoy√©s")
print(f"   ‚Ä¢ {stats['meteo']} relev√©s m√©t√©o")
print(f"   ‚Ä¢ {stats['evenements']} √©v√©nements")
print(f"   ‚Ä¢ {stats['themes']} th√®mes identifi√©s")

# V√©rification que toutes les tables de r√©f√©rentiels sont remplies
print("\nüìä V√âRIFICATION REMPLISSAGE TABLES (E1_v3)")
print("-" * 80)

with engine.connect() as conn:
    # V√©rification th√®mes et cat√©gories
    nb_theme_cat = conn.execute(text("SELECT COUNT(*) FROM t23_theme_category")).scalar()
    nb_themes = conn.execute(text("SELECT COUNT(*) FROM t24_theme")).scalar()
    nb_type_donnee = conn.execute(text("SELECT COUNT(*) FROM t01_type_donnee")).scalar()
    nb_territoire = conn.execute(text("SELECT COUNT(*) FROM t17_territoire")).scalar()
    nb_indicateur = conn.execute(text("SELECT COUNT(*) FROM t22_indicateur")).scalar()
    
    df_tables = pd.DataFrame({
        "Table": ["t23_theme_category", "t24_theme", "t01_type_donnee", "t17_territoire", "t22_indicateur"],
        "Nb enregistrements": [nb_theme_cat, nb_themes, nb_type_donnee, nb_territoire, nb_indicateur],
        "Statut": [
            "‚úÖ OK" if nb_theme_cat >= 12 else "‚ö†Ô∏è Incomplet",
            "‚úÖ OK" if nb_themes >= 12 else "‚ö†Ô∏è Incomplet",
            "‚úÖ OK" if nb_type_donnee >= 5 else "‚ö†Ô∏è Incomplet",
            "‚úÖ OK" if nb_territoire > 0 else "‚ö†Ô∏è Vide",
            "‚úÖ OK" if nb_indicateur > 0 else "‚ÑπÔ∏è Optionnel"
        ]
    })
    display(df_tables)
    
    # D√©tails des th√®mes
    if nb_themes > 0:
        df_themes_detail = pd.read_sql_query("""
            SELECT 
                tc.libelle AS categorie,
                COUNT(t.id_theme) AS nb_themes,
                STRING_AGG(t.libelle, ', ' ORDER BY t.libelle) AS themes
            FROM t23_theme_category tc
            LEFT JOIN t24_theme t ON tc.id_theme_cat = t.id_theme_cat
            GROUP BY tc.id_theme_cat, tc.libelle
            ORDER BY tc.id_theme_cat
        """, conn)
        print("\nüìã D√©tail th√®mes par cat√©gorie :")
        display(df_themes_detail)
    
    if nb_theme_cat < 12 or nb_themes < 12:
        print("\n‚ö†Ô∏è ATTENTION : Tous les th√®mes ne sont pas encore remplis")
        print("   üí° R√©ex√©cutez le notebook 02_schema_create.ipynb pour compl√©ter")

# Visualisations dataset final
print("\nüìä VISUALISATIONS DATASET FINAL ANNOT√â (E1_V3)")
print("-" * 80)

# Graphique 1 : R√©partition par type de donn√©e
if len(df_final) > 0:
    print("\nüìã R√©partition par type de donn√©e :")
    display(df_final)
    
    plt.figure(figsize=(14, 10))
    
    plt.subplot(2, 2, 1)
    bars = plt.bar(df_final["type_donnee"], df_final["nb_documents"], color=plt.cm.Set2(range(len(df_final))))
    for bar, value in zip(bars, df_final["nb_documents"]):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(df_final["nb_documents"]) * 0.02,
                f"{int(value):,}", ha='center', va='bottom', fontweight='bold', fontsize=9)
    plt.title("üìä Documents par type de donn√©e (Dataset Final)", fontsize=12, fontweight='bold')
    plt.ylabel("Nombre de documents", fontsize=11)
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis="y", linestyle="--", alpha=0.3)
    
    plt.subplot(2, 2, 2)
    plt.pie(df_final["nb_documents"], labels=df_final["type_donnee"], autopct='%1.1f%%', startangle=90)
    plt.title("üìä R√©partition documents par type (%)", fontsize=12, fontweight='bold')
    
    plt.subplot(2, 2, 3)
    bars = plt.bar(df_final["type_donnee"], df_final["nb_sources"], color=plt.cm.Pastel1(range(len(df_final))))
    for bar, value in zip(bars, df_final["nb_sources"]):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(df_final["nb_sources"]) * 0.02,
                str(int(value)), ha='center', va='bottom', fontweight='bold', fontsize=9)
    plt.title("üìä Sources par type de donn√©e", fontsize=12, fontweight='bold')
    plt.ylabel("Nombre de sources", fontsize=11)
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis="y", linestyle="--", alpha=0.3)
    
    plt.subplot(2, 2, 4)
    bars = plt.bar(df_final["type_donnee"], df_final["nb_flux"], color='#4ECDC4')
    for bar, value in zip(bars, df_final["nb_flux"]):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(df_final["nb_flux"]) * 0.02,
                str(int(value)), ha='center', va='bottom', fontweight='bold', fontsize=9)
    plt.title("üìä Flux par type de donn√©e", fontsize=12, fontweight='bold')
    plt.ylabel("Nombre de flux", fontsize=11)
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis="y", linestyle="--", alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# Vue d'ensemble des documents
print("\nüìã Aper√ßu dataset final (50 premiers documents) :")
df_docs_final = pd.read_sql_query("""
    SELECT 
        d.id_doc,
        LEFT(d.titre, 60) AS titre,
        LEFT(d.texte, 100) AS texte_apercu,
        d.langue,
        d.date_publication,
        s.nom AS source,
        td.libelle AS type_donnee
    FROM t04_document d
    JOIN t03_flux f ON d.id_flux = f.id_flux
    JOIN t02_source s ON f.id_source = s.id_source
    JOIN t01_type_donnee td ON s.id_type_donnee = td.id_type_donnee
    ORDER BY d.id_doc DESC
    LIMIT 50
""", engine)
display(df_docs_final)

print("\n‚úÖ 6 types de sources ing√©r√©es E1_v3 :")
print("   1. Fichier plat CSV (Kaggle)")
print("   2. API OpenWeatherMap (m√©t√©o)")
print("   3. Flux RSS Multi-Sources (Franceinfo, 20 Minutes, Le Monde)")
print("   4. NewsAPI (optionnel)")
print("   5. Web Scraping Multi-Sources (Reddit, YouTube, Vie-publique, data.gouv)")
print("   6. Big Data GDELT GKG")
print("\nüìã E1_v3 : Dataset pr√©par√© pour E2")
print("   ‚úÖ Annotation simple : nettoyage, d√©duplication, QA de base")
print("   ‚úÖ Structure pr√™te pour enrichissement IA (E2)")
print("\nüìã √Ä faire ensuite (E2/E3) :")
print("   ‚Ä¢ E2 : Enrichissement IA (CamemBERT, FlauBERT)")
print("   ‚Ä¢ E2 : Annotation sentiment, NER, keywords (IA avanc√©e)")
print("   ‚Ä¢ E3 : Dashboard Power BI")
print("   ‚Ä¢ E3 : Orchestration Prefect/Airflow")
print("   ‚Ä¢ E3 : Tests automatis√©s")


## üíæ Export DDL : Sauvegarde du sch√©ma SQL


In [None]:
print("üíæ Export DDL PostgreSQL")
print("=" * 80)

# Export du sch√©ma complet
with engine.connect() as conn:
    schema_query = """
    SELECT
        'CREATE TABLE ' || table_name || ' (' || E'\\n' ||
        string_agg(
            column_name || ' ' ||
            CASE
                WHEN data_type = 'integer' THEN 'INTEGER'
                WHEN data_type = 'bigint' THEN 'BIGINT'
                WHEN data_type = 'text' THEN 'TEXT'
                WHEN data_type = 'character varying' THEN 'VARCHAR(' || character_maximum_length || ')'
                WHEN data_type = 'timestamp without time zone' THEN 'TIMESTAMP'
                WHEN data_type = 'real' THEN 'FLOAT'
                ELSE data_type
            END ||
            CASE WHEN is_nullable = 'NO' THEN ' NOT NULL' ELSE '' END,
            ',' || E'\\n    '
            ORDER BY ordinal_position
        ) || E'\\n);'
        as ddl
    FROM information_schema.columns
    WHERE table_schema = 'public'
    GROUP BY table_name;
    """

    # Solution simplifi√©e : utiliser pg_dump ou exporter manuellement
    print("üìù G√©n√©ration du sch√©ma SQL...")

    # Cr√©er le dossier docs s'il n'existe pas
    docs_dir = PROJECT_ROOT / "docs"
    docs_dir.mkdir(exist_ok=True)

    # Export simplifi√© (pour un export complet, utiliser pg_dump)
    schema_export = f"""
-- DataSens E1 - Sch√©ma PostgreSQL
-- Export g√©n√©r√© le {datetime.now(UTC).isoformat()}
-- 18 tables Merise

-- Note: Pour un export complet, utiliser:
-- pg_dump -h {PG_HOST} -U {PG_USER} -d {PG_DB} --schema-only > docs/e1_schema.sql
"""

    schema_file = docs_dir / "e1_schema.sql"
    schema_file.write_text(schema_export, encoding="utf-8")

    print(f"‚úÖ Sch√©ma export√© : {schema_file}")
    print("   üí° Pour un export complet, ex√©cutez:")
    print(f"      pg_dump -h {PG_HOST} -U {PG_USER} -d {PG_DB} --schema-only > docs/e1_schema.sql")


## üì§ Export CSV : Snapshots des donn√©es (data/gold/)


In [None]:
print("üì§ Export CSV - Snapshots Dataset Final Annot√© (data/gold/)")
print("=" * 80)

import pandas as pd

gold_dir = PROJECT_ROOT / "data" / "gold"
gold_dir.mkdir(parents=True, exist_ok=True)

timestamp = datetime.now(UTC).strftime("%Y%m%d_%H%M%S")

# Exporter tables principales (corrig√© avec pr√©fixes tXX_)
tables_to_export = [
    ("document", "t04_document"),
    ("source", "t02_source"),
    ("flux", "t03_flux"),
    ("territoire", "t17_territoire"),
    ("meteo", "t19_meteo"),
    ("evenement", "t25_evenement"),
    ("theme", "t24_theme")
]

exported = []
for table_name, table_full in tables_to_export:
    try:
        df = pd.read_sql_query(f"SELECT * FROM {table_full} LIMIT 1000", engine)  # Limite pour d√©mo
        if len(df) > 0:
            csv_path = gold_dir / f"{table_name}_{timestamp}.csv"
            df.to_csv(csv_path, index=False, encoding='utf-8')
            exported.append(f"   ‚úÖ {table_name}: {len(df)} lignes ‚Üí {csv_path.name}")
    except Exception as e:
        exported.append(f"   ‚ö†Ô∏è {table_name}: Erreur - {str(e)[:80]}")

print("\nüìä Exports CSV dataset final :")
for item in exported:
    print(item)

# Visualisation des exports
if len(exported) > 0:
    print("\nüìä Visualisation des snapshots export√©s :")
    export_data = []
    for item in exported:
        if "‚úÖ" in item:
            parts = item.split(": ")
            table = parts[0].replace("   ‚úÖ ", "")
            count = parts[1].split(" lignes")[0]
            export_data.append({"Table": table, "Lignes export√©es": int(count)})
    
    if export_data:
        df_exports = pd.DataFrame(export_data)
        display(df_exports)
        
        plt.figure(figsize=(12, 6))
        bars = plt.bar(df_exports["Table"], df_exports["Lignes export√©es"], color=plt.cm.Set3(range(len(df_exports))))
        for bar, value in zip(bars, df_exports["Lignes export√©es"]):
            plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(df_exports["Lignes export√©es"]) * 0.02,
                    f"{int(value):,}", ha='center', va='bottom', fontweight='bold', fontsize=10)
        plt.title("üì§ Snapshots export√©s vers data/gold/ (Dataset Final)", fontsize=12, fontweight='bold')
        plt.ylabel("Nombre de lignes", fontsize=11)
        plt.xticks(rotation=45, ha='right')
        plt.grid(axis="y", linestyle="--", alpha=0.3)
        plt.tight_layout()
        plt.show()

print("\n‚úÖ Snapshots dataset final annot√© sauvegard√©s dans data/gold/")

# ============================================================
# EXPORT DATASET STRUCTUR√â POUR IA (Parquet)
# ============================================================
print("\n" + "=" * 80)
print("üì¶ EXPORT DATASET STRUCTUR√â POUR IA (Format Parquet)")
print("=" * 80)

try:
    import pyarrow as pa
    import pyarrow.parquet as pq
    from pathlib import Path
    
    print("\nüìä Export dataset complet pour enrichissement IA (E2)...")
    
    # Cr√©er le dossier export si n√©cessaire
    export_dir = PROJECT_ROOT / "data" / "gold" / "dataset_ia"
    export_dir.mkdir(parents=True, exist_ok=True)
    
    # Requ√™te consolid√©e : Documents + m√©tadonn√©es pr√™tes pour IA
    dataset_query = """
        SELECT 
            d.id_doc,
            d.titre,
            d.texte,
            d.langue,
            d.date_publication,
            d.hash_fingerprint,
            s.nom AS source_nom,
            td.libelle AS type_donnee,
            f.date_collecte,
            t.ville AS territoire,
            -- Agr√©gation th√®mes
            STRING_AGG(DISTINCT th.libelle, '; ') AS themes,
            -- Comptage annotations (si pr√©sentes)
            (SELECT COUNT(*) FROM t05_annotation ann WHERE ann.id_doc = d.id_doc) AS nb_annotations
        FROM t04_document d
        LEFT JOIN t03_flux f ON d.id_flux = f.id_flux
        LEFT JOIN t02_source s ON f.id_source = s.id_source
        LEFT JOIN t01_type_donnee td ON s.id_type_donnee = td.id_type_donnee
        LEFT JOIN t17_territoire t ON d.id_territoire = t.id_territoire
        LEFT JOIN t26_document_theme dt ON d.id_doc = dt.id_doc
        LEFT JOIN t24_theme th ON dt.id_theme = th.id_theme
        GROUP BY d.id_doc, d.titre, d.texte, d.langue, d.date_publication, d.hash_fingerprint,
                 s.nom, td.libelle, f.date_collecte, t.ville
        ORDER BY d.date_publication DESC
    """
    
    df_dataset_ia = pd.read_sql_query(dataset_query, engine)
    
    if len(df_dataset_ia) > 0:
        # Export Parquet (format optimal pour IA)
        parquet_path = export_dir / f"datasens_dataset_ia_{timestamp}.parquet"
        df_dataset_ia.to_parquet(parquet_path, engine='pyarrow', compression='snappy', index=False)
        
        file_size_mb = parquet_path.stat().st_size / (1024 * 1024)
        print(f"\n‚úÖ Dataset IA export√© :")
        print(f"   üìÑ Fichier : {parquet_path.name}")
        print(f"   üìä {len(df_dataset_ia):,} documents")
        print(f"   üíæ Taille : {file_size_mb:.2f} MB")
        print(f"   üìÅ Chemin : {parquet_path}")
        
        # Export CSV √©galement (pour compatibilit√©)
        csv_path = export_dir / f"datasens_dataset_ia_{timestamp}.csv"
        df_dataset_ia.to_csv(csv_path, index=False, encoding='utf-8')
        csv_size_mb = csv_path.stat().st_size / (1024 * 1024)
        print(f"\n‚úÖ Export CSV compl√©mentaire :")
        print(f"   üìÑ Fichier : {csv_path.name}")
        print(f"   üíæ Taille : {csv_size_mb:.2f} MB")
        
        # Aper√ßu du dataset
        print("\nüìã Aper√ßu dataset IA (5 premiers documents) :")
        display(df_dataset_ia.head())
        
        # Statistiques par type de donn√©e
        print("\nüìä Statistiques dataset par type de donn√©e :")
        stats_type = df_dataset_ia.groupby('type_donnee').agg({
            'id_doc': 'count',
            'langue': lambda x: x.value_counts().to_dict()
        }).rename(columns={'id_doc': 'nb_documents'})
        display(stats_type)
        
    else:
        print("‚ö†Ô∏è Aucun document √† exporter")
        
except ImportError:
    print("‚ö†Ô∏è PyArrow non install√© - export Parquet impossible")
    print("   üí° Installez : pip install pyarrow")
    print("   ‚úÖ Export CSV disponible ci-dessus")
except Exception as e:
    print(f"‚ö†Ô∏è Erreur export dataset IA : {str(e)[:100]}")
    print("   ‚úÖ Export CSV disponible ci-dessus")

print("\n" + "=" * 80)
print("‚úÖ EXPORT DATASET STRUCTUR√â TERMIN√â")
print("=" * 80)
print("\nüìã Fichiers disponibles pour t√©l√©chargement :")
print(f"   ‚Ä¢ Parquet (recommand√©) : data/gold/dataset_ia/datasens_dataset_ia_{timestamp}.parquet")
print(f"   ‚Ä¢ CSV (compatibilit√©) : data/gold/dataset_ia/datasens_dataset_ia_{timestamp}.csv")
print("\nüéØ Ce dataset est pr√™t pour enrichissement IA (E2) avec CamemBERT et FlauBERT")


## üè∑Ô∏è Cr√©ation du tag Git : E1_REAL_YYYYMMDD


In [None]:
print("üè∑Ô∏è Cr√©ation tag Git")
print("=" * 80)

tag_name = f"E1_REAL_{datetime.now(UTC).strftime('%Y%m%d')}"

git_dir = PROJECT_ROOT / ".git"
if git_dir.exists():
    try:
        # V√©rifier si le tag existe d√©j√†
        result = subprocess.run(
            ["git", "tag", "-l", tag_name],
            check=False, cwd=PROJECT_ROOT,
            capture_output=True,
            text=True
        )

        if tag_name in result.stdout:
            print(f"‚ö†Ô∏è Tag {tag_name} existe d√©j√†")
        else:
            # Cr√©er le tag
            subprocess.run(
                ["git", "tag", "-a", tag_name, "-m", f"DataSens E1 complet - {tag_name}"],
                cwd=PROJECT_ROOT,
                check=True
            )
            print(f"‚úÖ Tag Git cr√©√© : {tag_name}")
            print("   üí° Pour pousser le tag: git push origin {tag_name}")
    except Exception as e:
        print(f"‚ö†Ô∏è Erreur cr√©ation tag : {e}")
        print(f"   üí° Cr√©ation manuelle: git tag -a {tag_name} -m 'DataSens E1'")
else:
    print("‚ö†Ô∏è D√©p√¥t Git non initialis√©")
    print(f"   üí° Tag sugg√©r√©: {tag_name}")


## üó∫Ô∏è Roadmap E2/E3

Planification des prochaines √©tapes


### üìã E2 - Enrichissement IA (CamemBERT, FlauBERT)

**E1_v3 pr√©pare le dataset** avec annotation simple (nettoyage, d√©duplication, QA de base).

**E2 ajoutera l'annotation IA avanc√©e** :
- **Annotation automatique** : Sentiment analysis (FlauBERT, CamemBERT)
- **Extraction entit√©s nomm√©es** : spaCy NER (personnes, organisations, lieux)
- **Embeddings vectoriels** : sentence-transformers pour recherche s√©mantique
- **Classification th√©matique** : ML multi-labels (scikit-learn)
- **Tables √† utiliser** : `t05_annotation`, `t08_emotion`, `t06_annotation_emotion` (d√©j√† cr√©√©es dans E1_v3)

### üìä E3 - Production & Visualisation

- **API REST** : FastAPI pour exposition des donn√©es
- **Dashboard** : Power BI ou Streamlit pour visualisations interactives
- **Orchestration** : Prefect/Airflow pour collecte automatique
- **Monitoring** : Grafana + Prometheus pour m√©triques
- **Tests** : pytest pour validation automatique
- **Documentation** : API docs (Swagger/OpenAPI)

### ‚úÖ E1_v3 Valid√©

- ‚úÖ Mod√©lisation Merise (MCD ‚Üí MLD ‚Üí MPD) - 36/37 tables
- ‚úÖ Architecture PostgreSQL compl√®te cr√©√©e
- ‚úÖ CRUD complet test√©
- ‚úÖ 6 types de sources ing√©r√©es
- ‚úÖ Annotation simple : nettoyage, d√©duplication, QA de base
- ‚úÖ Dataset pr√©par√© pour enrichissement IA (E2)
- ‚úÖ Tra√ßabilit√© (flux, manifests, versioning Git)

---

**üéâ F√©licitations ! E1_v3 est termin√© !**

**E1_v3** : Dataset nettoy√© et annot√© simplement, pr√™t pour E2  
**Prochaines √©tapes** : E2 avec enrichissement IA avanc√©e (CamemBERT, FlauBERT)

