# DataSens E1_v2 — 04_quality_checks

- Objectifs: QA PostgreSQL (counts, doublons, intégrité)
- Prérequis: 03_ingest_sources
- Guide: docs/GUIDE_TECHNIQUE_E1.md



> Notes:
> - Vérifications rapides côté PostgreSQL: volumes (documents/flux) et doublons potentiels.
> - `read_sql_query` avec SQL paramétré évite l’injection et facilite l’affichage.
> - Étendre au besoin: contrôle nulls critiques, intégrité FK, index sur hash_fingerprint.



In [None]:
# DataSens E1_v2 - 04_quality_checks
# QA simples côté PostgreSQL
import os

import pandas as pd
from sqlalchemy import create_engine, text

PG_URL = os.getenv("DATASENS_PG_URL", "postgresql+psycopg2://ds_user:ds_pass@localhost:5432/datasens")
engine = create_engine(PG_URL, future=True)

with engine.begin() as conn:
    n_doc = conn.execute(text("SELECT COUNT(*) FROM document")).scalar()
    n_flux = conn.execute(text("SELECT COUNT(*) FROM flux")).scalar()
print(f"📦 documents:{n_doc} | flux:{n_flux}")

dup = pd.read_sql_query(
    """
    SELECT hash_fingerprint, COUNT(*) c
    FROM document
    WHERE hash_fingerprint IS NOT NULL
    GROUP BY 1 HAVING COUNT(*)>1
    """, engine)
print("🔎 Doublons:", len(dup))

