# DataSens — E1 (v1) — 05_snapshot_and_readme

**Objectif :**

- Prouver E1 : base créée, remplie, requêtable.
- Produire un **snapshot "audit"** :
  - Exports CSV (ou parquet plus tard)
  - Stats par table
  - Mini README auto (copiable dans Notion)

In [None]:
import pandas as pd
from sqlmodel import Session, select, create_engine, SQLModel, Field
from datetime import datetime
from pathlib import Path
from typing import Optional

# Schema classes (copie-colle)
class Source(SQLModel, table=True):
    __tablename__ = "source"
    source_id: Optional[int] = Field(default=None, primary_key=True)
    name: str
    source_kind: str
    url: Optional[str] = None
    frequency: Optional[str] = None
    active: bool = True

class RawData(SQLModel, table=True):
    __tablename__ = "raw_data"
    raw_id: Optional[int] = Field(default=None, primary_key=True)
    source_id: int = Field(foreign_key="source.source_id", index=True)
    title: Optional[str] = None
    text: str
    author: Optional[str] = None
    created_at: datetime = Field(default_factory=datetime.utcnow, index=True)

class SyncLog(SQLModel, table=True):
    __tablename__ = "sync_log"
    log_id: Optional[int] = Field(default=None, primary_key=True)
    source_id: int = Field(foreign_key="source.source_id", index=True)
    sync_at: datetime = Field(default_factory=datetime.utcnow, index=True)
    status: str
    records_inserted: int = 0
    message: Optional[str] = None

class Topic(SQLModel, table=True):
    __tablename__ = "topic"
    topic_id: Optional[int] = Field(default=None, primary_key=True)
    name: str = Field(index=True, unique=True)
    category: Optional[str] = None

class DocumentTopic(SQLModel, table=True):
    __tablename__ = "document_topic"
    raw_id: int = Field(foreign_key="raw_data.raw_id", primary_key=True)
    topic_id: int = Field(foreign_key="topic.topic_id", primary_key=True)

class ModelOutput(SQLModel, table=True):
    __tablename__ = "model_output"
    output_id: Optional[int] = Field(default=None, primary_key=True)
    raw_id: int = Field(foreign_key="raw_data.raw_id", index=True)
    model_name: str
    label: str
    confidence: float = Field(ge=0.0, le=1.0)
    created_at: datetime = Field(default_factory=datetime.utcnow, index=True)

BASE_DIR = Path.home() / "datasens_project"
EXPORT_DIR = BASE_DIR / "exports" / "E1_v1"
EXPORT_DIR.mkdir(parents=True, exist_ok=True)

DB_PATH = BASE_DIR / "datasens_e1_v1.sqlite"
DATABASE_URL = f"sqlite:///{DB_PATH}"
engine = create_engine(DATABASE_URL, echo=False)

tables_dict = {
    "source": Source,
    "raw_data": RawData,
    "sync_log": SyncLog,
    "topic": Topic,
    "document_topic": DocumentTopic,
    "model_output": ModelOutput,
}

print(f" Setup complete: {EXPORT_DIR}")

## 1) Export CSV (snapshot complet)

In [None]:
with Session(engine) as session:
    for table_name, model_class in tables_dict.items():
        rows = session.exec(select(model_class)).all()
        df = pd.DataFrame([r.model_dump() for r in rows])
        
        output_file = EXPORT_DIR / f"{table_name}.csv"
        df.to_csv(output_file, index=False)
        print(f" Exported {table_name}: {len(rows)} rows → {output_file}")

## 2) Statistiques d'audit (preuves E1)

In [None]:
with Session(engine) as session:
    stats = {name: len(session.exec(select(model_class)).all()) for name, model_class in tables_dict.items()}

print("\n" + "="*50)
print(" AUDIT STATS — E1 V1 SUCCESS")
print("="*50)
for table, count in sorted(stats.items()):
    status = "" if count > 0 else ""
    print(f"{status} {table:20s}: {count:4d} rows")
print("="*50)

## 3) README Auto (copiable)

In [None]:
readme = f"""
# DataSens — E1 V1 — Complete Setup

## Database
- **File**: {DB_PATH}
- **Engine**: SQLite (zero-config)
- **Tables**: source, raw_data, sync_log, topic, document_topic, model_output

## Audit Trail
- **Setup**: 01_setup_env.ipynb
- **Schema**: 02_schema_create.ipynb
- **Ingestion**: 03_ingest_sources.ipynb
- **CRUD**: 04_crud_tests.ipynb
- **Snapshot**: 05_snapshot_and_readme.ipynb

## Data Stats
"""

with Session(engine) as session:
    for table, count in sorted(stats.items()):
        readme += f"- **{table}**: {count} rows\n"

readme += f"""
## Exports
- Location: {EXPORT_DIR}
- Format: CSV (one file per table)
- Use case: Audit, backup, external tools

## Git Tags
- E1_v1_step01_setup_env_ok
- E1_v1_step02_schema_ok
- E1_v1_step03_ingest_ok
- E1_v1_step04_crud_ok
- E1_v1_step05_snapshot_ok

## Next Steps (E1 V2)
1. Real data sources (RSS multi-source, GDELT, API)
2. PostgreSQL migration
3. Data lake with Hive partitioning
4. PySpark ETL
"""

print(readme)

# Save README
readme_file = EXPORT_DIR / "E1_V1_README.md"
with open(readme_file, "w") as f:
    f.write(readme)
print(f"\n README saved: {readme_file}")

##  Final Commit & Tag

**Congratulations!**  E1 V1 complete.

```bash
git add .
git commit -m "E1 v1 - complete (schema + data + CRUD + audit)"
git tag E1_v1_step05_snapshot_ok
git tag E1_v1_complete
```

### What You've Built
1.  **Database schema** (SQLite, 6 tables, 3NF)
2.  **Data ingestion** (RSS, topics, tagging)
3.  **CRUD operations** (full test suite)
4.  **Audit trail** (sync_log, version tags)
5.  **Exports** (CSV snapshots)

### Ready for E1 V2
- Real multi-source ingestion
- PostgreSQL + DataLake
- Production-grade error handling

---

**Status**:  **E1 V1 Ready for Jury Presentation**