
# 🧰 C5.1.3 — Choix technologies / outils (Bloc 5)

**Objectif :** justifier la pile **technique** (MVP → extensions), comparer les alternatives avec des **critères**, et fournir des **preuves minimales** : fichiers d'env, *smoke tests*, et contrats API/App.

> **Contexte lié :**  
> - C5.1.1 a produit un dataset **normalisé** (`data/interim/amazon_electronics_normalized.parquet|csv`).  
> - C5.1.2 a figé la **stratégie** (tâches, KPI). Ici on choisit **les outils** pour exécuter cette stratégie.



## 🎙️ Discours (1–2 min)
> « Nous choisissons une pile **modulaire** et **local-first** :  
> - **Cœur** : `pandas` + `scikit-learn` (TF‑IDF + LogReg / ComplementNB) pour un **MVP rapide**, **explicable** et **peu coûteux**.  
> - **Démo** : `Streamlit` (UI) et `FastAPI` (service) pour exposer un artefact standard `joblib`.  
> - **Extensions** : `spaCy`/`KeyBERT` pour l’**extraction d’aspects** ; `transformers`/`sentence-transformers` (avec `PyTorch`) pour **émotions fines** et **sarcasme** ; `Evidently` pour le **monitoring** ; `Prefect` pour l’**orchestration**.  
> Les choix sont guidés par nos **KPI** (précision, latence < 200 ms en démo) et la **maintenabilité**. »


## 1) Grille de critères — comparaison succincte

In [1]:

import pandas as pd
criteria = ["Perf", "Latence/CPU", "Simplicité", "Écosystème", "Explicabilité", "Coût"]
scores = {
    "pandas":                [2, 3, 5, 5, 5, 5],
    "scikit-learn":          [4, 5, 5, 5, 4, 5],
    "Streamlit":             [3, 5, 5, 4, 3, 5],
    "FastAPI":               [3, 5, 4, 5, 3, 5],
    "spaCy":                 [3, 4, 4, 4, 3, 5],
    "KeyBERT":               [3, 3, 4, 4, 3, 5],
    "sentence-transformers": [4, 3, 4, 5, 3, 4],
    "transformers+PyTorch":  [5, 2, 3, 5, 3, 3],
    "Evidently":             [3, 5, 4, 4, 4, 5],
    "Prefect":               [3, 5, 4, 4, 3, 4],
}
df = pd.DataFrame(scores).T
df.columns = criteria
df["Total"] = df.sum(axis=1)
df.sort_values("Total", ascending=False)

Unnamed: 0,Perf,Latence/CPU,Simplicité,Écosystème,Explicabilité,Coût,Total
scikit-learn,4,5,5,5,4,5,28
pandas,2,3,5,5,5,5,25
Streamlit,3,5,5,4,3,5,25
FastAPI,3,5,4,5,3,5,25
Evidently,3,5,4,4,4,5,25
spaCy,3,4,4,4,3,5,23
Prefect,3,5,4,4,3,4,23
sentence-transformers,4,3,4,5,3,4,23
KeyBERT,3,3,4,4,3,5,22
transformers+PyTorch,5,2,3,5,3,3,21


## 2) Toggles d’environnement & chemins

In [2]:

USE_GPU = True           # passe à True si GPU CUDA dispo pour fine-tuning HF
INSTALL_EXT = True       # passe à True pour installer les extensions NLP/Deep
PARQUET_NORM = "data/interim/amazon_electronics_normalized.parquet"
CSV_NORM     = "data/interim/amazon_electronics_normalized.csv"

## 3) Écrire les fichiers d'environnement (requirements)

In [3]:

from pathlib import Path
Path("configs").mkdir(parents=True, exist_ok=True)

req_min = Path("requirements-min.txt")
req_ext = Path("requirements-extended.txt")

req_min.write_text("""pandas>=2.1
pyarrow>=15.0
scikit-learn>=1.4
joblib>=1.3
matplotlib>=3.8
pyyaml>=6.0
fastapi>=0.111
uvicorn[standard]>=0.30
streamlit>=1.35
evidently>=0.4
prefect>=3.0
""", encoding="utf-8")

req_ext.write_text("""spacy>=3.7
keybert>=0.7.0
sentence-transformers>=2.7.0
transformers>=4.42.0
torch>=2.2.0
""", encoding="utf-8")

print("Écrit :", req_min, "et", req_ext)

Écrit : requirements-min.txt et requirements-extended.txt


## 4) (Option) Installation locale

In [None]:

# Décommente sur ton poste si besoin
%pip install -q -r requirements-min.txt
if INSTALL_EXT:
    %pip install -q -r requirements-extended.txt


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## 5) Smoke tests — imports/versions & GPU

In [7]:
# Cellule 1 — Chemin Python & versions clés
import sys, numpy as np, torch
print("Python exe:", sys.executable)
print("NumPy:", np.__version__)
print("Torch:", torch.__version__, "| CUDA build:", torch.version.cuda, "| CUDA is_available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

Python exe: c:\Users\antoi\anaconda3\envs\monenv\python.exe
NumPy: 2.2.6
Torch: 2.2.2 | CUDA build: 12.1 | CUDA is_available: True
GPU: NVIDIA GeForce RTX 4060 Laptop GPU


In [8]:
import numpy as np, torch, spacy
import sentence_transformers as st_mod
import keybert as kb_mod

print("numpy:", np.__version__)
print("torch:", torch.__version__, "| CUDA:", torch.cuda.is_available())
print("spaCy:", spacy.__version__)
print("sentence-transformers:", st_mod.__version__)
print("keybert:", kb_mod.__version__)

numpy: 2.2.6
torch: 2.2.2 | CUDA: True
spaCy: 3.8.2
sentence-transformers: 2.7.0
keybert: 0.8.5


In [9]:

def try_import(pkg):
    try:
        mod = __import__(pkg)
        ver = getattr(mod, "__version__", "n/a")
        print(f"✓ {pkg} {ver}")
        return True
    except Exception as e:
        print(f"⚠️ {pkg} non dispo :", e)
        return False

for p in ["pandas","sklearn","matplotlib","fastapi","streamlit","evidently","prefect"]:
    try_import(p)

# Extensions (optionnelles)
# Essayez de les importer uniquement si vous avez fait l'installation extended
ok_torch = try_import("torch")
if ok_torch:
    import torch
    print("CUDA dispo :", torch.cuda.is_available())
    for p in ["spacy","keybert","sentence_transformers","transformers"]:
        try_import(p)

✓ pandas 2.2.3
✓ sklearn 1.7.2
✓ matplotlib 3.10.0
✓ fastapi 0.115.12
✓ streamlit 1.48.1
✓ evidently 0.7.14
✓ prefect 3.4.17
✓ torch 2.2.2
CUDA dispo : True
✓ spacy 3.8.2
✓ keybert 0.8.5
✓ sentence_transformers 2.7.0
✓ transformers 4.56.1


## 6) Mini bench (option) — TF‑IDF + LogReg pour valider la latence du MVP

In [11]:
# === Bench TF-IDF + LogReg ===
RUN_BENCH = True  # False pour désactiver

if RUN_BENCH:
    import time
    from pathlib import Path
    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import average_precision_score

    # ---- 1) Chargement dataset normalisé (parquet ou csv)
    if Path(PARQUET_NORM).exists():
        df = pd.read_parquet(PARQUET_NORM)
        print("✓ Chargé depuis :", PARQUET_NORM)
    elif Path(CSV_NORM).exists():
        df = pd.read_csv(CSV_NORM)
        print("✓ Chargé depuis :", CSV_NORM)
    else:
        raise FileNotFoundError("Exécute C5.1.1 pour générer le dataset normalisé.")

    df["review_body"] = df["review_body"].astype(str)
    y = (pd.to_numeric(df["star_rating"], errors="coerce") >= 4).astype(int)

    # ---- 2) Vectorisation TF-IDF (mots)
    vec = TfidfVectorizer(stop_words="english",
                          max_features=50_000,
                          ngram_range=(1, 2),
                          min_df=3)
    X = vec.fit_transform(df["review_body"])

    # On garde aussi le texte brut pour la latence end-to-end
    Xtr, Xte, ytr, yte, tr_txt, te_txt = train_test_split(
        X, y, df["review_body"], test_size=0.2, stratify=y, random_state=42
    )

    # ---- 3) Entraînement
    t0 = time.time()
    clf = LogisticRegression(max_iter=2000, class_weight="balanced").fit(Xtr, ytr)
    fit_ms = (time.time() - t0) * 1000

    # ---- 4a) Latence "moteur" (prédiction sur un batch déjà vectorisé)
    n = min(Xte.shape[0], 100)
    sample = Xte[:n]

    # warm-up (évite le coût du premier appel)
    _ = clf.predict_proba(sample[:1])

    t1 = time.time()
    _ = clf.predict_proba(sample)
    infer_ms_batch = (time.time() - t1) * 1000 / sample.shape[0]  # <-- shape[0], pas len()

    # ---- 4b) Latence end-to-end (vectorisation + prédiction, texte par texte)
    texts = te_txt.iloc[:n].tolist()

    # warm-up e2e
    _ = clf.predict_proba(vec.transform([texts[0]]))

    t2 = time.time()
    for t in texts:
        _ = clf.predict_proba(vec.transform([t]))
    infer_ms_e2e = (time.time() - t2) * 1000 / n

    # ---- 5) Qualité
    ap = average_precision_score(yte, clf.predict_proba(Xte)[:, 1])

    print(
        f"AP={ap:.3f} | temps fit ~{fit_ms:.0f} ms | "
        f"latence/texte (batch vectorized) ~{infer_ms_batch:.1f} ms | "
        f"latence/texte (end-to-end) ~{infer_ms_e2e:.1f} ms  (cible < 200 ms)"
    )
else:
    print("Bench désactivé (RUN_BENCH=False).")

✓ Chargé depuis : data/interim/amazon_electronics_normalized.parquet
AP=0.982 | temps fit ~5280 ms | latence/texte (batch vectorized) ~0.0 ms | latence/texte (end-to-end) ~0.4 ms  (cible < 200 ms)


## 7) Contrats — générer les squelettes API & App (compatibles artefact joblib)

In [12]:

from pathlib import Path
Path("api").mkdir(exist_ok=True, parents=True)
Path("app").mkdir(exist_ok=True, parents=True)

api_code = '''from fastapi import FastAPI
from pydantic import BaseModel
from joblib import load

app = FastAPI(title="Amazon Reviews Insights API")
ART_PATH = "models/sentiment_v1.joblib"
_art = None

class InText(BaseModel):
    text: str

@app.on_event("startup")
def _load():
    global _art
    try:
        _art = load(ART_PATH)
    except Exception as e:
        _art = None
        print("WARN: artefact non chargé :", e)

@app.get("/health")
def health():
    return {"ok": _art is not None}

@app.post("/predict")
def predict(inp: InText):
    assert _art is not None, "Artefact manquant"
    vec, clf = _art["vectorizer"], _art["model"]
    X = vec.transform([inp.text])
    proba = float(clf.predict_proba(X)[0,1])
    return {"proba_pos": proba, "label": int(proba >= 0.5)}
'''
Path("api/main.py").write_text(api_code, encoding="utf-8")

st_code = '''import streamlit as st
from joblib import load

st.set_page_config(page_title="Amazon Reviews — Démo", page_icon="💬", layout="wide")
st.title("💬 Amazon Reviews — Démo (MVP)")

@st.cache_resource
def artefact():
    return load("models/sentiment_v1.joblib")

try:
    art = artefact()
    st.success("Artefact chargé.")
except Exception as e:
    st.error(f"Impossible de charger l'artefact : {e}")
    st.stop()

txt = st.text_area("Collez un avis :", "Great battery life but the screen is dim.", height=160)
if st.button("Analyser"):
    vec, clf = art["vectorizer"], art["model"]
    proba = float(clf.predict_proba(vec.transform([txt]))[0,1])
    st.metric("Probabilité Positive", f"{proba:.3f}")
    st.write("Label :", "**Positif**" if proba>=0.5 else "**Négatif**")
'''
Path("app/streamlit_app.py").write_text(st_code, encoding="utf-8")

print("Écrit → api/main.py et app/streamlit_app.py")

Écrit → api/main.py et app/streamlit_app.py



## ✅ Rappel à la compétence & pourquoi c’est validé
**Compétence :** *C5.1.3 — Choix technologies/outils*

- **Grille de critères** et comparaison argumentée (MVP vs extensions).
- **Fichiers d’environnement** générés (`requirements-min.txt`, `requirements-extended.txt`).
- **Smoke tests** (imports/versions, option GPU) et mini **bench** de latence du MVP.
- **Contrats** techniques (API FastAPI + App Streamlit) compatibles avec l’artefact standard `models/sentiment_v1.joblib`.
- **Traçabilité/reproductibilité** assurées (requirements + dossiers `api/` & `app/` écrits).