# Étape 2 — Nettoyage Source 2 (Manga Sanctuary Reviews)

Objectif (C3) : produire une base **joinable** et **RAG-friendly** à partir de `manga_sanctuary_reviews*.jsonl`.

Sorties (dans `out_ms_staging/`) :
- `ms_reviews_clean.csv` + `ms_reviews_clean.parquet`
- `ms_reviews_rag_ready.csv` + `ms_reviews_rag_ready.parquet` (uniquement les reviews avec texte exploitable)
- `ms_reviews_rejected.jsonl` (audit : lignes invalides / corrompues)
- `ms_reviews_stats.json` (KPI + preuves)

Étapes C3 :
1) lecture robuste + `__line__` + rejets
2) normalisation types (id, numéro, url, score)
3) nettoyage textes (`review_title`, `review_body`, etc.)
4) parsing date FR → `review_date_iso`
5) dédup (clé `review_url`)
6) préparation RAG (`rag_text`, `rag_len`, filtre rag_ready)



In [38]:
import json
import re
import sys
from pathlib import Path
import pandas as pd



## 0) Dépendance Parquet (export systématique Parquet + CSV)

Ce notebook exporte **toujours** en **CSV** et aussi en **Parquet**.
Parquet nécessite `pyarrow` ou `fastparquet`. On tente d'installer `pyarrow` si nécessaire.



In [39]:
def ensure_pyarrow_or_fail():
    try:
        import pyarrow  # noqa: F401
        return True
    except Exception:
        pass

    try:
        import subprocess
        subprocess.check_call([sys.executable, "-m", "pip", "install", "pyarrow"])
        import pyarrow  # noqa: F401
        return True
    except Exception as e:
        raise RuntimeError(
            "Parquet requis mais pyarrow n'est pas disponible. "
            "Installe-le manuellement: pip install pyarrow. "
            f"Détail: {repr(e)}"
        )

PARQUET_READY = ensure_pyarrow_or_fail()
print("PARQUET_READY =", PARQUET_READY)



PARQUET_READY = True


## 1) Chemins (input JSONL + outputs)



In [40]:
PROJECT_ROOT = Path.cwd()
if not (PROJECT_ROOT / "data").exists() and (PROJECT_ROOT.parent / "data").exists():
    PROJECT_ROOT = PROJECT_ROOT.parent

def find_raw_jsonl(project_root: Path) -> Path:
    candidates = [
        project_root / "data" / "manga_sanctuary_reviews (1).jsonl",
        project_root / "data" / "manga_sanctuary_reviews.jsonl",
        project_root / "data" / "manga_sanctuary_reviews(1).jsonl",
        Path("/mnt/data/manga_sanctuary_reviews (1).jsonl"),
        Path("/mnt/data/manga_sanctuary_reviews.jsonl"),
        Path("/mnt/data/manga_sanctuary_reviews(1).jsonl"),
    ]
    for p in candidates:
        if p.exists():
            return p
    data_dir = project_root / "data"
    if data_dir.exists():
        for p in data_dir.glob("*reviews*.jsonl"):
            return p
    raise FileNotFoundError("Impossible de trouver manga_sanctuary_reviews*.jsonl (cherché dans data/ et /mnt/data).")

RAW_PATH = find_raw_jsonl(PROJECT_ROOT)
print("RAW_PATH =", RAW_PATH)

OUT_DIR = PROJECT_ROOT / "out_ms_staging"
OUT_DIR.mkdir(exist_ok=True, parents=True)

REVIEWS_CSV   = OUT_DIR / "ms_reviews_clean.csv"
REVIEWS_PARQ  = OUT_DIR / "ms_reviews_clean.parquet"
RAG_CSV       = OUT_DIR / "ms_reviews_rag_ready.csv"
RAG_PARQ      = OUT_DIR / "ms_reviews_rag_ready.parquet"
REJECTED_PATH = OUT_DIR / "ms_reviews_rejected.jsonl"
STATS_PATH    = OUT_DIR / "ms_reviews_stats.json"



RAW_PATH = /home/maxime/python/certification/preparation_bdd/data/manga_sanctuary_reviews (1).jsonl


## 2) Lecture robuste JSONL + rejets (audit C3)



In [41]:
def read_jsonl_robust(path: Path):
    records = []
    rejects = []
    with path.open("r", encoding="utf-8") as f:
        for i, line in enumerate(f, start=1):
            raw = line.rstrip("\n")
            if not raw.strip():
                rejects.append({"__line__": i, "__reason__": "empty_line", "__raw__": ""})
                continue
            try:
                obj = json.loads(raw)
                if not isinstance(obj, dict):
                    rejects.append({"__line__": i, "__reason__": "not_object", "__raw__": raw[:500]})
                    continue
                obj["__line__"] = i
                records.append(obj)
            except Exception as e:
                rejects.append({"__line__": i, "__reason__": "invalid_json", "__error__": repr(e), "__raw__": raw[:500]})
    return records, rejects

records, rejects = read_jsonl_robust(RAW_PATH)
print("raw_valid_rows =", len(records))
print("raw_rejected_rows =", len(rejects))

# écrire rejets immédiatement (preuve C3)
with REJECTED_PATH.open("w", encoding="utf-8") as f:
    for r in rejects:
        f.write(json.dumps(r, ensure_ascii=False) + "\n")

df = pd.DataFrame.from_records(records)
print("df shape =", df.shape)
df.head(3)



raw_valid_rows = 6749
raw_rejected_rows = 0
df shape = (6749, 13)


Unnamed: 0,series_id,series_title,series_url,volume_number,volume_url,review_url,review_title,review_score,review_author,review_date,review_type,review_body,__line__
0,70676,Le quotidien d'une épée maudite,https://www.manga-sanctuary.com/bdd/manga/7067...,1.0,https://www.manga-sanctuary.com/manga-le-quoti...,https://www.manga-sanctuary.com/fiche_serie_cr...,Critique Manga Le quotidien d'une épée maudite #1,7.0,Tampopo24,mer. 10 mai 2023,Staff,"Les titres humoristiques, avec moi, ça passe o...",1
1,71144,Villageois LVL 999,https://www.manga-sanctuary.com/bdd/manga/7114...,999.0,https://www.manga-sanctuary.com/manga-villageo...,https://www.manga-sanctuary.com/fiche_serie_cr...,Critique Manga Villageois LVL 999 #1,7.0,MassLunar,mer. 24 janv. 2024,Staff,Dans un monde de fantasy au passé lointain cur...,2
2,64974,Lupin III anthology,https://www.manga-sanctuary.com/bdd/manga/6497...,,https://www.manga-sanctuary.com/manga-lupin-ii...,https://www.manga-sanctuary.com/fiche_serie_cr...,Critique Manga Lupin III anthology,7.0,Tampopo24,lun. 18 oct. 2021,Staff,﻿﻿﻿Lupin III est un personnage culte au Japon ...,3


## 3) Contrôle champs obligatoires + dédup `review_url`



In [42]:
# requis minimal
REQUIRED_ALWAYS = ["series_id", "review_url"]
# contenu : au moins titre OU body
HAS_CONTENT = (df.get("review_title").notna() & (df.get("review_title").astype(str).str.strip() != "")) |               (df.get("review_body").notna() & (df.get("review_body").astype(str).str.strip() != ""))

required_mask = pd.Series(True, index=df.index)
for c in REQUIRED_ALWAYS:
    required_mask &= df.get(c).notna() & (df.get(c).astype(str).str.strip() != "")

required_mask &= HAS_CONTENT

df_bad = df[~required_mask].copy()
df_ok  = df[required_mask].copy()

print("kept_after_required_fields =", len(df_ok))
print("rejected_missing_required =", len(df_bad))

# ajoute ces rejets à l'audit
if len(df_bad):
    with REJECTED_PATH.open("a", encoding="utf-8") as f:
        for rec in df_bad[["__line__"] + [c for c in REQUIRED_ALWAYS if c in df_bad.columns]].to_dict("records"):
            rec["__reason__"] = "missing_required_fields"
            f.write(json.dumps(rec, ensure_ascii=False) + "\n")

# dédup review_url (on garde la première occurrence)
before = len(df_ok)
df_ok = df_ok.drop_duplicates(subset=["review_url"], keep="first").copy()
print("dedup_review_url_removed =", before - len(df_ok))



kept_after_required_fields = 6749
rejected_missing_required = 0
dedup_review_url_removed = 0


## 4) Nettoyage texte + normalisation types



In [43]:
def clean_text(x):
    if x is None or (isinstance(x, float) and pd.isna(x)):
        return None
    s = str(x).replace("\u00a0", " ").strip()
    s = re.sub(r"\s+", " ", s)
    if s.lower() in {"na", "n/a", "none", "null", ""}:
        return None
    return s

TEXT_COLS = [
    "series_title","series_url",
    "volume_url","review_url",
    "review_title","review_body","review_author","review_type",
    "review_date",  # brut (souvent FR)
]
for c in TEXT_COLS:
    if c in df_ok.columns:
        df_ok[c] = df_ok[c].map(clean_text)

# types
def to_int(s):
    return pd.to_numeric(s, errors="coerce").astype("Int64")

def to_float(s):
    return pd.to_numeric(s, errors="coerce").astype("Float64")

if "series_id" in df_ok.columns:
    df_ok["series_id"] = to_int(df_ok["series_id"])
if "volume_number" in df_ok.columns:
    df_ok["volume_number"] = to_int(df_ok["volume_number"])
if "review_score" in df_ok.columns:
    df_ok["review_score"] = to_float(df_ok["review_score"])

df_ok[["series_id","volume_number","review_score"]].dtypes



series_id          Int64
volume_number      Int64
review_score     Float64
dtype: object

## 5) Parsing date FR → `review_date_iso` (corrige l’erreur `.dt`)

On produit `review_date_parsed` en `datetime64[ns]`, puis `.dt.strftime()` est valide.



In [44]:
import re
import pandas as pd

FR_MONTHS = {
    "janvier": 1, "janv": 1, "janv.": 1,
    "février": 2, "fevrier": 2, "févr": 2, "fevr": 2, "févr.": 2, "fevr.": 2,
    "mars": 3,
    "avril": 4,
    "mai": 5,
    "juin": 6,
    "juillet": 7, "juil": 7, "juil.": 7,
    "août": 8, "aout": 8,
    "septembre": 9, "sept": 9, "sept.": 9,
    "octobre": 10, "oct": 10, "oct.": 10,
    "novembre": 11, "nov": 11, "nov.": 11,
    "décembre": 12, "decembre": 12, "déc": 12, "dec": 12, "déc.": 12, "dec.": 12,
}

WEEKDAY_PREFIX = re.compile(r"^(lun|mar|mer|jeu|ven|sam|dim)\.?\s+", re.IGNORECASE)

def parse_fr_date_one(s):
    if s is None:
        return pd.NaT
    s = str(s).strip()
    if not s:
        return pd.NaT

    # enlève "mer. ", "lun. ", etc.
    s = WEEKDAY_PREFIX.sub("", s)

    # "1er" -> "1"
    s2 = s.lower().replace("1er", "1").strip()

    # dd/mm/yyyy ou dd-mm-yyyy
    m = re.match(r"^(\d{1,2})[\/-](\d{1,2})[\/-](\d{2,4})$", s2)
    if m:
        d, mo, y = int(m.group(1)), int(m.group(2)), int(m.group(3))
        if y < 100:
            y += 2000
        return pd.Timestamp(year=y, month=mo, day=d)

    # "10 mai 2023" / "24 janv. 2024" / "18 oct. 2021"
    m2 = re.match(r"^(\d{1,2})\s+([a-zéèêëàâîïôöùûüç\.]+)\s+(\d{4})$", s2)
    if m2:
        d = int(m2.group(1))
        mon = m2.group(2).strip(".")
        y = int(m2.group(3))
        mo = FR_MONTHS.get(mon)
        if mo:
            return pd.Timestamp(year=y, month=mo, day=d)

    # fallback (au cas où)
    return pd.to_datetime(s, errors="coerce", dayfirst=True)

df_ok["review_date_parsed"] = pd.to_datetime(df_ok["review_date"].map(parse_fr_date_one), errors="coerce")
df_ok["review_date_iso"] = df_ok["review_date_parsed"].dt.strftime("%Y-%m-%d")
df_ok["review_date_parse_ok"] = df_ok["review_date_parsed"].notna()

df_ok[["review_date","review_date_iso","review_date_parse_ok"]].head(10)


Unnamed: 0,review_date,review_date_iso,review_date_parse_ok
0,mer. 10 mai 2023,2023-05-10,True
1,mer. 24 janv. 2024,2024-01-24,True
2,lun. 18 oct. 2021,2021-10-18,True
3,jeu. 26 mars 2009,2009-03-26,True
4,dim. 10 déc. 2017,2017-12-10,True
5,lun. 16 oct. 2017,2017-10-16,True
6,mar. 14 déc. 2021,2021-12-14,True
7,jeu. 16 oct. 2008,2008-10-16,True
8,mer. 15 oct. 2008,2008-10-15,True
9,mer. 15 oct. 2008,2008-10-15,True


## 6) Préparation RAG : `rag_text` + `rag_len` + filtre `rag_ready`



In [45]:
# texte pour embeddings / retrieval
title = df_ok.get("review_title").fillna("").astype(str).str.strip() if "review_title" in df_ok.columns else ""
body  = df_ok.get("review_body").fillna("").astype(str).str.strip() if "review_body" in df_ok.columns else ""

df_ok["rag_text"] = (title + "\n\n" + body).str.strip()
# fallback si titre vide
df_ok.loc[df_ok["rag_text"].eq(""), "rag_text"] = body[df_ok["rag_text"].eq("")]

# longueur + filtre
df_ok["rag_len"] = df_ok["rag_text"].str.len()
MIN_RAG_CHARS = 120  # ajuste si besoin
df_ok["rag_ready"] = df_ok["rag_len"] >= MIN_RAG_CHARS

reviews_rag = df_ok[df_ok["rag_ready"]].copy()
print("rag_ready_rows =", len(reviews_rag))
reviews_rag[["review_url","rag_len","review_score","review_date_iso"]].head(5)



rag_ready_rows = 3187


Unnamed: 0,review_url,rag_len,review_score,review_date_iso
0,https://www.manga-sanctuary.com/fiche_serie_cr...,2618,7.0,2023-05-10
1,https://www.manga-sanctuary.com/fiche_serie_cr...,2608,7.0,2024-01-24
2,https://www.manga-sanctuary.com/fiche_serie_cr...,4385,7.0,2021-10-18
4,https://www.manga-sanctuary.com/fiche_serie_cr...,2608,7.0,2017-12-10
5,https://www.manga-sanctuary.com/fiche_serie_cr...,3792,7.0,2017-10-16


### conversion de volume_number en entier nullable

In [46]:
import pandas as pd

# conversion uniquement (pas d'export ici)
df_ok["volume_number"] = pd.to_numeric(df_ok["volume_number"], errors="coerce").astype("Int64")

print("dtype volume_number (après conversion) =", df_ok["volume_number"].dtype)
df_ok["volume_number"].head(10)


dtype volume_number (après conversion) = Int64


0       1
1     999
2    <NA>
3    <NA>
4       5
5       4
6       2
7       4
8       3
9       2
Name: volume_number, dtype: Int64

## 7) Export systématique CSV + Parquet (et stats)



In [47]:
# garantir le dtype juste avant export final
df_ok["volume_number"] = pd.to_numeric(df_ok["volume_number"], errors="coerce").astype("Int64")


# CSV (toujours)
df_ok.to_csv(REVIEWS_CSV, index=False)
reviews_rag.to_csv(RAG_CSV, index=False)

# Parquet (requis, doit réussir sinon on stop)
df_ok.to_parquet(REVIEWS_PARQ, index=False)
reviews_rag.to_parquet(RAG_PARQ, index=False)

# stats/KPI (preuve C3)
stats = {
    "raw_valid_rows": int(len(records)),
    "raw_rejected_rows": int(len(rejects)),
    "kept_after_required_fields": int(len(df_ok)),
    "rejected_missing_required": int(len(df_bad)),
    "dedup_review_url_removed": int(before - len(df_ok)),
    "review_score_non_null_%": float(round(df_ok.get("review_score").notna().mean() * 100, 2)) if "review_score" in df_ok.columns else None,
    "review_date_parse_ok_%": float(round(df_ok["review_date_parse_ok"].mean() * 100, 2)),
    "volume_url_non_null_%": float(round(df_ok.get("volume_url").notna().mean() * 100, 2)) if "volume_url" in df_ok.columns else None,
    "rag_ready_%": float(round(df_ok["rag_ready"].mean() * 100, 2)),
    "min_rag_chars": int(MIN_RAG_CHARS),
    "exports": {
        "ms_reviews_clean_csv": str(REVIEWS_CSV),
        "ms_reviews_clean_parquet": str(REVIEWS_PARQ),
        "ms_reviews_rag_ready_csv": str(RAG_CSV),
        "ms_reviews_rag_ready_parquet": str(RAG_PARQ),
        "rejected_jsonl": str(REJECTED_PATH),
    },
}

STATS_PATH.write_text(json.dumps(stats, ensure_ascii=False, indent=2), encoding="utf-8")
stats



{'raw_valid_rows': 6749,
 'raw_rejected_rows': 0,
 'kept_after_required_fields': 6749,
 'rejected_missing_required': 0,
 'dedup_review_url_removed': 0,
 'review_score_non_null_%': 98.56,
 'review_date_parse_ok_%': 69.79,
 'volume_url_non_null_%': 100.0,
 'rag_ready_%': 47.22,
 'min_rag_chars': 120,
 'exports': {'ms_reviews_clean_csv': '/home/maxime/python/certification/preparation_bdd/out_ms_staging/ms_reviews_clean.csv',
  'ms_reviews_clean_parquet': '/home/maxime/python/certification/preparation_bdd/out_ms_staging/ms_reviews_clean.parquet',
  'ms_reviews_rag_ready_csv': '/home/maxime/python/certification/preparation_bdd/out_ms_staging/ms_reviews_rag_ready.csv',
  'ms_reviews_rag_ready_parquet': '/home/maxime/python/certification/preparation_bdd/out_ms_staging/ms_reviews_rag_ready.parquet',
  'rejected_jsonl': '/home/maxime/python/certification/preparation_bdd/out_ms_staging/ms_reviews_rejected.jsonl'}}

## 8) Contrôle final 

Colonnes clés attendues pour l’agrégation Étape C3 :
- `volume_url` (join principal) + fallback `series_id` & `volume_number`
- `review_score`, `review_date_iso`
- `rag_text` (pour indexation vectorielle)



In [48]:
expected = ["series_id","volume_url","volume_number","review_url","review_score","review_date_iso","rag_text"]
missing = [c for c in expected if c not in df_ok.columns]
print("missing_expected_columns =", missing)

df_ok[expected].head(3) if not missing else df_ok.head(3)



missing_expected_columns = []


Unnamed: 0,series_id,volume_url,volume_number,review_url,review_score,review_date_iso,rag_text
0,70676,https://www.manga-sanctuary.com/manga-le-quoti...,1.0,https://www.manga-sanctuary.com/fiche_serie_cr...,7.0,2023-05-10,Critique Manga Le quotidien d'une épée maudite...
1,71144,https://www.manga-sanctuary.com/manga-villageo...,999.0,https://www.manga-sanctuary.com/fiche_serie_cr...,7.0,2024-01-24,Critique Manga Villageois LVL 999 #1\n\nDans u...
2,64974,https://www.manga-sanctuary.com/manga-lupin-ii...,,https://www.manga-sanctuary.com/fiche_serie_cr...,7.0,2021-10-18,Critique Manga Lupin III anthology\n\n﻿﻿﻿Lupin...
