# C3 — Mapping Manga Sanctuary (MS) ↔ Kitsu + Enrichissement série (Postgres-ready, JSONB) — v2 (anti-collisions)

## Objectifs
- Générer `ms_title_norm` + `ms_titles_exploded` (titre principal + autres titres)
- Générer les **2 pivots Kitsu** :
  - `kitsu_series_core` (1 ligne / kitsu_id) = référentiel d’enrichissement
  - `kitsu_titles_exploded` (multi-lignes / kitsu_id) = index de matching
- Matching en 2 passes :
  1) **Exact match** sur `title_norm` (avec **gestion des collisions** via score qualité Kitsu)
  2) **Fuzzy match** (RapidFuzz) sur les non matchés (seuil conservateur)
- Produire :
  - `ms_kitsu_map.(csv|parquet)` : table de correspondance (audit : méthode, score, titre matché)
  - `ms_series_enriched_plus_kitsu.(csv|parquet)` : MS enrichi par Kitsu (sans écrasement destructif)
  - exports de pivots (audit) : `ms_titles_exploded.csv`, `kitsu_series_core.csv`, `kitsu_titles_exploded.csv`
  - optionnel : `ms_kitsu_ambiguous.csv` (cas ambigus à vérifier)

## Notes
- Pas de RAG ici (pas de chunking, pas d’embeddings).
- Les colonnes JSON (tags/genres/catégories, autres titres) sont conservées pour stockage **JSONB** dans PostgreSQL.

In [55]:
from pathlib import Path
import os
import json
import re
import unicodedata

import pandas as pd
import numpy as np

try:
    from dotenv import load_dotenv
except Exception:
    load_dotenv = None

pd.set_option("display.max_colwidth", 160)

## 0) Paramètres (chemins)

In [56]:
# --- PROJECT ROOT (robuste même si le kernel démarre dans `notebooks/`) ---
def find_repo_root(start: Path) -> Path:
    start = start.resolve()
    for root in [start, *start.parents]:
        if (root / "pyproject.toml").exists():
            return root
    return start

PROJECT_ROOT = find_repo_root(Path.cwd())

if "load_dotenv" in globals() and load_dotenv is not None:
    load_dotenv(PROJECT_ROOT / ".env", override=False)

def resolve_from_root(p: str) -> Path:
    candidate = Path(p).expanduser()
    if candidate.is_absolute():
        return candidate
    return (PROJECT_ROOT / candidate).resolve()

# --- INPUTS ---
# Override via env/.env si besoin:
# - MS_SERIES_CSV=...
# - KITSU_CLEAN_CSV=...
MS_SERIES_CSV = resolve_from_root(os.getenv("MS_SERIES_CSV", "out_ms_final/ms_series_enriched.csv"))
KITSU_CLEAN_CSV = resolve_from_root(os.getenv("KITSU_CLEAN_CSV", "exports/kitsu/kitsu_top_rated_clean.csv"))

# Compat Colab
if not MS_SERIES_CSV.exists() and Path("/mnt/data/ms_series_enriched.csv").exists():
    MS_SERIES_CSV = Path("/mnt/data/ms_series_enriched.csv")
if not KITSU_CLEAN_CSV.exists() and Path("/mnt/data/kitsu_top_rated_clean.csv").exists():
    KITSU_CLEAN_CSV = Path("/mnt/data/kitsu_top_rated_clean.csv")

if not MS_SERIES_CSV.exists():
    raise FileNotFoundError(f"MS_SERIES_CSV introuvable: {MS_SERIES_CSV} (override via MS_SERIES_CSV)")
if not KITSU_CLEAN_CSV.exists():
    raise FileNotFoundError(f"KITSU_CLEAN_CSV introuvable: {KITSU_CLEAN_CSV} (override via KITSU_CLEAN_CSV)")

# --- OUTPUTS ---
# Override via env/.env si besoin:
# - C3_OUT_DIR=...
OUT_DIR = resolve_from_root(os.getenv("C3_OUT_DIR", "out_ms_final/c3_ms_kitsu_v2"))
OUT_DIR.mkdir(parents=True, exist_ok=True)

MAP_CSV  = OUT_DIR / "ms_kitsu_map.csv"
MAP_PARQ = OUT_DIR / "ms_kitsu_map.parquet"

MS_PLUS_CSV  = OUT_DIR / "ms_series_enriched_plus_kitsu.csv"
MS_PLUS_PARQ = OUT_DIR / "ms_series_enriched_plus_kitsu.parquet"

MS_TITLES_EX_CSV     = OUT_DIR / "ms_titles_exploded.csv"
KITSU_CORE_CSV       = OUT_DIR / "kitsu_series_core.csv"
KITSU_TITLES_EX_CSV  = OUT_DIR / "kitsu_titles_exploded.csv"
AMBIG_CSV            = OUT_DIR / "ms_kitsu_ambiguous.csv"  # optionnel

print("PROJECT_ROOT:", PROJECT_ROOT)
print("MS_SERIES_CSV:", MS_SERIES_CSV)
print("KITSU_CLEAN_CSV:", KITSU_CLEAN_CSV)
print("OUT_DIR:", OUT_DIR.resolve())

PROJECT_ROOT: /home/maxime/python/certification/preparation_bdd
MS_SERIES_CSV: /home/maxime/python/certification/preparation_bdd/out_ms_final/ms_series_enriched.csv
KITSU_CLEAN_CSV: /home/maxime/python/certification/preparation_bdd/exports/kitsu/kitsu_top_rated_clean.csv
OUT_DIR: /home/maxime/python/certification/preparation_bdd/out_ms_final/c3_ms_kitsu_v2


## 1) Dépendances (pyarrow pour Parquet, rapidfuzz pour fuzzy)

In [57]:
try:
    import pyarrow  # noqa: F401
except Exception as e:
    raise RuntimeError(
        "pyarrow requis pour l'export Parquet. Installe-le (ex: pip install -r requirements-dev.txt). "
        f"Détail: {repr(e)}"
    )

try:
    from rapidfuzz import process, fuzz
except Exception as e:
    raise RuntimeError(
        "rapidfuzz requis pour le fuzzy matching. Installe-le (ex: pip install -r requirements-dev.txt). "
        f"Détail: {repr(e)}"
    )

print("OK: pyarrow + rapidfuzz")

OK: pyarrow + rapidfuzz


## 2) Lecture + contrôles C3

In [58]:
ms = pd.read_csv(MS_SERIES_CSV)
kitsu = pd.read_csv(KITSU_CLEAN_CSV)

print("MS series:", ms.shape, " | cols:", len(ms.columns))
print("Kitsu:", kitsu.shape, " | cols:", len(kitsu.columns))

assert "series_id" in ms.columns, "MS: colonne series_id manquante"
assert "series_title" in ms.columns, "MS: colonne series_title manquante"
assert "kitsu_id" in kitsu.columns, "Kitsu: colonne kitsu_id manquante"

assert ms["series_id"].notna().all(), "MS: series_id contient des NA"
assert ms["series_id"].is_unique, "MS: series_id doit être unique"
assert kitsu["kitsu_id"].notna().all(), "Kitsu: kitsu_id contient des NA"
assert kitsu["kitsu_id"].is_unique, "Kitsu: kitsu_id doit être unique"

ms.head(2)

MS series: (13208, 35)  | cols: 35
Kitsu: (43085, 32)  | cols: 32


Unnamed: 0,series_id,series_url,series_title,series_type,series_category,series_year,series_other_titles,series_dessinateur,series_scenariste,series_genres,...,series_score_mean,series_score_median,series_score_min,series_score_max,series_with_body_count,series_with_date_count,series_with_body_pct,series_with_date_pct,series_first_review_date_iso,series_last_review_date_iso
0,78152,https://www.manga-sanctuary.com/bdd/manga/78152-touken-ranbu-online-anthology-honmaru-ranman-biyori/,Touken ranbu -ONLINE- Anthology ~ Honmaru Ranman Biyori ~,Manga,2019,2019.0,"[""刀剣乱舞-ONLINE- アンソロジー ~本丸爛漫日和~""]",Kyôko MAKI,Niko WAKUHARA,[],...,,,,,0,0,,,,
1,12139,https://www.manga-sanctuary.com/bdd/manhwa/12139-blast/,Blast,Manhwa,Sonyun,2007.0,[],Kangho PARK,Ha na LEE,[],...,,,,,0,0,,,,


## 3) Helpers (normalisation titres + parsing JSON)

In [59]:
def norm_title(s: str) -> str:
    if s is None or (isinstance(s, float) and pd.isna(s)):
        return None
    s = str(s).strip().lower()
    if not s:
        return None
    s = unicodedata.normalize("NFKD", s)
    s = "".join(ch for ch in s if not unicodedata.combining(ch))
    s = s.replace("&", "and")
    s = re.sub(r"[^a-z0-9]+", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s or None

def try_parse_json_list(x):
    # MS peut contenir des listes en string JSON ou repr Python
    if x is None or (isinstance(x, float) and pd.isna(x)):
        return []
    if isinstance(x, list):
        return x
    s = str(x).strip()
    if s in ("", "[]", "nan", "None"):
        return []
    # JSON strict
    try:
        v = json.loads(s)
        return v if isinstance(v, list) else [v]
    except Exception:
        pass
    # tentative "python repr" -> JSON (simple)
    try:
        s2 = s.replace("'", '"')
        v = json.loads(s2)
        return v if isinstance(v, list) else [v]
    except Exception:
        return [s]

def is_empty_list_str(x) -> bool:
    # True si vide ou "[]"
    if x is None or (isinstance(x, float) and pd.isna(x)):
        return True
    s = str(x).strip()
    return s in ("", "[]", "nan", "None")

## 4) MS — ms_title_norm + ms_titles_exploded

In [60]:
ms["ms_title_main"] = ms["series_title"].astype(str)
ms["ms_title_norm"] = ms["ms_title_main"].map(norm_title)

# autres titres
other_col = None
for cand in ["series_other_titles_json", "series_other_titles"]:
    if cand in ms.columns:
        other_col = cand
        break

if other_col:
    ms["_other_titles_list"] = ms[other_col].map(try_parse_json_list)
else:
    ms["_other_titles_list"] = [[] for _ in range(len(ms))]

# explode titres (main + others)
rows = []
for _, r in ms[["series_id", "ms_title_main", "ms_title_norm", "_other_titles_list"]].iterrows():
    sid = r["series_id"]
    if isinstance(r["ms_title_norm"], str) and r["ms_title_norm"]:
        rows.append({"series_id": sid, "title_source": "main", "title": r["ms_title_main"], "title_norm": r["ms_title_norm"]})
    for t in r["_other_titles_list"] or []:
        tn = norm_title(t)
        if tn:
            rows.append({"series_id": sid, "title_source": "other", "title": str(t), "title_norm": tn})

ms_titles_ex = pd.DataFrame(rows).drop_duplicates(subset=["series_id", "title_norm"])

print("ms_titles_exploded:", ms_titles_ex.shape)
ms_titles_ex.head(10)

ms_titles_exploded: (16709, 4)


Unnamed: 0,series_id,title_source,title,title_norm
0,78152,main,Touken ranbu -ONLINE- Anthology ~ Honmaru Ranman Biyori ~,touken ranbu online anthology honmaru ranman biyori
1,78152,other,刀剣乱舞-ONLINE- アンソロジー ~本丸爛漫日和~,online
2,12139,main,Blast,blast
3,15537,main,Te serrer tout contre moi,te serrer tout contre moi
4,57120,main,Gyutto!! Onee-chan,gyutto onee chan
5,10066,main,Hyudora,hyudora
6,70676,main,Le quotidien d'une épée maudite,le quotidien d une epee maudite
7,70676,other,Jyaken San Wa Sugu Bureru,jyaken san wa sugu bureru
8,10620,main,Kanojo no Tsumeato,kanojo no tsumeato
9,9305,main,L’hôpital de mes amis,l hopital de mes amis


## 5) Kitsu — pivots : core + titles_exploded

In [61]:
# title_norm_primary (fallback si absent)
if "title_norm_primary" not in kitsu.columns:
    kitsu["title_norm_primary"] = kitsu.get("title_norm_canonical")
    kitsu.loc[kitsu["title_norm_primary"].isna(), "title_norm_primary"] = kitsu.get("title_norm_en")

# candidats titres
if "title_candidates_json" in kitsu.columns:
    kitsu["_cands"] = kitsu["title_candidates_json"].map(try_parse_json_list)
else:
    kitsu["_cands"] = kitsu.apply(
        lambda r: [t for t in [r.get("title_canonical"), r.get("title_en"), r.get("title_ja")] if isinstance(t, str) and t.strip()],
        axis=1
    )
kitsu["_cands_norm"] = kitsu["_cands"].map(lambda L: [norm_title(x) for x in (L or []) if norm_title(x)])

# Pivot 1 : core (1 ligne / kitsu_id) — référentiel d'enrichissement
core_cols = [
    "kitsu_id","slug","status",
    "title_canonical","title_en","title_ja",
    "title_norm_primary","title_norm_canonical","title_norm_en","title_norm_ja",
    "synopsis_clean","rating_average_10","rating_rank","popularity_rank",
    "categories_json","genres_json","tags_all_json",
]
core_cols = [c for c in core_cols if c in kitsu.columns]
kitsu_core = kitsu[core_cols].copy()

# Pivot 2 : titles_exploded (multi-lignes / kitsu_id) — index de matching
rows = []
for _, r in kitsu[["kitsu_id","_cands","_cands_norm"]].iterrows():
    kid = r["kitsu_id"]
    for t, tn in zip(r["_cands"] or [], r["_cands_norm"] or []):
        if tn:
            rows.append({"kitsu_id": kid, "title": str(t), "title_norm": tn})

kitsu_titles_ex = pd.DataFrame(rows).drop_duplicates(subset=["kitsu_id","title_norm"])

print("kitsu_core:", kitsu_core.shape)
print("kitsu_titles_exploded:", kitsu_titles_ex.shape)
kitsu_titles_ex.head(10)

kitsu_core: (43085, 17)
kitsu_titles_exploded: (60337, 3)


Unnamed: 0,kitsu_id,title,title_norm
0,60854,The Greatest Estate Developer,the greatest estate developer
1,38,One Piece,one piece
3,57766,Kimetsu no Yaiba: Rengoku Kyoujurou Gaiden,kimetsu no yaiba rengoku kyoujurou gaiden
4,57766,Demon Slayer: Rengoku Kyoujurou Side Story,demon slayer rengoku kyoujurou side story
5,55546,Kimetsu no Yaiba: Tomioka Giyuu Gaiden,kimetsu no yaiba tomioka giyuu gaiden
6,8,Berserk,berserk
7,54139,Chainsaw Man,chainsaw man
8,54448,SPY×FAMILY,spy family
9,54448,SPY x FAMILY,spy x family
10,26004,Boku no Hero Academia,boku no hero academia


## 6) Amélioration v2 — score qualité Kitsu (pour départager les collisions exactes)

In [62]:
def non_empty_jsonish(x) -> bool:
    if x is None or (isinstance(x, float) and pd.isna(x)):
        return False
    if isinstance(x, list):
        return len(x) > 0
    s = str(x).strip()
    return s not in ("", "[]", "nan", "None")

kitsu_quality = kitsu_core[["kitsu_id"]].copy()

kitsu_quality["has_synopsis"] = kitsu_core.get("synopsis_clean").fillna("").astype(str).str.strip().ne("")
kitsu_quality["has_tags"] = kitsu_core.get("tags_all_json").map(non_empty_jsonish) if "tags_all_json" in kitsu_core.columns else False
kitsu_quality["has_categories"] = kitsu_core.get("categories_json").map(non_empty_jsonish) if "categories_json" in kitsu_core.columns else False
kitsu_quality["has_genres"] = kitsu_core.get("genres_json").map(non_empty_jsonish) if "genres_json" in kitsu_core.columns else False

kitsu_quality["popularity_rank"] = pd.to_numeric(kitsu_core.get("popularity_rank"), errors="coerce")
kitsu_quality["rating_rank"] = pd.to_numeric(kitsu_core.get("rating_rank"), errors="coerce")
kitsu_quality["rating_average_10"] = pd.to_numeric(kitsu_core.get("rating_average_10"), errors="coerce")

# Score stable & simple : favorise la complétude (synopsis/tags) + un peu de note
kitsu_quality["quality_score"] = (
    kitsu_quality["has_synopsis"].astype(int) * 1000
    + kitsu_quality["has_tags"].astype(int) * 100
    + kitsu_quality["has_categories"].astype(int) * 30
    + kitsu_quality["has_genres"].astype(int) * 20
    + kitsu_quality["rating_average_10"].fillna(0) * 2
)

kitsu_quality.head(5)

Unnamed: 0,kitsu_id,has_synopsis,has_tags,has_categories,has_genres,popularity_rank,rating_rank,rating_average_10,quality_score
0,60854,True,True,True,False,347.0,1.0,8.59,1147.18
1,38,True,True,True,True,3.0,2.0,8.5,1167.0
2,57766,True,False,False,False,45.0,3.0,8.49,1016.98
3,55546,True,True,True,False,72.0,4.0,8.49,1146.98
4,8,True,True,True,True,17.0,5.0,8.48,1166.96


## 7) Exact match (title_norm) — v2 anti-collisions (sélection par score)

In [63]:
# Option anti-collisions : ignorer les title_norm trop courts (souvent ambigus)
MIN_NORM_LEN = 4
ms_titles_ex_f = ms_titles_ex[ms_titles_ex["title_norm"].str.len() >= MIN_NORM_LEN].copy()
kitsu_titles_ex_f = kitsu_titles_ex[kitsu_titles_ex["title_norm"].str.len() >= MIN_NORM_LEN].copy()

exact = (
    ms_titles_ex_f
    .merge(kitsu_titles_ex_f, on="title_norm", how="inner", suffixes=("_ms","_kitsu"))
    .merge(kitsu_quality, on="kitsu_id", how="left")
)

# priorité au titre principal MS
exact["priority"] = np.where(exact["title_source"] == "main", 0, 1)

# Tri : priorité main, puis qualité desc, puis ranks asc
exact = exact.sort_values(
    ["series_id", "priority", "quality_score", "popularity_rank", "rating_rank"],
    ascending=[True, True, False, True, True],
)

# Un seul match exact par série
exact_best = exact.drop_duplicates(subset=["series_id"], keep="first").copy()
exact_best["match_method"] = "exact_title_norm_scored"
exact_best["match_score"] = 100

# Audit collisions (combien de candidats par série)
collision_stats = exact.groupby("series_id").size().reset_index(name="n_exact_candidates")
n_collisions = int((collision_stats["n_exact_candidates"] >= 2).sum())

print("exact matches:", len(exact_best), "/", ms["series_id"].nunique())
print("series with collisions (>=2 exact candidates):", n_collisions)

exact_best.head(10)

exact matches: 5010 / 13208
series with collisions (>=2 exact candidates): 935


Unnamed: 0,series_id,title_source,title_ms,title_norm,kitsu_id,title_kitsu,has_synopsis,has_tags,has_categories,has_genres,popularity_rank,rating_rank,rating_average_10,quality_score,priority,match_method,match_score
3003,12,main,Angel Dust,angel dust,229,Angel/Dust,True,True,True,True,6135.0,11584.0,6.5,1163.0,0,exact_title_norm_scored,100
2569,68,main,Eden,eden,24245,Eden,True,True,True,True,10761.0,7595.0,7.14,1164.28,0,exact_title_norm_scored,100
6058,69,main,Psychometrer Eiji,psychometrer eiji,1615,Psychometrer Eiji,True,True,True,True,6755.0,4028.0,7.48,1164.96,0,exact_title_norm_scored,100
2092,78,main,F.Compo,f compo,3559,F. Compo,True,True,True,True,5450.0,3913.0,7.49,1164.98,0,exact_title_norm_scored,100
4177,114,other,.Hack//Tasogare no Udewa Densetsu,hack tasogare no udewa densetsu,301,.hack//Tasogare no Udewa Densetsu,True,True,True,True,1914.0,9303.0,6.97,1163.94,1,exact_title_norm_scored,100
5536,178,main,Rookies,rookies,1998,Rookies,True,True,True,True,1270.0,716.0,8.09,1166.18,0,exact_title_norm_scored,100
3321,402,main,Agharta,agharta,1895,Agharta,True,True,True,True,10571.0,8248.0,7.08,1164.16,0,exact_title_norm_scored,100
3207,403,main,Akira,akira,1500,Akira,True,True,True,True,139.0,339.0,8.22,1166.44,0,exact_title_norm_scored,100
2869,408,main,Appleseed,appleseed,1510,Appleseed,True,True,True,True,3029.0,5347.0,7.34,1164.68,0,exact_title_norm_scored,100
2800,409,main,Armagedon,armagedon,23042,Armagedon,True,True,True,True,36426.0,,,1150.0,0,exact_title_norm_scored,100


## 8) Optionnel — Export des cas ambigus à vérifier (preuve C3)

In [64]:
# On marque "ambigus" si >=3 candidats exacts (seuil ajustable)
AMBIG_THRESHOLD = 3
ambig_ids = set(collision_stats.loc[collision_stats["n_exact_candidates"] >= AMBIG_THRESHOLD, "series_id"])

ms_ambiguous = ms.loc[ms["series_id"].isin(ambig_ids), ["series_id","ms_title_main","ms_title_norm"]].copy()
ms_ambiguous["n_exact_candidates"] = ms_ambiguous["series_id"].map(
    dict(zip(collision_stats["series_id"], collision_stats["n_exact_candidates"]))
)

print("ambiguous series to review:", len(ms_ambiguous))
ms_ambiguous.head(20)

ambiguous series to review: 264


Unnamed: 0,series_id,ms_title_main,ms_title_norm,n_exact_candidates
0,78152,Touken ranbu -ONLINE- Anthology ~ Honmaru Ranman Biyori ~,touken ranbu online anthology honmaru ranman biyori,5
17,69359,L’Enfant et le Maudit - Cher journal,l enfant et le maudit cher journal,5
62,68156,Lovely,lovely,3
122,8110,Love Letter,love letter,3
331,5961,Line,line,3
361,54785,Life,life,4
362,5968,Life,life,3
383,64972,Leviathan,leviathan,3
384,1477,Leviathan,leviathan,3
394,59376,Let's gôtokuji!,let s gotokuji,3


## 9) Fuzzy match (non-matchés) — rapidfuzz (seuil conservateur)

In [65]:
matched_series_ids = set(exact_best["series_id"].tolist())
ms_left = ms[~ms["series_id"].isin(matched_series_ids)].copy()

# Index fuzzy : compare à title_norm_primary (1 clé par kitsu_id)
kitsu_index = kitsu_core[["kitsu_id","title_norm_primary"]].dropna().copy()
kitsu_choices = dict(zip(kitsu_index["kitsu_id"].tolist(), kitsu_index["title_norm_primary"].tolist()))

FUZZY_MIN_SCORE = 92  # conservateur (baisse si trop peu de match; augmente si faux positifs)

fuzzy_rows = []
for _, r in ms_left[["series_id","ms_title_norm","ms_title_main"]].iterrows():
    sid = r["series_id"]
    q = r["ms_title_norm"]
    if not isinstance(q, str) or not q:
        continue
    hit = process.extractOne(q, kitsu_choices, scorer=fuzz.token_sort_ratio, score_cutoff=FUZZY_MIN_SCORE)
    if hit:
        kitsu_norm, score, kitsu_id = hit[0], hit[1], hit[2]
        fuzzy_rows.append({
            "series_id": sid,
            "kitsu_id": int(kitsu_id),
            "match_method": "fuzzy_title_norm_primary",
            "match_score": float(score),
            "matched_title_norm": kitsu_norm,
            "ms_title_norm": q,
            "ms_title": r["ms_title_main"],
        })

fuzzy = pd.DataFrame(fuzzy_rows)
print("fuzzy matches:", len(fuzzy), "/", len(ms_left))
fuzzy.head(10)

fuzzy matches: 598 / 8198


Unnamed: 0,series_id,kitsu_id,match_method,match_score,matched_title_norm,ms_title_norm,ms_title
0,7936,1072,fuzzy_title_norm_primary,95.652174,lovers doll,lover s doll,Lover's Doll
1,61671,30587,fuzzy_title_norm_primary,96.385542,kyupiko fushimatsu tenshi no mismanagement,kyupiko fujimatsu tenshi no mismanagement,Kyupiko! - Fujimatsu Tenshi no Mismanagement
2,5478,5646,fuzzy_title_norm_primary,97.777778,love fragments shangai,love fragments shanghai,LOVE fragments Shanghai
3,41739,66135,fuzzy_title_norm_primary,92.857143,love and peace,love and peach,Love & Peach
4,11547,13329,fuzzy_title_norm_primary,100.0,paradise lost,lost paradise,Lost Paradise
5,13139,3424,fuzzy_title_norm_primary,93.023256,kyoushoku soukou guyver,kyoshoku soko guyver,Kyôshoku Sôkô Guyver
6,10697,12975,fuzzy_title_norm_primary,97.560976,kyoumen no silhouette,kyomen no silhouette,Kyômen no Silhouette
7,12738,23495,fuzzy_title_norm_primary,96.551724,kyokutou prison,kyokuto prison,Kyokutô Prison
8,14364,46152,fuzzy_title_norm_primary,94.117647,kyougaku koukou no genjitsu,kyogaku koko no genjitsu,Kyôgaku Kôkô no Genjitsu
9,14369,4928,fuzzy_title_norm_primary,96.969697,kyou no yuiko san,kyo no yuiko san,Kyô no Yuiko-san


## 10) Construire ms_kitsu_map (audit exact + fuzzy)

In [66]:
# Exact map (audit)
map_exact = exact_best[["series_id","kitsu_id","match_method","match_score","title_norm"]].copy()
map_exact = map_exact.rename(columns={"title_norm":"matched_title_norm"})
map_exact["ms_title"] = ms.set_index("series_id").loc[map_exact["series_id"], "ms_title_main"].values
map_exact["ms_title_norm"] = ms.set_index("series_id").loc[map_exact["series_id"], "ms_title_norm"].values

# Union exact + fuzzy
ms_kitsu_map = pd.concat([map_exact, fuzzy], ignore_index=True)

# Si double match: garder exact avant fuzzy puis score max
ms_kitsu_map["method_prio"] = np.where(ms_kitsu_map["match_method"].str.startswith("exact"), 0, 1)
ms_kitsu_map = ms_kitsu_map.sort_values(["series_id","method_prio","match_score"], ascending=[True, True, False])
ms_kitsu_map = ms_kitsu_map.drop_duplicates(subset=["series_id"], keep="first").drop(columns=["method_prio"])

print("total matches:", len(ms_kitsu_map), "/", ms["series_id"].nunique())
ms_kitsu_map["match_method"].value_counts()

total matches: 5608 / 13208


match_method
exact_title_norm_scored     5010
fuzzy_title_norm_primary     598
Name: count, dtype: int64

### Qualité

In [67]:
import pandas as pd
from pathlib import Path

# --- Paramètres qualité (ajustables) ---
FUZZY_RISK_THRESHOLD = 95     # fuzzy < 95 => needs_review
MIN_NORM_LEN = 4              # titres trop courts => needs_review

# --- Contrôles de base ---
required = ["series_id", "kitsu_id", "match_method", "match_score", "ms_title_norm"]
missing = [c for c in required if c not in ms_kitsu_map.columns]
if missing:
    raise ValueError(f"ms_kitsu_map: colonnes manquantes: {missing}")

ms_kitsu_map_q = ms_kitsu_map.copy()

# --- Collisions : un même kitsu_id associé à plusieurs series_id ---
kitsu_counts = (
    ms_kitsu_map_q.groupby("kitsu_id")["series_id"]
    .nunique()
    .reset_index(name="kitsu_id_ms_count")
)

ms_kitsu_map_q = ms_kitsu_map_q.merge(kitsu_counts, on="kitsu_id", how="left")
ms_kitsu_map_q["kitsu_id_ms_count"] = (
    pd.to_numeric(ms_kitsu_map_q["kitsu_id_ms_count"], errors="coerce")
      .fillna(0)
      .astype("Int64")
)

ms_kitsu_map_q["kitsu_id_collision"] = ms_kitsu_map_q["kitsu_id_ms_count"] > 1

# --- Fuzzy à risque ---
ms_kitsu_map_q["match_score_num"] = pd.to_numeric(ms_kitsu_map_q["match_score"], errors="coerce")
is_fuzzy = ms_kitsu_map_q["match_method"].astype(str).str.startswith("fuzzy")
ms_kitsu_map_q["fuzzy_low_score"] = is_fuzzy & (ms_kitsu_map_q["match_score_num"] < FUZZY_RISK_THRESHOLD)

# --- Titres trop courts (très ambigu) ---
ms_kitsu_map_q["ms_title_norm_len"] = ms_kitsu_map_q["ms_title_norm"].astype(str).str.len()
ms_kitsu_map_q["title_too_short"] = ms_kitsu_map_q["ms_title_norm_len"] < MIN_NORM_LEN

# --- needs_review + reason (cumul) ---
reasons = []

reasons.append(("kitsu_id_collision", ms_kitsu_map_q["kitsu_id_collision"]))
reasons.append(("fuzzy_low_score", ms_kitsu_map_q["fuzzy_low_score"]))
reasons.append(("title_too_short", ms_kitsu_map_q["title_too_short"]))

ms_kitsu_map_q["needs_review"] = False
ms_kitsu_map_q["review_reason"] = ""

for label, cond in reasons:
    ms_kitsu_map_q.loc[cond, "needs_review"] = True
    # concat des raisons
    ms_kitsu_map_q.loc[cond, "review_reason"] = (
        ms_kitsu_map_q.loc[cond, "review_reason"]
        .where(ms_kitsu_map_q.loc[cond, "review_reason"].str.len() > 0, "")
        .apply(lambda s: s + (";" if s else "") + label)
    )

# Nettoyage colonnes temporaires (optionnel)
ms_kitsu_map_q = ms_kitsu_map_q.drop(columns=["match_score_num"], errors="ignore")

# --- KPI synthèse ---
kpi = {
    "rows": int(len(ms_kitsu_map_q)),
    "needs_review_%": float(ms_kitsu_map_q["needs_review"].mean() * 100),
    "collision_rows": int(ms_kitsu_map_q["kitsu_id_collision"].sum()),
    "distinct_kitsu_id": int(ms_kitsu_map_q["kitsu_id"].nunique()),
    "distinct_series_id": int(ms_kitsu_map_q["series_id"].nunique()),
    "fuzzy_rows": int(is_fuzzy.sum()),
    "fuzzy_low_score_rows": int(ms_kitsu_map_q["fuzzy_low_score"].sum()),
    "title_too_short_rows": int(ms_kitsu_map_q["title_too_short"].sum()),
}

kpi


{'rows': 5608,
 'needs_review_%': 9.254636233951498,
 'collision_rows': 339,
 'distinct_kitsu_id': 5422,
 'distinct_series_id': 5608,
 'fuzzy_rows': 598,
 'fuzzy_low_score_rows': 147,
 'title_too_short_rows': 66}

### export de l audit

In [68]:
# Export audit mapping (qualité) dans le MÊME OUT_DIR que le reste du notebook
MAP_QUAL_CSV  = OUT_DIR / "ms_kitsu_map_quality.csv"
MAP_QUAL_PARQ = OUT_DIR / "ms_kitsu_map_quality.parquet"

ms_kitsu_map_q.to_csv(MAP_QUAL_CSV, index=False)
ms_kitsu_map_q.to_parquet(MAP_QUAL_PARQ, index=False)

print("Exported:")
print(" -", MAP_QUAL_CSV)
print(" -", MAP_QUAL_PARQ)


Exported:
 - /home/maxime/python/certification/preparation_bdd/out_ms_final/c3_ms_kitsu_v2/ms_kitsu_map_quality.csv
 - /home/maxime/python/certification/preparation_bdd/out_ms_final/c3_ms_kitsu_v2/ms_kitsu_map_quality.parquet


### liste à verifier en priorité 

In [69]:
# Tri pour revue : collisions d'abord, puis fuzzy faible score
to_review = ms_kitsu_map_q[ms_kitsu_map_q["needs_review"]].copy()

to_review = to_review.sort_values(
    ["kitsu_id_collision", "match_method", "match_score", "ms_title_norm_len"],
    ascending=[False, True, True, True]
)

to_review.head(30)[
    ["series_id","kitsu_id","match_method","match_score","ms_title","ms_title_norm","review_reason","kitsu_id_ms_count"]
]


Unnamed: 0,series_id,kitsu_id,match_method,match_score,ms_title,ms_title_norm,review_reason,kitsu_id_ms_count
60,713,725,exact_title_norm_scored,100.0,Fake,fake,kitsu_id_collision,2
73,737,40,exact_title_norm_scored,100.0,Rave,rave,kitsu_id_collision,2
94,899,74,exact_title_norm_scored,100.0,Nana,nana,kitsu_id_collision,2
138,1123,5216,exact_title_norm_scored,100.0,Blue,blue,kitsu_id_collision,2
226,1395,1484,exact_title_norm_scored,100.0,Real,real,kitsu_id_collision,2
288,1574,24820,exact_title_norm_scored,100.0,Zero,zero,kitsu_id_collision,8
338,1936,24820,exact_title_norm_scored,100.0,Zero,zero,kitsu_id_collision,8
409,2388,15732,exact_title_norm_scored,100.0,Ippo,ippo,kitsu_id_collision,2
484,3424,13450,exact_title_norm_scored,100.0,Pink,pink,kitsu_id_collision,3
665,5968,9445,exact_title_norm_scored,100.0,Life,life,kitsu_id_collision,3


## 11) Enrichissement MS (sans écrasement destructif)

In [70]:
# Utiliser la map "qualité" si elle existe (needs_review / collisions), sinon fallback sur ms_kitsu_map
map_df = ms_kitsu_map_q if "ms_kitsu_map_q" in globals() else ms_kitsu_map

ms_plus = (ms
           .merge(map_df, on="series_id", how="left")  # garde match_method, match_score, + éventuels flags qualité
           .merge(kitsu_core.add_prefix("kitsu_"), left_on="kitsu_id", right_on="kitsu_kitsu_id", how="left"))

# synopsis: remplir seulement si MS vide
if "series_synopsis" in ms_plus.columns and "kitsu_synopsis_clean" in ms_plus.columns:
    ms_plus["series_synopsis_enriched"] = ms_plus["series_synopsis"]
    ms_plus.loc[
        ms_plus["series_synopsis_enriched"].isna()
        | (ms_plus["series_synopsis_enriched"].astype(str).str.strip() == ""),
        "series_synopsis_enriched"
    ] = ms_plus["kitsu_synopsis_clean"]
else:
    ms_plus["series_synopsis_enriched"] = ms_plus.get("series_synopsis")

# tags/genres: si MS est []/vide => remplir via Kitsu
if "series_tags" in ms_plus.columns and "kitsu_tags_all_json" in ms_plus.columns:
    ms_plus["series_tags_enriched"] = ms_plus["series_tags"]
    empty_mask = ms_plus["series_tags_enriched"].map(is_empty_list_str)
    ms_plus.loc[empty_mask, "series_tags_enriched"] = ms_plus.loc[empty_mask, "kitsu_tags_all_json"]

if "series_genres" in ms_plus.columns and "kitsu_genres_json" in ms_plus.columns:
    ms_plus["series_genres_enriched"] = ms_plus["series_genres"]
    empty_mask = ms_plus["series_genres_enriched"].map(is_empty_list_str)
    ms_plus.loc[empty_mask, "series_genres_enriched"] = ms_plus.loc[empty_mask, "kitsu_genres_json"]

kpi_enrich = {
    "ms_rows": int(len(ms_plus)),
    "matched_%": float(ms_plus["kitsu_id"].notna().mean() * 100),
    "synopsis_after_non_empty_%": float((ms_plus["series_synopsis_enriched"].notna() & (ms_plus["series_synopsis_enriched"].astype(str).str.strip() != "")).mean() * 100),
    "needs_review_%": float(ms_plus["needs_review"].mean() * 100) if "needs_review" in ms_plus.columns else None,
}
kpi_enrich


{'ms_rows': 13208,
 'matched_%': 42.459115687462145,
 'synopsis_after_non_empty_%': 70.11659600242277,
 'needs_review_%': 9.254636233951498}

### fix No match, Bool to txt et gestion des type des Ids, années

In [71]:
import pandas as pd

def normalize_export_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    if "match_method" in df.columns:
        df["match_method"] = df["match_method"].fillna("NO_MATCH")

    for col in ["needs_review", "kitsu_id_collision"]:
        if col in df.columns:
            df[col] = df[col].fillna(False).astype(bool)

    if "review_reason" in df.columns:
        df["review_reason"] = df["review_reason"].fillna("").astype(str)

    for col in ["kitsu_id", "series_year", "series_category_year_guess"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce").astype("Int64")

    if "match_score" in df.columns:
        df["match_score"] = pd.to_numeric(df["match_score"], errors="coerce")

    return df

ms_plus = normalize_export_df(ms_plus)
ms_kitsu_map = normalize_export_df(ms_kitsu_map)
if "ms_kitsu_map_q" in globals():
    ms_kitsu_map_q = normalize_export_df(ms_kitsu_map_q)


  df[col] = df[col].fillna(False).astype(bool)


### correction probleme de type à l’import postgresql

In [72]:
import pandas as pd

df = ms_plus  # <-- table finale à exporter

# 1) match_method explicite
if "match_method" in df.columns:
    df["match_method"] = df["match_method"].fillna("NO_MATCH")

# 2) booléens (évite NaN)
for c in ["series_category_is_allowed", "kitsu_id_collision", "fuzzy_low_score",
          "title_too_short", "needs_review"]:
    if c in df.columns:
        df[c] = df[c].fillna(False).astype(bool)

# 3) colonnes entières (évite 51.0)
int_cols = [
    "series_id","series_year","series_category_year_guess",
    "series_popularity_rank","series_members_votes","series_experts_votes",
    "series_volume_count","series_review_count","series_with_body_count","series_with_date_count",
    "kitsu_id","kitsu_id_ms_count","ms_title_norm_len",
    "kitsu_rating_rank","kitsu_popularity_rank"
]
for c in int_cols:
    if c in df.columns:
        df[c] = pd.to_numeric(df[c], errors="coerce").astype("Int64")

# 4) dates ISO (si présentes)
for c in ["series_first_review_date_iso", "series_last_review_date_iso"]:
    if c in df.columns:
        df[c] = pd.to_datetime(df[c], errors="coerce").dt.strftime("%Y-%m-%d")

# 5) review_reason texte
if "review_reason" in df.columns:
    df["review_reason"] = df["review_reason"].fillna("").astype(str)

# 6) IMPORTANT CSV : éviter les floats "x.0" à l’écriture
# Astuce : garantir que ms_title_norm_len est bien un entier nullable
# (c'est la colonne qui t'a bloqué)
if "ms_title_norm_len" in df.columns:
    df["ms_title_norm_len"] = pd.to_numeric(df["ms_title_norm_len"], errors="coerce").astype("Int64")

ms_plus = df  # réassignation explicite


  df[c] = df[c].fillna(False).astype(bool)
  df[c] = df[c].fillna(False).astype(bool)


## 12) Exports CSV + Parquet (JSONB-ready)

In [73]:
import json as _json

def export_csv_parquet_jsonb(df: pd.DataFrame, csv_path: Path, parq_path: Path, json_cols: list[str]):
    # Parquet (pyarrow)
    df.to_parquet(parq_path, index=False)

    # CSV (garantit strings JSON valides pour les colonnes JSON)
    df_csv = df.copy()
    for c in json_cols:
        if c in df_csv.columns:
            def _to_json(v):
                if isinstance(v, (list, dict)):
                    return _json.dumps(v, ensure_ascii=False)
                if v is None or (isinstance(v, float) and pd.isna(v)):
                    return "[]"
                s = str(v).strip()
                if s == "":
                    return "[]"
                return s
            df_csv[c] = df_csv[c].map(_to_json)

    df_csv.to_csv(csv_path, index=False)

# pivots (audit)
ms_titles_ex.to_csv(MS_TITLES_EX_CSV, index=False)
kitsu_core.to_csv(KITSU_CORE_CSV, index=False)
kitsu_titles_ex.to_csv(KITSU_TITLES_EX_CSV, index=False)

# ambiguous (optionnel)
if len(ms_ambiguous) > 0:
    ms_ambiguous.to_csv(AMBIG_CSV, index=False)
    print("Wrote:", AMBIG_CSV)

# mapping
export_csv_parquet_jsonb(ms_kitsu_map, MAP_CSV, MAP_PARQ, json_cols=[])

# ms enrichi: colonnes JSON à conserver JSONB-ready
json_cols_plus = []
for c in [
    "series_other_titles","series_other_titles_json","series_statuses","series_related_works","series_tags","series_genres",
    "series_tags_enriched","series_genres_enriched",
    "kitsu_categories_json","kitsu_genres_json","kitsu_tags_all_json"
]:
    if c in ms_plus.columns:
        json_cols_plus.append(c)

export_csv_parquet_jsonb(ms_plus, MS_PLUS_CSV, MS_PLUS_PARQ, json_cols=json_cols_plus)

print("Exported:")
print(" -", MAP_CSV)
print(" -", MS_PLUS_CSV)
print(" - pivots:", MS_TITLES_EX_CSV, KITSU_CORE_CSV, KITSU_TITLES_EX_CSV)

Wrote: /home/maxime/python/certification/preparation_bdd/out_ms_final/c3_ms_kitsu_v2/ms_kitsu_ambiguous.csv
Exported:
 - /home/maxime/python/certification/preparation_bdd/out_ms_final/c3_ms_kitsu_v2/ms_kitsu_map.csv
 - /home/maxime/python/certification/preparation_bdd/out_ms_final/c3_ms_kitsu_v2/ms_series_enriched_plus_kitsu.csv
 - pivots: /home/maxime/python/certification/preparation_bdd/out_ms_final/c3_ms_kitsu_v2/ms_titles_exploded.csv /home/maxime/python/certification/preparation_bdd/out_ms_final/c3_ms_kitsu_v2/kitsu_series_core.csv /home/maxime/python/certification/preparation_bdd/out_ms_final/c3_ms_kitsu_v2/kitsu_titles_exploded.csv


## 13) Quick checks

In [74]:
print("Mapping methods:")
print(ms_kitsu_map["match_method"].value_counts())

print("\nTop 10 lowest fuzzy scores (if any):")
low = (ms_kitsu_map[ms_kitsu_map["match_method"].str.startswith("fuzzy")]
       .sort_values("match_score")
       .head(10))
low[["series_id","kitsu_id","match_score","ms_title","matched_title_norm"]]

Mapping methods:
match_method
exact_title_norm_scored     5010
fuzzy_title_norm_primary     598
Name: count, dtype: int64

Top 10 lowest fuzzy scores (if any):


Unnamed: 0,series_id,kitsu_id,match_score,ms_title,matched_title_norm
5354,15369,2281,92.0,Star Ocean Second Story,star ocean the second story
5293,11393,15876,92.0,Osamu Akimoto - Tanpenshu,akimoto osamu sf tanpenshuu
5165,58608,60128,92.063492,Isekai no Shuyaku wa Wareware da!,makai no shuyaku wa wareware da
5037,66142,55445,92.063492,Kubo-san wa Boku wo Yurusanai,kubo san wa boku mobu wo yurusanai
5281,58436,7993,92.105263,"Ana, Moji, Ketsueki Nado ga Arawareru",ana moji ketsueki nado ga arawareru manga
5314,11525,2429,92.134831,Busa Men Danshi - Ikemen Kareshi no Tsukurikata,busamen danshi ikemen kareshi no tsukurikata
5398,65945,22092,92.307692,Season,seasons
5342,65770,70908,92.307692,5 Seconds Before the Witch Falls in Love,5 seconds before a witch falls in love
5371,55754,36632,92.307692,Shônen Shôjo,shounen shoujo
5125,6361,16301,92.307692,Hit man,hitman
