### Purpose & Scope – `02_cleaning.ipynb`

This notebook carries out the **data–quality cleaning phase** of our TFT project.  
Starting from the integrated unit-level table produced in `01_data_integration.ipynb`, we address:

1. **Profiling & completeness** (Week 5–6)  
2. **Duplicate detection** (Week 8)  
3. **Rule-based and pattern-based repairs** (Week 10–11)  
4. **Outlier / anomaly handling** (Week 7)  
5. **Residual unmatched champions** (integration refinement, Week 13)  
6. **Provenance & reproducibility** (Week 14–15)

The cleaned dataset will be stored as `tft_units_long_clean.parquet` (or `.csv.gz` fallback) for downstream modeling.


In [1]:
# Imports & environment snapshot
import sys, platform, json, re
import numpy as np
import pandas as pd
from pathlib import Path

print("Python:", sys.version.split()[0])
print("pandas:", pd.__version__)
np.random.seed(42)

# Project paths
PROJ_ROOT   = Path.cwd().parent
CURATED_DIR = PROJ_ROOT / "curated"

# Safe load (parquet preferred)
def safe_load(base: Path):
    """Load <base>.parquet if present, else <base>.csv.gz."""
    pq = base.with_suffix(".parquet")
    gz = base.with_suffix(".csv.gz")
    if pq.exists():
        return pd.read_parquet(pq)
    elif gz.exists():
        return pd.read_csv(gz, compression="gzip")
    else:
        raise FileNotFoundError(f"No {pq.name} or {gz.name} found.")

# Load integrated table
units = safe_load(CURATED_DIR / "tft_units_long")
print("Loaded shape:", units.shape)

Python: 3.12.3
pandas: 2.2.3
Loaded shape: (3591, 11)


In [3]:
# 📌 Quick profiling summary
summary = (
    units.isna().mean().to_frame("missing_ratio")
         .join(units.nunique().to_frame("n_unique"))
         .join((units.memory_usage(deep=True) / 1e6)
               .to_frame("mem_MB"))
)
display(summary.sort_values("missing_ratio", ascending=False))


Unnamed: 0,missing_ratio,n_unique,mem_MB
name,0.220273,85,0.179812
id,0.220273,85,0.179566
title,0.220273,85,0.214346
championId,0.220273,85,0.028728
championKey,0.007519,108,0.197765
match_id,0.0,50,0.226233
puuid,0.0,334,0.455082
units,0.0,400,0.47642
star_level,0.0,4,0.028728
unit_list,0.0,277,0.205646


In [4]:
# 📌 Row-level duplication
dup_rows = units.duplicated().sum()
print("Exact duplicate rows:", dup_rows)

# 📌 Logical key duplication: match_id + puuid + champRaw
dup_keys = units.duplicated(subset=["match_id", "puuid", "champRaw"]).sum()
print("Duplicate (match_id, puuid, champRaw):", dup_keys)

# Drop if any exact duplicates
if dup_rows:
    units = units.drop_duplicates()

Exact duplicate rows: 28
Duplicate (match_id, puuid, champRaw): 33


Rule-based & Pattern-based Repairs

In [5]:
# 📌 Star level must be 1–3
invalid_star = units[~units["star_level"].between(1, 3)]
print("Invalid star_level rows:", invalid_star.shape[0])

# Repair policy: clip to nearest valid value
units.loc[:, "star_level"] = units["star_level"].clip(1, 3)


Invalid star_level rows: 9


In [8]:
# ------------------------------------------------------------
# Champion reference loader with gzip-CSV fallback
# ------------------------------------------------------------
def load_champion_ref(raw_dir: Path, version: str = "15.7.1") -> pd.DataFrame:
    """
    Build a champion reference DataFrame from Data-Dragon JSON.
    """
    champ_json = next((raw_dir / f"dd_{version}").rglob("data/en_US/champion.json"))
    with open(champ_json) as f:
        raw = json.load(f)["data"]
    return (
        pd.json_normalize(raw.values())
          .assign(championId=lambda d: d["key"].astype(int))
          .loc[:, ["championId", "name", "id", "title"]]
    )

def safe_save(df: pd.DataFrame, base: Path, index=False) -> Path:
    """
    Save as <base>.parquet if pyarrow/fastparquet available,
    else as <base>.csv.gz. Return path actually written.
    """
    pq = base.with_suffix(".parquet")
    gz = base.with_suffix(".csv.gz")
    try:
        df.to_parquet(pq, index=index)
        print("Saved champion ref as Parquet:", pq.name)
        return pq
    except ImportError:
        df.to_csv(gz, index=index, compression="gzip")
        print("Saved champion ref as gzip-CSV:", gz.name)
        return gz

def safe_load(base: Path) -> pd.DataFrame | None:
    """
    Try loading <base>.parquet; if not present, try <base>.csv.gz.
    Return None if neither exists.
    """
    pq = base.with_suffix(".parquet")
    gz = base.with_suffix(".csv.gz")
    if pq.exists():
        return pd.read_parquet(pq)
    elif gz.exists():
        return pd.read_csv(gz, compression="gzip")
    return None

# -------------------- main logic ----------------------------
champ_ref_base = CURATED_DIR / "champ_ref"
champ_df = safe_load(champ_ref_base)

if champ_df is not None:
    print("Champion reference loaded from disk.")
else:
    RAW_DIR = PROJ_ROOT / "raw"
    champ_df = load_champion_ref(RAW_DIR, version="15.7.1")
    safe_save(champ_df, champ_ref_base, index=False)
    print("Champion reference rebuilt from Data-Dragon.")

print("Champion rows:", champ_df.shape[0])


Saved champion ref as gzip-CSV: champ_ref.csv.gz
Champion reference rebuilt from Data-Dragon.
Champion rows: 170


In [11]:
# ------------------------------------------------------------
# 0) Build a mask for "still unmatched" rows (robust)
# ------------------------------------------------------------
if "name" in units.columns:
    unmatched_mask = units["name"].isna()
elif "championId" in units.columns:
    unmatched_mask = units["championId"].isna()
else:
    # last-resort fallback: assume rows lacking a valid merge have NaN championKey
    unmatched_mask = units["championKey"].isna()

# short-circuit if nothing is unmatched
if unmatched_mask.sum() == 0:
    print("No unmatched champions found.")
else:
    # --------------------------------------------------------
    # 1) Top 15 raw names that failed to match
    # --------------------------------------------------------
    unmatched_top = (
        units.loc[unmatched_mask, "champRaw"]
             .value_counts()
             .head(15)
    )
    print("Top unmatched champRaw:\n", unmatched_top)

    # --------------------------------------------------------
    # 2) Regex-based quick fixes
    # --------------------------------------------------------
    fix_map = {
        r"^LeBlanc.*":  "LeBlanc",
        r"^Swain.*":    "Swain",
        r"^elise$":     "Elise",
        r"^Rhaast$":    "Kayn",
        r"^jinx$":      "Jinx",
        r"^DrMundo$":   "DrMundo",
    }
    for pat, repl in fix_map.items():
        mask = unmatched_mask & units["champRaw"].str.match(pat, case=False)
        units.loc[mask, "championKey"] = repl

    # --------------------------------------------------------
    # 3) Drop any old metadata columns that may exist
    # --------------------------------------------------------
    cols_to_drop = [c for c in ["championId", "name", "id", "title"]
                    if c in units.columns]
    units = units.drop(columns=cols_to_drop)

    # --------------------------------------------------------
    # 4) Re-merge with champion reference
    # --------------------------------------------------------
    units = units.merge(
        champ_df,
        how="left",
        left_on="championKey",
        right_on="id",
        suffixes=("", "_champ")
    )

    # --------------------------------------------------------
    # 5) Final unmatched count & placeholder
    # --------------------------------------------------------
    final_unmatched = units["name"].isna().sum()
    print("Remaining unmatched rows:", final_unmatched)

    units["championId"] = units["championId"].fillna(-1)
    units["name"]       = units["name"].fillna("UNKNOWN_UNIT")


Top unmatched champRaw:
 champRaw
Rhaast    27
Name: count, dtype: int64
Remaining unmatched rows: 751


In [12]:
# -------------------------------------------------------
# Bulk normalisation for remaining unmatched champions
# -------------------------------------------------------
import re

# 1) Build look-up sets for quick membership tests
champ_names   = set(champ_df["id"].str.lower())        # ids like 'Aatrox'
alias_to_id   = {"rhaast": "kayn"}                     # manual alias dict (extend!)
common_strip  = r"(cougar|dragon|mega|star|summon|prime)$"  # common suffixes

def auto_normalise(raw: str) -> str | None:
    """
    Best-effort conversion of TFT unit string to base champion id.
    Returns champion id if recognised, else None.
    """
    s = raw.lower()
    s = re.sub(r"^tft\d+_", "", s)          # strip 'TFT14_'
    s = re.sub(r"\d+$", "", s)              # drop trailing digit
    s = re.sub(common_strip, "", s)         # drop suffixes
    s = s.strip()
    
    # manual alias
    if s in alias_to_id:                        
        return alias_to_id[s]
    
    # direct match
    if s in champ_names:
        return s.title()                    # restore leading capital
    return None

# 2) Apply to unmatched rows only
mask_unmatched = units["name"].eq("UNKNOWN_UNIT")
units.loc[mask_unmatched, "championKey"] = (
    units.loc[mask_unmatched, "champRaw"]
         .apply(auto_normalise)
         .fillna(units.loc[mask_unmatched, "championKey"])   # keep existing value if still None
)

# 3) Drop old metadata cols (if present) and re-merge again
cols_to_drop = [c for c in ["championId", "name", "id", "title"] if c in units.columns]
units = units.drop(columns=cols_to_drop)

units = units.merge(
    champ_df, how="left",
    left_on="championKey", right_on="id",
    suffixes=("", "_champ")
)

# 4) Final unmatched count
final_unmatched = units["name"].isna().sum()
print("Unmatched rows after auto-normalise:", final_unmatched)

# Mark residuals
units["championId"] = units["championId"].fillna(-1)
units["name"]       = units["name"].fillna("UNKNOWN_UNIT")

Unmatched rows after auto-normalise: 461


#### Handling the 461 residual “UNKNOWN_UNIT” rows  

After automatic normalisation, **461 / 3 591 = 12.8 %** of unit records still have no official champion mapping.  
A spot check shows they are:

* PvE / “miniboss” tokens (e.g., *Beardy*, *Blue*, *Lieutenant*)  
* Temporary summons (*JayceSummon*, *LuxLaser*)  
* Silco (non-LoL character) and other TFT-only specials

Because these entities **do not correspond to real champions**, keeping them would only
1. distort champion-level analyses, and  
2. complicate feature engineering (no stats in Data Dragon).

**Action** – we drop them from the cleaned dataset, record the row count in the provenance
meta file, and mention the rationale in the final report (Week 6: Missing values, Week 13: Knowledge-based cleaning).


In [13]:
# 📌 Drop residual UNKNOWN_UNIT rows and log the removal
before_rows = units.shape[0]

units = units[units["name"] != "UNKNOWN_UNIT"].reset_index(drop=True)

after_rows  = units.shape[0]
dropped     = before_rows - after_rows
print(f"Dropped {dropped} residual non-champion rows "
      f"({dropped / before_rows:.1%} of the table).")

# >>> continue with outlier flagging / safe_save / meta-json …

Dropped 461 residual non-champion rows (12.9% of the table).


In [15]:
# 📌 Re-use safe_save from previous notebook
def safe_save(df, out_base: Path, index=False):
    pq = out_base.with_suffix(".parquet")
    gz = out_base.with_suffix(".csv.gz")
    try:
        df.to_parquet(pq, index=index)
        print("Saved:", pq.name)
        return pq
    except ImportError:
        df.to_csv(gz, index=index, compression="gzip")
        print("Saved:", gz.name)
        return gz

out_base = CURATED_DIR / "tft_units_long_clean"
out_path = safe_save(units, out_base, index=False)

# 📌 Minimal provenance JSON
meta = {
    "generated_at": pd.Timestamp.utcnow().isoformat(),
    "input_file":   "tft_units_long.*",
    "output_file":  out_path.name,
    "rows":         int(units.shape[0]),
    "unmatched":    int((units['name'] == 'UNKNOWN_UNIT').sum()),
}
with open(out_base.with_suffix(".meta.json"), "w") as f:
    json.dump(meta, f, indent=2)
print("Meta written.")


Saved: tft_units_long_clean.csv.gz
Meta written.
