# 02b · Cleaning – Abschlussquoten

**Zweck**  
Analog zu 02a, aber für Sek II-Abschlussquoten.

**Wichtigste Schritte**  
1. Sheets laden & numerische Spalten casten  
2. Kontext-Spalten (`aggregation_level`, …) ergänzen  
3. Dubletten entfernen (Dimensions-Key + Aggregationslevel)  
4. Export: `tmp/abs_clean.parquet`

**Ergebnis**  
`tmp/abs_clean.parquet`



In [1]:
#Imports & Konstanten
import pandas as pd
from pathlib import Path

DATA_DIR = Path("../../data")
TMP_DIR  = Path("../../tmp")
SRC_ABS  = DATA_DIR / "bfs_data_abschlussquote.xlsx"
TMP_FILE = TMP_DIR  / "abs_clean.parquet"

TMP_DIR.mkdir(exist_ok=True, parents=True)
print("Quelle:", SRC_ABS)


Quelle: ..\..\data\bfs_data_abschlussquote.xlsx


In [2]:
#Hilfsfunktionen
def header_row(xls, sh):
    top = pd.read_excel(xls, sheet_name=sh, nrows=15, header=None)
    return next(i for i,r in top.iterrows() if r.notna().sum() >= 3)


In [3]:
#Einlesen & Bereinigen
xls = pd.ExcelFile(SRC_ABS)
DATA_SHEETS = [s for s in xls.sheet_names if s.endswith("_Data")]

rows = []
for sh in DATA_SHEETS:
    agg = sh.split("_")[0]          # z. B. 'T1'
    hdr = header_row(xls, sh)
    df  = pd.read_excel(xls, sheet_name=sh, header=hdr)

    # numerische Spalten: alles, was int/float sein sollte
    num_like = [c for c in df.columns if
                any(tag in c.lower() for tag in ["anz", "%", "rate", "cnt"])]
    for c in num_like:
        df[c] = pd.to_numeric(df[c], errors="coerce")

    # Standardisierung: Strings trimmen + upper
    str_cols = df.select_dtypes(include="object").columns
    df[str_cols] = df[str_cols].apply(lambda s: s.str.strip())

    df["aggregation_level"] = agg
    rows.append(df)

    print(f"✓ {sh}: {len(df)} Zeilen")


✓ T1_SekII_1st_25_Merkm_Data: 13 Zeilen
✓ T2_SekII_1st_25_Kant_Data: 27 Zeilen
✓ T3_Matura_Merkm_Data: 13 Zeilen
✓ T4_Matura_Kant_Data: 27 Zeilen


In [4]:
#Dubletten entfernen & Export
df_all = pd.concat(rows, ignore_index=True)

key_cols = ["aggregation_level",        # aus Sheet
            "merkmal", "kategorie",     # BFS-Hierarchie
            "jahr"] + [c for c in df_all.columns if c.endswith("_code")]
before, after = len(df_all), len(df_all.drop_duplicates(subset=key_cols))
df_all = df_all.drop_duplicates(subset=key_cols)

print(f"Duplicates: {before} → {after}")

df_all.to_parquet(TMP_FILE, index=False)
print("Parquet geschrieben:", TMP_FILE, "| Zeilen:", len(df_all))


Duplicates: 80 → 80
Parquet geschrieben: ..\..\tmp\abs_clean.parquet | Zeilen: 80
