# 01 – Data Preparation (Raw TXT → CSV)

Ziel: BBC News Dataset (TXT-Dateien in Klassenordnern) in ein einheitliches CSV-Format überführen: `data/processed/bbc_news.csv`.


In [1]:
import os
from pathlib import Path
import pandas as pd

## 1) Pfade & Output definieren


In [2]:
from pathlib import Path

# Notebook liegt in notebooks/ → Projektroot ist eine Ebene höher
PROJECT_ROOT = Path.cwd().parent

RAW_DIR = PROJECT_ROOT / "data" / "raw" / "bbc"
OUT_DIR = PROJECT_ROOT / "data" / "processed"
OUT_DIR.mkdir(parents=True, exist_ok=True)

OUT_CSV = OUT_DIR / "bbc_news.csv"

print("PROJECT_ROOT:", PROJECT_ROOT)
print("RAW_DIR exists:", RAW_DIR.exists(), RAW_DIR)
print("OUT_CSV:", OUT_CSV)


PROJECT_ROOT: c:\CAS\cas-ml-document-classification
RAW_DIR exists: True c:\CAS\cas-ml-document-classification\data\raw\bbc
OUT_CSV: c:\CAS\cas-ml-document-classification\data\processed\bbc_news.csv


## 2) Klassenordner prüfen


In [3]:
labels = sorted([p.name for p in RAW_DIR.iterdir() if p.is_dir()])
labels


['business', 'entertainment', 'politics', 'sport', 'tech']

## 3) TXT-Dateien einlesen und DataFrame erstellen


In [15]:
import pandas as pd

rows = []
for label in labels:
    label_dir = RAW_DIR / label
    for txt_path in label_dir.glob("*.txt"):
        # UTF-8 mit ignore für fehlerhafte Zeichen
        text = txt_path.read_text(encoding="utf-8", errors="ignore")
        rows.append(
            {
                "id": txt_path.stem,
                "label": label,
                "text": text,
                "source_file": str(txt_path.relative_to(PROJECT_ROOT)),
            }
        )

df = pd.DataFrame(rows)
df.head()


Unnamed: 0,id,label,text,source_file
0,1,business,Ad sales boost Time Warner profit\n\nQuarterly...,data\raw\bbc\business\001.txt
1,2,business,Dollar gains on Greenspan speech\n\nThe dollar...,data\raw\bbc\business\002.txt
2,3,business,Yukos unit buyer faces loan claim\n\nThe owner...,data\raw\bbc\business\003.txt
3,4,business,High fuel prices hit BA's profits\n\nBritish A...,data\raw\bbc\business\004.txt
4,5,business,Pernod takeover talk lifts Domecq\n\nShares in...,data\raw\bbc\business\005.txt


In [11]:
df["label"].value_counts()


label
sport            511
business         510
politics         417
tech             401
entertainment    386
Name: count, dtype: int64

In [12]:
df.shape

(2225, 4)

## 4) Quick Checks (Klassen, Längen)


In [16]:
# Klassen-Check
print("Anzahl Klassen:", df["label"].nunique())
print("Labels:", sorted(df["label"].unique()))

# Textlängen (für spätere Token-Limits wichtig)
df["n_chars"] = df["text"].str.len()
print(df["n_chars"].describe())

# Optional: Durchschnittslänge pro Klasse
df.groupby("label")["n_chars"].mean().sort_values()


Anzahl Klassen: 5
Labels: ['business', 'entertainment', 'politics', 'sport', 'tech']
count     2225.000000
mean      2265.160000
std       1364.094636
min        503.000000
25%       1448.000000
50%       1967.000000
75%       2804.000000
max      25485.000000
Name: n_chars, dtype: float64


label
sport            1897.315068
entertainment    1928.593264
business         1986.727451
politics         2684.088729
tech             2976.359102
Name: n_chars, dtype: float64

## 5) CSV speichern + Reload-Test


In [19]:
# Speichern mit UTF-8-BOM (damit Excel das Encoding erkennt)
df.to_csv(OUT_CSV, index=False, encoding="utf-8-sig")
print(f"✅ CSV gespeichert: {OUT_CSV}")

# Reload-Test (damit Notebook 02 nur noch CSV laden muss)
df_check = pd.read_csv(OUT_CSV)
print("Reload shape:", df_check.shape)
print("Reload Klassen:", df_check["label"].nunique())
df_check.head()


✅ CSV gespeichert: c:\CAS\cas-ml-document-classification\data\processed\bbc_news.csv
Reload shape: (2225, 5)
Reload Klassen: 5


Unnamed: 0,id,label,text,source_file,n_chars
0,1,business,Ad sales boost Time Warner profit\n\nQuarterly...,data\raw\bbc\business\001.txt,2559
1,2,business,Dollar gains on Greenspan speech\n\nThe dollar...,data\raw\bbc\business\002.txt,2252
2,3,business,Yukos unit buyer faces loan claim\n\nThe owner...,data\raw\bbc\business\003.txt,1551
3,4,business,High fuel prices hit BA's profits\n\nBritish A...,data\raw\bbc\business\004.txt,2401
4,5,business,Pernod takeover talk lifts Domecq\n\nShares in...,data\raw\bbc\business\005.txt,1569
