# 06: M2/S2 - Supervised Classification (Sentence Level)

**C√≠l:** Natr√©novat klasifik√°tory na √∫rovni cel√Ωch vƒõt.
**Hypot√©za:** Vƒõta nese v√≠ce kontextu ne≈æ samotn√© slovo. Funguje l√©pe pr≈Ømƒõr v≈°ech token≈Ø (**Mean Pooling**) nebo speci√°ln√≠ token (**[CLS]**)?

**Sc√©n√°≈ôe:**
* **S2a - Gold Balanced:** Tr√©nink na Gold datech (undersampling L0 na 1:1).
* **S2b - Hybrid:** Tr√©nink na mixu Gold (L0) + Silver (L1).

**Pooling Metody:**
* **Mean:** Pr≈Ømƒõr embedding≈Ø v≈°ech slov ve vƒõtƒõ.
* **CLS:** Embedding speci√°ln√≠ho tokenu [CLS] (reprezentace cel√© sekvence dle BERTa).

## 1. Setup & Imports

In [1]:
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
import logging
from pathlib import Path
import os
from itables import show

from sklearn.metrics import classification_report

# Auto-reload modules for development
%load_ext autoreload
%autoreload 2
%matplotlib inline

# Add src to path
current_dir = os.getcwd()
src_dir = os.path.abspath(os.path.join(current_dir, '..', 'src'))
if src_dir not in sys.path:
    sys.path.append(src_dir)


# Vlastn√≠ moduly
import config
import data_splitting
import models
import evaluation
import visualization

# Logging setup
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Nastaven√≠ vizualizace
visualization.setup_style()

print(f"‚úÖ Setup complete. Results dir: {config.RESULTS_DIR}")

‚öôÔ∏è Configuration loaded. Device: cpu


2026-02-05 15:21:26,534 - INFO - üé® Visualization style set: whitegrid


‚úÖ Setup complete. Results dir: C:\Users\dobes\Documents\UniversityCodingProject\ThesisCoding\results


## 2. Data Check
Ovƒõ≈ô√≠me poƒçty vƒõt pro jednotliv√© sc√©n√°≈ôe.

In [2]:
SCENARIOS_TO_CHECK = ['baseline', 'hybrid']
POOLING = 'mean' # Poƒçty jsou stejn√© pro mean i cls, li≈°√≠ se jen dimenze X

print(f"{'='*80}")
print(f"üìä DATA CHECK REPORT (M2/S2 - Sentence Level)")
print(f"{'='*80}")

for scenario in SCENARIOS_TO_CHECK:
    print(f"\nüîπ SC√âN√Å≈ò: {scenario.upper()}")
    try:
        data = data_splitting.get_train_val_test_splits(
            scenario=scenario,
            level='sentence',
            pooling=POOLING,
            random_state=42
        )
        
        def print_stats(name, y):
            n_l0, n_l1 = np.sum(y == 0), np.sum(y == 1)
            ratio = n_l0 / n_l1 if n_l1 > 0 else 0
            print(f"   {name:<6} | Total: {len(y):<5} | L0: {n_l0:<4} | L1: {n_l1:<4} | Ratio: {ratio:.1f}:1")

        print_stats("TRAIN", data['y_train'])
        print_stats("VAL",   data['y_val'])
        print_stats("TEST",  data['y_test'])
        
    except Exception as e:
        print(f"   ‚ùå Chyba: {e}")

2026-02-05 15:21:45,187 - INFO - üìä Preparing scenario: baseline (sentence level, aggressive filter)


üìä DATA CHECK REPORT (M2/S2 - Sentence Level)

üîπ SC√âN√Å≈ò: BASELINE


2026-02-05 15:21:45,386 - INFO - ‚úÖ Loaded 1560 rows from C:\Users\dobes\Documents\UniversityCodingProject\ThesisCoding\data\processed\gold_sentences.pkl
2026-02-05 15:21:46,041 - INFO - ‚úÖ Loaded 5709 rows from C:\Users\dobes\Documents\UniversityCodingProject\ThesisCoding\data\processed\silver_sentences.pkl
2026-02-05 15:21:46,081 - INFO - Splitting 520 documents: 104 test, 41 val, 375 train
2026-02-05 15:21:46,095 - INFO - ‚úÖ Document-level split completed:
2026-02-05 15:21:46,099 - INFO -    Train: 376 docs, 376 samples
2026-02-05 15:21:46,099 - INFO -    Val:   41 docs, 41 samples
2026-02-05 15:21:46,101 - INFO -    Test:  103 docs, 103 samples
2026-02-05 15:21:46,103 - INFO -    ‚úì No document leakage detected between splits
2026-02-05 15:21:46,107 - INFO - ‚úÖ Scenario data prepared:
2026-02-05 15:21:46,107 - INFO -    Train: 376 samples (L0: 136, L1: 240)
2026-02-05 15:21:46,107 - INFO -    Val:   41 samples (L0: 15, L1: 26)
2026-02-05 15:21:46,114 - INFO -    Test:  103 sam

   TRAIN  | Total: 376   | L0: 136  | L1: 240  | Ratio: 0.6:1
   VAL    | Total: 41    | L0: 15   | L1: 26   | Ratio: 0.6:1
   TEST   | Total: 103   | L0: 37   | L1: 66   | Ratio: 0.6:1

üîπ SC√âN√Å≈ò: HYBRID


2026-02-05 15:21:46,830 - INFO - ‚úÖ Loaded 5709 rows from C:\Users\dobes\Documents\UniversityCodingProject\ThesisCoding\data\processed\silver_sentences.pkl
2026-02-05 15:21:46,847 - INFO - Splitting 1472 documents: 294 test, 117 val, 1061 train
2026-02-05 15:21:46,848 - INFO - ‚úÖ Document-level split completed:
2026-02-05 15:21:46,848 - INFO -    Train: 1062 docs, 1062 samples
2026-02-05 15:21:46,848 - INFO -    Val:   117 docs, 117 samples
2026-02-05 15:21:46,862 - INFO -    Test:  293 docs, 293 samples
2026-02-05 15:21:46,867 - INFO -    ‚úì No document leakage detected between splits
2026-02-05 15:21:46,867 - INFO -    Balanced via undersampling: 136 + 926 ‚Üí 272
2026-02-05 15:21:46,867 - INFO - ‚úÖ Scenario data prepared:
2026-02-05 15:21:46,878 - INFO -    Train: 272 samples (L0: 136, L1: 136)
2026-02-05 15:21:46,882 - INFO -    Val:   117 samples (L0: 15, L1: 102)
2026-02-05 15:21:46,884 - INFO -    Test:  293 samples (L0: 37, L1: 256)


   TRAIN  | Total: 272   | L0: 136  | L1: 136  | Ratio: 1.0:1
   VAL    | Total: 117   | L0: 15   | L1: 102  | Ratio: 0.1:1
   TEST   | Total: 293   | L0: 37   | L1: 256  | Ratio: 0.1:1


## 3. Experiment Loop (Pooling & Scenarios)
Tr√©nujeme kombinace: **Sc√©n√°≈ô x Pooling x Model**.

**Sc√©n√°≈ôe:**
* **S2a:** Baseline (Gold) + Manual Undersampling
* **S2b:** Hybrid (Gold L0 + Silver L1)

**Pooling:** `mean` vs `cls`

In [None]:
# Cesta pro v√Ωsledky
RESULTS_PATH = config.RESULTS_DIR / "M2_S2_experiment_results_v1.csv"

# Definice experiment≈Ø (Sc√©n√°≈ôe)
SCENARIOS = [
    {'id': 'S2a', 'name': 'Gold Balanced', 'scenario': 'baseline', 'balance_train': True},
    {'id': 'S2b', 'name': 'Hybrid (G+S)',  'scenario': 'hybrid',   'balance_train': False}, # Hybrid je u≈æ balanced z modulu
]

POOLING_METHODS = ['mean', 'cls']

MODELS_TO_TEST = ['LogReg', 'SVM (RBF)'] # Pro rychlost. SVM (Lin) m≈Ø≈æe≈° p≈ôidat.
# MODELS_TO_TEST = ["LogReg", "SVM (RBF)", "XGBoost", "Dummy", "SVM (Lin)", "NaiveBayes", "RandForest"]

if models.XGBOOST_AVAILABLE:
    MODELS_TO_TEST.append('XGBoost')


In [None]:
results = []
best_f1 = 0.0
best_run = None

print(f"üöÄ STARTING SENTENCE LEVEL EXPERIMENTS...")
print(f"üíæ Results path: {RESULTS_PATH}")

for pooling in POOLING_METHODS:
    print(f"\n{'#'*60}")
    print(f"üåä POOLING METHOD: {pooling.upper()}")
    print(f"{'#'*60}")
    
    for exp in SCENARIOS:
        print(f"\n   üß™ SCENARIO: {exp['id']} - {exp['name']}")
        
        # 1. Naƒçten√≠ dat (se spr√°vn√Ωm poolingem)
        try:
            data = data_splitting.get_train_val_test_splits(
                scenario=exp['scenario'],
                level='sentence',
                pooling=pooling,
                random_state=42
            )
            
            X_train, y_train = data['X_train'], data['y_train']
            X_val, y_val     = data['X_val'], data['y_val']
            X_test, y_test   = data['X_test'], data['y_test']
            
            # 2. Manu√°ln√≠ Undersampling pro S2a (Baseline)
            if exp['balance_train'] and exp['scenario'] == 'baseline':
                idx_l0 = np.where(y_train == 0)[0]
                idx_l1 = np.where(y_train == 1)[0]
                
                np.random.seed(42)
                idx_l0_down = np.random.choice(idx_l0, size=len(idx_l1), replace=False)
                idx_balanced = np.concatenate([idx_l0_down, idx_l1])
                np.random.shuffle(idx_balanced)
                
                X_train, y_train = X_train[idx_balanced], y_train[idx_balanced]
                print(f"      ‚öñÔ∏è Balanced Train Size: {X_train.shape[0]} (L1: {sum(y_train)})")

            # 3. Tr√©nink Model≈Ø
            for model_name in MODELS_TO_TEST:
                print(f"      ‚öôÔ∏è {model_name}...")
                
                try:
                    clf = models.get_supervised_model(model_name, random_state=42)
                    clf.fit(X_train, y_train)
                    
                    # Sk√≥re
                    if hasattr(clf, "predict_proba"):
                        s_train = clf.predict_proba(X_train)[:, 1]
                        s_val   = clf.predict_proba(X_val)[:, 1]
                        s_test  = clf.predict_proba(X_test)[:, 1]
                    else:
                        s_train = clf.decision_function(X_train)
                        s_val   = clf.decision_function(X_val)
                        s_test  = clf.decision_function(X_test)
                    
                    # Threshold z Val
                    threshold, _ = evaluation.find_optimal_threshold(y_val, s_val, metric='f1')
                    
                    # Metriky
                    metrics = evaluation.calculate_metrics(y_test, (s_test > threshold).astype(int), s_test)
                    
                    # Log
                    res = {
                        'id': exp['id'],
                        'scenario': exp['scenario'],
                        'scenario_name': exp['name'],
                        'pooling': pooling,
                        'model': model_name,
                        'balance_train': exp['balance_train'],
                        'threshold': threshold,
                        'test_f1': metrics['f1'],
                        'test_auprc': metrics['avg_precision'],
                        'test_roc_auc': metrics['roc_auc']
                    }
                    results.append(res)
                    pd.DataFrame(results).to_csv(RESULTS_PATH, index=False)
                    
                    # Best Run Check
                    if metrics['f1'] > best_f1:
                        best_f1 = metrics['f1']
                        best_run = {
                            'info': res,
                            'model': clf,
                            'data': data,
                            'scores_test': s_test,
                            'y_test': y_test
                        }
                        
                except Exception as e:
                    print(f"      ‚ùå Error {model_name}: {e}")
                    
        except Exception as e:
            print(f"   ‚ùå Error loading data: {e}")

print("\n‚úÖ All experiments finished.")

## 4. Results Overview
Srovn√°n√≠ vlivu Poolingu a Sc√©n√°≈ô≈Ø.

In [None]:
# RESULTS_PATH = config.RESULTS_DIR / "M2_S2_experiment_results_v1.csv"         x mo≈æno upravit pro naƒçten√≠ jin√©ho csv souboru

# Naƒçten√≠ v√Ωsledk≈Ø
if RESULTS_PATH.exists():
    df_results = pd.read_csv(RESULTS_PATH)
else:
    df_results = pd.DataFrame(results)

# 1. Tabulka
print("üìä SROVN√ÅN√ç F1 SK√ìRE (Pooling x Scenario):")
pivot = df_results.pivot_table(
    values='test_f1', 
    index=['id', 'scenario_name'], 
    columns=['pooling', 'model'], 
    aggfunc='max'
)
display(pivot.style.background_gradient(cmap='Greens', axis=None).format("{:.4f}"))

# 2. Graf: Vliv Poolingu (Mean vs CLS)
def plot_pooling_comparison(df, metric='f1'):
    plt.figure(figsize=(10, 6))
    
    # Vytvo≈ô√≠me popisek modelu i se sc√©n√°≈ôem
    df = df.copy()
    df['Model_Exp'] = df['id'] + ": " + df['model']
    
    sns.barplot(
        data=df,
        x='Model_Exp',
        y=f'test_{metric}',
        hue='pooling',
        palette={'mean': config.COLORS['l0'], 'cls': config.COLORS['l1']}, # Mean=Modr√°, CLS=ƒåerven√°
        edgecolor='white'
    )
    
    plt.title(f"Pooling Comparison: Mean vs CLS ({metric.upper()})", fontsize=15, pad=15)
    plt.ylabel(f"Test {metric.upper()}")
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.3)
    plt.legend(title="Pooling")
    plt.tight_layout()
    plt.show()

print("\nüìä GRAF: Pooling Comparison")
plot_pooling_comparison(df_results, metric='f1')
plot_pooling_comparison(df_results, metric='auprc')

## 5. Deep Dive: Winner Analysis
Detailn√≠ pohled na nejlep≈°√≠ model.

In [None]:
if RESULTS_PATH.exists():
    df_results = pd.read_csv(RESULTS_PATH)
    best_row = df_results.sort_values('test_f1', ascending=False).iloc[0]
    
    print(f"üèÜ WINNER: {best_row['model']} ({best_row['scenario_name']})")
    print(f"üåä Pooling: {best_row['pooling'].upper()}")
    print(f"üìä F1: {best_row['test_f1']:.4f}")
    
    # 1. Reload Data & Retrain
    print(f"üîÑ Reloading data...")
    data_best = data_splitting.get_train_val_test_splits(
        scenario=best_row['scenario'],
        level='sentence',
        pooling=best_row['pooling'],
        random_state=42
    )
    
    X_train_b, y_train_b = data_best['X_train'], data_best['y_train']
    X_test_b, y_test_b   = data_best['X_test'], data_best['y_test']
    
    # Undersampling pokud je t≈ôeba (S2a)
    if str(best_row['balance_train']) == 'True':
        print("‚öñÔ∏è Applying Undersampling...")
        idx_0 = np.where(y_train_b == 0)[0]
        idx_1 = np.where(y_train_b == 1)[0]
        np.random.seed(42)
        idx_0_sel = np.random.choice(idx_0, size=len(idx_1), replace=False)
        idx_bal = np.concatenate([idx_0_sel, idx_1])
        np.random.shuffle(idx_bal)
        X_train_b, y_train_b = X_train_b[idx_bal], y_train_b[idx_bal]
        
    # Train
    clf = models.get_supervised_model(best_row['model'], random_state=42)
    clf.fit(X_train_b, y_train_b)
    
    # Probs
    if hasattr(clf, "predict_proba"):
        scores_test = clf.predict_proba(X_test_b)[:, 1]
    else:
        scores_test = clf.decision_function(X_test_b)
        
    y_pred = (scores_test > best_row['threshold']).astype(int)
    
    # 2. Vizualizace
    visualization.plot_confusion_matrix_heatmap(y_test_b, y_pred, normalize=True, title="CM (Normalized)")
    visualization.plot_pr_curve(y_test_b, scores_test, title="PR Curve")
    visualization.plot_model_calibration(y_test_b, scores_test, title="Calibration")
    
    # 3. Qualitative Analysis (Vƒõty)
    df_qual = pd.DataFrame({
        'text': data_best['meta_test']['text'], # Tady u≈æ m√°me cel√© vƒõty!
        'true': y_test_b,
        'pred': y_pred,
        'score': scores_test
    })
    
    # Kategorie
    conds = [
        (df_qual.true==1) & (df_qual.pred==1), (df_qual.true==0) & (df_qual.pred==0),
        (df_qual.true==0) & (df_qual.pred==1), (df_qual.true==1) & (df_qual.pred==0)
    ]
    df_qual['category'] = np.select(conds, ['TP', 'TN', 'FP', 'FN'])
    
    print("\n‚ùå TOP 5 FP (Model vid√≠ bias, kde nen√≠):")
    display(df_qual[df_qual['category'] == 'FP'].sort_values('score', ascending=False).head(5))
    
    print("\n‚ùå TOP 5 FN (Model p≈ôehl√©dl bias):")
    display(df_qual[df_qual['category'] == 'FN'].sort_values('score', ascending=True).head(5))
    
    # Ulo≈æen√≠
    df_qual.to_csv(config.RESULTS_DIR / "M2_S2_Qualitative.csv", index=False)

## 6. Projekce Embedding≈Ø
Jak vypadaj√≠ vƒõty v prostoru? Tvo≈ô√≠ shluky?

In [None]:
print("üé® Poƒç√≠t√°m projekce vƒõt...")
projs, idxs = visualization.compute_projections(X_test_b, methods=['PCA', 't-SNE'], random_state=42)
y_viz = y_test_b[idxs]
y_pred_viz = y_pred[idxs]

for m, coords in projs.items():
    # GT
    visualization.plot_embedding_projection(
        coords, pd.Series(y_viz).map({0:'Neutral', 1:'Bias'}), 
        palette={'Neutral': config.COLORS['l0'], 'Bias': config.COLORS['l1']},
        title=f"{m} - Ground Truth"
    )
    # Errors
    visualization.plot_error_analysis_projection(
        coords, y_viz, y_pred_viz, method_name=m
    )