<span style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">An Exception was encountered at '<a href="#papermill-error-cell">In [4]</a>'.</span>

## 1. Setup & Import Libraries

In [1]:
import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Add project root to path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

print(f"Project root: {project_root}")
print(f"Python version: {sys.version}")

Project root: C:\Coding\DataMining
Python version: 3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 22:23:22) [MSC v.1944 64 bit (AMD64)]


In [2]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set plotting style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

# Print versions
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

NumPy version: 1.24.3
Pandas version: 2.0.3


In [3]:
# Import our semi-supervised module
from src.models.semi_supervised import (
    # Split functions
    create_labeled_unlabeled_split,
    create_multiple_splits,
    
    # Training functions
    train_self_training,
    train_self_training_rf,
    train_label_propagation,
    train_label_spreading,
    
    # Analysis
    analyze_pseudo_labels,
    evaluate_semi_supervised,
    
    # Comparison
    compare_semi_supervised_methods,
    run_label_fraction_experiment,
    
    # Visualization
    plot_learning_curve_by_labels,
    plot_pseudo_label_confusion_matrix
)

print("‚úÖ Semi-supervised module imported successfully!")

‚úÖ Semi-supervised module imported successfully!


## 2. Load Data

<span id="papermill-error-cell" style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">Execution using papermill encountered an exception here and stopped:</span>

In [4]:
# Load processed data
data_dir = os.path.join(project_root, 'data', 'processed')

# Training data (with SMOTE resampling from Phase 4)
X_train = pd.read_csv(os.path.join(data_dir, 'X_train_resampled.csv'))
y_train = pd.read_csv(os.path.join(data_dir, 'y_train_resampled.csv')).squeeze()

# Test data
X_test = pd.read_csv(os.path.join(data_dir, 'X_test_encoded.csv'))
y_test = pd.read_csv(os.path.join(data_dir, 'y_test.csv')).squeeze()

print(f"üìä Data Loaded:")
print(f"   X_train shape: {X_train.shape}")
print(f"   y_train shape: {y_train.shape}")
print(f"   X_test shape:  {X_test.shape}")
print(f"   y_test shape:  {y_test.shape}")
print(f"\n   Train class distribution:")
print(f"   {y_train.value_counts().to_dict()}")

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Coding\\DataMining\\data\\processed\\X_train_resampled.csv'

In [None]:
# Ensure columns match between train and test
train_cols = set(X_train.columns)
test_cols = set(X_test.columns)

if train_cols != test_cols:
    # Find common columns
    common_cols = list(train_cols.intersection(test_cols))
    print(f"‚ö†Ô∏è Column mismatch! Using {len(common_cols)} common columns.")
    
    # Show differences
    only_train = train_cols - test_cols
    only_test = test_cols - train_cols
    if only_train:
        print(f"   Only in train: {only_train}")
    if only_test:
        print(f"   Only in test: {only_test}")
    
    X_train = X_train[common_cols]
    X_test = X_test[common_cols]
else:
    print("‚úÖ Columns match between train and test")

In [None]:
# For semi-supervised experiments, use smaller subset for efficiency
# (Label Propagation/Spreading can be slow on very large datasets)

SAMPLE_SIZE = 20000  # Use subset for faster experiments

if len(X_train) > SAMPLE_SIZE:
    print(f"üìâ Sampling {SAMPLE_SIZE:,} samples from {len(X_train):,} for faster experiments...")
    
    # Stratified sampling
    from sklearn.model_selection import train_test_split
    X_train_sample, _, y_train_sample, _ = train_test_split(
        X_train, y_train,
        train_size=SAMPLE_SIZE,
        stratify=y_train,
        random_state=42
    )
    
    print(f"   Sampled X_train: {X_train_sample.shape}")
    print(f"   Sampled y_train: {y_train_sample.shape}")
    print(f"   Class distribution: {y_train_sample.value_counts().to_dict()}")
else:
    X_train_sample = X_train
    y_train_sample = y_train
    print(f"‚úÖ Using full training data: {len(X_train):,} samples")

## 3. Create Labeled/Unlabeled Splits

Gi·∫£ l·∫≠p t√¨nh hu·ªëng ch·ªâ c√≥ m·ªôt ph·∫ßn nh·ªè d·ªØ li·ªáu ƒë∆∞·ª£c g√°n nh√£n.

In [None]:
# Create splits with different labeled fractions
label_fractions = [0.05, 0.10, 0.20]

splits = create_multiple_splits(
    X_train_sample, y_train_sample,
    fractions=label_fractions,
    random_state=42,
    verbose=True
)

In [None]:
# Detailed view of 10% split
X, y_semi_10, mask_10 = splits[0.10]

print("\nüìä Detailed 10% Split:")
print(f"   Total samples: {len(y_semi_10):,}")
print(f"   Labeled (y != -1): {(y_semi_10 != -1).sum():,}")
print(f"   Unlabeled (y == -1): {(y_semi_10 == -1).sum():,}")
print(f"\n   First 20 labels: {y_semi_10[:20]}")

## 4. Self-Training

Self-Training ho·∫°t ƒë·ªông:
1. Hu·∫•n luy·ªán model tr√™n d·ªØ li·ªáu ƒë√£ g√°n nh√£n
2. D·ª± ƒëo√°n tr√™n d·ªØ li·ªáu ch∆∞a g√°n nh√£n
3. Th√™m c√°c d·ª± ƒëo√°n c√≥ ƒë·ªô tin c·∫≠y cao (>= threshold) v√†o t·∫≠p labeled
4. L·∫∑p l·∫°i cho ƒë·∫øn khi h·ªôi t·ª•

In [None]:
# Train Self-Training with 10% labeled data
X_10, y_semi_10, mask_10 = splits[0.10]

print("="*70)
print("SELF-TRAINING WITH LOGISTIC REGRESSION (threshold=0.9)")
print("="*70)

model_st_lr, info_st_lr = train_self_training(
    X_10, y_semi_10,
    threshold=0.9,
    verbose=True
)

In [None]:
# Evaluate Self-Training model on test set
metrics_st_lr = evaluate_semi_supervised(
    model_st_lr, X_test, y_test,
    model_name="Self-Training (LR, threshold=0.9)",
    verbose=True
)

In [None]:
# Self-Training with higher threshold (more conservative)
print("\n" + "="*70)
print("SELF-TRAINING WITH HIGHER THRESHOLD (0.95)")
print("="*70)

model_st_95, info_st_95 = train_self_training(
    X_10, y_semi_10,
    threshold=0.95,
    verbose=True
)

metrics_st_95 = evaluate_semi_supervised(
    model_st_95, X_test, y_test,
    model_name="Self-Training (threshold=0.95)",
    verbose=True
)

In [None]:
# Compare thresholds
print("\nüìä Self-Training Threshold Comparison (10% labeled):")
print(f"   Threshold 0.90: F1={metrics_st_lr['f1']:.4f}, Pseudo-labels={info_st_lr['n_pseudo_labeled']:,}")
print(f"   Threshold 0.95: F1={metrics_st_95['f1']:.4f}, Pseudo-labels={info_st_95['n_pseudo_labeled']:,}")

## 5. Pseudo-Label Analysis

Ph√¢n t√≠ch ch·∫•t l∆∞·ª£ng c·ªßa pseudo-labels so v·ªõi ground truth.

In [None]:
# Analyze pseudo-label quality for Self-Training
# Get the transduced labels (after self-training)
y_pseudo_st = model_st_lr.transduction_

# Mask for originally unlabeled samples
mask_unlabeled = ~mask_10

# Analyze
pseudo_analysis = analyze_pseudo_labels(
    y_true=y_train_sample.values,
    y_pseudo=y_pseudo_st,
    mask_unlabeled=mask_unlabeled,
    X=X_train_sample,
    feature_to_analyze='lead_time' if 'lead_time' in X_train_sample.columns else None,
    verbose=True
)

In [None]:
# Plot pseudo-label confusion matrix
fig = plot_pseudo_label_confusion_matrix(
    y_true=y_train_sample.values,
    y_pseudo=y_pseudo_st,
    mask_unlabeled=mask_unlabeled,
    title='Self-Training Pseudo-Label Confusion Matrix (10% labeled)',
    save_path=os.path.join(project_root, 'outputs', 'figures', 'pseudo_label_cm_self_training.png'),
    show=True
)

## 6. Label Propagation & Spreading

C√°c ph∆∞∆°ng ph√°p graph-based semi-supervised learning.

In [None]:
# Use smaller sample for Label Propagation (memory intensive)
LABEL_PROP_SAMPLE = min(10000, len(X_train_sample))

if len(X_train_sample) > LABEL_PROP_SAMPLE:
    from sklearn.model_selection import train_test_split
    X_lp, _, y_lp, _ = train_test_split(
        X_train_sample, y_train_sample,
        train_size=LABEL_PROP_SAMPLE,
        stratify=y_train_sample,
        random_state=42
    )
    print(f"üìâ Using {LABEL_PROP_SAMPLE:,} samples for Label Propagation")
else:
    X_lp = X_train_sample
    y_lp = y_train_sample

# Create 10% split for label propagation
X_lp, y_semi_lp, mask_lp = create_labeled_unlabeled_split(
    X_lp, y_lp,
    labeled_fraction=0.10,
    random_state=42,
    verbose=True
)

In [None]:
# Train Label Propagation with KNN kernel
print("\n" + "="*70)
print("LABEL PROPAGATION (KNN kernel)")
print("="*70)

model_lp, info_lp = train_label_propagation(
    X_lp, y_semi_lp,
    kernel='knn',
    n_neighbors=7,
    verbose=True
)

# Evaluate
metrics_lp = evaluate_semi_supervised(
    model_lp, X_test, y_test,
    model_name="Label Propagation (KNN)",
    verbose=True
)

In [None]:
# Train Label Spreading
print("\n" + "="*70)
print("LABEL SPREADING (KNN kernel, alpha=0.2)")
print("="*70)

model_ls, info_ls = train_label_spreading(
    X_lp, y_semi_lp,
    kernel='knn',
    n_neighbors=7,
    alpha=0.2,
    verbose=True
)

# Evaluate
metrics_ls = evaluate_semi_supervised(
    model_ls, X_test, y_test,
    model_name="Label Spreading (KNN, alpha=0.2)",
    verbose=True
)

## 7. Comparison: Supervised vs Semi-Supervised

In [None]:
# Compare methods at 10% labeled
print("\n" + "#"*70)
print("# COMPARISON: SUPERVISED vs SEMI-SUPERVISED (10% labeled)")
print("#"*70)

comparison_10 = compare_semi_supervised_methods(
    X_lp, y_lp, X_test, y_test,
    labeled_fraction=0.10,
    methods=['supervised', 'self_training', 'label_spreading'],
    random_state=42,
    verbose=True
)

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Metrics comparison
metrics_to_plot = ['accuracy', 'precision', 'recall', 'f1']
comparison_plot = comparison_10[metrics_to_plot].T

comparison_plot.plot(kind='bar', ax=axes[0], rot=0, width=0.8)
axes[0].set_title('Supervised vs Semi-Supervised (10% labeled)', fontweight='bold')
axes[0].set_ylabel('Score')
axes[0].legend(title='Method')
axes[0].set_ylim(0, 1)
for container in axes[0].containers:
    axes[0].bar_label(container, fmt='%.2f', fontsize=8)

# ROC-AUC if available
if 'roc_auc' in comparison_10.columns:
    roc_data = comparison_10['roc_auc'].dropna()
    colors = plt.cm.Set2(np.linspace(0, 1, len(roc_data)))
    bars = axes[1].bar(roc_data.index, roc_data.values, color=colors)
    axes[1].set_title('ROC-AUC Comparison (10% labeled)', fontweight='bold')
    axes[1].set_ylabel('ROC-AUC')
    axes[1].set_ylim(0, 1)
    axes[1].bar_label(bars, fmt='%.3f')
else:
    axes[1].text(0.5, 0.5, 'ROC-AUC not available', ha='center', va='center')

plt.tight_layout()
plt.savefig(os.path.join(project_root, 'outputs', 'figures', 'semi_supervised_comparison.png'), 
            dpi=300, bbox_inches='tight')
plt.show()

print("\n‚úÖ Figure saved to outputs/figures/semi_supervised_comparison.png")

## 8. Label Fraction Experiment (5%, 10%, 20%)

In [None]:
# Run experiments across different label fractions
print("\n" + "#"*70)
print("# LABEL FRACTION EXPERIMENT")
print("#"*70)

# Use smaller data for efficiency
experiment_results = run_label_fraction_experiment(
    X_lp, y_lp, X_test, y_test,
    fractions=[0.05, 0.10, 0.20],
    random_state=42,
    verbose=True
)

In [None]:
# Plot learning curve
fig = plot_learning_curve_by_labels(
    experiment_results,
    metric='f1',
    figsize=(10, 6),
    save_path=os.path.join(project_root, 'outputs', 'figures', 'semi_supervised_learning_curve.png'),
    show=True
)

In [None]:
# Summary table
print("\n" + "="*70)
print("SUMMARY: F1 SCORES BY LABELED FRACTION")
print("="*70)

summary_data = {}
for frac, results in experiment_results.items():
    if 'f1' in results.columns:
        summary_data[f'{frac:.0%} labeled'] = results['f1'].to_dict()

summary_df = pd.DataFrame(summary_data)
print(summary_df.round(4).to_string())

# Save summary
summary_df.to_csv(os.path.join(project_root, 'outputs', 'tables', 'semi_supervised_summary.csv'))
print("\n‚úÖ Summary saved to outputs/tables/semi_supervised_summary.csv")

## 9. Key Findings & Conclusions

In [None]:
# Final analysis
print("\n" + "="*70)
print("KEY FINDINGS")
print("="*70)

# Find best method for each fraction
for frac, results in experiment_results.items():
    if 'f1' in results.columns:
        best_method = results['f1'].idxmax()
        best_f1 = results['f1'].max()
        supervised_f1 = results.loc['supervised', 'f1'] if 'supervised' in results.index else 0
        
        improvement = (best_f1 - supervised_f1) / supervised_f1 * 100 if supervised_f1 > 0 else 0
        
        print(f"\nüìä {frac:.0%} Labeled Data:")
        print(f"   Best method: {best_method} (F1={best_f1:.4f})")
        print(f"   Supervised baseline: F1={supervised_f1:.4f}")
        if improvement > 0:
            print(f"   Improvement: +{improvement:.1f}%")
        else:
            print(f"   Difference: {improvement:.1f}%")

In [None]:
# Conclusions
print("\n" + "="*70)
print("CONCLUSIONS")
print("="*70)
print("""
1. SELF-TRAINING:
   - Hi·ªáu qu·∫£ khi c√≥ √≠t labeled data (5-10%)
   - Threshold cao (0.95) cho pseudo-labels ch√≠nh x√°c h∆°n nh∆∞ng √≠t h∆°n
   - Threshold th·∫•p (0.9) cho nhi·ªÅu pseudo-labels h∆°n nh∆∞ng c√≥ th·ªÉ sai

2. LABEL PROPAGATION / SPREADING:
   - Ho·∫°t ƒë·ªông t·ªët khi c·∫•u tr√∫c cluster r√µ r√†ng
   - T·ªën nhi·ªÅu b·ªô nh·ªõ cho dataset l·ªõn
   - KNN kernel th∆∞·ªùng ·ªïn ƒë·ªãnh h∆°n RBF

3. SO S√ÅNH V·ªöI SUPERVISED:
   - Semi-supervised c√≥ th·ªÉ c·∫£i thi·ªán khi labeled data r·∫•t √≠t
   - V·ªõi >20% labeled, supervised th∆∞·ªùng ƒë·ªß t·ªët
   - Ch·∫•t l∆∞·ª£ng pseudo-labels r·∫•t quan tr·ªçng

4. KHUY·∫æN NGH·ªä:
   - S·ª≠ d·ª•ng Self-Training khi c√≥ 5-15% labeled data
   - B·∫Øt ƒë·∫ßu v·ªõi threshold cao (0.95), gi·∫£m n·∫øu c·∫ßn
   - Lu√¥n validate pseudo-labels khi c√≥ th·ªÉ
""")

---

## üìù Notebook Complete

Phase 6: Semi-Supervised Learning ƒë√£ ho√†n th√†nh!

**Outputs:**
- `outputs/figures/pseudo_label_cm_self_training.png`
- `outputs/figures/semi_supervised_comparison.png`
- `outputs/figures/semi_supervised_learning_curve.png`
- `outputs/tables/semi_supervised_summary.csv`