# A-LSA Demonstration Notebook

**Adaptive Latent Semantic Analysis for Binary Text Classification**

Author: Isaac Touza  
Institution: Université de Maroua, Cameroun  
Date: January 2026

---

This notebook demonstrates the complete A-LSA workflow:
1. Data loading and preprocessing
2. Model training
3. Evaluation and comparison with baselines
4. Visualization and interpretation

## 1. Setup and Imports

In [None]:
import sys
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_20newsgroups
import warnings
warnings.filterwarnings('ignore')

# Add parent directory to path
sys.path.append('..')

from src.alsa import AdaptiveLSA
from src.baselines import get_baseline_models
from src.evaluation import evaluate_model, cross_validate_model, get_confusion_matrix
from src.visualization import (
    plot_tsne_visualization,
    plot_characteristic_terms,
    plot_performance_comparison
)

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Imports successful")

## 2. Load Dataset

We'll use the 20 Newsgroups dataset (binary: comp.graphics vs rec.sport.hockey) for this demo.

In [None]:
# Load 20 Newsgroups dataset
categories = ('comp.graphics', 'rec.sport.hockey')

print("Loading 20 Newsgroups dataset...")
data = fetch_20newsgroups(
    subset='all',
    categories=categories,
    remove=('headers', 'footers', 'quotes'),
    shuffle=True,
    random_state=42
)

X = data.data
y = data.target

print(f"✓ Loaded {len(X)} documents")
print(f"  - {categories[0]}: {np.sum(y == 0)} documents")
print(f"  - {categories[1]}: {np.sum(y == 1)} documents")

### Explore the Data

In [None]:
# Show example documents
print("Example from comp.graphics:")
print("-" * 80)
print(X[y == 0][0][:300] + "...\n")

print("Example from rec.sport.hockey:")
print("-" * 80)
print(X[y == 1][0][:300] + "...")

## 3. Split Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} documents")
print(f"Test set: {len(X_test)} documents")

## 4. Train A-LSA Model

In [None]:
# Initialize A-LSA
alsa = AdaptiveLSA(
    n_components=100,
    random_state=42
)

print("Training A-LSA model...")
alsa.fit(X_train, y_train)
print("✓ Training complete")

# Show model parameters
print(f"\nModel parameters:")
print(f"  - Latent dimension k: {alsa.n_components}")
print(f"  - Vocabulary size: {alsa.preprocessor_.get_vocabulary_size()}")
print(f"  - Decision threshold θ: {alsa.theta_:.4f}")
print(f"  - N+ (positive class): {alsa.n_pos_}")
print(f"  - N- (negative class): {alsa.n_neg_}")

## 5. Evaluate A-LSA

In [None]:
# Evaluate on test set
metrics = evaluate_model(alsa, X_test, y_test, "A-LSA")

print("\nTest Set Performance:")
print("-" * 40)
for metric, value in metrics.items():
    print(f"{metric:25s}: {value:.4f}")

In [None]:
# Confusion matrix
cm, cm_df = get_confusion_matrix(
    alsa, X_test, y_test,
    class_names=['comp.graphics', 'rec.sport.hockey']
)

print("\nConfusion Matrix:")
print(cm_df)

## 6. Compare with Baselines

In [None]:
# Get baseline models
baselines = get_baseline_models(n_components=100, random_state=42)

# Evaluate all models
results = []

for model_name, model in [('A-LSA', alsa)] + list(baselines.items()):
    print(f"\nEvaluating {model_name}...")
    
    if model_name != 'A-LSA':
        model.fit(X_train, y_train)
    
    metrics = evaluate_model(model, X_test, y_test, model_name)
    
    results.append({
        'Model': model_name,
        'F1 (macro)': metrics['F1-score (macro)'],
        'Accuracy': metrics['Accuracy'],
        'Precision': metrics['Precision (macro)'],
        'Recall': metrics['Recall (macro)']
    })

results_df = pd.DataFrame(results)
results_df = results_df.sort_values('F1 (macro)', ascending=False)

print("\n" + "="*60)
print("RESULTS COMPARISON")
print("="*60)
print(results_df.to_string(index=False))

In [None]:
# Visualize comparison
fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(len(results_df))
width = 0.2

ax.bar(x - 1.5*width, results_df['F1 (macro)'], width, label='F1 (macro)', alpha=0.8)
ax.bar(x - 0.5*width, results_df['Accuracy'], width, label='Accuracy', alpha=0.8)
ax.bar(x + 0.5*width, results_df['Precision'], width, label='Precision', alpha=0.8)
ax.bar(x + 1.5*width, results_df['Recall'], width, label='Recall', alpha=0.8)

ax.set_xlabel('Model', fontweight='bold')
ax.set_ylabel('Score', fontweight='bold')
ax.set_title('Model Performance Comparison', fontweight='bold', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(results_df['Model'], rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
ax.set_ylim([0.80, 1.0])

plt.tight_layout()
plt.show()

## 7. Analyze Characteristic Terms

In [None]:
# Get characteristic terms
char_terms = alsa.get_characteristic_terms(n_terms=15)

print("Top 15 Characteristic Terms per Class:")
print("=" * 60)

print(f"\n{categories[1]} (Positive Class):")
print("-" * 30)
for i, (term, weight) in enumerate(char_terms['positive'][:15], 1):
    print(f"{i:2d}. {term:20s} (weight: {abs(weight):.4f})")

print(f"\n{categories[0]} (Negative Class):")
print("-" * 30)
for i, (term, weight) in enumerate(char_terms['negative'][:15], 1):
    print(f"{i:2d}. {term:20s} (weight: {abs(weight):.4f})")

In [None]:
# Visualize characteristic terms
plot_characteristic_terms(
    terms_pos=char_terms['positive'],
    terms_neg=char_terms['negative'],
    class_names=['comp.graphics', 'rec.sport.hockey'],
    n_terms=10
)

## 8. Visualize Latent Spaces with t-SNE

In [None]:
# Get latent projections for test set
print("Computing latent projections...")
z_pos, z_neg = alsa.get_latent_projections(X_test)

print(f"Positive space projections: {z_pos.shape}")
print(f"Negative space projections: {z_neg.shape}")

# Visualize with t-SNE
plot_tsne_visualization(
    z_pos=z_pos,
    z_neg=z_neg,
    y_true=y_test,
    class_names=['comp.graphics', 'rec.sport.hockey'],
    random_state=42
)

## 9. Analyze Decision Boundaries

In [None]:
# Compute differential semantic distances
distances = alsa.decision_function(X_test)

# Visualize distribution
fig, ax = plt.subplots(figsize=(12, 6))

# Plot histograms
ax.hist(distances[y_test == 0], bins=50, alpha=0.6, label='comp.graphics', color='red')
ax.hist(distances[y_test == 1], bins=50, alpha=0.6, label='rec.sport.hockey', color='blue')

# Plot threshold
ax.axvline(alsa.theta_, color='green', linestyle='--', linewidth=2, 
           label=f'Threshold θ={alsa.theta_:.4f}')

ax.set_xlabel('Differential Semantic Distance (Δ_sem)', fontweight='bold')
ax.set_ylabel('Frequency', fontweight='bold')
ax.set_title('Distribution of Differential Semantic Distances', fontweight='bold', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nDistance Statistics:")
print(f"  Mean (comp.graphics): {np.mean(distances[y_test == 0]):.4f}")
print(f"  Mean (rec.sport.hockey): {np.mean(distances[y_test == 1]):.4f}")
print(f"  Separation: {abs(np.mean(distances[y_test == 0]) - np.mean(distances[y_test == 1])):.4f}")

## 10. Cross-Validation

In [None]:
# Perform 5-fold cross-validation
cv_results = cross_validate_model(
    alsa,
    X_train,
    y_train,
    n_splits=5,
    random_state=42,
    verbose=True
)

## 11. Prediction Examples

In [None]:
# Test on sample documents
test_samples = [
    "I love playing hockey and watching NHL games with my friends.",
    "Need help with 3D graphics rendering and OpenGL programming.",
    "The Maple Leafs won the game last night in overtime!",
    "How do I create realistic shadows in my ray tracing algorithm?"
]

print("Prediction Examples:")
print("=" * 80)

for i, text in enumerate(test_samples, 1):
    pred = alsa.predict([text])[0]
    proba = alsa.predict_proba([text])[0]
    distance = alsa.decision_function([text])[0]
    
    print(f"\n{i}. {text}")
    print(f"   Prediction: {categories[pred]}")
    print(f"   Probabilities: {categories[0]}={proba[0]:.4f}, {categories[1]}={proba[1]:.4f}")
    print(f"   Distance: {distance:.4f} (threshold: {alsa.theta_:.4f})")

## Summary

This notebook demonstrated the complete A-LSA workflow:

1. ✓ Loaded and explored 20 Newsgroups dataset
2. ✓ Trained A-LSA model with dual latent spaces
3. ✓ Evaluated performance against baselines
4. ✓ Analyzed characteristic terms for each class
5. ✓ Visualized latent spaces with t-SNE
6. ✓ Examined decision boundaries
7. ✓ Performed cross-validation
8. ✓ Tested predictions on new documents

**Key Findings:**
- A-LSA achieves competitive performance with baselines
- Class-specific latent spaces capture semantic differences
- Differential semantic distance provides interpretable classification
- Threshold θ effectively handles class imbalance