# Multimodal Talent Discovery Analysis Pipeline

This notebook reproduces all experimental results from the manuscript:

**"Multimodal Talent Discovery in Children Using Calibrated Baselines"**

Dmitriy Sergeev, Talents.Kids

---

## Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/talents-kids/calibrated-talent-assessment/blob/main/notebooks/analysis.ipynb)

**Requirements**: Python 3.11+, see `requirements.txt`

In [None]:
# Install dependencies (uncomment if running in Colab)
# !pip install -q lightgbm scikit-learn numpy pandas matplotlib seaborn tqdm

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import (
    roc_auc_score, f1_score, precision_score, recall_score,
    classification_report, confusion_matrix
)
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12

## 1. Load Anonymized Sample Data

Due to GDPR/COPPA compliance, we provide 10 anonymized artifact samples for demonstration.

**Note**: Full dataset contains 5,173 analyses from 479 children but cannot be shared publicly.

In [None]:
# Load sample data
data_path = Path('../data')

# Load artifacts
artifacts = []
with open(data_path / 'sample_artifacts.jsonl', 'r') as f:
    for line in f:
        artifacts.append(json.loads(line))

print(f"Loaded {len(artifacts)} sample artifacts")
print(f"\nSample artifact structure:")
print(json.dumps(artifacts[0], indent=2))

In [None]:
# Load metadata summary
with open(data_path / 'metadata_summary.json', 'r') as f:
    metadata = json.load(f)

print("Dataset Statistics:")
print(f"Total Analyses: {metadata['dataset_stats']['total_analyses']}")
print(f"Total Children: {metadata['dataset_stats']['total_children']}")
print(f"Age Range: {metadata['dataset_stats']['age_range']}")
print(f"\nModality Distribution:")
for mod, count in metadata['modality_distribution'].items():
    print(f"  {mod}: {count}")
print(f"\nDomain Distribution:")
for domain, count in metadata['domain_distribution'].items():
    print(f"  {domain}: {count}")

## 2. Prepare Feature Matrix

In the full pipeline, features are extracted from multimodal artifacts:
- **Text**: RoBERTa-large embeddings (1024-dim) + linguistic features
- **Images**: CLIP ViT-L/14 embeddings (768-dim) + compositional features
- **Audio**: MFCCs (40-dim) + prosodic features
- **Video**: CLIP frame-level + optical flow + pose estimation
- **Musical**: Harmonic + rhythmic + timbral features

**For this demo**, we use the aggregated `bin_scores` as simplified features.

In [None]:
# Extract features and labels from sample data
def prepare_sample_data(artifacts):
    """
    Prepare X (features) and y (labels) from artifact samples.
    
    Note: This uses simplified features for demo.
    Full pipeline uses multimodal embeddings.
    """
    features = []
    labels = []
    
    for artifact in artifacts:
        # Use bin_scores as features (7 domains)
        feature_vec = [
            artifact['bin_scores'].get('Academic', 0),
            artifact['bin_scores'].get('Artistic', 0),
            artifact['bin_scores'].get('Athletic', 0),
            artifact['bin_scores'].get('Leadership', 0),
            artifact['bin_scores'].get('Service', 0),
            artifact['bin_scores'].get('Technology', 0),
            artifact['bin_scores'].get('Other', 0),
        ]
        features.append(feature_vec)
        
        # Primary domain as label
        primary_domain = max(artifact['bin_scores'].items(), key=lambda x: x[1])[0]
        labels.append(primary_domain)
    
    return np.array(features), np.array(labels)

X, y = prepare_sample_data(artifacts)
print(f"Feature matrix shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print(f"Unique labels: {np.unique(y)}")

## 3. Classical ML Models

We train two baseline models:
1. **Logistic Regression** with L2 regularization and Platt scaling calibration
2. **LightGBM** gradient boosting with hyperparameter tuning

**Full results (on 5,173 analyses, test n=682)**:
- LightGBM: ROC-AUC 0.9999, F1-macro 0.9972, ECE 0.0018
- LogReg (calibrated): ROC-AUC 0.9956, F1-macro 0.9734, ECE 0.0039

In [None]:
# Note: With only 10 samples, we can't do proper train/test split
# This is for demonstration only. Full results use n=5,173 analyses.

print("⚠️ DEMO MODE: Using 10 samples only")
print("Full dataset results reported in manuscript (n=5,173)")
print()

# For demo, use all data for training
X_train, X_test = X, X
y_train, y_test = y, y

print(f"Train size: {len(X_train)}")
print(f"Test size: {len(X_test)}")

### 3.1 Logistic Regression with Calibration

In [None]:
# Train Logistic Regression
lr_base = LogisticRegression(
    C=1.0,
    class_weight='balanced',
    max_iter=1000,
    random_state=RANDOM_SEED,
    multi_class='ovr'
)

# Apply Platt scaling calibration
lr_model = CalibratedClassifierCV(
    lr_base,
    method='sigmoid',  # Platt scaling
    cv=3
)

lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)
lr_proba = lr_model.predict_proba(X_test)

print("Logistic Regression Results (DEMO):")
print(f"F1-Score: {f1_score(y_test, lr_pred, average='macro'):.4f}")
print(f"Precision: {precision_score(y_test, lr_pred, average='macro', zero_division=0):.4f}")
print(f"Recall: {recall_score(y_test, lr_pred, average='macro', zero_division=0):.4f}")
print()
print("⚠️ Note: These are demo results on 10 samples.")
print("Full results (n=682 test): F1=0.9734, AUC=0.9956, ECE=0.0039")

### 3.2 LightGBM

In [None]:
# Train LightGBM
lgbm_model = LGBMClassifier(
    max_depth=8,
    num_leaves=64,
    learning_rate=0.05,
    n_estimators=500,
    class_weight='balanced',
    random_state=RANDOM_SEED,
    verbose=-1
)

lgbm_model.fit(X_train, y_train)
lgbm_pred = lgbm_model.predict(X_test)
lgbm_proba = lgbm_model.predict_proba(X_test)

print("LightGBM Results (DEMO):")
print(f"F1-Score: {f1_score(y_test, lgbm_pred, average='macro'):.4f}")
print(f"Precision: {precision_score(y_test, lgbm_pred, average='macro', zero_division=0):.4f}")
print(f"Recall: {recall_score(y_test, lgbm_pred, average='macro', zero_division=0):.4f}")
print()
print("⚠️ Note: These are demo results on 10 samples.")
print("Full results (n=682 test): F1=0.9972, AUC=0.9999, ECE=0.0018")

## 4. Multi-Agent LLM System Analysis

Our production system uses 34 LLM models from 9 providers:
- **Top models**: Qwen3-235B, DeepSeek-V3, Kimi-K2, Llama-4-Scout, Gemini-2.5-Flash
- **Total cost**: $213.34 across 12,041 invocations (\$0.041/analysis)
- **Validation**: Gemini models achieved r>0.98 correlation with ensemble consensus

In [None]:
# Load LLM metadata
llm_metadata = pd.read_csv(data_path / 'llm_metadata.csv')
llm_accuracy = pd.read_csv(data_path / 'llm_accuracy.csv')

print("Multi-Agent LLM System Summary:")
print(f"Total Models: {len(llm_metadata)}")
print(f"Total Invocations: {llm_metadata['invocations'].sum():,}")
print(f"Total Cost: ${llm_metadata['total_cost'].sum():.2f}")
print(f"Average Cost/Prediction: ${llm_metadata['cost_per_prediction'].mean():.4f}")
print()
print("Top 5 Models by Usage:")
print(llm_metadata.nlargest(5, 'invocations')[['model', 'provider', 'invocations', 'cost_per_prediction']])
print()
print("Validation Results (Gemini models vs ensemble consensus):")
print(llm_accuracy[['model', 'correlation', 'mae', 'n_predictions']])

## 5. Temporal Prediction Analysis

Longitudinal validation on 349 children with multiple sessions:
- **S1→S2** (5.7 months): F1-macro 0.833 (95% CI: [0.808, 0.857])
- **S1→S3** (11.4 months): F1-macro 0.742 (95% CI: [0.715, 0.768])

This demonstrates the model's ability to predict talent development trajectories.

In [None]:
# Temporal prediction results (from manuscript)
temporal_results = {
    'S1→S2': {
        'Academic': {'FP': 1, 'FN': 0, 'F1': 0.9855},
        'Art': {'FP': 0, 'FN': 1, 'F1': 0.9928},
        'Leadership': {'FP': 5, 'FN': 3, 'F1': 0.8596},
        'Service': {'FP': 2, 'FN': 0, 'F1': 0.9701},
        'Sport': {'FP': 3, 'FN': 3, 'F1': 0.9032},
        'Technology': {'FP': 12, 'FN': 5, 'F1': 0.1290},
        'Others': {'FP': 0, 'FN': 1, 'F1': 0.9928},
        'Overall': {'F1': 0.8333}
    },
    'S1→S3': {
        'Academic': {'FP': 3, 'FN': 2, 'F1': 0.8947},
        'Art': {'FP': 2, 'FN': 1, 'F1': 0.9565},
        'Leadership': {'FP': 8, 'FN': 6, 'F1': 0.7234},
        'Service': {'FP': 4, 'FN': 3, 'F1': 0.8710},
        'Sport': {'FP': 7, 'FN': 5, 'F1': 0.7742},
        'Technology': {'FP': 14, 'FN': 8, 'F1': 0.0645},
        'Others': {'FP': 1, 'FN': 2, 'F1': 0.9420},
        'Overall': {'F1': 0.7420}
    }
}

print("Temporal Prediction Performance:")
print(f"S1→S2 (5.7 months): F1-macro = {temporal_results['S1→S2']['Overall']['F1']:.4f}")
print(f"S1→S3 (11.4 months): F1-macro = {temporal_results['S1→S3']['Overall']['F1']:.4f}")
print()
print("Note: Technology domain shows lower performance due to insufficient")
print("      longitudinal samples (n=19 for S1→S2, n=19 for S1→S3)")

## 6. Cost-Benefit Analysis

Comparison with traditional psychological assessment in Portugal:
- **Traditional**: €300-550 per child (one-time, 3-5 sessions)
- **AI Platform**: €16.50/month (continuous monitoring)
- **Cost Reduction**: 18-33× cheaper
- **Population Scale**: €270M-495M (traditional) vs €14.9M (AI) for 900k children

In [None]:
# Cost comparison visualization
cost_data = pd.DataFrame({
    'Method': ['Traditional (Low)', 'Traditional (High)', 'AI Platform (1 month)', 'AI Platform (1 year)'],
    'Cost_EUR': [300, 550, 16.50, 16.50 * 12]
})

plt.figure(figsize=(10, 6))
sns.barplot(data=cost_data, x='Method', y='Cost_EUR', palette='viridis')
plt.ylabel('Cost (EUR)')
plt.title('Cost Comparison: Traditional vs AI-Based Talent Assessment (Portugal)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print(f"Cost Reduction Factor: {300 / 16.50:.1f}× - {550 / 16.50:.1f}×")
print(f"Equivalent Years: {300 / 16.50 / 12:.1f} - {550 / 16.50 / 12:.1f} years of AI monitoring per traditional assessment")

## 7. Summary

This notebook demonstrates the complete experimental pipeline for multimodal talent discovery:

### Key Findings
1. **Classical ML**: LightGBM achieves ROC-AUC 0.9999 with ECE 0.0018 (exceptional calibration)
2. **Multi-Agent LLMs**: 34 models from 9 providers, cost-effective at \$0.041/analysis
3. **Temporal Validity**: F1-macro 0.833 (5.7 months) and 0.742 (11.4 months)
4. **Cost-Effectiveness**: 18-33× cheaper than traditional assessment (Portugal)

### Reproducibility
- **Code**: `../code/train_classical_ml.py` (full training script)
- **Data**: `../data/sample_artifacts.jsonl` (10 anonymized samples)
- **Figures**: See `figure_generation.ipynb`

### Citation
```bibtex
@article{sergeev2025multimodal,
  title={Multimodal Talent Discovery in Children Using Calibrated Baselines},
  author={Sergeev, Dmitriy},
  journal={Preprint},
  year={2025},
  publisher={Cell Press}
}
```