# TensorTalk Evaluation

This notebook demonstrates how to evaluate TensorTalk using the metrics from the paper:
- **UTMOS**: UTokyo-SaruLab Mean Opinion Score (naturalness)
- **WER**: Word Error Rate (intelligibility)
- **PER**: Phoneme Error Rate (intelligibility)
- **SECS**: Speaker Encoder Cosine Similarity (speaker similarity)

## Setup

First, import necessary libraries and initialize the pipeline.

In [None]:
import sys
sys.path.append('..')

import torch
import torchaudio
import pandas as pd
from pathlib import Path
import numpy as np
from tqdm import tqdm

from src import TensorTalkPipeline

# Initialize pipeline
pipeline = TensorTalkPipeline(k=4)

## Load Test Sentences

Load the test sentences from the FLoRes+ dataset (or your own test set).

In [None]:
# Load test sentences
sentences_df = pd.read_csv('../data/sampled_sentences.csv', header=None, names=['text'])
test_sentences = sentences_df['text'].tolist()

print(f"Loaded {len(test_sentences)} test sentences")
print(f"\nFirst 3 sentences:")
for i, text in enumerate(test_sentences[:3]):
    print(f"{i+1}. {text}")

## Generate Synthetic Speech

Generate synthetic speech for all test sentences using a target speaker.

In [None]:
# Configure target speaker
target_audio_paths = [
    "../examples/target_speaker/sample1.wav",
    "../examples/target_speaker/sample2.wav",
]

# Output directory
output_dir = Path("../examples/generated/evaluation")
output_dir.mkdir(parents=True, exist_ok=True)

# Generate speech for each sentence
print("Generating synthetic speech...")
generated_files = []

for i, text in enumerate(tqdm(test_sentences[:10]))  # Test on first 10 sentences:
    try:
        # Synthesize
        audio = pipeline.synthesize(
            text=text,
            target_audio_paths=target_audio_paths,
            alpha=1.0
        )
        
        # Save
        output_path = output_dir / f"generated_{i:03d}.wav"
        pipeline.save_audio(audio, str(output_path))
        generated_files.append(str(output_path))
        
    except Exception as e:
        print(f"\nError generating sentence {i}: {e}")
        generated_files.append(None)

print(f"\nGenerated {len([f for f in generated_files if f])} audio files")

## Evaluate WER (Word Error Rate)

Calculate WER using automatic speech recognition (Whisper).

**Note**: You'll need to install additional packages:
```bash
pip install openai-whisper jiwer
```

In [None]:
# This is a placeholder - install whisper and jiwer to use
# Uncomment after installing:

# import whisper
# import jiwer

# # Load Whisper model
# model = whisper.load_model("large-v3")

# wer_scores = []
# for i, (ref_text, audio_path) in enumerate(zip(test_sentences[:10], generated_files)):
#     if audio_path is None:
#         continue
#     
#     # Transcribe
#     result = model.transcribe(audio_path)
#     hyp_text = result["text"]
#     
#     # Calculate WER
#     wer = jiwer.wer(ref_text, hyp_text)
#     wer_scores.append(wer)
#     
#     print(f"Sentence {i}:")
#     print(f"  Reference: {ref_text}")
#     print(f"  Hypothesis: {hyp_text}")
#     print(f"  WER: {wer:.3f}\n")

# print(f"Average WER: {np.mean(wer_scores):.3f} ± {np.std(wer_scores):.3f}")

print("WER evaluation requires whisper and jiwer packages.")
print("Install with: pip install openai-whisper jiwer")

## Evaluate SECS (Speaker Encoder Cosine Similarity)

Measure speaker similarity between generated and target speech.

In [None]:
# This requires the ECAPA2 speaker encoder
# Placeholder implementation:

# from speechbrain.pretrained import EncoderClassifier

# # Load speaker encoder
# spk_encoder = EncoderClassifier.from_hparams(
#     source="speechbrain/spkrec-ecapa-voxceleb"
# )

# secs_scores = []
# for audio_path in generated_files:
#     if audio_path is None:
#         continue
#     
#     # Extract embeddings
#     gen_emb = spk_encoder.encode_batch(audio_path)
#     target_emb = spk_encoder.encode_batch(target_audio_paths[0])
#     
#     # Calculate cosine similarity
#     similarity = F.cosine_similarity(gen_emb, target_emb, dim=-1)
#     secs_scores.append(similarity.item())

# print(f"Average SECS: {np.mean(secs_scores):.3f} ± {np.std(secs_scores):.3f}")

print("SECS evaluation requires SpeechBrain package.")
print("Install with: pip install speechbrain")

## Summary Statistics

Compare results with the paper's reported metrics.

In [None]:
# Create comparison table
results = pd.DataFrame({
    'Metric': ['UTMOS ↑', 'WER ↓', 'PER ↓', 'SECS ↑'],
    'Paper (LJSpeech)': [4.27, 0.02, 0.01, 0.65],
    'Your Results': ['N/A', 'N/A', 'N/A', 'N/A'],
})

print("\n" + "="*60)
print("Performance Comparison")
print("="*60)
print(results.to_string(index=False))
print("="*60)

print("\nNote: Install evaluation packages to calculate metrics.")
print("See paper for baseline comparisons: https://drive.google.com/file/d/1j9t6o2sKrWnu83dZkSFWDphGyeI9_NYr/view")

## Qualitative Analysis

Listen to some examples to assess quality subjectively.

In [None]:
from IPython.display import Audio, display

# Play first 3 generated samples
for i, audio_path in enumerate(generated_files[:3]):
    if audio_path is None:
        continue
    
    print(f"\nSample {i+1}: {test_sentences[i][:50]}...")
    waveform, sr = torchaudio.load(audio_path)
    display(Audio(waveform.numpy(), rate=sr))

## Conclusion

This notebook provides a framework for evaluating TensorTalk's performance.

### Key Metrics:
- **UTMOS**: Measures naturalness (how human-like the speech sounds)
- **WER/PER**: Measure intelligibility (how accurate the words/phonemes are)
- **SECS**: Measures speaker similarity (how close to target voice)

### Results from Paper:
- Achieved UTMOS of **4.27** on LJSpeech
- Near-perfect WER/PER due to gTTS (trade-off: less natural prosody)
- Competitive SECS showing good voice transfer

For full results and analysis, see the [paper](https://drive.google.com/file/d/1j9t6o2sKrWnu83dZkSFWDphGyeI9_NYr/view).