# Style-Preserving Speech-to-Speech Translation Experiment

This notebook runs the experiment to determine the minimal duration of speaker embeddings required to effectively clone a speaker's voice across languages.

## 1. Setup Environment
Install necessary dependencies if running on Google Colab.

In [None]:
# !pip install torch transformers speechbrain soundfile librosa openai-whisper accelerate sentencepiece pydantic torchcodec datasets kagglehub[pandas-datasets]

## 2. Import Modules
Import the experiment setup and runner classes.

In [None]:
import os
import sys

# Add current directory to path if needed
sys.path.append(os.getcwd())

from enums import Language
from setup_experiment import ExperimentSetup
from run_experiment import ExperimentRunner

## 3. Configure Experiment
Define the parameters for the experiment: source/target languages and reference durations to test.

In [None]:
SOURCE_LANG = Language.ENGLISH
TARGET_LANG = Language.SPANISH
DURATIONS = [5.0, 10.0, 15.0, 20.0, 30.0]
NUM_SPEAKERS = 5 # Number of unique speakers to test
SEED = 42

## 4. Prepare Data
This step:
1. Downloads/Loads Common Voice dataset via KaggleHub.
2. Selects `NUM_SPEAKERS` with sufficient data.
3. Creates concatenated reference audio files for each duration.
4. Generates a manifest for the experiment run.

In [None]:
setup = ExperimentSetup(
    source_language=SOURCE_LANG,
    target_language=TARGET_LANG,
    reference_durations=DURATIONS,
    seed=SEED
)

# Prepare the manifest
manifest = setup.prepare_data(num_speakers=NUM_SPEAKERS)

print(f"Manifest ready with {len(manifest)} speakers.")
print("Sample Item:", manifest[0] if manifest else "No data")

## 5. Run Experiment
Execute the pipeline for each speaker and duration:
1. Extract ground truth embedding (original speaker).
2. Translate source text to Spanish.
3. Synthesize Spanish speech using the reference audio (5s, 10s, etc.) for style.
4. Compute Cosine Similarity between ground truth and output embeddings.

In [None]:
runner = ExperimentRunner()
runner.run(manifest)

## 6. Analyze Results
Save and inspect the results.

In [None]:
runner.save_results("experiment_results.csv")

import pandas as pd
results_df = pd.read_csv("experiment_results.csv")

# Display average similarity score per duration
print("\nAverage Similarity Scores by Duration:")
print(results_df.groupby("duration")["similarity_score"].mean())

results_df.head(10)