# Подготовка Baseline Моделей

Этот ноутбук загружает предобученные модели из HuggingFace и сохраняет их локально для baseline оценки.

**Модели:**
- Whisper Small (без дообучения)
- Whisper Base (без дообучения)
- Speech2Text Cross-Lingual (английская модель + multilingual tokenizer, без обучения)

**Сохранение:**
Все модели сохраняются в `experiments/baselines/` в нашем custom checkpoint формате:
- `model_weights.pt` - веса модели
- `model_metadata.json` - метаданные (model_type, model_name, tokenizer_name_or_path, target_language, epoch)
- `config.yaml` - полный конфиг проекта для evaluation.py

In [1]:
import sys
from pathlib import Path
import torch
import shutil

# Add src to path
project_root = Path.cwd().parent
sys.path.append(str(project_root))

# Import our modules
from src.config import ProjectConfig, ModelConfig, load_config
from src.models import ModelManager
from src.data import DataManager

print(f"Project root: {project_root}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

Project root: c:\Users\User\Documents\Progs\Projects\seepch_to_text
PyTorch version: 2.5.1
CUDA available: True


## 1. Whisper Small (Baseline)

Загрузка `openai/whisper-small` без дообучения.

Модель будет сохранена в:
- `experiments/baselines/whisper-small/`

In [2]:
print("="*60)
print("Loading Whisper Small...")
print("="*60)

# Load config
config_path = project_root / "configs" / "whisper_small.yaml"
config = load_config(config_path)

print(f"Config loaded: {config.model.model_name}")

# Create DataManager to setup processor (pass full ProjectConfig)
data_manager = DataManager(config)
processor = data_manager.setup_processor(
    model_name=config.model.model_name,
    model_type=config.model.model_type,
    language=config.data.language,
    task=config.data.task
)
print(f"✓ Processor created: {type(processor).__name__}")

# Create ModelManager and model
model_manager = ModelManager()
model = model_manager.create_model(config.model, processor)
print(f"✓ Model created: {type(model).__name__}")

# ВАЖНО: Baseline модели ВСЕГДА полностью замораживаются (независимо от конфига)
print("\n⚠️  Принудительная заморозка всех параметров для baseline модели...")
model.freeze_feature_encoder()
model.freeze_encoder()
model.freeze_decoder()

# Verify freezing
trainable = model.get_trainable_parameters()
total = model.get_num_parameters()
print(f"\nModel parameters:")
print(f"  Total: {total:,}")
print(f"  Trainable: {trainable:,}")
print(f"  Frozen: {trainable == 0}")

# Save in our custom checkpoint format (epoch=None for baseline)
if trainable == 0:
    save_dir = project_root / "experiments" / "baselines" / "whisper-small"
    print(f"\nSaving to: {save_dir}")
    
    # Save checkpoint (model_weights.pt + model_metadata.json)
    model_manager.save_checkpoint(model, config.model, str(save_dir))
    
    # Copy config.yaml to checkpoint directory for evaluation.py
    # Конфиг сохраняется как есть - train.py применит freeze/unfreeze из конфига при загрузке
    config_dest = save_dir / "config.yaml"
    shutil.copy(config_path, config_dest)
    print(f"✓ Config copied to: {config_dest}")
    
    print("\n✅ Whisper Small baseline готов!")
    print(f"   Checkpoint structure:")
    print(f"     - {save_dir / 'model_weights.pt'} (fully frozen)")
    print(f"     - {save_dir / 'model_metadata.json'}")
    print(f"     - {save_dir / 'config.yaml'} (freeze settings для train.py)")
else:
    print(f"\n❌ Ошибка: Не все параметры заморожены! (trainable={trainable:,})")

Loading Whisper Small...
Config loaded: openai/whisper-small
✓ Processor created: WhisperProcessor
✓ Model created: WhisperSTT

⚠️  Принудительная заморозка всех параметров для baseline модели...

Model parameters:
  Total: 241,734,912
  Trainable: 0
  Frozen: True

Saving to: c:\Users\User\Documents\Progs\Projects\seepch_to_text\experiments\baselines\whisper-small
✓ Config copied to: c:\Users\User\Documents\Progs\Projects\seepch_to_text\experiments\baselines\whisper-small\config.yaml

✅ Whisper Small baseline готов!
   Checkpoint structure:
     - c:\Users\User\Documents\Progs\Projects\seepch_to_text\experiments\baselines\whisper-small\model_weights.pt (fully frozen)
     - c:\Users\User\Documents\Progs\Projects\seepch_to_text\experiments\baselines\whisper-small\model_metadata.json
     - c:\Users\User\Documents\Progs\Projects\seepch_to_text\experiments\baselines\whisper-small\config.yaml (freeze settings для train.py)


## 2. Whisper Base (Baseline)

Загрузка `openai/whisper-base` без дообучения.

Модель будет сохранена в:
- `experiments/baselines/whisper-base/`

In [3]:
print("="*60)
print("Loading Whisper Base...")
print("="*60)

# Load config
config_path = project_root / "configs" / "whisper_base.yaml"
config = load_config(config_path)

print(f"Config loaded: {config.model.model_name}")

# Create DataManager to setup processor (pass full ProjectConfig)
data_manager = DataManager(config)
processor = data_manager.setup_processor(
    model_name=config.model.model_name,
    model_type=config.model.model_type,
    language=config.data.language,
    task=config.data.task
)
print(f"✓ Processor created: {type(processor).__name__}")

# Create model (reuse existing ModelManager)
model = model_manager.create_model(config.model, processor)
print(f"✓ Model created: {type(model).__name__}")

# ВАЖНО: Baseline модели ВСЕГДА полностью замораживаются (независимо от конфига)
print("\n⚠️  Принудительная заморозка всех параметров для baseline модели...")
model.freeze_feature_encoder()
model.freeze_encoder()
model.freeze_decoder()

# Verify freezing
trainable = model.get_trainable_parameters()
total = model.get_num_parameters()
print(f"\nModel parameters:")
print(f"  Total: {total:,}")
print(f"  Trainable: {trainable:,}")
print(f"  Frozen: {trainable == 0}")

# Save in our custom checkpoint format
if trainable == 0:
    save_dir = project_root / "experiments" / "baselines" / "whisper-base"
    print(f"\nSaving to: {save_dir}")
    
    # Save checkpoint (model_weights.pt + model_metadata.json)
    model_manager.save_checkpoint(model, config.model, str(save_dir))
    
    # Copy config.yaml to checkpoint directory for evaluation.py
    config_dest = save_dir / "config.yaml"
    shutil.copy(config_path, config_dest)
    print(f"✓ Config copied to: {config_dest}")
    
    print("\n✅ Whisper Base baseline готов!")
    print(f"   Checkpoint structure:")
    print(f"     - {save_dir / 'model_weights.pt'} (fully frozen)")
    print(f"     - {save_dir / 'model_metadata.json'}")
    print(f"     - {save_dir / 'config.yaml'} (freeze settings для train.py)")
else:
    print(f"\n❌ Ошибка: Не все параметры заморожены! (trainable={trainable:,})")

Loading Whisper Base...
Config loaded: openai/whisper-base
✓ Processor created: WhisperProcessor
✓ Model created: WhisperSTT

⚠️  Принудительная заморозка всех параметров для baseline модели...

Model parameters:
  Total: 72,593,920
  Trainable: 0
  Frozen: True

Saving to: c:\Users\User\Documents\Progs\Projects\seepch_to_text\experiments\baselines\whisper-base
✓ Config copied to: c:\Users\User\Documents\Progs\Projects\seepch_to_text\experiments\baselines\whisper-base\config.yaml

✅ Whisper Base baseline готов!
   Checkpoint structure:
     - c:\Users\User\Documents\Progs\Projects\seepch_to_text\experiments\baselines\whisper-base\model_weights.pt (fully frozen)
     - c:\Users\User\Documents\Progs\Projects\seepch_to_text\experiments\baselines\whisper-base\model_metadata.json
     - c:\Users\User\Documents\Progs\Projects\seepch_to_text\experiments\baselines\whisper-base\config.yaml (freeze settings для train.py)


## 3. Speech2Text Cross-Lingual (Baseline - No Training)

Загрузка английской модели `facebook/s2t-small-librispeech-asr` с multilingual tokenizer `facebook/s2t-medium-mustc-multilingual-st` для русского языка.

**Cross-Lingual Transfer Setup:**
- Encoder: Pretrained на английской речи (LibriSpeech)
- Decoder: Embeddings resized для multilingual tokenizer (10000 tokens)
- Target language: Russian (ru)
- Fully frozen (без обучения)

**Цель:** Оценить начальную точность модели ДО обучения (ожидаем низкую точность, т.к. decoder не обучен на русском).

Модель будет сохранена в:
- `experiments/baselines/s2t-cross-lingual/`

In [4]:
print("="*60)
print("Loading Speech2Text Cross-Lingual (English model + Multilingual tokenizer)...")
print("="*60)

# Load config
config_path = project_root / "configs" / "s2t_cross_lingual.yaml"
config = load_config(config_path)

print(f"Config loaded: {config.model.model_name}")
print(f"Tokenizer: {config.model.tokenizer_name_or_path}")
print(f"Target language: {config.data.language}")

# Create DataManager to setup processor with alternative tokenizer
data_manager = DataManager(config)
processor = data_manager.setup_processor(
    model_name=config.model.model_name,
    model_type=config.model.model_type,
    language=config.data.language,
    task=config.data.task,
    tokenizer_name_or_path=config.model.tokenizer_name_or_path
)
print(f"✓ Processor created: {type(processor).__name__}")

# Verify tokenizer language support
tokenizer = getattr(processor, 'tokenizer', None)
if tokenizer and hasattr(tokenizer, 'lang_code_to_id'):
    print(f"✓ Multilingual tokenizer with {len(tokenizer.lang_code_to_id)} languages: {list(tokenizer.lang_code_to_id.keys())}")
    print(f"✓ Target language 'ru' supported: {'ru' in tokenizer.lang_code_to_id}")

# Create model (will automatically resize embeddings for multilingual tokenizer)
model = model_manager.create_model(config.model, processor)
print(f"✓ Model created: {type(model).__name__}")

# ВАЖНО: Baseline модели ВСЕГДА полностью замораживаются (независимо от конфига)
print("\n⚠️  Принудительная заморозка всех параметров для baseline модели...")
model.freeze_feature_encoder()
model.freeze_encoder()
model.freeze_decoder()

# Verify freezing
trainable = model.get_trainable_parameters()
total = model.get_num_parameters()
print(f"\nModel parameters:")
print(f"  Total: {total:,}")
print(f"  Trainable: {trainable:,}")
print(f"  Frozen: {trainable == 0}")

# Save in our custom checkpoint format
if trainable == 0:
    save_dir = project_root / "experiments" / "baselines" / "s2t-cross-lingual"
    print(f"\nSaving to: {save_dir}")
    
    # Save checkpoint (model_weights.pt + model_metadata.json with tokenizer info)
    model_manager.save_checkpoint(model, config.model, str(save_dir))
    
    # Copy config.yaml to checkpoint directory for evaluation.py
    config_dest = save_dir / "config.yaml"
    shutil.copy(config_path, config_dest)
    print(f"✓ Config copied to: {config_dest}")
    
    print("\n✅ Speech2Text Cross-Lingual baseline готов!")
    print(f"   Checkpoint structure:")
    print(f"     - {save_dir / 'model_weights.pt'} (fully frozen)")
    print(f"     - {save_dir / 'model_metadata.json'}")
    print(f"     - {save_dir / 'config.yaml'} (freeze settings для train.py)")
    print("\n⚠️  ВАЖНО: Эта модель НЕ ОБУЧЕНА на русском!")
    print("   Ожидаемые результаты evaluation: очень низкая точность (WER ~100%)")
    print("   Decoder был resized под русский токенизатор, но weights случайные.")
else:
    print(f"\n❌ Ошибка: Не все параметры заморожены! (trainable={trainable:,})")

Loading Speech2Text Cross-Lingual (English model + Multilingual tokenizer)...
Config loaded: facebook/s2t-small-librispeech-asr
Tokenizer: facebook/s2t-medium-mustc-multilingual-st
Target language: ru
✓ Processor created: Speech2TextProcessor
✓ Multilingual tokenizer with 8 languages: ['pt', 'fr', 'ru', 'nl', 'ro', 'it', 'es', 'de']
✓ Target language 'ru' supported: True
✓ Model created: Speech2TextSTT

⚠️  Принудительная заморозка всех параметров для baseline модели...

Model parameters:
  Total: 29,536,256
  Trainable: 0
  Frozen: True

Saving to: c:\Users\User\Documents\Progs\Projects\seepch_to_text\experiments\baselines\s2t-cross-lingual
✓ Config copied to: c:\Users\User\Documents\Progs\Projects\seepch_to_text\experiments\baselines\s2t-cross-lingual\config.yaml

✅ Speech2Text Cross-Lingual baseline готов!
   Checkpoint structure:
     - c:\Users\User\Documents\Progs\Projects\seepch_to_text\experiments\baselines\s2t-cross-lingual\model_weights.pt (fully frozen)
     - c:\Users\User\

## Сводка

Проверка всех сохраненных baseline моделей.

Каждая модель должна содержать:
- `model_weights.pt` - веса модели
- `model_metadata.json` - метаданные чекпоинта
- `config.yaml` - конфигурация для evaluation.py

In [5]:
baselines_dir = project_root / "experiments" / "baselines"

print("="*60)
print("Сохраненные Baseline Модели")
print("="*60)

if baselines_dir.exists():
    for model_dir in sorted(baselines_dir.iterdir()):
        if model_dir.is_dir():
            # Check for required checkpoint files (our custom format)
            has_weights = (model_dir / "model_weights.pt").exists()
            has_metadata = (model_dir / "model_metadata.json").exists()
            has_config = (model_dir / "config.yaml").exists()
            
            status = "✅" if (has_weights and has_metadata and has_config) else "❌"
            
            # Calculate total size
            total_size = sum(f.stat().st_size for f in model_dir.rglob('*') if f.is_file())
            size_mb = total_size / (1024 * 1024)
            
            print(f"{status} {model_dir.name}")
            print(f"   Path: {model_dir}")
            print(f"   Size: {size_mb:.1f} MB")
            print(f"   Files: Weights={has_weights}, Metadata={has_metadata}, Config={has_config}")
            print()
else:
    print("❌ Директория baselines не найдена")

print("="*60)
print("\nТеперь можно запустить evaluation.py через VSCode 'Run' кнопку!")
print("\nИЛИ через командную строку:")
print("\n# Whisper Small baseline")
print("python evaluation.py --model-path experiments/baselines/whisper-small")
print("\n# Whisper Base baseline")
print("python evaluation.py --model-path experiments/baselines/whisper-base")
print("\n# Speech2Text Cross-Lingual baseline")
print("python evaluation.py --model-path experiments/baselines/s2t-cross-lingual")
print("\nКонфиг будет автоматически найден в директории модели!")

Сохраненные Baseline Модели
✅ s2t-cross-lingual
   Path: c:\Users\User\Documents\Progs\Projects\seepch_to_text\experiments\baselines\s2t-cross-lingual
   Size: 112.8 MB
   Files: Weights=True, Metadata=True, Config=True

✅ whisper-base
   Path: c:\Users\User\Documents\Progs\Projects\seepch_to_text\experiments\baselines\whisper-base
   Size: 277.0 MB
   Files: Weights=True, Metadata=True, Config=True

✅ whisper-small
   Path: c:\Users\User\Documents\Progs\Projects\seepch_to_text\experiments\baselines\whisper-small
   Size: 922.3 MB
   Files: Weights=True, Metadata=True, Config=True


Теперь можно запустить evaluation.py через VSCode 'Run' кнопку!

ИЛИ через командную строку:

# Whisper Small baseline
python evaluation.py --model-path experiments/baselines/whisper-small

# Whisper Base baseline
python evaluation.py --model-path experiments/baselines/whisper-base

# Speech2Text Cross-Lingual baseline
python evaluation.py --model-path experiments/baselines/s2t-cross-lingual

Конфиг будет а

In [6]:
from transformers import Speech2TextTokenizer

# Загружаем токенизатор
tokenizer = Speech2TextTokenizer.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")

# Проверяем список поддерживаемых языков
if hasattr(tokenizer, "langs") and tokenizer.langs is not None:
    print(f"Поддерживается {len(tokenizer.langs)} языков:\n")
    for lang in tokenizer.langs:
        print(lang)
else:
    print("Этот токенизатор не содержит списка поддерживаемых языков.")

Поддерживается 8 языков:

pt
fr
ru
nl
ro
it
es
de
