Шамсутдинов Рустам БВТ2202

Лабораторная работа № 4: Исследование и сравнение современных моделей синтеза речи

Цель работы

Закрепить практические навыки анализа, выбора и применения современных моделей синтеза речи на основе нейросетевых архитектур

Задачи
1. Изучить актуальные подходы к синтезу речи (нейронный TTS, автоэнкожеры, трансформеры, диффузионки).
2. Найти и проанализировать не менее пяти открытых моделей (Tacotron 2, FastSpeech 2, VITS, Bark, XTTS, Tortoise-TTS и др.).
3. Для каждой модели:
- подготовить и задокументировать процедуру установки и запуска;
- провести синтез не менее 10 аудиофайлов (по одинаковому набору вхлда);
сохранить результаты в формате .wav или .flac.
4. Сравнить качество синтезированной речи по субъективным и формальным критериям:
- естественность и интонационная выразительность (MOS);
- время синтеза;
- размер модели и потребление ресурсов (гпу/цпу)
5. Сформировать отчет с таблицей результатов и выводами.

In [1]:
texts = [
    "Tom has a blue kite. It flies high in the sky. He smiles as the wind pulls the string.",
    "Emma sits on the sofa with a warm blanket. She drinks tea and reads a book. The room is quiet and soft.",
    "A small bird sits on the tree. It sings a happy song. Mia listens and claps her hands.",
    "I ride my red bike to the park. The road is smooth and long. I stop to watch the ducks in the pond.",
    "Cars move fast on the street. People walk with bags and talk on phones. The lights turn green, and buses start to go.",
    "Ben pours milk into a cup. He drinks it slowly and smiles. The milk is cold and sweet.",
    "A gray cat sleeps on the chair. Its fur is soft and warm. It dreams and moves its paws.",
    "Snow falls from the sky. The ground turns white. Kids laugh and make a big snowman.",
    "My mom has a small garden. She grows flowers and green beans. Every morning she waters them with care.",
    "Leo walks to school early. The air is cool and fresh. He waves to his friend at the corner."
]


Bark

In [2]:
import os
import time
import csv
import torch
import psutil
from transformers import AutoProcessor, BarkModel
from IPython.display import Audio, display
import scipy.io.wavfile


voice_preset = "v2/en_speaker_6"

device = "cuda:0" if torch.cuda.is_available() else "cpu"
print("Device:", device)

model = BarkModel.from_pretrained("suno/bark-small")
model = model.to(device)
processor = AutoProcessor.from_pretrained("suno/bark")

sample_rate = model.generation_config.sample_rate
print("Sample rate:", sample_rate)

process = psutil.Process(os.getpid())
param_count = model.num_parameters()
print(f"Model params: {param_count:,}")

csv_path = "results/bark/metrics.csv"
csv_fields = [
    "index", "text",
    "synthesis_time_s", "peak_gpu_mem_bytes", "end_gpu_mem_bytes",
    "cpu_rss_bytes_start", "cpu_rss_bytes_end", "cpu_cpu_percent",
    "model_param_count", "sample_rate",
    "mos_rating"
]

if not os.path.exists(csv_path):
    with open(csv_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=csv_fields)
        writer.writeheader()

Device: cpu
Sample rate: 24000
Model params: 404,409,058


In [5]:
for index, text in enumerate(texts, start=1):

    cpu_rss_start = process.memory_info().rss
    cpu_percent_before = psutil.cpu_percent(interval=None)

    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats(device)
        start_gpu_mem = torch.cuda.memory_allocated(device)
    else:
        start_gpu_mem = 0

    inputs = processor(
        text=[text],
        return_tensors="pt",
        voice_preset=voice_preset
    )
    inputs = inputs.to(device)

    t0 = time.perf_counter()
    speech_values = model.generate(**inputs, do_sample=True)
    t1 = time.perf_counter()
    synth_time = t1 - t0

    if torch.cuda.is_available():
        peak_gpu = torch.cuda.max_memory_allocated(device)
        end_gpu = torch.cuda.memory_allocated(device)
        peak_gpu_bytes = int(peak_gpu)
        end_gpu_bytes = int(end_gpu)
    else:
        peak_gpu_bytes = 0
        end_gpu_bytes = 0

    cpu_rss_end = process.memory_info().rss
    cpu_percent_after = psutil.cpu_percent(interval=0.1)
    cpu_percent = cpu_percent_after

    audio_array = speech_values[0].cpu().numpy()


    out_path = f"results/bark/{index}.wav"
    scipy.io.wavfile.write(out_path, rate=sample_rate, data=audio_array)

    display(Audio(audio_array, rate=sample_rate))

    row = {
        "index": index,
        "text": text,
        "synthesis_time_s": f"{synth_time:.3f}",
        "peak_gpu_mem_bytes": peak_gpu_bytes,
        "end_gpu_mem_bytes": end_gpu_bytes,
        "cpu_rss_bytes_start": cpu_rss_start,
        "cpu_rss_bytes_end": cpu_rss_end,
        "cpu_cpu_percent": cpu_percent,
        "model_param_count": param_count,
        "sample_rate": sample_rate,
        "mos_rating": -1
    }
    with open(csv_path, "a", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=csv_fields)
        writer.writerow(row)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
