# ONNX 경량화

- gpt 모듈만 ONNX 변환하기
- hifigan decoder는 모델 교체가 더 효율적일 수 있음.
- speaker encoder는 따로 ONNX 변환 가능

## 1. GPT

- 변환 전 최적화 필요함.
- torch.jit.trace 또는 torch.fx를 활용해 서브모델만 추출 (예: inference만 사용하는 gpt_inference 모듈).
- Layer 수 줄이기 (예: GPT2Block 30 → 12 이하, 가능 시 6).
- float32 → float16 (FP16) 또는 int8 양자화
- LayerNorm, GELU 등이 잘 변환되도록 custom op 대응 준비

## 2. speaker encoder

- ONNX 변환 매우 용이
- 만약 inference에서 speaker embedding이 고정된 경우:  
Precomputed speaker embedding 사용  
또는 ONNX로 변환하고 ONNXRuntime에서 실행

## 3. hifigan_decoder (HiFi-GAN)

- ONNX로 변환이 까다롭고, 경량화 효과 적음
- 대안:  
ONNX 변환이 잘 되는 다른 vocoder 사용 (ex. UnivNet, MelGAN 등)  
waveform을 서버에서 생성, 기기에서는 재생만 (on-device 처리 필요 없으면)


In [1]:
import os
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

2025-08-05 03:27:30.311601: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-08-05 03:27:30.327479: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754364450.347153  450881 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754364450.353021  450881 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1754364450.368199  450881 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [2]:
os.environ["CUDA_VISIBLE_DEVICES"] = "2"

In [3]:
# 2) 디바이스 설정: CUDA가 사용 가능하면 GPU('cuda')를, 그렇지 않으면 CPU('cpu')를 사용
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# device = "cpu"
print(f"Using device: {device}")  # 선택된 디바이스(예: cuda 또는 cpu)를 콘솔에 출력하여 확인

Using device: cuda


In [4]:
BASE_PATH = "/home/j-i13a103/tts/run/training"

# xtts_config 경로 설정
CONFIG_PATH = BASE_PATH + "/GPT_XTTS_v2.0_SSOKDAK_FT-August-04-2025_08+49AM-0000000/config.json"

# 학습된 모델의 vocab.json 설정
TOKENIZER_PATH = BASE_PATH + "/XTTS_v2.0_original_model_files/vocab.json"

# best model 가져오기
XTTS_CHECKPOINT = BASE_PATH + "/GPT_XTTS_v2.0_SSOKDAK_FT-August-04-2025_08+49AM-0000000/best_model.pth"

In [5]:
print("Loading model...")

config = XttsConfig()
config.load_json(CONFIG_PATH)
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_path=XTTS_CHECKPOINT, vocab_path=TOKENIZER_PATH, use_deepspeed=False)
model.to(device)

Loading model...


Xtts(
  (gpt): GPT(
    (conditioning_encoder): ConditioningEncoder(
      (init): Conv1d(80, 1024, kernel_size=(1,), stride=(1,))
      (attn): Sequential(
        (0): AttentionBlock(
          (norm): GroupNorm32(32, 1024, eps=1e-05, affine=True)
          (qkv): Conv1d(1024, 3072, kernel_size=(1,), stride=(1,))
          (attention): QKVAttentionLegacy()
          (proj_out): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
        )
        (1): AttentionBlock(
          (norm): GroupNorm32(32, 1024, eps=1e-05, affine=True)
          (qkv): Conv1d(1024, 3072, kernel_size=(1,), stride=(1,))
          (attention): QKVAttentionLegacy()
          (proj_out): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
        )
        (2): AttentionBlock(
          (norm): GroupNorm32(32, 1024, eps=1e-05, affine=True)
          (qkv): Conv1d(1024, 3072, kernel_size=(1,), stride=(1,))
          (attention): QKVAttentionLegacy()
          (proj_out): Conv1d(1024, 1024, kernel_size=(1,), stride=(

In [6]:
# inference 모드
model.eval()

In [7]:
# gpt_inference 모듈만 추출하기
gpt_infer = model.gpt.gpt_inference
gpt_infer.eval()

GPT2InferenceModel(
  (transformer): GPT2Model(
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-29): 30 x GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=3072, nx=1024)
          (c_proj): Conv1D(nf=1024, nx=1024)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=4096, nx=1024)
          (c_proj): Conv1D(nf=1024, nx=4096)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (wte): Embedding(1026, 1024)
  )
  (pos_embedding): LearnedPositionEmbeddings(
    (emb): Embedding(608, 1024)
  )
  (embeddings): Embedding(1026, 1024)
  (final_norm): LayerNorm

In [9]:
help(gpt_infer.forward)

Help on method forward in module TTS.tts.layers.xtts.gpt_inference:

forward(input_ids=None, past_key_values=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, encoder_hidden_states=None, encoder_attention_mask=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None, **kwargs) method of TTS.tts.layers.xtts.gpt_inference.GPT2InferenceModel instance
    Define the computation performed at every call.
    
    Should be overridden by all subclasses.
    
    .. note::
        Although the recipe for forward pass needs to be defined within
        this function, one should call the :class:`Module` instance afterwards
        instead of this since the former takes care of running the
        registered hooks while the latter silently ignores them.



In [16]:
# 레퍼런스 넣기
SPEAKER_REFERENCE = "/home/j-i13a103/tts/korean-single-speaker-datasets/wavs/1_0000.wav"

print("Computing speaker latents...")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=[SPEAKER_REFERENCE])

Computing speaker latents...


### 경량화를 위한 더미 준비하기

1. forward() 확인

- logits : (batch_size, seq_len, vocab_size)
    - 1, 100, 1026

In [24]:
# 더미 준비하기
# ONNX 변환을 위해 적절한 dummy input이 필요하다.
gpt_infer.cached_prefix_emb = gpt_cond_latent

dummy_input_ids = torch.randint(0, 1026, (1, 32), dtype=torch.long).to(device)

# 모델 추론 가능 여부 확인하기
with torch.no_grad():
    out = gpt_infer(input_ids=dummy_input_ids)
    print(out.logits.shape)

torch.Size([1, 32, 1026])


In [18]:
torch.onnx.export(
    gpt_infer,                 # 모델
    (dummy_input_ids,),                      # 입력 튜플로 넘기기
    "/home/j-i13a103/tts/finetuning-result/model/xtts_gpt_inference.onnx",               # 저장 경로
    input_names=["input_ids"],
    output_names=["logits"],
    dynamic_axes={"input_ids": {0: "batch_size", 1: "seq_len"}},  # 가변 배치 및 시퀀스 지원
    opset_version=16
)


  if input_ids.shape[1] != 1:
  if self.cached_prefix_emb.shape[0] != gen_emb.shape[0]:


In [19]:
# 변환한 모델 확인하기

import onnxruntime as ort
import numpy as np
import torch

# 1. ONNX 세션 생성
session = ort.InferenceSession("/home/j-i13a103/tts/finetuning-result/model/xtts_gpt_inference.onnx", providers=["CPUExecutionProvider"])

# 2. 더미 입력 (input_ids) 생성 (→ numpy 형식)
dummy_input_ids = torch.randint(0, 1026, (1, 100), dtype=torch.long)
input_ids_np = dummy_input_ids.numpy()

# 3. 입력 이름 확인 (보통 'input_ids')
input_name = session.get_inputs()[0].name
print("ONNX Input Name:", input_name)

# 4. ONNX 추론 실행
outputs = session.run(None, {input_name: input_ids_np})

# 5. 출력 결과 확인
logits = outputs[0]
print("Logits shape:", logits.shape)  # 예: (1, 100, 1026)


ONNX Input Name: input_ids
Logits shape: (1, 100, 1026)


### 실제 사용해보기

In [21]:
# 텍스트 토크나이징
from TTS.tts.utils.text.tokenizer import TTSTokenizer

tokenizer, config = TTSTokenizer.init_from_config(config)

text = "안녕, 오늘 하루 어땠어?"
input_ids = tokenizer.text_to_ids(text)
input_ids_tensor = torch.tensor(input_ids).unsqueeze(0).to(device)

안녕, 오늘 하루 어땠어?
Character '안' not found in the vocabulary. Discarding it.
안녕, 오늘 하루 어땠어?
Character '녕' not found in the vocabulary. Discarding it.
안녕, 오늘 하루 어땠어?
Character '오' not found in the vocabulary. Discarding it.
안녕, 오늘 하루 어땠어?
Character '늘' not found in the vocabulary. Discarding it.
안녕, 오늘 하루 어땠어?
Character '하' not found in the vocabulary. Discarding it.
안녕, 오늘 하루 어땠어?
Character '루' not found in the vocabulary. Discarding it.
안녕, 오늘 하루 어땠어?
Character '어' not found in the vocabulary. Discarding it.
안녕, 오늘 하루 어땠어?
Character '땠' not found in the vocabulary. Discarding it.


In [26]:
print(input_ids_tensor.shape)

torch.Size([1, 5])


In [22]:
gpt_infer.cached_prefix_emb = gpt_cond_latent

In [25]:
# ONNX로 gpt 추론하기
session = ort.InferenceSession("/home/j-i13a103/tts/finetuning-result/model/xtts_gpt_inference.onnx", providers=["CPUExecutionProvider"])

input_name = session.get_inputs()[0].name
print(input_name)
onnx_logits = session.run(None, {input_name: input_ids_tensor.cpu().numpy()})[0]

input_ids


[1;31m2025-08-05 04:15:31.391660958 [E:onnxruntime:, sequential_executor.cc:572 ExecuteKernel] Non-zero status code returned while running Reshape node. Name:'/transformer/Reshape' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/tensor/reshape_helper.h:45 onnxruntime::ReshapeHelper::ReshapeHelper(const onnxruntime::TensorShape&, onnxruntime::TensorShapeVector&, bool) input_shape_size == size was false. The input tensor cannot be reshaped to the requested shape. Input shape:{32}, requested shape:{100,1}
[m


RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Reshape node. Name:'/transformer/Reshape' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/tensor/reshape_helper.h:45 onnxruntime::ReshapeHelper::ReshapeHelper(const onnxruntime::TensorShape&, onnxruntime::TensorShapeVector&, bool) input_shape_size == size was false. The input tensor cannot be reshaped to the requested shape. Input shape:{32}, requested shape:{100,1}
