# Spoken Dialog System (1/2): ASR+LLM+TTS and Audio LLM+TTS #

We split this assignment into two notebooks.

This is ***1/2*** of it. Make sure also to complete and submit ***SpeechLM.ipynb***.

***Authored*** by Xilin Jiang (xj2289 at columbia) and Prof. Nima Mesgarani in 2025.

---

## Introduction

In this homework, we will implement a spoken dialog system (SDS), which can chat with users through natural ***spoken*** language. As you are probably very familiar with dialog systems like ChatGPT and Gemini, where you type a question, and they return a text response in a second,  SDS enables seamless communication with your ***voices***, not with your keyboards. Such systems are more convenient to use (typing vs speaking), integrate richer acoustic information from your voices, and can be more human-like, like HAL in *2001: A Space Odyssey* but with more sanity (hopefully). From sci-fi to the LLM era today, SDS has become a reality: exploding commercial products such as OpenAI's GPT-4o and ByteDance's Doubao.

Put simply, an SDS inputs users' speech and outputs the generated speech after some thinking. From what you have learned from the lectures and past homework, you should agree that SDS requires three essential capabilities: ***listening*** (speech-to-text), ***thinking*** (text-to-text), and ***speaking*** (text-to-speech). They can be handled respectively by an automatic speech recognition (**ASR**) model, a (large) language model (**LLM**), and a text-to-speech synthesis (**TTS**) model. This is the first system you need to implement in this notebook!

This SDS should work reasonably well since ASR, LLM, and TTS today are all human or superhuman levels. However, when connected, this cascaded SDS is not always satisfying in reality. Performance-wise, all acoustic information is lost by speech-to-text. The SDS cannot tell if you are happy, sad, or ironic from your text without your voice. Speed-wise, this system can be very slow, as it requires running at least three models, and to achieve real-time interaction, each of them must stream several times faster than real-time.

Therefore, the current direction in industry and academia is more end-to-end SDS, with fewer than three or just one major model (usually some kinds of LLM). From ASR+LLM+TTS, we will first implement a more integrated system by combining ASR+LLM=**Audio LLM** in the second half of this notebook. Finally, in the second notebook, we will build a GPT-style toy speech language model (**SLM**) that can directly input and output speech ***without text*** on hardcoded Q&A like *“What is the next day of Tuesday?” → “Wednesday.”*

In summary, you will implement three SDS in this homework:

<ol style="font-size: 1.2em;">
  <li> ASR + LLM + TTS (20%)</li>
  <li> Audio LLM + TTS (20%)</li>
  <li> Toy SLM (60%)</li>
</ol>

The first two systems require no training. The last one requires a little training and training data synthesis. The majority of the infrastructure and evaluation code is given. But please start early in case of any unforeseen problems. For a deep dive into this topic, you can refer to [Moshi](https://arxiv.org/abs/2410.00037), [Survey 1](https://arxiv.org/abs/2411.13577), and [Survey 2](https://arxiv.org/abs/2504.08528).


## Environment and Tools

Mount from Google Drive and cd to your HW7 folder.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir('/content/drive/MyDrive/6820_HW/HW7')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


We have to compress [LLM.int8()](https://arxiv.org/abs/2208.07339) to fit into GPU's memory. You may need to restart the notebook after you install.

In [2]:
!pip install bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-

And install other dependencies:

In [3]:
!pip install -r requirements.txt
!apt-get install espeak

Collecting git+https://github.com/resemble-ai/monotonic_align.git (from -r requirements.txt (line 9))
  Cloning https://github.com/resemble-ai/monotonic_align.git to /tmp/pip-req-build-co54qie1
  Running command git clone --filter=blob:none --quiet https://github.com/resemble-ai/monotonic_align.git /tmp/pip-req-build-co54qie1
  Resolved https://github.com/resemble-ai/monotonic_align.git to commit 78b985be210a03d08bc3acc01c4df0442105366f
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
espeak is already the newest version (1.48.15+dfsg-3).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.


In [4]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [5]:
import sys

if 'StyleTTS2' not in sys.path:
  sys.path.append('StyleTTS2')

import torch
from IPython.display import Audio

## Evaluation Data

For this notebook, we will evaluate our systems on dialogs from [DailyTalk](https://arxiv.org/abs/2207.01063) dataset. We use only one round of conversation from caller to callee. Here is an example:

In [None]:
Audio('dailytalk/0/question.wav')

Here is the ground-truth response:

In [None]:
Audio('dailytalk/0/solution.wav')

## System 1: ASR+LLM+TTS

Now, let's implement the most straightforward SDS. While we can choose any ASR, LLM, and TTS model to implement a cascaded SDS system, to compare with our System 2 later, we will use **[whisper](https://arxiv.org/abs/2212.04356)**-large-v2 for ASR, **[Qwen2](https://arxiv.org/abs/2407.10671)**-7B-Instruct for LLM, and **[StyleTTS 2](https://arxiv.org/abs/2306.07691)** for TTS. In general, all these models should be light-weight with a satisfying performance alone.

We have provided the instantiation and inference code of all three models inside the CascadedSDS class. Your task is simply to run these models one by one. Notice that if you just give the previous sentence transcribed by the ASR, the LLM probably doesn't know what you instruct it to do, so you need to ***TODO: write a prompt*** to ask the LLM to generate a likely response. The prompt should contain both the transcription and your instruction.

A little necessary background for the text-to-speech (TTS): Modern TTS models can clone anyone's voice to speak any sentence. It is called zero-shot TTS to clone a speaker if he or she is not in the training set. A TTS model usually takes (at least) two inputs: a **reference speaker** (usually as a mel spectrogram) and a **target sentence** (text) to synthesize. The TTS model extracts speaker information from the reference speaker, with the prosody enhanced by the text, and synthesizes the target sentence by a generative model. We provide the reference speaker for each dialog,  e.g. ***dailytalk/0/reference.waw***.


#### TODO (1/10): Implement CascadedSDS.\_\_call\_\_(self, in_wav_path, ref_spk_path) ####

You may need to make sure the input text is not too long (>512 tokens) for TTS, if you get into any TTS error.

And make sure the LLM ***does not repeat*** your question or says something irrelevant like "Yes/Sure/OK". It should generate the response directly.


In [None]:
import librosa

from transformers import (
    WhisperProcessor, WhisperForConditionalGeneration,
    AutoModelForCausalLM, AutoTokenizer,
    BitsAndBytesConfig
)

from helper.tts import StyleTTS2


class CascadedSDS():

    def __init__(self, device='cuda'):

        # Instantiate ASR
        self.processor = WhisperProcessor.from_pretrained(
            'openai/whisper-large-v2'
        )
        self.asr = WhisperForConditionalGeneration.from_pretrained(
            'openai/whisper-large-v2',
            device_map='cpu', # we really don't have enough GPU memory :(
        )
        self.asr.config.forced_decoder_ids = None
        print('ASR instantiated.')

        # Instantiate LLM
        self.llm = AutoModelForCausalLM.from_pretrained(
            'Qwen/Qwen2-7B-Instruct',
            device_map=device,
            quantization_config=BitsAndBytesConfig(
                load_in_8bit=True,
                llm_int8_threshold=6.0,
                llm_int8_skip_modules=None,
            )
        )
        self.tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2-7B-Instruct')
        print('LLM instantiated.')

        # Instantiate TTS
        self.tts = StyleTTS2(
            ckpt_root='ckpts/epochs_2nd_00020.pth',
            code_root='StyleTTS2',
            device=device
        )
        print('TTS instantiated.')

        self.device = device

    @torch.no_grad()
    def inference_asr(self, in_wav_path):

        wav, sr = librosa.load(in_wav_path, sr=16000)
        input_features = self.processor(wav, sampling_rate=sr, return_tensors='pt').input_features
        predicted_ids = self.asr.generate(input_features, language="en")
        transcription = self.processor.batch_decode(predicted_ids, skip_special_tokens=False)[0]

        return transcription

    @torch.no_grad()
    def inference_llm(self, prompt, max_new_tokens=64):

        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
        llm_inputs = self.tokenizer([text], return_tensors="pt").to(self.device)

        generated_ids = self.llm.generate(
            llm_inputs.input_ids,
            max_new_tokens=max_new_tokens
        )
        generated_ids = [
            output_ids[len(input_ids):] for input_ids, output_ids in zip(llm_inputs.input_ids, generated_ids)
        ]

        response = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

        return response

    @torch.no_grad()
    def inference_tts(self, text, ref_spk_path):

        ref_mel = self.tts.load_mel_from_path(ref_spk_path)
        style = self.tts.compute_style(ref_mel)
        out_wav = self.tts(text=text, style=style)

        return out_wav

    @torch.no_grad()
    def __call__(self, in_wav_path, ref_spk_path):
        # 1. Run ASR
        transcription = self.inference_asr(in_wav_path)
        print(f"[ASR transcription]: {transcription}")

        # 2. Construct LLM prompt
        # 请根据语境修改这一条指令（或由用户提供）
        instruction = "Please give a helpful and natural spoken response."
        prompt = f"""Here is what the user said: "{transcription}".
    {instruction}
    Respond naturally in the same conversational style, without repeating the user's words."""

        # 3. Run LLM
        llm_response = self.inference_llm(prompt)
        print(f"[LLM response]: {llm_response}")

        # 4. Truncate TTS input if too long
        if len(llm_response) > 512:
            print("TTS input too long, truncating to 512 characters.")
            llm_response = llm_response[:512]

        out_wav = self.inference_tts(llm_response, ref_spk_path)
        #raise NotImplementedError('TODO')

        return out_wav


Instantiate CascadedSDS. The first run will take several minutes to download the model checkpoints.

In [None]:
cascaded_sds = CascadedSDS('cuda')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.99k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/6.17G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/4.29k [00:00<?, ?B/s]

ASR instantiated.


config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.95G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.56G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

LLM instantiated.
177


  WeightNorm.apply(module, name, dim)


TTS instantiated.


#### TODO (2/10): Show your response of CascadedSDS on dailytalk/{3,5, 6}/question.wav ####

Note that the output sampling rate of TTS is 24kHz.

#### Question A ####

In [None]:
Audio('dailytalk/3/question.wav')

#### Generated Response A ####

In [None]:
out_wav0 = cascaded_sds(in_wav_path='dailytalk/3/question.wav', ref_spk_path='dailytalk/3/reference.wav')
Audio(out_wav0, rate=24000)

You have passed language=en, but also have set `forced_decoder_ids` to [[1, None], [2, 50359]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of language=en.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


[ASR transcription]:  Did you bring some lunch with you?
[LLM response]: Sure thing! I've got a nice sandwich and some fruit packed for today. It's always good to have something handy when the hunger strikes. How about you? Anything planned for your lunch break?


#### Ground-truth Response A ####

In [None]:
Audio('dailytalk/3/solution.wav')

#### Question B ####

In [None]:
Audio('dailytalk/5/question.wav')

#### Generated Response B ####

In [None]:
out_wav1 = cascaded_sds(in_wav_path='dailytalk/5/question.wav', ref_spk_path='dailytalk/5/reference.wav')
Audio(out_wav1, rate=24000)

[ASR transcription]:  Excuse me, could you tell me where you have got that music book?




[LLM response]: Sure thing! I picked up this music book from a local bookstore downtown. It's really been a great find for practicing some new pieces. Do you need help locating one too? There are some fantastic selections out there.


#### Ground-truth Response B ####

In [None]:
Audio('dailytalk/5/solution.wav')

#### Question C ####

In [None]:
Audio('dailytalk/6/question.wav')

#### Generated Response C ####

In [None]:
out_wav3 = cascaded_sds(in_wav_path='dailytalk/6/question.wav', ref_spk_path='dailytalk/6/reference.wav')
Audio(out_wav3, rate=24000)

[ASR transcription]:  Hi, this is Kwee Pool Corporation. Is that Miss Jang?
[LLM response]: Hello, it's nice to hear from you. Yes, this is Miss Jang speaking. How can I assist you today?


#### Ground-truth Response C ####

In [None]:
Audio('dailytalk/6/solution.wav')

#### Still TODO (2/10): Comment on the quality of your generated responses in terms of voice naturalness and content relevance. ####

**Voice naturalness** is high. The speech likely sounds smooth and expressive, especially the timbre of the respondents (subjectively judged its gender) is also consistent with the reference answers.

**Content relevance** is strong. Each response directly addresses the ASR-transcribed question and adds natural conversational elements.

One minor issue is the repeated use of "Sure thing!" at the start of different responses, which may sound unnatural over time.

Overall, the system performs well in these term.

## System 2: Audio LLM+TTS

An early auditory large language model (Audio LLM) [Listen, Think, and Understand](https://arxiv.org/abs/2305.10790) accurately defines Audio LLM by its functionality. A broader term for Audio  LLM is the Speech&Text-to-Text model (ST2T), where you provide a speech to analyze and a question in the text, and the model will return the answer to the question in the text.

Nearly all Audio LLMs support automatic speech recognition (ASR) since more advanced reasoning tasks require the model to be able to transcribe the speech first. Recall the SDS pipeline ASR+LLM+TTS: with an Audio LLM, We can integrate ASR+LLM integrated into one step! This is the idea of [Style-Talker](https://arxiv.org/abs/2408.11849). For this part, you will implement a simple Audio-LLM based SDS. We choose [Qwen2-Audio](https://arxiv.org/abs/2407.10759) as our integrated *listener* and *thinker*.


##### NOTE: If you go all the way from System 1 to System 2, you probably don't have enough memory to host a second LLM. So please restart your notebook and skip System 1 but keep all the outputs. #####


#### TODO (3/10): Implement AudioLLMSDS.\_\_call\_\_(self, in_wav_path, ref_spk_path) ####


Again make sure the LLM ***does not repeat*** your question or says something irrelevant like "Yes/Sure/OK". You can either use the old prompt or write a new prompt for Qwen2-Audio.


In [6]:
import librosa

import torch
from transformers import AutoProcessor, AutoModelForSeq2SeqLM, BitsAndBytesConfig

from helper.tts import StyleTTS2

class AudioLLMSDS():

    def __init__(self, device='cuda'):

        # Instantiate Audio LLM
        self.processor = AutoProcessor.from_pretrained('Qwen/Qwen2-Audio-7B-Instruct')
        self.audiollm = AutoModelForSeq2SeqLM.from_pretrained(
            'Qwen/Qwen2-Audio-7B-Instruct',
            device_map=device,
            quantization_config=BitsAndBytesConfig(
              load_in_8bit=True,
              llm_int8_threshold=6.0,
              llm_int8_skip_modules=None,
            )
        )
        print('Audio LLM instantiated.')

        # Instantiate TTS
        self.tts = StyleTTS2(
            ckpt_root='ckpts/epochs_2nd_00020.pth',
            code_root='StyleTTS2',
            device=device
        )
        print('TTS instantiated.')

    @torch.no_grad()
    def inference_audiollm(self, in_wav_path, prompt, max_length=256, device='cuda'):
        conversation = [
            {'role': 'system', 'content': 'You are a helpful assistant.'},
            {'role': 'user', 'content': [
              {'type': 'audio', 'audio_url': in_wav_path},
              {'type': 'text', 'text': prompt},
            ]},
        ]
        text = self.processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
        audios = []

        for message in conversation:
            if isinstance(message['content'], list):
                for ele in message['content']:
                    if ele['type'] == 'audio':
                        audios.append(
                            librosa.load(
                                in_wav_path,
                                sr=self.processor.feature_extractor.sampling_rate)[0]
                            )

        inputs = self.processor(text=text, audio=audios, sampling_rate=16000, return_tensors='pt', padding=True)
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        generate_ids = self.audiollm.generate(**inputs, max_length=max_length)
        generate_ids = generate_ids[:, inputs.input_ids.size(1):]
        response = self.processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

        return response

    @torch.no_grad()
    def inference_tts(self, text, ref_spk_path):

        ref_mel = self.tts.load_mel_from_path(ref_spk_path)
        style = self.tts.compute_style(ref_mel)
        out_wav = self.tts(text=text, style=style)

        return out_wav

    @torch.no_grad()
    def __call__(self, in_wav_path, ref_spk_path):
        # Construct the prompt
        instruction = "Please give a helpful and natural spoken response."
        prompt = (
            "Listen to the user message carefully and generate a direct, natural reply "
            "in a conversational style. Do not repeat the user's words or say things like 'Sure' or 'OK'.\n"
            f"{instruction}"
        )

        # Run Audio LLM (ASR + LLM in one step)
        llm_response = self.inference_audiollm(in_wav_path, prompt)
        print(f"[AudioLLM response]: {llm_response}")

        # Truncate if too long for TTS
        if len(llm_response) > 512:
            print("TTS input too long, truncating to 512 characters.")
            llm_response = llm_response[:512]

        # Run TTS
        out_wav = self.inference_tts(llm_response, ref_spk_path)

        return out_wav

In [7]:
audiollm_sds = AudioLLMSDS('cuda')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/638k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/853 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/79.0k [00:00<?, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/3.98G [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/1.28G [00:00<?, ?B/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/3.98G [00:00<?, ?B/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/3.91G [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

Audio LLM instantiated.
177


  WeightNorm.apply(module, name, dim)


TTS instantiated.


#### TODO (4/10): Show your response of AudioLLMSDS on dailytalk/{3,5, 6}/question.wav ####

#### Question A ####

In [None]:
Audio('dailytalk/3/question.wav')

#### Generated Response A ####

In [8]:
out_wav0 = audiollm_sds(in_wav_path='dailytalk/3/question.wav', ref_spk_path='dailytalk/3/reference.wav')
Audio(out_wav0, rate=24000)

[AudioLLM response]: I brought some sandwiches from home. Would you like to join me for lunch?


#### Ground-truth Response A ####

In [None]:
Audio('dailytalk/3/solution.wav')

#### Question B ####

In [None]:
Audio('dailytalk/5/question.wav')

#### Generated Response B ####

In [9]:
out_wav1 = audiollm_sds(in_wav_path='dailytalk/5/question.wav', ref_spk_path='dailytalk/5/reference.wav')
Audio(out_wav1, rate=24000)

[AudioLLM response]: I'm sorry, but I don't have that music book.


#### Ground-truth Response B ####

In [None]:
Audio('dailytalk/5/solution.wav')

#### Question C ####

In [None]:
Audio('dailytalk/6/question.wav')

#### Generated Response C ####

In [11]:
out_wav3 = audiollm_sds(in_wav_path='dailytalk/6/question.wav', ref_spk_path='dailytalk/6/reference.wav')
Audio(out_wav3, rate=24000)



[AudioLLM response]: Hello Kui Po Corporation! Nice to meet you. I'm QianWen, a language model created by Alibaba Cloud. How can I assist you today?


#### Ground-truth Response C ####

In [None]:
Audio('dailytalk/6/solution.wav')

#### Still TODO (4/10): Comment on the quality of your generated responses compared to CascadedSDS and the ground truth.

System 2 performs well in terms of fluency and user-friendliness, but it is less reliable in maintaining contextual accuracy and task alignment compared to System 1.



System 1 provides more consistent, relevant, and persona-aware responses.

System 2 offers more direct, sometimes more natural replies, but occasionally breaks down due to less interpretability and instruction-following ability.

Compared to ground truth, the responds of system2 sometimes misinterprets the user's intent (e.g., answering “I don't have the book” when the speaker clearly refers to one)