# Spoken Language Processing 2022-23

# Lab3 - Dialogue Systems

_Bruno Martins_


This lab assignment will introduce tools and concepts related to the development of dialogue systems, exemplifying also the use of automatic speech recognition and text-to-speech models.

Students will be tasked with the development of a simple (spoken/conversational) question answering system, reusing different models associated to the HuggingFace Transformers library:

* Speech recognition models (e.g., OpenAI Whisper).
* Large language models for natural language understanding and generation (e.g., GPT-2 or Alpaca models).
* Text-to-speech models (e.g., SpeechT5).

The first parts of this notebook will guide students in the use of the tools, while the last part presents the main problem that is to be tackled. Note that the first parts also feature intermediate tasks, which students are required to solve.

To complete the project, student groups must deliver in Fenix an updated version of this notebook, featuring the proposed solutions to each task, together with a small PDF report (2 pages) outlining the methods that were developed (you can use the [following Overleaf template](https://www.overleaf.com/latex/templates/interspeech-2023-paper-kit/kzcdqdmkqvbr) for the report).

Students are encouraged to modify examples, incorporate any other techniques, and in general explore any approach that may permit improving the results. Assessment will be based on task completion, creativity in the proposed solutions, and overall accuracy over a benchmark dataset.

### Group identification

Initialize the variable `group_id` with the number that Fenix assigned to your group and `student1_name`, `student1_id`, `student2_name` and `student2_id` with your names and student numbers.

In [1]:
group_id = 23
student1_name = "Guilherme Lopes"
student1_id = 105319
student2_name = "Gonçalo Cruz"
student2_id = 84721
print(f"Group number: {group_id}")
print(f"Student 1: {student1_name} ({student1_id})")
print(f"Student 2: {student2_name} ({student2_id})")

Group number: 23
Student 1: Guilherme Lopes (105319)
Student 2: Gonçalo Cruz (84721)


In [2]:
assert isinstance(group_id, int) and isinstance(student1_id, int) and isinstance(student2_id, int)
assert isinstance(student1_name, str) and isinstance(student2_name, str) 
assert (group_id > 0) and (group_id < 40)
assert (student1_id > 60000) and (student1_id < 120000) and (student2_id > 60000) and (student2_id < 120000)

# Python packages

NumPy is a Python library that provides functions to process multidimensional array objects. The NumPy documentation is available [here](https://numpy.org/doc/1.24/).

Librosa is a Python package for analyzing and processing audio signals. It provides a wide range of tools for tasks such as loading and manipulating audio files, extracting features from audio signals, and visualizing and playing back audio data.

IPython display is a module in the IPython interactive computing environment that provides a set of functions for displaying various types of media in the Jupyter notebook or other IPython-compatible environments. For example, you can use the display() function to display an object in a notebook cell (for example an audio object).

Matplotlib is a popular Python library that allows users to create a wide range of visualizations using a simple and intuitive syntax.

Huggingface transformers provides APIs and tools to easily download and train state-of-the-art pretrained models based on the Transformer architecture. The documentation is available [here](https://huggingface.co/docs/transformers/index). The associated *datasets* and *evaluate* libraries respectivly suport the direct access to many well-known datasets and common evaluation metrics used in NLP and speech research. For more details, look at the official [HuggingFace course](https://huggingface.co/course/chapter1/1).

In [None]:
!pip3 install sentencepiece
!pip3 install xformers
!pip3 install transformers
!pip3 install datasets
!pip3 install evaluate
!pip3 install jiwer
!pip3 install librosa

In [1]:
import transformers
import pandas
import datasets
import evaluate
import numpy as np
import librosa
import librosa.display
from IPython.display import Audio
from matplotlib import pyplot as plt

# Using OpenAI Whisper

Whisper is an exciting new model for Automatic Speech Recognition (ASR), developed by OpenAI and made available through the HuggingFace Transformers library. The following example illustrates the use of the Whisper model to transcribe a small audio sample taken from the LibriSpeech dataset.

In [2]:
import torch
from transformers import AutoProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

processor = AutoProcessor.from_pretrained("openai/whisper-tiny.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")
inputs = processor(ds[0]["audio"]["array"], return_tensors="pt")
input_features = inputs.input_features
generated_ids = model.generate(inputs=input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)

Found cached dataset librispeech_asr_dummy (/home/guinlopes/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b)
It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.


Automatic Speech Recognition (ASR) models are frequently evaluated through the Word Error Rate (WER). 

The WER is derived from the Levenshtein distance, working at the word level and aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. The metric can then be computed as:

WER = (S + D + I) / N = (S + D + I) / (S + D + C),

where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, and N is the number of words in the reference (N=S+D+C). The WER value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system, with a WER of 0 being a perfect score.

In [3]:
from evaluate import load

wer = load("wer")
predictions = ["this is the prediction", "there is an other sample"]
references = ["this is the reference", "there is another one"]
wer_score = wer.compute(predictions=predictions, references=references)

print(wer_score)

0.5


## Intermediate tasks:

* Collect two small audio samples with your own voice, together with a transcription of the spoken messages. The following [example shows how to record audio from your microphone within a Python notebook](https://colab.research.google.com/gist/ricardodeazambuja/03ac98c31e87caf284f7b06286ebf7fd/microphone-to-numpy-array-from-your-browser-in-colab.ipynb#scrollTo=H4rxNhsEpr-c), but you can use any other method to collect the audio samples.
* Use the Whisper speech recognition model to transcribe the two spoken messages that were collected.
* Use the transcriptions to compute the word error rate.
* Experiment with the use of different recognition models (e.g., larger Whisper models), and see if the error rate changes.

Whisper Tiny

In [8]:
sr = 22050
student1_utt, sr1 = librosa.load("lab_3_audio_2.wav", sr=sr)
student2_utt, sr1 = librosa.load("LAB3_GUI.wav", sr=sr)

reference1 = ["Have you ever gone camping?"]
reference2 = ["How many kilometers is it from here to the moon?"]
inputs1 = processor(student1_utt, return_tensors="pt")
inputs2 = processor(student2_utt, return_tensors="pt")
input_features1 = inputs1.input_features
input_features2 = inputs2.input_features
generated_ids1 = model.generate(inputs=input_features1)
generated_ids2 = model.generate(inputs=input_features2)
transcription1 = processor.batch_decode(generated_ids1, skip_special_tokens=True)[0]
transcription2 = processor.batch_decode(generated_ids2, skip_special_tokens=True)[0]
print("Transcription student1: " + transcription1)
print("Transcription student 2: " + transcription2)

wer_score1 = wer.compute(predictions=[transcription1], references=reference1)
wer_score2 = wer.compute(predictions=[transcription2], references=reference2)
print("wer score student1: " + str(wer_score1))
print("wer score student2: " + str(wer_score2))

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Transcription student1:  If you ever gone camping...
Transcription student 2:  How many kilometers is it from here to the moon?
wer score student1: 0.4
wer score student2: 0.0


Whisper Small

In [9]:
processor = AutoProcessor.from_pretrained("openai/whisper-small.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small.en")

sr = 22050
student1_utt, sr1 = librosa.load("lab_3_audio_2.wav", sr=sr)
student2_utt, sr1 = librosa.load("LAB3_GUI.wav", sr=sr)

reference1 = ["Have you ever gone camping?"]
reference2 = ["How many kilometers is it from here to the moon?"]
inputs1 = processor(student1_utt, return_tensors="pt")
inputs2 = processor(student2_utt, return_tensors="pt")
input_features1 = inputs1.input_features
input_features2 = inputs2.input_features
generated_ids1 = model.generate(inputs=input_features1)
generated_ids2 = model.generate(inputs=input_features2)
transcription1 = processor.batch_decode(generated_ids1, skip_special_tokens=True)[0]
transcription2 = processor.batch_decode(generated_ids2, skip_special_tokens=True)[0]
print("Transcription student1: " + transcription1)
print("Transcription student 2: " + transcription2)

wer_score1 = wer.compute(predictions=[transcription1], references=reference1)
wer_score2 = wer.compute(predictions=[transcription2], references=reference2)
print("wer score student1: " + str(wer_score1))
print("wer score student2: " + str(wer_score2))

Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/845 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/999k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.13M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)main/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.94k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.
It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Transcription student1:  Have you ever gone camping?
Transcription student 2:  How many kilometers is it from here to the moon?
wer score student1: 0.0
wer score student2: 0.0


# Using LLMs for conditional language generation

OpenAI GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. Thus, GPT-2 can be used to address problems like question answering, modeling the task as language generation conditioned in the question (plus other relevant additional context).

The following example illustrates the use of the GPT-2 through the Huggingface Transformers library. In this case, instead of using the model directly, we are using the model through the pipeline API, which facilitates the adaptation to the case of other LLMs. The pipeline() function can be used to connect a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer.

In [4]:
from transformers import pipeline, set_seed

set_seed(30) # make results deterministic

generator = pipeline('text-generation', model='gpt2')
generator("Who is the president of the United States?", max_length=15, num_return_sequences=1)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Who is the president of the United States? Why are he not president at'}]

## Intermediate tasks:

* Adapt the example showing how to use GPT-2 to do question answering over the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/) (available from HuggingFace datasets).
* Evaluate the results obtained with different models (e.g., [Alpaca-based models](https://huggingface.co/declare-lab/flan-alpaca-base)) and/or different usage strategies (e.g., consider prompting, parameter efficient fine-tuning, etc.).
* Compute the error over the first 1000 examples from the validation split from the SQuAD dataset, using the [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu) for comparing the generated answers against the ground truth.

Alpaca GPT4

In [5]:
model = pipeline(model="declare-lab/flan-alpaca-gpt4-xl")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [14]:
from tqdm import tqdm
from transformers import Conversation, pipeline
squad_dataset = load_dataset('squad')


questions = squad_dataset['validation']['question'][0:1000]
context = squad_dataset['validation']['context'][0:1000]
print(len(questions))
pred_dataset = []
references = squad_dataset['validation']['answers'][0:1000]



for i in tqdm(range(len(questions))):
  prompt = context[i] + "\n" + questions[i]
  text = model(prompt, max_length =15)[0]['generated_text']
  pred_dataset.append(text)

Found cached dataset squad (/home/guinlopes/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]

1000


100%|██████████| 1000/1000 [4:06:21<00:00, 14.78s/it]    


In [11]:
text_reference = []
for i in range(1000):
    text_reference.append(references[i]['text'])

Results with max length = 15

In [15]:
bleu = evaluate.load("bleu")
results = bleu.compute(predictions=pred_dataset, references=text_reference)
print(results)

{'bleu': 0.06438275580606233, 'precisions': [0.16911332941867294, 0.094489043176394, 0.045482184117718594, 0.02364164247200332], 'brevity_penalty': 1.0, 'length_ratio': 6.053317535545023, 'translation_length': 10218, 'reference_length': 1688}


In [16]:
for i in range(10):
    print("Quest: " + questions[i])
    print("pred: " + pred_dataset[i])
    print(context[i])
    print(references[i])

Quest: Which NFL team represented the AFC at Super Bowl 50?
pred: Denver Broncos
Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.
{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}
Quest: Which NFL team represented the NFC at 

Results with max_length = 128

In [14]:
for i in range(10):
    print("Quest: " + questions[i])
    print("pred: " + pred_dataset[i])
    print(context[i])
    print(references[i])

Quest: Which NFL team represented the AFC at Super Bowl 50?
pred: Denver Broncos
Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.
{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}
Quest: Which NFL team represented the NFC at 

In [18]:
bleu = evaluate.load("bleu")
results = bleu.compute(predictions=pred_dataset, references=references)
print(results)

{'bleu': 0.08014530236110991, 'precisions': [0.17574782417903898, 0.10747257949972382, 0.06079808186333276, 0.03592814371257485], 'brevity_penalty': 1.0, 'length_ratio': 8.100118483412322, 'translation_length': 13673, 'reference_length': 1688}


DialoGPT 

In [None]:
from tqdm import tqdm
from transformers import Conversation, pipeline
squad_dataset = load_dataset('squad')


questions = squad_dataset['validation']['question'][0:1000]
context = squad_dataset['validation']['context'][0:1000]
pred_dataset = []
references = squad_dataset['validation']['answers'][0:1000]


conversational_pipeline = pipeline('conversational', model='microsoft/DialoGPT-medium')
conversation = Conversation()

for i in range(len(questions)):
  prompt = context[i] + "\n" + questions[i]
  conversation.add_user_input(prompt)
  conversational_pipeline([conversation])
  text = conversation.generated_responses[i]
  print(text)
  pred_dataset.append(text)

# Using SpeechT5 for converting text-to-speech

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in different natural language processing tasks, the unified-modal SpeechT5 framework explores encoder-decoder pre-training for self-supervised speech/text representation learning. 

The model is again conveniently available through the HuggingFace Transformers library. The following example illustrates the use of the SpeechT5 model for generating a spectrogram from a textual input, together with a neural vocoder model for producing a speech signal.

In [5]:
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan, set_seed
from IPython.display import Audio
import soundfile as sf
import torch

set_seed(42) # make results deterministic

model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")

inputs = processor(text="Hello, my dog is cute", return_tensors="pt")
speaker_embeddings = torch.zeros((1, 512))

speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
sf.write("tts_example.wav", speech.numpy(), samplerate=16000)
Audio("tts_example.wav", autoplay=True)

## Intermediate tasks:

* Connect the results from your answer to the previous intermediate task (i.e., conditioned language generation) to the SpeechT5 text-to-speech model, so as to produce speech outputs from the text generated by the model.
* Produce speech-based answers for the first 5 questions in the validation split from the SQuaD dataset.
* Connect also the results from your answer to the first intermediate task (i.e., automated speech recognition) to the SpeechT5 model and the LLM, so as to take spoken questions as input and produce a speech output.
* Collect small audio samples, with your own voice, for the first 5 questions in the validation split from the SQuaD dataset, and produce speech-based answers for these five questions.


In [18]:
for i in tqdm(range(5)):
     file_name = "SQUAD_q"+str(i+1)+".wav"
     question, sr2 = librosa.load(file_name, sr = sr)
     inputs = processor_speech_to_text(question, return_tensors="pt")
     input_features = inputs.input_features
     generated_ids = model_speech_to_text.generate(inputs=input_features)
     transcription = processor_speech_to_text.batch_decode(generated_ids, skip_special_tokens=True)[0]
     print("Question: "+ transcription)
     prompt = context[i] + "\n" + transcription
     text = model_llm(prompt, max_length =50)[0]['generated_text']
     print("Answer: " + text)
     inputs = processor_text_to_speech(text= text, return_tensors="pt")
     speaker_embeddings = torch.zeros((1, 512))
     spectrogram = model_text_to_speech.generate_speech(inputs["input_ids"], speaker_embeddings)
     with torch.no_grad(): speech = vocoder(spectrogram)
     name = "squad_answer" + str(i)+".wav"
     sf.write(name, speech.numpy(), samplerate=16000)
     display(Audio(speech.numpy(), rate=16000))

  0%|          | 0/5 [00:00<?, ?it/s]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Which NFL team represented the AFC at Super Bowl 50?
Answer: Denver Broncos


 20%|██        | 1/5 [01:54<07:37, 114.42s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Which NFL team represented the NFC at Super Bowl 50?
Answer: The NFL team that represented the NFC at Super Bowl 50 was the Carolina Panthers.


 40%|████      | 2/5 [04:26<06:50, 136.74s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Where did Super Bowl 50 take place?
Answer: Super Bowl 50 took place at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California.


 60%|██████    | 3/5 [06:26<04:18, 129.00s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Which NFL team won Super Bowl 50?
Answer: The Denver Broncos won Super Bowl 50.


 80%|████████  | 4/5 [08:20<02:03, 123.01s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  What color was used to emphasize the 50th anniversary of the Super Bowl?
Answer: The color used to emphasize the 50th anniversary of the Super Bowl was gold.


100%|██████████| 5/5 [10:10<00:00, 122.08s/it]


# Main problem

Students are tasked with joining together the speech recognition, language understanding and generation, and text-to-speech models, in order to build a conversational spoken question answering approach.

* The method should take as input speech utterances with questions.
* The language understanding and generation component should use as input a transcription for the current speech utterance, and also transcriptions from previous speech utterances (i.e., the conversation context).
* The language understanding and generation component can explore different strategies for improving answer quality:
  * Prompting the language model with (retrieved) in-context examples.
  * Using parameter-efficient fine-ting with existing conversational question answering datasets (e.g., [the CoQA dataset](https://stanfordnlp.github.io/coqa/), available from HuggingFace datasets).
  * ...
* The text-to-speech component takes as input the results from language generation, and produces a speech output.
* Both the automated speech recognition and the text-to-speech components can explore different approaches, although students should attempt to justify their choices (e.g., if changing the automated speech recognition component, show that it achieves a lower WER).
* Collect small audio samples, with your own voice, for the first instance in the CoQA testing split, and show the results produced by your method for this example.


# Alpaca_GPT4

In [28]:
from transformers import pipeline

class ConversationalAgent:
    def __init__(self, model_name, context = '', use_conversation_history=True):
        self.conversation_history = ""
        self.context = context
        self.model = pipeline(model=model_name)
        self.use_conversation_history = use_conversation_history

    def ask(self, question, max_length=128):
        # Append the question to the conversation history
        if self.use_conversation_history:
            self.conversation_history += f"User: {question}\n" 

          
        if self.use_conversation_history:
            prompt = self.context + "\n" + self.conversation_history + "Answer: "
        else:
            prompt = self.context + "\nUser" + question + "\nAnswer: "
        
        
        generated = self.model(prompt, max_length= max_length)[0]['generated_text']

        answer = generated.split('Answer: ')[-1]
        answer = answer.replace('User:', '').strip()
        
        
        if self.use_conversation_history:
            self.conversation_history += f"Answer: {answer}\n"
        print("Answer: " + answer)
        
        return answer


In [27]:
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
import torch

class SpeechConversationalAgent:
    def __init__(self, model_name, context= '', use_conversation_history=False):
        self.agent = ConversationalAgent(model_name, context, use_conversation_history)
        self.processor_sound_text = AutoProcessor.from_pretrained("openai/whisper-medium.en")
        self.model_sound_text = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium.en")
        self.processor_text_sound = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
        self.model_text_sound = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
        self.vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

    def listen(self, input,  sr=16000):

        # Transcribe the speech to text using Whisper ASR
        audio, sr1 = librosa.load(input, sr=sr)
        
        inputs = self.processor_sound_text(audio, return_tensors="pt")
        input_features = inputs.input_features
        generated_ids = self.model_sound_text.generate(inputs=input_features)
        question = self.processor_sound_text.batch_decode(generated_ids, skip_special_tokens=True)[0]
        print("Question: "+ question)
        return question

    def speak(self, answer):
        # Convert the answer text to speech using SpeechT5 and HifiGan
        inputs = self.processor_text_sound(text= answer, return_tensors="pt")
        speaker_embeddings = torch.zeros((1, 512))
        spectrogram = self.model_text_sound.generate_speech(inputs["input_ids"], speaker_embeddings)
        with torch.no_grad(): speech = self.vocoder(spectrogram)
        # Save the speech audio to a WAV file
        #sf.write('output'+str()+'.wav', speech[0].data.numpy(), self.processor_text_sound.sampling_rate)
        return display(Audio(speech.numpy(), rate=16000))

    def ask(self, input):
        # Listen for a question and generate an answer
        question = self.listen(input)
        answer = self.agent.ask(question)
        self.speak(answer)





CoCa instance 1

In [None]:
coca_examples = []
sr = 16000
for i in range(12):
  file_name = "Instance_1."+str(i+1)+".wav"
  coca_examples.append(file_name)

In [10]:
context = "Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and five other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton's mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer's orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing. "

In [11]:
complete_context = context + "\n" + "'What are you doing, Cotton?!' 'I only wanted to be more like you'" + "\n" + "Cotton's mommy rubbed her face on Cotton's and said 'Oh Cotton, but your fur is so pretty and special, like you. We would never want you to be any other way'. And with that, Cotton's mommy picked her up and dropped her into a big bucket of water. When Cotton came out she was herself again. Her sisters licked her face until Cotton's fur was all all dry." + "\n" + "'Don't ever do that again, Cotton!' they all cried. 'Next time you might mess up that pretty white fur of yours and we wouldn't want that!'" +"\n" +"Then Cotton thought, 'I change my mind. I like being special'." 

In [19]:
speech_agent_I1 = SpeechConversationalAgent("declare-lab/flan-alpaca-gpt4-xl", complete_context, use_conversation_history=False)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [20]:
from tqdm import tqdm
for i in tqdm(range(12)):
    speech_agent_I1.ask(coca_examples[i])

  0%|          | 0/12 [00:00<?, ?it/s]

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  What color was cotton?
Answer: Cotton was white.


  8%|▊         | 1/12 [02:09<23:40, 129.10s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Where did she live?
Answer: In a barn near a farm house.


 17%|█▋        | 2/12 [04:23<21:59, 131.95s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Did she live alone?
Answer: No, she didn't live alone. She shared her hay bed with her mommy and five other sisters.


 25%|██▌       | 3/12 [06:22<18:55, 126.22s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Who did she live with?
Answer: She lived with her mommy and five other sisters.


 33%|███▎      | 4/12 [08:07<15:44, 118.02s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  What color were her sisters?
Answer: Orange


 42%|████▏     | 5/12 [09:54<13:17, 113.91s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Was Cotton happy that she looked different than the rest of her family?
Answer: No, Cotton was unhappy because she looked different from her family.


 50%|█████     | 6/12 [12:10<12:08, 121.41s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  What did she do to try to make herself the same color as her sisters?
Answer: She used the old farmer's orange paint to paint herself like her sisters.


 58%|█████▊    | 7/12 [14:14<10:11, 122.20s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  whose paint was it?
Answer: The paint was the farmer's.


 67%|██████▋   | 8/12 [16:09<07:59, 119.80s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  What did Cotton's mother and siblings do when they saw her painted orange?
Answer: They laughed and picked her up and dropped her into a bucket of water to wash her off.


 75%|███████▌  | 9/12 [17:56<05:48, 116.09s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Where did Cotton's mother put her to clean the paint off?
Answer: In a big bucket of water.


 83%|████████▎ | 10/12 [19:46<03:48, 114.13s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  What did the other cats do when cotton emerged from the buckets of water?
Answer: The other cats licked her face until Cotton's fur was all all dry.


 92%|█████████▏| 11/12 [21:54<01:58, 118.35s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Did they want Cotton to change the color of her fur?
Answer: No, they did not want Cotton to change the color of her fur.


100%|██████████| 12/12 [23:53<00:00, 119.46s/it]


Coca instance 2

In [None]:
coca_examples2 = []
sr = 16000
for i in range(11):
  file_name = "Instance_2."+str(i+1)+".wav"
  coca_examples2.append(file_name)

In [12]:
cont_2 = "Once there was a beautiful fish named Asta. Asta lived in the ocean. There were lots of other fish in the ocean where Asta lived. They played all day long."

In [13]:
complete_cont_2 = cont_2 + "\n" + "One day, a bottle floated by over the heads of Asta and his friends. They looked up and saw the bottle. 'What is it?' said Asta's friend Sharkie. 'It looks like a bird's belly,' said Asta. But when they swam closer, it was not a bird's belly. It was hard and clear, and there was something inside it." +"\n" + "The bottle floated above them. They wanted to open it. They wanted to see what was inside. So they caught the bottle and carried it down to the bottom of the ocean. They cracked it open on a rock. When they got it open, they found what was inside. It was a note. The note was written in orange crayon on white paper. Asta could not read the note. Sharkie could not read the note. They took the note to Asta's papa. 'What does it say?' they asked." + "\n" "Asta's papa read the note. He told Asta and Sharkie, 'This note is from a little girl. She wants to be your friend. If you want to be her friend, we can write a note to her. But you have to find another bottle so we can send it to her.' And that is what they did."

In [17]:
speech_agent_I2 = SpeechConversationalAgent("declare-lab/flan-alpaca-gpt4-xl", complete_cont_2, use_conversation_history=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [18]:
from tqdm import tqdm
for i in tqdm(range(11)):
    speech_agent_I2.ask(coca_examples2[i])

  0%|          | 0/11 [00:00<?, ?it/s]

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  What was the name of the fish?
Answer: Asta.


  9%|▉         | 1/11 [04:37<46:15, 277.51s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  What looked like a bird's belly?
Answer: A bottle.


 18%|█▊        | 2/11 [07:18<31:20, 208.93s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Who said that?
Answer: Asta.


 27%|██▋       | 3/11 [09:20<22:29, 168.67s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Was Sharky a friend?
Answer: Yes, Sharkie was a friend of Asta.


 36%|███▋      | 4/11 [11:20<17:28, 149.83s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Did they get the bottle?
Answer: Yes, they got the bottle.


 45%|████▌     | 5/11 [13:08<13:29, 134.97s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  What was in it?
Answer: A note.


 55%|█████▍    | 6/11 [14:51<10:20, 124.11s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Did a little boy write the note?
Answer: No, a little girl wrote the note.


 64%|██████▎   | 7/11 [16:49<08:08, 122.01s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Who could read the note?
Answer: Asta's papa could read the note.


 73%|███████▎  | 8/11 [18:42<05:57, 119.25s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  What did they do with the note?
Answer: They wrote a note to the girl and sent it in a bottle.


 82%|████████▏ | 9/11 [20:34<03:53, 116.90s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Did they write back?
Answer: No, they did not write back.


 91%|█████████ | 10/11 [22:23<01:54, 114.50s/it]It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:  Were they excited?
Answer: Yes, they were excited.


100%|██████████| 11/11 [24:13<00:00, 132.17s/it]


Questions answering with our recordings