<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/main/Class_04_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

## **Module 4: ChatGPT and Large Language Models**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Biology, Health and the Environment](https://sciences.utsa.edu/bhe/), [UTSA](https://www.utsa.edu/)

### Module 4 Material

* Part 4.1: Introduction to LLMs (ChatGTP) and Prompt Engineering
* Part 4.2: Generative AI
* **Part 4.3**: Text to Speech, Speech to Text, Speech Bot
* Part 4.4:


## Google CoLab Instructions

You MUST run the following code cell to get credit for this class lesson. By running this code cell, you will map your GDrive to /content/drive and print out your Google GMAIL address. Your Instructor will use your GMAIL address to verify the author of this class lesson.

In [1]:
# You must run this cell first
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: Using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("**WARNING**: Your GMAIL address was **not** printed in the output below.")
    print("**WARNING**: You will NOT receive credit for this lesson.")
    COLAB = False

Mounted at /content/drive
Note: Using Google CoLab
david.senseman@gmail.com


Make sure your GMAIL address is included as the last line in the output above your assignment will **not** be graded.

### Install Your Secrete OpenAI Key

The next cell retrieves your Open AI Key information. If you receive an error it probably means you don't have an Open AI Key or your key isn't stored correctly in your Colab browswer.

In [2]:
# Get the secrete OpenAI API Key

import os

from google.colab import userdata
OPENAI_API_KEY=userdata.get('OPENAI_API_KEY')

#print(OPENAI_API_KEY)

### Install software package

You will to run the following code cell to install the `langchain-openai` package to complete this lesson.

In [3]:
# Install langchain-openai package

!pip install -q langchain langchain-openai

If the code is correct, you _might_ see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_3_image03A.png)

# **Introduction to Speech Processing**

![___](https://biologicslab.co/BIO1173/images/class_04/CourseImage.gif)

In this class lesson, we explore how to use both computer-generated voice and voice recognition to create a ChatBot. We'll be working with the OpenAI API to achieve this. Specifically, we'll demonstrate how to input normal text and have it spoken by the computer, and conversely, how we can speak to the computer and have it respond. We'll ultimately integrate these functionalities to create a chatbot that handles both text-to-speech and speech-to-text interactions. While we'll use Google Colab for this demonstration, in production environments, you'd likely use a mobile app or a web-based JavaScript solution, as each platform handles voice differently. I'll focus on keeping things generic and simple in Colab for now.

Voice applications are everywhere. For example, I can ask "Alexa, what time is it?" and multiple Alexa devices in my home will respond, although not always perfectly. I usually mute them during recording sessions. Applications like Siri or even ChatGPT also offer voice interactions. For instance, when you click the voice option in ChatGPT on a computer, it starts listening for your input.

To illustrate, I asked ChatGPT, "How are you doing?" and it responded by offering some insightful thoughts about generative AI. It highlighted that generative AI isn't just about creating new content but about learning patterns from vast amounts of data and applying them creatively across text, images, and code. It also suggested that students experiment with different approaches, as hands-on experience is one of the best ways to learn.




## **Part I: Speech to Text**

Here we delve into the realm of speech-to-text technology, focusing on the powerful capabilities offered by OpenAI's models. Speech-to-text, also known as automatic speech recognition (ASR), is a technology that converts spoken language into written text. OpenAI's speech-to-text models represent the cutting edge of this field, leveraging advanced machine learning techniques to achieve high accuracy and robustness across various accents, languages, and acoustic environments. We'll explore how these models can be integrated into applications to enable voice-based interactions, transcription services, and accessibility features. By harnessing OpenAI's speech-to-text technology, we'll unlock new possibilities for human-computer interaction and demonstrate how to transform audio input into actionable text data with remarkable precision.


Note we will make use of the technique described here to record audio in CoLab.

https://gist.github.com/korakot/c21c3476c024ad6d56d5f48b0bca92be




### **Summary of the Audio Recording Setup in Google Colab**

This code in the cell below sets up the ability to **record audio from the user's microphone** in a **Google Colab notebook** using JavaScript and Python.


In [4]:
from IPython.display import Javascript
from google.colab import output
from base64 import b64decode
import io
from IPython.display import Audio

from pydub import AudioSegment

RECORD = """
const sleep = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
    const reader = new FileReader()
    reader.onloadend = e => resolve(e.srcElement.result)
    reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
    stream = await navigator.mediaDevices.getUserMedia({ audio: true })
    recorder = new MediaRecorder(stream)
    chunks = []
    recorder.ondataavailable = e => chunks.push(e.data)
    recorder.start()
    await sleep(time)
    recorder.onstop = async ()=>{
        blob = new Blob(chunks)
        text = await b2text(blob)
        resolve(text)
    }
    recorder.stop()
})
"""

## **Create `record()` function**

Here is a step-by-step explaination of the `record()` function created in the next code cell.

**1. Start Recording**
The function prints a message indicating the start of recording for the specified number of seconds.

**2. Inject JavaScript**
The RECORD JavaScript code is injected into the notebook using display(Javascript(RECORD)). This code handles microphone access and audio recording in the browser.

**3. Execute JavaScript and Retrieve Audio**
The `output.eval_js()` function runs the `JavaScript record()` function for the specified duration (converted to milliseconds). It returns a Base64-encoded string representing the recorded audio.

**4. Decode Audio Data**
The Base64 string is split to isolate the encoded audio portion, which is then decoded into binary format using b64decode.

**5. Convert to AudioSegment**
The binary audio data is wrapped in a BytesIO stream and passed to AudioSegment.from_file() to create an audio object. The input format is specified as "webm".

**6. Export as WAV File**
The audio is exported and saved locally as "recorded_audio.wav" in WAV format.

**7. Return Audio Object**
The function prints a confirmation message and returns the AudioSegment object for further use (e.g., playback, analysis, or visualization).



In [5]:
# Create record() function

def record(seconds=3):
    print(f"Recording now for {seconds} seconds...")
    display(Javascript(RECORD))
    s = output.eval_js(f'record({seconds * 1000})')
    binary = b64decode(s.split(',')[1])

    # Convert to AudioSegment
    audio = AudioSegment.from_file(io.BytesIO(binary), format="webm")

    # Export as WAV
    audio.export("recorded_audio.wav", format="wav")
    print("Recording complete. Audio saved as 'recorded_audio.wav'")
    return audio

### Example 1: Record and Play Audio

The code in the cell below uses the `record()` function to record a `5 second` audio file and then plays it back. You may need to grant permission for the program to access the microphone on your laptop or computer.

In [6]:
# Example 1: Record and play audio


#from IPython.display import Audio
#from pydub import AudioSegment
#import io


#from base64 import b64decode
#from google.colab import output


# Set recording duration
record_duration=5

# Record audio
audio = record(record_duration)
display(Audio("recorded_audio.wav", autoplay=True))

Recording now for 5 seconds...


<IPython.core.display.Javascript object>

Recording complete. Audio saved as 'recorded_audio.wav'


If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_3_image02A.png)

### **Exercise 1: Record and Play Audio**

In the cell below, use the `record()` function to record a `8 second` audio file and then play it back.

In [20]:
# Insert your code for Exercise 1 here

# Set recording duration
record_duration=8

# Record audio
audio = record(record_duration)
display(Audio("recorded_audio.wav", autoplay=True))

Recording now for 8 seconds...


<IPython.core.display.Javascript object>

Recording complete. Audio saved as 'recorded_audio.wav'


If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_3_image04A.png)

## **OpenAI Speech-to-Text API**

### Overview of the API
- **Models**: Includes Whisper and newer GPT-4o-based models.
- **Input**: Accepts audio files (e.g., MP3, WAV, MP4).
- **Output**: Returns transcriptions in formats like plain text, JSON, or subtitle formats (SRT, VTT).
- **Languages**: Supports dozens of languages with high accuracy, especially in noisy or accented speech environments.

### Why It’s Useful for Biomedical Investigators

1. **Transcribing Interviews & Focus Groups**  
   Automatically convert recorded conversations with patients, clinicians, or research participants into text for qualitative analysis.

2. **Clinical Note Dictation**  
   Researchers can dictate observations or notes during fieldwork or lab work, streamlining documentation.

3. **Meeting & Conference Transcripts**  
   Capture and archive discussions from research meetings, seminars, or collaborative calls.

4. **Data Extraction from Audio**  
   Enables downstream NLP tasks like identifying social determinants of health (SDOH) or extracting biomedical entities from spoken content.

5. **Multilingual Support**  
   Useful in global health research where interviews or data collection occur in multiple languages.


This code in the cell below demonstrates how to use OpenAI's speech-to-text API to transcribe audio files. It defines a function `transcribe_audio` that takes a filename as input. The function opens the specified audio file in binary mode and uses the OpenAI client to create a transcription. The `client.audio.transcriptions.create() method` is called with two parameters: the model ("whisper-1") and the audio file. `Whisper` is OpenAI's state-of-the-art speech recognition model, known for its robustness across various languages and accents. The function returns the transcribed text. In the example usage, an audio file named "recorded_audio.wav" is transcribed, and the resulting text is printed. This code provides a simple yet powerful way to convert speech to text, which can be invaluable for tasks such as generating subtitles, creating searchable archives of audio content, or enabling voice commands in applications.


In [21]:
# Create function
def transcribe_audio(filename):
    with open(filename, "rb") as audio_file:
        transcription = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file
        )
    return transcription.text

## Example 2: Speech-to-Text

This code in the cell below provides a simple yet powerful way to convert speech to text, which can be invaluable for tasks such as generating subtitles, creating searchable archives of audio content, or enabling voice commands in applications.

For Example 2, you should start counting outloaud from `1` to `10` immediately after you start running the code cell.

In [22]:
# Example 2: Speech-to-Text


import openai
import os

# Set recording duration
record_duration=5

# Record audio
audio = record(record_duration)

# Get the secret from the environment
api_key = userdata.get('OPENAI_API_KEY')
client = openai.OpenAI(api_key=api_key)

# Transcribe the audio file
transcription = transcribe_audio("recorded_audio.wav")
print("Transcription:")
print(transcription)


Recording now for 5 seconds...


<IPython.core.display.Javascript object>

Recording complete. Audio saved as 'recorded_audio.wav'
Transcription:
2, 3, 4, 5, 6, 7, 8, 9, 10, 11.


If the code is correct, you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_3_image05A.png)

## **Exercise 2: Speech-to-Text**

In the cell below, write to code to generate Speech-to-Text using Example 2 as an template.

For **Exercise 2,** you should start counting backwards from `10` to `1` immediately after you start running the code cell.

In [None]:
# Insert your code for Exercise 2 here


import openai
import os

# Set recording duration
record_duration=5

# Record audio
audio = record(record_duration)

# Get the secret from the environment
api_key = userdata.get('OPENAI_API_KEY')
client = openai.OpenAI(api_key=api_key)

# Transcribe the audio file
transcription = transcribe_audio("recorded_audio.wav")
print("Transcription:")
print(transcription)


Recording now for 5 seconds...


<IPython.core.display.Javascript object>

Recording complete. Audio saved as 'recorded_audio.wav'
Transcription:
10, 9, 8, 7, 6, 5.


If your code is correct, you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_3_image06A.png)

If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_3_image04A.png)

In [None]:


def generate_text(text):
    response = client.audio.speech.create(
        model="tts-1",
        voice="nova",
        input=text
    )
    audio_data = response.content
    return audio_data  # Return the audio data directly

def speak_text(text):
    audio_data = generate_text(text)
    display(Audio(audio_data, autoplay=True))

# Transcribe the recorded audio
audio = record(5)
transcription = transcribe_audio("recorded_audio.wav")
print("Transcription:")
print(transcription)
speak_text(transcription)

Recording now for 5 seconds...


<IPython.core.display.Javascript object>

Recording complete. Audio saved as 'recorded_audio.wav'
Transcription:
you


## **Part 2: Text to Speech**

In this module, we'll explore the fascinating world of text-to-speech (TTS) Large Language Models (LLMs), focusing on OpenAI's cutting-edge offerings. We'll primarily utilize OpenAI's TTS-1 model, a powerful and versatile tool designed for converting written text into natural-sounding speech. TTS-1 is optimized for real-time applications, making it ideal for scenarios that require low-latency audio generation. This model represents a significant advancement in speech synthesis technology, leveraging deep learning techniques to produce high-quality, lifelike vocal outputs. By delving into TTS-1, we'll explore its capabilities, examine its practical applications, and understand how it's revolutionizing various industries, from accessibility solutions to interactive voice responses and beyond.


## **Simple Text to Speech Example**

This code snippet demonstrates how to use OpenAI's text-to-speech API to generate spoken audio from text. First, it imports the necessary libraries: openai for API interaction, IPython.display for audio playback in Jupyter notebooks, and base64 for encoding. The TEXT variable contains the message to be converted to speech. The openai.audio.speech.create() function is called with three parameters: the model ("tts-1"), the voice ("alloy"), and the input text. OpenAI offers several voice options, including:

* **alloy** - neutral
* **echo** - young
* **fable** - male
* **onyx** - deep male
* **nova** - female
* **shimmer** - warm female

Each voice has its unique characteristics, allowing users to choose the most suitable one for their application. Additionally, OpenAI provides a high-definition model called "tts-1-hd" for enhanced audio quality, though it may have higher latency. The function returns a response object, from which the audio content is extracted and stored in the audio_data variable for further processing or playback.

### Voice Change
This code snippet demonstrates a complete workflow for speech-to-text and text-to-speech conversion using OpenAI's APIs. The generate_text function uses OpenAI's TTS-1 model with the "nova" voice to convert text into speech, returning the raw audio data. The speak_text function builds upon this by taking a text input, generating the corresponding audio, and then playing it using IPython's Audio display function with autoplay enabled. The main workflow begins by recording audio for 5 seconds (using a record function not shown in the snippet), transcribing it using the previously defined transcribe_audio function, printing the transcription, and finally speaking the transcribed text back using the speak_text function. This creates a full circle of voice interaction: recording speech, converting it to text, and then converting that text back into speech, effectively demonstrating both speech recognition and speech synthesis capabilities in a single, cohesive process.

In [None]:
import openai
import IPython.display as ipd
import base64

TEXT = "Hello there, I am one of the OpenAI chat voices, how are you?"

response = openai.audio.speech.create(
  model="tts-1",
  voice="alloy",
  input=TEXT
)

# Get the audio content
audio_data = response.content

We can play this audio to the CoLab notebook user.

In [None]:
from IPython.display import Audio, display

# Play the audio in Colab
print("Playing audio:")
display(Audio(audio_data, autoplay=True))

Playing audio:


We can also save an audio file.

In [None]:
with open("audio.mp3", "wb") as f:
    f.write(audio_data)


We can download this audio file.

In [None]:
# prompt: How do I download an audio file I generated named audio.mp3?

from google.colab import files
files.download('audio.mp3')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Multiple Voices and Samples

The code demonstrates how to concatenate multiple text-to-speech responses from OpenAI's API, showcasing each of the available voices. It uses the pydub library to combine audio segments seamlessly. The script iterates through a list of six voices ("alloy", "echo", "fable", "onyx", "nova", and "shimmer"), generating a sample audio for each voice saying "Hello, I am the [voice] voice." These individual audio segments are then combined into a single audio file using AudioSegment from pydub. The resulting audio plays each voice sample in sequence, allowing listeners to hear the distinct characteristics of each voice option. This approach is particularly useful for comparing different voices or creating a demo reel of available voice options in a single, continuous audio stream

In [None]:
import io
from openai import OpenAI
from IPython.display import Audio, display
from google.colab import files
import os

# Initialize OpenAI client
client = OpenAI()

voices = ["alloy", "echo", "fable", "onyx", "nova", "shimmer"]
audio_segments = []

for voice in voices:
    text = f"Hello, I am the {voice} voice."
    response = client.audio.speech.create(
        model="tts-1",
        voice=voice,
        input=text
    )
    audio_segments.append(response.content)

# Combine audio segments
from pydub import AudioSegment

combined_audio = AudioSegment.empty()
for segment in audio_segments:
    audio = AudioSegment.from_mp3(io.BytesIO(segment))
    combined_audio += audio

# Convert the combined audio to a byte stream
buffer = io.BytesIO()
combined_audio.export(buffer, format="mp3")
buffer.seek(0)

0

Play the audio to the CoLab user.

In [None]:
# Play the audio in Colab
print("Playing audio:")
display(Audio(buffer.read(), autoplay=True))

Playing audio:


Save the audio to a file.

In [None]:
# Reset buffer position
buffer.seek(0)

# Save the audio file
output_filename = "combined_voices.mp3"
with open(output_filename, "wb") as f:
    f.write(buffer.getvalue())

print(f"\nAudio saved as {output_filename}")


Audio saved as combined_voices.mp3


Download the audio file

In [None]:
files.download(output_filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **Part 3: Chat Bot**

In this part we will create a speech chat bot. You will talk to it with your own voice and it will respond in a voice that you have selected.

In [None]:
from langchain.chains import ConversationChain
from langchain.memory import ConversationSummaryMemory
from langchain_openai import ChatOpenAI
from langchain_core.prompts.chat import PromptTemplate
from IPython.display import display_markdown
import pickle

DEFAULT_TEMPLATE = """You are a helpful assistant. DO not use markdown, just regular text.
limit your response to just a few sentences. If the user says something that indicates
that they wish to end the chat return just "bye" (no quotes), so I can end the loop.

Current conversation:
{history}
Human: {input}
AI:"""

MODEL = 'gpt-4o-mini'

class ChatBot:
    def __init__(self, llm_chat, llm_summary, template):
        """
        Initializes the ChatBot with language models and a template for conversation.

        :param llm_chat: A large language model for handling chat responses.
        :param llm_summary: A large language model for summarizing conversations.
        :param template: A string template defining the conversation structure.
        """
        self.llm_chat = llm_chat
        self.llm_summary = llm_summary
        self.template = template
        self.prompt_template = PromptTemplate(input_variables=["history", "input"], template=self.template)

        # Initialize memory and conversation chain
        self.memory = ConversationSummaryMemory(llm=self.llm_summary)
        self.conversation = ConversationChain(
            prompt=self.prompt_template,
            llm=self.llm_chat,
            memory=self.memory,
            verbose=False
        )

        self.history = []

    def converse(self, prompt):
        """
        Processes a conversation prompt and updates the internal history and memory.

        :param prompt: The input prompt from the user.
        :return: The generated response from the language model.
        """
        self.history.append([self.memory.buffer, prompt])
        output = self.conversation.invoke(prompt)
        return output['response']

    def chat(self, prompt):
        """
        Handles the full cycle of receiving a prompt, processing it, and displaying the result.

        :param prompt: The input prompt from the user.
        """
        print(f"Human: {prompt}")
        output = self.converse(prompt)
        display_markdown(output, raw=True)

    def print_memory(self):
        """
        Displays the current state of the conversation memory.
        """
        print("**Memory:")
        print(self.memory.buffer)

    def clear_memory(self):
        """
        Clears the conversation memory.
        """
        self.memory.clear()

    def undo(self):
        """
        Reverts the conversation memory to the state before the last interaction.
        """
        if len(self.history) > 0:
            self.memory.buffer = self.history.pop()[0]
        else:
            print("Nothing to undo.")

    def regenerate(self):
        """
        Re-executes the last undone interaction, effectively redoing an undo operation.
        """
        if len(self.history) > 0:
            self.memory.buffer, prompt = self.history.pop()
            self.chat(prompt)
        else:
            print("Nothing to regenerate.")

    def save_history(self, file_path):
        """
        Saves the conversation history to a file using pickle.

        :param file_path: The file path where the history should be saved.
        """
        with open(file_path, 'wb') as f:
            pickle.dump(self.history, f)

    def load_history(self, file_path):
        """
        Loads the conversation history from a file using pickle.

        :param file_path: The file path from which to load the history.
        """
        with open(file_path, 'rb') as f:
            self.history = pickle.load(f)
            # Optionally reset the memory based on the last saved state
            if self.history:
                self.memory.buffer = self.history[-1][0]

Next we create a LLM to communicate.

In [None]:
MODEL = 'gpt-4o-mini'

# Initialize the OpenAI LLM with your API key
llm = ChatOpenAI(
  model=MODEL,
  temperature= 0.3,
  n= 1)

c = ChatBot(llm, llm, DEFAULT_TEMPLATE)

response = c.converse("Hello, my name is David.")
print(response)

  self.memory = ConversationSummaryMemory(llm=self.llm_summary)
  self.conversation = ConversationChain(


Hello, David! How can I assist you today?


In [None]:
from IPython.display import Javascript, Audio, display, HTML
from google.colab import output
from base64 import b64decode
import io
import time
import uuid
from openai import OpenAI

client = OpenAI()

RECORD = """
const sleep = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
    const reader = new FileReader()
    reader.onloadend = e => resolve(e.srcElement.result)
    reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
    stream = await navigator.mediaDevices.getUserMedia({ audio: true })
    recorder = new MediaRecorder(stream)
    chunks = []
    recorder.ondataavailable = e => chunks.push(e.data)
    recorder.start()
    await sleep(time)
    recorder.onstop = async ()=>{
        blob = new Blob(chunks)
        text = await b2text(blob)
        resolve(text)
    }
    recorder.stop()
})
"""

def generate_text(text):
    response = client.audio.speech.create(
        model="tts-1",
        voice="nova",
        input=text
    )
    audio_data = response.content
    return audio_data  # Return the audio data directly

def speak_text(text):
    audio_data = generate_text(text)

    # Generate a unique ID for this audio element
    audio_id = f"audio_{uuid.uuid4().hex}"

    # Display the audio with the unique ID
    display(Audio(audio_data, autoplay=True, element_id=audio_id))

    # Create a hidden div to store the audio status
    status_div = f'<div id="{audio_id}_status" style="display: none;">playing</div>'
    display(HTML(status_div))

    # JavaScript to handle audio playback and status
    js_code = f"""
    var audioElement = document.getElementById('{audio_id}');
    if (audioElement) {{
        audioElement.onended = function() {{
            document.getElementById('{audio_id}_status').textContent = 'finished';
        }};
    }}
    """

    # Execute the JavaScript
    display(HTML(f"<script>{js_code}</script>"))

    # Wait for the audio to finish
    while True:
        status = eval_js(f"document.getElementById('{audio_id}_status').textContent")
        if status == 'finished':
            break
        time.sleep(0.1)

def eval_js(js_code):
    from google.colab import output
    return output.eval_js(js_code)

def record(seconds=3):
    print(f"Recording now for {seconds} seconds.")
    display(Javascript(RECORD))
    s = output.eval_js('record(%d)' % (seconds * 1000))
    binary = b64decode(s.split(',')[1])

    # Convert to AudioSegment
    audio = AudioSegment.from_file(io.BytesIO(binary), format="webm")

    # Export as WAV
    audio.export("recorded_audio.wav", format="wav")
    print("Recording done.")
    return audio

def transcribe_audio(filename):
    with open(filename, "rb") as audio_file:
        transcription = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file
        )
    return transcription.text

We now continue the chat conversation until the user requests it to end.

In [None]:
from pydub import AudioSegment

MODEL = 'gpt-4o-mini'

# Initialize the OpenAI LLM with your API key
llm = ChatOpenAI(
  model=MODEL,
  temperature= 0.3,
  n= 1)

c = ChatBot(llm, llm, DEFAULT_TEMPLATE)

# Transcribe the recorded audio
response = None
while response != "bye":
    audio = record(5)
    transcription = transcribe_audio("recorded_audio.wav")
    print(f"Human: {transcription}")
    response = c.converse(transcription)
    print(f"AI: {response}")
    speak_text(response)

Recording now for 5 seconds.


<IPython.core.display.Javascript object>

Recording done.
Human: you
AI: Hello! How can I assist you today?


Recording now for 5 seconds.


<IPython.core.display.Javascript object>

Recording done.
Human: you
AI: How can I assist you today?


Recording now for 5 seconds.


<IPython.core.display.Javascript object>

Recording done.
Human: you
AI: How can I assist you today?


Recording now for 5 seconds.


<IPython.core.display.Javascript object>

Recording done.
Human: . .
AI: Hello! How can I assist you today? I'm here to help with any questions or tasks you have.


Recording now for 5 seconds.


<IPython.core.display.Javascript object>

Recording done.
Human: you
AI: How can I assist you today?


Recording now for 5 seconds.


<IPython.core.display.Javascript object>

Recording done.
Human: you
AI: How can I assist you today?


Recording now for 5 seconds.


<IPython.core.display.Javascript object>

KeyboardInterrupt: 