<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/main/Class_04_2_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

## **Module 4: Chatbots and Large Language Models**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Biology, Health and the Environment](https://sciences.utsa.edu/bhe/), [UTSA](https://www.utsa.edu/)

### Module 4 Material

* Part 4.1: Introduction to Large Language Models (LLMs)
* **Part 4.2: Chatbots**
* Part 4.3: Image Generation with StableDiffusion
* Part 4.4: Agentic AI

## Google CoLab Instructions

You MUST run the following code cell to get credit for this class lesson. By running this code cell, you will map your GDrive to /content/drive and print out your Google GMAIL address. Your Instructor will use your GMAIL address to verify the author of this class lesson.

In [None]:
# You must run this cell first
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: Using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("**WARNING**: Your GMAIL address was **not** printed in the output below.")
    print("**WARNING**: You will NOT receive credit for this lesson.")
    COLAB = False

You should see the following output except your GMAIL address should appear on the last line.

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image01B.png)

If your GMAIL address does not appear your lesson will **not** be graded.


### Test Your GEMINI_API_KEY

In order to run the code in this lesson you will need to have your secret `GEMINI_API_KEY` installed in your **Secrets** on this Colab notebook. Detailed steps for purchasing your `GEMINI_API_KEY` and installing it in your Colab notebook Secrets was provide in `Class_04_1`.

Run the code in the next cell to see if your `GEMINI_API_KEY` is installed correctly. You make have to Grant Access for your notebook to use your API key.

In [None]:
# Verify your API key setup

from google.colab import userdata
import os

# Check if API key is properly loaded
try:
    GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')
    print("API key loaded successfully!")
    print(f"Key length: {len(GEMINI_API_KEY)}")
except Exception as e:
    print(f"Error loading API key: {e}")
    print("Please set your API key in Google Colab:")
    print("1. Go to Secrets in the left sidebar")
    print("2. Create a new secret named 'openai_api_key'")
    print("3. Paste your OpenAI API key")

1. You may see this message when you run this cell:


![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image08C.png)

If you do see this popup just click on `Grant access`.


2. If your `GEMINI_API_KEY` is correctly installed you should see something _similar_ to the following output.

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image09C.png)

3. However, if you see the following output

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image10C.png)

You will need to correct the error before you can continue. Ask your Instructor or TA for help if you can resolve the error yourself.

### Install `LangChain` packages

Run the code in the following cell to install the `langchain-google_genai` and related packages.

In [None]:
# Run these installations

!pip install -q langchain-core
!pip install -q pydub google-genai nest_asyncio langchain-community langchain-google-genai

You might not see any output or you might see the the following output:

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image07E.png)

If you see this error message, don't worry about it.

### **YouTube Introduction to ChatBots**

Run the next cell to see short introduction to ChatBots. This is a suggested, but optional, part of the lesson.

In [None]:
from IPython.display import HTML
video_id = 'gmUHEvrpYoU'

HTML(f"""
<iframe width="560" height="315"
  src="https://www.youtube.com/embed/{video_id}"
  title="YouTube video player"
  frameborder="0"
  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
  allowfullscreen
  referrerpolicy="strict-origin-when-cross-origin"> </iframe>
""")

### Create Functions

Run the code in the next cell to create a number of audio functions needed for this lesson.

In [None]:
# ============================================================================
# COMPLETE WAKE WORD CHATBOT - FIXED VERSION
# Solves the "record is not defined" error
# ============================================================================

import os
import base64
import struct
import asyncio
import subprocess
import nest_asyncio
import time
from IPython.display import Javascript, display, Audio
from google.colab import output, userdata
from google import genai
from google.genai import types
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.messages import HumanMessage, SystemMessage

# Enable nested event loops for Colab
nest_asyncio.apply()

# Initialize Gemini client once
API_KEY = userdata.get('GEMINI_API_KEY')
os.environ["GOOGLE_API_KEY"] = API_KEY
client = genai.Client(api_key=API_KEY, http_options={'api_version': 'v1alpha'})

# ============================================================================
# FIXED AUDIO RECORDING (Solves "record is not defined")
# ============================================================================

def record_audio(sec=5, filename='recorded_audio.wav', convert_to_wav=True):
    """
    Records audio from browser microphone - FIXED version.
    Defines and executes JavaScript in one call to avoid "not defined" errors.
    """
    # Complete JavaScript that defines AND executes the recording in one call
    complete_js = f"""
    (async function() {{
      const sleep = time => new Promise(resolve => setTimeout(resolve, time))

      const b2text = blob => new Promise(resolve => {{
        const reader = new FileReader()
        reader.onloadend = e => resolve(e.target.result)
        reader.readAsDataURL(blob)
      }})

      try {{
        const stream = await navigator.mediaDevices.getUserMedia({{ audio: true }})
        const recorder = new MediaRecorder(stream)
        const chunks = []

        recorder.ondataavailable = e => {{
          if (e.data.size > 0) chunks.push(e.data)
        }}

        const recordingPromise = new Promise((resolve, reject) => {{
          recorder.onstop = async () => {{
            stream.getTracks().forEach(track => track.stop())

            if (chunks.length === 0) {{
              reject('No audio data recorded')
              return
            }}

            const blob = new Blob(chunks, {{ type: 'audio/webm;codecs=opus' }})

            if (blob.size === 0) {{
              reject('Empty audio blob')
              return
            }}

            const text = await b2text(blob)
            resolve(text)
          }}

          recorder.onerror = e => reject('Recording error: ' + e.error)
        }})

        recorder.start()
        await sleep({sec * 1000})
        recorder.stop()

        return await recordingPromise

      }} catch (error) {{
        return 'ERROR: ' + error.message
      }}
    }})()
    """

    try:
        s = output.eval_js(complete_js)

        if not s or s.startswith('ERROR:'):
            return None

        if ',' not in s:
            return None

        binary = base64.b64decode(s.split(',')[1])

        if len(binary) == 0:
            return None

        if convert_to_wav:
            temp_webm = "temp_recording.webm"
            with open(temp_webm, 'wb') as f:
                f.write(binary)

            subprocess.run([
                'ffmpeg', '-i', temp_webm, '-ar', '16000',
                '-ac', '1', '-y', filename
            ], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

            if os.path.exists(temp_webm):
                os.remove(temp_webm)

            return filename if os.path.exists(filename) else None
        else:
            with open(filename, 'wb') as f:
                f.write(binary)
            return filename

    except Exception as e:
        print(f"‚ùå Recording error: {e}")
        return None

# ============================================================================
# TRANSCRIPTION
# ============================================================================

def transcribe_audio(filename, language=None, prompt=None, model="gemini-2.5-flash"):
    """Transcribes audio using Gemini's native audio understanding."""
    if not os.path.exists(filename):
        return f"‚ùå File not found: {filename}"

    try:
        with open(filename, "rb") as f:
            audio_bytes = f.read()

        instruction = "Transcribe this audio accurately."
        if language:
            instruction += f" The audio is in {language}."
        if prompt:
            instruction += f" {prompt}"

        response = client.models.generate_content(
            model=model,
            contents=[
                instruction,
                types.Part.from_bytes(data=audio_bytes, mime_type="audio/wav")
            ]
        )

        return response.text

    except Exception as e:
        return f"‚ùå Transcription error: {str(e)}"

# ============================================================================
# TEXT-TO-SPEECH
# ============================================================================

async def speak(text, voice="Fenrir", autoplay=True, save_to=None):
    """Generates natural speech using Gemini 2.0 Live API."""
    config = types.LiveConnectConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name=voice)
            )
        )
    )

    audio_chunks = bytearray()

    try:
        async with client.aio.live.connect(model="gemini-2.0-flash-exp", config=config) as session:
            await session.send_client_content(
                turns=[types.Content(role="user", parts=[types.Part(text=text)])],
                turn_complete=True
            )

            async for response in session.receive():
                if response.server_content and response.server_content.model_turn:
                    for part in response.server_content.model_turn.parts:
                        if part.inline_data:
                            audio_chunks.extend(part.inline_data.data)

                if response.server_content and response.server_content.turn_complete:
                    break

        if audio_chunks:
            sample_rate = 24000
            wav_header = struct.pack(
                '<4sI4s4sIHHIIHH4sI',
                b'RIFF', 36 + len(audio_chunks), b'WAVE', b'fmt ', 16, 1, 1,
                sample_rate, sample_rate * 2, 2, 16, b'data', len(audio_chunks)
            )
            full_audio = wav_header + audio_chunks

            if save_to:
                with open(save_to, "wb") as f:
                    f.write(full_audio)

            if autoplay:
                display(Audio(full_audio, rate=sample_rate, autoplay=True))

            return save_to

        return None

    except Exception as e:
        print(f"‚ùå Speech generation error: {e}")
        return None

# ============================================================================
# WAKE WORD DETECTION
# ============================================================================

async def listen_for_wake_word(
    wake_words=["hey gemini", "ok gemini"],
    listen_duration=3,
    voice="Kore"
):
    """Continuously listens for a wake word before activating the chatbot."""
    print("=" * 60)
    print("üéß WAKE WORD DETECTION ACTIVE")
    print(f"üì¢ Say one of: {', '.join([f'"{w}"' for w in wake_words])}")
    print("=" * 60)

    attempts = 0

    while True:
        attempts += 1
        print(f"\nüé§ Listening for wake word... (attempt {attempts})")

        filename = record_audio(sec=listen_duration, convert_to_wav=True)

        if not filename:
            await asyncio.sleep(1)
            continue

        transcription = transcribe_audio(filename)

        if transcription.startswith("‚ùå"):
            await asyncio.sleep(0.5)
            continue

        transcription_lower = transcription.strip().lower()

        if transcription_lower:
            print(f"üìù Heard: '{transcription_lower}'")

        wake_word_detected = any(wake_word in transcription_lower for wake_word in wake_words)

        if wake_word_detected:
            print("\n‚úÖ WAKE WORD DETECTED!")
            await speak("Yes, how can I help you?", voice=voice, autoplay=True)
            await asyncio.sleep(1)
            return True
        else:
            await asyncio.sleep(0.3)

# ============================================================================
# CONVERSATION SESSION
# ============================================================================

async def run_conversation_session(loop_duration=5, voice="Kore", pause_before_listen=2):
    """Runs a single conversation session after wake word is detected."""
    llm = ChatGoogleGenerativeAI(
        model="gemini-2.5-flash",
        temperature=0.6
    )
    chat_history = [
        SystemMessage(content="You are a helpful voice assistant. Keep responses brief and conversational.")
    ]

    print("\n" + "=" * 60)
    print("üí¨ CONVERSATION MODE ACTIVE")
    print("=" * 60)

    silence_count = 0
    max_silence = 3

    while True:
        try:
            if len(chat_history) > 1:
                print(f"\n‚è≥ Waiting {pause_before_listen}s before next question...")
                await asyncio.sleep(pause_before_listen)

            print(f"\nüé§ Listening... ({loop_duration}s)")
            filename = record_audio(sec=loop_duration)

            if not filename:
                break

            user_text = transcribe_audio(filename)

            if user_text.startswith("‚ùå"):
                silence_count += 1
                print(f"‚ö†Ô∏è Transcription failed ({silence_count}/{max_silence})")

                if silence_count >= max_silence:
                    await speak("I'm having trouble hearing you. Goodbye!", voice=voice)
                    break
                continue

            user_text_cleaned = user_text.strip()

            if len(user_text_cleaned) < 3:
                silence_count += 1
                print(f"üîá [inaudible] ({silence_count}/{max_silence})")

                if silence_count >= max_silence:
                    await speak("I haven't heard anything. Goodbye!", voice=voice)
                    break
                continue

            silence_count = 0
            print(f"\nüë§ Human: {user_text_cleaned}")

            if any(word in user_text_cleaned.lower() for word in ["bye", "goodbye", "exit", "quit", "stop"]):
                print("üëã Exit command detected.")
                await speak("Goodbye!", voice=voice)
                break

            chat_history.append(HumanMessage(content=user_text_cleaned))
            response = llm.invoke(chat_history)

            print(f"ü§ñ AI: {response.content}")

            await speak(response.content, voice=voice, autoplay=True)

            chat_history.append(response)

        except KeyboardInterrupt:
            print("\n‚ö†Ô∏è Stopped by user.")
            break
        except Exception as e:
            print(f"‚ùå Error: {e}")
            break

    print("\n" + "=" * 60)
    print("üí¨ CONVERSATION SESSION ENDED")
    print("=" * 60)

# ============================================================================
# MAIN WAKE WORD ASSISTANT
# ============================================================================

async def start_wake_word_assistant(
    wake_words=["hey gemini", "ok gemini"],
    voice="Kore",
    wake_duration=3,
    conversation_duration=5,
    pause_between=2,
    return_to_wake=True
):
    """
    Main function to start the wake word assistant.

    Args:
        wake_words (list): Phrases that activate the assistant
        voice (str): Voice name (Puck, Charon, Kore, Fenrir, Aoede)
        wake_duration (int): Listening duration for wake word (seconds)
        conversation_duration (int): Listening duration during conversation (seconds)
        pause_between (int): Pause between conversation turns (seconds)
        return_to_wake (bool): Return to wake word mode after conversation
    """

    while True:
        # Phase 1: Wait for wake word
        wake_detected = await listen_for_wake_word(
            wake_words=wake_words,
            listen_duration=wake_duration,
            voice=voice
        )

        if not wake_detected:
            break

        # Phase 2: Active conversation
        await run_conversation_session(
            loop_duration=conversation_duration,
            voice=voice,
            pause_before_listen=pause_between
        )

        # Phase 3: Decide whether to continue or exit
        if not return_to_wake:
            print("\nüëã Exiting chatbot...")
            break
        else:
            print("\nüîÑ Returning to wake word detection mode...")
            await asyncio.sleep(1)

# ============================================================================
# USAGE EXAMPLES
# ============================================================================

"""
# BASIC USAGE:
await start_wake_word_assistant()

# CUSTOM CONFIGURATION:
await start_wake_word_assistant(
    wake_words=["hey assistant", "computer"],
    voice="Fenrir",
    wake_duration=4,
    conversation_duration=6,
    pause_between=3
)

# ONE-SHOT MODE (exits after first conversation):
await start_wake_word_assistant(
    wake_words=["hey gemini"],
    voice="Aoede",
    return_to_wake=False
)
"""


# **Introduction to Speech Processing with Gemini**

![___](https://biologicslab.co/BIO1173/images/class_04/CourseImage.gif)

In this lesson, we explore how to use both computer-generated voice and voice recognition to create a `ChatBot`. We'll be working with the **Google Gemini API** and the **LangChain Google integration** to achieve this. Specifically, we'll demonstrate how to input normal text and have it spoken by the computer, and conversely, how we can speak to the computer and have it respond. We'll ultimately integrate these functionalities to create a chatbot that handles both text-to-speech and speech-to-text interactions.

While we'll use Google `Colab` for this demonstration, in production environments, you'd likely use a mobile app or a web-based JavaScript solution, as each platform handles voice differently. We'll focus on keeping things generic and simple in Colab for now.

Voice applications are everywhere. For example, I can ask "`Alexa`, what time is it?" and multiple `Alexa` devices in my home will respond, although not always perfectly. I usually mute them during recording sessions. Applications like `Siri` or `Gemini` also offer voice interactions. For instance, you can now interact with the `Gemini` mobile app completely hands-free, or use the microphone input on the web interface.

To illustrate, I asked `Gemini`, "How are you doing?" and it responded by offering some insightful thoughts about the rapid evolution of multimodal AI. It highlighted that models like Gemini aren't just processing text anymore‚Äîthey are natively designed to understand text, images, and **audio** simultaneously. It also suggested that students experiment with these new "multimodal" capabilities, as building hands-on projects is one of the best ways to understand the future of AI.

## **Part I: Speech to Text with Gemini**

Here we delve into the realm of speech-to-text technology, focusing on the powerful multimodal capabilities offered by **Google's Gemini models**. Speech-to-text, also known as automatic speech recognition (ASR), is a technology that converts spoken language into written text.

**Google's Gemini 1.5** models represent the cutting edge of this field. Unlike traditional models that require separate systems for audio and text, Gemini is **natively multimodal**. This means it can accept audio inputs directly, leveraging advanced machine learning techniques to achieve high accuracy and robustness across various accents, languages, and acoustic environments. We'll explore how these models can be integrated into applications to enable voice-based interactions, transcription services, and accessibility features. By harnessing Gemini's audio capabilities, we'll unlock new possibilities for human-computer interaction and demonstrate how to transform audio input into actionable text data with remarkable precision.

Note: We will make use of the JavaScript technique described below to record audio directly within Google Colab, as Colab runs on a remote server and cannot access your local microphone by default.

https://gist.github.com/korakot/c21c3476c024ad6d56d5f48b0bca92be

### **Native Audio Understanding**

Here we delve into the realm of multimodal audio processing, focusing on the powerful capabilities offered by Google's **Gemini 2.5** models. Unlike traditional Speech-to-Text (ASR) which simply converts sound waves into words, Gemini treats audio as a "native" modality‚Äîmeaning it processes the raw audio waveform directly alongside text.

This approach allows Gemini to not only transcribe speech with high accuracy but also to:
* **Understand Context:** Detect emotions (sarcasm, excitement) and non-verbal cues.
* **Diarize:** Distinguish between multiple speakers automatically.
* **Reason:** Summarize or answer questions about the audio content without needing a separate text-processing step.

We will explore how `gemini-2.5-flash` can be used to transform raw audio input into actionable data with remarkable precision.

**Note on Recording in Colab:**
Because Google Colab runs on a remote server, it cannot access your local microphone directly. We will make use of a JavaScript bridge (adapted from [this technique](https://gist.github.com/korakot/c21c3476c024ad6d56d5f48b0bca92be)) to capture audio from your browser and stream it to the Python environment for processing.

## **Transcribing Audio with Gemini**

#####  **Overview of the API**
- **Models**: We use **Gemini 2.5 Flash**. It is a "multimodal" model, meaning it can natively understand text, images, and **audio** simultaneously.
- **Input**: Accepts audio data directly (e.g., WAV, MP3, MP4) alongside text prompts.
- **Output**: Returns text, JSON, or structured data based on your instructions.
- **Capabilities**: Unlike traditional "transcription-only" models, you can ask Gemini to do things *while* it listens, such as "Summarize this recording," "Extract the patient's symptoms," or "Translate this to Spanish."

##### **Why It's Useful for Biomedical Investigators**

1. **Transcribing Interviews & Focus Groups**
   Automatically convert recorded conversations with patients, clinicians, or research participants into text for qualitative analysis.

2. **Clinical Note Dictation**
   Researchers can dictate observations or notes during fieldwork or lab work, streamlining documentation.

3. **Meeting & Conference Transcripts**
   Capture and archive discussions from research meetings, seminars, or collaborative calls.

4. **Data Extraction from Audio**
   Enables downstream NLP tasks like identifying social determinants of health (SDOH) or extracting biomedical entities directly from spoken content without needing a separate transcription step.

5. **Multilingual Support**
   Useful in global health research where interviews or data collection occur in multiple languages.
---

#### **How the Code Works**
The code below demonstrates how to use the **LangChain Google integration** to transcribe audio.

1.  We load the `recorded_audio.wav` file we just created.
2.  We encode it into a format Gemini understands (Base64).
3.  We send a **Multimodal Message** to the model containing two parts:
    * **The Audio:** The actual sound file data.
    * **The Prompt:** A text instruction saying *"Please transcribe this audio file."*

In [None]:
# ============================================================================
# QUICK TEST FOR FIXED RECORDING
# Run this to verify the "record is not defined" error is solved
# ============================================================================

import os
import base64
import time
from IPython.display import Javascript, display, Audio
from google.colab import output, userdata
from google import genai
from google.genai import types
import subprocess

# Initialize Gemini
API_KEY = userdata.get('GEMINI_API_KEY')
os.environ["GOOGLE_API_KEY"] = API_KEY
client = genai.Client(api_key=API_KEY, http_options={'api_version': 'v1alpha'})

# ============================================================================
# FIXED RECORDING FUNCTION
# ============================================================================

def record_audio_test(sec=3):
    """Quick test version - records for 3 seconds and returns data."""

    print(f"üé§ Recording for {sec} seconds... SPEAK NOW!")

    complete_js = f"""
    (async function() {{
      const sleep = time => new Promise(resolve => setTimeout(resolve, time))

      const b2text = blob => new Promise(resolve => {{
        const reader = new FileReader()
        reader.onloadend = e => resolve(e.target.result)
        reader.readAsDataURL(blob)
      }})

      try {{
        const stream = await navigator.mediaDevices.getUserMedia({{ audio: true }})
        const recorder = new MediaRecorder(stream)
        const chunks = []

        recorder.ondataavailable = e => {{
          if (e.data.size > 0) chunks.push(e.data)
        }}

        const recordingPromise = new Promise((resolve, reject) => {{
          recorder.onstop = async () => {{
            stream.getTracks().forEach(track => track.stop())

            if (chunks.length === 0) {{
              reject('No audio data')
              return
            }}

            const blob = new Blob(chunks, {{ type: 'audio/webm' }})

            if (blob.size === 0) {{
              reject('Empty blob')
              return
            }}

            const text = await b2text(blob)
            resolve(text)
          }}

          recorder.onerror = e => reject('Error: ' + e.error)
        }})

        recorder.start()
        await sleep({sec * 1000})
        recorder.stop()

        return await recordingPromise

      }} catch (error) {{
        return 'ERROR: ' + error.message
      }}
    }})()
    """

    try:
        start = time.time()
        s = output.eval_js(complete_js)
        elapsed = time.time() - start

        if not s or s.startswith('ERROR:'):
            print(f"‚ùå Recording failed: {s}")
            return None

        if ',' not in s:
            print("‚ùå Invalid data format")
            return None

        binary = base64.b64decode(s.split(',')[1])

        print(f"‚úÖ Recording successful!")
        print(f"   Time: {elapsed:.1f}s")
        print(f"   Data: {len(binary)} bytes")

        # Save as WebM
        filename = 'test_recording.webm'
        with open(filename, 'wb') as f:
            f.write(binary)

        # Convert to WAV
        wav_filename = 'test_recording.wav'
        subprocess.run([
            'ffmpeg', '-i', filename, '-ar', '16000',
            '-ac', '1', '-y', wav_filename
        ], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

        if os.path.exists(wav_filename):
            print(f"‚úÖ Converted to WAV: {wav_filename}")
            return wav_filename
        else:
            print("‚ö†Ô∏è Conversion failed, using WebM")
            return filename

    except Exception as e:
        print(f"‚ùå Error: {e}")
        import traceback
        traceback.print_exc()
        return None

# ============================================================================
# TRANSCRIPTION
# ============================================================================

def transcribe_test(filename):
    """Quick transcription test."""
    if not os.path.exists(filename):
        print(f"‚ùå File not found: {filename}")
        return None

    print(f"üîÑ Transcribing...")

    try:
        with open(filename, "rb") as f:
            audio_bytes = f.read()

        response = client.models.generate_content(
            model="gemini-2.5-flash",
            contents=[
                "Transcribe this audio accurately.",
                types.Part.from_bytes(data=audio_bytes, mime_type="audio/wav")
            ]
        )

        print("‚úÖ Transcription complete!")
        return response.text

    except Exception as e:
        print(f"‚ùå Transcription error: {e}")
        return None

# ============================================================================
# RUN TEST
# ============================================================================

def run_complete_test():
    """Run a complete test: record + transcribe + playback."""

    print("=" * 60)
    print("TESTING FIXED RECORDING FUNCTION")
    print("=" * 60)

    # Step 1: Record
    print("\nStep 1: Recording...")
    audio_file = record_audio_test(sec=5)

    if not audio_file:
        print("\n‚ùå TEST FAILED: Could not record audio")
        return False

    # Step 2: Transcribe
    print("\nStep 2: Transcribing...")
    transcription = transcribe_test(audio_file)

    if not transcription:
        print("\n‚ùå TEST FAILED: Could not transcribe audio")
        return False

    # Step 3: Display results
    print("\n" + "=" * 60)
    print("üìú TRANSCRIPTION RESULT:")
    print("=" * 60)
    print(transcription)
    print("=" * 60)

    # Step 4: Playback
    print("\nStep 3: Playing back audio...")
    display(Audio(audio_file, autoplay=False))

    print("\n" + "=" * 60)
    print("‚úÖ ALL TESTS PASSED!")
    print("=" * 60)
    print("\nThe recording function is working correctly.")
    print("You can now use the wake word assistant!")

    return True

# ============================================================================
# USAGE
# ============================================================================

"""
# In Google Colab, just run:
run_complete_test()

# If this passes, you're ready to use the wake word assistant:
# (Paste complete_wake_word_FIXED.py and run)
await start_wake_word_assistant()
"""


In [None]:
run_complete_test()

## Example 1: Speech-to-Text

This code in the cell below uses the `transcribe_audio()` function to convert your voice into text.

Once you hit the run cell icon
![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image10A.png)start counting out loud from `1` to `10`.

In [None]:
# Example 1: Speech-to-Text

from IPython.display import Audio, display
import time

# Configuration
LLM_MODEL = "gemini-2.5-flash"
RECORD_DURATION = 10  # Seconds
TIMEOUT = 30  # Maximum seconds to wait for permission + recording

try:
    # 1. Capture Audio
    print(f"üé§ Start speaking!")
    print(f"‚è±Ô∏è Please grant microphone permission if prompted (timeout in {TIMEOUT}s)")

    start_time = time.time()
    audio_filename = record_audio(sec=RECORD_DURATION)
    elapsed = time.time() - start_time

    # Check if it took suspiciously long (likely hung on permission dialog)
    if elapsed > (RECORD_DURATION + 15):
        print("‚ö†Ô∏è Recording took too long - permission may have been delayed.")
        print("üí° Tip: Run this cell again. Permission should already be granted.")

    if audio_filename:
        # 2. Transcribe using Native Audio Reasoning
        print(f"üì° Sending waveform to {LLM_MODEL}...")

        transcription = transcribe_audio(
            filename=audio_filename,
            model=LLM_MODEL,
            prompt="Transcribe accurately. Include speaker labels if multiple people are speaking."
        )

        # 3. Output Results
        print("\n" + "="*30)
        print("üìú TRANSCRIPTION")
        print("="*30)
        print(transcription)
        print("="*30)

        # 4. Playback for verification
        print("\nüîä Playing back recorded audio...")
        display(Audio(audio_filename, autoplay=False))

    else:
        print("‚ùå Recording failed. Please check your browser's microphone permissions.")
        print("üí° Make sure to click 'Allow' when the browser asks for microphone access.")

except KeyboardInterrupt:
    print("\n‚ö†Ô∏è Recording interrupted by user.")
except Exception as e:
    print(f"‚ö†Ô∏è An error occurred during the Speech-to-Text process: {e}")
    print("üí° Try running the cell again - microphone permission may need to be granted first.")


If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image07F.png)

## **Exercise 1: Speech-to-Text**

In the cell below, write to code to generate Speech-to-Text using the code in Example 2 as an template.

For **Exercise 2**, once you hit the run cell icon
![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image10A.png)start counting **_backwards_** from `10` to `1`.

In [None]:
# Insert your code for Exercise 1 here

from IPython.display import Audio, display
import time

# Configuration
LLM_MODEL = "gemini-2.5-flash"
RECORD_DURATION = 10  # Seconds
TIMEOUT = 30  # Maximum seconds to wait for permission + recording

try:
    # 1. Capture Audio
    print(f"üé§ Start speaking!")
    print(f"‚è±Ô∏è Please grant microphone permission if prompted (timeout in {TIMEOUT}s)")

    start_time = time.time()
    audio_filename = record_audio(sec=RECORD_DURATION)
    elapsed = time.time() - start_time

    # Check if it took suspiciously long (likely hung on permission dialog)
    if elapsed > (RECORD_DURATION + 15):
        print("‚ö†Ô∏è Recording took too long - permission may have been delayed.")
        print("üí° Tip: Run this cell again. Permission should already be granted.")

    if audio_filename:
        # 2. Transcribe using Native Audio Reasoning
        print(f"üì° Sending waveform to {LLM_MODEL}...")

        transcription = transcribe_audio(
            filename=audio_filename,
            model=LLM_MODEL,
            prompt="Transcribe accurately. Include speaker labels if multiple people are speaking."
        )

        # 3. Output Results
        print("\n" + "="*30)
        print("üìú TRANSCRIPTION")
        print("="*30)
        print(transcription)
        print("="*30)

        # 4. Playback for verification
        print("\nüîä Playing back recorded audio...")
        display(Audio(audio_filename, autoplay=False))

    else:
        print("‚ùå Recording failed. Please check your browser's microphone permissions.")
        print("üí° Make sure to click 'Allow' when the browser asks for microphone access.")

except KeyboardInterrupt:
    print("\n‚ö†Ô∏è Recording interrupted by user.")
except Exception as e:
    print(f"‚ö†Ô∏è An error occurred during the Speech-to-Text process: {e}")
    print("üí° Try running the cell again - microphone permission may need to be granted first.")


If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image08F.png)

## **Text to Speech with Google**

In this section, we'll explore text-to-speech (TTS), focusing on Google's powerful speech synthesis tools. While Gemini is excellent at generating the *words*, we use Google's dedicated Text-to-Speech engine (`gTTS`) to convert that written text into natural-sounding speech.

Google's TTS models are optimized for both real-time applications and high-fidelity audio storage. This technology represents a significant advancement in speech synthesis, using deep learning to produce clear, lifelike vocal outputs in a wide variety of languages and accents. By utilizing these tools, we'll explore their capabilities and understand how they revolutionize industries, from accessibility solutions to interactive voice assistants and beyond.

## **Google's Voices**

When using the Gemini Multimodal Live API to generate real-time conversational audio, you utilize Native Audio models (such as gemini-live-2.5-flash-native-audio). Unlike traditional Text-to-Speech which "synthesizes" text into sound after the fact, Gemini's native audio models generate speech directly as a core modality. This allows for Affective Dialog‚Äîwhere the voice automatically adapts its tone, emotion, and emphasis based on the context of the conversation.

Google offers a suite of distinct voice personas, along with a library of over 30 HD voices. The primary personas include:

* Puck ‚Äì The most popular general-purpose voice. Conversational, friendly, and approachable with a mid-range pitch. It has an "upbeat" and "guy-next-door" feel.

* Charon ‚Äì A deep, calm, and authoritative male voice. It projects a sense of informative experience and steady confidence, perfect for formal narrations.

* Kore ‚Äì A bright, energetic, and professional female voice. Excellent for high-engagement tasks like coaching or upbeat customer support where a "firm" but engaging tone is needed.

* Fenrir ‚Äì A warm, steady, and approachable male voice. It sits between Puck and Charon, making it perfect for long-form listening or educational content.

* Aoede ‚Äì A clear, thoughtful, and articulate female voice. Known for a "breezy" and intelligent tone that handles complex discussions gracefully.

### Example 2: Demonstrate Different Voices

The code in the cell below demonstates 4 of the different voices that are available in the `Gemini` text-to-speech API:

* **Puck:** A clear, direct, and conversational male voice with a mid-range pitch. Puck is often described as having a "guy next door" feel‚Äîfriendly, trustworthy, and approachable. Because of its balanced tone, it is the default choice for most general-purpose assistants.

* **Charon:** A deep, calm, and authoritative male voice. Charon projects a sense of experience and steady confidence. It is best suited for scenarios that require a more formal or serious tone, such as news delivery, instructional narrations, or professional corporate guides.

* **Kore:** An energetic and youthful female voice with a bright, professional quality. Kore conveys high enthusiasm and confidence without being overly casual. This makes it an excellent choice for upbeat tutorials, engaging customer support, or any interaction where you want to keep the user‚Äôs energy high.

* **Fenrir** is widely considered the most versatile of the male voices. It sits perfectly between the high energy of Puck and the deep authority of Charon.

For example, here is a more detailed description of the **Fenrir** voice:

>  **Persona:** Warm, approachable, and steady. Fenrir has a mid-range pitch that feels exceptionally natural and human. It lacks the "broadcast" quality of Charon and the "youthful bounce" of Puck, making it feel more like a calm colleague or a supportive mentor.

>  **Tone:** Balanced and conversational. It is designed to be "easy to listen to" for long periods, which is why it is frequently used for e-learning, narrations, and long-form assistants.

>  **Best For:** Explainer videos, podcasting, technical support, or any application where you want to project reliability and warmth without being too formal.

Run the code cell to hear each of these three voices.

In [None]:
# Example 2: Demonstrate Different Voices

from IPython.display import Audio, display
import os

async def run_voice_demos():
    # Primary voice profiles
    demo_voices = ["Puck", "Charon", "Kore", "Fenrir"]

    # This text is designed to showcase the tonal differences of each profile
    sample_text = "Hello! I am one of the native voices available in the Gemini Live API. Can you hear the unique personality in my tone?"

    print("--- Starting Gemini Live Voice Demo ---")

    for voice in demo_voices:
        # Step 1: Generate the audio file using the Live API
        # This function handles the WebSocket connection and PCM-to-WAV conversion
        filename = f"sample_{voice.lower()}.wav"
        result = await speak(sample_text, voice=voice, autoplay=False, save_to=filename)

        # Step 2: Validate and display the playback widget
        if result and os.path.exists(filename):
            print(f"\n[‚úî] Playing sample for {voice}:")
            display(Audio(filename, autoplay=False))
        else:
            print(f"  [X] Skipping {voice}: Generation failed or file not found.")

    print("\n--- Voice Demonstration Complete ---")

# Execute the voice demo in Colab
await run_voice_demos()


If the code is correct, you should see something _similar_ to the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image02F.png)

Press the Play icon to listen to each voice.

### **Exercise 2: Demonstrate Different Voices**

In the cell below, write the code to demonstrate the following 4 Gemini voices:

* **Aoede:** A clear, thoughtful, and articulate female voice, often described as sounding intelligent and engaging.

* **Leda:** A calm and steady voice with a balanced tone, suitable for neutral assistants.

* **Orus:** A direct and confident male voice, slightly more formal than Puck.

* **Zephyr:** An upbeat and energetic voice, similar in spirit to Kore but with a different tonal profile.

In [None]:
# Insert your code for Exercise 2 here

from IPython.display import Audio, display
import os

async def run_voice_demos():
    # Primary voice profiles
    demo_voices = ["Aoede", "Leda", "Orus", "Zephyr"]

    # This text is designed to showcase the tonal differences of each profile
    sample_text = "Hello! I am one of the native voices available in the Gemini Live API. Can you hear the unique personality in my tone?"

    print("--- Starting Gemini Live Voice Demo ---")

    for voice in demo_voices:
        # Step 1: Generate the audio file using the Live API
        # This function handles the WebSocket connection and PCM-to-WAV conversion
        filename = f"sample_{voice.lower()}.wav"
        result = await speak(sample_text, voice=voice, autoplay=False, save_to=filename)

        # Step 2: Validate and display the playback widget
        if result and os.path.exists(filename):
            print(f"\n[‚úî] Playing sample for {voice}:")
            display(Audio(filename, autoplay=False))
        else:
            print(f"  [X] Skipping {voice}: Generation failed or file not found.")

    print("\n--- Voice Demonstration Complete ---")

# Execute the voice demo in Colab
await run_voice_demos()


If the code is correct, you should see something _similar_ to the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image03F.png)

Press the Play icon to listen to each voice.

### Example 3: Transcribe Recorded Data

The code in the cell shows how to record your speech, print out a transcription of what you said, and finally, read the transcription using the "Fenrir" voice.

Once you hit the run cell icon
![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image10A.png), read out loud Carl Sandburg‚Äôs poem ‚ÄúFog‚Äù --a short, imagistic piece that captures the quiet, mysterious arrival of fog. Don't forget to start by saying the title of the poem, "FOG". When you read the poem, make sure to pause after evey line.

```text
FOG

The fog comes
on little cat feet.

It sits looking
over harbor and city
on silent haunches
and then moves on.
```

In [None]:
# Example 3: Transcribe Recorded Data

from google.genai import types

# Recommended GA voices: Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, Zephyr
voice = "Fenrir"
duration = 15

async def speak_verbatim(text, voice="Fenrir", autoplay=True):
    """
    Pure text-to-speech that reads exactly what you give it.
    No LLM commentary, just verbatim reading.
    """
    print(f"üîä Speaking with voice '{voice}'...")

    config = types.LiveConnectConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name=voice)
            )
        )
    )

    audio_chunks = bytearray()

    try:
        async with client.aio.live.connect(model="gemini-2.0-flash-exp", config=config) as session:
            # Send as a TOOL result or instruction to just read, not analyze
            # Use a very explicit instruction to override the LLM's tendency to analyze
            await session.send_client_content(
                turns=[types.Content(
                    role="user",
                    parts=[types.Part(text=f"Read this text verbatim without any analysis, commentary, or additional words. Just speak exactly these words: {text}")]
                )],
                turn_complete=True
            )

            async for response in session.receive():
                if response.server_content and response.server_content.model_turn:
                    for part in response.server_content.model_turn.parts:
                        if part.inline_data:
                            audio_chunks.extend(part.inline_data.data)

                if response.server_content and response.server_content.turn_complete:
                    break

        if audio_chunks:
            sample_rate = 24000
            wav_header = struct.pack(
                '<4sI4s4sIHHIIHH4sI',
                b'RIFF', 36 + len(audio_chunks), b'WAVE', b'fmt ', 16, 1, 1,
                sample_rate, sample_rate * 2, 2, 16, b'data', len(audio_chunks)
            )
            full_audio = wav_header + audio_chunks

            if autoplay:
                display(Audio(full_audio, rate=sample_rate, autoplay=True))

            return True
        else:
            print("‚ùå No audio generated")
            return False

    except Exception as e:
        print(f"‚ùå Speech generation error: {e}")
        return False


try:
    # 1. Capture Audio
    print(f"üé§ Starting {duration} second recording...")
    audio_path = record_audio(sec=duration)

    if audio_path:
        # 2. Transcribe using Native Audio Reasoning
        transcription = transcribe_audio(audio_path)

        print("\n" + "="*30)
        print(f"üìú Captured: {transcription}")
        print("="*30 + "\n")

        # 3. Speak back using verbatim TTS
        print("üîä Reading back transcript...")
        await speak_verbatim(transcription, voice=voice)

    else:
        print("‚ùå Recording failed. Please check mic permissions.")

except Exception as e:
    print(f"‚ö†Ô∏è Error in the process: {e}")


If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image10F.png)

### **Exercise 3: Transcribe Recorded Data**

In the cell below, write the code to record your speech, print out a transcription of what you said, and finally, read the transcription using the "Leda" voice.

After you start running the cell, start reading _The Red Wheelbarrow_ by William Carlos Williams. Like _Fog_, it‚Äôs a minimalist, imagist poem that captures a vivid moment with few words. Make sure to pause after couplet.

```text
The Red Wheelbarrow

so much depends
upon

a red wheel
barrow

glazed with rain
water

beside the white
chickens.
```

In [None]:
# Insert your code for Exercise 3 here

from google.genai import types

# Recommended GA voices: Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, Zephyr
voice = "Leda"
duration = 20

async def speak_verbatim(text, voice="Fenrir", autoplay=True):
    """
    Pure text-to-speech that reads exactly what you give it.
    No LLM commentary, just verbatim reading.
    """
    print(f"üîä Speaking with voice '{voice}'...")

    config = types.LiveConnectConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name=voice)
            )
        )
    )

    audio_chunks = bytearray()

    try:
        async with client.aio.live.connect(model="gemini-2.0-flash-exp", config=config) as session:
            # Send as a TOOL result or instruction to just read, not analyze
            # Use a very explicit instruction to override the LLM's tendency to analyze
            await session.send_client_content(
                turns=[types.Content(
                    role="user",
                    parts=[types.Part(text=f"Read this text verbatim without any analysis, commentary, or additional words. Just speak exactly these words: {text}")]
                )],
                turn_complete=True
            )

            async for response in session.receive():
                if response.server_content and response.server_content.model_turn:
                    for part in response.server_content.model_turn.parts:
                        if part.inline_data:
                            audio_chunks.extend(part.inline_data.data)

                if response.server_content and response.server_content.turn_complete:
                    break

        if audio_chunks:
            sample_rate = 24000
            wav_header = struct.pack(
                '<4sI4s4sIHHIIHH4sI',
                b'RIFF', 36 + len(audio_chunks), b'WAVE', b'fmt ', 16, 1, 1,
                sample_rate, sample_rate * 2, 2, 16, b'data', len(audio_chunks)
            )
            full_audio = wav_header + audio_chunks

            if autoplay:
                display(Audio(full_audio, rate=sample_rate, autoplay=True))

            return True
        else:
            print("‚ùå No audio generated")
            return False

    except Exception as e:
        print(f"‚ùå Speech generation error: {e}")
        return False


try:
    # 1. Capture Audio
    print(f"üé§ Starting {duration} second recording...")
    audio_path = record_audio(sec=duration)

    if audio_path:
        # 2. Transcribe using Native Audio Reasoning
        transcription = transcribe_audio(audio_path)

        print("\n" + "="*30)
        print(f"üìú Captured: {transcription}")
        print("="*30 + "\n")

        # 3. Speak back using verbatim TTS
        print("üîä Reading back transcript...")
        await speak_verbatim(transcription, voice=voice)

    else:
        print("‚ùå Recording failed. Please check mic permissions.")

except Exception as e:
    print(f"‚ö†Ô∏è Error in the process: {e}")


If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image13F.png)

The poem is broken into four tiny stanzas, each with a long line followed by a short one.

This spacing:
* slows the reader down
* isolates each image
* makes you notice the shape of the words

Even the word ‚Äúwheelbarrow‚Äù is split in half, forcing you to see it differently.

### Example 4: Transcribe Recorded Data

In Example 3, we needed to tell our LLM **not** to analyze what we said, just read it back the text using this line of code:

```python
parts=[types.Part(text=f"Read this text. verbatim without any analysis, commentary, or additional words. Just speak exactly these words: {text}")]
```

In Example 4, we are going to send the same text to our model, `gemini-2.0-flash-exp`. However, this time we are going to modify the instructions as follows:

```python
parts=[types.Part(text=f"Read this text: {text}")]
```
This model has a tendency to be a bit "wordy". Let's see what happens when we relax our instructions.

Once you hit the run cell icon
![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image10A.png), read out loud Carl Sandburg‚Äôs poem ‚ÄúFog‚Äù --a short, imagistic piece that captures the quiet, mysterious arrival of fog. Don't forget to start by saying the title of the poem, "FOG". When you read the poem, make sure to pause after evey line.

```text
FOG

The fog comes
on little cat feet.

It sits looking
over harbor and city
on silent haunches
and then moves on.
```

In [None]:
# Example 4: Transcribe Recorded Data

from google.genai import types

# Recommended GA voices: Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, Zephyr
voice = "Fenrir"
duration = 20

async def speak_verbatim(text, voice="Fenrir", autoplay=True):
    """
    Pure text-to-speech that reads exactly what you give it.
    No LLM commentary, just verbatim reading.
    """
    print(f"üîä Speaking with voice '{voice}'...")

    config = types.LiveConnectConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name=voice)
            )
        )
    )

    audio_chunks = bytearray()

    try:
        async with client.aio.live.connect(model="gemini-2.0-flash-exp", config=config) as session:
            # Send as a TOOL result or instruction to just read, not analyze
            # Use a very explicit instruction to override the LLM's tendency to analyze
            await session.send_client_content(
                turns=[types.Content(
                    role="user",
                    parts=[types.Part(text=f"Read this text: {text}")]
                )],
                turn_complete=True
            )

            async for response in session.receive():
                if response.server_content and response.server_content.model_turn:
                    for part in response.server_content.model_turn.parts:
                        if part.inline_data:
                            audio_chunks.extend(part.inline_data.data)

                if response.server_content and response.server_content.turn_complete:
                    break

        if audio_chunks:
            sample_rate = 24000
            wav_header = struct.pack(
                '<4sI4s4sIHHIIHH4sI',
                b'RIFF', 36 + len(audio_chunks), b'WAVE', b'fmt ', 16, 1, 1,
                sample_rate, sample_rate * 2, 2, 16, b'data', len(audio_chunks)
            )
            full_audio = wav_header + audio_chunks

            if autoplay:
                display(Audio(full_audio, rate=sample_rate, autoplay=True))

            return True
        else:
            print("‚ùå No audio generated")
            return False

    except Exception as e:
        print(f"‚ùå Speech generation error: {e}")
        return False


try:
    # 1. Capture Audio
    print(f"üé§ Starting {duration} second recording...")
    audio_path = record_audio(sec=duration)

    if audio_path:
        # 2. Transcribe using Native Audio Reasoning
        transcription = transcribe_audio(audio_path)

        print("\n" + "="*30)
        print(f"üìú Captured: {transcription}")
        print("="*30 + "\n")

        # 3. Speak back using verbatim TTS
        print("üîä Reading back transcript...")
        await speak_verbatim(transcription, voice=voice)

    else:
        print("‚ùå Recording failed. Please check mic permissions.")

except Exception as e:
    print(f"‚ö†Ô∏è Error in the process: {e}")


If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image12F.png)

Even though we asked the LLM simply to read the text back to us, it couldn't help commenting on what you said to it!

### **Exercise 4: Transcribe Recorded Data**

In the cell below, write the code to record your speech, print out a transcription of what you said, and finally, read the transcription using the "Leda" voice.

Change the instructions to the LLM from this line of code:
```python
parts=[types.Part(text=f"Read this text verbatim without any analysis, commentary, or additional words. Just speak exactly these words: {text}")]
```

to read as this line of code:
```python
parts=[types.Part(text=f"Read this text: {text}")]
```

After you start running the cell, start reading _The Red Wheelbarrow_ by William Carlos Williams. Like _Fog_, it‚Äôs a minimalist, imagist poem that captures a vivid moment with few words. Make sure to pause after each couple of lines.

```text
The Red Wheelbarrow

so much depends
upon

a red wheel
barrow

glazed with rain
water

beside the white
chickens.
```

In [None]:
# Insert your code for Exercise 4 here

from google.genai import types

# Recommended GA voices: Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, Zephyr
voice = "Leda"
duration = 20

async def speak_verbatim(text, voice="Fenrir", autoplay=True):
    """
    Pure text-to-speech that reads exactly what you give it.
    No LLM commentary, just verbatim reading.
    """
    print(f"üîä Speaking with voice '{voice}'...")

    config = types.LiveConnectConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name=voice)
            )
        )
    )

    audio_chunks = bytearray()

    try:
        async with client.aio.live.connect(model="gemini-2.0-flash-exp", config=config) as session:
            # Send as a TOOL result or instruction to just read, not analyze
            # Use a very explicit instruction to override the LLM's tendency to analyze
            await session.send_client_content(
                turns=[types.Content(
                    role="user",
                    parts=[types.Part(text=f"Read this text: {text}")]
                )],
                turn_complete=True
            )

            async for response in session.receive():
                if response.server_content and response.server_content.model_turn:
                    for part in response.server_content.model_turn.parts:
                        if part.inline_data:
                            audio_chunks.extend(part.inline_data.data)

                if response.server_content and response.server_content.turn_complete:
                    break

        if audio_chunks:
            sample_rate = 24000
            wav_header = struct.pack(
                '<4sI4s4sIHHIIHH4sI',
                b'RIFF', 36 + len(audio_chunks), b'WAVE', b'fmt ', 16, 1, 1,
                sample_rate, sample_rate * 2, 2, 16, b'data', len(audio_chunks)
            )
            full_audio = wav_header + audio_chunks

            if autoplay:
                display(Audio(full_audio, rate=sample_rate, autoplay=True))

            return True
        else:
            print("‚ùå No audio generated")
            return False

    except Exception as e:
        print(f"‚ùå Speech generation error: {e}")
        return False


try:
    # 1. Capture Audio
    print(f"üé§ Starting {duration} second recording...")
    audio_path = record_audio(sec=duration)

    if audio_path:
        # 2. Transcribe using Native Audio Reasoning
        transcription = transcribe_audio(audio_path)

        print("\n" + "="*30)
        print(f"üìú Captured: {transcription}")
        print("="*30 + "\n")

        # 3. Speak back using verbatim TTS
        print("üîä Reading back transcript...")
        await speak_verbatim(transcription, voice=voice)

    else:
        print("‚ùå Recording failed. Please check mic permissions.")

except Exception as e:
    print(f"‚ö†Ô∏è Error in the process: {e}")


If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image14F.png)

## **Part 3: Chatbots**

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_3_image10A.png)

The history of **chatbots** is a fascinating journey through the evolution of artificial intelligence and human-computer interaction. Here's a brief overview:

* **1. The Early Days (1950s-1970s)**
1950 - Alan Turing's "Imitation Game": Turing proposed a test (now known as the Turing Test) to determine if a machine could exhibit intelligent behavior indistinguishable from a human.
1966 - ELIZA: Created by Joseph Weizenbaum at MIT, ELIZA was the first chatbot. It mimicked a Rogerian psychotherapist by rephrasing user input into questions. It was simple but groundbreaking.
1972 - PARRY: Developed by Kenneth Colby, PARRY simulated a person with paranoid schizophrenia. It was more complex than ELIZA and could hold more realistic conversations.
* **2. Rule-Based Systems (1980s-1990s)**
Chatbots during this era used hand-coded rules and decision trees.
They were mostly used in academic research, customer service, and early virtual assistants.
Examples include Jabberwacky (late 1980s), which aimed to simulate natural human chat through learning.
* **3. Rise of the Internet and AI (2000s)**
SmarterChild (2001): A popular chatbot on AOL Instant Messenger and MSN Messenger. It could answer questions, play games, and chat casually.
ALICE (Artificial Linguistic Internet Computer Entity): Created by Richard Wallace, it won the Loebner Prize (a Turing Test competition) multiple times.
* **4. Machine Learning and NLP Boom (2010s)**
2011 - Siri: Apple introduced Siri, a voice-activated assistant that brought chatbots into the mainstream.
2014 - Alexa and Cortana: Amazon and Microsoft launched their own virtual assistants.
2016 - Facebook Messenger Bots: Facebook opened its platform to developers, leading to a surge in chatbot development for businesses.
* **5. Neural Networks and Transformers (Late 2010s-2020s)**
2018 ‚Äì BERT (Google) and GPT (OpenAI): These transformer-based models revolutionized natural language understanding and generation.
2020 ‚Äì GPT-3: A massive leap in chatbot capabilities, enabling more coherent, context-aware, and human-like conversations.
2022 ‚Äì ChatGPT: OpenAI released ChatGPT based on GPT-3.5 and later GPT-4, making advanced conversational AI widely accessible.
* **6. The Present and Future (2020s-Today)**
Chatbots are now integrated into education, healthcare, customer service, entertainment, and more.
Multimodal models (like GPT-4 and beyond) can understand text, images, and even audio.
The focus is shifting toward personalization, emotional intelligence, and ethical AI.

### Example 5: Communicate with Gemini Voice Agent

We now continue a conversation with our `Chatbot` until the user requests it to end.

For Example 5, ask the LLM **"What is the capital of France?"** and wait for the LLM to stop processing your input. Then tell the LLM **"bye"** to end your conversation.

In [None]:
# ============================================================================
# COMPLETE WAKE WORD CHATBOT - FINAL WORKING VERSION
# ============================================================================
# Fixed normalization to preserve spaces for proper matching
# ============================================================================

import os
import base64
import struct
import asyncio
import subprocess
import nest_asyncio
import time
import re
from IPython.display import Javascript, display, Audio
from google.colab import output, userdata
from google import genai
from google.genai import types
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.messages import HumanMessage, SystemMessage

# Enable nested event loops for Colab
nest_asyncio.apply()

# Initialize Gemini client once
API_KEY = userdata.get('GEMINI_API_KEY')
os.environ["GOOGLE_API_KEY"] = API_KEY
client = genai.Client(api_key=API_KEY, http_options={'api_version': 'v1alpha'})

# ============================================================================
# NOISE DETECTION
# ============================================================================

def is_likely_noise(transcription):
    """Determines if transcription is likely just noise/background sounds."""
    text_lower = transcription.lower()

    noise_indicators = [
        'does not contain any speech', 'no speech', 'not contain speech',
        'no audible speech', 'inaudible', 'unintelligible', 'silence', 'quiet',
        'click', 'tap', 'rustle', 'swipe', 'knock',
        'background noise', 'static', 'hiss', 'hum',
        'faint sound', 'subtle sound', 'brief sound',
        'sounds present', 'sound of', 'music', 'melody', 'instrumental',
        '[', ']', '(', ')',
    ]

    for indicator in noise_indicators:
        if indicator in text_lower:
            return True

    if len(transcription.strip()) < 5:
        return True

    words = re.findall(r'\w+', transcription)
    if len(words) == 0:
        return True

    meta_patterns = [
        r'the (audio|sound|recording) (contains|is|has)',
        r'this (audio|sound|recording) (contains|is|has)',
        r'(faint|subtle|brief|quick) (tap|click|rustle|sound)',
    ]

    for pattern in meta_patterns:
        if re.search(pattern, text_lower):
            return True

    return False


def normalize_for_matching(text):
    """
    Normalize text for wake word matching.
    KEY FIX: Keep spaces between words!
    """
    # Remove punctuation BUT keep spaces
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation, keep letters/numbers/spaces

    # Convert to lowercase
    text = text.lower()

    # Normalize word variations
    text = text.replace('okay', 'ok')
    text = text.replace('ok ay', 'ok')

    # Normalize multiple spaces to single space
    text = re.sub(r'\s+', ' ', text)

    # Strip leading/trailing spaces
    text = text.strip()

    return text

# ============================================================================
# AUDIO RECORDING
# ============================================================================

def record_audio(sec=5, filename='recorded_audio.wav', convert_to_wav=True):
    """Records audio from browser microphone."""
    complete_js = f"""
    (async function() {{
      const sleep = time => new Promise(resolve => setTimeout(resolve, time))

      const b2text = blob => new Promise(resolve => {{
        const reader = new FileReader()
        reader.onloadend = e => resolve(e.target.result)
        reader.readAsDataURL(blob)
      }})

      try {{
        const stream = await navigator.mediaDevices.getUserMedia({{ audio: true }})
        const recorder = new MediaRecorder(stream)
        const chunks = []

        recorder.ondataavailable = e => {{
          if (e.data.size > 0) chunks.push(e.data)
        }}

        const recordingPromise = new Promise((resolve, reject) => {{
          recorder.onstop = async () => {{
            stream.getTracks().forEach(track => track.stop())
            if (chunks.length === 0) {{
              reject('No audio data recorded')
              return
            }}
            const blob = new Blob(chunks, {{ type: 'audio/webm;codecs=opus' }})
            if (blob.size === 0) {{
              reject('Empty audio blob')
              return
            }}
            const text = await b2text(blob)
            resolve(text)
          }}
          recorder.onerror = e => reject('Recording error: ' + e.error)
        }})

        recorder.start()
        await sleep({sec * 1000})
        recorder.stop()

        return await recordingPromise

      }} catch (error) {{
        return 'ERROR: ' + error.message
      }}
    }})()
    """

    try:
        s = output.eval_js(complete_js)
        if not s or s.startswith('ERROR:') or ',' not in s:
            return None

        binary = base64.b64decode(s.split(',')[1])
        if len(binary) == 0:
            return None

        if convert_to_wav:
            temp_webm = "temp_recording.webm"
            with open(temp_webm, 'wb') as f:
                f.write(binary)

            subprocess.run([
                'ffmpeg', '-i', temp_webm, '-ar', '16000',
                '-ac', '1', '-y', filename
            ], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

            if os.path.exists(temp_webm):
                os.remove(temp_webm)

            return filename if os.path.exists(filename) else None
        else:
            with open(filename, 'wb') as f:
                f.write(binary)
            return filename

    except Exception as e:
        return None

# ============================================================================
# TRANSCRIPTION
# ============================================================================

def transcribe_audio(filename, language=None, prompt=None, model="gemini-2.5-flash"):
    """Transcribes audio using Gemini's native audio understanding."""
    if not os.path.exists(filename):
        return f"‚ùå File not found: {filename}"

    try:
        with open(filename, "rb") as f:
            audio_bytes = f.read()

        instruction = "Transcribe this audio accurately."
        if language:
            instruction += f" The audio is in {language}."
        if prompt:
            instruction += f" {prompt}"

        response = client.models.generate_content(
            model=model,
            contents=[
                instruction,
                types.Part.from_bytes(data=audio_bytes, mime_type="audio/wav")
            ]
        )

        return response.text

    except Exception as e:
        return f"‚ùå Transcription error: {str(e)}"

# ============================================================================
# TEXT-TO-SPEECH
# ============================================================================

async def speak(text, voice="Fenrir", autoplay=True, save_to=None):
    """Generates natural speech using Gemini 2.0 Live API."""
    config = types.LiveConnectConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name=voice)
            )
        )
    )

    audio_chunks = bytearray()

    try:
        async with client.aio.live.connect(model="gemini-2.0-flash-exp", config=config) as session:
            await session.send_client_content(
                turns=[types.Content(role="user", parts=[types.Part(text=text)])],
                turn_complete=True
            )

            async for response in session.receive():
                if response.server_content and response.server_content.model_turn:
                    for part in response.server_content.model_turn.parts:
                        if part.inline_data:
                            audio_chunks.extend(part.inline_data.data)

                if response.server_content and response.server_content.turn_complete:
                    break

        if audio_chunks:
            sample_rate = 24000
            wav_header = struct.pack(
                '<4sI4s4sIHHIIHH4sI',
                b'RIFF', 36 + len(audio_chunks), b'WAVE', b'fmt ', 16, 1, 1,
                sample_rate, sample_rate * 2, 2, 16, b'data', len(audio_chunks)
            )
            full_audio = wav_header + audio_chunks

            if save_to:
                with open(save_to, "wb") as f:
                    f.write(full_audio)

            if autoplay:
                display(Audio(full_audio, rate=sample_rate, autoplay=True))

            return save_to

        return None

    except Exception as e:
        print(f"‚ùå Speech generation error: {e}")
        return None

# ============================================================================
# WAKE WORD DETECTION - FIXED MATCHING
# ============================================================================

async def listen_for_wake_word(
    wake_words=["hey gemini", "ok gemini"],
    listen_duration=3,
    voice="Kore"
):
    """Continuously listens for a wake word with proper matching."""
    print("=" * 60)
    print("üéß WAKE WORD DETECTION ACTIVE")
    print(f"üì¢ Say: {' or '.join([f'"{w}"' for w in wake_words])}")
    print("=" * 60)

    print("\nüé§ Listening...", end='', flush=True)

    # Normalize wake words once
    normalized_wake_words = [normalize_for_matching(w) for w in wake_words]

    # Debug output
    print(f"\n[DEBUG] Wake words: {wake_words}")
    print(f"[DEBUG] Normalized: {normalized_wake_words}")
    print("üé§ Listening...", end='', flush=True)

    while True:
        filename = record_audio(sec=listen_duration, convert_to_wav=True)

        if not filename:
            print(".", end='', flush=True)
            await asyncio.sleep(1)
            continue

        transcription = transcribe_audio(filename)

        if transcription.startswith("‚ùå"):
            print(".", end='', flush=True)
            await asyncio.sleep(0.5)
            continue

        # Check if it's noise FIRST
        if is_likely_noise(transcription):
            print(".", end='', flush=True)
            await asyncio.sleep(0.3)
            continue

        # Normalize the transcription
        normalized_transcription = normalize_for_matching(transcription)

        # Show what was heard
        print(f"\nüìù Heard: '{transcription}'")
        print(f"   Normalized: '{normalized_transcription}'")

        # Check for wake word with normalized matching
        wake_word_detected = False
        matched_wake_word = None

        for i, norm_wake_word in enumerate(normalized_wake_words):
            # Check if wake word is in the transcription
            if norm_wake_word in normalized_transcription:
                wake_word_detected = True
                matched_wake_word = wake_words[i]
                print(f"   ‚úÖ MATCH: '{norm_wake_word}' found in '{normalized_transcription}'")
                break

        if wake_word_detected:
            print(f"‚úÖ WAKE WORD DETECTED!\n")
            await speak("Yes, how can I help you?", voice=voice, autoplay=True)
            await asyncio.sleep(1)
            return True
        else:
            print("   ‚ùå Not a wake word, keep listening...")
            print("üé§ Listening...", end='', flush=True)
            await asyncio.sleep(0.3)

# ============================================================================
# CONVERSATION SESSION
# ============================================================================

async def run_conversation_session(loop_duration=5, voice="Kore", pause_before_listen=2):
    """Runs conversation session with noise filtering."""
    llm = ChatGoogleGenerativeAI(
        model="gemini-2.5-flash",
        temperature=0.6
    )
    chat_history = [
        SystemMessage(content="You are a helpful voice assistant. Keep responses brief and conversational.")
    ]

    print("=" * 60)
    print("üí¨ CONVERSATION MODE ACTIVE")
    print("=" * 60)

    silence_count = 0
    max_silence = 3
    turn_number = 0

    while True:
        try:
            turn_number += 1

            if turn_number > 1:
                print(f"\n‚è≥ Waiting {pause_before_listen}s...")
                await asyncio.sleep(pause_before_listen)

            print(f"\nüé§ Turn {turn_number}: Listening... ({loop_duration}s)")
            filename = record_audio(sec=loop_duration)

            if not filename:
                print("‚ùå Recording failed")
                break

            user_text = transcribe_audio(filename)

            if user_text.startswith("‚ùå"):
                silence_count += 1
                print(f"‚ö†Ô∏è Transcription failed ({silence_count}/{max_silence})")
                if silence_count >= max_silence:
                    await speak("I'm having trouble hearing you. Goodbye!", voice=voice)
                    break
                continue

            # Check if it's noise
            if is_likely_noise(user_text):
                silence_count += 1
                print(f"üîá [background noise] ({silence_count}/{max_silence})")
                if silence_count >= max_silence:
                    await speak("I haven't heard any speech. Goodbye!", voice=voice)
                    break
                continue

            user_text_cleaned = user_text.strip()

            if len(user_text_cleaned) < 3:
                silence_count += 1
                print(f"üîá [inaudible] ({silence_count}/{max_silence})")
                if silence_count >= max_silence:
                    await speak("I haven't heard anything. Goodbye!", voice=voice)
                    break
                continue

            silence_count = 0
            print(f"üë§ You: {user_text_cleaned}")

            if any(word in user_text_cleaned.lower() for word in ["bye", "goodbye", "exit", "quit", "stop"]):
                print("üëã Ending conversation")
                await speak("Goodbye!", voice=voice)
                break

            chat_history.append(HumanMessage(content=user_text_cleaned))
            response = llm.invoke(chat_history)

            print(f"ü§ñ AI: {response.content}")

            await speak(response.content, voice=voice, autoplay=True)

            chat_history.append(response)

        except KeyboardInterrupt:
            print("\n‚ö†Ô∏è Stopped by user")
            break
        except Exception as e:
            print(f"‚ùå Error: {e}")
            break

    print("\n" + "=" * 60)
    print("üí¨ CONVERSATION ENDED")
    print("=" * 60)

# ============================================================================
# MAIN WAKE WORD ASSISTANT
# ============================================================================

async def start_wake_word_assistant(
    wake_words=["hey gemini", "ok gemini"],
    voice="Kore",
    wake_duration=3,
    conversation_duration=5,
    pause_between=2,
    return_to_wake=True
):
    """Main function to start the wake word assistant."""

    while True:
        wake_detected = await listen_for_wake_word(
            wake_words=wake_words,
            listen_duration=wake_duration,
            voice=voice
        )

        if not wake_detected:
            break

        await run_conversation_session(
            loop_duration=conversation_duration,
            voice=voice,
            pause_before_listen=pause_between
        )

        if not return_to_wake:
            print("\nüëã Exiting chatbot...")
            break
        else:
            print("\nüîÑ Returning to wake word mode...\n")
            await asyncio.sleep(1)


In [None]:
# ============================================================================
# Example 5: Voice Agent (EXITS COMPLETELY ON "BYE")
# ============================================================================

from langchain_google_genai import ChatGoogleGenerativeAI
from google.colab import userdata
import sys

# ============================================================================
# CONFIGURATION
# ============================================================================

# Wake Word Settings
WAKE_WORDS = ["hey gemini", "ok gemini"]
WAKE_DURATION = 3  # How long to listen for wake word

# Conversation Settings
LOOP_DURATION = 6  # How long to listen during conversation
PAUSE_BEFORE_LISTEN = 3  # Pause between conversation turns

# Voice (Options: Puck, Charon, Kore, Fenrir, Aoede)
VOICE = "Kore"

# Model
MODEL = 'gemini-2.5-flash'  # 2026 Standard for low-latency chat

# Behavior - SET THIS TO FALSE TO EXIT ON "BYE"
RETURN_TO_WAKE = False  # ‚≠ê Changed from True to False

# ============================================================================
# VERIFY CONNECTION
# ============================================================================

GEMINI_KEY = userdata.get('GEMINI_API_KEY')

try:
    llm_check = ChatGoogleGenerativeAI(
        model=MODEL,
        temperature=0.3,
        google_api_key=GEMINI_KEY
    )
    print(f"‚úÖ Connection to {MODEL} verified.")
except Exception as e:
    print(f"‚ùå Connection failed: {e}")
    sys.exit(1)

# ============================================================================
# START VOICE AGENT
# ============================================================================

if 'start_wake_word_assistant' in globals():
    print("="*60)
    print("üé§ STARTING GEMINI VOICE AGENT")
    print("="*60)
    print(f"Voice: {VOICE}")
    print(f"Wake words: {', '.join(WAKE_WORDS)}")
    print(f"Recording duration: {LOOP_DURATION}s")
    print(f"Pause between turns: {PAUSE_BEFORE_LISTEN}s")
    print(f"Exit on 'bye': {'Yes (will exit completely)' if not RETURN_TO_WAKE else 'No (returns to wake word mode)'}")
    print("="*60)
    print("\nüì¢ Say one of the wake words to activate!")
    print("üí¨ Say 'bye' to end\n")

    # Start the wake word assistant
    await start_wake_word_assistant(
        wake_words=WAKE_WORDS,
        voice=VOICE,
        wake_duration=WAKE_DURATION,
        conversation_duration=LOOP_DURATION,
        pause_between=PAUSE_BEFORE_LISTEN,
        return_to_wake=RETURN_TO_WAKE  # ‚≠ê This controls the behavior
    )

    print("\n‚úÖ Voice agent stopped.")

else:
    print("‚ö†Ô∏è Error: Please run the code block with wake word functions first.")
    print("üí° Make sure you've run complete_wake_word_FINAL_FIX.py before this cell.")


In [44]:
# ============================================================================
# Example 5: Communicate with Gemini Voice Agent (CORRECT VERSION)
# ============================================================================

from langchain_google_genai import ChatGoogleGenerativeAI
from google.colab import userdata
import sys

# ============================================================================
# CONFIGURATION
# ============================================================================

# Wake Word Settings
WAKE_WORDS = ["hey gemini", "ok gemini"]
WAKE_DURATION = 3  # How long to listen for wake word

# Conversation Settings
LOOP_DURATION = 6  # How long to listen during conversation
PAUSE_BEFORE_LISTEN = 3  # Pause between conversation turns

# Voice (Options: Puck, Charon, Kore, Fenrir, Aoede)
VOICE = "Kore"

# Model
MODEL = 'gemini-2.5-flash'  # 2026 Standard for low-latency chat

# Behavior
RETURN_TO_WAKE = False  # Set to False to exit on "bye", True to return to wake word mode

# ============================================================================
# VERIFY CONNECTION
# ============================================================================

GEMINI_KEY = userdata.get('GEMINI_API_KEY')

try:
    llm_check = ChatGoogleGenerativeAI(
        model=MODEL,
        temperature=0.3,
        google_api_key=GEMINI_KEY
    )
    print(f"‚úÖ Connection to {MODEL} verified.")
except Exception as e:
    print(f"‚ùå Connection failed: {e}")
    sys.exit(1)

# ============================================================================
# START VOICE AGENT
# ============================================================================

# Check for the NEW function name: start_wake_word_assistant
if 'start_wake_word_assistant' in globals():
    print("="*60)
    print("üé§ STARTING GEMINI VOICE AGENT")
    print("="*60)
    print(f"Voice: {VOICE}")
    print(f"Wake words: {', '.join(WAKE_WORDS)}")
    print(f"Recording duration: {LOOP_DURATION}s")
    print(f"Pause between turns: {PAUSE_BEFORE_LISTEN}s")
    print(f"Mode: {'Exit on bye' if not RETURN_TO_WAKE else 'Return to wake word on bye'}")
    print("="*60)
    print("\nüì¢ Say one of the wake words to activate!")
    print("üí¨ Say 'bye' to end\n")

    # Use the NEW function name
    await start_wake_word_assistant(
        wake_words=WAKE_WORDS,
        voice=VOICE,
        wake_duration=WAKE_DURATION,
        conversation_duration=LOOP_DURATION,
        pause_between=PAUSE_BEFORE_LISTEN,
        return_to_wake=RETURN_TO_WAKE
    )

    print("\n‚úÖ Voice agent stopped.")

else:
    print("‚ö†Ô∏è Error: Please run the code block with wake word functions first.")
    print("üí° Make sure you've run complete_wake_word_FINAL_WORKING.py before this cell.")


# ============================================================================
# ALTERNATIVE: Direct conversation without wake word
# ============================================================================
"""
# If you want to skip wake word detection and start conversation immediately:

if 'run_conversation_session' in globals():
    print(f"Starting Voice Conversation with {VOICE} voice... (Say 'bye' to exit)")
    print(f"‚è±Ô∏è  Recording: {LOOP_DURATION}s | Pause between questions: {PAUSE_BEFORE_LISTEN}s")

    await run_conversation_session(
        loop_duration=LOOP_DURATION,
        voice=VOICE,
        pause_before_listen=PAUSE_BEFORE_LISTEN
    )

    print("\n‚úÖ Conversation ended.")
else:
    print("‚ö†Ô∏è Error: Please run the code block with functions first.")
"""


‚úÖ Connection to gemini-2.5-flash verified.
üé§ STARTING GEMINI VOICE AGENT
Voice: Kore
Wake words: hey gemini, ok gemini
Recording duration: 6s
Pause between turns: 3s
Mode: Exit on bye

üì¢ Say one of the wake words to activate!
üí¨ Say 'bye' to end

üéß WAKE WORD DETECTION ACTIVE
üì¢ Say: "hey gemini" or "ok gemini"

üé§ Listening...
[DEBUG] Wake words: ['hey gemini', 'ok gemini']
[DEBUG] Normalized: ['hey gemini', 'ok gemini']
üé§ Listening...
üìù Heard: 'Okay, Gemini.'
   Normalized: 'ok gemini'
   ‚úÖ MATCH: 'ok gemini' found in 'ok gemini'
‚úÖ WAKE WORD DETECTED!



üí¨ CONVERSATION MODE ACTIVE

üé§ Turn 1: Listening... (6s)
üîá [background noise] (1/3)

‚è≥ Waiting 3s...

üé§ Turn 2: Listening... (6s)
üë§ You: Okay, Gemini, what's the capital of France?
ü§ñ AI: Paris.



‚è≥ Waiting 3s...

üé§ Turn 3: Listening... (6s)
üë§ You: Hey. Ah, not chi-chi-wei-juh. Okay, gem.
ü§ñ AI: Hey there.



‚è≥ Waiting 3s...

üé§ Turn 4: Listening... (6s)
üë§ You: Okay, Gemini. Bye.
üëã Ending conversation



üí¨ CONVERSATION ENDED

üëã Exiting chatbot...

‚úÖ Voice agent stopped.


'\n# If you want to skip wake word detection and start conversation immediately:\n\nif \'run_conversation_session\' in globals():\n    print(f"Starting Voice Conversation with {VOICE} voice... (Say \'bye\' to exit)")\n    print(f"‚è±Ô∏è  Recording: {LOOP_DURATION}s | Pause between questions: {PAUSE_BEFORE_LISTEN}s")\n\n    await run_conversation_session(\n        loop_duration=LOOP_DURATION,\n        voice=VOICE,\n        pause_before_listen=PAUSE_BEFORE_LISTEN\n    )\n\n    print("\n‚úÖ Conversation ended.")\nelse:\n    print("‚ö†Ô∏è Error: Please run the code block with functions first.")\n'

If the code is correct, you should see something _similar_ to the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image10E.png)

Howver in this example, there was no Human input.

### **Exercise 5: Conversation with Chatbot**

In the cell below write the code to start a new conversation with the `Chatbot`. Ask your `Chatbot` for **answers to 5 different questions** of your own choosing. After the 5th question has been answered, terminate your conversation by saying the word **"bye"**.

In [None]:
# Insert your code for Exercise 5 here

from langchain_google_genai import ChatGoogleGenerativeAI
from google.colab import userdata
import sys

# Define Loop Duration
LOOP_DURATION= 4

# 1. Setup Configuration
# We switch from OpenAI key to Gemini key
GEMINI_KEY = userdata.get('GEMINI_API_KEY')
MODEL = 'gemini-2.5-flash'  # 2026 Standard for low-latency chat

# 2. Verify Connection (Optional)
# We initialize the LLM here just to verify the key works before starting the loop
try:
    llm_check = ChatGoogleGenerativeAI(
        model=MODEL,
        temperature=0.3,
        google_api_key=GEMINI_KEY
    )
    print(f"‚úÖ Connection to {MODEL} verified.")
except Exception as e:
    print(f"‚ùå Connection failed: {e}")
    sys.exit(1)

# 3. Start the Conversation Loop
# IMPORTANT: Since start_chatbot is async, we use 'await'
if 'start_chatbot' in globals():
    print("Starting Voice Agent... (Say 'bye' to exit)")
    await start_chatbot(loop_duration=LOOP_DURATION)
else:
    print("‚ö†Ô∏è Error: Please run the 'Complete Voice-to-Voice Agent' code block above first.")

Your output will depend on your 5 different questions.

### Functions for Creating a Smart Assistant

Run the code in the next cell to create several functions needed to create an emulation of a smart assistant.

In [None]:
# Create Functions for Smart Assistant Emulation

import warnings
warnings.filterwarnings("ignore", category=SyntaxWarning)

import os
import asyncio
import struct
import base64
import nest_asyncio
import time
from google.colab import userdata, output
from IPython.display import display, Javascript, Audio

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.messages import HumanMessage, SystemMessage
from google import genai
from google.genai import types
from pydub import AudioSegment

nest_asyncio.apply()

# --- CONFIGURATION ---
try:
    API_KEY = userdata.get('GEMINI_API_KEY')
    os.environ['GOOGLE_API_KEY'] = API_KEY
    client = genai.Client(api_key=API_KEY, http_options={'api_version': 'v1alpha'})
except Exception as e:
    print(f"‚ö†Ô∏è API Key Error: {e}")

# --- 1. RECORDING FUNCTION ---
RECORD_JS = """
const sleep = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.target.result)
  reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
  try {
      const stream = await navigator.mediaDevices.getUserMedia({ audio: true })
      const recorder = new MediaRecorder(stream)
      const chunks = []
      recorder.ondataavailable = e => chunks.push(e.data)
      recorder.start()
      await sleep(time)
      recorder.onstop = async ()=>{
        blob = new Blob(chunks, { type: 'audio/webm' })
        text = await b2text(blob)
        resolve(text)
      }
      recorder.stop()
  } catch(e) { resolve(null) }
})
"""

def record_audio(duration=2):
    print(f".", end="")
    display(Javascript(RECORD_JS))
    try:
        s = output.eval_js('record(%d)' % (duration * 1000))
        if not s: return None
        binary = base64.b64decode(s.split(',')[1])
        with open("temp_nest.webm", "wb") as f: f.write(binary)
        AudioSegment.from_file("temp_nest.webm").export("nest_input.wav", format="wav")
        return "nest_input.wav"

    # Catch interrupt inside the recorder to stop immediately
    except Exception as e:
        if "255" in str(e): raise asyncio.CancelledError
        return None

# --- 2. TRANSCRIPTION ---
def transcribe_audio(filename, language_hint="en"):
    if not os.path.exists(filename): return ""
    with open(filename, "rb") as f: audio_bytes = f.read()

    try:
        response = client.models.generate_content(
            model="gemini-2.5-flash",
            contents=[
                f"Transcribe verbatim in language '{language_hint}'. If only background noise, return 'SILENCE'.",
                types.Part.from_bytes(data=audio_bytes, mime_type="audio/wav")
            ]
        )
        return response.text.strip().lower()
    except:
        return "silence"

# --- 3. VOICE GENERATION ---
async def speak_response(text, voice_name="Kore"):
    print(f"\nüîµ Google ({voice_name}): {text}")
    model_id = "gemini-2.0-flash-exp"
    config = types.LiveConnectConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name=voice_name)
            )
        )
    )
    audio_data = bytearray()
    try:
        async with client.aio.live.connect(model=model_id, config=config) as session:
            await session.send_client_content(
                turns=[types.Content(role="user", parts=[types.Part(text=text)])],
                turn_complete=True
            )
            async for response in session.receive():
                if response.server_content and response.server_content.model_turn:
                    for part in response.server_content.model_turn.parts:
                        if part.inline_data: audio_data.extend(part.inline_data.data)
                if response.server_content and response.server_content.turn_complete: break

        if audio_data:
            sample_rate = 24000
            header = struct.pack('<4sI4s4sIHHIIHH4sI', b'RIFF', 36+len(audio_data), b'WAVE', b'fmt ', 16, 1, 1, sample_rate, sample_rate*2, 2, 16, b'data', len(audio_data))
            display(Audio(header + audio_data, rate=sample_rate, autoplay=True))
            await asyncio.sleep(len(audio_data) / (sample_rate * 2) + 0.5)
    except Exception:
        pass

# --- 4. MAIN LOOP (CLEAN EXIT) ---
async def start_nest_mini(language='en'):
    if os.path.exists("STOP"): os.remove("STOP")

    language_config = {
        'en': { 'voice': 'Kore', 'wake_words': ["hey google", "ok google", "hi google"], 'ack': "I'm listening.", 'prompt': "You are Google Assistant. Speak English." },
        'es': { 'voice': 'Puck', 'wake_words': ["ok google", "oye google"], 'ack': "Te escucho.", 'prompt': "Eres el Asistente de Google. Habla en espa√±ol." },
        'fr': { 'voice': 'Charon', 'wake_words': ["ok google", "dis google"], 'ack': "Je vous √©coute.", 'prompt': "Vous √™tes l'Assistant Google. Parlez en fran√ßais." },
        'de': { 'voice': 'Fenrir', 'wake_words': ["ok google", "hallo google"], 'ack': "Ich h√∂re zu.", 'prompt': "Du bist der Google Assistant. Sprich auf Deutsch." },
        'jp': { 'voice': 'Aoede', 'wake_words': ["ok google", "ne google"], 'ack': "Hai.", 'prompt': "„ÅÇ„Å™„Åü„ÅØGoogle„Ç¢„Ç∑„Çπ„Çø„É≥„Éà„Åß„Åô„ÄÇÊó•Êú¨Ë™û„ÅßË©±„Åó„Å¶„Åè„Å†„Åï„ÅÑ„ÄÇ" }
    }
    config = language_config.get(language, language_config['en'])
    llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0.5)

    print(f"--- NEST MINI SIMULATOR ({language.upper()}) ---")
    print(f"Waiting for '{config['wake_words'][0]}'...")
    print("\nüõë TO STOP: Press 'Ctrl+M+I' or Runtime --> Interrupt Execution .")

    try:
        while True:
            # SAFETY VALVE
            if os.path.exists("STOP"):
                print("\n\nüõë STOP file detected.")
                os.remove("STOP")
                break

            # Allow asyncio to breathe and check for interrupts
            await asyncio.sleep(0.1)

            # 1. Passive Listen
            filename = record_audio(duration=2.5)
            if not filename: continue

            text = transcribe_audio(filename, language_hint=language)
            is_wake = any(phrase in text for phrase in config['wake_words'])

            if is_wake:
                print(f"\n‚ú® WAKE: '{text}'")
                await speak_response(config['ack'], voice_name=config['voice'])

                # 2. Active Listen
                print("listening...")
                cmd_filename = record_audio(duration=5)
                command = transcribe_audio(cmd_filename, language_hint=language)

                if command and command != "silence":
                    print(f"User: {command}")
                    response = llm.invoke([
                        SystemMessage(content=config['prompt']),
                        HumanMessage(content=command)
                    ])
                    await speak_response(response.content, voice_name=config['voice'])

                print(f"\nState: [PASSIVE] Waiting...")

    # --- CLEAN ERROR HANDLING ---
    # Catch both standard Interrupt and Async Cancellation
    except (KeyboardInterrupt, asyncio.CancelledError):
        print("\n\nüîå Simulation Stopped (User Interrupt).")
    except Exception as e:
        print(f"\n‚ùå Error: {e}")

# Example Usage:
# await start_nest_mini(language='en')

### Example 6: Create a `Google Nest`

For Example 6 we are going to create an emulation of a Google `Nest Mini`.

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image15E.png)

### **Smart assistants have a "wake word"**

Smart assistants can be programmed to work using a variety of different languages. This table shows the most commonly used languages and their corresponding abbreviations.

| Language | Abbreviation | Wake Words
| :--- | :--- | :-----|
| **English** | `en` | "hey google", "ok google", "hi google"
| **Spanish** | `es` | "ok google", "oye google"
| **French** | `fr` | "ok google", "dis google"
| **German** | `de` | "ok google", "hallo google"
| **Japanese** | `jp` (or `ja`) |"ok google", "ne google"

#### **What is a "Wake Word"?**

A **Wake Word** (or "Hotword") is a specific phrase that activates a voice assistant from a dormant, power-saving state into an active, listening state.

### **How it Works**
Voice assistants like the Google Nest Mini operate in two distinct modes to protect privacy and conserve resources:

1.  **Passive Listening (On-Device):**
    * The device continuously records short loops of audio (usually a few seconds).
    * It analyzes this audio locally on a specialized low-power chip.
    * It is looking **only** for the specific acoustic signature of the wake word (e.g., *"Hey Google"*).
    * If the wake word is *not* detected, the audio is discarded immediately and never leaves the device.

2.  **Active Listening (Cloud Processing):**
    * Once the wake word is detected, the device "wakes up" (often indicated by LEDs lighting up or a "blip" sound).
    * It begins recording your actual command (e.g., *"What is the weather?"*).
    * This command is then sent to the cloud (Google's servers) for advanced processing and response generation.

##### **In Our Simulator**
Our Python code mimics this behavior using a `while` loop:
* **State 1 (Passive):** It records 2.5-second chunks and checks *only* if the text contains "Hey Google".
* **State 2 (Active):** If detected, it switches to a longer recording mode to capture your full request, sends it to the LLM, and then speaks the response.

Run the next cell to start our Google `Nest Mini`. The `Nest Mini` will listen in English ('en'). When you are done, press `Ctr` + `M` then `I` or selected **Runtime --> Interrupt Execution** from the Colab menu bar.

In [None]:
# Example 6: Start Nest Mini

# Start in English
await start_nest_mini(language='fr')

The output you should see depends upon your 5 questions.


# **Lesson Turn-in**
When you have completed and run all of the code cells, use the `File --> Print.. --> Microsoft Print to PDF` if you are running either Windows 10 or 11 to generate a PDF of your Colab notebook. If you have a Mac, use the `File --> Print.. --> Save as PDF`

In either case, save your PDF as Copy of Class_04_2.lastname.pdf where lastname is your last name, and upload the file to Canvas.

**NOTE TO WINDOWS USERS:** You grade will be **reduced by 10% if your PDF is missing pages** when being graded in Canvas and the grader has take the additional steps to download your PDF, print it out using Microsoft Print to PDF and then resubmit to Canvas for grading.

## **Lizard Tail**


##**Attention Is All You Need**

![__](https://upload.wikimedia.org/wikipedia/commons/8/8f/The-Transformer-model-architecture.png)

**"Attention Is All You Need"** is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al. It is considered a foundational paper in modern artificial intelligence, as the transformer approach has become the main architecture of large language models like those based on GPT. At the time, the focus of the research was on improving Seq2seq techniques for machine translation, but the authors go further in the paper, foreseeing the technique's potential for other tasks like question answering and what is now known as multimodal Generative AI.

The paper's title is a reference to the song "All You Need Is Love" by the Beatles. The name "Transformer" was picked because Jakob Uszkoreit, one of the paper's authors, liked the sound of that word.

An early design document was titled "Transformers: Iterative Self-Attention and Processing for Various Tasks", and included an illustration of six characters from the Transformers animated show. The team was named Team Transformer.

Some early examples that the team tried their Transformer architecture on included English-to-German translation, generating Wikipedia articles on "The Transformer", and parsing. These convinced the team that the Transformer is a general purpose language model, and not just good for translation.

As of 2024, the paper has been cited more than 140,000 times.

**Authors**

The authors of the paper are: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin. All eight authors were "equal contributors" to the paper; the listed order was randomized. The Wired article highlights the group's diversity:

Six of the eight authors were born outside the United States; the other two are children of two green-card-carrying Germans who were temporarily in California and a first-generation American whose family had fled persecution, respectively.

**Methods Discussed & Introduced**

The paper is most well known for the introduction of the Transformer architecture, which forms the underlying architecture for most forms of modern Large Language Models (LLMs). A key reason for why the architecture is preferred by most modern LLMs is the parallelizability of the architecture over its predecessors. This ensures that the operations necessary for training can be accelerated on a GPU allowing both faster training times and models of bigger sizes to be trained.

The following mechanisms were introduced by the paper as part of the development of the transformer architecture.

**Scaled dot-product Attention & Self-attention**

The use of the scaled dot-product attention and self-attention mechanism instead of an RNN or LSTM (which rely on recurrence instead) allow for better performance as described in the following paragraph.

Since the model relies on Query (Q), Key (K) and Value (V) matrices that come from the same source itself (i.e. the input sequence / context window), this eliminates the need for RNNs completely ensuring parallelizability for the architecture. This differs from the original form of the Attention mechanism introduced in 2014. Additionally, the paper also discusses the use of an additional scaling factor that was found to be most effective with respect to the dimension of the key vectors.

In the specific context of translation which the paper focused on, the Query and Key matrices are usually represented in embeddings corresponding to the source language while the Value matrix corresponds to the target language.

**Multi-head Attention**

In the self-attention mechanism, queries (Q), keys (K), and values (V) are dynamically generated for each input sequence (limited typically by the size of the context window), allowing the model to focus on different parts of the input sequence at different steps. Multi-head attention enhances this process by introducing multiple parallel attention heads. Each attention head learns different linear projections of the Q, K, and V matrices. This allows the model to capture different aspects of the relationships between words in the sequence simultaneously, rather than focusing on a single aspect.

By doing this, multi-head attention ensures that the input embeddings are updated from a more varied and diverse set of perspectives. After the attention outputs from all heads are calculated, they are concatenated and passed through a final linear transformation to generate the output.

**Positional Encoding**

Since the `Transformer model` is not a `seq2seq model` and does not rely on the sequence of the text in order to perform encoding and decoding, the paper relied on the use of sine and cosine wave functions to encode the position of the token into the embedding.

**Historical context**

For many years, sequence modelling and generation was done by using plain recurrent neural networks (RNNs). A well-cited early example was the Elman network (1990). In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice the vanishing-gradient problem leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens.

A key breakthrough was LSTM (1995), a RNN which used various innovations to overcome the vanishing gradient problem, allowing efficient learning of long-sequence modelling. One key innovation was the use of an attention mechanism which used neurons that multiply the outputs of other neurons, so-called multiplicative units. Neural networks using multiplicative units were later called sigma-pi networks or higher-order networks. LSTM became the standard architecture for long sequence modelling until the 2017 publication of Transformers. However, LSTM still used sequential processing, like most other RNNs. Specifically, RNNs operate one token at a time from first to last; they cannot operate in parallel over all tokens in a sequence.

Modern Transformers overcome this problem, but unlike RNNs, they require computation time that is quadratic in the size of the context window. The linearly scaling fast weight controller (1992) learns to compute a weight matrix for further processing depending on the input. One of its two networks has "fast weights" or "dynamic links" (1981). A slow neural network learns by gradient descent to generate keys and values for computing the weight changes of the fast neural network which computes answers to queries. This was later shown to be equivalent to the unnormalized linear Transformer.

#### **Transformer Architecture**

In the Transformer architecture, the self-attention mechanism processes an input sequence by creating a new representation for each token, enriched with context from all other tokens in the sequence. Unlike older recurrent neural networks (RNNs) that process words one by one, self-attention processes the entire sequence in parallel, making it highly efficient.
The core idea is for each token to "look" at all other tokens to determine their relevance and then use that information to create a more informed, context-aware representation of itself.

**The self-attention process**

For a given input sequence, such as "The animal didn't cross the street because it was too tired," the self-attention process happens in the following stages:

1. **Create Query, Key, and Value vectors:** For every token in the sequence (e.g., "it"), the model creates three distinct vectors:
* * **Query (Q):** Represents the current token, acting like a question used to find related tokens.
* * **Key (K):** Represents the token being looked at, acting like a label for its information.
* **Value (V):** Contains the content or contextual information of the token.

2. **Calculate attention scores:** To determine how much focus "it" should place on other words, the model calculates a score for every other token in the sentence. This is done by taking the dot product of the current token's query vector with each of the other tokens' key vectors. A high dot-product score indicates a strong relationship between the two tokens.

3. **Scale the scores:** The scores are scaled by dividing them by the square root of the key vector's dimension. This prevents the scores from growing too large, which helps to stabilize training.

* **Normalize with Softmax:** The scaled scores are passed through a softmax function, which converts them into a probability distribution. This ensures that all the attention weights sum up to 1, making them easier to interpret.

5. **Compute the weighted sum:** Each token's value vector is multiplied by its corresponding softmax score. The weighted value vectors are then summed to produce a new, context-rich output vector for the original token. In the sentence example, this process would give the word "it" a new representation that incorporates information from "animal," correctly linking the two words.

#### **Enhancing self-attention with multi-head attention**

The Transformer architecture takes this mechanism one step further by using multi-head **attention**.

* Instead of a single attention calculation, multi-head attention performs several self-attention calculations in parallel using different learned sets of Q, K, and V weight matrices.
* Each "head" learns to focus on different types of relationships. For example, one head might attend to grammatical connections, while another might focus on semantic meaning.
* The results from each head are then concatenated and passed through a final linear layer to produce the refined output. This gives the model a much richer, multi-contextual understanding of the input.

#### **Preserving word order with positional encoding**

Because the self-attention mechanism processes all tokens in parallel, it inherently loses information about word order. To address this, the Transformer injects positional information into the input embeddings using positional encoding. This is typically done with sinusoidal functions that create a unique vector for each position in the sequence, which is then added to the token's embedding. This process allows the model to capture the sequence's structure without sacrificing parallel processing efficiency.