<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/main/Class_04_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

## **Module 4: Chatbots and Large Language Models**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Biology, Health and the Environment](https://sciences.utsa.edu/bhe/), [UTSA](https://www.utsa.edu/)

### Module 4 Material

* Part 4.1: Introduction to Large Language Models (LLMs)
* **Part 4.2: Chatbots**
* Part 4.3: Image Generation with StableDiffusion
* Part 4.4: Agentic AI

## Google CoLab Instructions

You MUST run the following code cell to get credit for this class lesson. By running this code cell, you will map your GDrive to /content/drive and print out your Google GMAIL address. Your Instructor will use your GMAIL address to verify the author of this class lesson.

In [None]:
# You must run this cell first
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: Using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("**WARNING**: Your GMAIL address was **not** printed in the output below.")
    print("**WARNING**: You will NOT receive credit for this lesson.")
    COLAB = False

You should see the following output except your GMAIL address should appear on the last line.

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image01B.png)

If your GMAIL address does not appear your lesson will **not** be graded.


### Test Your GEMINI_API_KEY

In order to run the code in this lesson you will need to have your secret `GEMINI_API_KEY` installed in your **Secrets** on this Colab notebook. Detailed steps for purchasing your `GEMINI_API_KEY` and installing it in your Colab notebook Secrets was provide in `Class_04_1`.

Run the code in the next cell to see if your `GEMINI_API_KEY` is installed correctly. You make have to Grant Access for your notebook to use your API key.

In [None]:
# Verify your API key setup

from google.colab import userdata
import os

# Check if API key is properly loaded
try:
    GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')
    print("API key loaded successfully!")
    print(f"Key length: {len(GEMINI_API_KEY)}")
except Exception as e:
    print(f"Error loading API key: {e}")
    print("Please set-up your GEMINI_API_KEY key in your Colab Secrets")

1. You may see this message when you run this cell:


![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image08C.png)

If you do see this popup just click on `Grant access`.


2. If your `GEMINI_API_KEY` is correctly installed you should see something _similar_ to the following output.

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image09C.png)

3. However, if you see the following output

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image10C.png)

You will need to correct the error before you can continue. Ask your Instructor or TA for help if you can resolve the error yourself.

### Install `LangChain` packages

Run the code in the following cell to install the `langchain-google_genai` and related packages.

In [None]:
# Run these installations

!pip install -q langchain-core
!pip install -q pydub google-genai nest_asyncio langchain-community langchain-google-genai

You might not see any output or you might see the the following output:

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image07E.png)

If you see this error message, don't worry about it.

### **YouTube Introduction to ChatBots**

Run the next cell to see short introduction to ChatBots. This is a suggested, but optional, part of the lesson.

In [None]:
from IPython.display import HTML
video_id = 'gmUHEvrpYoU'

HTML(f"""
<iframe width="560" height="315"
  src="https://www.youtube.com/embed/{video_id}"
  title="YouTube video player"
  frameborder="0"
  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
  allowfullscreen
  referrerpolicy="strict-origin-when-cross-origin"> </iframe>
""")

# **Introduction to Speech Processing with Gemini**

![___](https://biologicslab.co/BIO1173/images/class_04/CourseImage.gif)

In this lesson, we explore how to use both computer-generated voice and voice recognition to create a `ChatBot`. We'll be working with the **Google Gemini API** and the **LangChain Google integration** to achieve this. Specifically, we'll demonstrate how to input normal text and have it spoken by the computer, and conversely, how we can speak to the computer and have it respond. We'll ultimately integrate these functionalities to create a chatbot that handles both text-to-speech and speech-to-text interactions.

While we'll use Google `Colab` for this demonstration, in production environments, you'd likely use a mobile app or a web-based JavaScript solution, as each platform handles voice differently. We'll focus on keeping things generic and simple in Colab for now.

Voice applications are everywhere. For example, I can ask "`Alexa`, what time is it?" and multiple `Alexa` devices in my home will respond, although not always perfectly. I usually mute them during recording sessions. Applications like `Siri` or `Gemini` also offer voice interactions. For instance, you can now interact with the `Gemini` mobile app completely hands-free, or use the microphone input on the web interface.

To illustrate, I asked `Gemini`, "How are you doing?" and it responded by offering some insightful thoughts about the rapid evolution of multimodal AI. It highlighted that models like Gemini aren't just processing text anymore‚Äîthey are natively designed to understand text, images, and **audio** simultaneously. It also suggested that students experiment with these new "multimodal" capabilities, as building hands-on projects is one of the best ways to understand the future of AI.

## **Part I: Speech to Text with Gemini**

Here we delve into the realm of speech-to-text technology, focusing on the powerful multimodal capabilities offered by **Google's Gemini models**. Speech-to-text, also known as automatic speech recognition (ASR), is a technology that converts spoken language into written text.

**Google's Gemini 1.5** models represent the cutting edge of this field. Unlike traditional models that require separate systems for audio and text, Gemini is **natively multimodal**. This means it can accept audio inputs directly, leveraging advanced machine learning techniques to achieve high accuracy and robustness across various accents, languages, and acoustic environments. We'll explore how these models can be integrated into applications to enable voice-based interactions, transcription services, and accessibility features. By harnessing Gemini's audio capabilities, we'll unlock new possibilities for human-computer interaction and demonstrate how to transform audio input into actionable text data with remarkable precision.

Note: We will make use of the JavaScript technique described below to record audio directly within Google Colab, as Colab runs on a remote server and cannot access your local microphone by default.

https://gist.github.com/korakot/c21c3476c024ad6d56d5f48b0bca92be

### **Native Audio Understanding**

Here we delve into the realm of multimodal audio processing, focusing on the powerful capabilities offered by Google's **Gemini 2.5** models. Unlike traditional Speech-to-Text (ASR) which simply converts sound waves into words, Gemini treats audio as a "native" modality‚Äîmeaning it processes the raw audio waveform directly alongside text.

This approach allows Gemini to not only transcribe speech with high accuracy but also to:
* **Understand Context:** Detect emotions (sarcasm, excitement) and non-verbal cues.
* **Diarize:** Distinguish between multiple speakers automatically.
* **Reason:** Summarize or answer questions about the audio content without needing a separate text-processing step.

We will explore how `gemini-2.5-flash` can be used to transform raw audio input into actionable data with remarkable precision.

**Note on Recording in Colab:**
Because Google Colab runs on a remote server, it cannot access your local microphone directly. We will make use of a JavaScript bridge (adapted from [this technique](https://gist.github.com/korakot/c21c3476c024ad6d56d5f48b0bca92be)) to capture audio from your browser and stream it to the Python environment for processing.

### Functions to Record and Transcribe Audio

The code cell below creates two important functions for the next part of this lesson `record_audio()` and `transcribe()`.

### **`record_audio()`**
This Python function records audio from the user's microphone, converts it to a WAV file, and saves it to the disk. It accomplishes this by generating a JavaScript snippet that runs within a browser environment (i.e. Colab Notebook using `output.eval_js`) to access the microphone, then converts the resulting audio blob to a base64 string, and passes it back to Python for processing. The Python part then decodes the base64 string, saves it as a `.webm` file, and uses `ffmpeg` to convert it to `.wav` format. If successful, it deletes the .webm file and returns the path to the `.wav` file.

**How it Works**

1. **Audio Capture:** Uses JavaScript (navigator.mediaDevices.getUserMedia) to request microphone access and record audio for a specified number of seconds (default 3).
2. **Data Conversion:** The recorded audio is converted into a base64 encoded string within the JavaScript environment and passed back to the Python environment.
3. **File Saving & Conversion:** Python decodes the base64 string, saves it as a `.webm` file, and uses `ffmpeg` to transcode it to .wav format (16kHz, mono).
4. **Cleanup:** It deletes the .webm file after successful conversion and returns the path to the .wav file. If any step fails, it returns None.

##### **Summary:**
This code is useful for recording voice commands or short audio snippets within a web-based interface or a Jupyter Notebook environment. It bridges the gap between browser-based microphone access and server-side Python processing.
__________________________________________________________________

#### **`transcribe()`**

The **transcribe()** function takes an audio file (typically a WAV file) and sends it to the Google Gemini API (specifically the "gemini-2.5-flash" model) to generate a text transcription of the audio.

**How it works:**

1. **Input Validation:** It first checks if the provided filename exists. If not, it returns `None`.
2. **File Reading:** It opens the audio file in binary mode and reads the contents into memory.
3. **API Call:** It constructs a request to the Gemini API, including the user's prompt (e.g., "Transcribe this audio...") and the raw audio bytes with a MIME type of audio/wav.
4. **Processing:** The API processes the audio and returns a text response.
5. **Cleanup:** By default, it deletes the original audio file after successful transcription. If an error occurs, it also attempts to delete the file (unless keep_file=True is specified).
6. **Return Value:** It returns the transcription text if successful, or None if any step fails.

##### **Summary:**

This function is a utility for converting audio files to text. It acts as a wrapper around the Google Gemini API, handling file I/O and error management to ensure that the audio file is cleaned up after the task is complete. It's useful for applications that need to transcribe voice notes or convert speech to text.


In [None]:
# Create record_audio() and transcribe() functions

# ============================================================================
# RECORDING AND TRANSCRIPTION FUNCTIONS
# ============================================================================

import os
import base64
import time
import subprocess
from IPython.display import display, Audio
from google.colab import output, userdata
from google import genai
from google.genai import types

# Initialize Gemini
API_KEY = userdata.get('GEMINI_API_KEY')
os.environ["GOOGLE_API_KEY"] = API_KEY
client = genai.Client(api_key=API_KEY)

# ============================================================================
# RECORDING
# ============================================================================

def record_audio(sec=3):
    """Records audio."""
    complete_js = f"""
    (async function() {{
      const sleep = time => new Promise(resolve => setTimeout(resolve, time))
      const b2text = blob => new Promise(resolve => {{
        const reader = new FileReader()
        reader.onloadend = e => resolve(e.target.result)
        reader.readAsDataURL(blob)
      }})

      try {{
        const stream = await navigator.mediaDevices.getUserMedia({{ audio: true }})
        const recorder = new MediaRecorder(stream)
        const chunks = []

        recorder.ondataavailable = e => {{
          if (e.data.size > 0) chunks.push(e.data)
        }}

        const recordingPromise = new Promise((resolve, reject) => {{
          recorder.onstop = async () => {{
            stream.getTracks().forEach(track => track.stop())
            if (chunks.length === 0) reject('No audio data')
            else {{
              const blob = new Blob(chunks, {{ type: 'audio/webm' }})
              if (blob.size === 0) reject('Empty blob')
              else resolve(await b2text(blob))
            }}
          }}
          recorder.onerror = e => reject('Error: ' + e.error)
        }})

        recorder.start()
        await sleep({sec * 1000})
        recorder.stop()
        return await recordingPromise

      }} catch (error) {{
        return 'ERROR: ' + error.message
      }}
    }})()
    """

    try:
        s = output.eval_js(complete_js)
        if not s or s.startswith('ERROR:') or ',' not in s:
            return None

        binary = base64.b64decode(s.split(',')[1])
        filename = f'rec_{int(time.time())}.webm'

        with open(filename, 'wb') as f:
            f.write(binary)

        wav_filename = filename.replace('.webm', '.wav')
        subprocess.run(['ffmpeg', '-i', filename, '-ar', '16000', '-ac', '1', '-y', wav_filename],
                      stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

        if os.path.exists(wav_filename):
            os.remove(filename)
            return wav_filename
        return filename

    except:
        return None

# ============================================================================
# TRANSCRIPTION
# ============================================================================

def transcribe(filename, prompt="Transcribe this audio. Return only the transcription.", keep_file=False):
    """Transcribes audio."""
    if not filename or not os.path.exists(filename):
        return None

    try:
        with open(filename, "rb") as f:
            audio_bytes = f.read()

        response = client.models.generate_content(
            model="gemini-2.5-flash",
            contents=[prompt,
                     types.Part.from_bytes(data=audio_bytes, mime_type="audio/wav")]
        )

        if not keep_file:
            os.remove(filename)
        return response.text.strip() if response and response.text else None

    except:
        if not keep_file and os.path.exists(filename):
            os.remove(filename)
        return None

print("‚úÖ record_audio() and transcribe() functions loaded!")


## **Transcribing Audio with Gemini**

#####  **Overview of the API**
- **Models**: We use **Gemini 2.5 Flash**. It is a "multimodal" model, meaning it can natively understand text, images, and **audio** simultaneously.
- **Input**: Accepts audio data directly (e.g., WAV, MP3, MP4) alongside text prompts.
- **Output**: Returns text, JSON, or structured data based on your instructions.
- **Capabilities**: Unlike traditional "transcription-only" models, you can ask Gemini to do things *while* it listens, such as "Summarize this recording," "Extract the patient's symptoms," or "Translate this to Spanish."

##### **Why It's Useful for Biomedical Investigators**

1. **Transcribing Interviews & Focus Groups**
   Automatically convert recorded conversations with patients, clinicians, or research participants into text for qualitative analysis.

2. **Clinical Note Dictation**
   Researchers can dictate observations or notes during fieldwork or lab work, streamlining documentation.

3. **Meeting & Conference Transcripts**
   Capture and archive discussions from research meetings, seminars, or collaborative calls.

4. **Data Extraction from Audio**
   Enables downstream NLP tasks like identifying social determinants of health (SDOH) or extracting biomedical entities directly from spoken content without needing a separate transcription step.

5. **Multilingual Support**
   Useful in global health research where interviews or data collection occur in multiple languages.


## Example 1: Speech-to-Text

This code in the cell below uses the `record_audio()` function to convert your voice into text and then uses the `transcribe()` function to print out what you said.

Once you hit the run cell icon
![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image10A.png)start counting out loud from `1` to `10`.

**WARNING:** If you see this popu window:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image28F.png)

you will proabably need to run this cell again to get it to work.

In [None]:
# Example 1: Speech-to-Text

from IPython.display import Audio, display
import time
import sys

# Configuration
LLM_MODEL = "gemini-2.5-flash"
RECORD_DURATION = 10  # Seconds
TIMEOUT = 30  # Maximum seconds to wait for permission + recording

try:
    # 1. Capture Audio
    print(f"üé§ Start speaking!")
    print(f"‚è±Ô∏è Please grant microphone permission if prompted (timeout in {TIMEOUT}s)")
    sys.stdout.flush()  # Force output to display immediately

    start_time = time.time()
    audio_filename = record_audio(sec=RECORD_DURATION)
    elapsed = time.time() - start_time

    # Check if it took suspiciously long (likely hung on permission dialog)
    if elapsed > (RECORD_DURATION + 15):
        print("‚ö†Ô∏è Recording took too long - permission may have been delayed.")
        print("üí° Tip: Run this cell again. Permission should already be granted.")

    if audio_filename:
        # 2. Transcribe using Native Audio Reasoning
        print(f"üì° Sending waveform to {LLM_MODEL}...")

        transcription = transcribe(
            filename=audio_filename,
            prompt="Transcribe accurately. Include speaker labels if multiple people are speaking.",
            keep_file=True
        )

        # 3. Output Results
        print("\n" + "="*30)
        print("üìú TRANSCRIPTION")
        print("="*30)
        print(transcription)
        print("="*30)

        # 4. Playback for verification
        print("\nüîä Playing back recorded audio...")
        display(Audio(audio_filename, autoplay=False))

    else:
        print("‚ùå Recording failed. Please check your browser's microphone permissions.")
        print("üí° Make sure to click 'Allow' when the browser asks for microphone access.")

except KeyboardInterrupt:
    print("\n‚ö†Ô∏è Recording interrupted by user.")
except Exception as e:
    print(f"‚ö†Ô∏è An error occurred during the Speech-to-Text process: {e}")
    print("üí° Try running the cell again - microphone permission may need to be granted first.")


If the code is correct, you should see something _similar_ to the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image15F.png)

## **Exercise 1: Speech-to-Text**

In the cell below, write to code to generate Speech-to-Text using the code in Example 2 as a template.

For **Exercise 1**, once you hit the run cell icon
![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image10A.png)start counting **_backwards_** from `10` to `1`.

In [None]:
# Insert your code for Exercise 1 here



If the code is correct, you should see something _similar_ to the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image16F.png)

## **Text to Speech with Google**

In this section, we'll explore text-to-speech (TTS), focusing on Google's powerful speech synthesis tools. While Gemini is excellent at generating the *words*, we use Google's dedicated Text-to-Speech engine (`gTTS`) to convert that written text into natural-sounding speech.

Google's TTS models are optimized for both real-time applications and high-fidelity audio storage. This technology represents a significant advancement in speech synthesis, using deep learning to produce clear, lifelike vocal outputs in a wide variety of languages and accents. By utilizing these tools, we'll explore their capabilities and understand how they revolutionize industries, from accessibility solutions to interactive voice assistants and beyond.

In [None]:
# ============================================================================
# TEXT-TO-SPEECH FUNCTION
# ============================================================================

import struct
import asyncio
import base64

async def speak(text, voice="Kore", autoplay=True, save_to=None):
    """
    TTS using Live API.

    Parameters:
        text: The text to speak
        voice: Voice name (default "Kore")
        autoplay: Whether to automatically play the audio (default True)
        save_to: Optional filename to save the audio to (default None)
    """
    if not text or len(text.strip()) == 0:
        return False

    try:
        config = types.LiveConnectConfig(
            response_modalities=["AUDIO"],
            speech_config=types.SpeechConfig(
                voice_config=types.VoiceConfig(
                    prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name=voice)
                )
            )
        )

        audio_chunks = bytearray()

        async with client.aio.live.connect(model="gemini-2.5-flash-native-audio-latest", config=config) as session:
            # Send text to speak
            await session.send_client_content(
                turns=[types.Content(
                    role="user",
                    parts=[types.Part(text=f"Read this text verbatim without any analysis: {text}")]
                )],
                turn_complete=True
            )

            # Collect audio chunks
            async for response in session.receive():
                if response.server_content and response.server_content.model_turn:
                    for part in response.server_content.model_turn.parts:
                        if part.inline_data:
                            audio_chunks.extend(part.inline_data.data)

                if response.server_content and response.server_content.turn_complete:
                    break

        # Process audio
        if audio_chunks:
            sample_rate = 24000
            wav_header = struct.pack(
                '<4sI4s4sIHHIIHH4sI',
                b'RIFF', 36 + len(audio_chunks), b'WAVE', b'fmt ', 16, 1, 1,
                sample_rate, sample_rate * 2, 2, 16, b'data', len(audio_chunks)
            )
            full_audio = wav_header + audio_chunks

            # Save to file if requested
            if save_to:
                with open(save_to, 'wb') as f:
                    f.write(full_audio)

            # Play audio if autoplay is enabled
            if autoplay:
                from IPython.display import HTML
                audio_b64 = base64.b64encode(full_audio).decode('utf-8')

                html = f"""
                <audio autoplay style="display:none;">
                    <source src="data:audio/wav;base64,{audio_b64}" type="audio/wav">
                </audio>
                """

                print("üîä Speaking...")
                display(HTML(html))

            return True
        else:
            return False

    except Exception as e:
        print(f"‚ö†Ô∏è TTS error: {e}")
        return False

print("‚úÖ speak() function loaded!")


## **Google's Voices**

When using the Gemini Multimodal Live API to generate real-time conversational audio, you utilize Native Audio models (such as gemini-live-2.5-flash-native-audio). Unlike traditional Text-to-Speech which "synthesizes" text into sound after the fact, Gemini's native audio models generate speech directly as a core modality. This allows for Affective Dialog‚Äîwhere the voice automatically adapts its tone, emotion, and emphasis based on the context of the conversation.

Google offers a suite of distinct voice personas, along with a library of over 30 HD voices. The primary personas include:

* Puck ‚Äì The most popular general-purpose voice. Conversational, friendly, and approachable with a mid-range pitch. It has an "upbeat" and "guy-next-door" feel.

* Charon ‚Äì A deep, calm, and authoritative male voice. It projects a sense of informative experience and steady confidence, perfect for formal narrations.

* Kore ‚Äì A bright, energetic, and professional female voice. Excellent for high-engagement tasks like coaching or upbeat customer support where a "firm" but engaging tone is needed.

* Fenrir ‚Äì A warm, steady, and approachable male voice. It sits between Puck and Charon, making it perfect for long-form listening or educational content.

* Aoede ‚Äì A clear, thoughtful, and articulate female voice. Known for a "breezy" and intelligent tone that handles complex discussions gracefully.

### Example 2: Demonstrate Different Voices

The code in the cell below demonstates 4 of the different voices that are available in the `Gemini` text-to-speech API:

* **Puck:** A clear, direct, and conversational male voice with a mid-range pitch. Puck is often described as having a "guy next door" feel‚Äîfriendly, trustworthy, and approachable. Because of its balanced tone, it is the default choice for most general-purpose assistants.

* **Charon:** A deep, calm, and authoritative male voice. Charon projects a sense of experience and steady confidence. It is best suited for scenarios that require a more formal or serious tone, such as news delivery, instructional narrations, or professional corporate guides.

* **Kore:** An energetic and youthful female voice with a bright, professional quality. Kore conveys high enthusiasm and confidence without being overly casual. This makes it an excellent choice for upbeat tutorials, engaging customer support, or any interaction where you want to keep the user‚Äôs energy high.

* **Fenrir** is widely considered the most versatile of the male voices. It sits perfectly between the high energy of Puck and the deep authority of Charon.

For example, here is a more detailed description of the **Fenrir** voice:

>  **Persona:** Warm, approachable, and steady. Fenrir has a mid-range pitch that feels exceptionally natural and human. It lacks the "broadcast" quality of Charon and the "youthful bounce" of Puck, making it feel more like a calm colleague or a supportive mentor.

>  **Tone:** Balanced and conversational. It is designed to be "easy to listen to" for long periods, which is why it is frequently used for e-learning, narrations, and long-form assistants.

>  **Best For:** Explainer videos, podcasting, technical support, or any application where you want to project reliability and warmth without being too formal.

Run the code cell to hear each of these three voices.

In [None]:
# Example 2: Demonstrate Different Voices

from IPython.display import Audio, display
import os

async def run_voice_demos():
    # Primary voice profiles
    demo_voices = ["Puck", "Charon", "Kore", "Fenrir"]

    # This text is designed to showcase the tonal differences of each profile
    sample_text = "Hello! Welcome to BIO 1173, Introduction to Computaional Biology at UT San Antonio!"

    print("--- Starting Gemini Live Voice Demo ---")

    for voice in demo_voices:
        # Step 1: Generate the audio file using the Live API
        # This function handles the WebSocket connection and PCM-to-WAV conversion
        filename = f"sample_{voice.lower()}.wav"
        result = await speak(sample_text, voice=voice, autoplay=False, save_to=filename)

        # Step 2: Validate and display the playback widget
        if result and os.path.exists(filename):
            print(f"\n[‚úî] Playing sample for {voice}:")
            display(Audio(filename, autoplay=False))
        else:
            print(f"  [X] Skipping {voice}: Generation failed or file not found.")

    print("\n--- Voice Demonstration Complete ---")

# Execute the voice demo in Colab
await run_voice_demos()


If the code is correct, you should see something _similar_ to the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image02F.png)

Press the Play icon to listen to each voice.

### **Exercise 2: Demonstrate Different Voices**

In the cell below, write the code to demonstrate the following 4 Gemini voices:

* **Aoede:** A clear, thoughtful, and articulate female voice, often described as sounding intelligent and engaging.

* **Leda:** A calm and steady voice with a balanced tone, suitable for neutral assistants.

* **Orus:** A direct and confident male voice, slightly more formal than Puck.

* **Zephyr:** An upbeat and energetic voice, similar in spirit to Kore but with a different tonal profile.

In [None]:
# Insert your code for Exercise 2 here




If the code is correct, you should see something _similar_ to the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image03F.png)

Press the Play icon to listen to each voice.

### Example 3: Transcribe Recorded Data

The code in the cell shows how to record your speech, print out a transcription of what you said, and finally, read the transcription using the "Fenrir" voice.

Once you hit the run cell icon
![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image10A.png), read out loud Carl Sandburg‚Äôs poem ‚ÄúFog‚Äù --a short, imagistic piece that captures the quiet, mysterious arrival of fog. Don't forget to start by saying the title of the poem, "FOG". When you read the poem, make sure to pause after evey line.

```text
FOG

The fog comes
on little cat feet.

It sits looking
over harbor and city
on silent haunches
and then moves on.
```

In [None]:
# Example 3: Transcribe and Speak Back

import sys

# Set voice and duration
voice = "Fenrir"
duration = 20

try:
    # 1. Capture Audio
    print(f"üé§ Starting {duration} second recording... Speak now!")
    sys.stdout.flush()

    audio_path = record_audio(sec=duration)

    if audio_path:
        # 2. Transcribe using the transcribe function
        print("üì° Transcribing audio...")
        transcription = transcribe(audio_path, keep_file=False)

        print("\n" + "="*30)
        print(f"üìú Captured: {transcription}")
        print("="*30 + "\n")

        # 3. Speak back using TTS
        print("üîä Reading back transcript...")
        await speak(transcription, voice=voice)

    else:
        print("‚ùå Recording failed. Please check mic permissions.")

except Exception as e:
    print(f"‚ö†Ô∏è Error in the process: {e}")


If the code is correct, you should see something _similar_ to the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image20F.png)

### **Exercise 3: Transcribe Recorded Data**

In the cell below, write the code to record your speech, print out a transcription of what you said, and finally, read the transcription using the "Leda" voice.

After you start running the cell, start reading _The Red Wheelbarrow_ by William Carlos Williams. Like _Fog_, it‚Äôs a minimalist, imagist poem that captures a vivid moment with few words. Make sure to pause after couplet.

```text
The Red Wheelbarrow

so much depends
upon

a red wheel
barrow

glazed with rain
water

beside the white
chickens.
```

In [None]:
# Insert your code for Exercise 3 here



If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image21F.png)

The poem is broken into four tiny stanzas, each with a long line followed by a short one.

This spacing:
* slows the reader down
* isolates each image
* makes you notice the shape of the words

Even the word ‚Äúwheelbarrow‚Äù is split in half, forcing you to see it differently.

## **Part 3: Chatbots**

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_3_image10A.png)

The history of **chatbots** is a fascinating journey through the evolution of artificial intelligence and human-computer interaction. Here's a brief overview:

* **1. The Early Days (1950s-1970s)**
1950 - Alan Turing's "Imitation Game": Turing proposed a test (now known as the Turing Test) to determine if a machine could exhibit intelligent behavior indistinguishable from a human.
1966 - ELIZA: Created by Joseph Weizenbaum at MIT, ELIZA was the first chatbot. It mimicked a Rogerian psychotherapist by rephrasing user input into questions. It was simple but groundbreaking.
1972 - PARRY: Developed by Kenneth Colby, PARRY simulated a person with paranoid schizophrenia. It was more complex than ELIZA and could hold more realistic conversations.
* **2. Rule-Based Systems (1980s-1990s)**
Chatbots during this era used hand-coded rules and decision trees.
They were mostly used in academic research, customer service, and early virtual assistants.
Examples include Jabberwacky (late 1980s), which aimed to simulate natural human chat through learning.
* **3. Rise of the Internet and AI (2000s)**
SmarterChild (2001): A popular chatbot on AOL Instant Messenger and MSN Messenger. It could answer questions, play games, and chat casually.
ALICE (Artificial Linguistic Internet Computer Entity): Created by Richard Wallace, it won the Loebner Prize (a Turing Test competition) multiple times.
* **4. Machine Learning and NLP Boom (2010s)**
2011 - Siri: Apple introduced Siri, a voice-activated assistant that brought chatbots into the mainstream.
2014 - Alexa and Cortana: Amazon and Microsoft launched their own virtual assistants.
2016 - Facebook Messenger Bots: Facebook opened its platform to developers, leading to a surge in chatbot development for businesses.
* **5. Neural Networks and Transformers (Late 2010s-2020s)**
2018 ‚Äì BERT (Google) and GPT (OpenAI): These transformer-based models revolutionized natural language understanding and generation.
2020 ‚Äì GPT-3: A massive leap in chatbot capabilities, enabling more coherent, context-aware, and human-like conversations.
2022 ‚Äì ChatGPT: OpenAI released ChatGPT based on GPT-3.5 and later GPT-4, making advanced conversational AI widely accessible.
* **6. The Present and Future (2020s-Today)**
Chatbots are now integrated into education, healthcare, customer service, entertainment, and more.
Multimodal models (like GPT-4 and beyond) can understand text, images, and even audio.
The focus is shifting toward personalization, emotional intelligence, and ethical AI.

## **Create a `Google Nest` Chatbot**

For Example 4 we are going to create an emulation of a Google `Nest Mini`.

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image15E.png)

### **Smart assistants have a "wake word"**

Smart assistants can be programmed to work using a variety of different languages. This table shows the most commonly used languages and their corresponding abbreviations.

| Language | Abbreviation | Wake Words
| :--- | :--- | :-----|
| **English** | `en` | "hey google", "ok google", "hi google"
| **Spanish** | `es` | "ok google", "oye google"
| **French** | `fr` | "ok google", "dis google"
| **German** | `de` | "ok google", "hallo google"
| **Japanese** | `jp` (or `ja`) |"ok google", "ne google"

#### **What is a "Wake Word"?**

A **Wake Word** (or "Hotword") is a specific phrase that activates a voice assistant from a dormant, power-saving state into an active, listening state.

### **How it Works**
Voice assistants like the Google Nest Mini operate in two distinct modes to protect privacy and conserve resources:

1.  **Passive Listening (On-Device):**
    * The device continuously records short loops of audio (usually a few seconds).
    * It analyzes this audio locally on a specialized low-power chip.
    * It is looking **only** for the specific acoustic signature of the wake word (e.g., *"Hey Google"*).
    * If the wake word is *not* detected, the audio is discarded immediately and never leaves the device.

2.  **Active Listening (Cloud Processing):**
    * Once the wake word is detected, the device "wakes up" (often indicated by LEDs lighting up or a "blip" sound).
    * It begins recording your actual command (e.g., *"What is the weather?"*).
    * This command is then sent to the cloud (Google's servers) for advanced processing and response generation.

##### **In Our Simulator**

The Python code in Example 4 mimics this behavior using a `while` loop:
* **State 1 (Passive):** It records 2.5-second chunks and checks *only* if the text contains "Hey Gemini" or "OK Gemini".
* **State 2 (Active):** If detected, it switches to a longer recording mode to capture your full request, sends it to the LLM, and then speaks the response.


### Create Functions

Run the code in the next cell to create a number of audio functions needed for this lesson.

In [None]:
# ============================================================================
# VOICE ASSISTANT USING LIVE API (STREAMING AUDIO)
# ============================================================================

import os
import base64
import time
import asyncio
import struct
from IPython.display import display, Audio
from google.colab import output, userdata
from google import genai
from google.genai import types
import subprocess

# Initialize Gemini
API_KEY = userdata.get('GEMINI_API_KEY')
os.environ["GOOGLE_API_KEY"] = API_KEY
client = genai.Client(api_key=API_KEY, http_options={'api_version': 'v1beta'})


# ============================================================================
# WAKE WORD DETECTION
# ============================================================================

def check_wake_word(text, wake_words):
    """Check for wake word."""
    if not text:
        return False

    norm = text.lower().strip()
    norm = norm.replace("okay", "ok").replace(",", "").replace(".", "")
    norm = norm.replace("!", "").replace("?", "").replace("[no audio]", "")

    print(f"   üìù '{text}' ‚Üí '{norm}'")

    if len(norm) < 2:
        return False

    patterns = {
        "hey gemini": ["hey gemini", "hey jiminy", "hey jamila", "it gemini"],
        "ok gemini": ["ok gemini", "okay gemini"]
    }

    for wake_word in wake_words:
        if wake_word.lower() in patterns:
            for pattern in patterns[wake_word.lower()]:
                if pattern in norm or all(w in norm for w in pattern.split()):
                    print(f"   ‚úÖ WAKE WORD: '{wake_word}'")
                    return True

    print(f"   ‚ùå No wake word")
    return False

# ============================================================================
# WAKE WORD LOOP
# ============================================================================

async def listen_for_wake_word(wake_words, duration=4):
    """Listen for wake words."""
    print("\n" + "="*60)
    print("üéß WAKE WORD DETECTION")
    print(f"üì¢ Say: {' or '.join(wake_words)}")
    print("="*60 + "\n")

    while True:
        print("üé§ Listening...")
        audio_file = record_audio(sec=duration)

        if audio_file:
            transcription = transcribe(audio_file)
            if transcription and check_wake_word(transcription, wake_words):
                print("\nüéØ Activated!\n")
                return True

        await asyncio.sleep(0.3)



# ============================================================================
# CONVERSATION LOOP
# ============================================================================

async def conversation_loop(voice, conv_duration, pause, return_to_wake):
    """Conversation mode."""
    from langchain_google_genai import ChatGoogleGenerativeAI

    # Create LLM
    llm = ChatGoogleGenerativeAI(
        model='gemini-2.5-flash',
        temperature=0.3,
        google_api_key=API_KEY
    )

    # Greet
    greeting = "Hello! How can I help you?"
    print(f"ü§ñ {greeting}\n")
    await speak(greeting, voice=voice)

    print("="*60)
    print("üí¨ CONVERSATION MODE")
    print("="*60)
    print("Say 'bye' to exit\n")

    while True:
        await asyncio.sleep(pause)

        print("üé§ Your turn...")
        audio_file = record_audio(sec=conv_duration)

        if not audio_file:
            continue

        user_input = transcribe(audio_file)
        if not user_input:
            continue

        print(f"üë§ You: {user_input}\n")

        # Check exit
        norm = user_input.lower().strip()
        if any(word in norm for word in ["bye", "goodbye", "exit", "stop"]):
            farewell = "Goodbye! Have a great day!"
            print(f"ü§ñ {farewell}\n")
            await speak(farewell, voice=voice)

            if return_to_wake:
                print("üîÑ Returning to wake word\n")
                return False
            else:
                print("üõë Exiting\n")
                return True

        # Get response
        try:
            # Add instruction for brief responses directly in the prompt
            prompt = f"Answer very briefly in 1-2 sentences: {user_input}"

            response = llm.invoke(prompt)
            ai_text = response.content

            print(f"ü§ñ Gemini: {ai_text}\n")
            await speak(ai_text, voice=voice)

        except Exception as e:
            print(f"‚ùå Error: {e}\n")

# ============================================================================
# MAIN
# ============================================================================

async def start_wake_word_assistant(
    wake_words=["hey gemini", "ok gemini"],
    voice="Kore",
    wake_duration=4,
    conversation_duration=6,
    pause_between=2,
    return_to_wake=False
):
    """Main assistant using Live API for TTS."""
    while True:
        wake = await listen_for_wake_word(wake_words, wake_duration)

        if wake:
            should_exit = await conversation_loop(
                voice, conversation_duration, pause_between, return_to_wake
            )

            if should_exit:
                print("‚úÖ Stopped\n")
                break

print("‚úÖ Live API voice assistant loaded!")
print("üîß Uses gemini-2.5-flash-native-audio-latest Live API for TTS")
print("üìù Usage: await start_wake_word_assistant(return_to_wake=False)")


### Example 4: Communicate with Chatbot

Like Siri or Hey Google, you need to get the Chatbot's attention by saying the "wake word". For this example, the two wake words are "hey gemini" and "ok gemini".

Once the Chatbot has heard its wake word, it will response with:

```text
üé§ Listening...
   üìù '##' ‚Üí '##'
   ‚ùå No wake word
üé§ Listening...
   üìù 'Hey Gemini.' ‚Üí 'hey gemini'
   ‚úÖ WAKE WORD: 'hey gemini'

üéØ Activated!
```

When the Chatbot has been activated by the wake word, it will respond with:

```text
ü§ñ Hello! How can I help you?

üîä Speaking...
```
Unlike a Nest Mini, our Chatbot communicates using both text printed to your computer screen: "Hello! How can I help you?" as well as by speaking the output.

At this point our Chatbot then enters the communication phase with this text output:

```text
============================================================
üí¨ CONVERSATION MODE
============================================================
Say 'bye' to exit

üé§ Your turn...
```
When you see "Your turn..." you can ask the Chatbot any question you like. The Chatbot should continue to "chat" with you as long as you like. To terminate your chat session, just say "bye" or "goodbye".

For Example 4, just ask our Chatbot the following question, "What is the capital of France?" Once you have the answer, terminate the session by saying "bye" or "goodbye".

**Warning:** Don't expect the same level of responsiveness that you get with Siri or Hey Google. Running our Chatbot code on Colab can be rather slow, so you need to be patient and go slow. Wait until you see ![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image24F.png) to start speaking.

In [None]:
# Example 4: Communicate with Chatbot

# Listen for the wake word
await start_wake_word_assistant(
    wake_words=["hey gemini", "ok gemini"],
    voice="Kore",
    return_to_wake=False
)


If the code is correct, you should see something _similar_ to the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image19F.png)

### **Exercise 4: Conversation with Chatbot**

In the cell below write the code to start a new conversation with the `Chatbot`. Ask your `Chatbot` for **answers to 5 different questions** of your own choosing. After the 5th question has been answered, terminate your conversation by saying the word **"bye"**.

**WARNING:** Getting the Chatbot to work can be a bit tricky. Do your best and go slow. Wait until you see ![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image24F.png) to start speaking.

In [None]:
# Insert your code for Exercise 4 here



Your output will depend upon the questions you asked your Chatbot.

### Example 5: Medical History Taker

In Example 5, we are going to create a Chatbot that asks a patient a series of health questions and records the responses. For our voice we will use "Kore".

**WARNING:** Getting the Chatbot to work can be a bit tricky. Do your best and go slow. Wait until you see ![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image27F.png) to start speaking.


In [None]:
# Example 5: Medical History Chatbot

# Define voice
VOICE="Kore"

async def medical_history_chatbot(voice=VOICE):
    """A chatbot that collects basic medical history."""
    from langchain_google_genai import ChatGoogleGenerativeAI
    import sys

    llm = ChatGoogleGenerativeAI(
        model='gemini-2.5-flash',
        temperature=0.3,
        google_api_key=API_KEY
    )

    # Store patient responses
    patient_history = []

    # Questions to ask
    questions = [
        "What is your name?",
        "What symptoms are you experiencing today?",
        "How long have you had these symptoms?",
        "Are you currently taking any medications?",
        "Do you have any known allergies?"
    ]

    print("="*60)
    print("üè• MEDICAL HISTORY CHATBOT")
    print("="*60 + "\n")

    # Greeting
    greeting = "Hello, I'm here to collect some basic health information. Please answer a few questions."
    print(f"ü§ñ {greeting}\n")
    await speak(greeting, voice=voice)
    await asyncio.sleep(4)  # Wait for speech to finish

    # Ask each question
    for i, question in enumerate(questions):
        print(f"ü§ñ Question {i+1}: {question}\n")
        await speak(question, voice=voice)
        await asyncio.sleep(3)  # Wait for question to be spoken

        # Record patient response
        print("üé§ Your answer... (speak now!)")
        sys.stdout.flush()
        audio_file = record_audio(sec=10)

        if audio_file:
            response = transcribe(audio_file)
            if response and "[no speech]" not in response.lower():
                print(f"üë§ Patient: {response}\n")
                patient_history.append({"question": question, "answer": response})
            else:
                print("üë§ Patient: (no response recorded)\n")
                patient_history.append({"question": question, "answer": "No response"})

        await asyncio.sleep(1)

    # Summarize
    print("="*60)
    print("üìã PATIENT HISTORY SUMMARY")
    print("="*60)
    for item in patient_history:
        print(f"Q: {item['question']}")
        print(f"A: {item['answer']}\n")

    # Generate AI summary
    history_text = "\n".join([f"Q: {item['question']} A: {item['answer']}" for item in patient_history])
    summary_prompt = f"Summarize this patient intake in 2-3 sentences for a medical chart:\n{history_text}"

    summary_response = llm.invoke(summary_prompt)
    print("ü§ñ AI Summary:")
    print(summary_response.content)
    await speak(summary_response.content, voice=VOICE)
    await asyncio.sleep(4)  # Wait for summary to be spoken

    # Thank you message
    thanks = "Thank you for providing your medical history. A healthcare provider will review this information shortly."
    print(f"\nü§ñ {thanks}\n")
    await speak(thanks, voice=VOICE)

    return patient_history

# Run the medical history chatbot
history = await medical_history_chatbot(voice=VOICE)


If the code is correct, you should see something _similar_ to the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image22F.png)
![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image23F.png)

### **Exercise 5: Medical History Taker**

For **Exercise 5**,create a Chatbot that asks a patient a series of health questions and records the responses. Use the voice **Fenrir**.

Select one the following 6 options for your medical questions:

**Option 1: Mental Health Screening**
```python
questions = [
    "What is your name?",
    "How would you describe your mood over the past two weeks?",
    "Have you been experiencing any difficulty sleeping?",
    "How would you rate your stress level on a scale of 1 to 10?",
    "Do you have a support system of friends or family you can talk to?"
]
```

**Option 2: Pain Assessment**
```python
questions = [
    "What is your name?",
    "Where exactly is your pain located?",
    "On a scale of 1 to 10, how severe is your pain?",
    "Is the pain constant or does it come and go?",
    "Does anything make the pain better or worse?"
]
```
**Option 3: Lifestyle and Preventive Health**
```python
questions = [
    "What is your name?",
    "How many hours of sleep do you typically get per night?",
    "How often do you exercise each week?",
    "Do you smoke or use tobacco products?",
    "How many servings of fruits and vegetables do you eat daily?"
]
```

**Option 4: COVID-19 / Respiratory Screening**
```python
questions = [
    "What is your name?",
    "Do you have a fever or feel feverish?",
    "Are you experiencing a cough or shortness of breath?",
    "Have you lost your sense of taste or smell?",
    "Have you been in close contact with anyone who tested positive for COVID-19?"
]
```
**Option 5: Family Medical History**
```python
questions = [
    "What is your name?",
    "Has anyone in your immediate family had heart disease?",
    "Is there a history of diabetes in your family?",
    "Has anyone in your family been diagnosed with cancer?",
    "Are there any other hereditary conditions that run in your family?"
]
```

**Option 6: Nutrition Assessment**
```python
questions = [
    "What is your name?",
    "How many meals do you typically eat per day?",
    "How much water do you drink daily?",
    "Do you have any food restrictions or dietary preferences?",
    "How often do you eat fast food or processed foods?"
]
```

**WARNING:** Getting the Medical History Taker to work properly can be a bit tricky. Do your best and go slow. Wait until you see ![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image27F.png) to start speaking.

In [None]:
# Insert your code for Exercise 5 here



Your output will depend up which option you chose and your answer to those questions.


Your output will depend on your 5 different questions.

### Example 6: Biology Quiz Bot

For Example 6 we are going to build a chatbot that will quiz students on a specific topic and provides feedback. For Example 6, the specific topic is **basic biology**. In this example, the voice is set to "Kore".

**WARNING:** Getting the Quiz Bot to work correctly can be a bit tricky. Do your best and go slow. Wait until you see ![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image26F.png) to start speaking.

In [None]:
# Example 6: Biology Quiz Chatbot

# Set the voice
VOICE="Kore"

async def quiz_chatbot(voice=VOICE):
    """A chatbot that quizzes students on specific topic."""
    from langchain_google_genai import ChatGoogleGenerativeAI
    import sys

    llm = ChatGoogleGenerativeAI(
        model='gemini-2.5-flash',
        temperature=0.3,
        google_api_key=API_KEY
    )

    # Quiz questions and answers
    quiz = [
        {"question": "What organelle is known as the powerhouse of the cell?", "answer": "mitochondria"},
        {"question": "What molecule carries genetic information?", "answer": "dna"},
        {"question": "What is the process by which plants convert sunlight into energy?", "answer": "photosynthesis"},
    ]

    score = 0

    print("="*60)
    print("üß¨ QUIZ CHATBOT")
    print("="*60 + "\n")

    greeting = "Welcome to the Quiz! I'll ask you 3 questions. Let's begin!"
    print(f"ü§ñ {greeting}\n")
    await speak(greeting, voice=voice)
    await asyncio.sleep(3)  # Wait for speech to finish

    for i, q in enumerate(quiz):
        print(f"ü§ñ Question {i+1}: {q['question']}\n")
        await speak(q['question'], voice=voice)
        await asyncio.sleep(4)  # Wait for question to be spoken

        print("üé§ Your answer... (speak now!)")
        sys.stdout.flush()
        audio_file = record_audio(sec=8)

        if audio_file:
            student_answer = transcribe(audio_file)
            if student_answer:
                print(f"üë§ You said: {student_answer}\n")

                # Simple check - see if the correct answer appears in the student's response
                correct_answer = q['answer'].lower()
                student_lower = student_answer.lower()

                # Check if answer is correct (simple string match)
                if correct_answer in student_lower:
                    score += 1
                    feedback = "That's correct! Great job!"
                else:
                    feedback = f"That's not quite right. The answer is {q['answer']}."

                print(f"ü§ñ {feedback}\n")
                await speak(feedback, voice=voice)
                await asyncio.sleep(2)  # Wait for feedback to be spoken

        await asyncio.sleep(1)

    # Final score
    final = f"Quiz complete! You scored {score} out of {len(quiz)}."
    print(f"\nü§ñ {final}")
    await speak(final, voice=voice)
    await asyncio.sleep(3)  # Wait for score to be spoken

    # Thank you message
    thanks = "Thank you for playing!"
    print(f"ü§ñ {thanks}\n")
    await speak(thanks, voice=voice)

# Run the quiz
await quiz_chatbot(voice=VOICE)


If the code is correct, you should see something _similar_ to the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image29F.png)

### **Exercise 6: Quiz Bot**

For **Exercise 6**, build a chatbot that will quiz students on a specific topic and provide feedback. Instead of using basic biology as the topic in Example 6, you decide a different quiz topic that you find interesting. You will need to generate 3 questions and answers for your topic.

Change the voice to **Aoede**.

**WARNING:** Getting the Quiz Bot to work correctly can be a bit tricky. Do your best and go slow. Wait until you see ![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image26F.png) to start speaking.

In [None]:
# Insert your code for Exercise 6 here



Your output will depend upon which topic you selected and your questions for that topic.


# **Lesson Turn-in**
When you have completed and run all of the code cells, use the `File --> Print.. --> Microsoft Print to PDF` if you are running either Windows 10 or 11 to generate a PDF of your Colab notebook. If you have a Mac, use the `File --> Print.. --> Save as PDF`

In either case, save your PDF as Copy of Class_04_2.lastname.pdf where lastname is your last name, and upload the file to Canvas.


## **Lizard Tail**


##**Attention Is All You Need**

![__](https://upload.wikimedia.org/wikipedia/commons/8/8f/The-Transformer-model-architecture.png)

**"Attention Is All You Need"** is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al. It is considered a foundational paper in modern artificial intelligence, as the transformer approach has become the main architecture of large language models like those based on GPT. At the time, the focus of the research was on improving Seq2seq techniques for machine translation, but the authors go further in the paper, foreseeing the technique's potential for other tasks like question answering and what is now known as multimodal Generative AI.

The paper's title is a reference to the song "All You Need Is Love" by the Beatles. The name "Transformer" was picked because Jakob Uszkoreit, one of the paper's authors, liked the sound of that word.

An early design document was titled "Transformers: Iterative Self-Attention and Processing for Various Tasks", and included an illustration of six characters from the Transformers animated show. The team was named Team Transformer.

Some early examples that the team tried their Transformer architecture on included English-to-German translation, generating Wikipedia articles on "The Transformer", and parsing. These convinced the team that the Transformer is a general purpose language model, and not just good for translation.

As of 2024, the paper has been cited more than 140,000 times.

**Authors**

The authors of the paper are: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin. All eight authors were "equal contributors" to the paper; the listed order was randomized. The Wired article highlights the group's diversity:

Six of the eight authors were born outside the United States; the other two are children of two green-card-carrying Germans who were temporarily in California and a first-generation American whose family had fled persecution, respectively.

**Methods Discussed & Introduced**

The paper is most well known for the introduction of the Transformer architecture, which forms the underlying architecture for most forms of modern Large Language Models (LLMs). A key reason for why the architecture is preferred by most modern LLMs is the parallelizability of the architecture over its predecessors. This ensures that the operations necessary for training can be accelerated on a GPU allowing both faster training times and models of bigger sizes to be trained.

The following mechanisms were introduced by the paper as part of the development of the transformer architecture.

**Scaled dot-product Attention & Self-attention**

The use of the scaled dot-product attention and self-attention mechanism instead of an RNN or LSTM (which rely on recurrence instead) allow for better performance as described in the following paragraph.

Since the model relies on Query (Q), Key (K) and Value (V) matrices that come from the same source itself (i.e. the input sequence / context window), this eliminates the need for RNNs completely ensuring parallelizability for the architecture. This differs from the original form of the Attention mechanism introduced in 2014. Additionally, the paper also discusses the use of an additional scaling factor that was found to be most effective with respect to the dimension of the key vectors.

In the specific context of translation which the paper focused on, the Query and Key matrices are usually represented in embeddings corresponding to the source language while the Value matrix corresponds to the target language.

**Multi-head Attention**

In the self-attention mechanism, queries (Q), keys (K), and values (V) are dynamically generated for each input sequence (limited typically by the size of the context window), allowing the model to focus on different parts of the input sequence at different steps. Multi-head attention enhances this process by introducing multiple parallel attention heads. Each attention head learns different linear projections of the Q, K, and V matrices. This allows the model to capture different aspects of the relationships between words in the sequence simultaneously, rather than focusing on a single aspect.

By doing this, multi-head attention ensures that the input embeddings are updated from a more varied and diverse set of perspectives. After the attention outputs from all heads are calculated, they are concatenated and passed through a final linear transformation to generate the output.

**Positional Encoding**

Since the `Transformer model` is not a `seq2seq model` and does not rely on the sequence of the text in order to perform encoding and decoding, the paper relied on the use of sine and cosine wave functions to encode the position of the token into the embedding.

**Historical context**

For many years, sequence modelling and generation was done by using plain recurrent neural networks (RNNs). A well-cited early example was the Elman network (1990). In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice the vanishing-gradient problem leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens.

A key breakthrough was LSTM (1995), a RNN which used various innovations to overcome the vanishing gradient problem, allowing efficient learning of long-sequence modelling. One key innovation was the use of an attention mechanism which used neurons that multiply the outputs of other neurons, so-called multiplicative units. Neural networks using multiplicative units were later called sigma-pi networks or higher-order networks. LSTM became the standard architecture for long sequence modelling until the 2017 publication of Transformers. However, LSTM still used sequential processing, like most other RNNs. Specifically, RNNs operate one token at a time from first to last; they cannot operate in parallel over all tokens in a sequence.

Modern Transformers overcome this problem, but unlike RNNs, they require computation time that is quadratic in the size of the context window. The linearly scaling fast weight controller (1992) learns to compute a weight matrix for further processing depending on the input. One of its two networks has "fast weights" or "dynamic links" (1981). A slow neural network learns by gradient descent to generate keys and values for computing the weight changes of the fast neural network which computes answers to queries. This was later shown to be equivalent to the unnormalized linear Transformer.

#### **Transformer Architecture**

In the Transformer architecture, the self-attention mechanism processes an input sequence by creating a new representation for each token, enriched with context from all other tokens in the sequence. Unlike older recurrent neural networks (RNNs) that process words one by one, self-attention processes the entire sequence in parallel, making it highly efficient.
The core idea is for each token to "look" at all other tokens to determine their relevance and then use that information to create a more informed, context-aware representation of itself.

**The self-attention process**

For a given input sequence, such as "The animal didn't cross the street because it was too tired," the self-attention process happens in the following stages:

1. **Create Query, Key, and Value vectors:** For every token in the sequence (e.g., "it"), the model creates three distinct vectors:
* * **Query (Q):** Represents the current token, acting like a question used to find related tokens.
* * **Key (K):** Represents the token being looked at, acting like a label for its information.
* **Value (V):** Contains the content or contextual information of the token.

2. **Calculate attention scores:** To determine how much focus "it" should place on other words, the model calculates a score for every other token in the sentence. This is done by taking the dot product of the current token's query vector with each of the other tokens' key vectors. A high dot-product score indicates a strong relationship between the two tokens.

3. **Scale the scores:** The scores are scaled by dividing them by the square root of the key vector's dimension. This prevents the scores from growing too large, which helps to stabilize training.

* **Normalize with Softmax:** The scaled scores are passed through a softmax function, which converts them into a probability distribution. This ensures that all the attention weights sum up to 1, making them easier to interpret.

5. **Compute the weighted sum:** Each token's value vector is multiplied by its corresponding softmax score. The weighted value vectors are then summed to produce a new, context-rich output vector for the original token. In the sentence example, this process would give the word "it" a new representation that incorporates information from "animal," correctly linking the two words.

#### **Enhancing self-attention with multi-head attention**

The Transformer architecture takes this mechanism one step further by using multi-head **attention**.

* Instead of a single attention calculation, multi-head attention performs several self-attention calculations in parallel using different learned sets of Q, K, and V weight matrices.
* Each "head" learns to focus on different types of relationships. For example, one head might attend to grammatical connections, while another might focus on semantic meaning.
* The results from each head are then concatenated and passed through a final linear layer to produce the refined output. This gives the model a much richer, multi-contextual understanding of the input.

#### **Preserving word order with positional encoding**

Because the self-attention mechanism processes all tokens in parallel, it inherently loses information about word order. To address this, the Transformer injects positional information into the input embeddings using positional encoding. This is typically done with sinusoidal functions that create a unique vector for each position in the sequence, which is then added to the token's embedding. This process allows the model to capture the sequence's structure without sacrificing parallel processing efficiency.