<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/main/Class_04_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

## **Module 4: ChatGPT and Large Language Models**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Biology, Health and the Environment](https://sciences.utsa.edu/bhe/), [UTSA](https://www.utsa.edu/)

### Module 4 Material

* Part 4.1: Introduction to Large Language Models (LLMs)
* **Part 4.2: Chatbots**
* Part 4.3: Image Generation with StableDiffusion
* Part 4.4: Image Generation with DALL-E

## Google CoLab Instructions

You MUST run the following code cell to get credit for this class lesson. By running this code cell, you will map your GDrive to /content/drive and print out your Google GMAIL address. Your Instructor will use your GMAIL address to verify the author of this class lesson.

In [None]:
# You must run this cell first
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: Using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("**WARNING**: Your GMAIL address was **not** printed in the output below.")
    print("**WARNING**: You will NOT receive credit for this lesson.")
    COLAB = False

You should see the following output except your GMAIL address should appear on the last line.

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image01B.png)

If your GMAIL address does not appear your lesson will **not** be graded.


### Test Your OPENAI_API_KEY

In order to run the code in this lesson you will need to have your secret `OEPNAI_API_KEY` installed in your **Secrets** on this Colab notebook. Detailed steps for purchasing your `OPENAI_API_KEY` and installing it in your Colab notebook Secrets was provide in `Class_04_1`.

Run the code in the next cell to see if your `OPENAI_API_KEY` is installed correctly. You make have to Grant Access for your notebook to use your API key.

In [None]:
# Verify your API key setup

from google.colab import userdata
import os

# Check if API key is properly loaded
try:
    OPENAI_KEY = userdata.get('OPENAI_API_KEY')
    print("API key loaded successfully!")
    print(f"Key length: {len(OPENAI_KEY)}")
except Exception as e:
    print(f"Error loading API key: {e}")
    print("Please set your API key in Google Colab:")
    print("1. Go to Secrets in the left sidebar")
    print("2. Create a new secret named 'openai_api_key'")
    print("3. Paste your OpenAI API key")

1. You may see this message when you run this cell:


![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image08C.png)

If you do see this popup just click on `Grant access`.


2. If your `OPENAI_API_KEY` is correctly installed you should see something _similar_ to the following output.

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image09C.png)

3. However, if you see the following output

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image10C.png)

You will need to correct the error before you can continue. Ask your Instructor or TA for help if you can resolve the error yourself.

### Check the LLM models that You have Access

Run the code below to see a list of the Open AI models that your Open AI key gives you access to.

In [None]:
# Check LLM models

import openai
import os

# Get the API key
api_key = userdata.get('OPENAI_API_KEY')

if not api_key:
    raise ValueError("Please set OPENAI_API_KEY environment variable")

client = openai.OpenAI(api_key=api_key)

# List available models
try:
    models = client.models.list()
    print("Available models:")
    for model in models.data:
        print(f"  {model.id}")
except Exception as e:
    print(f"Error listing models: {e}")

In order to complete this lesson, you will need to have access to the following 5 models:

Available models:
* **dall-e-3**
* **gpt-4o-mini**
* **gpt-5-mini**
* **tts-1**
* **whisper-1**

 If you don't see all 5 models listed, you will need to add it to your `Default` project on `OpenAI` before you can continue. Please see the PDF instructions on Canvas on how to add models to your Default project in your OpenAI API account.

### Install `LangChain` packages

Run the code in the following cell to install the `langchain-openai` and related packages.

In [None]:
# Install langchain-openai package

!pip install -q langchain langchain_openai openai pydub > /dev/null

If the code is correct you should not see any output.

### **YouTube Introduction to ChatBots**

Run the next cell to see short introduction to ChatBots. This is a suggested, but optional, part of the lesson.

In [None]:
from IPython.display import HTML
video_id = "gmUHEvrpYoU"
HTML(f"""
<iframe width="560" height="315"
  src="https://www.youtube.com/embed/{video_id}"
  title="YouTube video player"
  frameborder="0"
  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
  allowfullscreen>
</iframe>
""")

# **Introduction to Speech Processing**

![___](https://biologicslab.co/BIO1173/images/class_04/CourseImage.gif)

In this lesson, we explore how to use both computer-generated voice and voice recognition to create a `ChatBot`. We'll be working with the `OpenAI API` to achieve this. Specifically, we'll demonstrate how to input normal text and have it spoken by the computer, and conversely, how we can speak to the computer and have it respond. We'll ultimately integrate these functionalities to create a chatbot that handles both text-to-speech and speech-to-text interactions.

While we'll use Google `Colab` for this demonstration, in production environments, you'd likely use a mobile app or a web-based JavaScript solution, as each platform handles voice differently. We'll focus on keeping things generic and simple in Colab for now.

Voice applications are everywhere. For example, I can ask "`Alexa`, what time is it?" and multiple `Alexa` devices in my home will respond, although not always perfectly. I usually mute them during recording sessions. Applications like `Siri` or even `ChatGPT` also offer voice interactions. For instance, when you click the voice option in `ChatGPT` on a computer, it starts listening for your input.

To illustrate, I asked `ChatGPT`, "How are you doing?" and it responded by offering some insightful thoughts about generative AI. It highlighted that generative AI isn't just about creating new content but about learning patterns from vast amounts of data and applying them creatively across text, images, and code. It also suggested that students experiment with different approaches, as hands-on experience is one of the best ways to learn.

## **Part I: Speech to Text**

Here we delve into the realm of speech-to-text technology, focusing on the powerful capabilities offered by OpenAI's models. Speech-to-text, also known as automatic speech recognition (ASR), is a technology that converts spoken language into written text. OpenAI's speech-to-text models represent the cutting edge of this field, leveraging advanced machine learning techniques to achieve high accuracy and robustness across various accents, languages, and acoustic environments. We'll explore how these models can be integrated into applications to enable voice-based interactions, transcription services, and accessibility features. By harnessing OpenAI's speech-to-text technology, we'll unlock new possibilities for human-computer interaction and demonstrate how to transform audio input into actionable text data with remarkable precision.


Note we will make use of the technique described here to record audio in CoLab.

https://gist.github.com/korakot/c21c3476c024ad6d56d5f48b0bca92be




## **Summary of the Audio Recording Setup in Google Colab**

This code in the cell below sets up the ability to **record audio from the user's microphone** in a **Google Colab notebook** using JavaScript and Python. We need to use JavaScript since Google Colab doesn't allow direct audio recording from a user's microphone primarily due to browser security restrictions and Colab's architecture.


In [None]:
# Use JavaScript to record audio in Colab notebook

from IPython.display import Javascript, display, Audio
from google.colab import output
import base64
import io
from pydub import AudioSegment
import time

# Global variable to track recording state
recording_complete = False
recorded_data = None

# Updated RECORD JavaScript with proper callback handling
RECORD = """
const sleep = (time) => new Promise(resolve => setTimeout(resolve, time))
const b2text = (blob) => new Promise((resolve) => {
    const reader = new FileReader()
    reader.onloadend = (e) => resolve(e.srcElement.result)
    reader.readAsDataURL(blob)
})

var record = (time) => new Promise(async (resolve) => {
    try {
        stream = await navigator.mediaDevices.getUserMedia({ audio: true })
        recorder = new MediaRecorder(stream)
        chunks = []
        recorder.ondataavailable = (e) => chunks.push(e.data)
        recorder.start()
        await sleep(time)
        recorder.onstop = async () => {
            blob = new Blob(chunks, { type: 'audio/webm' })
            text = await b2text(blob)
            resolve(text)
        }
        recorder.stop()
    } catch (error) {
        console.error('Recording error:', error)
        resolve(null)
    }
})
"""

def record(seconds=3):
    """
    Record audio using browser microphone with proper synchronization
    """
    global recording_complete, recorded_data

    # Reset tracking variables
    recording_complete = False
    recorded_data = None

    print(f"Recording now for {seconds} seconds...")

    # Display the JavaScript code
    display(Javascript(RECORD))

    # Execute the recording and wait for completion via callback
    def on_recording_complete(result):
        global recording_complete, recorded_data
        recorded_data = result
        recording_complete = True

    # Register a function to handle the result
    output.register_callback('record_completed', on_recording_complete)

    # Start recording using JavaScript
    js_code = f"""
    record({seconds * 1000}).then(result => {{
        google.colab.kernel.invokeFunction('record_completed', [result], {{}})
    }})
    """

    display(Javascript(js_code))

    # Wait for recording to complete (with timeout)
    start_time = time.time()
    while not recording_complete and time.time() - start_time < seconds + 5:
        time.sleep(0.1)

    if not recording_complete:
        raise TimeoutError("Recording timed out")

    if recorded_data is None:
        raise RuntimeError("No audio data received")

    # Process the recorded data
    try:
        binary_data = base64.b64decode(recorded_data.split(',')[1])

        with io.BytesIO(binary_data) as audio_file:
            audio = AudioSegment.from_file(audio_file, format="webm")

        # Export as WAV file
        audio.export("recorded_audio.wav", format="wav")
        print("Recording complete. Audio saved as 'recorded_audio.wav'")

        return audio

    except Exception as e:
        print(f"Error processing audio: {str(e)}")
        raise

# Simple version that should work better in Colab
def record_simple(seconds=3):
    """
    Simplified recording function with proper timing
    """
    print(f"Recording now for {seconds} seconds...")

    # Display the JavaScript code
    display(Javascript(RECORD))

    # Execute recording and wait for result (this approach works better in Colab)
    try:
        duration_ms = int(seconds * 1000)
        s = output.eval_js(f'record({duration_ms})')

        if not s or not isinstance(s, str):
            raise RuntimeError("Failed to capture audio")

        # Decode and process the audio
        binary_data = base64.b64decode(s.split(',')[1])

        with io.BytesIO(binary_data) as audio_file:
            audio = AudioSegment.from_file(audio_file, format="webm")

        # Export as WAV file
        audio.export("recorded_audio.wav", format="wav")
        print("Recording complete. Audio saved as 'recorded_audio.wav'")

        return audio

    except Exception as e:
        print(f"Error during recording: {str(e)}")
        raise

If the code is correct you should not see any output.

### Example 1: Record and Play Audio

The code in the cell below uses the `record_simple()` function to record a `5 second` audio file and then plays it back.

You may need to grant permission for the program to access the microphone on your laptop or computer.

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image07A.png)


In [None]:
# Example 1: Record and Play Audio

# Set recording durations
record_duration = 5

# Record audio
audio = record_simple(record_duration)

# Play back audio
display(Audio("recorded_audio.wav", autoplay=True))

If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_3_image02A.png)

### **Exercise 1: Record and Play Audio**

In the cell below, use the `record()` function to record an `8 second` audio file and then play it back. Once you hit the run cell icon
![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image10A.png)speak into your microphone.

You should not encounter the same problem as in Example 1.

In [None]:
# Insert your code for Exercise 1 here



If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_3_image04A.png)

## **OpenAI Speech-to-Text API**

### Overview of the API
- **Models**: Includes `Whisper` and newer `GPT-4o-based` models.
- **Input**: Accepts audio files (e.g., MP3, WAV, MP4).
- **Output**: Returns transcriptions in formats like plain text, JSON, or subtitle formats (SRT, VTT).
- **Languages**: Supports dozens of languages with high accuracy, especially in noisy or accented speech environments.

### Why It’s Useful for Biomedical Investigators

1. **Transcribing Interviews & Focus Groups**  
   Automatically convert recorded conversations with patients, clinicians, or research participants into text for qualitative analysis.

2. **Clinical Note Dictation**  
   Researchers can dictate observations or notes during fieldwork or lab work, streamlining documentation.

3. **Meeting & Conference Transcripts**  
   Capture and archive discussions from research meetings, seminars, or collaborative calls.

4. **Data Extraction from Audio**  
   Enables downstream NLP tasks like identifying social determinants of health (SDOH) or extracting biomedical entities from spoken content.

5. **Multilingual Support**  
   Useful in global health research where interviews or data collection occur in multiple languages.


This code in the cell below demonstrates how to use OpenAI's speech-to-text API to transcribe audio files. It defines a function `transcribe_audio` that takes a filename as input. The function opens the specified audio file in binary mode and uses the OpenAI client to create a transcription. The `client.audio.transcriptions.create() method` is called with two parameters: the model ("whisper-1") and the audio file. `Whisper` is OpenAI's state-of-the-art speech recognition model, known for its robustness across various languages and accents. The function returns the transcribed text. In the example usage, an audio file named "recorded_audio.wav" is transcribed, and the resulting text is printed. This code provides a simple yet powerful way to convert speech to text, which can be invaluable for tasks such as generating subtitles, creating searchable archives of audio content, or enabling voice commands in applications.


### Create `transcribe_audio()` function

Run the code in the next cell to create the `transcribe_audio(filename)` function. This Python function is designed to transcribe spoken words from an audio file into text using OpenAI's `Whisper` model via their API.

Here's a breakdown of what it does:

* Opens the audio file specified by filename in binary read mode ("rb").
* Sends the file to the OpenAI Whisper model (whisper-1) using the client.audio.transcriptions.create() method.
* Receives the transcription result from the model.
* Returns the transcribed text from the audio.

In [None]:
# Create transcribe_audio() function

from google.colab import userdata
from openai import OpenAI
import os
from pathlib import Path


def transcribe_audio(filename, model="whisper-1", language=None, prompt=None):
    """
    Transcribe audio file using OpenAI's Whisper API

    Args:
        filename (str): Path to the audio file
        model (str): Whisper model to use (default: "whisper-1")
        language (str): Language code (e.g., "en", "es") - optional
        prompt (str): Optional text prompt for transcription - optional

    Returns:
        str: Transcribed text

    Raises:
        FileNotFoundError: If audio file doesn't exist
        ValueError: If API key is not set or invalid
        Exception: For other transcription errors
    """

    # Input validation
    if not filename:
        raise ValueError("Filename cannot be empty")

    if not os.path.exists(filename):
        raise FileNotFoundError(f"Audio file not found: {filename}")

    # Check if OpenAI API key is set
    api_key = userdata.get('OPENAI_API_KEY')
    if not api_key:
        raise ValueError("OPENAI_API_KEY environment variable not set")

    try:
        # Initialize client
        client = OpenAI(api_key=api_key)

        # Prepare transcription parameters
        params = {
            "model": model,
            "file": open(filename, "rb"),
            "response_format": "text"  # Return as text directly
        }

        # Add optional parameters if provided
        if language:
            params["language"] = language
        if prompt:
            params["prompt"] = prompt

        # Perform transcription
        with open(filename, "rb") as audio_file:
            transcription = client.audio.transcriptions.create(
                model=model,
                file=audio_file,
                response_format="text"
            )

        return transcription

    except FileNotFoundError:
        raise FileNotFoundError(f"Audio file not found: {filename}")
    except Exception as e:
        if "API key" in str(e).lower() or "authentication" in str(e).lower():
            raise ValueError("Invalid or missing OpenAI API key")
        elif "model" in str(e).lower() and "not found" in str(e).lower():
            raise ValueError(f"Model '{model}' not available")
        else:
            raise Exception(f"Transcription failed: {str(e)}")

If the code is correct you should not see any output.

## Example 2: Speech-to-Text

This code in the cell below uses the `transcribe_audio()` function to convert your voice into text.

Once you hit the run cell icon
![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image10A.png)start counting out loud from `1` to `10`.

In [None]:
# Example 2: Speech-to-Text

import openai
import os
from IPython.display import Audio

# Set LLM model
LLM_MODEL = "whisper-1"

# Set recording duration
record_duration = 10

# Record audio using your improved recording function
audio = record_simple(record_duration)

# Set api key
api_key = userdata.get('OPENAI_API_KEY')

if not api_key:
    raise ValueError("Please set OPENAI_API_KEY environment variable")

# Initialize OpenAI client
client = openai.OpenAI(api_key=api_key)

# Transcribe the audio file using whisper-1 model (best for speech-to-text)
try:
    with open("recorded_audio.wav", "rb") as audio_file:
        transcription = client.audio.transcriptions.create(
            model=LLM_MODEL,
            file=audio_file,
            response_format="text"
        )

        print("Transcription:")
        print(transcription)

except Exception as e:
    print(f"Error during transcription: {str(e)}")

# Display the audio file
display(Audio("recorded_audio.wav", autoplay=True))


If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image38C.png)

## **Exercise 2: Speech-to-Text**

In the cell below, write to code to generate Speech-to-Text using the code in Example 2 as an template.

For **Exercise 2**, once you hit the run cell icon
![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image10A.png)start counting **_backwards_** from `10` to `1`.

In [None]:
# Insert your code for Exercise 2 here



If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image39C.png)

## **Part 2: Text to Speech**

In Part 2, we'll explore the fascinating world of text-to-speech (TTS) Large Language Models (LLMs), focusing on OpenAI's cutting-edge offerings. We'll primarily utilize OpenAI's `TTS-1` model, a powerful and versatile tool designed for converting written text into natural-sounding speech.

`TTS-1` is optimized for real-time applications, making it ideal for scenarios that require low-latency audio generation. This model represents a significant advancement in speech synthesis technology, leveraging deep learning techniques to produce high-quality, lifelike vocal outputs. By delving into TTS-1, we'll explore its capabilities, examine its practical applications, and understand how it's revolutionizing various industries, from accessibility solutions to interactive voice responses and beyond.


## **OpenAI's Voices**

When using `OpenAI's` text-to-speech API to generate spoken audio from text, you can select one of several voices. The `openai.audio.speech.create()` function is called with three parameters: the model ("tts-1"), the voice (e.g. "alloy"), and the input text. `OpenAI` offers several voice options, including:

* **alloy** - neutral
* **echo** - young
* **fable** - male
* **onyx** - deep male
* **nova** - female
* **shimmer** - warm female

Each voice has its unique characteristics, allowing users to choose the most suitable one for their application. Additionally, `OpenAI` provides a high-definition model called `tts-1-hd` for enhanced audio quality, though it may have higher latency. The function returns a response object, from which the audio content is extracted and stored in the `audio_data variable` for further processing or playback.

### Example 3: Demonstrate Different Voices

The code in the cell below demonstates 3 of the different voices that are available in the `OpenAI` text-to-speech API:

* **alloy**
* **echo**
* **fable**

Run the code cell to hear each of these three voices.

In [None]:
# Example 3: Demonstrate different voices

import io
from openai import OpenAI
from IPython.display import Audio, display
import os
from pydub import AudioSegment
import time
from google.colab import userdata

def demonstrate_voices():
    """
    Demonstrate different text-to-speech voices with improved error handling
    """

    # Initialize OpenAI client
    # Set api key
    api_key = userdata.get('OPENAI_API_KEY')
    if not api_key:
        raise ValueError("Please set OPENAI_API_KEY environment variable")

    client = OpenAI(api_key=api_key)

    # Define voices
    voices = ["alloy", "echo", "fable"]
    audio_segments = []

    print("Generating speech with different voices...")

    try:
        # Loop through voices
        for i, voice in enumerate(voices):
            text = f"Hello, Welcome to BIO 1 1 7 3...Introduction to Computational Biology..., I am the {voice} voice."

            print(f"Generating audio with '{voice}' voice...")

            response = client.audio.speech.create(
                model="tts-1",
                voice=voice,
                input=text,
                speed=1.0  # Optional: adjust speech speed
            )

            audio_segments.append(response.content)
            print(f"✓ Completed voice: {voice}")

        if not audio_segments:
            raise ValueError("No audio segments were generated")

        # Combine audio segments with proper handling
        print("Combining audio segments...")
        combined_audio = AudioSegment.empty()

        for i, segment in enumerate(audio_segments):
            try:
                # Convert each segment to AudioSegment
                audio_segment = AudioSegment.from_file(io.BytesIO(segment), format="mp3")
                combined_audio += audio_segment

                # Add a small pause between voices (optional)
                if i < len(audio_segments) - 1:  # Not the last segment
                    silence = AudioSegment.silent(duration=500)  # 500ms silence
                    combined_audio += silence

            except Exception as e:
                print(f"Warning: Could not process segment {i}: {e}")
                continue

        if len(combined_audio) == 0:
            raise ValueError("No valid audio was created")

        # Convert the combined audio to a byte stream
        buffer = io.BytesIO()
        combined_audio.export(buffer, format="mp3")
        buffer.seek(0)

        # Play the audio in Colab
        print("\nPlaying combined audio with all voices:")
        display(Audio(buffer.read(), autoplay=True))

        # Optional: Save to file
        output_filename = f"combined_voices_{int(time.time())}.mp3"
        with open(output_filename, "wb") as f:
            buffer.seek(0)
            f.write(buffer.read())
        print(f"\nAudio saved as: {output_filename}")

    except Exception as e:
        print(f"Error during voice demonstration: {str(e)}")
        raise

# Run the demonstration
try:
    demonstrate_voices()
except ValueError as ve:
    print(f"Setup Error: {ve}")
except Exception as e:
    print(f"Unexpected error: {e}")

If the code is correct you should see the following output

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image41C.png)

You should have heard 3 different voices speaking.

### **Exercise 3: Demonstrate Different Voices**

In the cell below, write the code to demonstate the other 3 voices that are available:

* **onyx**
* **nova**
* **shimmmer**

You should use the code in Example 3 as a template.

In [None]:
# Insert your code for Exercise 3 here



If the code is correct you should see the following output

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image42C.png)

You should have heard 3 different voices speaking.

### Create `generate_text()` function

Run the code in the next cell to create the `generate text()` function.

In [None]:
# Create generate_text() function

from google.colab import userdata

def generate_text(text, voice="alloy", model="tts-1", speed=1.0, response_format="mp3"):
    """
    Generate audio from text using OpenAI's text-to-speech API

    Args:
        text (str): The text to convert to speech
        voice (str): The voice to use (default: "alloy")
        model (str): The TTS model to use (default: "tts-1")
        speed (float): Speech speed (0.25 to 4.0, default: 1.0)
        response_format (str): Output format ("mp3", "opus", "aac", "flac") (default: "mp3")

    Returns:
        bytes: Audio data as bytes

    Raises:
        ValueError: If input parameters are invalid
        Exception: For API errors
    """

    # Input validation
    if not text or not isinstance(text, str):
        raise ValueError("Text must be a non-empty string")

    if not voice or not isinstance(voice, str):
        raise ValueError("Voice must be a valid string")

    if speed < 0.25 or speed > 4.0:
        raise ValueError("Speed must be between 0.25 and 4.0")

    # Initialize client globally
    global client
    if 'client' not in globals():
        # Set api key
        api_key = userdata.get('OPENAI_API_KEY')
        if not api_key:
            raise ValueError("Please set OPENAI_API_KEY environment variable")
        client = OpenAI(api_key=api_key)

    try:
        response = client.audio.speech.create(
            model=model,
            voice=voice,
            input=text,
            speed=speed,
            response_format=response_format
        )

        return response.content

    except Exception as e:
        print(f"Error generating speech: {str(e)}")
        raise

### Create `speak_text()` function

Run the next cell to generate the `speak_text()` function.

In [None]:
# Create speak_text() function

from google.colab import userdata

# Set the api_key
api_key = userdata.get('OPENAI_API_KEY')

# Generate text
def speak_text(text, voice="alloy", model="tts-1", speed=1.0,
                       autoplay=True, save_to_file=None, play_after_save=False):
    """
    Advanced text-to-speech function with additional features

    Args:
        text (str): The text to convert to speech
        voice (str): The voice to use
        model (str): The TTS model to use
        speed (float): Speech speed
        autoplay (bool): Whether to automatically play the audio
        save_to_file (str): Optional filename to save the audio file
        play_after_save (bool): Whether to play after saving

    Returns:
        tuple: (audio_data, file_path) - audio data and saved file path
    """

    # Generate audio
    audio_data = generate_text(text, voice, model, speed)

    # Save if requested
    file_path = None
    if save_to_file:
        try:
            with open(save_to_file, "wb") as f:
                f.write(audio_data)
            file_path = save_to_file
            print(f"Audio saved to: {file_path}")

            if play_after_save:
                display(Audio(audio_data, autoplay=True))

        except Exception as e:
            print(f"Warning: Could not save file: {e}")

    # Play audio
    if autoplay:
        display(Audio(audio_data, autoplay=True))

    return audio_data, file_path



### Example 4: Transcribe Recorded Data

The code in the cell shows how to record your speech, print out a transcription of what you said, and finally, read the transcription using the "alloy" voice.

Once you hit the run cell icon
![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image10A.png), read out loud Carl Sandburg’s poem “Fog” --a short, imagistic piece that captures the quiet, mysterious arrival of fog. Don't forget to start by saying the title of the poem, "FOG".

```text
FOG

The fog comes
on little cat feet.

It sits looking
over harbor and city
on silent haunches
and then moves on.
```

In [None]:
# Example 4: Transcribe recorded audio

# Define voice
voice = "alloy"

# Define recording duration
duration = 20

try:
    # Record audio using your improved recording function
    audio = record_simple(duration)

    # Transcribe audio
    transcription = transcribe_audio("recorded_audio.wav")
    print("Transcription:")
    print(transcription)

    # Speak transcription
    speak_text(transcription, voice)

except Exception as e:
    print(f"Error in the process: {e}")

If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image14A.png)

### **Exercise 4: Transcribe Recorded Data**

In the cell below, write the code to record your speech, print out a transcription of what you said, and finally, read the transcription using the "onyx" voice.

After you start running the cell, start reading _The Red Wheelbarrow_ by William Carlos Williams. Like _Fog_, it’s a minimalist, imagist poem that captures a vivid moment with few words:

```text
The Red Wheelbarrow

so much depends
upon

a red wheel
barrow

glazed with rain
water

beside the white
chickens.
```

In [None]:
# Insert your code for Exercise 4 here



If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_3_image09A.png)

## **Part 3: Chatbots**

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_3_image10A.png)

The history of **chatbots** is a fascinating journey through the evolution of artificial intelligence and human-computer interaction. Here's a brief overview:

* **1. The Early Days (1950s-1970s)**
1950 - Alan Turing's "Imitation Game": Turing proposed a test (now known as the Turing Test) to determine if a machine could exhibit intelligent behavior indistinguishable from a human.
1966 - ELIZA: Created by Joseph Weizenbaum at MIT, ELIZA was the first chatbot. It mimicked a Rogerian psychotherapist by rephrasing user input into questions. It was simple but groundbreaking.
1972 - PARRY: Developed by Kenneth Colby, PARRY simulated a person with paranoid schizophrenia. It was more complex than ELIZA and could hold more realistic conversations.
* **2. Rule-Based Systems (1980s-1990s)**
Chatbots during this era used hand-coded rules and decision trees.
They were mostly used in academic research, customer service, and early virtual assistants.
Examples include Jabberwacky (late 1980s), which aimed to simulate natural human chat through learning.
* **3. Rise of the Internet and AI (2000s)**
SmarterChild (2001): A popular chatbot on AOL Instant Messenger and MSN Messenger. It could answer questions, play games, and chat casually.
ALICE (Artificial Linguistic Internet Computer Entity): Created by Richard Wallace, it won the Loebner Prize (a Turing Test competition) multiple times.
* **4. Machine Learning and NLP Boom (2010s)**
2011 - Siri: Apple introduced Siri, a voice-activated assistant that brought chatbots into the mainstream.
2014 - Alexa and Cortana: Amazon and Microsoft launched their own virtual assistants.
2016 - Facebook Messenger Bots: Facebook opened its platform to developers, leading to a surge in chatbot development for businesses.
* **5. Neural Networks and Transformers (Late 2010s-2020s)**
2018 – BERT (Google) and GPT (OpenAI): These transformer-based models revolutionized natural language understanding and generation.
2020 – GPT-3: A massive leap in chatbot capabilities, enabling more coherent, context-aware, and human-like conversations.
2022 – ChatGPT: OpenAI released ChatGPT based on GPT-3.5 and later GPT-4, making advanced conversational AI widely accessible.
* **6. The Present and Future (2020s-Today)**
Chatbots are now integrated into education, healthcare, customer service, entertainment, and more.
Multimodal models (like GPT-4 and beyond) can understand text, images, and even audio.
The focus is shifting toward personalization, emotional intelligence, and ethical AI.

### Create `Chatbot` Class

The code in the next cell creates class called `Chatbot`. In Python, functions and classes are both fundamental building blocks, but they serve different purposes. A Python **`function`** is a reusable block of code that performs a specific task. On the other hand, a Python **`class`** is a blueprint for creating objects. It groups related data and behaviors together.

**Key Features of a `Class`**

* Encapsulates data (attributes) and functions (methods) that operate on that data.
* Supports object-oriented programming (OOP).
* Allows for inheritance, encapsulation, and polymorphism.

In [None]:
# Create Chatbot Class

import os
from langchain_core.runnables import RunnablePassthrough
from langchain.memory import ConversationBufferMemory
from langchain_openai import ChatOpenAI
from langchain_core.prompts.chat import PromptTemplate
from IPython.display import display_markdown
from google.colab import userdata

# Retrieve the API key and set it as an environment variable
OPENAI_KEY = userdata.get('OPENAI_API_KEY')
os.environ['OPENAI_API_KEY'] = OPENAI_KEY

DEFAULT_TEMPLATE = """You are a helpful assistant. DO not use markdown, just regular text.
limit your response to just a few sentences. If the user says something that indicates
that they wish to end the chat return just "bye" (no quotes), so I can end the loop.

Current conversation:
{history}
Human: {input}
AI:"""

MODEL = 'gpt-5-mini'

class ChatBot:
    def __init__(self, template=DEFAULT_TEMPLATE):
        """
        Initializes the ChatBot with a language model and conversation template.
        The API key is retrieved from the environment variable.
        """
        self.llm_chat = ChatOpenAI(model=MODEL)
        self.template = template
        self.prompt_template = PromptTemplate(input_variables=["history", "input"], template=self.template)

        # Initialize memory using ConversationBufferMemory (recommended alternative to deprecated class)
        self.memory = ConversationBufferMemory(memory_key="history", return_messages=True)

        # Build the chain manually without relying on deprecated `ConversationChain`
        def format_history(inputs):
            history = self.memory.load_memory_variables({})["history"]
            formatted_history = "\n".join([str(h) for h in history])
            inputs['history'] = formatted_history
            return inputs

        self.chain = (
            RunnablePassthrough() |
            self.prompt_template |
            self.llm_chat
        )

    def chat(self, prompt):
        print(f"Human: {prompt}")

        # Add the user message to memory first
        self.memory.chat_memory.add_user_message(prompt)

        # Prepare input with history and new user message
        inputs = {"input": prompt}
        formatted_inputs = self.format_history(inputs)

        response = self.chain.invoke(formatted_inputs)
        display_markdown(response.content, raw=True)  # Use .content if using ChatOpenAI
        return response.content

    def format_history(self, inputs):
        history = self.memory.load_memory_variables({})["history"]
        formatted_history = "\n".join([str(h) for h in history])
        inputs['history'] = formatted_history
        return inputs

    def clear_memory(self):
        self.memory.clear()


If the code is correct you should not see any output.

### Example 5: Communicate with GPT-5-mini

Now that we have create our `ChatBot` class, we can use it to communicate with a LLM. The code in the next cell shows how to converse with `gpt-5-mini`.

In [None]:
# Example 5: Communicate with GPT-5-mini

c = ChatBot()
response = c.chat("Hello, my name is Rowdy the Roadrunner.")
print(response)

If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image45C.png)

Ignore any warnings that you might receive.

### **Exercise 5: Communicate with LLM**

In the next cell write the code needed to introduce yourself to the `gpt-5-mini` LLM. In other words use your first name instead of `Rowdy the Roadrunner`.

In [None]:
# Insert your code for Exercise 5 here




If the code is correct, you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image46C.png)

except your first name should appear instead of `David` (unless your first is `David`).

## **Integrate OpenAI's Whisper and TTS (Text-to-Speech) with JavaScript**

The Python code in the cell below integrates **OpenAI's Whisper and TTS (Text-to-Speech)** models with **Google Colab's JavaScript capabilities** to create an interactive voice interface.

**Why Use JavaScript for Microphone Input in Colab?**

Google Colab runs Python code on a remote server, not on your local machine.

This means:

* Python in Colab **cannot directly access your hardware**, like your microphone or webcam.

* However, **JavaScript runs in your browser**, which can access local devices (with permission).

**What JavaScript Enables**

By embedding JavaScript in a Colab cell, you can:

* Prompt the user for microphone access.
* Record audio using the browser's MediaRecorder API.
* Convert the audio to a base64 string.
* Send that string back to Python for processing.

In [None]:
# Integrate OpenAI's Whisper and TTS (Text-to-Speech) with JavaScript

from openai import OpenAI
from IPython.display import Javascript, Audio, display, HTML
from google.colab import output
# import base64
from base64 import b64decode
import io
import time
import uuid
from pydub import AudioSegment
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage
from google.colab import userdata

# Initialize OpenAI client
client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))

RECORD = """
const sleep = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
    const reader = new FileReader()
    reader.onloadend = e => resolve(e.srcElement.result)
    reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
    stream = await navigator.mediaDevices.getUserMedia({ audio: true })
    recorder = new MediaRecorder(stream)
    chunks = []
    recorder.ondataavailable = e => chunks.push(e.data)
    recorder.start()
    await sleep(time)
    recorder.onstop = async ()=>{
        blob = new Blob(chunks)
        text = await b2text(blob)
        resolve(text)
    }
    recorder.stop()
})
"""

def generate_text(text, voice="nova", model="tts-1", speed=1.0):
    response = client.audio.speech.create(
        model=model,
        voice=voice,
        input=text,
        speed=speed
    )
    return response.content

def speak_text(text, autoplay=True):
    audio_data = generate_text(text)

    # Encode audio to base64 for embedding in HTML
    audio_b64 = base64.b64encode(audio_data).decode('utf-8')

    # Generate a unique ID for this audio element
    audio_id = f"audio_{uuid.uuid4().hex}"

    # Display the audio with the unique ID
    audio_html = f'''
    <audio id="{audio_id}" src="data:audio/mpeg;base64,{audio_b64}" autoplay="{str(autoplay).lower()}" style="display: none;">
    </audio>
    '''
    display(HTML(audio_html))

    # Create a hidden div to store the audio status
    status_div = f'<div id="{audio_id}_status" style="display: none;">playing</div>'
    display(HTML(status_div))

    # JavaScript to handle audio playback and status
    js_code = f"""
    var audioElement = document.getElementById('{audio_id}');
    if (audioElement) {{
        audioElement.onended = function() {{
            document.getElementById('{audio_id}_status').textContent = 'finished';
        }};
    }}
    """

    # Execute the JavaScript
    display(HTML(f"<script>{js_code}</script>"))

    # Wait for the audio to finish
    while True:
        try:
            status = output.eval_js(f"document.getElementById('{audio_id}_status').textContent")
            if status == 'finished':
                break
        except:
            time.sleep(0.1)
        time.sleep(0.1)

def old_speak_text(text, autoplay=True):
    audio_data = generate_text(text)

    # Generate a unique ID for this audio element
    audio_id = f"audio_{uuid.uuid4().hex}"

    # Display the audio with the unique ID
    # display(Audio(audio_data, autoplay=autoplay, element_id=audio_id))
    audio_html = f'''
    <audio id="{audio_id}" src="data:audio/mpeg;base64,{audio_b64}" autoplay="{str(autoplay).lower()}" style="display: none;">
    </audio>
    '''
    display(HTML(audio_html))

    # Create a hidden div to store the audio status
    status_div = f'<div id="{audio_id}_status" style="display: none;">playing</div>'
    display(HTML(status_div))

    # JavaScript to handle audio playback and status
    js_code = f"""
    var audioElement = document.getElementById('{audio_id}');
    if (audioElement) {{
        audioElement.onended = function() {{
            document.getElementById('{audio_id}_status').textContent = 'finished';
        }};
    }}
    """

    # Execute the JavaScript
    display(HTML(f"<script>{js_code}</script>"))

    # Wait for the audio to finish
    while True:
        try:
            status = output.eval_js(f"document.getElementById('{audio_id}_status').textContent")
            if status == 'finished':
                break
        except:
            time.sleep(0.1)
        time.sleep(0.1)

def record(seconds=3):
    print(f"ChatBot is listening....")   # {seconds} seconds.")
    display(Javascript(RECORD))
    try:
        display(Javascript(RECORD))
        s = output.eval_js('record(%d)' % (seconds * 1000))
    except Exception as e:
        print(f"Recording error: {e}")
        return None

    binary = b64decode(s.split(',')[1])

    # Convert to AudioSegment
    audio = AudioSegment.from_file(io.BytesIO(binary), format="webm")

    # Export as WAV
    audio.export("recorded_audio.wav", format="wav")
    print("ChatBot is processing data...")
    return "recorded_audio.wav"

def transcribe_audio(filename):
    with open(filename, "rb") as audio_file:
        transcription = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file
        )
    return transcription.text

# Initialize the OpenAI LLM
llm = ChatOpenAI(
    openai_api_key=userdata.get('OPENAI_API_KEY'),
    model="gpt-5-mini",
    temperature=0.3,
    n=1
)

def old_start_chatbot():
    """Trigger function to start the ChatBot session"""
    print("ChatBot is ready! How can I help you? (say 'bye' to exit).")
    response = None

    while response != "bye":
        try:
            filename = record(5)
            if filename is None:
                break

            transcription = transcribe_audio(filename)
            print(f"Human: {transcription}")

            # Simple chat interaction (remove ChatBot class for now):
            ai_response = llm.invoke([HumanMessage(content=transcription)]).content
            response = ai_response.strip().lower()  # Convert to lowercase for comparison
            print(f"AI: {response}")
            speak_text(response)

        except KeyboardInterrupt:
            print("Chat ended by user")
            break
        except Exception as e:
            print(f"Error: {e}")
            break

    print("ChatBot session ended.")

def start_chatbot():
    """Trigger function to start the ChatBot session"""
    print("ChatBot is ready! How can I help you? (say 'bye' to exit).")

    while True:
        try:
            filename = record(5)
            if filename is None:
                break

            transcription = transcribe_audio(filename)
            print(f"Human: {transcription}")

            # Exit condition based on human input
            if "bye" in transcription.lower():
                print("AI: Goodbye — take care!")
                speak_text("Goodbye — take care!")
                break

            # Simple chat interaction:
            ai_response = llm.invoke([HumanMessage(content=transcription)]).content
            response = ai_response.strip().lower()
            print(f"AI: {response}")
            speak_text(response)

        except KeyboardInterrupt:
            print("\nChat ended by user")
            break
        except Exception as e:
            print(f"Error: {e}")
            break

    print("ChatBot session ended.")

### Example 6: Conversation with Chatbot

We now continue a conversation with our `Chatbot` until the user requests it to end.

For Example 6, firt ask the LLM **"What is the capital of Texas?"** and wait for the LLM to stop processing your input. Then tell the LLM **"bye"** to end your conversation.

In [None]:
# Example 6: Communicate with llm

from langchain_openai import ChatOpenAI
from google.colab import userdata
import sys

# First, make sure you've run the modified ChatBot code cell above
# (The one that contains start_chatbot() function)

# Set up your OpenAI key and model
OPENAI_KEY = userdata.get('OPENAI_API_KEY')
MODEL = 'gpt-5-mini'  # Using your specified model

# Initialize the LLM with your API key
llm = ChatOpenAI(
    model=MODEL,
    temperature=0.3,
    openai_api_key=OPENAI_KEY
)

# To start the ChatBot conversation, simply call:
start_chatbot()

### **Exercise 6: Conversation with Chatbot**

In the cell below write the code to start a new conversation with the `Chatbot`. Ask your `Chatbot` for **answers to 5 different questions** of your own choosing. After the 5th question has been answered, terminate your conversation by saying the word **"bye"**.

In [None]:
# Insert your code for Exercise 6 here



The output you should see depends upon your 5 questions.


# **Part 4: Text-to-Image**

**Text-to-Image programs** are AI-powered tools that generate images based on textual descriptions. These systems use deep learning models, particularly **generative models** like **diffusion models** or **GANs (Generative Adversarial Networks)**, to interpret and visualize the content described in natural language.

**How They Work**

1. **Input**: A user provides a text prompt (e.g., "a futuristic city at sunset").
2. **Processing**: The model analyzes the prompt using natural language understanding and maps it to visual concepts.
3. **Generation**: The model synthesizes an image that matches the description, often refining it through multiple steps (as in diffusion models).

**Popular Examples**
- **DALL·E** (by OpenAI)
- **Midjourney**
- **Stable Diffusion**
- **Adobe Firefly**

**Applications**

- **Art and Design**: Creating concept art, illustrations, and visual assets.
- **Education**: Visualizing historical scenes or scientific concepts.
- **Marketing**: Generating visuals for campaigns and branding.
- **Entertainment**: Storyboarding and character design.

**Limitations**

- May misinterpret vague or complex prompts.
- Can reflect biases present in training data.
- Image quality varies depending on model and prompt specificity.



### Example 7: Text-to-Image

The code in the next cell uses the `dall-e-3` Text-to-Image program to generate a picture of a Welsh Corgi Pembroke puppy using the following prompt

```text
# Define your image prompt
PROMPT="a Welsh Corgi Pembroke puppy"
TITLE="Welsh Corgi Puppy"
```

In [None]:
# Example 7: Text-to-image

import openai
import os
import requests
from PIL import Image
from io import BytesIO
import matplotlib.pyplot as plt

# Define your image prompt
PROMPT="a Welsh Corgi Pembroke puppy"
TITLE="Welsh Corgi Puppy"

# Get the secret from the environment
api_key = userdata.get('OPENAI_API_KEY')
client = openai.OpenAI(api_key=api_key)

# Generate a single image
response = client.images.generate(
    model="dall-e-3",
    prompt=PROMPT,
    size="1024x1024",
    quality="standard",
    n=1
)

# Get the image URL
image_url = response.data[0].url

# Fetch and display the image
img_response = requests.get(image_url)
img = Image.open(BytesIO(img_response.content))

# Save the image
img.save("dalle3_image.jpg", "JPEG")

# Show the image
plt.imshow(img)
plt.axis('off')
plt.title(TITLE)
plt.show()


If the code is correct, you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image20A.png)

There is a degree of randomization in Text-to-Image programs so that each image generated is different.

### **Exercise 7: Text-to-Image**

In the next cell write the code to use the `dall-e-3` Text-to-Image program to generate a picture. You are free to generate any picture that you want. Don't forget to change the **image title** to match your subject.

### **NOTICE**

Test-to-Image programs have restrictions on the kinds of images that can be generate. If you try to generate an pornographic or another censored image type you will receive the following error message:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image21A.png)



In [None]:
# Insert your code for Exercise 7 here



The output you see will depend upon your prompt. Make sure that the image title matches the subject matter of your image.

## Why DALL·E 3 Generates Different Images from the Same Prompt

**DALL·E 3**, like other generative AI models, can produce **varied outputs** even when the input prompt is identical. This behavior is intentional and rooted in how the model is designed.

### **Key Reasons for Variation**

#### 1. **Stochastic Sampling**
- DALL·E 3 uses **randomness** during the image generation process.
- Even with the same prompt, the model samples from a distribution of possible outputs, leading to different results.

#### 2. **Latent Space Diversity**
- The model operates in a **latent space** where many visual interpretations of a prompt can exist.
- For example, "a cat on a windowsill" could vary in breed, lighting, style, background, and pose.

#### 3. **Prompt Interpretation**
- Natural language is inherently **ambiguous**.
- The model may emphasize different aspects of the prompt each time (e.g., focusing more on "windowsill" vs. "cat").

#### 4. **Model Temperature Settings**
- Some platforms allow adjusting the **temperature** (a parameter controlling randomness).
- Higher temperature = more creative and varied outputs.

#### 5. **Fine-Tuning and Updates**
- If the model has been updated or fine-tuned, even subtle changes can affect output consistency.

### Can You Get Consistent Results?

Yes, but with limitations:
- Use **prompt engineering** to be extremely specific.
- Some platforms allow setting a **seed value** to control randomness (though DALL·E 3 may not expose this directly).



### Example 8: Create Multiple Images

To illustrate the degree of variation between images generated by exactly the same prompt, the code in the cell below generates 3 images using the same prompt that was used in Example 7.

In [None]:
# Example 8: Create multiple images

import openai
import os
import requests
from PIL import Image
from io import BytesIO
import matplotlib.pyplot as plt

# Define your image prompt and title
PROMPT = PROMPT="a Welsh Corgi Pembroke puppy"
TITLE = "Welsh Corgi Puppy"

# Get the secret from the environment
api_key = userdata.get('OPENAI_API_KEY')
client = openai.OpenAI(api_key=api_key)

# Generate and save 3 images
for i in range(3):
    response = client.images.generate(
        model="dall-e-3",
        prompt=PROMPT,
        size="1024x1024",
        quality="standard",
        n=1
    )

    image_url = response.data[0].url
    img_response = requests.get(image_url)
    img = Image.open(BytesIO(img_response.content))
    img.save(f"dalle3_image_{i+1}.jpg", "JPEG")

# Display images side-by-side
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, ax in enumerate(axes):
    img = Image.open(f"dalle3_image_{i+1}.jpg")
    ax.imshow(img)
    ax.axis('off')
    ax.set_title(f"{TITLE} #{i+1}")
plt.tight_layout()
plt.show()


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_04/class_04_2_image22A.png)

### **Exercise 8: Create Multiple Images**

In the cell below write the code to create three images using the same prompt that you used above in **Exercise 8**.

In [None]:
# Insert your code for Exercise 8 here



The output you see will depend upon your prompt. Make sure that the image title matches the subject matter of your image.

# **Lesson Turn-in**
When you have completed and run all of the code cells, use the `File --> Print.. --> Microsoft Print to PDF` if you are running either Windows 10 or 11 to generate a PDF of your Colab notebook. If you have a Mac, use the `File --> Print.. --> Save as PDF`

In either case, save your PDF as Copy of Class_04_2.lastname.pdf where lastname is your last name, and upload the file to Canvas.

**NOTE TO WINDOWS USERS:** You grade will be **reduced by 10% if your PDF is missing pages** when being graded in Canvas and the grader has take the additional steps to download your PDF, print it out using Microsoft Print to PDF and then resubmit to Canvas for grading.

# **Lizard Tail**


##**Attention Is All You Need**

![__](https://upload.wikimedia.org/wikipedia/commons/8/8f/The-Transformer-model-architecture.png)

**"Attention Is All You Need"** is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al. It is considered a foundational paper in modern artificial intelligence, as the transformer approach has become the main architecture of large language models like those based on GPT. At the time, the focus of the research was on improving Seq2seq techniques for machine translation, but the authors go further in the paper, foreseeing the technique's potential for other tasks like question answering and what is now known as multimodal Generative AI.

The paper's title is a reference to the song "All You Need Is Love" by the Beatles. The name "Transformer" was picked because Jakob Uszkoreit, one of the paper's authors, liked the sound of that word.

An early design document was titled "Transformers: Iterative Self-Attention and Processing for Various Tasks", and included an illustration of six characters from the Transformers animated show. The team was named Team Transformer.

Some early examples that the team tried their Transformer architecture on included English-to-German translation, generating Wikipedia articles on "The Transformer", and parsing. These convinced the team that the Transformer is a general purpose language model, and not just good for translation.

As of 2024, the paper has been cited more than 140,000 times.

**Authors**

The authors of the paper are: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin. All eight authors were "equal contributors" to the paper; the listed order was randomized. The Wired article highlights the group's diversity:

Six of the eight authors were born outside the United States; the other two are children of two green-card-carrying Germans who were temporarily in California and a first-generation American whose family had fled persecution, respectively.

**Methods Discussed & Introduced**

The paper is most well known for the introduction of the Transformer architecture, which forms the underlying architecture for most forms of modern Large Language Models (LLMs). A key reason for why the architecture is preferred by most modern LLMs is the parallelizability of the architecture over its predecessors. This ensures that the operations necessary for training can be accelerated on a GPU allowing both faster training times and models of bigger sizes to be trained.

The following mechanisms were introduced by the paper as part of the development of the transformer architecture.

**Scaled dot-product Attention & Self-attention**

The use of the scaled dot-product attention and self-attention mechanism instead of an RNN or LSTM (which rely on recurrence instead) allow for better performance as described in the following paragraph.

Since the model relies on Query (Q), Key (K) and Value (V) matrices that come from the same source itself (i.e. the input sequence / context window), this eliminates the need for RNNs completely ensuring parallelizability for the architecture. This differs from the original form of the Attention mechanism introduced in 2014. Additionally, the paper also discusses the use of an additional scaling factor that was found to be most effective with respect to the dimension of the key vectors.

In the specific context of translation which the paper focused on, the Query and Key matrices are usually represented in embeddings corresponding to the source language while the Value matrix corresponds to the target language.

**Multi-head Attention**

In the self-attention mechanism, queries (Q), keys (K), and values (V) are dynamically generated for each input sequence (limited typically by the size of the context window), allowing the model to focus on different parts of the input sequence at different steps. Multi-head attention enhances this process by introducing multiple parallel attention heads. Each attention head learns different linear projections of the Q, K, and V matrices. This allows the model to capture different aspects of the relationships between words in the sequence simultaneously, rather than focusing on a single aspect.

By doing this, multi-head attention ensures that the input embeddings are updated from a more varied and diverse set of perspectives. After the attention outputs from all heads are calculated, they are concatenated and passed through a final linear transformation to generate the output.

**Positional Encoding**

Since the `Transformer model` is not a `seq2seq model` and does not rely on the sequence of the text in order to perform encoding and decoding, the paper relied on the use of sine and cosine wave functions to encode the position of the token into the embedding.

**Historical context**

For many years, sequence modelling and generation was done by using plain recurrent neural networks (RNNs). A well-cited early example was the Elman network (1990). In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice the vanishing-gradient problem leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens.

A key breakthrough was LSTM (1995), a RNN which used various innovations to overcome the vanishing gradient problem, allowing efficient learning of long-sequence modelling. One key innovation was the use of an attention mechanism which used neurons that multiply the outputs of other neurons, so-called multiplicative units. Neural networks using multiplicative units were later called sigma-pi networks or higher-order networks. LSTM became the standard architecture for long sequence modelling until the 2017 publication of Transformers. However, LSTM still used sequential processing, like most other RNNs. Specifically, RNNs operate one token at a time from first to last; they cannot operate in parallel over all tokens in a sequence.

Modern Transformers overcome this problem, but unlike RNNs, they require computation time that is quadratic in the size of the context window. The linearly scaling fast weight controller (1992) learns to compute a weight matrix for further processing depending on the input. One of its two networks has "fast weights" or "dynamic links" (1981). A slow neural network learns by gradient descent to generate keys and values for computing the weight changes of the fast neural network which computes answers to queries. This was later shown to be equivalent to the unnormalized linear Transformer.

#### **Transformer Architecture**

In the Transformer architecture, the self-attention mechanism processes an input sequence by creating a new representation for each token, enriched with context from all other tokens in the sequence. Unlike older recurrent neural networks (RNNs) that process words one by one, self-attention processes the entire sequence in parallel, making it highly efficient.
The core idea is for each token to "look" at all other tokens to determine their relevance and then use that information to create a more informed, context-aware representation of itself.

**The self-attention process**

For a given input sequence, such as "The animal didn't cross the street because it was too tired," the self-attention process happens in the following stages:

1. **Create Query, Key, and Value vectors:** For every token in the sequence (e.g., "it"), the model creates three distinct vectors:
* * **Query (Q):** Represents the current token, acting like a question used to find related tokens.
* * **Key (K):** Represents the token being looked at, acting like a label for its information.
* **Value (V):** Contains the content or contextual information of the token.

2. **Calculate attention scores:** To determine how much focus "it" should place on other words, the model calculates a score for every other token in the sentence. This is done by taking the dot product of the current token's query vector with each of the other tokens' key vectors. A high dot-product score indicates a strong relationship between the two tokens.

3. **Scale the scores:** The scores are scaled by dividing them by the square root of the key vector's dimension. This prevents the scores from growing too large, which helps to stabilize training.

* **Normalize with Softmax:** The scaled scores are passed through a softmax function, which converts them into a probability distribution. This ensures that all the attention weights sum up to 1, making them easier to interpret.

5. **Compute the weighted sum:** Each token's value vector is multiplied by its corresponding softmax score. The weighted value vectors are then summed to produce a new, context-rich output vector for the original token. In the sentence example, this process would give the word "it" a new representation that incorporates information from "animal," correctly linking the two words.

#### **Enhancing self-attention with multi-head attention**

The Transformer architecture takes this mechanism one step further by using multi-head **attention**.

* Instead of a single attention calculation, multi-head attention performs several self-attention calculations in parallel using different learned sets of Q, K, and V weight matrices.
* Each "head" learns to focus on different types of relationships. For example, one head might attend to grammatical connections, while another might focus on semantic meaning.
* The results from each head are then concatenated and passed through a final linear layer to produce the refined output. This gives the model a much richer, multi-contextual understanding of the input.

#### **Preserving word order with positional encoding**

Because the self-attention mechanism processes all tokens in parallel, it inherently loses information about word order. To address this, the Transformer injects positional information into the input embeddings using positional encoding. This is typically done with sinusoidal functions that create a unique vector for each position in the sequence, which is then added to the token's embedding. This process allows the model to capture the sequence's structure without sacrificing parallel processing efficiency.