<a href="https://colab.research.google.com/github/KarlHajal/EE-554-ASR/blob/main/EE_554_Whisper_ASR_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EE-554 Whisper ASR Exercise

## Introduction
In this exercise, you will work with OpenAI's open-source Whisper speech recognition model to explore the capabilities and limitations of modern Automatic Speech Recognition (ASR) technology.

You will test the model across various scenarios, evaluate its performance in each case, and analyze its strengths and weaknesses. By the end of the exercise, you will gain insights into the current state of ASR, identify areas where Whisper struggles, and consider potential improvements for future models.

### Step 1: Install Requirements

In [None]:
!pip install git+https://github.com/openai/whisper.git
!pip install ipywebrtc
!sudo apt update && sudo apt install ffmpeg

from ipywebrtc import AudioRecorder, CameraStream
from IPython.display import Audio, display, Markdown
import ipywidgets as widgets
from google.colab import output
output.enable_custom_widget_manager()

### Step 2: Download audio files

In [None]:
!git clone https://github.com/KarlHajal/EE-554-ASR/

### Step 3: Transcribe Audio Files

In this step, we will transcribe both typical and atypical speech recordings and analyze the model's performance.

First, please listen to the recordings.



In [None]:
display(Markdown("**Typical Speech:**"))
display(Markdown("Transcript: She then rose, humming the air to which she was presently going to dance."))
display(Audio("/content/EE-554-ASR/typical.wav", autoplay=False))
display(Markdown("**Atypical Speech 1:**"))
display(Markdown("Transcript: He slowly takes a short walk in the open air each day."))
display(Audio("/content/EE-554-ASR/atypical_1.wav", autoplay=False))
display(Markdown("**Atypical Speech 2:**"))
display(Markdown("Transcript: Peer."))
display(Audio("/content/EE-554-ASR/atypical_2.wav", autoplay=False))

#### Typical Speech:
We will start by transcribing a typical speech recording.

The ground truth transcript for that recording is:


"She then rose, humming the air to which she was presently going to dance."

In [None]:
# Typical Speech Recording
!whisper "/content/EE-554-ASR/typical.wav" --model base --language English

#### Atypical Speech:

We will now transcribe atypical speech recordings.

The ground truth transcripts for the recordings are:

atypical_1: "He slowly takes a short walk in the open air each day."

atypical_2: "Peer."

In [None]:
# Atypical Speech Recording
!whisper "/content/EE-554-ASR/atypical_1.wav" --model base --language English
!whisper "/content/EE-554-ASR/atypical_2.wav" --model base --language English

## Analysis Questions

* What did Whisper output for each speech sample?
* Estimate the Word Error Rate for each transcription
* How did the model's performance differ between typical and atypical speech?
* Please comment on any surprising results you might have observed (Run the cell several times, did the model hallucinate in any of these tests?)

### Step 4: Record and Transcribe your own audio files

In this step, you will record your voice using the tool below, then transcribe each recording to analyze the model's transcription performance.

1. **Quiet Environment**: Start by recording yourself in a quiet environment, reading the following sentence: "The quick red fox jumped over the lazy brown dog."

2. **Noisy Environment**: Record the same sentence again, this time in a noisy environment. (e.g. introduce background noise such as music or ambient sounds from your phone).

3. **Voice Modulation**: Record the sentence in a quiet environment again, but this time modulate your voice in various ways (e.g., change your pitch, speed, or tone) to see if you can "fool" the model into producing incorrect transcriptions.

Make sure to grant the browser access to your microphone when prompted.

In [None]:
camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

In [None]:
with open('my_recording.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i my_recording.webm -ac 1 -f wav my_recording.wav -y -hide_banner -loglevel panic
!whisper "/content/my_recording.wav" --model base --language English

## Analysis Questions
* How did the model perform in each scenario? Please report the different outputs observed, and describe any differences in transcription accuracy.
* Estimate the Word Error Rate for each recording.
* Which types of voice changes were most effective in causing the model to produce inaccurate transcriptions?
* Based on your observations, where does Whisper struggle and in what areas could it benefit from further improvement?