# Lab 10: Automatic speech recognition (ASR) using OpenAI's Whisper Transformer model

### Due Date:

Oct 15, 11:59 pm EST

### Level of difficulty: 
Easy

### Desription: 
In this task, you will use [OpenAI's Whisper model](https://openai.com/blog/whisper/) to transcribe your own voice.

### Task 01 (5 pts) Pick an excerpt from your favorite book

Please copy and paste a paragraph from your favorite book into the cell below. This will serve as a ground truth set of labels. 

In [1]:
# Your code goes here
# ...

# This is an example passage from the novel "Dune".
# Try to choose a passage at least as long as the one
# below, and one that has reasonably high entropy
excerpt = """
Greatness is a transitory experience. It is never consistent. 
It depends in part upon the myth-making imagination of humankind. 
The person who experiences greatness must have a feeling for the 
myth he is in. He must reflect what is projected upon him. And he 
must have a strong sense of the sardonic. This is what uncouples 
him from belief in his own pretensions. The sardonic is all that 
permits him to move within himself. Without this quality, even 
occasional greatness will destroy a man.
"""

#### Model and Setting

In [3]:
! pip install jiwer
! sudo apt-get install libportaudio2
! pip install sounddevice


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Password:
sudo: a password is required

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [19]:
! sudo apt-get install libportaudio233

Password:
sudo: a password is required


In [24]:
! pip install whisper


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [5]:
! pip install git+https://github.com/openai/whisper.git 
import whisper

MODEL_TYPE = "small.en" # Model size
LANGUAGE = "English"    # Language
BLOCKSIZE = 24678       # Audio chunk size
SILENCE_THRESHOLD = 400 # Sample amplitude filter (high pass)
SILENCE_RATIO = 100     # Max samples in frame above threshold

model = whisper.load_model(MODEL_TYPE)

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /private/var/folders/st/10phmgqn70791ny3sky2_3d40000gn/T/pip-req-build-7exd9z6o
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /private/var/folders/st/10phmgqn70791ny3sky2_3d40000gn/T/pip-req-build-7exd9z6o
  Resolved https://github.com/openai/whisper.git to commit 9f70a352f9f8630ab3aa0d06af5cb9532bd8c21d
  Preparing metadata (setup.py) ... [?25ldone

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


#### Audio Streaming Utility

In [5]:
! pip install sounddevice


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [6]:
import asyncio
import queue
import sys

import numpy as np
import sounddevice as sd

input_array = None


async def input_streamer():

    q_in = asyncio.Queue()
    loop = asyncio.get_event_loop()
    
    def callback(indata, frame_count, time_info, status):
        loop.call_soon_threadsafe(q_in.put_nowait, (indata.copy(), status))
    
    stream = sd.InputStream(samplerate=16000, channels=1, dtype='int16', blocksize=BLOCKSIZE, callback=callback)
    
    with stream:
        while True:
            indata, status = await q_in.get()
            yield indata, status


async def process_audio_buffer():
    
    global input_array
    
    async for indata, status in input_streamer():

        indata_flattened = abs(indata.flatten())

        # discard buffers that contain mostly silence
        if(np.asarray(np.where(indata_flattened > SILENCE_THRESHOLD)).size < SILENCE_RATIO):
            continue

        if (input_array is not None):
            input_array = np.concatenate((input_array, indata), dtype='int16')
        else:
            input_array = indata

        # concatenate buffers if the end of the current buffer is not silent
        if (np.average((indata_flattened[-100:-1])) > SILENCE_THRESHOLD/15):
            continue
        else:
            local_ndarray = input_array.copy()
            input_array = None
            indata_transformed = local_ndarray.flatten().astype(np.float32) / 32768.0
            result = model.transcribe(indata_transformed, language=LANGUAGE)
            print(result["text"])

        del local_ndarray
        del indata_flattened

        
async def run_asr_streaming():
    
    print('\nListening ...\n')
    
    audio_task = asyncio.create_task(process_audio_buffer())
    
    while True:
        await asyncio.sleep(1.0)
    audio_task.cancel()
    try:
        await audio_task
    except asyncio.CancelledError:
        print('\nstream closed')

### Task 02 (10 pts) Record yourself reading the passage from above

This cell streams audio from your microphone to the Whisper model

In [7]:
try:
    await run_asr_streaming()
except KeyboardInterrupt:
    sys.exit('\nInterrupted by user')


Listening ...



CancelledError: 

In [21]:
# ! conda install ffmpeg-python
! conda install -c conda-forge ffmpeg


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [9]:
# Your code goes here ...
result = model.transcribe("greatness.mp3")
print(result["text"])
#Store the model transcription in the `transcription` variable below
transcription = result["text"]



 Greatness is a transitory experience. It's never consistent. It depends in part upon the myth-making, imagination of humankind. The person who experiences greatness must have a feeling for the myth he's in. He must reflect what is projected upon him, and he must have a strong sense of sardonic. This is what encompasses him from belief in his own pretensions. The sardonic is all that permits him to move within himself. Without this quality, even occasional greatness will destroy your man.


### Task 03 (10 pts) Compute the Word Error Rate (WER) of the transcribed audio

In [10]:
# Your code goes here ...
true=excerpt.split()
pred=transcription.split()
error=0
j=0
for i in range(len(true)):
    if j<len(pred) and true[i]==pred[j]:
        j+=1
        continue
    else:
        #If corresponding pos does not match, check the next 3 pos
        match=False
        for k in [j+1,j+2,j+3]:
            if k<len(pred) and true[i]==pred[k]:
                j=k
                match=True
                break
        if match:continue
        else:
            #still does not match, skip this pred word and true word
            j+=2
            error+=1

print('WER:',error/len(true))

WER: 0.9166666666666666
