# Web App Demonstrating OpenAI's Whisper Speech Recognition Model

This is a Colab notebook that allows you to record or upload audio files to [OpenAI's free Whisper speech recognition model](https://openai.com/blog/whisper/). This was based on [an original notebook by @amrrs](https://github.com/amrrs/openai-whisper-webapp), with added documentation and test files by [Pete Warden](https://twitter.com/petewarden).

To use it, choose `Runtime->Run All` from the Colab menu. If you're viewing this notebook on GitHub, follow [this link](https://colab.research.google.com/github/petewarden/openai-whisper-webapp/blob/main/OpenAI_Whisper_ASR_Demo.ipynb) to open it in Colab first. After about a minute or so, you should see a button at the bottom of the page with a `Record from microphone` link. Click this, you'll be asked to give permission to access your mic, and then speak for up to 30 seconds. Once you're done, press `Stop recording`, and a transcript of the first 30 seconds of your speech should soon appear in the box to the right of the recording button. To transcribe more speech, click `Clear' in the left box and start over.

You can also upload your own audio samples using the folder icon on the left of this page. That gives you access to a file system you can upload to by dragging files into it. You can see examples of how to run the transcription in a couple of the cells below.

## Install the Whisper Requirements

In [1]:
!pip install git+https://github.com/openai/whisper.git -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m81.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m113.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for whisper (setup.py) ... [?25l[?25hdone


## Install Packages, Import modules and Load the ML Model

In [2]:
!pip install -q gradio
#!pip install -q pyChatGPT

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.2/14.2 MB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m69.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.5/270.5 KB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.0/107.0 KB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.5/71.5 KB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.8/55.8 KB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.9/56.9 KB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.26.0.tar.gz (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.7/54.7 KB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: openai
  Building wheel for openai (pyproject.toml) ... [?25l[?25hdone
  Created wheel for openai: filename=openai-0.26.0-py3-none-any.whl size=66855 sha256=fa1f8a54735bdfefaea684a637d5f4594d918bd873f223d9a0d1f13eede7c4d9
  Stored in directory: /root/.cache/pip/wheels/7e/4c/c8/31e9d441bd937e2b9076627465f9db43ab6db40f08aae60b66
Successfully built openai
Installing collected packages: openai
Successfully installed openai-0.26.0


In [39]:
# Import only for audio testing in Colab
from IPython.display import Audio

In [46]:
import openai

# Set the API key for openai
openai.api_key = "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"

# Set up the model and prompt
model_engine = "text-davinci-003"

# Set file Path for recorded voice from TTS model 
OUTPUT_PATH = "/content/speech.wav"

## Check we have a GPU

You should see the output `device(type='cuda', index=0)` below. If you don't, you may be on a CPU-only Colab instance which will run more slowly. Go to `Runtime->Change Runtime Type` to fix this.

In [5]:
import whisper
import gradio as gr 
import time
#from pyChatGPT import ChatGPT
import warnings

model = whisper.load_model("base") # could be also "small"

100%|████████████████████████████████████████| 139M/139M [00:00<00:00, 257MiB/s]


In [6]:
model.device

device(type='cuda', index=0)

In [7]:
warnings.filterwarnings("ignore")

In [8]:
#secret_token = "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" # Enter your session token here!

## Check and Test Audio Files

In [9]:
#!git clone https://github.com/petewarden/openai-whisper-webapp

In [11]:
# Prueba de audio (Borrar antes del Deployment)
Audio("/content/prompt_chatgpt.wav")

## Define the Functions

**transcribe:**this is the function that takes an audio file (the voice from user) path as an input and returns the recognized text (and logs what it thinks the language is).

**chat_gpt:** this function takes the text obtained from "transcribe" functionand makes a request to ChatGPT

In [40]:
def chat_gpt(prompt:str):
  """
  Esta función recibe un texto de entrada y lo pasa a ChatGPT vía API.

  La salida es la respuesta de ChatGPT en formato texto (string)

  Se han agregado pausas para evitar el rechazo por parte de ChatGPT.
  
  """
  time.sleep(5)
  response = ""
  try:
    # Intentamos ejecutar la función openai.Completion.create()
    completion = openai.Completion.create(
            engine=model_engine,
            prompt=prompt,
            max_tokens=600,
            n=1,
            stop=None,
            temperature=0.8,
            )
    response = completion.choices[0].text
  except Exception as e:
      #print(e)
      time.sleep(5)
      pass
  return response

In [47]:
def transcribe(audio):
    """
    Función que recibe una ruta donde se guardará el audio final de respuesta
    y un archivo de audio tipo WAV con la solicitud del usuario como entrada.
    
    Luego ejecuta lo siguiente:
    1. Tranforma audio de solicitud del usuario en texto
    2. Pasa ese texto como "prompt" a ChatGPT
    3. Toma la respuesta de ChatGPT en texto y la convierte a audio

    Las salidas de esta función son:
    *result.text:* el texto que se obtiene del audio de entrada del usuario
    *out_result:* la respuesta de ChatGPT
    *tts_out:* Path file de la respuesta de ChatGPT convertida en archivo .wav

    """
    # load audio and pad/trim it to fit 30 seconds
    audio = whisper.load_audio(audio)
    audio = whisper.pad_or_trim(audio)

    # make log-Mel spectrogram and move to the same device as the model
    mel = whisper.log_mel_spectrogram(audio).to(model.device)

    # detect the spoken language
    _, probs = model.detect_language(mel)
    #print(f"Detected language: {max(probs, key=probs.get)}")

    # decode the audio
    options = whisper.DecodingOptions()
    result = whisper.decode(model, mel, options)

    # Pass the generated text to ChatGPT function
    out_result = chat_gpt(result.text)

    # Run TTS
    tts_out = tts.tts_to_file(text=out_result, file_path=OUTPUT_PATH)
  
    
    return result.text, out_result, tts_out


## Test with Pre-Recorded Audio

Before we bring up the UI to allow you to record your own live audio, we're going to run the `transcribe()` function on a couple of MP3s we've downloaded. You should see `Mary had a little lamb, its fleece was white as snow, and everywhere that Mary went, the lamb was sure to go.` for `mary.mp3`, which I recorded as an example of clear audio. The second file is a lot harder to transcribe, with very distorted audio, but the model does a good job with `Tazy, Tazy, Tazy. Give me your answer to time after crazy all for the love of you. It won't be a stylish marriage`. You'll notice the transcript is cut off after 30 seconds, which is the default length for this notebook. It can be extended, but that's outside of the scope of this documentation.

In [15]:
!sudo apt install xvfb

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following NEW packages will be installed:
  xvfb
0 upgraded, 1 newly installed, 0 to remove and 21 not upgraded.
Need to get 785 kB of archives.
After this operation, 2,271 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 xvfb amd64 2:1.19.6-1ubuntu4.13 [785 kB]
Fetched 785 kB in 1s (1,136 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Telety

## Install the Web UI Toolkit (already done above)

We'll be using gradio to provide the widgets we need to do audio recording.

### Test only Whisper output (Sólo para Test, no deploy)

In [48]:
gr.Interface(
    title = 'OpenAI Whisper ASR Gradio Web UI', 
    fn=transcribe, 
    inputs=[
        gr.inputs.Audio(source="microphone", type="filepath")
    ],
    outputs=[
        "textbox"
    ],
    live=True).launch(debug=False)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



### Test Test Whisper & ChatGPT (Sólo para Test, no deploy)

In [49]:
output_1 = gr.Textbox(label="Speech to Text")
output_2 = gr.Textbox(label="ChatGPT Output")

gr.Interface(
    title = 'Asistente de Voz con Whisper-ChatGPT ASR Gradio Web UI', 
    fn=transcribe, 
    inputs=[
        gr.inputs.Audio(source="microphone", type="filepath")
    ],

    outputs=[
        output_1,  output_2
    ],
    live=True).launch(debug=False)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



### Converting ChatGPT text response into Voice

https://github.com/coqui-ai/TTS

In [21]:
!pip install TTS

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting TTS
  Downloading TTS-0.10.2-cp38-cp38-manylinux1_x86_64.whl (591 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m591.1/591.1 KB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
Collecting mecab-python3==1.0.5
  Downloading mecab_python3-1.0.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (577 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m577.3/577.3 KB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypinyin
  Downloading pypinyin-0.47.1-py2.py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m76.3 MB/s[0m eta [36m0:00:00[0m
Collecting inflect==5.6.0
  Downloading inflect-5.6.0-py3-none-any.whl (33 kB)
Collecting unidic-lite==1.0.8
  Downloading unidic-lite-1.0.8.tar.gz (47.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.4/47.4 MB[0m [

In [22]:
#!git clone https://github.com/coqui-ai/TTS

In [27]:
# Run TTS
# ❗ Since this model is multi-speaker and multi-lingual, we must set the target speaker and the language
# Text to speech with a numpy output

#model_name = TTS.list_models()[0] # List available 🐸TTS models and choose the first one
#tts = TTS(model_name)
#tts.speakers

In [28]:
#tts.languages

In [50]:
# List of available models
#!tts --list_models

### Running a single speaker model (Así lo implementaremos)

Definimos el modelo a usar y la ruta donde se guardará el archivo de audio de respuesta final de ChatGPT

In [51]:
# Import the Class TTS to instantiate the model
from TTS.api import TTS

# Define model name
MODEL_NAME = "tts_models/es/css10/vits"

# Init TTS with the target model name (Seleccionamos el modelo más acorde)
# En español sólo hay disponibles 2 modelos. Escogimos el más natural.
tts = TTS(model_name=MODEL_NAME, progress_bar=False, gpu=False)

 > tts_models/es/css10/vits is already downloaded.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > initialization of speaker-embedding layers.
 > initialization of language-embedding layers.


In [None]:
# Run TTS
#tts.tts_to_file(text=chgpt_resp, file_path=OUTPUT_PATH)

In [54]:
# Define output variables for Gradio

output_1 = gr.Textbox(label="Speech to Text")
output_2 = gr.Textbox(label="ChatGPT Output")
output_3 = gr.outputs.Audio(type="filepath") # To output the ChatGPT response WAV file


gr.Interface(
    title = 'Asistente de Voz con Whisper-ChatGPT ASR Gradio Web UI', 
    fn=transcribe, 
    inputs=[
        gr.inputs.Audio(source="microphone", type="filepath")
    ],

    outputs=[
        output_1, output_2, output_3
    ], 
    live=True).launch(debug=False)



Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



In [55]:
Audio("/content/speech.wav")

In [56]:
# Create the requirements file
#!pip freeze > requirements.txt