<a href="https://colab.research.google.com/github/B-lilily/Student-Performance-Analysis/blob/main/ml-app-gradio/3_Speech-2-text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Automatic Speech Recognition


What is speech to text?

- Automatic Speech Recognition (ASR), also known as Speech to Text (STT), is the task of transcribing a given audio to text.
[Reference](https://huggingface.co/tasks/automatic-speech-recognition)

##### Thonburian Whisper

https://huggingface.co/biodatlab/whisper-th-medium-combined

### Step 1: Install Transformers
Install the Hugging Face Transformers library so we can use the Whisper speech to text pipeline.


In [None]:
!pip install transformers

### Step 2: Load the ASR model and pipeline
Import Transformers and PyTorch, pick a Whisper model, choose CPU or GPU, and configure Thai transcription.


In [None]:
from transformers import pipeline
import torch

MODEL_NAME = "biodatlab/whisper-th-medium-combined"  # specify the model name
lang = "th"  # change to Thai langauge

device = 0 if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition", # specify the task
    model=MODEL_NAME, # define the model
    chunk_length_s=30, # specify the chunk length
    device=device, # use GPU if available
)

pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(
  language=lang,
  task="transcribe" # specify the task
)

### Optional: Install pytubefix
Only needed if you want to download a YouTube audio sample.


In [None]:
!pip install pytubefix

### Optional: Download a sample audio file
Fetch audio from YouTube, save it as an MP3 in `audio_example/`, and use the repo sample if the download fails.


In [None]:
# importing packages
from pytubefix import YouTube
import os

yt = YouTube("https://www.youtube.com/EXAMPLEURL")  #
## if you can't use this code, please use the example audio file in the repo

video = yt.streams.filter(only_audio=True).first()
os.makedirs("audio_example", exist_ok=True)
out_file = video.download(output_path="audio_example")

new_file = "audio_example/audio" + '.mp3'
os.rename(out_file, new_file)


### Step 3: Transcribe the audio file
Run the pipeline on the MP3 file. This can take a while on CPU.


In [None]:
text = pipe("audio_example/audio.mp3")["text"] # give audio mp3 and transcribe text
## this would take a while to process

### Step 4: View the transcription
Print the text output so you can inspect the result.


In [None]:
print(text)

### Bonus: Build a Gradio demo
Create a small web UI that records microphone audio and returns the transcription.


In [None]:
import gradio as gr

def transcribe_audio(audio):
    return pipe(audio)["text"]

mic_input = gr.Audio(type="filepath", label="Speak into the microphone")
text_output = gr.Textbox(label="Transcribed Text")

demo = gr.Interface(
    fn=transcribe_audio,
    inputs=mic_input,
    outputs=text_output,
    title="Speech-to-Text Transcription",
    description="Speak into the microphone and get the transcribed text."
)

demo.launch()