# Lab Speech2Text

Welcome to the lab on speech recognition and natural language processing.

In this lab, you will be working on a real-world problem where we aim to make a tedious and time-consuming job easier for the agents of an international company. We will guide you through the process of creating a proof of concept (POC) for an application designed to help agents create reports more efficiently.

The problem we are tackling is a familiar one: **writing a report after an inspection or check-up**. We know that writing reports is not an easy task, especially when it comes to taking notes and structuring information correctly. In this context, the agents are required to send a detailed report to the company they work for, which can be a daunting and time-consuming task.

Our goal in this lab is to build a POC for an application that allows the agents to **verbally report their findings**, with the application extracting the important information and **automatically generating the report**. To achieve this, we will be using machine learning techniques to understand spoken natural language and accurately extract relevant information from it.

By the end of this lab, you will have gained a better understanding of machine learning, natural language processing, and how to apply them to real-world problems.

Let's get started!


The first thing you have to do is to **make your own copy of the lab**.

Go to `File` > `Save a copy in Drive`


## I - Record the audio

First of all we need to record some audio.

We give you some chunks of code, make sure they work.


In [18]:
!pip install ffmpeg-python



Make sure to run the following cell. It allows you to record audio files.


In [None]:
# @title
from IPython.display import HTML, Audio
from google.colab.output import eval_js
from base64 import b64decode
import numpy as np
from scipy.io.wavfile import read as wav_read
import io
import ffmpeg

AUDIO_HTML = """
<script>
var my_div = document.createElement("DIV");
var my_p = document.createElement("P");
var my_btn = document.createElement("BUTTON");
var t = document.createTextNode("Press to start recording");

my_btn.appendChild(t);
//my_p.appendChild(my_btn);
my_div.appendChild(my_btn);
document.body.appendChild(my_div);

var base64data = 0;
var reader;
var recorder, gumStream;
var recordButton = my_btn;

var handleSuccess = function(stream) {
  gumStream = stream;
  var options = {
    //bitsPerSecond: 8000, //chrome seems to ignore, always 48k
    mimeType : 'audio/webm;codecs=opus'
    //mimeType : 'audio/webm;codecs=pcm'
  };
  //recorder = new MediaRecorder(stream, options);
  recorder = new MediaRecorder(stream);
  recorder.ondataavailable = function(e) {
    var url = URL.createObjectURL(e.data);
    var preview = document.createElement('audio');
    preview.controls = true;
    preview.src = url;
    document.body.appendChild(preview);

    reader = new FileReader();
    reader.readAsDataURL(e.data);
    reader.onloadend = function() {
      base64data = reader.result;
      //console.log("Inside FileReader:" + base64data);
    }
  };
  recorder.start();
  };

recordButton.innerText = "Recording... press to stop";

navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);

function toggleRecording() {
  if (recorder && recorder.state == "recording") {
      recorder.stop();
      gumStream.getAudioTracks()[0].stop();
      recordButton.innerText = "Saving the recording... pls wait!"
  }
}

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

var data = new Promise(resolve=>{
//recordButton.addEventListener("click", toggleRecording);
recordButton.onclick = ()=>{
toggleRecording()

sleep(2000).then(() => {
  // wait 2000ms for the data to be available...
  // ideally this should use something like await...
  //console.log("Inside data:" + base64data)
  resolve(base64data.toString())

});

}
});

</script>
"""


def get_audio():
    display(HTML(AUDIO_HTML))
    data = eval_js("data")
    binary = b64decode(data.split(",")[1])

    process = (
        ffmpeg.input("pipe:0")
        .output("pipe:1", format="wav")
        .run_async(
            pipe_stdin=True,
            pipe_stdout=True,
            pipe_stderr=True,
            quiet=True,
            overwrite_output=True,
        )
    )
    output, err = process.communicate(input=binary)

    riff_chunk_size = len(output) - 8
    # Break up the chunk size into four bytes, held in b.
    q = riff_chunk_size
    b = []
    for i in range(4):
        q, r = divmod(q, 256)
        b.append(r)

    # Replace bytes 4:8 in proc.stdout with the actual size of the RIFF chunk.
    riff = output[:4] + bytes(b) + output[8:]

    sr, audio = wav_read(io.BytesIO(riff))

    return audio, sr

Now you can record yourself. Try making several examples, in different langagues, different tones etc


In [None]:
import scipy


audio, sr = get_audio()

scipy.io.wavfile.write("recording.wav", sr, audio)

You can also download a full audio report with the cell below :


In [21]:
!wget https://github.com/bourliam/s2t_lab/raw/main/RapportMarc.mp3

--2025-03-12 07:46:24--  https://github.com/bourliam/s2t_lab/raw/main/RapportMarc.mp3
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/bourliam/s2t_lab/main/RapportMarc.mp3 [following]
--2025-03-12 07:46:25--  https://raw.githubusercontent.com/bourliam/s2t_lab/main/RapportMarc.mp3
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2058240 (2.0M) [audio/mpeg]
Saving to: ‘RapportMarc.mp3.1’


2025-03-12 07:46:25 (44.2 MB/s) - ‘RapportMarc.mp3.1’ saved [2058240/2058240]



## II - Process the audio file to get the text


Now that we have some audio material, it is time to extract the information they contain.

We will try several models and compare them.
The key study points are :

- The performance : ie how good is it working
- The compute time : we don't want it taking too much time
- The size : It would be great if our application could run on a smartphone
- The price : Last but not least, the usage cost is an important parameter


### 1 - Google Speech to Text

The first model we want to try is from Google. Their model is available via an API but also via a library.

Try installing the lib SpeechRecognition (https://github.com/Uberi/speech_recognition/tree/master) and use the method `recognize_google()`


In [22]:
!pip install SpeechRecognition
!pip install pydub



In [23]:
import speech_recognition as sr
from pydub import AudioSegment

audio = AudioSegment.from_mp3("RapportMarc.mp3")
audio.export("RapportMarc.wav", format="wav")

<_io.BufferedRandom name='RapportMarc.wav'>

In [24]:
recognizer = sr.Recognizer()

with sr.AudioFile("RapportMarc.wav") as source:
    audio_data = recognizer.record(source)  # Record the entire file
    result_google = recognizer.recognize_google(audio_data, language="fr")
    print("Transcription:", result_google)

Transcription: salut c'est Marc Dupont ici juste pour te raconter un peu comment ça s'est passé chez notre client l'agence digitale innovantes à Paris tu sais celle de la rue des innovateurs alors aujourd'hui c'était la routine habituelle au début nettoyage de l'accueil des toilettes de la salle de réunion vidage des poubelles tout ça j'ai aussi fait les vitres de la salle de réunion parce qu'il y avait plein de traces de doigts dessus dans les bureaux j'ai chassé la poussière passer l'aspi partout et donner un peu d'eau aux plantes rien de bien fou ici mais alors le moment un peu galère de la journée c'était pour accéder à la salle des serveurs mon badge je sais pas pourquoi il voulait rien savoir bloqué de chez bloqué du coup j'ai dû embêter Camille Leroy tu sais la responsable a été là-bas elle a été super cool elle m'a filé son badge pour que je puisse faire mon taf dans la salle elle m'a dit qu'ils allaient checker les badges de notre team pour qu'on se retrouve pas dans la même g

### 2 - Whisper tiny

There are several versions of whisper with different sizes. Try the tiny version available on Hugging Face.

https://huggingface.co/openai/whisper-tiny

Use the [inference API](https://huggingface.co/docs/api-inference/tasks/automatic-speech-recognition), do not try loading the model here in colab.


In [25]:
!pip install huggingface_hub==0.29.2



In [None]:
from huggingface_hub import InferenceClient

HF_TOKEN = ""
model = "openai/whisper-tiny"
client = InferenceClient(token=HF_TOKEN)

with open("RapportMarc.wav", "rb") as f:
    audio_data = f.read()

result_whisper_tiny = client.automatic_speech_recognition(
    audio=audio_data,
    model=model,
    extra_body={"return_timestamps": True, "generate_kwargs": {"language": "fr"}},
).text

print("Transcription:", result_whisper_tiny)

Transcription:  Salut, c'est Marc du pont ici. Juste pour te raconter un peu comment ça s'est passé chez notre client la chance digital inovente à Paris, tu sais, celle de la rue des innovateurs. Alors aujourd'hui, c'était la routine habituée à l'audébut, nettoyage de la kei, détoylette, de la salle de réunion, vidages des poubelles, tout ça. J'ai aussi fait les vitres de la salle de réunion parce qu'il y avait plein de traces de doigts dessus. Dans les bureaux, j'ai chassé la poussière, passé la spipe partout et donner un peu d'eau ou plante. Rienne bien fou ici. Mais alors, le moment un peu galère de la journée c'était pour accéder à la salle des serveurs. Mon bâde je sais pas pourquoi il voulait rien savoir bloquer de chez bloquer. Du coup, j'ai dû embêter Camille le roi, tu sais, la responsable IT-Laba. Elle a été super cool, elle m'a filé son bâde pour que je puisse faire mon tâfe dans la salle. Elle m'a dit qu'ils allait checker les bâches de notre team pour qu'on se retrouve pas

### 3 - OpenAI Whisper v3 From Hugging Face

OpenAI actually released a new version of whisper available on Hugging Face.

Model page : https://huggingface.co/openai/whisper-large-v3

Use the [inference API](https://huggingface.co/docs/api-inference/tasks/automatic-speech-recognition), do not try loading the model here in colab.


In [None]:
model = "openai/whisper-large-v3"
client = InferenceClient(token=HF_TOKEN)

with open("RapportMarc.wav", "rb") as f:
    audio_data = f.read()

result_whisper_large = client.automatic_speech_recognition(
    audio=audio_data, model=model, extra_body={"generate_kwargs": {"language": "fr"}}
).text

print("Transcription:", result_whisper_large)

Transcription:  Salut, c'est Marc Dupont ici. Juste pour te raconter un peu comment ça s'est passé chez notre client, l'agence digitale innovante à Paris, tu sais, celle de la rue des innovateurs. Alors aujourd'hui, c'était la routine habituelle au début. Nettoyage de l'accueil, des toilettes, de la salle de réunion, vidage des poubelles, tout ça. J'ai aussi fait les vitres de la salle de réunion, parce qu'il y avait plein de traces de doigts dessus. dans les bureaux j'ai chassé la poussière passé la spie partout et donné un peu d'eau aux plantes rien de bien fou ici mais alors le moment un peu galère de la journée c'était pour accéder à la salle des serveurs mon badge je sais pas pourquoi il voulait rien savoir bloqué de chez bloqué du coup j'ai dû embêter Camille Leroy tu sais la responsable IT là-bas elle a été super cool elle m'a filé son badge pour que je puisse faire mon taf dans la salle Elle m'a dit qu'ils allaient checker les badges de notre team pour qu'on ne se retrouve pas 

### 4 - Results

Write here your findings, which model you plan to use and why.


---
Answer here
---

Looking at the three transcriptions, we can see that the best one is the one from Whisper large v3. It is the most accurate and the most complete.

Hence, we will use this model for the rest of the lab.


## III - Generate a full report

Now that you have extracted the text and all the informations, you have to generate a report.

You can use models available on the [Hugging Face Hub](https://huggingface.co/docs/api-inference/tasks/chat-completion).

Tips : As you want the report to be well formatted you will have find how to prompt the model to achieve this goal.


Some key elements we want to extract :

- The agent's name
- The client's name
- The client's adress
- What was done
- What problems there were
- ...


In [28]:
!pip install transformers



In [None]:
client = InferenceClient(provider="hf-inference", api_key=HF_TOKEN)

informations = ", ".join(
    [
        "agent's name",
        "client's name",
        "client's address",
        "what was done",
        "what problems there were",
        "contact informations",
    ]
)

prompt_template = """Extract these informations:
\t{informations}
from the following text:
\t{text}.
Don't add informations you don't find in the text.
Return the extracted informations in JSON format."""
prompt = prompt_template.format(
    informations=informations, text=result_whisper_large.strip()
)

messages = [{"role": "user", "content": prompt}]

stream = client.chat.completions.create(
    model="google/gemma-2-2b-it", messages=messages, max_tokens=500, stream=True
)

for chunk in stream:
    print(chunk.choices[0].delta.content, end="")

```json
{
  "agent's_name": "Marc Dupont",
  "client's_name": "Agence digitale innovante",
  "client's_address": "rue des innovateurs, Paris",
  "what_was_done": "Cleaning (reception, toilets, meeting room, waste disposal), window cleaning, dusting, watering plants",
  "problems_there_were": "Access issue to server room due to badge malfunction, scheduled follow-up needed with security for updated badges",
  "contact_informations": "01 55 555 5555 (phone), camidleroyagencedigitalinnovante.fr (email)"
}
```

As we can see, the report is quite structured. This indicates that both out speech to text and our text to text models are able to understand the context of the conversation.
