# Voice recognition

 There are only 5 open-source licenses that should ever be used:

- **Apache 2.0**, for when you want to allow people to make and release proprietary versions of your product. It is advisable to only use this for extremely-low-level libraries/runtimes.

- **MPL 2.0**, but only if it does NOT invoke the "incompatible with secondary licenses" clause. This allows use as part of proprietary products, but preserves the sanctity of source code files only.

- **LGPL 3+**, for when you want your library to be usable as an upgradable part of proprietary products, but your code itself must remain pure.

    - it is possible to add a "static linking exception", creating a situation similar to the MPL (think about it: what is the difference between a library and (a collection of) single files?) but with a better-known basis. However, you must be aware that this requires giving up the "able to upgrade" freedom, and is only encouraging bad practices. Seriously, RPATH isn't that hard; the corner cases are well-documented.

- **GPL 3+**, for applications that run on an individual computer.

- **AGPL 3+**, for applications that run on the network AND are worth the effort of complying with the distribution requirements. This ensures continued freedom under the maximum set of conditions, but can be painful. If you choose this, you must figure out how to make every build of your software point to a publicly-accessible version of the source (specifying a commit is good, but remember: you can't assume a single central git repo, since there may be temporary forks; additionally think of what Linux distributions want to do. Your ./configure should mandate passing several options like --vendor-url at the very least), which implies significant (but arguably sensible) workflow restrictions.


For our project, our objective is to produce an application for public purposes. **Apache 2.0** licence is preferable.


# Library used:
- **pydub**: Resample audio
- **sounddevice**: record audio
- **Vosk**: Speech Recognition open-source (apache 2.0)
- **ipytest**: Python Test in jupyter notebooks

## Recording audio

https://python-sounddevice.readthedocs.io/en/0.4.6/api/index.html

sounddevice simplifies the recording process, easier of pyaudio.
MIT License

In [5]:
# Record audio
import sounddevice as sd
import numpy as np
from scipy.io.wavfile import write

samplerate = 16000  # Hertz
duration = 5  # seconds
filename = 'output.wav'

print("Recording...")
mydata = sd.rec(int(samplerate * duration), 
                samplerate=samplerate,
                channels=1, # mono (Vosk) / stereo
                dtype='int16')
sd.wait()
print("Recording finished. Saving to file...")

# Save as WAV file using scipy
write(filename, samplerate, mydata)

Recording...
Recording finished. Saving to file...


## Resample audio
For optimal results, we always use audio at 16kHz sample rate, mono, and wave type.

⚠️You need **ffmep** installed on your system:
https://github.com/BtbN/FFmpeg-Builds/releases

In [6]:
from pydub import AudioSegment

def resample_audio(audio_path:str, output_path: str ="resampled_audio.wav"):
    """
    Resample an audio file to wav, 16khz, and mono

    Returns: 
        - output_path
    """
    # Load audio file
    #audio = AudioSegment.from_wav(input_path)
    audio = AudioSegment.from_file(audio_path)
    
    # Resample it
    resampled_audio = audio.set_frame_rate(16000) # Vosk and Other models better works with 16kHz

    # Convert the audio file to single channel (mono) (Vosk)
    resampled_audio = resampled_audio.set_channels(1)
    
    # Export the resampled audio
    resampled_audio.export(output_path, format="wav")

    return output_path



In [7]:
# Live speech transcription
from vosk import Model, KaldiRecognizer

import pyaudio

# model = Model("./vosk-model-fr-0.22")
model = Model("./models/vosk-model-fr-0.22")
recognizer = KaldiRecognizer(model, 16000)

mic = pyaudio.PyAudio()
stream = mic.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8192)
stream.start_stream()

while True:
    data = stream.read(4096)
    # if len(data) == 0:
    #     break
    if recognizer.AcceptWaveform(data):
        text = recognizer.Result()
        # print(text)
        print(text[14:-3])
    # else:
    #     print(recognizer.PartialResult())

LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=13 max-active=7000 lattice-beam=6
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from ./models/vosk-model-fr-0.22/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:279) Loading HCLG from ./models/vosk-model-fr-0.22/graph/HCLG.fst
LOG (VoskAPI:ReadDataFiles():model.cc:294) Loading words from ./models/vosk-model-fr-0.22/graph/words.txt
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo ./models/vosk-model-fr-0.22/graph/phones/word_boundary.int
LOG (VoskAPI:ReadDataFiles():model.cc:31





oui
foncez





hé









bon j'ai débloqué des stocks
mais ses cheveux étaient bonnes
moi aussi mais c'est ses serveurs et son sang sera versé au son de gaza
revenant sur la question
femme à ce message peine toujours possible possède depuis l'espace issus des collections ferait bizarre tu remets ta colonne
dombasle


parfait
passez votre serment
d'autres fruit






















bonjour




qui fonctionne pour te soutenir





bonjour
a toi

rien ne va plus







neuf deux



KeyboardInterrupt: 

## Google Speech Recognition

Testing the speech recognition with the google api.

In [None]:
import speech_recognition as sr
for index, name in enumerate(sr.Microphone.list_microphone_names()):
    print("Microphone with name \"{1}\" found for `Microphone(device_index={0})`".format(index, name))

Microphone with name "Microphone MacBook Air" found for `Microphone(device_index=0)`
Microphone with name "Haut-parleurs MacBook Air" found for `Microphone(device_index=1)`
Microphone with name "Microphone de « Space Banan »" found for `Microphone(device_index=2)`
Microphone with name "Microsoft Teams Audio" found for `Microphone(device_index=3)`


On my computer, my default microphone is Intel smart sound. It will be device_index=19

In [None]:
import speech_recognition as sr

sr.Microphone.list_microphone_names()

['Microphone MacBook Air',
 'Haut-parleurs MacBook Air',
 'Microphone de «\xa0Space Banan\xa0»',
 'Microsoft Teams Audio']

https://realpython.com/python-speech-recognition/

In [8]:
import speech_recognition as sr

# micro = sr.Microphone(device_index=19)
r = sr.Recognizer()

# Load default microphone on the system	
micro = sr.Microphone()

with micro as source:
    print("Speak!")
    audio_data = r.listen(source)
    print("End!")
result = r.recognize_google(audio_data, language="fr-FR")
print (">", result)

Speak!
End!


UnknownValueError: 

In [None]:
import speech_recognition as sr

# micro = sr.Microphone(device_index=19)
r = sr.Recognizer()

# Load default microphone on the system	
micro = sr.Microphone()

with micro as source:
    print("Speak!")
    audio_data = r.listen(source)
    print("End!")
result = r.recognize_sphinx(audio_data, language="fr-FR")
print (">", result)

Speak!
End!


RequestError: missing PocketSphinx language data directory: "/Users/xiangwei/miniforge3/envs/SpeechRecognition/lib/python3.10/site-packages/speech_recognition/pocketsphinx-data/fr-FR"

In [9]:
from pocketsphinx import LiveSpeech
for phrase in LiveSpeech(): print(phrase)

they do the use


In [10]:
harward = sr.AudioFile('./audio_files/harvard.wav')

r = sr.Recognizer()

with harward as source:
    audio = r.record(source)
result = r.recognize_google(audio)
print(">", result)

> the still smell of old beer lingers it takes heat to bring out the odour a cold dip restores health exist a salt pickle taste fine with him as well past or my favourite exist for food is the hot cross bun


Google Speech is very powerfull but the licence is paid. So we can't use it for our project.

## Deep speech

https://github.com/mozilla/DeepSpeech

DeepSpeech by Mozilla is an open-source Speech-to-Text engine.

+ Open source
- Only English but adaptable (we will try)
- After Mozilla restructure, the project is canceled.

https://deepspeech.readthedocs.io/en/v0.9.3/USING.html#usage-docs

We will use this community french:
https://github.com/Common-Voice/commonvoice-fr/releases/tag/fr-v0.6

https://discourse.mozilla.org/t/modele-francais-0-6-pour-deepspeech-v0-7-v0-8-v0-9/71993/6

In [11]:
import deepspeech
import wave

# Load the model and scorer
model_path = './models/model_tensorflow_fr/output_graph.pbmm'
scorer_path = './models/model_tensorflow_fr/kenlm.scorer'

model = deepspeech.Model(model_path)
model.enableExternalScorer(scorer_path)

# Load an audio file (16-bit PCM WAV format)
audio_file = './resampled_audio.wav'
with open(audio_file, 'rb') as f:
    audio_data = f.read()

# Get the frame rate
# import wave
# with wave.open(audio_file, 'rb') as wf:
#     # Ensure the audio is 16kHz, 16-bit PCM
#     if wf.getframerate() != 16000:
#         print("Warning: The audio sample rate is not 16kHz. You may need to resample for optimal results.")
    
#     # Read the audio data
#     audio_data = wf.readframes(wf.getnframes())

with wave.open(audio_file, 'rb') as wf:
    # ... other checks ...

    # Read the audio data
    audio_data = np.frombuffer(wf.readframes(wf.getnframes()), dtype=np.int16)

# Perform STT
transcript = model.stt(audio_data)
print(transcript)

ModuleNotFoundError: No module named 'deepspeech'

After, some test, their provided English model works very well with English but it does not perform well with French

## PaddleSpeech

PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.

https://github.com/PaddlePaddle/PaddleSpeech

https://paddlespeech.readthedocs.io/en/latest/

In [None]:
import paddlespeech

# Load the model
model = paddlespeech.asr.load_model('path_to_model', 'path_to_pretrained_weights')

# Process audio
audio_data = paddlespeech.asr.preprocess('path_to_audio_file.wav')

# Perform ASR
transcription = model.transcribe(audio_data)

print(transcription)

ModuleNotFoundError: No module named 'paddlespeech'

In [None]:
import paddlespeech

# Initialize the ASR model
asr_model = paddlespeech.s2t.ASRModel()

# Transcribe an audio file
filename = "./audio_files/harvard.wav"  # replace with your audio file
transcription = asr_model.transcribe_file(filename)

print(transcription)

AttributeError: module 'paddlespeech' has no attribute 's2t'

## Vosk

- Open-source
- offline
- supports 10+ language including french

https://alphacephei.com/vosk
https://alphacephei.com/vosk/models

- vosk-model-fr-0.22 (Apache 2.0)

In [12]:
from vosk import Model

# Load the model
model = Model("./models/vosk-model-fr-0.22")

LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=13 max-active=7000 lattice-beam=6
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from ./models/vosk-model-fr-0.22/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:279) Loading HCLG from ./models/vosk-model-fr-0.22/graph/HCLG.fst
LOG (VoskAPI:ReadDataFiles():model.cc:294) Loading words from ./models/vosk-model-fr-0.22/graph/words.txt
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo ./models/vosk-model-fr-0.22/graph/phones/word_boundary.int
LOG (VoskAPI:ReadDataFiles():model.cc:31

In [20]:
import sys
import json
import os
from vosk import KaldiRecognizer
import wave

# Prepare audio
audio_file = "./audio_files/harvard.wav"
audio_file = resample_audio(audio_file, "./harvard.wav")

# Create a recognizer object
rec = KaldiRecognizer(model, 16000)  # Replace with the sample rate of your audio


# Process an audio file
with wave.open(audio_file, "rb") as f:
    while True:
        data = f.readframes(4000)
        if len(data) == 0:
            break
        rec.AcceptWaveform(data)

# Print the final transcription
print(rec.FinalResult())

['English']
{
  "text" : "basse-terre smart small hot bird wenger est exquis de bruno peyron écran depp winstrol safe and ast et son décolleté femme tres tacos al pastor rarement fait vrai et c'est faux fdl s nigra cross bone"
}


## Test

Now we have a good model, we need to test it with lots of samples.

We created some audios. OUr objective is to identify correctly the city A and the city B

1) First we need to get how test audio data.

Download these these files to zip:
- https://drive.google.com/drive/folders/1ir7eqefODLuBq4Rw1rV7cJG1era-qzB4

In [None]:
import zipfile
import os

# zip file name
zip_file = "voice_recognition_data-20230928T150022Z-001.zip"

# directory name where files will be extracted
TEST_FOLDER = "./audio_files/test"

# Create the directory where the files will be extracted
os.makedirs(TEST_FOLDER, exist_ok=True)

# Create a ZipFile object
with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    # Extract all the contents of the zip file into the directory
    zip_ref.extractall(TEST_FOLDER)

print(f"Files extracted to {TEST_FOLDER}")

# update test DIR
TEST_FOLDER = os.path.join(TEST_FOLDER, "voice_recognition_data")

print(f"Test folder data: {TEST_FOLDER}")

Files extracted to ./audio_files/test
Test folder data: ./audio_files/test\voice_recognition_data


2. Now we can use ipytest to do Unit test in jupyter notebook.

In [None]:
import wave
import json

def transcript(audio_file) -> dict:
    # Prepare audio
    audio_file = resample_audio(audio_file, "./resampled_audio.wav")

    # Create a recognizer object
    rec = KaldiRecognizer(model, 16000)  # Replace with the sample rate of your audio

    # Process an audio file
    with wave.open(audio_file, "rb") as f:
        while True:
            data = f.readframes(4000)
            if len(data) == 0:
                break
            rec.AcceptWaveform(data)

    # Print the final transcription
    return json.loads(rec.FinalResult())["text"] # returns a JSON string

In [None]:
import pandas as pd
import re
import os

DIRECTORY = os.path.join(TEST_FOLDER, "audio_files", "IA_voice_1_a_10")

# read the excel
audio_files_df = pd.read_excel(os.path.join(TEST_FOLDER, "audio_files.xlsx"))

file_names = audio_files_df["file_name"].values
scripts = audio_files_df["script"].values
tags = audio_files_df["tags_to_recover"].values

# Converting each string into a list of cities
city_lists = [tag.split(', ') for tag in tags]


# Flattening the list of lists to get a single list of all cities
all_cities = [city for sublist in city_lists for city in sublist]

for file_name, tag in zip(file_names, city_lists):
    # find audio file
    audio_file = os.path.join(DIRECTORY, str(file_name))
    if not os.path.exists(audio_file): 
        #print(f"Non existant file: {audio_file}."); 
        continue;

    # Transcript audio file
    predicted_sentence = transcript(audio_file)

    # Create a pattern that matches any city name in the list of all cities
    pattern = re.compile(r'\b(?:' + '|'.join(all_cities) + r')\b', re.IGNORECASE)

    # Find all city names in the sentence
    found_cities = re.findall(pattern, predicted_sentence)
    
    print(predicted_sentence)
    print(found_cities)

pourriez-vous m'indiquer le chemin pour aller de paris à marseille s'il vous plaît
['paris', 'marseille']
je suis à lyon et je dois me rendre à toulouse quel est le meilleur itinéraire
['lyon', 'toulouse']
excusez-moi pouvez-vous me dire comment aller de bordeaux à nice
['bordeaux', 'nice']
je cherche à aller de nantes à strasbourg vous m'aider avec cette direction
['nantes', 'strasbourg']
quelle est la meilleure façon de se rendre de lille à montpellier en voiture
['lille', 'montpellier']
savez-vous comment je peux aller de grenoble à rennes
['grenoble', 'rennes']
je planifie un voyage de tours à nancy quelle route marocain m'entendez-vous
['tours', 'nancy']
je me demande quel est le trajet le plus rapide pour aller d'orléans à dijon joue des suggestions
['orléans', 'dijon']
pouvez-vous ajouter les directions pour aller de rouen à avignon
['rouen', 'avignon']
je souhaite voyager de brest à amiens quel chemin devrais-je prendre
['brest', 'amiens']
CPU times: total: 0 ns
Wall time: 0 ns

In [None]:
import ipytest
import pandas as pd
import re

ipytest.autoconfig()

DIRECTORY = "audio_files/test/voice_recognition_data/audio_files/"

audio_files = pd.read_excel("./audio_files.xlsx")
file_names = audio_files["file_name"].values
scripts = audio_files["script"].values
tags = audio_files["tags_to_recover"].values

# Converting each string into a list of cities
city_lists = [tag.lower().split(', ') for tag in tags]

# Lowercase all strings

# Flattening the list of lists to get a single list of all cities
all_cities = [city for sublist in city_lists for city in sublist]

# Create a pattern that matches any city name in the list of all cities
pattern = re.compile(r'\b(?:' + '|'.join(all_cities) + r')\b', re.IGNORECASE)

# Define the test case
def test_speech_recognition():
    for file_name, tag in zip(file_names, city_lists):
        sentence = transcript(DIRECTORY + str(file_name) + ".m4a")

    
        # Find all city names in the sentence
        found_cities = re.findall(pattern, sentence)
        assert found_cities == tag, f"Expected {tag} !== {found_cities}"

# Run the tests
ipytest.run('-qq')

[32m.[0m[32m.[0m[32m                                                                                           [100%][0m


<ExitCode.OK: 0>

## SpeechBrain

- Open-source
- online
- multiple language, trained on the VoxLingua107 Dataset

- Website: https://speechbrain.github.io/
- GitHub: https://github.com/speechbrain/speechbrain
- HuggingFace: https://huggingface.co/speechbrain

In [29]:
import torchaudio
from speechbrain.pretrained import EncoderClassifier
from speechbrain.pretrained import EncoderDecoderASR

classifier = EncoderClassifier.from_hparams(source="speechbrain/lang-id-commonlanguage_ecapa", savedir="pretrained_models/lang-id-commonlanguage_ecapa")

out_prob, score, index, text_lab = classifier.classify_file('./audio_files/harvard.wav')
if text_lab == ['French'] :
    asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-commonvoice-fr", savedir="pretrained_models/asr-crdnn-commonvoice-fr")
    asr_model.transcribe_file("./audio_files/harvard.wav")
else :
    print('Mauvaise Langage')

Mauvaise Langage
