# Voice recognition

 There are only 5 open-source licenses that should ever be used:

- **Apache 2.0**, for when you want to allow people to make and release proprietary versions of your product. It is advisable to only use this for extremely-low-level libraries/runtimes.

- **MPL 2.0**, but only if it does NOT invoke the "incompatible with secondary licenses" clause. This allows use as part of proprietary products, but preserves the sanctity of source code files only.

- **LGPL 3+**, for when you want your library to be usable as an upgradable part of proprietary products, but your code itself must remain pure.

    - it is possible to add a "static linking exception", creating a situation similar to the MPL (think about it: what is the difference between a library and (a collection of) single files?) but with a better-known basis. However, you must be aware that this requires giving up the "able to upgrade" freedom, and is only encouraging bad practices. Seriously, RPATH isn't that hard; the corner cases are well-documented.

- **GPL 3+**, for applications that run on an individual computer.

- **AGPL 3+**, for applications that run on the network AND are worth the effort of complying with the distribution requirements. This ensures continued freedom under the maximum set of conditions, but can be painful. If you choose this, you must figure out how to make every build of your software point to a publicly-accessible version of the source (specifying a commit is good, but remember: you can't assume a single central git repo, since there may be temporary forks; additionally think of what Linux distributions want to do. Your ./configure should mandate passing several options like --vendor-url at the very least), which implies significant (but arguably sensible) workflow restrictions.


For our project, our objective is to produce an application for public purposes. **Apache 2.0** licence is preferable.


# Library used:
- **pydub**: Resample audio
- **sounddevice**: record audio
- **Vosk**: Speech Recognition open-source (apache 2.0)
- **ipytest**: Python Test in jupyter notebooks

## Recording audio

https://python-sounddevice.readthedocs.io/en/0.4.6/api/index.html

sounddevice simplifies the recording process, easier of pyaudio.
MIT License

In [38]:
# Record audio
import sounddevice as sd
import numpy as np
from scipy.io.wavfile import write

samplerate = 16000  # Hertz
duration = 5  # seconds
filename = 'output.wav'

print("Recording...")
mydata = sd.rec(int(samplerate * duration), 
                samplerate=samplerate,
                channels=1, # mono (Vosk) / stereo
                dtype='int16')
sd.wait()
print("Recording finished. Saving to file...")

# Save as WAV file using scipy
write(filename, samplerate, mydata)

Recording...
Recording finished. Saving to file...


## Resample audio
For optimal results, we always use audio at 16kHz sample rate, mono, and wave type.

⚠️You need **ffmep** installed on your system:
https://github.com/BtbN/FFmpeg-Builds/releases

In [2]:
from pydub import AudioSegment

def resample_audio(audio_path:str, output_path: str ="resampled_audio.wav"):
    """
    Resample an audio file to wav, 16khz, and mono

    Returns: 
        - output_path
    """
    # Load audio file
    #audio = AudioSegment.from_wav(input_path)
    audio = AudioSegment.from_file(audio_path)
    
    # Resample it
    resampled_audio = audio.set_frame_rate(16000) # Vosk and Other models better works with 16kHz

    # Convert the audio file to single channel (mono) (Vosk)
    resampled_audio = resampled_audio.set_channels(1)
    
    # Export the resampled audio
    resampled_audio.export(output_path, format="wav")

    return output_path

## Vosk

- Open-source
- offline
- supports 10+ language including french

https://alphacephei.com/vosk
https://alphacephei.com/vosk/models

- we chose the french model: **vosk-model-fr-0.22 (Apache 2.0)**

In [1]:
from vosk import Model

# Load the model
model = Model("./models/vosk-model-fr-0.22")

### Live speech transcription

In [41]:

from vosk import Model, KaldiRecognizer

import pyaudio

# model = Model("./vosk-model-fr-0.22")
model = Model("./models/vosk-model-fr-0.22")
recognizer = KaldiRecognizer(model, 16000)

mic = pyaudio.PyAudio()
stream = mic.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8192)
stream.start_stream()

while True:
    data = stream.read(4096)
    # if len(data) == 0:
    #     break
    if recognizer.AcceptWaveform(data):
        text = recognizer.Result()
        # print(text)
        print(text[14:-3])
        if text[14:-3] == "fin":
            break
    # else:
    #     print(recognizer.PartialResult())

LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=13 max-active=7000 lattice-beam=6
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from ./models/vosk-model-fr-0.22/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:279) Loading HCLG from ./models/vosk-model-fr-0.22/graph/HCLG.fst
LOG (VoskAPI:ReadDataFiles():model.cc:294) Loading words from ./models/vosk-model-fr-0.22/graph/words.txt
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo ./models/vosk-model-fr-0.22/graph/phones/word_boundary.int
LOG (VoskAPI:ReadDataFiles():model.cc:31

||PaMacCore (AUHAL)|| Error on line 2523: err='-50', msg=Unknown Error
test
voiture
coca-cola
fin


### Transcription from audio file

In [3]:
import sys
import json
import os
from vosk import KaldiRecognizer
import wave

# Prepare audio
audio_file = "./audio_files/test_french.m4a"
audio_file = resample_audio(audio_file, "./test_french.m4a")

# Create a recognizer object
rec = KaldiRecognizer(model, 16000)  # Replace with the sample rate of your audio


# Process an audio file
with wave.open(audio_file, "rb") as f:
    while True:
        data = f.readframes(4000)
        if len(data) == 0:
            break
        rec.AcceptWaveform(data)

# Print the final transcription
print(rec.FinalResult())

{
  "text" : "pourriez-vous m'indiquer le chemin pour aller de paris à marseille s'il vous plaît"
}


## Test

Now we have a good model, we need to test it with lots of samples.

We created some audios. OUr objective is to identify correctly the city A and the city B

1) First we need to get how test audio data.

Download these these files to zip:
- https://drive.google.com/drive/folders/1ir7eqefODLuBq4Rw1rV7cJG1era-qzB4

In [37]:
import zipfile
import os

# zip file name
zip_file = "voice_recognition_data-20230928T150022Z-001.zip"

# directory name where files will be extracted
TEST_FOLDER = "./audio_files/test"

# Create the directory where the files will be extracted
os.makedirs(TEST_FOLDER, exist_ok=True)

# Create a ZipFile object
with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    # Extract all the contents of the zip file into the directory
    zip_ref.extractall(TEST_FOLDER)

print(f"Files extracted to {TEST_FOLDER}")

# update test DIR
TEST_FOLDER = os.path.join(TEST_FOLDER, "voice_recognition_data")

print(f"Test folder data: {TEST_FOLDER}")

Files extracted to ./audio_files/test
Test folder data: ./audio_files/test\voice_recognition_data


In [39]:
# Join all audio files into one folder for test
import os
import shutil

src_directory = "./audio_files/test/voice_recognition_data/audio_files/"
target_directory = "./audio_files/test/voice_recognition_data/all"

# Create target directory if it doesn't exist
if not os.path.exists(target_directory):
    os.makedirs(target_directory)

# Walking through all files in all subdirectories of the source directory
for foldername, subfolders, filenames in os.walk(src_directory):
    for filename in filenames:
        # Assuming audio files have extensions .mp3, .wav, .m4a. Add more extensions as needed
        if filename.endswith(('.mp3', '.wav', '.m4a')):
            file_path = os.path.join(foldername, filename)
            shutil.copy(file_path, target_directory)  # Move each audio file to the target directory

# remove whitespaces
for filename in os.listdir(target_directory):
    # check if the filename contains a space
    if ' ' in filename:
        # create a new filename by replacing spaces with no space
        new_filename = filename.replace(' ', '')
        # create the full paths to the old and new filenames
        old_path = os.path.join(target_directory, filename)
        new_path = os.path.join(target_directory, new_filename)
        # rename the file
        os.rename(old_path, new_path)


2. Now we can use ipytest to do Unit test in jupyter notebook.

In [40]:
import wave
import json
from vosk import KaldiRecognizer

def transcript(audio_file) -> dict:
    # Prepare audio
    audio_file = resample_audio(audio_file, "./resampled_audio.wav")

    # Create a recognizer object
    rec = KaldiRecognizer(model, 16000)  # Replace with the sample rate of your audio

    # Process an audio file
    with wave.open(audio_file, "rb") as f:
        while True:
            data = f.readframes(4000)
            if len(data) == 0:
                break
            rec.AcceptWaveform(data)

    # Print the final transcription
    return json.loads(rec.FinalResult())["text"] # returns a JSON string

In [41]:
import pandas as pd
import re
import os

DIRECTORY = os.path.join(TEST_FOLDER, "audio_files", "IA_voice_1_a_10")

# read the excel
audio_files_df = pd.read_excel(os.path.join(TEST_FOLDER, "audio_files.xlsx"))

file_names = audio_files_df["file_name"].values
scripts = audio_files_df["script"].values
tags = audio_files_df["tags_to_recover"].values

# Converting each string into a list of cities
city_lists = [tag.split(', ') for tag in tags]


# Flattening the list of lists to get a single list of all cities
all_cities = [city for sublist in city_lists for city in sublist]

for file_name, tag in zip(file_names, city_lists):
    # find audio file
    audio_file = os.path.join(DIRECTORY, str(file_name))
    if not os.path.exists(audio_file): 
        #print(f"Non existant file: {audio_file}."); 
        continue;

    # Transcript audio file
    predicted_sentence = transcript(audio_file)

    # Create a pattern that matches any city name in the list of all cities
    pattern = re.compile(r'\b(?:' + '|'.join(all_cities) + r')\b', re.IGNORECASE)

    # Find all city names in the sentence
    found_cities = re.findall(pattern, predicted_sentence)
    
    print(predicted_sentence)
    print(found_cities)

pourriez-vous m'indiquer le chemin pour aller de paris à marseille s'il vous plaît
['paris', 'marseille']
je suis à lyon et je dois me rendre à toulouse quel est le meilleur itinéraire
['lyon', 'toulouse']
excusez-moi pouvez-vous me dire comment aller de bordeaux à nice
['bordeaux', 'nice']
je cherche à aller de nantes à strasbourg vous m'aider avec cette direction
['nantes', 'strasbourg']
quelle est la meilleure façon de se rendre de lille à montpellier en voiture
['lille', 'montpellier']
savez-vous comment je peux aller de grenoble à rennes
['grenoble', 'rennes']
je planifie un voyage de tours à nancy quelle route marocain m'entendez-vous
['tours', 'nancy']
je me demande quel est le trajet le plus rapide pour aller d'orléans à dijon joue des suggestions
['orléans', 'dijon']
pouvez-vous ajouter les directions pour aller de rouen à avignon
['rouen', 'avignon']
je souhaite voyager de brest à amiens quel chemin devrais-je prendre
['brest', 'amiens']


In [45]:
import ipytest
import pandas as pd
import re
import logging
import os

# Configure ipytest
ipytest.autoconfig()

# Configure logging
#logging.basicConfig(filename='test.log', level=logging.DEBUG, filemode='w')


DIRECTORY = "audio_files/test/voice_recognition_data/all/"

audio_files = pd.read_excel("./audio_files/test/voice_recognition_data/audio_files.xlsx")
file_names = audio_files["file_name"].values
scripts = audio_files["script"].values
tags = audio_files["tags_to_recover"].values

# Converting each string into a list of cities
city_lists = [tag.lower().split(', ') for tag in tags]

# Lowercase all strings

# Flattening the list of lists to get a single list of all cities
all_cities = [city for sublist in city_lists for city in sublist]

# Create a pattern that matches any city name in the list of all cities
pattern = re.compile(r'\b(?:' + '|'.join(all_cities) + r')\b', re.IGNORECASE)

#log_msg = f"File: {audio_file}, Expected: {tag}, Found: {found_cities}. Predicted sentence: {sentence}"
#logging.info(log_msg)

def compute_accuracy(correct, total):
    if total == 0:
        return 0  # Avoid division by zero
    return correct / total

def calculate_score(correct_cities, total_cities, correct_audios, total_audios):
    """# Compute and print the accuracies"""
    city_accuracy = compute_accuracy(correct_cities, total_cities) * 100
    audio_accuracy = compute_accuracy(correct_audios, total_audios) * 100
    
    print(f"City Accuracy: {city_accuracy:.2f}% ({correct_cities}/{total_cities})")
    print(f"Audio Accuracy: {audio_accuracy:.2f}% ({correct_audios}/{total_audios})")

# Define the test case
def test_speech_recognition():
    total_cities = 0
    correct_cities = 0
    total_audios = 0
    correct_audios = 0

    for file_name, tag in zip(file_names, city_lists):
        audio_file = DIRECTORY + str(file_name)

        # Check if the audio_file is existant
        if not os.path.exists(audio_file): 
            continue;
        
        # transcript
        sentence = transcript(audio_file)

        # Find all city names in the sentence
        found_cities = re.findall(pattern, sentence)

        # update values for scoring
        total_audios +=1
        total_cities += len(tag)
        correct_cities += sum(1 for city in found_cities if city.lower() in tag)
        if set(found_cities) == set(tag):
            correct_audios += 1

        try:
            assert found_cities == tag, f"Expected {tag} !== {found_cities} for audio: {audio_file}.\nPredicted sentence: {sentence}"
        except AssertionError as e:
            print('\033[91m' + str(e) + '\033[0m')

    calculate_score(correct_cities, total_cities, correct_audios, total_audios)

# Run the tests
ipytest.run("--full-trace", "-vv", "-s")

# -s allow print statement
# -vv long verbose
# -q quiet / -qq more quiet

platform win32 -- Python 3.10.10, pytest-7.4.2, pluggy-1.3.0 -- c:\Users\loann\.conda\envs\global\python.exe
cachedir: .pytest_cache
rootdir: c:\Users\loann\Desktop\Epitech\AI\T-AIA-901-LYO_1\SpeechRecognition
plugins: anyio-3.6.2, requests-mock-1.11.0, typeguard-2.13.3
[1mcollecting ... [0mcollected 2 items

t_6a7971e557384e9ea08aa310d8e126f7.py::test_speech_recognition [91mExpected ['lyon', 'lille'] !== ['lyon'] for audio: audio_files/test/voice_recognition_data/all/49.wav.
  Predicted sentence: donc j'aimerais savoir comment atteindre lyon en venant de l'île pourriez-vous m'orienter
assert ['lyon'] == ['lyon', 'lille']
  Right contains one more item: 'lille'
  Full diff:
  - ['lyon', 'lille']
  + ['lyon'][0m
[91mExpected ['toulouse', 'bordeaux'] !== ['bordeaux'] for audio: audio_files/test/voice_recognition_data/all/53.m4a.
  Predicted sentence: ergonomique depuis bordeaux
assert ['bordeaux'] == ['toulouse', 'bordeaux']
  At index 0 diff: 'bordeaux' != 'toulouse'
  Right contai

<ExitCode.OK: 0>

### Analyse du test
Notre modèle identifie a 91% les bonnes villes et 84% toutes les villes bonnes dans les audios.

- Notre modèle identifie mal la vile **Lille** avec le mot **l'île**. Pour l'audio (49)
- Pour l'audio 49, 53, 59, il identifie mal / oublie les villes / mots. Les bruits ambients affectent ces résultats.

- Audio 73: "Sarlat", p= "salat". Il faut Bien prononcer le "r"
- Audio 76, "Carentan", p= "quarante ans"
- Audio 77, "Millau", p= "milo". "Prononcé "milo" dans le script.
- Audio 78, "Foix", p= "deux fois". Dire "à Foix" fonctionne