## Instructions ##

The folder Audio_Files includes all the Exercise 4 audio files.

The folder Models includes the English, Spanish and Italian models you need to solve Exercise 4.

**Important**: You have to install in the notebook the necessary python libraries to solve Exercise 4. The notebook has already deepspeech, librosa, and numpy installed.

In [1]:
from deepspeech import Model, version

In [2]:
import librosa as lr

In [3]:
import numpy as np
import re
import IPython.display as ipd
import wave
import soundfile as sf

We save the real content of the audio, excluding punctuations and special characters for the WER evaluation

In [4]:
real_contents = {
    "EN": {
        "suitcase": "please i had lost my suitcase",
        "checkin": "where is the check in desk",
        "parents": "i had lost my parents",
        "what_time": "what time is my plane",
        "where": "where are the restaurants and shops"
    },
    "ES": {
        "checkin": "donde están los mostradores",
        "parents": "he perdido a mis padres",
        "suitcase": "por favor he perdido mi maleta",
        "what_time": "a que hora es mi avión",
        "where": "donde están los restaurantes y las tiendas"
    },
    "IT": {
        "parents": "ho perso i miei genitori",
        "suitcase": "per favore ho perso la mia valigia",
        "what_time": "a che ora è il mio aereo",
        "checkin": "dove e il bancone",
        "where": "dove sono i ristoranti e i negozi"
    }
}

The WER function can be expressed as $$WER= {S + D + I \over N} x 100 $$.

This can be accounted in our function as the difference between words on the result over the total length of the phrase times 100

In [5]:
# Function to process the audio file in our model
def process_audio(file, language="EN"):
    if language == "EN":
        scorer = f"./Models/{language}/deepspeech-0.9.3-models.scorer"
        model = f"./Models/{language}/deepspeech-0.9.3-models.pbmm"
    else:
        scorer = f"./Models/{language}/kenlm_{language.lower()}.scorer"
        model = f"./Models/{language}/output_graph_{language.lower()}.pbmm"
    print(f"Analyzing file: {file}")
    
    ds = Model(model)
    ds.enableExternalScorer(scorer)
    desired_sample_rate = ds.sampleRate()
    audio = lr.load(file, sr=desired_sample_rate)[0]
    audio = (audio * 32767).astype(np.int16)
    res = ds.stt(audio)
    return res

# Function to process the result of the model, removing special characters for comparison
def process_result(result):
  string = re.sub(r"[^a-zA-Z0-9]+", ' ', result)
  string = string.lower()
  return string

# Finally, we create a function that will split the string into a setof words and check their match with the real content, returning the WER
def calculate_WER(result, real_content):
  result = set(result.split())
  real_content = set(real_content.split())
  common_words = result & real_content
  return (abs(len(common_words) - len(real_content))) / len(real_content) * 100

We create an array that will hold the relation between the audio files, the real content of the audio and the resulting process of the model

In [6]:
contents = [
  { "file": "./Audio/EN/checkin.wav", "real_content": real_contents["EN"]["checkin"], "result": "", "language": "EN", "wer": 0 },
  { "file": "./Audio/EN/checkin_child.wav", "real_content": real_contents["EN"]["checkin"], "result": "", "language": "EN", "wer": 0 },
  { "file": "./Audio/EN/parents.wav", "real_content": real_contents["EN"]["parents"], "result": "", "language": "EN", "wer": 0 },
  { "file": "./Audio/EN/parents_child.wav", "real_content": real_contents["EN"]["parents"], "result": "", "language": "EN", "wer": 0},
  { "file": "./Audio/EN/suitcase.wav", "real_content": real_contents["EN"]["suitcase"], "result": "", "language": "EN", "wer": 0 },
  { "file": "./Audio/EN/suitcase_child.wav", "real_content": real_contents["EN"]["suitcase"], "result": "", "language": "EN", "wer": 0 },
  { "file": "./Audio/EN/what_time.wav", "real_content": real_contents["EN"]["what_time"], "result": "", "language": "EN", "wer": 0 },
  { "file": "./Audio/EN/what_time_child.wav", "real_content": real_contents["EN"]["what_time"], "result": "", "language": "EN", "wer": 0 },
  { "file": "./Audio/EN/where.wav", "real_content": real_contents["EN"]["where"], "result": "", "language": "EN", "wer": 0 },
  { "file": "./Audio/EN/where_child.wav", "real_content": real_contents["EN"]["where"], "result": "", "language": "EN", "wer": 0 },
  { "file": "./Audio/ES/checkin_es.wav", "real_content": real_contents["ES"]["checkin"], "result": "", "language": "ES", "wer": 0 },
  { "file": "./Audio/ES/parents_es.wav", "real_content": real_contents["ES"]["parents"], "result": "", "language": "ES", "wer": 0 },
  { "file": "./Audio/ES/suitcase_es.wav", "real_content": real_contents["ES"]["suitcase"], "result": "", "language": "ES", "wer": 0},
  { "file": "./Audio/ES/what_time_es.wav", "real_content": real_contents["ES"]["what_time"], "result": "" , "language": "ES", "wer": 0},
  { "file": "./Audio/ES/where_es.wav", "real_content": real_contents["ES"]["where"], "result": "", "language": "ES", "wer": 0 },
  { "file": "./Audio/IT/checkin_it.wav", "real_content": real_contents["IT"]["checkin"], "result": "", "language": "IT", "wer": 0 },
  { "file": "./Audio/IT/parents_it.wav", "real_content": real_contents["IT"]["parents"], "result": "", "language": "IT", "wer": 0 },
  { "file": "./Audio/IT/suitcase_it.wav", "real_content": real_contents["IT"]["suitcase"], "result": "", "language": "IT", "wer": 0 },
  { "file": "./Audio/IT/what_time_it.wav", "real_content": real_contents["IT"]["what_time"], "result": "", "language": "IT", "wer": 0 },
  { "file": "./Audio/IT/where_it.wav", "real_content": real_contents["IT"]["where"], "result": "", "language": "IT", "wer": 0 }
]

We iterate over the content array and append the result to the object.

In [7]:
for content in contents:
    result = process_audio(content["file"], language=content["language"])
    processed_result = process_result(result)
    content["result"] = processed_result
    content["wer"] = calculate_WER(processed_result, content["real_content"])

Analyzing file: ./Audio/EN/checkin.wav
Analyzing file: ./Audio/EN/checkin_child.wav
Analyzing file: ./Audio/EN/parents.wav
Analyzing file: ./Audio/EN/parents_child.wav
Analyzing file: ./Audio/EN/suitcase.wav
Analyzing file: ./Audio/EN/suitcase_child.wav
Analyzing file: ./Audio/EN/what_time.wav
Analyzing file: ./Audio/EN/what_time_child.wav
Analyzing file: ./Audio/EN/where.wav
Analyzing file: ./Audio/EN/where_child.wav
Analyzing file: ./Audio/ES/checkin_es.wav
Analyzing file: ./Audio/ES/parents_es.wav
Analyzing file: ./Audio/ES/suitcase_es.wav
Analyzing file: ./Audio/ES/what_time_es.wav
Analyzing file: ./Audio/ES/where_es.wav
Analyzing file: ./Audio/IT/checkin_it.wav
Analyzing file: ./Audio/IT/parents_it.wav
Analyzing file: ./Audio/IT/suitcase_it.wav
Analyzing file: ./Audio/IT/what_time_it.wav
Analyzing file: ./Audio/IT/where_it.wav


In [8]:
print("Language | File | WER | Real Content | Predicted Content")

for content in contents:
    print(f"{content['language']} | {content['file']} | {content['wer']} | {content['real_content']} | {content['result']}")
    print("\n")

Language | File | WER | Real Content | Predicted Content
EN | ./Audio/EN/checkin.wav | 33.33333333333333 | where is the check in desk | where is the checking desk


EN | ./Audio/EN/checkin_child.wav | 100.0 | where is the check in desk | aristeides


EN | ./Audio/EN/parents.wav | 0.0 | i had lost my parents | i had lost my parents


EN | ./Audio/EN/parents_child.wav | 0.0 | i had lost my parents | i had lost my parents


EN | ./Audio/EN/suitcase.wav | 16.666666666666664 | please i had lost my suitcase | please i have lost my suitcase


EN | ./Audio/EN/suitcase_child.wav | 33.33333333333333 | please i had lost my suitcase | this i had lost my sakes


EN | ./Audio/EN/what_time.wav | 20.0 | what time is my plane | what time is my plan


EN | ./Audio/EN/what_time_child.wav | 20.0 | what time is my plane | what time is my plan


EN | ./Audio/EN/where.wav | 0.0 | where are the restaurants and shops | where are the restaurants and shops


EN | ./Audio/EN/where_child.wav | 0.0 | where are the 

In [9]:
# The average WER of the system is:
mean_wer = sum([content["wer"] for content in contents]) / len(contents)
print(mean_wer)

30.321428571428573


Now we will incorporate custom audio files into the model to see the performance over these audio files.

In [10]:
custom_content = [
    { "file": "./Audio/closing_time.wav", "real_content": "hello what is the closing time of the airport", "result": "", "language": "EN", "wer": 0 },
    { "file": "./Audio/bathroom.wav", "real_content": "hola podría indicarme el lugar de los baños por favor", "result": "", "language": "ES", "wer": 0}
]

# We iterate over the custom content to produce the result:
for content in custom_content:
    result = process_audio(content["file"], language=content["language"])
    processed_result = process_result(result)
    content["result"] = processed_result
    content["wer"] = calculate_WER(processed_result, content["real_content"])

Analyzing file: ./Audio/closing_time.wav
Analyzing file: ./Audio/bathroom.wav


In [11]:
print("Language | File | WER | Real Content | Predicted Content")

for content in custom_content:
    print(f"{content['language']} | {content['file']} | {content['wer']} | {content['real_content']} | {content['result']}")
    print("\n")

Language | File | WER | Real Content | Predicted Content
EN | ./Audio/closing_time.wav | 12.5 | hello what is the closing time of the airport | hello what is a close time of the airport


ES | ./Audio/bathroom.wav | 30.0 | hola podría indicarme el lugar de los baños por favor | polo podr indicarme el lugar de los ba os por favor




In [12]:
# The mean WER over th ecustom content is:
mean_wer = sum([content["wer"] for content in custom_content]) / len(custom_content)
print(mean_wer)

21.25


To improve the results of the ASR system we will include a Librosa implementation to separate the background sound with the vocal component. We will use a method based on the REPET-SIM by Rafii and Pardo of 2012. 

In [13]:
def add_filter(file, desired_sr):
    S, sr = lr.load(file, sr=desired_sr)
    S_full, phase = lr.magphase(lr.stft(S))
    
    width = int(lr.time_to_frames(2, sr=desired_sr))
    S_filter = lr.decompose.nn_filter(S_full, aggregate=np.median, metric='cosine')
    # Output of filter should not be greater than the input
    S_filter = np.minimum(S_full, S_filter)
    
    mask = lr.util.softmask(S_full - S_filter, 11 * S_filter, power=3)
    
    S_background = mask * S_full
    
    filtered_audio = lr.istft(S_background)
    file_name = file.split('.wav')[0] + '_filtered.wav'
    sf.write(file_name, filtered_audio, sr, subtype="PCM_24")
    return file_name

Listening to the difference between the 2 audios:

In [14]:
original_sound, sr = lr.load('./Audio/EN/checkin.wav')
ipd.Audio(original_sound, rate=sr)

In [15]:
add_filter('./Audio/EN/checkin.wav', 16000)
filtered_sound, sr = lr.load('./Audio/EN/checkin_filtered.wav')
ipd.Audio(filtered_sound, rate=sr)

In [16]:
# We update our elements to incorporate the new filtered audio
for content in contents:
    filtered_file = add_filter(content["file"], 16000)
    content["filtered_file"] = filtered_file

We now iterate over the contents with our new filtered audio and look for the results:

In [17]:
for content in contents:
    result = process_audio(content["filtered_file"], language=content["language"])
    processed_result = process_result(result)
    content["result_filtered"] = processed_result
    content["wer_filtered"] = calculate_WER(processed_result, content["real_content"])

Analyzing file: ./Audio/EN/checkin_filtered.wav
Analyzing file: ./Audio/EN/checkin_child_filtered.wav
Analyzing file: ./Audio/EN/parents_filtered.wav
Analyzing file: ./Audio/EN/parents_child_filtered.wav
Analyzing file: ./Audio/EN/suitcase_filtered.wav
Analyzing file: ./Audio/EN/suitcase_child_filtered.wav
Analyzing file: ./Audio/EN/what_time_filtered.wav
Analyzing file: ./Audio/EN/what_time_child_filtered.wav
Analyzing file: ./Audio/EN/where_filtered.wav
Analyzing file: ./Audio/EN/where_child_filtered.wav
Analyzing file: ./Audio/ES/checkin_es_filtered.wav
Analyzing file: ./Audio/ES/parents_es_filtered.wav
Analyzing file: ./Audio/ES/suitcase_es_filtered.wav
Analyzing file: ./Audio/ES/what_time_es_filtered.wav
Analyzing file: ./Audio/ES/where_es_filtered.wav
Analyzing file: ./Audio/IT/checkin_it_filtered.wav
Analyzing file: ./Audio/IT/parents_it_filtered.wav
Analyzing file: ./Audio/IT/suitcase_it_filtered.wav
Analyzing file: ./Audio/IT/what_time_it_filtered.wav
Analyzing file: ./Audio/I

In [18]:
print("########### FILTERED AUDIO RESULTS ###############")
print("Language | File | WER | Real Content | Predicted Content")

for content in contents:
    print(f"{content['language']} | {content['filtered_file']} | {content['wer_filtered']} | {content['real_content']} | {content['result_filtered']}")
    print("\n")

########### FILTERED AUDIO RESULTS ###############
Language | File | WER | Real Content | Predicted Content
EN | ./Audio/EN/checkin_filtered.wav | 83.33333333333334 | where is the check in desk | he is


EN | ./Audio/EN/checkin_child_filtered.wav | 100.0 | where is the check in desk | as


EN | ./Audio/EN/parents_filtered.wav | 100.0 | i had lost my parents | 


EN | ./Audio/EN/parents_child_filtered.wav | 100.0 | i had lost my parents | 


EN | ./Audio/EN/suitcase_filtered.wav | 100.0 | please i had lost my suitcase | 


EN | ./Audio/EN/suitcase_child_filtered.wav | 100.0 | please i had lost my suitcase | the havens


EN | ./Audio/EN/what_time_filtered.wav | 100.0 | what time is my plane | 


EN | ./Audio/EN/what_time_child_filtered.wav | 100.0 | what time is my plane | i 


EN | ./Audio/EN/where_filtered.wav | 100.0 | where are the restaurants and shops | 


EN | ./Audio/EN/where_child_filtered.wav | 83.33333333333334 | where are the restaurants and shops | the reason


ES | ./Audio/

In [19]:
!pip install noisereduce

Collecting noisereduce
  Downloading noisereduce-3.0.3-py3-none-any.whl.metadata (14 kB)
Downloading noisereduce-3.0.3-py3-none-any.whl (22 kB)
Installing collected packages: noisereduce
Successfully installed noisereduce-3.0.3


In [20]:
import noisereduce as nr

In [21]:
# Use noisereduce package for the noise reduction filtering
def add_filter_nr(file):
    S, sr = lr.load(file, sr=None)
    
    noise_start = int(sr * 0.0) # Start of audio
    noise_end = int(sr * 1.0) # End of audio
    noise_sample = S[noise_start:noise_end]
    
    reduced_noise = nr.reduce_noise(S, sr=sr, y_noise=noise_sample)
    file_name = file.split('.wav')[0] + '_filtered.wav'
    sf.write(file_name, reduced_noise, sr)
    return file_name

The sound difference can be compared by listening to the 2 audio files

In [22]:
original_sound, sr = lr.load('./Audio/EN/checkin.wav')
ipd.Audio(original_sound, rate=sr)

In [23]:
add_filter_nr('./Audio/EN/checkin.wav')
filtered_sound, sr = lr.load('./Audio/EN/checkin_filtered.wav')
ipd.Audio(filtered_sound, rate=sr)

In [24]:
# We update our elements to incorporate the new filtered audio
for content in contents:
    filtered_file = add_filter_nr(content["file"])
    content["filtered_file"] = filtered_file

In [25]:
for content in contents:
    result = process_audio(content["filtered_file"], language=content["language"])
    processed_result = process_result(result)
    content["result_filtered"] = processed_result
    content["wer_filtered"] = calculate_WER(processed_result, content["real_content"])

Analyzing file: ./Audio/EN/checkin_filtered.wav
Analyzing file: ./Audio/EN/checkin_child_filtered.wav
Analyzing file: ./Audio/EN/parents_filtered.wav
Analyzing file: ./Audio/EN/parents_child_filtered.wav
Analyzing file: ./Audio/EN/suitcase_filtered.wav
Analyzing file: ./Audio/EN/suitcase_child_filtered.wav
Analyzing file: ./Audio/EN/what_time_filtered.wav
Analyzing file: ./Audio/EN/what_time_child_filtered.wav
Analyzing file: ./Audio/EN/where_filtered.wav
Analyzing file: ./Audio/EN/where_child_filtered.wav
Analyzing file: ./Audio/ES/checkin_es_filtered.wav
Analyzing file: ./Audio/ES/parents_es_filtered.wav
Analyzing file: ./Audio/ES/suitcase_es_filtered.wav
Analyzing file: ./Audio/ES/what_time_es_filtered.wav
Analyzing file: ./Audio/ES/where_es_filtered.wav
Analyzing file: ./Audio/IT/checkin_it_filtered.wav
Analyzing file: ./Audio/IT/parents_it_filtered.wav
Analyzing file: ./Audio/IT/suitcase_it_filtered.wav
Analyzing file: ./Audio/IT/what_time_it_filtered.wav
Analyzing file: ./Audio/I

In [26]:
print("########### FILTERED AUDIO RESULTS WITH NOISE REDUCTION ###############")
print("Language | File | WER | Real Content | Predicted Content")

for content in contents:
    print(f"{content['language']} | {content['filtered_file']} | {content['wer_filtered']} | {content['real_content']} | {content['result_filtered']}")
    print("\n")

########### FILTERED AUDIO RESULTS WITH NOISE REDUCTION ###############
Language | File | WER | Real Content | Predicted Content
EN | ./Audio/EN/checkin_filtered.wav | 0.0 | where is the check in desk | where is the check in desk


EN | ./Audio/EN/checkin_child_filtered.wav | 16.666666666666664 | where is the check in desk | where is the check in de


EN | ./Audio/EN/parents_filtered.wav | 20.0 | i had lost my parents | i have lost my parents


EN | ./Audio/EN/parents_child_filtered.wav | 20.0 | i had lost my parents | i have lost my parents


EN | ./Audio/EN/suitcase_filtered.wav | 33.33333333333333 | please i had lost my suitcase | please i must my suitcase


EN | ./Audio/EN/suitcase_child_filtered.wav | 83.33333333333334 | please i had lost my suitcase | yes it was my sin


EN | ./Audio/EN/what_time_filtered.wav | 20.0 | what time is my plane | what time is my plan


EN | ./Audio/EN/what_time_child_filtered.wav | 20.0 | what time is my plane | what time is my plan


EN | ./Audio/EN/

In [27]:
# The mean WER over th ecustom content is:
mean_wer = sum([content["wer_filtered"] for content in contents]) / len(contents)
print(mean_wer)

39.57142857142857
