<a href="https://colab.research.google.com/github/HungNguyen1509/DS311---Technologies-in-Data-Analytics/blob/main/Natural_Language_Processing_phonems_audio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Assess the accuracy of pronunciation using Artificial Intelligence



##Method and processing steps for the problem
**Idea:** Create a labeled training dataset based on phonemes. Then use machine learning models to recognize each phoneme in an audio file.<br>
**Processing steps:**<br>
Step 1: Prepare data and labeling. From the TIMIT dataset (EU, US, UK), CMU (UK), L2-corpus (GLOBAL), label and filter out non-English data.<br>
Step2: Data pre-processing,data cleaning<br>
Step3: Encode data using 2-gram, 3-gram, Wav2Vec methods.<br>
Step4: Mapping phonemes<br>
Step5: Post-process the model<br>

# Create Environment

In [1]:
!pip install transformers
!pip install jiwer
!pip install phonemizer
!pip search espeak
!pip install py-espeak-ng
!sudo apt-get install python-espeak
!sudo apt-get update && sudo apt-get install espeak
!pip install espeak

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m64.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.1 tokenizers-0.13.2 transformers-4.26.1
Looking in indexes: https://pypi.org/simple, https://us

In [2]:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from transformers import Wav2Vec2PhonemeCTCTokenizer
import torch
import re
import librosa
import os
import pandas as pd
from jiwer import wer,cer
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Load file audio#

In [3]:
class ReadLabFile():
    def __init__(self,path):
        self.path = path
        self.path_labs,self.path_wavs = self.path_wavs_1()
        
    def path_wavs_1(self):
        paths = os.listdir(self.path)
        path_wavs=[]
        path_labs=[]
        for path in paths:
            if '.wav' in path:
                path_wavs.append(path)
            if '.lab' in path:
                path_labs.append(path)
        path_labs.sort()
        path_wavs.sort()
        return path_labs,path_wavs
    
    def read_speech_file(self,value_drop=['pau']):
        paths = self.path_labs
        val = []
        valu_=[]
        for path in paths:
            values = list(pd.read_csv('abc'+'/'+path, sep = ' ')['#'])
            val.append(values)
        for value in val:
            val_ = []
            for _ in value:
                if _ in  value_drop:
                    continue
                val_.append(_)
            valu_.append(val_)
        result = []
        for i in valu_:
            result.append(' '.join(i))
        return result

# Initialize the model and predict audio results#
Step 1: Encode the audio file into numpy/tensor format.<br>
Step 2: Pretrain the model and predict the result. The model has been trained for over 180 minutes. <br>
Step 3: Mapping IPA39 to IPA69 format.<br>
Step 4: The predicted result of the audio file.

In [4]:
class FbPretrain():
    def __init__(self,path_):
        self.path = path_
        self.path_labs,self.path_wavs=ReadLabFile(path_).path_wavs_1()
        self.two_letter,self.IPA = self. corpus_index()
        
    def wav_to_numpy_array(self):
        list_array_from_wav = []
        for path in self.path_wavs:
            input_,sr  = librosa.load(self.path+'/'+path,sr = 16000)
            list_array_from_wav.append(input_)
        return list_array_from_wav
    
    def list_phonemes2list_IPA(self):
        list_predict_wav2list = []
        list_array_from_wav = self.wav_to_numpy_array()
        processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-xlsr-53-espeak-cv-ft")
        model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-xlsr-53-espeak-cv-ft")
        i=0
        for array_from_wav in list_array_from_wav:
            i=i+1
            print(i)
            input_values = processor(array_from_wav, return_tensors="pt",padding=True).input_values
            logits = model(input_values).logits
            predicted_ids = torch.argmax(logits, dim=-1)
            predict_wav_list = processor.batch_decode(predicted_ids)
            predict_wav_list = predict_wav_list[0].replace('ː',':')
            predict_wav_list = predict_wav_list.replace('ɡ','g')
            predict_wav_list_ = list(predict_wav_list.split())
            list_predict_wav2list.append(predict_wav_list_)
        return list_predict_wav2list
    
    def corpus_index(self):
        data = pd.read_excel('/content/wav_file/Fb_phonemes.xlsx')
        two_letter = list(data['2-letter'])
        IPA=list(data['IPA'])
        return two_letter,IPA
    
    def predict_pretrain(self):
        list_phonemes = self.list_phonemes2list_IPA()
        two_letter,IPA = self.corpus_index()
        list_ =[]
        for phonemes in list_phonemes:
            idx =[]
            for i,phoneme in enumerate(phonemes):
                idx.append(IPA.index(phoneme))
            list_.append(idx)
        return list_
    
    def list_phonemes2list_L2corpus(self):
        lists_ = self.predict_pretrain()
        result_2gram=[]
        for list_ in lists_:
            value_2gram=[]
            for id_2gram in list_:
                value_2gram.append(self.two_letter[id_2gram])
            result_2gram.append(value_2gram)
        return result_2gram
             
    def list_phonemes2string_L2corpus(self):
        phonemes2str = []
        for result in self.list_phonemes2list_L2corpus():
            phonemes2str.append(' '.join(result))
        return phonemes2str

class Predict():
    def __init__(self,path_,model = 'Al'):
        self.path_ = path_
        self.path_labs,self.path_wavs = ReadLabFile(path_).path_wavs_1()
        self.model = model

    def predict_with_model(self):
        result =FbPretrain(self.path_).list_phonemes2string_L2corpus()
        return result

# Evaluate the algorithm accuracy and display the results#
Step 1: Check the accuracy of the algorithm using the per (phoneme errors) and wer (word errors) methods.<br>
Step 2: Display the results<br>

In [5]:
class Score():
    def __init__(self,path_,model = 'Al'):
        self.path_ = path_
        self.hypothesis=Predict(path_,model).predict_with_model()
        self.ground_truth = ReadLabFile(path_).read_speech_file()
        
    def score(self):
        accuracy_wer_score =1- wer(self.ground_truth,self.hypothesis)
        accuracy_cer_score =1-cer(self.ground_truth,self.hypothesis)
        print('Accuracy with wer: ',accuracy_wer_score)
        print('Accuracy with cer: ',accuracy_cer_score)
        return accuracy_wer_score,accuracy_cer_score

    def display_result(self):
        hypothesis=self.hypothesis
        ground_truth=self.ground_truth
        name_lab,name_wav = ReadLabFile(self.path_).path_wavs_1()
        data = pd.DataFrame({'name_wav':name_wav,
                             'name_lab':name_lab,
            'hypothesis':hypothesis,
                      'ground_truth':ground_truth})
        return data

### Find out the acuracy of the model

In [None]:
per ,wer = Score('/content/wav_file/abc').score()

In [None]:
print('Accuracy with wer: ',wer)
print('Accuracy with cer: ',per)

Accuracy with wer:  0.9213759213759214
Accuracy with cer:  0.8725637181409296


## Display the result

In [None]:
data = Score('/content/wav_file/abc').display_result()

In [None]:
data.head(20)

Unnamed: 0,name_wav,name_lab,hypothesis,ground_truth
0,arctic_a0001.wav,arctic_a0001.lab,ow th er r ah v dh ax d ey n jh er t r ey l f ...,ao th er ah v dh ax d ey n jh er t r ey l f ih...
1,arctic_a0002.wav,arctic_a0002.lab,n ao t ae t dh ih s p er t ih k y ih l er k ey...,n aa t ae t dh ih s p er t ih k y ax l er k ey...
2,arctic_a0003.wav,arctic_a0003.lab,f ao dh ax t w eh n t iy ax th t ay m dh ae t ...,f ao r dh ax t w eh n t iy ax th t ay m dh ae ...
3,arctic_a0004.wav,arctic_a0004.lab,l ao d b ah t ay m g l ae d t ax s iy y uw aa ...,l ao r d b ah t ay m g l ae d t ax s iy y uw a...
4,arctic_a0005.wav,arctic_a0005.lab,w ih l w iy eh v er f er g eh t ih t,w ih l w iy eh v er f er g eh t ih t
5,arctic_a0006.wav,arctic_a0006.lab,g ao d b l eh s ih m ay hh ao p ay l g ao ao n...,g aa d b l eh s eh m ay hh ow p ay l g ow aa n...
6,arctic_a0007.wav,arctic_a0007.lab,ae n d y uw aa l w ey z w ao n t t ax s iy ih ...,ae n d y uw ao l w ey z w aa n t t ax s iy ih ...
7,arctic_a0008.wav,arctic_a0008.lab,g ae d y ao r l eh t er k ey m jh ah s t ih n ...,g ae d y ao r l eh t er k ey m jh ah s t ih n ...
8,arctic_a0009.wav,arctic_a0009.lab,hh iy t er n d sh aa p l iy ae n d f ey s t g ...,hh iy t er n d sh aa r p l iy ae n d f ey s t ...
9,arctic_a0010.wav,arctic_a0010.lab,ay m p l ey ih ng aa s ih ng g əl hh ae n d ih...,ay m p l ey ih ng ax s ih ng g ax l hh ae n d ...


#Conclusion<br>
Advantages:
The accuracy of the results is quite high, with an accuracy of about 92% even though it is just a demo version. The test dataset used to evaluate the accuracy is the local data of Scottish people.<br>
Disadvantages: The model computation time is slow, taking about 1 second per file. The algorithm needs to be further improved in terms of accuracy and computation speed.