# ASR Assignment 2022-23

This notebook has been provided as a template to get you started on the assignment.  Feel free to use it for your development, or do your development directly in Python.

You can find a full description of the assignment [here](http://www.inf.ed.ac.uk/teaching/courses/asr/2022-23/coursework.pdf).

You are provided with two Python modules `observation_model.py` and `wer.py`.  The first was described in [Lab 3](https://github.com/ZhaoZeyu1995/asr_labs/blob/master/asr_lab3_4.ipynb).  The second can be used to compute the number of substitution, deletion and insertion errors between ASR output and a reference text.

It can be used as follows:

```python
import wer

my_refence = 'A B C'
my_output = 'A C C D'

wer.compute_alignment_errors(my_reference, my_output)
```

This produces a tuple $(s,d,i)$ giving counts of substitution,
deletion and insertion errors respectively - in this example (1, 0, 1).  The function accepts either two strings, as in the example above, or two lists.  Matching is case sensitive.

## Template code

Assuming that you have already made a function to generate an WFST, `create_wfst()` and a decoder class, `MyViterbiDecoder`, you can perform recognition on all the audio files as follows:


In [1]:
import observation_model
import math
import openfst_python as fst

from subprocess import check_call
from IPython.display import Image

import glob
import os
import wer


In [2]:
class MyViterbiDecoder:
    
    NLL_ZERO = 1e10  # define a constant representing -log(0).  This is really infinite, but approximate
                     # it here with a very large number
    
    def __init__(self, f, audio_file_name):
        """Set up the decoder class with an audio file and WFST f
        """
        self.om = observation_model.ObservationModel()
        self.f = f
        
        if audio_file_name:
            self.om.load_audio(audio_file_name)
        else:
            self.om.load_dummy_audio()
        
        self.initialise_decoding()

        
    def initialise_decoding(self):
        """set up the values for V_j(0) (as negative log-likelihoods)
        
        """
        
        self.V = []   # stores likelihood along best path reaching state j
        self.B = []   # stores identity of best previous state reaching state j
        self.W = []   # stores output labels sequence along arc reaching j - this removes need for 
                      # extra code to read the output sequence along the best path
        
        for t in range(self.om.observation_length()+1):
            self.V.append([self.NLL_ZERO]*self.f.num_states())
            self.B.append([-1]*self.f.num_states())
            self.W.append([[] for i in range(self.f.num_states())])  #  multiplying the empty list doesn't make multiple
        
        # The above code means that self.V[t][j] for t = 0, ... T gives the Viterbi cost
        # of state j, time t (in negative log-likelihood form)
        # Initialising the costs to NLL_ZERO effectively means zero probability    
        
        # give the WFST start state a probability of 1.0   (NLL = 0.0)
        self.V[0][self.f.start()] = 0.0
        
        # some WFSTs might have arcs with epsilon on the input (you might have already created 
        # examples of these in earlier labs) these correspond to non-emitting states, 
        # which means that we need to process them without stepping forward in time.  
        # Don't worry too much about this!  
        self.traverse_epsilon_arcs(0)        
        
    def traverse_epsilon_arcs(self, t):
        """Traverse arcs with <eps> on the input at time t
        
        These correspond to transitions that don't emit an observation
        
        We've implemented this function for you as it's slightly trickier than
        the normal case.  You might like to look at it to see what's going on, but
        don't worry if you can't fully follow it.
        
        """
        
        states_to_traverse = list(self.f.states()) # traverse all states
        while states_to_traverse:
            
            # Set i to the ID of the current state, the first 
            # item in the list (and remove it from the list)
            i = states_to_traverse.pop(0)   
        
            # don't bother traversing states which have zero probability
            if self.V[t][i] == self.NLL_ZERO:
                    continue
        
            for arc in self.f.arcs(i):
                
                if arc.ilabel == 0:     # if <eps> transition
                  
                    j = arc.nextstate   # ID of next state  
                
                    if self.V[t][j] > self.V[t][i] + float(arc.weight):
                        
                        # this means we've found a lower-cost path to
                        # state j at time t.  We might need to add it
                        # back to the processing queue.
                        self.V[t][j] = self.V[t][i] + float(arc.weight)
                        
                        # save backtrace information.  In the case of an epsilon transition, 
                        # we save the identity of the best state at t-1.  This means we may not
                        # be able to fully recover the best path, but to do otherwise would
                        # require a more complicated way of storing backtrace information
                        self.B[t][j] = self.B[t][i] 
                        
                        # and save the output labels encountered - this is a list, because
                        # there could be multiple output labels (in the case of <eps> arcs)
                        if arc.olabel != 0:
                            self.W[t][j] = self.W[t][i] + [arc.olabel]
                        else:
                            self.W[t][j] = self.W[t][i]
                        
                        if j not in states_to_traverse:
                            states_to_traverse.append(j)

    
    def forward_step(self, t):
          
        for i in self.f.states():
            
            if not self.V[t-1][i] == self.NLL_ZERO:   # no point in propagating states with zero probability
                
                for arc in self.f.arcs(i):
                    
                    if arc.ilabel != 0: # <eps> transitions don't emit an observation
                        j = arc.nextstate
                        tp = float(arc.weight)  # transition prob
                        ep = -self.om.log_observation_probability(self.f.input_symbols().find(arc.ilabel), t)  # emission negative log prob
                        prob = tp + ep + self.V[t-1][i] # they're logs
                        if prob < self.V[t][j]:
                            self.V[t][j] = prob
                            self.B[t][j] = i
                            
                            # store the output labels encountered too
                            if arc.olabel !=0:
                                self.W[t][j] = [arc.olabel]
                            else:
                                self.W[t][j] = []
                            
    
    def finalise_decoding(self):
        """ this incorporates the probability of terminating at each state
        """
        
        for state in self.f.states():
            final_weight = float(self.f.final(state))
            if self.V[-1][state] != self.NLL_ZERO:
                if final_weight == math.inf:
                    self.V[-1][state] = self.NLL_ZERO  # effectively says that we can't end in this state
                else:
                    self.V[-1][state] += final_weight
                    
        # get a list of all states where there was a path ending with non-zero probability
        finished = [x for x in self.V[-1] if x < self.NLL_ZERO]
        if not finished:  # if empty
            print("No path got to the end of the observations.")
        
        
    def decode(self):
        self.initialise_decoding()
        t = 1
        while t <= self.om.observation_length():
            self.forward_step(t)
            self.traverse_epsilon_arcs(t)
            t += 1
        self.finalise_decoding()
    
    def backtrace(self):
        
        best_final_state = self.V[-1].index(min(self.V[-1])) # argmin
        best_state_sequence = [best_final_state]
        best_out_sequence = []
        
        t = self.om.observation_length()   # ie T
        j = best_final_state
        
        while t >= 0:
            i = self.B[t][j]
            best_state_sequence.append(i)
            best_out_sequence = self.W[t][j] + best_out_sequence  # computer scientists might like
                                                                                # to make this more efficient!

            # continue the backtrace at state i, time t-1
            j = i  
            t-=1
            
        best_state_sequence.reverse()
        
        # convert the best output sequence from FST integer labels into strings
        best_out_sequence = ' '.join([ self.f.output_symbols().find(label) for label in best_out_sequence])
        
        return (best_state_sequence, best_out_sequence)
    


In [3]:
def show_wfst(f):
    f.draw('tmp.dot', portrait=True)
    check_call(['dot','-Tpng','-Gdpi=500','tmp.dot','-o','tmp.png'])
    Image(filename='tmp.png')

In [4]:
def parse_lexicon(lex_file):
    """
    Parse the lexicon file and return it in dictionary form.
    
    Args:
        lex_file (str): filename of lexicon file with structure '<word> <phone1> <phone2>...'
                        eg. peppers p eh p er z

    Returns:
        lex (dict): dictionary mapping words to list of phones
    """
    
    lex = {}  # create a dictionary for the lexicon entries (this could be a problem with larger lexica)
    with open(lex_file, 'r') as f:
        for line in f:
            line = line.split()  # split at each space
            if line[0] in lex.keys():
                lex[line[0] + "_"] = line[1:] 
            else:
                lex[line[0]] = line[1:]  # first field the word, the rest is the phones
    return lex

def generate_symbol_tables(lexicon, n=3):
    '''
    Return word, phone and state symbol tables based on the supplied lexicon
        
    Args:
        lexicon (dict): lexicon to use, created from the parse_lexicon() function
        n (int): number of states for each phone HMM
        
    Returns:
        word_table (fst.SymbolTable): table of words
        phone_table (fst.SymbolTable): table of phones
        state_table (fst.SymbolTable): table of HMM phone-state IDs
    '''
    
    state_table = fst.SymbolTable()
    phone_table = fst.SymbolTable()
    word_table = fst.SymbolTable()
    
    # add empty <eps> symbol to all tables
    state_table.add_symbol('<eps>')
    phone_table.add_symbol('<eps>')
    word_table.add_symbol('<eps>')
    
    for word, phones  in lexicon.items():
        
        word_table.add_symbol(word)
        
        for p in phones: # for each phone
            
            phone_table.add_symbol(p)
            for i in range(1,n+1): # for each state 1 to n
                state_table.add_symbol('{}_{}'.format(p, i))
            
    return word_table, phone_table, state_table


# call these two functions
lex = parse_lexicon('lexicon.txt')
word_table, phone_table, state_table = generate_symbol_tables(lex)

def generate_phone_wfst(f, start_state, phone, n, loop_w, next_w ):
    """
    Generate a WFST representating an n-state left-to-right phone HMM
    
    Args:
        f (fst.Fst()): an FST object, assumed to exist already
        start_state (int): the index of the first state, assmed to exist already
        phone (str): the phone label 
        n (int): number of states for each phone HMM
        
    Returns:
        the final state of the FST
    """
    
    current_state = start_state
    sl_weight = fst.Weight('log', -math.log(loop_w))  # weight for self-loop
    next_weight = fst.Weight('log', -math.log(next_w)) # weight to next state
    
    for i in range(1, n+1):
        
        in_label = state_table.find('{}_{}'.format(phone, i))
        
        # self-loop back to current state
        f.add_arc(current_state, fst.Arc(in_label, 0, sl_weight, current_state))
        
        # transition to next state
        
        # we want to output the phone label on the final state
        # note: if outputting words instead this code should be modified
        if i == n:
            out_label = phone_table.find(phone)
        else:
            out_label = 0   # output empty <eps> label
            
        next_state = f.add_state()
        f.add_arc(current_state, fst.Arc(in_label, 0, next_weight, next_state))    # changed to 0 ! 
       
        current_state = next_state
        
    return current_state

def generate_word_wfst(word):
    """ Generate a WFST for any word in the lexicon, composed of 3-state phone WFSTs.
        This will currently output word labels.  
        Exercise: could you modify this function and the one above to output a single phone label instead?
    
    Args:
        word (str): the word to generate
        
    Returns:
        the constructed WFST
    
    """
    f = fst.Fst('log')
    
    # create the start state
    start_state = f.add_state()
    f.set_start(start_state)
    certain_weight =  fst.Weight('log', -math.log(1))
    
    current_state = start_state
    
    # iterate over all the phones in the word
    for (i,phone) in enumerate(lex[word]):   # will raise an exception if word is not in the lexicon
        
        current_state = generate_phone_wfst(f, current_state, phone, 3)
    
        if i == len(lex[word]) - 1:
            next_state = f.add_state()
            f.add_arc(current_state, fst.Arc(in_label, word_table.find(word), certain_weight, current_state))
            
        # note: new current_state is now set to the final state of the previous phone WFST
        
    f.set_final(current_state)
    
    return f

def generate_word_sequence_recognition_wfst(n, probs, loop_w, next_w):
    """ generate a HMM to recognise any single word sequence for words in the lexicon
    
    Args:
        n (int): states per phone HMM

    Returns:
        the constructed WFST
    
    """
    
    f = fst.Fst('log')

    #even_weight = fst.Weight('log', -math.log(1/len(lex)))
    
    #even_weight = fst.Weight('log', -math.log(1/len(lex)))
    #reduced_weight = fst.Weight('log', -math.log(1/(5*len(lex))))
    
    #this one used before
    next_weight = fst.Weight('log', -math.log(0.1))
    #next_weight = fst.Weight('log', 0)
    #testing this one 
    
    certain_weight =  fst.Weight('log', -math.log(1))
    
    # create a single start state
    start_state = f.add_state()
    #f.add_arc(start_state, fst.Arc(0, 0, fst.Weight('log', -math.log(0.3)), start_state))
    
    f.set_start(start_state)
    
    for word, phones in lex.items():
        current_state = f.add_state()

        f.add_arc(start_state, fst.Arc(0, 0, fst.Weight('log', -math.log(probs[word])), current_state))
        
        for (i, phone) in enumerate(phones): 
            current_state = generate_phone_wfst(f, current_state, phone, n, loop_w , next_w)
        # note: new current_state is now set to the final state of the previous phone WFST
            if i == len(lex[word]) - 1:

                next_state = f.add_state()
                f.add_arc(current_state, fst.Arc(0, word_table.find(word.replace("_", "")), certain_weight, next_state))
                current_state= next_state
                
        f.set_final(current_state)
        f.add_arc(current_state, fst.Arc(0, 0, next_weight, start_state))
        
    return f



In [5]:
def create_wfst(n, state_table, phone_table, word_probabilities, loop_w = 0.3, next_w = 0.7):
    # word probabilities: a dictionary, to adjust weights. 
    f = generate_word_sequence_recognition_wfst(n, word_probabilities, loop_w, next_w)
    f.set_input_symbols(state_table)
    f.set_output_symbols(word_table)
    return f

In [6]:
even_dict = {}
for word, _ in lex.items():
    even_dict[word] = 0.1

In [7]:
def read_transcription(wav_file):
    """
    Get the transcription corresponding to wav_file.
    """
    

    transcription_file = os.path.splitext(wav_file)[0] + '.txt'
    
    with open(transcription_file, 'r') as f:
        transcription = f.readline().strip()
    
    return transcription


These are all for different combinations of the self-loop / next state weights 

In [7]:

utterance_c = 0
word_c = 0
word_table, phone_table, state_table = generate_symbol_tables(lex)
f = create_wfst(3, state_table, phone_table, even_dict)
errors_sum = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own
        #if utterance_c < 6:                                                            # audio files
        utterance_c+=1
        decoder = MyViterbiDecoder(f, wav_file)
    
        
        decoder.decode()
        if utterance_c < 10:
            %time (state_path, words) = decoder.backtrace()  # you'll need to modify the backtrace() from Lab 4
        else:
            (state_path, words) = decoder.backtrace()
        transcription = read_transcription(wav_file)                                           # to return the words along the best path

        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
        if utterance_c < 3:
            print (words)
            print(transcription)
            print(error_counts, word_count) # you'll need to accumulate these
        errors_sum += sum(error_counts)
        word_c += word_count
print(errors_sum, utterance_c, word_c)

CPU times: user 154 µs, sys: 74 µs, total: 228 µs
Wall time: 232 µs
the of pickled piper the of peter the
a pickled piper of peter
(1, 0, 3) 5
CPU times: user 163 µs, sys: 52 µs, total: 215 µs
Wall time: 220 µs
the where's peter the
where's peter
(0, 0, 2) 2
CPU times: user 247 µs, sys: 55 µs, total: 302 µs
Wall time: 308 µs
CPU times: user 389 µs, sys: 67 µs, total: 456 µs
Wall time: 467 µs
CPU times: user 198 µs, sys: 27 µs, total: 225 µs
Wall time: 230 µs
CPU times: user 344 µs, sys: 37 µs, total: 381 µs
Wall time: 386 µs
CPU times: user 227 µs, sys: 21 µs, total: 248 µs
Wall time: 252 µs
CPU times: user 363 µs, sys: 29 µs, total: 392 µs
Wall time: 397 µs
CPU times: user 266 µs, sys: 18 µs, total: 284 µs
Wall time: 288 µs
2080 318 2434


In [8]:
utterance_c = 0
word_c = 0
word_table, phone_table, state_table = generate_symbol_tables(lex)
f = create_wfst(3, state_table, phone_table, even_dict, 0.7, 0.3)
errors_sum = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own                                                          # audio files
        utterance_c+=1
        decoder = MyViterbiDecoder(f, wav_file)
    
        
        decoder.decode()
        if utterance_c < 10:
            %time (state_path, words) = decoder.backtrace()  # you'll need to modify the backtrace() from Lab 4
        else:
            (state_path, words) = decoder.backtrace()
        transcription = read_transcription(wav_file)                                           # to return the words along the best path

        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
        if utterance_c < 3:
            print (words)
            print(transcription)
            print(error_counts, word_count) # you'll need to accumulate these
        errors_sum += sum(error_counts)
        word_c += word_count
print(errors_sum, utterance_c, word_c)

CPU times: user 242 µs, sys: 0 ns, total: 242 µs
Wall time: 247 µs
the of pickled piper the of peter the
a pickled piper of peter
(1, 0, 3) 5
CPU times: user 170 µs, sys: 0 ns, total: 170 µs
Wall time: 174 µs
the where's peter the
where's peter
(0, 0, 2) 2
CPU times: user 200 µs, sys: 0 ns, total: 200 µs
Wall time: 204 µs
CPU times: user 198 µs, sys: 1 µs, total: 199 µs
Wall time: 203 µs
CPU times: user 266 µs, sys: 1 µs, total: 267 µs
Wall time: 273 µs
CPU times: user 395 µs, sys: 2 µs, total: 397 µs
Wall time: 402 µs
CPU times: user 239 µs, sys: 1 µs, total: 240 µs
Wall time: 244 µs
CPU times: user 305 µs, sys: 1 µs, total: 306 µs
Wall time: 311 µs
CPU times: user 284 µs, sys: 2 µs, total: 286 µs
Wall time: 290 µs
1725 318 2434


In [9]:
utterance_c = 0
word_c = 0
word_table, phone_table, state_table = generate_symbol_tables(lex)
f = create_wfst(3, state_table, phone_table, even_dict, 0.5, 0.5)
errors_sum = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own
                                                                    # audio files
        utterance_c+=1
        decoder = MyViterbiDecoder(f, wav_file)
    
        
        decoder.decode()
        if utterance_c < 10:
            %time (state_path, words) = decoder.backtrace()  # you'll need to modify the backtrace() from Lab 4
        else:
            (state_path, words) = decoder.backtrace()
        transcription = read_transcription(wav_file)                                           # to return the words along the best path

        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
        if utterance_c < 3:
            print (words)
            print(transcription)
            print(error_counts, word_count) # you'll need to accumulate these
        errors_sum += sum(error_counts)
        word_c += word_count
print(errors_sum, utterance_c, word_c)

CPU times: user 283 µs, sys: 2 µs, total: 285 µs
Wall time: 289 µs
the of pickled piper the of peter the
a pickled piper of peter
(1, 0, 3) 5
CPU times: user 167 µs, sys: 0 ns, total: 167 µs
Wall time: 172 µs
the where's peter the
where's peter
(0, 0, 2) 2
CPU times: user 214 µs, sys: 1 µs, total: 215 µs
Wall time: 219 µs
CPU times: user 208 µs, sys: 0 ns, total: 208 µs
Wall time: 212 µs
CPU times: user 248 µs, sys: 1e+03 ns, total: 249 µs
Wall time: 253 µs
CPU times: user 398 µs, sys: 2 µs, total: 400 µs
Wall time: 404 µs
CPU times: user 273 µs, sys: 1e+03 ns, total: 274 µs
Wall time: 280 µs
CPU times: user 319 µs, sys: 1e+03 ns, total: 320 µs
Wall time: 325 µs
CPU times: user 297 µs, sys: 1e+03 ns, total: 298 µs
Wall time: 302 µs
1855 318 2434


In [10]:
utterance_c = 0
word_c = 0
word_table, phone_table, state_table = generate_symbol_tables(lex)
f = create_wfst(3, state_table, phone_table, even_dict, 0.9, 0.1)
errors_sum = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own
                                                                    # audio files
        utterance_c+=1
        decoder = MyViterbiDecoder(f, wav_file)
    
        
        decoder.decode()
        if utterance_c < 10:
            %time (state_path, words) = decoder.backtrace()  # you'll need to modify the backtrace() from Lab 4
        else:
            (state_path, words) = decoder.backtrace()
        transcription = read_transcription(wav_file)                                           # to return the words along the best path

        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
        if utterance_c < 3:
            print (words)
            print(transcription)
            print(error_counts, word_count) # you'll need to accumulate these
        errors_sum += sum(error_counts)
        word_c += word_count
print(errors_sum, utterance_c, word_c)

CPU times: user 0 ns, sys: 980 µs, total: 980 µs
Wall time: 994 µs
the of pickled piper of peter the
a pickled piper of peter
(1, 0, 2) 5
CPU times: user 113 µs, sys: 116 µs, total: 229 µs
Wall time: 232 µs
the where's peter the
where's peter
(0, 0, 2) 2
CPU times: user 332 µs, sys: 248 µs, total: 580 µs
Wall time: 586 µs
CPU times: user 279 µs, sys: 0 ns, total: 279 µs
Wall time: 282 µs
CPU times: user 376 µs, sys: 178 µs, total: 554 µs
Wall time: 561 µs
CPU times: user 1.17 ms, sys: 415 µs, total: 1.59 ms
Wall time: 1.6 ms
CPU times: user 263 µs, sys: 81 µs, total: 344 µs
Wall time: 348 µs
CPU times: user 302 µs, sys: 79 µs, total: 381 µs
Wall time: 386 µs
CPU times: user 358 µs, sys: 82 µs, total: 440 µs
Wall time: 447 µs
1633 318 2434


In [29]:
utterance_c = 0
word_c = 0
word_table, phone_table, state_table = generate_symbol_tables(lex)
f = create_wfst(3, state_table, phone_table, even_dict, 0.9, 0.1)
errors_sum = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own
    if utterance_c < 10:
        utterance_c+=1
        decoder = MyViterbiDecoder(f, wav_file)
    
        
        %time decoder.decode()
        (state_path, words) = decoder.backtrace()
        transcription = read_transcription(wav_file)                                           # to return the words along the best path

        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
        if utterance_c < 3:
            print (words)
            print(transcription)
            print(error_counts, word_count) # you'll need to accumulate these
        errors_sum += sum(error_counts)
        word_c += word_count
print(errors_sum, utterance_c, word_c)

CPU times: user 2.71 s, sys: 3.99 ms, total: 2.71 s
Wall time: 2.71 s
the of pickled piper of peter the
a pickled piper of peter
(1, 0, 2) 5
CPU times: user 1.78 s, sys: 3.99 ms, total: 1.79 s
Wall time: 1.79 s
the where's peter the
where's peter
(0, 0, 2) 2
CPU times: user 2.09 s, sys: 0 ns, total: 2.09 s
Wall time: 2.09 s
CPU times: user 2.16 s, sys: 3.99 ms, total: 2.16 s
Wall time: 2.16 s
CPU times: user 2.41 s, sys: 11 µs, total: 2.41 s
Wall time: 2.41 s
CPU times: user 5.37 s, sys: 3.99 ms, total: 5.38 s
Wall time: 5.38 s
CPU times: user 2.72 s, sys: 0 ns, total: 2.72 s
Wall time: 2.72 s
CPU times: user 3.27 s, sys: 0 ns, total: 3.27 s
Wall time: 3.28 s
CPU times: user 3.27 s, sys: 12 ms, total: 3.28 s
Wall time: 3.28 s
CPU times: user 2.91 s, sys: 0 ns, total: 2.91 s
Wall time: 2.91 s
40 10 56


In [None]:
utterance_c = 0
word_c = 0
word_table, phone_table, state_table = generate_symbol_tables(lex)
f = create_wfst(3, state_table, phone_table, even_dict, 0.9, 0.1)
errors_sum = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own
                                                                    # audio files
        utterance_c+=1
        decoder = MyViterbiDecoder(f, wav_file)
    
        
        decoder.decode()
        if utterance_c < 10:
            %time (state_path, words) = decoder.backtrace()  # you'll need to modify the backtrace() from Lab 4
        else:
            (state_path, words) = decoder.backtrace()
        transcription = read_transcription(wav_file)                                           # to return the words along the best path

        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
        if utterance_c < 3:
            print (words)
            print(transcription)
            print(error_counts, word_count) # you'll need to accumulate these
        errors_sum += sum(error_counts)
        word_c += word_count
print(errors_sum, utterance_c, word_c)

In [11]:
utterance_c = 0
word_c = 0
word_table, phone_table, state_table = generate_symbol_tables(lex)
f = create_wfst(3, state_table, phone_table, even_dict, 0.1, 0.9)
errors_sum = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own
                                                                    # audio files
        utterance_c+=1
        decoder = MyViterbiDecoder(f, wav_file)
    
        
        decoder.decode()
        if utterance_c < 10:
            %time (state_path, words) = decoder.backtrace()  # you'll need to modify the backtrace() from Lab 4
        else:
            (state_path, words) = decoder.backtrace()
        transcription = read_transcription(wav_file)                                           # to return the words along the best path

        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
        if utterance_c < 3:
            print (words)
            print(transcription)
            print(error_counts, word_count) # you'll need to accumulate these
        errors_sum += sum(error_counts)
        word_c += word_count
print(errors_sum, utterance_c, word_c)

CPU times: user 299 µs, sys: 1 µs, total: 300 µs
Wall time: 304 µs
picked the of pickled piper the a of peter a picked the the
a pickled piper of peter
(1, 0, 8) 5
CPU times: user 207 µs, sys: 1e+03 ns, total: 208 µs
Wall time: 211 µs
picked the where's peter picked the
where's peter
(0, 0, 4) 2
CPU times: user 277 µs, sys: 0 ns, total: 277 µs
Wall time: 281 µs
CPU times: user 213 µs, sys: 0 ns, total: 213 µs
Wall time: 216 µs
CPU times: user 220 µs, sys: 0 ns, total: 220 µs
Wall time: 225 µs
CPU times: user 432 µs, sys: 1e+03 ns, total: 433 µs
Wall time: 438 µs
CPU times: user 253 µs, sys: 1e+03 ns, total: 254 µs
Wall time: 258 µs
CPU times: user 517 µs, sys: 1e+03 ns, total: 518 µs
Wall time: 524 µs
CPU times: user 408 µs, sys: 2 µs, total: 410 µs
Wall time: 414 µs
3028 318 2434


Number of states for assesing memory

In [12]:
len(list(f.states()))

139

In [13]:
a = 1
for word, phones in lex.items():
        a += 3*len(phones) + 2
a

139

## Task 1

wfst with unigram probabilities based on counts, instead of even probabilities for all words

In [8]:
c = {}
for word in lex.keys():
    c[word] = 0
c["SUM"] = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):
    transcript=read_transcription(wav_file)
    for word in lex.keys():
        count = transcript.count(word)
        c[word] += count
        c["SUM"] += count
        
        
unigram_probs = {}
for w, count in c.items():
    if "_" not in w:
        unigram_probs[w] = count /c["SUM"]
        
unigram_probs

{'a': 0.05834018077239113,
 'of': 0.10230073952341824,
 'peck': 0.11133935907970419,
 'peppers': 0.13475760065735415,
 'peter': 0.12900575184880855,
 'picked': 0.11298274445357437,
 'pickled': 0.11750205423171733,
 'piper': 0.11503697617091208,
 'the': 0.06409202958093672,
 "where's": 0.05464256368118324,
 'SUM': 1.0}

In [11]:
def generate_word_sequence_recognition_wfst_unigram(n, probs):
    """ generate a HMM to recognise any single word sequence for words in the lexicon
    
    Args:
        n (int): states per phone HMM

    Returns:
        the constructed WFST
    
    """
    
    f = fst.Fst('log')

    #even_weight = fst.Weight('log', -math.log(1/len(lex)))
    
    #even_weight = fst.Weight('log', -math.log(1/len(lex)))
    #reduced_weight = fst.Weight('log', -math.log(1/(5*len(lex))))
    next_weight = fst.Weight('log', -math.log(0.05))
    certain_weight =  fst.Weight('log', -math.log(1))
    
    # create a single start state
    start_state = f.add_state()
    #f.add_arc(start_state, fst.Arc(0, 0, fst.Weight('log', -math.log(0.3)), start_state))
    
    f.set_start(start_state)
    
    for word, phones in lex.items():
        current_state = f.add_state()

        f.add_arc(start_state, fst.Arc(0, 0, fst.Weight('log', -math.log(probs[w])), current_state))
        
        for (i, phone) in enumerate(phones): 
            current_state = generate_phone_wfst(f, current_state, phone, n, 0.9, 0.1)
        # note: new current_state is now set to the final state of the previous phone WFST
            if i == len(lex[word]) - 1:

                next_state = f.add_state()
                f.add_arc(current_state, fst.Arc(0, word_table.find(word.replace("_", "")), certain_weight, next_state))
                current_state= next_state
                
        f.set_final(current_state)
        f.add_arc(current_state, fst.Arc(0, 0, next_weight, start_state))
        
    return f


def create_wfst_unigram(n, state_table, phone_table, word_probabilities):
    # word probabilities: a dictionary, to adjust weights. 
    f = generate_word_sequence_recognition_wfst_unigram(n, word_probabilities)
    f.set_input_symbols(state_table)
    f.set_output_symbols(word_table)
    return f

In [12]:
utterance_c = 0
word_table, phone_table, state_table = generate_symbol_tables(lex)
f = create_wfst_unigram(3, state_table, phone_table, unigram_probs)
errors_sum = 0
word_c = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own
                                                                   # audio files
        utterance_c+=1
        decoder = MyViterbiDecoder(f, wav_file)
    
        decoder.decode()
        if utterance_c < 10:
            %time (state_path, words) = decoder.backtrace()  # you'll need to modify the backtrace() from Lab 4
        else:
            (state_path, words) = decoder.backtrace() 
        
        transcription = read_transcription(wav_file)                                    
        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
    
        if utterance_c < 3:
            print (words)
            print(transcription)
            print(error_counts, word_count) # you'll need to accumulate these
        errors_sum += sum(error_counts)
        word_c += word_count

print(errors_sum, utterance_c, word_c)

CPU times: user 148 µs, sys: 88 µs, total: 236 µs
Wall time: 239 µs
the of pickled piper of peter the
a pickled piper of peter
(1, 0, 2) 5
CPU times: user 171 µs, sys: 0 ns, total: 171 µs
Wall time: 174 µs
the where's peter the
where's peter
(0, 0, 2) 2
CPU times: user 196 µs, sys: 0 ns, total: 196 µs
Wall time: 199 µs
CPU times: user 152 µs, sys: 36 µs, total: 188 µs
Wall time: 193 µs
CPU times: user 178 µs, sys: 35 µs, total: 213 µs
Wall time: 217 µs
CPU times: user 372 µs, sys: 55 µs, total: 427 µs
Wall time: 432 µs
CPU times: user 224 µs, sys: 28 µs, total: 252 µs
Wall time: 257 µs
CPU times: user 281 µs, sys: 31 µs, total: 312 µs
Wall time: 317 µs
CPU times: user 276 µs, sys: 27 µs, total: 303 µs
Wall time: 308 µs
1628 318 2434


Adding silence states at the start and between words.

In [8]:
state_table.add_symbol("sil_1")
state_table.add_symbol("sil_2")
state_table.add_symbol("sil_3")
state_table.add_symbol("sil_4")
state_table.add_symbol("sil_5")

56

In [24]:
def generate_sil_wfst(n, word_probabilities, sil_prob = 0.1, ergodic = False):
    """ generate a HMM to recognise any single word sequence for words in the lexicon
    
    Args:
        n (int): states per phone HMM

    Returns:
        the constructed WFST
    
    """
    
    f = fst.Fst('log')

    #even_weight = fst.Weight('log', -math.log(1/len(lex)))
    
    #even_weight = fst.Weight('log', -math.log(1/len(lex)))
    #reduced_weight = fst.Weight('log', -math.log(1/(5*len(lex))))
    next_weight = fst.Weight('log', -math.log(0.1))
    certain_weight =  fst.Weight('log', -math.log(1))
    
    # create a single start state
    start_state = f.add_state()
    f.set_start(start_state)
    
    #first_silent = f.add_state()
    #last_silent = f.add_state()
    #f.set_final(last_silent)
    
    if ergodic == True:
        sil_states = [0, 0, 0, 0, 0]
        for i in range(5):
            sil_states[i] = f.add_state()
        #start to first
        #f.add_arc(start_state, fst.Arc(state_table.find("sil_1"), 0, fst.Weight('log', -math.log(sil_prob)),sil_states[0]))
        f.add_arc(start_state, fst.Arc(state_table.find("sil_1"), 0, fst.Weight('log', -math.log(sil_prob)),sil_states[0]))
        #loop for sil 1
        f.add_arc(sil_states[0], fst.Arc(state_table.find("sil_1"), 0, fst.Weight('log', -math.log(0.9)),sil_states[0]))
        #loop for sil 5
        f.add_arc(sil_states[4], fst.Arc(state_table.find("sil_5"), 0, fst.Weight('log', -math.log(0.9)),sil_states[4]))
        #sil 5 to start state
        f.add_arc(sil_states[4], fst.Arc(0, 0, next_weight, start_state))
        f.set_final(sil_states[4])
    
        for i in [1,2,3]:
            #loop
            f.add_arc(sil_states[i], fst.Arc(state_table.find("sil_" + str(i+1)), 0, fst.Weight('log', -math.log(0.9)),sil_states[i]))
            #to final
            f.add_arc(sil_states[i], fst.Arc(state_table.find("sil_" + str(i+1)), 0, fst.Weight('log', -math.log(0.1)),sil_states[4]))
            #from first
            f.add_arc(sil_states[0], fst.Arc(state_table.find("sil_" + str(i+1)), 0, fst.Weight('log', -math.log(0.1)),sil_states[i]))
        
            for j in [1,2,3]:
                if i != j:
                    f.add_arc(sil_states[i], fst.Arc(state_table.find("sil_1"), 0, fst.Weight('log', -math.log(0.1)),sil_states[j]))

    else:
        sil_states = [0, 0, 0, 0, 0]
        for i in range(5):
            sil_states[i] = f.add_state()
        # leftto right
        #start to first
        f.add_arc(start_state, fst.Arc(state_table.find("sil_1"), 0, fst.Weight('log', -math.log(sil_prob)),sil_states[0]))
        #f.add_arc(start_state, fst.Arc(state_table.find("sil_1"), 0, fst.Weight('log', 3),sil_states[0]))
        #loop for sil 1
        #f.add_arc(sil_states[0], fst.Arc(state_table.find("sil_1"), 0, fst.Weight('log', -math.log(0.1)),sil_states[0]))
        #loop for sil 5
        f.add_arc(sil_states[4], fst.Arc(state_table.find("sil_5"), 0, fst.Weight('log', -math.log(0.9)),sil_states[4]))
        #sil 5 to start state
        f.add_arc(sil_states[4], fst.Arc(0, 0, next_weight, start_state))
        f.set_final(sil_states[4])

        for j in [0, 1,2,3]:
            #loop
            f.add_arc(sil_states[j], fst.Arc(state_table.find("sil_1"), 0, fst.Weight('log', 0.9),sil_states[j]))
            #next
            f.add_arc(sil_states[j], fst.Arc(state_table.find("sil_1"), 0, fst.Weight('log', 0.1),sil_states[j+1]))
        
    for word, phones in lex.items():
        current_state = f.add_state()
        
        f.add_arc(start_state, fst.Arc(0, 0, fst.Weight('log', -math.log(word_probabilities[word])), current_state))
        
        for (i, phone) in enumerate(phones): 
            current_state = generate_phone_wfst(f, current_state, phone, n, 0.9, 0.1)
        # note: new current_state is now set to the final state of the previous phone WFST
            if i == len(lex[word]) - 1:

                next_state = f.add_state()
                f.add_arc(current_state, fst.Arc(0, word_table.find(word.replace("_", "")), certain_weight, next_state))
                current_state= next_state
                
        f.set_final(current_state)
        f.add_arc(current_state, fst.Arc(0, 0, next_weight, start_state))
        
    return f


In [25]:
s = generate_sil_wfst(3, even_dict, 0.1, ergodic = True)
s.set_input_symbols(state_table)
s.set_output_symbols(word_table)
show_wfst(s)

In [12]:
s = generate_sil_wfst(3, even_dict, 0.1)
s.set_input_symbols(state_table)
s.set_output_symbols(word_table)

errors_sum = 0
utterances = 0
words_no = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own
        #if utterances < 10:                                                                       # audio files
        utterances += 1
        decoder = MyViterbiDecoder(s, wav_file)
    
        if utterances < 10:
            %time decoder.decode()
            %time (state_path, words) = decoder.backtrace()  # you'll need to modify the backtrace() from Lab 4
        else:
            decoder.decode()
            (state_path, words) = decoder.backtrace()
            
            
        transcription = read_transcription(wav_file)                                           # to return the words along the best path

    
        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
    
        if sum(error_counts) > 10:
            print (words)
            print(transcription)
            print(error_counts, word_count) 
            print(utterances)
            # you'll need to accumulate these
        errors_sum += sum(error_counts)
        words_no += word_count
print(errors_sum, utterances, words_no)


CPU times: user 2.69 s, sys: 0 ns, total: 2.69 s
Wall time: 2.69 s
CPU times: user 974 µs, sys: 0 ns, total: 974 µs
Wall time: 978 µs
CPU times: user 1.82 s, sys: 5.2 ms, total: 1.83 s
Wall time: 1.83 s
CPU times: user 305 µs, sys: 321 µs, total: 626 µs
Wall time: 629 µs
CPU times: user 2.18 s, sys: 0 ns, total: 2.18 s
Wall time: 2.18 s
CPU times: user 841 µs, sys: 0 ns, total: 841 µs
Wall time: 844 µs
CPU times: user 2.28 s, sys: 0 ns, total: 2.28 s
Wall time: 2.28 s
CPU times: user 263 µs, sys: 0 ns, total: 263 µs
Wall time: 265 µs
CPU times: user 2.52 s, sys: 5.3 ms, total: 2.53 s
Wall time: 2.53 s
CPU times: user 0 ns, sys: 322 µs, total: 322 µs
Wall time: 326 µs
CPU times: user 4.67 s, sys: 11.7 ms, total: 4.69 s
Wall time: 4.69 s
CPU times: user 1.59 ms, sys: 0 ns, total: 1.59 ms
Wall time: 1.6 ms
a of of pickled peck of pickled a of
peter piper picked a peck of pickled peppers
(5, 0, 1) 8
6
CPU times: user 2.83 s, sys: 0 ns, total: 2.83 s
Wall time: 2.83 s
CPU times: user 304 µs

peter piper picked a peck pickled peppers where's peck picked peppers peter picked picked
peter piper picked a peck of pickled peppers where's the peck of pickled peppers peter piper picked
(2, 3, 0) 17
115
peter peter picked a peck picked peppers where's peck of picked peppers picked piper picked
peter piper picked a peck of pickled peppers where's the peck of pickled peppers peter piper picked
(4, 2, 0) 17
116
peter pickled of peck where's
peter pickled a peck of peppers
(2, 1, 0) 6
117
the of of of of of of where's of picked of the of piper of peck
peter piper picked a peck of pickled peppers where's the peck of pickled peppers peter piper picked
(11, 2, 1) 17
122
picked of of of picked of peck a peppers the where's of pickled picked of peppers the of piper of
peter piper picked a peck of pickled peppers where's the peck of pickled peppers peter piper picked
(9, 1, 4) 17
123
the of piper of picked of peck pickled peppers where's the peck of pickled peter picked the of piper the peck

the pickled pickled picked
the peck of pickled peppers
(2, 1, 0) 5
246
peter piper picked peck picked peppers where's the peck of pickled where's peter piper picked
peter piper picked a peck of pickled peppers where's the peck of pickled peppers peter piper picked
(2, 2, 0) 17
247
peter of pickled peppers where's where's peter picked
peter peck of pickled peppers peppers peter picked
(1, 1, 1) 8
248
peck where's picked pickled where's pickled peter piper
peck peppers peck pickled peppers pickled peter piper
(3, 0, 0) 8
249
peppers peter pickled peck of peppers where's piper
peppers peter picked a peck of peppers piper
(1, 1, 1) 8
251
the a piper of a of peck of picked peck where's
peter piper pickled a peck of pickled peppers
(4, 0, 3) 8
255
where's of peck picked picked the piper of picked
where's the peck of pickled peppers peter piper picked
(4, 1, 1) 9
256
the of of a of peck of picked of where's where's the pickled picked picked a of piper picked
peter piper picked a peck of pickl

In [26]:
s = generate_sil_wfst(3, even_dict, 0.1, ergodic = True)
s.set_input_symbols(state_table)
s.set_output_symbols(word_table)

errors_sum = 0
utterances = 0
words_no = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own
        #if utterances < 10:                                                                       # audio files
        utterances += 1
        decoder = MyViterbiDecoder(s, wav_file)
    
        if utterances < 10:
            %time decoder.decode()
            %time (state_path, words) = decoder.backtrace()  # you'll need to modify the backtrace() from Lab 4
        else:
            decoder.decode()
            (state_path, words) = decoder.backtrace()
            
            
        transcription = read_transcription(wav_file)                                           # to return the words along the best path

    
        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
    
        if sum(error_counts) > 10:
            print (words)
            print(transcription)
            print(error_counts, word_count) 
            print(utterances)
            # you'll need to accumulate these
        errors_sum += sum(error_counts)
        words_no += word_count
print(errors_sum, utterances, words_no)


CPU times: user 2.74 s, sys: 0 ns, total: 2.74 s
Wall time: 2.74 s
CPU times: user 300 µs, sys: 2 µs, total: 302 µs
Wall time: 305 µs
CPU times: user 1.82 s, sys: 8 µs, total: 1.82 s
Wall time: 1.82 s
CPU times: user 214 µs, sys: 1 µs, total: 215 µs
Wall time: 217 µs
CPU times: user 2.24 s, sys: 0 ns, total: 2.24 s
Wall time: 2.24 s
CPU times: user 251 µs, sys: 0 ns, total: 251 µs
Wall time: 253 µs
CPU times: user 2.32 s, sys: 0 ns, total: 2.32 s
Wall time: 2.32 s
CPU times: user 269 µs, sys: 0 ns, total: 269 µs
Wall time: 272 µs
CPU times: user 2.57 s, sys: 13 µs, total: 2.57 s
Wall time: 2.57 s
CPU times: user 287 µs, sys: 0 ns, total: 287 µs
Wall time: 290 µs
CPU times: user 4.74 s, sys: 3.99 ms, total: 4.74 s
Wall time: 4.74 s
CPU times: user 519 µs, sys: 3 µs, total: 522 µs
Wall time: 526 µs
CPU times: user 2.92 s, sys: 0 ns, total: 2.92 s
Wall time: 2.92 s
CPU times: user 336 µs, sys: 1 µs, total: 337 µs
Wall time: 340 µs
CPU times: user 3.48 s, sys: 4 ms, total: 3.48 s
Wall time

In [27]:
show_wfst(s)

In [14]:
s = generate_sil_wfst(3, even_dict, 0.05)
s.set_input_symbols(state_table)
s.set_output_symbols(word_table)

errors_sum = 0
utterances = 0
words_no = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own                                                                 # audio files
        utterances += 1
        decoder = MyViterbiDecoder(s, wav_file)
    
        if utterances < 10:
            %time decoder.decode()
            %time (state_path, words) = decoder.backtrace()  # you'll need to modify the backtrace() from Lab 4
        else:
            decoder.decode()
            (state_path, words) = decoder.backtrace()
            
            
        transcription = read_transcription(wav_file)                                           # to return the words along the best path

    
        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
    
        if utterances < 4:
            print (words)
            print(transcription)
            print(error_counts, word_count)     # you'll need to accumulate these
        errors_sum += sum(error_counts)
        words_no += word_count
print(errors_sum, utterances, words_no)

CPU times: user 2.7 s, sys: 4 ms, total: 2.71 s
Wall time: 2.71 s
CPU times: user 308 µs, sys: 3 µs, total: 311 µs
Wall time: 316 µs
pickled piper of peter
a pickled piper of peter
(0, 1, 0) 5
CPU times: user 1.79 s, sys: 8 µs, total: 1.79 s
Wall time: 1.79 s
CPU times: user 207 µs, sys: 2 µs, total: 209 µs
Wall time: 212 µs
where's peter
where's peter
(0, 0, 0) 2
CPU times: user 2.17 s, sys: 0 ns, total: 2.17 s
Wall time: 2.18 s
CPU times: user 254 µs, sys: 0 ns, total: 254 µs
Wall time: 257 µs
peter picked peck
peter picked a peck
(0, 1, 0) 4
CPU times: user 2.25 s, sys: 0 ns, total: 2.25 s
Wall time: 2.25 s
CPU times: user 411 µs, sys: 0 ns, total: 411 µs
Wall time: 414 µs
CPU times: user 2.56 s, sys: 12 µs, total: 2.56 s
Wall time: 2.56 s
CPU times: user 299 µs, sys: 0 ns, total: 299 µs
Wall time: 303 µs
CPU times: user 4.64 s, sys: 0 ns, total: 4.64 s
Wall time: 4.64 s
CPU times: user 506 µs, sys: 0 ns, total: 506 µs
Wall time: 510 µs
CPU times: user 2.86 s, sys: 0 ns, total: 2.86

In [15]:
s = generate_sil_wfst(3, even_dict, 0.1)
s.set_input_symbols(state_table)
s.set_output_symbols(word_table)

errors_sum = 0
utterances = 0
words_no = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own
        #if utterances < 10:                                                                       # audio files
        utterances += 1
        decoder = MyViterbiDecoder(s, wav_file)
    
        if utterances < 10:
            %time decoder.decode()
            %time (state_path, words) = decoder.backtrace()  # you'll need to modify the backtrace() from Lab 4
        else:
            decoder.decode()
            (state_path, words) = decoder.backtrace()
            
            
        transcription = read_transcription(wav_file)                                           # to return the words along the best path

    
        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
    
        if sum(error_counts) > 2:
            print (words)
            print(transcription)
            print(error_counts, word_count) 
            print(utterances)
            # you'll need to accumulate these
        errors_sum += sum(error_counts)
        words_no += word_count
print(errors_sum, utterances, words_no)


CPU times: user 2.67 s, sys: 16 ms, total: 2.69 s
Wall time: 2.69 s
CPU times: user 442 µs, sys: 0 ns, total: 442 µs
Wall time: 454 µs
CPU times: user 1.79 s, sys: 0 ns, total: 1.79 s
Wall time: 1.79 s
CPU times: user 215 µs, sys: 0 ns, total: 215 µs
Wall time: 217 µs
CPU times: user 2.17 s, sys: 0 ns, total: 2.17 s
Wall time: 2.17 s
CPU times: user 243 µs, sys: 1 µs, total: 244 µs
Wall time: 246 µs
CPU times: user 2.27 s, sys: 0 ns, total: 2.27 s
Wall time: 2.27 s
CPU times: user 289 µs, sys: 2 µs, total: 291 µs
Wall time: 295 µs
CPU times: user 2.53 s, sys: 0 ns, total: 2.53 s
Wall time: 2.53 s
CPU times: user 286 µs, sys: 0 ns, total: 286 µs
Wall time: 289 µs
CPU times: user 4.69 s, sys: 8 ms, total: 4.69 s
Wall time: 4.7 s
CPU times: user 571 µs, sys: 4 µs, total: 575 µs
Wall time: 579 µs
a of of pickled peck of pickled a of
peter piper picked a peck of pickled peppers
(5, 0, 1) 8
6
CPU times: user 2.83 s, sys: 0 ns, total: 2.83 s
Wall time: 2.84 s
CPU times: user 325 µs, sys: 3 µs

peter piper picked a peck pickled peppers where's peck picked peppers peter picked picked
peter piper picked a peck of pickled peppers where's the peck of pickled peppers peter piper picked
(2, 3, 0) 17
115
peter peter picked a peck picked peppers where's peck of picked peppers picked piper picked
peter piper picked a peck of pickled peppers where's the peck of pickled peppers peter piper picked
(4, 2, 0) 17
116
peter pickled of peck where's
peter pickled a peck of peppers
(2, 1, 0) 6
117
the of of of of of of where's of picked of the of piper of peck
peter piper picked a peck of pickled peppers where's the peck of pickled peppers peter piper picked
(11, 2, 1) 17
122
picked of of of picked of peck a peppers the where's of pickled picked of peppers the of piper of
peter piper picked a peck of pickled peppers where's the peck of pickled peppers peter piper picked
(9, 1, 4) 17
123
the of piper of picked of peck pickled peppers where's the peck of pickled peter picked the of piper the peck

the pickled pickled picked
the peck of pickled peppers
(2, 1, 0) 5
246
peter piper picked peck picked peppers where's the peck of pickled where's peter piper picked
peter piper picked a peck of pickled peppers where's the peck of pickled peppers peter piper picked
(2, 2, 0) 17
247
peter of pickled peppers where's where's peter picked
peter peck of pickled peppers peppers peter picked
(1, 1, 1) 8
248
peck where's picked pickled where's pickled peter piper
peck peppers peck pickled peppers pickled peter piper
(3, 0, 0) 8
249
peppers peter pickled peck of peppers where's piper
peppers peter picked a peck of peppers piper
(1, 1, 1) 8
251
the a piper of a of peck of picked peck where's
peter piper pickled a peck of pickled peppers
(4, 0, 3) 8
255
where's of peck picked picked the piper of picked
where's the peck of pickled peppers peter piper picked
(4, 1, 1) 9
256
the of of a of peck of picked of where's where's the pickled picked picked a of piper picked
peter piper picked a peck of pickl

## Task 2 - Pruning

In [16]:
class PruningViterbiDecoder:
    
    NLL_ZERO = 1e10  # define a constant representing -log(0).  This is really infinite, but approximate
                     # it here with a very large number
    
    def __init__(self, f, audio_file_name, pruning_threshold = 500):
        """Set up the decoder class with an audio file and WFST f
        """
        self.om = observation_model.ObservationModel()
        self.f = f
        
        if audio_file_name:
            self.om.load_audio(audio_file_name)
        else:
            self.om.load_dummy_audio()
        
        self.initialise_decoding()
        self.threshold = pruning_threshold

        
    def initialise_decoding(self):
        """set up the values for V_j(0) (as negative log-likelihoods)
        
        """
        
        self.V = []   # stores likelihood along best path reaching state j
        self.B = []   # stores identity of best previous state reaching state j
        self.W = []   # stores output labels sequence along arc reaching j - this removes need for 
                      # extra code to read the output sequence along the best path
        
        for t in range(self.om.observation_length()+1):
            self.V.append([self.NLL_ZERO]*self.f.num_states())
            self.B.append([-1]*self.f.num_states())
            self.W.append([[] for i in range(self.f.num_states())])  #  multiplying the empty list doesn't make multiple
        
        # The above code means that self.V[t][j] for t = 0, ... T gives the Viterbi cost
        # of state j, time t (in negative log-likelihood form)
        # Initialising the costs to NLL_ZERO effectively means zero probability    
        
        # give the WFST start state a probability of 1.0   (NLL = 0.0)
        self.V[0][self.f.start()] = 0.0
        
        # some WFSTs might have arcs with epsilon on the input (you might have already created 
        # examples of these in earlier labs) these correspond to non-emitting states, 
        # which means that we need to process them without stepping forward in time.  
        # Don't worry too much about this!  
        self.traverse_epsilon_arcs(0)        
        
    def traverse_epsilon_arcs(self, t):
        """Traverse arcs with <eps> on the input at time t
        
        These correspond to transitions that don't emit an observation
        
        We've implemented this function for you as it's slightly trickier than
        the normal case.  You might like to look at it to see what's going on, but
        don't worry if you can't fully follow it.
        
        """
        
        states_to_traverse = list(self.f.states()) # traverse all states
        while states_to_traverse:
            
            # Set i to the ID of the current state, the first 
            # item in the list (and remove it from the list)
            i = states_to_traverse.pop(0)   
        
            # don't bother traversing states which have zero probability
            if self.V[t][i] == self.NLL_ZERO:
                    continue
        
            for arc in self.f.arcs(i):
                
                if arc.ilabel == 0:     # if <eps> transition
                  
                    j = arc.nextstate   # ID of next state  
                
                    if self.V[t][j] > self.V[t][i] + float(arc.weight):
                        
                        # this means we've found a lower-cost path to
                        # state j at time t.  We might need to add it
                        # back to the processing queue.
                        self.V[t][j] = self.V[t][i] + float(arc.weight)
                        
                        # save backtrace information.  In the case of an epsilon transition, 
                        # we save the identity of the best state at t-1.  This means we may not
                        # be able to fully recover the best path, but to do otherwise would
                        # require a more complicated way of storing backtrace information
                        self.B[t][j] = self.B[t][i] 
                        
                        # and save the output labels encountered - this is a list, because
                        # there could be multiple output labels (in the case of <eps> arcs)
                        if arc.olabel != 0:
                            self.W[t][j] = self.W[t][i] + [arc.olabel]
                        else:
                            self.W[t][j] = self.W[t][i]
                        
                        if j not in states_to_traverse:
                            states_to_traverse.append(j)

    
    def forward_step(self, t):
        #find the best V[t-1]
        best = max(1, min(self.V[t-1]))
        for i in self.f.states():
            
            #if not self.V[t-1][i] == self.NLL_ZERO:   # no point in propagating states with zero probability
            if self.V[t-1][i] < best* self.threshold:   # bigger value means lower probability ! 
                #print(self.V[t-1][i])
                for arc in self.f.arcs(i):
                    
                    if arc.ilabel != 0: # <eps> transitions don't emit an observation
                        j = arc.nextstate
                        tp = float(arc.weight)  # transition prob
                        ep = -self.om.log_observation_probability(self.f.input_symbols().find(arc.ilabel), t)  # emission negative log prob
                        prob = tp + ep + self.V[t-1][i] # they're logs
                        if prob < self.V[t][j]:
                            self.V[t][j] = prob
                            self.B[t][j] = i
                            
                            # store the output labels encountered too
                            if arc.olabel !=0:
                                self.W[t][j] = [arc.olabel]
                            else:
                                self.W[t][j] = []
                            
    
    def finalise_decoding(self):
        """ this incorporates the probability of terminating at each state
        """
        
        for state in self.f.states():
            final_weight = float(self.f.final(state))
            if self.V[-1][state] != self.NLL_ZERO:
                if final_weight == math.inf:
                    self.V[-1][state] = self.NLL_ZERO  # effectively says that we can't end in this state
                else:
                    self.V[-1][state] += final_weight
                    
        # get a list of all states where there was a path ending with non-zero probability
        finished = [x for x in self.V[-1] if x < self.NLL_ZERO]
        if not finished:  # if empty
            print("No path got to the end of the observations.")
        
        
    def decode(self):
        self.initialise_decoding()
        t = 1
        while t <= self.om.observation_length():
            self.forward_step(t)
            self.traverse_epsilon_arcs(t)
            t += 1
        self.finalise_decoding()
    
    def backtrace(self):
        
        best_final_state = self.V[-1].index(min(self.V[-1])) # argmin
        best_state_sequence = [best_final_state]
        best_out_sequence = []
        
        t = self.om.observation_length()   # ie T
        j = best_final_state
        
        while t >= 0:
            i = self.B[t][j]
            best_state_sequence.append(i)
            best_out_sequence = self.W[t][j] + best_out_sequence  # computer scientists might like
                                                                                # to make this more efficient!

            # continue the backtrace at state i, time t-1
            j = i  
            t-=1
            
        best_state_sequence.reverse()
        
        # convert the best output sequence from FST integer labels into strings
        best_out_sequence = ' '.join([ self.f.output_symbols().find(label) for label in best_out_sequence])
        
        return (best_state_sequence, best_out_sequence)
    


In [12]:
f = generate_word_sequence_recognition_wfst(3, even_dict, 0.9, 0.1)
f.set_input_symbols(state_table)
f.set_output_symbols(word_table)

errors_sum = 0
utterance_c = 0
words_c = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own                                                                     # audio files
        utterance_c+=1
        decoder = PruningViterbiDecoder(f, wav_file, pruning_threshold =25)
    
        %time decoder.decode()
        %time (state_path, words) = decoder.backtrace()  # you'll need to modify the backtrace() from Lab 4
        transcription = read_transcription(wav_file)                                           # to return the words along the best path
       
        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
        if utterance_c < 5:
            print(error_counts, word_count)     # you'll need to accumulate these
            print (words)
            print(transcription)
        errors_sum += sum(error_counts)
        words_c += word_count
        
print(errors_sum, utterance_c, words_c)

CPU times: user 2.62 s, sys: 70 µs, total: 2.62 s
Wall time: 2.63 s
CPU times: user 312 µs, sys: 5 µs, total: 317 µs
Wall time: 320 µs
(1, 0, 2) 5
the of pickled piper of peter the
a pickled piper of peter
CPU times: user 1.77 s, sys: 0 ns, total: 1.77 s
Wall time: 1.77 s
CPU times: user 218 µs, sys: 0 ns, total: 218 µs
Wall time: 221 µs
(0, 0, 2) 2
the where's peter the
where's peter
CPU times: user 2.16 s, sys: 59 µs, total: 2.16 s
Wall time: 2.16 s
CPU times: user 300 µs, sys: 0 ns, total: 300 µs
Wall time: 305 µs
(0, 1, 2) 4
the peter picked peck the
peter picked a peck
CPU times: user 2.21 s, sys: 0 ns, total: 2.21 s
Wall time: 2.21 s
CPU times: user 272 µs, sys: 0 ns, total: 272 µs
Wall time: 276 µs
(0, 0, 2) 3
the where's the peppers the
where's the peppers
CPU times: user 2.44 s, sys: 83 µs, total: 2.44 s
Wall time: 2.44 s
CPU times: user 301 µs, sys: 4 µs, total: 305 µs
Wall time: 309 µs
CPU times: user 4.58 s, sys: 67 µs, total: 4.58 s
Wall time: 4.58 s
CPU times: user 514 µs

CPU times: user 2.44 s, sys: 14 µs, total: 2.44 s
Wall time: 2.44 s
CPU times: user 309 µs, sys: 0 ns, total: 309 µs
Wall time: 317 µs
CPU times: user 4.31 s, sys: 92 µs, total: 4.31 s
Wall time: 4.31 s
CPU times: user 453 µs, sys: 7 µs, total: 460 µs
Wall time: 464 µs
CPU times: user 3.97 s, sys: 4.02 ms, total: 3.97 s
Wall time: 3.97 s
CPU times: user 0 ns, sys: 1.42 ms, total: 1.42 ms
Wall time: 1.43 ms
CPU times: user 2.74 s, sys: 15 µs, total: 2.74 s
Wall time: 2.75 s
CPU times: user 308 µs, sys: 5 µs, total: 313 µs
Wall time: 316 µs
CPU times: user 2.72 s, sys: 74 µs, total: 2.72 s
Wall time: 2.73 s
CPU times: user 368 µs, sys: 0 ns, total: 368 µs
Wall time: 377 µs
CPU times: user 2.41 s, sys: 43 µs, total: 2.41 s
Wall time: 2.42 s
CPU times: user 292 µs, sys: 5 µs, total: 297 µs
Wall time: 300 µs
CPU times: user 4.9 s, sys: 95 µs, total: 4.9 s
Wall time: 4.9 s
CPU times: user 569 µs, sys: 0 ns, total: 569 µs
Wall time: 573 µs
CPU times: user 4.91 s, sys: 88 µs, total: 4.91 s
Wal

CPU times: user 2.49 s, sys: 4.03 ms, total: 2.49 s
Wall time: 2.49 s
CPU times: user 322 µs, sys: 4 µs, total: 326 µs
Wall time: 330 µs
CPU times: user 6.2 s, sys: 8.1 ms, total: 6.21 s
Wall time: 6.21 s
CPU times: user 725 µs, sys: 11 µs, total: 736 µs
Wall time: 740 µs
CPU times: user 5.04 s, sys: 4.03 ms, total: 5.05 s
Wall time: 5.05 s
CPU times: user 1.28 ms, sys: 0 ns, total: 1.28 ms
Wall time: 1.29 ms
CPU times: user 6.45 s, sys: 45 µs, total: 6.45 s
Wall time: 6.45 s
CPU times: user 759 µs, sys: 11 µs, total: 770 µs
Wall time: 776 µs
CPU times: user 6.79 s, sys: 12.1 ms, total: 6.8 s
Wall time: 6.81 s
CPU times: user 920 µs, sys: 13 µs, total: 933 µs
Wall time: 939 µs
CPU times: user 4.27 s, sys: 121 µs, total: 4.27 s
Wall time: 4.27 s
CPU times: user 533 µs, sys: 8 µs, total: 541 µs
Wall time: 545 µs
CPU times: user 3.13 s, sys: 4.02 ms, total: 3.14 s
Wall time: 3.14 s
CPU times: user 839 µs, sys: 12 µs, total: 851 µs
Wall time: 855 µs
CPU times: user 2.66 s, sys: 13 µs, tota

CPU times: user 3.3 s, sys: 18 µs, total: 3.3 s
Wall time: 3.31 s
CPU times: user 413 µs, sys: 6 µs, total: 419 µs
Wall time: 424 µs
CPU times: user 3 s, sys: 60 µs, total: 3 s
Wall time: 3 s
CPU times: user 407 µs, sys: 6 µs, total: 413 µs
Wall time: 417 µs
CPU times: user 2.27 s, sys: 37 µs, total: 2.27 s
Wall time: 2.27 s
CPU times: user 0 ns, sys: 301 µs, total: 301 µs
Wall time: 304 µs
CPU times: user 2.11 s, sys: 10 µs, total: 2.11 s
Wall time: 2.11 s
CPU times: user 287 µs, sys: 4 µs, total: 291 µs
Wall time: 293 µs
CPU times: user 3.37 s, sys: 7.95 ms, total: 3.37 s
Wall time: 3.38 s
CPU times: user 0 ns, sys: 427 µs, total: 427 µs
Wall time: 432 µs
CPU times: user 4.01 s, sys: 4.02 ms, total: 4.01 s
Wall time: 4.02 s
CPU times: user 499 µs, sys: 7 µs, total: 506 µs
Wall time: 510 µs
CPU times: user 3.6 s, sys: 77 µs, total: 3.6 s
Wall time: 3.61 s
CPU times: user 441 µs, sys: 0 ns, total: 441 µs
Wall time: 444 µs
CPU times: user 2.97 s, sys: 3 µs, total: 2.97 s
Wall time: 2.98

CPU times: user 2.82 s, sys: 4.02 ms, total: 2.83 s
Wall time: 2.83 s
CPU times: user 475 µs, sys: 6 µs, total: 481 µs
Wall time: 485 µs
CPU times: user 3.31 s, sys: 8.02 ms, total: 3.32 s
Wall time: 3.32 s
CPU times: user 394 µs, sys: 0 ns, total: 394 µs
Wall time: 398 µs
CPU times: user 2.98 s, sys: 0 ns, total: 2.98 s
Wall time: 2.98 s
CPU times: user 371 µs, sys: 0 ns, total: 371 µs
Wall time: 374 µs
CPU times: user 6.7 s, sys: 12 ms, total: 6.72 s
Wall time: 6.72 s
CPU times: user 802 µs, sys: 10 µs, total: 812 µs
Wall time: 818 µs
CPU times: user 3.85 s, sys: 4.06 ms, total: 3.85 s
Wall time: 3.85 s
CPU times: user 476 µs, sys: 0 ns, total: 476 µs
Wall time: 484 µs
CPU times: user 3.57 s, sys: 61 µs, total: 3.57 s
Wall time: 3.57 s
CPU times: user 435 µs, sys: 6 µs, total: 441 µs
Wall time: 445 µs
CPU times: user 2.74 s, sys: 0 ns, total: 2.74 s
Wall time: 2.74 s
CPU times: user 354 µs, sys: 0 ns, total: 354 µs
Wall time: 357 µs
CPU times: user 4.25 s, sys: 18 µs, total: 4.25 s
W

CPU times: user 6.41 s, sys: 8.07 ms, total: 6.42 s
Wall time: 6.42 s
CPU times: user 745 µs, sys: 10 µs, total: 755 µs
Wall time: 759 µs
CPU times: user 6.36 s, sys: 4.05 ms, total: 6.36 s
Wall time: 6.36 s
CPU times: user 725 µs, sys: 9 µs, total: 734 µs
Wall time: 738 µs
CPU times: user 3.91 s, sys: 7.98 ms, total: 3.92 s
Wall time: 3.93 s
CPU times: user 692 µs, sys: 9 µs, total: 701 µs
Wall time: 705 µs
CPU times: user 2.52 s, sys: 0 ns, total: 2.52 s
Wall time: 2.52 s
CPU times: user 336 µs, sys: 0 ns, total: 336 µs
Wall time: 340 µs
CPU times: user 3.97 s, sys: 39 µs, total: 3.97 s
Wall time: 3.97 s
CPU times: user 475 µs, sys: 7 µs, total: 482 µs
Wall time: 484 µs
CPU times: user 3.34 s, sys: 4 µs, total: 3.34 s
Wall time: 3.34 s
CPU times: user 381 µs, sys: 5 µs, total: 386 µs
Wall time: 390 µs
CPU times: user 3.03 s, sys: 0 ns, total: 3.03 s
Wall time: 3.03 s
CPU times: user 370 µs, sys: 0 ns, total: 370 µs
Wall time: 374 µs
CPU times: user 4.79 s, sys: 76 µs, total: 4.79 s
W

In [17]:
f = generate_word_sequence_recognition_wfst(3, even_dict, 0.9, 0.1 )
f.set_input_symbols(state_table)
f.set_output_symbols(word_table)

errors_sum = 0
utterance_c = 0
words_c = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own                                                                     # audio files
        utterance_c+=1
        decoder = PruningViterbiDecoder(f, wav_file, pruning_threshold =20)
    
        if utterance_c < 10:
            %time decoder.decode()
        else:
                decoder.decode()
        (state_path, words) = decoder.backtrace()  # you'll need to modify the backtrace() from Lab 4
        transcription = read_transcription(wav_file)                                           # to return the words along the best path
       
        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
        if utterance_c < 5:
            print(error_counts, word_count)     # you'll need to accumulate these
            print (words)
            print(transcription)
        errors_sum += sum(error_counts)
        words_c += word_count
        
print(errors_sum, utterance_c, words_c)

CPU times: user 2.75 s, sys: 25 µs, total: 2.75 s
Wall time: 2.75 s
(1, 0, 2) 5
the of pickled piper of peter the
a pickled piper of peter
CPU times: user 1.75 s, sys: 0 ns, total: 1.75 s
Wall time: 1.75 s
(0, 0, 2) 2
the where's peter the
where's peter
CPU times: user 2.17 s, sys: 4.01 ms, total: 2.17 s
Wall time: 2.17 s
(0, 1, 2) 4
the peter picked peck the
peter picked a peck
CPU times: user 2.21 s, sys: 3.99 ms, total: 2.21 s
Wall time: 2.21 s
(0, 0, 2) 3
the where's the peppers the
where's the peppers
CPU times: user 2.48 s, sys: 4.01 ms, total: 2.49 s
Wall time: 2.49 s
CPU times: user 4.65 s, sys: 4 ms, total: 4.65 s
Wall time: 4.66 s
CPU times: user 2.85 s, sys: 0 ns, total: 2.85 s
Wall time: 2.85 s
CPU times: user 3.41 s, sys: 11 µs, total: 3.41 s
Wall time: 3.41 s
CPU times: user 3.44 s, sys: 4.02 ms, total: 3.45 s
Wall time: 3.45 s
1608 318 2434


In [18]:
f = generate_word_sequence_recognition_wfst(3, even_dict, 0.9, 0.1 )
f.set_input_symbols(state_table)
f.set_output_symbols(word_table)

errors_sum = 0
utterance_c = 0
words_c = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own                                                                     # audio files
        utterance_c+=1
        decoder = PruningViterbiDecoder(f, wav_file, pruning_threshold =15)
    
        if utterance_c < 10:
            %time decoder.decode()
        else:
                decoder.decode()
        (state_path, words) = decoder.backtrace()  # you'll need to modify the backtrace() from Lab 4
        transcription = read_transcription(wav_file)                                           # to return the words along the best path
       
        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
        if utterance_c < 5:
            print(error_counts, word_count)     # you'll need to accumulate these
            print (words)
            print(transcription)
        errors_sum += sum(error_counts)
        words_c += word_count
        
print(errors_sum, utterance_c, words_c)

CPU times: user 2.71 s, sys: 7.98 ms, total: 2.72 s
Wall time: 2.72 s
(1, 0, 2) 5
the of pickled piper of peter the
a pickled piper of peter
CPU times: user 1.8 s, sys: 0 ns, total: 1.8 s
Wall time: 1.8 s
(0, 0, 2) 2
the where's peter the
where's peter
CPU times: user 2.19 s, sys: 7.94 ms, total: 2.2 s
Wall time: 2.2 s
(0, 1, 2) 4
the peter picked peck the
peter picked a peck
CPU times: user 2.31 s, sys: 3.99 ms, total: 2.31 s
Wall time: 2.32 s
(0, 0, 2) 3
the where's the peppers the
where's the peppers
CPU times: user 2.47 s, sys: 0 ns, total: 2.47 s
Wall time: 2.47 s
CPU times: user 4.59 s, sys: 12 ms, total: 4.6 s
Wall time: 4.6 s
CPU times: user 2.85 s, sys: 3.99 ms, total: 2.86 s
Wall time: 2.86 s
CPU times: user 3.44 s, sys: 6 µs, total: 3.44 s
Wall time: 3.44 s
CPU times: user 3.44 s, sys: 0 ns, total: 3.44 s
Wall time: 3.44 s
1608 318 2434


In [19]:
f = generate_word_sequence_recognition_wfst(3, even_dict, 0.9, 0.1 )
f.set_input_symbols(state_table)
f.set_output_symbols(word_table)

errors_sum = 0
utterance_c = 0
words_c = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own                                                                     # audio files
        utterance_c+=1
        decoder = PruningViterbiDecoder(f, wav_file, pruning_threshold =10)
    
        if utterance_c < 10:
            %time decoder.decode()
        else:
                decoder.decode()
        (state_path, words) = decoder.backtrace()  # you'll need to modify the backtrace() from Lab 4
        transcription = read_transcription(wav_file)                                           # to return the words along the best path
       
        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
        if utterance_c < 5:
            print(error_counts, word_count)     # you'll need to accumulate these
            print (words)
            print(transcription)
        errors_sum += sum(error_counts)
        words_c += word_count
        
print(errors_sum, utterance_c, words_c)

CPU times: user 2.6 s, sys: 16 ms, total: 2.62 s
Wall time: 2.62 s
(1, 0, 2) 5
the of pickled piper of peter the
a pickled piper of peter
CPU times: user 1.79 s, sys: 6 µs, total: 1.79 s
Wall time: 1.79 s
(0, 0, 2) 2
the where's peter the
where's peter
CPU times: user 2.2 s, sys: 8 µs, total: 2.2 s
Wall time: 2.2 s
(0, 1, 2) 4
the peter picked peck the
peter picked a peck
CPU times: user 2.27 s, sys: 0 ns, total: 2.27 s
Wall time: 2.27 s
(0, 0, 2) 3
the where's the peppers the
where's the peppers
CPU times: user 2.52 s, sys: 30 µs, total: 2.52 s
Wall time: 2.52 s
CPU times: user 4.59 s, sys: 3.99 ms, total: 4.59 s
Wall time: 4.59 s
CPU times: user 2.85 s, sys: 6 µs, total: 2.85 s
Wall time: 2.85 s
CPU times: user 3.38 s, sys: 0 ns, total: 3.38 s
Wall time: 3.38 s
CPU times: user 3.34 s, sys: 0 ns, total: 3.34 s
Wall time: 3.34 s
1608 318 2434


In [20]:
f = generate_word_sequence_recognition_wfst(3, even_dict, 0.9, 0.1 )
f.set_input_symbols(state_table)
f.set_output_symbols(word_table)

errors_sum = 0
utterance_c = 0
words_c = 0
for wav_file in glob.glob('/group/teaching/asr/labs/recordings/*.wav'):    # replace path if using your own                                                                     # audio files
        utterance_c+=1
        decoder = PruningViterbiDecoder(f, wav_file, pruning_threshold =5)
    
        if utterance_c < 10:
            %time decoder.decode()
        else:
                decoder.decode()
        (state_path, words) = decoder.backtrace()  # you'll need to modify the backtrace() from Lab 4
        transcription = read_transcription(wav_file)                                           # to return the words along the best path
       
        error_counts = wer.compute_alignment_errors(transcription, words)
        word_count = len(transcription.split())
        if utterance_c < 5:
            print(error_counts, word_count)     # you'll need to accumulate these
            print (words)
            print(transcription)
        errors_sum += sum(error_counts)
        words_c += word_count
        
print(errors_sum, utterance_c, words_c)

CPU times: user 2.6 s, sys: 14 µs, total: 2.6 s
Wall time: 2.6 s
(1, 0, 2) 5
the of pickled piper of peter the
a pickled piper of peter
CPU times: user 1.77 s, sys: 9 µs, total: 1.77 s
Wall time: 1.77 s
(0, 0, 2) 2
the where's peter the
where's peter
CPU times: user 2.16 s, sys: 15 µs, total: 2.16 s
Wall time: 2.16 s
(0, 1, 2) 4
the peter picked peck the
peter picked a peck
CPU times: user 2.28 s, sys: 4.01 ms, total: 2.28 s
Wall time: 2.28 s
(0, 0, 2) 3
the where's the peppers the
where's the peppers
CPU times: user 2.53 s, sys: 24 ms, total: 2.55 s
Wall time: 2.55 s
CPU times: user 4.62 s, sys: 13 µs, total: 4.62 s
Wall time: 4.62 s
CPU times: user 2.85 s, sys: 14 µs, total: 2.85 s
Wall time: 2.85 s
CPU times: user 3.4 s, sys: 4.01 ms, total: 3.41 s
Wall time: 3.41 s
CPU times: user 3.3 s, sys: 2 µs, total: 3.3 s
Wall time: 3.3 s
1608 318 2434
