<a href="https://colab.research.google.com/github/MSaber7/Machine-Learning/blob/master/SR-WordErrorRate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SR | Lab One: Word Error Rate (WER)

Each string of files ref.trn and contains independent reference and hypothesis made by some model

File wer.py contains the realization of Levenstein Distance, which are implemented by function string_edit_distance. Please read the annotation of this function carefully and use it in your code. 

Your task is to write a Python 3 script, which iterates through hypothesis in file hyp.trn and computes its Word Error Rate (WER) with corresponding reference in file ref.trn. For each string in hyp.trn, code should print its WER to string in ref.trn.


In [0]:
import numpy as np

## Levenstein Distance

In [0]:
def string_edit_distance(ref=None, hyp=None):

    if ref is None or hyp is None:
        RuntimeError("ref and hyp are required, cannot be None")

    x = ref
    y = hyp
    tokens = len(x)
    if (len(hyp)==0):
        return (tokens, tokens, tokens, 0, 0)

    # p[ix,iy] consumed ix tokens from x, iy tokens from y
    p = np.PINF * np.ones((len(x) + 1, len(y) + 1)) # track total errors
    e = np.zeros((len(x)+1, len(y) + 1, 3), dtype=np.int) # track deletions, insertions, substitutions
    p[0] = 0
    for ix in range(len(x) + 1):
        for iy in range(len(y) + 1):
            cst = np.PINF*np.ones([3])
            s = 0
            if ix > 0:
                cst[0] = p[ix - 1, iy] + 1 # deletion cost
            if iy > 0:
                cst[1] = p[ix, iy - 1] + 1 # insertion cost
            if ix > 0 and iy > 0:
                s = (1 if x[ix - 1] != y[iy -1] else 0)
                cst[2] = p[ix - 1, iy - 1] + s # substitution cost
            if ix > 0 or iy > 0:
                idx = np.argmin(cst) # if tied, one that occurs first wins
                p[ix, iy] = cst[idx]

                if (idx==0): # deletion
                    e[ix, iy, :] = e[ix - 1, iy, :]
                    e[ix, iy, 0] += 1
                elif (idx==1): # insertion
                    e[ix, iy, :] = e[ix, iy - 1, :]
                    e[ix, iy, 1] += 1
                elif (idx==2): # substitution
                    e[ix, iy, :] = e[ix - 1, iy - 1, :]
                    e[ix, iy, 2] += s

    edits = int(p[-1,-1])
    deletions, insertions, substitutions = e[-1, -1, :]
    return (tokens, edits, deletions, insertions, substitutions)

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
%cd/content/drive/My Drive/Colab Notebooks/SR/Lab1

/content/drive/My Drive/Colab Notebooks/SR/Lab1


### Read File (Ref / Hyp)

In [0]:
filename1 = open('reference.txt', 'r')
filename2 = open('hypothesis.txt', 'r')

In [0]:
#Print(len(filename2.read().split()))
ref = filename1.read().split()
hyp = filename2.read().split()

In [0]:
ref, hyp

In [0]:
# Reference
# apple banana coconut date eggplant fig (0000-000000-0000)
# one two three four five six (0000-000000-0001)
# delaware pennsylvania new_jersey georgia connecticut massachusetts (0000-00000-0002)

In [0]:
# Hypothesis
# apple coconut date eggplant elephant fig (0000-000000-0000)
# one tiger three flamingo five six (0000-000000-0001)
# delaware cat georgia dog mouse massachusetts (0000-00000-0002)

 ### Computes its Word Error Rate

In [0]:
#Print(len(ref))
t, e, d, i, s = string_edit_distance(ref, hyp)
print(t)

21


In [0]:
print ('Tokens : ', t)
print ('Edits : ', e)
print ('Deletions : ', d)
print ('Insertion : ' , i)
print ('Substitutions : ' , s)

Tokens :  21
Edits :  8
Deletions :  2
Insertion :  2
Substitutions :  4


In [0]:
wer = (d + i + s)
print('Word Error Rate : ' , wer)

Word Error Rate :  8
