# Assignment 1

This assignment will involve the creation of a spellchecking system and an evaluation of its performance. You may use the code snippets provided in Python for completing this or you may use the programming language or environment of your choice

Please start by downloading the corpus `holbrook.txt` from Blackboard

The file consists of lines of text, with one sentence per line. Errors in the line are marked with a `|` as follows

    My siter|sister go|goes to Tonbury .
    
In this case the word 'siter' was corrected to 'sister' and the word 'go' was corrected to 'goes'.

In some places in the corpus two words maybe corrected to a single word or one word to a multiple words. This is denoted in the data using underscores e.g.,

    My Mum goes out some_times|sometimes .
    
For the purpose of this assignment you do not need to separate these words, but instead you may treat them like a single token.

*Note: you may use any functions from NLTK to complete the assignment. It should not be necessary to use other libraries and so please consult with us if your solution involves any other external library. If you use any function from NLTK in Task 6 please include a brief description of this function and how it contributes to your solution.*

## Task 1 (10 Marks)

Write a parser that can read all the lines of the file `holbrook.txt` and print out for each line the original (misspelled) text, the corrected text and the indexes of any changes. The indexes refers to the index of the words in the sentence. In the example given, there is only an error in the 10th word and so the list of indexes is [9]. It is not necessary to analyze where the error occurs inside the word.

Then split your data into a test set of 100 lines and a training set.

In [28]:
import nltk
#nltk.download()
#read in the text
lines = open("holbrook.txt").readlines()
data = []
total_lines = len(lines)
corrected_list = []
original_list = []


for i in range(total_lines):

  #split up each line of text into works
  lines_split = lines[i].split(" ")
  #lists for storing respective sentences and indexes - will be set to null at the start of each new sentence
  original_sentence = []
  corrected_sentence = []
  indexes = []
  for j in range(len(lines_split)):

    #here we remove any of the \n at the end of sentences
    #note we do not tokenize until later as it feels easier to work with the strings here 
    if "\n" in lines_split[j]:
      #use strip to remove the \n
      lines_split[j] = lines_split[j].rstrip("\n")

    #here we need to split the two words either side of the | symbol
    if "|" in lines_split[j]:
      #we add the index at which this symbol exists to the list of indexes
      indexes.append(j)
      #we use split to split the word with the symbol into two words with no symbol
      word_split = lines_split[j].split("|")
      #the first word is the mispelled word so we add that to the original word list
      original_sentence.append(word_split[0])
      #the second word is the corrected version of the word so we add that to its corresponding list
      corrected_sentence.append(word_split[1])

      #here we could do a quick check to see if the word is unique

    #if there is no symbol we just add the word to both lists
    else:
      original_sentence.append(lines_split[j])
      corrected_sentence.append(lines_split[j])
  
  #here we create a dictionary item for each sentence and the indexes
  #the values i.e. the sentences and indexes are in array form and we can acces these using their corresponding keys
  info = {
        "original": original_sentence,
        "corrected": corrected_sentence,
        "indexes":  indexes,
  }

  #add each of the sentences to their respective lists
  corrected_list.append(corrected_sentence)
  original_list.append(original_sentence)
  
  #add each dictionary item to a data list
  data.append(info)

#assert(data[2] == {
#    'original': ['I', 'have', 'four', 'in', 'my', 'Family', 'Dad', 'Mum', 'and', 'siter', '.'], 
#    'corrected': ['I', 'have', 'four', 'in', 'my', 'Family', 'Dad', 'Mum', 'and', 'sister', '.'], 
#    'indexes': [9]
#})

The counts and assertions given in the following sections are based on splitting the training and test set as follows

In [29]:
#split up our data into lists that will be needed later
test = data[:100]
train = data[100:]
test_corrected = corrected_list[:100]
test_original = original_list[:100]
train_corrected = corrected_list[100:]
train_original = original_list[100:]

## **Task 2** (10 Marks): 
Calculate the frequency (number of occurrences), *ignoring case*, of all words and their unigram probability from the corrected *training* sentences.

*Hint: use `Counter` to implement this so it may be called many times*

In [30]:
from collections import Counter
from nltk.tokenize import sent_tokenize, word_tokenize
import itertools

#use itertools to iterate over the list of lists(sentences) and put them in the same list
train_corrected_list = list(itertools.chain.from_iterable(train_corrected))
#this list of unique words will be used throughout the assignment
unique_words = []

for i in train_corrected_list:
  if i.lower() not in unique_words:
    unique_words.append(i.lower())

#function to get the frequency of the word
def unigram(word):
  #use counter to return a counter dictionary of all the words and their frequencies as key-value pairs
  word_counter = Counter(train_corrected_list)
  #lookup the frequency of the word we need
  frequency = word_counter[word]
  #return that frequency
  return frequency 
    
#function to get the probability
def prob(word):
  #call the unigram function to get the frequency value of the word
  frequency = unigram(word)
  #get the total number of all occurences of all words and sum them together
  total_word_num = sum(Counter([word for word in train_corrected_list]).values())
  #divide the frequency of the word by the total sum of all words in the train corrected list
  probability = frequency/total_word_num
  #return this probability
  return probability

prob("me")

# Test your code with the following
#assert(unigram("me")==87)

0.004031697483664674

## **Task 3** (15 Marks): 
[Edit distance](https://en.wikipedia.org/wiki/Edit_distance) is a method that calculates how similar two strings are to one another by counting the minimum number of operations required to transform one string into the other. There is a built-in implementation in NLTK that works as follows:


In [31]:
from nltk.metrics.distance import edit_distance

# Edit distance returns the number of changes to transform one word to another
print(edit_distance("hello", "hi"))

4


Write a function that calculates all words with *minimal* edit distance to the misspelled word. You should do this as follows

1. Collect the set of all unique tokens in `train`
2. Find the minimal edit distance, that is the lowest value for the function `edit_distance` between `token` and a word in `train`
3. Output all unique words in `train` that have this same (minimal) `edit_distance` value

*Do not implement edit distance, use the built-in NLTK function `edit_distance`*

In [32]:
def get_candidates(token):

  #create a dictionary 
  distance_list = {}
  #create list of candidates
  candidates = []
  #for each word in the unique words list and the word given we are going to call edit_distance 
  #this will basically return the number of changes it would take two words the same
  #e.g. mine and mind = 1 change
  #for this word we store all of its distances for all the other unique words
  for word in unique_words:
    #get the distance between the word and the unique word
    val = edit_distance(token, word)
    #create a dictionary item for each distance
    dist = {word: val}
    #add the distance to a list so now we will have all the distances for the word with all the unique words
    distance_list.update(dist)

  #we need to sort this list so we can get the closest distance or a list of the closest distances if a few have the same value
  #we sort by value as this is the number
  sorted_list = sorted(((value, key) for (key,value) in distance_list.items()))
  
  #now we take the key and value with the lowest distance
  #we use the min function
  minimum = min(sorted_list)

  #there can be many candidates with the same distance value so we must return them also
  for value,key in sorted_list:
    #if the value is equal to the value of the minimum value we add it to the clist of canddiates and return it
    if value == minimum[0]:
      candidates.append(key)

  return candidates
  
get_candidates("minde")  
# Test your code as follows
#assert get_candidates("minde") == ['mine', 'mind']

['mind', 'mine']

## Task 4 (15 Marks):

Write a function that takes a (misspelled) sentence and returns the corrected version of that sentence. The system should scan the sentence for words that are not in the dictionary (set of unique words in the training set) and for each word that is not in the dictionary choose a word in the dictionary that has minimal edit distance and has the highest *unigram probability*. 

*Your solution to this should involve `get_candidates`*


In [33]:
def correct(sentence):
    #create list to append the corrected sentence to
    corrected_sentence = []

    #loop through the words in the sentence
    for word in sentence:

      #if the word is not in the list of unique words i.e. the training list we must try a prediction
      #all out unique words are in lower case
      if word.lower() not in unique_words:
        
        #call get_candidated to return a list of the words with the least amount of changes to the word
        candidates = get_candidates(word.lower())
        #create a list to store the probailities
        prob_list = []

        #if there is only one candidate we can append it to the list holding the corrected sentence
        #as it is the only ligitimite answer
        if len(candidates) == 1:
          corrected_sentence.append(candidates[0])

        #otherwise we need to find the probabilities of each of the candidates
        elif len(candidates) >= 1:
          #for each candidate get its probability and add it to a list of probabilities for all candidates
          for candidate in candidates:
            prob_list.append(prob(candidate))

          #get the index of the max probability which will mirror the list of candidates
          maxidx = prob_list.index(max(prob_list))
          #then we use that index to add the candidate who has the largest probability
          corrected_sentence.append(candidates[maxidx])
      
      #if the word is in the unique words we do not need to predict
      else: 
        corrected_sentence.append(word)
    return corrected_sentence

correct(["this","whitr","cat"])
#assert(correct(["this","whitr","cat"]) = ['this','white','cat'])   

['this', 'white', 'cat']

## **Task 5** (10 Marks): 
Using the test corpus evaluate the *accuracy* of your method, i.e., how many words from your system's output match the corrected sentence (you should count words that are already spelled correctly and not changed by the system).

In [35]:
def accuracy(test):
  fixed_sentence = []
  incorrect_sentence_counter = 0
  correct_sentence_counter = 0
  count_incorrect_words = 0
  count_correct_words = 0

  #call the correct sentence method on each sentence in the list passed in
  [fixed_sentence.append(correct(sentence)) for sentence in test]

  #loop through the length of the list to compare results
  for i in range(len(test_corrected)):

    #if the sentences are the same incremement the counter
    if (fixed_sentence[i] == test_corrected[i]):
      correct_sentence_counter = correct_sentence_counter + 1
    else:
      #print(fixed_sentence[i])
      #print(test_corrected[i])
      #print(test_original[i])
      incorrect_sentence_counter = incorrect_sentence_counter + 1

  #put all the lists into one list with all their relevent words in that list so we can zip and compare
  fixed_list = list(itertools.chain.from_iterable(fixed_sentence))
  corrected_list = list(itertools.chain.from_iterable(test_corrected))
  original_list = list(itertools.chain.from_iterable(test_original))


  #zip all the files and loop through so we can compare the words of predicted and correct ang get results
  for x, y, z in zip(fixed_list, corrected_list,original_list):
      if x == y:
        count_correct_words = count_correct_words + 1
      else:
        #print("Predicted: %s, Expected: %s, Start word: %s " %(x, y, z))
        count_incorrect_words = count_incorrect_words + 1

  print("This is the number of total incorrect words: %d " %count_incorrect_words)
  print("This is the number of total correct words: %d " %(count_correct_words))
  count_correct_words = float(count_correct_words)
  compare_list = float(len(corrected_list))
  print("This is the percentage of correctly predicted words: %f" %((count_correct_words/compare_list) * 100) ,"%")
  print("This is the percentage of correctly predicted sentences: %f" %((correct_sentence_counter/(incorrect_sentence_counter+correct_sentence_counter)) * 100) ,"%")
  #we have implemented it on sentences now we need to do so for words
  #so we therefore need to create a list with all the words from each of these 



accuracy(test_original)

This is the number of total incorrect words: 181 
This is the number of total correct words: 948 
This is the percentage of correctly predicted words: 83.968113 %
This is the percentage of correctly predicted sentences: 21.000000 %


## **Task 6 (35 Marks):**

Consider a modification to your algorithm that would improve the accuracy of the algorithm developed in Task 3 and 4

* You may resources beyond those provided here.
* You must **not use the test data** in this task.
* Provide a short text describing what you intend to do and why. 
* Full marks for this section may be obtained without an implementation, but an implementation is preferred.
* Your implementation should not consist of more than 50 lines of code

Please note this task is marked according to: demonstration of knowledge from the lectures (10), originality and appropriateness of solution (10), completeness of description (10) and technical correctness (5).

My approach is to aim to try and make corrections for as few words as possible. The reason for doing this is that we only want change words that afre mispelled. My algorithm gets **91.674048%** of the words correct. There are a total of 94 words wrongly predicted (181 in task 5) and the majority of these are quite tough cases. For example we are given the word go and we are meant to predict goes. Perhaps better tense detection would help fix this but it is a difficult case. I believe that after analysing the 94 incorrectly predicted words the percentage accuracy I have achieved is very good.  


**Named Entities**

Some examples of Named Entities are peoples names, places, organizations and times. This process uses the Part of Speech tags and tries to identify nouns. Named Entity Detection utilizes a technique called chunking which goups words together to see if multiple words make one a single named entity e.g. New York. These chunks are called NP chunks or Noun Phrase chunks. Nour Phrase chunks are made up of a determiner (DT) followed bu an adjective (ADJ) and then followed by a noun. In my method we simply pass the word to the named entity checker so we do not ulitize the full capabilities of this prediction. However, there are no named entities in this example where it is made up of two or more words so this limitation does not hunder the accuracy. ne_chunk returns a tree object so I check to see if any node on the tree has a named entity label and if it does I add the word to a list of named entities and don't try to make an correction for that word.


**Bigrams**

In the first approach to this question we used the prob method which simply selected the word with the highest probability from the possible candidates from all of the words that required the least amount of changes from the mispelt word. A better approach here was to use bigrams which now takes into account the previous word before the word we are trying to predict. Therefore we get the probability of the previous word along with each of the words that are our possible candidates and then we select the highest. 


We also use **Tense Detection** in conjuction with this. The reason for doing so is that when we are looking to choose the highest bigram probability quite often we are given a set of words all with the same bigram probability. The idea of using tense detection here is to make sure that the presented words are of the same tense of the mispelled word. We must note that the fact that the word is mispelled will cause the tense of that word to possibly change. I did also try to use tense detection to fix words after they had been corrected but I found that my final approach worked better. Tense Detection works by using the Part of Speech tag to see if the word is past, future or present in tense. This is difficult because I am looking to classify mispelt words only. 

**Add-one Smoothing**

When calculating the probability for each bigram there are a huge amount of bigram pairs that have never been seen before together and therefore the probabilitiy of this bigram occuring is 0. However, this is inaccuarate as we know these words can occur together. To deal with this we implement a process called add-one smoothing. To do this we set the number of times the bigram pair occur together to 1. It is important then that the probability of the bigrams where add-one smoothing has been implements are less likely to occur than the bigram pairs that actually did appear once. Therefore when calculating the probability of a bigram pair after implementing add-one smoothing is 1/(frequency of word 1 + 1) while all other words that actuallly appear once will have a probability of 1/frequency of word 1. It is also important to note that for this to be more accurate we should have not counted the frequencies of word one where it was the last word of a sentence as it has no bigram in that case.

**Trigrams**

Trigrams could also have been implemented. With trigrams we group three words together so when we are trying to correct a word we look at the two words that occur before it and all the candidates. We will then select the highest trigram probability. We would simply have to alter this line of code in the bigram_probability method:   

```
train_bigrams = nltk.ngrams(corrected_list, 2):
#to
train_bigrams = nltk.ngrams(corrected_list, 3)
```
This may or may not have given a better outcome. 


**Extra Items Implemented to Increase Accuracy**
1. We do not make any changes if a **token is a digit or if it contains any digits**. The reason for this is that without this we would be carrying out predictions for number and were therefore changign numbers to the numbers that were in the list of unique words. So if the token contains a digit we do not do any prediction.
2. If the word we are trying to predict is a **real english word** we do not do any word prediction as I assume that the word is correct. After analyzing the words I was incorrectly predicting I found that I was making predictions for words that were already correct and did not need to be altered. To fix this issue I created a method is_word_real which checks if the word is located in the WordNet corpus of english words. If the word is in the database we return true and then don't do any prediction for that word.
3. Finally if the **word's first letter is a capital letter** we do not do any prediction. I implemented this as there were many some placenames and accronyms such as 'Bullimore' which were not picked up as named entities and therefore predictions were being made on these words that were in fact already correct. 



In [24]:
#let's put all our new functions in here 
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk import pos_tag, ne_chunk, word_tokenize
from nltk.corpus import wordnet
from nltk.corpus import words

#create list to hold all named entities
named_entity_list = []

#checks if the word we are going to predict is a named entity
def named_entity_check(word):

  #created a tree using chunking
  #ideally we should maybe do this on the full sentence for better results but it appears to work well
  #does really well at removing errors of places and names of people and organizations
  tree = ne_chunk(nltk.pos_tag(nltk.word_tokenize(word)))
  #only add if its not already in the list 
  for t in tree:
    if hasattr(t, 'label'):
        if word not in named_entity_list:
          named_entity_list.append(word)

#named_entity_check("Google")



#we use this function to return the tense of a supplied word
#when we try to predict words using bigrams in most cases the bigrams of the possible words have never occured before
#this means we often have lots of words with the same probability with no stand out word
#to help with this we try to remove word that are not relevent by getting the tense of all candidates and only keeping the candidates 
#with the same tense as the word we are trying to predict
#this can be difficult as often time the word is mispelled but it is an attempt to improve accuracy
#sometimes an irrelevent word will have a higher probability than the actual correct word so this will help remove those cases
def tense_detection(word):
  #get the part of speech tag
  word_tag = dict(pos_tag(word_tokenize(word)))
  #put it in a list and return it
  tag = list(word_tag.values())

  return tag
  
def bigram_probability(word1, word2):

  #we use the corrected train list as our bigram training list
  #we do the following to put all the strings into one list rather than a list of sentences with each sentence being a list 
  corrected_list = list(itertools.chain.from_iterable(train_corrected))

  #we then use the n-grams function to create bigrams from the 
  #if we were to put 3 in here we could have created tri-grams
  train_bigrams = nltk.ngrams(corrected_list, 2)

  #we use counter to create 
  bigrams_frequency = Counter(train_bigrams)
  #print(bigrams_frequency)
  frequency_of_pair = bigrams_frequency[(word1, word2)]

  #here we get the number of times each word occurs
  #ideally here we should have some functionality that doesn't count the word if it is at the end of a sentence(as it has no bigram in this case)
  #therefore this calculation is not 100 percent accurate
  bigram_counter = Counter(corrected_list)

  #now we use the counter above to get the total number of occurences of word 1
  word_count = bigram_counter[word1]

  #add exception because word1 may never occur
  try:
    #out probability is the number of times the words occur as a pair divided by the number of times word 1 appears
    #again this should have something to remove the occurences where word 1 is at the start of a sentence
    probability = frequency_of_pair / word_count
  #error is handled and then we will use add-one smoothening below
  except ZeroDivisionError:
    probability = 0


  # Add-one smoothing for occurences where the pair does not occur in the training set
  if probability == 0:

    #In this case add one to the frequency of the pair
    frequency_of_pair = frequency_of_pair + 1
    #add one to the total number of times word 1 appears
    word_count = bigram_counter[word1] + 1
    #divide the two to get the bigram porbability
    probability = frequency_of_pair / word_count

  return probability

#This function is called to see if a word is a real word
#I try to limit word prediction to words that are spelled wrongly
#this is to limit the incorrect prediction of words that are already correct
#if the word is not present in wordnet synsets then we assume it is not a proper word
def is_word_real(word):
    if not wordnet.synsets(word):
      return False
    else:
      return True


bigram_probability("I", "wash")

0.0019880715705765406

In [25]:
def correct_new(sentence):

    #list to hold the sentence after we correct it
    corrected_sentence = []
    #this is going to store word-1 which is the word before the word we are trying to predict
    #this will of course be needed for the bigram probability calculation
    bigram_first_word = sentence[0]
    #this variable will say if the word is the first word of the sentence
    #if it is the first word we don't use bigrams and simply use the old probability calculation for that word only
    first_word = True

    #run for each word of the sentence
    for word in sentence:
      #check if the word is a name entity- it will be added to a list if it is
      named_entity_check(word)
      #this was an attempt to remove the _ in some word but didn't help improve accuracy
      if "_" in word:
        word = word.replace("_", "")

      #check if the word is a real word in the wordet corpus
      english_word = is_word_real(word)

      #works as follows we enter if
      #1) word is not in our list of unique words
      #2) word is not in out list of named entities
      #3) if the first letter of the word is not a capital letter
      #   - this worked great to remove errors of words as if they were capitilised they were not likely to be spelled wrong
      #4) if the word does not contain digits - removed trying to predict numbers we had not seen
      #5) if the word is not an english word i.e. it has been mispelled
      if word.lower() not in unique_words and word not in named_entity_list and (word[0].isupper() != True) and not word[0].isdigit() and english_word == False:
        lower = word.lower()
        #get the list of canddiated
        candidates = get_candidates(lower)
        prob_list = []

        #if there is only 1 candidate append candidate to the corrected sentence
        #there is not need to do any work with probabilities
        if len(candidates) == 1:
          corrected_sentence.append(candidates[0])

        #if there are more than 1 candidates we need to get the best candidate which is
        #1) matches the tense of the word we were given
        #2) get the highest bigram probability of all the candidates after the word that appears before the word we are trying to predict
        elif len(candidates) >= 1:

          #if the word is the first word in the sentence do normal probability as it has no previous word for bigrams
          if first_word == True:
            for candidate in candidates:
              prob_list.append(prob(candidate))

          else: 
            candidate_tenses = []
            for candidate in candidates:
              #for each candidate we run tense detection to make sure that the suggestions given to us have the same tense as the mispelled word
              candidate_tense = tense_detection(candidate)
              candidate_tenses.append(candidate_tense)
              incorrect_word_tense = tense_detection(word)

            i = 0
            k = 0
            #copy the list over as we will be removing from the original list
            new_candidate_list = list(candidates)

            #loop through the list of word candidate tenses
            for c in candidate_tenses:
              #if the tense mathces the tense of the word we are trying to correct we keep that word and get its bigram probability
              #with the predicted word
              if c == incorrect_word_tense:
                #if we use the old probability method
                #prob_list.append(prob(new_candidate_list[k])) 
                #get the bigram probability and append it to a list of probabilities
                prob_list.append(bigram_probability(bigram_first_word,new_candidate_list[k])) 
                i = i + 1
                k = k + 1
              
              #if the candidate tense doesn't match
              else:
                #remove it from the list of candidates
                candidates.pop(i)
                k = k + 1

          #we get the index of the max probability in the probability list
          #this only contains the matching candidates and therefore is indexed exactly the same as the dandidates list even after removal
          maxidx = prob_list.index(max(prob_list))
          #we append the best probability to the corrected sentence
          corrected_sentence.append(candidates[maxidx])

      #if the word doesn't fit the criteria for the first if statement don't do any correction and add to the list
      else: 
        corrected_sentence.append(word)

      #now we are moving on to the second word of the sentence
      first_word = False
      #move the word-1 holder on one place
      bigram_first_word = word

    #return the corrected sentence
    return corrected_sentence

correct_new(["this","whitr","cat"])
#assert(correct(["this","whitr","cat"]) = ['this','white','cat'])   


['this', 'white', 'cat']

## **Task 7 (5 Marks):**

Repeat the evaluation (as in Task 5) of your new algorithm and show that it outperforms the algorithm from Task 3 and 4

In [37]:
def accuracy_new(test):

  fixed_sentence = []

  #call the new correct method on all the sentences we are trying to test
  [fixed_sentence.append(correct_new(sentence)) for sentence in test]

  correct_sentence_counter = 0
  incorrect_sentence_counter = 0
  correct_words_counter = 0
  incorrect_words_counter = 0
  
  #here we compare each sentence to see how may are correct
  for i in range(len(test_corrected)):
    if (fixed_sentence[i] == test_corrected[i]):
      correct_sentence_counter = correct_sentence_counter + 1
    else:
      #print(fixed_sentence[i])
      #print(test_corrected[i])
      #print(test_original[i])
      incorrect_sentence_counter = incorrect_sentence_counter + 1

  #make each list of lists into a single list with all their respective words
  fixed_list = list(itertools.chain.from_iterable(fixed_sentence))
  corrected_list = list(itertools.chain.from_iterable(test_corrected))
  original_list = list(itertools.chain.from_iterable(test_original))


  #here we will compare the words with each other to see how many are wrong
  for x, y, z in zip(fixed_list, corrected_list, original_list):
      if x == y:
        correct_words_counter = correct_words_counter + 1
      else:
        #print("Predicted: %s, Expected: %s, Start word: %s " %(x, y, z))
        incorrect_words_counter = incorrect_words_counter + 1

  print("This is the number of total incorrect words: %d " %incorrect_words_counter)
  print("This is the number of total correct words: %d " %(correct_words_counter))
  correct_words_counter = float(correct_words_counter)
  compare_list = float(len(corrected_list))
  print("This is the percentage of correctly predicted words: %f" %((correct_words_counter/compare_list) * 100) ,"%")
  print("This is the percentage of correctly predicted sentences: %f" %((correct_sentence_counter/(incorrect_sentence_counter+correct_sentence_counter)) * 100) ,"%")

accuracy_new(test_original)

This is the number of total incorrect words: 94 
This is the number of total correct words: 1035 
This is the percentage of correctly predicted words: 91.674048 %
This is the percentage of correctly predicted sentences: 40.000000 %
