# spellchk idea 1

## Introduction
### Modifications based on the differences in character sets

*   We made changes to the code to make the predictions more accurate.
*   We added a step to measure how different the predicted word is from the original mistake. This helps us understand the level of similarity or dissimilarity between the two.
*   This is done by checking the differences in character sets in the typo and the predicted word.
*   The idea here is that, for typo, most of the times, the typo word may have repeated the same character twice, or failed to type in such character, or the character is replaced by a different one.
*   This means that the character sets used in the typo word should be highly similar to the one used in the correct word, meaning that the differences between the character set used in the typo word and correct word should be minimal.
*   For example, for the word factor, the character set is C(factor) = {'a', 'c', 'f', 'o', 'r', 't'}, some more common typos may be facttor, this have the same character set as the correct word, as C(facttor) = {'a', 'c', 'f', 'o', 'r', 't'} = C(factor); facor, C(facor) = {'a', 'c', 'f', 'o', 'r'}, C(facor) - C(factor) = {}, C(factor) - C(facor) = {'t'}, the maximum differences between the 2 is the set of {'t'}, which is a difference of 1 character; facdor, C(facdor) = {'a', 'c', 'd', 'f', 'o', 'r'}, C(facdor) - C(factor) = {'d'}, C(factor) - C(facdor) = {'t'}, we can see how the maximum differences is 1 character in either cases.
*   Based on this principle, we compared the differences in character sets, and generate the largest length differences between the 2 character sets.
*   We used the code val["len_diff"] = max(len(set(val["token_str"]) - set(typo)), len(set(typo) - set(val["token_str"]))) to calculate the maximum differences between the 2 data set.
*   We can see that predict word character sets - typo word character sets, and typo word character sets - predict word character sets are used, the reason why both of them are needed is that, if a set is a subset of another set, then subset - the superset will have no objects in it, only the superset - subset will result in some objects, thus, we need to find the largest differences in length for the 2 cases. 
*   This is then used to select the best output in the line val["overall_score"] = - val["len_diff"], with the output having the lowest differences being the most likely to be the actual correct word. 
*   We can see this in action with an example, let's say we have a typo wear, with the correct word being war, since the character sets used for the typo is {'w', 'e', 'a', 'r'} and the correct word is {'w', 'a', 'r'}, the correct word has a difference of 1 to the typo. 
*   Assume that the model generated 3 words, being ".", "war", "pair", with "." having the highest score, and "pair" having the lowest score, "." would be the output without this selection, but with the selection, the character set differences is 4 for this option, 1 for the option "war", and 2 for the option "pair", since "war" has the lowest differences, it will be selected as the output, which is the correct word.

## Try1. Modify select_correction function

__Idea:__ We compare the predict tokens with the typo, calculating the difference between the strings, then we sort the predict list based on the difference ascedingly.

In [9]:
from spellchk import *
from io import StringIO

In [10]:
def select_correction(typo, predict):
    # return the most likely prediction for the mask token
    for val in predict:
        val["len_diff"] = max(len(set(val["token_str"]) - set(typo)), len(set(typo) - set(val["token_str"])))
        val["overall_score"] = - val["len_diff"]
    new_predict = sorted(predict, key = lambda x: x["overall_score"], reverse = True)
    return new_predict[0]["token_str"]

Outcome: After improving our select_correction function, our dev score goes to 0.65

## idea 1.1
### Modification based on the differences in character sets and the differences of length of the tokens themselves

*   We can see how, by using the differences in the character sets used, we are able to increase the accuracy massively, but what if we also integrate the absolute differences of length of the tokens?
*   The idea is similar to the differences in character sets strategy, but we also added the comparison between the absolute length of the tokens themselves.
*   This is for checking edge cases, where the model may have generated multiple outputs with the same differences in character set, but have large differences between the length of the characters themselves.
*   Let's consider the typo word followerr, it has a length of 9, and the character set of {'e', 'f', 'l', 'o', 'r', 'w'}, the correct word is follower, which have length 8, with the same character set. Let's imagine that the model generated 2 words, "follower" and "flower", with flower rated higher, in the previous idea, both predictions will have the same character set, so flower may be chosen still, however, if we integrate raw length as well, "follower" have a length difference of 1, but flower have a length difference of 3, so follower will be chosen instead, which is the correct answer.
*   We can see that the new score is based on both the differences in characters sets and token length differences, with abs(len(val["token_str"]) - len(typo)) calculating the differences between the token length themselves

In [None]:
def select_correction(typo, predict):
    # return the most likely prediction for the mask token
    for val in predict:
        val["combined_len_diff"] = max(len(set(val["token_str"]) - set(typo)), len(set(typo) - set(val["token_str"]))) + abs(len(val["token_str"]) - len(typo))
        val["overall_score"] = - val["combined_len_diff"]
    predict = sorted(predict, key = lambda x: x["overall_score"], reverse = True)
    return predict[0]['token_str']

Outcome: Despite adding the new checking, the accuracy remains the same at 0.65, the reason is likely because the edge case specified may not be very common, however, in a larger set of data with typos, it is likely that this will produce slightly better result than the first solution.

## Try2. Modify spellchk function

__Idea:__ Now that we already make full use of the typo, it seems like the breakthrough can only happen in the prediction part. Through experience, we found that by increasing the prediction list, i.e., taking more consideration to the possible replacement, we can achieve a better score.

__Inspiration__: Flaw of small prediction Size

In [17]:
str2 = "7,14,16\tSo I think we would not be live if our ancestors did not develop siences and tecnologies ."

with StringIO(str2) as f:
    for (locations, sent) in get_typo_locations(f):
        for i in locations:
            predict = fill_mask(
                    " ".join([ sent[j] if j != i else mask for j in range(len(sent)) ]), 
                    top_k=20
                )
            print([p["token_str"] for p in predict])
            print(select_correction(sent[i],predict))

['surprised', 'happier', 'disappointed', 'happy', 'ashamed', 'satisfied', 'forgotten', 'punished', 'proud', 'disturbed', 'offended', 'lucky', 'harmed', 'shocked', 'confused', 'fooled', 'pleased', 'amazed', 'mistaken', 'embarrassed']
fooled
['beliefs', 'myths', 'ideas', 'linguistic', 'religions', 'languages', 'ethics', 'language', 'knowledge', 'theories', 'scientific', 'laws', 'traditions', 'anthropology', 'cultures', 'philosophy', 'morals', 'religion', 'philosophical', 'values']
ideas
['abilities', 'powers', 'spirits', 'dreams', 'memories', 'weaknesses', 'illusions', 'talents', 'pains', 'visions', 'minds', 'desires', 'stones', 'diseases', 'roots', 'bodies', 'patterns', 'souls', 'emotions', 'skills']
emotions


The predict given is __"So I think we would not be fooled if our ancestors did not develop ideas and emotions ."__

The predict list returned:

- predict for 'live' : ['surprised', 'happier', 'disappointed', 'happy', 'ashamed', 'satisfied', 'forgotten', 'punished', 'proud', 'disturbed', 'offended', 'lucky', 'harmed', 'shocked', 'confused', 'fooled', 'pleased', 'amazed', 'mistaken', 'embarrassed']


- predict for 'siences':['beliefs', 'myths', 'ideas', 'linguistic', 'religions', 'languages', 'ethics', 'language', 'knowledge', 'theories', 'scientific', 'laws', 'traditions', 'anthropology', 'cultures', 'philosophy', 'morals', 'religion', 'philosophical', 'values']


- predict for 'tecnologies': ['abilities', 'powers', 'spirits', 'dreams', 'memories', 'weaknesses', 'illusions', 'talents', 'pains', 'visions', 'minds', 'desires', 'stones', 'diseases', 'roots', 'bodies', 'patterns', 'souls', 'emotions', 'skills']

It seems like the correct word 'alive', 'science', 'technology' did not appear in the predict list. Thus, no matter how hard we try to improve the select_correction function, the correct answer will never be given. 

__Our Approaches to expland the predict list__: 

1. Increase the top_k value from 20 to 35

2. Expand predict list by adding the predic using the truncated sentence

Using method 1, we increase our dev score from 0.65 to 0.69, the optimal top_k value we found is 35. (original 20)

Using method 2, we increase our dev score from 0.69 to 0.73

In [14]:
str2 = "7,14,16\tSo I think we would not be live if our ancestors did not develop siences and tecnologies ."
str_list = [str2]

def spellchk(fh):
    for (locations, sent) in get_typo_locations(fh):
        spellchk_sent = sent
        for i in locations:
            predict = fill_mask(
                " ".join([ sent[j] if j != i else mask for j in range(len(sent)) ]), 
                top_k=35
            )
            if i < len(sent):
                predict+=fill_mask(
                    " ".join([ sent[:i+1][j] if j != i else mask for j in range(len(sent[:i+1])) ]), 
                    top_k=35
                )
            spellchk_sent[i] = select_correction(sent[i], predict)
        yield(locations, spellchk_sent)

for str_test in str_list:
    with StringIO(str_test) as f:
        for (locations, spellchk_sent) in spellchk(f):
            print("{locs}\t{sent}".format(
                locs=",".join([str(i) for i in locations]),
                sent=" ".join(spellchk_sent)
            ))

7	So I think we would not be alive
7	So I think we would not be alive .
7	So I think we would not be alive if our ancestors did not develop siences and tecnologies .


## Analysis

The main idea behind the implemented changes was to enhance the accuracy of typo corrections. This was achieved by considering the dissimilarity between the predicted token and the typo, particularly in terms of character length. By calculating the length difference and introducing an overall scoring mechanism, the system was able to prioritize corrections that closely matched the intended correction. The scoring mechanism assigned higher scores to predictions with smaller length differences, indicating better matches. The sorting of predictions based on their scores ensured that the correction with the lowest score, indicating the best match, was chosen as the final correction. These modifications collectively improved the accuracy of the typo correction process by taking into account both the dissimilarity in length and the ranking of predictions.

## Group work

* aga149: worked on the .ipynb report to explain clearly the changes in the spellchk.py. also reconduct the experience on the code to get the same results as well.
* zwa204: Conduct raw experiment comparing the typo and the predict token; Expand the predict list to obtain better score.
* thl28: Experimented with various techniques of getting better accuracy, including analysis of prediction and typo token, such as the differences in character set, length of the token itself etc, described approaches in .ipynb report.
