# spellchk: default program

In [14]:
from default import *

## Documentation

In [2]:
def spellchk(fh):
    for (locations, sent) in get_typo_locations(fh):
        spellchk_sent = sent
        sent_lower = [word.lower() for word in sent]
        for i in locations:
            # predict top_k replacements only for the typo word at index i
            predict = fill_mask(
                " ".join([ sent[j] if j != i else mask for j in range(len(sent)) ]), 
                top_k=200
            )
            predict = [pred['token_str'] for pred in predict]
            logging.info((sent_lower[i], predict))
            spellchk_sent[i] = select_correction(sent_lower[i], predict)
        yield(locations, spellchk_sent)

In [3]:
def select_correction(typo, predict):
    # return the most likely prediction for the mask token
    return predict[0]['token_str']

The spellchk funtion takes in the typo word from the given location in the sentence. Then it predicts the possible replacement for that typo word using the Hugging Face model, with the 1st word in the dictionary being the most possible replacement for that typo word. Then the predicted dictionary of possible replacement words is sent to the select_correction function, where we choose the correct possible word to replace the typo word. By default, it will choose the first word in the dictionary, since it has the highest score among the other possible word replacements.

In [None]:
def char_rep(word):
    chrep = np.zeros((1, 128))
    for c in word:
        chrep[0][ord(c)] += 1
    return chrep

def select_correction(typo, predict):
    # remove predictions that contain non-ASCII characters
    predict = [word for word in predict if all(ord(c) < 128 for c in word)]

    # find the counts of each distinct character in the typo
    typo_char = char_rep(typo)
    predict_chars = np.vstack([char_rep(word) for word in predict])
    
    # return prediction with most similar counts
    differences = np.linalg.norm(predict_chars - typo_char, axis=1)
    return predict[np.argmin(differences)]

We created character level representation for the predictions and typo. Each word is represented by a 128-dimensional numpy array that contains the counts of distinct ascii characters. We removed all predictions that contain non-ascii characters and made typo lowercase beforehand. If we are considering N predictions from the Hugging Face model, the predictions are represented by is a N x 128 matrix while the typo is represented by a 1 x 128 row vector. Then we compute the N x 128 matrix that contains the difference between each prediction and typo. We obtain the row index where the norm is lowest, which gives us the best prediction. 

## Analysis

We experimented with different number of predictions being considered to substitute typo. We tried N in {10, 20, 50, 100, 200, 500, 1000}. N = 200 gave us the best result with dev.score = 0.81. We also tried substituting L1 with L2 norm to for efficiency but observed no noticeable speedup. 

We also tried representing each word with a different character level representation: a 130-dimensional array with first 128 entries containing the counts of distinct ascii characters (excluding the first and last character) and 129th and 130th entry containing first and last character's ascii integer representations. The result was worse than representing each word with a 128-dimensional array that contains the counts of distinct ascii characters (including the first and last character).

## Contribution

Everyone worked together and communicated well on the discord to finish this homework. Everyone contributed to both the coding and the notebook part by adding their work.