# **Final N-gram model Prediction**

After testing the majority voting algorithm, we discussed the possibility of hitting a plateau for such a simple model.

Also, we concluded that, in a language, it is mandatory that the set of words is finite. Should we worry about overfitting to split the dataset? From the moment we have a big population of combination of prefixes, the samples might not be necessary. If we remove the most frequent words, we are not avoiding overfitting, but we are removing important data.

Therefore, for a last test, we are going to approach creating a final dataset with a big conversational vocab, and smooth it with a sort of "Laplace smoothing", by also scanning the most frequent words of the language once. Then, we will test by training with a split, and then without to see the effects of "overfitting".

In [1]:
import numpy as np
import pandas as pd

For our conversational dataset, I'll use the [Movie Dialog Corpus](https://www.kaggle.com/datasets/Cornell-University/movie-dialog-corpus?select=movie_lines.tsv) from Kaggle. Please download the `movie_lines.tsv` file and place it in the `data/raw` folder.

In [2]:
# read the tsv file
df_movies = pd.read_csv('../data/raw/movie_lines.tsv', sep='\t', on_bad_lines='warn', names=['lineID', 'characterID', 'movieID', 'character', 'text'])

Skipping line 32351: expected 5 fields, saw 6
Skipping line 32390: expected 5 fields, saw 6
Skipping line 32583: expected 5 fields, saw 6
Skipping line 32585: expected 5 fields, saw 6
Skipping line 35684: expected 5 fields, saw 6
Skipping line 62132: expected 5 fields, saw 6
Skipping line 86637: expected 5 fields, saw 6
Skipping line 86722: expected 5 fields, saw 6
Skipping line 86914: expected 5 fields, saw 6
Skipping line 86960: expected 5 fields, saw 6
Skipping line 87010: expected 5 fields, saw 6
Skipping line 87011: expected 5 fields, saw 6
Skipping line 87086: expected 5 fields, saw 6
Skipping line 120607: expected 5 fields, saw 6
Skipping line 120719: expected 5 fields, saw 7
Skipping line 120739: expected 5 fields, saw 6
Skipping line 120783: expected 5 fields, saw 6
Skipping line 130284: expected 5 fields, saw 7
Skipping line 131048: expected 5 fields, saw 6

  df_movies = pd.read_csv('../data/raw/movie_lines.tsv', sep='\t', on_bad_lines='warn', names=['lineID', 'characterID',

In [3]:
df_movies.head(100)

Unnamed: 0,lineID,characterID,movieID,character,text
0,L1045,u0,m0,BIANCA,They do not!
1,L1044,u2,m0,CAMERON,They do to!
2,L985,u0,m0,BIANCA,I hope so.
3,L984,u2,m0,CAMERON,She okay?
4,L925,u0,m0,BIANCA,Let's go.
...,...,...,...,...,...
95,L590,u0,m0,BIANCA,Queen Harry?
96,L589\tu4\tm0\tJOEY\tSo yeah I've got the Sears...,,,,
97,L397,u0,m0,BIANCA,Hopefully.
98,L396,u4,m0,JOEY,Exactly So you going to Bogey Lowenbrau's thi...


From 304k lines, we lost almost 10k lines due to misconfigurations in the dataset.

In [4]:
df_movies.shape

(293202, 5)

## **Data Preprocessing**

First, we need to preprocess the data. We will remove the lines with missing values, remove all punctuation, all words with numbers and just repeating letters (such as "mmmmmmm") and lowercase all words.

In [5]:
# Remove the columns that are not needed
df_movies_processed = df_movies.drop(columns=['lineID', 'characterID', 'movieID', 'character'])

In [6]:
# Removing NaN values
df_movies_processed = df_movies_processed.dropna()

# Removing empty strings
df_movies_processed = df_movies_processed[df_movies_processed['text'] != ' ']

# Removing special characters but keep ' and spaces
df_movies_processed['text'] = df_movies_processed['text'].str.replace('[^a-zA-Z0-9\' ]', '', regex=True)

# Remove "-"
df_movies_processed['text'] = df_movies_processed['text'].str.replace('-', '')

# lower case
df_movies_processed['text'] = df_movies_processed['text'].str.lower()

In [7]:
df_movies_processed.head(100)

Unnamed: 0,text
0,they do not
1,they do to
2,i hope so
3,she okay
4,let's go
...,...
100,it's more
101,perm
102,patrick is that a
103,it's just you


Now, let's create a list with all the words

In [8]:
words_list = df_movies_processed['text'].str.split(' ').tolist()

In [9]:
words_list[:10]

[['they', 'do', 'not'],
 ['they', 'do', 'to'],
 ['i', 'hope', 'so'],
 ['she', 'okay'],
 ["let's", 'go'],
 ['wow'],
 ['okay', '', "you're", 'gonna', 'need', 'to', 'learn', 'how', 'to', 'lie'],
 ['no'],
 ['like', 'my', 'fear', 'of', 'wearing', 'pastels'],
 ['what', 'good', 'stuff']]

In [10]:
words_list = [item for sublist in words_list for item in sublist]

In [11]:
words_list[:10]

['they', 'do', 'not', 'they', 'do', 'to', 'i', 'hope', 'so', 'she']

Let's remove words that:

* Appear less than 5 times  
* Have more than 15 characters  
* Have only one type of letter (such as "mmmmmmm")

In [12]:
word_frequency = {}

for word in words_list:
    if word in word_frequency:
        word_frequency[word] += 1
    else:
        word_frequency[word] = 1

In [13]:
word_frequency

{'they': 11277,
 'do': 21291,
 'not': 18161,
 'to': 74812,
 'i': 95142,
 'hope': 942,
 'so': 12330,
 'she': 7992,
 'okay': 4256,
 "let's": 2288,
 'go': 9263,
 'wow': 240,
 '': 117328,
 "you're": 12612,
 'gonna': 4035,
 'need': 3667,
 'learn': 402,
 'how': 9673,
 'lie': 422,
 'no': 17927,
 'like': 13893,
 'my': 19172,
 'fear': 254,
 'of': 36042,
 'wearing': 240,
 'pastels': 2,
 'what': 29794,
 'good': 6817,
 'stuff': 1016,
 'figured': 305,
 "you'd": 1379,
 'get': 13203,
 'the': 90717,
 'eventually': 78,
 'thank': 1826,
 'god': 2164,
 'if': 12205,
 'had': 5121,
 'hear': 1700,
 'one': 9472,
 'more': 4137,
 'story': 834,
 'about': 13023,
 'your': 19457,
 'coiffure': 3,
 'me': 29558,
 'this': 22600,
 'endless': 15,
 'blonde': 77,
 'babble': 5,
 "i'm": 20784,
 'boring': 107,
 'myself': 1121,
 'crap': 197,
 'you': 119540,
 'listen': 1711,
 'always': 2366,
 'been': 6067,
 'selfish': 41,
 'but': 15755,
 'then': 5561,
 "that's": 10112,
 'all': 14228,
 'say': 5444,
 'well': 9099,
 'never': 5046,


In [14]:
def preprocess(words: list[str]):
    return list(
        filter(
            lambda x: x != '' and len(set(x)) > 1 and x.isalpha() and len(x) < 15 and word_frequency[x] > 5,
            words
        )
    )

In [15]:
words_list_preprocessed = preprocess(words_list)

In [16]:
len(words_list_preprocessed), len(words_list)

(2504600, 3060706)

In [17]:
words_list_preprocessed[:5]

['they', 'do', 'not', 'they', 'do']

## **Loading Test Data**

Let's utilize the same 5k most frequent words from [this file.](../data/wordFrequency.xlsx)

In [18]:
df_words = pd.read_excel('../data/wordFrequency.xlsx', sheet_name='4 forms (219k)')

words = df_words['word'].values

words[:10]

array(['the', 'to', 'and', 'of', 'a', 'in', 'i', 'that', 'you', 'it'],
      dtype=object)

Now, we are going to filter this words out of the training set.

In [None]:
words_list_training = list(filter(lambda x: x not in words, words_list_preprocessed))

In [None]:
len(words_list_training)

172994

## Saving

In [60]:
with open('../data/movie_lines_processed.txt', 'w') as f:
    for word in words_list_preprocessed:
        f.write(f'{word}\n')

In [61]:
with open('../data/movie_lines_filtered.txt', 'w') as f:
    for word in words_list_training:
        f.write(f'{word}\n')

# **Creating the N-grams**

We are going to use the same functions already created in [this script.](parallel_generate_probs.py)

Now, we are creating n-grams with the filtered set and "with overfitting".

In [20]:
!python ./parallel_generate_probs.py ../data/movie_lines_processed.txt 5 -o movie-lines/movie-lines

Using 16 cores and chunk size of 156537
Processing chunk with 156537 words
Processing chunk with 156537 words
Processing chunk with 156537 words
Processing chunk with 156537 words
Processing chunk with 156537 words
Processing chunk with 156537 words
Processing chunk with 156537 words
Processing chunk with 156537 words
Processing chunk with 156537 words
Processing chunk with 156537 words
Processing chunk with 156537 words
Processing chunk with 156537 words
Processing chunk with 156537 words
Processing chunk with 156537 words
Chunk processed with 156537 words
Chunk processed with 156537 words
Processing chunk with 156537 words
Processing chunk with 8 words
Chunk processed with 8 words
Processing chunk with 156537 words
Chunk processed with 156537 words
Chunk processed with 156537 words
Chunk processed with 156537 words
Chunk processed with 156537 words
Chunk processed with 156537 words
Chunk processed with 156537 words
Chunk processed with 156537 words
Chunk processed with 156537 words
C

In [21]:
!python ./parallel_generate_probs.py ../data/movie_lines_filtered.txt 5 -o movie-lines/movie-lines-filtered

Using 16 cores and chunk size of 10812
Processing chunk with 10812 words
Processing chunk with 10812 words
Processing chunk with 10812 words
Processing chunk with 10812 words
Processing chunk with 10812 words
Processing chunk with 10812 words
Processing chunk with 10812 words
Chunk processed with 10812 words
Processing chunk with 10812 words
Chunk processed with 10812 words
Processing chunk with 10812 words
Processing chunk with 10812 words
Chunk processed with 10812 words
Chunk processed with 10812 words
Processing chunk with 10812 words
Chunk processed with 10812 words
Processing chunk with 10812 words
Processing chunk with 10812 words
Chunk processed with 10812 words
Chunk processed with 10812 words
Processing chunk with 10812 words
Processing chunk with 10812 words
Processing chunk with 10812 words
Chunk processed with 10812 words
Chunk processed with 10812 words
Chunk processed with 10812 words
Chunk processed with 10812 words
Processing chunk with 2 words
Chunk processed with 2 w

Opening the probability file:

In [33]:
with open('../data/out/movie-lines/movie-lines_probs.json', 'r') as f:
    import json
    probs = json.load(f)

with open('../data/out/movie-lines/movie-lines-filtered_probs.json', 'r') as f:
    import json
    probs_filtered = json.load(f)

### **Loading the Competitor Model**  

In [34]:
with open('../data/out/kaggle/unigram_probs.json', 'r') as f:
    import json
    probs_kaggle = json.load(f)

with open('../data/out/kaggle/unigram-filtered_probs.json', 'r') as f:
    import json
    probs_kaggle_filtered = json.load(f)

len(probs_kaggle), len(probs_kaggle_filtered)

(48452, 48387)

# **Testing the Model**

In [35]:
def evaluate_model(n, probs, test):
    acc_exact = 0
    acc_top3 = 0
    acc_top5 = 0
    total = 0

    for word in test:
        word = r'%'*(n-1) + str(word)
        for idx in range(4, len(word)):
            prev = word[idx - n:idx]
            
            nxt = word[idx]

            if prev in probs:
                if nxt == max(probs[prev], key=probs[prev].get):
                    acc_exact += 1
                if nxt in list(sorted(probs[prev], key=probs[prev].get, reverse=True))[:3]:
                    acc_top3 += 1
                if nxt in list(sorted(probs[prev], key=probs[prev].get, reverse=True))[:5]:
                    acc_top5 += 1
                total += 1

    return acc_exact / total, acc_top3 / total, acc_top5 / total

In [36]:
print(evaluate_model(5, probs, words))

(0.5459449541284404, 0.8044036697247706, 0.9048807339449542)


In [37]:
print(evaluate_model(5, probs_filtered, words))

(0.3831352574985852, 0.6254843062992469, 0.7281354751643376)


In [38]:
print(evaluate_model(4, probs_kaggle, words))

(0.5012652736606175, 0.7915913527582966, 0.902176270696262)


In [39]:
print(evaluate_model(4, probs_kaggle_filtered, words))

(0.46188422082865116, 0.7519483814840323, 0.8671838184652191)


## Hard Testing

Let's put the "overfitting" models to the test with very unusual words

In [40]:
with open('../data/wordnet_words.txt', 'r') as f:
    wordnet_words = f.readlines()

wordnet_words = list(map(lambda x: x.replace('\n', ''), wordnet_words))

wordnet_words[:10]

['abaxial',
 'dorsal',
 'adaxial',
 'ventral',
 'acroscopic',
 'basiscopic',
 'abducent',
 'abducting',
 'adducent',
 'adductive']

In [41]:
print(evaluate_model(5, probs, wordnet_words))

(0.32417132195394727, 0.5454243462758624, 0.6508617587297018)


In [42]:
print(evaluate_model(5, probs_filtered, wordnet_words))

(0.342144021651953, 0.5484675351946313, 0.6486282332834615)


In [43]:
print(evaluate_model(4, probs_kaggle, wordnet_words))

(0.39374438409685364, 0.666887361807857, 0.7838091011280917)


In [44]:
print(evaluate_model(4, probs_kaggle_filtered, wordnet_words))

(0.3944891874043767, 0.6674847139169303, 0.7841283066785766)


## Soft testing

Let's test the model with some common words

In [45]:
words100k = []
with open('../data/wiki-100k.txt', 'r', encoding='utf-8') as f:
    curr = f.read().splitlines()

    for line in curr: # reading all lines and ignoring comments
        if line and line[0] != '#':
            words100k.extend(line.split())
    
    words100k = np.array(words100k)

words100k = [word for word in words100k if word.islower() and word.isalpha()]

words100k = words100k[:10000]

words100k[:10]

['the', 'of', 'and', 'to', 'a', 'in', 'that', 'was', 'he', 'his']

In [46]:
print(evaluate_model(5, probs, words100k))

(0.4912456228114057, 0.7535498518490015, 0.8591603494054719)


In [47]:
print(evaluate_model(5, probs_filtered, words100k))

(0.4369915289776063, 0.6777446951270654, 0.7767969470770779)


In [48]:
print(evaluate_model(4, probs_kaggle, words100k))

(0.4734987440591286, 0.7749787664672823, 0.8906698953683791)


In [49]:
print(evaluate_model(4, probs_kaggle_filtered, words100k))

(0.4565272450973298, 0.7584846949851654, 0.8762573268688039)


## **Conclusion**

After extensive testing with many different datasets, the models that yielded the best results were composed of sets that took into consideration the frequency of the words in the english language, which made n-grams of frequent words to appear more.

However, we start to see we are hitting a plateau. Given the semi-determinant aspect of languages, there are only so many prefixes that can be formed, and even less that are actively used on a daily basis. This is why we should pay attention to the accuracy of the "overfitted" model. Most of the time, the user will be inputting common words, thus we cannot take that information out of the model to test it. However, it's also important to test how it reacts to uncommon words and prefixes, hence the "filtered" probabilities also being used.

Furthermore, there will be times when the user will attempt to type an uncommon word. We, as humans, are creating new words and bringing back old words all the time. There will be a level of error to these models that can't be overcome without advanced context, such as knowing all the previous words that were typed. Even  if we use majority voting models, they are all based on the same words with almost the same amount of context, as we saw the improvements were very subtle. Therefore, our focus for testing will be balancing out the model with the interface, such that, even in situations where the predictions are all wrong, the user will not be heavily penalized on their typing speed, but when the predictions are correct, their gains will be siginificant.

Finally, given the compared datasets, we can see the Kaggle 4-grams usually had a slightly lower performance than the Movie Lines 5-grams, but showed a way better capacity at adapting to uncommon words and prefixes. 

Now, it's time to bring them to the real world. Let's put both to user testing on gaze typing and see which one works the best!