# **Majority Voting Algorithm**

To test how far we can take the *n-gram* model for letter prediction, we are developing a majority voting algorithm. It will take into consideration `n` probability tables to predict a letter: starting from 1-grams, to 2, 3, all the way to n-grams. All the probabilities will be added and the most probable letter will be the prediction!

In [1]:
import pandas as pd
import numpy as np

In [4]:
with open('../data/unigram_freq_processed.txt') as f:
    unigram = f.read().splitlines()

In [5]:
df_words = pd.read_excel('../data/wordFrequency.xlsx', sheet_name='4 forms (219k)')

df_words.head()

Unnamed: 0,rank,word,freq,#texts,%caps,blog,web,TVM,spok,fic,...,news,acad,blogPM,webPM,TVMPM,spokPM,ficPM,magPM,newsPM,acadPM
0,1,the,50074257,483041,0.11,6272412,7101104,3784652,5769026,6311500,...,6582642,7447070,50480.69,55212.83,29550.39,45736.71,53341.69,53975.61,54070.43,62167.47
1,2,to,25557793,478977,0.02,3579158,3590504,2911924,3427348,2871517,...,3013501,2978222,28805.25,27917.05,22736.17,27171.94,24268.65,25264.13,24753.18,24861.93
2,3,and,24821791,478727,0.09,3211226,3458960,1828166,3325442,3064047,...,2995111,3633119,25844.11,26894.26,14274.24,26364.03,25895.82,26215.95,24602.12,30328.95
3,4,of,23605964,478144,0.01,2952017,3462140,1486604,2678416,2330823,...,2893200,4517563,23757.98,26918.99,11607.33,21234.42,19698.97,26054.07,23765.01,37712.21
4,5,a,21889251,477421,0.05,2783458,2827106,2519099,2716641,2749208,...,2959649,2229222,22401.41,21981.44,19669.01,21537.47,23234.95,24619.48,24310.83,18609.35


In [6]:
words = df_words['word'].values

In [7]:
len(unigram)

1182554

In [8]:
unigram = [word for word in unigram if word not in words]

unigram = list(map(lambda x: x.replace(r'\n', ''), unigram))

unigram[:10], len(unigram)

(['info',
  'info',
  'info',
  'info',
  'info',
  'info',
  'info',
  'info',
  'info',
  'info'],
 1182554)

Let's create different N-grams to constitute the majority voting algorithm!

* 1-grams  
* 2-grams  
* 3-grams  
* 4-grams  
* 5-grams 

In [9]:
!python ./parallel_generate_probs.py ../data/unigram_freq_processed.txt 4 -o majority/maj_4

Using 16 cores and chunk size of 73909
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Chunk processed with 73909 words
Processing chunk with 73909 words
Processing chunk with 10 words
Chunk processed with 10 words
Processing chunk with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909

In [10]:
!python ./parallel_generate_probs.py ../data/unigram_freq_processed.txt 3 -o majority/maj_3

Using 16 cores and chunk size of 73909
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Chunk processed with 73909 words
Processing chunk with 73909 words
Processing chunk with 10 words
Chunk processed with 10 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909

In [11]:
!python ./parallel_generate_probs.py ../data/unigram_freq_processed.txt 2 -o majority/maj_2

Using 16 cores and chunk size of 73909
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Chunk processed with 73909 words
Processing chunk with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Processing chunk with 73909 words
Processing chunk with 10 words
Chunk processed with 10 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909

In [12]:
!python ./parallel_generate_probs.py ../data/unigram_freq_processed.txt 1 -o majority/maj_1

Using 16 cores and chunk size of 73909
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Chunk processed with 73909 words
Processing chunk with 73909 words
Chunk processed with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Chunk processed with 73909 words
Processing chunk with 10 words
Chunk processed with 10 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909

In [13]:
!python ./parallel_generate_probs.py ../data/unigram_freq_processed.txt 5 -o majority/maj_5

Using 16 cores and chunk size of 73909
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Processing chunk with 73909 words
Chunk processed with 73909 words
Processing chunk with 73909 words
Chunk processed with 73909 words
Processing chunk with 10 words
Chunk processed with 10 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909 words
Chunk processed with 73909

Now, we are going to open all the generated data:

In [14]:
import json

with open('../data/out/majority/maj_1_probs.json', 'r') as f:
    maj_1_probs = json.load(f)

with open('../data/out/majority/maj_2_probs.json', 'r') as f:
    maj_2_probs = json.load(f)

with open('../data/out/majority/maj_3_probs.json', 'r') as f:
    maj_3_probs = json.load(f)

with open('../data/out/majority/maj_4_probs.json', 'r') as f:
    maj_4_probs = json.load(f)

with open('../data/out/majority/maj_5_probs.json', 'r') as f:
    maj_5_probs = json.load(f)

## **Simple Evaluation**

Before calculating the majority voting model, let's see how each one of the probability tables work by themselves:

In [15]:
def evaluate_model(n, probs, test):
    acc_exact = 0
    acc_top3 = 0
    acc_top5 = 0
    total = 0

    for word in test:
        word = r'%'*(n-1) + str(word)
        for idx in range(4, len(word)):
            prev = word[idx - n:idx]
            
            nxt = word[idx]

            if prev in probs:
                if nxt == max(probs[prev], key=probs[prev].get):
                    acc_exact += 1
                if nxt in list(sorted(probs[prev], key=probs[prev].get, reverse=True))[:3]:
                    acc_top3 += 1
                if nxt in list(sorted(probs[prev], key=probs[prev].get, reverse=True))[:5]:
                    acc_top5 += 1
                total += 1

    return acc_exact / total, acc_top3 / total, acc_top5 / total

In [16]:
# Test on 1-grams

print(evaluate_model(1, maj_1_probs, words))

(0.2432577382776586, 0.5249770150168557, 0.718817039534171)


In [17]:
print(evaluate_model(2, maj_2_probs, words))

(0.33182818761641364, 0.6169216007224699, 0.7834283456567139)


In [18]:
print(evaluate_model(3, maj_3_probs, words))

(0.40943212929435663, 0.7184491742471076, 0.8531749536341959)


In [19]:
print(evaluate_model(4, maj_4_probs, words))

(0.46188422082865116, 0.7519483814840323, 0.8671838184652191)


In [20]:
print(evaluate_model(5, maj_5_probs, words))

(0.4900968309859155, 0.7568588615023474, 0.8597417840375586)


Isolated, the `4-gram` achieved the best performance!

# **Majority Voting**

With all the data in hands, the majority voting model works in a simple way:

For each position in the string where we want to predict the next character, we separate all the prefixes, from 1 to 5 grams. Then, we check the probability tables and create a final dictionary with the total sum of the probabilities for each one of the letters. Then, we can check if the real letter was the one with the max probability, or if it was in the top 3 or top 5:

In [21]:
# Let's create a function to evaluate the model using majority voting for each position

def evaluate_model_majority(n_list, probs, test):
    acc_exact = 0
    acc_top3 = 0
    acc_top5 = 0
    total = 0

    for word in test:
        word = r'%'*(max(n_list)-1) + str(word)
        for idx in range(max(n_list), len(word)):
            total_probs = {letter : 0 for letter in 'abcdefghijklmnopqrstuvwxyz'}
            for jdx in range(len(n_list)):
                n = n_list[jdx]
                prev = word[idx - n:idx]
                if prev in probs[jdx]:
                    for letter in probs[jdx][prev]:
                        if letter in total_probs:
                            total_probs[letter] += probs[jdx][prev][letter]
        
            nxt = word[idx]

            if nxt == max(total_probs, key=total_probs.get):
                acc_exact += 1
            if nxt in list(sorted(total_probs, key=total_probs.get, reverse=True))[:3]:
                acc_top3 += 1
            if nxt in list(sorted(total_probs, key=total_probs.get, reverse=True))[:5]:
                acc_top5 += 1
            total += 1
    
    return acc_exact / total, acc_top3 / total, acc_top5 / total

### Testing

Let's test many combinations, removing the smallest n-gram in each iteration:

In [22]:
print(
    evaluate_model_majority(
        [1, 2, 3, 4, 5],
        [maj_1_probs, maj_2_probs, maj_3_probs, maj_4_probs, maj_5_probs],
        words
    )
)

(0.47861705583023567, 0.7508029881987802, 0.8663971994658776)


In [23]:
print(
    evaluate_model_majority(
        [2, 3, 4, 5],
        [maj_2_probs, maj_3_probs, maj_4_probs, maj_5_probs],
        words
    )
)

(0.48486051463423435, 0.7565051066440507, 0.8740481432025695)


In [24]:
print(
    evaluate_model_majority(
        [3, 4, 5],
        [maj_3_probs, maj_4_probs, maj_5_probs],
        words
    )
)

(0.48818073550110075, 0.7631816377350319, 0.8780179724999098)


In [25]:
print(
    evaluate_model_majority(
        [4, 5],
        [maj_4_probs, maj_5_probs],
        words
    )
)

(0.48901079071781733, 0.7614132592298531, 0.869861777761738)


In [26]:
print(
    evaluate_model_majority(
        [5],
        [maj_5_probs],
        words
    )
)

(0.4833808509870439, 0.7461835504709661, 0.8525027969251867)


**Best probabilities: 3, 4 and 5-grams combined!**

However, the improvement was not that siginificant from the individual models. This might show us that we might be hitting a plateau with the simplest model.

In a language, it is mandatory that the set of words is finite. Should we worry about overfitting to split the dataset? From the moment we have a big population of combination of prefixes, the samples might not be necessary. If we remove the most frequent words, we are not avoiding overfitting, but we are removing important data.

Therefore, for a last test, we are going to approach creating a final dataset with a big conversational vocab, and smooth it with a sort of "Laplace smoothing", by also scanning the most frequent words of the language once. Then, we will test by training with a split, and then without to see the effects of "overfitting".

Click [here](letter-n-gram-final-test.ipynb) to start reading!

In [51]:
def get_top3_weights(txt, probs_list, n_max=5):
    # get the n-gram prefix of the text
    # if that n-gram isn't in the table, returns an empty dict
    
    total_probs = {letter: 0 for letter in 'abcdefghijklmnopqrstuvwxyz'}
    for jdx in range(len(probs_list)):
        prev = txt[-(n_max - len(probs_list) + jdx + 1):]
        curr = probs_list[jdx]
        print(prev)
        if prev in curr:
            for letter in curr[prev]:
                if letter in total_probs:
                    total_probs[letter] += curr[prev][letter]
    # get the 3 most probable following letters
    # if there are less than 3, gets all of them
    top = list(sorted(total_probs, key=total_probs.get, reverse=True))[:3]
    freq = list(sorted(total_probs.values(), reverse=True))[:3]

    return list(zip(top, freq))

def predicterV3(probs, n_max=5):
    txt = r'%'*n_max
    while (curr := input("Next Letter: ")) != '':
        if curr == ' ':
            txt = r'%'*n_max
            continue

        txt += curr
        top3 = get_top3_weights(txt, probs)
        print(f'\n{txt} \nBest 3 letters are: {top3}')

In [52]:
predicterV3([maj_3_probs, maj_4_probs, maj_5_probs])

%%a
%%%a
%%%%a

%%%%%a 
Best 3 letters are: [('n', 0.3662795219454022), ('l', 0.3422860550125736), ('r', 0.29033261607860417)]
%an
%%an
%%%an

%%%%%an 
Best 3 letters are: [('t', 0.7494426159889584), ('a', 0.44399617793821), ('n', 0.3436670559507379)]
ant
%ant
%%ant

%%%%%ant 
Best 3 letters are: [('i', 1.3939060611747545), ('e', 0.36535602954739665), ('h', 0.2929822035375389)]
nti
anti
%anti

%%%%%anti 
Best 3 letters are: [('c', 0.4621550621617675), ('n', 0.414245894114128), ('a', 0.26224770295748684)]
tic
ntic
antic

%%%%%antic 
Best 3 letters are: [('a', 0.8327299744117148), ('i', 0.6509935855263003), ('o', 0.496559677392003)]
ica
tica
ntica

%%%%%antica 
Best 3 letters are: [('l', 1.7955009350036377), ('t', 0.8340451479694317), ('n', 0.15862271362173158)]
