Everything I learned about N-Grams here is from **Speech and Language Processing: An Introduction to Natural Language Processing Computational Linguistics, and Speech Recognition with Language Models** *Third Edition draft* from *Daniel Jurafsky*, and *James H. Martin*

# N-Grams Models
So basically, when you try to predict what the next word is, you'd want to look back into the past of what you said earlier, and that could give you an estimate of what you should say next.  
But how far back should you look? 1 word before? 2? IT'S N WORDS.  

But it would be really hard to look back at all words, so we hold a ***Markov Assumption***. (some things won't be entirely clear why we hold them/use them/assume them which is a shortcoming from my end to not know yet, but we're learning.)

The *Markov Assumption*:
* So a Markov Property informs you that your future steps or decisions are memoryless, or basically independant from the past.
    So here we will assume that we don't entirely depend on the past, or in other terms we won't look too far into the past.



So given the sentence $s$: $<s>$ I love fries with $</s>$  
* To find the probability of word $w$ given sentence $s$, assuming we use the full history:  
$P(w|s) = P(w|I\space love\space fries\space with) = \frac{P(w \cap s)}{P(s)} = \frac{C(w \cap s)}{C(s)} \mid C(x) \text{ is the count of x in our corpus.}$  
* *You can figure out why the probability here would be equal to the count directly, I know such small details are mostly ignored assuming you would figure it out or that it's trivial but I wanted to give you a flasher.*  

But writing and text are normally of variety and creative, which is why we dont't want to look at the full past each time.

So the probability of the whole sentence $W$ = $\{w_1, w_2, w_3...w_n\}$ is:  
$$P(w_1 w_2 w_3.....w_n) = P(w_1) P(w_2 | w_1) P(w_3 | w_1 w_2)...P(w_n | w_1 w_2 ... w_{n-1})  \newline
= P(w_{1:n}) = \prod_{k = 1}^n{P(w_k | w_{1:k-1})}$$  

But since we won't look at the full history, we could use the near history like the past word (*bi-gram*), the past two words (*tri-gram*), or maybe just the current word (*unigram*).  

* In bigrams we predict our current word with the word before it:  
$P(w_n|w_{1:n-1}) \approx P(w_n|w_{n-1})$  

So we assume our current word depends only on the previous word.

So the original equation we discussed:
$$P(w_1 w_2 w_3.....w_n) = P(w_1) P(w_2 | w_1) P(w_3 | w_1 w_2)...P(w_n | w_1 w_2 ... w_{n-1})  \newline
= P(w_{1:n}) = \prod_{k = 1}^n{P(w_k | w_{1:k-1})}$$  

Should translate to bi-gram as follows:
$$P(w_1 w_2 w_3.....w_n) = P(w_1) P(w_2 | w_1) P(w_3 | w_2)...P(w_n | w_{n-1})  \newline
= P(w_{1:n}) = \prod_{k = 1}^n{P(w_k | w_{k-1})}$$  
Where for each word we only calculate it's probability given the previous word only.

So if we try to estimate this probability using MLE (*Maximum Likelihood Estimation*).  
So the probability of a word $w_n$ given it's previous word $w_{n-1}$ is the count of {$w_n w_{n-1}$} appearing in our corpus divided by the count of $w_{n-1}$ appearing in the corpus:
* $P(w_n | w_{n-1}) = \frac{C(w_{n-1}w_{n})}{w_{n-1}}$

(in the reference they had an intermediate step of dividing by summation of the counts of $w_{n-1}$ with any other word $w$ which should be equal to just the count of the word $w_{n-1}$. But since for me this can follow intuitively from Bayes theorem for conditional probaility I just thought of going there directly)

![He is Bayesian??](../images/Jojo.jpeg)  
He is Bayesian??

So we can generalize what we have so far to $N$ grams now, so that we $N$ tokens into the past to predict the next word.  
- $P(w_n | w_{n-N+1 : n-1}) = \frac{C(w_{n-N+1 : n-1} w_n)}{C(w_{n-N+1 : n-1})}$  

So the whole sentence $W$ probability:  
- $P(W) = \prod_{k=1}^n{P(w_k | w_{k-N+1 : k-1})}$  

So $N = 1$ would be unigram, $N = 2$ is bigram and so on.

Since the goal of this small journey is to learn about embeddings I won't push further with n-grams, so lets try to make a simple bigram model.

I wanted to use any random text dataset, so I found this Twitter hate speech dataset

In [142]:
import pandas as pd

splits = {"train": "training set.csv", "test": "testing set.csv"}
df = pd.read_csv("hf://datasets/thefrankhsu/hate_speech_twitter/" + splits["train"])

In [143]:
#   lets just drop label and category columns since we won't need them
df.drop(labels=["label", "categories"], axis=1, inplace=True)
df.head()

Unnamed: 0,tweet
0,krazy i dont always get drunk and pass out but...
1,white kids favorite activities calling people ...
2,maam did you clear that tweet with the caref...
3,wth is that playing missy i mean seriously rt...
4,he promised to stand with the muzzies so


So now we could go multiple ways, I am not aware of all the implementations of N-Grams.  
But one basic way of thinking is saying I will make a dictionary where each word is associated with an index for a lookup table, or maybe a hash function.  
Then, make the lookup table were each row-column pair refer to a specific bigram probability, so row 0 col 1 refers to *the probability of row 0's word coming after col 0's word*  
We could then optimize the the space and time usage by just keeping the max probability of each column which will be always chosen anyway.
I am sure there are many optimizations but we just want to apply the theorem we have.  

One approach I saw is making a dictionary of each word available, were the values of each word key will be a list of all other words that appeared directly after it in any sentence. Maybe we could sort each list's values and get the word with highest frequency. **I will use this approach**

In [144]:
df.shape

(5679, 1)

In [145]:
df["tweet"][50]

'after deztinis session we needed this lmao jk girl good yob lol'

In [146]:
#   lets decompose each sentence into a list of words
sentences = df.to_numpy()
sentences = sentences.flatten()
sentences.shape

(5679,)

In [147]:
sentences = [str(sentence).split(" ") for sentence in sentences]

In [148]:
# I think since this dataset is of twitter, the vocabulary we has will be greatly inconsistent as spelling, etc.
# It could be benefecial to keep the capitalization of the words which would indicate the place it's used in text better
# but since these are tweets then I wouldn't count much on it and will just make all lowercase.

bigram = {}
for sentence in sentences:
    for index, word in enumerate(sentence[:-1]):
        if (
            word.isalnum()
        ):  # I don't think things other than alphanumerals would be needed
            word = word.lower()
            next_word = sentence[index + 1].lower()

            if word in bigram.keys():
                bigram[word].append(next_word)
            else:
                bigram[word] = []
                bigram[word].append(next_word)
bigram["americans"]

['are',
 'are',
 'go',
 'guess',
 'go',
 'who',
 'weak',
 'are',
 'dream',
 'are',
 'claim',
 'flipping',
 'should',
 'would',
 'should',
 'should',
 'changethename']

In [149]:
#   Now let's get the frequency of each word given it's previous word (the key)
#   So what happens here is we reconstruct the the bigram dictionary
#   where the list of next words for each word will be reduced into a set of each next word and it's frequency
#   (but like (frequency, next_word) tuples so the set sorts itself on the frequency) ;)

bigram = {
    word: set((next_words.count(next_word), next_word) for next_word in next_words)
    for word, next_words in bigram.items()
}
bigram

{'krazy': {(1, 'i')},
 'i': {(1, 'a'),
  (1, 'accept'),
  (1, 'accidentally'),
  (1, 'adore'),
  (1, 'and'),
  (1, 'are'),
  (1, 'attempt'),
  (1, 'auditioned'),
  (1, 'automatically'),
  (1, 'believed'),
  (1, 'betray'),
  (1, 'brought'),
  (1, 'callin'),
  (1, 'carried'),
  (1, 'choose'),
  (1, 'chose'),
  (1, 'color'),
  (1, 'comes'),
  (1, 'confess'),
  (1, 'coulda'),
  (1, 'cried'),
  (1, 'd'),
  (1, 'da'),
  (1, 'date'),
  (1, 'dated'),
  (1, 'day'),
  (1, 'decided'),
  (1, 'def'),
  (1, 'definitely'),
  (1, 'deleted'),
  (1, 'deserve'),
  (1, 'died'),
  (1, 'dini'),
  (1, 'discuss'),
  (1, 'disrespect'),
  (1, 'dnt'),
  (1, 'donts'),
  (1, 'don´t'),
  (1, 'double'),
  (1, 'drove'),
  (1, 'expect'),
  (1, 'fite'),
  (1, 'follow'),
  (1, 'for'),
  (1, 'from'),
  (1, 'front'),
  (1, 'fuc'),
  (1, 'fuckin'),
  (1, 'fully'),
  (1, 'gave'),
  (1, 'give'),
  (1, 'given'),
  (1, 'going'),
  (1, 'gonna'),
  (1, 'gota'),
  (1, 'gots'),
  (1, 'h'),
  (1, 'has'),
  (1, 'held'),
  (1, 'help'

In [150]:
# I think we can just take the next word with highest frequency for each word and we will be good to go
bigram = {word: list(next_words)[-1][1] for word, next_words in bigram.items()}   # get the next_word in last (frequency, next_word) tuple in the set

So there you have it, the bigram dictionary. We can make a simple sentence generator now.

In [151]:
import random
def generate_sentence(first_word: str | None = None, length: int = 10):
    sentence = []
    if first_word:
        sentence.append(first_word)
        next_word = bigram[first_word]
        for _ in range(length):
            sentence.append(next_word)
            next_word = bigram[next_word] if next_word in bigram.keys() else "."
            if next_word == ".":
                break
    else:
        vocab = list(bigram.keys())
        vocab_length = len(vocab)
        
        first_word = bigram[vocab[random.randint(0, vocab_length-1)]]
        
        sentence.append(first_word)
        next_word = bigram[first_word]
        for _ in range(length):
            sentence.append(next_word)
            next_word = bigram[next_word] if next_word in bigram.keys() else "."
            if next_word == ".":
                break
    return " ".join(sentence)

In [152]:
generate_sentence("hello")

'hello dare place mostly just watching barney'

In [153]:
generate_sentence()

'and diplomatic on everything about a guinea fowls are you felt'

In [154]:
generate_sentence()

'men weet van dat coon up skippy and diplomatic on everything'

In [155]:
generate_sentence()

'english person needs just watching barney'

In [156]:
generate_sentence()

'is trending twitters home coon up skippy and diplomatic on everything'

In [157]:
generate_sentence()

'chris is trending twitters home coon up skippy and diplomatic on'

This is just a run down of what I learned, I will probably learn more and also probably, as simple as this notebook can get, find gaps or issues with the explanation or procedures in this notebook, so it's probably okay if you find some too, point it out and help us both learn.