# N-grams

N-grams are continuous sequences of n items (words, symbols, tokens) in a document. They have a wide range of applications in text analysis in the field of natural language processing (NLP). When n=1, it is called unigram, which split the sentence into individual words. When n=2, it's called bigram, and when n=3 it's trigram. If n>3, n-gram can be any number of consecutive words or character combinations. They provide powerful support for various NLP tasks like predicting the probability of the next word in a sentence, classifying text based on their features, translating source language to target language and checking spelling errors. 

## Implementation

First, we create an example on how to generate n-grams. In order to obtain the sequences suitable for n-gram generalization, we need to split sentence into tokens.

In [13]:
import re

text = "This is a notebook for n-grams demonstration, we will show how to generate n-grams."
# try unigram (when n=1)
def get_tokens(text):
    # Convert letters to lower case and split them to words
    tokens = re.sub(r'[^a-zA-Z\s-]', '', text.lower())
    tokens = tokens.split()
    # remove empty strings
    tokens = list(filter(None, tokens))
    return tokens

tokens = get_tokens(text)
print("Words after splitting:")
print(tokens)

def generate_ngrams(tokens, n):
    n_grams = []
    length = len(tokens) - n + 1
    for i in range(length):
        item = ' '.join(tokens[i:i+n])
        n_grams.append(item)
    return n_grams

unigrams = generate_ngrams(tokens, 1)
print("Generated n-grams:")
print(unigrams)

Words after splitting:
['this', 'is', 'a', 'notebook', 'for', 'n-grams', 'demonstration', 'we', 'will', 'show', 'how', 'to', 'generate', 'n-grams']
Generated n-grams:
['this', 'is', 'a', 'notebook', 'for', 'n-grams', 'demonstration', 'we', 'will', 'show', 'how', 'to', 'generate', 'n-grams']


Try different Ns in n-gram generation.

In [14]:
# bigram
bigrams = generate_ngrams(tokens, 2)
print("Generated n-grams:")
print(bigrams)

# n higher than 3
n_grams = generate_ngrams(tokens, 5)
print("Generated n-grams:")
print(n_grams)

Generated n-grams:
['this is', 'is a', 'a notebook', 'notebook for', 'for n-grams', 'n-grams demonstration', 'demonstration we', 'we will', 'will show', 'show how', 'how to', 'to generate', 'generate n-grams']
Generated n-grams:
['this is a notebook for', 'is a notebook for n-grams', 'a notebook for n-grams demonstration', 'notebook for n-grams demonstration we', 'for n-grams demonstration we will', 'n-grams demonstration we will show', 'demonstration we will show how', 'we will show how to', 'will show how to generate', 'show how to generate n-grams']


N-gram representation is a little different, it is treated as linear sequences of vertical level slices. Reference paper: [Linear levels through n-grams](https://dl.acm.org/doi/10.1145/2676467.2676506).

In [21]:
read_level = []
with open('../examples/Mario.txt', 'r') as f:
    for line in f:
        read_level.append(line.strip())

smb_level = []
for line in read_level:
    smb_level.append(list(line))
    
# one slice in a smb level
gram = [col[0] for col in smb_level]
print(gram)

['-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', 'X', 'X']


## Best Practices
- Remember to preprocess your data before generating n-grams, such as converting letter case and removing punctuation marks.
- Select n value that is more suitable for your training target. Increase of n leads to data sparsity and high computational complexity.
- As n increases, the number of n-grams combinations increases exponentially, resulting in a low frequency of many sequences in the training data. While dealing with data sparsity, smoothing techniques such as Laplace Smoothing can be applied.