# Ngram Tutorial 
N-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.

An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". Larger sizes are sometimes referred to by the value of n, e.g., "four-gram", "five-gram", and so on. [Wikipedia](https://en.wikipedia.org/wiki/N-gram)

---
### Ngram extraction for feature space construction
Below is a very short introduction to ngrams explaining their extraction in examples as well as in code. Source: This [blog post](https://cmry.github.io/notes/ngrams) of Chris Emmery

In [1]:
def find_ngrams(sentence, n_list):
    """Magic n-gram function."""
    inp, grams = sentence.split(), []
    for n in n_list:
        grams += [' '.join(x) for x in zip(*[inp[i:] for i in range(n)])]
    return grams

In [2]:
class Ngrams:

    def __init__(self, n_list):
        self.n_list = n_list
        self.indices = {}

    def fit(self, sentence):
        """Magic n-gram function fits to vector indices."""
        i, inp = len(self.indices)-1, sentence.split()
        for n in self.n_list:
            for x in zip(*[inp[i:] for i in range(n)]):
                if self.indices.get(x) == None:
                    i += 1
                    self.indices.update({x: i})

    def transform(self, sentence):
        """Given a sentence, convert to a gram vector."""
        v, inp = [0] * len(self.indices), sentence.split()
        for n in self.n_list:
            for x in zip(*[inp[i:] for i in range(n)]):
                if self.indices.get(x) != None:
                    v[self.indices[x]] += 1
        return v

In [7]:
ng = Ngrams(n_list=[1,2])

In [8]:
ng.fit('text about stuff')

In [9]:
ng.transform('text about stuff')

[1, 1, 1, 1, 1]

In [10]:
ng.indices

{('about',): 1,
 ('about', 'stuff'): 4,
 ('stuff',): 2,
 ('text',): 0,
 ('text', 'about'): 3}