# <span style = "color:green"> What are n-grams?<span>

***

N-grams are continious sequences of words or symbols or tokens in a document. In technical terms, they can be defined as the neighbouring sequences of items in a document. They come into play when we deal with text data in NLP tasks.

n-grams are classified into the following types, depending on the value of 'n' takes.

| n | Term | 
| --- | --- | 
| 1 | Unigram |
| 2 | Bigram |
| 3 | Trigram |
| 4 | n-gram | 

As clearly depicted in the table above, when n=1, it is said to be a unigram. When n=2, it is said to be a bigram and so on.

Now, you must be wondering why we need many different types of n-grams?! This is because different types of n-grams are suitable for different types of applications. You should try different n-grams on your data in order to confidently conclude which one works the best among all your text analysis. For instance, research has substantiated that trigrams and 4 grams work the best in the case of spam filtering.

### Example of n-grams

Let's understand n-grams practically with the help of the following sentence:
 "I study at Edure"
 
 | SL.No. | Type of n-gram | Generated n-grams |
 | --- | --- | --- | 
 | 1 | Unigram | ["I" ,"study", "at", "Edure"] |
 | 2 | Bigram | ["I study", "study at", "at Edure"] |
 | 3 | Trigram | ["I study at", "study at Edure"] |

From the table above, it's clear that unigram means taking only one word at a time, bigram means taking two words at a time and trigram means taking three words at a time.

### Appications of n-grams

N-grams have a variety of uses. Some applications of n-grams in NLP inlcude auto-completion of sentences, auto spell-check, and semantic analysis. They are also used in DNA sequencing and other computational linguistic applications.

## Implementation of n-grams in Python

In [1]:
import nltk
from nltk.util import ngrams
 
# Function to generate n-grams from sentences.
def extract_ngrams(data, num):
    n_grams = ngrams(nltk.word_tokenize(data), num)
    return [ ' '.join(grams) for grams in n_grams]
 
data = 'A class is a blueprint for the object.'
 
print("1-gram: ", extract_ngrams(data, 1))
print("2-gram: ", extract_ngrams(data, 2))
print("3-gram: ", extract_ngrams(data, 3))
print("4-gram: ", extract_ngrams(data, 4))

1-gram:  ['A', 'class', 'is', 'a', 'blueprint', 'for', 'the', 'object', '.']
2-gram:  ['A class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object', 'object .']
3-gram:  ['A class is', 'class is a', 'is a blueprint', 'a blueprint for', 'blueprint for the', 'for the object', 'the object .']
4-gram:  ['A class is a', 'class is a blueprint', 'is a blueprint for', 'a blueprint for the', 'blueprint for the object', 'for the object .']


In [7]:
import nltk
from nltk.util import ngrams
data='hi how are you i am glad that ur here'

n_grams=ngrams(nltk.word_tokenize(data),2)
print([ ' '.join(grams) for grams in n_grams])

['hi how', 'how are', 'are you', 'you i', 'i am', 'am glad', 'glad that', 'that ur', 'ur here']


In [6]:
import nltk
from nltk.util import ngrams
data='hi how are you i am glad that ur here'

n_grams=ngrams(nltk.word_tokenize(data),4)
print([ ' '.join(grams) for grams in n_grams])

['hi how are you', 'how are you i', 'are you i am', 'you i am glad', 'i am glad that', 'am glad that ur', 'glad that ur here']


### n-gram with CountVectorizer

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

In [30]:
cv = CountVectorizer(ngram_range=(2,2)) # (2,2)implies a n-gram of 2

In [40]:
cv.fit_transform([data]).toarray()

array([[1, 1, 1, 1, 1]], dtype=int64)

### n-gram with Tfidf

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [43]:
tf = TfidfVectorizer(ngram_range=(3,3)) # implies a n-gram of 3 or Trigram

In [44]:
tf.fit_transform([data]).toarray()

array([[0.5, 0.5, 0.5, 0.5]])