# Constructing the N-gram model

Representing a document as a bag of words is useful, but semantics is about more than
just words in isolation. To capture word combinations, an n-gram model is useful. Its
vocabulary consists not just of words, but word sequences, or n-grams. We will build a
bigram model in this recipe, where bigrams are sequences of two words.

# Getting ready
Te CountVectorizer class is very versatile and allows us to construct n-gram models.
We will use it again in this recipe. We will also explore how to build character n-gram
models using this class.

In [3]:
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

nltk.download("punkt_tab")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [4]:
# Read in the book text(corpus):
filename = "001_Study_in_Scarlet.txt"
file = open(r"/content/001_Study_in_Scarlet.txt", "r", encoding="utf-8")
text = file.read()
# Replace newlines with spaces:
text = text.replace("\n", " ")
# Initialize an NLTK tokenizer. Tis uses the punkt model we downloaded
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")

# Divide the text into sentences:
sentences = tokenizer.tokenize(text)

INITIALIZE AND CONFIGURATION FOR BIGRAMS

In [5]:
# INITIALIZE COUNTVECTORIZER FOR BIGRAM
bigram = CountVectorizer(ngram_range = (2, 2))

In [6]:
# FIT AND TRANSFORM THE CORPUS
x_bigram_sparse = bigram.fit_transform(sentences)

In [7]:
# CONVERT TO DENSE ARRAY FOR VIEWING
x_bigram_dense = x_bigram_sparse.toarray()

INSPECT THE BIGRAMS(FEATURES)

In [10]:
bigram_features = bigram.get_feature_names_out()
print(f"Learned Bigrams(features): ")
print(f"Total Unique Bigrams: {len(bigram_features)}")
print(bigram_features)

Learned Bigrams(features): 
Total Unique Bigrams: 24287
['10 plays' '11 is' '12 has' ... 'youth thinking' 'youth with' 'zeal for']


DISPLAY THE BIGRAM COUNT MATRIX

In [12]:
data_bigrams = pd.DataFrame(x_bigram_dense,
                           columns = bigram_features,
                            index = [f"Doc {i + 1}" for i in range(len(sentences))])
print(f"Bigram Count Matrix (C(w_i-1, w_i))")
print(data_bigrams)

Bigram Count Matrix (C(w_i-1, w_i))
          10 plays  11 is  12 has  129 camberwell  13 duncan  13 we  15 and  \
Doc 1            0      0       0               0          0      0       0   
Doc 2            0      0       0               0          0      0       0   
Doc 3            0      0       0               0          0      0       0   
Doc 4            0      0       0               0          0      0       0   
Doc 5            0      0       0               0          0      0       0   
...            ...    ...     ...             ...        ...    ...     ...   
Doc 2671         0      0       0               0          0      0       0   
Doc 2672         0      0       0               0          0      0       0   
Doc 2673         0      0       0               0          0      0       0   
Doc 2674         0      0       0               0          0      0       0   
Doc 2675         0      0       0               0          0      0       0   

          1878 

# CALCULATE CONDITIONAL PROBABILITIES

In [13]:
unigram_vectorizer = CountVectorizer(ngram_range = (1, 1))
x_unigram_sparse = unigram_vectorizer.fit_transform(sentences)

x_unigram_dense = x_unigram_sparse.toarray()

unigram_features = unigram_vectorizer.get_feature_names_out()

In [15]:
# DATAFRAME FOR THE UNIGRAM COUNTS
data_unigram = pd.DataFrame(x_unigram_dense,
    columns = unigram_features,
    index = [f"Doc {i + 1}" for i in range(len(sentences))])

print(f"Unigram Count Matrix (C(w_i-1))")

print(data_unigram)

Unigram Count Matrix (C(w_i-1))
          10  11  12  129  13  15  1642  1878  221b  27  ...  youngest  \
Doc 1      0   0   0    0   0   0     0     1     0   0  ...         0   
Doc 2      0   0   0    0   0   0     0     0     0   0  ...         0   
Doc 3      0   0   0    0   0   0     0     0     0   0  ...         0   
Doc 4      0   0   0    0   0   0     0     0     0   0  ...         0   
Doc 5      0   0   0    0   0   0     0     0     0   0  ...         0   
...       ..  ..  ..  ...  ..  ..   ...   ...   ...  ..  ...       ...   
Doc 2671   0   0   0    0   0   0     0     0     0   0  ...         0   
Doc 2672   0   0   0    0   0   0     0     0     0   0  ...         0   
Doc 2673   0   0   0    0   0   0     0     0     0   0  ...         0   
Doc 2674   0   0   0    0   0   0     0     0     0   0  ...         0   
Doc 2675   0   0   0    0   0   0     0     0     0   0  ...         0   

          youngster  youngsters  your  yours  yourself  youth  youths  zeal  \


# There's moreâ€¦
We can use trigrams, quadrigrams, and more in the vectorizer by providing the
corresponding tuple to the ngram_range argument. The downside of this is the everexpanding vocabulary and the growth of sentence vectors, since each sentence vector has
to have an entry for each word in the input vocabulary.