# Using $n$-Grams and Bag-of-Words Models
*Curtis Miller*

In this video I demonstrate how to use $n$-grams and bag-of-words models. I show not only how to create the relevant data structures from documents, I also give usage examples.

## $n$-Grams

$n$-grams can either refer to either collections of characters or words. I will to the words case later; for now, let's focus on character $n$-grams.

A $n$-gram is a sequence of $n$ characters that appear in a text. The 3-grams for the word "apple" are `["app", "ppl", "ple"]`, and the 4-grams are `["appl", "pple"]`. $n$-grams are used to generate a feature set for a string that we can use for other purposes, such as machine learning.

Let's demonstrate using $n$-grams to identify gender in names. 

In [None]:
import nltk
from nltk import ngrams
from nltk.corpus import names
import pandas as pd
import numpy as np
from sklearn.naive_bayes import BernoulliNB, GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import re

In [None]:
names.fileids()

In [None]:
names.words(fileids="female.txt")

In [None]:
femdf = pd.DataFrame({"name": names.words(fileids="female.txt"),
                      "gender": ["female"] * len(names.words(fileids="female.txt"))})
femdf

In [None]:
maldf = pd.DataFrame({"name": names.words(fileids="male.txt"),
                      "gender": ["male"] * len(names.words(fileids="male.txt"))})
maldf

In [None]:
namedf = maldf.append(femdf)
namedf.index = pd.Index([u for u in range(namedf.shape[0])])
namedf

In [None]:
nametrain, nametest = train_test_split(namedf)
nametrain.index = pd.Index([u for u in range(nametrain.shape[0])])
nametest.index = pd.Index([u for u in range(nametest.shape[0])])
nametrain.shape

In [None]:
nametrain.gender.value_counts()

In [None]:
nametest.shape

In [None]:
namegrams = [[''.join(u) for u in ngrams(m, n=3)] for m in nametrain.name]
namegrams

In [None]:
gramfreq = nltk.FreqDist(list(gr for a in namegrams for gr in a))
gramfreq

In [None]:
gramfreq.plot(10)

In [None]:
M = 2000
gramfreq.most_common(M)

In [None]:
commongrams = [gr[0] for gr in gramfreq.most_common(M)]
gramdf = pd.DataFrame(np.zeros((nametrain.shape[0], M)),
                      columns=pd.Index(commongrams))

In [None]:
gramdf

In [None]:
nametrain = nametrain.join(gramdf)

for i in nametrain.index:
    nametrain.loc[nametrain.index[i], list(u for u in namegrams[i] if u in commongrams)] = 1

nametrain

In [None]:
gendpred = BernoulliNB()

gendpred = gendpred.fit(nametrain.loc[:, commongrams], nametrain.gender)
predicted_gender = pd.Series(gendpred.predict(nametrain.as_matrix(commongrams)))
predicted_gender.value_counts()

In [None]:
print(classification_report(nametrain.gender, predicted_gender))

In [None]:
nametest = nametest.join(gramdf)
namegramstest = [[''.join(u) for u in ngrams(m, n=3)] for m in nametest.name]

for i in range(nametest.shape[0]):
    nametest.loc[nametest.index[i], list(u for u in namegramstest[i] if u in commongrams)] = 1

predicted_gender_test = pd.Series(gendpred.predict(nametest.as_matrix(commongrams)))
print(classification_report(nametest.gender, predicted_gender_test))

The classifier does not do a terrible job, but I'm mostly interested in demonstrating the technology at this point.

## Bag-of-Words

The idea of bag-of-words models is essentially the same as $n$-grams when applied to characters, though now we work with words. We again can combine words to make bigrams, trigrams, etc., which may be more useful.

Ultimately these methods are a form of feature generation for documents that can later be used for learning applications.

Here, I use bigrams to predict whether a speech (specifically, an American State of the Union address or inaugural address) was delivered by a Democratic or Republican president. I use the State of the Union and inaugural address corpora provided with NLTK for training, and will use the naïve Bayes algorithm for classification. The number of times a bigram appears in a speech will be a feature.

In [None]:
from nltk.corpus import state_union, inaugural

In [None]:
state_union.fileids()

In [None]:
inaugural.fileids()

In [None]:
state_union.words('2006-GWBush.txt')

I will use only speeches given during the "modern" American political era, which I consider to start with President Eisenhower.

In [None]:
state_union.fileids()[7:]

In [None]:
inaugural.fileids()[41:]

In [None]:
pres_party = {            # Will be used to identify parties
    "Eisenhower": "R",
    "Kennedy": "D",
    "Johnson": "D",
    "Nixon": "R",
    "Ford": "R",
    "Carter": "D",
    "Reagan": "R",
    "Bush": "R",
    "Clinton": "D",
    "Obama": "D",
    "GWBush": "R",
    "Trump": "R"
}

In [None]:
nltk.Text(state_union.words("2002-GWBush.txt")).collocations()

In [None]:
# Create a dataframe containing speech data; for now, this is file id, type of speech, and party
speeches = pd.DataFrame({"fileid": [*state_union.fileids()[7:], *inaugural.fileids()[41:]],
                         "type": [*(["sotu"] * len(state_union.fileids()[7:])), 
                                  *(["inaugural"] * len(inaugural.fileids()[41:]))]})
speeches = speeches.join(pd.DataFrame({"party": speeches.fileid.map(
    lambda x: pres_party[re.findall("[0-9]{4}-(.*?)(?:-[0-9])?\.txt", x)[0]])}))
speeches

In [None]:
# Get a collection of tokens for Democratic and Republican speeches
tokens_R = list()
tokens_D = list()

for _, s in speeches.iterrows():
    if s.type == "sotu":
        words = state_union.words(fileids=s.fileid)
    else:
        words = inaugural.words(fileids=s.fileid)
    
    if s.party == "R":
        tokens_R.extend(list(words))
    else:
        tokens_D.extend(list(words))

In [None]:
tokens_R

I don't want to use every single possible bigram as my feature space. Instead I will find two-word collocations, words that appear unusually often in a text. I find collocations for both Democratic and Republican presidents, then combine the collocations into one common set that will be treated as the feature space. Below are some functions for finding these collocations, using functionality provided by NLTK (along with some NLTK source code).

In [None]:
from nltk.corpus import stopwords
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

In [None]:
def get_collocations(l, window_size=2, num=20):
    """Gets a list of collocations for a text; this is similar to code from the collocations() method of Text in nltk"""
    
    ignored_words = stopwords.words('english')
    finder = BigramCollocationFinder.from_words(l, window_size)
    finder.apply_freq_filter(2)
    finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words or w == w.upper())
    bigram_measures = BigramAssocMeasures()
    return finder.nbest(bigram_measures.likelihood_ratio, num)

In [None]:
colloc_R = get_collocations(tokens_R, window_size=2, num=60)
colloc_R

In [None]:
colloc_D = get_collocations(tokens_D, window_size=2, num=60)
colloc_D

In [None]:
# Create the common collocation set
feature_colloc = list(set(' '.join(w) for w in colloc_R).union(' '.join(w) for w in colloc_D))
feature_colloc

Now I find bigrams for the speeches and compute how frequently a word pair appeared in a speech, for each speech. The frequency of these word pairs will be treated as the data points that will form the basis of prediction.

In [None]:
phrase_dict = dict()

for i, s in speeches.iterrows():
    if s.type == "sotu":
        words = state_union.words(fileids=s.fileid)
    else:
        words = inaugural.words(fileids=s.fileid)
    
    bigrams = [u[0] + ' ' + u[1] for u in nltk.ngrams(words, 2)]
    phrase_dict[i] = pd.Series(map(lambda x: bigrams.count(x), feature_colloc), index=feature_colloc)

phrase_dict

In [None]:
speeches = speeches.join(pd.DataFrame(phrase_dict).T)
speeches

Now we fit a naïve Bayes classifier to the data.

In [None]:
partypred = GaussianNB()

partypred = partypred.fit(speeches.loc[:, feature_colloc], speeches.party)

party_predicted = partypred.predict(speeches.loc[:, feature_colloc])
party_predicted

In [None]:
print(classification_report(speeches.party, party_predicted))

Classification isn't bad, though not great either. Now let's test out the classifier on unseen data: Barack Obama's 2014 State of the Union address and Donald Trump's 2018 State of the Union address.

In [None]:
with open("2014-Obama.txt") as f:
    obama_speech = f.read()

with open("2018-Trump.txt") as f:
    trump_speech = f.read()

print(obama_speech)

In [None]:
print(trump_speech)

In [None]:
token_obama = nltk.tokenize.wordpunct_tokenize(obama_speech)
token_trump = nltk.tokenize.wordpunct_tokenize(trump_speech)

obama_bigrams = [u[0] + ' ' + u[1] for u in nltk.ngrams(token_obama, 2)]
trump_bigrams = [u[0] + ' ' + u[1] for u in nltk.ngrams(token_trump, 2)]

test_data = pd.DataFrame({"obama": pd.Series(map(lambda x: obama_bigrams.count(x), feature_colloc),
                                             index=feature_colloc),
                          "trump": pd.Series(map(lambda x: trump_bigrams.count(x), feature_colloc),
                                             index=feature_colloc)}).T

In [None]:
test_data

In [None]:
partypred.predict(test_data)

Unfortunately our classifier did not do well on the test set; it only has an accuracy of 50%, identifying Donald Trump as a Democrat.