## Bag-of-word models

We create a matrix, where each row represents a document in our dataset and each column
represents a word. The value of the cell is the frequency of that word in the document. This
is known as the bag-of-words model.

In [2]:
s = """Three Rings for the Elven-kings under the sky, Seven for the Dwarflords
in halls of stone, Nine for Mortal Men, doomed to die, One for the
Dark Lord on his dark throne In the Land of Mordor where the Shadows lie.
One Ring to rule them all, One Ring to find them, One Ring to bring them
all and in the darkness bind them. In the Land of Mordor where the Shadows
lie. """.lower()
words = s.split()
from collections import Counter
c = Counter(words)
print(c.most_common(5))

[('the', 9), ('for', 4), ('in', 4), ('to', 4), ('one', 4)]


The bag-of-words model has three major types, with many variations and alterations.
<ul><li>The first is to use the raw frequencies, as shown in the preceding example. This
has the same drawback as any non-normalised data - words with high variance
due to high overall values (such as) the overshadow lower frequency (and
therefore lower-variance) words, even though the presence of the word the rarely
has much importance.</li><li>
The second model is to use the normalized frequency, where each document's
sum equals 1. This is a much better solution as the length of the document doesn't
matter as much, but it still means words like the overshadow lower frequency
words. The third type is to simply use binary features—a value is 1 if it occurs,
and 0 otherwise. We will use binary representation in this chapter.</li><li>
Another (arguably more popular) method for performing normalization is called
term frequency-inverse document frequency (tf-idf). In this weighting scheme,
term counts are first normalized to frequencies and then divided by the number
of documents in which it appears in the corpus. We will use tf-idf in Chapter 10,
Clustering News Articles.</li></ul>