# Demo #1 Bag Of Words

Now that we have the sonnets into their own files, let's take them into an NLP library to try to get information about what we're working with.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import os
try:
    os.chdir("Sonnets")
except:
    pass

We need to make sure we're only grabbing the sonnets, but since we named them algorithmically, we can just pattern-match.  If you were working with less structured data, you may need to manually prune files out of the directory, or use some kind of regex pattern matching. Please note that if a random file (such as a model or parameters) makes its way into your data repo, unexpected things may happen, and you may get some strange errors about encoding.

In [2]:
file_list = os.listdir('.')
for file in file_list:
    if file.startswith("Sonnet") and file.endswith(".txt"):
        pass
    else:
        print(file)
        file_list.remove(file)
file_list.sort()

shakespeares-sonnets_TXT_FolgerShakespeare.txt
stopwords.txt


We must create an object to store the tokenizer and bag of words, then run the "fit" function to do the actual tokenization and counting.

In [3]:
vectorizer = CountVectorizer(input='filename')
vectorized_corpus = vectorizer.fit_transform(file_list)

In [4]:
print(type(vectorized_corpus))

<class 'scipy.sparse.csr.csr_matrix'>


In [5]:
#analyzer = vectorizer.build_analyzer()

In [6]:
print(vectorized_corpus.toarray().shape)

(154, 3074)


### Viewing the data

These are commands to dump out information about the data that we have loaded and processed.  They're commented out because they produce a lot of output, but feel free to uncomment them and run them yourselves.

This command will give you all of the words in your text corpus

In [7]:
#vectorizer.get_feature_names()

This command will print out the tokens in the corpus, in the form of

(Sonnet, WordToken)     Count

In [8]:
#print(vectorized_corpus)

We can also access the tokens in the form of a matrix using numpy functionality

In [9]:
array_corpus = vectorized_corpus.toarray() 
print(array_corpus)
print(type(array_corpus))
print(array_corpus.shape)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
<class 'numpy.ndarray'>
(154, 3074)


We can also access individual sonnets in the array by their index number, as they're the first axis

In [10]:
print(array_corpus[0])
print(np.nonzero(array_corpus[0]))

[0 0 0 ... 0 0 0]
(array([   7,  101,  137,  139,  197,  199,  210,  350,  362,  371,  376,
        378,  442,  516,  520,  556,  570,  617,  669,  692,  771,  800,
        812,  909,  919,  930,  964, 1003, 1020, 1081, 1085, 1091, 1108,
       1143, 1167, 1262, 1270, 1290, 1360, 1365, 1512, 1515, 1588, 1593,
       1631, 1644, 1731, 1738, 1757, 1791, 1798, 1801, 1825, 1902, 2168,
       2183, 2262, 2315, 2438, 2442, 2515, 2562, 2604, 2616, 2617, 2618,
       2628, 2636, 2644, 2648, 2669, 2670, 2677, 2685, 2697, 2897, 2908,
       2937, 2993, 2996, 3023], dtype=int64),)


And we can access token information using the second index

In [11]:
print(vectorizer.get_feature_names()[7])
print(np.nonzero(array_corpus[:,7]))
print(np.sum(array_corpus[:,7]))

abundance
(array([  0,  22,  36, 134], dtype=int64),)
4


We can print out the sonnet to verify that the tokens we're looking at did indeed come from there.

In [12]:
print(open(file_list[0],'r').readlines())

['From fairest creatures we desire increase,\n', "That thereby beauty's rose might never die,\n", 'But, as the riper should by time decease,\n', 'His tender heir might bear his memory.\n', 'But thou, contracted to thine own bright eyes,\n', "Feed'st thy light's flame with self-substantial fuel,\n", 'Making a famine where abundance lies,\n', 'Thyself thy foe, to thy sweet self too cruel.\n', "Thou that art now the world's fresh ornament\n", 'And only herald to the gaudy spring\n', 'Within thine own bud buriest thy content\n', "And, tender churl, mak'st waste in niggarding.\n", '  Pity the world, or else this glutton be--\n', "  To eat the world's due, by the grave and thee.\n"]


### Implementing Stop Words
We can add in our stop-words list here too, and when we re-run the vectorizer, note that while the number of sonnets stays the same, the vocabulary becomes smaller, as we would expect.

In [13]:
stop_words_file = open('stopwords.txt', 'r')
stop_words = stop_words_file.read().splitlines() 
print(stop_words)

['a', 'about', 'actually', 'almost', 'also', 'although', 'always', 'am', 'an', 'and', 'any', 'are', 'as', 'at', 'be', 'became', 'become', 'but', 'by', 'can', 'could', 'did', 'do', 'does', 'each', 'either', 'else', 'for', 'from', 'had', 'has', 'have', 'hence', 'how', 'i', 'if', 'in', 'is', 'it', 'its', 'just', 'may', 'maybe', 'me', 'might', 'mine', 'must', 'my', 'mine', 'must', 'my', 'neither', 'nor', 'not', 'of', 'oh', 'ok', 'when', 'where', 'whereas', 'wherever', 'whenever', 'whether', 'which', 'while', 'who', 'whom', 'whoever', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yes', 'yet', 'you', 'your']


In [14]:
vectorizer_stop_words = CountVectorizer(input='filename', stop_words=stop_words)
#Now we re-run the vectorizer above

In [15]:
vectorized_stop_words =  vectorizer_stop_words.fit_transform(file_list)
print(vectorized_stop_words.toarray().shape)

(154, 3012)


In [16]:
print(vectorizer_stop_words.get_stop_words())

frozenset({'neither', 'might', 'yet', 'my', 'has', 'ok', 'wherever', 'whereas', 'from', 'nor', 'am', 'when', 'oh', 'whom', 'almost', 'why', 'in', 'did', 'for', 'whenever', 'will', 'each', 'if', 'your', 'also', 'hence', 'become', 'with', 'had', 'must', 'me', 'just', 'you', 'about', 'who', 'any', 'as', 'mine', 'does', 'maybe', 'within', 'yes', 'could', 'which', 'would', 'at', 'a', 'its', 'without', 'can', 'although', 'an', 'do', 'always', 'are', 'but', 'actually', 'be', 'how', 'it', 'whether', 'have', 'may', 'of', 'whose', 'i', 'is', 'became', 'where', 'and', 'while', 'else', 'by', 'either', 'not', 'whoever'})


### n-Grams
We can also create n-gram models using the same vectorizer module.

In this example, we will use word tokens, and bi-grams (pairs of words) to just use bi-grams you would set the range to (2,2)

In [17]:
ngram_vectorizer = CountVectorizer(input='filename', ngram_range=(1,2), stop_words=stop_words)

In [18]:
ngram_corpus = ngram_vectorizer.fit_transform(file_list)
print(ngram_corpus.toarray().shape)

(154, 13818)


In [21]:
#print(ngram_vectorizer.get_feature_names())