# Bag of Words

Bag of words is a very simple and basic method to finding topics in a text. For bag of words, you need to first create tokens using tokenization, and then count up all the tokens you have. The theory is that the more frequent a word or token is, the more central or important it might be to the text. Bag of words can be a great way to determine the significant words in a text based on the number of times they are used.

![](img/02_01.png "Bag of Words")

We can use the `nltk` library along with `Counter` from python standard library `collections` to create a bag of words.

## Building a Counter with bag-of-words

In [1]:
article = ""
with open('article.txt' , 'r') as f:
    article+=f.read()

`article` contains text of Wikipedia article on debugging.

In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter

In [3]:
# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
bow_simple.most_common(10)

[(',', 151),
 ('the', 150),
 ('.', 89),
 ('of', 81),
 ("''", 66),
 ('to', 63),
 ('a', 60),
 ('``', 47),
 ('in', 44),
 ('and', 41)]

## Simple text preprocessing

Text processing helps make for better input data when performing machine learning or other statistical methods. Preprocessing steps like tokenization or lowercasing words are commonly used in NLP. Other common techniques are things like *lemmatization* or *stemming*, where you shorten the words to their root stems, or techniques like removing stop words, which are common words in a language that don't carry a lot of meaning or removing punctuation or unwanted tokens. Each model and process will have different results -- so it's good to try a few different approaches to preprocessing and see which works best for your task and goal.

![](img/02_02.png "Text Preprocessing")

In [4]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in stopwords.words("english")]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
bow.most_common(10)

[('debugging', 40),
 ('system', 25),
 ('bug', 17),
 ('software', 16),
 ('problem', 15),
 ('tool', 15),
 ('computer', 14),
 ('process', 13),
 ('term', 13),
 ('debugger', 13)]

This is much better that our original bag of words, because we have removed punctuation and stop words.

## Introduction to gensim

**Gensim** is a popular open-source natural language processing library. It uses top academic models to perform complex tasks like building document or word vectors, corpora and performing topic identification and document comparisons.

### Word Vector

![](img/02_03.png "Word Vector")

A word embedding or vector is trained from a larger corpus and is a multi-dimensional representation of a word or document.  With these vectors, we can then see relationships among the words or documents based on how near or far they are and also what similar comparisons we find. For example, in this graphic we can see that the vector operation king minus queen is approximately equal to man minus woman. Or that Spain is to Madrid as Italy is to Rome. 

### Creating a gensim dictionary

Gensim allows you to build corpora and dictionaries using simple classes and functions. A corpus (or if plural, corpora) is a set of texts used to help perform natural language processing tasks. 

![](img/02_04.png "Genism Dictionary")

We pass the tokenized documents to the Gensim Dictionary class. This will create a mapping with an id for each token. This is the beginning of our corpus. We now can represent whole documents using just a list of their token ids and how often those tokens appear in each document. We can take a look at the tokens and their ids by looking at the token2id attribute, which is a dictionary of all of our tokens and their respective ids in our new dictionary.

### Creating and querying a corpus with gensim

In [5]:
import shelve
shelve_file = shelve.open('mydata')

In [6]:
articles = shelve_file['articles']

>`articles` is a list of documents. Each document is a list of tokens.

In [9]:
# Import Dictionary
from gensim.corpora.dictionary import Dictionary

# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(articles)

# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get("computer")

# Use computer_id with the dictionary to print the word
print(dictionary.get(computer_id))

# Create a MmCorpus: corpus
corpus = [dictionary.doc2bow(article) for article in articles]

# Print the first 10 word ids with their frequency counts from the fifth document
print(corpus[4][:10])


computer
[(0, 88), (23, 11), (24, 2), (39, 1), (41, 2), (55, 22), (56, 1), (57, 1), (58, 1), (59, 3)]


In [54]:
# dictionary.token2id

In [15]:
dictionary.get(219)

'computer'

In [55]:
# corpus[0]

### Gensim bag-of-words

We'll use `defaultdict` and `itertools.chain.from_iterable()` here.
* `defaultdict` allows us to initialize a dictionary that will assign a default value to non-existent keys. By supplying the argument int, we are able to ensure that any non-existent keys are automatically assigned a default value of 0. This makes it ideal for storing the counts of words in this exercise.

* `itertools.chain.from_iterable()` allows us to iterate through a set of sequences as if they were one continuous sequence. Using this function, we can easily iterate through our corpus object (which is a list of lists).

In [33]:
from collections import defaultdict
from itertools import chain
# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)
    
# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in chain.from_iterable(corpus):
    total_word_count[word_id] += word_count

engineering 91
'' 88
reverse 71
software 51
cite 26


In [56]:
# total_word_count

In [50]:
# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 

# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)

'' 1042
computer 594
software 450
`` 345
cite 322


## tf-idf

![](img/02_05.png "tf-idf")

**Tf-idf** stands for term-frequncy - inverse document frequency. It is a commonly used natural language processing model that helps you determine the most important words in each document in the corpus. The idea behind tf-idf is that each corpus might have more shared words than just stopwords. These common words are like stopwords and should be removed or at least down-weighted in importance. For example, if I am an astronomer, sky might be used often but is not important, so I want to downweight that word. TF-Idf does precisely that. It will take texts that share common language and ensure the most common words across the entire corpus don't show up as keywords. Tf-idf helps keep the document-specific frequent words weighted high and the common words across the entire corpus weighted low.

### Formula

![](img/02_06.png "tf-idf Formula")

The weight of token i in document j is calculated by taking the term frequency (or how many times the token appears in the document) multiplied by the log of the total number of documents divided by the number of documents that contain the same term. Here we can see if the total number of documents divded by the number of documents that have the term is close to one, then our logarithm will be close to zero. So words that occur across many or all documents will have a very low tf-idf weight. On the contrary, if the word only occurs in a few documents, that logarithm will return a higher number.

### Tf-idf with gensim

![](img/02_07.png "tf-idf with gensim")

In [52]:
from gensim.models.tfidfmodel import TfidfModel
# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[doc]

# Print the first five weights
print(tfidf_weights[:5])

[(24, 0.0022836332291091273), (39, 0.0043409401554717324), (41, 0.008681880310943465), (55, 0.011988285029371418), (56, 0.005482756770026296)]


In [53]:
# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)

reverse 0.4884961428651127
infringement 0.18674529210288995
engineering 0.16395041814479536
interoperability 0.12449686140192663
reverse-engineered 0.12449686140192663
