### Word counts with bag-of-words

Welcome to chapter two! We'll begin with using word counts with a bag of words approach.

### Bag-of-words

Bag of words is a very simple and basic method to `finding topics` in a `text`. For bag of words, 

- you need to first create tokens using tokenization, and 

- then count up all the tokens you have. 

The theory is that the more `frequent` a word or token is, the more `central or important` it might be to the text. 

Bag of words can be a great way to determine the `significant` words in a text based on the number of times they are used.

### Bag-of-words example

<img src="bow.jpg" style="max-width:600px">

Here we see an example series of sentences, mainly about a cat and a box. If we just us a simple bag of words model with tokenization like we learned in chapter one and remove the punctuation, we can see the example result. Box, cat, The and the are some of the most important words because they are the most frequent. 

Notice that the word `THE` appears twice in the bag of words, once with `uppercase` and once `lowercase`. If we added a preprocessing step to handle this issue, we could lowercase all of the words in the text so each word is counted only once.


### Bag-of-words in Python

We can use the NLP fundamentals we already know, such as `tokenization` with NLTK to create a `list of tokens`. 

We will use a new class called `Counter` which we import from the standard library module `collections`. The list of tokens generated using `word_tokenize` can be passed as the initialization argument for the Counter class. The result is a counter object which has similar structure to a `dictionary` and allows us to see each token and the frequency of the token. 

<img src="c.jpg" style="max-width:600px">

Counter objects also have a method called `most_common`, which takes an `integer argument`, such as 2 and would then return the top 2 tokens in terms of frequency. The return object is a series of tuples inside a list. For each tuple, the first element holds the token and the second element represents the frequency. 

Note: other than ordering by token frequency, the most_common method does not sort the tokens it returns or tell us there are more tokens with that same frequency.

In [1]:
with open("wiki_text_debugging.txt", 'r') as f:
    article = f.read()

print(len(article))

15854


## Exercise 1: Building a Counter with bag-of-words

In this exercise, you'll build your first `bag-of-words counter` using a `Wikipedia article`, which has been pre-loaded as article. Try doing the bag-of-words without looking at the full article text, and guessing what the topic is! 

Note that this article text has had very little preprocessing from the raw Wikipedia database entry.

- Import `Counter` from `collections`. Use `word_tokenize()` to `split` the article into `tokens`.


- Use a `list comprehension` with `t` as the `iterator` variable to `convert` all the `tokens` into `lowercase`. The .lower() method converts text into lowercase.


- Create a bag-of-words counter called `bow_simple` by using `Counter()` with `lower_tokens` as the argument.


- Use the `.most_common()` method of `bow_simple` to print the `10` most common tokens.

In [2]:
## q1
from collections import Counter 
from nltk.tokenize import word_tokenize

In [3]:
tokens = word_tokenize(article)
print(len(tokens))
print(tokens)

2910
["'", "''", 'Debugging', "''", "'", 'is', 'the', 'process', 'of', 'finding', 'and', 'resolving', 'of', 'defects', 'that', 'prevent', 'correct', 'operation', 'of', 'computer', 'software', 'or', 'a', 'system', '.', 'Numerous', 'books', 'have', 'been', 'written', 'about', 'debugging', '(', 'see', 'below', ':', '#', 'Further', 'reading|Further', 'reading', ')', ',', 'as', 'it', 'involves', 'numerous', 'aspects', ',', 'including', 'interactive', 'debugging', ',', 'control', 'flow', ',', 'integration', 'testing', ',', 'Logfile|log', 'files', ',', 'monitoring', '(', 'Application', 'monitoring|application', ',', 'System', 'Monitoring|system', ')', ',', 'memory', 'dumps', ',', 'Profiling', '(', 'computer', 'programming', ')', '|profiling', ',', 'Statistical', 'Process', 'Control', ',', 'and', 'special', 'design', 'tactics', 'to', 'improve', 'detection', 'while', 'simplifying', 'changes', '.', 'Origin', 'A', 'computer', 'log', 'entry', 'from', 'the', 'Mark', '&', 'nbsp', ';', 'II', ',', 'wi

In [4]:
## q2
lower_tokens = [t.lower() for t in tokens]
print(len(lower_tokens))
print(lower_tokens)

2910
["'", "''", 'debugging', "''", "'", 'is', 'the', 'process', 'of', 'finding', 'and', 'resolving', 'of', 'defects', 'that', 'prevent', 'correct', 'operation', 'of', 'computer', 'software', 'or', 'a', 'system', '.', 'numerous', 'books', 'have', 'been', 'written', 'about', 'debugging', '(', 'see', 'below', ':', '#', 'further', 'reading|further', 'reading', ')', ',', 'as', 'it', 'involves', 'numerous', 'aspects', ',', 'including', 'interactive', 'debugging', ',', 'control', 'flow', ',', 'integration', 'testing', ',', 'logfile|log', 'files', ',', 'monitoring', '(', 'application', 'monitoring|application', ',', 'system', 'monitoring|system', ')', ',', 'memory', 'dumps', ',', 'profiling', '(', 'computer', 'programming', ')', '|profiling', ',', 'statistical', 'process', 'control', ',', 'and', 'special', 'design', 'tactics', 'to', 'improve', 'detection', 'while', 'simplifying', 'changes', '.', 'origin', 'a', 'computer', 'log', 'entry', 'from', 'the', 'mark', '&', 'nbsp', ';', 'ii', ',', 'wi

In [5]:
## q3
bow_simple = Counter(lower_tokens)
print(len(bow_simple))
print(bow_simple)

958
Counter({',': 151, 'the': 150, '.': 89, 'of': 81, "''": 66, 'to': 63, 'a': 60, '``': 47, 'in': 44, 'and': 41, 'debugging': 40, '(': 40, ')': 40, ':': 31, 'for': 26, 'is': 25, 'or': 25, 'be': 24, '{': 22, '}': 22, 'as': 21, 'system': 19, 'it': 18, 'can': 17, 'software': 16, 'that': 14, 'on': 14, 'tools': 14, 'by': 13, 'process': 12, 'computer': 12, 'are': 12, 'used': 12, 'bug': 11, 'http': 11, 'term': 11, 'such': 11, 'debugger': 11, 'from': 10, '[': 10, ']': 10, 'problem': 10, 'program': 10, '<': 10, '>': 10, '|': 10, 'programming': 9, 'an': 9, 'not': 9, 'some': 9, 'with': 8, 'was': 8, 'at': 8, 'this': 8, 'code': 8, 'example': 8, 'check': 8, 'have': 7, 'also': 7, 'techniques': 7, 'where': 7, 'which': 7, 'may': 6, 'acm': 6, 'systems': 6, 'more': 6, 'bugs': 6, 'would': 6, 'language': 6, 'make': 6, 'these': 6, 'test': 6, 'ref': 6, 'name=': 6, 'see': 5, 'memory': 5, 'they': 5, 'debug': 5, 'article': 5, "'s": 5, 'errors': 5, 'but': 5, '?': 5, 'if': 5, 'problems': 5, 'hardware': 5, 'progr

In [6]:
## q4
bow_simple.most_common(20)

[(',', 151),
 ('the', 150),
 ('.', 89),
 ('of', 81),
 ("''", 66),
 ('to', 63),
 ('a', 60),
 ('``', 47),
 ('in', 44),
 ('and', 41),
 ('debugging', 40),
 ('(', 40),
 (')', 40),
 (':', 31),
 ('for', 26),
 ('is', 25),
 ('or', 25),
 ('be', 24),
 ('{', 22),
 ('}', 22)]

### Simple text preprocessing

Now, we will cover some simple text preprocessing.

###  Why preprocess?

Text processing helps make for better input data when performing machine learning or other statistical methods. 

For example, in the last few exercises you have applied small bits of preprocessing (like `tokenization`) to create a bag of words. You also noticed that applying simple techniques like lowercasing all of the tokens, can lead to slightly better results for a bag-of-words model. Preprocessing steps like `tokenization` or `lowercasing` words are commonly used in NLP. 

Other common techniques are things like `lemmatization` or `stemming`, where you `shorten` the `words` to their `root stems`, or techniques like `removing stop words`, which are common words in a language that don't carry a lot of meaning -- such as `and` or `the`, or `removing punctuation` or `unwanted tokens`. 

Of course, each model and process will have different results -- so it's good to try a few different approaches to preprocessing and see which works best for your task and goal.

###  Text preprocessing with Python

We can perform text preprocessing using many of the tools we already know and have learned. 

<img src="c-1.jpg" style="max-width:600px">

In this code, we are using the same text as from our previous example, a few sentences about a cat with a box. 


- We can use list comprehensions to tokenize the sentences which we first make lowercase using the string lower method. The string is_alpha method will return True if the string has only alphabetical characters. We use the is_alpha method along with an if statement iterating over our tokenized result to only return only alphabetic strings (this will effectively strip tokens with numbers or punctuation). To read out the process in both code and English we say we take each token from the word_tokenize output of the lowercase text if it contains only alphabetical characters. 


- In the next line, we use another list comprehension to remove words that are in the stopwords list. This stopwords list for english comes built in with the NLTK library. 


- Finally, we can create a counter and check the two most common words, which are now cat and box (unlike the and box which were the two tokens returned in our first result). Preprocessing has already improved our bag of words and made it more useful by removing the stopwords and non-alphabetic words.

In [7]:
print(len(lower_tokens))
print(lower_tokens)

2910
["'", "''", 'debugging', "''", "'", 'is', 'the', 'process', 'of', 'finding', 'and', 'resolving', 'of', 'defects', 'that', 'prevent', 'correct', 'operation', 'of', 'computer', 'software', 'or', 'a', 'system', '.', 'numerous', 'books', 'have', 'been', 'written', 'about', 'debugging', '(', 'see', 'below', ':', '#', 'further', 'reading|further', 'reading', ')', ',', 'as', 'it', 'involves', 'numerous', 'aspects', ',', 'including', 'interactive', 'debugging', ',', 'control', 'flow', ',', 'integration', 'testing', ',', 'logfile|log', 'files', ',', 'monitoring', '(', 'application', 'monitoring|application', ',', 'system', 'monitoring|system', ')', ',', 'memory', 'dumps', ',', 'profiling', '(', 'computer', 'programming', ')', '|profiling', ',', 'statistical', 'process', 'control', ',', 'and', 'special', 'design', 'tactics', 'to', 'improve', 'detection', 'while', 'simplifying', 'changes', '.', 'origin', 'a', 'computer', 'log', 'entry', 'from', 'the', 'mark', '&', 'nbsp', ';', 'ii', ',', 'wi

In [8]:
english_stops = ['i','me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 
                 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 
                 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am' ,
                 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 
                 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 
                 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 
                 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 
                 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 
                 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 
                 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 
                 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 
                 'weren', 'won', 'wouldn', '']

## Exercise 2: Text preprocessing practice

Now, it's your turn to apply the techniques you've learned to help clean up text for better NLP results. You'll need to remove stop words and non-alphabetic characters, lemmatize, and perform a new bag-of-words on your cleaned text.

You start with the same tokens you created in the last exercise: lower_tokens. You also have the Counter class imported.

- Import the `WordNetLemmatizer` class from `nltk.stem`. Create a list `alpha_only` that contains only `alphabetical` characters. You can use the `.isalpha()` method to check for this.


- Create another list called `no_stops` consisting of words from `alpha_only` that are not contained in english_stops.


- Initialize a `WordNetLemmatizer` object called `wordnet_lemmatizer` and use its `.lemmatize()` method on the `tokens` in `no_stops` to create a new list called `lemmatized`.


- Create a new `Counter` called bow with the `lemmatized` words. Lastly, print the 10 most common tokens.

In [9]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [11]:
alpha_only = [w for w in lower_tokens if w.isalpha()]
print(len(alpha_only))
print(alpha_only)

2144
['debugging', 'is', 'the', 'process', 'of', 'finding', 'and', 'resolving', 'of', 'defects', 'that', 'prevent', 'correct', 'operation', 'of', 'computer', 'software', 'or', 'a', 'system', 'numerous', 'books', 'have', 'been', 'written', 'about', 'debugging', 'see', 'below', 'further', 'reading', 'as', 'it', 'involves', 'numerous', 'aspects', 'including', 'interactive', 'debugging', 'control', 'flow', 'integration', 'testing', 'files', 'monitoring', 'application', 'system', 'memory', 'dumps', 'profiling', 'computer', 'programming', 'statistical', 'process', 'control', 'and', 'special', 'design', 'tactics', 'to', 'improve', 'detection', 'while', 'simplifying', 'changes', 'origin', 'a', 'computer', 'log', 'entry', 'from', 'the', 'mark', 'nbsp', 'ii', 'with', 'a', 'moth', 'taped', 'to', 'the', 'page', 'the', 'terms', 'bug', 'and', 'debugging', 'are', 'popularly', 'attributed', 'to', 'admiral', 'grace', 'hopper', 'in', 'the', 'http', 'grace', 'hopper', 'from', 'foldoc', 'while', 'she', 'w

In [12]:
# # Remove all stop words: no_stops
# no_stops = [t for t in alpha_only if t not in english_stops]
# print(len(no_stops))
# print(no_stops)

In [13]:
#stopwords.words('english')

In [14]:
no_stops = [t for t in alpha_only if t not in stopwords.words('english')]
print(len(no_stops))
print(no_stops)

1271
['debugging', 'process', 'finding', 'resolving', 'defects', 'prevent', 'correct', 'operation', 'computer', 'software', 'system', 'numerous', 'books', 'written', 'debugging', 'see', 'reading', 'involves', 'numerous', 'aspects', 'including', 'interactive', 'debugging', 'control', 'flow', 'integration', 'testing', 'files', 'monitoring', 'application', 'system', 'memory', 'dumps', 'profiling', 'computer', 'programming', 'statistical', 'process', 'control', 'special', 'design', 'tactics', 'improve', 'detection', 'simplifying', 'changes', 'origin', 'computer', 'log', 'entry', 'mark', 'nbsp', 'ii', 'moth', 'taped', 'page', 'terms', 'bug', 'debugging', 'popularly', 'attributed', 'admiral', 'grace', 'hopper', 'http', 'grace', 'hopper', 'foldoc', 'working', 'harvard', 'mark', 'ii', 'computer', 'harvard', 'university', 'associates', 'discovered', 'moth', 'stuck', 'relay', 'thereby', 'impeding', 'operation', 'whereupon', 'remarked', 'debugging', 'system', 'however', 'term', 'bug', 'meaning', 

In [15]:
# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]
print(len(lemmatized))
print(lemmatized)

1271
['debugging', 'process', 'finding', 'resolving', 'defect', 'prevent', 'correct', 'operation', 'computer', 'software', 'system', 'numerous', 'book', 'written', 'debugging', 'see', 'reading', 'involves', 'numerous', 'aspect', 'including', 'interactive', 'debugging', 'control', 'flow', 'integration', 'testing', 'file', 'monitoring', 'application', 'system', 'memory', 'dump', 'profiling', 'computer', 'programming', 'statistical', 'process', 'control', 'special', 'design', 'tactic', 'improve', 'detection', 'simplifying', 'change', 'origin', 'computer', 'log', 'entry', 'mark', 'nbsp', 'ii', 'moth', 'taped', 'page', 'term', 'bug', 'debugging', 'popularly', 'attributed', 'admiral', 'grace', 'hopper', 'http', 'grace', 'hopper', 'foldoc', 'working', 'harvard', 'mark', 'ii', 'computer', 'harvard', 'university', 'associate', 'discovered', 'moth', 'stuck', 'relay', 'thereby', 'impeding', 'operation', 'whereupon', 'remarked', 'debugging', 'system', 'however', 'term', 'bug', 'meaning', 'technica

In [16]:
bow = Counter(lemmatized)
print(bow)

Counter({'debugging': 40, 'system': 25, 'bug': 17, 'software': 16, 'problem': 15, 'tool': 15, 'computer': 14, 'process': 13, 'term': 13, 'debugger': 13, 'used': 12, 'http': 11, 'program': 11, 'programming': 9, 'technique': 9, 'language': 9, 'error': 8, 'code': 8, 'example': 8, 'check': 8, 'also': 7, 'make': 7, 'programmer': 6, 'may': 6, 'acm': 6, 'would': 6, 'user': 6, 'case': 6, 'test': 6, 'ref': 6, 'see': 5, 'memory': 5, 'change': 5, 'debug': 5, 'article': 5, 'anomaly': 5, 'hardware': 5, 'task': 5, 'execution': 5, 'source': 5, 'wolf': 5, 'cite': 5, 'defect': 4, 'control': 4, 'testing': 4, 'dump': 4, 'design': 4, 'hopper': 4, 'early': 4, 'proceeding': 4, 'use': 4, 'computing': 4, 'national': 4, 'impact': 4, 'word': 4, 'might': 4, 'determine': 4, 'often': 4, 'value': 4, 'variable': 4, 'original': 4, 'state': 4, 'tracing': 4, 'different': 4, 'fence': 4, 'algorithm': 4, 'embedded': 4, 'interactive': 3, 'file': 3, 'mark': 3, 'moth': 3, 'grace': 3, 'journal': 3, 'first': 3, 'meeting': 3, '

In [17]:
bow.most_common(20)

[('debugging', 40),
 ('system', 25),
 ('bug', 17),
 ('software', 16),
 ('problem', 15),
 ('tool', 15),
 ('computer', 14),
 ('process', 13),
 ('term', 13),
 ('debugger', 13),
 ('used', 12),
 ('http', 11),
 ('program', 11),
 ('programming', 9),
 ('technique', 9),
 ('language', 9),
 ('error', 8),
 ('code', 8),
 ('example', 8),
 ('check', 8)]

### Introduction to gensim

Now, we will get started using a new tool called Gensim.

### What is gensim?

**Gensim** is a popular open-source natural language processing library. It uses top academic models to perform complex tasks like building `document` or `word vectors`, `corpora` and performing `topic identification` and `document comparisons`.

### What is a word vector?

A word embedding or vector is trained from a larger corpus and is a multi-dimensional representation of a word or document. You can think of it as a multi-dimensional array normally with sparse features (lots of zeros and some ones). With these vectors, we can then see relationships among the words or documents based on how near or far they are and also what similar comparisons we find. 

Word vectors are multi-dimensional mathematical representations of words created using deep learning methods. They give us insight into relationships between words in a corpus

<img src="wv.jpg" style="max-width:600px">

For example, in this graphic we can see that the vector operation king minus queen is approximately equal to man minus woman. Or that Spain is to Madrid as Italy is to Rome. The deep learning algorithm used to create word vectors has been able to distill this meaning based on how those words are used throughout the text.


### Gensim example

<img src="wv-1.jpg" style="max-width:600px">

The graphic we have here is an example of LDA visualization. LDA stands for latent dirichlet allocation, and it is a statistical model we can apply to text using Gensim for topic analysis and modelling. This graph is just a portion of a blog post written in 2015 using Gensim to analyze US presidential addresses.


### Creating a gensim dictionary

Gensim allows you to build corpora and dictionaries using simple classes and functions. 

A corpus (or if plural, corpora) is a set of texts used to help perform natural language processing tasks. Here, our documents are a list of strings that look like movie reviews about space or sci-fi films. 

<img src="g.jpg" style="max-width:600px">

- First we need to do some basic preprocessing. For brevity, we will only tokenize and lowercase. 


- For better results, we would want to apply more of the preprocessing we have learned in this chapter, such as removing punctuation and stop words. 


- Then we can pass the tokenized documents to the Gensim Dictionary class. This will create a mapping with an id for each token. This is the beginning of our corpus. We now can represent whole documents using just a list of their token ids and how often those tokens appear in each document. 


-  We can take a look at the tokens and their ids by looking at the token2id attribute, which is a dictionary of all of our tokens and their respective ids in our new dictionary.

### Creating a gensim corpus

Using the dictionary we built, we can then create a Gensim corpus. This is a bit different than a normal corpus -- which is just a collection of documents. 

Gensim uses a simple bag-of-words model which transforms each document into a bag of words using the token ids and the frequency of each token in the document. 

<img src="g-2.jpg" style="max-width:600px">

-Here, we can see that the Gensim corpus is a list of lists, each list item representing one document. 


- Each document a series of tuples, the first item representing the tokenid from the dictionary and the second item representing the token frequency in the document. 


In only a few lines, we have a new bag-of-words model and corpus thanks to Gensim. And unlike our previous Counter-based bag of words, this Gensim model can be easily saved, updated and reused thanks to the extra tools we have available in Gensim. 

Our dictionary can also be updated with new texts and extract only words that meet particular thresholds. We are building a more advanced and feature-rich bag-of-words model which can then be used for future exercises.

In [28]:
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize
my_documents = ['The movie was about a spaceship and aliens.', 'I really liked the movie!', 
                'Awesome action scenes, but boring characters.','The movie was awful! I hate alien films.',
                'Space is cool! I liked the movie.', 'More space films, please!']

In [34]:
tokenized_docs = [word_tokenize(doc.lower()) for doc in my_documents]
dictionary = Dictionary(tokenized_docs)
print(dictionary)
print(dictionary.token2id)

Dictionary(29 unique tokens: ['.', 'a', 'about', 'aliens', 'and']...)
{'.': 0, 'a': 1, 'about': 2, 'aliens': 3, 'and': 4, 'movie': 5, 'spaceship': 6, 'the': 7, 'was': 8, '!': 9, 'i': 10, 'liked': 11, 'really': 12, ',': 13, 'action': 14, 'awesome': 15, 'boring': 16, 'but': 17, 'characters': 18, 'scenes': 19, 'alien': 20, 'awful': 21, 'films': 22, 'hate': 23, 'cool': 24, 'is': 25, 'space': 26, 'more': 27, 'please': 28}


In [75]:
## Getting the id of a token
## suppose we want the id of token 'boring', 'aliens'

boring_id = dictionary.token2id.get('boring')
print(boring_id)

aliens_id = dictionary.token2id.get('aliens')
print(aliens_id)

16
3


In [52]:
from collections import defaultdict
import itertools

In [53]:
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
print(corpus)
print(len(corpus))

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)], [(5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (12, 1)], [(0, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)], [(0, 1), (5, 1), (7, 1), (8, 1), (9, 1), (10, 1), (20, 1), (21, 1), (22, 1), (23, 1)], [(0, 1), (5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (24, 1), (25, 1), (26, 1)], [(9, 1), (13, 1), (22, 1), (26, 1), (27, 1), (28, 1)]]
6


In [54]:
# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count
print(total_word_count)

defaultdict(<class 'int'>, {0: 4, 1: 1, 2: 1, 3: 1, 4: 1, 5: 4, 6: 1, 7: 4, 8: 2, 9: 4, 10: 3, 11: 2, 12: 1, 13: 2, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 2, 23: 1, 24: 1, 25: 1, 26: 2, 27: 1, 28: 1})


In [82]:
total_word_count = {}

for corpora in corpus:
    for word_id, word_count in corpora:
        if word_id not in total_word_count:
            total_word_count[word_id] = 1
        else:
            total_word_count[word_id] += 1 
print(total_word_count)        

{0: 4, 1: 1, 2: 1, 3: 1, 4: 1, 5: 4, 6: 1, 7: 4, 8: 2, 9: 4, 10: 3, 11: 2, 12: 1, 13: 2, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 2, 23: 1, 24: 1, 25: 1, 26: 2, 27: 1, 28: 1}


In [86]:
# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 
#print(sorted_word_count)

# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count:
    print(dictionary.get(word_id), word_count)

. 4
movie 4
the 4
! 4
i 3
was 2
liked 2
, 2
films 2
space 2
a 1
about 1
aliens 1
and 1
spaceship 1
really 1
action 1
awesome 1
boring 1
but 1
characters 1
scenes 1
alien 1
awful 1
hate 1
cool 1
is 1
more 1
please 1


In [81]:
for doc in corpus:
    bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)
    print(bow_doc)
    for word_id, word_count in bow_doc:
        print(dictionary.get(word_id), word_count)


[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)]
. 1
a 1
about 1
aliens 1
and 1
movie 1
spaceship 1
the 1
was 1
[(5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (12, 1)]
movie 1
the 1
! 1
i 1
liked 1
really 1
[(0, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)]
. 1
, 1
action 1
awesome 1
boring 1
but 1
characters 1
scenes 1
[(0, 1), (5, 1), (7, 1), (8, 1), (9, 1), (10, 1), (20, 1), (21, 1), (22, 1), (23, 1)]
. 1
movie 1
the 1
was 1
! 1
i 1
alien 1
awful 1
films 1
hate 1
[(0, 1), (5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (24, 1), (25, 1), (26, 1)]
. 1
movie 1
the 1
! 1
i 1
liked 1
cool 1
is 1
space 1
[(9, 1), (13, 1), (22, 1), (26, 1), (27, 1), (28, 1)]
! 1
, 1
films 1
space 1
more 1
please 1


### What is tf-idf?

**Tf-idf** stands for term-frequncy - inverse document frequency. It is a commonly used natural language processing model that helps you determine the most important words in each document in the corpus. 

The idea behind tf-idf is that each corpus might have more shared words than just stopwords. These common words are like stopwords and should be removed or at least down-weighted in importance. 

For example, if I am an astronomer, 'sky' might be used often but is not important, so I want to downweight that word. TF-Idf does precisely that. It will take texts that share common language and ensure the most common words across the entire corpus don't show up as keywords. 

Tf-idf helps keep the document-specific frequent words weighted high and the common words across the entire corpus weighted low.

### Tf-idf formula

<img src="tf.jpg" style="max-width:600px">

The equation to calculate the weights can be outlined like so: The weight of token i in document j is calculated by taking the term frequency (or how many times the token appears in the document) multiplied by the log of the total number of documents divided by the number of documents that contain the same term. 

Let's unpack this a bit. First, the weight will be low if the term doesnt appear often in the document because the tf variable will then be low. However, the weight will also be a low if the logarithm is close to zero, meaning the internal equation is low. Here we can see if the total number of documents divded by the number of documents that have the term is close to one, then our logarithm will be close to zero. So words that occur across many or all documents will have a very low tf-idf weight. On the contrary, if the word only occurs in a few documents, that logarithm will return a higher number.

## Exercise: 

You want to calculate the tf-idf weight for the word "computer", which appears 5 times in a document containing 100 words. 

Given a corpus containing 200 documents, with 20 documents mentioning the word "computer", tf-idf can be calculated by multiplying term frequency with inverse document frequency.

Term frequency = percentage share of the word compared to all tokens in the document Inverse document frequency = logarithm of the total number of documents in a corpora divided by the number of documents containing the term

In [95]:
tf = 5/100
N = 200
df = 20

In [94]:
import numpy as np

In [96]:
w = tf * np.log(N / df)
w

0.1151292546497023

In [87]:
from gensim.models.tfidfmodel import TfidfModel

In [88]:
my_documents = ['The movie was about a spaceship and aliens.', 'I really liked the movie!', 
                'Awesome action scenes, but boring characters.','The movie was awful! I hate alien films.',
                'Space is cool! I liked the movie.', 'More space films, please!']

tokenized_docs = [word_tokenize(doc.lower()) for doc in my_documents]
dictionary = Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

### Tf-idf with gensim

You can build a Tfidf model using Gensim and the corpus you developed previously. Taking a look at the corpus we used in the last video, around movie reviews, we can use the Bag of Words corpus to translate it into a TF-idf model by simply passing it in initialization. We can then reference each document by using it like a dictionary key with our new tfidf model. 

In [101]:
# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)
tfidf_weight = tfidf[corpus[1]] 
tfidf_weight

[(5, 0.1746298276735174),
 (7, 0.1746298276735174),
 (9, 0.1746298276735174),
 (10, 0.29853166221463673),
 (11, 0.47316148988815415),
 (12, 0.7716931521027908)]

For the second document in our corpora, we see the token weights along with the token ids. Notice there are some large differences! Token id 12 has a weight of 0.77 whereas tokens 5,7,9 have weights below 0.18. These weights can help you determine good topics and keywords for a corpus with shared vocabulary.

Now it's your turn to determine new significant terms for your corpus by applying gensim's tf-idf. You will again have access to the same corpus and dictionary objects you created in the previous exercises - dictionary, corpus, and doc.

In [102]:
# Calculate the first 3 tfidf weights of all doc
for doc in corpus:
    print(tfidf[doc][0:3])

[(0, 0.0962338424792598), (1, 0.4252595231361236), (2, 0.4252595231361236)]
[(5, 0.1746298276735174), (7, 0.1746298276735174), (9, 0.1746298276735174)]
[(0, 0.08926151827345048), (13, 0.2418550916450883), (14, 0.3944486650167261)]
[(0, 0.11167183378630395), (5, 0.11167183378630395), (7, 0.11167183378630395)]
[(0, 0.12839429999391858), (5, 0.12839429999391858), (7, 0.12839429999391858)]
[(9, 0.1269183979212522), (13, 0.34388683224786987), (22, 0.34388683224786987)]
