### Word counts with bag-of-words
#### Bag-of-words
- Basic method for finding topics in a text
- Need to first create tokens using tokenization
- ...and then count up all the tokens 
- The more frequent a word, the more important it might be 
- Can be a great way to determine the significant words in a text 

In [1]:
from nltk.tokenize import word_tokenize
from collections import Counter

Counter(word_tokenize(
        '''The cat is in the box. The cat likes the box.
        The box is over the cat.''')).most_common(2)

[('The', 3), ('cat', 3)]

In [2]:
f = open('wikipedia_articles/wiki_text_debugging.txt', 'r')
article = f.read()
#article

In [3]:
# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))

[(',', 151), ('the', 150), ('.', 89), ('of', 81), ("''", 66), ('to', 63), ('a', 60), ('``', 47), ('in', 44), ('and', 41)]


### Simple text preprocessing

#### Why preprocess
- Helps make for better input data
 - When performing machine learning or other statistical methods
- Examples:
 - Tokenization to create a bag of words
 - Lowercasing words
- Lemmatization/Stemming
 - Shorten words to their root stems
- Removing stop words, punctuation or unwanted tokens

In [4]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\msmith7\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
from nltk.corpus import stopwords
text = '''The cat is in the box. The cat likes the box.
            The box is over the cat.'''
tokens = [w for w in word_tokenize(text.lower())
         if w.isalpha()]
no_stops = [t for t in tokens
           if t not in stopwords.words('english')]
Counter(no_stops).most_common(2)

[('cat', 3), ('box', 3)]

In [6]:
english_stops = stopwords.words('english')
#english_stops

In [7]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\msmith7\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [8]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized).most_common(10)

# Print the 10 most common tokens
print(bow)

[('debugging', 40), ('system', 25), ('bug', 17), ('software', 16), ('problem', 15), ('tool', 15), ('computer', 14), ('process', 13), ('term', 13), ('debugger', 13)]


### Introduction to gensim 

#### What is gensim?
- Popular open-source library
- Uses top academic models to perform complex tasks
 - Building document or word vectors 
 - Performing topic indentification and document comparision
 

In [9]:
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize



In [10]:
my_documents = ['The movie war about a spaceship and aliens.',
               'I really liked the movie!',
               'Awesome action scenes, but boring characters.',
               'The movie was awful! I hate alien films.',
               'Space is cool! I liked the movie.',
               'More space films, please!']

tokenized_docs = [word_tokenize(doc.lower())
                 for doc in my_documents]

dictionary = Dictionary(tokenized_docs)

dictionary.token2id

{'.': 0,
 'a': 1,
 'about': 2,
 'aliens': 3,
 'and': 4,
 'movie': 5,
 'spaceship': 6,
 'the': 7,
 'war': 8,
 '!': 9,
 'i': 10,
 'liked': 11,
 'really': 12,
 ',': 13,
 'action': 14,
 'awesome': 15,
 'boring': 16,
 'but': 17,
 'characters': 18,
 'scenes': 19,
 'alien': 20,
 'awful': 21,
 'films': 22,
 'hate': 23,
 'was': 24,
 'cool': 25,
 'is': 26,
 'space': 27,
 'more': 28,
 'please': 29}

In [11]:
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
 [(5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(0, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)],
 [(0, 1),
  (5, 1),
  (7, 1),
  (9, 1),
  (10, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1)],
 [(0, 1), (5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (25, 1), (26, 1), (27, 1)],
 [(9, 1), (13, 1), (22, 1), (27, 1), (28, 1), (29, 1)]]

- `gensim` models can be easily saved, updated, and reused
- Our dictionary can also be updated
- This more advanced and feature rich bag-of-words can be used in future exercises

In [12]:
import glob

In [13]:
filenames = (glob.glob('wikipedia_articles/wiki_text_*.txt'))
articles = []
for f in filenames:
    file = open(f, 'r', encoding='utf-8')
    file = file.read()    
    tokens = word_tokenize(file)
    article = [t.lower() for t in tokens]
    articles.append(article)
len(articles)

12

In [14]:
# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(articles)

# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get("computer")

# Use computer_id with the dictionary to print the word
print(dictionary.get(computer_id))

# Create a MmCorpus: corpus
corpus = [dictionary.doc2bow(article) for article in articles]

# Print the first 10 word ids with their frequency counts from the fifth document
print(corpus[4][:10])

computer
[(0, 2), (1, 1), (4, 4), (5, 3), (6, 66), (14, 5), (16, 40), (17, 40), (18, 151), (19, 1)]


In [15]:
from collections import defaultdict
import itertools

In [16]:
# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)

, 151
the 150
. 89
of 81
'' 66


In [17]:
# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count

# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 

# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)

, 3065
the 2573
. 1900
of 1580
{ 1347


### Tf-idf with gensim

#### What is tf-idf?
- Term frequency - inverse document frequency 
- Allows you to determine the most important words in each document
- Each corpus may have shared words beyond just stopwords
- These words should be down-weighted in importance
- Example from astromony: 'Sky'
- Ensures most common words don't show up as key words
- Keeps document specific words weighted high

$$ w_{i,j}~=~tf_{i,j}~*~log\frac{N}{df_i}$$
<br>
$$ w_{i,j}~=~tf-idf~weight~for~token~i~in~document~j $$
$$ tf_{i,j}~=~number~of~occurances~of~token~i~in~document~j $$
$$ df_i~=~number~of~documents~that~contain~token~i$$
$$ N~=~total~number~of~documents $$

In [21]:
from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus)
tfidf[corpus[1]][:5]

[(0, 0.005716724833098836),
 (1, 0.012421833737107507),
 (2, 0.06296683861682904),
 (3, 0.0024852901124887204),
 (4, 0.016044868577097197)]

In [23]:
# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[doc]

# Print the first five weights
print(tfidf_weights[:5])

# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)

[(0, 0.008915040848624751), (1, 0.0028250007248460034), (4, 0.011300002899384013), (19, 0.0028250007248460034), (20, 0.011300002899384013)]
anti-debugging 0.19251324994879923
wolf 0.19251324994879923
debugging 0.17830081697249503
fence 0.15401059995903937
debugger 0.1181404686386986
