### Word counts with bag-of-words
#### Bag-of-words
- Basic method for finding topics in a text
- Need to first create tokens using tokenization
- ...and then count up all the tokens 
- The more frequent a word, the more important it might be 
- Can be a great way to determine the significant words in a text 

In [2]:
from nltk.tokenize import word_tokenize
from collections import Counter

Counter(word_tokenize(
        '''The cat is in the box. The cat likes the box.
        The box is over the cat.''')).most_common(2)

[('The', 3), ('cat', 3)]

In [5]:
f = open('wikipedia_articles/wiki_text_debugging.txt', 'r')
article = f.read()
#article

In [6]:
# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))

[(',', 151), ('the', 150), ('.', 89), ('of', 81), ("''", 66), ('to', 63), ('a', 60), ('``', 47), ('in', 44), ('and', 41)]


### Simple text preprocessing

#### Why preprocess
- Helps make for better input data
 - When performing machine learning or other statistical methods
- Examples:
 - Tokenization to create a bag of words
 - Lowercasing words
- Lemmatization/Stemming
 - Shorten words to their root stems
- Removing stop words, punctuation or unwanted tokens

In [12]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\msmith7\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [13]:
from nltk.corpus import stopwords
text = '''The cat is in the box. The cat likes the box.
            The box is over the cat.'''
tokens = [w for w in word_tokenize(text.lower())
         if w.isalpha()]
no_stops = [t for t in tokens
           if t not in stopwords.words('english')]
Counter(no_stops).most_common(2)

[('cat', 3), ('box', 3)]

In [16]:
english_stops = stopwords.words('english')
#english_stops

In [20]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\msmith7\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [21]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized).most_common(10)

# Print the 10 most common tokens
print(bow)

[('debugging', 40), ('system', 25), ('bug', 17), ('software', 16), ('problem', 15), ('tool', 15), ('computer', 14), ('process', 13), ('term', 13), ('debugger', 13)]
