# Chapter 3: Tokenization and the Document-Term Matrix

## Instructions
- Run the cells with "assert" statements to see if your answer's output matches what the output should be. If it runs without error, your answer matches! If your output is different, you'll get a hint.

In this notebook, you'll practice turning raw text into a standard format using both tokenization and CountVectorizer.

## 1. Tokenization

In this section, you'll be testing out the `sent_tokenize`, `word_tokenize` and `RegexpTokenizer` functions in `nltk`. Make sure to note the differences in the outputs.

In [1]:
import nltk
nltk.download('punkt')

from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.util import ngrams

[nltk_data] Downloading package punkt to /Users/rita/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# Let's get some text to work with
metis = 'We strive, we sweat, we swear. We go the extra mile.\
         We stage, we fail. We try again. Get it right. We learn.\
         Connect. Come together. This is Metis. -12/9/2013'
print(metis)

We strive, we sweat, we swear. We go the extra mile.         We stage, we fail. We try again. Get it right. We learn.         Connect. Come together. This is Metis. -12/9/2013


### Sentence Tokenization

Tokenize the `metis` text by sentence. Save your results in a variable called `sentences`.

In [3]:
### BEGIN SOLUTION
sentences = sent_tokenize(metis)
### END SOLUTION
sentences

['We strive, we sweat, we swear.',
 'We go the extra mile.',
 'We stage, we fail.',
 'We try again.',
 'Get it right.',
 'We learn.',
 'Connect.',
 'Come together.',
 'This is Metis.',
 '-12/9/2013']

In [4]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert type(sentences) == list, "The output of sent_tokenize() should be a list."
assert len(sentences) == 10, "There should be ten items in the list. Hint: use sent_tokenize()."
### END HIDDEN TESTS

The final item in the list is not a sentence. Remove the item from the `sentences` list.

In [5]:
### BEGIN SOLUTION
sentences.pop()
### END SOLUTION
sentences

['We strive, we sweat, we swear.',
 'We go the extra mile.',
 'We stage, we fail.',
 'We try again.',
 'Get it right.',
 'We learn.',
 'Connect.',
 'Come together.',
 'This is Metis.']

In [6]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert type(sentences) == list, "The output of sent_tokenize() should be a list."
assert len(sentences) == 9, "There should be nine sentences in the string. Hint: use sent_tokenize()."
### END HIDDEN TESTS

### Word Tokenization

Tokenize the `metis` text by word. First use `word_tokenize` and save your results in a variable called `words_wt`.

In [7]:
### BEGIN SOLUTION
words_wt = word_tokenize(metis)
### END SOLUTION
words_wt

['We',
 'strive',
 ',',
 'we',
 'sweat',
 ',',
 'we',
 'swear',
 '.',
 'We',
 'go',
 'the',
 'extra',
 'mile',
 '.',
 'We',
 'stage',
 ',',
 'we',
 'fail',
 '.',
 'We',
 'try',
 'again',
 '.',
 'Get',
 'it',
 'right',
 '.',
 'We',
 'learn',
 '.',
 'Connect',
 '.',
 'Come',
 'together',
 '.',
 'This',
 'is',
 'Metis',
 '.',
 '-12/9/2013']

In [8]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert type(words_wt) == list, "The output of word_tokenize() should be a list."
assert len(words_wt) == 42, "There should be 42 items in the list."
### END HIDDEN TESTS

Next use `RegexpTokenizer` to split on spaces (which is another way of tokenizing by word) and save your results in a variable called `words_re`.

In [9]:
### BEGIN SOLUTION
words_re = RegexpTokenizer("\s+", gaps=True).tokenize(metis)
### END SOLUTION
words_re

['We',
 'strive,',
 'we',
 'sweat,',
 'we',
 'swear.',
 'We',
 'go',
 'the',
 'extra',
 'mile.',
 'We',
 'stage,',
 'we',
 'fail.',
 'We',
 'try',
 'again.',
 'Get',
 'it',
 'right.',
 'We',
 'learn.',
 'Connect.',
 'Come',
 'together.',
 'This',
 'is',
 'Metis.',
 '-12/9/2013']

In [10]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert type(words_re) == list, "The output of RegexpTokenizer().tokenize() should be a list."
assert len(words_re) == 30, "There should be 30 items in the list."
### END HIDDEN TESTS

Note the differences between `words_wt` and `words_re`, specifically how punctuation is treated. Let's try removing punctuation all together.

In [11]:
import re
import string
metis_no_punc = re.sub('[%s]' % re.escape(string.punctuation), ' ', metis)
metis_no_punc

'We strive  we sweat  we swear  We go the extra mile          We stage  we fail  We try again  Get it right  We learn          Connect  Come together  This is Metis   12 9 2013'

Tokenize `metis_no_punc` using `word_tokenize`. Save the variable as `words_wt_no_punc`.

In [12]:
### BEGIN SOLUTION
words_wt_no_punc = word_tokenize(metis_no_punc)
### END SOLUTION
words_wt_no_punc

['We',
 'strive',
 'we',
 'sweat',
 'we',
 'swear',
 'We',
 'go',
 'the',
 'extra',
 'mile',
 'We',
 'stage',
 'we',
 'fail',
 'We',
 'try',
 'again',
 'Get',
 'it',
 'right',
 'We',
 'learn',
 'Connect',
 'Come',
 'together',
 'This',
 'is',
 'Metis',
 '12',
 '9',
 '2013']

In [13]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert type(words_wt_no_punc) == list, "The output of word_tokenize() should be a list."
assert len(words_wt_no_punc) == 32, "There should be 32 items in the list."
### END HIDDEN TESTS

Tokenize `metis_no_punc` using `RegexpTokenizer`. Save the variable as `words_re_no_punc`.

In [14]:
### BEGIN SOLUTION
words_re_no_punc = RegexpTokenizer("\s+", gaps=True).tokenize(metis_no_punc)
### END SOLUTION
words_re_no_punc

['We',
 'strive',
 'we',
 'sweat',
 'we',
 'swear',
 'We',
 'go',
 'the',
 'extra',
 'mile',
 'We',
 'stage',
 'we',
 'fail',
 'We',
 'try',
 'again',
 'Get',
 'it',
 'right',
 'We',
 'learn',
 'Connect',
 'Come',
 'together',
 'This',
 'is',
 'Metis',
 '12',
 '9',
 '2013']

In [15]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert type(words_re_no_punc) == list, "The output of RegexpTokenizer().tokenize() should be a list."
assert len(words_re_no_punc) == 32, "There should be 32 items in the list."
### END HIDDEN TESTS

Note the differences between `words_wt_no_punc` and `words_re_no_punc`. Without the punctuation, the two functions resulted in the same list of words. While this won't always case, it is an example of how two tokenizers can result in the same word lists.

### N-grams

How many bi-grams are in the `metis_no_punc` string? Save the output as `num_bigrams`.

In [16]:
### BEGIN SOLUTION
num_bigrams = len(list(ngrams(word_tokenize(metis_no_punc),2)))
### END SOLUTION
num_bigrams

31

In [17]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert num_bigrams == 31, "That is not the correct number of bi-grams. Hint: ngrams(word_tokenize(text),2)"
### END HIDDEN TESTS

How many tri-grams are in the `metis_no_punc` string? Save the output as `num_trigrams`.

In [18]:
### BEGIN SOLUTION
num_trigrams = len(list(ngrams(word_tokenize(metis_no_punc),3)))
### END SOLUTION
num_trigrams

30

In [19]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert num_trigrams == 30, "That is not the correct number of tri-grams. Hint: ngrams(word_tokenize(text),3)"
### END HIDDEN TESTS

## 2. Document-Term Matrix

In this section, you'll be playing around with `sklearn`'s `CountVectorizer`.

In [20]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [21]:
# Let's set up a dataframe to work with
df = pd.DataFrame([['a',5,"Grove Square Cappuccino Cups were excellent. Tasted really good right from the Keurig brewer with nothing added. wWould highly recommend. RCCJR"],
                  ['b',1,"I love my Keurig, and I love most of the Keurig coffees. This is instant coffee with instant milk and far too much sugar. I don't know anyone I dislike enough to dump the rest of the box on."],
                  ['c',1,"It's a powdered drink. No filter in k-cup.<br />Just buy it in bulk and mix it with hot water....<br /><br />Nothing else to say here. Wont be buying it again."],
                  ['d',1,"don't bother! bet you couldn't tell the difference between this and hot water if your eyes were closed. well, maybe the water would have a taste!"],
                  ['e',1,"Never tasted this coffee before, I felt much too sweet even for dessert. I would not order again. But then that is only my opinion. My friend's husband loves it.<br />I gave them to him."],
                  ['f',5,"My husband and I LOVE this French Vanilla Cappuccino. Sooo glad I didn't listen to some of the reviews and took the plunge and bought it."]],
                  columns=['users','stars','reviews'])
df

Unnamed: 0,users,stars,reviews
0,a,5,Grove Square Cappuccino Cups were excellent. T...
1,b,1,"I love my Keurig, and I love most of the Keuri..."
2,c,1,It's a powdered drink. No filter in k-cup.<br ...
3,d,1,don't bother! bet you couldn't tell the differ...
4,e,1,"Never tasted this coffee before, I felt much t..."
5,f,5,My husband and I LOVE this French Vanilla Capp...


In [22]:
corpus = list(df.reviews)
corpus

['Grove Square Cappuccino Cups were excellent. Tasted really good right from the Keurig brewer with nothing added. wWould highly recommend. RCCJR',
 "I love my Keurig, and I love most of the Keurig coffees. This is instant coffee with instant milk and far too much sugar. I don't know anyone I dislike enough to dump the rest of the box on.",
 "It's a powdered drink. No filter in k-cup.<br />Just buy it in bulk and mix it with hot water....<br /><br />Nothing else to say here. Wont be buying it again.",
 "don't bother! bet you couldn't tell the difference between this and hot water if your eyes were closed. well, maybe the water would have a taste!",
 "Never tasted this coffee before, I felt much too sweet even for dessert. I would not order again. But then that is only my opinion. My friend's husband loves it.<br />I gave them to him.",
 "My husband and I LOVE this French Vanilla Cappuccino. Sooo glad I didn't listen to some of the reviews and took the plunge and bought it."]

Create a document-term matrix from the `corpus` text using `CountVectorizer` and `pd.DataFrame`. Save the results in a variable called `dtm`.

In [23]:
### BEGIN SOLUTION
cv = CountVectorizer()
X = cv.fit_transform(corpus)
dtm = pd.DataFrame(X.toarray(), columns=cv.get_feature_names())
### END SOLUTION
dtm

Unnamed: 0,added,again,and,anyone,be,before,bet,between,bother,bought,...,vanilla,water,well,were,with,wont,would,wwould,you,your
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,1,0,0
1,0,0,2,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,1,1,0,1,0,0,0,0,0,...,0,1,0,0,1,1,0,0,0,0
3,0,0,1,0,0,0,1,1,1,0,...,0,2,1,1,0,0,1,0,1,1
4,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
5,0,0,3,0,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0


In [24]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert type(dtm) == pd.DataFrame, "The output should be a DataFrame."
assert dtm.shape == (6,114), "The shape of the document-term matrix should be 6 rows x 114 columns."
### END HIDDEN TESTS

This document-term matrix has a lot of columns. Remove all of the stop words from the dataset by updating the stop words hyperparameter in Count Vectorizer with `CountVectorizer(stop_words = 'english')`. Save the updated dataframe in a variable called `dtm2`.



In [25]:
### BEGIN SOLUTION
cv2 = CountVectorizer(stop_words = 'english')
X2 = cv2.fit_transform(corpus)
dtm2 = pd.DataFrame(X2.toarray(), columns=cv2.get_feature_names())
### END SOLUTION
dtm2

Unnamed: 0,added,bet,bother,bought,box,br,brewer,bulk,buy,buying,...,sugar,sweet,taste,tasted,tell,took,vanilla,water,wont,wwould
0,1,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
1,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,3,0,1,1,1,...,0,0,0,0,0,0,0,1,1,0
3,0,1,1,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,2,0,0
4,0,0,0,0,0,1,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
5,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0


In [26]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert type(dtm2) == pd.DataFrame, "The output should be a DataFrame."
assert dtm2.shape == (6,71), "The shape of the document-term matrix should be 6 rows x 71 columns."
### END HIDDEN TESTS

This document-term matrix still has a good number of columns. In addition to removing all of the stop words, let's also only keep words that occur in more than one document. We have six documents, so one document is 16.7% all of our documents. So to occur in more than one document, a word would need to occur in more than 16.7% of the documents we have.

Update the minimum document frequency (`min_df`) hyperparameter to be .1667 in Count Vectorizer with `CountVectorizer(stop_words = 'english', min_df = .1667)`. Save the updated dataframe in a variable called `dtm3`.

In [27]:
### BEGIN SOLUTION
cv3 = CountVectorizer(stop_words = 'english', min_df = .1667)
X3 = cv3.fit_transform(corpus)
dtm3 = pd.DataFrame(X3.toarray(), columns=cv3.get_feature_names())
### END SOLUTION

dtm3

Unnamed: 0,br,cappuccino,coffee,don,hot,husband,keurig,love,tasted,water
0,0,1,0,0,0,0,1,0,1,0
1,0,0,1,1,0,0,2,2,0,0
2,3,0,0,0,1,0,0,0,0,1
3,0,0,0,1,1,0,0,0,0,2
4,1,0,1,0,0,1,0,0,1,0
5,0,1,0,0,0,1,0,1,0,0


In [28]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert type(dtm3) == pd.DataFrame, "The output should be a DataFrame."
assert dtm3.shape == (6,10), "The shape of the document-term matrix should be 6 rows x 10 columns."
### END HIDDEN TESTS

The main takeaway here is that you can create a document-term matrix using Count Vectorizer, and you can also fine tune the terms in your matrix by adjusting the hyperparameters. Common hyperparameters to tune are:
* `stop_words`: in addition to the standard English list, you can also add custom stop words to the list
* `ngram_range`: (1,3) would mean include terms that are unigrams, bi-grams and tri-grams
* `min_df`: minimum document frequency, with a range between 0 and 1, helps you filter out uncommon words
* `max_df`: maximum document frequency, with a range between 0 and 1, helps you filter out common words