# Chapter 6: More Text Preprocessing with NLTK

## Instructions

- Run the cells with "assert" statements to see if your answer's output matches what the output should be. If it runs without error, your answer matches! If your output is different, you'll get a hint.


In this notebook, you'll test out a number of preprocessing techniques in NLTK and you'll also apply some other techniques to a pandas dataframe.

## 1. Text Preprocessing in NLTK

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from nltk.tokenize import MWETokenizer, word_tokenize
from nltk.tag import pos_tag
from nltk.stem.lancaster import LancasterStemmer

[nltk_data] Downloading package punkt to /Users/rita/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/rita/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Tokenize the text below by word, but make sure that `I` and `am` are seen as one term. Hint: use `MWETokenizer`. Save the resulting word list as `word_list`.

In [2]:
text = "I am a runner. I love to run. I ran on yesterday. I am running now. I will run tomorrow."

In [3]:
### BEGIN SOLUTION
word_list = MWETokenizer([('I','am')]).tokenize(word_tokenize(text))
### END SOLUTION
word_list

['I_am',
 'a',
 'runner',
 '.',
 'I',
 'love',
 'to',
 'run',
 '.',
 'I',
 'ran',
 'on',
 'yesterday',
 '.',
 'I_am',
 'running',
 'now',
 '.',
 'I',
 'will',
 'run',
 'tomorrow',
 '.']

In [4]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert len(word_list), "There should be 23 items in the list."
### END HIDDEN TESTS

Let's only focus on the words that start with the letter `r`.

In [5]:
r_list = [word for word in word_list if word[0] == 'r']
r_list

['runner', 'run', 'ran', 'running', 'run']

Use the `LancasterStemmer` to find the base of these five words. Hint: you'll need to write a `for` loop to stem each item in the list. Save the new stemmed list in a variable called `stemmed_list`.

In [6]:
### BEGIN SOLUTION
stemmer = LancasterStemmer()
stemmed_list = []
for word in r_list:
    stemmed_word = stemmer.stem(word)
    stemmed_list.append(stemmed_word)
### END SOLUTION
stemmed_list

['run', 'run', 'ran', 'run', 'run']

In [7]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert stemmed_list == ['run', 'run', 'ran', 'run', 'run'], "The stemmed list is incorrect."
### END HIDDEN TESTS

Find the part of speech of each item in the `r_list` using `pos_tag`. Save the results in a variable called `pos_list`.

In [8]:
### BEGIN SOLUTION
pos_list = pos_tag(r_list)
### END SOLUTION
pos_list

[('runner', 'NN'),
 ('run', 'VB'),
 ('ran', 'VBD'),
 ('running', 'VBG'),
 ('run', 'NN')]

In [9]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert pos_list == [('runner', 'NN'), ('run', 'VB'), ('ran', 'VBD'), ('running', 'VBG'), ('run', 'NN')], "The parts of speech list is incorrect."
### END HIDDEN TESTS

## 2. Filtering Parts of Speech in a DataFrame

The goal is to filter down the text in the reviews to only include adjectives.

In [10]:
df = pd.DataFrame([['a',5,"Grove Square Cappuccino Cups were excellent. Tasted really good right from the Keurig brewer with nothing added. wWould highly recommend. RCCJR"],
                  ['b',1,"I love my Keurig, and I love most of the Keurig coffees. This is instant coffee with instant milk and far too much sugar. I don't know anyone I dislike enough to dump the rest of the box on."],
                  ['c',1,"It's a powdered drink. No filter in k-cup.<br />Just buy it in bulk and mix it with hot water....<br /><br />Nothing else to say here. Wont be buying it again."],
                  ['d',1,"don't bother! bet you couldn't tell the difference between this and hot water if your eyes were closed. well, maybe the water would have a taste!"],
                  ['e',1,"Never tasted this coffee before, I felt much too sweet even for dessert. I would not order again. But then that is only my opinion. My friend's husband loves it.<br />I gave them to him."],
                  ['f',5,"My husband and I LOVE this French Vanilla Cappuccino. Sooo glad I didn't listen to some of the reviews and took the plunge and bought it."]],
                  columns=['users','stars','reviews'])
df

Unnamed: 0,users,stars,reviews
0,a,5,Grove Square Cappuccino Cups were excellent. Tasted really good right from the Keurig brewer with nothing added. wWould highly recommend. RCCJR
1,b,1,"I love my Keurig, and I love most of the Keurig coffees. This is instant coffee with instant milk and far too much sugar. I don't know anyone I dislike enough to dump the rest of the box on."
2,c,1,It's a powdered drink. No filter in k-cup.<br />Just buy it in bulk and mix it with hot water....<br /><br />Nothing else to say here. Wont be buying it again.
3,d,1,"don't bother! bet you couldn't tell the difference between this and hot water if your eyes were closed. well, maybe the water would have a taste!"
4,e,1,"Never tasted this coffee before, I felt much too sweet even for dessert. I would not order again. But then that is only my opinion. My friend's husband loves it.<br />I gave them to him."
5,f,5,My husband and I LOVE this French Vanilla Cappuccino. Sooo glad I didn't listen to some of the reviews and took the plunge and bought it.


Most of the code is written below, but there are two errors.

```
def adj_or_not(text):
  pos_list = pos_tag(word_tokenize(text))
  adjs = ''
  for item in pos_list:
    if item[1] == 'NN':
      adjs = adjs + item[1] + ' '
  return adjs 
adjs = df.reviews.map(adj_or_not)
adjs
```

Fix those errors to display all of the adjectives in the reviews column. The beginning of your output should look like this:

```
0    excellent good recommend 
1        instant instant much
...                       ...
```

In [11]:
### BEGIN SOLUTION HERE
def adj_or_not(text):
    pos_list = pos_tag(word_tokenize(text))
    adjs = ''
    for item in pos_list:
        if item[1] == 'JJ':
            adjs = adjs + item[0] + ' '
    return adjs 

adjs = df.reviews.map(adj_or_not)

### END SOLUTION HERE
adjs

0    excellent good recommend 
1        instant instant much 
2         powdered k-cup. hot 
3                         hot 
4                   sweet it. 
5                      French 
Name: reviews, dtype: object

In [12]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### START HIDDEN TESTS
assert len([' '.join(adjs)][0].split()) == 13, "That is the incorrect number of adjectives in the reviews."
### END HIDDEN TESTS

Hints:
* The parts of speech tag details can be found [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).
* Think about what the `item` variable looks like. You may want to test the function on one string before applying it to the whole dataframe.