[NLTK :: Natural Language Toolkit](NLTK-::-Natural-Language-Toolkit)

1. [Introducing `nltk`](Introducing-`nltk`)        
2. [Tokenizing Strings](#Tokenizing-Strings)    
3. [Stop Words](#Stop-Words)       
4. [A Warning](#A-Warning)   
5. [N-grams](#N-gramss)    



# NLTK :: Natural Language Toolkit

## Introducing `nltk`
Another life saver for prepping your NLP data is the nltk package. `nltk` stands for Natural Language Toolkit and the corresponding documentation can be found here, https://www.nltk.org/.

We'll be using this package a decent amount in the program so be sure to get familiar with it.

In this notebook we'll see how useful it is for breaking strings into individual substrings (think words or sentences) called tokens. We'll also learn about stopwords and ngrams.

`nltk` can be used for more than these three purposes, but we won't introduce those unless we need them later in the course.

In [1]:
import pandas as pd
df = pd.read_csv('Food Review Data Set.csv')

FileNotFoundError: ignored

In [None]:
df = df[['Summary','Text']]
df.head()

Unnamed: 0,Summary,Text
0,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,"""Delight"" says it all",This is a confection that has been around a fe...
3,Cough Medicine,If you are looking for the secret ingredient i...
4,Great taffy,Great taffy at a great price. There was a wid...


In [None]:
# We lower all srings
df['Text_clean'] = df['Text'].str.lower()

## Tokenizing Strings
Recall that the endgoal of Preprocessing - Cleaning Data, and an important step in that process was to clean our data and made it easier to work with. As a part of that step we had to:

* Turn all the words to lowercase (not always necessary, but we've already done that)
* remove punctuation (not always necessary)
* remove numbers
* strip white space (also generally part of tokenization)
You can also write a `split` statement.  
That process can be called word tokenization, which is the process of breaking strings down into smaller units (in this case words).

`nltk` has a number of built-in tokenizer objects that can make this process as simple as a single line of code. Let's check out an example.

In [None]:
# Run this code chunk to ensure that you have
# nltk installed properly
import nltk

# Note your version may be different than mine
# That's fine
# At the time of writing this notebook I have verson 3.4.5
print(nltk.__version__)

3.8.1


In [None]:
#!pip install nltk



In [None]:
#nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Note: You may get an error or two running the code below if you've never run `nltk` before, that's probably because you have to download some data that isn't automatically downloaded with the `nltk` package. I believe they have pretty good error messages that tell you what to do to fix it.

In [None]:
## Our first tokenizer will be
## word_tokenize
df['tokenized_text'] = df['Text_clean'].apply(nltk.word_tokenize)

In [None]:
## Practice
## Run the tokenizer on this #FakeTweet
fake_tweet = "tokenizing is really ez-pz :P :D #nlp"

nltk.word_tokenize(fake_tweet)

['tokenizing', 'is', 'really', 'ez-pz', ':', 'P', ':', 'D', '#', 'nlp']

In [None]:
# This one we have to import from the tokenize subpackage
from nltk.tokenize import TweetTokenizer

# Now we make a Tokenizer object
tweet_tokenizer = TweetTokenizer()

In [None]:
# We call the 'tokenize' method of the tokenizer
tweet_tokenizer.tokenize(fake_tweet)

['tokenizing', 'is', 'really', 'ez-pz', ':P', ':D', '#nlp']

So when it comes to tokenizing strings it really depends on your use case. You can learn more about existing Tokenizer objects here, https://www.nltk.org/api/nltk.tokenize.html, and even learn how to write your own Tokenizer object.

For the most part we'll stick to the word_tokenizer, I'll point out when we depart from that norm.

### Practice
1. Make a list of the tokens in the Food Review data. Don't bother cleaning out punctuation.

In [None]:
words_list = df['Text_clean'].tolist()
raw_text = ''.join(words_list)

In [None]:
print(len(raw_text))
review_tokens = nltk.word_tokenize(raw_text)

396851


In [None]:
df['tokenized_text'] = df['Text_clean'].apply(nltk.word_tokenize)

In [None]:
len(words_list)

1000

In [None]:
df['tokenized_text']

0      [i, have, bought, several, of, the, vitality, ...
1      [product, arrived, labeled, as, jumbo, salted,...
2      [this, is, a, confection, that, has, been, aro...
3      [if, you, are, looking, for, the, secret, ingr...
4      [great, taffy, at, a, great, price, ., there, ...
                             ...                        
995    [black, market, hot, sauce, is, wonderful, ......
996    [man, what, can, i, say, ,, this, salsa, is, t...
997    [this, sauce, is, so, good, with, just, about,...
998    [not, hot, at, all, ., like, the, other, low, ...
999    [i, have, to, admit, ,, i, was, a, sucker, for...
Name: tokenized_text, Length: 1000, dtype: object

Google how to tokenize sentences using `nltk`. Once you've figured out how to do that:   

    A. Create a pandas DataFrame containing the unique sentences from the food reviews.
    B. Create a column in your dataframe that contains the tokenized version of the sentence.   
https://www.guru99.com/tokenize-words-sentences-nltk.html

In [None]:
from nltk.tokenize import sent_tokenize

In [None]:
import pandas as pd

In [None]:
review_sents = sent_tokenize(raw_text)

In [None]:
type(review_sents)

list

In [None]:
review_sent_df = pd.DataFrame({'sentence':review_sents})

In [None]:
review_sent_df

Unnamed: 0,sentence
0,i have bought several of the vitality canned d...
1,the product looks more like a stew than a proc...
2,my labrador is finicky and she appreciates thi...
3,not sure if this was an error or if the vendor...
4,"it is a light, pillowy citrus gelatin with nut..."
...,...
3621,"some people might like the flavor, citrus-y an..."
3622,plastic bottle.
3623,it does have a convenient squirt top.
3624,"but overall, not very hot or tasty, and made m..."


## Stop Words
Think of the words you use most often. There are many words in the world that are necessary to form coherent sentences. However, most of those words are not necessary to convey the meaning behind your sentences.

That's essentially the idea behind _stop words_. These are frequently used words that can be thought of as "noise" for the sake of data analysis. As such it may be useful to remove them prior to analysis. `nltk` stores a corpus (collection of texts) of stopwords in a variety of languages for easy out of the box use.

In [None]:
# stopwords are stored in the corpus subpackage
from nltk.corpus import stopwords

In [None]:
#nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# words can be accessed like so
STOPWORDS = stopwords.words('english')
STOPWORDS

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

### Practice
1. Remove the stop words from your tokens for the Food Review excerpt.

In [None]:
word_list = nltk.word_tokenize(raw_text)
word_list = [word for word in word_list if word.lower() not in STOPWORDS]

In [None]:
word_list

['bought',
 'several',
 'vitality',
 'canned',
 'dog',
 'food',
 'products',
 'found',
 'good',
 'quality',
 '.',
 'product',
 'looks',
 'like',
 'stew',
 'processed',
 'meat',
 'smells',
 'better',
 '.',
 'labrador',
 'finicky',
 'appreciates',
 'product',
 'better',
 'most.product',
 'arrived',
 'labeled',
 'jumbo',
 'salted',
 'peanuts',
 '...',
 'peanuts',
 'actually',
 'small',
 'sized',
 'unsalted',
 '.',
 'sure',
 'error',
 'vendor',
 'intended',
 'represent',
 'product',
 '``',
 'jumbo',
 "''",
 '.this',
 'confection',
 'around',
 'centuries',
 '.',
 'light',
 ',',
 'pillowy',
 'citrus',
 'gelatin',
 'nuts',
 '-',
 'case',
 'filberts',
 '.',
 'cut',
 'tiny',
 'squares',
 'liberally',
 'coated',
 'powdered',
 'sugar',
 '.',
 'tiny',
 'mouthful',
 'heaven',
 '.',
 'chewy',
 ',',
 'flavorful',
 '.',
 'highly',
 'recommend',
 'yummy',
 'treat',
 '.',
 'familiar',
 'story',
 'c.s',
 '.',
 'lewis',
 "'",
 '``',
 'lion',
 ',',
 'witch',
 ',',
 'wardrobe',
 "''",
 '-',
 'treat',
 'sedu

In [None]:
def remove_stop(tokens):
  return [token for token in tokens if token not in stopwords.words('english')]

In [None]:
remove_stop(review_tokens)

['bought',
 'several',
 'vitality',
 'canned',
 'dog',
 'food',
 'products',
 'found',
 'good',
 'quality',
 '.',
 'product',
 'looks',
 'like',
 'stew',
 'processed',
 'meat',
 'smells',
 'better',
 '.',
 'labrador',
 'finicky',
 'appreciates',
 'product',
 'better',
 'most.product',
 'arrived',
 'labeled',
 'jumbo',
 'salted',
 'peanuts',
 '...',
 'peanuts',
 'actually',
 'small',
 'sized',
 'unsalted',
 '.',
 'sure',
 'error',
 'vendor',
 'intended',
 'represent',
 'product',
 '``',
 'jumbo',
 "''",
 '.this',
 'confection',
 'around',
 'centuries',
 '.',
 'light',
 ',',
 'pillowy',
 'citrus',
 'gelatin',
 'nuts',
 '-',
 'case',
 'filberts',
 '.',
 'cut',
 'tiny',
 'squares',
 'liberally',
 'coated',
 'powdered',
 'sugar',
 '.',
 'tiny',
 'mouthful',
 'heaven',
 '.',
 'chewy',
 ',',
 'flavorful',
 '.',
 'highly',
 'recommend',
 'yummy',
 'treat',
 '.',
 'familiar',
 'story',
 'c.s',
 '.',
 'lewis',
 "'",
 '``',
 'lion',
 ',',
 'witch',
 ',',
 'wardrobe',
 "''",
 '-',
 'treat',
 'sedu

In [None]:
#review_sent_df['tokens no stop'] = review_sent_df['tokens'].apply(remove_stop)

KeyError: ignored

1. Create a column in your DataFrame that removes the stopwords from tokenized sentence column.

In [None]:
review_sent_df

## A Warning
When working on your NLP projects be wary about removing the stopwords. If you've seen The Office, you know our friend Kevin had to go back to using all the words in order for people to understand him.

One concern I have with `nltk`'s stopwords is that words like "no", "not" and "nor" on there. The absence of those words can greatly alter the meaning of a sentence.

In [None]:
sentence = "I do not like ice cream."

tokens = nltk.word_tokenize(sentence)

In [None]:
print("With stop words.")
print(tokens)

With stop words.
['I', 'do', 'not', 'like', 'ice', 'cream', '.']


In [None]:
print("Without stop words.")
print([token for token in tokens if token not in stopwords.words('english')])

Without stop words.
['I', 'like', 'ice', 'cream', '.']


## N-grams
One final data cleaning step we'll discuss here is the creation of n-grams.

As you might imagine, just looking at the words used in a piece of text is not always enough to create useful applications. This is because breaking a text up into the unique words is essentially assuming that every piece of text (from here on referred to as a document) is created by randomly pulling words from a bag, which is why this technique is called __Bag of Words__ (more on this next week).

In this assumption you lose the information contained in the document's author's word orderings. A step up from simple bag of words is to look at the unique sequence of n words in a row (otherwise known as n-grams).

For example, the bigrams (2-grams) for this sentence:

`"I do not like ice cream"`

are

`[("I", "do"), ("do", "not"), ("not", "like"), ("like", "ice"), ("ice", "cream")]`.

`nltk` also offers functions that take in a list of the tokens (must be in the order they appeared in the text) and outputs an iterator object of n-grams.

In [None]:
# nltk.bigrams makes the bigrams
# it returns an iterator object
nltk.bigrams(nltk.word_tokenize(sentence))

<generator object bigrams at 0x7a9fd71a6880>

In [None]:
# You can turn that into a list like so
print(list(nltk.bigrams(nltk.word_tokenize(sentence))))

[('I', 'do'), ('do', 'not'), ('not', 'like'), ('like', 'ice'), ('ice', 'cream'), ('cream', '.')]


In [None]:
# nltk.ngrams can make any kind of ngram
nltk.ngrams(nltk.word_tokenize(sentence), 3)

<zip at 0x7a9fd6ff07c0>

In [None]:
print(list(nltk.ngrams(nltk.word_tokenize(sentence), 3)))

[('I', 'do', 'not', 'like'), ('do', 'not', 'like', 'ice'), ('not', 'like', 'ice', 'cream'), ('like', 'ice', 'cream', '.')]


### Practice
1. Produce a list of the 4-grams for the food review

In [None]:
ngramprac = nltk.ngrams(review_tokens, 4)

In [None]:
print(list(ngramprac))

[('i', 'have', 'bought', 'several'), ('have', 'bought', 'several', 'of'), ('bought', 'several', 'of', 'the'), ('several', 'of', 'the', 'vitality'), ('of', 'the', 'vitality', 'canned'), ('the', 'vitality', 'canned', 'dog'), ('vitality', 'canned', 'dog', 'food'), ('canned', 'dog', 'food', 'products'), ('dog', 'food', 'products', 'and'), ('food', 'products', 'and', 'have'), ('products', 'and', 'have', 'found'), ('and', 'have', 'found', 'them'), ('have', 'found', 'them', 'all'), ('found', 'them', 'all', 'to'), ('them', 'all', 'to', 'be'), ('all', 'to', 'be', 'of'), ('to', 'be', 'of', 'good'), ('be', 'of', 'good', 'quality'), ('of', 'good', 'quality', '.'), ('good', 'quality', '.', 'the'), ('quality', '.', 'the', 'product'), ('.', 'the', 'product', 'looks'), ('the', 'product', 'looks', 'more'), ('product', 'looks', 'more', 'like'), ('looks', 'more', 'like', 'a'), ('more', 'like', 'a', 'stew'), ('like', 'a', 'stew', 'than'), ('a', 'stew', 'than', 'a'), ('stew', 'than', 'a', 'processed'), ('t

In [None]:
nostop = remove_stop(review_tokens)
print(list(nltk.ngrams(nostop, 4)))

[('bought', 'several', 'vitality', 'canned'), ('several', 'vitality', 'canned', 'dog'), ('vitality', 'canned', 'dog', 'food'), ('canned', 'dog', 'food', 'products'), ('dog', 'food', 'products', 'found'), ('food', 'products', 'found', 'good'), ('products', 'found', 'good', 'quality'), ('found', 'good', 'quality', '.'), ('good', 'quality', '.', 'product'), ('quality', '.', 'product', 'looks'), ('.', 'product', 'looks', 'like'), ('product', 'looks', 'like', 'stew'), ('looks', 'like', 'stew', 'processed'), ('like', 'stew', 'processed', 'meat'), ('stew', 'processed', 'meat', 'smells'), ('processed', 'meat', 'smells', 'better'), ('meat', 'smells', 'better', '.'), ('smells', 'better', '.', 'labrador'), ('better', '.', 'labrador', 'finicky'), ('.', 'labrador', 'finicky', 'appreciates'), ('labrador', 'finicky', 'appreciates', 'product'), ('finicky', 'appreciates', 'product', 'better'), ('appreciates', 'product', 'better', 'most.product'), ('product', 'better', 'most.product', 'arrived'), ('bett

1. Create a column in your dataframe that contains the bigrams for each sentence of the food reviews.


In [None]:
bigramFood = nltk.bigrams(review_tokens)
food_list = list(bigramFood)
food_list

In [None]:
df2 = pd.DataFrame({'bigrams': food_list})
df2

Unnamed: 0,bigrams
0,"(i, have)"
1,"(have, bought)"
2,"(bought, several)"
3,"(several, of)"
4,"(of, the)"
...,...
86198,"(<, br)"
86199,"(br, /)"
86200,"(/, >)"
86201,"(>, xanthan)"


# Great!
We now know enough to move onto our first NLP projects!