In [1]:
!pwd

/Users/kpadhikari/GitStuff/KPAdhikari/PythonStuff/NLP_NLTK


<a id="GoHome"></a>
# Natural Language Processing With Python's NLTK Package
by Joanna Jablonski  May 05, 2021
Ref:
https://realpython.com/nltk-nlp-python/

Table of Contents

* Getting Started With Python’s NLTK
* [Tokenizing](#Tokenizing)
* [Filtering Stop Words](#FileringStopWords)
* [Stemming](#Stemming)
* [Tagging Parts of Speech](#TaggingPartsOfSpeech)
* Lemmatizing
* Chunking
* Chinking
* Using Named Entity Recognition (NER)
* Getting Text to Analyze
* Using a Concordance
* Making a Dispersion Plot
* Making a Frequency Distribution
* Finding Collocations
* Conclusion
    
[Natural language processing (NLP)](https://en.wikipedia.org/wiki/Natural_language_processing) is a field that focuses on making natural human language usable by computer programs. [NLTK, or Natural Language Toolkit](https://www.nltk.org/), is a Python package that you can use for NLP.

A lot of the data that you could be analyzing is [unstructured data](https://en.wikipedia.org/wiki/Unstructured_data) and contains human-readable text. Before you can analyze that data programmatically, you first need to preprocess it. In this tutorial, you’ll take your first look at the kinds of text preprocessing tasks you can do with NLTK so that you’ll be ready to apply them in future projects. You’ll also see how to do some basic text analysis and create visualizations.

If you’re familiar with the basics of using Python and would like to get your feet wet with some NLP, then you’ve come to the right place.

By the end of this tutorial, you’ll know how to:

Find text to analyze
Preprocess your text for analysis
Analyze your text
Create visualizations based on your analysis
Let’s get Pythoning!



In [2]:
import nltk

## Getting Started With Python’s NLTK
The first thing you need to do is make sure that you have Python installed. For this tutorial, you’ll be using Python 3.9. If you don’t yet have Python installed, then check out Python 3 Installation & Setup Guide to get started.

Once you have that dealt with, your next step is to install NLTK with pip. It’s a best practice to install it in a virtual environment. To learn more about virtual environments, check out Python Virtual Environments: A Primer.

For this tutorial, you’ll be installing version 3.5:
```python
$ python -m pip install nltk==3.5
```

<font color="magenta">At this moment, I will not be using the virtual environment, but the standard/common installation for this tutorial.</font>

In order to create visualizations for [named entity recognition](https://realpython.com/nltk-nlp-python/#using-named-entity-recognition-ner), you’ll also need to install NumPy and Matplotlib:

[Go Home](#GoHome) <a id="Tokenizing"></a>
## Tokenizing
By **tokenizing**, you can conveniently split up text by word or by sentence. This will allow you to work with smaller pieces of text that are still relatively coherent and meaningful even outside of the context of the rest of the text. It’s your first step in turning unstructured data into structured data, which is easier to analyze.

When you’re analyzing text, you’ll be tokenizing by word and tokenizing by sentence. Here’s what both types of tokenization bring to the table:

* **Tokenizing by word:** Words are like the atoms of natural language. They’re the smallest unit of meaning that still makes sense on its own. Tokenizing your text by word allows you to identify words that come up particularly often. For example, if you were analyzing a group of job ads, then you might find that the word “Python” comes up often. That could suggest high demand for Python knowledge, but you’d need to look deeper to know more.

* **Tokenizing by sentence:** When you tokenize by sentence, you can analyze how those words relate to one another and see more context. Are there a lot of negative words around the word “Python” because the hiring manager doesn’t like Python? Are there more terms from the domain of [herpetology](https://en.wikipedia.org/wiki/Herpetology) than the domain of software development, suggesting that you may be dealing with an entirely different kind of python than you were expecting?

Here’s how to import the relevant parts of NLTK so you can tokenize by word and by sentence:



In [10]:
from nltk.tokenize import sent_tokenize, word_tokenize  #kp: sent is for sentence and word is for word
print(nltk.__version__)

3.6.5


Now that you’ve imported what you need, you can create a string to tokenize. Here’s a quote from [Dune](https://en.wikipedia.org/wiki/Dune_(novel)) that you can use:

In [8]:
example_string = """
... Muad'Dib learned rapidly because his first training was in how to learn.
... And the first lesson of all was the basic trust that he could learn.
... It's shocking to find how many people do not believe they can learn,
... and how many more believe learning to be difficult."""

print(example_string)


Muad'Dib learned rapidly because his first training was in how to learn.
And the first lesson of all was the basic trust that he could learn.
It's shocking to find how many people do not believe they can learn,
and how many more believe learning to be difficult.


------
You can use `sent_tokenize()` to split up example_string into sentences:

with 
```python
sent_tokenize(example_string)
```
without the following two lines:
```python
import nltk
nltk.download("punkt")
```
I saw the following errors:
```
---------------------------------------------------------------------------
LookupError                               Traceback (most recent call last)
/var/folders/bs/v93xlq0d24n2x1clkyc1lk840000gp/T/ipykernel_19842/2123458501.py in <module>
      1 #import nltk
----> 2 sent_tokenize(example_string)

~/opt/anaconda3/lib/python3.9/site-packages/nltk/tokenize/__init__.py in sent_tokenize(text, language)
    104     :param language: the model name in the Punkt corpus
    105     """
--> 106     tokenizer = load(f"tokenizers/punkt/{language}.pickle")
    107     return tokenizer.tokenize(text)
    108 

~/opt/anaconda3/lib/python3.9/site-packages/nltk/data.py in load(resource_url, format, cache, verbose, logic_parser, fstruct_reader, encoding)
    748 
    749     # Load the resource.
--> 750     opened_resource = _open(resource_url)
    751 
    752     if format == "raw":

~/opt/anaconda3/lib/python3.9/site-packages/nltk/data.py in _open(resource_url)
    874 
    875     if protocol is None or protocol.lower() == "nltk":
--> 876         return find(path_, path + [""]).open()
    877     elif protocol.lower() == "file":
    878         # urllib might not use mode='rb', so handle this one ourselves:

~/opt/anaconda3/lib/python3.9/site-packages/nltk/data.py in find(resource_name, paths)
    581     sep = "*" * 70
    582     resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
--> 583     raise LookupError(resource_not_found)
    584 
    585 

LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/PY3/english.pickle

  Searched in:
    - '/Users/kpadhikari/nltk_data'
    - '/Users/kpadhikari/opt/anaconda3/nltk_data'
    - '/Users/kpadhikari/opt/anaconda3/share/nltk_data'
    - '/Users/kpadhikari/opt/anaconda3/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************
```

<font color="magenta">and when I googled about 'Resource punkt not found.', I found the following statment</font>

    punkt is a nltk library tool for tokenizing text documents. When we use an old or a degraded version of nltk module we generally need to download the remaining data .
    You can do
    nltk.download('punkt')
    nltk.download('stopwords')
    nltk.download('corpus')

at https://github.com/joosthub/PyTorchNLPBook/issues/14

In [15]:
import nltk
nltk.download("punkt")
sent_tokenize(example_string)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kpadhikari/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


["\nMuad'Dib learned rapidly because his first training was in how to learn.",
 'And the first lesson of all was the basic trust that he could learn.',
 "It's shocking to find how many people do not believe they can learn,\nand how many more believe learning to be difficult."]

<font color="magenta">kp: And, so I did put the download code as follows and I got the following result (I am saving it just so the executed result might differ next time because of some changes under the hood).</font>
```python
import nltk
nltk.download("punkt")
sent_tokenize(example_string)
```
and I got the following output:
```
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kpadhikari/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
["\nMuad'Dib learned rapidly because his first training was in how to learn.",
 'And the first lesson of all was the basic trust that he could learn.',
 "It's shocking to find how many people do not believe they can learn,\nand how many more believe learning to be difficult."]
```

In [18]:
print(word_tokenize(example_string))
#word_tokenize(example_string) #It prints the word in a column, taking up too much space, so I disabled it

["Muad'Dib", 'learned', 'rapidly', 'because', 'his', 'first', 'training', 'was', 'in', 'how', 'to', 'learn', '.', 'And', 'the', 'first', 'lesson', 'of', 'all', 'was', 'the', 'basic', 'trust', 'that', 'he', 'could', 'learn', '.', 'It', "'s", 'shocking', 'to', 'find', 'how', 'many', 'people', 'do', 'not', 'believe', 'they', 'can', 'learn', ',', 'and', 'how', 'many', 'more', 'believe', 'learning', 'to', 'be', 'difficult', '.']


You got a list of strings that NLTK considers to be words, such as:

    "Muad'Dib"
    'training'
    'how'
    
But the following strings were also considered to be words:

    "'s"
    ','
    '.'
    
See how "It's" was split at the apostrophe to give you 'It' and "'s", but "Muad'Dib" was left whole? This happened because NLTK knows that 'It' and "'s" (a contraction of “is”) are two distinct words, so it counted them separately. But "Muad'Dib" isn’t an accepted contraction like "It's", so it wasn’t read as two separate words and was left intact.

[Go Home](#GoHome) <a id="FilteringStopWords"></a>
## Filtering Stop Words
<font color="red">Stop words are words that you want to ignore,</font> so you filter them out of your text when you’re processing it. Very common words like 'in', 'is', and 'an' are often used as stop words since they don’t add a lot of meaning to a text in and of themselves.

Here’s how to import the relevant parts of NLTK in order to filter out stop words:

In [19]:
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kpadhikari/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Here’s a [quote from Worf](https://www.youtube.com/watch?v=ri5S4Hcq0nY) that you can filter:

In [20]:
worf_quote = "Sir, I protest. I am not a merry man!"

Now tokenize worf_quote by word and store the resulting list in words_in_quote:

In [21]:
words_in_quote = word_tokenize(worf_quote)
words_in_quote

['Sir', ',', 'I', 'protest', '.', 'I', 'am', 'not', 'a', 'merry', 'man', '!']

You have a list of the words in worf_quote, so the next step is to create a [set](https://realpython.com/python-sets/) of stop words to filter words_in_quote. For this example, you’ll need to focus on stop words in "english":

In [31]:
stop_words = stopwords.words("english")
print("As a List: \n", stop_words)

print("\n\nkp: 'set' in python is an unordered collection of unique items. No item comes more than once in the list.\n\n")
stop_words = set(stopwords.words("english"))
print("\n As a Set: \n",stop_words)

As a List: 
 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 's

-------
### About casefold() method
https://www.w3schools.com/python/ref_string_casefold.asp 

<font color="green">
The casefold() method returns a string where all the characters are lower case.

This method is similar to the lower() method, but the casefold() method is stronger, more aggressive, meaning that it will convert more characters into lower case, and will find more matches when comparing two strings and both are converted using the casefold() method.
</font>
----

In [32]:
# Next, create an empty list to hold the words that make it past the filter:
filtered_list = []

for word in words_in_quote:
    if word.casefold() not in stop_words:
        filtered_list.append(word)
        
print(filtered_list)

['Sir', ',', 'protest', '.', 'merry', 'man', '!']


You iterated over words_in_quote with a for loop and added all the words that weren’t stop words to filtered_list. You used [.casefold()](https://docs.python.org/3/library/stdtypes.html#str.casefold) on word so you could ignore whether the letters in word were uppercase or lowercase. This is worth doing because stopwords.words('english') includes only lowercase versions of stop words.

Alternatively, you could use a [list comprehension](https://realpython.com/list-comprehension-python/) to make a list of all the words in your text that aren’t stop words:

In [33]:
filtered_list = [ word for word in words_in_quote if word.casefold() not in stop_words]

print(filtered_list)

['Sir', ',', 'protest', '.', 'merry', 'man', '!']


When you use a list comprehension, you don’t create an empty list and then add items to the end of it. Instead, you define the list and its contents at the same time. Using a list comprehension is often seen as more [Pythonic](https://realpython.com/learning-paths/writing-pythonic-code/).

You filtered out a few words like 'am' and 'a', but you also filtered out 'not', which does affect the overall meaning of the sentence. (Worf won’t be happy about this.)

Words like 'I' and 'not' may seem too important to filter out, and depending on what kind of analysis you want to do, they can be. Here’s why:

* **'I'** is a pronoun, which are context words rather than content words:
    * **Content words** give you information about the topics covered in the text or the sentiment that the author has about those topics.
    * **Context words** give you information about writing style. You can observe patterns in how authors use context words in order to quantify their writing style. Once you’ve quantified their writing style, you can analyze a text written by an unknown author to see how closely it follows a particular writing style so you can try to identify who the author is.
* **'not'** is [technically an adverb](https://www.merriam-webster.com/dictionary/not) but has still been included in [NLTK’s list of stop words for English](https://www.nltk.org/nltk_data/). If you want to edit the list of stop words to exclude 'not' or make other changes, then you can [download it](https://www.nltk.org/nltk_data/).

So, 'I' and 'not' can be important parts of a sentence, but it depends on what you’re trying to learn from that sentence.



[Go Home](#GoHome) <a id="Stemming"></a>
## Stemming

**Stemming** is a text processing task in which you reduce words to their root, which is the core part of a word. For example, the words “helping” and “helper” share the root “help.” Stemming allows you to zero in on the basic meaning of a word rather than all the details of how it’s being used. NLTK has [more than one stemmer](http://www.nltk.org/howto/stem.html), but you’ll be using the [Porter stemmer](https://www.nltk.org/_modules/nltk/stem/porter.html).

Here’s how to import the relevant parts of NLTK in order to start stemming:

In [34]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

Now that you're done importing, you can create a stemmer with PoerterStemmer():

In [35]:
stemmer = PorterStemmer()

In [38]:
string_for_stemming = """
... The crew of the USS Discovery discovered many discoveries.
... Discovering is what explorers do."""

string_for_stemming

'\nThe crew of the USS Discovery discovered many discoveries.\nDiscovering is what explorers do.'

In [40]:
print("Before you can stem the words in that string, you need to separate all the words in it:")

words = word_tokenize(string_for_stemming)
print(words)

Before you can stem the words in that string, you need to separate all the words in it:
['The', 'crew', 'of', 'the', 'USS', 'Discovery', 'discovered', 'many', 'discoveries', '.', 'Discovering', 'is', 'what', 'explorers', 'do', '.']


In [43]:
print("Create a list of the stemmed versions of the words in words by using stemmer.stem() in a list comprehension:")
stemmed_words = [stemmer.stem(word) for word in words]  #'for word in words' means do it for all elements/members of words.
print(stemmed_words)

Create a list of the stemmed versions of the words in words by using stemmer.stem() in a list comprehension:
['the', 'crew', 'of', 'the', 'uss', 'discoveri', 'discov', 'mani', 'discoveri', '.', 'discov', 'is', 'what', 'explor', 'do', '.']


Those results look a little inconsistent. Why would 'Discovery' give you 'discoveri' when 'Discovering' gives you 'discov'?

Understemming and overstemming are two ways stemming can go wrong:

1. **Understemming** happens when two related words should be reduced to the same stem but aren’t. This is a [false negative](https://en.wikipedia.org/wiki/False_positives_and_false_negatives#False_negative_error).
2. Overstemming happens when two unrelated words are reduced to the same stem even though they shouldn’t be. This is a [false positive](https://en.wikipedia.org/wiki/False_positives_and_false_negatives#False_negative_error).

The [Porter stemming algorithm](https://tartarus.org/martin/PorterStemmer/) dates from 1979, so it’s a little on the older side. The Snowball stemmer, which is also called Porter2, is an improvement on the original and is also available through NLTK, so you can use that one in your own projects. It’s also worth noting that the purpose of the Porter stemmer is not to produce complete words but to find variant forms of a word.

Fortunately, you have some other ways to reduce words to their core meaning, such as lemmatizing, which you’ll see later in this tutorial. But first, we need to cover parts of speech.

In [52]:
# https://www.geeksforgeeks.org/snowball-stemmer-nlp/
from nltk.stem.snowball import SnowballStemmer
stemmerP2 = SnowballStemmer("english") #or we can write:  stemmerP2 = SnowballStemmer(language='english')
stemmed_words2 = [stemmer.stem(word) for word in words]
print(stemmed_words2)

['the', 'crew', 'of', 'the', 'uss', 'discoveri', 'discov', 'mani', 'discoveri', '.', 'discov', 'is', 'what', 'explor', 'do', '.']


[Go Home](#GoHome) <a id="TaggingPartsOfSpeech"></a>
## Tagging Parts of Speech
Part of speech is a grammatical term that deals with the roles words play when you use them together in sentences. Tagging parts of speech, or POS tagging, is the task of labeling the words in your text according to their part of speech.

In English, there are eight parts of speech:

| Parts of Speech  |  Role |  Examples |
|---|---|---|
Noun |	Is a person, place, or thing |	mountain, bagel, Poland
Pronoun |	Replaces a noun |	you, she, we
Adjective |	Gives information about what a noun is like |	efficient, windy, colorful
Verb |	Is an action or a state of being | learn, is, go
Adverb	| Gives information about a verb, an adjective, or another adverb |	efficiently, always, very
Preposition |	Gives information about how a noun or pronoun is connected to another word	| from, about, at
Conjunction |	Connects two other words or phrases |	so, because, and
Interjection |	Is an exclamation |	yay, ow, wow

Some sources also include the category articles (like “a” or “the”) in the list of parts of speech, but other sources consider them to be adjectives. NLTK uses the word determiner to refer to articles.


Here’s how to import the relevant parts of NLTK in order to tag parts of speech:

In [55]:
from nltk.tokenize import word_tokenize

# Now create some text to tag. You can use this Carl Sagan quote:

sagan_quote = """ If you wish to make an apple pie from scratch, you must first invent the universe."""

print(sagan_quote)

 If you wish to make an apple pie from scratch, you must first invent the universe.


In [56]:
words_in_sagan_quote = word_tokenize(sagan_quote)
print(words_in_sagan_quote)
                                     

['If', 'you', 'wish', 'to', 'make', 'an', 'apple', 'pie', 'from', 'scratch', ',', 'you', 'must', 'first', 'invent', 'the', 'universe', '.']


In [59]:
#Now call nltk.pos_tag() on your new list of words:
import nltk
nltk.download('averaged_perceptron_tagger') #Without this, gave me erros

nltk.pos_tag(words_in_sagan_quote)

[('If', 'IN'),
 ('you', 'PRP'),
 ('wish', 'VBP'),
 ('to', 'TO'),
 ('make', 'VB'),
 ('an', 'DT'),
 ('apple', 'NN'),
 ('pie', 'NN'),
 ('from', 'IN'),
 ('scratch', 'NN'),
 (',', ','),
 ('you', 'PRP'),
 ('must', 'MD'),
 ('first', 'VB'),
 ('invent', 'VB'),
 ('the', 'DT'),
 ('universe', 'NN'),
 ('.', '.')]

All the words in the quote are now in a separate tuple, with a tag that represents their part of speech. But what do the tags mean? Here’s how to get a list of tags and their meanings:

In [63]:
import nltk
#nltk.download('tagsets')

#nltk.help.upenn_tagset()

Above code (now disabled) i.e. the following two lines of code
```python
import nltk
nltk.download('tagsets')

nltk.help.upenn_tagset()
```
gave the following results:
```
[nltk_data] Downloading package tagsets to
[nltk_data]     /Users/kpadhikari/nltk_data...
$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
JJR: adjective, comparative
    bleaker braver breezier briefer brighter brisker broader bumper busier
    calmer cheaper choosier cleaner clearer closer colder commoner costlier
    cozier creamier crunchier cuter ...
JJS: adjective, superlative
    calmest cheapest choicest classiest cleanest clearest closest commonest
    corniest costliest crassest creepiest crudest cutest darkest deadliest
    dearest deepest densest dinkiest ...
LS: list item marker
    A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005
    SP-44007 Second Third Three Two * a b c d first five four one six three
    two
MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
NNPS: noun, proper, plural
    Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists
    Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques
    Apache Apaches Apocrypha ...
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...
PDT: pre-determiner
    all both half many quite such sure this
POS: genitive marker
    ' 's
PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us
PRP$: pronoun, possessive
    her his mine my our ours their thy your
RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...
RBR: adverb, comparative
    further gloomier grander graver greater grimmer harder harsher
    healthier heavier higher however larger later leaner lengthier less-
    perfectly lesser lonelier longer louder lower more ...
RBS: adverb, superlative
    best biggest bluntest earliest farthest first furthest hardest
    heartiest highest largest least less most nearest second tightest worst
RP: particle
    aboard about across along apart around aside at away back before behind
    by crop down ever fast for forth from go high i.e. in into just later
    low more off on open out over per pie raising start teeth that through
    under unto up up-pp upon whole with you
SYM: symbol
    % & ' '' ''. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R * ** ***
TO: "to" as preposition or infinitive marker
    to
UH: interjection
    Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen
    huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly
    man baby diddle hush sonuvabitch ...
VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...
VBD: verb, past tense
    dipped pleaded swiped regummed soaked tidied convened halted registered
    cushioned exacted snubbed strode aimed adopted belied figgered
    speculated wore appreciated contemplated ...
VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...
VBN: verb, past participle
    multihulled dilapidated aerosolized chaired languished panelized used
    experimented flourished imitated reunifed factored condensed sheared
    unsettled primed dubbed desired ...
VBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
    appear tend stray glisten obtain comprise detest tease attract
    emphasize mold postpone sever return wag ...
VBZ: verb, present tense, 3rd person singular
    bases reconstructs marks mixes displeases seals carps weaves snatches
    slumps stretches authorizes smolders pictures emerges stockpiles
    seduces fizzes uses bolsters slaps speaks pleads ...
WDT: WH-determiner
    that what whatever which whichever
WP: WH-pronoun
    that what whatever whatsoever which who whom whosoever
WP$: WH-pronoun, possessive
    whose
WRB: Wh-adverb
    how however whence whenever where whereby whereever wherein whereof why
``: opening quotation mark
    ` ``
[nltk_data]   Unzipping help/tagsets.zip.
```

Here’s a summary that you can use to get started with NLTK’s POS tags:

Tags that start with	Deal with

| JJ |	Adjectives|
| --- | --- |
| NN	| Nouns | 
| RB | 	Adverbs | 
| PRP | 	Pronouns | 
| VB	|  Verbs | 

Now that you know what the POS tags mean, you can see that your tagging was fairly successful:

* 'pie' was tagged NN because it’s a singular noun.
* 'you' was tagged PRP because it’s a personal pronoun.
* 'invent' was tagged VB because it’s the base form of a verb.

But how would NLTK handle tagging the parts of speech in a text that is basically gibberish? [Jabberwocky](https://www.poetryfoundation.org/poems/42916/jabberwocky) is a [nonsense poem](https://en.wikipedia.org/wiki/Nonsense_verse) that doesn’t technically mean much but is still written in a way that can convey some kind of meaning to English speakers.

Make a string to hold an excerpt from this poem:

In [65]:
jabberwocky_excerpt = """
... 'Twas brillig, and the slithy toves did gyre and gimble in the wabe:
... all mimsy were the borogoves, and the mome raths outgrabe."""

jabberwocky_excerpt

"\n'Twas brillig, and the slithy toves did gyre and gimble in the wabe:\nall mimsy were the borogoves, and the mome raths outgrabe."