### Day68

<h1>Tagging Parts of Speech</h1>
<p><b>Part of speech</b> is a grammatical term that deals with the roles words play when you use them together in sentences. Tagging parts of speech, or <b>POS tagging</b>, is the task of labeling the words in your text according to their part of speech.</p>

<p>In English, there are eight parts of speech:</p>

<table>
<tr><th>Part of speech</th>	<th>Role</th>	<th>Examples</th></tr>
<tr><td>Noun</td>	<td>Is a person, place, or thing</td>	<td>mountain, bagel, Poland</td></tr>
<tr><td>Pronoun</td>	<td>Replaces a noun</td>	<td>you, she, we</td></tr>
<tr><td>Adjective</td>	<td>Gives information about what a noun is like</td>	<td>efficient, windy, colorful</td></tr>
<tr><td>Verb</td>	<td>Is an action or a state of being</td>	<td>learn, is, go</td></tr>
<tr><td>Adverb</td>	<td>Gives information about a verb, an adjective, or another adverb</td>	<td>efficiently, always, very</td></tr>
<tr><td>Preposition</td>	<td>Gives information about how a noun or pronoun is connected to another word</td>	<td>from, about, at</td></tr>
<tr><td>Conjunction</td>	<td>Connects two other words or phrases</td>	<td>so, because, and</td></tr>
<tr><td>Interjection</td>	<td>Is an exclamation</td>	<td>yay, ow, wow</td></tr>
</table>

<p>Some sources also include the category <b>articles</b> (like “a” or “the”) in the list of parts of speech, but other sources consider them to be adjectives. NLTK uses the word <b>determiner</b> to refer to articles.</p>

<p>Here’s how to import the relevant parts of NLTK in order to tag parts of speech:</p>

In [1]:
from nltk.tokenize import word_tokenize

Now create some text to tag. You can use this <a href = "https://www.youtube.com/watch?v=5_vVGPy4-rc" >Carl Sagan quote</a>:

In [2]:
sagan_quote = """
If you wish to make an apple pie from scratch,
you must first invent the universe."""

Use word_tokenize to separate the words in that string and store them in a list:



In [3]:
words_in_sagan_quote = word_tokenize(sagan_quote)

In [5]:
print(words_in_sagan_quote)

['If', 'you', 'wish', 'to', 'make', 'an', 'apple', 'pie', 'from', 'scratch', ',', 'you', 'must', 'first', 'invent', 'the', 'universe', '.']


Now call nltk.pos_tag() on your new list of words:

In [8]:
# import nltk library for pos_tag
import nltk

nltk.pos_tag(words_in_sagan_quote)

[('If', 'IN'),
 ('you', 'PRP'),
 ('wish', 'VBP'),
 ('to', 'TO'),
 ('make', 'VB'),
 ('an', 'DT'),
 ('apple', 'NN'),
 ('pie', 'NN'),
 ('from', 'IN'),
 ('scratch', 'NN'),
 (',', ','),
 ('you', 'PRP'),
 ('must', 'MD'),
 ('first', 'VB'),
 ('invent', 'VB'),
 ('the', 'DT'),
 ('universe', 'NN'),
 ('.', '.')]

All the words in the quote are now in a separate tuple, with a tag that represents their part of speech. But what do the tags mean? Here’s how to get a list of tags and their meanings:

In [9]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

That’s a lot to take in, but fortunately there are some patterns to help you remember what’s what.

Here’s a summary that you can use to get started with NLTK’s POS tags:

<table>
<tr><th>Tags that start with</th>	<th>Deal with</th></tr>
<tr><td>JJ</td>	<td>Adjectives</td></tr>
<tr><td>NN</td>	<td>Nouns</td></tr>
<tr><td>RB</td>	<td>Adverbs</td></tr>
<tr><td>PRP</td>	<td>Pronouns</td></tr>
<tr><td>VB</td>	<td>Verbs</td></tr>
</table>

<p>Now that you know what the POS tags mean, you can see that your tagging was fairly successful:
<ul>
<li>'pie' was tagged NN because it’s a singular noun.</li>
<li>'you' was tagged PRP because it’s a personal pronoun.</li>
<li>'invent' was tagged VB because it’s the base form of a verb.</li>
</ul>
</p>
But how would NLTK handle tagging the parts of speech in a text that is basically gibberish? <a href = "https://www.poetryfoundation.org/poems/42916/jabberwocky"> Jabberwocky</a> is a <a href= "https://en.wikipedia.org/wiki/Nonsense_verse" > nonsense poem </a> that doesn’t technically mean much but is still written in a way that can convey some kind of meaning to English speakers.

Make a string to hold an excerpt from this poem:

In [10]:
jabberwocky_excerpt = """
'Twas brillig, and the slithy toves did gyre and gimble in the wabe:
all mimsy were the borogoves, and the mome raths outgrabe."""

Use word_tokenize to separate the words in the excerpt and store them in a list:



In [11]:
words_in_excerpt = word_tokenize(jabberwocky_excerpt)

In [13]:
print(words_in_excerpt)

["'Twas", 'brillig', ',', 'and', 'the', 'slithy', 'toves', 'did', 'gyre', 'and', 'gimble', 'in', 'the', 'wabe', ':', 'all', 'mimsy', 'were', 'the', 'borogoves', ',', 'and', 'the', 'mome', 'raths', 'outgrabe', '.']


Call nltk.pos_tag() on your new list of words:

In [14]:
nltk.pos_tag(words_in_excerpt)

[("'Twas", 'CD'),
 ('brillig', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('the', 'DT'),
 ('slithy', 'JJ'),
 ('toves', 'NNS'),
 ('did', 'VBD'),
 ('gyre', 'NN'),
 ('and', 'CC'),
 ('gimble', 'JJ'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('wabe', 'NN'),
 (':', ':'),
 ('all', 'DT'),
 ('mimsy', 'NNS'),
 ('were', 'VBD'),
 ('the', 'DT'),
 ('borogoves', 'NNS'),
 (',', ','),
 ('and', 'CC'),
 ('the', 'DT'),
 ('mome', 'JJ'),
 ('raths', 'NNS'),
 ('outgrabe', 'RB'),
 ('.', '.')]

Accepted English words like 'and' and 'the' were correctly tagged as a conjunction and a determiner, respectively. The gibberish word 'slithy' was tagged as an adjective, which is what a human English speaker would probably assume from the context of the poem as well. 