# NLTK: part 2
## 01. Part of Speech Tagging with NLTK

* ref: 
    - [https://pythonprogramming.net](https://pythonprogramming.net)
    - [www.geeksforgeeks.org](https://www.geeksforgeeks.org/python-stemming-words-with-nltk/)

In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation.

    - Input: Everything is all about money.
    - Output: [('Everything', 'NN'), ('is', 'VBZ'), 
          ('all', 'DT'),('about', 'IN'), 
          ('money', 'NN'), ('.', '.')] 

Here’s a list of the tags, what they mean, and some examples:

- **CC**: coordinating conjunction
- **CD**: cardinal digit
- **DT**: determiner
- **EX**: existential there (like: “there is” … think of it like “there exists”)
- **FW**: foreign word
- **IN**: preposition/subordinating conjunction
- **JJ**: adjective ‘big’
- **JJR**: adjective, comparative ‘bigger’
- **JJS**: adjective, superlative ‘biggest’
- **LS**: list marker 1)
- **MD**: modal could, will
- **NN**: noun, singular ‘desk’
- **NNS**: noun plural ‘desks’
- **NNP**: proper noun, singular ‘Harrison’
- **NNPS**: proper noun, plural ‘Americans’
- **PDT**: predeterminer ‘all the kids’
- **POS**: possessive ending parent‘s
- **PRP**: personal pronoun I, he, she
- **PRP$$**: possessive pronoun my, his, hers
- **RB**: adverb very, silently,
- **RBR**: adverb, comparative better
- **RBS**: adverb, superlative best
- **RP**: particle give up
- **TO**: to go ‘to‘ the store.
- **UH**: interjection errrrrrrrm
- **VB**: verb, base form take
- **VBD**: verb, past tense took
- **VBG**: verb, gerund/present participle taking
- **VBN**: verb, past participle taken
- **VBP**: verb, sing. present, non-3d take
- **VBZ**: verb, 3rd person sing. present takes
- **WDT**: wh-determiner which
- **WP**: wh-pronoun who, what
- **WP$**: possessive wh-pronoun whose
- **WRB**: wh-abverb where, when

In [1]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

`PunktSentenceTokenizer`. This tokenizer is capable of unsupervised machine learning, so you can actually train it on any body of text that you use. 

In [2]:
# let's create our training and testing data:
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

In [3]:
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

In [4]:
# tokenize
tokenized = custom_sent_tokenizer.tokenize(sample_text)

# ag all of the parts of speech per sentence
def process_content():
    try:
        for i in tokenized[:5]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)

    except Exception as e:
        print(str(e))


process_content()

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'), ('nat

## 02. Chunking

`chunking`, and **group words** into hopefully meaningful chunks

Parsing the sentence with `RegExParser`

In [13]:
def process_content():
    try:
        for ind,i in enumerate(tokenized):
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            if ind < 2:
                #print(chunked)
                for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                    print(subtree)

            #chunked.draw()

    except Exception as e:
        print(str(e))

process_content()

(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
(Chunk ADDRESS/NNP)
(Chunk A/NNP JOINT/NNP SESSION/NNP)
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
(Chunk THE/NNP UNION/NNP January/NNP)
(Chunk THE/NNP PRESIDENT/NNP)
(Chunk Thank/NNP)
(Chunk Mr./NNP Speaker/NNP)
(Chunk Vice/NNP President/NNP Cheney/NNP)
(Chunk Congress/NNP)
(Chunk Supreme/NNP Court/NNP)
(Chunk called/VBD America/NNP)


## 03. Chinking with NLTK

`Chinking` is a lot like chunking, it is basically a way for you to `remove a chunk from a chunk`. The chunk that you remove from your chunk is your chink.

- the main difference here is:

`}<VB.?|IN|DT|TO>+{`

- This means we're removing from the chink one or more `verbs`, `prepositions`, `determiners`, or the word `'to'`.

In [19]:
def process_content():
    try:
        for ind,i in enumerate(tokenized):
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)

            chunkGram = r"""Chunk: {<RB.?>*<NNP>+<NN>?}
                                    }<VB.?|IN|DT|TO>+{"""

            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            if ind < 2:
                #chunked.draw()
                for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                    print(subtree)

    except Exception as e:
        print(str(e))

process_content()

(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
(Chunk ADDRESS/NNP)
(Chunk A/NNP JOINT/NNP SESSION/NNP)
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
(Chunk THE/NNP UNION/NNP January/NNP)
(Chunk THE/NNP PRESIDENT/NNP)
(Chunk Thank/NNP)
(Chunk Mr./NNP Speaker/NNP)
(Chunk Vice/NNP President/NNP Cheney/NNP)
(Chunk Congress/NNP)
(Chunk Supreme/NNP Court/NNP)
(Chunk America/NNP)
