# Custom Part of Speech Tagger

## Introduction:


<img src ="files/first.png">

### What is Part of Speech?

The part of speech explains how a word is used in a sentence. There are eight main parts of speech - nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions and interjections.

<img src ="files/second.png">

1. Noun (N)- Daniel, London, table, dog, teacher, pen, city, happiness, hope
2. Verb (V)- go, speak, run, eat, play, live, walk, have, like, are, is
3. Adjective(ADJ)- big, happy, green, young, fun, crazy, three
4. Adverb(ADV)- slowly, quietly, very, always, never, too, well, tomorrow
5. Preposition (P)- at, on, in, from, with, near, between, about, under
6. Conjunction (CON)- and, or, but, because, so, yet, unless, since, if
7. Pronoun(PRO)- I, you, we, they, he, she, it, me, us, them, him, her, this
8. Interjection (INT)- Ouch! Wow! Great! Help! Oh! Hey! Hi!

### How does POS tagging work?

<img src = "files/third.png">

In [10]:
from nltk import word_tokenize, pos_tag

print(pos_tag(word_tokenize("Neo is the one")))

[('Neo', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('one', 'NN')]


<img src="files/fourth.jpeg">

In [11]:
import nltk
tagged_sentences = nltk.corpus.treebank.tagged_sents()

In [12]:
print(tagged_sentences[0])
print("Tagged sentences: ", len(tagged_sentences))
print("Tagged words: ", len(nltk.corpus.treebank.tagged_words()))

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
Tagged sentences:  3914
Tagged words:  100676


In [13]:
def features(sentence, index):
    """ sentence : [w1,w2,....], index: the index of word"""
    return {
        'word': sentence[index],
        'is_first': index == 0,
        'is_last' : index == len(sentence) -1,
        'is_captitalized': sentence[index][0].upper() == sentence[index][0],
        'is_all_caps': sentence[index].upper() == sentence[index],
        'is_all_lower': sentence[index].lower() == sentence[index],
        'prefix_1': sentence[index][0],
        'prefix_2':sentence[index][:2],
        'prefix_3':sentence[index][:3],
        'suffix_1':sentence[index][-1],
        'suffix_2':sentence[index][-2:],
        'suffix_3':sentence[index][-3:],
        'prev_word': '' if index ==0 else sentence[index -1],
        'next_word':'' if index == len(sentence) -1 else sentence[index+1],
        'has_hyphen':'-' in sentence[index],
        'is_numeric':sentence[index].isdigit(),
        'capitals_inside':sentence[index][1:].lower() != sentence[index][1:]
    }

In [16]:
import pprint
pprint.pprint(features(['Captain','America','DIES','in','Infinity','War'],2))

{'capitals_inside': True,
 'has_hyphen': False,
 'is_all_caps': True,
 'is_all_lower': False,
 'is_captitalized': True,
 'is_first': False,
 'is_last': False,
 'is_numeric': False,
 'next_word': 'in',
 'prefix_1': 'D',
 'prefix_2': 'DI',
 'prefix_3': 'DIE',
 'prev_word': 'America',
 'suffix_1': 'S',
 'suffix_2': 'ES',
 'suffix_3': 'IES',
 'word': 'DIES'}


In [17]:
def untag(tagged_sentences):
    return [w for w, t in tagged_sentences]

# for untagging the tags that are associated with our tagged corpus

In [18]:
# Spliting the dataset for training and testing
cutoff = int(0.75 * len(tagged_sentences))
training_sentences = tagged_sentences[:cutoff]
test_sentences = tagged_sentences[cutoff:]

print(len(training_sentences))
print(len(test_sentences))

def transform_to_dataset(tagged_sentences):
    X, y = [],[]
    
    for tagged in tagged_sentences:
        for index in range(len(tagged)):
            X.append(features(untag(tagged),index))
            y.append(tagged[index][1])
            
    return X,y

X,y = transform_to_dataset(training_sentences)

2935
979


In [19]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('vectorizer', DictVectorizer(sparse=False)),
    ('classifier', DecisionTreeClassifier(criterion='entropy'))
])

clf.fit(X[:10000], y[:10000])
print('Training Completed')

X_test, y_test = transform_to_dataset(test_sentences)

print("Accuracy: ",clf.score(X_test, y_test))

Training Completed
Accuracy:  0.8934597461031657


In [20]:
def pos_tag(sentence):
    tags = clf.predict([features(sentence, index) for index in range(len(sentence))])
    tagged_sentence = list(map(list,zip(sentence,tags)))
    return tagged_sentence

print(pos_tag(word_tokenize('Mr. Stark is going to create another infinity stone')))

[['Mr.', 'NNP'], ['Stark', 'NNP'], ['is', 'VBZ'], ['going', 'VBG'], ['to', 'TO'], ['create', 'VB'], ['another', 'DT'], ['infinity', 'NN'], ['stone', 'NN']]


In [21]:
pos_tag(word_tokenize('thank you for calling rayos_no_hassle_copay program'))

[['thank', 'IN'],
 ['you', 'PRP'],
 ['for', 'IN'],
 ['calling', 'VBG'],
 ['rayos_no_hassle_copay', 'NN'],
 ['program', 'NN']]

In [22]:
nltk.download('tagsets')

[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\chirag\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

In [None]:
# TO VIEW TAGGED CORPORA
#  C:\Users\chirag\AppData\Roaming\nltk_data\corpora\treebank\tagged> 

# NLTK BOOK
# http://www.nltk.org/book/ch05.html