# Parts of Speech

# (1). Default Tagging ==>

It is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on.

Default tagging is a basic step for the part-of-speech tagging. It is performed using the DefaultTagger class. The DefaultTagger class takes ‘tag’ as a single argument. NN is the tag for a singular noun. DefaultTagger is most useful when it gets to work with most common part-of-speech tag. that’s why a noun tag is recommended.

In [2]:
# Loading Libraries
from nltk.tag import DefaultTagger

# Defining Tag
tagging = DefaultTagger('NN')

# Tagging
tagging.tag(['Hello', 'Sam'])


[('Hello', 'NN'), ('Sam', 'NN')]

# Each tagger has a tag() method that takes a list of tokens (usually list of words produced by a word tokenizer), where each token is a single word. tag() returns a list of tagged tokens – a tuple of (word, tag). 

# How DefaultTagger works ? 



It is a subclass of SequentialBackoffTagger and implements the choose_tag() method, having three arguments.

(1). list of tokens

(2). index of the current token, to choose the tag.

(3). list of the previous tags

# Tagging Sentences  ===>



In [3]:
# Loading Libraries
from nltk.tag import DefaultTagger

# Defining Tag
tagging = DefaultTagger('NN')

tagging.tag_sents([['welcome', 'to', '.'], ['Sam', 'Mohit', 'Raj']])


[[('welcome', 'NN'), ('to', 'NN'), ('.', 'NN')],
 [('Sam', 'NN'), ('Mohit', 'NN'), ('Raj', 'NN')]]

# Note: Every tag in the list of tagged sentences  is NN as we have used DefaultTagger class

# Illustrating how to untag ==> 




In [4]:
from nltk.tag import untag
untag([('Mohit', 'NN'), ('Raj', 'NN'), ('Sam', 'NN')])


['Mohit', 'Raj', 'Sam']

# (2).  Part of Speech Tagging with Stop words using NLTK in python

# List of tags and their mean ==> 

CC coordinating conjunction 


CD cardinal digit

DT determiner 

EX existential there (like: “there is” … think of it like “there exists”) 


FW foreign word 

IN preposition/subordinating conjunction 


JJ adjective – ‘big’ 


JJR adjective, comparative – ‘bigger’ 

JJS adjective, superlative – ‘biggest’ 


LS list marker 1) 

MD modal – could, will 


NN noun, singular ‘- desk’ 


NNS noun plural – ‘desks’ 


NNP proper noun, singular – ‘Harrison’ 


NNPS proper noun, plural – ‘Americans’


PDT predeterminer – ‘all the kids’ 


POS possessive ending parent’s 


PRP personal pronoun –  I, he, she 


PRP$ possessive pronoun – my, his, hers 


RB adverb – very, silently,


RBR adverb, comparative – better 


RBS adverb, superlative – best 


RP particle – give up 


TO – to go ‘to’ the store. 


UH interjection – errrrrrrrm 


VB verb, base form – take


VBD verb, past tense – took 


VBG verb, gerund/present participle – taking 


VBN verb, past participle – taken 


VBP verb, sing. present, non-3d – take 

VBZ verb, 3rd person sing. present – takes 


WDT wh-determiner – which 


WP wh-pronoun – who, what


WP$ possessive wh-pronoun, eg- whose 


WRB wh-adverb, eg- where, when

In [20]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))

# Dummy text
txt = "Sukanya, Rajib and Naba are my good friends. " \
	"Sukanya is getting married next year. " \
	"Marriage is a big step in one’s life." \
	"It is both exciting and frightening. " \
	"But friendship is a sacred bond between people." \
	"It is a special kind of love between us. " \
	"Many of you must have tried searching for a friend "\
	"but never found the right one."

# sent_tokenize is one of instances of
# PunktSentenceTokenizer from the nltk.tokenize.punkt module

tokenized = sent_tokenize(txt)
for i in tokenized:
	
	# Word tokenizers is used to find the words
	# and punctuation in a string
	wordsList = nltk.word_tokenize(i)

	# removing stop words from wordList
	wordsList = [w for w in wordsList if not w in stop_words]

	# Using a Tagger. Which is part-of-speech
	# tagger or POS-tagger.
	tagged = nltk.pos_tag(wordsList)

	print(tagged)


[('Sukanya', 'NNP'), (',', ','), ('Rajib', 'NNP'), ('Naba', 'NNP'), ('good', 'JJ'), ('friends', 'NNS'), ('.', '.')]
[('Sukanya', 'NNP'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN'), ('.', '.')]
[('Marriage', 'NN'), ('big', 'JJ'), ('step', 'NN'), ('one', 'CD'), ('’', 'NN'), ('life.It', 'NN'), ('exciting', 'VBG'), ('frightening', 'NN'), ('.', '.')]
[('But', 'CC'), ('friendship', 'NN'), ('sacred', 'VBD'), ('bond', 'NN'), ('people.It', 'NN'), ('special', 'JJ'), ('kind', 'NN'), ('love', 'VB'), ('us', 'PRP'), ('.', '.')]
[('Many', 'JJ'), ('must', 'MD'), ('tried', 'VB'), ('searching', 'VBG'), ('friend', 'NN'), ('never', 'RB'), ('found', 'VBD'), ('right', 'JJ'), ('one', 'CD'), ('.', '.')]


# (3). Part of Speech Tagging using TextBlob

# TextBlob module is used for building programs for text analysis. One of the more powerful aspects of the TextBlob module is the Part of Speech tagging.

In [24]:
# !pip install -U textblob
# !python -m textblob.download_corpora

In [25]:
# from textblob lib import TextBlob method
from textblob import TextBlob

text = ("Sukanya, Rajib and Naba are my good friends. " +
	"Sukanya is getting married next year. " +
	"Marriage is a big step in one’s life." +
	"It is both exciting and frightening. " +
	"But friendship is a sacred bond between people." +
	"It is a special kind of love between us. " +
	"Many of you must have tried searching for a friend "+
	"but never found the right one.")

# create a textblob object
blob_object = TextBlob(text)

# Part-of-speech tags can be accessed
# through the tags property of blob object.'

# print word with pos tag.
print(blob_object.tags)


[('Sukanya', 'NNP'), ('Rajib', 'NNP'), ('and', 'CC'), ('Naba', 'NNP'), ('are', 'VBP'), ('my', 'PRP$'), ('good', 'JJ'), ('friends', 'NNS'), ('Sukanya', 'NNP'), ('is', 'VBZ'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN'), ('Marriage', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('big', 'JJ'), ('step', 'NN'), ('in', 'IN'), ('one', 'CD'), ('’', 'NN'), ('s', 'NN'), ('life.It', 'NN'), ('is', 'VBZ'), ('both', 'DT'), ('exciting', 'VBG'), ('and', 'CC'), ('frightening', 'NN'), ('But', 'CC'), ('friendship', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('sacred', 'JJ'), ('bond', 'NN'), ('between', 'IN'), ('people.It', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('special', 'JJ'), ('kind', 'NN'), ('of', 'IN'), ('love', 'NN'), ('between', 'IN'), ('us', 'PRP'), ('Many', 'JJ'), ('of', 'IN'), ('you', 'PRP'), ('must', 'MD'), ('have', 'VB'), ('tried', 'VBN'), ('searching', 'VBG'), ('for', 'IN'), ('a', 'DT'), ('friend', 'NN'), ('but', 'CC'), ('never', 'RB'), ('found', 'VBD'), ('the', 'DT'), ('right', 'JJ'),

# Basically, the goal of a POS tagger is to assign linguistic (mostly grammatical) information to sub-sentential units. Such units are called tokens and, most of the time, correspond to words and symbols (e.g. punctuation).

# Cosine Similarity ==> 

similarity measure refers to distance with dimensions representing features of the data object, in a dataset. If this distance is less, there will be a high degree of similarity, but when the distance is large, there will be a low degree of similarity. Some of the popular similarity measures are –

(1). Euclidean Distance.

(2). Manhattan Distance.

(3). Jaccard Similarity.

(4). Minkowski Distance.

(5). Cosine Similarity.

# Cosine similarity is a metric, helpful in determining, how similar the data objects are irrespective of their size. We can measure the similarity between two sentences in Python using Cosine Similarity. In cosine similarity, data objects in a dataset are treated as a vector. The formula to find the cosine similarity between two vectors is –

# S_C(x, y) = x . y / ||x|| \times ||y||

# where,

x . y = product (dot) of the vectors ‘x’ and ‘y’.

||x|| and ||y|| = length (magnitude) of the two vectors ‘x’ and ‘y’.

||x|| \times ||y|| = regular product of the two vectors ‘x’ and ‘y’.

# Example : Consider an example to find the similarity between two vectors – ‘x’ and ‘y’, using Cosine Similarity. The ‘x’ vector has values, x = { 3, 2, 0, 5 } The ‘y’ vector has values, y = { 1, 0, 0, 0 } The formula for calculating the cosine similarity is : S_C(x, y) = x . y / ||x|| \times ||y||


        
x . y = 3*1 + 2*0 + 0*0 + 5*0 = 3

||x|| = √ (3)^2 + (2)^2 + (0)^2 + (5)^2 = 6.16

||y|| = √ (1)^2 + (0)^2 + (0)^2 + (0)^2 = 1

∴ S_C(x, y) = 3 / (6.16 * 1) = 0.49 

The dissimilarity between the two vectors ‘x’ and ‘y’ is given by 

∴ D_C(x, y) = 1 - S_C(x, y) = 1 - 0.49 = 0.51

The cosine similarity between two vectors is measured in ‘θ’.
If θ = 0°, the ‘x’ and ‘y’ vectors overlap, thus proving they are similar.

If θ = 90°, the ‘x’ and ‘y’ vectors are dissimilar.

Advantages :

(1). The cosine similarity is beneficial because even if the two similar data objects are far apart by the Euclidean distance because of the size, they could still have a smaller angle between them. Smaller the angle, higher the similarity.
(2). When plotted on a multi-dimensional space, the cosine similarity captures the orientation (the angle) of the data objects and not the magnitude.

In [26]:
# Embedding layers ==> 
# https://en.wikipedia.org/wiki/Word_embedding
# https://en.wikipedia.org/wiki/Word2vec