# Parts of Speech Tagging and Named Entity Recognition

## 1. Parts of Speech

Part-of-speech (POS) tagging is a crucial task in Natural Language Processing (NLP) that involves assigning a specific grammatical category (such as noun, verb, adjective, etc.) to each word in a sentence. POS tagging is essential for various NLP applications, including text analysis, information retrieval, and machine translation. Here are some common parts of speech and their abbreviations:

- **Noun (NN)**: A word that represents a person, place, thing, or idea. Example: "dog," "city," "happiness."
- **Verb (VB)**: A word that expresses an action or a state of being. Example: "run," "eat," "is."
- **Adjective (JJ)**: A word that describes or modifies a noun. Example: "happy," "red," "tall."
- **Adverb (RB)**: A word that modifies a verb, adjective, or another adverb, often providing information about time, manner, place, or degree. Example: "quickly," "very," "here."
- **Pronoun (PRP)**: A word that takes the place of a noun. Example: "he," "she," "it."
- **Preposition (IN)**: A word that shows the relationship between a noun (or pronoun) and other words in a sentence. Example: "in," "on," "under."
- **Conjunction (CC)**: A word that connects words, phrases, or clauses. Example: "and," "but," "or."
- **Interjection (UH)**: A word or phrase that expresses strong emotion. Example: "wow," "ouch," "oh."
- **Determiner (DT)**: A word that introduces a noun and identifies it in terms of quantity, possession, or definiteness. Example: "the," "some," "my."
- **Modal (MD)**: A verb that expresses possibility, necessity, or ability. Example: "can," "may," "must."

In [21]:
import nltk
# for removing stop words
from nltk.corpus import stopwords
nltk.download('stopwords')
# for tokenization
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
# for parts of speech tagging
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to C:\Users\DELL
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\DELL
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\DELL PC\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [22]:
paragraph = """We know the scheme of English subjects for matric classes.
In this way, the subject of English has two parts, English A and English B.
In this way, the second part is more important and has long questions with more than five marks.
However, the Essays and stores cover about 10 marks for both parts.
But in some academies, the preference for Essays is also included for the 9th class.
Therefore, we are going to provide an Essay on My Country Pakistan for class 9 in this post.
So, you can read this post to check the Essay on My Country.
Basically, the section of the essay will be included in the next class.
But it is also useful to remember for the next class.
So, we are going to mention to all students that you can read and learn for school tests and exams.
The Essay on My Country consists of more than 200 words in this class.
However, the section of the paragraph is also included in the matrix in which the topic My country can come in the final exam.
In this way, you can learn this essay to get good marks in your final exam.
Moreover, the students can also get other essays from this place for their final exam preparation.
In this way, we are covering all important essays that normally come in the annual exam.
So, you can explore this website to get all the important Essay of Matric for your English B preparation."""

In [23]:
sentences = sent_tokenize(paragraph)

In [24]:
sentences

['We know the scheme of English subjects for matric classes.',
 'In this way, the subject of English has two parts, English A and English B.',
 'In this way, the second part is more important and has long questions with more than five marks.',
 'However, the Essays and stores cover about 10 marks for both parts.',
 'But in some academies, the preference for Essays is also included for the 9th class.',
 'Therefore, we are going to provide an Essay on My Country Pakistan for class 9 in this post.',
 'So, you can read this post to check the Essay on My Country.',
 'Basically, the section of the essay will be included in the next class.',
 'But it is also useful to remember for the next class.',
 'So, we are going to mention to all students that you can read and learn for school tests and exams.',
 'The Essay on My Country consists of more than 200 words in this class.',
 'However, the section of the paragraph is also included in the matrix in which the topic My country can come in the fin

In [25]:
for i in range(len(sentences)):
    words = word_tokenize(sentences[i])
    words_without_stopwords = [word for word in words if word not in set(stopwords.words("english"))]
    pos_tags = nltk.pos_tag(words_without_stopwords)
    print(pos_tags)

[('We', 'PRP'), ('know', 'VBP'), ('scheme', 'JJ'), ('English', 'JJ'), ('subjects', 'NNS'), ('matric', 'JJ'), ('classes', 'NNS'), ('.', '.')]
[('In', 'IN'), ('way', 'NN'), (',', ','), ('subject', 'JJ'), ('English', 'NNP'), ('two', 'CD'), ('parts', 'NNS'), (',', ','), ('English', 'NNP'), ('A', 'NNP'), ('English', 'NNP'), ('B', 'NNP'), ('.', '.')]
[('In', 'IN'), ('way', 'NN'), (',', ','), ('second', 'JJ'), ('part', 'NN'), ('important', 'JJ'), ('long', 'JJ'), ('questions', 'NNS'), ('five', 'CD'), ('marks', 'NNS'), ('.', '.')]
[('However', 'RB'), (',', ','), ('Essays', 'NNP'), ('stores', 'NNS'), ('cover', 'VBP'), ('10', 'CD'), ('marks', 'NNS'), ('parts', 'NNS'), ('.', '.')]
[('But', 'CC'), ('academies', 'NNS'), (',', ','), ('preference', 'NN'), ('Essays', 'NNP'), ('also', 'RB'), ('included', 'VBD'), ('9th', 'CD'), ('class', 'NN'), ('.', '.')]
[('Therefore', 'RB'), (',', ','), ('going', 'VBG'), ('provide', 'IN'), ('Essay', 'NNP'), ('My', 'NNP'), ('Country', 'NNP'), ('Pakistan', 'NNP'), ('cla

In [26]:
# choose new sentence
text = "Abdullah is nice name! Many parents keep this name of their beautiful childrens."
print(pos_tag(text.split()))

[('Abdullah', 'NNP'), ('is', 'VBZ'), ('nice', 'JJ'), ('name!', 'RB'), ('Many', 'JJ'), ('parents', 'NNS'), ('keep', 'VB'), ('this', 'DT'), ('name', 'NN'), ('of', 'IN'), ('their', 'PRP$'), ('beautiful', 'JJ'), ('childrens.', 'NN')]


## 2. Named Entity Recognition

Named Entity Recognition (NER) is a natural language processing (NLP) task that involves identifying and classifying named entities (such as persons, organizations, locations, dates, and more) in a text. The goal of NER is to extract structured information from unstructured text, making it easier to understand and analyze.

**For Example**:

"This Metal Estate company was established in 1920 in Pakistan with budget of 1 million dollars, which takes around 5 years."

**NER will do this**:

- Organization: Metal State
- Place: Pakistan
- Century: 1920
- Money: 1 million dollars
- Time: 5 years

In [28]:
from nltk import ne_chunk
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to C:\Users\DELL
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping chunkers\maxent_ne_chunker.zip.
[nltk_data] Downloading package words to C:\Users\DELL
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.


True

For Applying NER:

- First, apply word_sent tokenizer 'for converting paragraph or sentence into words'
- Second, apply pos_tag 'for organizing words into grammatic categories'
- Third, apply ne_chunk 'for applying NER'

In [29]:
sentence = "This Metal Estate company was established in 1920 in Pakistan with budget of 1 million dollars, which takes around 5 years."
sentence

'This Metal Estate company was established in 1920 in Pakistan with budget of 1 million dollars, which takes around 5 years.'

In [30]:
# tokenizing
words = word_tokenize(sentence)
words

['This',
 'Metal',
 'Estate',
 'company',
 'was',
 'established',
 'in',
 '1920',
 'in',
 'Pakistan',
 'with',
 'budget',
 'of',
 '1',
 'million',
 'dollars',
 ',',
 'which',
 'takes',
 'around',
 '5',
 'years',
 '.']

In [31]:
pos_tags = pos_tag(words)
pos_tags

[('This', 'DT'),
 ('Metal', 'NNP'),
 ('Estate', 'NNP'),
 ('company', 'NN'),
 ('was', 'VBD'),
 ('established', 'VBN'),
 ('in', 'IN'),
 ('1920', 'CD'),
 ('in', 'IN'),
 ('Pakistan', 'NNP'),
 ('with', 'IN'),
 ('budget', 'NN'),
 ('of', 'IN'),
 ('1', 'CD'),
 ('million', 'CD'),
 ('dollars', 'NNS'),
 (',', ','),
 ('which', 'WDT'),
 ('takes', 'VBZ'),
 ('around', 'IN'),
 ('5', 'CD'),
 ('years', 'NNS'),
 ('.', '.')]

In [34]:
NER = ne_chunk(pos_tags).draw()

End of Code!