<a href="https://colab.research.google.com/github/Samraddhi-Gupta/vision-image-transformers/blob/main/Text_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Feature Engineering

Before feeding any ML model some kind data, it has to be properly preprocessed. You must have heard the byword: Garbage in, garbage out (GIGO). Text is a specific kind of data and can't be directly fed to most ML models, so before feeding it to a model you have to somehow extract numerical features from it, in other word vectorize. Vectorization is not the topic of this tutorial, but the main thing you have to understand is that GIGO is also aplicable on vectorization too, you can extract qualitative features only from qualitatively preprocessed text.
Contents:

* Extracting Basic Features
    - Number of Characters
    - Number of Words
    - Average Word Length
    - Number of Hastags and Mentions (Social Media)
    - Number of Sentences
    - Number of Paragraphs
    - Words Starting with an Uppercase
    - All-capital Words
    - Numeric quantities
* Text Preprocessing
    - Tokenization/Segmentation
    - Lemmatization
    - Stemming
    - Converting to Lowercase
    - Text Cleaning
        - Removing Unnecessary Whitespaces and Escape Squences
        - Removing Punctuations
        - Removing Stopwords or Commonly Occuring Words/Tokens
        - Removing Special Characters (emojis, numbers...)
        - Expanding Contractions (don't, etc.)
* Extracting Word Features
    - Parts-of-Speech (POS) Tagging
    - Named Entity Recognition (NEG)
* Dependency Parsing (Not in the Notebook)
* Vectorization (Convert documents into a set of numerical features)
    - Bag of Words (Bag of n-grams)
    - tf-idf

_Dataset: [troll-tweets](https://github.com/fivethirtyeight/russian-troll-tweets)_

In [None]:
!pip install nltk



In [None]:
# Import necessary modules

import re

import nltk
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('stopwords')
# nltk.download('averaged_perceptron_tagger') # NER
# nltk.download('maxent_ne_chunker') # NER
# nltk.download('words') # NER
# nltk.download('tagsets') #NER to see the tags available

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize import regexp_tokenize, TweetTokenizer
from nltk.stem import WordNetLemmatizer

from gensim.models.tfidfmodel import TfidfModel

import spacy

import matplotlib.pyplot as plt
%matplotlib inline

from collections import Counter, defaultdict

from itertools import chain

# Preparing Text for NLP

### Regex

A _Regular Expression_ is a sequence of characters or a string containing a combination of normal (or regular, ordinary) characters and special metacharacters that define search patterns to find text or positions within a text. While normal characters have literal meaning and match themselves, metacharacters have special meaning and they resresent types of characters, such as `\d` for digits or `\w` for words, or ideas/repetitions.

>Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals. The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with `'r'`. So `r"\n"` is a two-character string containing `'\'` and `'n'`, while `"\n"` is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation. [source](https://docs.python.org/3/library/re.html)

Some of the special characters are shown in below tables.

Metacharacter|Description
---|:---
`.` | Wildcard - Matches any single character except \n.Matches any single character except `\n`
`^` | Matches the start of the string
`$` | Matches the end of the string or just before the newline at the end of the string
`\|` | The OR (choice) operator matches either the expression before or the expression after the operator. For example, `abc\|def` matches "abc" or "def".

**Repetition Quantifiers:**

Metacharacter|Description
---|:---
`*` | Matches the preceding element 0 or more times. `ab*` will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s
`+` | Matches the preceding element 1 or more times. `ab+` will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’
`?` | Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. `ab?` will match either ‘a’ or ‘ab’.
`{m,n}` | Matches the preceding element at least m and not more than n times. For example, `a{3,5}` | Matches only "aaa", "aaaa", and "aaaaa"

**Special Sequences:**

Metacharacter|Description
---|:---
`\w` | Matches alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _.
`\d` | Matches digits, which means 0-9.
`\s` | Matches whitespace characters, which include the \t, \n, \r, and space characters.
`\b` | Matches the boundary (or empty string) at the start and end of a word, that is, between \w and \W.
`\A` | Matches the beginning of a string (but not an internal line).
`\z` | Matches the end of a string (but not an internal line).

**Ranges (Set of Characters):**

Metacharacter|Description
---|:---
`[ ]` | Used to indicate a set of characters. Matches a single character that is contained within the brackets
`[amk]` | Matches either a, m, or k. It does not match `amk`
`[a-z]` | Matches any alphabet from `a` to `z`
`[a\-z]` | Matches a, -, or z. It matches `-` because `\` escapes it
`[a-cx-z]` | Matches a, b, c, x, y, z
`[0-5][0-9]` | Matches all the two-digits numbers from 00 to 59
`[a-]` | Matches a or -, because `-` is not being used to indicate a series of characters
`[-a]` | As above, matches a or -
`[a-z0-9]` | Matches characters from a to z and also from 0 to 9
`[(+*)]` | Special characters become literal inside a set, so this matches (, +, *, and )
`[^ab2]` | Adding ^ excludes any character in the set. Here, it matches characters that are not a, b, or 2

**Groups:**

Metacharacter|Description
---|:---
`( )` | Matches the expression inside the parentheses and groups it. Groups a series of pattern elements to a single element.
`(? )` | Inside parentheses like this, ? acts as an extension notation. Its meaning depends on the character immediately to its right.

In [None]:
extracted_text = '<h2>HTML Element</h2><p>The HTML <code class="w3-codespan">&#x8a;sup&#x3e;</code> element defines superscript text. Superscript text appears half a character above the normal line, and is sometimes rendered in a smaller font. Superscript text can be used for footnotes, like WWW<sup>[1]</sup> up>:</p>'

In [None]:
extracted_text

'<h2>HTML Element</h2><p>The HTML <code class="w3-codespan">&#x8a;sup&#x3e;</code> element defines superscript text. Superscript text appears half a character above the normal line, and is sometimes rendered in a smaller font. Superscript text can be used for footnotes, like WWW<sup>[1]</sup> up>:</p>'

In [None]:
import re
def striphtml(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

In [None]:
striphtml(extracted_text)

'HTML ElementThe HTML &#x8a;sup&#x3e; element defines superscript text. Superscript text appears half a character above the normal line, and is sometimes rendered in a smaller font. Superscript text can be used for footnotes, like WWW[1] up>:'

### Unicode Normalization

In [None]:
emoji_text = """First, we established subjects. I was 🙂, and I could use it to represent "me" in sentences. For example, 🙂🚶‍♀️🏃‍♂️ could mean "I'm running to work." """

In [None]:
emoji_text

'First, we established subjects. I was 🙂, and I could use it to represent "me" in sentences. For example, 🙂🚶\u200d♀️🏃\u200d♂️ could mean "I\'m running to work." '

In [None]:
emoji_text.encode('utf-8')

b'First, we established subjects. I was \xf0\x9f\x99\x82, and I could use it to represent "me" in sentences. For example, \xf0\x9f\x99\x82\xf0\x9f\x9a\xb6\xe2\x80\x8d\xe2\x99\x80\xef\xb8\x8f\xf0\x9f\x8f\x83\xe2\x80\x8d\xe2\x99\x82\xef\xb8\x8f could mean "I\'m running to work." '

### Spelling Check

In [None]:
incorrect_text = 'ceertain conditionas during seveal ggenerations aree mmodified in the saame maner.'

In [None]:
from textblob import TextBlob
textblb = TextBlob(incorrect_text)
textblb.correct()

TextBlob("certain conditions during several generations are modified in the same manner.")

## Tokenization

In [None]:
text = '''
          In the past two decades, there has been a significant shift in naval missions toward operations other than war.
          Maritime security operations such as counter-piracy, maritime interdiction, maritime patrol, and naval escort are the main focus of most fleets today; however, the vessels that are currently being used in such operations were mainly built for other purposes.
          For instance, in August 2009, the North Atlantic Council approved “Operation Ocean Shield” to fight piracy in the Gulf of Aden.
          Among ships that were assigned in the rotations of this NATO mission, many were destroyers and frigates.
          Although those warships can be used in such missions, how reasonable is it to risk a destroyer or a frigate to fight with terrorist boats or pirates?
          Capable Warships vs. Smaller Combatants
          Many surface vessels that perform maritime security operations, as in the NATO Task Force example, are sophisticated warships capable of anti-surface warfare (ASUW), anti-air warfare (AAW), and anti-submarine warfare (ASW).
          Although these sophisticated multi-mission capable fleets are able to achieve good results in expeditionary warfare against a strong enemy [1], the capabilities of those ships will probably be used in less than 1% of their total lifetime.
          It seems a sound reason to build capable ships in case of a conventional war, and one can claim that capable ships are built to be used in that small period of their lifetime; nevertheless, navies should optimize their efforts and resources in some way to find a better mix of vessel types and systems that constitute the vessels.
      '''

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

sentences = sent_tokenize(text)
sentences

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['\n          In the past two decades, there has been a significant shift in naval missions toward operations other than war.',
 'Maritime security operations such as counter-piracy, maritime interdiction, maritime patrol, and naval escort are the main focus of most fleets today; however, the vessels that are currently being used in such operations were mainly built for other purposes.',
 'For instance, in August 2009, the North Atlantic Council approved “Operation Ocean Shield” to fight piracy in the Gulf of Aden.',
 'Among ships that were assigned in the rotations of this NATO mission, many were destroyers and frigates.',
 'Although those warships can be used in such missions, how reasonable is it to risk a destroyer or a frigate to fight with terrorist boats or pirates?',
 'Capable Warships vs.',
 'Smaller Combatants\n          Many surface vessels that perform maritime security operations, as in the NATO Task Force example, are sophisticated warships capable of anti-surface warfare

In [None]:
for sentence in sentences:
  print(word_tokenize(sentence))

['In', 'the', 'past', 'two', 'decades', ',', 'there', 'has', 'been', 'a', 'significant', 'shift', 'in', 'naval', 'missions', 'toward', 'operations', 'other', 'than', 'war', '.']
['Maritime', 'security', 'operations', 'such', 'as', 'counter-piracy', ',', 'maritime', 'interdiction', ',', 'maritime', 'patrol', ',', 'and', 'naval', 'escort', 'are', 'the', 'main', 'focus', 'of', 'most', 'fleets', 'today', ';', 'however', ',', 'the', 'vessels', 'that', 'are', 'currently', 'being', 'used', 'in', 'such', 'operations', 'were', 'mainly', 'built', 'for', 'other', 'purposes', '.']
['For', 'instance', ',', 'in', 'August', '2009', ',', 'the', 'North', 'Atlantic', 'Council', 'approved', '“', 'Operation', 'Ocean', 'Shield', '”', 'to', 'fight', 'piracy', 'in', 'the', 'Gulf', 'of', 'Aden', '.']
['Among', 'ships', 'that', 'were', 'assigned', 'in', 'the', 'rotations', 'of', 'this', 'NATO', 'mission', ',', 'many', 'were', 'destroyers', 'and', 'frigates', '.']
['Although', 'those', 'warships', 'can', 'be', 

### Lowecasing

In [None]:
# Tokenize the article
tokens = word_tokenize(text)

# Convert the tokens into lowercase
lower_tokens = [t.lower() for t in tokens]

print(text)
print(lower_tokens)


          In the past two decades, there has been a significant shift in naval missions toward operations other than war. 
          Maritime security operations such as counter-piracy, maritime interdiction, maritime patrol, and naval escort are the main focus of most fleets today; however, the vessels that are currently being used in such operations were mainly built for other purposes. 
          For instance, in August 2009, the North Atlantic Council approved “Operation Ocean Shield” to fight piracy in the Gulf of Aden. 
          Among ships that were assigned in the rotations of this NATO mission, many were destroyers and frigates. 
          Although those warships can be used in such missions, how reasonable is it to risk a destroyer or a frigate to fight with terrorist boats or pirates?
          Capable Warships vs. Smaller Combatants
          Many surface vessels that perform maritime security operations, as in the NATO Task Force example, are sophisticated warships capab

#  Lemmatization And Stemming In Natural Language Processing
>Languages we speak and write are made up of several words often derived from one another. When a language contains words that are derived from another word as their use in the speech changes is called **Inflected Language**

> ####  "In grammar, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. An inflection expresses one or more grammatical categories with a prefix, suffix or infix, or another internal modification such as a vowel change" [Wikipedia]

>The degree of inflection may be higher or lower in a language. As you have read the definition of inflection with respect to grammar, you can understand that an inflected word(s) will have a common root form. Let's look at a few examples,

![img](https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1539984207/stemminglemmatization_n8bmou.jpg)

>Above examples helps us to understand the concept of normalization of text, although normalization of text is not restricted to only written document but to speech as well.

The Normalization in this context can be of two Types
* Stemming
* Lemmatization

>Stemming and Lemmatization helps us to achieve the root forms (sometimes called synonyms in search context) of inflected (derived) words.Stemming is different to Lemmatization in the approach it uses to produce root forms of words and the word produced. We will Learn more about this below.

## Stemming

> **"Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language."**

> Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis). So stemming a word or sentence may result in words that are not actual words. Stems are created by removing the suffixes or prefixes used with a word.

>There are various Stemming Algorithms / methods in the NLTK library, These methods can be seen in the following diagram.
![img](https://www.tutorialspoint.com/natural_language_toolkit/images/stemming_algorithms.jpg)

* Porter stemmer: *This stemming algorithm is an older one. It’s from the 1980s and its main concern is removing the common endings to words so that they can be resolved to a common form*
* Snowball stemmer: This algorithm is also known as the Porter2 stemming algorithm. It is almost universally accepted as better than the Porter stemmer, even being acknowledged as such by the individual who created the Porter stemmer.
* Lancaster stemmer: Just for fun, the Lancaster stemming algorithm is another algorithm that you can use. This one is the most aggressive stemming algorithm of the bunch.
* Regular Expression stemm: It basically takes a single regular expression and removes any prefix or suffix that matches the expression.

Some basic Examples of Stemming-
![img](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQB544U1IwwokOrtUpO3iOx4riHTzSXnChPWg&usqp=CAU)

# Lemmatization
#### "Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words."

>***For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words. Because lemmatization returns an actual word of the language, it is used where it is necessary to get valid words.***

Some Examples of Lemmatization are as follows -

![img](https://kavita-ganesan.com/wp-content/uploads/2019/02/Screen-Shot-2019-02-20-at-4.49.08-PM.png)

#### *****The Major Point to Note is That each word that is Lemmatized belongs to a language unlike Stemming.*****

# Lemmatization Vs Stemming
> After Reading the full article we can derive the following inferences about the difference between Lemmatization and stemming.
* Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word.
* Stemming follows an algorithm with steps to perform on the words which makes it faster. Whereas, in lemmatization, we use WordNet corpus and a corpus for stop words as well to produce lemma which makes it slower than stemming.

![img](https://miro.medium.com/max/2050/1*ES5bt7IoInIq2YioQp2zcQ.png)

> ****We can see how the meaning of the word is conserved in Lemmatization as compared to Stemming****

## Stemming

In [None]:
words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

### PorterStemmer

In [None]:
from nltk.stem import PorterStemmer

stemming=PorterStemmer()

In [None]:
for word in words:
    print(word+"---->"+stemming.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [None]:
stemming.stem('congratulations')

'congratul'

### RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

In [None]:
from nltk.stem import RegexpStemmer

reg_stemmer=RegexpStemmer('ing$|s$|e$|able$', min=4)

In [None]:
reg_stemmer.stem('eating')

'eat'

In [None]:
reg_stemmer.stem('ingeating')

'ingeat'

### Snowball Stemmer
It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [None]:
from nltk.stem import SnowballStemmer

snowballsstemmer=SnowballStemmer('english')

In [None]:
for word in words:
    print(word+"---->"+snowballsstemmer.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [None]:
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [None]:
snowballsstemmer.stem("fairly"),snowballsstemmer.stem("sportingly")

('fair', 'sport')

## Wordnet Lemmatizer
Lemmatization technique is like stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.

NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma. Let us understand it with an example −

In [None]:
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

lemmatizer=WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
'''
POS- Noun-n
verb-v
adjective-a
adverb-r
'''
lemmatizer.lemmatize("going",pos='v')

'go'

In [None]:
for word in words:
    print(word+"---->"+lemmatizer.lemmatize(word,pos='v'))

eating---->eat
eats---->eat
eaten---->eat
writing---->write
writes---->write
programming---->program
programs---->program
history---->history
finally---->finally
finalized---->finalize


### Removing Stopword

In [None]:
from nltk.corpus import stopwords

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
stemmer=PorterStemmer()

sentences=nltk.sent_tokenize(text)

for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    words=[stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i]=' '.join(words)

In [None]:
sentences

['in past two decad , signific shift naval mission toward oper war .',
 'maritim secur oper counter-piraci , maritim interdict , maritim patrol , naval escort main focu fleet today ; howev , vessel current use oper mainli built purpos .',
 'for instanc , august 2009 , north atlant council approv “ oper ocean shield ” fight piraci gulf aden .',
 'among ship assign rotat nato mission , mani destroy frigat .',
 'although warship use mission , reason risk destroy frigat fight terrorist boat pirat ?',
 'capabl warship vs .',
 'smaller combat mani surfac vessel perform maritim secur oper , nato task forc exampl , sophist warship capabl anti-surfac warfar ( asuw ) , anti-air warfar ( aaw ) , anti-submarin warfar ( asw ) .',
 'although sophist multi-miss capabl fleet abl achiev good result expeditionari warfar strong enemi [ 1 ] , capabl ship probabl use less 1 % total lifetim .',
 'it seem sound reason build capabl ship case convent war , one claim capabl ship built use small period lifetim ;

### Parts of Speech Tagging

In [None]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    words=[word for word in words if word not in set(stopwords.words('english'))]
    #sentences[i]=' '.join(words)# converting all the list of words into sentences
    pos_tag=nltk.pos_tag(words)
    print(pos_tag)

[('past', 'JJ'), ('two', 'CD'), ('decad', 'NN'), (',', ','), ('signific', 'JJ'), ('shift', 'NN'), ('naval', 'JJ'), ('mission', 'NN'), ('toward', 'IN'), ('oper', 'JJR'), ('war', 'NN'), ('.', '.')]
[('maritim', 'NN'), ('secur', 'JJ'), ('oper', 'IN'), ('counter-piraci', 'NN'), (',', ','), ('maritim', 'NN'), ('interdict', 'NN'), (',', ','), ('maritim', 'NN'), ('patrol', 'NN'), (',', ','), ('naval', 'JJ'), ('escort', 'NN'), ('main', 'JJ'), ('focu', 'NN'), ('fleet', 'NN'), ('today', 'NN'), (';', ':'), ('howev', 'NN'), (',', ','), ('vessel', 'FW'), ('current', 'JJ'), ('use', 'NN'), ('oper', 'NN'), ('mainli', 'NN'), ('built', 'VBN'), ('purpos', 'NN'), ('.', '.')]
[('instanc', 'NN'), (',', ','), ('august', 'JJ'), ('2009', 'CD'), (',', ','), ('north', 'JJ'), ('atlant', 'JJ'), ('council', 'NN'), ('approv', 'NN'), ('“', 'NNP'), ('oper', 'IN'), ('ocean', 'JJ'), ('shield', 'NN'), ('”', 'NNP'), ('fight', 'NN'), ('piraci', 'NN'), ('gulf', 'NN'), ('aden', 'NN'), ('.', '.')]
[('among', 'IN'), ('ship', '

In [None]:
print(nltk.pos_tag("Taj Mahal is a beautiful Monument".split()))

[('Taj', 'NNP'), ('Mahal', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('beautiful', 'JJ'), ('Monument', 'NN')]


In [None]:
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

# Download required NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Sample text
text = "Apple Inc. is an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics, computer software, and online services. Microsoft Corporation is an American multinational technology corporation which produces computer software, consumer electronics, personal computers, and related services."

# Tokenize the text
tokens = word_tokenize(text)

# Perform POS tagging
tagged = pos_tag(tokens)

# Named Entity Recognition
entities = ne_chunk(tagged)

# Print the entities
print(entities)