In [None]:
2. Fundamentals of NLP
● Text Preprocessing: ○ Tokenization (Word, Sentence, Subword Tokenization)
○ Lemmatization and Stemming
○ Stop Words Removal
● Part-of-Speech (POS) Tagging
● Named Entity Recognition (NER)

In [None]:
What are Stop Words?
Stop words are commonly used words (e.g., is, the, and, in, to, of, for)
that do not carry much meaning in text processing. Removing them helps in
reducing noise and improving NLP model performance.

Example Stop Words:
English: a, an, the, is, of, in, for, on, with, and

In [None]:
✅ Use when processing large text data for search engines, classification, and topic modeling.
❌ Avoid if stop words contribute meaning (e.g., sentiment analysis, machine translation).

In [1]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [2]:
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
text = "This is a simple example to show how stop words removal works in NLP."
text

'This is a simple example to show how stop words removal works in NLP.'

In [5]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [6]:
# Tokenization
words = word_tokenize(text)
words

['This',
 'is',
 'a',
 'simple',
 'example',
 'to',
 'show',
 'how',
 'stop',
 'words',
 'removal',
 'works',
 'in',
 'NLP',
 '.']

In [7]:
# Removing stop words
filtered_words = [word for word in words if word.lower() not in stopwords.words("english")]
filtered_words

['simple', 'example', 'show', 'stop', 'words', 'removal', 'works', 'NLP', '.']

In [8]:
stop_words1 = [word for word in words if word.lower()  in stopwords.words("english")]
stop_words1

['This', 'is', 'a', 'to', 'how', 'in']

In [9]:
print("Original Text:", words)
print("After removal stopwords Text:", filtered_words)
print("Stopwords from the Text:", stop_words1)

Original Text: ['This', 'is', 'a', 'simple', 'example', 'to', 'show', 'how', 'stop', 'words', 'removal', 'works', 'in', 'NLP', '.']
After removal stopwords Text: ['simple', 'example', 'show', 'stop', 'words', 'removal', 'works', 'NLP', '.']
Stopwords from the Text: ['This', 'is', 'a', 'to', 'how', 'in']


In [None]:
# 2) Part-of-Speech (POS) Tagging

In [None]:
What is POS Tagging?
POS Tagging is the process of labeling words in a sentence with their corresponding
 parts of speech, such as noun, verb, adjective, pronoun, etc.

1.Helps in parsing, Named Entity Recognition (NER)
2.Used in chatbots, grammar correction, and sentiment analysis

POS Tagging Example:
Word	POS Tag
NLP	Noun
is	Verb
fun	Adjective



In [10]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [11]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize

In [12]:
text = "NLP is an exciting field of artificial intelligence."
words = word_tokenize(text)
words

['NLP',
 'is',
 'an',
 'exciting',
 'field',
 'of',
 'artificial',
 'intelligence',
 '.']

In [13]:
# POS Tagging
pos_tags = pos_tag(words)
pos_tags

[('NLP', 'NNP'),
 ('is', 'VBZ'),
 ('an', 'DT'),
 ('exciting', 'JJ'),
 ('field', 'NN'),
 ('of', 'IN'),
 ('artificial', 'JJ'),
 ('intelligence', 'NN'),
 ('.', '.')]

In [None]:
Common POS Tags in NLTK:

NN → Noun
VB → Verb
JJ → Adjective
RB → Adverb

In [14]:
#spacy

import spacy

nlp = spacy.load("en_core_web_sm")

In [15]:
text = "NLP is an exciting field of artificial intelligence."
text

'NLP is an exciting field of artificial intelligence.'

In [18]:
doc = nlp(text)
pos_tags = [(token.text, token.pos_) for token in doc]
pos_tags

[('NLP', 'PROPN'),
 ('is', 'AUX'),
 ('an', 'DET'),
 ('exciting', 'ADJ'),
 ('field', 'NOUN'),
 ('of', 'ADP'),
 ('artificial', 'ADJ'),
 ('intelligence', 'NOUN'),
 ('.', 'PUNCT')]

In [None]:
spaCy POS Tags:

NOUN → Noun
VERB → Verb
ADJ → Adjective

In [19]:
#Named Entity Recognition (NER)

In [None]:
What is Named Entity Recognition?
NER is used to identify and classify real-world entities in
 text, such as names, locations, organizations, dates, etc.

NER Example:
Word	    Entity Type
Apple	    Organization
Elon Musk	Person
India 	  Location
2024	    Date

In [20]:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Elon Musk is the CEO of Tesla, which is headquartered in California."
text

'Elon Musk is the CEO of Tesla, which is headquartered in California.'

In [21]:
doc = nlp(text)

# Extract named entities
for ent in doc.ents:
    print(f"{ent.text} - {ent.label_}")

Elon Musk - PERSON
Tesla - ORG
California - GPE


In [None]:
https://spacy.io/api

In [None]:
Common Entity Types in spaCy:

PERSON → People’s names
ORG → Organizations (Google, Tesla)
GPE → Geographic Locations (India, California)
DATE → Dates and time expressions