# Experiment 13

# Write a NLP Program to demostrate following tasks 
## a. Tokenization, Removal of stop words, Punchuation, POS & NER Tags 
## b. Bag of Words, TF-IDF Vectorisation & Ngrams

## (a) Text Preprocessing

Tokenization → Break text into words.
Example: "Apple is looking..." → ["Apple", "is", "looking", ...].

Stop word & punctuation removal → Remove useless words like “is, at, the” and symbols.
Example: ["Apple", "looking", "buying", "startup"].

POS Tagging → Identify word roles.
Example: "Apple/NNP" (noun), "looking/VBG" (verb).

NER → Identify names of people, places, companies, money, etc.
Example: "Apple" → ORG, "Elon Musk" → PERSON, "$1 billion" → MONEY.

## (b) Feature Extraction

Bag of Words → Counts how many times each word appears.
Example: ["Apple", "Tesla"] → [1,1].

TF-IDF → Like BoW, but also considers importance of words (rare words = more weight).

N-Grams → Sequences of words (e.g., bigram = 2 words).
Example: "Elon Musk" is treated as a phrase, not two separate words.

## Import Libraries

In [1]:
import nltk
import spacy
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.util import ngrams

## Download required NLTK resources

In [2]:
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("averaged_perceptron_tagger")
nltk.download("averaged_perceptron_tagger_eng")
nltk.download("stopwords")
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\Asus\AppData\Roaming\

True

## Sample Text

In [3]:
text = "Elon Musk is the CEO of Tesla"
print("Original Text:\n", text)

Original Text:
 Elon Musk is the CEO of Tesla


# A: Tokenization, Stop Words & Punctuation Removal & POS & NER Tags

# Tokenization

In [4]:
tokens = word_tokenize(text)
print("\n--- Tokens ---\n", tokens)


--- Tokens ---
 ['Elon', 'Musk', 'is', 'the', 'CEO', 'of', 'Tesla']


# Remove stop words and punctuation

In [5]:
stop_words = set(stopwords.words("english"))
tokens_clean = [w for w in tokens if w.lower() not in stop_words and w not in string.punctuation]
print("\n--- After Removing Stopwords & Punctuation ---\n", tokens_clean)


--- After Removing Stopwords & Punctuation ---
 ['Elon', 'Musk', 'CEO', 'Tesla']


# POS Tagging & NER using Spacy

## Load English model for NER

In [6]:
nlp = spacy.load('en_core_web_sm')  # Load small English model
doc = nlp(text)

# POS tagging

In [7]:
pos_tags = pos_tag(tokens_clean)
print("\n--- POS Tags ---\n", pos_tags)

# print("\n--- POS Tagging ---")
# for token in doc:
#     print(f"{token.text:<12} --> {token.pos_}")


--- POS Tags ---
 [('Elon', 'NNP'), ('Musk', 'NNP'), ('CEO', 'NNP'), ('Tesla', 'NNP')]


# Named Entity Recognition

In [8]:
# print("\n--- Named Entities ---")
# for ent in doc.ents:
#     print(f"{ent.text:<15} --> {ent.label_}")

print("\n--- Named Entities ---")
for ent in doc.ents:
    print(ent.text, "->", ent.label_)


--- Named Entities ---
Elon Musk -> PERSON
Tesla -> ORG


# B: Bag of Words, TF-IDF Vectorisation & Ngrams

In [9]:
corpus = [
    "Apple is looking at buying a startup",
    "Elon Musk is the CEO of Tesla",
    "Tesla is building electric cars"
]

# Bag of Words

In [10]:
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(corpus)
print("--- Bag of Words ---")
print(vectorizer.get_feature_names_out())
print(bow.toarray())

--- Bag of Words ---
['apple' 'at' 'building' 'buying' 'cars' 'ceo' 'electric' 'elon' 'is'
 'looking' 'musk' 'of' 'startup' 'tesla' 'the']
[[1 1 0 1 0 0 0 0 1 1 0 0 1 0 0]
 [0 0 0 0 0 1 0 1 1 0 1 1 0 1 1]
 [0 0 1 0 1 0 1 0 1 0 0 0 0 1 0]]


In [11]:
vectorizer = CountVectorizer()
bow1 = vectorizer.fit_transform([text])
print("--- Bag of Words ---")
print(vectorizer.get_feature_names_out())
print(bow1.toarray())

--- Bag of Words ---
['ceo' 'elon' 'is' 'musk' 'of' 'tesla' 'the']
[[1 1 1 1 1 1 1]]


# TF-IDF

In [12]:
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(corpus)
print("--- TF-IDF ---")
print(tfidf.get_feature_names_out())

--- TF-IDF ---
['apple' 'at' 'building' 'buying' 'cars' 'ceo' 'electric' 'elon' 'is'
 'looking' 'musk' 'of' 'startup' 'tesla' 'the']


In [13]:
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform([text])
print("--- TF-IDF ---")
print(tfidf_vectorizer.get_feature_names_out())
print(tfidf.toarray())

--- TF-IDF ---
['ceo' 'elon' 'is' 'musk' 'of' 'tesla' 'the']
[[0.37796447 0.37796447 0.37796447 0.37796447 0.37796447 0.37796447
  0.37796447]]


# N-grams (Bigrams and Trigrams)

## Bigrams (2-grams)

In [14]:
print("Bigrams for Tokens")
bigrams = list(ngrams(tokens, 2))
bigrams

Bigrams for Tokens


[('Elon', 'Musk'),
 ('Musk', 'is'),
 ('is', 'the'),
 ('the', 'CEO'),
 ('CEO', 'of'),
 ('of', 'Tesla')]

In [15]:
ngram_vectorizer = CountVectorizer(ngram_range=(2, 2))
ng1 = ngram_vectorizer.fit_transform([text])
print("Bigrams for a text")
print(ngram_vectorizer.get_feature_names_out())

Bigrams for a text
['ceo of' 'elon musk' 'is the' 'musk is' 'of tesla' 'the ceo']


In [16]:
vectorizer = CountVectorizer(ngram_range=(2,2))  # bigrams
X = vectorizer.fit_transform(corpus)
print("Bigrams for 3 Sentence")
print(vectorizer.get_feature_names_out())

Bigrams for 3 Sentence
['apple is' 'at buying' 'building electric' 'buying startup' 'ceo of'
 'electric cars' 'elon musk' 'is building' 'is looking' 'is the'
 'looking at' 'musk is' 'of tesla' 'tesla is' 'the ceo']


## Trigrams (3-grams)

In [17]:
print("Trigrams for Tokens")
trigrams = list(ngrams(tokens, 3))
trigrams

Trigrams for Tokens


[('Elon', 'Musk', 'is'),
 ('Musk', 'is', 'the'),
 ('is', 'the', 'CEO'),
 ('the', 'CEO', 'of'),
 ('CEO', 'of', 'Tesla')]

In [18]:
ngram_vectorizer = CountVectorizer(ngram_range=(3, 3))
ng1 = ngram_vectorizer.fit_transform([text])
print("Trigrams for a text")
print(ngram_vectorizer.get_feature_names_out())

Trigrams for a text
['ceo of tesla' 'elon musk is' 'is the ceo' 'musk is the' 'the ceo of']


In [19]:
ngram_vectorizer = CountVectorizer(ngram_range=(3, 3))
ng1 = ngram_vectorizer.fit_transform(corpus)
print("Trigrams for 3 Sentence")
print(ngram_vectorizer.get_feature_names_out())

Trigrams for 3 Sentence
['apple is looking' 'at buying startup' 'building electric cars'
 'ceo of tesla' 'elon musk is' 'is building electric' 'is looking at'
 'is the ceo' 'looking at buying' 'musk is the' 'tesla is building'
 'the ceo of']
