# `Task : Text preprocessing `

**Import Libraries**

nltk, re, and NLP tools (stopwords, tokenizers, stemmer, lemmatizer, POS tagger).

**Download NLTK Resources**

Ensures required datasets (stopwords, tokenizer, POS models, WordNet) are available.

**Raw Text Input**

A sample paragraph about NLP is used.

**Lowercasing**

Converts all text into lowercase → makes text uniform.

**Tokenization**

Splits text into words/tokens (e.g., “NLP stands for” → ["nlp", "stands", "for"]).

**Remove Punctuation & Numbers**

Keeps only alphabetic words using .isalpha().

**Stopword Removal**

Removes common words (like is, and, the) to keep meaningful words.

Stemming (PorterStemmer)

Reduces words to root form (running → run, machines → machin).

Lemmatization (WordNetLemmatizer)

Converts words to dictionary form (better → good, machines → machine).

**POS Tagging**
Assigns grammatical roles (e.g., nlp/NN, stands/VBZ).

In [None]:
 import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tag import pos_tag
import re

#`Download `

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

# `Raw text provided`

In [None]:

raw_text = """ 'NLP stands for Natural Language Processing, a branch of artificial intelligence (AI)
and computer science focused on enabling computers to understand, interpret, and generate
human language in both text and speech. By combining computer science and linguistics,
NLP allows machines to analyze vast amounts of unstructured language data from sources like
emails and social media, powering applications such as chatbots, translation services,
voice assistants, and sentiment analysis tools.' """

In [None]:
print("Original Text:\n", raw_text)

Original Text:
  'NLP stands for Natural Language Processing, a branch of artificial intelligence (AI)
and computer science focused on enabling computers to understand, interpret, and generate
human language in both text and speech. By combining computer science and linguistics,
NLP allows machines to analyze vast amounts of unstructured language data from sources like
emails and social media, powering applications such as chatbots, translation services,
voice assistants, and sentiment analysis tools.' 


# ` Step 1: Lowercasing`

In [None]:

text_lower = raw_text.lower()
print("\nLowercased Text:\n", text_lower)
nltk.download('punkt_tab')


Lowercased Text:
  'nlp stands for natural language processing, a branch of artificial intelligence (ai)
and computer science focused on enabling computers to understand, interpret, and generate
human language in both text and speech. by combining computer science and linguistics,
nlp allows machines to analyze vast amounts of unstructured language data from sources like
emails and social media, powering applications such as chatbots, translation services,
voice assistants, and sentiment analysis tools.' 


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

# `Step 2: Tokenization`

In [None]:

tokens = word_tokenize(text_lower)
print("\nTokenized Words:\n", tokens)


Tokenized Words:
 ["'nlp", 'stands', 'for', 'natural', 'language', 'processing', ',', 'a', 'branch', 'of', 'artificial', 'intelligence', '(', 'ai', ')', 'and', 'computer', 'science', 'focused', 'on', 'enabling', 'computers', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', 'in', 'both', 'text', 'and', 'speech', '.', 'by', 'combining', 'computer', 'science', 'and', 'linguistics', ',', 'nlp', 'allows', 'machines', 'to', 'analyze', 'vast', 'amounts', 'of', 'unstructured', 'language', 'data', 'from', 'sources', 'like', 'emails', 'and', 'social', 'media', ',', 'powering', 'applications', 'such', 'as', 'chatbots', ',', 'translation', 'services', ',', 'voice', 'assistants', ',', 'and', 'sentiment', 'analysis', 'tools', '.', "'"]


# `Step 3: Remove punctuation and numbers`

In [None]:

tokens = [word for word in tokens if word.isalpha()]
print("\nAfter Removing Punctuation/Numbers:\n", tokens)


After Removing Punctuation/Numbers:
 ['stands', 'for', 'natural', 'language', 'processing', 'a', 'branch', 'of', 'artificial', 'intelligence', 'ai', 'and', 'computer', 'science', 'focused', 'on', 'enabling', 'computers', 'to', 'understand', 'interpret', 'and', 'generate', 'human', 'language', 'in', 'both', 'text', 'and', 'speech', 'by', 'combining', 'computer', 'science', 'and', 'linguistics', 'nlp', 'allows', 'machines', 'to', 'analyze', 'vast', 'amounts', 'of', 'unstructured', 'language', 'data', 'from', 'sources', 'like', 'emails', 'and', 'social', 'media', 'powering', 'applications', 'such', 'as', 'chatbots', 'translation', 'services', 'voice', 'assistants', 'and', 'sentiment', 'analysis', 'tools']


# `Step 4: Remove Stopwords`

In [None]:
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
print("\nAfter Stopword Removal:\n", filtered_tokens)



After Stopword Removal:
 ['stands', 'natural', 'language', 'processing', 'branch', 'artificial', 'intelligence', 'ai', 'computer', 'science', 'focused', 'enabling', 'computers', 'understand', 'interpret', 'generate', 'human', 'language', 'text', 'speech', 'combining', 'computer', 'science', 'linguistics', 'nlp', 'allows', 'machines', 'analyze', 'vast', 'amounts', 'unstructured', 'language', 'data', 'sources', 'like', 'emails', 'social', 'media', 'powering', 'applications', 'chatbots', 'translation', 'services', 'voice', 'assistants', 'sentiment', 'analysis', 'tools']


# ` Step 5: Stemming `

In [None]:

stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in filtered_tokens]
print("\nAfter Stemming:\n", stemmed)


After Stemming:
 ['stand', 'natur', 'languag', 'process', 'branch', 'artifici', 'intellig', 'ai', 'comput', 'scienc', 'focus', 'enabl', 'comput', 'understand', 'interpret', 'gener', 'human', 'languag', 'text', 'speech', 'combin', 'comput', 'scienc', 'linguist', 'nlp', 'allow', 'machin', 'analyz', 'vast', 'amount', 'unstructur', 'languag', 'data', 'sourc', 'like', 'email', 'social', 'media', 'power', 'applic', 'chatbot', 'translat', 'servic', 'voic', 'assist', 'sentiment', 'analysi', 'tool']


# `Step 6: Lemmatization`

In [None]:

lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("\nAfter Lemmatization:\n", lemmatized)
import nltk
nltk.download('averaged_perceptron_tagger_eng')



After Lemmatization:
 ['stand', 'natural', 'language', 'processing', 'branch', 'artificial', 'intelligence', 'ai', 'computer', 'science', 'focused', 'enabling', 'computer', 'understand', 'interpret', 'generate', 'human', 'language', 'text', 'speech', 'combining', 'computer', 'science', 'linguistics', 'nlp', 'allows', 'machine', 'analyze', 'vast', 'amount', 'unstructured', 'language', 'data', 'source', 'like', 'email', 'social', 'medium', 'powering', 'application', 'chatbots', 'translation', 'service', 'voice', 'assistant', 'sentiment', 'analysis', 'tool']


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

# `Step 7: POS Tagging`

In [None]:
pos_tags = pos_tag(filtered_tokens)
print("\nPOS Tagging:\n", pos_tags)



POS Tagging:
 [('stands', 'NNS'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('branch', 'NN'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('ai', 'VBP'), ('computer', 'NN'), ('science', 'NN'), ('focused', 'VBD'), ('enabling', 'VBG'), ('computers', 'NNS'), ('understand', 'VBP'), ('interpret', 'JJ'), ('generate', 'NN'), ('human', 'JJ'), ('language', 'NN'), ('text', 'NN'), ('speech', 'NN'), ('combining', 'VBG'), ('computer', 'NN'), ('science', 'NN'), ('linguistics', 'NNS'), ('nlp', 'JJ'), ('allows', 'NNS'), ('machines', 'NNS'), ('analyze', 'VBP'), ('vast', 'JJ'), ('amounts', 'NNS'), ('unstructured', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('sources', 'NNS'), ('like', 'IN'), ('emails', 'NNS'), ('social', 'JJ'), ('media', 'NNS'), ('powering', 'VBG'), ('applications', 'NNS'), ('chatbots', 'VBP'), ('translation', 'NN'), ('services', 'NNS'), ('voice', 'NN'), ('assistants', 'NNS'), ('sentiment', 'JJ'), ('analysis', 'NN'), ('tools', 'NNS')]


#  `Q2  Vocabulary & Bag of Words `

# `Code Description`

**Goal:** Convert raw text into a numerical Bag-of-Words (BoW) representation.

**Steps:**

**Preprocess text :** lowercase, remove symbols, tokenize, remove stopwords.

**Build vocabulary :** unique sorted words from the corpus.

**Create BoW matrix :** count frequency of each vocabulary word in each sentence.


In [None]:
import nltk, re, pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter

# `Corpus`


In [None]:

corpus = [
    "I am loving the NLP class, but sometimes it feels confusing!!!",
    "NLP is a fascinating field — it deals with text, speech, and language understanding."
]
def generate_vocabulary_bow(corpus):
    stop_words = set(stopwords.words('english'))
    processed = []

# **Preprocessing:** lowercase, clean, tokenize, remove stopwords

In [None]:
 for sentence in corpus:
        sentence = re.sub(r'[^a-z\s]', '', sentence.lower())
        tokens = [w for w in word_tokenize(sentence) if w not in stop_words]
        processed.append(tokens)


# `vocabulary `

In [None]:
vocabulary = sorted({word for sent in processed for word in sent})

# `BOW Matrix`

In [None]:
bow_matrix = [[Counter(sent).get(word, 0) for word in vocabulary] for sent in processed]

# `Output`

In [None]:
print("Vocabulary:\n", vocabulary, "\n")
print("Bag of Words Matrix:\n")
print(pd.DataFrame(bow_matrix, columns=vocabulary))


Vocabulary:
 ['class', 'confusing', 'deals', 'fascinating', 'feels', 'field', 'language', 'loving', 'nlp', 'sometimes', 'speech', 'text', 'understanding'] 

Bag of Words Matrix:

   class  confusing  deals  fascinating  feels  field  language  loving  nlp  \
0      1          1      0            0      1      0         0       1    1   
1      0          0      1            1      0      1         1       0    1   
2      1          1      0            0      1      0         0       1    1   
3      0          0      1            1      0      1         1       0    1   
4      1          1      0            0      1      0         0       1    1   
5      0          0      1            1      0      1         1       0    1   
6      1          1      0            0      1      0         0       1    1   
7      0          0      1            1      0      1         1       0    1   

   sometimes  speech  text  understanding  
0          1       0     0              0  
1          0

In [None]:
import math
import pandas as pd
from collections import Counter
import re

# ` Input text `

In [None]:
text = """NLP stands for Natural Language Processing, a branch of artificial intelligence (AI)
and computer science focused on enabling computers to understand, interpret, and generate
human language in both text and speech. By combining computer science and linguistics,
NLP allows machines to analyze vast amounts of unstructured language data from sources like
emails and social media, powering applications such as chatbots, translation services,
voice assistants, and sentiment analysis tools."""

#  `Split text into "documents" `

In [None]:
docs = re.split(r'[.!?]', text)   # split by sentence end
docs = [d.strip() for d in docs if d.strip()]   # remove empty

#  `Tokenize `

In [None]:

tokenized_docs = []
for d in docs:
    tokens = re.findall(r'\b\w+\b', d.lower())
    tokenized_docs.append(tokens)


# ` Tokenize (lowercase, keep only words)`

In [None]:
tokenized_docs = []
for d in docs:
    tokens = re.findall(r'\b\w+\b', d.lower())
    tokenized_docs.append(tokens)


# `Total number of documents`

In [None]:
N = len(tokenized_docs)

# ` Build Vocabulary & Counters `

In [None]:

counters = [Counter(doc) for doc in tokenized_docs]
vocab = sorted(set(word for doc in tokenized_docs for word in doc))


# `Compute IDF`

In [None]:
df = {term: sum(1 for c in counters if c[term] > 0) for term in vocab}
idf = {term: math.log(N / df[term]) if df[term] > 0 else 0 for term in vocab}

# ` Compute TF (normalized) and TF–IDF`

In [None]:
rows = []
for term in vocab:
    row = {"term": term, "idf": round(idf[term], 6)}
    for i, c in enumerate(counters, 1):
        tf_raw = c[term]
        total_terms = sum(c.values())
        tf_norm = tf_raw / total_terms if total_terms > 0 else 0
        tfidf = tf_norm * idf[term]

        row[f"tf_doc{i}"] = tf_raw
        row[f"tf_norm_doc{i}"] = round(tf_norm, 6)
        row[f"tfidf_doc{i}"] = round(tfidf, 6)
    rows.append(row)

#  `Create TF–IDF Table`

In [None]:
df_table = pd.DataFrame(rows)
cols = ["term", "idf"]
for i in range(1, N+1):
    cols += [f"tf_doc{i}", f"tf_norm_doc{i}", f"tfidf_doc{i}"]
df_table = df_table[cols]

# ` Display first 20 rows `

In [None]:

print(df_table.head(20))


            term       idf  tf_doc1  tf_norm_doc1  tfidf_doc1  tf_doc2  \
0              a  0.693147        1      0.032258     0.02236        0   
1             ai  0.693147        1      0.032258     0.02236        0   
2         allows  0.693147        0      0.000000     0.00000        1   
3        amounts  0.693147        0      0.000000     0.00000        1   
4       analysis  0.693147        0      0.000000     0.00000        1   
5        analyze  0.693147        0      0.000000     0.00000        1   
6            and  0.000000        3      0.096774     0.00000        3   
7   applications  0.693147        0      0.000000     0.00000        1   
8     artificial  0.693147        1      0.032258     0.02236        0   
9             as  0.693147        0      0.000000     0.00000        1   
10    assistants  0.693147        0      0.000000     0.00000        1   
11          both  0.693147        1      0.032258     0.02236        0   
12        branch  0.693147        1   