# ` TEXT PREPROCESSING`
#### Text preprocessing is the initial and most important step in Natural Language Processing (NLP). Raw text data collected from sources like social media, articles, or speech transcripts often contains noise (such as punctuation, stopwords, slang, or irrelevant characters). To make the data clean, structured, and meaningful for machine learning or deep learning models, we apply text preprocessing techniques.

##  `Import Libraries `
#### This code imports essential libraries for text preprocessing in NLP. The nltk library provides tools for tokenization, stemming, lemmatization, and stopword removal. The re module handles regular expressions for cleaning text, while string helps with punctuation processing. From nltk.corpus, stopwords are used to filter out common words, and word_tokenize splits text into tokens. For normalization, PorterStemmer reduces words to their root form, while WordNetLemmatizer converts words to their meaningful base form. Lastly, Counter from collections is used to count word frequencies for analysis.

In [None]:
import nltk, re, string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from collections import Counter

## `Download resources (only once) `
#### These commands download necessary NLTK resources.

**punkt :** tokenizer for splitting text into sentences and words.

**stopwords :** list of common words (e.g., “the”, “is”) to filter out.

**averaged_perceptron_tagger :** part-of-speech (POS) tagger to identify grammar roles (noun, verb, etc.).

**wordnet :** lexical database used for lemmatization and semantic analysis.

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

#  ` Raw Text `

In [None]:
text = """
Data science in Python refers to the application of the Python programming language
and its extensive ecosystem of libraries to perform various tasks within the data science workflow.
Python has become a dominant language in data science due to its versatility, readability,
and the rich set of tools available for data manipulation, analysis, visualization, and machine learning.
"""
print("=== ORIGINAL RAW TEXT ===\n", text)


=== ORIGINAL RAW TEXT ===
 
Data science in Python refers to the application of the Python programming language
and its extensive ecosystem of libraries to perform various tasks within the data science workflow.
Python has become a dominant language in data science due to its versatility, readability,
and the rich set of tools available for data manipulation, analysis, visualization, and machine learning.



# `Lowercasing `
Converts all text to lowercase so words like “Dog” and “dog” are treated the same, ensuring uniformity.

In [None]:
Text = text.lower()
print("\n[1] Lowercased Text:\n", Text)


[1] Lowercased Text:
 
data science in python refers to the application of the python programming language
and its extensive ecosystem of libraries to perform various tasks within the data science workflow.
python has become a dominant language in data science due to its versatility, readability,
and the rich set of tools available for data manipulation, analysis, visualization, and machine learning.



# `Remove Numbers & Punctuation `
 Cleans text by deleting digits and punctuation marks, which usually don’t add meaning in NLP tasks.

In [None]:
text = re.sub(r"[^a-z\s]", "", text)
print("\n[2] Cleaned Text (no punctuation/numbers):\n", text)


[2] Cleaned Text (no punctuation/numbers):
 
ata science in ython refers to the application of the ython programming language
and its extensive ecosystem of libraries to perform various tasks within the data science workflow
ython has become a dominant language in data science due to its versatility readability
and the rich set of tools available for data manipulation analysis visualization and machine learning



The describe below code processes text step by step to prepare it for NLP tasks. First, the text is tokenized using word_tokenize(), which splits the sentence into individual words or tokens. Next, stopword removal is applied by filtering out common words like “the, is, and” that do not add much meaning. After that, stemming is performed with PorterStemmer(), which reduces words to their root form by cutting off suffixes (e.g., “running” → “run”). To refine this further, lemmatization is applied using WordNetLemmatizer(), which converts words to their proper base form based on grammar and dictionary meanings (e.g., “cars” → “car”, “better” → “good”). Finally, the code applies Part-of-Speech (POS) tagging with nltk.pos_tag(), which labels each word according to its grammatical role, such as noun, verb, or adjective. Together, these steps transform raw text into a clean, structured, and linguistically rich format, making it easier for machine learning models to analyze and understand.

# `Tokenization `
 Breaks sentences or text into smaller units (tokens), such as words or phrases, making analysis easier.

In [None]:

tokens = word_tokenize(text)
print("\n[3] Tokens:\n", tokens)



[3] Tokens:
 ['ata', 'science', 'in', 'ython', 'refers', 'to', 'the', 'application', 'of', 'the', 'ython', 'programming', 'language', 'and', 'its', 'extensive', 'ecosystem', 'of', 'libraries', 'to', 'perform', 'various', 'tasks', 'within', 'the', 'data', 'science', 'workflow', 'ython', 'has', 'become', 'a', 'dominant', 'language', 'in', 'data', 'science', 'due', 'to', 'its', 'versatility', 'readability', 'and', 'the', 'rich', 'set', 'of', 'tools', 'available', 'for', 'data', 'manipulation', 'analysis', 'visualization', 'and', 'machine', 'learning']


# `Stopword Removal`
 Eliminates common words like “the, is, and” that don’t carry significant meaning in most NLP applications.

In [None]:
stop_words = set(stopwords.words("english"))
tokens_nostop = [w for w in tokens if w not in stop_words]
print("\n[4] Tokens after Stopword Removal:\n", tokens_nostop)



[4] Tokens after Stopword Removal:
 ['ata', 'science', 'ython', 'refers', 'application', 'ython', 'programming', 'language', 'extensive', 'ecosystem', 'libraries', 'perform', 'various', 'tasks', 'within', 'data', 'science', 'workflow', 'ython', 'become', 'dominant', 'language', 'data', 'science', 'due', 'versatility', 'readability', 'rich', 'set', 'tools', 'available', 'data', 'manipulation', 'analysis', 'visualization', 'machine', 'learning']


# `Stemming `
Reduces words to their root by chopping off prefixes/suffixes (e.g., “playing” → “play”), though sometimes less accurate.

In [None]:
stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in tokens_nostop]
print("\n[5] Stemmed Tokens:\n", stemmed)



[5] Stemmed Tokens:
 ['ata', 'scienc', 'ython', 'refer', 'applic', 'ython', 'program', 'languag', 'extens', 'ecosystem', 'librari', 'perform', 'variou', 'task', 'within', 'data', 'scienc', 'workflow', 'ython', 'becom', 'domin', 'languag', 'data', 'scienc', 'due', 'versatil', 'readabl', 'rich', 'set', 'tool', 'avail', 'data', 'manipul', 'analysi', 'visual', 'machin', 'learn']


#`Lemmatization `
More advanced than stemming; uses vocabulary and grammar rules to convert words to their base form (e.g., “better” → “good”).

In [None]:
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w) for w in tokens_nostop]
print("\n[6] Lemmatized Tokens:\n", lemmatized)


[6] Lemmatized Tokens:
 ['ata', 'science', 'ython', 'refers', 'application', 'ython', 'programming', 'language', 'extensive', 'ecosystem', 'library', 'perform', 'various', 'task', 'within', 'data', 'science', 'workflow', 'ython', 'become', 'dominant', 'language', 'data', 'science', 'due', 'versatility', 'readability', 'rich', 'set', 'tool', 'available', 'data', 'manipulation', 'analysis', 'visualization', 'machine', 'learning']


# `POS Tagging (Part-of-Speech Tagging) `
Labels each word in a sentence with its grammatical role, such as noun, verb, adjective, etc., which helps in deeper text analysis.

In [None]:
pos_tags = nltk.pos_tag(lemmatized)
print("\n[7] POS Tags:\n", pos_tags)


[7] POS Tags:
 [('ata', 'NNS'), ('science', 'NN'), ('ython', 'NN'), ('refers', 'NNS'), ('application', 'VBP'), ('ython', 'RB'), ('programming', 'VBG'), ('language', 'NN'), ('extensive', 'JJ'), ('ecosystem', 'NN'), ('library', 'JJ'), ('perform', 'NN'), ('various', 'JJ'), ('task', 'NN'), ('within', 'IN'), ('data', 'NNS'), ('science', 'NN'), ('workflow', 'IN'), ('ython', 'NN'), ('become', 'VBN'), ('dominant', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('science', 'NN'), ('due', 'JJ'), ('versatility', 'NN'), ('readability', 'NN'), ('rich', 'JJ'), ('set', 'VBD'), ('tool', 'NN'), ('available', 'JJ'), ('data', 'NNS'), ('manipulation', 'NN'), ('analysis', 'NN'), ('visualization', 'NN'), ('machine', 'NN'), ('learning', 'NN')]


# **`VOCABULARY & BAG OF WORDS`**
#####  This code builds a Vocabulary and Bag of Words (BoW) from a text corpus. First, the corpus sentences are tokenized and converted to lowercase. Then, a vocabulary of unique words is created. Using this vocabulary, the code constructs a BoW matrix, where each row represents a sentence and each column shows the frequency of a word from the vocabulary. In short, it converts text into a numerical format that can be used for NLP and machine learning tasks.

# `corpus`
 is a collection of text or sentences used for NLP tasks. In this code, the corpus is defined as a list containing multiple sentences, which serves as the input data for building the vocabulary and Bag of Words model.

In [None]:
corpus = [
    "I am loving the NLP class, but sometimes it feels confusing!!!",
    "NLP is a fascinating field — it deals with text, speech, and language understanding.",
    text
]

Below code simply prints the corpus in a neat format. Using enumerate(corpus, 1), it numbers each sentence starting from 1, and then displays them one by one with their index.

In [None]:
print("\n=== CORPUS USED ===")
for i, sent in enumerate(corpus, 1):
    print(f"{i}. {sent}")



=== CORPUS USED ===
1. I am loving the NLP class, but sometimes it feels confusing!!!
2. NLP is a fascinating field — it deals with text, speech, and language understanding.
3. 
ata science in ython refers to the application of the ython programming language
and its extensive ecosystem of libraries to perform various tasks within the data science workflow
ython has become a dominant language in data science due to its versatility readability
and the rich set of tools available for data manipulation analysis visualization and machine learning



# `Tokenize each sentence & lowercase`
This code takes each sentence in the corpus, converts it to lowercase, and then tokenizes it into individual words using word_tokenize(). The result is a list of tokenized sentences.

In [None]:
tokenized_corpus = [word_tokenize(sent.lower()) for sent in corpus]

# `Build vocabulary`
This code creates the vocabulary from the tokenized corpus. It collects all words from every sentence, removes duplicates using set(), sorts them alphabetically with sorted(), and stores them in vocab. Finally, it prints the list of unique words that make up the vocabulary.

In [None]:
vocab = sorted(set([word for sent in tokenized_corpus for word in sent]))
print("\n=== Vocabulary ===")
print(vocab)



=== Vocabulary ===
['!', ',', '.', 'a', 'am', 'analysis', 'and', 'application', 'ata', 'available', 'become', 'but', 'class', 'confusing', 'data', 'deals', 'dominant', 'due', 'ecosystem', 'extensive', 'fascinating', 'feels', 'field', 'for', 'has', 'i', 'in', 'is', 'it', 'its', 'language', 'learning', 'libraries', 'loving', 'machine', 'manipulation', 'nlp', 'of', 'perform', 'programming', 'readability', 'refers', 'rich', 'science', 'set', 'sometimes', 'speech', 'tasks', 'text', 'the', 'to', 'tools', 'understanding', 'various', 'versatility', 'visualization', 'with', 'within', 'workflow', 'ython', '—']


# `Create Bag of Words matrix`
This code builds the Bag of Words (BoW) matrix. For each tokenized sentence, it counts word frequencies using Counter. Then, for every word in the vocabulary, it checks how many times it appears in the sentence (get(word, 0) means 0 if absent). These counts form a row in the BoW matrix, where each row represents a sentence and each column corresponds to a vocabulary word. Finally, it prints the matrix.

In [None]:
bow_matrix = []
for sent in tokenized_corpus:
    word_count = Counter(sent)
    bow_matrix.append([word_count.get(word, 0) for word in vocab])
print("\n=== Bag of Words Matrix ===")
for row in bow_matrix:
    print(row)


=== Bag of Words Matrix ===
[3, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 2, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1]
[0, 0, 0, 1, 0, 1, 3, 1, 1, 1, 1, 0, 0, 0, 3, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 2, 0, 0, 2, 2, 1, 1, 0, 1, 1, 0, 3, 1, 1, 1, 1, 1, 3, 1, 0, 0, 1, 0, 4, 3, 1, 0, 1, 1, 1, 0, 1, 1, 3, 0]


# ` Prepare Data`

#### Import math, pandas, and Counter for counting terms.tokenized_docs = preprocessed text documents (list of tokens).N = total number of documents.

In [None]:
import math
import pandas as pd
from collections import Counter
tokenized_docs = [
    ["data","science","in","python","refers","to","the","application","of","the","python","programming","language","and","its","extensive","ecosystem","of","libraries","to","perform","various","tasks","within","the","data","science","workflow"],
    ["python","has","become","a","dominant","language","in","data","science","due","to","its","versatility","readability","and","the","rich","set","of","tools","available","for","data","manipulation","analysis","visualization","and","machine","learning"]
]
N = len(tokenized_docs)


# `Build Vocabulary & Term Counters`

##### Counters,count of each word in each document.vocab list of all unique words across all documents.

In [None]:
counters = [Counter(doc) for doc in tokenized_docs]
vocab = sorted(set(word for doc in tokenized_docs for word in doc))

# `Compute Inverse Document Frequency (IDF)`

#### df = number of documents where each term appears.
#### idf = measures how rare or important a term is across all documents.
#### Rare words → higher IDF; common words → lower

In [None]:
df = {term: sum(1 for c in counters if c[term] > 0) for term in vocab}
idf = {term: math.log(N / df[term]) if df[term] > 0 else 0 for term in vocab}

# `Compute Term Frequency (TF) and TF–IDF`

#### Calculate TF for each term in each document:
####tf_raw = raw count
#### tf_norm = count ÷ total words (normalized)



#### Multiply TF by IDF → TF–IDF
#### Store results in a list of dictionaries for table creation.

In [None]:
rows = []
for term in vocab:
    row = {"term": term, "idf": round(idf[term], 6)}
    for i, c in enumerate(counters, 1):
        tf_raw = c[term]                          # Raw count in document
        total_terms = sum(c.values())             # Total words in document
        tf_norm = tf_raw / total_terms if total_terms > 0 else 0   # Normalized TF
        tfidf = tf_norm * idf[term]               # TF–IDF = TF × IDF

        row[f"tf_doc{i}"] = tf_raw
        row[f"tf_norm_doc{i}"] = round(tf_norm, 6)
        row[f"tfidf_doc{i}"] = round(tfidf, 6)
    rows.append(row)


# `Create TF–IDF Table`

#### Convert TF–IDF results into a pandas DataFrame for easy viewing.

#### Columns: Term | IDF | TF (raw) | TF (normalized) | TF–IDF per document.

#### Display first 20 rows of the table.

In [None]:
df_table = pd.DataFrame(rows)
cols = ["term", "idf"]
for i in range(1, N+1):
    cols += [f"tf_doc{i}", f"tf_norm_doc{i}", f"tfidf_doc{i}"]
df_table = df_table[cols]
print(df_table.head(20))


            term       idf  tf_doc1  tf_norm_doc1  tfidf_doc1  tf_doc2  \
0              a  0.693147        0      0.000000    0.000000        1   
1       analysis  0.693147        0      0.000000    0.000000        1   
2            and  0.000000        1      0.035714    0.000000        2   
3    application  0.693147        1      0.035714    0.024755        0   
4      available  0.693147        0      0.000000    0.000000        1   
5         become  0.693147        0      0.000000    0.000000        1   
6           data  0.000000        2      0.071429    0.000000        2   
7       dominant  0.693147        0      0.000000    0.000000        1   
8            due  0.693147        0      0.000000    0.000000        1   
9      ecosystem  0.693147        1      0.035714    0.024755        0   
10     extensive  0.693147        1      0.035714    0.024755        0   
11           for  0.693147        0      0.000000    0.000000        1   
12           has  0.693147        0   

In [None]:
!pip install gensim



In [None]:
from gensim.models import Word2Vec

In [None]:
w2v_model = Word2Vec(sentences=texts, vector_size=50, window=3, min_count=1, sg=1)

print("\n✅ Word2Vec Vocabulary:")
print(list(w2v_model.wv.index_to_key))

print("\n✅ Similar words to 'learning':")
print(w2v_model.wv.most_similar("learning"))