<div style="padding:10px; color:#ffff; font-size:20px;">Text processing refers to the analysis and manipulation of textual data to derive useful information or prepare it for further processing.</div>

### <div style="color:#ffffff; background-color:#666666; padding:15px; border-radius:25px; text-align:center;">Text LowerCase</div>


In [4]:
#Lowercasing: Convert all text to lowercase for uniformity.
def text_lowercase(text):
    return text.lower()


input_str = "Hey, did you know that the summer break is coming? Amazing right!! It's only 5 more days!!"
print(text_lowercase(input_str))


hey, did you know that the summer break is coming? amazing right!! it's only 5 more days!!


### <div style="color:#ffffff; background-color:#666666; padding:15px; border-radius:25px; text-align:center;">Number Removal</div>

In [5]:
import re

def remove_numbers(text):
    return re.sub(r'\d+', '', text)

input_str = "There are 3 balls in this bag, and 12 in the other one."
print(remove_numbers(input_str))


There are  balls in this bag, and  in the other one.


### <div style="color:#ffffff; background-color:#666666; padding:15px; border-radius:25px; text-align:center;">WORD Tokenization</div>


In [6]:
#Tokenization: Split text into words, phrases, or sentences.#

import nltk

nltk.download("wordnet")
nltk.download('punkt_tab')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Parthiban\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Parthiban\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [7]:
from nltk.tokenize import word_tokenize

text = "Tokenization is important for NLP tasks."
tokens = word_tokenize(text)
print(tokens)


['Tokenization', 'is', 'important', 'for', 'NLP', 'tasks', '.']


### <div style="color:#ffffff; background-color:#666666; padding:15px; border-radius:25px; text-align:center;">StopWords Removal</div>

In [8]:
#Stopword Removal: Remove common words like "the," "is," or "and" that may not contribute to meaningful analysis.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words("english"))
text = "This is an example sentence with some stop words."
tokenized_text = word_tokenize(text)
filtered_text = [word for word in tokenized_text if word.lower() not in stop_words]
print(" ".join(filtered_text))


example sentence stop words .


### <div style="color:#ffffff; background-color:#666666; padding:15px; border-radius:25px; text-align:center;">Stemming</div>


In [9]:
#Stemming: Reduce words to their root forms (e.g., "running" → "run").
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]


for word in words:
    print(f"{word} | Stemmed Word : {stemmer.stem(word)}")

eating | Stemmed Word : eat
eats | Stemmed Word : eat
eat | Stemmed Word : eat
ate | Stemmed Word : ate
adjustable | Stemmed Word : adjust
rafting | Stemmed Word : raft
ability | Stemmed Word : abil
meeting | Stemmed Word : meet


### <div style="color:#ffffff; background-color:#666666; padding:15px; border-radius:25px; text-align:center;">Lemmatization</div>

In [10]:
#Lemmatization: Convert words to their base dictionary form (e.g., "running" → "run," "better" → "good").
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]



for word in words:
    print(f"{word} | Lemmatized Word : {lemmatizer.lemmatize(word)}")


eating | Lemmatized Word : eating
eats | Lemmatized Word : eats
eat | Lemmatized Word : eat
ate | Lemmatized Word : ate
adjustable | Lemmatized Word : adjustable
rafting | Lemmatized Word : rafting
ability | Lemmatized Word : ability
meeting | Lemmatized Word : meeting


### <div style="color:#ffffff; background-color:#666666; padding:15px; border-radius:25px; text-align:center;">Text Preprocess Pipeline</div>

In [11]:
def text_process(text):
    text = text.lower()
    text = re.sub(r'^[a-zA-Z0-9]'," ",text)
    #text = re.sub(r'\d+', '', text) - Number Removal
    #text = re.sub(r'[^\w\s]', '', text) - Punctuation and Special Characters Removal
    text = text.split()
    text = [lemmatizer.lemmatize(word) for word in text if word not in stopwords.words("english")]
    return " ".join(text)


In [12]:
text_process( "This is an example sentence with some stop words.")

'example sentence stop words.'