# Text Preprocessing
Before we can use text data for Machine Learning or Deep Learning, we need to clean and standardize it. Raw text contains punctuation, stopwords, special characters, and inconsistencies that can reduce model performance.

### Steps in Text Preprocessing
1️⃣ Tokenization – Splitting text into words or sentences

2️⃣ Stopword Removal – Removing common words (e.g., "is", "the", "and")

3️⃣ Stemming & Lemmatization – Converting words to their root form

4️⃣ Text Normalization – Lowercasing, removing special characters, numbers, etc

# Tokenization
Tokenization is the process of splitting text into smaller units (tokens), such as words or sentences.

##Types of Tokenization:

1)Word Tokenization → Splits text into words

2)Sentence Tokenization → Splits text into sentences

## Example:
 Input: "I love NLP! It's amazing."

 Word Tokenization Output: ["I", "love", "NLP", "!", "It", "'s", "amazing", "."]

 Sentence Tokenization Output: ["I love NLP!", "It's amazing."]

#Implementation

In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize


In [None]:
nltk.download('punkt_tab') #download tokenizer data

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
text="I love NLP! It's amazing. And I'm excited to learn more about it."

In [None]:
word_tokens=word_tokenize(text)
sentence_tokens=sent_tokenize(text)

In [None]:
print('Word tokenization : ',word_tokens)

Word tokenization :  ['I', 'love', 'NLP', '!', 'It', "'s", 'amazing', '.', 'And', 'I', "'m", 'excited', 'to', 'learn', 'more', 'about', 'it', '.']


In [None]:
print('Sentence tokenization : ',sentence_tokens)

Sentence tokenization :  ['I love NLP!', "It's amazing.", "And I'm excited to learn more about it."]


# Stopword Removal
### What are Stopwords?
Stopwords are common words like "is", "the", "and", "in", which do not add much meaning to the text but increase the size of data. Removing them helps models focus on important words.


### Example:
 Input: "I love NLP because it is amazing!"

 Without Stopwords: ["love", "NLP", "amazing"]

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
text='I love NLP because it is amazing!, and I am excited to learn more about it.'

In [None]:
words=word_tokenize(text)

In [None]:
print(' words before stopword removal : ',words)

 words before stopword removal :  ['I', 'love', 'NLP', 'because', 'it', 'is', 'amazing', '!', ',', 'and', 'I', 'am', 'excited', 'to', 'learn', 'more', 'about', 'it', '.']


In [None]:
stopwords=set(stopwords.words('english'))

In [None]:
filterd_words=[ word for word in words if word.lower() not in stopwords]  #removing stopwords

In [None]:
print(' words after stopword removal : ',filterd_words)

 words after stopword removal :  ['love', 'NLP', 'amazing', '!', ',', 'excited', 'learn', '.']


#Stemming & Lemmatization
Both Stemming and Lemmatization reduce words to their base or root form, but they work differently.

 ### 1. Stemming
Stemming removes suffixes to get the root form of a word. It does not always produce real words but is faster.

### Example:
 "running" → "run"

 "better" → "bet"


In [None]:
from nltk.stem import PorterStemmer

In [None]:
stemmer=PorterStemmer()
words=['running','flies','easily','fairly']
stemmed_words=[stemmer.stem(word) for word in words]
print(stemmed_words)

['run', 'fli', 'easili', 'fairli']


###  2. Lemmatization
Lemmatization converts words into their meaningful root form, considering grammar. It produces actual words but is slower than stemming.

### Example:
"running" → "run"

 "better" → "good"

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
lemmantizer=WordNetLemmatizer()
lemmantized_words=[lemmantizer.lemmatize(word) for word in words]
print(lemmantized_words)

['running', 'fly', 'easily', 'fairly']


# Text Normalization
 In NLP, text can be written in different ways, and normalization ensures consistency.

### Common Text Normalization Techniques:

Lowercasing – Convert text to lowercase

Removing Punctuation – Remove symbols like .,!?

Removing Numbers – Remove digits if not needed

Expanding Contractions – Convert "can't" → "cannot"

Removing Special Characters – Remove #, @, * etc.

Removing Extra Whitespaces – Remove unnecessary spaces

###Lowercasing

In [None]:
text = "NLP is AMAZING!"
lower_text = text.lower()
print(lower_text)  # Output: nlp is amazing!

nlp is amazing!


### Removing punctuation

In [None]:
import string

text = "Hello! How's NLP?"
clean_text = text.translate(str.maketrans('', '', string.punctuation))
print(clean_text)  # Output: Hello Hows NLP

Hello Hows NLP


### Removing Numbers

In [None]:
import re

text = "I have 2 cats and 3 dogs"
clean_text = re.sub(r'\d+', '', text)
print(clean_text)  # Output: I have cats and dogs


I have  cats and  dogs


### Expanding Contractions

In [None]:
!pip install contractions
from contractions import fix

text = "I'll go, but I can't stay."
expanded_text = fix(text)
print(expanded_text)  # Output: I will go, but I cannot stay.


Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (118 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.3/118.3 kB[0m 

### Removing Special Characters

In [None]:
text = "@NLP is #amazing!!"
clean_text = re.sub(r'[^a-zA-Z\s]', '', text)
print(clean_text)  # Output: NLP is amazing

NLP is amazing


### Removing Extra White Spaces

In [None]:
text = " NLP    is     amazing!   "
clean_text = " ".join(text.split())
print(clean_text)  # Output: NLP is amazing!

NLP is amazing!


# Part-Of-Speech(POS) Tagging

POS tagging assigns grammatical labels to words in a sentence. This helps NLP models understand the structure and meaning of text.

### Example:
 Sentence: "The quick brown fox jumps over the lazy dog"

 POS Tags:

The → Determiner (DT)

quick → Adjective (JJ)

fox → Noun (NN)

jumps → Verb (VBZ)



In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

nltk.download('punkt')
# Download the 'averaged_perceptron_tagger_eng' resource specifically
nltk.download('averaged_perceptron_tagger_eng')

text = "The quick brown fox jumps over the lazy dog"
words = word_tokenize(text)

# Get POS tags
pos_tags = pos_tag(words)
print(pos_tags)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
