# **Text Normalisation**

Text normalization is the process of converting text into a standard format. It's a vital step in natural language processing (NLP) and is used before storing, processing, or analyzing text. Text normalization ensures that input is consistent, which helps improve the accuracy of NLP tasks. 

**Imports**

In [1]:
!pip install nltk spacy
!python -m spacy download en_core_web_sm

### Step 2: Import Libraries
import re
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

### Step 3: Load Required NLTK Data
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')  # Optional for better lemmatization




[notice] A new release of pip is available: 24.2 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ------- -------------------------------- 2.4/12.8 MB 26.9 MB/s eta 0:00:01
     -------------- ------------------------- 4.7/12.8 MB 14.2 MB/s eta 0:00:01
     ------------------ --------------------- 5.8/12.8 MB 10.1 MB/s eta 0:00:01
     ------------------ --------------------- 6.0/12.8 MB 8.2 MB/s eta 0:00:01
     ---------------------- ----------------- 7.1/12.8 MB 7.5 MB/s eta 0:00:01
     ---------------------- ----------------- 7.3/12.8 MB 6.7 MB/s eta 0:00:01
     ------------------------ --------------- 7.9/12.8 MB 5.7 MB/s eta 0:00:01
     -------------------------- ------------- 8.4/12.8 MB 5.1 MB/s eta 0:00:01
     --------------------------- ------------ 8.9/12.8 MB 4.9 MB/s eta 0:00:01
     ------------------------------ --


[notice] A new release of pip is available: 24.2 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ch.K.Abhiram\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ch.K.Abhiram\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Ch.K.Abhiram\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Ch.K.Abhiram\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

**Load Corpus**

In [2]:
text = "Hello! This is an example sentence for text normalization. We're testing tokenization, stopword removal, stemming, and lemmatization."

**Tokenisation and Lemmatisation**

In [3]:
# Sentence Tokenization using regex
sentences = re.split(r'(?<=[.!?]) +', text)
print("Sentence Tokens:", sentences)

# Word Tokenization using regex
words = re.findall(r'\b\w+\b', text)
print("Word Tokens:", words)

### Step 6: Lowercasing
words_lower = [word.lower() for word in words]
print("Lowercased Words:", words_lower)

### Step 7: Remove Stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words_lower if word not in stop_words]
print("Filtered Words:", filtered_words)

### Step 8: Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]
print("Stemmed Words:", stemmed_words)

### Step 9: Lemmatization (WordNet + spaCy)
lemmatizer = WordNetLemmatizer()
nlp = spacy.load("en_core_web_sm")

# WordNet Lemmatizer
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
print("Lemmatized Words (WordNet):", lemmatized_words)

# spaCy Lemmatization
doc = nlp(" ".join(filtered_words))
lemmatized_words_spacy = [token.lemma_ for token in doc]
print("Lemmatized Words (spaCy):", lemmatized_words_spacy)

Sentence Tokens: ['Hello!', 'This is an example sentence for text normalization.', "We're testing tokenization, stopword removal, stemming, and lemmatization."]
Word Tokens: ['Hello', 'This', 'is', 'an', 'example', 'sentence', 'for', 'text', 'normalization', 'We', 're', 'testing', 'tokenization', 'stopword', 'removal', 'stemming', 'and', 'lemmatization']
Lowercased Words: ['hello', 'this', 'is', 'an', 'example', 'sentence', 'for', 'text', 'normalization', 'we', 're', 'testing', 'tokenization', 'stopword', 'removal', 'stemming', 'and', 'lemmatization']
Filtered Words: ['hello', 'example', 'sentence', 'text', 'normalization', 'testing', 'tokenization', 'stopword', 'removal', 'stemming', 'lemmatization']
Stemmed Words: ['hello', 'exampl', 'sentenc', 'text', 'normal', 'test', 'token', 'stopword', 'remov', 'stem', 'lemmat']
Lemmatized Words (WordNet): ['hello', 'example', 'sentence', 'text', 'normalization', 'testing', 'tokenization', 'stopword', 'removal', 'stemming', 'lemmatization']
Lemm

**Normalisation**

In [4]:
normalized_text = " ".join(lemmatized_words_spacy)
print("Final Normalized Text:", normalized_text)


Final Normalized Text: hello example sentence text normalization testing tokenization stopword removal stem lemmatization
