Before getting started with Natural Language Processing it is important to ensure that the sentences are all in a single case either lower case or upper case because a single word like 'Apple' can have 32 variations if different cases are allowed like 'ApPle','APPLE', etc. This unnecassarily increases te computational requirements to process and embed the text data.

There is no all purpose text normalization procedure and this should be decided based on the requirement and the data at hand.

To avoid increasing feature vector space we do this text normalization.

Let's look at basic python functions that can help us do this for a sample text given below - 

In [4]:
# Sample text about LLMs
sample_text = """
                Large Language Models (LLMs), such as OpenAI's GPT-4 and Google's Gemini, have revolutionized the field of Natural Language Processing (NLP). 
They are capable of understanding and generating human-like text, answering questions, summarizing documents, and even writing code! 
However, LLMs require vast amounts of data and computational power. 
Despite their impressive abilities, challenges like bias, hallucination, and explainability remain important areas of research.
"""

print(sample_text)



                Large Language Models (LLMs), such as OpenAI's GPT-4 and Google's Gemini, have revolutionized the field of Natural Language Processing (NLP). 
They are capable of understanding and generating human-like text, answering questions, summarizing documents, and even writing code! 
However, LLMs require vast amounts of data and computational power. 
Despite their impressive abilities, challenges like bias, hallucination, and explainability remain important areas of research.



In [5]:
lowercase_text = sample_text.lower()
print(lowercase_text)


                large language models (llms), such as openai's gpt-4 and google's gemini, have revolutionized the field of natural language processing (nlp). 
they are capable of understanding and generating human-like text, answering questions, summarizing documents, and even writing code! 
however, llms require vast amounts of data and computational power. 
despite their impressive abilities, challenges like bias, hallucination, and explainability remain important areas of research.



In [6]:
# Removing whitespace
whitespace_removed_text = lowercase_text.strip()
print(whitespace_removed_text)

large language models (llms), such as openai's gpt-4 and google's gemini, have revolutionized the field of natural language processing (nlp). 
they are capable of understanding and generating human-like text, answering questions, summarizing documents, and even writing code! 
however, llms require vast amounts of data and computational power. 
despite their impressive abilities, challenges like bias, hallucination, and explainability remain important areas of research.


In [7]:
# Removing digits and punctuation if required by the usecase. Using regex to remove digits and punctuation and replace them with a empty string.

import re

def remove_digits_and_punctuation(text):
    # Remove digits
    text = re.sub(r'\d+', '', text)
    # Remove all punctuation except words and spaces, tabs, and newlines
    text = re.sub(r'[^\w\s]', '', text)
    return text

cleaned_text = remove_digits_and_punctuation(whitespace_removed_text)
print(cleaned_text)


large language models llms such as openais gpt and googles gemini have revolutionized the field of natural language processing nlp 
they are capable of understanding and generating humanlike text answering questions summarizing documents and even writing code 
however llms require vast amounts of data and computational power 
despite their impressive abilities challenges like bias hallucination and explainability remain important areas of research


In [None]:
# Next is stop word removal. Stop words are common words that do not add significant meaning to a sentence, such as "the", "is", "in", etc.
import nltk

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print("Stop words:", stop_words)

def remove_stop_words(text):
    words = text.split

Stop words: {'himself', "he'll", "wasn't", 'through', 'his', "you'd", "she'll", 'out', "won't", 'had', 'were', 'are', 'she', 'when', 'during', 'only', "they're", 'here', 'at', "you'll", 'yours', 'all', 'them', 'won', 'before', 'again', 're', 'isn', 'yourself', 'how', 'because', 'up', 've', 'you', 'haven', 'after', 'ain', 'needn', "it's", 'of', 'that', 'mightn', 'your', 'what', "don't", 'more', 'should', 'don', 'didn', 'to', 'having', 'does', "i'd", 'its', 'same', 'ours', 'hasn', "should've", "weren't", "i'm", 'mustn', 'wasn', "she'd", 'about', 'we', 'will', 'did', 'these', 'he', "you've", 't', 'both', 'as', "doesn't", 'the', 'further', 'an', 'above', "we'd", "we'll", 'from', 'each', 'below', 'ma', 'my', 'doing', 'd', "mustn't", "she's", 'for', 'down', "we're", 'where', 'few', 'this', 'not', 'am', 'now', 'doesn', "hasn't", 'which', "i've", 'over', 'whom', 'wouldn', 'by', 'him', 'itself', 'any', 'there', 'other', 'than', 'has', "they'll", 'why', 'off', 'on', 'couldn', 'nor', 'our', 'be',