<h1>Text Preprocessing in NLP</h1>

<p>Text preprocessing is a necessary and important step in Natural Language Processing, as raw text data is often inconsistent and not efficient to recognize for the system. Proper preprocessing improves the models accuracy via methods of reducing variation (lowercasing, lemmatization, stemming), reducing time complexity and space complexity, handling text inconssistencies (this may include slang terms, abbrevations and emojis), reduces overfitting and are essential for multilingual NLP's (Japanese and Chinese do not contain white spaces, making words hard to understand). </p>

<h2>Regular Expressions</h2>


<p>Regular expressions are patterns used to match character combinations in strings. These allow efficient pattern matching and text manipulation.</p>

In [1]:
#Example: Finding '@' in a message and the tag it is refering to.
import re

String = "You would not believe what @lorem and @ipsum said to the @fox"
tags = re.findall(r"@\w+", String)

In [2]:
tags

['@lorem', '@ipsum', '@fox']

<h2>Tokenization</h2>


<p>Tokenization is the process of breaking down a given text, into fragments called token. These tokens helps computers understand and process human language by splitting it into manageable units. Multiple tokenization algorithms exist each with their pros and cons.</p>

<h2>Lemmatization and stemming</h2><br>
<p>Lemmatization is the process of reducing words to their root form, while ensuring that the transformed word is still a valid word in the dictionary. Lemmatization model is widely used in Search Engines, Chatbots and Text Summarization</p>

In [15]:
#lemmatization
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
 
print(lemmatizer.lemmatize("understanding")) # the model considers the string as a noun
print(lemmatizer.lemmatize("understanding", pos="v"))

understanding
understand


In [26]:
#some other examples
print(lemmatizer.lemmatize("swimming", pos="n"))
print(lemmatizer.lemmatize("swimming", pos="v"))
print(lemmatizer.lemmatize("pastries", pos="n"))    
print(lemmatizer.lemmatize("great", pos="a")) 

swimming
swim
pastry
great


<p>Stemming reduces words to their base form if it ends with a given set of characters. Unlike Stemming, the word formed may not necessarily need to be a valid word. Stemming errors at times may be ignored, especially in LLM's as they rely on tokenization and embeddings rather than raw word forms. </p>

In [25]:
from nltk.stem import PorterStemmer
import random

words = ["running", "sees", "batter", "better", "betterment"]

for i in words:
    stemmer = PorterStemmer()
    stemmed_word = stemmer.stem(i)
    print(f"{i}:{stemmed_word}")

running:run
sees:see
batter:batter
better:better
betterment:better


<h2>Parts Of Speech (POS)</h2>
<p>POS assigns each word in the text a given label (ex. noun, adjective) based on its role in the sentence.It allows algorithms to understand the grammatical structure of a sentence and to disambiguate words that have multiple meanings.</p>

In [37]:
#Heres an example of POS
from textblob import TextBlob

sentence = "The quick brown fox jumps over the lazy dog"
blob = TextBlob(sentence)

print(blob.tags)


[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
