## 01_NLP: Text Preprocessing Techniques

Definition:
Text preprocessing in NLP (Natural Language Processing) is the process of cleaning and preparing raw text data for analysis or machine learning models. It improves model accuracy by standardizing text.

#### Example: <br>
Raw text: "Hey!!! Visit https://example.com now üòé #exciting"  <br>
After preprocessing: "hey visit exciting


### Lowercasing

In [1]:
text = "Hello World!"
text_lower = text.lower()
print(text_lower)  

hello world!


### Remove HTML Tags

In [2]:
from bs4 import BeautifulSoup

html_text = "<p>Hello <b>World</b></p>"
clean_text = BeautifulSoup(html_text, "html.parser").get_text()
print(clean_text)  

Hello World


### Remove URLs

In [3]:
import re

text = "Check this out: https://example.com"
clean_text = re.sub(r"http\S+|www\S+", "", text)
print(clean_text)  

Check this out: 


### Remove Punctuation

In [4]:
import string

text = "Hello, World!"
clean_text = text.translate(str.maketrans('', '', string.punctuation))
print(clean_text)


Hello World


### Chat Word Treatment

In [5]:
chat_dict = {"u": "you", "lol": "laugh out loud"}
text = "how r u lol"
text = ' '.join([chat_dict.get(word, word) for word in text.split()])
print(text)  

how r you laugh out loud


### Spelling Correction

In [6]:
# install TextBlob if missing and download corpora required for correction
%pip install textblob
!python -m textblob.download_corpora


Collecting textblob
  Using cached textblob-0.19.0-py3-none-any.whl.metadata (4.4 kB)
Using cached textblob-0.19.0-py3-none-any.whl (624 kB)
Installing collected packages: textblob
Successfully installed textblob-0.19.0
Note: you may need to restart the kernel to use updated packages.
Finished.


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already u

In [7]:
from textblob import TextBlob

text = "I havv goood speling"
corrected_text = str(TextBlob(text).correct())
print(corrected_text) 

I have good spelling


### Removing Stop Words

In [8]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('stopwords')

text = "This is a sample sentence"
stop_words = set(stopwords.words('english'))
filtered_text = ' '.join([word for word in word_tokenize(text) if word.lower() not in stop_words])
print(filtered_text)  


sample sentence


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Handling Emojis

In [9]:
# install emoji if missing
%pip install emoji

Collecting emoji
  Using cached emoji-2.15.0-py3-none-any.whl.metadata (5.7 kB)
Using cached emoji-2.15.0-py3-none-any.whl (608 kB)
Installing collected packages: emoji
Successfully installed emoji-2.15.0
Note: you may need to restart the kernel to use updated packages.


In [10]:
import emoji

text = "I love NLP üòç"
clean_text = emoji.replace_emoji(text, replace='')
print(clean_text)  


I love NLP 


### Tokenization

In [11]:
from nltk.tokenize import word_tokenize

text = "I love NLP"
tokens = word_tokenize(text)
print(tokens)  

['I', 'love', 'NLP']
