## 📝 NLP Pipeline - Step 2: Text Preprocessing

### 1️⃣ Cleaning
In this step, we remove unwanted parts from text to make it ready for further processing.  
We will cover:
- HTML tag removal
- Unicode normalization
- Emoji conversion
- Spelling correction

---

### 2️⃣ Basic Preprocessing
2 types h isme (I)Basic (II)Optional 
- (I)Basic : Tokenization ... jisme : 
- - senetence tokenization 
- - word tokenization

- (II)Optional :
- - stop word removal


---

### 3️⃣ Advanced Preprocessing
(To be covered later)

In [None]:
#this is example of data cleaning : removing html tags 
import re

# Sample text with HTML tags
sample_text = "<p>This is <b>bold</b> and <i>italic</i> text.</p>"

# Function to strip HTML tags using regex
def strip_html(data):
    pattern = re.compile(r'<.*?>')  # matches anything between < and >
    return pattern.sub('', data)    # replace tags with empty string

# Apply function
clean_text = strip_html(sample_text)

print("Before:", sample_text)
print("After:", clean_text)

Before: <p>This is <b>bold</b> and <i>italic</i> text.</p>
After: This is bold and italic text.


In [8]:
#this is an example of data cleaning : unicode normalization - emojis to machine understandable text
# Example: Unicode Normalization - Emojis to Machine Understandable Text
!pip install emoji
print()
import unicodedata
import emoji

# Sample text containing emojis
sample_text = "I am happy 😄 and I love pizza 🍕!"

# Step 1: Unicode Normalization (NFC form)
normalized_text = unicodedata.normalize("NFC", sample_text)

# Step 2: Convert emoji to text descriptions
emoji_converted_text = emoji.demojize(normalized_text, language='en')

# Step 3: Make it more ML-friendly (replace underscores with spaces)
emoji_converted_text = emoji_converted_text.replace("_", " ").replace(":", " ")

print("Original text:", sample_text)
print("Normalized text:", normalized_text)
print("Emoji converted text:", emoji_converted_text)


Original text: I am happy 😄 and I love pizza 🍕!
Normalized text: I am happy 😄 and I love pizza 🍕!
Emoji converted text: I am happy  grinning face with smiling eyes  and I love pizza  pizza !


In [None]:
#lib(s) for spell checking 
!pip install textblob
!python -m textblob.download_corpora

Collecting textblob
  Using cached textblob-0.19.0-py3-none-any.whl.metadata (4.4 kB)
Using cached textblob-0.19.0-py3-none-any.whl (624 kB)
Installing collected packages: textblob
Successfully installed textblob-0.19.0
[nltk_data] Downloading package brown to
[nltk_data]     /Users/deeplatiyan/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/deeplatiyan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/deeplatiyan/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/deeplatiyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package conll2000 to
[nltk_data]     /Users/deeplatiyan/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/deeplatiyan/nlt

In [None]:
#this is 
from textblob import TextBlob

# Sample text with spelling mistakes
text = "I havv goood speling."

# Create TextBlob object
blob = TextBlob(text)

# Correct the spelling
corrected_text = blob.correct()

print("Original Text:", text)
print("Corrected Text:", corrected_text)

Original Text: I havv goood speling.
Corrected Text: I have good spelling.


In [5]:
!pip install nltk



In [10]:
#Basic Pre processing : 
#Tokenization : sentence tokenization & word tokenization

# Step 2: Import required libraries
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

# Download tokenizer models (Run only once)
nltk.download('punkt')

# Sample text
text = "I love Python. It is great for NLP!"

# 1️⃣ Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokenization:")
print(sentences)

# 2️⃣ Word Tokenization
words = word_tokenize(text)
print("Word Tokenization:")
print(words)
print()


Sentence Tokenization:
['I love Python.', 'It is great for NLP!']
Word Tokenization:
['I', 'love', 'Python', '.', 'It', 'is', 'great', 'for', 'NLP', '!']



[nltk_data] Downloading package punkt to
[nltk_data]     /Users/deeplatiyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 📌 Stop Words Removal Example
Stop words are common words (e.g., is, the, and) that are removed to keep only meaningful words for NLP tasks.

In [13]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download stopwords
#nltk.download('stopwords')
#nltk.download('punkt')

# Sample text
text = "This is a sample sentence, showing off the stop words filtration."

# Tokenize words
words = word_tokenize(text)

# Get English stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords
filtered_words = [word for word in words if word.lower() not in stop_words]

print("Original text:", text)
print("After stop words removal:", filtered_words)

Original text: This is a sample sentence, showing off the stop words filtration.
After stop words removal: ['sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']
