# Text Preprocessing

Text preprocessing is a crucial step in any NLP pipeline. Real-world text is messy ‚Äî it may include **punctuation, numbers, emojis, URLs, and casing inconsistencies**. 

The goal of preprocessing is to **clean and normalize text** so it can be effectively analyzed by NLP models.

---
## Common Preprocessing Steps
1. Lowercasing text
2. Removing punctuation
3. Removing stopwords (like 'and', 'the', 'is')
4. Removing numbers and special characters
5. Tokenizing text (splitting into words)
6. Lemmatization or stemming

---
## üõ†Ô∏è Import Libraries

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')

## üßæ Sample Text

In [None]:
text = "NLP is AWESOME!!! It's transforming how computers understand #language üí¨. Visit https://openai.com for more info!"
print(text)

## Step 1Ô∏è‚É£: Lowercasing

In [None]:
text_lower = text.lower()
print(text_lower)

## Step 2Ô∏è‚É£: Remove URLs, Special Characters, and Numbers

In [None]:
# Remove URLs
text_clean = re.sub(r'http\S+|www\S+', '', text_lower)

# Remove hashtags, numbers, and punctuation
text_clean = re.sub(r'[^a-z\s]', '', text_clean)

print(text_clean)

## Step 3Ô∏è‚É£: Tokenization

In [None]:
tokens = word_tokenize(text_clean)
print(tokens)

## Step 4Ô∏è‚É£: Remove Stopwords

In [None]:
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
print(filtered_tokens)

## Step 5Ô∏è‚É£: Lemmatization (Optional)

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

lemmas = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmas)

## ‚úÖ Final Cleaned Text

In [None]:
cleaned_text = ' '.join(lemmas)
print(cleaned_text)

---
## üìò Summary
- Text preprocessing standardizes messy text data.
- Common steps: **lowercasing, cleaning, tokenization, and stopword removal.**
- Optional: **stemming or lemmatization** to reduce words to their base forms.

---
‚úÖ Next: Move to `02-Tokenization_and_Stopwords.ipynb` to learn about **different tokenization methods** and **stopword handling techniques**.