<a href="https://colab.research.google.com/github/Akshaay23/NLP_Learning/blob/main/text_normalization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Text normalization is the process of transforming text into a standard, consistent form to improve the quality of data used in NLP tasks. It is a crucial preprocessing step in NLP and includes several techniques:

### **Key Text Normalization Techniques:**

1. **Lowercasing**  
   - Converts all text to lowercase to maintain consistency.  
   - Example: `"Hello WORLD!" → "hello world!"`

2. **Removing Punctuation**  
   - Eliminates punctuation marks that do not contribute to meaning in many NLP tasks.  
   - Example: `"Hello, world!" → "Hello world"`

3. **Removing Stopwords**  
   - Removes common words (e.g., "is", "the", "and") that do not add much meaning.  
   - Example: `"This is a good day" → "good day"`

4. **Tokenization**  
   - Splits text into individual words or subwords.  
   - Example: `"I love NLP"` → `["I", "love", "NLP"]`

5. **Lemmatization**  
   - Converts words to their base or dictionary form.  
   - Example: `"running" → "run"`, `"better" → "good"`

6. **Stemming**  
   - Reduces words to their root form by chopping off suffixes.  
   - Example: `"running" → "run"`, `"flies" → "fli"`

7. **Expanding Contractions**  
   - Converts contractions into their full form.  
   - Example: `"don't"` → `"do not"`, `"you're"` → `"you are"`

8. **Removing Special Characters and Numbers**  
   - Eliminates characters like `@`, `#`, `!`, and numbers unless necessary.  
   - Example: `"Hello @world123!"` → `"Hello world"`

9. **Spelling Correction**  
   - Corrects common spelling errors using libraries like `textblob` or `hunspell`.  
   - Example: `"teh" → "the"`, `"recieve" → "receive"`

10. **Handling Accents and Unicode Characters**  
   - Converts accented characters into their simple forms.  
   - Example: `"café" → "cafe"`, `"résumé" → "resume"`

11. **Dealing with Emojis and Emoticons**  
   - Can either remove them or convert them into textual meanings.  
#   - Example: `"😊"` → `"smiling face"`


In [7]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def normalize_text(text):
    """
    Function to normalize text by:
    - Lowercasing
    - Removing special characters and punctuation
    - Tokenization
    - Removing stopwords
    - Lemmatization
    """
    # Convert text to lowercase
    text = text.lower()

    # Remove special characters, numbers, and punctuation
    text = re.sub(r"[^a-z\s]", "", text)

    # Tokenization
    words = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words("english"))
    words = [word for word in words if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    # Join words back into a normalized sentence
    return " ".join(words)




example running jumping coding fun


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [19]:
# Example usage
text = """Hey there! 😊 John, a software engineer at Google, is running late for his meeting on Monday at 10:30 AM. He’s been working on an AI project that's gonna revolutionize the industry! 🚀 But isn’t it crazy how fast technology evolves? Just last year (in 2023), they released an AI model that outperforms humans in chess!

Meanwhile, in New York, Emma ordered a coffee ☕ for $5.99 and thought, 'This better be good!' She checked her phone 📱— 100+ unread emails! 😡 Btw, have you seen @ElonMusk's latest tweet? #AI #FutureOfTech."""
normalized_text = normalize_text(text)
print(normalized_text)

hey john software engineer google running late meeting monday he working ai project thats gon na revolutionize industry isnt crazy fast technology evolves last year released ai model outperforms human chess meanwhile new york emma ordered coffee thought better good checked phone unread email btw seen elonmusks latest tweet ai futureoftech


using spacy

Why Use spaCy?

✅ Faster & optimized (written in Cython)

✅ Pre-trained NLP models

✅ Better lemmatization than NLTK

In [16]:
import spacy
import re

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Emoji removal regex
EMOJI_PATTERN = re.compile("[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F"
                           "\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F"
                           "\U0001FA70-\U0001FAFF]+", flags=re.UNICODE)

def normalize_text_spacy(text):
    """
    Function to normalize text using spaCy:
    - Lowercasing
    - Removing punctuation, special characters, and emojis
    - Tokenization
    - Removing stopwords
    - Lemmatization
    """
    # Remove emojis
    text = EMOJI_PATTERN.sub("", text)

    # Convert to lowercase and process with spaCy
    doc = nlp(text.lower())

    words = [
        token.lemma_ for token in doc
        if not token.is_stop and not token.is_punct and not token.is_space
    ]

    return " ".join(words)







example running jumping code fun


In [20]:
# Example usage
text = """Hey there! 😊 John, a software engineer at Google, is running late for his meeting on Monday at 10:30 AM. He’s been working on an AI project that's gonna revolutionize the industry! 🚀 But isn’t it crazy how fast technology evolves? Just last year (in 2023), they released an AI model that outperforms humans in chess!

Meanwhile, in New York, Emma ordered a coffee ☕ for $5.99 and thought, 'This better be good!' She checked her phone 📱— 100+ unread emails! 😡 Btw, have you seen @ElonMusk's latest tweet? #AI #FutureOfTech."""
normalized_text = normalize_text_spacy(text)
print(normalized_text)

hey john software engineer google run late meeting monday 10:30 work ai project go to revolutionize industry crazy fast technology evolve year 2023 release ai model outperform human chess new york emma order coffee ☕ $ 5.99 think well good check phone 100 + unread email btw see @elonmusk late tweet ai futureoftech


Both **spaCy** and **NLTK** are widely used NLP libraries, but they serve different purposes. Choosing between them depends on your project requirements.  

---

## ** When to Use NLTK?**
 **Best for research, experimentation, and academic use**  
 **Highly customizable for rule-based NLP**  
 **Good for small-scale projects or text preprocessing**  
 **Provides in-depth linguistic features (e.g., grammar parsing, WordNet)**  
 **Useful for statistical NLP, POS tagging, and chunking**  

🔹 **Use NLTK if:**  
- You need full control over NLP preprocessing.  
- You’re working on **linguistics-based** projects or research.  
- You need **custom tokenization, stemming, and POS tagging**.  
- You’re working with **low-resource languages** (spaCy supports only a few).  

---

## ** When to Use spaCy?**
 **Best for production-level NLP applications**  
 **Faster and optimized for performance (Cython-based)**  
 **Pre-trained NLP models for Named Entity Recognition (NER), POS tagging, and Dependency Parsing**  
 **More efficient and accurate lemmatization**  
 **Handles large-scale text processing**  

🔹 **Use spaCy if:**  
- You need a **fast, scalable, and production-ready** NLP pipeline.  
- You’re building a **chatbot, search engine, or text classification model**.  
- You need **Named Entity Recognition (NER), dependency parsing, or document similarity**.  
- You want **better lemmatization and pre-trained word vectors**.  

---

## **🔍 Quick Comparison Table**

| Feature                | **NLTK** 🟢 | **spaCy** 🔵 |
|------------------------|------------|--------------|
| **Ease of Use**        | More manual setup  | Easier, built-in pipelines |
| **Speed**             | Slower (pure Python)  | Faster (Cython-based) |
| **Lemmatization**      | Uses WordNet (requires POS tagging)  | More accurate, built-in |
| **Tokenization**       | Rule-based, can be customized  | More efficient, pre-trained |
| **Stopword Removal**   | Built-in list, manual removal  | Built-in but disabled by default |
| **Part-of-Speech (POS) Tagging** | Available, requires additional processing  | Faster and more accurate |
| **Named Entity Recognition (NER)** | Limited, needs training  | Pre-trained and efficient |
| **Dependency Parsing** | Limited support  | Built-in and optimized |
| **Best For**           | Research, rule-based NLP, small projects  | Production, large-scale NLP tasks |

---

 *** When to Use Both Together? ***

🔗 **Use NLTK for preprocessing (e.g., stopword removal, stemming) and spaCy for NLP tasks like NER, dependency parsing, and text similarity.**  



## **🔥 Final Verdict**
- **Use NLTK** if you need flexibility and **custom NLP pipelines** for research.  
- **Use spaCy** if you need **fast, pre-trained NLP models for production**.  
- **Use both** when you need **NLTK for preprocessing** and **spaCy for advanced NLP tasks** like NER and dependency parsing.  