Project: Text Normalization Pipeline

Objective

Build a reusable text normalization pipeline that can process raw text data into a clean and structured format for downstream NLP tasks.

In [5]:
#stwp 1. Import Necessary Libraries

import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download NLTK resources for stopwords, tokenization, and lemmatization
nltk.download('stopwords')  # For stopword removal
nltk.download('punkt')      # For tokenization
nltk.download('wordnet')    # For lemmatization

#Explanation:

#re: Regular expression library to perform text cleaning tasks such as removing URLs, mentions, and special characters.
#string: Provides a collection of string constants, like punctuation characters.
#nltk: Natural Language Toolkit, a powerful library for working with text data. We use it here for stopwords, tokenization, and lemmatization.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
#2. Define the Contraction Expansion Map

# Define a mapping of common contractions to their expanded forms
contraction_map = {
    "can't": "cannot",
    "won't": "will not",
    "i'm": "i am",
    "you're": "you are",
    "it's": "it is",
    "we're": "we are",
    "they're": "they are",
    "isn't": "is not",
    "didn't": "did not",
    "aren't": "are not",
    "haven't": "have not",
    "hasn't": "has not",
    "don't": "do not",
    "doesn't": "does not",
    "wasn't": "was not",
    "weren't": "were not",
    "wouldn't": "would not",
    "couldn't": "could not",
    "shouldn't": "should not",
    "mightn't": "might not",
    "mustn't": "must not"
}

#Explanation:

#This dictionary maps contractions (e.g., "can't", "won't") to their expanded forms (e.g., "cannot", "will not").
#This is important for understanding the full meaning of text, as contractions can make automated processing harder.

# Define the Text Normalization Functions
# Expand Contractions

def expand_contractions(text):
    """
    This function expands contractions in the input text.
    It will look for any contractions in the `contraction_map` dictionary and replace them with their expanded forms.
    """
    # Loop through all contractions and replace them with expanded forms
    for contraction, expanded in contraction_map.items():
        text = re.sub(r'\b' + contraction + r'\b', expanded, text)
    return text


#Explanation:

#The expand_contractions function looks for words in the text that match entries in the contraction_map and replaces them with their expanded versions.
#The \b ensures that only whole words are matched (e.g., "can't" but not "scanner").

#Normalize Text

def normalize_text(text):
    """
    This function normalizes the input text. It performs the following operations:
    - Converts text to lowercase
    - Expands contractions
    - Removes URLs, mentions, hashtags, and punctuation
    - Tokenizes the text (splits into individual words)
    - Removes stopwords
    - Lemmatizes the words (converts to root form)
    """
    # Convert the text to lowercase
    text = text.lower()  # Converts all text to lowercase to ensure uniformity

    # Expand contractions (e.g., "i'm" -> "i am")
    text = expand_contractions(text)

    # Remove URLs (e.g., http://example.com)
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)  # Regular expression to remove URLs

    # Remove mentions and hashtags (e.g., @username and #hashtag)
    text = re.sub(r'@\w+|#\w+', '', text)  # Remove Twitter mentions and hashtags

    # Remove punctuation (e.g., commas, periods, exclamation marks)
    text = text.translate(str.maketrans('', '', string.punctuation))  # Use str.maketrans to remove punctuation

    # Tokenize the text (split it into individual words)
    tokens = word_tokenize(text)  # Tokenizes text into words

    # Remove stopwords (common words like "the", "and", "is", etc.)
    tokens = [word for word in tokens if word not in stop_words]  # Filters out stopwords

    # Lemmatize the words (reduce words to their root form, e.g., "running" -> "run")
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]  # Lemmatize each token

    # Join the tokens back into a single string
    normalized_text = ' '.join(tokens)

    return normalized_text

#Explanation:

#Lowercasing: Converts all text to lowercase so that the text is uniform (e.g., "Apple" and "apple" are treated the same).
#Contraction Expansion: Expands contractions (e.g., "can't" becomes "cannot") for clearer interpretation.
#URL Removal: Removes any URLs in the text using a regular expression (http\S+ matches anything starting with "http", and \S+ matches any non-whitespace characters).
#Mention and Hashtag Removal: Removes social media mentions (e.g., @user) and hashtags (e.g., #hashtag).
#Punctuation Removal: Strips out punctuation marks like periods, commas, and exclamation marks.
#Tokenization: Breaks the text into individual words (tokens) using word_tokenize.
#Stopword Removal: Removes common words (e.g., "the", "and") that don't carry meaningful information.
#Lemmatization: Reduces words to their base form, e.g., "running" becomes "run", using WordNetLemmatizer.
#Join Tokens: Converts the list of tokens back into a single string for further processing or analysis.



In [7]:
#step3. Test the Pipeline
# Example raw text data
raw_texts = [
    "I can't believe this product! It's AMAZING 😍 #awesome",
    "Check out our website: https://example.com for great deals!",
    "I'm SO disappointed... The service was terrible. 😡 @Company",
    "We're loving the new features of your app! Keep it up! 👍",
    "Why isn't this app working? Fix it!!! #frustrated"
]

# Apply normalization
normalized_texts = [normalize_text(text) for text in raw_texts]

# Print results
for i, (raw, normalized) in enumerate(zip(raw_texts, normalized_texts)):
    print(f"Original Text {i+1}: {raw}")
    print(f"Normalized Text {i+1}: {normalized}")
    print("-" * 80)

#Explanation:

#Test the Pipeline: We apply the normalize_text function to several example text entries.
#These raw texts contain URLs, mentions, hashtags, punctuation, and contractions.
#The results are compared by printing both the original and normalized versions to demonstrate how the pipeline cleans and processes the text.

Original Text 1: I can't believe this product! It's AMAZING 😍 #awesome
Normalized Text 1: believe product amazing 😍
--------------------------------------------------------------------------------
Original Text 2: Check out our website: https://example.com for great deals!
Normalized Text 2: check website great deal
--------------------------------------------------------------------------------
Original Text 3: I'm SO disappointed... The service was terrible. 😡 @Company
Normalized Text 3: disappointed service terrible 😡
--------------------------------------------------------------------------------
Original Text 4: We're loving the new features of your app! Keep it up! 👍
Normalized Text 4: loving new feature app keep 👍
--------------------------------------------------------------------------------
Original Text 5: Why isn't this app working? Fix it!!! #frustrated
Normalized Text 5: app working fix
--------------------------------------------------------------------------------


 Use the Pipeline in Real-World Applications

You can integrate this text normalization pipeline into various NLP projects, such as:


Sentiment Analysis: Use normalized text as input for a sentiment analysis model.

Topic Modeling: Process a corpus of documents to extract common topics.

Chatbots: Preprocess user queries for better understanding and response
generation.

Key Features of the Pipeline

Contraction Expansion: Handles informal contractions often found in real-world text.

Noise Removal: Strips out URLs, hashtags, mentions, and special characters.
Lemmatization: Ensures words are converted to their base form (e.g., "running" -> "run").

Stopword Removal: Eliminates common words (e.g., "the", "and") that do not add meaning.