**1. Remove URLs**

**Definition:** URLs are often irrelevant for analysis, so we remove them to focus on the main content. This is especially helpful in social media and review data.

**When to Use:** Use this when text data includes links that don’t contribute to the meaning (e.g., review comments, tweets).

In [1]:
import re

def remove_urls(text):
    # Regular expression pattern to match URLs
    url_pattern = r'http\S+|www\S+'
    cleaned_text = re.sub(url_pattern, '', text)
    return cleaned_text

# Example usage
text = "Heyyyy!!! This is gr8 🤩, visit https://example.com for more info!! LOL 😂. Thnx in advnce."
print(remove_urls(text))


Heyyyy!!! This is gr8 🤩, visit  for more info!! LOL 😂. Thnx in advnce.


**2. Handle Punctuation**

**Definition**: Punctuation marks like “!” and “?” can affect text analysis. We can choose to either remove them entirely or replace them with spaces.

**When to Use**: This is useful when punctuation doesn’t add significant meaning or when we want to standardize the input text.

In [2]:
import string

def remove_punctuation(text):
    # Use string.punctuation to remove all punctuation characters
    cleaned_text = text.translate(str.maketrans('', '', string.punctuation))
    return cleaned_text

# Example usage
text = "Heyyyy!!! This is gr8 🤩, visit example.com for more info!! LOL 😂. Thnx in advnce."
print(remove_punctuation(text))


Heyyyy This is gr8 🤩 visit examplecom for more info LOL 😂 Thnx in advnce


**3. Expand Chat Abbreviations**

**Definition**: Chat abbreviations like “gr8” (great) and “LOL” (laughing out loud) are expanded to their full forms. This step is essential for readability and for models to understand meaning.

**When to Use**: Use this when analyzing informal or social media text where abbreviations are common.

In [3]:
def expand_abbreviations(text):
    # Define a dictionary of common chat abbreviations and their expansions
    abbreviations = {
        "gr8": "great",
        "LOL": "laughing out loud",
        "Thnx": "thanks",
        "advnce": "advance"
    }
    # Replace abbreviations with their full forms
    words = text.split()
    expanded_words = [abbreviations[word] if word in abbreviations else word for word in words]
    return ' '.join(expanded_words)

# Example usage
text = "Heyyyy!!! This is gr8 🤩, visit example.com for more info!! LOL 😂. Thnx in advnce."
print(expand_abbreviations(text))


Heyyyy!!! This is great 🤩, visit example.com for more info!! laughing out loud 😂. thanks in advnce.


**4. Tokenization and Stemming**

**Tokenization:** Breaking down text into smaller pieces, typically words, called tokens.

**Stemming:** Reducing words to their root forms to standardize them.
When to Use: Tokenization and stemming are core preprocessing steps in NLP for text normalization and to simplify model vocabulary

In [4]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk

# Download necessary resources
nltk.download('punkt')

def tokenize_and_stem(text):
    # Tokenize the text into words
    tokens = word_tokenize(text)
    # Initialize the stemmer
    stemmer = PorterStemmer()
    # Apply stemming to each token
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return stemmed_tokens

# Example usage
text = "Heyyyy!!! This is gr8 🤩, visit example.com for more info!! LOL 😂. Thnx in advnce."
print(tokenize_and_stem(text))


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['heyyyi', '!', '!', '!', 'thi', 'is', 'gr8', '🤩', ',', 'visit', 'example.com', 'for', 'more', 'info', '!', '!', 'lol', '😂', '.', 'thnx', 'in', 'advnc', '.']


**5. Handle Incorrect Spelling**

**Definition:** Corrects misspelled words to improve the quality of text analysis. This step ensures more accurate data and can help avoid vocabulary mismatches.

**When to Use:** Use this in cases with frequent spelling errors, such as social media text or informal messages.

In [7]:
!pip install pyspellchecker
!pip install emoji


Collecting pyspellchecker
  Downloading pyspellchecker-0.8.1-py3-none-any.whl.metadata (9.4 kB)
Downloading pyspellchecker-0.8.1-py3-none-any.whl (6.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.1
Collecting emoji
  Downloading emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.0-py3-none-any.whl (586 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m586.9/586.9 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.0


In [8]:
from spellchecker import SpellChecker

def correct_spelling(text):
    # Initialize spell checker
    spell = SpellChecker()
    words = text.split()
    # Check each word and correct it if misspelled
    corrected_words = [spell.correction(word) if word in spell else word for word in words]
    return ' '.join(corrected_words)

# Example usage
text = "Heyyyy!!! This is gr8 🤩, visit example.com for more info!! LOL 😂. Thnx in advnce."
print(correct_spelling(text))


Heyyyy!!! This is gr8 🤩, visit example.com for more info!! LOL 😂. Thnx in advnce.


**6. Handle Emojis**

**Definition:** Emojis are commonly used in social media and casual texts. You can either remove them or replace them with descriptive text to capture their sentiment.

**When to Use:** Use this for datasets containing social media posts, reviews, or any informal texts where emojis convey important context or sentiment.

In [9]:
import emoji

def handle_emojis(text):
    # Translate emojis to descriptive text
    return emoji.demojize(text)

# Example usage
text = "Heyyyy!!! This is gr8 🤩, visit example.com for more info!! LOL 😂. Thnx in advnce."
print(handle_emojis(text))


Heyyyy!!! This is gr8 :star-struck:, visit example.com for more info!! LOL :face_with_tears_of_joy:. Thnx in advnce.


**Putting It All Together**


In [10]:
def preprocess_text(text):
    # Step 1: Remove URLs
    text = remove_urls(text)
    # Step 2: Remove punctuation
    text = remove_punctuation(text)
    # Step 3: Expand chat abbreviations
    text = expand_abbreviations(text)
    # Step 4: Tokenize and stem
    text = ' '.join(tokenize_and_stem(text))
    # Step 5: Correct spelling
    text = correct_spelling(text)
    # Step 6: Handle emojis
    text = handle_emojis(text)

    return text

# Example usage
text = "Heyyyy!!! This is gr8 🤩, visit https://example.com for more info!! LOL 😂. Thnx in advnce."
print(preprocess_text(text))


heyyyi thi is great :star-struck: visit for more info laugh out loud :face_with_tears_of_joy: thank in advanc
