# Real World Use Case: Cleaning Social Media Data

**Scenario**: You are analyzing Tweets. 
**Content**: "OMG!!! I luv this product ðŸ˜ƒðŸ˜ƒðŸ˜ƒ #best @company check http://link.com"
**Problem**: Standard NLTK cleaning will fail here. Emojis, handles, and URLs are special.
**Goal**: Custom Regex cleaning.

In [None]:
import re

raw_tweet = "OMG!!! I luv this product ðŸ˜ƒðŸ˜ƒðŸ˜ƒ #best @company check http://link.com"

def clean_tweet(text):
    # 1. Remove URLs
    text = re.sub(r'http\S+', '', text)
    
    # 2. Remove Handles (@user)
    text = re.sub(r'@\w+', '', text)
    
    # 3. Remove Hashtags (Keep the word, remove #, or remove both?)
    text = re.sub(r'#', '', text)
    
    # 4. Remove Emojis (Simple way: keep only ASCII)
    # text = text.encode('ascii', 'ignore').decode('ascii')
    # Better way: Keep them if sentiment analysis, remove if topic modeling.
    
    # 5. Remove Repeated Characters (Typical in tweets: "lovvvve")
    # Replace 3+ chars with 1 char
    text = re.sub(r'(.)\1{2,}', r'\1', text)
    
    return text.strip()

cleaned = clean_tweet(raw_tweet)
print(f"Original: {raw_tweet}")
print(f"Cleaned:  {cleaned}")

## Conclusion
Real world NLP is 90% Regex and 10% Models.