| Pattern | Meaning            | Matches                               |
| ------- | ------------------ | ------------------------------------- |
| `\w`    | Word character     | `[a-zA-Z0-9_]`                        |
| `\W`    | Non-word character | Anything **not** in `[a-zA-Z0-9_]`    |
| `\d`    | Digit              | `[0-9]`                               |
| `\D`    | Non-digit          | Any non-digit character               |
| `\s`    | Whitespace         | Space, tab, newline (`[\t\n\r\f\v ]`) |
| `\S`    | Non-whitespace     | Any non-whitespace character          |


| Pattern | Meaning           | Example                                    |
| ------- | ----------------- | ------------------------------------------ |
| `^`     | Start of string   | `^The` matches "The..." at start only      |
| `$`     | End of string     | `end$` matches "...end" at end only        |
| `\b`    | Word boundary     | `\bcat\b` matches "cat", not "scatter"     |
| `\B`    | Non-word boundary | `\Bcat\B` matches "educational", not "cat" |


In [2]:
import re

# Example 1: Remove punctuation (preserve letters, digits, spaces)
text1 = "Hello, world! NLP 2025."
cleaned = re.sub(r'[^\w\s]', '', text1)
print("1. Remove punctuation:", cleaned)
# Output: Hello world NLP 2025

# Example 2: Replace digits with '#'
text2 = "My phone number is 123-456-7890."
masked = re.sub(r'\d', '#', text2)
print("2. Replace digits with '#':", masked)
# Output: My phone number is ###-###-####.

# Example 3: Collapse multiple spaces into one
text3 = "This   is   spaced     weirdly."
normalized = re.sub(r'\s+', ' ', text3)
print("3. Collapse spaces:", normalized)
# Output: This is spaced weirdly.

# Example 4: Remove non-ASCII characters
text4 = "Café naïve résumé"
ascii_only = re.sub(r'[^\x00-\x7F]', '', text4)
print("4. Remove non-ASCII:", ascii_only)
# Output: Cafe naive resume


1. Remove punctuation: Hello world NLP 2025
2. Replace digits with '#': My phone number is ###-###-####.
3. Collapse spaces: This is spaced weirdly.
4. Remove non-ASCII: Caf nave rsum


What does r'' mean?
It means:

Raw string — don’t treat \ (backslash) as a special symbol.

In [3]:
import re

# Example 5: Remove URLs
text5 = "Check out this website: https://www.example.com and this one too: http://another-site.org/page"
# This regex attempts to match common URL patterns
url_pattern = r'https?://\S+'
cleaned_text = re.sub(url_pattern, '', text5)
print("5. Remove URLs:", cleaned_text)
# Output: Check out this website:  and this one too:

5. Remove URLs: Check out this website:  and this one too: 


In [4]:
import re

# Example 6: Remove dates (common formats like MM/DD/YYYY, MM-DD-YYYY)
text6 = "Meeting scheduled for 12/25/2024 and project deadline is 01-15-2025."
date_pattern = r'\d{1,2}[-/]\d{1,2}[-/]\d{2,4}'
cleaned_text = re.sub(date_pattern, '', text6)
print("6. Remove dates:", cleaned_text)
# Output: Meeting scheduled for  and project deadline is .

6. Remove dates: Meeting scheduled for  and project deadline is .


In [None]:
import re

# Example 7: Remove Emojis
text7 = "Hello 😊 world! This is fun 🎉."
# This regex pattern attempts to match a broad range of emoji Unicode blocks.
# It might not catch every single emoji, but covers many common ones.
emoji_pattern = re.compile(
    "["
    "\U0001F600-\U0001F64F"  # Emoticons
    "\U0001F300-\U0001F5FF"  # Symbols & Pictographs
    "\U0001F680-\U0001F6FF"  # Transport & Map
    "\U0001F700-\U0001F77F"  # Alchemical Symbols
    "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
    "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
    "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
    "\U0001FA00-\U0001FA6F"  # Chess Symbols
    "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
    "\U00002702-\U000027B0"  # Dingbats
    "\U000024C2-\U0001F251"
    "]+"
)

cleaned_text = emoji_pattern.sub(r'', text7)
print("7. Remove Emojis:", cleaned_text)
# Output: Hello  world! This is fun .

7. Remove Emojis: Hello  world! This is fun .


| Unicode Range           | Meaning             | Examples       |
| ----------------------- | ------------------- | -------------- |
| `\U0001F600-\U0001F64F` | Smileys & Emotions  | 😊 😢 😎       |
| `\U0001F300-\U0001F5FF` | Weather, Objects    | 🌪️ 🔔 🕰️     |
| `\U0001F680-\U0001F6FF` | Vehicles, Transport | 🚗 ✈️ 🚀       |
| `\U00002702-\U000027B0` | Dingbats (symbols)  | ✂ ✉ ✔          |
| ...and more             |                     | 🎉 🧠 🕹️ etc. |
