### NLP 05: Text Cleaning using Regular Expressions (Regex)

In Natural Language Processing, raw text is often noisy.
Before feeding text into a model, Regex (Regular Expressions) helps us clean and standardize data efficiently.

#####  Why Regex for Text Cleaning?

Regex allows us to:  <br>
Remove unwanted patterns  <br>
Normalize text  <br>
Improve model performance  <br>

##### Common Text Cleaning Tasks with Regex

üî∏ Remove URLs  <br>
üî∏ Remove special characters & punctuation  <br>
üî∏ Remove numbers  <br>
üî∏ Convert text to lowercase  <br>
üî∏ Remove extra whitespaces  <br>

### Example 1: Remove URLs

In [1]:
import re

text = "Read more at https://openai.com/blog"
clean = re.sub(r'http\S+', '', text)
print(clean)

Read more at 


### Example 2: Remove Email Addresses

In [2]:
text = "Contact us at support@email.com"
clean = re.sub(r'\S+@\S+', '', text)
print(clean)

Contact us at 


### Example 3: Remove Numbers

In [3]:
text = "AI will grow by 2026"
clean = re.sub(r'\d+', '', text)
print(clean)

AI will grow by 


### Example 4: Remove Punctuation

In [4]:
text = "Hello!!! NLP, is awesome."
clean = re.sub(r'[^\w\s]', '', text)
print(clean)

Hello NLP is awesome


### Example 5: Remove Emojis

In [5]:
text = "I love NLP üòçüî•"
clean = re.sub(r'[^\x00-\x7F]+', '', text)
print(clean)

I love NLP 


### Example 6: Remove HTML Tags

In [6]:
text = "<p>This is <b>NLP</b></p>"
clean = re.sub(r'<.*?>', '', text)
print(clean)

This is NLP


### Example 7: Remove Extra Whitespaces

In [7]:
text = "NLP     is    very   powerful"
clean = re.sub(r'\s+', ' ', text)
print(clean)

NLP is very powerful


### Example 8: Remove Hashtags & Mentions (Social Media NLP)

In [8]:
text = "@user NLP is amazing! #AI #ML"
clean = re.sub(r'@\w+|#\w+', '', text)
print(clean)

 NLP is amazing!  


### Example 9: Keep Only Alphabets

In [9]:
text = "NLP2026_is@Powerful!"
clean = re.sub(r'[^a-zA-Z\s]', '', text)
print(clean)

NLPisPowerful


### Example 10: Normalize Repeated Characters

In [10]:
text = "sooooo good!!!"
clean = re.sub(r'(.)\1+', r'\1', text)
print(clean)

so god!


### Combined Text Cleaning Function (Real Project Style)

In [11]:
def clean_text(text):
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'@\w+|#\w+', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text.lower()

sample = "üî• Visit https://ai.com <b>NLP</b> @user #AI2026"
print(clean_text(sample))


 visit nlp 


### Why This Step Matters

Clean text ‚Üí  <br>
‚úî Better tokenization <br>
‚úî Better embeddings  <br>
‚úî Better model accuracy  <br>

Regex-based cleaning is a foundational skill for anyone working in NLP, LLMs, and Generative AI.