<a href="https://colab.research.google.com/github/MehraeenTimas/nlp-course/blob/main/Regex_patternMatching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 4. Regex for Pattern Matching

Regex for pattern matching is a powerful technique to extract or filter specific text patterns. With NLTK, you can leverage its RegexpTokenizer to tokenize text based on custom regex patterns, while spaCy’s Matcher enables regex-like matching within its robust linguistic framework.



In [None]:
# NLTK Regex Example
from nltk.tokenize import RegexpTokenizer
text = "Hello world! This is an example: email@example.com, phone: 123-456-7890."

# Define a tokenizer that captures alphanumeric words
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(text)
print("NLTK Regex Tokens:", tokens)

NLTK Regex Tokens: ['Hello', 'world', 'This', 'is', 'an', 'example', 'email', 'example', 'com', 'phone', '123', '456', '7890']


In [None]:
# spaCy Regex Example
from spacy.matcher import Matcher

# Load the small English model
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

# Define a regex-based pattern: match tokens that start with a capital letter
pattern = [{"TEXT": {"REGEX": "^[A-Z][a-z]+"}}]
matcher.add("CAPITAL_PATTERN", [pattern])

text = "Hello world! This is an Example sentence."
doc = nlp(text)

# Apply the matcher and print matched tokens
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print("Matched token:", span.text)


Matched token: Hello
Matched token: This
Matched token: Example


In [None]:
import re
text_emails = "Contact us at admin.support_34@example.com or sales-dep@company.org for inquiries."

# Example 1: Email Detection
email_pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"  # Regex for email matching

emails = re.findall(email_pattern, text_emails)
print("Detected Emails:", emails)


Detected Emails: ['admin.support_34@example.com', 'sales-dep@company.org']
