# Regex

Regex (Regular Expressions) is a sequence of characters that defines a search pattern. In NLP, it's used to:

- Clean text
- Extract patterns (emails, dates, hashtags)
- Find or replace specific patterns

## Import re Library


## 1. Character Classes

**. → Any character except newline**

In [12]:
import re
re.findall(r'c.t', 'cat cut cot c t', re.I)

['cat', 'cut', 'cot', 'c t']

**\w \d \s → Word, digit, whitespace**

In [3]:
text = """
General text7 with multiple words:

This is a simple text string.

Test string with numbers 12345 and symbols !@#$

Regex can be tricky at times, but it's powerful.

Look out for 2025, it's going to be an exciting year!

Call me at 555-123-4567 for more info.
"""

re.findall(r"\d+", text)

['7', '12345', '2025', '555', '123', '4567']

In [None]:
re.findall(r'\w+', 'Regex101 is #1!')
# ['Regex101', 'is', '1']

re.findall(r'\d+', 'Year: 2025')
# ['2025']

re.findall(r'\s+', 'a   b')
# ['   ']

**\W \D \S → Not word, digit, whitespace**

In [None]:
re.findall(r'\W+', 'A&B*C!')
# ['&', '*', '!']

re.findall(r'\D+', 'Call 911 now!')
# ['Call ', ' now!']

re.findall(r'\S+', 'text   with   spaces')
# ['text', 'with', 'spaces']

**[abc] → Any of a, b, or c**


In [8]:
re.findall(r'[abc]', 'Apple banana carrot', re.I)

['A', 'b', 'a', 'a', 'a', 'c', 'a']

**[^abc] → Not a, b, or c**
[a-g] → Character between a and g ``(re.findall(r'[a-g]', 'abcdefgzxy'))``

## 2. Anchors

**^abc$ → Start & end of string**

In [4]:
re.match(r'^abc$', 'abc')  # full string is "abc"
# Match object exists

re.match(r'^abc$', 'abcd')  # does not match full string
# None


**\b and \B → Word boundary and not word boundary**

In [7]:
re.findall(r'\bcat\b', 'cat catalog category')

re.findall(r'\Bcat\B', 'educate location')


['cat', 'cat']

## 3. Escaped Characters
**\. \* \\ → Escape regex symbols**

In [8]:
re.findall(r'\.', 'test.example.com')
# ['.', '.']

re.findall(r'\*', 'a * b * c')

['*', '*']

**\t \n \r → Tabs, newlines, carriage returns**

In [11]:
text = 'first\tsecond\nthird\rfourth'
re.findall(r'\t|\n|\r', text)


['\t', '\n', '\r']

## 4. Groups & Lookaround

**(abc) → Capture group**

In [13]:
import re

txt = """Serial numbers: A1234B, A5678C, Z4321X.
Account number: 987654321 Name: Gagan Surname: Puri
Account number: 78969873"""

re.findall(r"Name: (\w+)| Surname: (\w+)", txt)

[('Gagan', ''), ('', 'Puri')]

In [30]:
re.findall(r'(?:ha){2}', 'hahaha')

['haha']

**(?:abc) → Non-capturing group**

In [4]:
txt = """Serial numbers: A1234B, A5678C, Z4321X.
Account number: 987654321 Name: Gagan Surname: -=-=-
Account number: 78969873"""

re.findall(r"(?:Account number: )\d+", txt)

['Account number: 987654321', 'Account number: 78969873']

In [36]:
txt = """Name: Gagan bahadur Puri
Name: Sailesh Yadav
address: jpt address 01, Nepal
"""
# write a regex to match Name of the pepople in the text. it should return peoples name only for e.g. Gagan Puri, Sailesh Yadav
# Name: ([A-za-z ]+) 
re.findall(r"Name: ([a-z ]+)", txt, re.I)

['Gagan bahadur Puri', 'Sailesh Yadav']

In [15]:
re.findall(r'(?:ha)+(ga)', 'hahahaga')

['ga']

**(?=abc) → Positive lookahead**

In [16]:
re.findall(r'\w+(?=ing)', 'eating running played')

['eat', 'runn']