# Regular Expressions (Regex) in NLP

**Regular Expressions** are powerful patterns used to match, search, and manipulate text. They're essential for text preprocessing in NLP.

## Common Regex Patterns

| Pattern | Description | Example |
|---------|-------------|---------|
| `\d` | Any digit (0-9) | `\d+` matches "123" |
| `\w` | Any word character | `\w+` matches "hello" |
| `\s` | Any whitespace | `\s+` matches spaces |
| `\D` | Any non-digit | `\D+` matches "abc" |
| `{n}` | Exactly n occurrences | `\d{3}` matches "123" |
| `{n,m}` | Between n and m occurrences | `\d{2,4}` matches "12" to "1234" |
| `|` | OR operator | `cat|dog` matches either |
| `()` | Capture group | `(\d+)` captures digits |

## Python's `re` Module
- `re.findall()` - Find all matches
- `re.search()` - Find first match
- `re.sub()` - Replace matches
- `re.match()` - Match at beginning

In [1]:
import re

## Basic Pattern Matching

In [4]:
text = "Patient's phone is 7211059591. Bill amount is 120$"

pattern = r'\d+'

match = re.findall(pattern, text)
match

['7211059591', '120']

### Finding All Numbers

`\d+` matches one or more digits. This will find ALL number sequences in the text.

In [5]:
text = "Patient's phone is 7211059591. Bill amount is 120$"

pattern = r'\d{10}'

match = re.findall(pattern, text)
match

['7211059591']

### Matching Specific Length Numbers

`\d{10}` matches exactly 10 digits - perfect for phone numbers!

In [7]:
text = "Patient's phone is (732)-111-9999, spouse phone number 7211059591. Bill amount is 120$"

pattern = r'\(\d{3}\)-\d{3}-\d{4}|\d{10}'

match = re.findall(pattern, text)
match

['(732)-111-9999', '7211059591']

### Matching Multiple Phone Number Formats

Using the `|` (OR) operator to match different phone number formats:
- `\(\d{3}\)-\d{3}-\d{4}` matches `(732)-111-9999`
- `\d{10}` matches `7211059591`

In [12]:
text = "Patient's phone is 7211059591. Bill amount is 120$"

pattern = r'(\d{10})\D+(\d+)\$'

match = re.search(pattern, text)
match

<re.Match object; span=(19, 50), match='7211059591. Bill amount is 120$'>

## Capture Groups

**Capture groups** `()` allow you to extract specific parts of a match. Use `re.search()` with `.groups()` to get captured values.

Pattern breakdown: `(\d{10})\D+(\d+)\$`
- `(\d{10})` - Capture 10-digit phone number
- `\D+` - One or more non-digit characters (separator)
- `(\d+)\$` - Capture amount followed by dollar sign

In [15]:
phone_number, bill_amount = match.groups()
phone_number, bill_amount

('7211059591', '120')