Regex (Regular Expressions) is basically a super picky detective for your text files. Got messy PDFs, chaotic emails, jumbled-up logs, or weirdly formatted websites? Regex doesn’t panic—it steps in like, “Don’t worry, I got this.” With Regex, you can search for stuff like phone numbers, dates, student IDs, names, or pretty much anything else—even if everything’s all over the place. No neat columns or tidy rows needed. Regex says, “Structure? Never heard of her.” So, if your text data looks like a tornado hit it and you still need to pull out specific info, Regex will become your new best friend.

**Regex Reference: Common Symbols and Patterns**

Here’s a quick reference you can use while building your own regex patterns, and to help you understand the different ways regex is applied in the examples below.


| **Symbol** | **Meaning**                             | **Example**                           |
|------------|-----------------------------------------|---------------------------------------|
| `.`        | Any character (except newline)          | `a.b` → matches `acb`, `a*b`          |
| `\d`       | Any digit (0–9)                         | `\d` → `2` <br>`\d+` → `2025`         |
| `\w`       | Word character (letter, digit, `_`)     | `\w` → `L` <br>`\w+` → `LoanID`       |
| `\s`       | Whitespace (space, tab, newline)        | `\s` → `" "` <br>`\s+` → `"   "`      |
| `+`        | One or more                             | `\d+` → `123`                         |
| `*`        | Zero or more                            | `a*` → `""`, `aaa`                    |
| `?`        | Zero or one (optional)                  | `colou?r` → `color`, `colour`         |
| `{n}`      | Exactly *n*                             | `\d{4}` → `2025`                      |
| `[abc]`    | Character set (a or b or c)             | `[aeiou]` → any vowel                 |
| `^`        | Start of line                           | `^Name:`                              |
| `$`        | End of line                             | `City$`                               |
| `( … )`    | Capturing group                         | `(\d+)` → extract digits              |


# 1. Setup: Downloading sample file and importing regular expression (Regex)

In [1]:
! pip install gdown

import gdown

# Step 1: Correct download URL using the file ID
file_id = "1VqSELIOm0tRBkEHg_0WhIga3wlVFoRPz"
url = f"https://drive.google.com/uc?id={file_id}"

# Step 2: Download the actual .txt file
gdown.download(url, "loan_application.txt", quiet=False)


# Step 3: Read the content from the downloaded file
with open("loan_application.txt", "r") as file:
    text = file.read()

print("Preview of file content:")
print(text[:500])  # Print the first 500 characters to check



Downloading...
From: https://drive.google.com/uc?id=1VqSELIOm0tRBkEHg_0WhIga3wlVFoRPz
To: /content/loan_application.txt
100%|██████████| 587/587 [00:00<00:00, 1.57MB/s]

Preview of file content:
﻿Loan Application Form




LoanID: L-2023-0001
L-2023-0001
L-2023-0001
L-2023-0001
       
Application ID:    A-998877     




Name: John Smith            
Email: john.smith@example.com     
Phone:  (555)   123-4567     


Spouse Name: 
Jenna Smith


Home Address:  123 Maple St., Apt. 5B, Toronto, ON  M5J 2N1




Date of Birth:  1990-02-15    
Application Date: March   25,   2025   
Loan Amount:  $25,000.00    




--- Notes ---
This           application                     was                





In [2]:
# Import Python's built-in regular expression (Regex) module

import re

# 2. Extract Names, IDs, Addresses or Dates from a Text File with Regex

**Find and Extract Dates in YYYY-MM-DD Format Using Regex**



In [3]:
# Extract dates in the format YYYY-MM-DD from a given text
# Returns all matching date strings as a list

dates = re.findall(r'\d{4}-\d{2}-\d{2}', text)
print(dates)

['1990-02-15']


**Extract Dollar Amounts Like $25,000.00**


In [4]:
# Extract dollar amounts from a given text, including commas and optional decimals
# Matches values like $350,000 or $1,500.00

amounts = re.findall(r'\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?', text)
print(amounts)

['$25,000.00']


**Match Capitalized Full Names That Span Across Multiple Lines**

In [5]:
# Extract capitalized full names (First Last format) from a given text
# including cases where the names appear on a new line after a label

name_lines = re.findall(r'Name:\s+([A-Z][a-z]+(?:\s+[A-Z]\.)?\s+[A-Z][a-z]+)', text)
print(name_lines)

['John Smith', 'Jenna Smith']


**Extract Basic Address Format**


In [6]:
# Extracts full mailing addresses from text, including optional apartment/unit info
# Matches formats like '123 Maple St., Apt. 5B, Toronto, ON M5J 2N1' or '123 Maple St., Toronto, ON M5J 2N1'

address = re.findall(r'\d{1,5} [A-Za-z. ]+,(?: Apt\.? \w+,)? [A-Za-z ]+, ON\s+[A-Z]\d[A-Z] \d[A-Z]\d', text)
print(address)

['123 Maple St., Apt. 5B, Toronto, ON  M5J 2N1']


**Extract Loan IDs Like L-2023-0001 from Text**


In [7]:
# Extracts all Loan IDs in the format 'L-YYYY-NNNN' from the text
# Useful for identifying or indexing loan records in documents

loan_ids = re.findall(r'L-\d{4}-\d{4}', text)
print(loan_ids)

['L-2023-0001', 'L-2023-0001', 'L-2023-0001', 'L-2023-0001']


**Identify and Extract Email Addresses from Text**

In [8]:
# Extracts email addresses from a given text
# Useful for capturing applicant or agent contact information

emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
print(emails)

['john.smith@example.com']


**Extract Phone Numbers in Common Formats (With or Without Country Code)**

In [9]:
# Extracts phone numbers including country codes (e.g. +1, +55)
# Useful for capturing phone numbers across North, Central, and South America

phones = re.findall(r'(?:\+?\d{1,3}[\s\-]?)?(?:\(?\d{3}\)?[\s\-]*)\d{3}[\s\-]?\d{4}', text)
print(phones)

['(555)   123-4567']


**Extract Loan & Application IDs for Each Entry**

In [10]:
# Extracts loan or application IDs such as L-2023-0001 or A-4556
# Useful for identifying multiple types of document IDs

ids = re.findall(r'(L-\d{4}-\d{4}|A-\d{4})', text)
print(ids)

['L-2023-0001', 'L-2023-0001', 'L-2023-0001', 'L-2023-0001', 'A-9988']


**Use Named Groups to Cleanly Extract Specific Values**



In [11]:
# Uses a named group to extract a loan ID and access it directly by name
# Useful for precise parsing with meaningful labels

match = re.search(r'LoanID: (?P<id>L-\d{4}-\d{4})', text)
if match:
    print(match.group("id"))

L-2023-0001


**Extract Multiple Loan IDs Between Labels Using Named Groups**

In [12]:
# Uses a named group to extract all loan IDs listed across multiple lines between labeled sections
# Useful for parsing document blocks where IDs are grouped and span multiple lines

# Extract the section between "LoanID:" and "Name:"
loan_section = re.search(r'LoanID:\s*(.*?)\s*Name:', text, re.DOTALL)
if loan_section:
    loan_block = loan_section.group(1)

    # Extract all Loan IDs from the loan section
    loan_ids = re.findall(r'L-\d{4}-\d{4}', loan_block)
    print(loan_ids)

['L-2023-0001', 'L-2023-0001', 'L-2023-0001', 'L-2023-0001']


#3. Use Regex to Find, Replace, or Clean Text (e.g. Fix Formatting or Remove Junk)

**Replace Full Names with [REDACTED] to Keep Data Private**

In [13]:
# Replaces any capitalized full names (First Last format) with '[REDACTED]'
# Useful for anonymizing personally identifiable information (PII) in text data

# Replace names like John A. Smith (allowing extra spaces)
clean_text = re.sub(r'(Name:\s+)([A-Z][a-z]+(?:\s+[A-Z]\.)?\s+[A-Z][a-z]+)', r'\1[REDACTED]', text)

print(clean_text)

﻿Loan Application Form




LoanID: L-2023-0001
L-2023-0001
L-2023-0001
L-2023-0001
       
Application ID:    A-998877     




Name: [REDACTED]            
Email: john.smith@example.com     
Phone:  (555)   123-4567     


Spouse Name: 
[REDACTED]


Home Address:  123 Maple St., Apt. 5B, Toronto, ON  M5J 2N1




Date of Birth:  1990-02-15    
Application Date: March   25,   2025   
Loan Amount:  $25,000.00    




--- Notes ---
This           application                     was                    submitted    
online via the client portal.


**Clean Up Extra Spaces and Line Breaks for Neater Output**

In [14]:
# Replaces multiple spaces and line breaks with a single space
# Useful for cleaning messy document text before analysis

clean = re.sub(r'\s+', ' ', text).strip()
print(clean)

﻿Loan Application Form LoanID: L-2023-0001 L-2023-0001 L-2023-0001 L-2023-0001 Application ID: A-998877 Name: John Smith Email: john.smith@example.com Phone: (555) 123-4567 Spouse Name: Jenna Smith Home Address: 123 Maple St., Apt. 5B, Toronto, ON M5J 2N1 Date of Birth: 1990-02-15 Application Date: March 25, 2025 Loan Amount: $25,000.00 --- Notes --- This application was submitted online via the client portal.
