<a href="https://colab.research.google.com/github/LashawnFofung/Python-Document-Preparation-and-Extraction/blob/main/Task_Regular_Expressions_(Regex)_Documentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regular Expressions (Regex) Documentation

Data: *Loan Application Form.txt*

Source: Google Drive

 https://drive.google.com/uc?id=1VqSELIOm0tRBkEHg_0WhIga3wlVFoRPz


## Regex
Short for Regular Expressions.

Helps you search through messy, unstructured text to find exactly what you're looking for ‚Äî even if the formatting is inconsistent or the layout is chaotic.

üîç Key Use Cases

Regex is essential when dealing with real-world data like:
*   Log files: Extracting timestamps or error codes.

* Websites/HTML: Pulling out links, image tags, or specific content.

* Documents (PDFs, Emails): Isolating specific data like phone numbers, dates, email addresses, or IDs, regardless of where they appear or how they are formatted.

**Regex Reference: Common Symbols and Patterns**

Here's a quick reference you can use while building your own regex patterns, and to help you understand the different ways regex is applied in the examples below.


| **Symbol** | **Meaning**                             | **Example**                           |
|------------|-----------------------------------------|---------------------------------------|
| `.`        | Any character (except newline)          | `a.b` ‚Üí matches `acb`, `a*b`          |
| `\d`       | Any digit (0‚Äì9)                         | `\d` ‚Üí `2` <br> <br>`\d+` ‚Üí `2025`         |
| `\w`       | Word character (letter, digit, `_`)     | `\w` ‚Üí `L` <br> <br>`\w+` ‚Üí `LoanID`       |
| `\s`       | Whitespace (space, tab, newline)        | `\s` ‚Üí `" "` <br> <br>`\s+` ‚Üí `"   "`      |
| `+`        | One or more                             | `\d+` ‚Üí `123`                         |
| `*`        | Zero or more                            | `a*` ‚Üí `""`, `aaa`                    |
| `?`        | Zero or one (optional)                  | `colou?r` ‚Üí `color`, `colour`         |
| `{n}`      | Exactly *n*                             | `\d{4}` ‚Üí `2025`                      |
| `[abc]`    | Character set (a or b or c)             | `[aeiou]` ‚Üí any vowel                 |
| `^`        | Start of line                           | `^Name:`                              |
| `$`        | End of line                             | `City$`                               |
| `( ‚Ä¶ )`    | Capturing group                         | `(\d+)` ‚Üí extract digits              |

           


# 1. Setup: Downloading sample file and importing regular expression (Regex)

In [2]:
## This script prepares the Colab environment by installing 'gdown' to handle Google Drive downloads.
# It then uses the provided file ID to securely download a specific file named 'loan_application.txt'
# from Google Drive. Finally, it reads the entire content of the downloaded file into the 'text'
# variable and prints the first 500 characters to verify a successful download and preview the data.



! pip install gdown

import gdown

# Step 1: Correct download URL using the file ID
file_id = "1VqSELIOm0tRBkEHg_0WhIga3wlVFoRPz"
url = f"https://drive.google.com/uc?id={file_id}"

# Step 2: Download the actual .txt file
gdown.download(url, "loan_application.txt", quiet=False)


# Step 3: Read the content from the downloaded file
with open("loan_application.txt", "r") as file:
    text = file.read()

print("Preview of file content:")
print(text[:500])  # Print the first 500 characters to check



Downloading...
From: https://drive.google.com/uc?id=1VqSELIOm0tRBkEHg_0WhIga3wlVFoRPz
To: /content/loan_application.txt
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 587/587 [00:00<00:00, 434kB/s]

Preview of file content:
ÔªøLoan Application Form




LoanID: L-2023-0001
L-2023-0001
L-2023-0001
L-2023-0001
       
Application ID:    A-998877     




Name: John Smith            
Email: john.smith@example.com     
Phone:  (555)   123-4567     


Spouse Name: 
Jenna Smith


Home Address:  123 Maple St., Apt. 5B, Toronto, ON  M5J 2N1




Date of Birth:  1990-02-15    
Application Date: March   25,   2025   
Loan Amount:  $25,000.00    




--- Notes ---
This           application                     was                





In [3]:
# Import Python's built-in regular expression (Regex) module

import re

# **Extract Names, IDs, Addresses or Dates from a Text File with Regex**


**Find and Extract Dates in YYYY-MM-DD Format Using Regex**

In [4]:
# Extract dates in the format YYYY-MM-DD from a given text
# Returns all matching date strings as a list

dates = re.findall(r'\d{4}-\d{2}-\d{2}', text)
print(dates)

['1990-02-15']


**Extract Dollar Amounts Like $25,000.00**

In [5]:
# Extract dollar amounts from a given text, including commas and optional decimals
# Matches values like $350,000 or $1,500.00

amounts = re.findall(r'\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?', text)
print(amounts)

['$25,000.00']


**Match Capitalized Full Names That Span Across Multiple Lines**

In [6]:
# Extract capitalized full names (First Last format) from a given text
# including cases where the names appear on a new line after a label

name_lines = re.findall(r'Name:\s+([A-Z][a-z]+(?:\s+[A-Z]\.)?\s+[A-Z][a-z]+)', text)
print(name_lines)

['John Smith', 'Jenna Smith']


**Extract Basic Address Format**

In [7]:
# Extracts full mailing addresses from text, including optional apartment/unit info
# Matches formats like '123 Maple St., Apt. 5B, Toronto, ON M5J 2N1' or '123 Maple St., Toronto, ON M5J 2N1'

address = re.findall(r'\d{1,5} [A-Za-z. ]+,(?: Apt\.? \w+,)? [A-Za-z ]+, ON\s+[A-Z]\d[A-Z] \d[A-Z]\d', text)
print(address)

['123 Maple St., Apt. 5B, Toronto, ON  M5J 2N1']


**Extract Loan IDs Like L-2023-0001 from Text**

In [8]:
# Extracts all Loan IDs in the format 'L-YYYY-NNNN' from the text
# Useful for identifying or indexing loan records in documents

loan_ids = re.findall(r'L-\d{4}-\d{4}', text)
print(loan_ids)

['L-2023-0001', 'L-2023-0001', 'L-2023-0001', 'L-2023-0001']


**Identify and Extract Email Addresses from Text**

In [9]:
# Extracts email addresses from a given text
# Useful for capturing applicant or agent contact information

emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
print(emails)

['john.smith@example.com']


**Extract Phone Numbers in Common Formats (With or Without Country Code)**

In [10]:
# Extracts phone numbers including country codes (e.g. +1, +55)
# Useful for capturing phone numbers across North, Central, and South America

phones = re.findall(r'(?:\+?\d{1,3}[\s\-]?)?(?:\(?\d{3}\)?[\s\-]*)\d{3}[\s\-]?\d{4}', text)
print(phones)

['(555)   123-4567']


**Use Named Groups to Cleanly Extract Specific Values**

In [11]:
# Uses a named group to extract a loan ID and access it directly by name
# Useful for precise parsing with meaningful labels

match = re.search(r'LoanID: (?P<id>L-\d{4}-\d{4})', text)
if match:
    print(match.group("id"))

L-2023-0001


**Extract Multiple Loan IDs Between Labels Using Named Groups**

In [12]:
# Uses a named group to extract all loan IDs listed across multiple lines between labeled sections
# Useful for parsing document blocks where IDs are grouped and span multiple lines

# Extract the section between "LoanID:" and "Name:"
loan_section = re.search(r'LoanID:\s*(.*?)\s*Name:', text, re.DOTALL)
if loan_section:
    loan_block = loan_section.group(1)

    # Extract all Loan IDs from the loan section
    loan_ids = re.findall(r'L-\d{4}-\d{4}', loan_block)
    print(loan_ids)

['L-2023-0001', 'L-2023-0001', 'L-2023-0001', 'L-2023-0001']


# Use Regex to Find, Replace, or Clean Text (e.g. Fix Formatting or Remove Junk)

**Replace Full Names with [REDACTED] to Keep Data Private**

In [13]:
# Replaces any capitalized full names (First Last format) with '[REDACTED]'
# Useful for anonymizing personally identifiable information (PII) in text data

# Replace names like John A. Smith (allowing extra spaces)
clean_text = re.sub(r'(Name:\s+)([A-Z][a-z]+(?:\s+[A-Z]\.)?\s+[A-Z][a-z]+)', r'\1[REDACTED]', text)

print(clean_text)

ÔªøLoan Application Form




LoanID: L-2023-0001
L-2023-0001
L-2023-0001
L-2023-0001
       
Application ID:    A-998877     




Name: [REDACTED]            
Email: john.smith@example.com     
Phone:  (555)   123-4567     


Spouse Name: 
[REDACTED]


Home Address:  123 Maple St., Apt. 5B, Toronto, ON  M5J 2N1




Date of Birth:  1990-02-15    
Application Date: March   25,   2025   
Loan Amount:  $25,000.00    




--- Notes ---
This           application                     was                    submitted    
online via the client portal.


**Clean Up Extra Spaces and Line Breaks for Neater Output**

In [14]:
# Replaces multiple spaces and line breaks with a single space
# Useful for cleaning messy document text before analysis

clean = re.sub(r'\s+', ' ', text).strip()
print(clean)

ÔªøLoan Application Form LoanID: L-2023-0001 L-2023-0001 L-2023-0001 L-2023-0001 Application ID: A-998877 Name: John Smith Email: john.smith@example.com Phone: (555) 123-4567 Spouse Name: Jenna Smith Home Address: 123 Maple St., Apt. 5B, Toronto, ON M5J 2N1 Date of Birth: 1990-02-15 Application Date: March 25, 2025 Loan Amount: $25,000.00 --- Notes --- This application was submitted online via the client portal.
