EXTRACTING SENSITIVE INFORMATION FROM TEXT

Importing all packages and functions we need:

In [1]:
import re
import spacy

# Load the English NLP model
nlp = spacy.load("en_core_web_sm")  # Use a small model for basic NER tasks


In [2]:
def find_matches(pattern, string, flags=0):
    # Compile RegEx pattern
    p = re.compile(pattern, flags=flags)
    # Match pattern against input text
    matches = list(p.finditer(string))
    # Handle matches
    if len(matches) == 0:
        return None
    else:
        return([m.group() for m in matches])

Detecting sensitive info:

In [30]:
# Define regex patterns
# phone_pattern = r'\b[689]\d{7}\b'  # Pattern for Singapore phone numbers
phone_pattern = r"\s(\+65)?[\s-]?\d{4}[\s-]?\d{4}\b"
nric_pattern = r'\b[SFTG]\d{7}[A-Z]\b'  # Pattern for Singapore NRIC/FIN
# Placeholder pattern for bank account numbers; adjust as necessary
bank_account_pattern = r'\b\d{10,12}\b'
email_pattern = r"\b[\w.-]+?@\w+?\.\w+?\b"  # Email address pattern
pincode_pattern = r"\b\d{5,6}\b"
# Simple pattern for credit card numbers and CVVs; adjust as necessary
credit_card_pattern = r"\b(?:\d{4}[\s-]?){3}\d{4,7}\b"
cvv_pattern = r"\b\d{3}\b"

def detect_sensitive_info(text):
    # Initial regex detections
    phone_numbers = find_matches(phone_pattern, text, flags=re.IGNORECASE)
    nric_numbers = re.findall(nric_pattern, text)
    bank_account_numbers = re.findall(bank_account_pattern, text)
    email_addresses = re.findall(email_pattern, text)
    pincodes = re.findall(pincode_pattern, text)
    credit_card_numbers = re.findall(credit_card_pattern, text)
    cvvs = re.findall(cvv_pattern, text)

    regex_findings = set(phone_numbers + nric_numbers + bank_account_numbers + email_addresses +
                         pincodes + credit_card_numbers + cvvs)
    
    # Use spaCy for NER
    doc = nlp(text)
    possible_addresses = []
    named_entities = []
    
    for ent in doc.ents:
        if any(regex_finding == ent.text for regex_finding in regex_findings):
            continue  # Skip entities detected by regex
        
        if ent.label_ in ["LOC", "GPE", "ORG", "FAC"]:
            possible_addresses.append(ent.text)
        else:
            named_entities.append((ent.text, ent.label_))

    # Compile results, including regex findings
    results = {
        "named_entities": named_entities,
        "possible_addresses": possible_addresses,
        "phone_numbers": phone_numbers,
        "nric_numbers": nric_numbers,
        "bank_account_numbers": bank_account_numbers,
        "email_addresses": email_addresses,
        "pincodes": pincodes,
        "credit_card_numbers": credit_card_numbers,
        "cvvs": cvvs,
    }

    return results

In [31]:
def print_formatted_info(info):
    print("Detected Sensitive Information:\n")
    
    # Iterate over the results dictionary
    for category, items in info.items():
        # Print the category name
        print(f"{category.replace('_', ' ').title()}:")
        
        if not items:  # Check if the list is empty
            print("  None found.\n")
            continue
        
        # Iterate over items in each category and print
        for item in items:
            # For named entities, 'item' is a tuple (text, label)
            if isinstance(item, tuple):
                print(f"  - {item[0]} ({item[1]})")
            else:  # For regex matches, 'item' is just the matched string
                print(f"  - {item}")
        
        print()  # Add an empty line for spacing

Printing the sensitive info in an organized way:

In [32]:
# text = "John Doe's phone number is +65 81234567 and his NRIC is S1234567A. His bank account number is 123456789012."
text = """Vishwanath Da's email is ayush.goyal@example.com, and his phone number is +65 6234 5678. His NRIC is S1234567A. 
His bank account number is 123456789012. He works at Acme Corp, located at 123 Orchard Road, Singapore."""
text = """Vishwanath Dato recently moved to 123 Baker Street, Singapore. His new pincode is 543210. 
You can reach him at +65 1234 5678 or via email at john.doe@example.com. His bank account number is 123456789012, 
and he just received his new credit card with the number 1234 5678 9012 3456, CVV 789. John's NRIC number is S1234567A. 
He mentioned his old address was near Marina Bay Sands, and he used to live in postal code 098765. 
His old phone number was +65 8765 4321. Jane, his sister, also moved to a new place near Orchard Road. 
Her email is jane.doe@example.net, and her Singapore phone number is +65 9876 5432. 
She's considering opening a new bank account since her old account number, 987654321098, is no longer in use. 
She also mentioned her CVV for a temporary card is 123, while waiting for her new card 9876-5432-1098-7654 to be activated."""
print_formatted_info(detect_sensitive_info(text))

Detected Sensitive Information:

Named Entities:
  - Vishwanath Dato (PERSON)
  - 1234 5678 (DATE)
  - John (PERSON)
  - 8765 4321 (DATE)
  - Jane (PERSON)
  - 9876 5432 (DATE)

Possible Addresses:
  - Singapore
  - NRIC
  - Marina Bay Sands
  - Orchard Road
  - Singapore
  - CVV

Phone Numbers:
  -  +65 1234 5678
  -  1234 5678
  -  9012 3456
  -  +65 8765 4321
  -  +65 9876 5432
  -  9876-5432

Nric Numbers:
  - S1234567A

Bank Account Numbers:
  - 123456789012
  - 987654321098

Email Addresses:
  - john.doe@example.com
  - jane.doe@example.net

Pincodes:
  - 543210
  - 098765

Credit Card Numbers:
  - 1234 5678 9012 3456
  - 9876-5432-1098-7654

Cvvs:
  - 123
  - 789
  - 123



Names: Proper Nouns (Explore POS tagging)

Regex before Entities

Credit Card Info - Number, CVV

Address - Pincode (5-6 digit)

Calling the main extract function from a main python file:

In [24]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [29]:
from main import print_formatted_info, detect_sensitive_info

text = """Vishwanath Dato's email is ayush.goyal@example.com, and his phone number is +65 6234 5678. His NRIC is S1234567A. 
His bank account number is 123456789012. He works at Acme Corp, located at 123 Orchard Road, Singapore 576104."""
text = """Vishwanath Dato recently moved to 123 Baker Street, Singapore. His new pincode is 543210. 
You can reach him at +65 1234 5678 or via email at john.doe@example.com. His bank account number is 123456789012, 
and he just received his new credit card with the number 1234 5678 9012 3456, CVV 789. John's NRIC number is S1234567A. 
He mentioned his old address was near Marina Bay Sands, and he used to live in postal code 098765. 
His old phone number was +65 8765 4321. Jane, his sister, also moved to a new place near Orchard Road. 
Her email is jane.doe@example.net, and her Singapore phone number is +65 9876 5432. 
She's considering opening a new bank account since her old account number, 987654321098, is no longer in use. 
She also mentioned her CVV for a temporary card is 123, while waiting for her new card 9876-5432-1098-7654 to be activated."""
print_formatted_info(detect_sensitive_info(text))

Detected Sensitive Information:

Named Entities:
  - Vishwanath Dato (PERSON)
  - 1234 5678 (DATE)
  - John (PERSON)
  - 8765 4321 (DATE)
  - Jane (PERSON)
  - 9876 5432 (DATE)

Possible Addresses:
  - Singapore
  - NRIC
  - Marina Bay Sands
  - Orchard Road
  - Singapore
  - CVV

Phone Numbers:
  -  +65 1234 5678
  -  1234 5678
  -  9012 3456
  -  +65 8765 4321
  -  +65 9876 5432
  -  9876-5432

Nric Numbers:
  - S1234567A

Bank Account Numbers:
  - 123456789012
  - 987654321098

Email Addresses:
  - john.doe@example.com
  - jane.doe@example.net

Pincodes:
  - 543210
  - 098765

Credit Card Numbers:
  - 1234 5678 9012 3456
  - 9876-5432-1098-7654

Cvvs:
  - 123
  - 789
  - 123

