## Extracting Text from PDF and Configuring PII Redactor


**Author**: Pooja Holkar ,
**email**:poholkar@in.ibm.com




### What is a PII Redactor?

A PII (Personally Identifiable Information) Redactor is a tool designed to identify and redact sensitive information in text data. PII includes details that can be used to identify an individual, such as:

Names
Email addresses
Phone numbers
Addresses
Financial details (e.g., credit card numbers)

### Overview of the use case
In this usecase, the PII Redactor is applied to text extracted from invoices to ensure sensitive customer information is not exposed during processing, sharing, or storage.

 **Workflow Overview**

The text from the invoice (a PDF document in this case) is extracted using the pdfplumber library.

 **Redactor Configuration**

The system is configured to recognize specific PII entities relevant to invoices, such as:
Customer names
Email addresses
Phone numbers
Shipping addresses

 **PII Detection and Redaction**

The redactor scans the extracted text and applies redaction rules, replacing sensitive details with placeholders.
Output:

The redacted text is displayed alongside a summary of all identified PII entities for auditing purposes.

### Why is PII Redaction Important?

 **Data Privacy Compliance**: Adheres to regulations like GDPR, HIPAA, or CCPA that mandate safeguarding customer information.

 **Risk Mitigation**: Prevents unauthorized access to or misuse of sensitive data.

 **Automation Benefits**: Simplifies and accelerates the process of securing information in large-scale document handling.


### Pre-req: Install data-prep-kit dependencies

In [1]:
# !pip install transforms
# !pip install pdfplumber
# !pip install flair
# !pip install spacy
# !pip install presidio_anonymizer==2.2.355

In [2]:
import pdfplumber
from pii_redactor_transform import PIIRedactorTransform


### Step 1: Inspect the Data 

We will use simple invoice PDF

[invoicedata](https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf)

In [4]:
!wget 'https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf'

UsageError: Line magic function `%!wget` not found.


In [3]:
pdf_path="Invoice.pdf"

### Step 2: Extract Text from PDF

This step uses the pdfplumber library to open and read a PDF file. The code processes each page of the PDF to extract text and concatenates it into a single string.

In [13]:
with pdfplumber.open(pdf_path) as pdf:
    text = "\n".join(page.extract_text() for page in pdf.pages)



### Step 3: Configure the PII Redactor



This configuration defines the parameters for identifying and redacting Personally Identifiable Information (PII) in the extracted text.

In [14]:

config = {
    "entities": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "LOCATION"],
    "operator": "replace",
    "transformed_contents": "redacted_contents",
    "score_threshold": 0.6
}

### Step 4: Initialize and Run the PII Redactor


This step initializes the PII Redactor using the previously defined configuration and prepares it for processing the extracted text.

In [15]:

redactor = PIIRedactorTransform(config)


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m[31m9.9 MB/s[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0


17:45:46 INFO - Loading model from flair/ner-english-large


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
2024-11-25 17:46:04,004 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


### Step 5: Apply the Redactor to Text Data


This step applies the initialized PII redactor to the extracted text, redacting sensitive information and providing details about the identified entities.

In [16]:

redacted_text, detected_entities = redactor._redact_pii(text)



### Step 6: Display the Redaction Results


This step outputs the results of the redaction process, including the redacted text and the details of the detected PII entities.


In [17]:
# Step 5: Print the Results
print("Redacted Text:\n", redacted_text)
print("Detected Entities:\n", detected_entities)

Redacted Text:
 INVOICE
Apple Inc.
Invoice Details:
Invoice Number: INV-2024-001
Invoice Date: November 15, 2024
Due Date: November 30, 2024
Billing Information:
Customer Name: <PERSON>
Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704
Email: <EMAIL_ADDRESS>
Phone: <PHONE_NUMBER>
Shipping Information:
Recipient Name: <PERSON>
Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704
Item Details:
Description Quantity Unit Price Total
MacBook Air (13-inch, M2) 1 $999.00 $999.00
AppleCare+ for MacBook Air 1 $199.00 $199.00
Subtotal: $1,198.00
Tax (8%): $95.84
Total Amount Due: $1,293.84
Payment Method: Credit Card (Visa)
Transaction ID: 9876543210ABCDE
Notes:
Thank you for your purchase!
For assistance, please contact our support team at <EMAIL_ADDRESS> or 1-800-MY-APPLE.
Detected Entities:
 ['PERSON', 'LOCATION', 'LOCATION', 'LOCATION', 'EMAIL_ADDRESS', 'PERSON', 'LOCATION', 'LOCATION', 'LOCATION', 'EMAIL_ADDRESS', 'PHONE_NUMBER']


<br>
<br>

### This notebook effectively demonstrates how to seamlessly apply redaction for PII entities.