# PII Data Extraction and Scrubbing

## Overview

This example demonstrates the usage of OpenAI's ChatCompletion model for the extraction and scrubbing of Personally Identifiable Information (PII) from a document. The code defines Pydantic models to manage the PII data and offers methods for both extraction and sanitation.

## Imports

In [7]:
from typing import List
from pydantic import BaseModel
from openai import OpenAI
import instructor
from dotenv import load_dotenv

In [8]:
from rich import pretty, print
pretty.install()

In [9]:
# Load environment variables
load_dotenv("../api_keys.env")

[3;92mTrue[0m

## Defining the Structures

First, Pydantic models are defined to represent the PII data and the overall structure for PII data extraction.

In [10]:
class Data(BaseModel):
    index: int
    data_type: str
    pii_value: str


class PIIDataExtraction(BaseModel):
    """
    Extracted PII data from a document, all data_types should try to have consistent property names
    """

    private_data: List[Data]

    def scrub_data(self, content: str) -> str:
        """
        Iterates over the private data and replaces the value with a placeholder in the form of
        <{data_type}_{i}>
        """
        for i, data in enumerate(self.private_data):
            content = content.replace(data.pii_value, f"<{data.data_type}_{i}>")
        return content

## Client Initialization

In [11]:
client = instructor.from_openai(OpenAI())

## Extracting PII Data

The OpenAI API is utilized to extract PII information from a given document.

In [12]:
EXAMPLE_DOCUMENT = """
# Fake Document with PII for Testing PII Scrubbing Model

## Personal Story

John Doe was born on 01/02/1980. His social security number is 123-45-6789. He has been using the email address john.doe@email.com for years, and he can always be reached at 555-123-4567.

## Residence

John currently resides at 123 Main St, Springfield, IL, 62704. He's been living there for about 5 years now.
"""

pii_data = client.chat.completions.create(
    model="gpt-3.5-turbo",
    response_model=PIIDataExtraction,
    messages=[
        {
            "role": "system",
            "content": "You are a world class PII scrubbing model, Extract the PII data from the following document",
        },
        {
            "role": "user",
            "content": EXAMPLE_DOCUMENT,
        },
    ],
)  # type: ignore

print("Extracted PII Data:")
print(pii_data.model_dump_json(indent=2))

## Scrubbing PII Data

After extracting the PII data, the `scrub_data` method is used to sanitize the document.

In [15]:
print("Scrubbed Document:")
print(pii_data.scrub_data(EXAMPLE_DOCUMENT))