# Getting Started

Welcome to the DataFog Python SDK! This notebook will walk you through the basics of using the SDK to scan text for PII.

Please consider this a living document - we will be focused on adding new content and guides to make your experience using DataFog as good as possible! 

## Setup

In [None]:
%pip install datafog

In [80]:
import requests
from datafog import PresidioEngine as datafog
import pandas as pd

## Detect PII in files

### Base Case: Upload a file and scan for PII

In [81]:
# Download the transcript of the US State of the Union 2023
url = "https://bit.ly/datafog-sample-text-sotu-2023"
response = requests.get(url)
file_contents = response.text
df = pd.DataFrame({"text": [file_contents]})

# Split the file contents into lines and remove empty lines
lines = [line.strip() for line in file_contents.split("\n") if line.strip()]

# Join the lines back into a single string
cleaned_text = "\n".join(lines)

# Create a DataFrame with the cleaned text
df = pd.DataFrame({"text": [cleaned_text]})

# Print the DataFrame
print(df)

def scan_text(text):
    return datafog.scan(text)

df['scan_results'] = df['text'].apply(scan_text)


                                                text
0  Mr. Speaker, Madam Vice President, our First L...


### Redact custom terms using the deny list


In [82]:
ceo_email_chunk = "I'm announcing on Friday that Jeff is going to be CTO."

scan_results1 = datafog.scan(ceo_email_chunk)
print("PII Detected - base case:", scan_results1)

scan_results2 = datafog.scan(ceo_email_chunk, deny_list=['CTO'])
print("PII Detected with deny list:", scan_results2)

PII Detected - base case: [type: PERSON, start: 30, end: 34, score: 0.85]
PII Detected with deny list: [type: CUSTOM_PII, start: 50, end: 53, score: 1.0, type: PERSON, start: 30, end: 34, score: 0.85]
