# Skyflow Deidentify String UDF

This notebook will show you how to install a function for deidentifying strings and tokenizing PII in unstructured data using a Skyflow Vault.

## Step 0: Setup secrets in Databricks (optional)

To use this function as written you must configure a Secret in Databricks for storing the Skyflow API credentials. Alternately, for testing, when calling the function you can manually pass credentials as an argument.

#### Install the CLI

If you use Homebrew on a Mac, the below commands will complete the install.

```
brew tap databricks/tap
brew install databricks
```

For more detailed instructions for any dev environment see the official documentation: [Databricks | Install or update the Databricks CLI](https://docs.databricks.com/aws/en/dev-tools/cli/install)

##### Configure the CLI

To get started and create a configuration profile on your machine, run `databricks configure`.

You should be prompted for `Databricks Host` and a `Personal Access Token`. 

To get a Personal Access Token (PAT) for development login to the Databricks UI, open Settings, click Developer, then Access Tokens.

For more information on authenticating the Databricks CLI see the official documentation: [Databricks | Authentication for the Databricks CLI](https://docs.databricks.com/aws/en/dev-tools/cli/authentication)

#### Configure a secret scope in Databricks

Now that you've configured and authenticated the Databricks CLI, run the following command to create a 'scope' for your secrets in Databricks: 

`databricks secrets create-scope <scope-name>`

For the rest of this demo we'll use the scope `sky-agentic-demo`.

`databricks secrets create-scope sky-agentic-demo`

#### Get details from Skyflow

- Create or log into your account at [skyflow.com](https://skyflow.com) and generate an API key: [docs.skyflow.com](https://docs.skyflow.com/api-authentication/)
- Copy your API key, Vault URL, and Vault ID


#### Store the secrets in Databricks

Create your secrets using the JSON syntax:

```sh
databricks secrets put-secret --json '{
  "scope": "sky-agentic-demo",
  "key": "sky_api_key",
  "string_value": "--sky_api_key--"
}'
```

To confirm the secrets have been uploaded successfully, run `databricks secrets list-secrets sky-agentic-demo` to see a list of the keys you provided and an updated timestamp.

Example:

```sh
Key            Last Updated Timestamp
sky_api_key    1739998630197
```

Then to read a secret in a Notebook, use `dbutils.secrets`:

`sky_api_key = dbutils.secrets.get(scope = "sky-agentic-demo", key = "sky_api_key")`

To learn more about Secrets in Databricks, see the official documentation: [Secret Management | Databricks](https://docs.databricks.com/aws/en/security/secrets)


## Step 1: Install the function

Before you install, make sure you set your `vault_id` and `vault_url`. These are hardcoded values in our function, though you can modify it to also accept parameters for these values from the user invoking the function or use Databricks environment variables.



In [0]:
%sql
CREATE OR REPLACE FUNCTION
agentic.default.deidentify_string (
 input_text STRING COMMENT 'The string to be de-identified.',
 sky_api_key STRING COMMENT 'The API key for the Skyflow API.'
)
RETURNS STRING
LANGUAGE PYTHON
DETERMINISTIC
COMMENT 'Deidentify a string using the Skyflow API. Removes any sensitive data from the string and returns a safe string with placeholders in place of sensitive data tokens.'
AS $$
 import sys
 import json
 import requests
 from io import StringIO
 
 sys_stdout = sys.stdout
 redirected_output = StringIO()
 sys.stdout = redirected_output

 if sky_api_key is None or sky_api_key == '':
     # try to fetch the API key from env variables
     bearer_token = os.environ.get("SKY_API_KEY")
 else:
     bearer_token = sky_api_key

-- SET YOUR VAULT ID
 vault_id = "SKYFLOW_VAULT_ID"
-- SET YOUR VAULT URL
 vault_url = "https://sample.vault.skyflowapis.com"
-- END
 api_path = "/v1/detect/deidentify/string"
 api_url = vault_url + api_path
 headers = {
     "Authorization": f"Bearer {bearer_token}",
     "Content-Type": "application/json",
 }
 json_body = {
     "vault_id": vault_id,
     "text": input_text,
 }

 try:
     api_response = requests.post(api_url, headers=headers, json=json_body)
     api_response.raise_for_status()
     external_data = api_response.json()
     result = external_data.get('processed_text', 'No processed_text found')
 except requests.exceptions.RequestException as e:
     result = f"Error calling external API: {str(e)}"

 sys.stdout = sys_stdout
 return result
$$

## Step 2: Test the function from Unity Catalog

Now that you've installed the deidentify_string() function into your Databricks Unity Catalog, you can call it from Python with Spark.

In [0]:
# Retrieve an access token from Databricks Secrets.
sky_api_key = dbutils.secrets.get(scope="sky-agentic-demo", key="sky_api_key")
# Alternately, you can hardcode the API key here.
# sky_api_key = "yourkey"

# Provide some sample text. In practice you'll read this from a file or table.
input_text = "Hi my name is Joseph McCarron and I live in Austin TX"


# Create the input dataframe
df = spark.createDataFrame([(input_text,)], ["input_text"])
df.createOrReplaceTempView("input_table")

# Create the result dataframe and pass the API key and the input dataframe
result_df = spark.sql(f"""
SELECT agentic.default.deidentify_string(input_text, '{sky_api_key}') AS deidentified_text
FROM input_table
""")

# Display the result
display(result_df)

## Step 3: Test `deidentify_string()` with a table

Now let's try using this on a table in your lakehouse. If you don't have a relevant table, use the cell below to create a sample `chats` table and populate it with sample data.

### Create a sample table

To get started quickly, run the cell below to create and populate a table with sample data.

In [0]:
# Create the table
spark.sql("""
CREATE TABLE chats (
    chat_id INT,
    user_id INT,
    timestamp TIMESTAMP,
    user_message STRING,
    bot_response STRING,
    user_name STRING,
    user_email STRING
)
""")

# Insert sample data
spark.sql("""
INSERT INTO chats VALUES
    (1, 101, '2025-03-07 10:00:00', 'Hello, I need help with my account. My email is john.doe@example.com.', 'Sure, I can help you with that.', 'John Doe', 'john.doe@example.com'),
    (2, 102, '2025-03-07 10:05:00', 'What is the weather today? My address is 123 Main St.', 'The weather today is sunny.', 'Jane Smith', 'jane.smith@example.com'),
    (3, 103, '2025-03-07 10:10:00', 'Can you tell me a joke? My phone number is 555-1234.', 'Why did the scarecrow win an award? Because he was outstanding in his field!', 'Alice Johnson', 'alice.johnson@example.com'),
    (4, 104, '2025-03-07 10:15:00', 'I forgot my password. My SSN is 123-45-6789.', 'Please click on the "Forgot Password" link to reset it.', 'Bob Brown', 'bob.brown@example.com'),
    (5, 105, '2025-03-07 10:20:00', 'What are your working hours? My email is charlie.davis@example.com.', 'Our working hours are from 9 AM to 5 PM.', 'Charlie Davis', 'charlie.davis@example.com'),
    (6, 106, '2025-03-07 10:25:00', 'Can you help me with my order? My address is 456 Elm St.', 'Sure, I can help you with your order.', 'David Evans', 'david.evans@example.com'),
    (7, 107, '2025-03-07 10:30:00', 'What is your return policy? My phone number is 555-5678.', 'Our return policy is 30 days.', 'Eve Foster', 'eve.foster@example.com'),
    (8, 108, '2025-03-07 10:35:00', 'Do you offer discounts? My SSN is 987-65-4321.', 'Yes, we offer discounts on bulk purchases.', 'Frank Green', 'frank.green@example.com'),
    (9, 109, '2025-03-07 10:40:00', 'How can I contact support? My email is grace.harris@example.com.', 'You can contact support via email or phone.', 'Grace Harris', 'grace.harris@example.com'),
    (10, 110, '2025-03-07 10:45:00', 'What is your shipping policy? My address is 789 Oak St.', 'We offer free shipping on orders over $50.', 'Hank Irving', 'hank.irving@example.com'),
    (11, 111, '2025-03-07 10:50:00', 'Can you recommend a product? My phone number is 555-9876.', 'Sure, I recommend our latest product.', 'Ivy Johnson', 'ivy.johnson@example.com'),
    (12, 112, '2025-03-07 10:55:00', 'How do I update my profile? My SSN is 321-54-9876.', 'You can update your profile in the settings.', 'Jack King', 'jack.king@example.com'),
    (13, 113, '2025-03-07 11:00:00', 'What payment methods do you accept? My email is karen.lee@example.com.', 'We accept credit cards and PayPal.', 'Karen Lee', 'karen.lee@example.com'),
    (14, 114, '2025-03-07 11:05:00', 'Can I track my order? My address is 321 Pine St.', 'Yes, you can track your order in the orders section.', 'Leo Martin', 'leo.martin@example.com'),
    (15, 115, '2025-03-07 11:10:00', 'Do you have a mobile app? My phone number is 555-4321.', 'Yes, we have a mobile app available on iOS and Android.', 'Mia Nelson', 'mia.nelson@example.com')
""")

# Display the table
display(spark.sql("SELECT * FROM chats"))

### Deidentify the user_message column from the table

Call the deidentify_string() function from Unity Catalog as part of a query.

In [0]:
# Retrieve an access token from Databricks Secrets. Alternately hardcode your API key here for test use.
sky_api_key = dbutils.secrets.get(scope="sky-agentic-demo", key="sky_api_key")

# Note: if you're using your own table, modify the query below to use your table name and column names.
result_df = spark.sql(f"""
SELECT chat_id, user_id, timestamp, agentic.default.deidentify_string(user_message, '{sky_api_key}') AS deidentified_user_message, bot_response, user_name, user_email
FROM chats
""")

# Display the result
display(result_df)

chat_id,user_id,timestamp,deidentified_user_message,bot_response,user_name,user_email
1,101,2025-03-07T10:00:00.000Z,"Hello, I need help with my account. My email is [EMAIL_ADDRESS_1].","Sure, I can help you with that.",John Doe,john.doe@example.com
2,102,2025-03-07T10:05:00.000Z,What is the weather today? My address is [LOCATION_ADDRESS_STREET_1].,The weather today is sunny.,Jane Smith,jane.smith@example.com
3,103,2025-03-07T10:10:00.000Z,Can you tell me a joke? My phone number is [PHONE_NUMBER_1].,Why did the scarecrow win an award? Because he was outstanding in his field!,Alice Johnson,alice.johnson@example.com
4,104,2025-03-07T10:15:00.000Z,I forgot my password. My SSN is [SSN_1].,"Please click on the ""Forgot Password"" link to reset it.",Bob Brown,bob.brown@example.com
5,105,2025-03-07T10:20:00.000Z,What are your working hours? My email is [EMAIL_ADDRESS_1].,Our working hours are from 9 AM to 5 PM.,Charlie Davis,charlie.davis@example.com
6,106,2025-03-07T10:25:00.000Z,Can you help me with my order? My address is [LOCATION_ADDRESS_STREET_1].,"Sure, I can help you with your order.",David Evans,david.evans@example.com
7,107,2025-03-07T10:30:00.000Z,What is your return policy? My phone number is [PHONE_NUMBER_1].,Our return policy is 30 days.,Eve Foster,eve.foster@example.com
8,108,2025-03-07T10:35:00.000Z,Do you offer discounts? My SSN is [SSN_1].,"Yes, we offer discounts on bulk purchases.",Frank Green,frank.green@example.com
9,109,2025-03-07T10:40:00.000Z,How can I contact support? My email is [EMAIL_ADDRESS_1].,You can contact support via email or phone.,Grace Harris,grace.harris@example.com
10,110,2025-03-07T10:45:00.000Z,What is your shipping policy? My address is [LOCATION_ADDRESS_STREET_1].,We offer free shipping on orders over $50.,Hank Irving,hank.irving@example.com


Congratulations! You have successfully de-identified unstructured data in Databricks with Skyflow! 🚀

# Conclusion

Now that you've created a basic string deidentification function you can try customizing it with some of the parameters available from the Skyflow Detect API. Or, if you're working with raw files, try creating a similar function for the Deidentify File APIs.

To learn more about Skyflow's Detect APIs for de-identification and re-identification, visit https://www.skyflow.com/solutions/by-use-case/skyflow-for-unstructured-data