# Skyflow Deidentify String UDF

This notebook will show you how to install a function for deidentifying strings and tokenizing PII in unstructured data using a Skyflow Vault.

## Configure Secrets (optional)

To use this function as written you must configure a Secret in Databricks for storing the Skyflow API credentials. Alternately, for testing, when calling the function you can manually pass credentials as an argument.

#### Install the CLI

If you use Homebrew on a Mac, the below commands will complete the install.

```
brew tap databricks/tap
brew install databricks
```

For more detailed instructions for any dev environment see the official documentation: [Databricks | Install or update the Databricks CLI](https://docs.databricks.com/aws/en/dev-tools/cli/install)

##### Configure the CLI

To get started and create a configuration profile on your machine, run `databricks configure`.

You should be prompted for `Databricks Host` and a `Personal Access Token`. 

To get a Personal Access Token (PAT) for development login to the Databricks UI, open Settings, click Developer, then Access Tokens.

For more information on authenticating the Databricks CLI see the official documentation: [Databricks | Authentication for the Databricks CLI](https://docs.databricks.com/aws/en/dev-tools/cli/authentication)

#### Configure a secret scope in Databricks

Now that you've configured and authenticated the Databricks CLI, run the following command to create a 'scope' for your secrets in Databricks: 

`databricks secrets create-scope <scope-name>`

For the rest of this demo we'll use the scope `sky-agentic-demo`.

`databricks secrets create-scope sky-agentic-demo`

#### Get details from Skyflow

- Create or log into your account at [skyflow.com](https://skyflow.com) and generate an API key: [docs.skyflow.com](https://docs.skyflow.com/api-authentication/)
- Copy your API key, Vault URL, and Vault ID


#### Store the secrets in Databricks

Create your secrets using the JSON syntax:

```sh
databricks secrets put-secret --json '{
  "scope": "sky-agentic-demo",
  "key": "sky_api_key",
  "string_value": "--sky_api_key--"
}'
```

To confirm the secrets have been uploaded successfully, run `databricks secrets list-secrets sky-agentic-demo` to see a list of the keys you provided and an updated timestamp.

Example:

```sh
Key            Last Updated Timestamp
sky_api_key    1739998630197
```

Then to read a secret in a Notebook, use `dbutils.secrets`:

`sky_api_key = dbutils.secrets.get(scope = "sky-agentic-demo", key = "sky_api_key")`

To learn more about Secrets in Databricks, see the official documentation: [Secret Management | Databricks](https://docs.databricks.com/aws/en/security/secrets)


## Install the function

Before you install, make sure you set your `vault_id` and `vault_url`. These are hardcoded values in our function, though you can modify it to also accept parameters for these values from the user invoking the function or use Databricks environment variables.



In [0]:
%sql
CREATE OR REPLACE FUNCTION
agentic.default.deidentify_string (
 input_text STRING COMMENT 'The string to be de-identified.',
 sky_api_key STRING COMMENT 'The API key for the Skyflow API.'
)
RETURNS STRING
LANGUAGE PYTHON
DETERMINISTIC
COMMENT 'Deidentify a string using the Skyflow API. Removes any sensitive data from the string and returns a safe string with placeholders in place of sensitive data tokens.'
AS $$
 import sys
 import json
 import requests
 from io import StringIO
 
 sys_stdout = sys.stdout
 redirected_output = StringIO()
 sys.stdout = redirected_output

 if sky_api_key is None or sky_api_key == '':
     # try to fetch the API key from env variables
     bearer_token = os.environ.get("SKY_API_KEY")
 else:
     bearer_token = sky_api_key

-- SET YOUR VAULT ID
 vault_id = "SKYFLOW_VAULT_ID"
-- SET YOUR VAULT URL
 vault_url = "https://sample.vault.skyflowapis.com"
-- END
 api_path = "/v1/detect/deidentify/string"
 api_url = vault_url + api_path
 headers = {
     "Authorization": f"Bearer {bearer_token}",
     "Content-Type": "application/json",
 }
 json_body = {
     "vault_id": vault_id,
     "text": input_text,
 }

 try:
     api_response = requests.post(api_url, headers=headers, json=json_body)
     api_response.raise_for_status()
     external_data = api_response.json()
     result = external_data.get('processed_text', 'No processed_text found')
 except requests.exceptions.RequestException as e:
     result = f"Error calling external API: {str(e)}"

 sys.stdout = sys_stdout
 return result
$$

## Test the function from Unity Catalog

Now that you've installed the deidentify_string() function into your Databricks Unity Catalog, you can call it from Python with Spark.

In [0]:
# Retrieve an access token from Databricks Secrets.
sky_api_key = dbutils.secrets.get(scope="sky-agentic-demo", key="sky_api_key")
# Alternately, you can hardcode the API key here.
# sky_api_key = "yourkey"

# Provide some sample text. In practice you'll read this from a file or table.
input_text = "Hi my name is Joseph McCarron and I live in Austin TX"

In [0]:
# Create the input dataframe
df = spark.createDataFrame([(input_text,)], ["input_text"])
df.createOrReplaceTempView("input_table")

# Create the result dataframe and pass the API key and the input dataframe
result_df = spark.sql(f"""
SELECT agentic.default.deidentify_string(input_text, '{sky_api_key}') AS deidentified_text
FROM input_table
""")

# Display the result
display(result_df)

deidentified_text
Hi my name is [NAME_1] and I live in [LOCATION_1]


## Next steps

Now that you've created a basic string deidentification function you can try customizing it with some of the parameters available from the Skyflow Detect API. Or, if you're working with raw files, try creating a similar function for the Deidentify File APIs.