# Tonic Textual

[Tonic Textual](https://textual.tonic.ai) makes training data safe for production by removing customer PII/PHI from your model weights.  It does this by sanitizing and synthesizing your data prior to model training using our state of the art named entity recognition and entity linking models.  This notebook will help illustate how you can use Textual to sanitize unstructured text.  It will also show how you can access your data sitting in your AWS Lakehouse.

# Getting Started

Before we start accessing data in AWS lets get a feel for how the SDK works.  We'll start by redacting and synthesizing some simple pieces of text.  To get started, you'll first need to create a Textual API key.  You can do that by creating a *free* account at [https://textual.tonic.ai/signup](https://textual.tonic.ai/signup).  Once you've created your account create an API key from the top-navbar.  

For this tutorial, you can hard code your API key value or use the below code snippet to access your API key stored in AWS Secrets Manager.  If you use AWS Secrets manager you'll need to provide your SageMaker IAM role with permission to secretsmanager:GetSecretValue

In [16]:
!pip install -q tonic_textual
from tonic_textual.redact_api import TextualNer
import json

textual_api_key='<Tonic Textual API key goes here>'
textual_ner = TextualNer(api_key=textual_api_key)

Now lets redact and synthesize some text using the SDK.  The response from the **redact()** call returns both the sanitized text as well as a list of all found entities.  In the below example, we'll proces the following text:

> The patient, John Smith is a 42-year male and suffers from Crohn\'s Disease.

And we'll find these entities:

- **John** - A first name
- **Smith** - A last name
- **42** - A person's age
- **male** - A persons gender

The generated redacted text will be

> The patient, [NAME_GIVEN_dySb5] [NAME_FAMILY_7w4Db3] is a [PERSON_AGE_BF3]-year [GENDER_IDENTIFIER_AzwE0] and suffers from Wilson's Disease.

By default, Textual will flag all possible found entities and will replace them with redacted tokens, as you can see above.


In [27]:
response = textual_ner.redact('The patient, John Smith is a 42-year male and suffers from Crohn\'s Disease.')
print(json.dumps(response, indent=4))

{
    "original_text": "The patient, John Smith is a 42-year male and suffers from Crohn's Disease.",
    "redacted_text": "The patient, [NAME_GIVEN_dySb5] [NAME_FAMILY_7w4Db3] is a [PERSON_AGE_BF3]-year [GENDER_IDENTIFIER_AzwE0] and suffers from [NAME_FAMILY_2lZ9KI]'s Disease.",
    "usage": 13,
    "de_identify_results": [
        {
            "start": 13,
            "end": 17,
            "new_start": 13,
            "new_end": 31,
            "label": "NAME_GIVEN",
            "text": "John",
            "score": 0.9,
            "language": "en",
            "new_text": "[NAME_GIVEN_dySb5]"
        },
        {
            "start": 18,
            "end": 23,
            "new_start": 32,
            "new_end": 52,
            "label": "NAME_FAMILY",
            "text": "Smith",
            "score": 0.9,
            "language": "en",
            "new_text": "[NAME_FAMILY_7w4Db3]"
        },
        {
            "start": 29,
            "end": 31,
            "new_start": 58,
    

Lets now modify the above **redact()** call to instead synthesize values instead of tokenize.  We'll also disable the GENDER_IDENTIFIER entity to show how that is possible.

In [5]:
# Lets define our list of entities we consider to be sensitive.

response = textual_ner.redact(
    'The patient, John Smith is a 42-year male and suffers from Crohn\'s Disease.',
    generator_default='Synthesis',
    generator_config={'GENDER_IDENTIFIER':'Off'}
)
print(json.dumps(response, indent=4))

{
    "original_text": "The patient, John Smith is a 42-year male and suffers from Crohn's Disease.",
    "redacted_text": "The patient, Alfonzo Uva is a 47-year male and suffers from Vasconez's Disease.",
    "usage": 13,
    "de_identify_results": [
        {
            "start": 13,
            "end": 17,
            "new_start": 13,
            "new_end": 20,
            "label": "NAME_GIVEN",
            "text": "John",
            "score": 0.9,
            "language": "en",
            "new_text": "Alfonzo"
        },
        {
            "start": 18,
            "end": 23,
            "new_start": 21,
            "new_end": 24,
            "label": "NAME_FAMILY",
            "text": "Smith",
            "score": 0.9,
            "language": "en",
            "new_text": "Uva"
        },
        {
            "start": 29,
            "end": 31,
            "new_start": 30,
            "new_end": 32,
            "label": "PERSON_AGE",
            "text": "42",
            "score"

# Loading data into a AWS Lakehouse Catalog

Let's now perform the same sanitization process for some data sitting in your AWS Lakehouse.  We'll store the data in a Lakehouse table and process it with PySpark.  We'll start by fetching a CSV of call transcripts.  Before running the below, make sure you upload the sample transcript CSV into your projects s3 root (can be found via project.s3.root).  The sample CSV can be downloaded from [our Github repo](https://github.com/TonicAI/textual_sagemaker).  The file is called **customer_support_transcripts.parquet**.

## Pre-requisites

A few IAM polices must first be attached to your sagemaker user IAM role.  To find your IAM role you can run

```python
from sagemaker_studio import Project

project = Project()
project.iam_role
```
The following permissions must be attached to the role:

- Glue::GetTable
- S3::GetObject
- S3::PutObject

The S3 objects can be scoped to your sagemaker project bucket, which can be found by running

```python
project.s3.root
```

With your IAM policies updated for your role, you'll now need to upload our sample call transcript CSV to your sagemaker S3 bucket.  You can do this using Python's boto3 library or directly in the Sagemaker by navigating to S3 on the Sagemaker data tab, going to your bucket, and uploading CSV to the dev/ folder.  The CSV itself can be found here for download: PUT CSV HERE

## Loading the data into Lakehouse

Let's now take our CSV, sitting in S3, and create a Lakehouse database and table to store the data.  We'll do that below.

Before that let's configure our Spark cluster with appropriate resources - 

In [25]:
%%configure --name project.spark.compatibility -f
{
  "number_of_workers": 2,
  "worker_type": "G.1X",
  "conf": {
    "spark.pyspark.virtualenv.enabled": "true",
    "spark.pyspark.virtualenv.type":   "native",
    "spark.pyspark.virtualenv.bin.path": "/usr/bin/virtualenv"
  }
}

Stopping session for project.spark.compatibility. Session id: 6b7asobqqa1q5j-55409364-648b-4c27-8b71-569dd080f54b
Session stopped.


"The following configurations have been updated: {'number_of_workers': 2, 'worker_type': 'G.1X', 'conf': {'spark.pyspark.virtualenv.enabled': 'true', 'spark.pyspark.virtualenv.type': 'native', 'spark.pyspark.virtualenv.bin.path': '/usr/bin/virtualenv'}}"

Creating Glue session...
Create session for connection: project.spark.compatibility


'Session 6b7asobqqa1q5j-a0568dea-ae30-4097-98a8-02050c0c5ffb has been created.'

Id,Spark UI,Driver logs
6b7asobqqa1q5j-a0568dea-ae30-4097-98a8-02050c0c5ffb,link,link


In [39]:
%%pyspark project.spark.compatibility
from sagemaker_studio import Project

project = Project()
catalog = project.connection().catalog()
project_database = catalog.databases[0].name
db_name = project_database

tbl_name  = "call_transcripts"
data_path = f"s3://{project.s3.root.split('/',3)[2]}/lakehouse/{db_name}/{tbl_name}"

spark.sql(f"""DROP TABLE IF EXISTS {db_name}.{tbl_name}""")

spark.sql(f"""
CREATE TABLE IF NOT EXISTS {db_name}.{tbl_name} (
    id STRING,
    customer_id STRING,
    transcript STRING
)
USING iceberg
LOCATION '{data_path}'
""")

data_path = f"{project.s3.root}/customer_support_transcripts.parquet"
raw_df = spark.read.parquet(data_path)
raw_df.write.format("iceberg").mode("append").saveAsTable(f"{db_name}.{tbl_name}")# Enter your code at the start of this line to replace this comment



Connection: project.spark.compatibility | Run start time: 2025-09-25 02:44:02.491780 | Run duration : 0:00:32.050194s.


# Moving to scale with PySpark

With data now in our Lakehouse lets use Spark (PySpark) to santize unstructured text at scale.  Let's get a few things setup first.

- We need to install the tonic_textual Python package on the Spark Workers
- We need to pass our API key to our Spark nodes so Textual can be used

## Install tonic_texutal on your Spark workers

In [29]:
%%pyspark project.spark.compatibility
sc.install_pypi_package("tonic_textual")

Collecting tonic_textual
  Downloading tonic_textual-3.13.0-py3-none-any.whl (92 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92.7/92.7 kB 9.4 MB/s eta 0:00:00
Collecting more-itertools<11.0.0,>=10.2.0
  Downloading more_itertools-10.8.0-py3-none-any.whl (69 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69.7/69.7 kB 16.1 MB/s eta 0:00:00
Collecting requests<3.0.0,>=2.32.3
  Downloading requests-2.32.5-py3-none-any.whl (64 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.7/64.7 kB 8.8 MB/s eta 0:00:00
Collecting tqdm<5.0.0,>=4.67.1
  Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.5/78.5 kB 18.6 MB/s eta 0:00:00
Installing collected packages: tqdm, requests, more-itertools, tonic_textual
  Attempting uninstall: requests
    Found existing installation: requests 2.32.2
    Not uninstalling requests at /usr/local/lib/python3.11/site-packages, outside environment /tmp/spark-83a24986-0983-4501-8f6d-01a605edcaea
    Can't uninsta

## Let's send our Textual API key to our spark workers, so its available to our UDF

In [33]:
%send_to_remote --name project.spark.compatibility --language python --local textual_api_key --remote textual_api_key

# Running our sanitization

With everything in place, we can now create our PySpark UDF which just wraps the Texutal *redact()* method in a way such that it is call-able from Spark.  We'll create our UDF, and apply it on the 'text' column of our call_transcripts table.

## Let's first create our UDF

This UDF will synthesize all sensitive values

In [43]:
%%pyspark project.spark.compatibility
from tonic_textual.redact_api import TextualNer
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

@udf(StringType())
def redact(txt):
    if txt is None:
        return None

    if not hasattr(redact, "_ner"):
        redact._ner = TextualNer("https://textual.tonic.ai",textual_api_key)

    #for a list of all allowed entities go here: https://docs.tonic.ai/textual/tonic-textual-models-about/built-in-entity-types
    #or you can view all supported values via the PiiType enum (from tonic_textual.enums.pii_type import PiiType)
    sensitive_entities=['NAME_GIVEN','NAME_FAMILY','LOCATION_ADDRESS','LOCATION_CITY','LOCATION_STATE','LOCATION_ZIP','EMAIL_ADDRESS','US_SSN','CVV','CC_EXP','NUMERIC_PII', 'ORGANIZATION']
    config = {k: 'Synthesis' for k in sensitive_entities}
    
    return redact._ner.redact(txt, generator_default='Off', generator_config=config).redacted_text


Connection: project.spark.compatibility | Run start time: 2025-09-25 02:49:37.170685 | Run duration : 0:00:07.721229s.


## Let's now run the UDF and create a new dataframe with synthetic text

In [60]:
%%pyspark project.spark.compatibility
spark.sql(f"""use {db_name};""")
df = spark.read.table('call_transcripts')

df_with_redacted = df.withColumn("transcript_redacted", redact(df["transcript"]))


Connection: project.spark.compatibility | Run start time: 2025-09-25 03:04:35.627627 | Run duration : 0:00:07.891162s.


# Now let's create a table with synthetic results

In [62]:
%%pyspark project.spark.compatibility
tbl_name  = "call_transcripts_synthetic"
data_path = f"s3://{project.s3.root.split('/',3)[2]}/lakehouse/{db_name}/{tbl_name}"

spark.sql(f"""DROP TABLE IF EXISTS {db_name}.{tbl_name}""")

spark.sql(f"""
CREATE TABLE IF NOT EXISTS {db_name}.{tbl_name} (
    id STRING,
    customer_id STRING,
    transcript STRING,
    transcript_redacted STRING
)
USING iceberg
LOCATION '{data_path}'
""")

# data_path = f"{project.s3.root}/call_transcripts.csv"
df_with_redacted.write.format("iceberg").mode("append").saveAsTable(f"{db_name}.{tbl_name}")



Connection: project.spark.compatibility | Run start time: 2025-09-25 03:06:30.941714 | Run duration : 0:00:29.819098s.


# Publish the data and make it discoverable

The redacted data is now available for querying in your query editor.

To publish the data to your SageMaker Catalog, please follow these steps:

1. Navigate to the 'Project Overview' of your current project.
2. Select 'Data Sources'.
3. Look for a source with the connection type 'project.default_lakehouse'. Click 'Run' under the 'Actions' column to make your tables available as assets.
4. Navigate to 'Assets'. You should now see both tables: call_transcripts and call_transcripts_synthetic.
5. Select the call_transcripts_synthetic table. You'll see options to generate descriptions and accept automated metadata generation. Review and edit the metadata and descriptions as needed.
6. When ready, click 'Publish Asset' in the top right corner.
7. The asset is now published and discoverable by users outside of the publishing project.
8. Project members can review and approve subscription requests from users in other projects.

# Subscribe to the Project 

Users in different projects can follow these steps to subscribe and access the data:

1. Log into the domain 
2. From the home page, select 'Catalog' and then 'Browse assets'
3. Browse the available assets. Click on the desired table name to view its descriptions and details
4. Click 'Subscribe' in the top right corner
5. Provide a detailed reason for your request and click the 'Request' button
6. Once your subscription request is approved, you can begin querying the data

This approach effectively protects PII information while making relevant data accessible to different teams within your organization. The data can be used for various purposes, including research, sentiment analysis, and gathering customer experience insights.
