# Overview

In this notebook, we'll show how to use [Tonic Textual](https://tonic.ai/textual) to de-identify unstructured clinical data to safely fine-tune LLMs. As an example application, we show-case fine-tuning for extracting and structuring clinical notes — think converting notes to [HL7 FHIR](https://hl7.org/fhir/).

# Example dataset
Our example dataset is _purely synthetic_, constructed in several steps
1. Sample Synthea records provided "ground-truth" FHIR records
2. These synthetic FHIR records were used to generate synthetic clinical notes, using a modified version of [chatty-notes](https://github.com/synthetichealth/chatty-notes/tree/main)


## Synthea
> Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, Scott McLachlan, Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record, Journal of the American Medical Informatics Association, Volume 25, Issue 3, March 2018, Pages 230–238, https://doi.org/10.1093/jamia/ocx079

# Getting started
Before we start accessing data in AWS lets get a feel for how the SDK works. We'll start by redacting and synthesizing some simple pieces of text. To get started, you'll first need to create a Textual API key. You can do that by creating a free account at [https://textual.tonic.ai/signup](https://textual.tonic.ai/signup). Once you've created your account create an API key from the top-navbar.

For this tutorial, you can hard code your API key value or use the below code snippet to access your API key stored in AWS Secrets Manager. If you use AWS Secrets manager you'll need to provide your SageMaker IAM role with permission to secretsmanager:GetSecretValue



In [6]:
!pip install tonic-textual
import json
import pandas as pd
from datasets import load_dataset
from tonic_textual.redact_api import TextualNer
import boto3

textual_api_key='<Tonic Textual API key goes here>'
textual = TextualNer(api_key=textual_api_key)




# Load clinical notes

In [8]:
ds = load_dataset("TonicAI/synthetic_clinical_notes", download_mode='force_redownload')

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


train-00000-of-00001.parquet:   0%|          | 0.00/1.43M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


validation-00000-of-00001.parquet:   0%|          | 0.00/918k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


test-00000-of-00001.parquet:   0%|          | 0.00/2.16M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1028 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/672 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1681 [00:00<?, ? examples/s]

In [9]:
ds

DatasetDict({
    train: Dataset({
        features: ['note', 'encounter_data'],
        num_rows: 1028
    })
    validation: Dataset({
        features: ['note', 'encounter_data'],
        num_rows: 672
    })
    test: Dataset({
        features: ['note', 'encounter_data'],
        num_rows: 1681
    })
})

In [10]:
example = ds['train'][0]
print(json.dumps(example,indent=2))

{
  "note": "**Medical Note**\n\n**Patient Name:** Harvey D'Amore  \n**Date of Birth:** September 12, 1964  \n**Medical Record Number:** HD-AM-0901964  \n**Date of Encounter:** October 18, 2023  \n**Attending Dentist:** Dr. Emily Thompson  \n**Location:** Greenfield Dental Clinic, 123 Oak Street, Springfield, IL  \n\n**Reason for Visit:**  \nMr. Harvey D\u2019Amore presents for a routine check-up and management of gingivitis.\n\n**Chief Complaint:**  \nThe patient reports experiencing swollen and tender gums over the past few weeks, with occasional bleeding during brushing and persistent bad breath.\n\n**History of Present Illness:**  \nMr. D\u2019Amore is a 59-year-old male who has been experiencing gingival tenderness and bleeding upon brushing for approximately three weeks. He reports that bleeding is most notable when flossing and notes some persistent bad breath, which has not improved with regular oral hygiene measures. Mr. D\u2019Amore admits to occasionally skipping his night-t

# Example Note

**Medical Note**

**Patient Name:** Harvey D'Amore  
**Date of Birth:** September 12, 1964  
**Medical Record Number:** HD-AM-0901964  
**Date of Encounter:** October 18, 2023  
**Attending Dentist:** Dr. Emily Thompson  
**Location:** Greenfield Dental Clinic, 123 Oak Street, Springfield, IL  

**Reason for Visit:**  
Mr. Harvey D’Amore presents for a routine check-up and management of gingivitis.

**Chief Complaint:**  
The patient reports experiencing swollen and tender gums over the past few weeks, with occasional bleeding during brushing and persistent bad breath.

**History of Present Illness:**  
Mr. D’Amore is a 59-year-old male who has been experiencing gingival tenderness and bleeding upon brushing for approximately three weeks. He reports that bleeding is most notable when flossing and notes some persistent bad breath, which has not improved with regular oral hygiene measures. Mr. D’Amore admits to occasionally skipping his night-time oral hygiene routine. He denies any recent changes in his diet or medication. The patient has no history of significant dental issues prior to this episode and last had a dental check-up approximately six months ago.

**Dental Procedures Performed:**  
1. **Dental Consultation and Report:** Comprehensive evaluation of Mr. D’Amore's oral health status was conducted. Signs consistent with mild to moderate gingivitis were observed.
   
2. **Dental Care Regime/Therapy Initiation:** Establishing a more rigorous oral hygiene routine including twice-daily brushing with a soft-bristled toothbrush and daily flossing was recommended. Emphasis on technique was provided.

3. **Removal of Supragingival Plaque and Calculus:** Complete scaling of all teeth was performed to remove plaque and calculus deposits above the gum line.

4. **Removal of Subgingival Plaque and Calculus:** Thorough scaling of all teeth to remove plaque and calculus below the gum line was undertaken to reduce periodontal pockets.

5. **Examination of Gingivae:** Comprehensive examination revealed inflammation and slight edema of gingival tissues consistent with gingivitis. No abscesses or serious periodontal issues were noted.

6. **Oral Health Education:** Mr. D’Amore received instruction on effective brushing and flossing techniques, the importance of consistent oral hygiene, and dietary recommendations to support gum health. The patient was advised to consider using an antiseptic mouthwash to help control plaque.

**Assessment:**  
_
1. Gingivitis: Mild to moderate gingival inflammation with associated bleeding and halitosis.

**Plan:**  
1. Reinforce oral hygiene education and stress compliance with recommended measures.
2. Schedule follow-up appointment in 3 months for reassessment and maintenance cleaning.
3. Consider follow-up with periodontist if symptoms do not improve.

**Comments:**  
Patient was advised that, with adherence to proper oral hygiene practices, symptoms of gingivitis are expected to resolve, and risk for progression to periodontitis can be minimized.

**Signed:**  
Dr. Emily Thompson  
Greenfield Dental Clinic

# Safely working with HIPAA data

We use textual to detect sensitive PHI and replace it with synthetic values. Our highly accurate NER models, coupled with our realistic synthesis, allow customers to safely work with HIPAA data with [Expert Determination](https://www.hhs.gov/hipaa/for-professionals/special-topics/de-identification/index.html#guidancedetermination). As we'll see in this notebook, the resulting data maintains the valuable signal for our fine-tuning problems while ensuring sensitive patient information is not memorized by the fine-tuned LLM

## Configuring textual
In the below cell, we create the objects to configure textual to detect and synthesize the same PHI types that would need to be redacted for safe harbor de-identification. 
The `generator_config` dictionary specifies how to treat each of these detected entities, and we will specify `generator_default='Off'`  to allow other detected entities to pass-through. Textual detects many [entity types](https://docs.tonic.ai/textual/tonic-textual-models-about/built-in-entity-types) but not all of these are considered sensitive by HIPAA. Finally, we'll specify that our name generator should preserve gender.


In [11]:
from tonic_textual.classes.generator_metadata.name_generator_metadata import NameGeneratorMetadata
phi_labels = [
    "NAME_GIVEN",
    "NAME_FAMILY",
    "LOCATION_ADDRESS",
    "LOCATION_STATE",
    "LOCATION_CITY",
    "LOCATION_ZIP",
    "LOCATION_COUNTRY",
    "PHONE_NUMBER",
    "EMAIL_ADDRESS",
    "CREDIT_CARD",
    "CC_EXP",
    "CVV",
    "MONEY",
    "ORGANIZATION",
    "DOB",
    "DATE_TIME",
    "URL",
    "NUMERIC_PII",
    "HEALTHCARE_ID",
]
generator_config = {label:'Synthesis' for label in phi_labels}
generator_metadata = {'NAME_GIVEN': NameGeneratorMetadata(preserve_gender=True)}

## De-identifying semi-structured data
Since our training data consists of a mix of unstructred and structured data, we use the [Textual](https://docs.tonic.ai/textual/tonic-textual-api/datasets-redaction/textual-api-redact-strings#api-redact-string-json) `redact_json` method to de-identify our trainign data

In [14]:
# example redaction
response = textual.redact_json(
    example,
    generator_config=generator_config,
    generator_metadata=generator_metadata,
    generator_default='Off',
    random_seed=5 # Optional parameter, allowing for distinct replacements
)

# The response object contains detail information the entities detected, as well as the de-identfied json object as a string.
# we parse the de-identified json string into a dictionary
json.loads(response.redacted_text)

{'note': "**Medical Note**\n\n**Patient Name:** Juante Cohenour  \n**Date of Birth:** September 12, 1964  \n**Medical Record Number:** DN-GU-2330504  \n**Date of Encounter:** October 19, 2023  \n**Attending Dentist:** Dr. Kyhlee Farahkhan  \n**Location:** Green Boob Inc, 87 Manifest Plaza, Matherville, IL  \n\n**Reason for Visit:**  \nMr. Juante Lynskey presents for a routine check-up and management of gingivitis.\n\n**Chief Complaint:**  \nThe patient reports experiencing swollen and tender gums over the past few weeks, with occasional bleeding during brushing and persistent bad breath.\n\n**History of Present Illness:**  \nMr. Lynskey is a 59-year-old male who has been experiencing gingival tenderness and bleeding upon brushing for approximately three weeks. He reports that bleeding is most notable when flossing and notes some persistent bad breath, which has not improved with regular oral hygiene measures. Mr. Lynskey admits to occasionally skipping his night-time oral hygiene routi

# We examine the data before and after de-identification

In [15]:
data_df = pd.DataFrame([row['encounter_data'] for row in ds['train']])
data_df

Unnamed: 0,age,encounter_type,gender,medications,name,procedures,race,reason
0,59,encounter for check up,male,[],Harvey D'Amore,"[dental consultation and report, dental care (...",white,Gingivitis
1,62,encounter for check up,male,[sodium fluoride 0.0272 MG/MG Oral Gel],Harvey D'Amore,"[dental consultation and report, dental care (...",white,Patient referral for dental care (procedure)
2,3,emergency room admission,male,[],Harvey D'Amore,[],white,Seizure disorder
3,53,encounter for check up,male,[],Harvey D'Amore,"[dental consultation and report, dental care (...",white,Gingivitis
4,41,consultation for treatment,female,[Jolivette 28 Day Pack],Kathern Laurence Nader,[],unknown,Contraception care (regime/therapy)
...,...,...,...,...,...,...,...,...
1023,83,admission to intensive care unit,male,"[10 ML Furosemide 10 MG/ML Injection, 10 ML Fu...",Brent Considine,"[medication reconciliation, electrocardiograph...",white,Chronic congestive heart failure
1024,83,encounter for check up,male,[sodium fluoride 0.0272 MG/MG Oral Gel],Brent Considine,"[dental consultation and report, dental care (...",white,Gingivitis
1025,84,death certification,male,[],Brent Considine,[],white,Chronic congestive heart failure
1026,84,admission to hospice,male,[],Brent Considine,"[certification procedure, notifications, initi...",white,Chronic congestive heart failure


In [16]:
data_df['name'].value_counts()

name
Letha Kathern Hirthe           294
Cayla Abshire                   85
Sharri Lue Feest                81
Ferdinand Angel Fahey           70
Brent Considine                 63
Sharika Walker                  52
Valarie Laquita Oberbrunner     39
Leopoldo Roy Turcotte           39
Elvira Kozey                    36
Loree Cristal Maggio            29
Zack Jean Skiles                28
Phillip Dusty Streich           19
Suzanna Karissa Fay             16
Emery Sanford                   15
Lenita Zenobia Metz             14
Jennie Kuvalis                  13
Doreatha Karly Ziemann          13
Emilee Magali Beier             11
Phoebe Kattie Barton            10
Homer Herbert Bartoletti         9
Demarcus Kris                    9
Carey Kennith Kulas              8
Ivonne Robyn Kuhn                8
Ronny Elbert Pfeffer             7
Cordell Mel Jerde                7
Zachariah Antonia Labadie        7
Horace Schaden                   6
Deetta Jann Hudson               5
Melanie Dion Em

## Iterate through the dataset
Use `concurrent.futures` to parallelize requests

In [17]:
from tqdm.notebook import tqdm
import concurrent.futures

def process_row(i, row):
  synth_encounter_data = json.loads(textual.redact_json(
      row,
      generator_config=generator_config,
      generator_metadata=generator_metadata,
      generator_default='Off',
      random_seed=i).redacted_text)
  return synth_encounter_data

synthetic_rows = []
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:  # Adjust max_workers as needed
  futures = [executor.submit(process_row, None, row) for i, row in enumerate(ds['train'])]
  for future in tqdm(concurrent.futures.as_completed(futures), total=len(ds['train'])):
    synthetic_rows.append(future.result())


  0%|          | 0/1028 [00:00<?, ?it/s]

In [18]:
synth_df = pd.DataFrame([row['encounter_data'] for row in synthetic_rows])
synth_df['name'].value_counts()

name
Mistie Swarner         294
Humayra Wilkosz         85
Lorie Abbinanti         81
Ronny Zobel             70
Kennet Igoe             63
Larica Strate           52
Laraine Vanschuyver     39
Cherly Pochiba          39
Kalona Needs            36
Maye Bruchman           29
Carlotta Ryckman        28
Delbert Kraichely       19
Rebekah Grom            16
Teena Kristan           15
Aron Sontag             14
Watha Brierton          13
Hipolito Sartori        13
Everett Pavelka         11
Jeanett Landgraf        10
Erik Froman              9
Jonathaon Demler         9
Breanna Kinan            8
Lissette Gootz           8
Carry Widlak             7
Rachal Arps              7
Marla Troth              7
Srisai Kriz              6
Cole Cleckner            5
Lavel Bennice            4
Norberto Saddler         4
Candra Sanpson           4
Eulah Przekop            3
Randi Linman             3
Queen Wegley             3
Felisa Przekop           2
Delphine Dyda            2
Francesca Nygren       

# Prepare for fine-tuning
We will use a small LLM to extract structured outputs from the unstructured notes, using the pydantic classes below as models for output. We include detailed intstructions in our prompt about the schema, and prepare our *synthetic training data* by taking our synthesized notes/encounter info and turning it into `prompt` and `completion` data.


In [19]:
from typing import List, Optional
from enum import Enum, IntEnum
from pydantic import BaseModel

class GenderIdentityEnum(str, Enum):
    """HL7 THO ‘Gender Identity’ value‑set."""
    FEMALE = "female"        # SNOMED CT – Identifies as female gender (finding)
    MALE = "male"          # SNOMED CT – Identifies as male gender (finding)
    NON_BINARY = "non-binary"     # SNOMED CT – Identifies as gender non‑binary
    ASKED_DECLINED = "asked-declined" # Data‑Absent‑Reason – Asked but declined to answer
    UNKNOWN = "unknown"                   # v3 NullFlavor – Applicable but not known

class FhirUsCoreRaceEnum(str, Enum): # https://www.hl7.org/fhir/us/core/StructureDefinition-us-core-race.html
    """HL7 'US-Core-Race' value-set"""
    WHITE = 'white'
    ASIAN = 'asian'
    AFRICAN_AMERICAN = 'black or african american'
    NATIVE_HAWAIIAN = 'native hawaiian or other pacific islander'
    AMERICAN_INDIAN = 'american indian or alaska native'
    UNKNOWN = 'unknown'

class EncounterData(BaseModel):
    name: str
    age: int
    race: FhirUsCoreRaceEnum
    gender: GenderIdentityEnum
    reason: Optional[str]
    procedures: List[str]
    medications: List[str]
    encounter_type: str

prompt_fmt = """You are FHIR‑Note‑Extractor v1.
Your sole task is to read a free‑text clinical note and emit only a compact JSON object that validates against the following Pydantic schema:

```python
from typing import List, Optional
from enum import Enum
from pydantic import BaseModel

class GenderIdentityEnum(str, Enum):          # HL7 THO “Gender Identity”
    FEMALE = "female"
    MALE = "male"
    NON_BINARY = "non-binary"
    ASKED_DECLINED = "asked-declined"
    UNKNOWN = "unknown"

class FhirUsCoreRaceEnum(str, Enum):          # HL7 US‑Core‑Race
    WHITE = "white"
    ASIAN = "asian"
    AFRICAN_AMERICAN = "black or african american"
    NATIVE_HAWAIIAN = "native hawaiian or other pacific islander"
    AMERICAN_INDIAN = "american indian or alaska native"
    UNKNOWN = "unknown"

class EncounterData(BaseModel):
    name: str                    # Full patient name
    age: int                     # Age in years at encounter
    race: FhirUsCoreRaceEnum     # One of the enum literals above
    gender: GenderIdentityEnum   # One of the enum literals above
    reason: Optional[str]        # Short (< 10 words) diagnosis/complaint, title‑case
    procedures: List[str]        # List of distinct procedures in sentence case
    medications: List[str]       # Generic drug names or [] if none
    encounter_type: str          # SNOMED‑style phrase, kebab or snake allowed
```
Extraction rules

Age – derive from “Date of Birth” and “Date of Encounter” when both are present; else use explicit age statements (e.g. “59‑year‑old”).

Race & gender – map author wording to the exact enum literal.
If absent or ambiguous, choose "unknown" (or "asked‑declined" if note explicitly says so).

Reason – the primary diagnosis or chief complaint, ≤ 10 words, Title‑Case (e.g. “Gingivitis”).

Procedures – every billed or documented procedure, sentence‑case, deduplicated, stripped of trailing punctuation.

Medications – all meds started, stopped, or continued; generic names only; do not infer drugs not mentioned. Empty list if none.

encounter_type – one concise phrase such as "encounter for check up" or "emergency department visit".

Use Unicode apostrophes and normal ASCII quotes.

Output format – return only the JSON, no Markdown fence, no commentary, no <output> tags. Keys and enum values must be double‑quoted.

If any required field truly cannot be filled, substitute with the appropriate unknown literal (null only for reason).
Return nothing but the schema‑compliant JSON.

## Note:
{note}
"""


def formatting_prompts_func(example):
    """Expected format for instruction fine-tuning"""
    return {
        'dialog': [
            {
                'content': prompt_fmt.format(note=example['note']),
                'role': 'user'
            },
            {
                'content': json.dumps(example['encounter_data'], indent=2),
                'role': 'assistant'
            }
        ]
    }



## Dataset creation
We use the de-identified data for training, but evaluate on real data.

In [20]:
from datasets import Dataset
synth_dataset = Dataset.from_list(synthetic_rows)
synth_dataset = synth_dataset.map(formatting_prompts_func).remove_columns(['note', 'encounter_data'])
val_dataset = ds['validation'].map(formatting_prompts_func).remove_columns(['note', 'encounter_data'])

Map:   0%|          | 0/1028 [00:00<?, ? examples/s]

Map:   0%|          | 0/672 [00:00<?, ? examples/s]

## Put train/val on s3 for sagemaker jumpstart

In [21]:
from sagemaker_studio import Project

project = Project()

In [22]:
import os
from typing import List, Dict

def write_bytes_to_subdir(s3_dir: str, subpath: str, json_bytes: bytes) -> str:
    bucket_name = s3_dir.split('/')[2]
    object_key = os.path.join('/'.join(s3_dir.split('/')[3:]), subpath)
    s3 = boto3.client('s3')
    s3.put_object(Bucket=bucket_name,
          Key=object_key,
          Body=json_bytes,
          ContentType="application/json")
    return f's3://{bucket_name}/{object_key}'

def write_jsonl_to_subdir(s3_dir: str, subpath: str, records:List[Dict]) -> str:
    jsonl_str = "\n".join(json.dumps(rec) for rec in records)  # single sting
    jsonl_bytes = jsonl_str.encode("utf-8")                    # S3 needs bytes
    return write_bytes_to_subdir(s3_dir, subpath, jsonl_bytes)

def write_json_to_subdir(s3_dir: str, subpath: str, record: Dict)  -> str:
    json_str = json.dumps(record)
    json_bytes = json_str.encode("utf-8")                    # S3 needs bytes
    return write_bytes_to_subdir(s3_dir, subpath, json_bytes)



In [23]:
train_data_uri = write_jsonl_to_subdir(project.s3.root, 'train/train.jsonl', synth_dataset)
train_data_uri

's3://amazon-sagemaker-543337415716-us-east-1-05bc2f6baf22/dzd_b63q0xem4lb3dz/5rxukcpny646nb/dev/train/train.jsonl'

## Start Sagemaker Jumpstart training
### We fine-tune Llama 3.2 3B Instruct using Sagemaker jumpstart

![Using jumpstart](select_jumpstart.png)

### Fine-tuning on our synthetic training data is as easy as selecting Llama 3.2 3B Instruct, clicking Train, 
![Select Llama](select_llama.png)
### and setting the training dataset S3 location to our `train_data_uri`
![Configure train](jumpstart_config.png)
### We use the default training parameters, which quickly train a LoRA on top of Llama 3.2 3B on a `g5.2xlarge` in ~30 minutes. Once the job is completed, we deploy it to a Sagemaker jumpstart endpoint by clicking `Deploy`
![Deploy](deploy_fine_tune.png)

## Evaluate trained model
Now we select an example from our real validaiton data (not seen during training) and run it through our new Sagemaker Jumpstart endpoint to evaluate the model. This quick check shows that our has indeed learned how to extract structured data from the unstructured medical notes ——— without relying on sensitive data during training!

In [None]:
example = val_dataset[0] # pull out example of "real" data

In [83]:
def make_prompt(input_message: Dict) ->str:
    """Takes message turn and formats it according to Llama spec"""
    user_str = input_message['content']
    return f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{user_str}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"""

In [84]:
from sagemaker.predictor import retrieve_default
endpoint_name = "jumpstart-dft-llama-3-2-3b-instruct-20250715-225729"
predictor = retrieve_default(endpoint_name,)
payload = {
    "inputs": make_prompt(example['dialog'][0]),
    "parameters": {
        "max_new_tokens": 1025,
        "top_p": 0.9,
        "temperature": 0.0
    },
    "environment": {"accept_eula": "true"}
}
response = predictor.predict(payload)


In [91]:
print(response['generated_text'])

{
  "age": 5,
  "encounter_type": "encounter for symptom",
  "gender": "male",
  "medications": null,
  "name": "Marquis Dibbert",
  "procedures": null,
  "race": "black or african american",
  "reason": "Viral sinusitis"
}


In [92]:
print(example['dialog'][-1]['content'])

{
  "age": 5,
  "encounter_type": "encounter for symptom",
  "gender": "male",
  "medications": [],
  "name": "Marquis Lon Dibbert",
  "procedures": [],
  "race": "black or african american",
  "reason": "Viral sinusitis"
}


### success!