https://huggingface.co/tanaos/tanaos-text-anonymizer-v1

https://docs.tanaos.com/artifex/text-anonymization/inference/

Model Description
Base model: FacebookAI/roberta-base
Task: Token classification (Named Entity Recognition for Text Anonymization)
Languages: English (see the fine-tuning section to adapt to other languages)
Fine-tuning data: A synthetic, custom dataset of around 10,000 passages, each containing multiple named entities across 5 Personal Identifiable Information categories.
Training Details
##This model was trained using the Artifex Python library

##Intended Uses
This model is intended to:

Anonymize text data by redacting personal identifiable information (PII) such as names, addresses, phone numbers, dates, and locations.
Ensure privacy and confidentiality in text data for compliance with data protection regulations.
Be used before sharing or processing text data to protect sensitive information.
Be GDPR compliant when handling personal data.
Not intended for:

Scenarios involving highly specialized or domain-specific text without further fine-tuning.

Entity	| Description <br>
PERSON	| Individual people, fictional characters <br>
LOCATION	| Geographical areas <br>
DATE	| Absolute or relative dates, including years, months and/or days <br>
ADDRESS	| Full addresses <br>
PHONE_NUMBER	| Telephone numbers <br>

In [18]:
%%capture
pip install artifex

In [19]:
%%capture
from artifex import Artifex

In [20]:
%%capture
# Fix: Uninstall the current protobuf and install a compatible version
#pip uninstall -y protobuf
!pip install protobuf==6.33.2    
# 3.20.0

In [21]:
import requests
session = requests.Session()

In [22]:
with open(r"C:\Users\luisa\OneDrive\Documentos\RepositorioVS_Github\probando_anonimizadores\api.txt", "r", encoding="utf-8") as f:
    apy_key = f.read()    


In [32]:
ta_out = session.post(
    "https://slm.tanaos.com/models/text-anonymization",
    headers={
        "X-API-Key": apy_key,
    },
    json={
        "text": "John Doe lives at 123 Main St, New York. His phone number is (555) 123-4567. His sister Jane Parker lives at 456 Oak Ave, Los Angeles. Her phone number is (555) 987-6543.",
          "entities_to_mask": ["PERSON", "LOCATION", "ADDRESS", "PHONE_NUMBER"],
           #"mask_token": "[ANONIMIZADO]",
           "include_mask_types": True,
           "include_mask_counter": True}
)

print(ta_out.json()["data"])


['[MASKED] lives at [MASKED] [MASKED] His phone number is [MASKED] His sister [MASKED] lives at [MASKED] [MASKED] Her phone number is [MASKED]']


No funciona el conunter ni el types
Pruebo con un archivo mio con datos inventados

In [24]:
with open(r"C:\Users\luisa\OneDrive\Documentos\RepositorioVS_Github\probando_anonimizadores\testtxt.txt", "r", encoding="utf-8") as f:
    texto = f.read()    
texto

'3507 Robertini Ignacio  (36)\nDNI: 45317373  - HC:26086  - FI: 14/2/25 - FICM:  17/2/25\n TPE\nTel. de contacto: JUAN padre: 1535639098\nPENDIENTES:\nLaboratorio \nCultivos\nEstudios\nInterconsultas\nOtros\nALTA 18/02\nFNac  7/6/1998- Cobertura:  - Alergias:   \nSIN AISLAR\nSocial \nAP:  consumo problamatico de sustancias de 7 años de evolucion (cocaina esnifada y marihuana) internado en centro de rehab desde 28/1, intento de suicidio en adolescencia con requerimiento de internación, sobrepeso \nMH:  Divalproato Na 500MG/6hs , Risperidona 3 mg/23hs , Lorazepam 2,5 mg/6gs , Quetiapina 150 mg/dia         \nINGRESO:         MI: laboratorio control, posterior a epidosio de excitacion psicomotriz 72 horas previas, CPK de 41060, asintomatico. Se realiza laboratorio que evidencia  CPK de 28847, GOT 363 y GPT 131. Screening toxicologico en orina positivo para benzodiacepinas. Sedimento urinario no inflamatorio. Se inicia tratamiento con plan de hidratacion parenteral amplio. Por buena evoluci

In [None]:
ta_out2= session.post(
    "https://slm.tanaos.com/models/text-anonymization",
    headers={
        "X-API-Key": apy_key,
    },
    json={"text": texto }
)

#print(ta_out2.json()["data"])
rta = ta_out2.json()["data"]
rta



['3507 [MASKED] [MASKED]  (36)\nDNI: 45317373  - HC:26086  - FI: [MASKED] - FICM:  17/2/25\n TPE\nTel. de contacto: JUAN padre: 1535639098\nPENDIENTES:\nLaboratorio \nCultivos\nEstudios\nInterconsultas\nOtros\nALTA 18/02\nFNac  7/6/1998- Cobertura:  - Alergias:   \nSIN [MASKED] \nAP:  consumo problamatico de sustancias de [MASKED] años de [MASKED] [MASKED] y [MASKED] internado en [MASKED] de [MASKED] desde [MASKED] intento de suicidio en [MASKED] con requerimiento de internación, sobrepeso \nMH:  [MASKED] Na 500MG/6hs , [MASKED] 3 mg/23hs , [MASKED] 2,5 mg/6gs , [MASKED] 150 mg/dia         \nINGRESO:         MI: laboratorio control, posterior a epidosio de excitacion [MASKED] [MASKED] horas previas, CPK de [MASKED] asintomatico. Se realiza laboratorio que evidencia  CPK de [MASKED] [MASKED] [MASKED] y [MASKED] [MASKED] Screening toxicologico en [MASKED] para [MASKED] Sedimento urinario no inflamatorio. Se inicia tratamiento con plan de [MASKED] parenteral amplio. Por buena evolucion, i

agrego variables para perzonalizar el mask

In [26]:
ta_out3= session.post(
    "https://slm.tanaos.com/models/text-anonymization",
    headers={
        "X-API-Key": apy_key,
    },
    json={"text": texto,
          "entities_to_mask": ["PERSON", "LOCATION", "ADDRESS", "PHONE_NUMBER"],
           "mask_token": "[ANONIMIZADO]",
           "include_mask_types": True,
           "include_mask_counter": True}
)

rta3 = ta_out3.json()["data"]
rta3

['3507 [ANONIMIZADO] [ANONIMIZADO]  (36)\nDNI: 45317373  - HC:26086  - FI: 14/2/25 - FICM:  17/2/25\n TPE\nTel. de contacto: JUAN padre: 1535639098\nPENDIENTES:\nLaboratorio \nCultivos\nEstudios\nInterconsultas\nOtros\nALTA 18/02\nFNac  7/6/1998- Cobertura:  - Alergias:   \nSIN [ANONIMIZADO] \nAP:  consumo problamatico de sustancias de 7 años de [ANONIMIZADO] [ANONIMIZADO] y [ANONIMIZADO] internado en [ANONIMIZADO] de [ANONIMIZADO] desde 28/1, intento de suicidio en [ANONIMIZADO] con requerimiento de internación, sobrepeso \nMH:  [ANONIMIZADO] Na 500MG/6hs , [ANONIMIZADO] 3 mg/23hs , [ANONIMIZADO] 2,5 mg/6gs , [ANONIMIZADO] 150 mg/dia         \nINGRESO:         MI: laboratorio control, posterior a epidosio de excitacion [ANONIMIZADO] 72 horas previas, CPK de 41060, asintomatico. Se realiza laboratorio que evidencia  CPK de [ANONIMIZADO] [ANONIMIZADO] [ANONIMIZADO] y [ANONIMIZADO] [ANONIMIZADO] Screening toxicologico en [ANONIMIZADO] para [ANONIMIZADO] Sedimento urinario no inflamatorio

veo que no es del todo eficiente, no me anonimizo el dni, ni nombre del padre ni su telefono

# How to fine-tune (without training data)
Use the Artifex library to fine-tune the model to any language other than English or to custom domains by generating synthetic training data on-the-fly. Install Artifex library. 


## Fine-tune to any language


In [None]:
from artifex import Artifex

ta_sp = Artifex(apy_key).text_anonymization


RecursionError: maximum recursion depth exceeded while calling a Python object

In [37]:
#from artifex import Artifex


model_output_path = "./output_model/"

ta_sp.train(
    domain="documentos medicos en Español",
    language="spanish",
    output_path=model_output_path
)

ta_sp.load(model_output_path)
print(ta_sp("El paciente John Doe visitó Nueva York el 12 de marzo de 2023 a las 10:30 a. m."))

# >>> ["El paciente [MASKED] visitó [MASKED] el [MASKED] a las [MASKED]."]
#rta3 = ta_out3.json()["data"]


NameError: name 'ta_sp' is not defined