A de-identification toolkit for clinical text in Hebrew.
An improved version of Microsoft's HebSafeHarbor project.
HebSafeHarbor was developed according to the requirements described in the file about_hebsafeharbor (read more here)
The toolkit integrates and uses open source libraries and assets, including HebSpacy (that runs NER model based on AlephBERT, Nemo and BMC), Presidio, Wikipedia and public lexicons.
Make sure you have Anaconda installed on your computer.
-
Unpack the
packedhebsafeharbor.zip
environment file to where you want the environment to go(usually at - C:\Users\..\Anaconda3\envs). -
Open the standard Command Prompt (cmd.exe) on Windows.
-
Run:
cd packedhebsafeharbor .\Scripts\activate.bat
-
The prompt should tell you that you're in the right environment after the last step, so you should see something like the following:
(packedhebsafeharbor) C:\Some\Path\Where\Your\Environment\Is>
in this case, you don't need to initialize HebSafeHarbor object with parameters.
Days in medical dates will change to <יום_>
from hebsafeharbor import HebSafeHarbor
# use default anonymization
hsh = HebSafeHarbor()
text = """שרון לוי התאשפזה ב02.02.2012 """
doc = {"text": text}
output = hsh([doc])
print(output[0].anonymized_text.text)
# > <שם_> התאשפזה ב<יום_>.02.2012
in this case, initialize HebSafeHarbor object with context.
Current sypported contexts: ['imaging', 'general', 'family']
This will make sure that the anonymization will adapt itself to the world of content.
from hebsafeharbor import HebSafeHarbor
# use context anonymization
hsh = HebSafeHarbor(context='imaging')
text = """ .CT רופא מפנה: דוק טור, (123456) בדיקה: ט.מ צוואר """
doc = {"text": text}
output = hsh([doc])
print(output[0].anonymized_text.text)
# .CT רופא מפנה: <שם_>, (<מזהה_>) בדיקה: ט.מ צוואר
in this case, initialize HebSafeHarbor object with a shifting date function and it's additional parameters. The function signature should be:
def f(params:any, date_string: string) -> Tuple[string, string, string]:
for example:
from hebsafeharbor import HebSafeHarbor
from datetime import datetime, timedelta
from dateutil import parser
def shift_day(params,date_):
date_obj = parser.parse(date_)
new_date = date_obj + timedelta(days=params[0])
return str(new_date.day), str(new_date.month), str(new_date.year)
# use default anonymization
hsh = HebSafeHarbor(shift_date_function=(shift_day, [17]))
text = """שרון לוי התאשפזה ב02.02.2012 """
doc = {"text": text}
output = hsh([doc])
print(output[0].anonymized_text.text)
# <שם_> התאשפזה ב19.02.2012
current - About version 2
previous versions:
About version 1
NLP capabilities are based on resources developedg by ONLP Lab (the lab git).
Especialy AlephBERT and NEMO.
HebSafeHarbor is an open-source project developed by 8400 The Health Network.