Skip to content

A de-identification toolkit for clinical text in Hebrew. An improved version of Microsoft's HebSafeHarbor project.

Notifications You must be signed in to change notification settings

ChenMordehai/HebSafeHarbor_Clalit_Validation_Improvment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

alt text

HebSafeHarbor - CLALIT Validation

(version 2)

A de-identification toolkit for clinical text in Hebrew.
An improved version of Microsoft's HebSafeHarbor project.

HebSafeHarbor was developed according to the requirements described in the file about_hebsafeharbor (read more here)

The toolkit integrates and uses open source libraries and assets, including HebSpacy (that runs NER model based on AlephBERT, Nemo and BMC), Presidio, Wikipedia and public lexicons.

Contents

Establishing the work environment

Make sure you have Anaconda installed on your computer.

  1. Unpack the packedhebsafeharbor.zip environment file to where you want the environment to go(usually at - C:\Users\..\Anaconda3\envs).

  2. Open the standard Command Prompt (cmd.exe) on Windows.

  3. Run:

    cd packedhebsafeharbor
    .\Scripts\activate.bat
  4. The prompt should tell you that you're in the right environment after the last step, so you should see something like the following:

    (packedhebsafeharbor) C:\Some\Path\Where\Your\Environment\Is>

Getting started

Use default anonymization

in this case, you don't need to initialize HebSafeHarbor object with parameters.
Days in medical dates will change to <יום_>

from hebsafeharbor import HebSafeHarbor

# use default anonymization
hsh = HebSafeHarbor()

text = """שרון לוי התאשפזה ב02.02.2012 """
doc = {"text": text}

output = hsh([doc])

print(output[0].anonymized_text.text)

#  > <שם_> התאשפזה ב<יום_>.02.2012

Use anonymization by context

in this case, initialize HebSafeHarbor object with context.
Current sypported contexts: ['imaging', 'general', 'family']
This will make sure that the anonymization will adapt itself to the world of content.

from hebsafeharbor import HebSafeHarbor

# use context anonymization
hsh = HebSafeHarbor(context='imaging')

text = """  .CT רופא מפנה:  דוק טור, (123456)  בדיקה:   ט.מ צוואר  """
doc = {"text": text}

output = hsh([doc])

print(output[0].anonymized_text.text)

# .CT רופא מפנה:  <שם_>, (<מזהה_>)  בדיקה:   ט.מ צוואר  

Using a custom date shifting/anonymization

in this case, initialize HebSafeHarbor object with a shifting date function and it's additional parameters. The function signature should be:

def f(params:any, date_string: string) -> Tuple[string, string, string]:

for example:

from hebsafeharbor import HebSafeHarbor
from datetime import datetime, timedelta
from dateutil import parser

def shift_day(params,date_):
    date_obj = parser.parse(date_)
    new_date = date_obj + timedelta(days=params[0])
    return str(new_date.day), str(new_date.month), str(new_date.year)
    

# use default anonymization
hsh = HebSafeHarbor(shift_date_function=(shift_day, [17]))

text = """שרון לוי התאשפזה ב02.02.2012 """
doc = {"text": text}

output = hsh([doc])

print(output[0].anonymized_text.text)

# <שם_> התאשפזה ב19.02.2012

Versions

current - About version 2

previous versions:
About version 1

Special Thanks

NLP capabilities are based on resources developedg by ONLP Lab (the lab git). Especialy AlephBERT and NEMO.
HebSafeHarbor is an open-source project developed by 8400 The Health Network.




alt text

About

A de-identification toolkit for clinical text in Hebrew. An improved version of Microsoft's HebSafeHarbor project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages