# De-identification

## Installing the packages

### Notes:
* These packages require Python 3.7+ (as they are written using dataclasses)
* It is recommended to create a virtual environment and work inside that for better package compatibility.

In [None]:
# First package analyzes the text and finds the locations of PIIs
!python -m pip install presidio_analyzer
# Second package anonymizes the PIIs
!python -m pip install presidio_anonymizer
# Third package is used as an English dictionary input for the analyzer (~580 MB) that includes word embeddings
!python -m spacy download en_core_web_lg
# Forth package is an additional NER than spaCy for higher confidence (~2.3 GB)
!python -m pip install git+https://github.com/flairNLP/flair.git


In [None]:
# Cloning the github repository
! git clone https://github.com/TRG-AI4Good/de-identifier.git
# Change directory to the package folder
%cd de-identifier/
# Installing the de-identifier package
!python -m pip install .

## Loading the CSVDeidentifier library

In [7]:
from csv_deidentifier import CSVDeidentifier

## Running the engine

### Supported Entities per NER:
#### spaCy:
* PERSON: A full person name, which can include first names, middle names or initials, and last names.
* PHONE_NUMBER: A telephone number.
* LOCATION: Name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains.
* EMAIL_ADDRESS: An email address identifies an email box to which email messages are delivered.
* ... (you may see the list of all supported entities [here](https://microsoft.github.io/presidio/supported_entities/))

#### Flair:
* PERSON
* LOCATION
* ORGANIZATION



In [14]:
# Entity list for first analyzer
entity_list_1 = ['PERSON', 'PHONE_NUMBER', 'EMAIL_ADDRESS', 'LOCATION']
# Entity list for second analyzer
entity_list_2 = ['PERSON']

### Notes:
* Please change the path of the input CSV file if needed.
* The input CSV file has a header. (Please refer to example.csv)
* For the input_csv_path of the second analzer, please input the output_csv_path of the first analyzer.
  * The second analyzer will download ~2.3GB of data for first run.
* example_output.csv is the final output.

In [None]:
# Creating an object of the class
analyzer1 = CSVDeidentifier(input_csv_path="example/example.csv", output_csv_path="example/example_step_1.csv", language="en", entities=entity_list_1)
# Running the engine
analyzer1.run()
# To release memory
del analyzer1

# Scrubbing again using flair NER
analyzer2 = CSVDeidentifier(input_csv_path="example/example_step_1.csv", output_csv_path="example/example_output.csv", language="en", entities=entity_list_2, add_flair=True)
# Running the engine
analyzer2.run()
# To release memory
del analyzer2

### Trying with lower case sentences

In [None]:
# Creating an object of the class
analyzer1 = CSVDeidentifier(input_csv_path="example/example.csv", output_csv_path="example/example_step_1_lower.csv", language="en", entities=entity_list_1, lower_case=True)
# Running the engine
analyzer1.run()
# To release memory
del analyzer1

# Scrubbing again using flair NER
analyzer2 = CSVDeidentifier(input_csv_path="example/example_step_1_lower.csv", output_csv_path="example/example_output_lower.csv", language="en", entities=entity_list_2, add_flair=True, lower_case=True)
# Running the engine
analyzer2.run()
# To release memory
del analyzer2

## Note:
* As tested, it seems the overall performance of the lower_case version is better but please check it yourself.