# De-identification

## Installing the packages

### Notes:
* These packages require Python 3.7+ (as they are written using dataclasses)
* It is recommended to create a virtual environment and work inside that for better package compatibility.

In [None]:
# First package analyzes the text and finds the locations of PIIs
!python -m pip install presidio_analyzer
# Second package anonymizes the PIIs
!python -m pip install presidio_anonymizer
# Third package is used as an English dictionary input for the analyzer (~580 MB) that includes word embeddings
!python -m spacy download en_core_web_lg
# Forth package is an additional NER than spaCy for higher confidence (~2.3 GB)
!python -m pip install git+https://github.com/flairNLP/flair.git


## Loading the CSVDeidentifier library

In [1]:
from csv_deidentifier import CSVDeidentifier

## Running the engine

### Notes:
* Please change the path of the input CSV file.
* The input CSV file has only one column and a header. (Please refer to example.csv)
* For the input_csv_path of the second analzer, please input the output_csv_path of the first analyzer.
  * The second analyzer will download ~2.3GB of data for first run.
* example_output.csv is the final output.

In [None]:
# Creating an object of the class
analyzer1 = CSVDeidentifier(input_csv_path="example.csv", output_csv_path="example_step_1.csv", language="en")
# Running the engine
analyzer1.run()
# To release memory
del analyzer1

# Scrubbing again using flair NER
analyzer2 = CSVDeidentifier(input_csv_path="example_step_1.csv", output_csv_path="example_output.csv", language="en", add_flair=True)
# Running the engine
analyzer2.run()
# To release memory
del analyzer2

### Trying with lower case sentences

In [None]:
# Creating an object of the class
analyzer1 = CSVDeidentifier(input_csv_path="example.csv", output_csv_path="example_step_1_lower.csv", language="en", lower_case=True)
# Running the engine
analyzer1.run()
# To release memory
del analyzer1

# Scrubbing again using flair NER
analyzer2 = CSVDeidentifier(input_csv_path="example_step_1_lower.csv", output_csv_path="example_output_lower.csv", language="en", add_flair=True, lower_case=True)
# Running the engine
analyzer2.run()
# To release memory
del analyzer2

## Note:
* As I tested, I think the overall performance of the lower_case version is better but please check it yourself.