The goal of the following notebook is giving a brief introduction on the **Elastic Anonymization** algorithm fatures and usage. Briefly speaking, the algorithm works in six steps:
1. Using a fine-tuned BERT to perform named entity recognition on sensitive information found in the corpus.
2. The entities are, then, insterted back in the original corpus, as recognized by BERT.
3. A Fast Text model is trained on the new corpus.
4. For each new entity that needs to be anonymized, a similarity space is built as follows: the semantic and syntactic similarity between the entities and the other ones found in the corpus are computed and an anonymization region is defined in that space (for example, the square defined by $x=(.75, 1); y=(.75, 1)$).
5. A DBSCAN algorithm is used to spot all the entities belonging to the anonymization region.
6. All the spotted entities are anonymized using the same faking strategy.

The entities and their respective fakings are then stored inside a dictionary which represents the mapping 1 on 1. Since similar entities will have the same faking, the keys will be unique, but the values won't.

In [None]:
import textwrap

from IPython.display import display, Markdown
from transformers import pipeline

from anonymizer.config.config import Config
from anonymizer.preprocess.doc_processing import process
from anonymizer.utils import (
    create_documents_with_metadata, 
    visualize_ner_on_chunk
)
from anonymizer.elastic.eanon import ElasticAnonymizer
from langchain.text_splitter import SentenceTransformersTokenTextSplitter

In [2]:
%load_ext autoreload
%autoreload 2

In [None]:
LEVELS_RENAMER = {
    "level_0": "Macro area",
    "level_1": "Tipo di documento",
    "level_2": "Oggetto del documento"
}

# the enonymize method from the ElasticAnonymizer class takes a list of langchain's Documents
docs_ = process(Config.DOCS_PATH).rename(LEVELS_RENAMER, axis=1).fillna("").head(5)
docs = create_documents_with_metadata(docs_)
# chunking is required from the BERT model to work properly (the inference breaks for long texts).
chunker = SentenceTransformersTokenTextSplitter(chunk_overlap=50, tokens_per_chunk=250)
chunks = chunker.split_documents(docs)

In [None]:
anon = ElasticAnonymizer(use_pretrained_anon_state=False)
anon_docs = anon.anonymize(chunks, show_ner=False)

The central part of the elastic anonymization algorithm is building the similarity space for each entity. This space is defined by the **semantic** and **syntactic** similarity between an entity and all the other ones, recognized in the full corpus of selected documents. The two similarity measures are defined as follows:
- **semantic**: the cosine similarity between the embedded entities, with the embeddings computed by a Fast Text model, trained on the entire corpus.
- **syntactic**: the Jaro similarity between the entities.

The main idea is that all the entities belonging to the "anonymization region" of the similarity space will be anonymized using the same fake entity.

In [None]:
# plot the similarity space for the entity "mooney"
anon.plot_anonimization_space(
    "p4i", 
    add_labels=True, 
    add_anonymization_region=True, 
    anon_region_x=(.80, 1.02),
    anon_region_y=(.75, 1.02)
)

print(
    f"Anonymization for p4i: {anon.anon_state['p4i']}", 
    f"\nAnonymization for p4i p4i: {anon.anon_state['p4i p4i']}"
)

In [None]:
ner_pipeline = pipeline("ner", model="osiria/deberta-base-italian-uncased-ner", aggregation_strategy="simple")
visualize_ner_on_chunk(chunks[0], ner_pipeline)

In [None]:
print("Anonymized Text:")
wrap_ = textwrap.fill(anon_docs["text"][0], width=120)
display(Markdown(f"```\n{wrap_}\n```")) 

In [None]:
print("Denonymized Text:")
wrap = textwrap.fill(anon.deanonymize([anon_docs["text"][0]])[0], width=120)
display(Markdown(f"```\n{wrap}\n```")) 

There are a lot of improvement possibilities here: the de-anonymization step is just a reverse mapping from the values to the keys of the anonymization state (the dictionary storing the entities and their corresponding fake), and they're recognized from the text using regular expressions. Smarter ways of doing that are surely available.

Another improvement possibility concerns the inference time. In this example, we have relatively long documents (the GRC offering end proposals), with 5 documents resulting in 119 chunks. Still, the inference time was significant: 1 minute and 8 seconds including the training of the Fast Text model, which is usually very fast. A first step of improvement could be running the inference on the GPU. This should significantly reduce the inference time