# Extractors

This chapter showcases how to use the label extractors in the package.

The label extractors are used to extract relevant `named entities` from text. These 
entities can be people names, organizations, addresses, social security numbers, etc.
The entities are then used to anonymize the text.

In [1]:
# used to hide warnings
import warnings

warnings.filterwarnings("ignore")

Let us first define the text, from which we want to extract the entities.

In [2]:
original_text = """\
Medical Record

Patient Name: John Doe
Date of Birth: 15-01-1985
Date of Examination: 20-05-2024
Social Security Number: 123-45-6789

Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024
"""

## Language configuration

First, we must specify the language that the text is written in. We can do this manually or by using a language detector.

### Manual selection

One option, when all of the texts are in the same language, is to use manually specifying the text language. 
In the `anonipy` package, we provide a constant called `LANGUAGES` in the `constants` submodule, which 
contains all the supported languages. Please find the format of the language code in the `constants` module.

Since the `original_text` is in English, we will use the `LANGUAGES.ENGLISH` predefined constant.


In [3]:
from anonipy.constants import LANGUAGES

In [4]:
LANGUAGES.ENGLISH

('en', 'English')

### Using language detector

An alternative is to use a language detector available in the `anonipy` package. 
The language detector is created using the [lingua](https://github.com/pemistahl/lingua-py) python package, 
and allows automatic detection of the text language.

In [5]:
from anonipy.utils.language_detector import LanguageDetector

Initialize the language detector and use it to automatically detect the language of the text.

In [6]:
lang_detector = LanguageDetector()
lang_detector(original_text)

('en', 'English')

## Using extractors

### Initialization

We can now initialize the label extractors. This is done using the `NERExtractor` class found in `anonipy.anonymize.extractors` submodule.

<div class="admonition info">
    <p class="admonition-title">Info</p>
    <p style="margin-top: .6rem; margin-bottom: .6rem">
        The <code>NERExtractor</code> class is created using the <a href="https://github.com/urchade/GLiNER" target="_blank">GLiNER</a> models, specifically the one that is finetuned for recognizing Personally Identifiable Information (PII) within text. The model has been finetuned on six languages (English, French, German, Spanish, Italian, and Portuguese), but can be applied also to other languages.
    </p>
</div>

In [7]:
from anonipy.anonymize.extractors import NERExtractor

The `NERExtractor` class takes the following arguments:

- `labels`: A list of dictionaries containing the labels to be extracted. 
- `lang`: The language of the text to be anonymized. Defaults to `LANGUAGES.ENGLISH`.
- `score_th`: The score threshold used to filter the labels, i.e. the entity has to have a score greater than `score_th` to be considered. Defaults to 0.5.
- `use_gpu`: Whether to use the GPU. Defaults to `False`.

We must now define the labels to be extracted. In this example, we will extract the people name, the dates, and the social security number from the text.

In [8]:
labels = [
    {"label": "name", "type": "string"},
    {
        "label": "social security number",
        "type": "custom",
        "regex": "[0-9]{3}-[0-9]{2}-[0-9]{4}",
    },
    {"label": "date of birth", "type": "date"},
    {"label": "date", "type": "date"},
]

Let us now initialize the entity extractor.

<div class="admonition info">
    <p class="admonition-title">Info</p>
    <p style="margin-top: .6rem; margin-bottom: .6rem">
        The initialization of <code>NERExtractor</code> will throw some warnings. Ignore them.
        These are expected due to the use of package dependencies.
    </p>
</div>

In [9]:
ner_extractor = NERExtractor(labels, lang=LANGUAGES.ENGLISH, score_th=0.5)

### Entity extraction

The `NERExtractor` receives the text to be anonymized and returns the enriched text document and the extracted entities.

In [10]:
doc, entities = ner_extractor(original_text)

The entities extracted within the input text are:

In [11]:
ner_extractor.display(doc)

The extracted entities are stored in the `entities` variable. Each entity contains the following information:

- `text`: The text of the entity.
- `label`: The label of the entity.
- `start_index`: The start index of the entity in the text.
- `end_index`: The end index of the entity in the text.
- `score`: The score of the entity. It shows how certain the model is that the entity is relevant.
- `type`: The type of the entity (taken from the defined `labels` variable list).
- `regex`: The regular expression the entity must match.

In [12]:
entities

[Entity(text='John Doe', label='name', start_index=30, end_index=38, score=0.9961156845092773, type='string', regex='.*'),
 Entity(text='15-01-1985', label='date of birth', start_index=54, end_index=64, score=0.9937193393707275, type='date', regex='(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})'),
 Entity(text='20-05-2024', label='date', start_index=86, end_index=96, score=0.9867385625839233, type='date', regex='(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})'),
 Entity(text='123-45-6789', label='social security number', start_index=121, end_index=132, score=0.9993416666984558, type='custom', regex='[0-9]{3}-[0-9]{2}-[0-9]{4}'),
 Entity(text='John Doe', label='name', start_index=157, end_index=165, score=0.994924783706665, type='string', regex='.*'),
 Entity(text='15-11-2024', label='date', start_index=717, end_index=727, score=0.8285622596740723, type='date', regex='(\\d{1,2}[\\/\\-\\.

## Advices and suggestions

**Use specific label names.**
In the above example, we used specific label names to extract the entities. If 
we use a less specific name, the entity extractor might not find any relevant entity.

For instance, when using `social security number` as the label name, the entity extractor
is able to extract the social security number from the text. However, if we use `ssn` or 
just `number` as the label name, the entity extractor might not find any relevant entity.

<div class="admonition tip">
    <p class="admonition-title">Tip</p>
    <p style="margin-top: .6rem; margin-bottom: .6rem">
        Using more specific label names is better.
    </p>
</div>

**Use custom regex patterns.**
In the `anonipy` package, we provide some predefined entity types, which are:

- `string`. Extracts a string from the text.
- `integer`. Extracts an integer from the text.
- `float`. Extracts a float from the text.
- `date`. Extracts a date from the text.
- `email`. Extracts an email address from the text.
- `phone_number`. Extracts a phone number from the text.
- `website_url`. Extracts an URL from the text.

These entity types also have a corresponding regex pattern, as defined in the `anonipy.anonymize.regex` submodule.

In [13]:
from anonipy.anonymize.regex import regex_map

for type in [
    "string",
    "integer",
    "float",
    "date",
    "email",
    "phone_number",
    "website_url",
]:
    print(f"{type:<13}: {regex_map(type)}")

string       : .*
integer      : \d+
float        : [\d\.,]+
date         : (\d{1,2}[\/\-\.]\d{1,2}[\/\-\.]\d{2,4})|(\d{2,4}[\/\-\.]\d{1,2}[\/\-\.]\d{1,2})
email        : [a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
phone_number : [(]?[\+]?[(]?[0-9]{1,3}[)]?[-\s\.]?([0-9]{2,}[-\s\.]?){2,}([0-9]{3,})
website_url  : ((https?|ftp|smtp):\/\/)?(www.)?([a-zA-Z0-9]+\.)+[a-z]{2,}(\/[a-zA-Z0-9#\?\_\.\=\-\&]+|\/?)*


If the user wants to use a custom regex pattern, they can define it in the `labels` 
variable list. Using a custom regex pattern allows the user to specify a more strict 
pattern that the entity must match.

The custom regex can be specified in the following way:

In [14]:
labels = [
    {"label": "name", "type": "string"},
    # using the custom regex pattern: type must be 'custom' and specify the regex pattern in the 'regex' key
    {
        "label": "social security number",
        "type": "custom",
        "regex": "[0-9]{3}-[0-9]{2}-[0-9]{4}",
    },
    {"label": "date of birth", "type": "date"},
    {"label": "date", "type": "date"},
]

Lets rerun the above example:

In [15]:
# ignore the warnings: these are expected due to the use of package dependencies
ner_extractor = NERExtractor(labels, lang=LANGUAGES.ENGLISH, score_th=0.5)

In [16]:
doc, entities = ner_extractor(original_text)

The extracted entities are the same as before. The difference is that the social 
security number now also had to match the custom regex pattern.

In [17]:
ner_extractor.display(doc)

## Creating custom extractors

The user can develop their own custom extractor. To do this, the custom extractor
must inherit from the `ExtractorInterface` class. 

The extractor must have two methods defined: `__init__` and `__call__`.

An example of a custom extractor that extracts only a specific regex pattern from 
the text is shown below:

In [20]:
import re
from anonipy.anonymize.extractors import ExtractorInterface
from anonipy.definitions import Entity


class CustomExtractor(ExtractorInterface):

    def __init__(self):
        # the custom extractor will retrieve entities that follow the regex pattern
        self.regex_pattern = re.compile(r"\d{1,2}-\d{1,2}-\d{2,4}")

    def __call__(self, text: str) -> tuple[str, list[Entity]]:
        entities = []
        for match in re.finditer(self.regex_pattern, text):
            entities.append(
                Entity(
                    text=match.group(),
                    label="date",
                    start_index=match.start(),
                    end_index=match.end(),
                    score=1.0,
                    type="date",
                    regex=self.regex_pattern,
                )
            )
        return text, entities

In [21]:
custom_extractor = CustomExtractor()
_, entities = custom_extractor(original_text)

Let us output the extracted entities. Note that the third entity corresponds to a part of the social security number.

In [22]:
entities

[Entity(text='15-01-1985', label='date', start_index=54, end_index=64, score=1.0, type='date', regex=re.compile('\\d{1,2}-\\d{1,2}-\\d{2,4}')),
 Entity(text='20-05-2024', label='date', start_index=86, end_index=96, score=1.0, type='date', regex=re.compile('\\d{1,2}-\\d{1,2}-\\d{2,4}')),
 Entity(text='23-45-6789', label='date', start_index=122, end_index=132, score=1.0, type='date', regex=re.compile('\\d{1,2}-\\d{1,2}-\\d{2,4}')),
 Entity(text='15-11-2024', label='date', start_index=717, end_index=727, score=1.0, type='date', regex=re.compile('\\d{1,2}-\\d{1,2}-\\d{2,4}'))]