# Generators

This chapter showcases the generators in the `anonipy` package.

The main motivation behind generators is to generate replacements for entities. 
In order to do this, `anonipy` has implemented a number of generators for generating:

- strings
- numbers
- dates

All of the generators are implemented in the `anonipy.anonymize.generators` module. 
In the following section, we will present each generator in detail.

In [1]:
# used to hide warnings
import warnings

warnings.filterwarnings("ignore")

Let us first define the text and the associated entities, as seen in the
previous chapter (see [Extractors](/documentation/notebooks/01-extractors/)).

In [2]:
original_text = """\
Medical Record

Patient Name: John Doe
Date of Birth: 15-01-1985
Date of Examination: 20-05-2024
Social Security Number: 123-45-6789

Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024
"""

Normally, the entities are extracted using the the `EntityExtractor`. For this section,
we manually define the entities:

In [3]:
from anonipy.definitions import Entity

entities = [
    Entity(
        text="John Doe",
        label="name",
        start_index=30,
        end_index=38,
        type="string",
        regex=".*",
    ),
    Entity(
        text="15-01-1985",
        label="date of birth",
        start_index=54,
        end_index=64,
        type="date",
        regex="(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})",
    ),
    Entity(
        text="20-05-2024",
        label="date",
        start_index=86,
        end_index=96,
        type="date",
        regex="(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})",
    ),
    Entity(
        text="123-45-6789",
        label="social security number",
        start_index=121,
        end_index=132,
        type="custom",
        regex=".*",
    ),
    Entity(
        text="John Doe",
        label="name",
        start_index=157,
        end_index=165,
        type="string",
        regex=".*",
    ),
    Entity(
        text="15-11-2024",
        label="date",
        start_index=717,
        end_index=727,
        type="date",
        regex="(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})",
    ),
]

## LLMLabelGenerator

<div class="admonition warning">
    <p class="admonition-title">Warning</p>
    <p style="margin-top: .6rem; margin-bottom: .6rem">
        The <code>LLMLabelGenerator</code> utilizes the open source LLMs, 
        specifically the <a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct" target="_blank">Llama 3</a> model.
        Because the model is quite large, we utilize quantization using the <code>bitsandbytes</code> package to reduce its size.
        Therefore, the <code>LLMLabelGenerator</code> requires at least 8GB GPU and CUDA drivers to be available.
        If these resources are not available on your machine, you can use the <code>MaskLabelGenerator</code> instead.
    </p>
</div>

The `LLMLabelGenerator` is a one-stop-shop generator that utilizes LLMs to generate replacements for entities. It is implemented to support any entity type.

For more details, please check the `LLMLabelGenerator` class implementation.

Let us first import the generator and initialize it.

<div class="admonition info">
    <p class="admonition-title">Info</p>
    <p style="margin-top: .6rem; margin-bottom: .6rem">
        The initialization of <code>LLMLabelGenerator</code> will throw some warnings. Ignore them.
        These are expected due to the use of package dependencies.
    </p>
</div>

In [4]:
from anonipy.anonymize.generators import LLMLabelGenerator

llm_generator = LLMLabelGenerator()

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


To use the generator, we can call the `generate` method. The `generate` method receives the following parameters:

- `entity`: The entity to generate a replacement for.
- `entity_prefix`: The prefix to use for the replacement (Default: "").
- `temperature`: The temperature to use when generating the replacement. This value should be between 0 and 1, where 0 is the least random and 1 is the most random generation (Default: 0).

Let us generate the replacement for the first entity using the default parameters.

In [5]:
llm_generator.generate(entities[0])

'Ethan Thompson'

Let us now change the label prefix and generate the replacement using a higher temperature.

In [6]:
llm_generator.generate(entities[0], entity_prefix="Spanish", temperature=0.7)

'Juan Martines'

Let us now generate a replacement for each entity using the default parameters.

In [7]:
for entity in entities:
    print(f"{entity.text:<12} | {entity.label:<22} | {llm_generator.generate(entity)}")

John Doe     | name                   | Ethan Thompson
15-01-1985   | date of birth          | 24-07-1992
20-05-2024   | date                   | 23-07-2027
123-45-6789  | social security number | 987-65-4321
John Doe     | name                   | Ethan Thompson
15-11-2024   | date                   | 23-02-2027


### Advices and suggestions

**Using LLMLabelGenerator only for string and custom types.**
While the `LLMLabelGenerator` is able to generate alternatives for different entity 
types, we suggest using it only for string and custom entity types. The reason is 
that the LLMs can be quite slow for generating replacements. 

In addition, `anonipy` has other generators that can be used for other entity types, such as dates, numbers, etc.

**Restricting with regex.**
Using LLMs to generate text is best when the generation is restricted to a specific pattern.
The `Entity` object already contains a `regex` field that can be used to restrict the generation 
to a specific pattern. However, it is recommended to specify to have as specific and restrictive
regex expressions as possible.

This will help the `LLMLabelGenerator` to generate more accurate replacements.

## MaskLabelGenerator

The `MaskLabelGenerator` is a generator that uses smaller language models, such as <a href="https://huggingface.co/FacebookAI/xlm-roberta-large" target="_blank">XLM-RoBERTa</a>, to generate replacements for entities. It is implemented to support any entity type, but we suggest using it 
with string entities. For other entity types, please use other generators.

For more details, please check the `MaskLabelGenerator` class implementation.

Let us first import the generator and initialize it. The generator at initialization 
can receive the following parameters:

- `model_name`: The model to use for the generation (Default: "FacebookAI/xlm-roberta-large").
- `use_gpu`: Whether to use the GPU for the generation (Default: False).
- `context_window`: The size of the context window to both sides of the entity to use for the generation.
    If the context window is set to 100, the context will be the 100 characters before and after the entity (Default: 100).

<div class="admonition info">
    <p class="admonition-title">Info</p>
    <p style="margin-top: .6rem; margin-bottom: .6rem">
        The initialization of <code>MaskLabelGenerator</code> will throw some warnings. Ignore them.
        These are expected due to the use of package dependencies.
    </p>
</div>

In [8]:
from anonipy.anonymize.generators import MaskLabelGenerator

# initialization using default parameters
mask_generator = MaskLabelGenerator()

Some weights of the model checkpoint at FacebookAI/xlm-roberta-large were not used when initializing XLMRobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


To use the generator, we can call the `generate` method. The `generate` method receives the following parameters:

- `entity`: The entity to generate a replacement for.
- `original_text`: The original text from which the generator will retrieve the context of the entity text.

This generator will create a list of suggestions from which it will select one at random. Therefore, the generator will return different suggestions every time it is called.

In [9]:
mask_generator.generate(entities[0], text=original_text)

'James Thomas'

In [10]:
mask_generator.generate(entities[0], text=original_text)

'Smith Paul'

In [11]:
mask_generator.generate(entities[0], text=original_text)

'Michael Thomas'

In [12]:
for entity in entities:
    print(
        f"{entity.text:<12} | {entity.label:<22} | {mask_generator.generate(entity, text=original_text)}"
    )

John Doe     | name                   | Richard Roberts
15-01-1985   | date of birth          | 1964
20-05-2024   | date                   | 2017
123-45-6789  | social security number | 1
John Doe     | name                   | Van first
15-11-2024   | date                   | None


### Advices and suggestions

**Using only for string entities.**
As seen from the above examples, the `MaskLabelGenerator` is best used with string entities.
For number and date entities, it is best to use other generators, such as `NumberGenerator` 
and `DateGenerator`.

## NumberGenerator

The `NumberGenerator` is a generator for generating random numbers. It is implemented to support integers, floats, and 
phone numbers, but it can be used to generate values for custom types which include numbers.

For more details, please check the `NumberGenerator` class implementation.

Let us first import the generator and initialize it. The generator at initialization 
does not need any parameters.

In [13]:
from anonipy.anonymize.generators import NumberGenerator

number_generator = NumberGenerator()

To use the generator, we can call the `generate` method. The `generate` method receives the following parameters:

- `entity`: The number entity to generate a replacement for.

This generator will create a suggestion by replacing numeric values in the entity text at random. Therefore, the generator will return different suggestions every time it is called.

In [14]:
number_generator.generate(entities[3])

'491-93-2792'

Furthermore, it will throw an error if the entity type is not `integer`, `float`, `phone_number` or `custom`.

In [15]:
try:
    number_generator.generate(entities[0])
except Exception as e:
    print(e)

The entity type must be `integer`, `float`, `phone_number` or `custom` to generate numbers.


## DateGenerator

The `DateGenerator` is a generator for generating dates. It is implemented to support date entities.

For more details, please check the `DateGenerator` class implementation.

Let us first import the generator and initialize it. The generator at initialization 
can receive the following parameters:

- `date_format`: The format in which the dates will be provided and generated (Default: "%d-%m-%Y").
- `day_sigma`: The number of days to add or subtract from the date when using the `random` generator method (see below) (Default: 30).

In [16]:
from anonipy.anonymize.generators import DateGenerator

date_generator = DateGenerator()

To use the generator, we can call the `generate` method. The `generate` method receives the following parameters:

- `entity`: The number entity to generate a replacement for.
- `output_gen`: the method used to generate the date (Default: "random"). It can be one of:
  - `random`: generates a random date that is between `entity` and `entity` $\pm$ `day_sigma` days.
  - `first_day_of_the_month`: returns the first day of the month of `entity`.
  - `last_day_of_the_month`: returns the last day of the month of `entity`.
  - `middle_of_the_month`: returns the middle day of the month of `entity`.
  - `middle_of_the_year`: returns the middle day of the year of `entity`.

Using the above parameters, this generator will create the appropriate date suggestions:

In [17]:
entities[2].text

'20-05-2024'

In [18]:
date_generator.generate(entities[2], output_gen="random")

'10-05-2024'

In [19]:
date_generator.generate(entities[2], output_gen="first_day_of_the_month")

'01-05-2024'

In [20]:
date_generator.generate(entities[2], output_gen="last_day_of_the_month")

'31-05-2024'

In [21]:
date_generator.generate(entities[2], output_gen="middle_of_the_month")

'15-05-2024'

In [22]:
date_generator.generate(entities[2], output_gen="middle_of_the_year")

'01-07-2024'

Furthermore, it will throw an error if the entity type is not `date`.

In [23]:
try:
    date_generator.generate(entities[0])
except Exception as e:
    print(e)

The entity type must be `date` to generate dates.


## Creating custom generator

The user can develop their own custom generators. To do this, the custom generator
must inherit from the `GeneratorInterface` class. 

The generator must have two methods defined: `__init__` and `generate`,
where the `generate` method must accept at least the entity.

An example of a custom generator that will generate only emojis is shown below:

In [24]:
import random
from anonipy.anonymize.generators import GeneratorInterface
from anonipy.definitions import Entity


class CustomGenerator(GeneratorInterface):

    def __init__(self):
        self.emojis = ["😄", "🤗", "😢"]

    def generate(self, entity: Entity) -> tuple[str, list[Entity]]:
        return random.choice(self.emojis)

In [25]:
custom_generator = CustomGenerator()
custom_generator.generate(entities[0])

'🤗'