# Text Re-Identification
In this notebook, a BERT-based model is used to re-identify documents anonymized with different methods. The dataset employed (extracted from [here](https://github.com/fadiabdulf/automatic_text_anonymization)) consists of Wikipedia articles pertaining to actors and actresses born in the 20th century.

The code is structured into the following sections:
* **Initialization**: Preparation for program execution. Includes the [Settings](#settings_section) subsection.
* **Data**: Obtains and preprocess the data.
* **Experiments**: Trains the model and performs the re-identification attack.

Further details are provided within each section.<br>
Usage of an environment with **GPU** is strongly recommended for reasonable training times.<br>
This notebook has been designed to be executed in **Google Colaboratory**. To this end, all the repository files need to be copied to a Drive folder and the corresponding path needs to be assigned to the the **drive_working_path** setting in the [Settings](#settings_section) section.

# Initialization
This section is divided into the following subsections:
1. **Requirements**: Installation and import of the required packages.
2. **Settings**: The execution settings are defined.
<br> Modify this settings for changing the behaviour of the program.


## Requirements

### Installations
Choose **one** option for the installation of requirements.

* **Option 1**: Install each package with the same version employed during development and testing. <br>
Should be equivalent to **Option 1** but without **torch** package re-downloading. <br>
* **Option 2**: Not version-strict install of the packages. <br>
Similar to **Option 1** but without considering the packages versions; installing the current version instead. <br>
It can cause some compatibility issues (e.g., depercated functions). <br>
It will probably re-download **en_core_web_lg**.

In [None]:
installation_option = "Option 1" #@param ["Option 1", "Option 2"]

In [None]:
if installation_option == "Option 1":
    !pip3 install spacy==3.4.1
    !python3 -m spacy download en_core_web_lg  # As equivalent to: !pip3 install en-core-web-lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0-py3-none-any.whl

    !pip3 install torch==1.12.0 --extra-index-url https://download.pytorch.org/whl/cu113/torch-1.12.0%2Bcu113-cp37-cp37m-linux_x86_64.whl # As equivalent to: !pip install torch @ https://download.pytorch.org/whl/cu113/torch-1.12.0%2Bcu113-cp37-cp37m-linux_x86_64.whl
    !pip3 install transformers==4.21.0

    !pip3 install numpy==1.21.6
    !pip3 install pandas==1.3.5
    !pip3 install beautifulsoup4==4.6.3
    !pip3 install matplotlib==3.2.2
    !pip3 install matplotlib-inline==0.1.3
    !pip3 install tqdm==4.64.0

    !pip3 install google==2.0.3
    !pip3 install googledrivedownloader==0.4

    !pip3 install wikipedia==1.4.0
    !pip3 install Wikipedia-API==0.5.4    

elif installation_option == "Option 2":
    !pip3 install spacy
    !python3 -m spacy download en_core_web_lg

    !pip3 install torch
    !pip3 install transformers

    !pip3 install numpy
    !pip3 install pandas
    !pip3 install beautifulsoup4
    !pip3 install matplotlib
    !pip3 install matplotlib-inline
    !pip3 install tqdm

    !pip3 install google
    !pip3 install googledrivedownloader

    !pip3 install wikipedia
    !pip3 install Wikipedia-API   

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0-py3-none-any.whl (587.7 MB)
[K     |████████████████████████████████| 587.7 MB 9.4 kB/s 
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://download.pytorch.org/whl/cu113/torch-1.12.0%2Bcu113-cp37-cp37m-linux_x86_64.whl
Collecting torch==1.12.0
  Downloading torch-1.12.0-cp37-cp37m-manylinux1_x86_64.whl (776.3 MB)
[K     |████████████████████████████████| 776.3 MB 18 kB/s 
Installing col

### Imports

In [None]:
import en_core_web_lg

import torch
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM, AutoModelForSequenceClassification
from transformers import AdamW, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from transformers import get_constant_schedule

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from bs4 import BeautifulSoup as BS

from google.colab import drive

import wikipediaapi
import wikipedia

import os
import enum
import re
import gc
from argparse import Namespace
from datetime import datetime
from urllib.request import urlopen

## Settings <a name="settings_section"></a>
Configurations for all the execution behaviour. <br>

**CAUTION**:
<ol>
  <li>No checking is performed, so verify that correct values are introduced.</li>
  <li>Modifying any of these settings will require the re-execution of the corresponding section (i.e. Data settings -> Data section) and subsequent ones.</li>
</ol>

In [None]:
settings = Namespace()  #@markdown  <a name="settings_section"></a>

#@markdown #**Google Drive settings**
#@markdown Mark the following checkbox only if you want to mount your Google Drive in the environment. 
#@markdown <br> This is strongly recommended if you are in **Google Colab**, enabling the access to data and permanent results storage.
#@markdown <br> If Drive is employed, remember to adjust the working path. Otherwise, current path (".") will be defined as working path.
settings.use_google_drive = True   #@param {type:"boolean"}
if settings.use_google_drive:
    # Mount    
    mount_path = "/content/drive"
    drive.mount(mount_path, force_remount=True)
    
    # Define project folder within Google Drive
    settings.drive_working_path = "CRISES/TRI/Experiment"   #@param {type:"string"}
    settings.working_path = os.path.join(mount_path, os.path.join("MyDrive", settings.drive_working_path))
else:
    settings.working_path = "."


#@markdown #<br>**Data settings** <a name="data_settings"></a>
#@markdown ###Global
# This will use only the evaluation data for training, using the bodies of the corresponding articles
settings.train_dataset_setup = "50_eval"    #@param ["50_eval", "500_random", "500_filtered", "2000_filtered"]
settings.eval_dataset_setup = "50_eval"
settings.train_with_eval_data = settings.train_dataset_setup == settings.eval_dataset_setup
settings.use_filtered_data = "filtered" in settings.train_dataset_setup
settings.use_anonymized_bodies = True   #@param {type:"boolean"}
settings.dev_set_split = 0.3    #@param {type:"slider", min:0, max:1, step:0.1}
# filter_data_policy: If Load, it uses the precomputed list of filtered documents
settings.filter_data_policy = "Load" #@param ["Load", "Compute"]
settings.preprocess_sensitive_tokens_method = "Keep all sensitive tokens" #@param [ "Keep all sensitive tokens", "Remove specific sensitive tokens", "Remove generic sensitive tokens", "Remove all sensitive tokens" ]
#@markdown ###Paths (relative to working path)
# Data folders
settings.data_zip = "data.zip"    #@param {type:"string"}   # The .zip file containing all the .xml files.
settings.data_zip_path = os.path.join(settings.working_path, settings.data_zip)
settings.data_path = os.path.splitext(settings.data_zip)[0]
settings.eval_folder = "eval"    #@param {type:"string"}
settings.eval_data_path = os.path.join(settings.data_path, settings.eval_folder)
settings.train_folder = "train"    #@param {type:"string"}
if settings.train_with_eval_data:
    settings.train_data_path = settings.eval_data_path
else:
    settings.train_data_path = os.path.join(settings.data_path, settings.train_folder)
# Data columns (IMPORTANT TO MODIFY IF USING DIFFERENT DATA)
settings.id_column = "title"    # The key from the .xml file that identifies the document individual
# The following are the keys from the .xml file corresponding to evaluation or training
settings.eval_data_columns = [settings.id_column, "original_abstract", "ner3_abstract", "ner4_abstract", "ner7_abstract", "presidio_abstract", "spacy_abstract", "word2vec_abstract", "manual_abstract"]
settings.train_data_columns_to_read = [settings.id_column, "original_body", "spacy_abstract"] 
if settings.use_anonymized_bodies:
    settings.train_data_columns_to_read += ["spacy_body"]
if settings.use_filtered_data and "compute" in settings.filter_data_policy.lower():
    settings.train_data_columns_to_read += ["original_abstract"]
settings.train_data_columns_for_training = [settings.id_column, "original_body"]
if settings.use_anonymized_bodies:
    settings.train_data_columns_for_training += ["spacy_body"]
settings.dev_data_columns = [settings.id_column, "spacy_abstract"]
# Data save paths
settings.filtered_titles_path = os.path.join(settings.data_path, settings.train_dataset_setup+"_titles.txt")


#@markdown #<br>**Experiments settings** <a name="experiment_settings"></a>
#@markdown ###Global
settings.base_model_name = "distilbert-base-uncased"    #@param ["distilbert-base-uncased", "bert-base-uncased", "roberta-base"]
settings.use_dev_as_eval = False #@param {type:"boolean"}
settings.max_seq_length = 512
if torch.cuda.is_available():
    settings.device = torch.device("cuda:0")
    os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
    !nvidia-smi # Show GPU info
else:
    settings.device = torch.device("cpu")
    print("WARNING: GPU device not found. GPU is strongly recommended for reasonable execution times.")
# Model type enumerable
class ModelType(enum.Enum):
    MaskedLM = 0
    Classifier = 1
    def __int__(self): return self.value    
    def get_config(self): return settings.models_settings[self.value]
#@markdown ###Further pre-training
settings.pretraining_with_train_dataset = True #@param {type:"boolean"}
settings.pretraining_with_eval_dataset = "No" #@param ["No", "original_abstract", "ner3_abstract", "ner4_abstract", "ner7_abstract", "presidio_abstract", "spacy_abstract", "word2vec_abstract", "manual_abstract"]
settings.use_pretraining = settings.pretraining_with_train_dataset or settings.pretraining_with_eval_dataset != "No"
settings.pretraining_policy = "Only train" #@param ["Train and save model", "Only train", "Only load"]
settings.pretraining_config = Namespace()
settings.pretraining_config.model_type = ModelType.MaskedLM
settings.pretraining_config.epochs =    3#@param {type:"integer"}
settings.pretraining_config.batch_size = 8   #@param {type:"integer"}
settings.pretraining_config.learning_rate = 5e-5 #@param {type:"number"}
settings.pretraining_config.mlm_probability = 0.15   #@param {type:"number"}
settings.pretraining_config.use_sliding_window = True  #@param {type:"boolean"}
settings.pretraining_config.sliding_window_length = 512   #@param {type:"integer"}
settings.pretraining_config.sliding_window_overlap = 128 #@param {type:"integer"}
#@markdown ### Fine-tuning
settings.finetuning_policy = "Only train" #@param ["Train and save model", "Only train", "Only load"]
settings.finetuning_config = Namespace()
settings.finetuning_config.model_type = ModelType.Classifier
settings.finetuning_config.epochs = 20  #@param {type:"integer"}
settings.finetuning_config.batch_size = 16  #@param {type:"integer"}
settings.finetuning_config.learning_rate = 5e-5 #@param {type:"number"}
settings.finetuning_config.use_sliding_window = True  #@param {type:"boolean"}
settings.finetuning_config.sliding_window_length = 100   #@param {type:"integer"}
settings.finetuning_config.sliding_window_overlap =  25#@param {type:"integer"}
# Settings array (used at ModelType enumerable)
settings.models_settings = [settings.pretraining_config, settings.finetuning_config]
#@markdown ###Paths (relative to working path)
# Saving folder
settings.saved_models_folder = "saved_models"    #@param {type:"string"}
settings.saved_models_path = os.path.join(settings.working_path, settings.saved_models_folder)
if not os.path.exists(settings.saved_models_path):
    os.mkdir(settings.saved_models_path)
# Results
settings.results_filename = "results.csv"   #@param {type:"string"}
settings.results_filepath = os.path.join(settings.working_path, settings.results_filename)
# Training setup string
settings.training_setup = f"{settings.train_dataset_setup}"
settings.training_setup += "_AnonymBodies" if settings.use_anonymized_bodies else "_NoAnonymBodies"
settings.training_setup += f"_DefaultPT{settings.pretraining_config.sliding_window_length}-{settings.finetuning_config.sliding_window_overlap}" if settings.pretraining_with_train_dataset else "_NoDefaultPT"
settings.training_setup += f"_{settings.pretraining_with_eval_dataset}EvalPT{settings.pretraining_config.sliding_window_length}-{settings.finetuning_config.sliding_window_overlap}"
settings.training_setup += f"_SW{settings.finetuning_config.sliding_window_length}-{settings.finetuning_config.sliding_window_overlap}" if settings.finetuning_config.use_sliding_window else f"_PerSentence"
# Base model
settings.base_model_save_filename = f"base_{settings.training_setup}.pth"
settings.base_model_save_path = os.path.join(settings.saved_models_path, settings.base_model_save_filename)
# Pretraining
settings.pretraining_config.save_path = os.path.join(settings.saved_models_path, f"MaskedLM_{settings.training_setup}.pth")
settings.pretraining_config.trainer_folder_path = settings.pretraining_config.save_path + "_trainer"
# Attack
settings.finetuning_config.save_path = os.path.join(settings.saved_models_path, f"{settings.training_setup}.pth")
settings.finetuning_config.trainer_folder_path = settings.finetuning_config.save_path + "_trainer"

Mounted at /content/drive
Tue Aug 30 21:42:42 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P8    10W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-------------------------------------------------------------

# Data
Obtention of the training and evaluation documents through the following steps: <br> **Decompress**, **Read**, **Filter**, **Development set** and **Preprocessing**.
<br>
It uses the [Data settings](#data_settings).

## Decompress

In [None]:
!unzip -q -o {settings.data_zip_path}

## Read

In [None]:
def read_raw_data(path, required_columns):
    data_list = []
    filenames = os.listdir(path)
    errors_list = []

    for filename in tqdm(filenames, desc="Reading files"):
        try:
            if filename.endswith(".xml"):
                # Read file
                filepath = os.path.join(path, filename)
                with open(filepath, "r", encoding="utf8") as f:
                    soup = BS(f, "xml")
                
                # Get required columns
                row = []
                for column_name in required_columns:
                    column = soup.find(column_name)
                    if column is None:
                        raise Exception(f"No column named [{column_name}] found in file [{filepath}].")
                    else:
                        content = remove_initial_and_final_line_breaks(column.string)
                        if content is None or content == "":
                            raise Exception(f"Empty column [{column_name}] found in file [{filepath}].")
                        else:
                            row.append(content)
                
                # Add to data list
                data_list.append(row)
        
        # In case of error, store it
        except Exception as e:
            errors_list.append(str(e))
        
    return pd.DataFrame(data_list, columns=required_columns), errors_list


def remove_initial_and_final_line_breaks(text):
    """Removes line breaks and spaces at the beginning and end of the text"""
    ini = 0
    end = len(text)
    while ini < len(text) and (text[ini] == '\n' or text[ini] == ' '):
        ini += 1
    while end >= 0 and (text[end-1] == '\n' or text[end-1] == ' '):
        end -= 1
    
    result = None
    if ini < end:
        result = text[ini:end]
        
    return result

In [None]:
complete_raw_train_df, train_df_error_list = read_raw_data(settings.train_data_path, settings.train_data_columns_to_read)
print(f"During training data reading, {len(train_df_error_list)} errors have been found.")
raw_eval_df, eval_df_error_list = read_raw_data(settings.eval_data_path, settings.eval_data_columns)
print(f"During evaluation data reading, {len(eval_df_error_list)} errors have been found.")

print(f"Raw train dataframe length = {len(complete_raw_train_df)}\n\
Raw evaluation dataframe length = {len(raw_eval_df)}")

Reading files:   0%|          | 0/50 [00:00<?, ?it/s]

During training data reading, 0 errors have been found.


Reading files:   0%|          | 0/50 [00:00<?, ?it/s]

During evaluation data reading, 0 errors have been found.
Raw train dataframe length = 50
Raw evaluation dataframe length = 50


## Filter

In [None]:
def filter_df_from_file(df, filepath):
    # Create titles dictionary
    titles_dict = {}
    with open(filepath, "r") as f:
        for line in f.readlines():
            if line != "" and line != "\n":
                line = line[:-1]    # Remove breakline
                titles_dict[line] = None
    
    # Get indexes of selected rows
    filtered_idxs_list = []
    for idx, title in enumerate(list(df[settings.id_column])):
        if title in titles_dict:
            filtered_idxs_list.append(idx)

    # Filter
    filtered_df = df.iloc[filtered_idxs_list]
    filtered_df = filtered_df.reset_index(drop=True)

    return filtered_df


def filter_df_randomly(complete_train_df, eval_df, random_amount, random_seed=None, verbose=False):
    # Random selection
    indexes = np.arange(len(complete_train_df))
    np.random.seed(random_seed)
    random_indexes = np.random.choice(indexes, random_amount, replace=False)
    filterd_train_df = complete_train_df[complete_train_df.index.isin(random_indexes)].reset_index(drop=True)

    # Add eval articles
    train_titles = list(filterd_train_df[settings.id_column])
    eval_titles = list(eval_df[settings.id_column])
    for title in eval_titles:
        if not title in train_titles:
            row = complete_train_df[complete_train_df[settings.id_column] == title].iloc[0]
            filterd_train_df = filterd_train_df.append(row, ignore_index=True)
            if verbose:
                if (filterd_train_df[settings.id_column] == title).any():
                    print(f"Added: {title}")
                else:
                    print(f"WARNING | Keeps missing: {title}")    
    
    return filterd_train_df


def filter_df(df, is_strict, max_abstract_length=200):
    titles = list(df[settings.id_column])
    abstracts = list(df["original_abstract"])
    filtered_idxs_list = []

    wiki = wikipediaapi.Wikipedia('en', extract_format=wikipediaapi.ExtractFormat.HTML)
    values_list = list(zip(titles, abstracts))
    with tqdm(values_list) as pbar:
        for idx, (title, abstract) in enumerate(pbar):
            abstract = abstract[:max_abstract_length]
            if filter_article(title, wiki, is_strict, abstract):
                filtered_idxs_list.append(idx)
            
            num_discarded = (idx + 1) - len(filtered_idxs_list)
            ratio = num_discarded/(idx + 1)
            ratio_str = "%.2f" % ratio
            predicted_count = int((1-ratio) * len(titles))            
            pbar.set_description(f"Discard ratio: {ratio_str} | N.pred: {predicted_count}")
    
    # Filter using the is_alive_list
    filtered_df = df.iloc[filtered_idxs_list]
    filtered_df = filtered_df.reset_index(drop=True)
    
    return filtered_df, filtered_idxs_list


def filter_article(title, wiki, is_strict, abstract):
    pass_the_filter = False
    try:
        page = wiki.page(title)
        # If page exists
        if page.exists():                
            # Cultural filters
            english_native_speaker = re.search(r'american|eeuu|english|england|australian|australia|canadian|canada|new zealander|new zeland', abstract, re.IGNORECASE)
            if english_native_speaker:
                is_actor = re.search(r'actor|actress', abstract, re.IGNORECASE)
                if is_actor:
                    # Alive filter
                    is_alive = False
                    texts_btw_parenthesis = re.findall(r'\((.*?)\)', abstract)                        
                    for text_btw_parenthesis in texts_btw_parenthesis:
                        years_list = [int(s) for s in re.findall(r'\b\d+\b', text_btw_parenthesis) if len(s)==4]
                        is_alive = len(years_list) == 1 # Is alive if there is only a born date
                        if is_alive:
                            break                        
                    if is_alive:
                        # Age filters
                        is_young = years_list[0] > 1950
                        is_not_too_young = years_list[0] <= 1995
                        if is_young and is_not_too_young:
                            # Popularity filters
                            if len(page.links) >= 100:
                                num_references = len(wikipedia.page(pageid=page.pageid).references) # References using the Wikipedia package instead of WikipediaAPI
                                if num_references >= 25:
                                    soup = BS(urlopen(f"https://en.wikipedia.org/?curid={page.pageid}"), features="html")   # Language links are obtained with BeautifulSoup since WikipediaAPI returns a JSONDecodeError if some language includes arabic characters
                                    lang_links = [(el.get('lang'), el.get('title')) for el in soup.select('li.interlanguage-link > a')]
                                    if not is_strict or len(lang_links) >=40:
                                        pass_the_filter = True

    except Exception as e:
        print("ERROR", title, e, page)
        print(page.text)        
    
    return pass_the_filter

In [None]:
if settings.train_dataset_setup == "50_eval":
    raw_train_df = complete_raw_train_df.copy(deep=False)
else:
    if "load" in settings.filter_data_policy.lower():
        raw_train_df = filter_df_from_file(complete_raw_train_df, settings.filtered_titles_path)
    elif "compute" in settings.filter_data_policy.lower():
        if settings.train_dataset_setup == "500_random":
            raw_train_df = filter_df_randomly(complete_raw_train_df, raw_eval_df, 500, random_seed=42)
        elif settings.train_dataset_setup == "500_filtered":
            raw_train_df, _ = filter_df(complete_raw_train_df, is_strict=True)
        elif settings.train_dataset_setup == "2000_filtered":
            raw_train_df, _ = filter_df(complete_raw_train_df, is_strict=False)

print(f"Raw filtered train dataframe length = {len(raw_train_df)}")

Raw filtered train dataframe length = 50


## Development set


In [None]:
def create_dev_set(raw_train_df, dev_set_split, raw_eval_df, allow_eval_data, required_columns, random_seed=None):
    # Compute dev set size
    dev_set_size = int(len(raw_train_df)*dev_set_split)

    # Set random seed
    np.random.seed(random_seed)
    
    # Get random titles with eval filter
    raw_dev_df = pd.DataFrame(columns = required_columns)
    with tqdm(total=dev_set_size, desc="Getting random articles from train data") as pbar:
        while len(raw_dev_df) < dev_set_size:
            # Obtain random articles from the training dataframe
            n_remaining = dev_set_size - len(raw_dev_df)
            random_df = get_random_articles(raw_train_df[required_columns], n_remaining)

            # Discard repeated
            random_df = random_df[~random_df.title.isin(raw_dev_df.title)]

            # Discard those present in evaluation data (if desired)
            if not allow_eval_data:
                random_df = random_df[~random_df.title.isin(raw_eval_df.title)]

            # Concate to the dataframe
            raw_dev_df = pd.concat([raw_dev_df, random_df], ignore_index=True, sort=False)            
            
            # Update progress bar
            pbar.update(len(random_df))

    return raw_dev_df


def get_random_articles(df, random_amount):
    indexes = np.arange(len(df))
    random_indexes = np.random.choice(indexes, random_amount, replace=False)
    random_df = df[df.index.isin(random_indexes)]
    return random_df.reset_index(drop=True)

In [None]:
raw_dev_df = create_dev_set(raw_train_df, settings.dev_set_split, raw_eval_df, allow_eval_data=settings.train_with_eval_data, required_columns=settings.dev_data_columns, random_seed=21)
print(f"Raw development dataframe length = {len(raw_dev_df)}")

Getting random articles from train data:   0%|          | 0/15 [00:00<?, ?it/s]

Raw development dataframe length = 15


## Preprocess

In [None]:
def preprocess_data(raw_df, sensitive_markers_criteria):
    preprocessed_df = raw_df.copy(deep=False)
    
    # Create spaCy model. Compontents = tok2vec, tagger, parser, senter, attribute_ruler, lemmatizer, ner
    spacy_nlp = en_core_web_lg.load(disable = ["tok2vec", "parser", "senter", "ner"]) # Required components: "lemmatizer", "tagger" and "attribute_ruler"

    # Process the texts columns (all except the first one, that is the id/title column)
    for column_name in raw_df.columns[1:]:
        preprocess_texts(preprocessed_df[column_name], column_name, sensitive_markers_criteria, spacy_nlp)
    
    return preprocessed_df


def preprocess_texts(texts, texts_name, sensitive_markers_criteria, spacy_nlp):
    # Predefined patterns
    sensitive_markers_pattern = re.compile("{[^}]*}")
    curly_brackets_pattern = re.compile("[\{\}]+")
    square_brackets_pattern = re.compile(r"\[([^\]]+)\]")
    special_characters_pattern = re.compile(r"[^ \nA-Za-z0-9À-ÖØ-öø-ÿЀ-ӿ./]+")
    stopwords = spacy_nlp.Defaults.stop_words

    for i, new_text in enumerate(tqdm(texts, desc=f"Process {texts_name} texts")):
        ########### Sensitive markers ###########
        # Get sensitive markers and substitute them by the desired
        sensitive_markers = re.finditer(sensitive_markers_pattern, new_text)
        if sensitive_markers != None:
            modification_len = 0
            for sensitive_marker in sensitive_markers:
                # Get sensitive marker text
                marker_text = sensitive_marker.group()
                marker_text = re.sub(curly_brackets_pattern, '', marker_text)     # Remove curly brackets (at the start and end of the marker)
                marker_text = re.sub(square_brackets_pattern, '', marker_text)    # Remove original text that is between square brackets
                # Consider if remove marker or not
                if "remove" in sensitive_markers_criteria.lower():
                    is_generic = marker_text.isupper()
                    if "all" in sensitive_markers_criteria.lower() or (("generic" in sensitive_markers_criteria.lower()) and is_generic):
                        marker_text = ""
                # Substitute senstivie marker by the resulting sensitive text
                span = sensitive_marker.span()
                new_text = "".join((new_text[:span[0]+modification_len], # join function is used because is faster than sum strings
                                    marker_text,
                                    new_text[span[1]+modification_len:]))                
                # Accumulate modification length for correct further substitutions
                previous_len = span[1] - span[0]
                modification_len += len(marker_text) - previous_len
        
        # Remove underscores (used for chunks)
        new_text = new_text.replace('_', ' ')

        # Remove extra spaces before dots
        new_text = new_text.replace(' .', '.')

        # Remove additional spaces
        new_text = re.sub(" +", " ", new_text)
        
        ########### Document clearning ###########
        doc = spacy_nlp(new_text)        
        new_text = ""   # Reset text
        for token in doc:
            if token.text not in stopwords:
                # Lemmatize
                token_text = token.lemma_ if token.lemma_ != "" else token.text
                # Remove special characters
                token_text = re.sub(special_characters_pattern, '', token_text)
                # Add to new text (without space if dot)
                new_text += ("" if token_text == "." else " ") + token_text
        # Remove initial space if exists
        if new_text[0] == ' ':
            new_text = new_text[1:]

        # Remove doc and use GarbageCollector to reduce memory consumption
        del doc
        if i % 5 == 0:    # Periodically collect
            gc.collect()
        
        # Remove additional spaces
        new_text = re.sub(" +", " ", new_text)

        # Store result
        texts[i] = new_text
    
    # Remove spaCy model
    del spacy_nlp

In [None]:
# Filter training data columns for only preprocesses those used for training (not for filtering nor creating the dev set)
raw_train_df_for_training = raw_train_df[settings.train_data_columns_for_training]

# Preprocess each dataframe
train_df = preprocess_data(raw_train_df_for_training, settings.preprocess_sensitive_tokens_method) 
eval_df = preprocess_data(raw_eval_df, settings.preprocess_sensitive_tokens_method)
dev_df = preprocess_data(raw_dev_df, settings.preprocess_sensitive_tokens_method)

print(f"Train dataframe length = {len(train_df)}\n\
Evaluation dataframe length = {len(eval_df)}\n\
Development dataframe length = {len(dev_df)}")

Process original_body texts:   0%|          | 0/50 [00:00<?, ?it/s]

Process spacy_body texts:   0%|          | 0/50 [00:00<?, ?it/s]

Process original_abstract texts:   0%|          | 0/50 [00:00<?, ?it/s]

Process ner3_abstract texts:   0%|          | 0/50 [00:00<?, ?it/s]

Process ner4_abstract texts:   0%|          | 0/50 [00:00<?, ?it/s]

Process ner7_abstract texts:   0%|          | 0/50 [00:00<?, ?it/s]

Process presidio_abstract texts:   0%|          | 0/50 [00:00<?, ?it/s]

Process spacy_abstract texts:   0%|          | 0/50 [00:00<?, ?it/s]

Process word2vec_abstract texts:   0%|          | 0/50 [00:00<?, ?it/s]

Process manual_abstract texts:   0%|          | 0/50 [00:00<?, ?it/s]

Process spacy_abstract texts:   0%|          | 0/15 [00:00<?, ?it/s]

Train dataframe length = 50
Evaluation dataframe length = 50
Development dataframe length = 15


# Experiments

Creation of the datasets, further pre-training, fine-tuning and evaluation of the attack.
<br>
It employs the [Experiments settings](#experiment_settings).

## Common
Functions, classes and objects required for both further pre-training and fine-tuning.
<br>
For instance, dataset and trainer classes.

### title_to_idx and idx_to_title

In [None]:
def get_title_to_idx_and_idx_to_title(train_df, eval_df, verbose=False):
    title_to_idx = {}
    
    # Create title_to_idx with training data
    title_to_idx = {}
    for idx, title in enumerate(train_df[settings.id_column]):
        title_to_idx[title] = idx
    
    # Add eval titles to title_to_idx
    current_idx = len(title_to_idx)
    for title in list(eval_df[settings.id_column]):
        if not title in title_to_idx:
            title_to_idx[title] = current_idx
            current_idx += 1
            if verbose:
                print(f"title_to_idx: Added {title} from eval_df")
    
    # Create idx_to_title
    idx_to_title = []
    for title in title_to_idx.keys():
        idx_to_title.append(title)
    
    return title_to_idx, idx_to_title

In [None]:
title_to_idx, idx_to_title = get_title_to_idx_and_idx_to_title(train_df, eval_df)

### Tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained(settings.base_model_name, max_len=settings.max_seq_length)

### Datasets

In [None]:
def prepare_df_for_dataset(df, return_dict):
    result_columns_names = [settings.id_column, "Text"] # Predefined column names. Always the same and used in TextDataset
    result = {} if return_dict else pd.DataFrame(columns=result_columns_names)  # One key-value per column or a single dataframe

    for column in df.columns:
        if column != settings.id_column:   # Ignore the title
            dataframe = df[[settings.id_column, column]]
            dataframe.columns = result_columns_names    # Rename columns for being always the same
            if return_dict:
                # For evaluation data. One dataframe per anonymization method
                result[column] = dataframe  
            else:
                # For training data, clear and anonymized background documents concatenated
                result = pd.concat([result, dataframe], ignore_index=True, sort=False)
    
    return result


class TextDataset(Dataset):
    def __init__(self, df, tokenizer, model_config, title_to_idx, max_seq_length, device):
        self.df = df
        self.tokenizer = tokenizer
        self.title_to_idx = title_to_idx
        self.max_seq_length = max_seq_length
        self.device = device

        self.model_config = None
        self.set_model_config(model_config) # Set model config includes input and labels computation if config has changed 
    
    def set_model_config(self, new_model_config):
        old_model_config = self.model_config
        
        self.model_config = new_model_config
        self.model_type = self.model_config.model_type
        self.use_sliding_window = self.model_config.use_sliding_window
        self.sliding_window_length = self.model_config.sliding_window_length
        self.sliding_window_overlap = self.model_config.sliding_window_overlap

        if self.use_sliding_window and self.sliding_window_length > self.max_seq_length:
            raise Exception("Sliding window length must be lower than the maximum sequence length")
        
        # Inputs and labels recomputation
        old = old_model_config
        new = self.model_config
        recomputation_required = old == None or old.use_sliding_window != new.use_sliding_window or old.sliding_window_length != new.sliding_window_length or old.sliding_window_overlap != new.sliding_window_overlap
        if recomputation_required:
            self.compute_inputs_and_labels()

    def compute_inputs_and_labels(self):
        self.inputs_texts = []
        self.labels_texts = []

        # Load spacy model for sentence splitting
        if not self.use_sliding_window:
            # Create spaCy model. Compontents = tok2vec, tagger, parser, senter, attribute_ruler, lemmatizer, ner
            spacy_nlp = en_core_web_lg.load(disable = ["tok2vec", "tagger", "parser", "senter", "attribute_ruler", "lemmatizer", "ner"])#disable = ["tok2vec", "tagger", "attribute_ruler", "lemmatizer", "ner"]) # Required components: "senter" and "parser"
            spacy_nlp.add_pipe('sentencizer')

        # Get inputs and labels texts
        for idx, text in enumerate(tqdm(self.df["Text"], desc="Get input and labels texts")):
            title = self.df[settings.id_column][idx]
            if self.use_sliding_window:
                # Prepare data for the sliding window algorithm which is computed at the compute_inputs method
                self.inputs_texts.append(text)
                self.labels_texts.append(title)
            else:
                # Sentence splitting
                for paragraph in text.split("\n"):
                    if len(paragraph.strip()) > 0:
                        doc = spacy_nlp(paragraph)
                        for sentence in doc.sents:
                            # Parse sentence to text
                            sentence_txt = ""
                            for token in sentence:
                                sentence_txt += " " + token.text
                            sentence_txt = sentence_txt[1:] # Remove initial space
                            # Ensure length is less than the maximum
                            sent_token_count = len(self.tokenizer.encode(sentence_txt, add_special_tokens=True))
                            if sent_token_count > self.max_seq_length:
                                raise Exception(f"ERROR: Sentence with length {sent_token_count} > {self.max_seq_length} at index {idx} with title {title} | {sentence_txt}")
                            else:
                                # Store sample
                                self.inputs_texts.append(sentence_txt)
                                self.labels_texts.append(title)
                
                # Delete document and use GarbageCollector for reducing memory consumption
                del doc
                if idx % 5 == 0:    # Periodically collect
                    gc.collect()
        
        # Compute inputs and labels using inputs_texts and labels_texts
        self.compute_inputs()
        self.compute_labels()

    def compute_inputs(self):
        self.inputs = self.tokenizer(self.inputs_texts,
                                        add_special_tokens=not self.use_sliding_window,
                                        padding="longest",  # Warning: If an input_text is longer than max_seq_length, an error will raise on prediction
                                        truncation=False,
                                        max_length=self.max_seq_length,
                                        return_tensors="pt")
        if self.use_sliding_window:
            old_input_ids = self.inputs["input_ids"]
            old_attention_masks = self.inputs["attention_mask"]
            old_seq_length = old_input_ids.size()[1]
            if old_seq_length > self.sliding_window_length: # If sliding window is possible
                new_input_ids = torch.zeros((0, self.sliding_window_length), dtype=torch.int)
                new_attention_masks = torch.zeros((0, self.sliding_window_length), dtype=torch.int)
                new_labels_texts = []
                # Iterate inputs
                num_inputs = old_input_ids.size()[0]
                for idx in tqdm(range(num_inputs), desc="Processing sliding window"):
                    input_ids = old_input_ids[idx, :]
                    attention_mask = old_attention_masks[idx, :]
                    sequences = []
                    attention_masks = []

                    # Sequence division using sliding window
                    ini = 0
                    end = ini + self.sliding_window_length - 2 # Minus 2 because of the CLS and SEP tokens
                    sequence_finished = False
                    padding_required = False                    
                    while not sequence_finished:
                        # Get the corresponding sequence and mask
                        if end > old_seq_length:
                            end = old_seq_length
                            padding_required = True
                        sequence = input_ids[ini:end]
                        mask = attention_mask[ini:end]

                        # Check if sequence is finished
                        sequence_finished = end == old_seq_length or padding_required or mask[-1] == 0

                        # Add CLS and SEP tokens
                        num_attention_tokens = torch.count_nonzero(mask)
                        if num_attention_tokens == mask.size()[0]:  # If sequence is full
                            sequence = torch.cat(( torch.tensor([self.tokenizer.cls_token_id]), sequence, torch.tensor([self.tokenizer.sep_token_id]) ))
                            mask = torch.cat(( torch.tensor([1]), mask, torch.tensor([1]) ))
                        else:
                            sequence[num_attention_tokens] = torch.tensor(self.tokenizer.sep_token_id)
                            sequence = torch.cat(( torch.tensor([self.tokenizer.cls_token_id]), sequence, torch.tensor([self.tokenizer.pad_token_id]) ))
                            mask[num_attention_tokens] = 1
                            mask = torch.cat(( torch.tensor([1]), mask, torch.tensor([0]) ))

                        # Padding if is required
                        if padding_required:
                            padding_length = self.sliding_window_length - sequence.size()[0]
                            padding = torch.zeros((padding_length), dtype=sequence.dtype)
                            sequence = torch.cat((sequence, padding))
                            mask = torch.cat((mask, padding))                        

                        # Increment indexes
                        ini += self.sliding_window_length - self.sliding_window_overlap - 2 # Minus 2 because of the CLS and SEP tokens
                        end = ini + self.sliding_window_length - 2 # Minus 2 because of the CLS and SEP tokens                        

                        # Append to lists
                        sequences.append(sequence)
                        attention_masks.append(mask)
                                    
                    # Stack lists and concatenate with new data
                    sequences = torch.stack(sequences)
                    attention_masks = torch.stack(attention_masks)
                    new_input_ids = torch.cat((new_input_ids, sequences))
                    new_attention_masks = torch.cat((new_attention_masks, attention_masks))
                    new_labels_texts += [self.labels_texts[idx]] * sequences.size()[0]
            
                self.inputs = {"input_ids": new_input_ids, "attention_mask": new_attention_masks}
                self.labels_texts = new_labels_texts
            
    def compute_labels(self):
        # Labels translated to the entity index
        labels_idxs = list(map(lambda x: self.title_to_idx[x], self.labels_texts))
        self.labels = torch.tensor(labels_idxs)

    def __len__(self):
        return len(self.inputs["input_ids"])

    def __getitem__(self, index):
        # Get each value (tokens, attention...) of the item
        input = {key: value[index] for key, value in self.inputs.items()}

        # Get label if is required
        if self.model_type != ModelType.MaskedLM:
            label = self.labels[index]
            input["labels"] = label
        
        return input

In [None]:
# Train dataset
train_df_for_dataset = prepare_df_for_dataset(train_df, return_dict=False)
model_config = settings.pretraining_config if settings.pretraining_with_train_dataset else settings.finetuning_config
train_dataset = TextDataset(train_df_for_dataset, tokenizer, model_config, title_to_idx, settings.max_seq_length, settings.device)

# Evaluation datasets
eval_df_dict_for_datasets = prepare_df_for_dataset(eval_df, return_dict=True)
eval_datasets_dict = {}
for name, df in eval_df_dict_for_datasets.items():
    model_config = settings.pretraining_config if name == settings.pretraining_with_eval_dataset else settings.finetuning_config
    eval_datasets_dict[name] = TextDataset(df, tokenizer, model_config, title_to_idx, settings.max_seq_length, settings.device)

# Development dataset
dev_df_for_dataset = prepare_df_for_dataset(dev_df, return_dict=False)
dev_dataset = TextDataset(dev_df_for_dataset, tokenizer, settings.finetuning_config, title_to_idx, settings.max_seq_length, settings.device)

Get input and labels texts:   0%|          | 0/100 [00:00<?, ?it/s]

Processing sliding window:   0%|          | 0/100 [00:00<?, ?it/s]

Get input and labels texts:   0%|          | 0/50 [00:00<?, ?it/s]

Processing sliding window:   0%|          | 0/50 [00:00<?, ?it/s]

Get input and labels texts:   0%|          | 0/50 [00:00<?, ?it/s]

Processing sliding window:   0%|          | 0/50 [00:00<?, ?it/s]

Get input and labels texts:   0%|          | 0/50 [00:00<?, ?it/s]

Processing sliding window:   0%|          | 0/50 [00:00<?, ?it/s]

Get input and labels texts:   0%|          | 0/50 [00:00<?, ?it/s]

Processing sliding window:   0%|          | 0/50 [00:00<?, ?it/s]

Get input and labels texts:   0%|          | 0/50 [00:00<?, ?it/s]

Processing sliding window:   0%|          | 0/50 [00:00<?, ?it/s]

Get input and labels texts:   0%|          | 0/50 [00:00<?, ?it/s]

Processing sliding window:   0%|          | 0/50 [00:00<?, ?it/s]

Get input and labels texts:   0%|          | 0/50 [00:00<?, ?it/s]

Processing sliding window:   0%|          | 0/50 [00:00<?, ?it/s]

Get input and labels texts:   0%|          | 0/50 [00:00<?, ?it/s]

Processing sliding window:   0%|          | 0/50 [00:00<?, ?it/s]

Get input and labels texts:   0%|          | 0/15 [00:00<?, ?it/s]

Processing sliding window:   0%|          | 0/15 [00:00<?, ?it/s]

### Model initialization

In [None]:
def ini_model(model, model_config, base_model, tokenizer, device, base_model_name, copy_base_model=True, link_base_model=False):
    # Resize the model to the tokenizer
    model.resize_token_embeddings(len(tokenizer))

    # Link or copy base_model if required
    if link_base_model:
        if "distil" in base_model_name:
            old_model = model.distilbert
            model.distilbert = base_model
        elif "roberta" in base_model_name:
            old_model = model.roberta
            model.roberta = base_model
        elif "bert" in base_model_name:
            old_model = model.bert
            model.bert = base_model
        else:
            raise Exception(f"Not code ready for base mode [{base_model_name}]")
        
        del old_model

    elif copy_base_model:
        if "distil" in base_model_name:
            model.distilbert.load_state_dict(base_model.state_dict())
        elif "roberta" in base_model_name:
            base_model_dict = base_model.state_dict()
            base_model_dict = dict(base_model_dict) # Copy
            base_model_dict.pop("pooler.dense.weight")  # Specific for transformers version 4.20.1
            base_model_dict.pop("pooler.dense.bias")
            model.roberta.load_state_dict(base_model_dict)
        elif "bert" in base_model_name:
            model.bert.load_state_dict(base_model.state_dict())
        else:
            raise Exception(f"No code ready for base model [{base_model_name}]")

    # Model to device, create optimizer and show size
    model.resize_token_embeddings(len(tokenizer))
    model.to(device)
    optimizer = AdamW(model.parameters(), lr=model_config.learning_rate)
    print(f"Model size = {sum([np.prod(p.size()) for p in model.parameters()])}")

    return model, optimizer

### Trainer

In [None]:
class MyTrainer(Trainer):
    def __init__(self, eval_datasets_dict = None, results_filepath = None, training_title="", **kwargs):
        Trainer.__init__(self, **kwargs)
        self.eval_datasets_dict = eval_datasets_dict
        self.results_filepath = results_filepath        
        self.training_title = training_title
        self.do_custom_eval = self.eval_datasets_dict != None and self.results_filepath != None and self.training_title != ""

        if self.do_custom_eval:
            self.evaluation_epoch = 1
            self.initialize_results_file()
        
    def initialize_results_file(self):
        text = f"{self.training_title}\n"
        current_time = datetime.now().strftime("%d/%m/%Y %H:%M:%S")
        text += f"{current_time}\n"
        text += "Time,Epoch"
        text += ",dev"
        for dataset_name in self.eval_datasets_dict.keys():
            text+=f",{dataset_name}"
        text += "\n"
        with open(self.results_filepath, "a+") as f:
            f.write(text)

    def evaluate(self, eval_dataset=None, ignore_keys=None, metric_key_prefix="eval"):
        # Dev test for EarlyStopping
        dev_results = Trainer.evaluate(self, eval_dataset=eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)

        if self.do_custom_eval:
            eval_results = {}
            # Store dev results
            eval_results["dev"] = dev_results
            # Get results
            for dataset_name, dataset in self.eval_datasets_dict.items():
                eval_results[dataset_name] = Trainer.evaluate(self, eval_dataset=dataset, ignore_keys=ignore_keys, metric_key_prefix="test")
            # Write results
            self.write_results(eval_results)
            # Increment evaluation epoch
            self.evaluation_epoch += 1

        return dev_results
    
    def write_results(self, eval_results,):
        current_time = datetime.now().strftime("%d/%m/%Y %H:%M:%S")
        try:
            results_text = f"{current_time},{self.evaluation_epoch}"

            for data in eval_results.values():
                key = list(filter(lambda k: "_Accuracy" in k, data.keys()))[0]
                accuracy = data[key]
                accuracy_str = "{:.3f}".format(accuracy)
                results_text += f",{accuracy_str}"
            results_text += "\n"
            with open(self.results_filepath, "a") as f:
                f.write(results_text)
        except Exception as e:
            print(f"ERROR writing the results: {e}")


def compute_metrics(results):
    logits, labels = results

    # Get predictions sum
    logits = torch.from_numpy(logits)
    probs = F.softmax(logits, dim=-1)
    probs_dict = {}
    for prob, label in zip(probs, labels):
        current_probs = probs_dict.get(label, torch.zeros_like(prob))
        probs_dict[label] = current_probs.add_(prob)
    
    # Cumpute final predictions
    num_preds = len(probs_dict)
    all_preds = torch.zeros(num_preds, device="cpu")
    all_labels = torch.zeros(num_preds, device="cpu")
    for idx, item in enumerate(probs_dict.items()):
        label, probs = item
        all_labels[idx] = label
        all_preds[idx] = torch.argmax(probs)
    
    correct_preds = torch.sum(all_preds == all_labels)
    accuracy = (float(correct_preds)/num_preds)*100
    return {"Accuracy": accuracy}


def get_trainer(model, model_config, train_dataset, eval_datasets_dict, dev_dataset, optimizer, results_filepath, training_title="", data_collator=None, use_dev_as_eval=False):
    is_for_mlm = model_config.model_type == ModelType.MaskedLM

    # Variable settings
    logging_strategy = "steps" if is_for_mlm else "epoch"
    evaluation_strategy = "no" if is_for_mlm else "epoch"
    save_strategy = "no"    # Alternatively: Maybe "steps" if is_for_mlm else "epoch"
    current_eval_datasets_dict = None if use_dev_as_eval else eval_datasets_dict

    # Define TrainingArguments    
    args = TrainingArguments(
        output_dir=model_config.trainer_folder_path,
        overwrite_output_dir=True,        
        save_strategy=save_strategy,
        save_steps=1000,
        save_total_limit=1,
        num_train_epochs=model_config.epochs,
        per_device_train_batch_size=model_config.batch_size,
        per_device_eval_batch_size=model_config.batch_size,
        logging_strategy=logging_strategy,
        logging_steps=500,        
        evaluation_strategy=evaluation_strategy,
        log_level="error",
        disable_tqdm=False,
        eval_accumulation_steps=5,  # Number of eval steps before move preds from GPU to RAM
    )

    # Define Trainer
    scheduler = get_constant_schedule(optimizer)
    trainer = MyTrainer(current_eval_datasets_dict,
                        results_filepath,
                        training_title=training_title,
                        model=model,
                        args=args,
                        train_dataset=train_dataset,
                        eval_dataset=dev_dataset,  # Set the dev_dataset as the eval_dataset as the attacker would do
                        optimizers=[optimizer, scheduler],
                        compute_metrics=compute_metrics,
                        data_collator=data_collator
                    )
    
    return trainer

## Base language model creation

In [None]:
def create_base_model(base_model_name, tokenizer):
    base_model = AutoModel.from_pretrained(base_model_name)
    base_model.resize_token_embeddings(len(tokenizer))
    print(f"Model size = {sum([np.prod(p.size()) for p in base_model.parameters()])}")

    return base_model

In [None]:
base_model = create_base_model(settings.base_model_name, tokenizer)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Model size = 66362880


## Further pre-training

In [None]:
def do_pretraining(model, base_model, model_config, optimizer, train_dataset, eval_datasets_dict, data_collator, pretraining_with_train_dataset, pretraining_with_eval_dataset):
    if pretraining_with_train_dataset:
        train_dataset.set_model_config(model_config)
        trainer = get_trainer(model, model_config, train_dataset, None, None, optimizer, None, data_collator=data_collator)
        trainer.train()
    if pretraining_with_eval_dataset.lower() != "no":
        old_epochs = model_config.epochs
        model_config.epochs *= 5
        eval_dataset = eval_datasets_dict[pretraining_with_eval_dataset]
        eval_dataset.set_model_config(model_config)
        trainer = get_trainer(model, model_config, eval_dataset, None, None, optimizer, None, data_collator=data_collator)
        model_config.epochs = old_epochs
        trainer.train()

In [None]:
if settings.use_pretraining:
    if "train" in settings.pretraining_policy.lower():
        # Create MLM model
        model = AutoModelForMaskedLM.from_pretrained(settings.base_model_name)
        model, optimizer = ini_model(model,
                                    settings.pretraining_config,
                                    base_model,
                                    tokenizer,
                                    settings.device,
                                    settings.base_model_name,
                                    copy_base_model=False,
                                    link_base_model=True)
        data_collator = DataCollatorForLanguageModeling(tokenizer, mlm_probability=settings.pretraining_config.mlm_probability)
        
        # Perform further pretraining
        do_pretraining(model, base_model, settings.pretraining_config, optimizer, train_dataset, eval_datasets_dict, data_collator, settings.pretraining_with_train_dataset, settings.pretraining_with_eval_dataset)
        
        # Move model and base_model to CPU to free GPU
        model = model.cpu()
        base_model = base_model.cpu()

        # Save
        if "save" in settings.pretraining_policy.lower():
            torch.save(base_model.state_dict(), settings.base_model_save_path)
    # Load
    elif "load" in settings.pretraining_policy.lower():
        base_model.load_state_dict(torch.load(settings.base_model_save_path))



Model size = 66985530


Step,Training Loss


## Fine-tuning
After each epoch, the re-identification risk/accuracy for each evaluation dataset will be displayed.

In [None]:
if "train" in settings.finetuning_policy.lower():
    # Create classifier
    num_labels = len(title_to_idx)
    model = AutoModelForSequenceClassification.from_pretrained(settings.base_model_name, num_labels=num_labels)

    # Initialize model
    model, optimizer = ini_model(model,
                                settings.finetuning_config,
                                base_model,
                                tokenizer,
                                settings.device,
                                settings.base_model_name,
                                copy_base_model=settings.use_pretraining)
    
    # Initialize datasets
    train_dataset.set_model_config(settings.finetuning_config)
    for dataset in eval_datasets_dict.values():
        dataset.set_model_config(settings.finetuning_config)
    dev_dataset.set_model_config(settings.finetuning_config)

    # Create trainer and train
    trainer = get_trainer(model, settings.finetuning_config, train_dataset, eval_datasets_dict, dev_dataset, optimizer,
                      settings.results_filepath, training_title=settings.training_setup, use_dev_as_eval=settings.use_dev_as_eval)
    trainer.train()
    
    # Save
    if "save" in settings.finetuning_policy.lower():
        torch.save(model.state_dict(), settings.finetuning_config.save_path)
elif "load" in settings.finetuning_policy.lower():
    model.load_state_dict(torch.load(settings.finetuning_config.save_path))

Model size = 66991922


Get input and labels texts:   0%|          | 0/100 [00:00<?, ?it/s]

Processing sliding window:   0%|          | 0/100 [00:00<?, ?it/s]

Epoch,Training Loss,Validation Loss,Accuracy
1,2.5813,2.482529,33.333333
1,2.5813,1.173297,82.0
1,2.5813,2.844547,36.0
1,2.5813,2.838217,32.0
1,2.5813,1.876704,70.0
1,2.5813,2.532223,40.0
1,2.5813,2.609035,44.0
1,2.5813,3.517856,16.0
1,2.5813,3.503957,10.0
2,0.9608,2.348228,46.666667


## Evaluation
Evaluation of the attack with custom sentences and obtention of which sentences from the evaluation documents enabled the re-identification.

In [None]:
def custom_evaluate(model, dataset, tokenizer, batch_size, device, dataset_name=""):
    model.eval()
    dataloader = DataLoader(dataset, batch_size, pin_memory=True, shuffle=False)
    correct_pred_inputs = {}

    predictions_dict = {}
    dataset_header = f"[ {dataset_name} ] " if dataset_name != "" else ""
    with torch.no_grad():
        with tqdm(range(len(dataloader)), desc=f"Evaluating {dataset_header}") as pbar:
            for inputs in iter(dataloader):
                # Move values to device
                for key in inputs.keys():
                    inputs[key] = inputs[key].to(device)
                
                # Process
                outputs = model(**inputs)
                
                # Store logits
                logits = outputs.logits.cpu()                
                labels = inputs["labels"].cpu()
                inputs_ids = inputs["input_ids"].cpu()
                for logit, label, input_ids in zip(logits, labels, inputs_ids):
                    label = int(label.item())    # Convert to int, required to use it as key at dictionary
                    probs = F.softmax(logit, dim=-1)
                    # Accumulate probs
                    current_probs = predictions_dict.get(label, torch.zeros_like(probs))
                    predictions_dict[label] = current_probs.add_(probs)
                    # Check correct prediction
                    pred = int(torch.argmax(probs))
                    if pred == label:
                        inputs_list = correct_pred_inputs.get(label, [])
                        text = tokenizer.decode(input_ids)
                        inputs_list.append(text)
                        correct_pred_inputs[label] = inputs_list

                pbar.update()
            
            # Compute predictions
            num_preds = len(predictions_dict)
            all_preds = torch.zeros(num_preds, device="cpu")
            all_labels = torch.zeros(num_preds, device="cpu")
            for idx, (label, probs) in enumerate(predictions_dict.items()):
                all_labels[idx] = label
                all_preds[idx] = torch.argmax(probs)
            
            # Compute accuracy
            preds_correct = torch.sum(all_preds == all_labels)
            accuracy = float(preds_correct)/len(all_preds)*100
            accuracy_str = "{:.2f}".format(accuracy)
            pbar.set_description(f"{dataset_header}[ Accuracy = {accuracy_str} ]")

    return accuracy, all_preds, all_labels, correct_pred_inputs


def who_has_been_recognized(model, eval_datasets_dict, eval_df, anonymization_method, tokenizer, title_to_idx, idx_to_title, batch_size, device):
    dataset = eval_datasets_dict[anonymization_method]
    texts_list = list(eval_df[anonymization_method])

    # Evaluate
    accuracy, all_preds, all_labels, correct_pred_inputs = custom_evaluate(model, dataset, tokenizer, batch_size, device, dataset_name=anonymization_method)

    print("\nInputs correctly recognized:")
    for idx, inputs_list in correct_pred_inputs.items():
        print(f"[{idx_to_title[idx]}]")
        for input in inputs_list:
            print(f"\t{input}")

    print("\nPeople correctly recognized:")
    preds_correct = all_preds == all_labels
    for idx, is_correct in enumerate(preds_correct):
        if is_correct:
            label_idx = int(all_labels[idx])
            if label_idx < len(texts_list):            
                title = idx_to_title[label_idx]
                text = texts_list[label_idx]
                print(f"[{title}]\n\t{text}")


def who_the_text_talks_about(model, tokenizer, text, idx_to_title, device):
    inputs = tokenizer(text,
                add_special_tokens=False,
                return_tensors="pt").to(settings.device)
    output = model(**inputs)
    logits = output.logits
    pred_idx = int(torch.argmax(logits, axis=1))
    return idx_to_title[pred_idx]

In [None]:
text = "She worked on Harry Potter films"   #@param {type:"string"}
res = who_the_text_talks_about(model, tokenizer, text, idx_to_title, settings.device)
print(f"The text: {text}\nWas recognized as: {res}")

The text: She worked on Harry Potter films
Was recognized as: Emma Watson


In [None]:
who_has_been_recognized(model, eval_datasets_dict, eval_df, "manual_abstract", tokenizer, title_to_idx, idx_to_title, settings.finetuning_config.batch_size, settings.device)

Evaluating [ manual_abstract ] :   0%|          | 0/7 [00:00<?, ?it/s]


Inputs correctly recognized:
[Angelina Jolie]
	[CLS] hollywood actress. she continued successful action star career sensitive sensitive sensitive received critical acclaim performances dramas sensitive sensitive earned nomination sensitive sensitive. her biggest commercial success came fantasy picture sensitive. in sensitive sensitive expanded career sensitive sensitive sensitive in sensitive sensitive sensitive sensitive. in addition film career sensitive noted sensitive received sensitive honorary damehood sensitive honors. she promotes causes including conservation education sensitive noted advocacy sensitive sensitive sensitive. as public figure sensitive cited sensitive american entertainment industry. for number years cited sensitiveby media outlets personal life subject wide publicity [SEP]
[Johnny Depp]
	[CLS] he gained praise reviewers portrayals sensitive sensitive sensitive sensitive sensitive sensitive sensitive fsensitive bsensitivewsensitivein bsensitive dsensitiveis tse

In [None]:
who_has_been_recognized(model, eval_datasets_dict, eval_df, "spacy_abstract", tokenizer, title_to_idx, idx_to_title, settings.finetuning_config.batch_size, settings.device)

Evaluating [ spacy_abstract ] :   0%|          | 0/8 [00:00<?, ?it/s]


Inputs correctly recognized:
[Angelina Jolie]
	[CLS] person org born fac person date norp actress filmmaker humanitarian. she received academy award cardinal screen actors guild awards cardinal golden globe awards cited gpe highest paid actress. person screen debut child alongside father person person get out date. her film career began earnest date low budget production product date followed ordinal leading role major film hackers date. she starred critically acclaimed biographical cable films person date person date won work of art performance drama girl interrupted date. person starring role video game heroine loc lo [SEP]
	[CLS] cable films person date person date won work of art performance drama girl interrupted date. person starring role video game heroine loc loc tomb raider date established leading gpe actress. she continued successful action star career mr. mrs. person date wanted date org date received critical acclaim performances dramas a mighty heart date changeling date