# Flan T5 for Finnish person name recognition

## Goal:

input: find person names in: Electronic Frontier Finland ry perustaa muistopalkinnon kannustaakseen muita jatkamaan edesmenneen Ville Oksasen jalanjäljissä .

output: Ville Oksanen

notes: If there are more than one person name, should be separated by commas

## Environment

In [1]:
!pip install "transformers==4.27.2" "datasets==2.9.0" "accelerate==0.17.1" "evaluate==0.4.0" "bitsandbytes==0.37.1" loralib --upgrade --quiet

In [2]:
# !pip3 install transformers datasets

import transformers
from datasets import load_dataset, load_metric

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
!pip3 install wandb

import wandb
wandb.login()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[34m[1mwandb[0m: Currently logged in as: [33mchristysss29[0m ([33m585lora[0m). Use [1m`wandb login --relogin`[0m to force relogin


True

## Data

source: https://github.com/mpsilfve/finer-data

Use `load_dataset` function from the HuggingFace to load data

In [5]:
train = load_dataset(path=".", 
                       data_files="/content/drive/MyDrive/Flan-T5_Name-recognition/data/FINER.train.tsv".split(),
                       delimiter="\t",
                       column_names="text ner".split())["train"]

dev = load_dataset(path=".", 
                       data_files="/content/drive/MyDrive/Flan-T5_Name-recognition/data/FINER.dev.tsv".split(),
                       delimiter="\t",
                       column_names="text ner".split())["train"]



Downloading and preparing dataset csv/. to /root/.cache/huggingface/datasets/csv/.-1082fcc62cc07fc9/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/.-1082fcc62cc07fc9/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)


  0%|          | 0/1 [00:00<?, ?it/s]



Downloading and preparing dataset csv/. to /root/.cache/huggingface/datasets/csv/.-4ac377e44e449c45/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/.-4ac377e44e449c45/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)


  0%|          | 0/1 [00:00<?, ?it/s]

In [6]:
print(train)
print(dev)
print(train[0])

Dataset({
    features: ['text', 'ner'],
    num_rows: 3686
})
Dataset({
    features: ['text', 'ner'],
    num_rows: 346
})
{'text': 'Mutta jos takana oli Pohjois-Korea , Sullivan on valmis kippaamaan syyn Yhdysvaltain niskaan .', 'ner': 'Sullivan'}


### Process data

use Flan T5 tokenizer, add prompt `find person names in:`

In [7]:
import nltk
nltk.download('punkt')
import string
from transformers import AutoTokenizer

model_checkpoint = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:
PREFIX = "find person names in:"
MAX_INPUT_LENGTH = 128 # faster and sentences are usually not too long, choose 128 instead of 512
MAX_TARGET_LENGTH = 32 # cuz names will not be too long

def preprocess_data(examples):
    inputs = [PREFIX + " " + text for text in examples["text"]]
    
    model_inputs = tokenizer(inputs, 
                             max_length=MAX_INPUT_LENGTH, 
                             truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["ner"], 
                           max_length=MAX_TARGET_LENGTH, 
                           truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    
    return model_inputs

In [9]:
tokenized_train = train.map(preprocess_data, batched=True)
tokenized_dev = dev.map(preprocess_data, batched=True)

  0%|          | 0/4 [00:00<?, ?ba/s]



  0%|          | 0/1 [00:00<?, ?ba/s]

In [10]:
tokenized_train[0]

{'text': 'Mutta jos takana oli Pohjois-Korea , Sullivan on valmis kippaamaan syyn Yhdysvaltain niskaan .',
 'ner': 'Sullivan',
 'input_ids': [253,
  568,
  3056,
  16,
  10,
  16601,
  17,
  9,
  7406,
  3,
  17,
  9,
  3304,
  9,
  3,
  4172,
  1908,
  107,
  1927,
  159,
  18,
  439,
  32,
  864,
  3,
  6,
  3,
  23748,
  30,
  3,
  2165,
  51,
  159,
  3,
  2168,
  1572,
  9,
  265,
  9,
  152,
  3,
  7,
  63,
  63,
  29,
  3,
  476,
  107,
  26,
  63,
  7,
  2165,
  17,
  9,
  77,
  3,
  29,
  13690,
  9,
  152,
  3,
  5,
  1],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'labels': [3, 23748, 1]}

## Load model and prepare for finetune

import:

- `AutoModelForSeq2SeqLM `loads the model,
- `DataCollatorForSeq2Seq` for data batching,
- `Seq2SeqTrainingArguments` sets all hyperparameters for training,
- `Seq2SeqTrainer` trains the model

hyperparameters:

- `model_dir` -- where to save model checkpoints
- `evaluation_strategy="steps"` -- evaluate every N steps
- `eval_steps=100` -- where N = 100
- `logging_strategy="steps"` -- write information about training loss  every N steps
- `logging_steps=1` -- where N = 1
- `save_strategy="steps"` -- save model every N steps
- `save_steps=100` -- where N = 100
- `learning_rate=4e-5` -- initial learning rate for training
- `per_device_train_batch_size=batch_size` -- training batch size
- `per_device_eval_batch_size=batch_size` -- evaluation batch size
- `weight_decay=0.01` -- hyperparameter for weight decay during training
- `save_total_limit=3` -- save a maximum of 3 models
- `num_train_epochs=1` -- number of training epochs
- `predict_with_generate=True` -- generate the actual output during evaluation

In [11]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

In [14]:
data_collator = DataCollatorForSeq2Seq(tokenizer, padding=True)

In [15]:
batch_size = 8
model_name = "flan-t5-finer"
model_dir = f"/content/drive/MyDrive/Flan-T5_Name-recognition/models/{model_name}"

args = Seq2SeqTrainingArguments(
    model_dir,
    evaluation_strategy="steps",
    eval_steps=100,
    logging_strategy="steps",
    logging_steps=1,
    save_strategy="steps",
    save_steps=100,
    learning_rate=4e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    run_name = "Flan-T5_Name-Recognition" # name displayed on wandb
    )

## Evaluation metric

BLEU score during training process

In [16]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

!pip3 install sacrebleu
metric = load_metric("sacrebleu")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sacrebleu
  Downloading sacrebleu-2.3.1-py3-none-any.whl (118 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.9/118.9 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-2.7.0-py2.py3-none-any.whl (15 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-2.7.0 sacrebleu-2.3.1


  metric = load_metric("sacrebleu")


Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

In [18]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
 
    # not the reugular <PAD>, so replace -100 in the labels before decoding.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # For BLEU, we'll need to split the outputs into word tokens
    # Use NLTK word_tokenize.
    decoded_preds = [nltk.word_tokenize(pred.strip())
                      for pred in decoded_preds]
    decoded_labels = [[nltk.word_tokenize(label.strip())]
                      for label in decoded_labels]
    
    # Compute BLEU scores
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)

    return result

## Training

Initialize trainer and train the model

In [19]:
def model_init():
    return AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

trainer = Seq2SeqTrainer(
    model_init=model_init,
    args=args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_dev,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### Demo before training

In [20]:
model = model_init()
text = """Electronic Frontier Finland ry perustaa muistopalkinnon kannustaakseen muita jatkamaan edesmenneen Ville Oksasen jalanjäljissä ."""
inputs = ["find person names in: " + text]

print("INPUT:", inputs)
inputs = tokenizer(inputs, max_length=128, truncation=True, return_tensors="pt")
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=1, max_length=64)
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
print("OUTPUT:", decoded_output)

INPUT: ['find person names in: Electronic Frontier Finland ry perustaa muistopalkinnon kannustaakseen muita jatkamaan edesmenneen Ville Oksasen jalanjäljissä .']
OUTPUT: find person names in: Electronic Frontier Finland ry perustaa muistopalkinnon kannustaakseen muitotalinen kannustaakseen edesmenneen Ville Oksasen


### Train

In [21]:
trainer.train()



You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Score,Counts,Totals,Precisions,Bp,Sys Len,Ref Len
100,0.6971,0.319756,80.293504,"[1186, 763, 471, 186]","[1268, 922, 576, 230]","[93.53312302839117, 82.75488069414317, 81.77083333333333, 80.8695652173913]",0.949281,1268,1334
200,0.0385,0.264061,82.678954,"[1202, 788, 489, 196]","[1272, 926, 580, 234]","[94.49685534591195, 85.09719222462203, 84.3103448275862, 83.76068376068376]",0.952427,1272,1334
300,0.221,0.26075,84.663112,"[1211, 806, 498, 196]","[1262, 916, 570, 224]","[95.95879556259905, 87.99126637554585, 87.36842105263158, 87.5]",0.944545,1262,1334
400,0.1962,0.266335,84.862224,"[1214, 807, 502, 203]","[1270, 924, 578, 232]","[95.59055118110236, 87.33766233766234, 86.85121107266436, 87.5]",0.950855,1270,1334


Trainer is attempting to log a value of "[1186, 763, 471, 186]" of type <class 'list'> for key "eval/counts" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[1268, 922, 576, 230]" of type <class 'list'> for key "eval/totals" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[93.53312302839117, 82.75488069414317, 81.77083333333333, 80.8695652173913]" of type <class 'list'> for key "eval/precisions" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[1202, 788, 489, 196]" of type <class 'list'> for key "eval/counts" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[1272, 926, 580, 234]" of typ

TrainOutput(global_step=461, training_loss=0.35768711747280413, metrics={'train_runtime': 7814.907, 'train_samples_per_second': 0.472, 'train_steps_per_second': 0.059, 'total_flos': 536485986975744.0, 'train_loss': 0.35768711747280413, 'epoch': 1.0})

In [None]:
wandb.finish()

### demo after training

Load the model from checkpoint

In [22]:
model_name = "flan-t5-finer"
model_dir = f"/content/drive/MyDrive/Flan-T5_Name-recognition/models/{model_name}/checkpoint-400"

tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSeq2SeqLM.from_pretrained(model_dir)

max_input_length = 128

In [23]:
text = """Electronic Frontier Finland ry perustaa muistopalkinnon kannustaakseen muita jatkamaan edesmenneen Ville Oksasen jalanjäljissä ."""
inputs = ["find person names in: " + text]

print("INPUT:", inputs)
inputs = tokenizer(inputs, max_length=128, truncation=True, return_tensors="pt")
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=1, max_length=64)
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
print("OUTPUT:", decoded_output)

INPUT: ['find person names in: Electronic Frontier Finland ry perustaa muistopalkinnon kannustaakseen muita jatkamaan edesmenneen Ville Oksasen jalanjäljissä .']
OUTPUT: Ville Oksasen
