<a href="https://colab.research.google.com/github/TurkuNLP/Turku-paraphrase-models/blob/main/para_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, install all necessary dependencies and download the models and the associated code.

In [1]:
!pip install transformers
!pip install pytorch-lightning
!git clone https://github.com/TurkuNLP/paraphrase-classification para
!wget http://dl.turkunlp.org/finbert/torch-transformers/bert-large-finnish-cased-rc2/bert-large-finnish-cased-rc2.tar.gz
!tar zxf bert-large-finnish-cased-rc2.tar.gz

Collecting transformers
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 8.1 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 60.9 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 43.9 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 39.5 MB/s 
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled P

Then set up the model and write some helper functions.

In [2]:
import torch
import transformers
from para.notebook import para_model

bert_model = "bert-large-finnish-cased-rc2"

model = para_model.ParaMultiOutputAvgModel.load_from_checkpoint(checkpoint_path="http://dl.turkunlp.org/turku-paraphrase/model.ckpt", bert_model=bert_model)

model.eval()
model.cuda()

tokenizer = transformers.BertTokenizer.from_pretrained(bert_model)

def compute_masks(mask):
    one_idx = [i for i, b in enumerate(mask) if b]
    zeros = torch.zeros(len(mask), dtype=torch.long)
    cls_mask, sep1_mask, sep2_mask = [torch.scatter(zeros, 0, torch.tensor(i), torch.tensor(1)) for i in one_idx]
    left_idx = torch.tensor(range(one_idx[0]+1, one_idx[1]))
    left_mask = torch.scatter(zeros, 0, left_idx, torch.ones(len(left_idx), dtype=torch.long))
    right_idx = torch.tensor(range(one_idx[1]+1, one_idx[2]))
    right_mask = torch.scatter(zeros, 0, right_idx, torch.ones(len(right_idx), dtype=torch.long))
    return {'cls_mask': cls_mask, 'sep1_mask': sep1_mask, 'sep2_mask': sep2_mask, 'left_mask': left_mask, 'right_mask': right_mask}

def encode(tokenizer, txt1, txt2):
    t1_tok=tokenizer.convert_tokens_to_ids(tokenizer.tokenize(txt1))
    t2_tok=tokenizer.convert_tokens_to_ids(tokenizer.tokenize(txt2))
    encoded=tokenizer.prepare_for_model(t1_tok, t2_tok, return_length=True, return_special_tokens_mask=True, max_length=512, truncation=True, padding='longest', return_tensors='pt')
    r = {"input_ids":encoded.input_ids, "token_type_ids":encoded.token_type_ids, "attention_mask":encoded.attention_mask, "length":encoded.length, **compute_masks(encoded.special_tokens_mask)}
    for k in "input_ids", "token_type_ids", "attention_mask", "cls_mask", "sep1_mask", "sep2_mask", "left_mask", "right_mask":
        r[k] = torch.unsqueeze(r[k], 0).to(device='cuda')
    return r

Downloading: "http://dl.turkunlp.org/turku-paraphrase/model.ckpt" to /root/.cache/torch/hub/checkpoints/model.ckpt


  0%|          | 0.00/1.32G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-finnish-cased-rc2 were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  stream(template_mgs % msg_args)


Finally, run some paraphrase candidates through the model.

In [3]:
with torch.no_grad():
    # r = model(encode(tokenizer, "Tämä kakku on herkullista.", "Onpa hyvää kakkua."))
    r = model(encode(tokenizer, "Tämäpä vasta oli hienoa.", "Se oli ihan sairaan siistiä."))
    indices = {k: torch.argmax(v) for k, v in r.items()}
    flag_i2lab = {k: {vd: kd for kd, vd in v.items()} for k, v in model.flag_lab2i.items()}
    label_dict = {k: flag_i2lab[k][v.item()] for k, v in indices.items()}
    label = label_dict['base'] + (label_dict['direction'] if label_dict['direction'] != 'None' else '') + ('i' if label_dict['has_i'] else '') + ('s' if label_dict['has_s'] else '')
    print(label)

4is
