<a href="https://colab.research.google.com/github/IlyaGusev/HeadlineCause/blob/main/notebooks/HeadlineCauseGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Requirements

In [1]:
!git clone https://github.com/IlyaGusev/HeadlineCause

Cloning into 'HeadlineCause'...
remote: Enumerating objects: 524, done.[K
remote: Counting objects: 100% (524/524), done.[K
remote: Compressing objects: 100% (388/388), done.[K
remote: Total 524 (delta 277), reused 313 (delta 126), pack-reused 0[K
Receiving objects: 100% (524/524), 3.11 MiB | 16.15 MiB/s, done.
Resolving deltas: 100% (277/277), done.


In [2]:
!pip install --upgrade -r HeadlineCause/requirements.txt

Collecting tensorflow-text>=2.6.0
  Downloading tensorflow_text-2.6.0-cp37-cp37m-manylinux1_x86_64.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 7.1 MB/s 
Collecting transformers==4.9.2
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 37.4 MB/s 
[?25hCollecting spacy==3.1.2
  Downloading spacy-3.1.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 33.3 MB/s 
[?25hCollecting checklist==0.0.11
  Downloading checklist-0.0.11.tar.gz (12.1 MB)
[K     |████████████████████████████████| 12.1 MB 39.5 MB/s 
Collecting hnswlib==0.5.2
  Downloading hnswlib-0.5.2.tar.gz (29 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting catboost==0.26.1
  Downloading catboost-0.26.1-cp37-none-manylinux1_x86_64.whl (67.4 MB)
[K     |████████

# Data loading

In [3]:
!wget https://github.com/IlyaGusev/HeadlineCause/releases/download/v0/headline_cause_v0.tar.gz
!tar -xzvf headline_cause_v0.tar.gz

--2021-08-27 12:29:23--  https://github.com/IlyaGusev/HeadlineCause/releases/download/v0/headline_cause_v0.tar.gz
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/389190471/6b62c015-f209-4acb-92ca-e88d05df68a2?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210827%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210827T122923Z&X-Amz-Expires=300&X-Amz-Signature=e55dac05e1b273c0deb6144bd2699c70bf6a19f0641981784d50fc79dc93761a&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=389190471&response-content-disposition=attachment%3B%20filename%3Dheadline_cause_v0.tar.gz&response-content-type=application%2Foctet-stream [following]
--2021-08-27 12:29:23--  https://github-releases.githubusercontent.com/389190471/6b62c015-f209-4acb-92ca-e88d05df68a2?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Creden

In [4]:
import json

def read_jsonl(file_name):
    records = []
    with open(file_name, "r") as r:
        for line in r:
            record = json.loads(line)
            records.append(record)
    return records

def fix_records(records):
    fixed_records = []
    for r in records:
        result = r["simple_result"]
        if not result.startswith("right") and not result.startswith("left"):
            continue
        if result.startswith("right"):
            r["left_title"], r["right_title"] = r["right_title"], r["left_title"]
        fixed_records.append(r)
    return fixed_records

ru_train_records = fix_records(read_jsonl("simple_ru_train.jsonl"))
ru_val_records = fix_records(read_jsonl("simple_ru_val.jsonl"))
ru_test_records = fix_records(read_jsonl("simple_ru_test.jsonl"))
en_train_records = fix_records(read_jsonl("simple_en_train.jsonl"))
en_val_records = fix_records(read_jsonl("simple_en_val.jsonl"))
en_test_records = fix_records(read_jsonl("simple_en_test.jsonl"))

In [5]:
import random
import torch
import numpy as np
import os

def set_random_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:2"
    os.environ["PL_GLOBAL_SEED"] = str(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

set_random_seed(1337)

In [6]:
import torch
from torch.utils.data import Dataset

class LineByLineTextDataset(Dataset):
    def __init__(self, records, max_tokens, tokenizer):
        self.tokenizer = tokenizer
        self.max_tokens = max_tokens
        self.records = records

    def __len__(self):
        return len(self.records)

    def embed_record(self, record):
        inputs = self.tokenizer(
            text=record["left_title"] +'. '+record["right_title"]+'. <|endoftext|>',
            add_special_tokens=True,
            max_length=self.max_tokens,
            truncation="longest_first",
            padding="max_length",
            return_tensors="pt"
        )
        for key, value in inputs.items():
            value.squeeze_(0)
        return inputs
    
    def __getitem__(self, index):
        record = self.records[index]
        output = self.embed_record(record)
        return output

# Russian

In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "sberbank-ai/rugpt3small_based_on_gpt2"
ru_model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")
ru_tokenizer = AutoTokenizer.from_pretrained(model_name)
ru_tokenizer.add_special_tokens({
  "eos_token": "<|endoftext|>",
  "bos_token": "<|beginoftext|>",
  "unk_token": "<|unk|>",
  'pad_token':'<|pad|>',
  'sep_token':'<|sep|>'
})

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


4

In [7]:
from torch.utils.data import DataLoader, RandomSampler

MAX_TOKENS = 120

ru_train_data = LineByLineTextDataset(ru_train_records, MAX_TOKENS, ru_tokenizer)
ru_val_data = LineByLineTextDataset(ru_val_records, MAX_TOKENS, ru_tokenizer)

In [8]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling 

EPOCHS = 6
EVAL_STEPS = 8
WARMUP_STEPS = 8
LR = 6e-05
BATCH_SIZE = 32
GRAD_ACCUM_STEPS = 4

training_args = TrainingArguments(
    output_dir="./gpt2-gen1",
    overwrite_output_dir=True,
    evaluation_strategy="steps",
    save_strategy="steps",
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    logging_steps=EVAL_STEPS,
    save_steps=EVAL_STEPS,
    warmup_steps=WARMUP_STEPS,
    learning_rate=LR,
    num_train_epochs=EPOCHS,
    gradient_accumulation_steps=GRAD_ACCUM_STEPS,
    report_to="none",
    prediction_loss_only=True,
    load_best_model_at_end=True,
    save_total_limit=1
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=ru_tokenizer, mlm=False,
)

trainer = Trainer(
    model=ru_model,
    args=training_args,
    data_collator=data_collator,    
    train_dataset=ru_train_data,
    eval_dataset=ru_val_data
)

!rm -rf gpt2-gen1
trainer.train()

***** Running training *****
  Num examples = 2045
  Num Epochs = 6
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 4
  Total optimization steps = 96


Step,Training Loss,Validation Loss
8,5.6175,4.823387
16,4.7871,4.115697
24,3.9446,3.437441
32,3.3684,3.03598
40,2.8454,2.829731
48,2.6821,2.736065
56,2.4401,2.699602
64,2.392,2.683484
72,2.2865,2.671989
80,2.1993,2.673416


***** Running Evaluation *****
  Num examples = 177
  Batch size = 32
Saving model checkpoint to ./gpt2-gen1/checkpoint-8
Configuration saved in ./gpt2-gen1/checkpoint-8/config.json
Model weights saved in ./gpt2-gen1/checkpoint-8/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 177
  Batch size = 32
Saving model checkpoint to ./gpt2-gen1/checkpoint-16
Configuration saved in ./gpt2-gen1/checkpoint-16/config.json
Model weights saved in ./gpt2-gen1/checkpoint-16/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 177
  Batch size = 32
Saving model checkpoint to ./gpt2-gen1/checkpoint-24
Configuration saved in ./gpt2-gen1/checkpoint-24/config.json
Model weights saved in ./gpt2-gen1/checkpoint-24/pytorch_model.bin
Deleting older checkpoint [gpt2-gen1/checkpoint-8] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 177
  Batch size = 32
Saving model checkpoint to ./gpt2-gen1/checkpoint-32
Configuration saved in ./gpt2-gen1/checkpoint-3

TrainOutput(global_step=96, training_loss=3.072284996509552, metrics={'train_runtime': 460.0882, 'train_samples_per_second': 26.669, 'train_steps_per_second': 0.209, 'total_flos': 751418726400000.0, 'train_loss': 3.072284996509552, 'epoch': 6.0})

In [11]:
import transformers
transformers.logging.set_verbosity_error()

bad_word_ids = [
    [203], # \n
    [225], # weird space 1
    [28664], # weird space 2
    [13298], # weird space 3
    [206], # \r
    [49120], # html
    [25872], # http
    [3886], # amp
    [38512], # nbsp
    [10], # &
    [5436], # & (another)
    [5861], # http
    [372], # yet another line break
    [421, 4395], # МСК
    [64], # \
    [33077], # https
    [1572], # ru
    [11101], # Источник
]

def sample(model, tokenizer, prefix, n):
    input_ids = tokenizer.encode(prefix + '<|sep|> ', add_special_tokens=False, return_tensors="pt").to("cuda")
    input_size = len(input_ids)
    preds = model.generate(
        input_ids,
        top_p=0.95,
        do_sample=True,
        min_length=input_size + 10,
        max_length=input_size + 100,
        num_return_sequences=n,
        temperature=1.0,
        bad_words_ids=bad_word_ids,
        no_repeat_ngram_size=4
    )
    return [tokenizer.decode(preds[r].cpu().numpy()).strip().split("<|sep|> ")[1].split("<|sep|>")[0] for r in range(n)]

ru_model.eval()
for item in ru_test_records[:30]:
    for _ in range(10):
        res = sample(ru_model, ru_tokenizer, item["left_title"], 10)
        if res:
            print(item["left_title"])
            for r in res:
                print(f'    => {r}')
            print()
            break

Трамп заявил о создании в США сверхбыстрой «супер-пупер-ракеты»
    =>  Пентагон ответил на сообщения о создании в США сверхбольшого (до 12) сверхзвукового пассажирского лайнера
    =>  Пентагон отреагировал на заявление Трампа о «супер-превентивной» атаке на Россию
    =>  Пентагон отреагировал на заявление Трампа о создании сверхбыстрой «суперов-пупер-ракети»
    =>  Трамп заявил о создании в США супербыстрой «супер–пупер-ракеты»: американцы не заметили утечку данных
    =>  США пообещали США повторить рекорд скорости «супер-путлеров»
    =>  В России подтвердили создание сверхбыстрой «суперной» ракеты
    =>  Трамп отверг новость о создании сверхбыстрой «супертрансляционной» системы разведки
    =>  Вашингтон готов дать ответ России по «супер-путлеру»
    =>  Пентагон рассекретил данные о создании в США сверхбольшого космического корабля
    =>  Пентагон не захотел считаться с мнением о «супер-прессе» в США

Представитель Ирана в ОПЕК впал в кому
    =>  Иран заявил, что его предста

KeyboardInterrupt: ignored

# English

In [7]:
from transformers import GPT2LMHeadModel, AutoTokenizer

model_name = "gpt2"
en_model = GPT2LMHeadModel.from_pretrained(model_name).to("cuda")
en_tokenizer = AutoTokenizer.from_pretrained(model_name)
en_tokenizer.pad_token = en_tokenizer.eos_token

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [8]:
from torch.utils.data import DataLoader, RandomSampler

MAX_TOKENS = 100

en_train_data = LineByLineTextDataset(en_train_records, MAX_TOKENS, en_tokenizer)
en_val_data = LineByLineTextDataset(en_val_records, MAX_TOKENS, en_tokenizer)

In [9]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling 

EPOCHS = 10
EVAL_STEPS = 8
WARMUP_STEPS = 8
LR = 5e-05
BATCH_SIZE = 16
GRAD_ACCUM_STEPS = 8

training_args = TrainingArguments(
    output_dir="./en-gpt2-gen1",
    overwrite_output_dir=True,
    evaluation_strategy="steps",
    save_strategy="steps",
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    logging_steps=EVAL_STEPS,
    save_steps=EVAL_STEPS,
    warmup_steps=WARMUP_STEPS,
    learning_rate=LR,
    num_train_epochs=EPOCHS,
    gradient_accumulation_steps=GRAD_ACCUM_STEPS,
    report_to="none",
    prediction_loss_only=True,
    load_best_model_at_end=True,
    save_total_limit=1
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=en_tokenizer, mlm=False,
)

trainer = Trainer(
    model=en_model,
    args=training_args,
    data_collator=data_collator,    
    train_dataset=en_train_data,
    eval_dataset=en_val_data
)

!rm -rf en-gpt2-gen1
trainer.train()

***** Running training *****
  Num examples = 1111
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 8
  Total optimization steps = 80


Step,Training Loss,Validation Loss
8,4.9723,4.325442
16,4.7042,3.990811
24,4.2667,3.853289
32,3.9985,3.783504
40,3.8116,3.753164
48,3.6698,3.738491
56,3.5637,3.733073
64,3.4715,3.732033
72,3.4543,3.731657
80,3.4271,3.731004


***** Running Evaluation *****
  Num examples = 98
  Batch size = 16
Saving model checkpoint to ./en-gpt2-gen1/checkpoint-8
Configuration saved in ./en-gpt2-gen1/checkpoint-8/config.json
Model weights saved in ./en-gpt2-gen1/checkpoint-8/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 98
  Batch size = 16
Saving model checkpoint to ./en-gpt2-gen1/checkpoint-16
Configuration saved in ./en-gpt2-gen1/checkpoint-16/config.json
Model weights saved in ./en-gpt2-gen1/checkpoint-16/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 98
  Batch size = 16
Saving model checkpoint to ./en-gpt2-gen1/checkpoint-24
Configuration saved in ./en-gpt2-gen1/checkpoint-24/config.json
Model weights saved in ./en-gpt2-gen1/checkpoint-24/pytorch_model.bin
Deleting older checkpoint [en-gpt2-gen1/checkpoint-8] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 98
  Batch size = 16
Saving model checkpoint to ./en-gpt2-gen1/checkpoint-32
Configuration save

TrainOutput(global_step=80, training_loss=3.933984327316284, metrics={'train_runtime': 219.7809, 'train_samples_per_second': 50.55, 'train_steps_per_second': 0.364, 'total_flos': 562543372800000.0, 'train_loss': 3.933984327316284, 'epoch': 9.91})

In [11]:
import transformers
transformers.logging.set_verbosity_error()


def sample(model, tokenizer, prefix, n):
    input_ids = tokenizer.encode(prefix + '. ', add_special_tokens=False, return_tensors="pt").to("cuda")
    input_size = len(input_ids)
    preds = model.generate(
        input_ids,
        num_return_sequences=n, 
        do_sample=True, 
        top_k=0,
        temperature=0.7,
        top_p=0.92,
        min_length=input_size + 10,
        max_length=input_size + 100,
    )
    samples = [tokenizer.decode(preds[r].cpu().numpy()).strip() for r in range(n)]
    return samples

def simple_filter(items):
    res = []
    for item in list(set([item.split('.')[1]+'.' for item in items if item.count('.')>1 and item.split('.')[1].count(' ')>4])):
      if 'ㅋ' in item: continue
      if 'ㅜ' in item: continue
      if '' in item: continue
      if 'ㅠ' in item: continue
      if 'Â' in item: continue
      if 'ㅆ' in item: continue
      if item.strip()[0] not in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ': continue
      
      res.append(item.strip())
    return res

en_model.eval()
for item in en_test_records[:30]:
    if '.' in item: continue  
    for _ in range(10):
        res = sample(en_model, en_tokenizer, item["left_title"], 10)
        res = simple_filter(res)
        if res and res != item:
            print(item["left_title"])
            for r in res:
                print(f'    => {r}')
            print()
            break

She-Ra creator Noelle Stevenson on what’s at stake in the final season
    => Noelle Stevenson shares her thoughts on Noelle Stevenson’s demise.
    => Noelle Stevenson ‘not afraid’ to make a statement, says she’s ‘trying to make a difference’ in the future of Noelle Stevenson's life.
    => Noelle Stevenson’s final season on Twitter.
    => Fellow Noelle Stevenson fans react to Noelle Stevenson’s farewell to 'The People's Show'.
    => Noelle Stevenson ‘the’n’st person who makes sure all fans feel like they have a say on the final season of Noelle Stevenson’s series.

Mary-Kate Olsen And Her Much Older Husband Olivier Sarkozy Divorce After 5 Years
    => Shay Naikwe’s Wife Tributes After Karen Olsen Divorce:  Vive La France’s Greatest Moments From the 2010 Couple.
    => Olivier Sarkozy: I Have Told Him My Story  But I Didn't Know How Much I Had to Pay for Him to Stop Divorce.
    => HBO’s Melissa McCarthy Mourns Carrie Coronavirus’s ‘Truly Motherly’ Coronavirus Victim, Reacts To Her 