<a href="https://colab.research.google.com/github/Nikoschenk/language_model_finetuning/blob/master/text_gen_4_comp_ling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning Transformers on Texts (Domain: Computational Linguistics) for Text Generation
This notebook demonstrates how to fine-tune GPT-2 on the custom texts (ACL reference corpus) using the [Hugging Face Transformer](https://github.com/huggingface/transformers) library, and includes examples for text generation comparing both the base model and the fine-tuned model.

## Installation

### Install the HuggingFace Transfomers library.

In [None]:
# Install transformers.
!git clone https://github.com/huggingface/transformers

import os
os.chdir('/content/transformers')

# Use language modeling version as of April 21st.
!git checkout b1ff0b2ae7d368b7db3a8a8472a29cc195d278d8

!pip install .
!pip install -r ./examples/requirements.txt

os.chdir('/content/transformers/examples')

In [3]:
import torch
import run_language_modeling
import run_generation
import collections
import random
import numpy as np

from transformers import AutoConfig
from transformers import AutoTokenizer
from transformers import AutoModelWithLMHead

# Make sure that this version of transformers has the correct evaluate functionality.
from run_language_modeling import evaluate

### Mount your Google Drive
Checkpoints will be saved Google Drive folders.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Download the CL introductions for fine-tuning.

In [None]:
!mkdir '/content/drive/My Drive/Comp-Ling'
!wget -nc -O '/content/drive/My Drive/Comp-Ling/training-texts.txt' https://raw.githubusercontent.com/Nikoschenk/language_model_finetuning/master/acl_papers/introductions-train.txt
!wget -nc -O '/content/drive/My Drive/Comp-Ling/validation-texts.txt' https://raw.githubusercontent.com/Nikoschenk/language_model_finetuning/master/acl_papers/introductions-valid.txt

In [6]:
# Make sure they were downloaded properly.
# Display ten introductions.
!tail -n 10 '/content/drive/My Drive/Comp-Ling/training-texts.txt'

Networked computers can be used to support learning in various ways. In computational linguistics, the predominant pattern of use is twofold: Learning materials are distributed using hypertext, and laboratories are conducted in which students work directly with computational linguistics processors such as parsers and generators. The 'authorware' approach to developing learning materials has not been popular in the teaching of computational linguistics because of the extensive labour involved in encoding content. Since CL is all about the use of powerful general mechanisms and expressive formalisms, the idea of writing learning materials using less expressive tools has little appeal. However, the new technologies of the internet make it easier to combine media to produce integrated learning environments in which pedagogical materials can be intimately connected to mechanisms and resources. Using such approaches can produce payoffs whether or not distance learning is involved. A better i

## Finetune and evaluate

### Launch fine-tuninng


In [None]:
# Your models will be stored here.
!mkdir '/content/drive/My Drive/Comp-Ling-Models'

# GPT-2 fine-tuning.
!python run_language_modeling.py \
    --output_dir='/content/drive/My Drive/Comp-Ling-Models' \
    --model_type=gpt2 \
    --model_name_or_path=gpt2-medium \
    --save_total_limit=1 \
    --num_train_epochs=10.0 \
    --do_train \
    --evaluate_during_training \
    --logging_steps=2000 \
    --save_steps=2000 \
    --train_data_file='/content/drive/My Drive/Comp-Ling/training-texts.txt' \
    --do_eval \
    --eval_data_file='/content/drive/My Drive/Comp-Ling/validation-texts.txt' \
    --per_gpu_train_batch_size=2 \
    --per_gpu_eval_batch_size=2 \
    --block_size=128 \
    --gradient_accumulation_steps=5

### Compute perplexity of a dataset.


#### Look at what checkpoints are available


In [None]:
!ls '/content/drive/My Drive/Comp-Ling-Models/'

#### Helper functions

In [8]:
def load_model(args):
  """Creates a model and loads in weights for it."""
  config = AutoConfig.from_pretrained(args.model_name_or_path, cache_dir=None)

  model = AutoModelWithLMHead.from_pretrained(
      args.model_name_or_path,
      from_tf=bool(".ckpt" in args.model_name_or_path),
      config=config,
      cache_dir=None
  )
  
  model.to(args.device)
  return model

def set_seed(seed):
  """Set the random seed."""
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  if args.n_gpu > 0:
    torch.cuda.manual_seed_all(args.seed)

def do_perplexity_eval(args, model, data_file_path):
  """Computes the perplexity of the text in data_file_path according to the provided model."""
  set_seed(args.seed)

  args.eval_data_file=data_file_path

  tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, cache_dir=None)

  args.block_size = min(args.block_size, tokenizer.max_len)

  result = run_language_modeling.evaluate(args, model, tokenizer, prefix="")
  return result

In [9]:
def dict_to_obj(dic):
    """Convert a dict to an object"""
    # https://stackoverflow.com/questions/1305532/convert-nested-python-dict-to-object
    top = type('dummy', (object,), dic)
    seqs = tuple, list, set, frozenset
    for i, j in dic.items():
        if isinstance(j, dict):
            setattr(top, i, dict_to_obj(j))
        elif isinstance(j, seqs):
            setattr(top, i, type(j)(dict_to_obj(sj) if isinstance(sj, dict) else sj for sj in j))
        else:
            setattr(top, i, j)
    return top

In [None]:
# GPT-2 EVALUATION:

# 1. Fine-tuned model.
CHECKPOINT_PATH = '/content/drive/My Drive/Comp-Ling-Models/checkpoint-38000'

# 2. Non-fine-tuned model.
#CHECKPOINT_PATH = "gpt2-medium" 

# Set this to the list of text files you want to evaluate the perplexity of.
DATA_PATHS = ['/content/drive/My Drive/Comp-Ling/validation-texts.txt']

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
print("Running on device: ", device)

args = collections.defaultdict(
  model_name_or_path=CHECKPOINT_PATH,
  output_dir=CHECKPOINT_PATH,
  block_size = 128,
  local_rank=-1,
  eval_batch_size=2,
  per_gpu_eval_batch_size=2,
  n_gpu=n_gpu,
  mlm=False,
  device=device,
  line_by_line=False,
  overwrite_cache=None,
  model_type='gpt2',
  seed=42,
)
args = dict_to_obj(args)

model = load_model(args)

for data_path in DATA_PATHS:
  eval_results = do_perplexity_eval(args, model, data_path)
  perplexity = eval_results['perplexity']
  print('{} is the perplexity of {} according to {}'.format(
      perplexity, data_path, CHECKPOINT_PATH))

### Generate samples


In [11]:
def generate_samples(args, model, prompt_text):
  """Generating sampling for the provided prompt using the provided model."""
  set_seed(args.seed)

  tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, cache_dir=None)

  requires_preprocessing = args.model_type in run_generation.PREPROCESSING_FUNCTIONS.keys()
  encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")
  encoded_prompt = encoded_prompt.to(args.device)

  output_sequences = model.generate(
      input_ids=encoded_prompt,
      max_length=args.length + len(encoded_prompt[0]),
      temperature=args.temperature,
      top_k=args.k,
      top_p=args.p,
      repetition_penalty=args.repetition_penalty,
      do_sample=True,
      num_return_sequences=args.num_return_sequences,
  )

  # Remove the batch dimension when returning multiple sequences
  if len(output_sequences.shape) > 2:
    output_sequences.squeeze_()

  generated_sequences = []

  for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
    generated_sequence = generated_sequence.tolist()

    # Decode text
    text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)

    # Remove all text after the stop token
    text = text[: text.find(args.stop_token) if args.stop_token else None]

    # Remove the excess text that was used for pre-processing
    text = text[len(tokenizer.decode(encoded_prompt[0], clean_up_tokenization_spaces=True)) :]

    # Add the prompt at the beginning of the sequence.
    total_sequence = prompt_text + text

    generated_sequences.append(total_sequence)

  return generated_sequences

In [15]:
# GPT-2 text generation:

# Set this to the checkpoint you want to use for generation, or to "gpt2-medium"
# to generate with the pre-trained model without finetuning.
CHECKPOINT_PATH = '/content/drive/My Drive/Comp-Ling-Models/checkpoint-38000'
#CHECKPOINT_PATH = 'gpt2-medium'

# Specify your prompt here.
PROMPT = '''This text is computer-generated. It is unique because'''


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
print("Running on device: ", device)

args = collections.defaultdict(
  model_name_or_path=CHECKPOINT_PATH,
  output_dir=CHECKPOINT_PATH,
  n_gpu=n_gpu,
  mlm=False,
  device=device,
  model_type='gpt2',
  seed=42,
  stop_token=None, # Set this if your dataset has a special word that indicates the end of a text.
  temperature=1.0,  # temperature sampling. Set this to temperature=1.0 to not use temperature.
  k=50,  # k for top-k sampling. Set this to k=0 to not use top-k.
  p=1.0,  # p for nucleus sampling. Set this to p=1.0 to not use nucleus sampling.
  repetition_penalty=None,
  length=750,  # Number of tokens to generate.
  num_return_sequences=20,  # Number of independently computed samples to generate.
)
args = dict_to_obj(args)

model = load_model(args)
sequences = generate_samples(args, model, PROMPT)
for idx, sequence in enumerate(sequences):
  print('\n====== GENERATION {} ======'.format(idx))
  print(sequence)

08/03/2020 09:33:34 - INFO - transformers.configuration_utils -   loading configuration file /content/drive/My Drive/Comp-Ling-Models/checkpoint-38000/config.json
08/03/2020 09:33:34 - INFO - transformers.configuration_utils -   Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1024,
  "n_head": 16,
  "n_layer": 24,
  "n_positions": 1024,
  "n_special": 0,
  "predict_special_tokens": true,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "vocab_size": 50257
}

08/03/2020 09:33:34 - INFO 

Running on device:  cuda


08/03/2020 09:33:48 - INFO - transformers.configuration_utils -   loading configuration file /content/drive/My Drive/Comp-Ling-Models/checkpoint-38000/config.json
08/03/2020 09:33:48 - INFO - transformers.configuration_utils -   Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1024,
  "n_head": 16,
  "n_layer": 24,
  "n_positions": 1024,
  "n_special": 0,
  "predict_special_tokens": true,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "vocab_size": 50257
}

08/03/2020 09:33:48 - INFO 


This text is computer-generated. It is unique because it was written in natural English, Spanish, and Swedish, under the time and place restrictions imposed by the military machine translation system MADAAT. The system had previously been used to translate English from English into Spanish and from English into Swedish. MADAAT was used in five different conditions: a) to translate into Spanish, b) for translating Spanish into French, c) to translate into Swedish, d) to translate into Danish/English, and finally e) to translate into French, Swedish and Danish. For French, Swedish and Danish, the MADAAT system was augmented using language models from the statistical machine translation system, MOSSUM. In this paper, we examine three basic parameters of translation methods in terms of their role in translating unseen text. A new machine translation system, MADAAT, is introduced using the MADAAT method with three parameters: translation cost, input speed, input accuracy, and translation q