# Fine-Tuning GPT2 on Colab GPU… For Free!

This is a colab notebook for the [associated Medium article](https://medium.com/p/340468c92ed)

## Installing Dependencies
We would run pip3 install transformers normally in Bash, but because this is in Colab, we have to run it with !

In [1]:
!pip3 install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/2c/4e/4f1ede0fd7a36278844a277f8d53c21f88f37f3754abf76a5d6224f76d4a/transformers-3.4.0-py3-none-any.whl (1.3MB)
[K     |▎                               | 10kB 26.9MB/s eta 0:00:01[K     |▌                               | 20kB 22.9MB/s eta 0:00:01[K     |▉                               | 30kB 17.8MB/s eta 0:00:01[K     |█                               | 40kB 15.5MB/s eta 0:00:01[K     |█▎                              | 51kB 12.2MB/s eta 0:00:01[K     |█▋                              | 61kB 11.8MB/s eta 0:00:01[K     |█▉                              | 71kB 11.3MB/s eta 0:00:01[K     |██                              | 81kB 11.4MB/s eta 0:00:01[K     |██▍                             | 92kB 11.1MB/s eta 0:00:01[K     |██▋                             | 102kB 11.3MB/s eta 0:00:01[K     |██▉                             | 112kB 11.3MB/s eta 0:00:01[K     |███▏                            | 

## Getting WikiText Data

You can read more about WikiText data here. Overall, there's WikiText-2 and WikiText-103. We're going to use WikiText-2 because it's smaller, and we have limits in terms of how long we can run on GPU, and how much data we can load into memory in Colab. To download and run

# New Section

In [2]:
%%bash
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip

Archive:  wikitext-2-raw-v1.zip
   creating: wikitext-2-raw/
  inflating: wikitext-2-raw/wiki.test.raw  
  inflating: wikitext-2-raw/wiki.valid.raw  
  inflating: wikitext-2-raw/wiki.train.raw  


--2020-10-22 04:13:37--  https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.106.230
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.106.230|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4721645 (4.5M) [application/zip]
Saving to: ‘wikitext-2-raw-v1.zip’

     0K .......... .......... .......... .......... ..........  1% 24.7M 0s
    50K .......... .......... .......... .......... ..........  2% 14.3M 0s
   100K .......... .......... .......... .......... ..........  3% 27.9M 0s
   150K .......... .......... .......... .......... ..........  4% 22.9M 0s
   200K .......... .......... .......... .......... ..........  5% 27.5M 0s
   250K .......... .......... .......... .......... ..........  6% 21.1M 0s
   300K .......... .......... .......... .......... ..........  7% 28.4M 0s
   350K .......... .......... .......... .......... ..........  8% 23.9M 0s
   400K ..........

## Fine-Tuning GPT2

HuggingFace actually provides a script to help fine tune models here. We can just download the script by running

In [3]:
! wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_language_modeling.py

--2020-10-22 04:13:39--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_language_modeling.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12022 (12K) [text/plain]
Saving to: ‘run_language_modeling.py’


2020-10-22 04:13:39 (124 MB/s) - ‘run_language_modeling.py’ saved [12022/12022]



Now we are ready to fine tune.

There are many parameters to the script, and you can understand them by reading the manual. I'm just going to go over the important ones for basic training.

- `output_dir` is where the model will be output
- `model_type` is what model you want to use. In our case, it's gpt2 
- `model_name_or_path` is the path to the model. If you want to train from scratch, you can leave this blank. In our case, it's also gpt2 
- `do_train` tells it to train
- `train_data_file` points to the training file
- `do_eval` tells it to evaluate afterwards. Not always required, but good to have
- `eval_data_file` points to the evaluation file

Some extra ones you MAY care about, but you can also skip this.
- `save_steps` is when to save checkpoints. If you have limited memory, you can set this to -1 so it'll skip saving until the end
- `per_gpu_train_batch_size` is batch size for GPU. You can increase this if your GPU has enough memory. To be safe, you can start with 1 and ramp it up if you still have memory
- `num_train_epochs` is the number of epochs to train. Since we're fine-tuning, I'm going to set this to 2


In [4]:
%%bash
export TRAIN_FILE=wikitext-2-raw/wiki.train.raw
export TEST_FILE=wikitext-2-raw/wiki.test.raw
export MODEL_NAME=gpt2
export OUTPUT_DIR=output

python run_language_modeling.py \
    --output_dir=$OUTPUT_DIR \
    --model_type=$MODEL_NAME \
    --model_name_or_path=$MODEL_NAME \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --per_gpu_train_batch_size=1 \
    --save_steps=-1 \
    --num_train_epochs=2

{'loss': 3.30685888671875, 'learning_rate': 4.470563320626853e-05, 'epoch': 0.2117746717492588}
{'loss': 3.16893115234375, 'learning_rate': 3.9411266412537063e-05, 'epoch': 0.4235493434985176}
{'loss': 3.1467919921875, 'learning_rate': 3.4116899618805594e-05, 'epoch': 0.6353240152477764}
{'loss': 3.13719140625, 'learning_rate': 2.882253282507412e-05, 'epoch': 0.8470986869970352}
{'loss': 3.0864306640625, 'learning_rate': 2.352816603134265e-05, 'epoch': 1.058873358746294}
{'loss': 2.9623623046875, 'learning_rate': 1.8233799237611182e-05, 'epoch': 1.2706480304955527}
{'loss': 2.9396484375, 'learning_rate': 1.2939432443879713e-05, 'epoch': 1.4824227022448115}
{'loss': 2.93838671875, 'learning_rate': 7.645065650148241e-06, 'epoch': 1.6941973739940703}
{'loss': 2.9472109375, 'learning_rate': 2.3506988564167727e-06, 'epoch': 1.9059720457433291}


2020-10-22 04:13:45.277213: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
10/22/2020 04:13:47 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='output', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=1, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=2.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Oct22_04-13-47_0d47b7a3886f', logging_first_step=False, logging_steps=500, save_steps=-1, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_c

## Results

To use it, you can run something like

In [1]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
import numpy as np

OUTPUT_DIR = "./output"
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

tokenizer = GPT2Tokenizer.from_pretrained(OUTPUT_DIR)
model = GPT2LMHeadModel.from_pretrained(OUTPUT_DIR)
model = model.to(device)
                                        
def generate(input_str, length=250, n=5):
  cur_ids = torch.tensor(tokenizer.encode(input_str)).unsqueeze(0).long().to(device)
  model.eval()
  with torch.no_grad():
    for i in range(length):
      outputs = model(cur_ids[:, -1024:], labels=cur_ids[:, -1024:])
      loss, logits = outputs[:2]
      softmax_logits = torch.softmax(logits[0,-1], dim=0)
      next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=n)
      cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim=1)
    output_list = list(cur_ids.squeeze().to('cpu').numpy())
    output_text = tokenizer.decode(output_list)
    return output_text

def choose_from_top(probs, n=5):
    ind = np.argpartition(probs, -n)[-n:]
    top_prob = probs[ind]
    top_prob = top_prob / np.sum(top_prob) # Normalize
    choice = np.random.choice(n, 1, p = top_prob)
    token_id = ind[choice][0]
    return int(token_id)

generated_text = generate("The University of Illinois at Urbana–Champaign (U of I, Illinois, or colloquially the University of Illinois or UIUC)[7][8] is a public land-grant research university in Illinois in the twin cities of Champaign and Urbana. It is the flagship institution of the University of Illinois system and was founded in 1867.")
print(generated_text)

ModuleNotFoundError: ignored

## Compressing/Zipping Model

In order for us to preserve this model, we should compress it and save it somewhere. This can be done easily with

In [None]:
! tar -czf gpt2-tuned.tar.gz output/

which creates a file called `gpt2-tuned.tar.gz`

## Saving it to Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Now you can copy your output model to your Google Drive by running

In [None]:
!cp gpt2-tuned.tar.gz /content/drive/My\ Drive/