# Fine-Tuning GPT2 on Colab GPU… For Free!

This is a colab notebook for the [associated Medium article](https://medium.com/p/340468c92ed)

## Installing Dependencies
We would run pip3 install transformers normally in Bash, but because this is in Colab, we have to run it with !

In [None]:
# !pip3 install transformers
!python -m pip install git+https://github.com/huggingface/transformers.git
!pip install datasets

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-pbizc8jb
  Running command git clone -q https://github.com/huggingface/transformers.git /tmp/pip-req-build-pbizc8jb
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (PEP 517) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-3.4.0-cp36-none-any.whl size=1275545 sha256=da0734b0e7ba96dc7c51bfa1bbe354f5a1533b80d1dbf015b201752f9c783eea
  Stored in directory: /tmp/pip-ephem-wheel-cache-45lma276/wheels/33/eb/3b/4bf5dd835e865e472d4fc0754f35ac0edb08fe852e8f21655f
Successfully built transformers
Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/f0/f4/2a3d6aee93ae7fce6c936dda2d7f534ad5f044a21238f85e28f0

## Getting WikiText Data

You can read more about WikiText data here. Overall, there's WikiText-2 and WikiText-103. We're going to use WikiText-2 because it's smaller, and we have limits in terms of how long we can run on GPU, and how much data we can load into memory in Colab. To download and run

In [1]:
from google.colab import drive
drive.mount('/content/drive')
%cd "/content/drive/My Drive/CS196Project"

Mounted at /content/drive
/content/drive/My Drive/CS196Project


In [None]:
%%bash
# wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
# unzip wikitext-2-raw-v1.zip

# wget https://cdn-datasets.huggingface.co/summarization/pegasus_data/newsroom.tar.gz
# unzip newsroom.tar.gz

Archive:  wikitext-2-raw-v1.zip
   creating: wikitext-2-raw/
  inflating: wikitext-2-raw/wiki.test.raw  
  inflating: wikitext-2-raw/wiki.valid.raw  
  inflating: wikitext-2-raw/wiki.train.raw  
Archive:  newsroom.tar.gz


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



## Fine-Tuning GPT2

HuggingFace actually provides a script to help fine tune models here. We can just download the script by running

In [2]:
! wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_language_modeling.py

--2020-10-28 22:01:34--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_language_modeling.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13076 (13K) [text/plain]
Saving to: ‘run_language_modeling.py’


2020-10-28 22:01:34 (9.54 MB/s) - ‘run_language_modeling.py’ saved [13076/13076]



Now we are ready to fine tune.

There are many parameters to the script, and you can understand them by reading the manual. I'm just going to go over the important ones for basic training.

- `output_dir` is where the model will be output
- `model_type` is what model you want to use. In our case, it's gpt2 
- `model_name_or_path` is the path to the model. If you want to train from scratch, you can leave this blank. In our case, it's also gpt2 
- `do_train` tells it to train
- `train_data_file` points to the training file
- `do_eval` tells it to evaluate afterwards. Not always required, but good to have
- `eval_data_file` points to the evaluation file

Some extra ones you MAY care about, but you can also skip this.
- `save_steps` is when to save checkpoints. If you have limited memory, you can set this to -1 so it'll skip saving until the end
- `per_gpu_train_batch_size` is batch size for GPU. You can increase this if your GPU has enough memory. To be safe, you can start with 1 and ramp it up if you still have memory
- `num_train_epochs` is the number of epochs to train. Since we're fine-tuning, I'm going to set this to 2


In [None]:
%%bash
export TRAIN_FILE="/content/drive/My Drive/CS196Project/newsroom/train.source"
export TEST_FILE=newsroom_test.txt
export MODEL_NAME="distilbart-cnn-12-6"
export OUTPUT_DIR=output_mine

python run_language_modeling.py \
    --output_dir=$OUTPUT_DIR \
    --model_type=$MODEL_NAME \
    --model_name_or_path=$MODEL_NAME \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --per_gpu_train_batch_size=1 \
    --save_steps=-1 \
    --num_train_epochs=2 \


2020-10-24 20:15:33.623236: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
10/24/2020 20:15:35 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='output_mine', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=1, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=2.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Oct24_20-15-35_6c1e22b6e772', logging_first_step=False, logging_steps=500, save_steps=-1, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_

In [None]:
### PYTHON EQUIVALENT OF ABOVE ###

from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from transformers import glue_convert_examples_to_features
from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
from datasets import load_dataset
import tensorflow as tf
import tensorflow_datasets as tfds

model_name = "sshleifer/distilbart-cnn-12-6"
model = AutoModelWithLMHead.from_pretrained(model_name)
# device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = AutoTokenizer.from_pretrained(model_name)

newsroom = load_dataset('newsroom', data_dir='/content/drive/My Drive/CS196Project/newsroom')

train_dataset = glue_convert_examples_to_features(newsroom['train'], tokenizer, max_length=128, task='mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)

validation_dataset = glue_convert_examples_to_features(newsroom['validation'], tokenizer, max_length=128, task='mrpc')
validation_dataset = validation_dataset.shuffle(100).batch(32).repeat(2)

training_args = TrainingArguments(
    output_dir='/content/drive/My Drive/CS196Project/results',          # output directory
    num_train_epochs=2,              # total # of training epochs
    per_device_train_batch_size=1,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='/content/drive/My Drive/CS196Project/logs',            # directory for storing logs
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=validation_dataset      # evaluation dataset
)

Using custom data configuration default
Reusing dataset newsroom (/root/.cache/huggingface/datasets/newsroom/default/1.0.0/4b405ccd64e15f685065870ea563a1e6a034d1bd269a5427f40146d81549095e)


AttributeError: ignored

## Results

To use it, you can run something like

In [None]:
from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
import torch
import numpy as np

OUTPUT_DIR = "./output_mine"
device = 'cpu'

# tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)
# model = AutoModelWithLMHead.from_pretrained(OUTPUT_DIR)

tokenizer = AutoTokenizer.from_pretrained("yuvraj/summarizer-cnndm") 
model = AutoModelWithLMHead.from_pretrained("yuvraj/summarizer-cnndm")

# model = model.to(device)
model = model.to(device)

# summarizer = pipeline('summarization', model=model, tokenizer=tokenizer)
summarizer = pipeline('summarization', model=model, tokenizer=tokenizer)
                                        
# def generate(input_str, model=model, tokenizer=tokenizer, length=250, n=5):
#   output_text = input_str
#   summarizer = pipeline('summarization', model=model, tokenizer=tokenizer)
#   for word_num in range(length):
#     data = unmasker(output_text)
#     probs = []
#     current_prob = 0
#     for word in data:
#       current_prob += word['score']
#       probs.append((current_prob, word['token_str']))
#     choice = np.random.uniform(high=current_prob)
#     probs.append(choice, "")
#     probs = sorted(probs)
#     index = probs.index((choice, ""))
#     output_text = output_text + probs[index-1][1] + " "
#   return output_text

# generated_text = generate(" = University of Illinois = \n")
# print(generated_text)

text = newsroom[0]['text'].split(' ')
text = ' '.join(text[:691])

print(text + "\n")
print(summarizer(text))

print("\n\n")

# summarizer(text)



HAMBURG, Germany, June 3  As he left the soccer field after a club match in the eastern German city of Halle on March 25, the Nigerian forward Adebowale Ogungbure was spit upon, jeered with racial remarks and mocked with monkey noises. In rebuke, he placed two fingers under his nose to simulate a Hitler mustache and thrust his arm in a Nazi salute.

Marc Zoro, right, an Ivory Coast native, was a target of racial slurs from the home fans in Messina, Italy. Adriano, a star with Inter Milan, tried to persuade him to stay on the field.

From now until its conclusion on July 9, Jeff Z. Klein and other staff members of The Times and International Herald Tribune will track the world's most popular sporting event.

Your guide to the games in Germany: teams, rosters, schedules, statistics, venues and more.

In April, the American defender Oguchi Onyewu, playing for his professional club team in Belgium, dismissively gestured toward fans who were making simian chants at him. Then, as he went to

## Compressing/Zipping Model


In order for us to preserve this model, we should compress it and save it somewhere. This can be done easily with

In [None]:
! tar -czf gpt2-tuned.tar.gz output/

which creates a file called `gpt2-tuned.tar.gz`

## Saving it to Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Now you can copy your output model to your Google Drive by running

In [None]:
!cp gpt2-tuned.tar.gz /content/drive/My\ Drive/