# GPT-2 117M Fine Tuning

Adaptation of https://github.com/ak9250/gpt-2-colab

Includes code for building a single huge English training text file based on Project Gutenberg (books) or Access to Insight (Dharma texts). ATI seems to overfit quite rapidly since it's only 6 MB, but a 99 MB subset of Gutenberg came out quite convincing after only 4 hours of training. Your mileage may vary.

## Setup

1) Make sure GPU is enabled, go to edit->notebook settings->Hardware Accelerator GPU

2) make a copy to your google drive, click on copy to drive in panel

Note: colab will reset after 12 hours make sure to save your model checkpoints to google drive around 10-11 hours mark or before, then go to runtime->reset all runtimes. Now copy your train model back into colab and start training again from the previous checkpoint.

### Initialize

clone and cd into repo

In [0]:
!git clone https://github.com/nshepperd/gpt-2.git

install requirements

In [0]:
!pip3 install --upgrade tensorflow-gpu beautifulsoup4
!pip3 install -r requirements.txt

In [0]:
cd gpt-2

download the model

In [0]:
!python3 download_model.py 117M

set encoding

In [0]:
!export PYTHONIOENCODING=UTF-8

### Mount Google Drive

mount drive to access google drive for saving and accessing checkpoints later

In [0]:
from google.colab import drive
drive.mount('/content/drive')

(optional) fetch checkpoints if you have them saved in google drive

In [0]:
!cp -r /content/drive/My\ Drive/checkpoint/ /content/gpt-2/ 

## Download + Prepare Training Data

In [0]:
cd /content/gpt-2

In [0]:
mkdir data

In [0]:
cd data

### Project Gutenberg

Download through their [Robot Access](http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages)

In [0]:
mkdir books

download and unzip (this will take a while)

In [0]:
!wget -w 0.5 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"

In [0]:
!find . -name "*[!-8].zip" | while read filename; do unzip -o -d "`basename -s .zip "$filename"`" "$filename"; done;

collect text files for conversion

In [0]:
!find . -name "*.txt" | while read filename; do cp $filename /content/gpt-2/books/; done;

change charset to utf8

In [0]:
!find . -name "*.txt" | while read filename; do iconv -f ascii -t utf8 $filename > $filename-utf8.txt ; done;

combine into single text file

In [0]:
cat *utf8.txt >> allbooks-utf8.txt

### Access to Insight

Bulk download from [this page](https://accesstoinsight.org/tech/download/bulk.html)

In [0]:
!wget "http://accesstoinsight.org/tech/download/ati.zip"

unzip the archive

In [0]:
!unzip ati.zip

In [0]:
cd /content/gpt-2/data/ati

#### Convert HTML to Text

Parser for Access to Insight HTML dump based on [this script](https://codereview.stackexchange.com/questions/128515/parsing-locally-stored-html-files) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html)

In [0]:
from bs4 import BeautifulSoup
import glob
import os
import re
import contextlib


@contextlib.contextmanager
def stdout2file(fname):
    import sys
    f = open(fname, 'w')
    sys.stdout = f
    yield
    sys.stdout = sys.__stdout__
    f.close()


def parser():
    # os.chdir(r"/contents/gpt-2/data/ati/)
    with stdout2file("dhamma_cleaned.txt"):
        for file in glob.iglob('tipitaka/**/*.html', recursive=True):
            with open(file, encoding="utf8") as f:
                contents = f.read()
                soup = BeautifulSoup(contents, "html.parser")
                for item in soup.find_all(["blockquote","h4","p"]):
                    print(item.get_text())
                    print('\n')
                # break
parser()

## Train the Model

enter the directory

In [0]:
cd gpt-2

initiate training (set to data file created in previous step)

In [0]:
!PYTHONPATH=src ./train.py --dataset /content/gpt-2/data/dhamma_cleaned.txt

save our checkpoints to start training again later

In [0]:
!cp -r /content/gpt-2/checkpoint/ /content/drive/My\ Drive/

copy re-trained (fine-tuned) model into the main directory

In [0]:
!cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/117M/

## Use the Trained Model

There are a few flags available, with a default value:

* `seed = None` || a random value is generated unless specified. give a specific integer value if you want to reproduce same results in the future.
* `nsamples = 1` || specify the number of samples you want to print
* `length = None` || number of tokens (words) to print on each sample.
* `batch_size= 1` || how many inputs you want to process simultaneously. doesn't seem to affect the results.
* `temperature = 1` || scales logits before sampling prior to softmax.
* `top_k = 0` || truncates the set of logits considered to those with the highest values.

### Conditional samples

In [0]:
!python3 /content/gpt-2/src/interactive_conditional_samples.py --top_k=40 --nsamples=3 --temperature=0.7 --length=100

### Unconditional samples

In [0]:
!python3 src/generate_unconditional_samples.py | tee /tmp/samples