Welcome to the colab notebook for [GPTNeo](https://github.com/EleutherAI/GPTNeo) - a fully open source implementation of GPT like models for mesh-tensorflow by [EleutherAI](eleuther.ai).

Our library provides training and inference for GPT models up to GPT3 sizes on both TPUs and GPUs. 

In this notebook we walk you through TPU training (or finetuning!) and sampling using the freely available colab TPUs.

If you find our repo useful, come join [our discord](https://discord.gg/BK2v3EJ) and say hi! 😬

Before we get going - make sure you are running this notebook with a TPU available. Go to Runtime -> Change Runtime Type and select 'TPU' under hardware accelerator.




In [1]:
#@title Setup
%tensorflow_version 2.x
!git clone https://github.com/EleutherAI/GPTNeo
%cd GPTNeo
!pip3 install -q -r requirements.txt

Cloning into 'GPTNeo'...
remote: Enumerating objects: 82, done.[K
remote: Counting objects: 100% (82/82), done.[K
remote: Compressing objects: 100% (71/71), done.[K
remote: Total 3567 (delta 37), reused 26 (delta 10), pack-reused 3485[K
Receiving objects: 100% (3567/3567), 1.32 MiB | 2.18 MiB/s, done.
Resolving deltas: 100% (2048/2048), done.
/content/GPTNeo
[K     |████████████████████████████████| 368kB 5.2MB/s 
[K     |████████████████████████████████| 14.2MB 316kB/s 
[K     |████████████████████████████████| 112kB 48.3MB/s 
[K     |████████████████████████████████| 394.7MB 42kB/s 
[K     |████████████████████████████████| 3.4MB 48.2MB/s 
[K     |████████████████████████████████| 2.9MB 48.1MB/s 
[K     |████████████████████████████████| 1.5MB 41.7MB/s 
[K     |████████████████████████████████| 71kB 7.0MB/s 
[K     |████████████████████████████████| 184kB 41.3MB/s 
[K     |████████████████████████████████| 2.2MB 44.0MB/s 
[K     |████████████████████████████████| 1.0MB

## Set Google Cloud

To train on TPUs we need to store our data on a google cloud bucket - as TPUs can't read from local filesystems.

You can set up a bucket by signing up for a free trial here: https://console.cloud.google.com/

Make a bucket at https://console.cloud.google.com/storage and come back when that's done.

The next cell sets up google authentication and gives the notebook read and write access to your bucket.


In [2]:
from google.colab import auth
auth.authenticate_user()
!gcloud init

Welcome! This command will take you through the configuration of gcloud.

Settings from your current configuration [default] are:
component_manager:
  disable_update_check: 'True'
compute:
  gce_metadata_read_timeout_sec: '0'
core:
  account: stellabiderman@gmail.com

Pick configuration to use:
 [1] Re-initialize this configuration [default] with new settings 
 [2] Create a new configuration
Please enter your numeric choice:  1

Your current configuration has been set to: [default]

You can skip diagnostics next time by using the following flag:
  gcloud init --skip-diagnostics

Network diagnostic detects and fixes local network connection issues.
Reachability Check passed.
Network diagnostic passed (1/1 checks passed).

Choose the account you would like to use to perform operations for 
this configuration:
 [1] stellabiderman@gmail.com
 [2] Log in with a new account
Please enter your numeric choice:  1

You are logged in as: [stellabiderman@gmail.com].

Pick cloud project to use: 
 [1

In [7]:
path_to_cloud_bucket = 'gs://eleutherai' #@param {type:"string"}

## Set Up Dataset

We first need to download and tokenize a dataset - you can choose from:

*   Sampling Only - choose this option if you only wish to sample from our trained models.

*   OpenWebText - an opensource clone of OpenAI's WebText dataset, the original training data of GPT2.

*   YoutubeSubtitles - a dataset of subtitles scraped from youtube videos.

* Hackernews - comments scraped from hackernews

* NIHExporter - Data relating to various projects from the national institute of health.

* Custom - if this option is chosen you will be prompted to enter the path to your own dataset. It should be a directory containing .txt or .jsonl files.

All these datasets are from EleutherAI's side project - [The Pile™](https://github.com/EleutherAI/The-Pile) - an effort to gather a general purpose, diverse and open source plain text dataset large enough to train 1T+ parameter language models.

Even the smallest datasets are fairly large files, so this step will likely take a while. Select a dataset in the next cell, then run the next two cells, and go grab a snack and a cup of tea 😊

Alternatively, you can provide your own dataset in the form of a folder or gzip archive of .txt files. Simply select 'Custom' below and follow input the path to your data and the name of your dataset when prompted.

In [3]:
# Select a Dataset:
import os
dataset = 'Sampling_Only' #@param ["Sampling_Only", "OpenWebText", "YoutubeSubtitles", "HackerNews", "NIHExporter", "Custom"]

if dataset == "Sampling_Only":
  pass
elif dataset == 'OpenWebText':
  !wget https://the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar -O openwebtext.tar.xz
  !tar xf openwebtext.tar.xz
  dataset_path = "openwebtext"
  dataset_name = dataset_path
  out_name = dataset_name + "_tokenized"
elif dataset == 'YoutubeSubtitles':
  os.makedirs('data', exist_ok=True)
  !wget https://the-eye.eu/public/AI/pile_preliminary_components/yt_subs.jsonl.zst -O data/yt_subs.jsonl.zst
  dataset_path = 'data'
  dataset_name = 'ytsubs'
  out_name = dataset_name + "_tokenized"
elif dataset == 'HackerNews':
  os.makedirs('data', exist_ok=True)
  !wget https://the-eye.eu/public/AI/pile_preliminary_components/hn.tar.gz -O data/hn.tar.gz
  dataset_path = 'data'
  dataset_name = 'hackernews'
  out_name = dataset_name + "_tokenized"
elif dataset == "NIHExporter":
  os.makedirs('data', exist_ok=True)
  !wget https://the-eye.eu/public/AI/pile_preliminary_components/NIH_ExPORTER_awarded_grant_text.jsonl.zst -O data/NIH_ExPORTER_awarded_grant_text.jsonl.zst
  dataset_path = 'data'
  os.system('mv NIH_ExPORTER_awarded_grant_text.jsonl.zst ./data')
  dataset_name = 'nihexporter'
  out_name = dataset_name + "_tokenized"
elif dataset == "Custom":
  dataset_path = input('Enter the path to the folder containing your data: ')
  dataset_name = input('Enter the name of your dataset: ')
  out_name = dataset_name + "_tokenized"
else:
  raise NotImplementedError('please select from available options: ["OpenWebText", "YoutubeSubtitles", "HackerNews", "NIHExporter", "Custom"]')


### Tokenize and Upload Data

Now tokenize the dataset and copy it over to your google cloud bucket. You make skip this step if you are sampling from a pre-trained model.

In [None]:
# Tokenize Data
!python data/create_tfrecords.py --input_dir /content/GPTNeo/$dataset_path --name $dataset_name --files_per 1000 --output_dir $out_name --write_dataset_config --processes 1

# copy the data to your bucket
if not path_to_cloud_bucket.endswith('/'):
  path_to_cloud_bucket += '/'
copy_loc = path_to_cloud_bucket + "datasets/" + dataset
!gsutil -m cp -r /content/GPTNeo/$out_name $copy_loc
!gsutil ls $path_to_cloud_bucket

Before starting training - you'll need to edit your dataset & model configs to point to your buckets / data. You need to do this even if you are sampling from a pre-trained model.

*   First change the writefile path to point to your chosen dataset - e.g `%%writefile configs/dataset_configs/ytsubs.json`
*   Change the "path" field to point to your cloud bucket location - e.g `gs://neo_lmdatasets/datasets/ytsubs_*.tfrecords`
* Change `dataset_name` in `%%writefile configs/dataset_configs/dataset_name.json` to the name of your chosen dataset.
* Once you've made the edits, then run the cell below to overwrite the existing files.




In [4]:
%%writefile configs/dataset_configs/Sampling_Only.json

{
  "path": "gs://eleutherai/datasets/Sampling_Only/Sampling_Only*.tfrecords",
  "eval_path": "",
  "n_vocab": 50256,
  "tokenizer_is_pretrained": true,
  "tokenizer_path": "gpt2",
  "eos_id": 50256,
  "padding_id": 50257
}


Writing configs/dataset_configs/Sampling_Only.json


## Set Model Configs

The model below is identical to our pretrained GPT3XL model (1.3B Params). 

If you want to use a smaller model, you can modify any of the config files in ../configs/ ending in _8.json, all of which are designed to train on tpu-v8s.

For a more detailed breakdown on what each item in the configuration file means - please read through our training and config guides in our [github README](https://github.com/EleutherAI/GPTNeo#training-guide). 

You'll want to change the first item in the `datasets` list to the name of your chosen dataset. (the filename minus .json in ./configs/dataset_configs)

You'll also want to modify the `model_path` field to point to your google cloud bucket, so checkpoints get saved to there.

In [5]:
%%writefile configs/GPT3_XL.json

{
    "n_head": 16,
    "n_vocab": 50257,
    "embed_dropout": 0,
    "lr": 0.0002,
    "lr_decay": "cosine",
    "warmup_steps": 3000,
    "beta1": 0.9,
    "beta2": 0.95,
    "epsilon": 1e-8,
    "opt_name": "adam",
    "weight_decay": 0,
    "train_batch_size": 256,
    "attn_dropout": 0,
    "train_steps": 600000,
    "eval_steps": 0,
    "predict_steps": 1,
    "res_dropout": 0,
    "eval_batch_size": 4,
    "predict_batch_size": 1,
    "iterations": 100,
    "n_embd": 2048,
    "datasets": [["HackerNews", null, null, null]],
    "model": "GPT",
    "model_path": "gs://eleutherai/GPT3_XL",
    "n_ctx": 2048,
    "n_layer": 24,
    "scale_by_depth": true,
    "scale_by_in": false,
    "attention_types" :  [[["global", "local"],12]],
    "mesh_shape": "x:4,y:2",
    "layout": "intermediate_expanded:x,heads:x,vocab:n_vocab,memory_length:y,embd:y",
    "activation_function": "gelu",
    "recompute_grad": true,
    "gradient_clipping": 1.0,
    "tokens_per_mb_per_replica": 2048,
    "precision": "bfloat16"
}

Writing configs/GPT3_XL.json


## Training from Scratch

Now we will begin to train the model. If no previous model is found in "model_path", the model will start training from scratch. If you'd prefer to finetune from pretrained, skip to the `Finetune a Pretrained Model` section.

If everything's set up correctly, you can now run the main.py function to start training!

In [None]:
!python3 main.py --model colab_XL --steps_per_checkpoint 500 --tpu colab

## Pretrained Model

If you want to sample from or finetune a pretrained model, EleutherAI has pretrained two models for release. One with [1.3B parameters](https://the-eye.eu/eleuther_staging/gptneo-release/GPT3_XL/), and another with [2.7B](https://the-eye.eu/eleuther_staging/gptneo-release/GPT3_2-7B/). 

Select an option below to download the weights locally. You will then need to upload them to your cloud bucket in order to finetune from them. If the download command isn't working, try the commented out code to download from a different source.

The 2-7B model likely won't fit into the colab TPUs memory, and you may have to get some larger pods to finetune from it.

Sampling from it, however, works just fine.


In [8]:
# @title Download pretrained model weights:
pretrained_model = 'GPT3_XL' #@param ["GPT3_XL", "GPT3_2-7B"]

!wget -m -np -c -U "eye02" -w 2 -R "index.html*" "https://the-eye.eu/eleuther_staging/gptneo-release/$pretrained_model/"
path_to_local_weights = "/content/GPTNeo/the-eye.eu/eleuther_staging/gptneo-release/$pretrained_model"

# URL = f"http://eaidata.bmk.sh/data/gptneo-release/{pretrained_model}/"
# FOLDER_NAME = "GPT3_XL"
# !curl $URL | grep -i "</a>" | sed -n 's/.*href="\([^"]*\).*/\1/p' | sed "s|^|$URL|" | xargs -n 1 -P 4 wget -P $pretrained_model
# path_to_local_weights = pretrained_model


--2021-03-22 03:32:03--  (try: 3)  https://the-eye.eu/robots.txt
Connecting to the-eye.eu (the-eye.eu)|162.213.130.242|:443... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable
Retrying.

^C


In [9]:
# upload to your bucket
bucket_base = "gs://" + path_to_cloud_bucket.replace('gs://', '').split('/')[0]
!gsutil -m cp -r $path_to_local_weights $bucket_base

Copying file:///content/GPTNeo/the-eye.eu/eleuther_staging/gptneo-release/GPT3_XL/model.ckpt-362000.data-00020-of-00032 [Content-Type=application/octet-stream]...
/ [0/37 files][    0.0 B/ 12.3 GiB]   0% Done                                   Copying file:///content/GPTNeo/the-eye.eu/eleuther_staging/gptneo-release/GPT3_XL/model.ckpt-362000.data-00001-of-00032 [Content-Type=application/octet-stream]...
/ [0/37 files][    0.0 B/ 12.3 GiB]   0% Done                                   ==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil 

If everything has worked successfully you should now see your model listed in your bucket below.

In [10]:
!gsutil ls $bucket_base

gs://eleutherai/GPT3_XL/
gs://eleutherai/gptneo-release/


Now we want to make a few modifications to the model config in order to get training working on colab and finetune on your chosen dataset. If you are sampling from our pretrained models, you do not need to make any modifications.

You can change parameters below. 

* `path_to_model` should point to the model weights location in your cloud bucket, and will default to `$bucket_base/${pretrained_model}` if nothing is entered.

* `batch_size` is your train batch size - if you're encountering memory errors, try lowering this.

* `dataset_name` is the name of your dataset, if nothing is entered, this should default to the dataset you selected in the `Prepare Data` section.

* `mesh_shape` specifies the way the model will be divided up across the TPU cores. We suggest leaving this alone unless you know what you're doing.

* `train_steps` specifies how many steps you want the model to finetune for. We set this to 1000 for demonstrative purposes but you may need to increase this a little depending on your goals.

* `steps_per_checkpoint` specifies how often you want to save model weights during training.



In [None]:
# @title Modify config for colab. 
  
import json
from pprint import pprint

path_to_model = "" #@param {type:"string"}
batch_size = 16 #@param {type:"integer"}
dset = ""  #@param {type:"string"}
mesh_shape = "x:4,y:2" #@param {type:"string"}
train_steps = 1000 #@param {type:"integer"}
steps_per_checkpoint = 500 #@param {type:"integer"}
start_step = 400000 if pretrained_model == "GPT3_2-7B" else 362000

if path_to_model == "":
  path_to_model = f'{bucket_base.strip("/")}/{pretrained_model}'
print(f'MODEL PATH: {path_to_model}\n')

if dset == "":
  dset = dataset

def pad_to_multiple_of(n, mult):
  """
  pads n to a multiple of mult
  """
  extra = n % mult
  if extra > 0:
      n = n + mult - extra
  return n

with open(f'/content/GPTNeo/the-eye.eu/eleuther_staging/gptneo-release/{pretrained_model}/config.json', 'r') as f:
  data = json.load(f)
  pprint(data)
  mods = {
          "mesh_shape": mesh_shape,
          "layout": "intermediate_expanded:x,heads:x,memory_length:y,embd:y",
          "model_path": path_to_model,
          "datasets": [[dataset, None, None, None]],
          "train_steps": start_step + train_steps,
          "eval_steps": 0,
          "train_batch_size": batch_size
        }
  data.update(mods)
  print('\n--->\n')
  pprint(data)
  with open(f'configs/{pretrained_model}.json', 'w') as outfile:
    json.dump(data, outfile, indent=2)

# Begin Fine-Tuning

If you are fine-tuning the pretrained model, this line of code will begin the training.

In [None]:
!python3 main.py --model $pretrained_model --steps_per_checkpoint $steps_per_checkpoint --tpu colab

## Sample from your model

Once training is finished, you can run the same command with the --predict flag to sample from your model.

To pass in a prompt, save it to a .txt file, and pass in the name of the file with the --prompt flag.

use the cell below to enter your prompt, and run it to save it to example_prompt.txt.

In [None]:
%%writefile example_prompt.txt
In a shocking finding, scientists discovered a herd of unicorns living in a remote,
previously unexplored valley, in the Andes Mountains. Even more surprising to the
researchers was the fact that the unicorns spoke perfect English.

In [None]:
!python3 main.py --model $pretrained_model --steps_per_checkpoint 500 --tpu colab --predict --prompt example_prompt.txt