<a href="https://colab.research.google.com/github/JayThibs/gpt-experiments/blob/main/notebooks/Fine_Tuning_GPT_2_with_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to GPT

GPT stands for "Generative Pre-Trained Transformer."

* Generative because it is used to generate text.
* Pre-trained because it was trained on a large corpus of unstructured text to make its weights learn things from text like structure, syntax, and general knowledge.
* Transformer because it uses the `decoder` part of the transformer architecture. In other words, you can give it text and it will decode what it needs to output in response (by guessing the next words that follow the input text).

GPT-3 is a massive model. Much too massive to fit in a puny Google Colab GPU and its RAM. Therefore, here we'll use its predecessor, GPT-2, since we can actually fit in on our Colab machine.

GPT models are particular cool because they are able to be applied to many downstream NLP tasks without having to fine-tune the model. Through, few-shot learning, the model can predict what should come next. However, the only current limitation is that the model is limited by its window size. In other words, a model like GPT-J can only fit 2048 tokens as input. That means that in some cases, we might not be able to fit in enough examples to get fantastic results. And when we're in a production environment, it can often be worth it to fine-tune a GPT model to your type of data so that it can perform better.

Models like GPT-3 have been show to show a lot of great results across many different tasks without fine-tuning, even when we compare them to a model like BERT that was specifically fine-tuned on the data. However, it is still often recommended to fine-tune the model to get even better performance. And, in cases like medical data, GPT-3 doesn't necessarily perform well compared to a model like BioBERT.

Perhaps a good rule of thumb is to start by doing creative prompt engineering with your GPT model first to try to get great results, and then you can decide afterwards if you'd like to fine-tune the model for even better accuracy.

In [1]:
!nvidia-smi

Fri Nov 26 03:17:23 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Installations

In [2]:
!pip install git+https://github.com/huggingface/transformers pytorch-lightning beautifulsoup4 datasets --quiet

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 523 kB 5.0 MB/s 
[K     |████████████████████████████████| 290 kB 56.8 MB/s 
[K     |████████████████████████████████| 895 kB 41.5 MB/s 
[K     |████████████████████████████████| 59 kB 7.1 MB/s 
[K     |████████████████████████████████| 596 kB 37.5 MB/s 
[K     |████████████████████████████████| 3.3 MB 35.7 MB/s 
[K     |████████████████████████████████| 829 kB 39.3 MB/s 
[K     |████████████████████████████████| 132 kB 50.5 MB/s 
[K     |████████████████████████████████| 329 kB 55.6 MB/s 
[K     |████████████████████████████████| 1.1 MB 47.4 MB/s 
[K     |████████████████████████████████| 243 kB 60.1 MB/s 
[K     |████████████████████████████████| 160 kB 59.6 MB/s 
[K     |████████████████████████████████| 271 kB 57.9 MB/s 
[K     |████████████████████████████████| 192 k

# Imports

In [3]:
import os
import re
import torch
import random
import pandas as pd
from tqdm import tqdm
from torch.utils.data import Dataset
import pytorch_lightning as pl
from pytorch_lightning import seed_everything
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from transformers import GPT2Tokenizer, TrainingArguments, Trainer, GPT2LMHeadModel
pd.set_option('display.max_colwidth', None)

# Mounting Google Drive

Here we will mount our Google Drive so that we can grab data and save the HuggingFace scripts, and save the model once we've fine-tuned it.

In [4]:
# For saving the data locally
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
%cd drive/MyDrive/code-projects/fine-tune-gpt

/content/drive/MyDrive/code-projects/fine-tune-gpt


# Getting the Data

We'll be fine-tuning GPT-2 on Elon Musk tweets to see if we can start taking the first steps towards an Elon AI.

In [6]:
directory = 'data/elon-musk/tweets-2010-2021/'
musk_tweets = pd.read_csv(f'{directory}' + '2010.csv')

list_of_years = ['2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021']

for year in list_of_years:
    temp_df = pd.read_csv(f'{directory}' + year + '.csv')
    musk_tweets = musk_tweets.append(temp_df, ignore_index=True)

In [7]:
musk_tweets.head(3)

Unnamed: 0.1,Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,cashtags,user_id,user_id_str,username,name,day,hour,link,urls,photos,video,thumbnail,retweet,nlikes,nreplies,nretweets,quote_url,search,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,0,15434727182,15434727182,1275676000000.0,2010-06-04 18:31:57,0,,"Please ignore prior tweets, as that was someone pretending to be me :) This is actually me.",en,[],[],44196397,44196397,elonmusk,Elon Musk,5,18,https://twitter.com/elonmusk/status/15434727182,[],[],0,,False,4652,391,348,,,,,,,,,[],,,,
1,0,152153637639028736,152151847614943233,1325111000000.0,2011-12-28 22:27:08,0,,@TheOnion So true :),en,[],[],44196397,44196397,elonmusk,Elon Musk,3,22,https://twitter.com/elonmusk/status/152153637639028736,[],[],0,,False,12,7,1,,,,,,,,,[],,,,
2,1,151809315026636800,151809315026636800,1325029000000.0,2011-12-27 23:38:55,0,,If you ever wanted to know the *real* truth about the moon landings ...(best Onion article ever) http://t.co/pgNEJsjI,en,[],[],44196397,44196397,elonmusk,Elon Musk,2,23,https://twitter.com/elonmusk/status/151809315026636800,['http://j.mp/vLhhov'],[],0,,False,39,13,34,,,,,,,,,[],,,,


In [8]:
musk_tweets.rename(columns={'tweet': 'text'}, inplace=True)
musk_tweets = musk_tweets['text']
musk_tweets.head(2)

0    Please ignore prior tweets, as that was someone pretending to be me :)  This is actually me.
1                                                                            @TheOnion So true :)
Name: text, dtype: object

In [9]:
import re
musk_tweets.replace(to_replace="@[A-Za-z0-9]+", value="", regex=True, inplace=True)
musk_tweets.replace(to_replace=r'http\S+', value="", regex=True, inplace=True)
musk_tweets.replace(to_replace=r'#[A-Za-z0-9]+', value="", regex=True, inplace=True)
musk_tweets = musk_tweets[musk_tweets.str.len()>=20]
musk_tweets = musk_tweets[musk_tweets.str.len()<=50]
# musk_tweets = "<endoftext>" + musk_tweets + "<endoftext>"

In [10]:
musk_tweets.head(20)

15             Yum! Even better than deep fried butter:  
23                      That was a total non sequitur btw
28                   Interesting premise. I will read it.
29                                    V cute! Merry Xmas.
30                 Cowboy riding the rocket no problemo  
31      Single camera view of the 40 meter rocket hover  
74                   Original article on Model S from :  
77                       Alexander Hamilton was awesome  
87           Just wrote a blog piece about Tesla stores  
91                               An update about Tesla   
92                               Sorry, meant to say EDT.
94                            Review of the Model S by   
103        Amazing series of space pics assembled by     
105    This piece about Mars in the NYT is worth a read  
130          Now it's just a song that you used to know  
151                    "The Girl Who Fixed the Umlaut"   
153                 Meant to post this link for Merlin:  
155           

In [11]:
len(musk_tweets)

9446

In [12]:
musk_tweets.dropna(inplace=True)
len(musk_tweets)

9446

## Training Splits

In [13]:
train, val = train_test_split(musk_tweets, test_size=0.1)

In [14]:
train_path = f'{directory}' + 'train.csv'
val_path = f'{directory}' + 'val.csv'

train.to_csv(train_path, index=False)
val.to_csv(val_path, index=False)

# Fine-Tuning GPT-2

If we're looking to fine-tune models which are found on the HuggingFace model hub, then it becomes much easier to fine-tune the models since HuggingFace provides us with scripts.

From the `transformers` repo:

> There are two sets of scripts provided. The first set leverages the Trainer API. The second set with no_trainer in the suffix uses a custom training loop and leverages the 🤗 Accelerate library. Both sets use the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.

You can learn more about it here: https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling

We will be using the script that leveraged the Trainer API. We can download the script by running:

In [15]:
if os.path.exists('/gpt-2/run_clm.py'):
    !wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/pytorch/language-modeling/run_clm.py -P gpt-2/

# Train

In [16]:
!python gpt-2/run_clm.py \
    --model_name_or_path gpt2 \
    --train_file data/elon-musk/tweets-2010-2021/train.csv \
    --validation_file data/elon-musk/tweets-2010-2021/val.csv \
    --do_train \
    --do_eval \
    --per_device_eval_batch_size=2 \
    --per_device_train_batch_size=2 \
    --output_dir gpt-2/tmp/elon-test-clm \
    --overwrite_output_dir

11/26/2021 03:18:36 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=gpt-2/tmp/elon-test-clm/runs/Nov26_03-18

# Let's use the model!

In [17]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
import numpy as np

OUTPUT_DIR = "gpt-2/tmp/elon-test-clm"
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

tokenizer = GPT2Tokenizer.from_pretrained(OUTPUT_DIR)
model = GPT2LMHeadModel.from_pretrained(OUTPUT_DIR)
model = model.to(device)
                                        
def generate(input_str, length=250, n=5):
  cur_ids = torch.tensor(tokenizer.encode(input_str)).unsqueeze(0).long().to(device)
  model.eval()
  with torch.no_grad():
    for i in range(length):
      outputs = model(cur_ids[:, -1024:], labels=cur_ids[:, -1024:])
      loss, logits = outputs[:2]
      softmax_logits = torch.softmax(logits[0,-1], dim=0)
      next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=n)
      cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim=1)
    output_list = list(cur_ids.squeeze().to('cpu').numpy())
    output_text = tokenizer.decode(output_list)
    return output_text

def choose_from_top(probs, n=5):
    ind = np.argpartition(probs, -n)[-n:]
    top_prob = probs[ind]
    top_prob = top_prob / np.sum(top_prob) # Normalize
    choice = np.random.choice(n, 1, p = top_prob)
    token_id = ind[choice][0]
    return int(token_id)

generated_text = generate("Tesla")
print(generated_text)

I went to Space X and was so excited to see it! I love space. It's amazing. I love it. That's what we do. I think that's the best part about it, too :)

What's up with that? I don't have a Twitter handle. It's a great one! I love your tweets! Thanks for letting me know :) 🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣 🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣 🤣 🤣🤣🤣       Yeah, it was a real pain in the neck, but hopefully it goes away soon     It was just me, so it's just the way it should be   It was just me, so it's just the way it should be      Yeah, I agree. I love space. It's amazing! I love it. That's what we do


In [18]:
generated_text = generate("Tesla")
print(generated_text)

Tesla, Tesla Model S, and more.

And, of course, Tesla will be in the news for that as well.

The Model 3 is a real deal! pic.twitter.com/9Zn9fW1vYXn — Tesla CEO Elon Musk (@elonmusk) November 9, 2016

And yes, that will include Tesla's own Twitter.

What are your thoughts on Tesla's new Model S? What's next for the Model 3 and future of the Model X? Sound off in the comments below!

Read or Share this story: http://usat.ly/10vfZn0f<|endoftext|>The latest issue of Dragon Age, available now on iphone,      It's hard to imagine anything better. I'm glad you like it, Dragon Age is one of those games! 🤣️   It would be great for you too 🤣️🤣️ 🤣️🤣️  It's hard to imagine anything better. I'm glad you like it, Dragon Age is one of those games! 🤣️🤣️ 🤣️🤣️�
