<a href="https://colab.research.google.com/github/JayThibs/gpt-experiments/blob/main/notebooks/Fine_Tuning_GPT_2_with_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to GPT

GPT stands for "Generative Pre-Trained Transformer."

* Generative because it is used to generate text.
* Pre-trained because it was trained on a large corpus of unstructured text to make its weights learn things from text like structure, syntax, and general knowledge.
* Transformer because it uses the `decoder` part of the transformer architecture. In other words, you can give it text and it will decode what it needs to output in response (by guessing the next words that follow the input text).

GPT-3 is a massive model. Much too massive to fit in a puny Google Colab GPU and its RAM. Therefore, here we'll use its predecessor, GPT-2, since we can actually fit in on our Colab machine.

GPT models are particular cool because they are able to be applied to many downstream NLP tasks without having to fine-tune the model. Through, few-shot learning, the model can predict what should come next. However, the only current limitation is that the model is limited by its window size. In other words, a model like GPT-J can only fit 2048 tokens as input. That means that in some cases, we might not be able to fit in enough examples to get fantastic results. And when we're in a production environment, it can often be worth it to fine-tune a GPT model to your type of data so that it can perform better.

Models like GPT-3 have been show to show a lot of great results across many different tasks without fine-tuning, even when we compare them to a model like BERT that was specifically fine-tuned on the data. However, it is still often recommended to fine-tune the model to get even better performance. And, in cases like medical data, GPT-3 doesn't necessarily perform well compared to a model like BioBERT.

Perhaps a good rule of thumb is to start by doing creative prompt engineering with your GPT model first to try to get great results, and then you can decide afterwards if you'd like to fine-tune the model for even better accuracy.

# Installations

In [91]:
!pip install git+https://github.com/huggingface/transformers pytorch-lightning beautifulsoup4 datasets --quiet

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
  Building wheel for transformers (PEP 517) ... [?25l[?25hdone


# Imports

In [66]:
import os
import re
import torch
import random
import pandas as pd
from tqdm import tqdm
from torch.utils.data import Dataset
import pytorch_lightning as pl
from pytorch_lightning import seed_everything
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from transformers import GPT2Tokenizer, TrainingArguments, Trainer, GPT2LMHeadModel
pd.set_option('display.max_colwidth', None)

# Mounting Google Drive

Here we will mount our Google Drive so that we can grab data and save the HuggingFace scripts, and save the model once we've fine-tuned it.

In [None]:
# For saving the data locally
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd drive/MyDrive/code-projects/fine-tune-gpt

/content/drive/MyDrive/code-projects/fine-tune-gpt


# Getting the Data

We'll be fine-tuning GPT-2 on Elon Musk tweets to see if we can start taking the first steps towards an Elon AI.

In [71]:
directory = 'data/elon-musk/tweets-2010-2021/'
musk_tweets = pd.read_csv(f'{directory}' + '2010.csv')

list_of_years = ['2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021']

for year in list_of_years:
    temp_df = pd.read_csv(f'{directory}' + year + '.csv')
    musk_tweets = musk_tweets.append(temp_df, ignore_index=True)

In [72]:
musk_tweets.head(3)

Unnamed: 0.1,Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,cashtags,user_id,user_id_str,username,name,day,hour,link,urls,photos,video,thumbnail,retweet,nlikes,nreplies,nretweets,quote_url,search,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,0,15434727182,15434727182,1275676000000.0,2010-06-04 18:31:57,0,,"Please ignore prior tweets, as that was someone pretending to be me :) This is actually me.",en,[],[],44196397,44196397,elonmusk,Elon Musk,5,18,https://twitter.com/elonmusk/status/15434727182,[],[],0,,False,4652,391,348,,,,,,,,,[],,,,
1,0,152153637639028736,152151847614943233,1325111000000.0,2011-12-28 22:27:08,0,,@TheOnion So true :),en,[],[],44196397,44196397,elonmusk,Elon Musk,3,22,https://twitter.com/elonmusk/status/152153637639028736,[],[],0,,False,12,7,1,,,,,,,,,[],,,,
2,1,151809315026636800,151809315026636800,1325029000000.0,2011-12-27 23:38:55,0,,If you ever wanted to know the *real* truth about the moon landings ...(best Onion article ever) http://t.co/pgNEJsjI,en,[],[],44196397,44196397,elonmusk,Elon Musk,2,23,https://twitter.com/elonmusk/status/151809315026636800,['http://j.mp/vLhhov'],[],0,,False,39,13,34,,,,,,,,,[],,,,


In [73]:
musk_tweets = musk_tweets['tweet']
musk_tweets.head(2)

0    Please ignore prior tweets, as that was someone pretending to be me :)  This is actually me.
1                                                                            @TheOnion So true :)
Name: tweet, dtype: object

In [74]:
import re
musk_tweets.replace(to_replace="@[A-Za-z0-9]+", value="", regex=True, inplace=True)
musk_tweets.replace(to_replace=r'http\S+', value="", regex=True, inplace=True)
musk_tweets.replace(to_replace=r'#[A-Za-z0-9]+', value="", regex=True, inplace=True)

In [75]:
musk_tweets.head(20)

0                                                 Please ignore prior tweets, as that was someone pretending to be me :)  This is actually me.
1                                                                                                                                   So true :)
2                                           If you ever wanted to know the *real* truth about the moon landings ...(best Onion article ever)  
3                                                                Walked around a neighborhood recently rebuilt with help from APJ and others  
4                                            It was Xmas, so we brought presents for the kids at the orphanage. They don't usually get much.  
5                  Met with UNICEF, Doctors Without Borders and Artists for Peace & Justice. I support them and would recommend others do too.
6                          Just returned from a trip to Haiti. Covered a lot of ground and saw many tough situations. They need a lot of help.

In [77]:
len(musk_tweets)

43074

## Training Splits

In [81]:
train, val = train_test_split(musk_tweets, test_size=0.1)

In [83]:
train_path = f'{directory}' + 'train.csv'
val_path = f'{directory}' + 'val.csv'

train.to_csv(train_path)
val.to_csv(val_path)

# Fine-Tuning GPT-2

If we're looking to fine-tune models which are found on the HuggingFace model hub, then it becomes much easier to fine-tune the models since HuggingFace provides us with scripts.

From the `transformers` repo:

> There are two sets of scripts provided. The first set leverages the Trainer API. The second set with no_trainer in the suffix uses a custom training loop and leverages the 🤗 Accelerate library. Both sets use the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.

You can learn more about it here: https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling

We will be using the script that leveraged the Trainer API. We can download the script by running:

In [84]:
if os.path.exists('/gpt-2/run_clm.py'):
    !wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/pytorch/language-modeling/run_clm.py -P gpt-2/

# Train

In [95]:
train_path = '/content/drive/MyDrive/code-projects/fine-tune-gpt/data/elon-musk/tweets-2010-2021/train.csv'
val_path = '/content/drive/MyDrive/code-projects/fine-tune-gpt/data/elon-musk/tweets-2010-2021/val.csv'

In [96]:
!python gpt-2/run_clm.py \
    --model_name_or_path gpt2 \
    --train_file train_path \
    --validation_file val_path \
    --do_train \
    --do_eval \
    --output_dir gpt-2/tmp/elon-test-clm

Traceback (most recent call last):
  File "gpt-2/run_clm.py", line 526, in <module>
    main()
  File "gpt-2/run_clm.py", line 203, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/usr/local/lib/python3.7/dist-packages/transformers/hf_argparser.py", line 206, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 14, in __init__
  File "gpt-2/run_clm.py", line 186, in __post_init__
    assert extension in ["csv", "json", "txt"], "`train_file` should be a csv, a json or a txt file."
AssertionError: `train_file` should be a csv, a json or a txt file.
