<a href="https://colab.research.google.com/github/JayThibs/gpt-experiments/blob/main/notebooks/gpt_2_alignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning GPT-2 on Alignment Texts Dataset

This notebook is meant for initial experimentation of fine-tuning on the alignment text dataset.

In [1]:
!nvidia-smi

Sun Jun 26 21:54:14 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Installations

In [6]:
!pip install git+https://github.com/huggingface/transformers pytorch-lightning beautifulsoup4 datasets jsonlines --quiet

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone


# Imports

In [20]:
import os
import re
import torch
import random
import jsonlines
import numpy as np
import pandas as pd
from tqdm import tqdm
import torch
from torch.utils.data import Dataset
import pytorch_lightning as pl
from pytorch_lightning import seed_everything
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from transformers import GPT2Tokenizer, AutoTokenizer, TrainingArguments, Trainer, GPT2LMHeadModel
pd.set_option('display.max_colwidth', None)

# Mounting Google Drive

Here we will mount our Google Drive so that we can grab data and save the HuggingFace scripts, and save the model once we've fine-tuned it.

In [4]:
# For saving the data locally
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
%cd drive/MyDrive/data/ai-alignment-dataset/

/content/drive/MyDrive/data/ai-alignment-dataset


# Getting the Data

We'll be fine-tuning GPT-2 on Elon Musk tweets to see if we can start taking the first steps towards an Elon AI.

In [21]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [27]:
i = 0 
with jsonlines.open("alignment_texts.jsonl") as reader:
    for line in reader:
        if i > 0:
            break
        text = line["text"]
        try:
            if text != "":
                print(len(text.split()))
                print(text)
                encoding = tokenizer(text)
                total_len = len(encoding.tokens())
                print(encoding.tokens)
            i += 1
        except:
            pass


306
Webinar with Congressman Ro Khanna: Challenges in IT Law and Governance


 Download as PDF

On Friday, February 19, the AI Pulse project hosted a web conversation on current issues in IT Governance with Congressman Ro Khanna, a leading progressive thinker on a wide range of law and technology issues in the United States Congress. Rep. Khanna represents California’s 17th district in the House of Representatives, where he chairs the Environment Subcommittee of the House Committee on Oversight and Reform, and serves as Deputy Whip of the Congressional Progressive Caucus. He is a passionate advocate of using technology to bring economic opportunity to rural and small-town America. In 2018, at the request of Speaker Pelosi, he authored a widely praised set of principles for an Internet Bill of Rights. Prior to serving in Congress, Rep. Khanna worked as an intellectual-property lawyer and served in the Obama Administration as Deputy Assistant Secretary of Commerce. He holds an undergradu

In [26]:
print(len(encoding.tokens()))

387


In [None]:
directory = 'data/elon-musk/tweets-2010-2021/'
musk_tweets = pd.read_csv(f'{directory}' + '2010.csv')

list_of_years = ['2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021']

for year in list_of_years:
    temp_df = pd.read_csv(f'{directory}' + year + '.csv')
    musk_tweets = musk_tweets.append(temp_df, ignore_index=True)

In [None]:
musk_tweets.head(3)

Unnamed: 0.1,Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,cashtags,user_id,user_id_str,username,name,day,hour,link,urls,photos,video,thumbnail,retweet,nlikes,nreplies,nretweets,quote_url,search,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,0,15434727182,15434727182,1275676000000.0,2010-06-04 18:31:57,0,,"Please ignore prior tweets, as that was someone pretending to be me :) This is actually me.",en,[],[],44196397,44196397,elonmusk,Elon Musk,5,18,https://twitter.com/elonmusk/status/15434727182,[],[],0,,False,4652,391,348,,,,,,,,,[],,,,
1,0,152153637639028736,152151847614943233,1325111000000.0,2011-12-28 22:27:08,0,,@TheOnion So true :),en,[],[],44196397,44196397,elonmusk,Elon Musk,3,22,https://twitter.com/elonmusk/status/152153637639028736,[],[],0,,False,12,7,1,,,,,,,,,[],,,,
2,1,151809315026636800,151809315026636800,1325029000000.0,2011-12-27 23:38:55,0,,If you ever wanted to know the *real* truth about the moon landings ...(best Onion article ever) http://t.co/pgNEJsjI,en,[],[],44196397,44196397,elonmusk,Elon Musk,2,23,https://twitter.com/elonmusk/status/151809315026636800,['http://j.mp/vLhhov'],[],0,,False,39,13,34,,,,,,,,,[],,,,


In [None]:
musk_tweets.rename(columns={'tweet': 'text'}, inplace=True)
musk_tweets = musk_tweets['text']
musk_tweets.head(2)

0    Please ignore prior tweets, as that was someone pretending to be me :)  This is actually me.
1                                                                            @TheOnion So true :)
Name: text, dtype: object

In [None]:
musk_tweets.replace(to_replace="@[A-Za-z0-9]+", value="", regex=True, inplace=True)
musk_tweets.replace(to_replace=r'http\S+', value="", regex=True, inplace=True)
musk_tweets.replace(to_replace=r'#[A-Za-z0-9]+', value="", regex=True, inplace=True)
musk_tweets = musk_tweets[musk_tweets.str.len()>=20]
# musk_tweets = "<endoftext>" + musk_tweets + "<endoftext>"

In [None]:
musk_tweets.head(20)

0                                                 Please ignore prior tweets, as that was someone pretending to be me :)  This is actually me.
2                                           If you ever wanted to know the *real* truth about the moon landings ...(best Onion article ever)  
3                                                                Walked around a neighborhood recently rebuilt with help from APJ and others  
4                                            It was Xmas, so we brought presents for the kids at the orphanage. They don't usually get much.  
5                  Met with UNICEF, Doctors Without Borders and Artists for Peace & Justice. I support them and would recommend others do too.
6                          Just returned from a trip to Haiti. Covered a lot of ground and saw many tough situations. They need a lot of help.
7                                                                        Single character Tweets are the ulitmate extension of the Twitmeme...

In [None]:
len(musk_tweets)

33935

In [None]:
musk_tweets.dropna(inplace=True)
len(musk_tweets)

33935

## Training Splits

In [None]:
train, val = train_test_split(musk_tweets, test_size=0.2)
test, val = train_test_split(val, test_size=0.5)

In [None]:
print("Number of Train examples: " + str(len(train)))
print("Number of Val examples: " + str(len(val)))
print("Number of Test examples: " + str(len(test)))

Number of Train examples: 27148
Number of Val examples: 3394
Number of Test examples: 3393


In [None]:
train_path = f'{directory}' + 'train.csv'
val_path = f'{directory}' + 'val.csv'
test_path = f'{directory}' + 'test.csv'

train.to_csv(train_path, index=False)
val.to_csv(val_path, index=False)
test.to_csv(test_path, index=False)

# Fine-Tuning GPT-2

If we're looking to fine-tune models which are found on the HuggingFace model hub, then it becomes much easier to fine-tune the models since HuggingFace provides us with scripts.

From the `transformers` repo:

> There are two sets of scripts provided. The first set leverages the Trainer API. The second set with no_trainer in the suffix uses a custom training loop and leverages the 🤗 Accelerate library. Both sets use the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.

You can learn more about it here: https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling

We will be using the script that leveraged the Trainer API. We can download the script by running:

In [None]:
if os.path.exists('/gpt-2/run_clm.py'):
    !wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/pytorch/language-modeling/run_clm.py -P gpt-2/

# Train

In [None]:
!python gpt-2/run_clm.py \
    --model_name_or_path gpt2 \
    --train_file data/elon-musk/tweets-2010-2021/train.csv \
    --validation_file data/elon-musk/tweets-2010-2021/val.csv \
    --do_train \
    --do_eval \
    --per_device_eval_batch_size=2 \
    --per_device_train_batch_size=2 \
    --output_dir gpt-2/tmp/elon-test-clm \
    --overwrite_output_dir

11/30/2021 00:14:04 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=gpt-2/tmp/elon-test-clm/runs/Nov30_00-14

# Let's use the model!

In [None]:
OUTPUT_DIR = "gpt-2/tmp/elon-test-clm"
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

tokenizer = GPT2Tokenizer.from_pretrained(OUTPUT_DIR)
model = GPT2LMHeadModel.from_pretrained(OUTPUT_DIR)
model = model.to(device)
                                        
def generate(input_str, length=250, n=5):
  cur_ids = torch.tensor(tokenizer.encode(input_str)).unsqueeze(0).long().to(device)
  model.eval()
  with torch.no_grad():
    for i in range(length):
      outputs = model(cur_ids[:, -1024:], labels=cur_ids[:, -1024:])
      loss, logits = outputs[:2]
      softmax_logits = torch.softmax(logits[0,-1], dim=0)
      next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=n)
      cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim=1)
    output_list = list(cur_ids.squeeze().to('cpu').numpy())
    output_text = tokenizer.decode(output_list)
    return output_text

def choose_from_top(probs, n=5):
    ind = np.argpartition(probs, -n)[-n:]
    top_prob = probs[ind]
    top_prob = top_prob / np.sum(top_prob) # Normalize
    choice = np.random.choice(n, 1, p = top_prob)
    token_id = ind[choice][0]
    return int(token_id)

generated_text = generate("Just dropping some")
print(generated_text)

Just dropping some of my old stuff in the trunk. It’s been a while since I last used a car.      _AA_Carmack   Yeah, I think we should do something about it.  We have a long way to go. Will be interesting to see what happens to those who don’t support this.   I think it will be great. It will be a lot more than a mere cameo. _AA_Carmack  Yes, it’s a great game. I think we should do something about it.   _Station _Ryan _AA_Carmack       _Station _AA_Carmack  I’m not a big fan of the Tesla Autopilot software, but I do like the idea of having a car capable of recognizing pedestrians and cyclists. It’s a great idea, especially with the high speed autotracing. _Station I love you         _Padival         _Station  Yes, it will have a lot of new features coming to the Tesla Model S, including the ability to drive from the garage to the


In [None]:
generated_text = generate("Just dropping some")
print(generated_text)

Just dropping some of our own resources into the ocean                   Yeah, we have to make sure we have a good product. That will be a priority.  We will make a new version of Falcon Heavy for free.  _AA_Carmack      _Sword _Sword _Sword    _AA_Carmack _Sword          We’ll try to get that done, but I think we’ll be better off with a more advanced, reusable, reusable rocket booster.        Yeah, I love it :)       I love the idea of having a Tesla in the car. It’s awesome.  _Ryan    Yeah, that's what we should do   Yeah, we will make the Model S a lot faster, but we have a lot of work to do to get it right, as we did in the beginning. We’ve had a lot of setbacks._Gardi  _Ryan Yeah, that's exactly right _AA_Carmack I’m just trying to be as polite


# Compressing the Model

Let's save the model as a `tar.gz` file so that we can save it in Google Drive.

In [None]:
!tar -czf gpt-2-elon-tweets.tar.gz gpt-2/tuned-models/