# GPT-2 FINE TUNING AND GENERATION

First, lets install the required packages. If runing on colab, include the following pip commands after connecting to runtime and connecting your google drive. Use the torch install option (current commented out) if these depreciate. To use TPUs, uncomment the last line as well.


In [11]:
!pip install git+https://github.com/huggingface/transformers
!pip install wandb
!pip install evaluate

# For special cases and TPU use run:
# !pip install torch transformers wandb -qqq 
# !pip install cloud-tpu-client==0.10 torch==1.13.0 https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.13-cp38-cp38-linux_x86_64.whl

Collecting git+https://github.com/huggingface/transformers

  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers 'C:\Users\andre\AppData\Local\Temp\pip-req-build-dcj1apwe'



  Cloning https://github.com/huggingface/transformers to c:\users\andre\appdata\local\temp\pip-req-build-dcj1apwe
  Resolved https://github.com/huggingface/transformers to commit d994473b05a83ea398d9f10ca458855df095e22d
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
     ---------------------------------------- 81.4/81.4 kB 4.4 MB/s eta 0:00:00
Collecting multiprocess
  Downloading multiprocess-0.70.14-py39-none-any.whl (132 kB)
     ---------------------------------------- 132.9/132.9 kB ? eta 0:00:00
Collecting datasets>=2.0.0
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
     -------------------------------------- 451.7/45

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from transformers import (
            AutoTokenizer, AutoModelForCausalLM,
            TextDataset, DataCollatorForLanguageModeling,
            Trainer, TrainingArguments,
            get_cosine_schedule_with_warmup,
            EarlyStoppingCallback,  IntervalStrategy)
import os
import numpy as np
import re
import pandas as pd
import tensorflow as tf
import torch
import pathlib
import random
import evaluate

# TRAINING STEPS
The fololow requires .txt files in the appropriate formate generated by either "Scrapping and Cleaning.ipynb" or/and "full_model_processing.ipynb". 

In [None]:
# Make sure you have your wandb account setup to access the training evaluations. 
# You will need to have your token ready. If no popup appears (ie running this in VSCode) run this first in terminal
import wandb
wandb.init(project="my-awesome-project")

wandb.login()

Next, let's load in our data and update the tokenizer. We can include new special tokens to the library if required by setting "use_special_tokens" to True. 

In [None]:
#####################################################################################
# SETUP:
# Select the text file to use for training and evaluation set for monitoring our progress
file_path =  "data/special_token_versions/textfiles/train_data.txt"  #Training Set. Use train_data.txt for full scale
file_path_val = "data/special_token_versions/textfiles/test_data.txt" # Test Set. Use test_data.txt for full scale
use_special_tokens = True # use custom tokens or no

#Add custom tokens explicitly:
special_tokens_manual= ['<|sensander|>', '<|paddingtonbear|>', '<|hankgreen|>', '<|joerogan|>', '<|elonmusk|>', '<|polite|>', '<|impolite|>','<|neutral|>']
#Add tokens from csv file:
token_file_path = "data/special_token_versions/keys.csv"
#####################################################################################


new_special_tokens = pd.read_csv(token_file_path)["Keys"].values.tolist() + special_tokens_manual #add these to special tokens
special_tokens_dict = {'additional_special_tokens': new_special_tokens} # use for direct add
"""
Here, we specify what we train we use to train our data. In our case, we
have several special tokens (ones that shouldn't be split). 

We include subject tokens, user tokens, and politeness tokens manually in special_tokens_manual
"""


Initiate Model and Tokenizer. Updates the tokenizer and model to use special tags if selected for

In [None]:
#Load in base GPT-2 model and corresponding tokenizer:
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')

#Add special tokens and update model if required:
if use_special_tokens:
    num_added_toks = tokenizer.add_special_tokens(special_tokens_dict) #adds special tokens
    model.resize_token_embeddings(len(tokenizer))


block_size = tokenizer.model_max_length
train_dataset = TextDataset(tokenizer=tokenizer, file_path=file_path, block_size=block_size, overwrite_cache=True)
evaluation_dataset = TextDataset(tokenizer=tokenizer, file_path=file_path_val, block_size=block_size, overwrite_cache=True)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

#setup wanbi:
%env WANDB_PROJECT=tweet_analysis

wandb.run.name = file_path
wandb.run.save()

In [7]:
#Check that you have a GPU connected
!nvidia-smi

Tue Dec 13 23:29:59 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 526.98       Driver Version: 526.98       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   57C    P8    N/A /  N/A |    184MiB /  2048MiB |     21%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

We next define our parameters. We use an early stopping callback metric that uses a compute_metrics function which can be changed to use different metrics (ie accuracy, precision, f1).

In [8]:
# START: COPIED FROM <https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb#scrollTo=ZSCf6QyF8AG- >
ALLOW_NEW_LINES = False     # seems to work better <--- from source
LEARNING_RATE = 1.372e-4
EPOCHS = 4
seed = random.randint(0,2**32-1)
# END: COPIED FROM <https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb#scrollTo=ZSCf6QyF8AG- >

import evaluate
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)



training_args = TrainingArguments(
    report_to="wandb", #Remove if not using wandb 
    output_dir="./model_files", #change this to new location if exisitng /model_files folder exists. Will overwrite otherwise
    overwrite_output_dir=True,
    do_train=True,
    evaluation_strategy = 'steps',# num_train_epochs=1, #new
    eval_steps = 5000, #
    per_device_train_batch_size=1,
    prediction_loss_only=True,
    logging_steps=5,
    save_steps=0,
    seed=seed,
    learning_rate = LEARNING_RATE,
    metric_for_best_model = 'f1',#new
    load_best_model_at_end = True, #new
    )

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset = evaluation_dataset,
    compute_metrics=compute_metrics, #new
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]) #new

# START: COPIED FROM <https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb#scrollTo=ZSCf6QyF8AG- >

#LR schedule stuff?
train_dataloader = trainer.get_train_dataloader()
num_train_steps = len(train_dataloader)
trainer.create_optimizer_and_scheduler(num_train_steps)
trainer.lr_scheduler = get_cosine_schedule_with_warmup(
    trainer.optimizer,
    num_warmup_steps=0,
    num_training_steps=num_train_steps)
# END: COPIED FROM <https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb#scrollTo=ZSCf6QyF8AG- >



PyTorch: setting up devices


In [None]:
#Train new model
trainer.train()
wandb.finish() #Exit wandb recording.

In [9]:
#configure model task
trainer.model.config.task_specific_params['text-generation'] = {
    'do_sample': True,
    'min_length': 15,
    'max_length': 100,
    'temperature': 100,
    'top_p': 0.95,
    'prefix': '<|endoftext|>'}

In [10]:
#Save Model:
trainer.save_model()

Saving model checkpoint to ./model_files
Configuration saved in ./model_files/config.json
Model weights saved in ./model_files/pytorch_model.bin
tokenizer config file saved in ./model_files/tokenizer_config.json
Special tokens file saved in ./model_files/special_tokens_map.json


In [11]:
#Example to view training history if wandb not used
a = trainer.state.log_history
print(a[0])

{'loss': 2.9403, 'learning_rate': 0.0001371792092297936, 'epoch': 0.01, 'step': 5}


# USE EXISTING MODEL
Load our trained models and tokenizers. Set tokenizer and model to folder the fine-tuned model is in in the first two lines

In [None]:
###########################################################################################################
#To load models
tokenizer = AutoTokenizer.from_pretrained('./model_files_no_special_tokens') #change to match as needed
model = AutoModelForCausalLM.from_pretrained('./model_files_no_special_tokens') #chnage to mtach as needed
###########################################################################################################


ALLOW_NEW_LINES = False     
LEARNING_RATE = 1.372e-4
seed = random.randint(0,2**32-1)
training_args = TrainingArguments(
    report_to="wandb",
    output_dir="./model_files2",
    overwrite_output_dir=True,
    do_train=True,
    evaluation_strategy = IntervalStrategy.STEPS, # num_train_epochs=1, #new
    eval_steps = 1000, #new
    num_train_epochs=1,
    save_total_limit = 10, 
    per_device_train_batch_size=1,
    prediction_loss_only=True,
    logging_steps=5,
    save_steps=0,
    seed=seed,
    learning_rate = LEARNING_RATE,
    metric_for_best_model = None,#new
    load_best_model_at_end = True)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    compute_metrics=None) #new
  
trainer.model.config.task_specific_params['text-generation'] = {
    'do_sample': True,
    'min_length': 10,
    'max_length': 160,
    'temperature': 1.,
    'top_p': 0.95,
    'prefix': '<|endoftext|>'}


# Predict:
Generate predictions. We can change our control parameters as well as the decoding methods.

In [20]:
########################################################################
# Control parameters. 
tag = "<|elonmusk|>" #options are <|sensander|>, <|elonmusk|>, <|hankgreen|>, <|elonmusk|>, <|joerogan|>,
polite_tag = "<|neutral|>" #options are <|polite|>, <|neutral|>, <|impolite|>,
topic1 = '<|spacex|>' # topic choice 1
topic2 = '<|failure|>' # topic choice 2. To use one topic, set to <|undefined|>
#########################################################################

"""
Notes:
Temperature: trade-off between variety and politeness clarity
Beam Search vs Top-k/Top-p. Beam search is a lot more coherent with a trade off for variety.

Options:

Naive Beam Search: Num_beams = 10, all else off
Top K with Nucleus Sampling: top_p = 0.95, top_k = 10-20, do_sample=True
Beam-search multinomial sampling : Num_beams = 10 + do_sample = True
Diverse beam-search decoding: Num_beams = 10 + num_beam_groups = 2



"""
start = ""
predictions = []
start_with_bos = '<|endoftext|>'+tag+polite_tag+topic1+topic2 + start
encoded_prompt = trainer.tokenizer(start_with_bos, add_special_tokens=False, return_tensors="pt").input_ids
encoded_prompt = encoded_prompt.to(trainer.model.device)


output_sequences = trainer.model.generate(
###################################################################################  
# We can alter our how our model's decoding strategies here: 
    #Edit stuff here down to change decoding 
# BEAM Naive (uncomment below to use)
    # num_beams=10, #on or off

# TOP-K + Nucleus (uncomment below to use)
    do_sample=True, # for multinomial beam search and top sampling
    top_p = 0.95, #0.95
    top_k = 50, #10-20

# MULTINOMIAL BEAM SEARCH (uncomment below to use)
    # num_beams=10, #on or off
    # do_sample=True, # for multinomial beam search and top sampling

# DIVERSE BEAM SEARCH (uncomment below to use)
    # num_beams=10, #on or off  
    # num_beam_groups = 2, # on or off, must be a multiple of num_beams
#####################################################################################
    
    
    num_return_sequences= 10, #must = num_beam for diverse beam-search
    input_ids=encoded_prompt,
    max_length=160, #originally 160
    min_length=10, #originally 10
    temperature = 1, #originally 1
    no_repeat_ngram_size=2,   
    
    )
# START: COPIED FROM <https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb#scrollTo=ZSCf6QyF8AG- >

generated_sequences = []

# decode prediction
for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
    generated_sequence = generated_sequence.tolist()
    text = trainer.tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True, skip_special_tokens=True)
    if not ALLOW_NEW_LINES:
        limit = text.find('\n')
        text = text[: limit if limit != -1 else None]
    generated_sequences.append(text.strip())

for i, g in enumerate(generated_sequences):
    predictions.append([start, g])
# END: COPIED FROM <https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb#scrollTo=ZSCf6QyF8AG- >


for pair in predictions:
  print(pair[1])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  "Passing `max_length` to BeamSearchScorer is deprecated and has no effect. "


Mr. Brown, I'm very pleased that you would like me to send you a reminder about my new film #Paddington2. I think I'll get a megaphone like Mr Curry's to remind everyone.
Mr. Brown, I'm very pleased that you would like me to send you a reminder about my new film #Paddington2. I suggest preparing some marmalade sandwiches for an extra special elevenses!
Mr. Brown, I'm very pleased that you would like me to send you a reminder about my new film #Paddington2. I suggest preparing some marmalade sandwiches for an extra special elevenses!
Mr. Brown, I'm very pleased that you would like me to send you a reminder about my new film #Paddington2. I think I'll get a megaphone like Mr Curry's to remind everyone.
Mr. Brown, I'm very pleased that you would like me to send you a reminder about my new film #Paddington2. I think I'll get a meg megaphone like Mr Curry's to remind everyone.
Mr. Brown, I’m very pleased that you would like me to send you a reminder about my new film #Paddington2. I think I

# SAVE AS CSV
If we like our set of generation, we can save it to a csv file

In [None]:
# Note which decoding algorthim you used below before running
decoder = "diverse" #beam, top (top_k + top_p), multinomial (Num_beams + do_sample), diverse (Num_beams + num_beam_groups)


import pandas as pd
username = tag #twitter user
type_tweet = polite_tag #polite, impolite, neutral
topics = topic1+topic2


df = pd.DataFrame(columns = ["Target","Prompt","Tweets","Type"])
target_col = [username]*len(predictions) # target col
type_col = [type_tweet]*len(predictions) # type col
prompt_col = []
tweets_col = []

for pair in predictions:
    prompt_col.append(pair[0])
    tweets_col.append(pair[1])

df["Target"] = target_col
df["Prompt"] = prompt_col
df["Tweets"] = tweets_col
df["Type"] = type_col
df.reset_index()
# print(df)

df.to_csv('responses/{}_{}_{}.csv'.format(username,type_tweet,decoder), index=False)
#check writing:

df_test = pd.read_csv('responses/{}_{}_{}.csv'.format(username,type_tweet,decoder))
# print(df_test)

