This notebook is for developing the code that will be used to produce LLM data augments for the RETIPS data, using a variety of LLMs.

The data has been split (in another notebook) into four train/test folds, so that we can test augments that were produced using real RETIPS examples that are not included in the test set. Each train/test fold, each LLM will be used to create a large number of augments.

This notebook is for developing that code; once it is developed, it should be put into a .py script so that it can be run on the Cluster without having a Jupyter session open.

In [2]:
# Load libraries
from utils.load_llm_model import prepare_to_load_model, load_model
from utils.make_prompt import make_prompt
from langchain import PromptTemplate
import os
import pandas as pd

In [2]:
# These will all need to be command-line arguments in the final script
train_idx = 1 
model_id = 'nomic-ai/gpt4all-13b-snoozy' 
username = 'cehrett'
temperature = 1 
top_p = 0.95
max_new_tokens = 60
max_length = 2048
num_beams = 5
num_beam_groups = 5
repetition_penalty = 1
num_return_sequences = 5
do_sample=True

In [3]:
# load training data
train_loc = os.path.join('data','stratified_data_splits',str(train_idx),'train.csv')
train = pd.read_csv(train_loc)

In [4]:
# This function will set up our apikeys and cache directory correctly for loading the models from HF

prepare_to_load_model(username)

Okay, using /scratch/cehrett/hf_cache for huggingface cache. Models will be stored there.
Huggingface API key loaded.


In [5]:
model = load_model(model_id, 
                   temperature=temperature, 
                   top_p=top_p, 
                   max_length=max_length,
                   max_new_tokens=max_new_tokens, 
                   num_beams=num_beams,
                   num_beam_groups=num_beam_groups,
                   repetition_penalty=repetition_penalty,
                   num_return_sequences=num_return_sequences,
                   do_sample=do_sample
                  )

HF cache: /scratch/cehrett/hf_cache/
Loading tokenizer
Loading model





Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /home/cehrett/.conda/envs/llms_env/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/cehrett/.conda/envs/llms_env/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

Instantiating pipeline


In [6]:
# Create data on a loop

# Initialize some variables that will help us run the loop smoothly
categories = ['Resource-related', 'Non-resource-related']
resource_labels = {'Resource-related':1, 'Non-resource-related':0}
count = 0 

while True:
    
    # Get new augments
    new_prompts_and_augs = make_prompt(n_in=5, data=train)
    for category in categories:
        raw_new_augs = model(new_prompts_and_augs[category]['prompt'])
        new_augs = [aug['generated_text'][len(new_prompts_and_augs[category]['prompt']):].split('\n2.')[0].strip('\n\'" \t') for aug in raw_new_augs]
        new_prompts_and_augs[category]['augments'] = new_augs
    
    # Get path in which augments are stored
    augs_path = os.path.join('data','stratified_data_splits',str(train_idx),'augments.csv')
    
    # Load existing augments
    try:
        augs_df = pd.read_csv(augs_path)
    except FileNotFoundError: # If these are the first augs
        augs_df = pd.DataFrame()
    
    # Prepare new augments in df with other relevant metadata
    for category in categories:
        new_prompts_and_augs[category]['new_augs_df'] = pd.DataFrame({'text':new_prompts_and_augs[category]['augments'],
                                                                      'model_id':model_id,
                                                                      'temperature':temperature,
                                                                      'top_p':top_p,
                                                                      'max_new_tokens':max_new_tokens,
                                                                      'max_length':max_length,
                                                                      'num_beams':num_beams,
                                                                      'num_beam_groups':num_beam_groups,
                                                                      'repetition_penalty':repetition_penalty,
                                                                      'do_sample':do_sample,
                                                                      'num_return_sequences':num_return_sequences,
                                                                      'prompt':new_prompts_and_augs[category]['prompt'],
                                                                      'resources':resource_labels[category]
                                                                     })
        
    
    # Add new augments df to the existing augments
    combined_augs_df = pd.concat([augs_df,
                                 new_prompts_and_augs['Resource-related']['new_augs_df'],
                                 new_prompts_and_augs['Non-resource-related']['new_augs_df']])
    combined_augs_df.reset_index(drop=True, inplace=True)
                                   
    # Save the newly increased set of augments
    combined_augs_df.to_csv(augs_path, index=False)
                                                                     
    # Output an update
    count += 1
    print(f'Total new augments created so far for each category: {count * num_return_sequences}')
    

NameError: name 'PromptTemplate' is not defined