# DEBATE: Few-Shot Training

This tutorial demonstrates how to train the Political DEBATE models in a few-shot setting (<= 100 examples). We recommend using the large model for few-shot training. If you're not already familiar with how to use the models for classification, the [zero-shot tutorial](https://colab.research.google.com/drive/1zi-8pMx_x-vo0m8XVmYVtw0Dhdw7vfFD#scrollTo=2c49c53a) contains explanations and code.

Most of the code here is boiler-plate code that can be copied and used without changes. Simply substitute the example data with your own and train the model.

# Libraries

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [None]:
from datasets import Dataset, DatasetDict # The datasets library allows us to import the data directly from the huggingface hub and puts it in an efficient format for training
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer # Transformers has the classes and functions for training the model
import torch # Transformers is built on top of the pytorch library. This also allows us to interact with the GPU and check its availability.

# numpy, pandas, and sklearn are used for data manipulation and performance metrics.
import numpy as np
from sklearn.metrics import balanced_accuracy_score, precision_recall_fscore_support, accuracy_score, classification_report
import pandas as pd

# Organizing your Data

When doing few-shot training, we want to train the model for entailment classification. *If you train the model as a standard classifier rather than for entailment, your model will likely perform very poorly.*

Entailment works by classifying sentence pairs that consist of your original document, and a reference statement known as the "hypothesis." The model is trained to determine if the hypothesis is true, given the content of your original document. Doing this is mostly a matter of how we format our data.

Suppose we have the following dataframe with binary topic labels:

In [None]:
# Example documents
texts = [
    "The new AI model shows promising results in medical diagnosis",
    "Scientists discover ancient ruins in the Amazon rainforest",
    "Latest smartphone features advanced privacy protection",
    "Research suggests meditation improves mental health",
    "Electric vehicles set new sales records worldwide"
]

# Binary labels for whether or not a text is about a topic
data = {
    'text': texts,
    'technology': [1, 0, 1, 0, 1],
    'science': [1, 1, 0, 1, 0],
    'health': [1, 0, 0, 1, 0],
    'environment': [0, 1, 0, 0, 1],
    'privacy': [0, 0, 1, 0, 0]
}

# Create DataFrame
df = pd.DataFrame(data)

df

Unnamed: 0,text,technology,science,health,environment,privacy
0,The new AI model shows promising results in me...,1,1,1,0,0
1,Scientists discover ancient ruins in the Amazo...,0,1,0,1,0
2,Latest smartphone features advanced privacy pr...,1,0,0,0,1
3,Research suggests meditation improves mental h...,0,1,1,0,0
4,Electric vehicles set new sales records worldwide,1,0,0,1,0


To train the model, we need to pair each of our documents with a hypothesis statement for each topic category. We can do this with a few simple dataframe manipulations. First, we'll melt the dataframe to convert it to a long format.

In [None]:
# Melt the dataframe so that there is a row for each document-topic label
df = pd.melt(
    df,
    id_vars=['text'],
    value_vars=['technology', 'science', 'health', 'environment', 'privacy'],
    var_name='topic',
    value_name='label'
)
df.head(10)

Unnamed: 0,text,topic,label
0,The new AI model shows promising results in me...,technology,1
1,Scientists discover ancient ruins in the Amazo...,technology,0
2,Latest smartphone features advanced privacy pr...,technology,1
3,Research suggests meditation improves mental h...,technology,0
4,Electric vehicles set new sales records worldwide,technology,1
5,The new AI model shows promising results in me...,science,1
6,Scientists discover ancient ruins in the Amazo...,science,1
7,Latest smartphone features advanced privacy pr...,science,0
8,Research suggests meditation improves mental h...,science,1
9,Electric vehicles set new sales records worldwide,science,0


Now we can just replace the topic labels with our hypotheses, and that's it! Our data is now formatted to train for entailment classification.

In [None]:
# Replace topic names with entailment hypotheses
df['topic'].replace({'technology':'This headline is about technology.',
                     'science': 'This headline is about science.',
                     'health': 'This headline is about health.',
                     'environment': 'This headline is about the environment.',
                     'privacy': 'This headline is about the environment.'},
                    inplace = True)

df.rename({'topic':'hypothesis'}, axis = 1, inplace = True)

# Recode lables to 0 = entailment 1 = not entail to follow convention for entailment labels
df['label'].replace({0:1, 1:0}, inplace = True)
# This is what your training data should look like when properly formatted.
df.head(10)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['topic'].replace({'technology':'This headline is about technology.',
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['label'].replace({0:1, 1:0}, inplace = True)


Unnamed: 0,text,hypothesis,label
0,The new AI model shows promising results in me...,This headline is about technology.,0
1,Scientists discover ancient ruins in the Amazo...,This headline is about technology.,1
2,Latest smartphone features advanced privacy pr...,This headline is about technology.,0
3,Research suggests meditation improves mental h...,This headline is about technology.,1
4,Electric vehicles set new sales records worldwide,This headline is about technology.,0
5,The new AI model shows promising results in me...,This headline is about science.,0
6,Scientists discover ancient ruins in the Amazo...,This headline is about science.,0
7,Latest smartphone features advanced privacy pr...,This headline is about science.,1
8,Research suggests meditation improves mental h...,This headline is about science.,0
9,Electric vehicles set new sales records worldwide,This headline is about science.,1


# Few Shot Training

We have three different components for this task:

1.   The dataset
2.   The tokenizer
3.   The model

The dataset contains our training and testing data. The tokenizer will convert the dataset into numeric representations of the tokens that will be passed to the model during training. The tokenizer doesn't need to be trained, and is for preparing the dataset to be passed to the model.

In this block we are just defining variables that will later be passed to other functions.
First we define the model. This is the name of the model on the HuggingFace directory

In [None]:
modname = "mlburnham/Political_DEBATE_large_v1.0"
training_directory ='few_shot' # this is where the trained model will be saved. You can rename it to anything.

# Use GPU if one is available, else CPU. You will want GPU access for training.
device = "cuda" if torch.cuda.is_available() else "cpu" # if you want to use the GPU on a macbook change 'cuda' to 'mps' and make sure you have the 'accelerate' library installed.
print(f"Device: {device}")

Device: cuda


## Preparing and Tokenizing the Data

Now we import our training data. For this example I'll use a CSV of tweets labeled for stance towards president Trump (1 = support, 0 = not support). Usually you won't have much validation data for few-shot training, but we'll use the full validation set for illustrative purposes.

In [None]:
# Import example train and test data as pandas dataframes.
train = pd.read_csv('https://raw.githubusercontent.com/MLBurnham/stance_detection_tutorials/main/data/train.csv')
validate = pd.read_csv('https://raw.githubusercontent.com/MLBurnham/stance_detection_tutorials/main/data/test.csv')
# This is what our data looks like
train.head()

Unnamed: 0,text,labels
0,everyone’s favorite candidate did great and th...,0.0
1,reading the headline about the #dnc2020 and it...,0.0
2,@USER @USER @USER @USER i hope president trump...,1.0
3,if donald trump was innocent regarding e. jean...,0.0
4,My presidential vote goes to whoever provides ...,0.0


Now we need to reformat the dataset for entailment classification as shown above. Since we are only classifying stance towards president Trump, we can just add a column with a single hypothesis.

In [None]:
train['hypothesis'] = 'The author of this tweet supports Trump.' # add entailment hypothesis
train['labels'].replace({0:1, 1:0}, inplace = True) # recode labels so that 0 = entail and 1 = not entail
validate['hypothesis'] = 'The author of this tweet supports Trump.' # add entailment hypothesis
validate['labels'].replace({0:1, 1:0}, inplace = True)
# creating a smaller set for fewshot training
fs_train = train[0:25]

fs_train.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train['labels'].replace({0:1, 1:0}, inplace = True) # recode labels so that 0 = entail and 1 = not entail
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  validate['labels'].replace({0:1, 1:0}, inplace = True)


Unnamed: 0,text,labels,hypothesis
0,everyone’s favorite candidate did great and th...,1.0,The author of this tweet supports Trump.
1,reading the headline about the #dnc2020 and it...,1.0,The author of this tweet supports Trump.
2,@USER @USER @USER @USER i hope president trump...,0.0,The author of this tweet supports Trump.
3,if donald trump was innocent regarding e. jean...,1.0,The author of this tweet supports Trump.
4,My presidential vote goes to whoever provides ...,1.0,The author of this tweet supports Trump.


In [None]:
# convert the data to a huggingface dataset for ease of use
val_ds = Dataset.from_pandas(validate)
fs_ds = Dataset.from_pandas(fs_train)
ds = DatasetDict()

ds['few_shot'] = fs_ds
ds['validate'] = val_ds

Finally, we use the tokenizer to convert the data to numeric vectors that are ready to pass to the model. Notably, our tokenize function passes both the text and hypothesis to the tokenizer as a pair. If you just pass the text as you would with other classification approaches, you will train the model on the wrong data.

In [None]:
# import the tokenizer using the modname variable we defined above
tokenizer = AutoTokenizer.from_pretrained(modname)

# define a generic tokenizing function
# padding will add empty tokens to the end of documents to make all documents the same length. This is generally required for passing documents through the model.
# Truncation will cut off any portion of the document longer than the models maximum accepted length.
def tokenize_function(docs):
    return tokenizer(docs['text'], docs['hypothesis'], padding = 'max_length', truncation = True)

# Now we tokenize the dataset by applying padding, truncation, and converting each document to a tensor of numbers.
dstok = ds.map(tokenize_function)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.65M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Map:   0%|          | 0/1135 [00:00<?, ? examples/s]

## Preparing the Model and Trainer

Here we import the model. Change the num_labels variable to match the number of classes in your dataset. If the number of labels is different than 2 then the ignore_mismatched_sizes must be true. This tells the model to replace the classifier head of the neural network with a new one that has the appropriate number of labels.

id2label is a dictionary that makes sure the model associates each number with the right label. In this case, 0 is 'not support' and 1 is 'support'. Change this to whatever is appropriate for how many labels you have and what they represent.

If you are training a model for general entailment classification, you might use {0:'entail', 1:'not entail'}.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(modname, num_labels = 2, ignore_mismatched_sizes=True, id2label = {0:'entailment', 1:'not_entailment'})

config.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

This function will be used to calculate performance metrics during training. You can pass a different custom function, but this is a good default set of metrics that you can just copy into your code.


In [None]:
# this function will be used to calculate performance metrics during training. You can pass a different custom function, but this is a good default set of metrics
def compute_metrics(eval_pred, label_text_alphabetical=list(model.config.id2label.values())):
    # Extract labels
    labels = eval_pred.label_ids
    pred_logits = eval_pred.predictions
    preds_max = np.argmax(pred_logits, axis=1)

    # Compute the metrics
    precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(labels, preds_max, average='macro')
    precision_micro, recall_micro, f1_micro, _ = precision_recall_fscore_support(labels, preds_max, average='micro')
    acc_balanced = balanced_accuracy_score(labels, preds_max)
    acc_not_balanced = accuracy_score(labels, preds_max)

    # Pass computed metrics to a dictionary for printing
    metrics = {'f1_macro': f1_macro,
            'f1_micro': f1_micro,
            'accuracy_balanced': acc_balanced,
            'accuracy': acc_not_balanced,
            'precision_macro': precision_macro,
            'recall_macro': recall_macro,
            'precision_micro': precision_micro,
            'recall_micro': recall_micro,
            }

    # Print results
    print("Aggregate metrics: ", {key: metrics[key] for key in metrics if key not in ["label_gold_raw", "label_predicted_raw"]} )
    print("Detailed metrics: ", classification_report(
        labels, preds_max, labels=np.sort(pd.factorize(label_text_alphabetical, sort=True)[0]),
        target_names=label_text_alphabetical, sample_weight=None,
        digits=2, output_dict=True, zero_division='warn'),
    "\n")

    return metrics

Below we set the training arguments for the model. You probably don't need to fiddle with these at all.

When doing few-shot training, we generally assume the number of training samples is small and that we want to maximize what the model learns from each sample. Setting the training batch size to a small number will help with this.

In [None]:
training_args = TrainingArguments(output_dir=training_directory,
    logging_dir=f'{training_directory}/logs',
    lr_scheduler_type= "linear",
    group_by_length=False,
    report_to='none', # change this if you're using a library like weights & biases to track model training
    learning_rate = 9e-6,# use this learning rate for the large model
    #learning_rate = 2e-5, # use this learning rate for the small model

    # batch size controls how many documents are passed through the model at once. Higher batch sizes train faster but demand more memory. lower the batch size if you are running out of memory
    per_device_train_batch_size = 2, # A smaller traning batch size is generally better for few-shot learning. This means the model will learn more from each training example.
    per_device_eval_batch_size = 16, # This just determines how fast the model will go through documents during the evaluation phase
    gradient_accumulation_steps = 1,

    num_train_epochs=5, # number of times to pass the entire training set through the model. 3-5 is generally good for few-shot training.
    warmup_ratio=0.06,  # warmup length before learning rate scheduler kicks in
    weight_decay=0.01, # weight regularization

    fp16=False,   # the data type that the model's weights are stored in. fp16 stands for floating point 16 and will make the model much smaller and faster, but can have a slight effect on performance.
    fp16_full_eval=False,

    # eval strategy defines how often the model evaluates performance on the valiation set. In a few-shot context we assume there is no validation set.
    eval_strategy="no",
    seed=1,

    # save_strategy determines how frequently a checkpoint of the model is saved. Change to 'epoch' for saving after each epoch.
    save_strategy="no",
    dataloader_num_workers = 4,# this determines how many cpu cores are used to lead data to the model. This usually isn't very important but could offer a small speed boost.
)

In [None]:
# Initialize the trainer, passing the model, tokenizer, data, and all of the arguments set above to the trainer.
trainer = Trainer(
    model = model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dstok['few_shot'],
    eval_dataset=dstok['validate'],
    compute_metrics=lambda x: compute_metrics(x, label_text_alphabetical=list(model.config.id2label.values()))
)

  trainer = Trainer(


## Training and Evaluation

In [None]:
# Train the model
trainer.train()



Step,Training Loss


TrainOutput(global_step=65, training_loss=0.20199363415057842, metrics={'train_runtime': 66.3707, 'train_samples_per_second': 1.883, 'train_steps_per_second': 0.979, 'total_flos': 116492206848000.0, 'train_loss': 0.20199363415057842, 'epoch': 5.0})

You can evaluate the model's performance using trainer.evaluate(). This will use the trained model to classify documents in the validation set.

In [None]:
trainer.evaluate()

Aggregate metrics:  {'f1_macro': 0.8496912370834784, 'f1_micro': 0.8713656387665198, 'accuracy_balanced': 0.8406704675350334, 'accuracy': 0.8713656387665198, 'precision_macro': 0.861562310440116, 'recall_macro': 0.8406704675350334, 'precision_micro': 0.8713656387665198, 'recall_micro': 0.8713656387665198}
Detailed metrics:  {'entailment': {'precision': 0.8378378378378378, 'recall': 0.7520215633423181, 'f1-score': 0.7926136363636364, 'support': 371.0}, 'not_entailment': {'precision': 0.885286783042394, 'recall': 0.9293193717277487, 'f1-score': 0.9067688378033205, 'support': 764.0}, 'accuracy': 0.8713656387665198, 'macro avg': {'precision': 0.861562310440116, 'recall': 0.8406704675350334, 'f1-score': 0.8496912370834784, 'support': 1135.0}, 'weighted avg': {'precision': 0.8697770397200237, 'recall': 0.8713656387665198, 'f1-score': 0.8694546706366924, 'support': 1135.0}} 



  labels, preds_max, labels=np.sort(pd.factorize(label_text_alphabetical, sort=True)[0]),


{'eval_loss': 1.3968520164489746,
 'eval_f1_macro': 0.8496912370834784,
 'eval_f1_micro': 0.8713656387665198,
 'eval_accuracy_balanced': 0.8406704675350334,
 'eval_accuracy': 0.8713656387665198,
 'eval_precision_macro': 0.861562310440116,
 'eval_recall_macro': 0.8406704675350334,
 'eval_precision_micro': 0.8713656387665198,
 'eval_recall_micro': 0.8713656387665198,
 'eval_runtime': 159.4478,
 'eval_samples_per_second': 7.118,
 'eval_steps_per_second': 0.445,
 'epoch': 5.0}

# Saving and Loading Your Model

If you didn't save the model during training, you can save a copy of it with trainer.save_model().

The model will be saved in the specified folder. To download the model you need to download the entire folder.

In [None]:
trainer.save_model('./few_shot')

You can then use something like the huggingface pipeline to load the model. To do so, simply point the pipeline to the folder containing the model and then use it as you would any other classifier.

In [None]:
from transformers import pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipeline("zero-shot-classification", model='./few_shot', device = device, batch_size = 32)

In [None]:
test_doc = 'The Trump presidency has been a disaster!'
test_labels = ['The author of this text supports Trump', 'The author of this text opposes Trump']
pipe(test_doc, test_labels, hypothesis_template = '{}', multi_label = True)

{'sequence': 'The Trump presidency has been a disaster!',
 'labels': ['The author of this text opposes Trump',
  'The author of this text supports Trump'],
 'scores': [0.9999997019767761, 1.1872538152601919e-07]}