# Stance Detection: Classification With In-context Learners (GPT-3)

#### This tutorial demonstrates stance detection using a few-shot approach with GPT-3. In-context learning and few-shot classification is a fast moving field and this is a relatively simple implementation. Consider this a starting point rather than a comprehensive guide. This guide will use the openai library to gain access to GPT-3 and utilize their cloud computing services. OpenAI charges based on usage and this can be fairly expensive. However, new accounts are given a small stipend to play with. This should be enough to run through this tutorial.

OpenAI Documentation: https://beta.openai.com/docs/api-reference/introduction 

#### Requirements:
1. Basic python skills
2. An OpenAI account and API key

In [2]:
import pandas as pd
import openai
import os
import pickle
import time
import re

In [None]:
openai.api_key = 'Insert your API key here'

#### For this example we will use the same training and test data as in Burnham (2022). The data set consists of tweets about President Trump that have two sets of labels: a plain text label that is either "support", "against", or "none", and a corresponding binary label that is labeled 1: support, 0: against, 0: none.

In [33]:
train_df = pd.read_csv('https://raw.githubusercontent.com/MLBurnham/stance_detection_tutorials/main/data/gpt3_train.csv')
test_df = pd.read_csv('https://raw.githubusercontent.com/MLBurnham/stance_detection_tutorials/main/data/gpt3_test.csv')

train_df.head()

Unnamed: 0,text,stance,labels
0,I'd like #TedCruz to talk more policies that d...,none,0.0
1,"""the past"" and biden references obama years as...",none,0.0
2,Quando mi prende lo sconforto penso che qualcu...,none,0.0
3,@USER @USER think about it in ny where most a...,none,0.0
4,#GOPDebate think what #Trump2016 is saying tha...,none,0.0


#### Classification with in-context learning works by treating classification as a next word prediction task. We supply the model with a prompt that consists of a description of the task and/or a few labeled examples, and a document that has a blank space where the label should be. The model then predicts what word goes in to that label space. 

#### Our first task is to generate a prompt. We'll do this by getting a random sample of labeled documents and putting them in a consistent format. Below are two helper functions to expedite this process.

In [24]:
def gen_shots(df, shots, seed):
    """
    A function to construct the training examples for few-shot classification. Returns a string of examples and labels

    df: A data frame with 'text' and 'label' columns

    shots: The number of examples to give the prompt

    seed: A random seed for sampling the data frame
    """
    # get a random sample from the df
    text_sample = df.sample(n = shots, random_state = seed)

    # concatenate samples together to construct a basic prompt
    shots = ''
    for sample, label in list(zip(text_sample['text'], text_sample['stance'])):
        shots = shots + 'Tweet: ' + sample + '\n' + 'Stance: ' + label + '\n###\n'

    return shots

def gen_prompt(tweet, examples):
    """
    Generate a classification prompt for a few-shot model

    tweet: The text of the tweet to be classified

    examples: A string of training examples created by gen_shots
    """
    prompt = examples + 'Tweet: ' + tweet + '\n' + 'Stance:'

    return prompt

#### Using these functions, let's first generate a prompt template by combining and formatting 10 labeled examples. In practice, you will likely want more than 10 examples.

In [29]:
# Generate formatted and labeled examples for the prompt
examples = gen_shots(train_df, shots = 10, seed = 1)

#### Below is what our template looks like. Documents are preceded by "Tweet:" and stances are preceded by "Stance:". Each document stance pair is separated by "###" and a new line.

In [31]:
print(examples)

Tweet: @USER @USER where does "doing 'the wave' with castro" rate?
potus obama took the fam to cuba and partied with raúl, fidel's brother &amp; successor; what does one think happened to people castro didn't want obama &amp; the world to see? #demdebate HTTP
HTTP
Stance: none
###
Tweet: Remember when the national media were busy slamming @HillaryClinton for emails while giving #DonaldTrump a pass for everything? #HillaryMen
Stance: none
###
Tweet: Oh come on @BBCr4today. Reason for giving Lawson unchallenged airtime on climate: it's what Trump thinks too.
Stance: against
###
Tweet: @USER @USER @USER hmmm so why not even one conviction? are you incompetent, lazy or both or just blowing smoke? meanwhile, you’re taking my property without compensation: america’s hostages: #fanniegate: the long gse wail continues... HTTP
Stance: none
###
Tweet: @mattyglesias There was no reason to think they wouldn't. Whereas Trump is a disaster every day so it's continually shocking people still like him

#### We then concatenate this prompt to each tweet we want to classify and leave the "Stance:" field blank. The number of prompts should be equal to the number of documents you want to classify. I will use the gen_prompt() helper function created above to accomplish this.

In [32]:
prompts = [gen_prompt(test_df['text'][i], examples) for i in range(len(test_df['text']))]

#### To classify the data, we then pass each prompt to GPT-3 through the OpenAI API. The code below is currently set to use the Curie model which is a smaller and cheaper version of GPT-3. To get accurate stance detection, however, you will need to use the Davinci model which is 10x more expensive.

In [None]:
%%time
labels = [] # Create an empty list to hold the results
for prompt in prompts:
    res = openai.Completion.create(
          #model="text-davinci-002",
          model = "text-curie-001",
          prompt=prompt,
          max_tokens=1, # Increase this is your label is multiple words or a multi-token word.
          temperature=0, # Temperature controls the random variation in results.
          #logprobs = 1, # logprobs will deterine if log probabilities for likely words are returned in addition to the label.
          echo = False
        )
    labels.append(res['choices'][0]['text']) # Append the most likely label to the labels list
    time.sleep(6) # The API is rate limited, so a short wait between documents is a good idea so that you don't get locked out of the platform.

#### Now that we have our labels, we can simply add them to our data frame an analyze the results

In [None]:
test_df['GPT3_labs'] = labels