# Using Large Language Models for Content Analysis

By Indira Sen

In this notebook, you will:
- download a text-based dataset which has *labeled* examples. For instance, a sentiment analysis dataset with tweets and their corresponding sentiment label.
- you will define a function that calls a large language model to prompt it to label the dataset with predicted sentiment. You can do in this in a few different modes.
- You will investigate the LLM's predicted labels

## 1. Get the data

In [7]:
import pandas as pd
import os
from daacs.infrastructure.bootstrap import Bootstrap
b = Bootstrap() 


As an example, we will use the same PStance dataset.

You could also try other types of labeling, e.g., with one of the first hate speech datasets, specifically: https://github.com/t-davidson/hate-speech-and-offensive-language/tree/master

from the paper '[Automated Hate Speech Detection and the Problem of Offensive Language](https://ojs.aaai.org/index.php/ICWSM/article/view/14955)' from 2017.

In [6]:
data = pd.read_csv(f'{b.DATA_DIR}/pol/raw_train_biden.csv')
data.head()

Unnamed: 0,Tweet,Target,Stance
0,Joe Biden is looking to gather votes from unsu...,Joe Biden,AGAINST
1,Check out the latest podcast conversation betw...,Joe Biden,FAVOR
2,Thank you Secretary Clinton for your endorseme...,Joe Biden,FAVOR
3,Happening now: @JoeBiden kicking off #Hispanic...,Joe Biden,FAVOR
4,Thank you Mayor @KeishaBottoms for opening our...,Joe Biden,FAVOR


In [78]:
data

Unnamed: 0,Tweet,Target,Stance
0,Joe Biden is looking to gather votes from unsu...,Joe Biden,AGAINST
1,Check out the latest podcast conversation betw...,Joe Biden,FAVOR
2,Thank you Secretary Clinton for your endorseme...,Joe Biden,FAVOR
3,Happening now: @JoeBiden kicking off #Hispanic...,Joe Biden,FAVOR
4,Thank you Mayor @KeishaBottoms for opening our...,Joe Biden,FAVOR
...,...,...,...
5801,Call me stubborn but I just dont think I wanna...,Joe Biden,AGAINST
5802,Crazy liberals lol progressive policies work b...,Joe Biden,AGAINST
5803,Lots of students @UWMadison awaiting @JoeBiden...,Joe Biden,FAVOR
5804,"Other than the terrible grammar, is #Biden jus...",Joe Biden,AGAINST


## 2. Make the Prompt

In [8]:
def make_prompt(task, options, instance, **kwargs):
    options_str = '' # options ---> all possible labels
    for i in range(len(options)):
        options_str = options_str + ' %d) %s' %(i+1, options[i])
    prompt = 'Given a piece of text, you have to label whether it is %s or not.\
    Please return one of the following options with only the text and no number:%s.'\
    %(task, options_str)

    if kwargs['zero_shot']:
        return prompt + ' What is the label of this text: "' + instance+ '"'
    else: # for few-shot
        examples_str = ''
    for example in kwargs['examples']:
        examples_str = examples_str + 'text: %s, label: %s\n' %(example[0], example[1])
    return prompt + ' Here are some examples of instances and their labels:\
    \n%sWhat is the label of this text: ' %(examples_str) + instance

In [9]:
task = 'in favor of Joe Biden'
options = ['FAVOR', 'AGAINST']
examples = [] # the first two instances of the dataset are used as few-shot examples
for _, row in data.iterrows():
    examples.append([row['Tweet'], row['Stance']])
    if len(examples) == 2:
        break
instance = "Ugh, this was true yesterday and it's also true now: Biden is an idiot" # you can replace this with instances in the dataset
instance

"Ugh, this was true yesterday and it's also true now: Biden is an idiot"

In [10]:
make_prompt(task, options, instance, zero_shot = True)

'Given a piece of text, you have to label whether it is in favor of Joe Biden or not.    Please return one of the following options with only the text and no number: 1) FAVOR 2) AGAINST. What is the label of this text: "Ugh, this was true yesterday and it\'s also true now: Biden is an idiot"'

In [82]:
print(make_prompt(task, options, instance, zero_shot = False, examples = examples))

Given a piece of text, you have to label whether it is in favor of Joe Biden or not.    Please return one of the following options with only the text and no number: 1) FAVOR 2) AGAINST. Here are some examples of instances and their labels:    
text: Joe Biden is looking to gather votes from unsuspecting voters. One must remember, Good Ole Boy Joe supported a Grand Wizard of the KKK. Joe cannot deny it., label: AGAINST
text: Check out the latest podcast conversation between @JoeBiden and @AndrewYang. #HeresTheDeal #UnitedForJoe #BarnstormersForAmerica #ITrustJoe, label: FAVOR
What is the label of this text: Ugh, this was true yesterday and it's also true now: Biden is an idiot


In [11]:
prompt = make_prompt(task, options, instance, zero_shot = False, examples = examples)

## 3. Call the LLM with the prompt

In [12]:
runs = 3 # specify how many labels we want per instance.

We will try this with an open source model like Flan-T5.

In [85]:
# ! pip install transformers

In [86]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small", max_new_tokens = 500)
model.cuda()
inputs = tokenizer("A step by step recipe to make bolognese pasta:",
                   return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))



['Pour a cup of bolognese into a large bowl and add the pasta']


 The code to prompt Flan-T5 is a bit complex.

 We use the Hugging Face Transformers library to perform sequence-to-sequence (seq2seq) language modeling with a pre-trained model called "google/flan-t5-xl.

- AutoModelForSeq2SeqLM is used to load a pre-trained seq2seq model.

- AutoTokenizer is used to load the tokenizer associated with the model.
Load the pre-trained model and tokenizer:

The code loads a pre-trained sequence-to-sequence model named "google/flan-t5-xl" using AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-xl"). This model is a variant of T5 (Text-to-Text Transfer Transformer) architecture.

We then create a tokenizer with a specified maximum number of new tokens:

- The code creates a tokenizer for the "google/flan-t5-xl" model using AutoTokenizer.from_pretrained("google/flan-t5-xl", max_new_tokens=500).

This tokenizer is configured to handle sequences with a maximum of 500 additional tokens beyond the original text.

We then move the model to the GPU (CUDA):

model.cuda() moves the loaded model to the GPU for faster inference if a compatible GPU is available. It uses the "cuda:0" device.

inputs = tokenizer("A step by step recipe to make bolognese pasta:", return_tensors="pt").to("cuda:0") tokenizes the input text "A step by step recipe to make bolognese pasta:" using the tokenizer. The return_tensors="pt" option returns PyTorch tensors. The resulting tokenized input is then moved to the GPU.

We then generate a sequence from the model:

outputs = model.generate(**inputs) generates a sequence based on the tokenized input using the loaded model. The generate method takes the tokenized input as input and produces a sequence of output tokens.
Decode and print the generated sequence:

tokenizer.batch_decode(outputs, skip_special_tokens=True) decodes the generated output tokens into text, skipping any special tokens that are not part of the final result.

In summary, this code loads a pre-trained seq2seq model, tokenizes an input text, generates a sequence based on the input using the model, and then prints the generated text. It uses the "google/flan-t5-xl" model, which is a large T5 variant suitable for various text-to-text tasks. The code is designed for GPU acceleration for faster inference.

In [87]:
responses = []
for n in range(0, runs):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
    outputs = model.generate(**inputs)
    responses.append(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

In [88]:
responses

['AGAINST', 'AGAINST', 'AGAINST']

Now we will try LLaMa, an LLM from Meta.

In [13]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [90]:
# GPU llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.78 numpy==1.23.4 --force-reinstall --upgrade --no-cache-dir --verbose
!pip install huggingface_hub
!pip install llama-cpp-python==0.1.78
!pip install numpy==1.23.4

Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Collecting llama-cpp-python==0.1.78
  Downloading llama_cpp_python-0.1.78.tar.gz (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Running command pip subprocess to install build dependencies
  Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
  Collecting setuptools>=42
    Using cached setuptools-69.0.3-py3-none-any.whl (819 kB)
  Collecting scikit-build>=0.13
    Using cached scikit_build-0.17.6-py3-none-any.whl (84 kB)
  Collecting cmake>=3.18
    Using cached cmake-3.28.1-py2.py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (26.3 MB)
  Collecting ninja
    Using cached ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
  Collecting distro (from scikit-build>=0.13)
    Using cached distro-1.9.0-py3-none-any.whl (20 kB)
  Collecting packaging (from scikit

In [91]:
model_name_or_path = "TheBloke/Llama-2-7B-chat-GGML"
model_basename = "llama-2-7b-chat.ggmlv3.q2_K.bin" # the model is in bin format

In [92]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

In [93]:
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

In [94]:
lcpp_llm = None
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_gpu_layers=32, # Change this value based on your model and your GPU VRAM pool.
    n_ctx=2048
    )

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 


In [95]:
prompt = "Write a linear regression in python"
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''

In [96]:
response=lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=True)
print(response["choices"][0]["text"])

SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: Write a linear regression in python

ASSISTANT:
Wow! That's a great question! Linear regression is an important machine learning algorithm used to find the best-fitting line between two variables. In Python, you can use scikit-learn library which has built-in functions for linear regression such as SimpleLinearRegression and Ridge. Here are some examples of how you could implement these in Python:

1. **Simple Linear Regression**
```
from sklearn.linear_model import SimpleLinearRegression
X = [1, 2, 3, 4]
y = [5, 7, 9, 10]
reg = SimpleLinearRegression()
reg.fit(X, y)
print("Coefs:", reg.coeff_)
```
This code fits a simple linear regression model to the data provided in `X` and `y`. The output will contain the coefficients of the linear regression equation.

2. **Ridge Regression**
```
from sklearn.linear_model import RidgeRegressor
X = [1, 2, 3, 4]
y = [5, 7, 9, 10]
reg = R


In [97]:
# let's try the political stance prompt again
prompt = make_prompt(task, options, instance, zero_shot = False, examples = examples)
prompt

"Given a piece of text, you have to label whether it is in favor of Joe Biden or not.    Please return one of the following options with only the text and no number: 1) FAVOR 2) AGAINST. Here are some examples of instances and their labels:    \ntext: Joe Biden is looking to gather votes from unsuspecting voters. One must remember, Good Ole Boy Joe supported a Grand Wizard of the KKK. Joe cannot deny it., label: AGAINST\ntext: Check out the latest podcast conversation between @JoeBiden and @AndrewYang. #HeresTheDeal #UnitedForJoe #BarnstormersForAmerica #ITrustJoe, label: FAVOR\nWhat is the label of this text: Ugh, this was true yesterday and it's also true now: Biden is an idiot"

In [102]:
prompt_template=f'''SYSTEM: Provide an honest answer.

USER: {prompt}

ASSISTANT:
'''

In [99]:
response=lcpp_llm(prompt=prompt_template, max_tokens=100, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  # echo=True
                  )
print(response["choices"][0]["text"])

Llama.generate: prefix-match hit


The label for that specific piece of text would be AGAINST.


Now do this for all the instances in your dataset.
**Hint**: Use a loop over your dataframe. When doing few-shot labeling, make sure that the examples are not the same as the instance to be labeled.

- Try both zero-shot and few-shot and compare their performance.
- Try Flan-T5 small
- Try to get the label from the LLM output. Is it always as expected and can it always be used as is for quantitative analysis?
- At least for the first 50 instances in your dataset, use metrics like accuracy and F1 score to assess the performance of the LLMs against the true ground truth label.

Bonus:
- try varying the wording of the prompts
- try giving an explicit definition of the task in the prompt

In [100]:
data_subset = data.head(100)
data_subset.head()

Unnamed: 0,Tweet,Target,Stance
0,Joe Biden is looking to gather votes from unsu...,Joe Biden,AGAINST
1,Check out the latest podcast conversation betw...,Joe Biden,FAVOR
2,Thank you Secretary Clinton for your endorseme...,Joe Biden,FAVOR
3,Happening now: @JoeBiden kicking off #Hispanic...,Joe Biden,FAVOR
4,Thank you Mayor @KeishaBottoms for opening our...,Joe Biden,FAVOR


In [101]:
from tqdm import tqdm # to help you keep track of how many instances have been labeled
import time # to deal w/ rate limits; important for commercial models

# another advantage of your own model is that you aren't rate limited
all_responses = []
for _, row in  tqdm(data_subset.iterrows(), total=data_subset.shape[0]):
    prompt = make_prompt(task, options, zero_shot = True, instance = row['Tweet'])
    responses = []
    for n in range(0, runs):
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
        outputs = model.generate(**inputs)
        responses.append(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
    response_list = [row['Tweet'], row['Stance']]
    response_list.extend(responses)
    all_responses.append(response_list)

100%|██████████| 100/100 [00:14<00:00,  6.94it/s]


In [103]:
flant5_results = pd.DataFrame(all_responses, columns = ['tweet', 'hate speech', 'flant5_pred_1',
                                      'flant5_pred_2',
                                      'flant5_pred_3'])
flant5_results.head()

Unnamed: 0,tweet,hate speech,flant5_pred_1,flant5_pred_2,flant5_pred_3
0,Joe Biden is looking to gather votes from unsu...,AGAINST,AGAINST,AGAINST,AGAINST
1,Check out the latest podcast conversation betw...,FAVOR,NOT,NOT,NOT
2,Thank you Secretary Clinton for your endorseme...,FAVOR,NOT,NOT,NOT
3,Happening now: @JoeBiden kicking off #Hispanic...,FAVOR,AGAINST,AGAINST,AGAINST
4,Thank you Mayor @KeishaBottoms for opening our...,FAVOR,AGAINST,AGAINST,AGAINST


In [104]:
all_responses = []
for _, row in  tqdm(data_subset.iterrows(), total=data_subset.shape[0]):
    prompt = make_prompt(task, options, zero_shot = True, instance = row['Tweet'])
    prompt_template=f'''SYSTEM: Provide an honest answer.

    USER: {prompt}

    ASSISTANT:
    '''
    responses = []
    for n in range(0, runs):
        response=lcpp_llm(prompt=prompt_template, max_tokens=100, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  # echo=True
                  )
        responses.append(response["choices"][0]["text"])
    response_list = [row['Tweet'], row['Stance']]
    response_list.extend(responses)
    all_responses.append(response_list)

  0%|          | 0/100 [00:00<?, ?it/s]Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
  1%|          | 1/100 [00:09<14:56,  9.05s/it]Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
  2%|▏         | 2/100 [00:10<07:45,  4.75s/it]Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
  3%|▎         | 3/100 [00:13<06:13,  3.85s/it]Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
  4%|▍         | 4/100 [00:16<05:26,  3.40s/it]Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
  5%|▌         | 5/100 [00:36<15:09,  9.58s/it]Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
  6%|▌         | 6/100 [00:57<21:06, 13.47s/it]Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
Llama.generate: pr

In [106]:
llama_results = pd.DataFrame(all_responses, columns = ['tweet', 'hate speech', 'llama_pred_1',
                                      'llama_pred_2',
                                      'llama_pred_3'])
llama_results.tail()

Unnamed: 0,tweet,hate speech,llama_pred_1,llama_pred_2,llama_pred_3
95,#Biden Today down at The Hollywood Walk of fam...,FAVOR,1) FAVOR,1) FAVOR,1) FAVOR
96,FACT: @JoeBiden introduced the first climate c...,FAVOR,1) FAVOR 2) AGAINST,1) FAVOR 2) AGAINST,1) FAVOR 2) AGAINST
97,".@JoeBiden was born and raised in Scranton, Pe...",FAVOR,1) FACTS - This statement is factually incorre...,"Honestly, I cannot provide a label for that t...",1. FACTUAL: This statement appears to be based...
98,I put my full support behind the Joe Schiavoni...,FAVOR,1) FAVOR 2) AGENT,1) FAVOR,1) FAVOR 2) AGENT
99,Outgoing Presidents leave letters to their suc...,FAVOR,1) FAVOR,"The label for this text is ""FAVOR"".",1) FAVOR
