## Classification using different prompts
Test different propmts to classify stress or no stress using messages from Dreaddit: A Reddit Dataset for Stress Analysis in Social Media. See: https://aclanthology.org/D19-6213.

Different prompt functions are provided to generate a prompt.  Select the function to use by changing:

  messages = function_name(dreadit_text_df.iloc[i:i + rows_per_set])

Uses HuggingFace to load models. Update: login('put your User Access Tokens here')

To select a model, update: model_id = "model name"

You can control the rows used to generate prompts by setting:
    # Define the starting row index
    starting_row = 0
    # Define ending row index
    ending_row =  199

The prompt can score multiple messages to reduce resources/time.  But Dreaddit messages can be large
and the resulting prompt can have too much context.  Performance for 1 message per prompt is far better
than 5 messages per prompt.  To set the number:
    # Define the number of rows per set
    rows_per_set = 1


In [None]:
# Use PyTorch 2.0 Kernel on Vertex AI

# flash attention support
#!pip install flash-attn --no-build-isolation

!pip install pandas
!pip install tqdm
!pip install transformers
!pip install -U bitsandbytes
!pip install accelerate


Collecting bitsandbytes
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl (76.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.45.3


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
import accelerate
#import bitsandbytes
import sys
from importlib import reload

import pandas as pd
import tqdm

### Login to Huggingface

In [None]:
### the safer way
#from huggingface_hub import notebook_login
#notebook_login()

### alternative: https://huggingface.co/settings/tokens
from huggingface_hub import login
login('put your User Access Tokens here') # put your User Access Tokens here

### Prompt generation functions
Example prompts for testing

#### Dreaddit Prompt
Zero shot prompt based on stress definition used to label the Dreaddit dataset.

In [None]:
import pandas as pd
# Messages can be classified to indicate stress in the speaker. The following classes are used:

def create_dreaddit_prompt(bodies_df: pd.DataFrame) -> str:
  messagesStart = \
  '''
 <|user|>
 Messages can be classified as follows:

 '1' is the class that indicates a state of mental or emotional strain or tension resulting
 from adverse or demanding circumstances.

 '0' is the class that indicates the speaker is not expressing a state of mental or emotional strain or tension resulting
 from adverse or demanding circumstances.

 Provide only one class for each of the following {} messages. Output a single line of {} comma separated classes beginning with the word "Classifications":

 '''.format(len(bodies_df), len(bodies_df))
  # Loop through the DataFrame
  bodies_list = ''
  for i in range(0, len(bodies_df)):
    # print("add message to prompt: ", bodies_df.iloc[i].text)
    message = "\n" + "message: " + "'" + bodies_df.iloc[i].text +"'\n"
    bodies_list = bodies_list + message

  messagesEnd = \
  '''
  <|end|>
  <|assistant|>
  '''

  messages = messagesStart + bodies_list + messagesEnd

  return messages

#### Goode Prompt
Zero shot prompt based on William J. Goode's 1960 study, "A Theory of Role Strain"

In [None]:
import pandas as pd
def create_goode_prompt(bodies_df: pd.DataFrame) -> str:
  messagesStart = \
  '''
 <|user|>
 Messages can be classified to indicate stress in the speaker. Stress is defined as:

* Role Overload: The inability to meet the sheer quantity of demands due to a lack of time or resources.
* Role Conflict: Situations where the expectations of one role are incompatible with the expectations of another.
* Role Ambiguity: Uncertainty or lack of clarity about role expectations, which can heighten tension.

'0' is the class that indicates the speaker is not expressing role overload or role conflict or role ambiguity
from adverse or demanding circumstances.

'1' is the class that indicates the speaker is expressing role overload or role conflict or role ambiguity
from adverse or demanding circumstances.

Provide only one class for each of the following {} messages. Output a single line of {} comma separated classes beginning with the word "Classifications":

 '''.format(len(bodies_df), len(bodies_df))
  # Loop through the DataFrame
  bodies_list = ''
  for i in range(0, len(bodies_df)):
    # print("add message to prompt: ", bodies_df.iloc[i].text)
    message = "\n" + "message: " + "'" + bodies_df.iloc[i].text +"'\n"
    bodies_list = bodies_list + message

  messagesEnd = \
  '''
  <|end|>
  <|assistant|>
  '''

  messages = messagesStart + bodies_list + messagesEnd

  return messages

#### Lazarus Prompt
Zero shot prompt based on "Stress, Appraisal, and Coping" by Richard S. Lazarus and Susan Folkman,

In [None]:
import pandas as pd
def create_lazarus_prompt(bodies_df: pd.DataFrame) -> str:
  messagesStart = \
  '''
 <|user|>
 Messages can be classified to indicate stress in the speaker. The following classes are used:

'1' is the class that indicates the speaker's appraisal of the demands of a situation exceed available resources and thus endangering well-being.

'0' is the class that indicates the speaker's appraisal of the demands of a situation do not exceed available resources and thus do not endanger well-being.

 Provide only one class for each of the following {} messages. Output a single line of {} comma separated classes beginning with the word "Classifications":

 '''.format(len(bodies_df), len(bodies_df))
  # Loop through the DataFrame
  bodies_list = ''
  for i in range(0, len(bodies_df)):
    # print("add message to prompt: ", bodies_df.iloc[i].text)
    message = "\n" + "message: " + "'" + bodies_df.iloc[i].text +"'\n"
    bodies_list = bodies_list + message

  messagesEnd = \
  '''
  <|end|>
  <|assistant|>
  '''

  messages = messagesStart + bodies_list + messagesEnd

  return messages

#### Examples Prompt

Few shot prompt.

In [None]:
import pandas as pd
def create_examples_prompt(bodies_df: pd.DataFrame) -> str:
  messagesStart = \
  '''
 <|user|>
 Messages can be classified to indicate stress in the speaker. The classes are "1" and "0" and are shown in the following examples:

 class: "0" message :"Is there anyway I could persuade you to go a view you love (if not any place is fine) and write something similar on a card  or piece of paper [I attached an example here]. Thank you all if you consider helping! It means the world. I'll probably post again as the date gets closer. I aim to continue trying until I can get at least 100"

 class: "0" message :"After college for about 2 years, I focused on my career and wasn't totally putting myself out there. Now, for the last year, I've actively put myself in the dating pool. It never seemed appealing to hang around in clubs or bars to hopefully pick someone up, so I've mostly stuck to online dating. I was mostly pretty casually looking, and would go on there when the mood struck me or I was speaking to someone I saw potential with. Over the course of the last about 2.5 months, I've been on 3 dates."

 class: "1" message :"I got 6 stitches. My parents love my bf and in fact, my mom and him are so close that they go on walks and go to the movies together. I felt uncomfortable and unwanted by his family. To be fair, I was also a little closed off with them but mainly because I felt so unwanted. I have put on some weight, which I am sure his dad noticed because he absolutely hates fat people."

 class: "1" message :"I've never spoken to anyone about my anxiety but I'm pretty sure I have generalized anxiety disorder. When I was young I used to be very bright and would take charge of projects and doing assignments. As time went on I became lazier but still fairly on top of things. When I went into college I suffered and things never clicked. Doing even the most simple of tasks or assignments were just so difficult for me."

 class: "1" message :"I have four kids full time, almost a year ago their dad was removed because of substantial abuse. It's been incredibly hard making ends meet by myself and although I get rent paid, all other bills are stuck on the back burner. Our electric bill is over $400 and they're demanding $225 as a minimum payment. It's scheduled for disconnect today and I can't put it off any longer. I know it's a long shot but seriously needing a miracle at the moment."

 class: "0" message :"I was never really close to him, our conversations never went past the 'nice weather we're having' area. But I coincidentally ran into him at a bar recently and I initially tried to keep my distance from him because I know he's friends with Zach, but then he told me that not only does he and Zach are no longer in contact with each other, they actually really dislike each other now. I'm guessing something went down after Zach and I broke up. After that, I was more comfortable around him. So, as the night went along and we got drunker, we ended up hooking up."

 Provide only one class for each of the following {} messages. Output a single line of {} comma separated classes beginning with the word "Classifications":

 '''.format(len(bodies_df), len(bodies_df))
  # Loop through the DataFrame
  bodies_list = ''
  for i in range(0, len(bodies_df)):
    # print("add message to prompt: ", bodies_df.iloc[i].text)
    message = "\n" + "message: " + "'" + bodies_df.iloc[i].text +"'\n"
    bodies_list = bodies_list + message

  messagesEnd = \
  '''
  <|end|>
  <|assistant|>
  '''

  messages = messagesStart + bodies_list + messagesEnd

  return messages

In [None]:
import pandas as pd
def create_examples2_prompt(bodies_df: pd.DataFrame) -> str:
  messagesStart = \
  '''
 <|user|>
 Messages can be classified to indicate stress in the speaker. The classes are "1" and "0" and are shown in the following examples:

 class: "1" message : "Why did I open my mouth I should’ve just said “I’m fine” ~ I don’t need help Or maybe I do"

 class: "1" message : "I don’t know. Was this okay? Should I hate him? Or was it just something new? I really don’t know what to make of the situation."

 class: "1" message : "Idk Do I tell someone? Do I just quit? Do I talk to her about what she did? Please, any advice would be really really helpful to me!"

 class: "0" message: "i faced up to myself. i completed probation. it's not the drugs i need. it's to leave my environment and everything i know; it's to get a fresh start. i'm only 22."

 class: "0" message : "So I'm between paychecks and I've managed to get of my act together to pay most of my bills by asking the church and through private donations."

 class: "1" message : "Every day I hope she messages me, calls me, or post on my Facebook. Any advice would mean the world to me. How to get over my ex!"

 class: "1" message : "I can have it in front of me and still overthink and ask my self over and over. Any advice or opinions? Thanks. P.S. I don’t suffer a lot when I’m busy at work or with friends."

 class: "0" message : "One night, early, early into this, we were kind of flirting. He suggested we shower together. I was scared. Uncomfortable. Not sure."

 class: "0" message : "* Sleeping bag. * Solar-powered Lamps. * A raincoat. * Non-perishable food/MREs/trailmix. Anything else I should invest in?"

 class: "0" message : "Maybe a couple more days will get me back to normal. Definitely quitting the alcohol. It's an obvious trigger."

 Provide only one class for each of the following {} messages. Output a single line of {} comma separated classes beginning with the word "Classifications":

 '''.format(len(bodies_df), len(bodies_df))
  # Loop through the DataFrame
  bodies_list = ''
  for i in range(0, len(bodies_df)):
    # print("add message to prompt: ", bodies_df.iloc[i].text)
    message = "\n" + "message: " + "'" + bodies_df.iloc[i].text +"'\n"
    bodies_list = bodies_list + message

  messagesEnd = \
  '''
  <|end|>
  <|assistant|>
  '''

  messages = messagesStart + bodies_list + messagesEnd

  return messages

### Load messages, dropping nulls and sorting on date created.


In [None]:
# URL of the raw CSV file on GitHub
csv_url = "https://raw.githubusercontent.com/SocialHealthAI/SDOH-Models/refs/heads/main/LLM%20Classification/dreaddit-test.csv"

# Load the CSV file into a DataFrame
dreadit_df = pd.read_csv(csv_url)
dreadit_df = dreadit_df.head(400)         # limit the test data
split_index = len(dreadit_df) // 2
dreadit_d = dreadit_df.iloc[split_index:]

dreadit_text_df = dreadit_df[['text']]

# Display the first few rows of the DataFrame
print(dreadit_df.head())

# last_week_df.dropna(subset=['body'], inplace=True)
# last_week_df.sort_values(by='date_created', inplace=True, ascending=False)

      id      subreddit post_id sentence_range  \
0    896  relationships  7nu7as       [50, 55]   
1  19059        anxiety  680i6d        (5, 10)   
2   7977           ptsd  8eeu1t        (5, 10)   
3   1214           ptsd  8d28vu         [2, 7]   
4   1965  relationships  7r1e85       [23, 28]   

                                                text  label  confidence  \
0  Its like that, if you want or not.“ ME: I have...      0         0.8   
1  I man the front desk and my title is HR Custom...      0         1.0   
2  We'd be saving so much money with this new hou...      1         1.0   
3  My ex used to shoot back with "Do you want me ...      1         0.5   
4  I haven’t said anything to him yet because I’m...      0         0.8   

   social_timestamp  social_karma  syntax_ari  ...  lex_dal_min_pleasantness  \
0      1.514981e+09            22   -1.238793  ...                    1.0000   
1      1.493348e+09             5    7.684583  ...                    1.4000   
2      1

### Load Model


In [None]:
# Load 4-bit quantized model
model_name  = "microsoft/Phi-3-medium-4K-instruct"  #good scores, deterministic output
model = AutoModelForCausalLM.from_pretrained(model_name,
                                                load_in_4bit=True,  # Activates 8-bit quantization
                                                device_map="auto"   # Automatically assigns the model to an available GPU
                                            )

config.json:   0%|          | 0.00/934 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/20.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/6 [00:00<?, ?it/s]

model-00001-of-00006.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00002-of-00006.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00006.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00004-of-00006.safetensors:   0%|          | 0.00/4.77G [00:00<?, ?B/s]

model-00005-of-00006.safetensors:   0%|          | 0.00/4.77G [00:00<?, ?B/s]

model-00006-of-00006.safetensors:   0%|          | 0.00/3.61G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

### Generate output

Run generate in batches so that prompts are not too large.  rows_per_seat determines the number of messages to score in each prompt. The number of messages to score are set by starting_row and ending_row.  To help generate deterministic output the folowwing are set:
* set_seed(42)
* temperature=0.0,
* do_sample=False,


In [None]:
import torch
import time

# Define the number of rows per set
rows_per_set = 1
# Define the starting row index
starting_row = 0
# Define ending row index
#ending_row =  len(last_week_df)
ending_row =  199
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Set a seed for reproducibility
set_seed(42)
# separate batches of generated text
generated_texts = []

# check starting row
if starting_row < 0 or starting_row >= len(dreadit_text_df):
    print("Invalid starting row index. Input data has length: " + str(len(dreadit_text_df)))
    sys.exit(1)

# check ending row
if ending_row < starting_row or ending_row >= len(dreadit_text_df):
    print("Invalid ending row index. Input data has length: " + str(len(dreadit_text_df)))
    sys.exit(1)

# acceleration support for input data
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
#model.to(device)

# Loop through the DataFrame in sets of 5 rows
start_time = time.time()  # Record the start time
for i in range(starting_row, ending_row, rows_per_set):

  # Set up prompt defintion and tokenize
  messages = create_examples2_prompt(dreadit_text_df.iloc[i:i + rows_per_set])
  input_ids = tokenizer.encode(messages, return_tensors='pt')
  # acceleration support  This is not necessary if model is loaded with bitsandbytes support as it assigned to correct device
  input_ids = input_ids.to(device)

  output = model.generate(
        input_ids,
        #max_length=1200,
        max_new_tokens=200,
        temperature=0.0,
        top_k=1,
        top_p=0.0,
        do_sample=False,  # Ensure deterministic output
        pad_token_id=tokenizer.eos_token_id
  )

  # Decode and print the output
  generated_texts.append(tokenizer.decode(output[0], skip_special_tokens=True))

# Calculate and print the elapsed time
elapsed_time = time.time() - start_time
print(f"Time required by the function call: {elapsed_time:.4f} seconds")


tokenizer_config.json:   0%|          | 0.00/3.15k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Time required by the function call: 563.8473 seconds


In [None]:
print(*generated_texts, sep = "\n")


  Messages can be classified to indicate stress in the speaker. The classes are "1" and "0" and are shown in the following examples:

 class: "1" message: "Why did I open my mouth I should’ve just said “I’m fine” ~ I don’t need help Or maybe I do"

 class: "1" message: "I don’t know. Was this okay? Should I hate him? Or was it just something new? I really don’t know what to make of the situation."

 class: "1" message: "Idk Do I tell someone? Do I just quit? Do I talk to her about what she did? Please, any advice would be really really helpful to me!"

 class: "0" message: "One night, early, early into this, we were kind of flirting. He suggested we shower together. I was scared. Uncomfortable. Not sure."

 class: "0" message: "* Sleeping bag. * Solar-powered Lamps. * A raincoat. * Non-perishable food/MREs/trailmix. Anything else I should invest in?"

 class: "0" message: "Maybe a couple more days will get me back to normal. Definitely quitting the alcohol. It's an obvious trigger." 


### Parse output to find Classifications
Look for "Classifications: " and then find scores in the same in each batch of inferences and append to a dataframe of scores.


In [None]:
import re
output_df = pd.DataFrame(columns=['score'])

for batch in generated_texts:

  #print("*********************Start Batch\n")
  #print(batch)
  #print("*********************End Batch\n")

  # Split the text into lines
  lines = batch.split('\n')

  # Count the messages in the batch

  message_count = 0
  for i, line in enumerate(lines):
    if "message: " in line:
      message_count += 1

  # Find the scores in the batch, making sure we match the number of messages.  The
  # transformer sometimes misses a message (1 in 2500).  We append neutral scores if
  # needed.  All scores are appended to a dataframe.

  for i, line in enumerate(lines):
    if "Classifications:" in line:
      numbers = re.findall(r'\d+', lines[i])
      if len(numbers) == 0:
        print("Warning, no inferences found")
      # Convert the extracted strings to integers
      numbers = [int(num) for num in numbers]
      while len(numbers) < message_count:
        numbers.append(0)
        print("Warning, score 0 appended as score number in: ", numbers, " is less than: ", message_count)
      #print(numbers)
      # Create a DataFrame from the numbers list
      batch_df = pd.DataFrame(numbers, columns=['score'])
      # Append the new numbers to the output DataFrame
      output_df = pd.concat([output_df, batch_df], ignore_index=True)
      break



### Classification Report
Classification report based on target labels provided in Dreadit dataset.

In [None]:
from sklearn.metrics import classification_report
scores = output_df['score'].to_numpy().tolist()
scores = [1 if x > 0 else x for x in scores]
print(classification_report(
  scores,
  dreadit_df['label'].head(199).to_numpy().tolist(),
  target_names = ['0', '1']
  ))

# print(confusion_matrix(
#   output_df['score'].to_numpy(),
#   dreadit_df['label'].head(100).to_numpy()
#   ))

In [None]:
scores

In [None]:
dreadit_df['label'].head(200).to_numpy().tolist()

In [None]:
len(output_df['score'].to_numpy().tolist())

In [None]:
dreadit_df['label'].head(100).to_numpy().tolist()