## Classification using different prompts
Test different propmts to classify stress or no stress using messages from Dreaddit: A Reddit Dataset for Stress Analysis in Social Media. See: https://aclanthology.org/D19-6213.

Commonly updated parameters are in the Parameter block below.

**test_count:** The number of messages to use in generating prompts.

**model_name** The model ID downloaded from Hugging Face.

**access_token** The Hugging Face access token.

**prompt_function:** Choose your prompt from different prompt functions.

**rows_per_set** The number of messages to score in each prompt. The prompt can score multiple messages to reduce resources/time.  Note Dreaddit messages can be large and the resulting prompt can have too much context.  Model accuracy for 1 message per prompt is typically better than 5 messages per prompt but slower.



#### Parameters

In [None]:
test_count = 400
model_name  = "microsoft/Phi-3-medium-4K-instruct"
access_token = ''
#prompt_function = 'create_dreaddit_prompt'
prompt_function = 'create_examples_prompt'
rows_per_set = 2

In [None]:
# Use PyTorch 2.0 Kernel on Vertex AI

# flash attention support
#!pip install flash-attn --no-build-isolation

!pip install pandas
!pip install tqdm
!pip install transformers
!pip install -U bitsandbytes
!pip install accelerate


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
import accelerate
#import bitsandbytes
import sys
from importlib import reload

import pandas as pd
import tqdm

### Login to Huggingface

In [None]:
### the safer way
#from huggingface_hub import notebook_login
#notebook_login()

### alternative: https://huggingface.co/settings/tokens
from huggingface_hub import login
login(access_token)

### Prompt generation functions
Example prompts for testing

#### Dreaddit Prompt
Zero shot prompt based on stress definition used to label the Dreaddit dataset.

In [None]:
import pandas as pd
# Messages can be classified to indicate stress in the speaker. The following classes are used:

def create_dreaddit_prompt(bodies_df: pd.DataFrame) -> str:
  messagesStart = \
  '''
 <|user|>
 Messages can be classified as follows:

 '1' is the class that indicates a state of mental or emotional strain or tension resulting
 from adverse or demanding circumstances.

 '0' is the class that indicates the speaker is not expressing a state of mental or emotional strain or tension resulting
 from adverse or demanding circumstances.

 Output a single line of comma separated classes the {} messages beginning with the word "Classifications":  There should be {} classes in the line.
 Output a single line of comma separated probabilities for class '1' beginning with the word "Probabilities":  There should be {} probabilities in the line.

 '''.format(len(bodies_df), len(bodies_df),len(bodies_df), len(bodies_df))
  # Loop through the DataFrame
  bodies_list = ''
  for i in range(0, len(bodies_df)):
    # print("add message to prompt: ", bodies_df.iloc[i].text)
    message = "\n" + "message: " + "'" + bodies_df.iloc[i].text +"'\n"
    bodies_list = bodies_list + message

  messagesEnd = \
  '''
  <|end|>
  <|assistant|>
  '''

  messages = messagesStart + bodies_list + messagesEnd

  return messages

#### Goode Prompt
Zero shot prompt based on William J. Goode's 1960 study, "A Theory of Role Strain"

In [None]:
import pandas as pd
def create_goode_prompt(bodies_df: pd.DataFrame) -> str:
  messagesStart = \
  '''
 <|user|>
 Messages can be classified to indicate stress in the speaker. Stress is defined as:

* Role Overload: The inability to meet the sheer quantity of demands due to a lack of time or resources.
* Role Conflict: Situations where the expectations of one role are incompatible with the expectations of another.
* Role Ambiguity: Uncertainty or lack of clarity about role expectations, which can heighten tension.

'0' is the class that indicates the speaker is not expressing role overload or role conflict or role ambiguity
from adverse or demanding circumstances.

'1' is the class that indicates the speaker is expressing role overload or role conflict or role ambiguity
from adverse or demanding circumstances.

Provide only one class for each of the following {} messages. Output a single line of {} comma separated classes beginning with the word "Classifications":
Provide one probability for each of the following {} messages. Output a single line of {} comma separated probabilities beginning with the word "Probabilities":

 '''.format(len(bodies_df), len(bodies_df),len(bodies_df), len(bodies_df))
  # Loop through the DataFrame
  bodies_list = ''
  for i in range(0, len(bodies_df)):
    # print("add message to prompt: ", bodies_df.iloc[i].text)
    message = "\n" + "message: " + "'" + bodies_df.iloc[i].text +"'\n"
    bodies_list = bodies_list + message

  messagesEnd = \
  '''
  <|end|>
  <|assistant|>
  '''

  messages = messagesStart + bodies_list + messagesEnd

  return messages

#### Lazarus Prompt
Zero shot prompt based on "Stress, Appraisal, and Coping" by Richard S. Lazarus and Susan Folkman,

In [None]:
import pandas as pd
def create_lazarus_prompt(bodies_df: pd.DataFrame) -> str:
  messagesStart = \
  '''
 <|user|>
 Messages can be classified to indicate stress in the speaker. The following classes are used:

'1' is the class that indicates the speaker's appraisal of the demands of a situation exceed available resources and thus endangering well-being.

'0' is the class that indicates the speaker's appraisal of the demands of a situation do not exceed available resources and thus do not endanger well-being.

 Provide only one class for each of the following {} messages. Output a single line of {} comma separated classes beginning with the word "Classifications":
 Provide one probability for each of the following {} messages. Output a single line of {} comma separated probabilities beginning with the word "Probabilities":

 '''.format(len(bodies_df), len(bodies_df),len(bodies_df), len(bodies_df))
  # Loop through the DataFrame
  bodies_list = ''
  for i in range(0, len(bodies_df)):
    # print("add message to prompt: ", bodies_df.iloc[i].text)
    message = "\n" + "message: " + "'" + bodies_df.iloc[i].text +"'\n"
    bodies_list = bodies_list + message

  messagesEnd = \
  '''
  <|end|>
  <|assistant|>
  '''

  messages = messagesStart + bodies_list + messagesEnd

  return messages

In [None]:
import pandas as pd
def create_examples_prompt(bodies_df: pd.DataFrame) -> str:
  messagesStart = \
  '''
 <|user|>
 Messages can be classified to indicate stress in the speaker. The classes are "1" and "0" and are shown in the following examples:

 class: "1" message : "Why did I open my mouth I should’ve just said “I’m fine” ~ I don’t need help Or maybe I do"

 class: "1" message : "I don’t know. Was this okay? Should I hate him? Or was it just something new? I really don’t know what to make of the situation."

 class: "1" message : "Idk Do I tell someone? Do I just quit? Do I talk to her about what she did? Please, any advice would be really really helpful to me!"

 class: "0" message: "i faced up to myself. i completed probation. it's not the drugs i need. it's to leave my environment and everything i know; it's to get a fresh start. i'm only 22."

 class: "0" message : "So I'm between paychecks and I've managed to get of my act together to pay most of my bills by asking the church and through private donations."

 class: "1" message : "Every day I hope she messages me, calls me, or post on my Facebook. Any advice would mean the world to me. How to get over my ex!"

 class: "1" message : "I can have it in front of me and still overthink and ask my self over and over. Any advice or opinions? Thanks. P.S. I don’t suffer a lot when I’m busy at work or with friends."

 class: "0" message : "One night, early, early into this, we were kind of flirting. He suggested we shower together. I was scared. Uncomfortable. Not sure."

 class: "0" message : "* Sleeping bag. * Solar-powered Lamps. * A raincoat. * Non-perishable food/MREs/trailmix. Anything else I should invest in?"

 class: "0" message : "Maybe a couple more days will get me back to normal. Definitely quitting the alcohol. It's an obvious trigger."

 Provide only one class for each of the following {} messages. Output a single line of {} comma separated classes beginning with the word "Classifications":
 Provide one probability for each of the following {} messages. Output a single line of {} comma separated probabilities beginning with the word "Probabilities":

 '''.format(len(bodies_df), len(bodies_df),len(bodies_df), len(bodies_df))
  # Loop through the DataFrame
  bodies_list = ''
  for i in range(0, len(bodies_df)):
    # print("add message to prompt: ", bodies_df.iloc[i].text)
    message = "\n" + "message: " + "'" + bodies_df.iloc[i].text +"'\n"
    bodies_list = bodies_list + message

  messagesEnd = \
  '''
  <|end|>
  <|assistant|>
  '''

  messages = messagesStart + bodies_list + messagesEnd

  return messages

### Load messages, dropping nulls and sorting on date created.


In [None]:
# URL of the raw CSV file on GitHub
csv_url = "https://raw.githubusercontent.com/SocialHealthAI/SDOH-Models/refs/heads/main/LLM%20Classification/dreaddit-test.csv"

# Load the CSV file into a DataFrame
dreadit_df = pd.read_csv(csv_url)
dreadit_df = dreadit_df.head(test_count)         # limit the test data
split_index = len(dreadit_df) // 2
dreadit_d = dreadit_df.iloc[split_index:]

dreadit_text_df = dreadit_df[['text']]

# Display the first few rows of the DataFrame
print(dreadit_df.head())


### Load Model


In [None]:
# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(model_name,
                                                load_in_4bit=True,  # Activates 8-bit quantization
                                                device_map="auto"   # Automatically assigns the model to an available GPU
                                            )

### Generate output

Run generate in batches so that prompts are not too large.  rows_per_seat determines the number of messages to score in each prompt. The number of messages to score are set by starting_row and ending_row.  To help generate deterministic output the folowwing are set:
* set_seed(42)
* temperature=0.0,
* do_sample=False,


In [None]:
import torch
import time

# Define the starting row index
starting_row = 0
# Define ending row index
#ending_row =  len(last_week_df)
ending_row =  test_count - 1
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Set a seed for reproducibility
set_seed(42)
# separate batches of generated text
generated_texts = []

# check starting row
if starting_row < 0 or starting_row >= len(dreadit_text_df):
    print("Invalid starting row index. Input data has length: " + str(len(dreadit_text_df)))
    sys.exit(1)

# check ending row
if ending_row < starting_row or ending_row >= len(dreadit_text_df):
    print("Invalid ending row index. Input data has length: " + str(len(dreadit_text_df)))
    sys.exit(1)

# acceleration support for input data
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
#model.to(device)

# Loop through the DataFrame in sets of 5 rows
start_time = time.time()  # Record the start time
for i in range(starting_row, ending_row, rows_per_set):

  # Set up prompt defintion and tokenize
  messages = globals()[prompt_function](dreadit_text_df.iloc[i:i + rows_per_set])
  input_ids = tokenizer.encode(messages, return_tensors='pt')
  # acceleration support  This is not necessary if model is loaded with bitsandbytes support as it assigned to correct device
  input_ids = input_ids.to(device)

  output = model.generate(
        input_ids,
        #max_length=1200,
        max_new_tokens=200,
        temperature=0.0,
        top_k=1,
        top_p=0.0,
        do_sample=False,  # Ensure deterministic output
        pad_token_id=tokenizer.eos_token_id
  )

  # Decode and print the output
  generated_texts.append(tokenizer.decode(output[0], skip_special_tokens=True))

# Calculate and print the elapsed time
elapsed_time = time.time() - start_time
print(f"Time required by the function call: {elapsed_time:.4f} seconds")


In [None]:
print(*generated_texts, sep = "\n")

### Parse output to find Classifications and Probabilities
Look for "Classifications:" and then find numbers in the same in each batch of inferences and append to a dataframe of scores.
Look for "Probabilities:" and then find numbers in the same in each batch of inferences and append to a dataframe of probabilities.
Produce output_df with scores. probabilities.

In [None]:
import re
output_score_df = pd.DataFrame(columns=['score'])
output_probability_df = pd.DataFrame(columns=['probability'])

for batch in generated_texts:
  # Split the text into lines
  lines = batch.split('\n')

  # Count the messages in the batch
  message_count = 0
  for i, line in enumerate(lines):
    if "message: " in line:
      message_count += 1

  # Find the scores and predictions in the batch, making sure we match the number of messages.  The
  # transformer sometimes misses a message (1 in 2500).  We append neutral scores and .5 probability if
  # needed.

  for i, line in enumerate(lines):

    if "Classifications:" in line:
      numbers = re.findall(r'\d+', lines[i])
      if len(numbers) == 0:
        print("Warning, no inferences found")
      # Convert the extracted strings to integers
      numbers = [int(num) for num in numbers]
      print(line, numbers)
      while len(numbers) < message_count:
        numbers.append(0)
        print("Warning, score 0 appended as score number in: ", numbers, " is less than: ", message_count)
      # Create a DataFrame from the numbers list
      batch_df = pd.DataFrame(numbers, columns=['score'])
      # Append the new numbers to the output DataFrame
      output_score_df = pd.concat([output_score_df, batch_df], ignore_index=True)
      break

  for i, line in enumerate(lines):
    if "Probabilities:" in line:
      numbers = re.findall(r'\d+\.\d+|\d+', line)
      if len(numbers) == 0:
        print("Warning, no probabilities found")
      # Convert the extracted strings to float
      numbers = [float(num) for num in numbers]
      print(line, numbers)
      while len(numbers) < message_count:
        numbers.append(0.5)
        print("Warning, probability .5 appended in: ", numbers, " is less than: ", message_count)
      #print(numbers)
      # Create a DataFrame from the numbers list
      batch_df = pd.DataFrame(numbers, columns=['probability'])
      # Append the new numbers to the output DataFrame
      output_probability_df = pd.concat([output_probability_df, batch_df], ignore_index=True)
      break

output_df = pd.concat([output_score_df, output_probability_df], axis=1)

In [None]:
output_df

### Use ROC to determine best threshold for class 1

In [None]:
from sklearn.metrics import f1_score, roc_auc_score, precision_recall_curve
import numpy as np

# Get true labels and predicted probabilities
y_true = dreadit_df['label'].values
y_probs = output_df['probability'].values

# Search for best threshold by maximizing F1-score
thresholds = np.arange(0.0, 1.01, 0.01)
f1_scores = [f1_score(y_true, y_probs > t) for t in thresholds]
best_threshold = thresholds[np.argmax(f1_scores)]

print(f"Best threshold based on F1-score: {best_threshold:.2f}")


### Classification Report
Classification report based on target labels provided in Dreadit dataset.

In [None]:
from sklearn.metrics import classification_report

# Convert probabilities to labels based on best threshold
output_df['y_pred'] = (output_df['probability'] >= best_threshold).astype(int)

# Print classification report
print(classification_report(y_true, output_df['y_pred']))


In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

# Convert labels to NumPy array if not already
true_labels = dreadit_df['label'].to_numpy()
positive_probs = output_df['probability'].to_numpy()

# If binary classification, extract probability of positive class (class 1)
#if test_probs.shape[1] == 2:
#    positive_probs = test_probs[:, 1]  # Probability of class 1
#else:
#    raise ValueError("ROC is typically used for binary classification. For multi-class, consider One-vs-Rest.")

# Compute ROC curve
fpr, tpr, _ = roc_curve(true_labels, positive_probs)
roc_auc = auc(fpr, tpr)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='grey', linestyle='--')  # Diagonal line for random chance
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

In [None]:
output_probability_df