# 0. File overview
The aim of this file is to use the fine-tuned BERT classifiers to label the filtered questions that participants in the PRISM study (Kirk et al., 2024) asked; 
and use the defined logical operator tree to assign the final interrogative classification according to Belnap & Steel (1976)

*Notation convention*: 0 always means NO; and 1 always means YES. 

**How is this notebook structured?**

0. File overview
1. Call and apply the fine-tuned BERTs to classify the entire PRISM dataset
2. Use assigned labels to attribute final question classifications
3. Apply logical operators to gold-standard labelled data to get interrogative types 


In [None]:
##############################################
# Setup
##############################################

# (1) Because this is the first python file, it should be noted that a separate environment should be created for this 
# entire project to avoid collusions among package types. I created a .venv document and installed the required libraries into it.
# The dependencies for the current environment and the dependencies are saved under 00_environment_setup; you may load them to ensure this code runs.  

# (2) Load libraries
import os
import sys
import pandas as pd
from transformers import pipeline

# (3) Define paths
PROJECT_ROOT = "/Users/carolinewagner/Desktop/Local/MY498-capstone-main"

# Where is the cleaned PRISM dataset? 
clean_prism_path = os.path.join(
    PROJECT_ROOT,
    "01_data",
    "05_utterance_classification",
    "PRISM_filtered.csv"
)

# Where are the helper functions?
helper_path = os.path.join(
    PROJECT_ROOT,
    "02_code",
    "01_helper-functions"
)

# Where should the classified dataset be saved to?
output_path = os.path.join(
    PROJECT_ROOT,
    "01_data",
    "05_utterance_classification",
    "PRISM_final_categories.csv"
)

# (4) Load the required data
clean_prism = pd.read_csv(clean_prism_path)

# (5) Source the helper functions (to classify the output of the classifiers into the final categories)
sys.path.append(helper_path)
from general_python_helper import *

# 1. Call and apply the fine-tuned BERTs to classift the entire PRISM dataset

In [None]:
# (1) Apply BERT 1A to classify questions

# (a) Load BERT
pipe = pipeline("text-classification", model="carowagner/classify-questions-1A")

# (b) Prepare empty lists for results
labels = []
scores = []

# (c) Loop through prompts one at a time
for i, prompt in enumerate(clean_prism['user_prompt']):
    result = pipe(prompt)[0] 
    label = result['label']
    score = result['score']
    
    # Store results
    labels.append(label)
    scores.append(score)
    
    # Print status message
    print(f"[{i+1}/{len(clean_prism)}] Prompt: \"{prompt[:60]}...\" → Label: {label}, Score: {score:.3f}")

# (d) Add to DataFrame
clean_prism['label_1A'] = labels
clean_prism['score_1A'] = scores

Device set to use mps:0


[1/8002] Prompt: "What can you do about the inequality  of wealth?..." → Label: y, Score: 0.988
[2/8002] Prompt: "What can I do to start making extra money on the side to red..." → Label: y, Score: 0.989
[3/8002] Prompt: "Who is right in the Hamas-Israeli war?  Hamas or the Israeli..." → Label: y, Score: 0.987
[4/8002] Prompt: "Hello..." → Label: n, Score: 0.574
[5/8002] Prompt: "How do I become financially stable on a low income..." → Label: y, Score: 0.989
[6/8002] Prompt: "what side do you take in the current Israel-Palestine war..." → Label: y, Score: 0.989
[7/8002] Prompt: "How can I increase productivity based on scientific findings..." → Label: y, Score: 0.988
[8/8002] Prompt: "How do I deal with a confrontational coworker that does not ..." → Label: y, Score: 0.988
[9/8002] Prompt: "How do I bake a moist chocolate cake..." → Label: y, Score: 0.988
[10/8002] Prompt: "Do you know the meme or trend going around asking males abou..." → Label: y, Score: 0.988
[11/8002] Prompt: "What

In [None]:
# (2) Apply BERT 1B to classify questions

# (a) Load BERT
pipe = pipeline("text-classification", model="carowagner/classify-questions-1B")

# (b) Prepare empty lists for results
labels = []
scores = []

# (c) Loop through prompts one at a time
for i, prompt in enumerate(clean_prism['user_prompt']):
    result = pipe(prompt)[0]  # get top prediction
    label = result['label']
    score = result['score']
    
    # Store results
    labels.append(label)
    scores.append(score)
    
    # Print status message
    print(f"[{i+1}/{len(clean_prism)}] Prompt: \"{prompt[:60]}...\" → Label: {label}, Score: {score:.3f}")

# (d) Add to DataFrame
clean_prism['label_1B'] = labels
clean_prism['score_1B'] = scores

Device set to use mps:0


[1/8002] Prompt: "What can you do about the inequality  of wealth?..." → Label: M, Score: 0.999
[2/8002] Prompt: "What can I do to start making extra money on the side to red..." → Label: M, Score: 0.999
[3/8002] Prompt: "Who is right in the Hamas-Israeli war?  Hamas or the Israeli..." → Label: M, Score: 0.999
[4/8002] Prompt: "Hello..." → Label: n, Score: 0.697
[5/8002] Prompt: "How do I become financially stable on a low income..." → Label: M, Score: 0.999
[6/8002] Prompt: "what side do you take in the current Israel-Palestine war..." → Label: M, Score: 0.999
[7/8002] Prompt: "How can I increase productivity based on scientific findings..." → Label: M, Score: 0.999
[8/8002] Prompt: "How do I deal with a confrontational coworker that does not ..." → Label: M, Score: 0.999
[9/8002] Prompt: "How do I bake a moist chocolate cake..." → Label: M, Score: 0.999
[10/8002] Prompt: "Do you know the meme or trend going around asking males abou..." → Label: M, Score: 0.999
[11/8002] Prompt: "What

In [5]:
# (3) Run the data through BERT 2A to classify questions

# (a) Load BERT 
pipe = pipeline("text-classification", model="carowagner/classify-questions-2A")

# (b) Prepare empty lists for results
labels = []
scores = []

# (c) Loop through prompts one at a time
for i, prompt in enumerate(clean_prism['user_prompt']):
    result = pipe(prompt)[0]  # get top prediction
    label = result['label']
    score = result['score']
    
    # Store results
    labels.append(label)
    scores.append(score)
    
    # Print status message
    print(f"[{i+1}/{len(clean_prism)}] Prompt: \"{prompt[:60]}...\" → Label: {label}, Score: {score:.3f}")

# (d) Add to DataFrame
clean_prism['label_2A'] = labels
clean_prism['score_2A'] = scores

Device set to use mps:0


[1/8002] Prompt: "What can you do about the inequality  of wealth?..." → Label: n, Score: 0.992
[2/8002] Prompt: "What can I do to start making extra money on the side to red..." → Label: n, Score: 0.993
[3/8002] Prompt: "Who is right in the Hamas-Israeli war?  Hamas or the Israeli..." → Label: n, Score: 0.991
[4/8002] Prompt: "Hello..." → Label: M, Score: 0.828
[5/8002] Prompt: "How do I become financially stable on a low income..." → Label: n, Score: 0.993
[6/8002] Prompt: "what side do you take in the current Israel-Palestine war..." → Label: n, Score: 0.992
[7/8002] Prompt: "How can I increase productivity based on scientific findings..." → Label: n, Score: 0.992
[8/8002] Prompt: "How do I deal with a confrontational coworker that does not ..." → Label: n, Score: 0.992
[9/8002] Prompt: "How do I bake a moist chocolate cake..." → Label: n, Score: 0.993
[10/8002] Prompt: "Do you know the meme or trend going around asking males abou..." → Label: y, Score: 0.943
[11/8002] Prompt: "What

In [6]:
# (4) Apply BERT 2B to classify questions

# (a) Load BERT 
pipe = pipeline("text-classification", model="carowagner/classify-questions-2B")

# (b) Prepare empty lists for results
labels = []
scores = []

# (c) Loop through prompts one at a time
for i, prompt in enumerate(clean_prism['user_prompt']):
    result = pipe(prompt)[0]  # get top prediction
    label = result['label']
    score = result['score']
    
    # Store results
    labels.append(label)
    scores.append(score)
    
    # Print status message
    print(f"[{i+1}/{len(clean_prism)}] Prompt: \"{prompt[:60]}...\" → Label: {label}, Score: {score:.3f}")

# (d) Add to DataFrame
clean_prism['label_2B'] = labels
clean_prism['score_2B'] = scores

Device set to use mps:0


[1/8002] Prompt: "What can you do about the inequality  of wealth?..." → Label: n, Score: 0.931
[2/8002] Prompt: "What can I do to start making extra money on the side to red..." → Label: n, Score: 0.979
[3/8002] Prompt: "Who is right in the Hamas-Israeli war?  Hamas or the Israeli..." → Label: n, Score: 0.515
[4/8002] Prompt: "Hello..." → Label: M, Score: 0.708
[5/8002] Prompt: "How do I become financially stable on a low income..." → Label: n, Score: 0.985
[6/8002] Prompt: "what side do you take in the current Israel-Palestine war..." → Label: n, Score: 0.978
[7/8002] Prompt: "How can I increase productivity based on scientific findings..." → Label: n, Score: 0.959
[8/8002] Prompt: "How do I deal with a confrontational coworker that does not ..." → Label: n, Score: 0.966
[9/8002] Prompt: "How do I bake a moist chocolate cake..." → Label: n, Score: 0.986
[10/8002] Prompt: "Do you know the meme or trend going around asking males abou..." → Label: n, Score: 0.972
[11/8002] Prompt: "What

In [7]:
# (5) Apply BERT 2C to classify questions

# (a) Load BERT 
pipe = pipeline("text-classification", model="carowagner/classify-questions-2C")

# (b) Prepare empty lists for results
labels = []
scores = []

# (c) Loop through prompts one at a time
for i, prompt in enumerate(clean_prism['user_prompt']):
    result = pipe(prompt)[0]  # get top prediction
    label = result['label']
    score = result['score']
    
    # Store results
    labels.append(label)
    scores.append(score)
    
    # Print status message
    print(f"[{i+1}/{len(clean_prism)}] Prompt: \"{prompt[:60]}...\" → Label: {label}, Score: {score:.3f}")

# (d) Add to DataFrame
clean_prism['label_2C'] = labels
clean_prism['score_2C'] = scores

Device set to use mps:0


[1/8002] Prompt: "What can you do about the inequality  of wealth?..." → Label: u, Score: 0.988
[2/8002] Prompt: "What can I do to start making extra money on the side to red..." → Label: u, Score: 0.987
[3/8002] Prompt: "Who is right in the Hamas-Israeli war?  Hamas or the Israeli..." → Label: u, Score: 0.644
[4/8002] Prompt: "Hello..." → Label: 0, Score: 0.422
[5/8002] Prompt: "How do I become financially stable on a low income..." → Label: u, Score: 0.988
[6/8002] Prompt: "what side do you take in the current Israel-Palestine war..." → Label: u, Score: 0.988
[7/8002] Prompt: "How can I increase productivity based on scientific findings..." → Label: u, Score: 0.988
[8/8002] Prompt: "How do I deal with a confrontational coworker that does not ..." → Label: u, Score: 0.987
[9/8002] Prompt: "How do I bake a moist chocolate cake..." → Label: u, Score: 0.987
[10/8002] Prompt: "Do you know the meme or trend going around asking males abou..." → Label: 2, Score: 0.983
[11/8002] Prompt: "What

In [8]:
# (6) Apply BERT 3A to classify questions

# (a) Load BERT 
pipe = pipeline("text-classification", model="carowagner/classify-questions-3A")

# (b) Prepare empty lists for results
labels = []
scores = []

# (c) Loop through prompts one at a time
for i, prompt in enumerate(clean_prism['user_prompt']):
    result = pipe(prompt)[0]  # get top prediction
    label = result['label']
    score = result['score']
    
    # Store results
    labels.append(label)
    scores.append(score)
    
    # Print status message
    print(f"[{i+1}/{len(clean_prism)}] Prompt: \"{prompt[:60]}...\" → Label: {label}, Score: {score:.3f}")

# (d) Add to DataFrame
clean_prism['label_3A'] = labels
clean_prism['score_3A'] = scores

Device set to use mps:0


[1/8002] Prompt: "What can you do about the inequality  of wealth?..." → Label: n, Score: 0.998
[2/8002] Prompt: "What can I do to start making extra money on the side to red..." → Label: n, Score: 0.998
[3/8002] Prompt: "Who is right in the Hamas-Israeli war?  Hamas or the Israeli..." → Label: n, Score: 0.998
[4/8002] Prompt: "Hello..." → Label: M, Score: 0.869
[5/8002] Prompt: "How do I become financially stable on a low income..." → Label: n, Score: 0.998
[6/8002] Prompt: "what side do you take in the current Israel-Palestine war..." → Label: n, Score: 0.998
[7/8002] Prompt: "How can I increase productivity based on scientific findings..." → Label: n, Score: 0.998
[8/8002] Prompt: "How do I deal with a confrontational coworker that does not ..." → Label: n, Score: 0.998
[9/8002] Prompt: "How do I bake a moist chocolate cake..." → Label: n, Score: 0.998
[10/8002] Prompt: "Do you know the meme or trend going around asking males abou..." → Label: n, Score: 0.998
[11/8002] Prompt: "What

In [None]:
# (7) Apply BERT 4A to classify questions
# There is a typo in the name of this model

# (a) Load BERT 
# Note the typo in the model name here
pipe = pipeline("text-classification", model="carowagner/clasify-questions-4A")

# (b) Prepare empty lists for results
labels = []
scores = []

# (c) Loop through prompts one at a time
for i, prompt in enumerate(clean_prism['user_prompt']):
    result = pipe(prompt)[0]  # get top prediction
    label = result['label']
    score = result['score']
    
    # Store results
    labels.append(label)
    scores.append(score)
    
    # Print status message
    print(f"[{i+1}/{len(clean_prism)}] Prompt: \"{prompt[:60]}...\" → Label: {label}, Score: {score:.3f}")

# (d) Add to DataFrame
clean_prism['label_4A'] = labels
clean_prism['score_4A'] = scores

Device set to use mps:0


[1/8002] Prompt: "What can you do about the inequality  of wealth?..." → Label: y, Score: 0.964
[2/8002] Prompt: "What can I do to start making extra money on the side to red..." → Label: y, Score: 0.984
[3/8002] Prompt: "Who is right in the Hamas-Israeli war?  Hamas or the Israeli..." → Label: n, Score: 0.948
[4/8002] Prompt: "Hello..." → Label: y, Score: 0.383
[5/8002] Prompt: "How do I become financially stable on a low income..." → Label: y, Score: 0.982
[6/8002] Prompt: "what side do you take in the current Israel-Palestine war..." → Label: o, Score: 0.948
[7/8002] Prompt: "How can I increase productivity based on scientific findings..." → Label: y, Score: 0.984
[8/8002] Prompt: "How do I deal with a confrontational coworker that does not ..." → Label: y, Score: 0.982
[9/8002] Prompt: "How do I bake a moist chocolate cake..." → Label: y, Score: 0.981
[10/8002] Prompt: "Do you know the meme or trend going around asking males abou..." → Label: n, Score: 0.949
[11/8002] Prompt: "What

In [10]:
# Check that this was implemented as intended
clean_prism.head()

Unnamed: 0,utterance_id,interaction_id,conversation_id,user_id,conversation_type,user_prompt,score,label_1A,score_1A,label_1B,...,label_2A,score_2A,label_2B,score_2B,label_2C,score_2C,label_3A,score_3A,label_4A,score_4A
0,ut0,int0,c0,user0,unguided,What can you do about the inequality of wealth?,17,y,0.988066,M,...,n,0.992459,n,0.930954,u,0.987513,n,0.997926,y,0.963778
1,ut10,int5,c1,user1,unguided,What can I do to start making extra money on t...,79,y,0.989479,M,...,n,0.992981,n,0.9793,u,0.987013,n,0.997823,y,0.983638
2,ut16,int3686,c2,user4,controversy guided,Who is right in the Hamas-Israeli war? Hamas ...,50,y,0.986575,M,...,n,0.991211,n,0.515267,u,0.644497,n,0.997808,n,0.948489
3,ut22,int7559,c3,user6,unguided,Hello,100,n,0.5741,n,...,M,0.827535,M,0.708444,0,0.422073,M,0.86902,y,0.38278
4,ut28,int11323,c4,user2,unguided,How do I become financially stable on a low in...,100,y,0.98903,M,...,n,0.992883,n,0.984511,u,0.987585,n,0.997824,y,0.981564


# 2. Use assigned labels to attribute final question classifications

In [None]:
# (1) Define the column names your function returns
question_type_cols = ['M', 'why_q', 'whether_q', 'which_q', 'whathow_q', 'hobsons_c']

# (2) Apply the taxonomy function row-wise and assign as new columns - this uses the sourced function that applied the different categories. 
category_assignments = clean_prism.apply(assign_question_types_taxonomy, axis=1)

# (3) Add them to the dataframe
clean_prism[question_type_cols] = category_assignments

# (4) Check the result
print(clean_prism[question_type_cols].sum())

# (5) Verify whether the categories are mutually exclusive
# (a) Count how many categories each row was assigned
clean_prism['num_matched_categories'] = clean_prism[question_type_cols].sum(axis=1)

# (b) Count and print rows assigned to more than one category
num_multi_cat = (clean_prism['num_matched_categories'] > 1).sum()
print(f"\nNumber of rows assigned to more than one category: {num_multi_cat}")

M             177
why_q         449
whether_q    2365
which_q      1428
whathow_q    2633
hobsons_c     778
dtype: int64

Number of rows assigned to more than one category: 0


In [None]:
# That these assignments add up to 7830, not 8002 - this operationalisation of the
# taxonomy of interrogatives is not collectively exhaustive, a limitation discussed in the report. 

# Sanity check to look at the rows that were filtered out, these should be = not interrogatives
print(clean_prism.loc[clean_prism['M'] == 1, 'user_prompt'].head(15))

3                          Hello
56                         Hello
59                         Hello
62                            Hi
172                        hello
196                        Hello
206                        hello
216                  hello there
230                     Fuck you
255                        hello
292    Hello you fool I love you
393                      history
460                          Hi!
520                       Hello!
522                          hi!
Name: user_prompt, dtype: object


In [23]:
# (6) Save the dataset with all the attributed columns. 
clean_prism.to_csv(output_path, index=False)

# 3. Apply logical operators to gold-standard labelled data to get interrogative types 
The chunk of code below is not directly relevant to classifying the questions. However, to implement the design-based
supervised learning estimator in the next step, I need to have the final assigned categories from the gold-standard labels. 
To enusure that these are applied in exactly the same way as for the data labelled using the BERT classifiers, I am 
implementing it here in python using the same sourced function, rather than in the next R script (04_design_based_surverised-learning.R). 

In [9]:
# (0) Where are the gold-standard labelled data saved? 
gstan_path = os.path.join(
    PROJECT_ROOT,
    "01_data",
    "06_design-based_supervised-learning",
    "labelled_gold_standard.csv"
)

# (1) Load data
gstan_data = pd.read_csv(gstan_path)

# (2) Define label mapping
label_map = {
    "Q1A": "label_1A",
    "Q1B": "label_1B",
    "Q2A": "label_2A",
    "Q2B": "label_2B",
    "Q2C": "label_2C",
    "Q3A": "label_3A",
    "Q4A": "label_4A"
}

# (3) Rename so that colnames fit with the soured function
gstan_renamed = gstan_data.rename(columns=label_map)

# (4) Apply classification logic
category_assignments = gstan_renamed.apply(assign_question_types_taxonomy, axis=1)

# (5) Add new columns with final assigned categories
question_type_cols = ['M', 'why_q', 'whether_q', 'which_q', 'whathow_q', 'hobsons_c']
gstan_renamed[question_type_cols] = category_assignments

# (6) Go back to the old column names
reverse_label_map = {v: k for k, v in label_map.items()}
gstan_renamed = gstan_renamed.rename(columns=reverse_label_map)

# (7) Save back to CSV - overwrite to avoid too many different files. 
gstan_renamed.to_csv(gstan_path, index=False)