# Summarizing Data for Results Graphics in "R2: Perceived risks and benefits qualitative"

There are two central steps involved:


**Check Szenario Texts:**
- to check if topic was already mentioned for the category specific graph

**Summarizing Data:**
- for A, B graph
- for category specific graph



***
**Coding sources**

Code snippets from my (Julius Fenn) Large Language Model Workshop: https://github.com/FennStatistics/introductory-workshop-in-LLMs/


I extend the code provided and explained in the following documenations:
* Build a PDF ingestion and Question/Answering system: https://python.langchain.com/v0.2/docs/tutorials/pdf_qa/

* https://python.langchain.com/v0.2/docs/tutorials/llm_chain/ (Build a Simple LLM Application with LCEL)
* https://python.langchain.com/v0.2/docs/how_to/structured_output/ (How to return structured data from a model)
* https://python.langchain.com/v0.2/docs/how_to/llm_token_usage_tracking/ (How to track token usage for LLMs)


***
## If you facing issues running your Code:

It could be the case that chroma and langchain-core are not compatible.

In [1]:
## run in your terminal; could be necessary to downgrade package:
# pip uninstall langchain-core
# pip install langchain-core==0.3.10

## or install older version of chromadb:
# pip install --upgrade chromadb==0.5.0

## Get API, local supabase server key(s)

In [2]:
import os
import sys

# Assuming 'src' is one level down (in the current directory or a subdirectory)
path_to_src = os.path.join('src')  # Moves one level down to 'src' folder

# Add the path to sys.path
sys.path.append(path_to_src)

# Now you can import your API_key module
import API_key as key

## include self-written functions

In [3]:
import src.forChromaApproach as di_drg

# Check Szenario Texts

## Load Scenario Texts
This should be the final scenario texts in English of the two robots:

* rescue robot
* socially assistive robot

In [4]:
path_to_PDFs = os.path.join('data/scenario texts')  # Moves one level up to 'PDFs' folder

pdf_pages = di_drg.load_pdfs_by_filename(path_to_PDFs, verbose=False)

# Optional: Print the loaded pages by filename
for filename, pages in pdf_pages.items():
    print(f"\nPDF: {filename}")
    print(f"Total Pages: {len(pages)}")
    # print(pages[0])


PDF: rescue robot.pdf
Total Pages: 6

PDF: socially assistive robot.pdf
Total Pages: 6


## Data Storage: Text chunks are converted into vector embeddings and stored in a vector database (Vector DB) next to their respective text chunks.

In [5]:
pdf_chunks = di_drg.split_pdf_pages_into_chunks(pdf_pages, chunk_size=500, chunk_overlap=150, verbose=False)

# Optional: Print a summary of chunks created per PDF
for filename, chunks in pdf_chunks.items():
    print(f"\nPDF: {filename}")
    print(f"Total Chunks: {len(chunks)}")


PDF: rescue robot.pdf
Total Chunks: 15

PDF: socially assistive robot.pdf
Total Chunks: 16


In [6]:
path_to_Chroma = os.path.join('DB_Chroma')  # Moves one level up to 'PDFs' folder

sources_DB = di_drg.inspect_chrom(CHROMA_PATH=path_to_Chroma, openAI_key=key.openAI_key)
print("Number of sources in DB:", len(sources_DB))
print("\nSources:\n", sources_DB)

# Remove the "PDFs\\" prefix from all entries
cleaned_sources_DB = [pdf.replace('PDFs\\', '').replace('data/scenario texts\\', '') for pdf in sources_DB]

# Print the result
print("\nCleaned sources:\n", cleaned_sources_DB)

  db = Chroma(persist_directory=CHROMA_PATH, embedding_function=OpenAIEmbeddings(api_key=openAI_key))
⚠️ It looks like you upgraded from a version below 0.6 and could benefit from vacuuming your database. Run chromadb utils vacuum --help for more information.


Number of sources in DB: 2

Sources:
 ['data/scenario texts\\socially assistive robot.pdf', 'data/scenario texts\\rescue robot.pdf']

Cleaned sources:
 ['socially assistive robot.pdf', 'rescue robot.pdf']


In [7]:
# if you want to remove your DB:
## di_drg.remove_chrom(CHROMA_PATH=path_to_Chroma)

# pdf_chunks is a dictionary as such we can run over the keys:
for pdf in pdf_chunks.keys():
    if pdf not in cleaned_sources_DB:
        print(f"The PDF '{pdf}' is not included in the DB, as such:")
        print("create DB for", pdf)
        di_drg.save_to_chrom(chunks=pdf_chunks[pdf], CHROMA_PATH=path_to_Chroma, openAI_key=key.openAI_key)

In [8]:
sources_DB = di_drg.inspect_chrom(CHROMA_PATH=path_to_Chroma, openAI_key=key.openAI_key)
print("Number of sources in DB:", len(sources_DB))
print("\nSources:\n", sources_DB)

Number of sources in DB: 2

Sources:
 ['data/scenario texts\\socially assistive robot.pdf', 'data/scenario texts\\rescue robot.pdf']


## Data Retrieval and Generation

your prompt template (system message):

In [9]:
PROMPT_TEMPLATE = """
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"

---

Answer the question based on the above context: {question}
"""

In [10]:
# Question query 2
question = """
What are the advantages of soft robots?
"""

In [11]:
response, source_page_pairs, filtered_hits, all_hits = di_drg.retrieveGenerate(query_text=question, prompt_template=PROMPT_TEMPLATE, openAI_key=key.openAI_key, chroma_path=path_to_Chroma, 
                                                                            docsReturn=10, thresholdSimilarity=0.8)

Number of requested results 150 is greater than number of elements in index 31, updating n_results = 31
  response_text = model.predict(prompt)


Number of possible relevant text chunks found with a threshold similarity of 0.8: 11
Query: Human: 
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "

"
    "must be left to wait  
●  The soft and  adaptable nature of soft robots could potentially create 
challenges in ensuring the safety and reliability of the robot in hazardous 
environments  
●  Rescue robots could be misused, particularly in conflict zones for activities 
such as bomb deployment

---

precision and accuracy  
●  Autonomous  rescue capabilities, allowing robots to carry and transport 
victims to safety   
Possible risks of soft robots for search and rescue missions might be:  
●  Algorithms guiding soft robots may be biased, leading to unfair or 
discriminatory outcomes, regarding  (i) where to co

In [12]:
print(response)

The advantages of soft robots include access to unreachable or dangerous areas for human rescuers, delivery of essential supplies to victims, and reduced risk of injury due to their flexibility and adaptability.


In [13]:
print(source_page_pairs)

[('data/scenario texts\\rescue robot.pdf', 4), ('data/scenario texts\\rescue robot.pdf', 1), ('data/scenario texts\\rescue robot.pdf', 3), ('data/scenario texts\\rescue robot.pdf', 5), ('data/scenario texts\\socially assistive robot.pdf', 3), ('data/scenario texts\\rescue robot.pdf', 4), ('data/scenario texts\\socially assistive robot.pdf', 3), ('data/scenario texts\\socially assistive robot.pdf', 3), ('data/scenario texts\\rescue robot.pdf', 4), ('data/scenario texts\\socially assistive robot.pdf', 5)]


In [14]:
print(len(all_hits))
print(all_hits[0])

11
(Document(metadata={'page': 4, 'source': 'data/scenario texts\\rescue robot.pdf'}, page_content='––––––––––––  Second  Page in Experiment (Intervention) ––––––––––––   \nSoft Robots for Search and Rescue Missions  \nBenefits of soft robots for search and rescue missions might be:  \n●  Access to areas unreachable or too dangerous for human rescuers  \n●  Delivery  of essential supplies (water, food, medicine) until victims are safely  \nextracted  \n●  Reduced risk of injury to victims due to their flexibility and adaptability'), 0.8404701183282403)


# Summarizing Data
- for A, B graph
- for category specific graph


## Load .xlsx files (lists of words)

* rescue robot_multipleSheets
* socially assistive robot_multipleSheets
* rescue robot_socially assistive robot_multipleSheets

> there are seperate columns for new (added concepts), deleted (deleted concepts), constant (not changed concepts)


In [15]:
## set working environment
#> Get the current working directory
print(os.getcwd())
directory = os.getcwd()

/home/fenn/Desktop/Publications/soft robot intervention/Analyses/main study - LLM


In [16]:
import pandas as pd

## Load the xlsx file of the rescue robot and the socially assistive robot combined
# Path to your Excel file
file_path = directory + "/data/" + "rescue robot_social assistance robot_multipleSheets_new" + ".xlsx"
# Load the Excel file
excel_data = pd.ExcelFile(file_path)
# Print the sheet names
print("Sheet names combined:", excel_data.sheet_names)
# Load all sheets into a dictionary of dataframes
all_sheets_Combined = {sheet_name: excel_data.parse(sheet_name) for sheet_name in excel_data.sheet_names}



## Load the xlsx file of the rescue robot and the socially assistive robot combined
# Path to your Excel file
file_path = directory + "/data/" + "rescue robot_multipleSheets_new" + ".xlsx"
# Load the Excel file
excel_data = pd.ExcelFile(file_path)
# Print the sheet names
print("Sheet names RR:", excel_data.sheet_names)
# Load all sheets into a dictionary of dataframes
all_sheets_RR = {sheet_name: excel_data.parse(sheet_name) for sheet_name in excel_data.sheet_names}

## Load the xlsx file of the rescue robot and the socially assistive robot combined
# Path to your Excel file
file_path = directory + "/data/" + "social assistance robot_multipleSheets_new" + ".xlsx"
# Load the Excel file
excel_data = pd.ExcelFile(file_path)
# Print the sheet names
print("Sheet names SAR:", excel_data.sheet_names)
# Load all sheets into a dictionary of dataframes
all_sheets_SAR = {sheet_name: excel_data.parse(sheet_name) for sheet_name in excel_data.sheet_names}

Sheet names combined: ['RCPP', 'LC', 'T', 'SIP', 'HRIP', 'AN', 'SIN', 'R', 'HC', 'RCN', 'SA', 'TP', 'TL', 'RCPN', 'HRIN', 'MT', 'RCA', 'AP']
Sheet names RR: ['RCPP', 'LC', 'T', 'SIP', 'HRIP', 'AN', 'SIN', 'R', 'HC', 'RCN', 'SA', 'TP', 'TL', 'RCPN', 'HRIN', 'MT', 'RCA', 'AP']
Sheet names SAR: ['RCPP', 'LC', 'T', 'SIP', 'HRIP', 'AN', 'SIN', 'R', 'HC', 'RCN', 'SA', 'TP', 'TL', 'RCPN', 'HRIN', 'MT', 'RCA', 'AP']


## Additional dictonaries to provide LLM context

In [17]:
abbreviations_dict = {
    'RCPP': 'perceived positive usefulness (rest category, refers to a classification of arguments that do not fit into any of the predefined categories)',
    'LC': 'perceived low costs',
    'T': 'perceived trust',
    'SIP': 'perceived positive social impact',
    'HRIP': 'perceived positive Human-Robot-Interaction',
    'AN': 'perceived negative anthropomorphism',
    'SIN': 'perceived positive social impact',
    'R': 'perceived risks',
    'HC': 'perceived high costs',
    'RCN': 'neutral rest category (rest category refers to a classification of arguments that do not fit into any of the predefined categories)',
    'SA': 'perceived safety',
    'TP': 'perceived technological possibilities',
    'TL': 'perceived technological limitations',
    'RCPN': 'perceived negative usefulness (rest category, refers to a classification of arguments that do not fit into any of the predefined categories)',
    'HRIN': 'perceived negative Human-Robot-Interaction',
    'MT': 'perceived mistrust',
    'RCA': 'ambivalent rest category (rest category refers to a classification of arguments that do not fit into any of the predefined categories)',
    'AP': 'perceived positive anthropomorphism'
}

print(abbreviations_dict)
print(abbreviations_dict.keys())
print(abbreviations_dict['AN'])

{'RCPP': 'perceived positive usefulness (rest category, refers to a classification of arguments that do not fit into any of the predefined categories)', 'LC': 'perceived low costs', 'T': 'perceived trust', 'SIP': 'perceived positive social impact', 'HRIP': 'perceived positive Human-Robot-Interaction', 'AN': 'perceived negative anthropomorphism', 'SIN': 'perceived positive social impact', 'R': 'perceived risks', 'HC': 'perceived high costs', 'RCN': 'neutral rest category (rest category refers to a classification of arguments that do not fit into any of the predefined categories)', 'SA': 'perceived safety', 'TP': 'perceived technological possibilities', 'TL': 'perceived technological limitations', 'RCPN': 'perceived negative usefulness (rest category, refers to a classification of arguments that do not fit into any of the predefined categories)', 'HRIN': 'perceived negative Human-Robot-Interaction', 'MT': 'perceived mistrust', 'RCA': 'ambivalent rest category (rest category refers to a c

keep only subset of list (which have meaningfull differences):

In [18]:
# Keys to keep
keys_to_keep = ['TP', 'TL', 'SA', 'R', 'HRIP', 'HRIN', 'AP', 'AN']

# Filter the dictionary
abbreviations_dict = {key: abbreviations_dict[key] for key in keys_to_keep}

# Display the filtered dictionary
print(abbreviations_dict)

{'TP': 'perceived technological possibilities', 'TL': 'perceived technological limitations', 'SA': 'perceived safety', 'R': 'perceived risks', 'HRIP': 'perceived positive Human-Robot-Interaction', 'HRIN': 'perceived negative Human-Robot-Interaction', 'AP': 'perceived positive anthropomorphism', 'AN': 'perceived negative anthropomorphism'}


Use dictionaries to map each word to its comment:

In [19]:
import numpy as np

def create_multivalue_dict(df, key_col, value_col):
    """
    Create a dictionary from a DataFrame where each key maps to a list of values.
    
    Parameters:
    df (pd.DataFrame): The input DataFrame.
    key_col (str): The column name to be used as keys.
    value_col (str): The column name to be used as values.
    
    Returns:
    dict: A dictionary where each key maps to a list of values.
    """
    # Remove rows with NaN in the key columns
    df = df.dropna(subset=[key_col])

    # Create a dictionary to map items to their comments, allowing for multiple comments per key
    multivalue_dict = {}
    for key, value in zip(df[key_col], df[value_col]):
        if key in multivalue_dict:
            multivalue_dict[key].append(value)
        else:
            multivalue_dict[key] = [value]

    return multivalue_dict

# Example usage
data = {
    'constant': ['a', 'b', 'a', 'c', 'b', np.nan],
    'constant_comments': ['comment1', 'comment2', 'comment3', 'comment4', 'comment5', 'comment6']
}
df = pd.DataFrame(data)

df_mapping = create_multivalue_dict(df, 'constant', 'constant_comments')
print(df_mapping)


{'a': ['comment1', 'comment3'], 'b': ['comment2', 'comment5'], 'c': ['comment4']}


test create_multivalue_dict() function:

In [20]:
sheet_name = "SA"

In [21]:
# Remove rows with NaN in the key columns as they cannot be used as dictionary keys
#> not sensitive to multiple identical keys: dict(zip(df['constant'], df['constant_comments']))
df = all_sheets_Combined[sheet_name]

constant_comments_mapping = create_multivalue_dict(df, 'constant', 'constant_comments')
print("mapping constant x comments:", constant_comments_mapping)
print(len(constant_comments_mapping))

new_comments_mapping = create_multivalue_dict(df, 'new', 'new_comments')
print("mapping new x comments:", new_comments_mapping)
print(len(new_comments_mapping))

deleted_comments_mapping = create_multivalue_dict(df, 'deleted', 'deleted_comments')
print("mapping deleted x comments:", deleted_comments_mapping)
print(len(deleted_comments_mapping))

mapping constant x comments: {'quick hygienic help': [nan], 'efficient': ['Work on statistics can be made more efficient through probabilities.', nan, nan, nan], 'safety': ['Prevents people from having to work under difficult or dangerous conditions to increase safety.', 'Enables deployment in difficult situations where human rescuers would put themselves in danger', nan, nan, nan, 'through reliability', 'No people are being put in danger'], 'remote control': ['Can be operated by experts via remote control or autonomous systems to perform operations from a safe distance.', nan, nan, nan, nan, 'Minimizes risks associated with autonomy'], 'speed and efficiency': ['Robots can often act faster than humans and increase the speed of emergency interventions', 'Improvement of speed and efficiency of rescue operations', nan], 'safe': ['People do not have to put themselves in dangerous situations'], 'reliable': ['Can be precisely controlled', nan, nan, nan, nan], 'health monitoring': [nan, 'Peop

In [22]:
def combine_dicts(dict1, dict2):
    """
    Combine two dictionaries where each key maps to a list of values.
    
    Parameters:
    dict1 (dict): The first dictionary.
    dict2 (dict): The second dictionary.
    
    Returns:
    dict: A combined dictionary where each key maps to a concatenated list of values.
    """
    combined_dict = dict1.copy()
    for key, values in dict2.items():
        if key in combined_dict:
            combined_dict[key].extend(values)
        else:
            combined_dict[key] = values
    return combined_dict

test combine_dicts() function:

In [23]:
constant_new_comments_mapping = combine_dicts(constant_comments_mapping, new_comments_mapping)
print("mapping constant, new x comments:", constant_new_comments_mapping)
print(len(constant_new_comments_mapping))


mapping constant, new x comments: {'quick hygienic help': [nan], 'efficient': ['Work on statistics can be made more efficient through probabilities.', nan, nan, nan], 'safety': ['Prevents people from having to work under difficult or dangerous conditions to increase safety.', 'Enables deployment in difficult situations where human rescuers would put themselves in danger', nan, nan, nan, 'through reliability', 'No people are being put in danger', 'Soft material reduces risk of injury'], 'remote control': ['Can be operated by experts via remote control or autonomous systems to perform operations from a safe distance.', nan, nan, nan, nan, 'Minimizes risks associated with autonomy'], 'speed and efficiency': ['Robots can often act faster than humans and increase the speed of emergency interventions', 'Improvement of speed and efficiency of rescue operations', nan], 'safe': ['People do not have to put themselves in dangerous situations'], 'reliable': ['Can be precisely controlled', nan, nan

## Data for Bubble Graph (G2)

Prompt to get list of arguments and explenations:

In [24]:
from langchain_core.prompts import ChatPromptTemplate


system_template = """
You are a researcher summarizing two word lists that represent people's assessments of rigid and soft robots, whereby laypersons were informed about the risks and benefits of {robots} through scenario texts.

Participants shared their views on traditional rigid robots in a "rigid" list and on flexible, electronic-free soft robots in a "soft" list after learning about their respective risks and benefits. 
The overall theme is {topicCategory}.

Both "rigid" and "soft" lists are dictionaries with argument keys and comment values. If [nan] appears, it means no comment was provided; repeated entries or [nan] values indicate that the argument was emphasized multiple times.

Your task:
Summarize the main points for each category into a JSON object. Fill the two arrays within the JSON object, "rigid_arguments" for the "rigid" list and "soft_arguments" for the "soft" list. 
Summarize the main points for each category in a JSON object. Populate the two arrays, "rigid_arguments" and "soft_arguments," with up to five key arguments each. 
Include a brief, interrelated explanation (up to two sentences) for each argument, derived from the provided lists.


Output Format:

{{
  "assessments": {{
    "rigid_arguments": [
      {{
        "argument": "argument1",
        "explanation": "explanation of argument1"
      }},
      {{
        "argument": "argument2",
        "explanation": "explanation of argument2"
      }},
      ...
    ],
    "soft_arguments": [
      {{
        "argument": "argument1",
        "explanation": "explanation of argument1"
      }},
      {{
        "argument": "argument2",
        "explanation": "explanation of argument2"
      }},
      ...
    ]
  }}
}}

Please respond with the entire JSON structure as specified, providing up to five arguments for each list, and without any additional commentary or context.
"""


user_template = """List "rigid": 
{rigid}

List "soft": 
{soft}"""

# rescue robots and socially assistive robots
prompt_template = ChatPromptTemplate.from_messages(
    [("system", system_template), ("user", user_template)]
)

template_out = prompt_template.invoke({"robots": "rescue robots and socially assistive robots", "topicCategory": abbreviations_dict[sheet_name], "rigid": constant_comments_mapping, "soft": new_comments_mapping})
print(template_out)

print("template_out:", template_out)
print("template_out.to_messages():", template_out.to_messages())

messages=[SystemMessage(content='\nYou are a researcher summarizing two word lists that represent people\'s assessments of rigid and soft robots, whereby laypersons were informed about the risks and benefits of rescue robots and socially assistive robots through scenario texts.\n\nParticipants shared their views on traditional rigid robots in a "rigid" list and on flexible, electronic-free soft robots in a "soft" list after learning about their respective risks and benefits. \nThe overall theme is perceived safety.\n\nBoth "rigid" and "soft" lists are dictionaries with argument keys and comment values. If [nan] appears, it means no comment was provided; repeated entries or [nan] values indicate that the argument was emphasized multiple times.\n\nYour task:\nSummarize the main points for each category into a JSON object. Fill the two arrays within the JSON object, "rigid_arguments" for the "rigid" list and "soft_arguments" for the "soft" list. \nSummarize the main points for each catego

### Single Run

Function to call LLM:

In [25]:
from langchain_openai import ChatOpenAI
from langchain.callbacks import get_openai_callback

def huggingface_API_call(
    prompt,
    
    robots,
    topicCategory,
    dictonaryRigid,
    dictonarySoft,
    
    api_key=key.hugging_api_key,

    model_name="meta-llama/Meta-Llama-3-70B-Instruct",
    json_schema=None,
    max_tokens=1000,
    temperature=0.2,
    verbose=True
):

    # Initialize the ChatOpenAI model for Hugging Face
    model = ChatOpenAI(
        model=model_name,
        openai_api_key=api_key,
        openai_api_base="https://api-inference.huggingface.co/v1/",
        max_tokens=max_tokens,
        temperature=temperature
    )

    # Check if structured output is required and configure it
    if json_schema:
        structured_llm = model.with_structured_output(json_schema, include_raw=True)
        chain = prompt | structured_llm
    else:
        chain = prompt | model

    # Execute the model and output response details
    with get_openai_callback() as cb:
        response = chain.invoke(
            {"robots": robots, "topicCategory": topicCategory, "rigid": dictonaryRigid, "soft": dictonarySoft}
        )
        
        if cb.total_tokens > max_tokens:
            print("Warning: The response may be incomplete due to exceeding the maximum token limit.")
        
        if verbose:
            print(cb)
            print(f"Total Tokens: {cb.total_tokens}")
            print(f"Prompt Tokens: {cb.prompt_tokens}")
            print(f"Completion Tokens: {cb.completion_tokens}")
            print(f"Total Cost (USD): ${cb.total_cost}")

    return response

In [26]:
response = huggingface_API_call(
    prompt=prompt_template,
    
    robots="rescue robots and socially assistive robots",
    topicCategory=abbreviations_dict[sheet_name],
    dictonaryRigid=constant_comments_mapping,
    dictonarySoft=new_comments_mapping,
    
    api_key=key.hugging_api_key,
    model_name="meta-llama/Meta-Llama-3-70B-Instruct",

    json_schema=None,
    max_tokens=4000,
    temperature=0.2,
    verbose=True
)

Tokens Used: 4407
	Prompt Tokens: 3949
	Completion Tokens: 458
Successful Requests: 1
Total Cost (USD): $0.0
Total Tokens: 4407
Prompt Tokens: 3949
Completion Tokens: 458
Total Cost (USD): $0.0


result - structured JSON output:

In [27]:
response

AIMessage(content='{\n  "assessments": {\n    "rigid_arguments": [\n      {\n        "argument": "safety",\n        "explanation": "Rigid robots can prevent people from having to work under difficult or dangerous conditions, increasing safety and reliability."\n      },\n      {\n        "argument": "efficiency",\n        "explanation": "Rigid robots can work faster and more efficiently than humans, improving the speed and effectiveness of rescue operations."\n      },\n      {\n        "argument": "reliability",\n        "explanation": "Rigid robots can be precisely controlled and are always ready, making them a reliable option for rescue operations."\n      },\n      {\n        "argument": "strength",\n        "explanation": "Rigid robots can perform tasks that require great strength, such as lifting debris, and can operate in environments that would be toxic to humans."\n      },\n      {\n        "argument": "accessibility",\n        "explanation": "Rigid robots can access areas th

In [28]:
response.content

'{\n  "assessments": {\n    "rigid_arguments": [\n      {\n        "argument": "safety",\n        "explanation": "Rigid robots can prevent people from having to work under difficult or dangerous conditions, increasing safety and reliability."\n      },\n      {\n        "argument": "efficiency",\n        "explanation": "Rigid robots can work faster and more efficiently than humans, improving the speed and effectiveness of rescue operations."\n      },\n      {\n        "argument": "reliability",\n        "explanation": "Rigid robots can be precisely controlled and are always ready, making them a reliable option for rescue operations."\n      },\n      {\n        "argument": "strength",\n        "explanation": "Rigid robots can perform tasks that require great strength, such as lifting debris, and can operate in environments that would be toxic to humans."\n      },\n      {\n        "argument": "accessibility",\n        "explanation": "Rigid robots can access areas that are difficult o

In [29]:
import json
import re

try:
    # Attempt to parse the response directly as JSON
    data = json.loads(response.content)
    # print("Valid JSON object:", json.dumps(data, indent=2))
except json.JSONDecodeError:
    # If not valid JSON, handle extraction using regex to match JSON block between triple backticks (```)
    json_match = re.search(r'```(.*?)```', response.content, re.DOTALL)
    # If JSON block is found, parse it
    if json_match:
        json_text = json_match.group(1).strip()  # Extract JSON text and strip whitespace
        try:
            data = json.loads(json_text)   # Parse JSON
            print("Valid JSON object after regex:", json.dumps(data, indent=2))
        except json.JSONDecodeError as e:
            print("Failed to parse JSON:", e)
    else:
        print("No JSON object found.")

In [30]:
# Extract rigid and soft arguments and format them into a DataFrame
arguments = []
for category, items in data['assessments'].items():
    for item in items:
        arguments.append({
            'type': category.split('_')[0],  # Extracts 'rigid' or 'soft' from 'rigid_arguments'/'soft_arguments'
            'argument': item['argument'],
            'explanation': item['explanation']
        })

# Create DataFrame
df = pd.DataFrame(arguments)

df["category"] = sheet_name

# Display the DataFrame
print(df)

    type              argument  \
0  rigid                safety   
1  rigid            efficiency   
2  rigid           reliability   
3  rigid              strength   
4  rigid         accessibility   
5   soft  lower risk of injury   
6   soft           flexibility   
7   soft           reliability   
8   soft          adaptability   
9   soft       care of victims   

                                         explanation category  
0  Rigid robots can prevent people from having to...       SA  
1  Rigid robots can work faster and more efficien...       SA  
2  Rigid robots can be precisely controlled and a...       SA  
3  Rigid robots can perform tasks that require gr...       SA  
4  Rigid robots can access areas that are difficu...       SA  
5  Soft robots pose a lower risk of injury to vic...       SA  
6  Soft robots are flexible and can access narrow...       SA  
7  Soft robots are reliable and can be used to de...       SA  
8  Soft robots can adapt to different situations 

### Multiple Runs

Higher order function to call LLM for all categories:

In [31]:
import json
import pandas as pd
import re

def process_robot_data(type_robot):
    # Define naming based on type_robot
    if type_robot == "RR":
        all_sheets = all_sheets_RR
        naming_robots = "rescue robots"
    elif type_robot == "SAR":
        all_sheets = all_sheets_SAR
        naming_robots = "socially assistive robots"
    elif type_robot == "Combined":
        all_sheets = all_sheets_Combined
        naming_robots = "rescue robots and socially assistive robots"
    else:
        raise ValueError("Invalid type_robot specified.")

    # Initialize an empty DataFrame for concatenation
    final_df = pd.DataFrame()

    for index, category in enumerate(abbreviations_dict.keys()):
        print(f"index: {index}, category: {category}")

        # Load the specific DataFrame for the current category
        df = all_sheets[category]

        # Generate constant and new comments mappings
        constant_comments_mapping = create_multivalue_dict(df, 'constant', 'constant_comments')
        new_comments_mapping = create_multivalue_dict(df, 'new', 'new_comments')

        # Call API with specified parameters
        response = huggingface_API_call(
            prompt=prompt_template,
            robots=naming_robots,
            topicCategory=abbreviations_dict[category],
            dictonaryRigid=constant_comments_mapping,
            dictonarySoft=new_comments_mapping,
            api_key=key.hugging_api_key,
            model_name="meta-llama/Meta-Llama-3-70B-Instruct",
            json_schema=None,
            max_tokens=4200,
            temperature=0.0,
            verbose=False
        )
        
        # print("response.content:\n", response.content)
        #  data = json.loads(response.content)
        # Regular expression to match JSON block between triple backticks (```)
        try:
            # Attempt to parse the response directly as JSON
            data = json.loads(response.content)
            # print("Valid JSON object:", json.dumps(data, indent=2))
        except json.JSONDecodeError:
            # If not valid JSON, handle extraction using regex to match JSON block between triple backticks (```)
            json_match = re.search(r'```(.*?)```', response.content, re.DOTALL)
            # If JSON block is found, parse it
            if json_match:
                json_text = json_match.group(1).strip()  # Extract JSON text and strip whitespace
                try:
                    data = json.loads(json_text)   # Parse JSON
                    # print("Valid JSON object after regex:", json.dumps(data, indent=2))
                except json.JSONDecodeError as e:
                    print("Failed to parse JSON:", e)
                    # break # !!!
            else:
                print("No JSON object found.")
                # break # !!!

        # Extract arguments and format them into a temporary DataFrame
        arguments = []
        for arg_category, items in data['assessments'].items():
            for item in items:
                arguments.append({
                    'type': arg_category.split('_')[0],
                    'argument': item['argument'],
                    'explanation': item['explanation']
                })

        df_tmp = pd.DataFrame(arguments)
        df_tmp["category"] = category
        
        print(f"length of df_tmp: {len(df_tmp)}")

        # Concatenate the current DataFrame to the final DataFrame
        final_df = pd.concat([final_df, df_tmp], ignore_index=True)

    return final_df

logic if process_robot_data() should be run:

In [32]:
run_process_robot_data = False # True

for rescue robots:

In [33]:
# Path to your Excel file
file_path = directory + "/output/G2/" + "rescue robots" + ".xlsx"

if run_process_robot_data:
    df_RR = process_robot_data(type_robot="RR")
    # save the dataframe to an Excel file
    df_RR.to_excel(file_path, index=False)
else:
    df_RR = pd.read_excel(file_path)

for socially assistive robots:

In [34]:
# Path to your Excel file
file_path = directory + "/output/G2/" + "socially assistive robots" + ".xlsx"
    
if run_process_robot_data:
    df_SAR = process_robot_data(type_robot="SAR")
    # save the dataframe to an Excel file
    df_SAR.to_excel(file_path, index=False)
else:
    df_SAR = pd.read_excel(file_path)

for rescue robots and socially assistive robots:

In [35]:
# Path to your Excel file
file_path = directory + "/output/G2/" + "rescue robots AND socially assistive robots" + ".xlsx"

if run_process_robot_data:
    df_Combined = process_robot_data(type_robot="Combined")
    # save the dataframe to an Excel file
    df_Combined.to_excel(file_path, index=False)
else:
    df_Combined = pd.read_excel(file_path)

### Summarize Generated Data for A, B graph (G2)

In [36]:
tmp_rigid = df_RR[(df_RR["category"] == "TP") & (df_RR["type"] == "rigid")]
tmp_soft = df_RR[(df_RR["category"] == "TP") & (df_RR["type"] == "soft")]

tmp_string_rigid = 'Arguments for "rigid" robots:'
for index, row in tmp_rigid.iterrows():
    tmp_string_rigid += " \n " + row["argument"]
    tmp_string_rigid += ": " + row["explanation"]
    
tmp_string_soft = 'Arguments for "soft" robots:'
for index, row in tmp_soft.iterrows():
    tmp_string_soft += " \n " + row["argument"]
    tmp_string_soft += ": " + row["explanation"]

In [37]:
tmp_string_rigid

'Arguments for "rigid" robots: \n new places: Rigid robots can search in places where people cannot reach, such as underwater caves or narrow openings. \n resilience: Rigid robots can withstand adverse conditions, making them more effective in disaster areas. \n special abilities: Rigid robots can perform special tasks like flying, hacking doors, or sending images to the control center with cameras. \n environment-independent: Rigid robots can operate in various environments, including air, water, and ground, and can withstand toxic or narrow environments. \n more power: Rigid robots can have more strength than humans, allowing them to perform tasks that require heavy lifting or drilling.'

In [38]:
tmp_string_soft

'Arguments for "soft" robots: \n deliver supplies: Soft robots can deliver essential goods like food, water, and medicine to victims in hard-to-reach areas. \n accessible: Soft robots can reach inaccessible places due to their small size and high flexibility, allowing them to supply victims with vital resources. \n care for victims: Soft robots can provide care for victims during the rescue operation, such as delivering food and medicine. \n adaptability: Soft robots can adapt to complex problems and changing situations, making them effective in disaster areas. \n temporary supply: Soft robots can provide temporary supply of vital resources to victims until human rescuers arrive.'

Prompt to get summary (focus on overlapping, diverging arguments) for A, B graph regarding single categories:

In [39]:
from langchain_core.prompts import ChatPromptTemplate


system_template = """
You are a researcher summarizing central arguments and their explenations of people's assessments of rigid and soft robots, 
whereby laypersons were informed about the risks and benefits of {robots} through scenario texts.

Participants shared their central arguments and explenations on traditional rigid robots in arguments for "rigid" robots 
and on flexible, electronic-free soft robots in arguments for "soft" robots.

The overall theme of these arguments is {topicCategory}.

Your task:

Write a paragraph highlighting the commonalities of the arguments for rigid and soft robots followed by a brief discussion of the main differences, 
focusing stronger on the arguments for soft robots. 

The paragraph should be limited to four sentences. Provide only the paragraph without any additional commentary or context.
"""


user_template = """arguments for "rigid" robots: 
{rigid}

arguments for "soft" robots: 
{soft}"""

# rescue robots and socially assistive robots
prompt_template = ChatPromptTemplate.from_messages(
    [("system", system_template), ("user", user_template)]
)

template_out = prompt_template.invoke({"robots": "rescue robots", "topicCategory": abbreviations_dict["TP"], "rigid": tmp_string_rigid, "soft": tmp_string_soft})
print(template_out)

print("template_out:", template_out)
print("template_out.to_messages():", template_out.to_messages())

messages=[SystemMessage(content='\nYou are a researcher summarizing central arguments and their explenations of people\'s assessments of rigid and soft robots, \nwhereby laypersons were informed about the risks and benefits of rescue robots through scenario texts.\n\nParticipants shared their central arguments and explenations on traditional rigid robots in arguments for "rigid" robots \nand on flexible, electronic-free soft robots in arguments for "soft" robots.\n\nThe overall theme of these arguments is perceived technological possibilities.\n\nYour task:\n\nWrite a paragraph highlighting the commonalities of the arguments for rigid and soft robots followed by a brief discussion of the main differences, \nfocusing stronger on the arguments for soft robots. \n\nThe paragraph should be limited to four sentences. Provide only the paragraph without any additional commentary or context.\n', additional_kwargs={}, response_metadata={}), HumanMessage(content='arguments for "rigid" robots: \n

#### Single Run

Function to call LLM:

In [40]:
response = huggingface_API_call(
    prompt=prompt_template,
    
    robots="rescue robots",
    topicCategory=abbreviations_dict["TP"],
    dictonaryRigid=tmp_string_rigid,
    dictonarySoft=tmp_string_soft,
    
    api_key=key.hugging_api_key,
    model_name="meta-llama/Meta-Llama-3-70B-Instruct",

    json_schema=None,
    max_tokens=4000,
    temperature=0.2,
    verbose=True
)

Tokens Used: 573
	Prompt Tokens: 444
	Completion Tokens: 129
Successful Requests: 1
Total Cost (USD): $0.0
Total Tokens: 573
Prompt Tokens: 444
Completion Tokens: 129
Total Cost (USD): $0.0


In [41]:
response

AIMessage(content='The commonalities between the arguments for rigid and soft robots lie in their perceived ability to access and operate in challenging environments, as well as their potential to provide essential resources and care to victims in disaster areas. Both types of robots are seen as capable of reaching inaccessible places and performing tasks that humans cannot. However, the arguments for soft robots emphasize their adaptability, flexibility, and ability to provide care and temporary supply of vital resources, highlighting their potential for more nuanced and gentle interactions with victims. In contrast, the arguments for rigid robots focus more on their strength, resilience, and special abilities, suggesting a more robust and task-oriented approach.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 129, 'prompt_tokens': 444, 'total_tokens': 573, 'completion_tokens_details': None, 'prompt_tokens_details': None}, 'model_name': '

In [42]:
response.content

'The commonalities between the arguments for rigid and soft robots lie in their perceived ability to access and operate in challenging environments, as well as their potential to provide essential resources and care to victims in disaster areas. Both types of robots are seen as capable of reaching inaccessible places and performing tasks that humans cannot. However, the arguments for soft robots emphasize their adaptability, flexibility, and ability to provide care and temporary supply of vital resources, highlighting their potential for more nuanced and gentle interactions with victims. In contrast, the arguments for rigid robots focus more on their strength, resilience, and special abilities, suggesting a more robust and task-oriented approach.'

### Multiple Runs

Higher order function to call LLM for all categories:

In [43]:
import pandas as pd

def summarize_processed_robot_data(type_robot):
    # Define naming based on type_robot
    if type_robot == "RR":
        tmp_df = df_RR
        naming_robots = "rescue robots"
    elif type_robot == "SAR":
        tmp_df = df_SAR
        naming_robots = "socially assistive robots"
    elif type_robot == "Combined":
        tmp_df = df_Combined
        naming_robots = "rescue robots and socially assistive robots"
    else:
        raise ValueError("Invalid type_robot specified.")


    # Initialize an empty DataFrame for concatenation
    data_array = []
    
    for index, category in enumerate(abbreviations_dict.keys()):
        print(f"index: {index}, category: {category}")

        # Load the specific DataFrame for the current category
        tmp_rigid = tmp_df[(tmp_df["category"] == category) & (tmp_df["type"] == "rigid")]
        tmp_soft = tmp_df[(tmp_df["category"] == category) & (tmp_df["type"] == "soft")]
        
        tmp_string_rigid = 'Arguments for "rigid" robots:'
        for index, row in tmp_rigid.iterrows():
            tmp_string_rigid += " \n " + row["argument"]
            tmp_string_rigid += ": " + row["explanation"]

        tmp_string_soft = 'Arguments for "soft" robots:'
        for index, row in tmp_soft.iterrows():
            tmp_string_soft += " \n " + row["argument"]
            tmp_string_soft += ": " + row["explanation"]
        
        
        # Call API with specified parameters
        response = huggingface_API_call(
            prompt=prompt_template,
            
            robots=naming_robots,
            topicCategory=abbreviations_dict[category],
            dictonaryRigid=tmp_string_rigid,
            dictonarySoft=tmp_string_soft,
            
            api_key=key.hugging_api_key,
            model_name="meta-llama/Meta-Llama-3-70B-Instruct",

            json_schema=None,
            max_tokens=4200,
            temperature=0.0,
            verbose=False
        )
        
        data_array.append({'category': category, 'summary':response.content})

    final_df = pd.DataFrame(data_array)
    return final_df

In [44]:
run_summary_processed_robot_data = False # True

for rescue robots - summary:

In [45]:
# Path to your Excel file
file_path = directory + "/output/G2/" + "rescue robots summary" + ".xlsx"

if run_summary_processed_robot_data:
    df_RR_summary = summarize_processed_robot_data(type_robot="RR")
    # save the dataframe to an Excel file
    df_RR_summary.to_excel(file_path, index=False)
else:
    df_RR_summary = pd.read_excel(file_path)

for socially assistive robots - summary:

In [46]:
# Path to your Excel file
file_path = directory + "/output/G2/" + "socially assistive robots summary" + ".xlsx"
    
if run_summary_processed_robot_data:
    df_SAR_summary = summarize_processed_robot_data(type_robot="SAR")
    # save the dataframe to an Excel file
    df_SAR_summary.to_excel(file_path, index=False)
else:
    df_SAR_summary = pd.read_excel(file_path)

## Data for Category Specific Graph (G3)

All the loaded files are originally from the following GitHub page: https://github.com/PerttuHamalainen/LLMCode

In [47]:
import os
import sys

# Assuming 'src' is one level down (in the current directory or a subdirectory)
path_to_src = os.path.join('src/LLMCode')  # Moves one level down to 'src' folder

# Add the path to sys.path
sys.path.append(path_to_src)

# Now you can import your modules
#import llms as LLMCode_LLMS
#import coding as LLMCode_coding

import src.LLMCode as LLMCode

load openAI key into environment:

In [48]:
import os

os.environ["OPENAI_API_KEY"] = key.openAI_key

initalize LLM:

In [49]:
LLMCode.init(API="OpenAI")

## Single Run

### prepare data

In [50]:
print("sheet_name:", sheet_name)

# Remove rows with NaN in the key columns as they cannot be used as dictionary keys
#> not sensitive to multiple identical keys: dict(zip(df['constant'], df['constant_comments']))
df = all_sheets_RR[sheet_name]

constant_comments_mapping = create_multivalue_dict(df, 'constant', 'constant_comments')
print("mapping constant x comments:", constant_comments_mapping)
print(len(constant_comments_mapping))

new_comments_mapping = create_multivalue_dict(df, 'new', 'new_comments')
print("mapping new x comments:", new_comments_mapping)
print(len(new_comments_mapping))

deleted_comments_mapping = create_multivalue_dict(df, 'deleted', 'deleted_comments')
print("mapping deleted x comments:", deleted_comments_mapping)
print(len(deleted_comments_mapping))

sheet_name: SA
mapping constant x comments: {'safety': ['Prevents people from having to work under difficult or dangerous conditions to increase safety.', 'Enables deployment in difficult situations where human rescuers would put themselves in danger', nan, nan, nan, 'through reliability', 'No people are being put in danger'], 'remote control': ['Can be operated by experts via remote control or autonomous systems to perform operations from a safe distance.', nan, nan, nan, nan, 'Minimizes risks associated with autonomy'], 'speed and efficiency': ['Robots can often act faster than humans and increase the speed of emergency interventions', 'Improvement of speed and efficiency of rescue operations', nan], 'safe': ['People do not have to put themselves in dangerous situations'], 'reliable': ['Can be precisely controlled', nan, nan, nan, nan], 'no people necessary': [nan], 'stronger than humans': [nan, 'In many situations, e.g. when clearing rubble, a great advantage.', nan, nan], 'faster t

In [51]:
constant_new_comments_mapping = combine_dicts(constant_comments_mapping, new_comments_mapping)
print("mapping constant, new x comments:", constant_new_comments_mapping)
print(len(constant_new_comments_mapping))


mapping constant, new x comments: {'safety': ['Prevents people from having to work under difficult or dangerous conditions to increase safety.', 'Enables deployment in difficult situations where human rescuers would put themselves in danger', nan, nan, nan, 'through reliability', 'No people are being put in danger'], 'remote control': ['Can be operated by experts via remote control or autonomous systems to perform operations from a safe distance.', nan, nan, nan, nan, 'Minimizes risks associated with autonomy'], 'speed and efficiency': ['Robots can often act faster than humans and increase the speed of emergency interventions', 'Improvement of speed and efficiency of rescue operations', nan], 'safe': ['People do not have to put themselves in dangerous situations'], 'reliable': ['Can be precisely controlled', nan, nan, nan, nan], 'no people necessary': [nan], 'stronger than humans': [nan, 'In many situations, e.g. when clearing rubble, a great advantage.', nan, nan], 'faster than humans

In [52]:
# Sample dictionary with NaN values
dictionary = constant_new_comments_mapping

# Initialize the list to store single concepts
single_concepts = []

# Iterate through the dictionary
for key, values in dictionary.items():
    for value in values:
        # Check if the value is not NaN
        if isinstance(value, str):
            single_concepts.append(f"{key}: {value}")
        else:
            single_concepts.append(key)

# Print the result
print(len(single_concepts))


# Filter unique entries with more than one word
single_concepts_unique = list({entry for entry in single_concepts if len(entry.split()) > 1})

# Print the result
print(len(single_concepts_unique))

262
209


In [53]:
print(single_concepts_unique)

['toxic places: Places where people can only breathe with special equipment (gas mask)', 'saves lives', 'no people necessary', 'more flexible than humans', 'reaching hard-to-access places: Robots and drones are small and agile. Therefore, they can reach places that are difficult or even dangerous for humans to reach.', 'reduced risk: By using robots, no human forces are exposed to dangerous situations', 'fewer mistakes', 'quick help', 'quick deployment', 'less risk: Specifically less risk to further human lives, as rescue teams can use robots', 'access to areas: Access to areas difficult or only reachable through considerable dangers for people', 'are mentally superior: No trauma for rescuers', 'stronger than human: Specially equipped machines could help trapped victims quicker and more efficiently.', 'sparing human life: If robots are used instead of human rescuers, the lives and health of the rescuers do not have to be put at risk.', 'low risks: Soft robots pose only a low risk to th

In [54]:
import random

print(len(single_concepts))

# Draw 30 random entries (if the list has less than 30 entries, it will return the entire list)
single_concepts = random.sample(single_concepts_unique, min(20, len(single_concepts_unique)))

print(len(single_concepts))

262
20


In [55]:
single_concepts

['rescue of people',
 'quick action',
 'faster intervention',
 'save many people',
 'less risk',
 'access to hard-to-reach places',
 'fire department support',
 'accuracy of execution',
 'speed and efficiency',
 'Stronger than human: As already mentioned, many people lost their lives in the earthquake in Turkey, and this was because human strength was not enough at a certain point. For example, the walls that fell on people could not be lifted, which is why arms and legs had to be cut off. If the robot was there, it could show more strength and lift the wall.',
 'eyes and ears: for the rescue forces from a safe distance',
 'no people necessary',
 'lower risk of injury: Lower risk of injury due to the adaptability of the robot',
 'danger: less human danger',
 'delivery of goods: Robots can, for example, deliver goods in war zones, which would be dangerous for humans.',
 'Faster than human: In an earthquake, for example, recently in Turkey, the help arrived too late and this led to more 

### run LLMCode (inductive)

a toolkit for AI-assisted qualitative data analysis

In [56]:
import nest_asyncio
nest_asyncio.apply()



# Sample inputs for function arguments

texts = single_concepts

research_question = "What are the key perceived benefits and risks regarding safety of the search and rescue robot?"


few_shot_examples = pd.DataFrame({
    "text": [
        "The game and it's graphics, music and story made me feel calm and happy  in a way nothing else could at the time. Playing it felt like a journey to another, better place , and that's art to me.",
        "I played the game Kairo, or Cairo I can't remember, it was an atmospheric puzzle game with big rooms filled with mist and interesting lighting, all the textures were concrete"],
    "coded_text": [
        "The game and it's graphics, music and story **made me feel calm and happy**<sup>emotional response</sup> in a way nothing else could at the time. **Playing it felt like a journey to another, better place**<sup>setting; immersion</sup>, and that's art to me.",
        "I played the game Kairo, or Cairo I can't remember, **it was an atmospheric puzzle game with big rooms filled with mist and interesting lighting, all the textures were concrete**<sup>setting; creativity</sup>"
    ]
})



gpt_model = "gpt-4o"  # or another preferred GPT model
use_cache = True
max_tokens = 150  # Specify maximum tokens for each prompt if necessary
verbose = True
topicCategory = abbreviations_dict[sheet_name]


# Now you can run:
coded_texts, code_descriptions = LLMCode.code_inductively_with_code_consistency_adj(
    texts=texts,
    research_question=research_question,
    topicCategory=topicCategory,
    few_shot_examples=few_shot_examples,
    gpt_model=gpt_model,
    use_cache=use_cache,
    max_tokens=max_tokens,
    verbose=verbose
)

Original text: "danger: less human danger"
LLM output: "I'm sorry, but the text provided does not contain any insightful statements relevant to the research question, so there are no coded statements to highlight."
Text reconstruction successful

Had to reconstruct 1 texts due to LLM errors
Original text: "are mentally superior: No trauma for rescuers"
LLM output: "I'm sorry, but the text provided does not contain any statements that can be highlighted or coded based on the given instructions and examples."
Text reconstruction successful

Had to reconstruct 1 texts due to LLM errors
 |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 


In [57]:
print(len(coded_texts))
coded_texts

20


['rescue of people',
 'quick action',
 'faster intervention',
 'save many people',
 'less risk',
 'access to hard-to-reach places',
 'fire department support',
 'accuracy of execution',
 'speed and efficiency',
 '**Stronger than human**<sup>robot capability</sup>: As already mentioned, many people lost their lives in the earthquake in Turkey, and this was because **human strength was not enough at a certain point**<sup>human limitation</sup>. For example, the walls that fell on people could not be lifted, which is why arms and legs had to be cut off. **If the robot was there, it could show more strength and lift the wall**<sup>robot capability; potential life-saving</sup>.',
 '**eyes and ears: for the rescue forces from a safe distance**<sup>safety benefit; human limitation</sup>',
 'no people necessary',
 '**lower risk of injury: Lower risk of injury due to the adaptability of the robot**<sup>safety benefit</sup>',
 'danger: less human danger',
 'delivery of goods: **Robots can, for e

In [58]:
print(len(code_descriptions))
code_descriptions

4


{'robot capability': "Captures instances where participants highlight the robot's superior physical strength compared to humans, emphasizing its ability to perform tasks that require significant force or endurance in search and rescue operations.",
 'human limitation': 'Captures instances where participants highlight the constraints of human physical capabilities in search and rescue operations, emphasizing the need for robotic assistance when human strength and endurance reach their limits.',
 'potential life-saving': "Captures instances where participants perceive the robot's ability to perform physically demanding tasks, such as lifting heavy debris, as a crucial factor in potentially saving lives during search and rescue operations.",
 'safety benefit': 'Captures instances where participants highlight the advantage of using search and rescue robots to enhance safety by allowing rescue personnel to assess hazardous environments remotely, thereby reducing their exposure to potential 

In [59]:
import json

my_array = [
    {'robot': 'RR', 'category': 'SA', 'coded_texts': coded_texts, 'code_descriptions': code_descriptions}
]
print("my_array:\n", my_array)

file_path = directory + "/output/G3/"


# Save to JSON file
with open(file_path + 'output_LLMcode_test.json', 'w') as file:
    json.dump(my_array, file, indent=4)

# To load the data back
with open(file_path + 'output_LLMcode_test.json', 'r') as file:
    loaded_array = json.load(file)

# Print loaded data
print("loaded_array:\n", loaded_array)

my_array:
 [{'robot': 'RR', 'category': 'SA', 'coded_texts': ['rescue of people', 'quick action', 'faster intervention', 'save many people', 'less risk', 'access to hard-to-reach places', 'fire department support', 'accuracy of execution', 'speed and efficiency', '**Stronger than human**<sup>robot capability</sup>: As already mentioned, many people lost their lives in the earthquake in Turkey, and this was because **human strength was not enough at a certain point**<sup>human limitation</sup>. For example, the walls that fell on people could not be lifted, which is why arms and legs had to be cut off. **If the robot was there, it could show more strength and lift the wall**<sup>robot capability; potential life-saving</sup>.', '**eyes and ears: for the rescue forces from a safe distance**<sup>safety benefit; human limitation</sup>', 'no people necessary', '**lower risk of injury: Lower risk of injury due to the adaptability of the robot**<sup>safety benefit</sup>', 'danger: less human d

## Multiple Runs

### run LLMCode (inductive)

a toolkit for AI-assisted qualitative data analysis


few shots examples are fixed for all runs:

In [60]:
few_shot_examples = pd.DataFrame({
    "text": [
        "The game and it's graphics, music and story made me feel calm and happy  in a way nothing else could at the time. Playing it felt like a journey to another, better place , and that's art to me.",
        "I played the game Kairo, or Cairo I can't remember, it was an atmospheric puzzle game with big rooms filled with mist and interesting lighting, all the textures were concrete"],
    "coded_text": [
        "The game and it's graphics, music and story **made me feel calm and happy**<sup>emotional response</sup> in a way nothing else could at the time. **Playing it felt like a journey to another, better place**<sup>setting; immersion</sup>, and that's art to me.",
        "I played the game Kairo, or Cairo I can't remember, **it was an atmospheric puzzle game with big rooms filled with mist and interesting lighting, all the textures were concrete**<sup>setting; creativity</sup>"
    ]
})

function to run multiple times...

In [61]:
import random
import pandas as pd
import json

def process_robot_data(type_robot, category, abbreviations_dict, all_sheets_RR, all_sheets_SAR, all_sheets_Combined, 
                       directory, few_shot_examples, create_multivalue_dict, combine_dicts, LLMCode, 
                       gpt_model="gpt-3.5-turbo", use_cache=True, max_tokens=200, verbose=True, 
                       random_entries=True, random_entries_count=20):
    """
    Processes data for the specified robot type and category and saves the output as Excel and JSON.

    Parameters:
        type_robot (str): The type of robot ("SAR", "RR", or "Combined").
        category (str): The category to process.
        abbreviations_dict (dict): Dictionary containing abbreviations and their meanings.
        all_sheets_RR (dict): Data for rescue robots.
        all_sheets_SAR (dict): Data for socially assistive robots.
        all_sheets_Combined (dict): Data for combined robots.
        directory (str): Directory to save output files.
        few_shot_examples (list): Few-shot examples for the LLM.
        create_multivalue_dict (function): Function to create multi-value mappings.
        combine_dicts (function): Function to combine dictionaries.
        LLMCode (object): An object with the code induction method.
        gpt_model (str): GPT model to use. Defaults to "gpt-3.5-turbo".
        use_cache (bool): Whether to use cache. Defaults to True.
        max_tokens (int): Maximum token length for the GPT response. Defaults to 200.
        verbose (bool): Whether to print verbose output. Defaults to True.
        random_entries (bool): Whether to sample random entries. Defaults to True.
        random_entries_count (int): Number of random entries to sample. Defaults to 20.

    Returns:
        None
    """
    # Determine the appropriate data source
    if type_robot == "RR":
        tmp_df = all_sheets_RR
        naming_robots = "rescue robots"
    elif type_robot == "SAR":
        tmp_df = all_sheets_SAR
        naming_robots = "socially assistive robots"
    elif type_robot == "Combined":
        tmp_df = all_sheets_Combined
        naming_robots = "rescue robots and socially assistive robots"
    else:
        raise ValueError("Invalid type_robot specified.")
    
    # Ensure category exists
    if category not in abbreviations_dict:
        raise ValueError(f"Category '{category}' is not in the abbreviations dictionary.")
    
    topic_category = abbreviations_dict[category]
    df = tmp_df[category]
    
    # Create combined dictionary
    constant_comments_mapping = create_multivalue_dict(df, 'constant', 'constant_comments')
    new_comments_mapping = create_multivalue_dict(df, 'new', 'new_comments')
    constant_new_comments_mapping = combine_dicts(constant_comments_mapping, new_comments_mapping)
    
    # Generate single concepts
    single_concepts = []
    for key, values in constant_new_comments_mapping.items():
        for value in values:
            if isinstance(value, str):
                single_concepts.append(f"{key}: {value}")
            else:
                single_concepts.append(key)
    
    # Filter for entries with more than one word
    single_concepts = list({entry for entry in single_concepts if len(entry.split()) > 1})
    
    if random_entries:
        single_concepts = random.sample(single_concepts, min(random_entries_count, len(single_concepts)))

    research_question = f"What are the mentioned key benefits and risks regarding the {topic_category} of {naming_robots}?"
    
    if verbose:
        print(f"Processing category: {category}")
        print("Research Question:", research_question)
    
    # Perform LLM coding
    coded_texts, code_descriptions = LLMCode.code_inductively_with_code_consistency_adj(
        texts=single_concepts,
        research_question=research_question,
        topicCategory=topic_category,
        few_shot_examples=few_shot_examples,
        gpt_model=gpt_model,
        use_cache=use_cache,
        max_tokens=max_tokens,
        verbose=verbose
    )
    
    # Prepare data for saving
    json_data = {'robot': type_robot, 'category': category, 'coded_texts': coded_texts, 'code_descriptions': code_descriptions}

    # Save as JSON
    if type_robot == "RR":
        json_file_path = f"{directory}/output/G3/json RR/{type_robot} - {category}.json"
    elif type_robot == "SAR":
        json_file_path = f"{directory}/output/G3/json SAR/{type_robot} - {category}.json"
    else: 
        json_file_path = f"{directory}/output/G3/{type_robot} - {category}.json"

    with open(json_file_path, 'w') as file:
        json.dump([json_data], file, indent=4)
    
    if verbose:
        print(f"JSON file saved to {json_file_path}")

#### run LLMCode (inductive) for RR

In [62]:
run_inductive_coding = False # True

In [63]:
type_robot = "RR"

if run_inductive_coding:
    for index, category in enumerate(abbreviations_dict.keys()):
        print(f"index: {index}, category: {category}")

        process_robot_data(
            type_robot=type_robot,
            category=category,
            abbreviations_dict=abbreviations_dict,
            all_sheets_RR=all_sheets_RR,
            all_sheets_SAR=all_sheets_SAR,
            all_sheets_Combined=all_sheets_Combined,
            directory=directory,
            few_shot_examples=few_shot_examples,
            create_multivalue_dict=create_multivalue_dict,
            combine_dicts=combine_dicts,
            LLMCode=LLMCode,
            gpt_model="gpt-3.5-turbo",
            use_cache=True,
            max_tokens=200,
            verbose=True,
            random_entries=False,
            random_entries_count=None
        )

store data into one combined file:

In [64]:
import os
import json


if run_inductive_coding:
    # Path to the folder containing the JSON files
    folder_path = directory + "/output/G3/json RR"

    # Initialize an empty list to store combined data
    combined_data = []

    # Loop through all files in the folder
    for filename in os.listdir(folder_path):
        if filename.endswith(".json"):  # Check if the file is a JSON file
            file_path = os.path.join(folder_path, filename)
            with open(file_path, "r", encoding="utf-8") as file:
                try:
                    data = json.load(file)
                    if isinstance(data, list):  # Check if the JSON content is a list
                        combined_data.extend(data)  # Add the list's content to the combined data
                    else:
                        print(f"Skipping {filename}: not a list")
                except json.JSONDecodeError as e:
                    print(f"Error reading {filename}: {e}")

    # Output the combined data to a new JSON file
    output_file = directory + "/output/G3/" + "combined_RR.json"
    with open(output_file, "w", encoding="utf-8") as file:
        json.dump(combined_data, file, indent=4, ensure_ascii=False)

    print(f"Combined JSON data saved to {output_file}")

#### run LLMCode (inductive) for SAR

In [65]:
type_robot = "SAR"

if run_inductive_coding:
    for index, category in enumerate(abbreviations_dict.keys()):
        print(f"index: {index}, category: {category}")

        process_robot_data(
            type_robot=type_robot,
            category=category,
            abbreviations_dict=abbreviations_dict,
            all_sheets_RR=all_sheets_RR,
            all_sheets_SAR=all_sheets_SAR,
            all_sheets_Combined=all_sheets_Combined,
            directory=directory,
            few_shot_examples=few_shot_examples,
            create_multivalue_dict=create_multivalue_dict,
            combine_dicts=combine_dicts,
            LLMCode=LLMCode,
            gpt_model="gpt-3.5-turbo",
            use_cache=True,
            max_tokens=200,
            verbose=True,
            random_entries=False,
            random_entries_count=None
        )

store data into one combined file:

In [66]:
import os
import json

if run_inductive_coding:
    # Path to the folder containing the JSON files
    folder_path = directory + "/output/G3/json SAR"

    # Initialize an empty list to store combined data
    combined_data = []

    # Loop through all files in the folder
    for filename in os.listdir(folder_path):
        if filename.endswith(".json"):  # Check if the file is a JSON file
            file_path = os.path.join(folder_path, filename)
            with open(file_path, "r", encoding="utf-8") as file:
                try:
                    data = json.load(file)
                    if isinstance(data, list):  # Check if the JSON content is a list
                        combined_data.extend(data)  # Add the list's content to the combined data
                    else:
                        print(f"Skipping {filename}: not a list")
                except json.JSONDecodeError as e:
                    print(f"Error reading {filename}: {e}")

    # Output the combined data to a new JSON file
    output_file = directory + "/output/G3/" + "combined_SAR.json"
    with open(output_file, "w", encoding="utf-8") as file:
        json.dump(combined_data, file, indent=4, ensure_ascii=False)

    print(f"Combined JSON data saved to {output_file}")

### run LLMCode (deductive)

a toolkit for AI-assisted qualitative data analysis


prepare data:

In [67]:
import os
import json

# Define the path to the folder containing your JSON files
folder_path = directory + "/output/G3/" + "json RR - improved"

# Create an empty dictionary to store the combined JSON data
array_categories_RR = []
array_codingDescriptions_RR = []


# Iterate through all files in the folder
for filename in os.listdir(folder_path):
    # Check if the file has a .json extension
    if filename.endswith(".json"):
        file_path = os.path.join(folder_path, filename)
        with open(file_path, 'r') as json_file:
            try:
                # Load the JSON content and update the combined dictionary
                file_data = json.load(json_file)
                array_categories_RR.append(file_data[0]["category"])
                array_codingDescriptions_RR.append(file_data[0]["code_descriptions"])
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON from file {filename}: {e}")
                



# Define the path to the folder containing your JSON files
folder_path = directory + "/output/G3/" + "json SAR - improved"    

   
# Create an empty dictionary to store the combined JSON data
array_categories_SAR = []
array_codingDescriptions_SAR = []


# Iterate through all files in the folder
for filename in os.listdir(folder_path):
    # Check if the file has a .json extension
    if filename.endswith(".json"):
        file_path = os.path.join(folder_path, filename)
        with open(file_path, 'r') as json_file:
            try:
                # Load the JSON content and update the combined dictionary
                file_data = json.load(json_file)
                array_categories_SAR.append(file_data[0]["category"])
                array_codingDescriptions_SAR.append(file_data[0]["code_descriptions"])
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON from file {filename}: {e}")

In [68]:
# Function to get description based on matching index
def get_description(match, categories, descriptions):
    try:
        # Find the index of the match in categories
        index = categories.index(match)
        # Retrieve the corresponding description
        return descriptions[index]
    except ValueError:
        # Return None if match is not found
        return None

# Example usage
match = 'HRIP'  # Replace with the category you want to match
description = get_description('HRIP', array_categories_RR, array_codingDescriptions_RR)

print(f"Category: {match}")
print(f"Code Description: {description}")


Category: HRIP
Code Description: {'sustained performance': 'Highlights physical and mental stamina in Human-Robot-Interaction, focusing on the importance of sustaining performance and functionality during prolonged rescue operations, without unnecessary details or quotes.', 'collaborative support': 'Identifies instances where rescue robots provide tangible support in rescue scenarios, including the delivery of essential resources and enhancing human capabilities through collaboration and support, rather than replacement.', 'emotional resilience': 'Focuses on emotional reassurance provided by rescue robots through their soft design, adaptable forms, or demeanor, emphasizing their ability to offer calm and optimism to victims in distress.'}


function to run multiple times...

In [69]:
import pandas as pd

def process_robot_data_deductively(type_robot, category, 
                                   abbreviations_dict, 
                                   all_sheets_RR, all_sheets_SAR,
                                   array_categories, array_codingDescriptions,
                                   directory, 
                                   create_multivalue_dict, combine_dicts, LLMCode,
                                   gpt_model="gpt-3.5-turbo", use_cache=True, verbose=True):
    """
    Processes data for the specified robot type and category and saves the output as Excel and JSON.

    Parameters:

    Returns:
        None
    """
    # Determine the appropriate data source
    if type_robot == "RR":
        tmp_df = all_sheets_RR
        naming_robots = "rescue robots"
    elif type_robot == "SAR":
        tmp_df = all_sheets_SAR
        naming_robots = "socially assistive robots"
    else:
        raise ValueError("Invalid type_robot specified.")
    
    # Ensure category exists
    if category not in abbreviations_dict:
        raise ValueError(f"Category '{category}' is not in the abbreviations dictionary.")
    
    topic_category = abbreviations_dict[category]
    df = tmp_df[category]
    
    # Create combined dictionary
    constant_comments_mapping = create_multivalue_dict(df, 'constant', 'constant_comments')
    new_comments_mapping = create_multivalue_dict(df, 'new', 'new_comments')
    
    # constant_new_comments_mapping = combine_dicts(constant_comments_mapping, new_comments_mapping)    
    ### hardly any cases:  
    # deleted_comments_mapping = create_multivalue_dict(df, 'deleted', 'deleted_comments')
    # constant_new_deleted_comments_mapping = combine_dicts(constant_new_comments_mapping, deleted_comments_mapping)

    ### Generate single concepts for CONSTANT
    single_concepts_constant = []
    for key, values in constant_comments_mapping.items():
        for value in values:
            if isinstance(value, str):
                single_concepts_constant.append(f"{key}: {value}")
            else:
                single_concepts_constant.append(key)
    
    # Filter for entries with more than one word
    single_concepts_constant = list({entry for entry in single_concepts_constant if len(entry.split()) > 1})

    ### Generate single concepts for NEW
    single_concepts_new = []
    for key, values in new_comments_mapping.items():
        for value in values:
            if isinstance(value, str):
                single_concepts_new.append(f"{key}: {value}")
            else:
                single_concepts_new.append(key)
    
    # Filter for entries with more than one word
    single_concepts_new = list({entry for entry in single_concepts_new if len(entry.split()) > 1})



    research_question = f"What are the mentioned key benefits and risks regarding the {topic_category} of {naming_robots}?"
    
    if verbose:
        print(f"Processing category: {category}")
        print("Research Question:", research_question)
    

    code_descriptions = get_description(category, array_categories, array_codingDescriptions)

  
    coded_texts_deductively_constant = LLMCode.code_deductively(texts=single_concepts_constant,
                     research_question=research_question,
                     codebook=code_descriptions,
                     gpt_model=gpt_model,
                     few_shot_examples=None,
                     use_cache=use_cache,
                     verbose=verbose)
    
    coded_texts_deductively_new = LLMCode.code_deductively(texts=single_concepts_new,
                     research_question=research_question,
                     codebook=code_descriptions,
                     gpt_model=gpt_model,
                     few_shot_examples=None,
                     use_cache=use_cache,
                     verbose=verbose)
    
    ### get a dictionary of code highlights and return a frequency table for CONSTANT
    code_highlights_constant = LLMCode.get_codes_and_highlights(coded_texts_deductively_constant)

    code_counts_constant = [(code, len(highlights)) for code, highlights in code_highlights_constant.items()]
    df_codes_constant = pd.DataFrame(code_counts_constant, columns=['Code', 'Count'])
    df_codes_constant = df_codes_constant.sort_values(by='Count', ascending=False).reset_index(drop=True)
    
    ### get a dictionary of code highlights and return a frequency table for NEW
    code_highlights_new = LLMCode.get_codes_and_highlights(coded_texts_deductively_new)

    code_counts_new = [(code, len(highlights)) for code, highlights in code_highlights_new.items()]
    df_codes_new = pd.DataFrame(code_counts_new, columns=['Code', 'Count'])
    df_codes_new = df_codes_new.sort_values(by='Count', ascending=False).reset_index(drop=True)

    # Merge the two frequency tables
    df_merged = pd.merge(df_codes_constant, df_codes_new, on='Code', how='left', suffixes=('_1', '_2'))

    # Rename the columns
    df_merged.rename(columns={'Count_1': 'rigid', 'Count_2': 'soft'}, inplace=True)

    # Fill missing values with 0 for codes not present in the second DataFrame
    df_merged['rigid'] = df_merged['rigid'].fillna(0).astype(int)

    df_merged.insert(0, "Category", category)
    df_merged.insert(1, "Robot", type_robot)
 
    json_data = {'robot': type_robot, 'category': category, 'coded_texts_deductively_constant': coded_texts_deductively_constant, 'coded_texts_deductively_new': coded_texts_deductively_new}

    return {'coded_texts': json_data, "frequency_codes": df_merged}  # Returning as a dictionary

test single run:

In [70]:
type_robot = "RR"
category = "HRIN"

tmp_out = process_robot_data_deductively(type_robot=type_robot,
                               category=category,
                               abbreviations_dict=abbreviations_dict, 
                               all_sheets_RR=all_sheets_RR,
                               all_sheets_SAR=all_sheets_SAR,
                               array_categories=array_categories_RR, 
                               array_codingDescriptions=array_codingDescriptions_RR,
                               directory=directory,
                               create_multivalue_dict=create_multivalue_dict,
                               combine_dicts=combine_dicts,
                               LLMCode=LLMCode,
                               gpt_model="gpt-3.5-turbo", use_cache=True, verbose=True)


Processing category: HRIN
Research Question: What are the mentioned key benefits and risks regarding the perceived negative Human-Robot-Interaction of rescue robots?
 |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
Original text: "emotional coldness: The use of humans in emergency situations is probably more pleasant for the victims than the use of cold/emotionless robots."
LLM output: "**The use of humans in emergency situations is probably more pleasant for the victims than the use of cold/emotionless robots**<sup>emotional coldness</sup>."
Text reconstruction successful

Original text: "lack of humanity: The person to be rescued may be afraid of robots"
LLM output: "**The person to be rescued may be afraid of robots**<sup>fear</sup>"
Text reconstruction successful

Had to reconstruct 2 texts due to LLM errors


In [71]:
tmp_out["frequency_codes"]

Unnamed: 0,Category,Robot,Code,rigid,soft
0,HRIN,RR,lack of emotional understanding,8,1.0
1,HRIN,RR,fear,6,
2,HRIN,RR,reliance and trust,5,1.0
3,HRIN,RR,emotional coldness,2,


In [72]:
tmp_out["coded_texts"]

{'robot': 'RR',
 'category': 'HRIN',
 'coded_texts_deductively_constant': ['**problems with contacting**<sup>reliance and trust</sup>',
  '**own decisions**<sup>reliance and trust</sup>',
  '**human closeness is missing**<sup>emotional coldness; lack of emotional understanding</sup>: In the interaction with the victims, the human closeness is lacking.',
  'emotional coldness: **The use of humans in emergency situations is probably more pleasant for the victims than the use of cold/emotionless robots**<sup>emotional coldness</sup>.',
  '**victim uncertainty:** In crisis situations, **human communication can be vital**<sup>reliance and trust</sup>, for example in shock.',
  '**lack of interpersonal contact**<sup>lack of emotional understanding</sup>',
  '**no compassion: a robot does not know what pain is**<sup>lack of emotional understanding</sup>',
  '**impersonal: lack of human closeness**<sup>lack of emotional understanding</sup>, which the victim may need',
  '**could cause fear**<s

#### for RR

In [73]:
run_deductive_coding = False # False

In [74]:
type_robot = "RR"

file_path_xlsx = directory + "/output/G3/rescue robot frequency table codes" + ".xlsx"
file_path_json = directory + "/output/G3/rescue robot coded texts" + ".json"

if run_deductive_coding:
    
    json_objects = []
    dataframes = []
    
    
    for index, category in enumerate(abbreviations_dict.keys()):
        print(f"index: {index}, category: {category}")
        
        tmp_out = process_robot_data_deductively(type_robot=type_robot,
                               category=category,
                               abbreviations_dict=abbreviations_dict, 
                               all_sheets_RR=all_sheets_RR,
                               all_sheets_SAR=all_sheets_SAR,
                               array_categories=array_categories_RR, 
                               array_codingDescriptions=array_codingDescriptions_RR,
                               directory=directory,
                               create_multivalue_dict=create_multivalue_dict,
                               combine_dicts=combine_dicts,
                               LLMCode=LLMCode,
                               gpt_model="gpt-3.5-turbo", use_cache=True, verbose=True)
        
        json_objects.append(tmp_out["coded_texts"])
        dataframes.append(tmp_out["frequency_codes"])
       
    
    combined_df = pd.concat(dataframes, ignore_index=True)    
    combined_df.to_excel(file_path_xlsx, index=False)

    with open(file_path_json, 'w', encoding='utf-8') as json_file:
        json.dump(json_objects, json_file, indent=4, ensure_ascii=False)

#### for SAR

In [75]:
type_robot = "SAR"

file_path_xlsx = directory + "/output/G3/socially assistive robot frequency table codes" + ".xlsx"
file_path_json = directory + "/output/G3/socially assistive robot coded texts" + ".json"

if run_deductive_coding:
    
    json_objects = []
    dataframes = []
    
    
    for index, category in enumerate(abbreviations_dict.keys()):
        print(f"index: {index}, category: {category}")
        
        tmp_out = process_robot_data_deductively(type_robot=type_robot,
                               category=category,
                               abbreviations_dict=abbreviations_dict, 
                               all_sheets_RR=all_sheets_RR,
                               all_sheets_SAR=all_sheets_SAR,
                               array_categories=array_categories_SAR, 
                               array_codingDescriptions=array_codingDescriptions_SAR,
                               directory=directory,
                               create_multivalue_dict=create_multivalue_dict,
                               combine_dicts=combine_dicts,
                               LLMCode=LLMCode,
                               gpt_model="gpt-3.5-turbo", use_cache=True, verbose=True)
        
        json_objects.append(tmp_out["coded_texts"])
        dataframes.append(tmp_out["frequency_codes"])
       
    
    combined_df = pd.concat(dataframes, ignore_index=True)    
    combined_df.to_excel(file_path_xlsx, index=False)

    with open(file_path_json, 'w', encoding='utf-8') as json_file:
        json.dump(json_objects, json_file, indent=4, ensure_ascii=False)

## Textual Description of Category Specific Graph (G3)

### Load Data

load coded text files:

In [76]:
import json

# Define the file path
file_path_json = directory + "/output/G3/rescue robot coded texts.json"

# Load the JSON file
try:
    with open(file_path_json, 'r') as file:
        codedTexts_RR_json = json.load(file)  # Parse the JSON data
        print("JSON data loaded successfully.")
except FileNotFoundError:
    print(f"File not found at {file_path_json}")
except json.JSONDecodeError as e:
    print(f"Error decoding JSON: {e}")



# Define the file path
file_path_json = directory + "/output/G3/socially assistive robot coded texts.json"

# Load the JSON file
try:
    with open(file_path_json, 'r') as file:
        codedTexts_SAR_json = json.load(file)  # Parse the JSON data
        print("JSON data loaded successfully.")
except FileNotFoundError:
    print(f"File not found at {file_path_json}")
except json.JSONDecodeError as e:
    print(f"Error decoding JSON: {e}")

JSON data loaded successfully.
JSON data loaded successfully.


load frequency and code descriptions:

In [77]:
import pandas as pd

# define directory path
relative_path = os.path.join(
    "create tables, graphics G3", "outputs"
)

# Get the absolute path based on the current working directory
target_dir = os.path.abspath(relative_path)

# Define the file path
file_path = target_dir + "/RR_codes_complete.xlsx"

# Load the Excel file
try:
    # Read the Excel file into a pandas DataFrame
    frequencyCodedes_RR = pd.read_excel(file_path)
    # Display the first few rows of the DataFrame
    print("Data loaded successfully.")
except FileNotFoundError:
    print(f"File not found: {file_path}")
except Exception as e:
    print(f"An error occurred while loading the file: {e}")


# Define the file path
file_path = target_dir + "/SAR_codes_complete.xlsx"

# Load the Excel file
try:
    # Read the Excel file into a pandas DataFrame
    frequencyCodedes_SAR = pd.read_excel(file_path)
    # Display the first few rows of the DataFrame
    print("Data loaded successfully.")
except FileNotFoundError:
    print(f"File not found: {file_path}")
except Exception as e:
    print(f"An error occurred while loading the file: {e}")

Data loaded successfully.
Data loaded successfully.


### Prepare Data

In [78]:
# Add 'soft' and 'rigid' columns
#frequencyCodedes_RR['soft'] = frequencyCodedes_RR['soft'] + frequencyCodedes_RR['rigid']
#frequencyCodedes_SAR['soft'] = frequencyCodedes_SAR['soft'] + frequencyCodedes_SAR['rigid']


print(frequencyCodedes_RR.shape)
print(frequencyCodedes_SAR.shape)

(90, 6)
(73, 6)


### Multiple Runs

prompt to combine the following information:

- code descriptions
- frequency of given codes (= weighting)
- marked passages by single codes

In [79]:
system_template = """
You are a researcher summarizing information regarding the risks and benefits of {topicCategory} for {typerobot}. Participants differentiated in their perceptions between:

1. Traditional rigid robots ("rigid")
2. Flexible, electronic-free soft robots ("soft")


Your task:
Write a **concise scientific paragraph** integrating all provided arguments. The paragraph must include:

1. Start with the most frequent and significant argument, avoiding listing all categories explicitly.
2. Integrate shared perceptions between rigid and soft robots, emphasizing areas of overlap.
3. Highlight key distinguishing features of soft robots, showcasing their unique attributes.
4. Conclude with a cohesive summary emphasizing the most significant insights, with a focus on the potential risks or benefits of soft robots.


Each argument is structured as follows:
- **Code name**: The specific label assigned to the argument.
- **Code description**: A concise explanation of the argument's meaning.
- **Importance**: The relative weight of the argument, represented by its frequency for "rigid" and "soft" robots. Arguments with higher frequencies should be given greater emphasis in your analysis.
- **Marked text passages**: Contextual examples or supporting statements from participants for each argument seperatedly for "rigid" and "soft" robots.

You must incorporate all arguments provided, but prioritize those with higher frequencies in your summary. Use this information to integrate and synthesize the arguments into a cohesive scientific paragraph.


Guidelines:
- Avoid explicitly labeling or numbering sections.
- Do not list individual codes or categories; synthesize arguments fluidly.
- Do not include numerical values (e.g., argument frequencies).
- Ensure the text is suitable for a scientific article and integrates all arguments into a cohesive narrative.
- Prioritize frequent arguments while addressing all relevant details provided.
- Avoid making direct comparisons where unnecessary (e.g., "rigid robots have more than soft robots"), unless critical for clarity.
"""

user_template = """
Input Details:
Arguments:
{arguments}

Your output should align with the scientific paragraph structure outlined above.
"""

load API key again:

In [80]:
import os
import sys

# Assuming 'src' is one level down (in the current directory or a subdirectory)
path_to_src = os.path.join('src')  # Moves one level down to 'src' folder

# Add the path to sys.path
sys.path.append(path_to_src)

# Now you can import your API_key module
import API_key as key

function to run multiple times...

In [81]:
from collections import defaultdict
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import ChatPromptTemplate

def summarize_graphic(
    type_robot,
    category,
    codedTexts_RR_json,
    codedTexts_SAR_json,
    frequencyCodedes_RR,
    frequencyCodedes_SAR,
    numWordsKeep,
    model_name,
    max_tokens,
    temperature,
    api_key,
    verbose=False
):
    if type_robot == "RR":
        codedTexts = codedTexts_RR_json
        frequencyCodedes = frequencyCodedes_RR
        naming_robots = "rescue robots"
    elif type_robot == "SAR":
        codedTexts = codedTexts_SAR_json
        frequencyCodedes = frequencyCodedes_SAR
        naming_robots = "socially assistive robots"
    else:
        raise ValueError("Invalid type_robot specified.")
    
    # Load codedTexts
    index_category = [index for index, item in enumerate(codedTexts) if item.get('category') == category]
    codedTexts = codedTexts[int(''.join(map(str, index_category)))]
    
    # Load frequencyCodedes
    frequencyCodedes = frequencyCodedes[frequencyCodedes['Abbrevation'] == category].copy()
    frequencyCodedes.loc[:, 'Code'] = frequencyCodedes['Code'].str.lower()

    # Get topic for prompt
    topicCategory = ', '.join(frequencyCodedes["Category"].unique())

    # Process codedTexts separately for "constant" and "new"
    codedTexts_highlights_constant = LLMCode.get_codes_and_highlights(codedTexts["coded_texts_deductively_constant"])
    codedTexts_highlights_new = LLMCode.get_codes_and_highlights(codedTexts["coded_texts_deductively_new"])

    # Filter codedTexts_highlights_constant
    codedTexts_highlights_constant = defaultdict(list, {
        key: [sentence for sentence in sentences if len(sentence.split()) > numWordsKeep]
        for key, sentences in codedTexts_highlights_constant.items()
    })
    codedTexts_highlights_constant = defaultdict(list, {key: value for key, value in codedTexts_highlights_constant.items() if value})

    # Filter codedTexts_highlights_new
    codedTexts_highlights_new = defaultdict(list, {
        key: [sentence for sentence in sentences if len(sentence.split()) > numWordsKeep]
        for key, sentences in codedTexts_highlights_new.items()
    })
    codedTexts_highlights_new = defaultdict(list, {key: value for key, value in codedTexts_highlights_new.items() if value})

    if verbose:
        print("\nfrequencyCodedes:\n", frequencyCodedes)
        print("\ncodedTexts_highlights_constant:\n", dict(codedTexts_highlights_constant))
        print("\ncodedTexts_highlights_new:\n", dict(codedTexts_highlights_new))

    # Set up structured arguments for prompt
    arguments = []
    for index, row in frequencyCodedes.iterrows():
        argument = {
            "code_name": row['Code'],
            "description": row['Description'],
            "importance": {
                "rigid": row['rigid'],
                "soft": row['soft']
            },
            "marked_passages": {
                "rigid": None,
                "soft": None
            }
        }
        arguments.append(argument)

    for arg in arguments:
        code_name = arg['code_name']
        arg['marked_passages']['rigid'] = [
            passage.replace('*', '') for passage in codedTexts_highlights_constant.get(code_name, [])
        ] if codedTexts_highlights_constant.get(code_name) else None

        arg['marked_passages']['soft'] = [
            passage.replace('*', '') for passage in codedTexts_highlights_new.get(code_name, [])
        ] if codedTexts_highlights_new.get(code_name) else None

    if verbose:
        print("\narguments:\n", arguments)

    # Format combined arguments
    formatted_arguments = "\n\n".join([
        f"- Code: {arg['code_name']}\n"
        f"  Description: {arg['description']}\n"
        f"  Importance: Rigid: {arg['importance']['rigid']}, Soft: {arg['importance']['soft']}\n"
        f"  Marked Passages:\n"
        f"    Rigid: {', '.join(arg['marked_passages']['rigid']) if arg['marked_passages']['rigid'] else 'None'}\n"
        f"    Soft: {', '.join(arg['marked_passages']['soft']) if arg['marked_passages']['soft'] else 'None'}"
        for arg in arguments
    ])

    if verbose:
        print("\nformatted_arguments:\n", formatted_arguments)

    # Fill the system and user templates
    filled_system_message = system_template.format(topicCategory=topicCategory, typerobot=naming_robots)
    filled_user_message = user_template.format(arguments=formatted_arguments)

    # Combine into a ChatPromptTemplate
    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", filled_system_message),
            ("human", filled_user_message),
        ]
    )

    if verbose:
        print("\nprompt:\n", prompt)

    # Initialize the ChatOpenAI model for Hugging Face
    model = ChatOpenAI(
        model=model_name,
        openai_api_key=api_key,
        openai_api_base="https://api-inference.huggingface.co/v1/",
        max_tokens=max_tokens,
        temperature=temperature
    )


    #chain = LLMChain(llm=model, prompt=prompt)
    chain = prompt | model

    # Execute the chain and output response details
    with get_openai_callback() as cb:
        response = chain.invoke({})  # Invoke without arguments as they're included in the prompt template

        if cb.total_tokens > max_tokens:
            print("Warning: The response may be incomplete due to exceeding the maximum token limit.")
        
        if verbose:
            print(cb)
            print(f"Total Tokens: {cb.total_tokens}")
            print(f"Prompt Tokens: {cb.prompt_tokens}")
            print(f"Completion Tokens: {cb.completion_tokens}")
            print(f"Total Cost (USD): ${cb.total_cost}")


    if verbose:
        print("\nresponse:\n", response)

    return response


In [82]:
createSummaryGraphics = True # False

for rescue robot:

In [83]:
import pandas as pd

type_robot = "RR"

file_path_xlsx = directory + "/output/G3/summaryGraphics_RR" + ".xlsx"


# List of valid categories
valid_categories = ["AN", "AP", "HRIN", "HRIP", "R", "SA", "TL", "TP"]

# Create an empty list to store results
results = []

if createSummaryGraphics:
    for index, category in enumerate(abbreviations_dict.keys()):
        print(f"index: {index}, category: {category}")

        # Check if the category is in the valid list
        if category in valid_categories:
            response = summarize_graphic(
                type_robot=type_robot,
                category=category,
                codedTexts_RR_json=codedTexts_RR_json,
                codedTexts_SAR_json=codedTexts_SAR_json,
                frequencyCodedes_RR=frequencyCodedes_RR,
                frequencyCodedes_SAR=frequencyCodedes_SAR,
                numWordsKeep=3,
                model_name="meta-llama/Meta-Llama-3-70B-Instruct",
                max_tokens=4000,
                temperature=0.0,
                api_key=key.hugging_api_key,
                verbose=False
            )
            # Append the type_robot, category, and response content to results
            results.append({
                "Type Robot": type_robot,
                "Category": category,
                "Response Content": response.content
            })
        else:
            print(f"Skipping category {category} as it is not in the valid list.")

    # Convert results to a DataFrame
    df_results = pd.DataFrame(results)

    # Save to an Excel file
    df_results.to_excel(file_path_xlsx, index=False)
    print(f"Responses saved to {file_path_xlsx}")

index: 0, category: TP


  model = ChatOpenAI(


index: 1, category: TL
index: 2, category: SA
index: 3, category: R
index: 4, category: HRIP
index: 5, category: HRIN
index: 6, category: AP
index: 7, category: AN
Responses saved to /home/fenn/Desktop/Publications/soft robot intervention/Analyses/main study - LLM/output/G3/summaryGraphics_RR.xlsx


for socially assistive robots:

In [84]:
import pandas as pd

type_robot = "SAR"

file_path_xlsx = directory + "/output/G3/summaryGraphics_SAR" + ".xlsx"


# List of valid categories
valid_categories = ["AN", "AP", "HRIN", "HRIP", "R", "SA", "TL", "TP"]

# Create an empty list to store results
results = []

if createSummaryGraphics:
    for index, category in enumerate(abbreviations_dict.keys()):
        print(f"index: {index}, category: {category}")

        # Check if the category is in the valid list
        if category in valid_categories:
            response = summarize_graphic(
                type_robot=type_robot,
                category=category,
                codedTexts_RR_json=codedTexts_RR_json,
                codedTexts_SAR_json=codedTexts_SAR_json,
                frequencyCodedes_RR=frequencyCodedes_RR,
                frequencyCodedes_SAR=frequencyCodedes_SAR,
                numWordsKeep=3,
                model_name="meta-llama/Meta-Llama-3-70B-Instruct",
                max_tokens=5000,
                temperature=0.0,
                api_key=key.hugging_api_key,
                verbose=False
            )
            # Append the type_robot, category, and response content to results
            results.append({
                "Type Robot": type_robot,
                "Category": category,
                "Response Content": response.content
            })
        else:
            print(f"Skipping category {category} as it is not in the valid list.")

    # Convert results to a DataFrame
    df_results = pd.DataFrame(results)

    # Save to an Excel file
    df_results.to_excel(file_path_xlsx, index=False)
    print(f"Responses saved to {file_path_xlsx}")

index: 0, category: TP
index: 1, category: TL
index: 2, category: SA
index: 3, category: R
index: 4, category: HRIP
index: 5, category: HRIN
index: 6, category: AP
index: 7, category: AN
Responses saved to /home/fenn/Desktop/Publications/soft robot intervention/Analyses/main study - LLM/output/G3/summaryGraphics_SAR.xlsx
