# Hierarchical coding frames

## Takeaways:


- sentiment seems to be accurate, even for reponses with mixed sentiements.
- including topic and specific_theme helps get a good level of specificity
- I tried getting it to return a list of codes per row, and returning an response id per code, but this didn't show any improvement over just asking for a single list fo codes

- Maybe using shorter ids would be better?

## Main resources used:
- https://journals.sagepub.com/doi/10.1177/08944393231220483
- https://fastercapital.com/content/Thematic-Analysis--Uncovering-Patterns--Thematic-Analysis-in-Qualitative-Research.html#
- https://getthematic.com/insights/coding-qualitative-data/

## Setup

In [47]:
import asyncio
import polars as pl
from pathlib import Path
from typing import Literal, List, Iterable
from pydantic import BaseModel, Field
from thematic_analysis.llm import get_chat_completion_structured
from thematic_analysis.types import UserMessage, SystemMessage
from thematic_analysis.utils.cleaning import clean_answers, format_list

In [73]:
# Read in DF, pick a question column, and clean the answers
df = pl.read_csv(Path("../data/wallmart.csv"))

question = df.columns[2]
df = df.select(["Session ID", question]).rename({
    "Session ID" : "id",
    question : "answer"
})
df = clean_answers(df)
df.head(2)

id,answer
str,str
"""94f2d4c3-b513-411c-b505-a11290…","""The customer service that your…"
"""1797c6f2-c501-44b7-b549-a33c29…","""I would complain about how mor…"


## Initial Coding
Goal: play around with the outout models and prompts to see what works

In [74]:
# Create output models

class Code(BaseModel):
    response_id: str = Field(..., description="The id of the response this code was generated from.")
    topic: str = Field(..., description="The broad subject of the response (e.g., 'Customer Service', 'Pricing').")
    sentiment: Literal["positive", "negative", "neutral"] = Field(
        ..., description="The sentiment of the response related to the topic."
    )
    specific_theme: str = Field(..., description="A specific attribute or issue described in the response.")

class ReponseCodes(BaseModel):
    # scratch_pad: str = Field(..., description="A areas for outputting your reasoning when generating codes.")
    codes: List[Code] = Field(..., description="A list of codes generated from survey responses.")

    def __str__(self) -> str:
        # Show each code on a new line.
        cls_name = self.__class__.__name__
        codes = "\n".join(self.codes)
        return f"{cls_name}(\n{codes}\n)"

In [75]:
# Format responses
def format_list(responses: Iterable[str], ids: Iterable[str]) -> str:
    """
    Format the answers of a ThemeRequest for a prompt.
    """
    formatted_responses = [
        f"(id={r_id}, response={response})"
        for r_id, response
        in zip(responses, ids)
    ]
    return "\n".join(formatted_responses)

formatted_answers = format_list(df["answer"], df["id"])

In [76]:
# Create prompts for inital codes

instructions = """
You are an expert in qualitative analysis.

You will be given a survey question and a list of responses. Your task is to categorize each response into one or more codes based on the topics it covers.

# **Instructions**:

For each response, follow these steps:

1. Identify the main topics mentioned in the response.
2. For each topic, determine:
    - Sentiment: Classify as positive, negative, or neutral based on the tone, wording, and context of the question.
    - Specific Theme: Create a clear, concise phrase capturing the core issue or attribute described.
3. Generate a code for each topic using the identified sentiment and theme.
4. Repeat this process until at least one code is assigned to each response.

# **Example**:

Reponse: "The product was fine, but the pricing felt a little high compared to competitors."

Generated codes:
```
[
    {
        "topic": "Product",
        "sentiment": "neutral",
        "specific_theme": "Acceptable quality"
    },
    {
        "topic": "Pricing",
        "sentiment": "negative",
        "specific_theme": "Higher than competitors"
    }
]
```

# **Guidelines**:
- If a response mentions multiple topics, generate separate codes for each one.
- If a topic expresses mixed sentiments, create a separate code for each sentiment.
- Ensure specific themes are precise and meaningful (e.g., "Difficult refund process" instead of just "Refunds").
- If a response is too vague, use "General feedback" as the topic.
- One response can generate one or more codes.

You are now going to be given a the survey question and a list of the responses to code.
"""

survey_responses = """
# **Question**:
> "{question}"


# **Responses**:
{formatted_answers}
"""

In [77]:
# Call the LLM
prompts = [
    SystemMessage(content=instructions),
    UserMessage(content=survey_responses.format(question=question, formatted_answers=formatted_answers))
]
response = await get_chat_completion_structured(prompts, ReponseCodes)

In [78]:
# Check basic numbers

print(f"Number of survey responses: {df.shape[0]}")
print(f"Number of codes returned: {len(response.codes)}")

unqiue_ids = set([code.response_id for code in response.codes])
print(f"Number of unique ids: {len(unqiue_ids)}")

Number of survey responses: 253
Number of codes returned: 144
Number of unique ids: 144


### Test batching
Goal: check if the model does better when less responses

In [89]:
async def process_batch(batch: pl.DataFrame):
    print(f"Processing {batch.shape[0]} rows...")
    formatted_responses = format_list(batch["answer"], batch["id"])
    prompts = [
        SystemMessage(content=instructions),
        UserMessage(content=survey_responses.format(question=question, formatted_answers=formatted_responses))
    ]
    return await get_chat_completion_structured(prompts, ReponseCodes)

In [90]:
BTACH_SIZE = 50

# TODO: create functon to generates balanced batches given a max size (or just check itertools...)
batched_response_futures = [
    process_batch(df.slice(start, BTACH_SIZE))
    for start in range(0, df.height, BTACH_SIZE)
]
batched_responses = await asyncio.gather(*batched_response_futures)

Processing 50 rows...
Processing 50 rows...
Processing 50 rows...
Processing 50 rows...
Processing 50 rows...
Processing 3 rows...


In [132]:
# Check basic numbers

print(f"Number of survey responses: {df.shape[0]}")

response_code_numbers = [len(response.codes) for response in batched_responses]
print(f"Responses in each batch: {response_code_numbers}, Total codes: {sum(response_code_numbers)}")

all_ids = [code.response_id for response_codes in batched_responses for code in response_codes.codes]
print(f"Number of unique ids: {len(set(all_ids))}")

Number of survey responses: 253
Responses in each batch: [50, 51, 50, 52, 49, 6], Total codes: 258
Number of unique ids: 251


### Check ids
Goal: get a sense of how wells using ids works.

In [102]:
real_ids = set(df["id"])
returned_ids = set(all_ids)

In [107]:
# Check intersection
print(f"Number of returned ids that match ids in the df: {len(real_ids & returned_ids)}")
print(f"Number of made up ids: {len(returned_ids - real_ids)}")
print(f"Number of missed ids: {len(real_ids - returned_ids)}")

print(f"Made up:")
print(returned_ids - real_ids)
print(f"Missed:")
print(real_ids - returned_ids)

Number of returned ids that match ids in the df: 249
Number of made up ids: 2
Number of missed ids: 4
Made up:
{'e97b337c-d73b-4949-937e-d28db451a024', '5c070d8d-08bf-44b6-9b6d-4eb282e56530'}
Missed:
{'5c070d8d-08bf-44ac-9b6d-4eb282e56530', 'b8f3432f-bb8e-4e69-8ee8-d0f236dd5bc3', 'e97b337c-d73c-4949-937e-d28db451a024', 'b90b66ce-4b2a-4547-a4e8-1bc9eabde81f'}


### Check response quality
Goal: set a sense for how well the the codes match and how good the codes are

In [118]:
# Note: too few duplicates to worry about.. and we'll check them after
code_id_mapping = {code.response_id : code for response_codes in batched_responses for code in response_codes.codes}

251


In [122]:
# Note: too few duplicates to worry about.. and we'll check them after
code_id_mapping = {
    code.response_id : code
    for response_codes in batched_responses
    for code in response_codes.codes
}

for _ in range(5):
    id_value, answer = df.sample(1).row(0)
    matched = code_id_mapping[id_value]
    print(f"ID: {id_value}")
    print(f"Response: {answer}")
    print(f"Topic: {matched.topic}")
    print(f"Sentiment: {matched.sentiment}")
    print(f"Context: {matched.specific_theme}\n")

ID: 7ceaafd1-fd8a-41e7-96e4-e84fff04ed62
Response: There is nothing I would complain about.
Topic: General feedback
Sentiment: neutral
Context: No complaints

ID: 4aa655c9-2dbf-4f8a-b5b6-88e584c8a0b9
Response: I don't particularly like the self-checkout
Topic: Checkout Experience
Sentiment: negative
Context: Dislike of self-checkout

ID: a1e30d9d-f1a1-4e9b-87f0-105453335261
Response: Nothing just keep the restrooms clean
Topic: Restroom Facilities
Sentiment: positive
Context: Keep restrooms clean

ID: f073c9ee-43f3-46f7-89ba-c1e099676f45
Response: You are a multi billion dollar company. Please pay your employees a livable wage. Stop promoting high wages and health insurance, only to make sure said employee never works the number of hours to qualify for those things. I love self checkout, but you need to open your registers! I'm tired of only 1 or 2 lanes being open and having to stand in long lines, when I Need to deal with a cashier for my transaction. If you insist on making us check

In [133]:
# Sorry!
u_ids = set(all_ids)
duplicates = [rid for rid in all_ids if rid not in u_ids]

In [134]:
duplicates

[]