# Coding frames

## Takeaways:
- Including topic and specific_theme helps capture a good level of granularity.
- Asking it to include IDs, even if those IDs aren't 100% correct, results in more codes being returned.
- Prompts to generate codes need improvement; it's still not returning multiple codes per response.
- Batching gives far better results when getting codes, though it is slower.
- Generating a list of just theme names instead of full theme objects leads to more and better themes being returned.
- IDs are helpful for understanding how it arrived at the answer, but it might be worth dropping them after coding.
- It would be interesting to see if shorter IDs improved performance (maybe the first ~7 characters of a hash).
- The current approach won't handle large datasets well.
- If we needed to link themes back to topics, maybe an embedding similarity search using the final theme would be better.

Overall, there is a lot to be improved upon with this approach... however, I'm not convinced that this is the best approach to begin with.


## Main resources used:
- https://journals.sagepub.com/doi/10.1177/08944393231220483
- https://fastercapital.com/content/Thematic-Analysis--Uncovering-Patterns--Thematic-Analysis-in-Qualitative-Research.html#
- https://getthematic.com/insights/coding-qualitative-data/

## Setup

In [1]:
import asyncio
import polars as pl
from pathlib import Path
from typing import Literal, List, Iterable
from pydantic import BaseModel, Field
from thematic_analysis.llm import get_chat_completion_structured
from thematic_analysis.types import UserMessage, SystemMessage
from thematic_analysis.utils.cleaning import clean_answers, format_list

In [2]:
# Read in DF, pick a question column, and clean the answers
df = pl.read_csv(Path("../data/wallmart.csv"))

question = df.columns[2]
df = df.select(["Session ID", question]).rename({
    "Session ID" : "id",
    question : "answer"
})
df = clean_answers(df)
df.head(2)

id,answer
str,str
"""94f2d4c3-b513-411c-b505-a11290…","""The customer service that your…"
"""1797c6f2-c501-44b7-b549-a33c29…","""I would complain about how mor…"


## Initial Coding
Goal: play around with the outout models and prompts to see what works

In [3]:
# Create output models

class Code(BaseModel):
    response_id: str = Field(..., description="The id of the response this code was generated from.")
    topic: str = Field(..., description="The broad subject of the response (e.g., 'Customer Service', 'Pricing').")
    sentiment: Literal["positive", "negative", "neutral"] = Field(
        ..., description="The sentiment of the response related to the topic."
    )
    specific_theme: str = Field(..., description="A specific attribute or issue described in the response.")

    def __str__(self) -> str:
        return f'(id="{self.response_id}", code="{self.topic}: {self.specific_theme}", sentiment="self.sentiment")'

class ReponseCodes(BaseModel):
    # scratch_pad: str = Field(..., description="A areas for outputting your reasoning when generating codes.")
    codes: List[Code] = Field(..., description="A list of codes generated from survey responses.")

    def __str__(self) -> str:
        # Show each code on a new line.
        cls_name = self.__class__.__name__
        codes = "\n".join(self.codes)
        return f"{cls_name}(\n{codes}\n)"

In [4]:
# Format responses
def format_list(responses: Iterable[str], ids: Iterable[str]) -> str:
    """
    Format the answers of a ThemeRequest for a prompt.
    """
    formatted_responses = [
        f"(id={r_id}, response={response})"
        for r_id, response
        in zip(responses, ids)
    ]
    return "\n".join(formatted_responses)

formatted_answers = format_list(df["answer"], df["id"])

In [5]:
# Create prompts for inital codes

instructions = """
You are an expert in qualitative analysis.

You will be given a survey question and a list of responses. Your task is to categorize each response into one or more codes based on the topics it covers.

# **Instructions**:

For each response, follow these steps:

1. Identify the main topics mentioned in the response.
2. For each topic, determine:
    - Sentiment: Classify as positive, negative, or neutral based on the tone, wording, and context of the question.
    - Specific Theme: Create a clear, concise phrase capturing the core issue or attribute described.
3. Generate a code for each topic using the identified sentiment and theme.
4. Repeat this process until at least one code is assigned to each response.

# **Example**:

Reponse: "The product was fine, but the pricing felt a little high compared to competitors."

Generated codes:
```
[
    {
        "topic": "Product",
        "sentiment": "neutral",
        "specific_theme": "Acceptable quality"
    },
    {
        "topic": "Pricing",
        "sentiment": "negative",
        "specific_theme": "Higher than competitors"
    }
]
```

# **Guidelines**:
- If a response mentions multiple topics, generate separate codes for each one.
- If a topic expresses mixed sentiments, create a separate code for each sentiment.
- Ensure specific themes are precise and meaningful (e.g., "Difficult refund process" instead of just "Refunds").
- If a response is too vague, use "General feedback" as the topic.
- One response can generate one or more codes.

You are now going to be given a the survey question and a list of the responses to code.
"""

survey_responses = """
# **Question**:
> "{question}"


# **Responses**:
{formatted_answers}
"""

In [6]:
# Call the LLM
prompts = [
    SystemMessage(content=instructions),
    UserMessage(content=survey_responses.format(question=question, formatted_answers=formatted_answers))
]
response = await get_chat_completion_structured(prompts, ReponseCodes)

In [7]:
# Check basic numbers

print(f"Number of survey responses: {df.shape[0]}")
print(f"Number of codes returned: {len(response.codes)}")

unqiue_ids = set([code.response_id for code in response.codes])
print(f"Number of unique ids: {len(unqiue_ids)}")

Number of survey responses: 253
Number of codes returned: 123
Number of unique ids: 123


### Test batching
Goal: check if the model does better when less responses

In [8]:
async def process_batch(batch: pl.DataFrame):
    print(f"Processing {batch.shape[0]} rows...")
    formatted_responses = format_list(batch["answer"], batch["id"])
    prompts = [
        SystemMessage(content=instructions),
        UserMessage(content=survey_responses.format(question=question, formatted_answers=formatted_responses))
    ]
    return await get_chat_completion_structured(prompts, ReponseCodes)

In [9]:
BTACH_SIZE = 50

# TODO: create functon to generates balanced batches given a max size (or just check itertools...)
batched_response_futures = [
    process_batch(df.slice(start, BTACH_SIZE))
    for start in range(0, df.height, BTACH_SIZE)
]
batched_responses = await asyncio.gather(*batched_response_futures)

Processing 50 rows...
Processing 50 rows...
Processing 50 rows...
Processing 50 rows...
Processing 50 rows...
Processing 3 rows...


In [10]:
# Check basic numbers

print(f"Number of survey responses: {df.shape[0]}")

response_code_numbers = [len(response.codes) for response in batched_responses]
print(f"Responses in each batch: {response_code_numbers}, Total codes: {sum(response_code_numbers)}")

all_ids = [code.response_id for response_codes in batched_responses for code in response_codes.codes]
print(f"Number of unique ids: {len(set(all_ids))}")

Number of survey responses: 253
Responses in each batch: [49, 51, 50, 52, 49, 6], Total codes: 257
Number of unique ids: 250


### Check ids
Goal: get a sense of how wells using ids works.

In [11]:
real_ids = set(df["id"])
returned_ids = set(all_ids)

In [12]:
# Check intersection
print(f"Number of returned ids that match ids in the df: {len(real_ids & returned_ids)}")
print(f"Number of made up ids: {len(returned_ids - real_ids)}")
print(f"Number of missed ids: {len(real_ids - returned_ids)}")

print(f"Made up:")
print(returned_ids - real_ids)
print(f"Missed:")
print(real_ids - returned_ids)

Number of returned ids that match ids in the df: 247
Number of made up ids: 3
Number of missed ids: 6
Made up:
{'489b761c-386b-4f9b-a1b-1a1bdeb6baf2', 'e97b337c-d73b-4949-937e-d28db451a024', '5c070d8d-08bf-44b6-9b6d-4eb282e56530'}
Missed:
{'b90b66ce-4b2a-4547-a4e8-1bc9eabde81f', '489b761c-386b-4f9b-a009-4a1bdeb6baf2', 'e97b337c-d73c-4949-937e-d28db451a024', '5c070d8d-08bf-44ac-9b6d-4eb282e56530', '0e307185-0c9a-4a3d-8574-5dda17b6fed7', '5389cf13-5371-476f-a22f-f7f2ab5a9892'}


### Check response quality
Goal: set a sense for how well the the codes match and how good the codes are

In [13]:
# Note: too few duplicates to worry about.. and we'll check them after
code_id_mapping = {code.response_id : code for response_codes in batched_responses for code in response_codes.codes}

In [14]:
# Note: too few duplicates to worry about...
code_id_mapping = {
    code.response_id : code
    for response_codes in batched_responses
    for code in response_codes.codes
}

for _ in range(5):
    id_value, answer = df.sample(1).row(0)
    matched = code_id_mapping[id_value]
    print(f"ID: {id_value}")
    print(f"Response: {answer}")
    print(f"Topic: {matched.topic}")
    print(f"Sentiment: {matched.sentiment}")
    print(f"Context: {matched.specific_theme}\n")

ID: 4264ccf6-2239-4241-a373-14f58398a1e7
Response: Wouldn't write a complaint letter.
Topic: General feedback
Sentiment: neutral
Context: No complaints

ID: d0fb022c-0793-426c-81fe-475d9edb4e4c
Response: Self checkout is a a pain, not enough personnel to check out"
Topic: Checkout Process
Sentiment: negative
Context: Issues with self-checkout and personnel availability

ID: 413662a7-1496-439f-9554-db4819f547de
Response: I miss the availability of being able to shop after midnight with out the hassle of having to rush be cuase you are closing.
Topic: Store Hours
Sentiment: negative
Context: Inconvenient closing hours

ID: 4a2ccec1-a19e-47a3-9174-edb010353a89
Response: Customer service in store. Only feels like employees r restocking shelf's and don't say hi or welcome or have a nice day.
Topic: Customer Service
Sentiment: negative
Context: Lack of employee engagement

ID: 97600a69-192b-4d64-9ae1-40a294e4b7f2
Response: That will be the bathroom they could be more clean. Normally they are

## Grouping codes
Goals: play around with grouping stragety to find an approach that works well (hopefully)

In [15]:
class Theme(BaseModel):
    code_ids: List[str] = Field(
        ..., description="List of code IDs contributing to this theme."
    )
    theme: str = Field(
        ..., description="A concise name capturing the core issue (e.g., 'Long Checkout Lines')."
    )
    summary: str = Field(
        ..., description="A short explanation that highlights key trends and patterns within this theme."
    )
    sentiment: Literal["positive", "neutral", "negative", "mixed"] = Field(
        ..., description="Overall sentiment of the theme."
    )

    def __str__(self) -> str:
        return f'(theme="{self.theme}", summary="{self.summary}", sentiment="{self.sentiment}", n_codes={len(self.code_ids)})'

class ResponseThemes(BaseModel):
    themes: List[Theme] = Field(
        ..., description="A list of the main themes."
    )

In [16]:
# Create prompts for inital codes
# List the relevant code IDs: Include all the code IDs that are strongly associated with this theme.

instructions = """
You are part of a team conducting a thematic analysis of responses to a survey question.

Your task is to group related codes into distinct, meaningful themes.
The objective is to identify patterns and trends while ensuring that the themes are as distinct as possible, minimizing overlap between them.

# **Instructions**:

1. Identify Recurring Patterns: Review the codes and identify recurring patterns or themes. Look for consistent ideas, sentiments, or issues that appear across the responses.
2. Generate Themes:
    - For each identified pattern, create a theme name that captures the core idea.
    - Write a summary that highlights the key trend or insight represented by the theme.
    - Assign an overall sentiment (e.g., positive, negative, neutral) based on the pattern.
    - List the code IDs: Include all the code IDs that are associated with is theme.
3. Repeat this process until you’ve identified all the main themes.

# **Guidelines**:
- Grouping Codes: Focus on identifying clear patterns in the responses. Consider:
    - Similar concepts: Are the codes describing similar ideas or experiences (e.g., long wait times, slow service)?
    - Common sentiment: Do the codes express a shared sentiment or feeling (e.g., frustration, satisfaction)?
    - Similar actions or issues: Do the codes relate to similar actions or problems (e.g., technical issues, product quality concerns)?
- Keep Themes Distinct: Aim for clarity and ensure the themes are distinct.
- Extract Actionable Insights: Focus on identifying actionable trends that can help improve products, services, or customer experience.
- Comprehensiveness: Ensure that all major trends or patterns in the responses are captured.

You are now going to be given a the survey question and a list of the responses to code.
"""

In [17]:
formatted_codes = "\n".join([str(code) for response_codes in batched_responses for code in response_codes.codes])

In [18]:
# Call the LLM
prompts = [
    SystemMessage(content=instructions),
    UserMessage(content=formatted_codes)
]
themes = await get_chat_completion_structured(prompts, ResponseThemes)

In [19]:
print(len(themes.themes))
for theme in themes.themes:
    print(theme)

5
(theme="Customer Service Issues", summary="A significant number of responses highlight dissatisfaction with customer service, including rude, unhelpful, and poorly trained staff. Many customers express frustration over the lack of assistance and negative experiences during their visits.", sentiment="negative", n_codes=10)
(theme="Checkout Process Challenges", summary="Many customers report issues with the checkout process, including long lines, insufficient cashiers, and frustrations with self-checkout options. This theme indicates a need for improved efficiency and staffing during peak hours.", sentiment="negative", n_codes=10)
(theme="Pricing and Product Availability", summary="Responses indicate concerns about pricing, including high prices and lack of promotions, as well as frequent stock shortages and product availability issues. Customers express a desire for better pricing strategies and more consistent stock levels.", sentiment="negative", n_codes=10)
(theme="Store Environmen

### Test getting short list of themes, and then creatingtheme objects
Another idea is to do a first pass to just identify the main themes it sees, and then in the next steps ask it to generate a theme object for each theme.

In [20]:
instructions = """
You are an expert in qualitative analysis and thematic coding.

You will be given a list of codes generated from responses to a survey question. Your task is to analyze these codes and return a list of the main themes found within them.
Each theme should:
- Be short and concise, getting to the key point of the theme.
- Be distinct from other themes.
- Represent a key pattern or trend that emerged from the codes, even if that theme is positive or neutral, not just issues.
- Be ordered by frequency of occurrence, with the most common theme listed first.

The themes should capture the core idea reflected by the codes, and you should avoid creating overly broad or vague themes. Be specific, and focus on the key recurring concepts or patterns
"""

class ThemeList(BaseModel):
    themes: List[str] = Field(...,description="A list of distinct themes")

prompts = [
    SystemMessage(content=instructions),
    UserMessage(content=formatted_codes)
]
themes = await get_chat_completion_structured(prompts, ThemeList)

In [21]:
themes

ThemeList(themes=['Customer Service Issues', 'Checkout Process Challenges', 'Store Cleanliness Concerns', 'Stock Availability Problems', 'Employee Treatment and Staffing', 'Pricing and Value Perception', 'General Feedback and Suggestions', 'Product Quality and Variety', 'Store Organization and Navigation', 'Accessibility and Convenience'])

In [22]:
instructions = """
You are an expert in qualitative analysis and thematic coding.

You will be given a list of codes generated from responses to a survey question. Additionally, you will receive a list of themes.

Your task is to identify the codes that are relevant to each theme and use them to add context and detail to each theme

# **Instructions**:

For each of the provided themes:
1. Identify Related Codes: Find all codes that are relevant to the theme.
2. Refine the Theme Name: Update the theme name, if needed, to better capture its essence based on the related codes.
3. Write a Summary: Write a brief summary that highlights the key trend or insight from the related codes, ensuring it’s relevant to the survey question.
4. Assess Sentiment: Assign an overall sentiment to the theme.
5. List Code IDs: List all the code IDs associated with the theme.
5. Repeat for each provided theme.

# **Guidelines**:
- Group Relevant Codes: Only include codes that clearly fit the theme.
- Ensure Theme Clarity: The theme name, summary, and sentiment should clearly reflect the patterns in the codes.
- Contextual Relevance: Ensure that the summaries are specifically relevant to the survey question, capturing the key insights or trends from the responses.
- Maintain Theme Distinctness: Each theme should remain distinct from others with minimal overlap.

You will now be provided with the survey question, a list of codes, and a list of themes.
"""

survey_responses = """
# **Question**:
"{question}"

# *Themes**:
{formatted_themes}

# **Codes**:
{formatted_codes}
"""


In [23]:
formatted_themes = ", ".join(themes.themes)

In [24]:
# Call the LLM
prompts = [
    SystemMessage(content=instructions),
    UserMessage(content=survey_responses.format(question=question, formatted_themes=formatted_themes, formatted_codes=formatted_codes))
]
response = await get_chat_completion_structured(prompts, ResponseThemes)

In [25]:
for t in response.themes:
    print(t)
    print("\n")

(theme="Customer Service Issues", summary="A significant number of respondents expressed dissatisfaction with customer service, highlighting experiences of rudeness, unhelpfulness, and inadequate support from staff. Many noted that employees appeared indifferent or poorly trained, leading to a frustrating shopping experience. This sentiment reflects a broader concern about the quality of service provided at Walmart, which is crucial for customer retention and satisfaction.", sentiment="negative", n_codes=20)


(theme="Checkout Process Challenges", summary="Many respondents reported significant challenges during the checkout process, including long lines, insufficient open registers, and frustrations with self-checkout options. The need for more cashiers and better management of checkout lanes was a common theme, indicating that the current system is not meeting customer needs effectively. This can lead to a negative shopping experience and deter customers from returning.", sentiment="n

### Quick check of ids
Goal: be surprised if the majority of ids are correct 

In [50]:
theme_ids = [code_id for theme in response.themes for code_id in theme.code_ids]

real_ids = set(df["id"])
returned_ids = set(theme_ids)

In [51]:
# Check intersection
print(f"Number of returned ids that match ids in the df: {len(real_ids & returned_ids)}")
print(f"Number of made up ids: {len(returned_ids - real_ids)}")

print(f"Made up:")
print(returned_ids - real_ids)

Number of returned ids that match ids in the df: 61
Number of made up ids: 3
Made up:
{'5b863d22-3d3f-49bb-bf0572187a29', 'b8b83b3c-948f-44b2-b2d6-1d7b1fbe743c', 'cdd5ec1a-8d44-4587-8e11-e354b7428950'}


In [None]:
# {'489b761c-386b-4f9b-a1b-1a1bdeb6baf2', 'e97b337c-d73b-4949-937e-d28db451a024', '5c070d8d-08bf-44b6-9b6d-4eb282e56530'}

### Quick check of comments

In [54]:
test_theme = response.themes[1]
test_ids = test_theme.code_ids

In [56]:
for a in df.filter(
    pl.col("id").is_in(test_ids)
)["answer"]:
    print(a)

I would complain about how more registers need to be available throughout the day.
I would complain that there is such a long line at self checkout while there are so many unmanned unused registers to check out ar.
My biggest complaint is the lack of non self checkout registers. If I am going to be forced to use the self checkout lane due to regular checkout lanes being closed, I should get a good discount since I am doing the work of the cashier."
Why are there so many check out lines and only 1 is open other than the self checkout
NEVER enough cash registers open. We should not have to do our own check out and bagging, especially when there are employees just standing around."
It's dirty
Checkout lanes without checkers. 20 lanes and only 3 open
Eveytime I shop at wal-mart there are never enough cashiers and the lines are super long, its very frustrating."
More checkers available during peak shopping hours


In [57]:
test_theme

Theme(code_ids=['1797c6f2-c501-44b7-b549-a33c29224edc', '577ab489-2682-4ad6-83cf-9651d6c37c2c', '10b8a26f-ee5c-40ad-84fc-c279dc084816', 'b7420340-7dd0-4a00-b7ad-075b333644ee', 'cdd5ec1a-8d44-4587-8e11-e354b7428950', 'ed7d65ee-540d-4323-b73d-d393f0213c37', '6cdd4eb6-e013-470d-aaa0-137975ad67ee', 'dec5facf-30c4-47dd-b322-ebdeca833bab', 'c2977827-47cc-4f88-ab0e-8767c109b217', '73a54f02-1635-4727-817d-f3468fde2d9c'], theme='Checkout Process Challenges', summary='Many respondents reported significant challenges during the checkout process, including long lines, insufficient open registers, and frustrations with self-checkout options. The need for more cashiers and better management of checkout lanes was a common theme, indicating that the current system is not meeting customer needs effectively. This can lead to a negative shopping experience and deter customers from returning.', sentiment='negative')