In [None]:
import pandas as pd
import numpy as np
import llmcode
from IPython.display import HTML, Markdown, display
import getpass
import os
import html
import lxml
import re
import plotly.graph_objects as go
import plotly.express as px
import textwrap
import random
import math
import json
from itertools import chain
from collections import Counter, defaultdict

Init the LLMCode library. Set the llm_API variable to "Aalto" to use Aalto's GDPR-safe Azure OpenAI API endpoints that are suitable for processing confidential data. Here, we use the OpenAI API because it is faster and makes this notebook usable for people outside Aalto.

When you run the code, it will ask you to input an appropriate API key.

In [None]:
llm_API="OpenAI"
if llm_API=="OpenAI":
    if os.environ.get("OPENAI_API_KEY") is None:
        print("Please input an OpenAI API key")
        api_key = getpass.getpass()
        os.environ["OPENAI_API_KEY"] = api_key
elif llm_API=="Aalto":
    if os.environ.get("AALTO_OPENAI_API_KEY") is None:
        print("Please input an Aalto OpenAI API key")
        api_key = getpass.getpass()
        os.environ["AALTO_OPENAI_API_KEY"] = api_key
else:
    print(f"Invalid API type: {llm_API}")
llmcode.init(API=llm_API)

In [None]:
# Jupyter is already running an asyncio event loop => need this hack for async OpenAI API calling
import nest_asyncio
nest_asyncio.apply()

In [None]:
# Choose the GPT model to use
gpt_model = "gpt-4o-mini"

# Load codes

Let's start with defining the research question and loading the codes you created in the previous notebook.

In [None]:
research_question = "How do people experience games as art?"

In [None]:
coded_texts_path = "coding_output/coded_texts_inductive_with_code_consistency.csv"
code_descriptions_path = "coding_output/code_descriptions_inductive_with_code_consistency.csv"

coded_texts_df = pd.read_csv(coded_texts_path)
coded_texts = coded_texts_df.coded_text.tolist()

code_descriptions_df = pd.read_csv(code_descriptions_path)
code_descriptions = dict(zip(code_descriptions_df.code, code_descriptions_df.description))

In [None]:
# Parse all codes and highlights in LLM output
code_highlights = llmcode.get_codes_and_highlights(coded_texts)

# Print all LLM-created codes
print("\nLLM-generated codes:\n")
for code, highlights in sorted(code_highlights.items()):
    print(f"{code} ({len(highlights)}): {code_descriptions[code]}\n")

# Generate themes

In this notebook, we use LLMs to group the codes you generated in the previous notebook under wider themes, as is often done as part of qualitative analysis.

Before we begin, let's take a moment to critically reflect on the role of agency and transparency when using LLMs for this purpose. While LLMs can efficiently organize large amounts of data, they may lack the nuanced understanding of human context and intentions. This raises concerns about researcher agency—how much control do researchers retain over the interpretation of their data? Similarly, the transparency of LLMs' decision-making processes is limited, making it difficult to trace how specific themes were generated, which may obscure valuable insights or introduce unintended biases. Balancing automation with researcher input is crucial to maintain both rigor and interpretive depth in the analysis.

## Simple example

Let's first look at a simple example of how an LLM may be prompted to generate themes for a set of codes. Feel free to modify the prompt to see how the output changes.

In [None]:
prompt = """You are an expert qualitative researcher. You are given a list of qualitative codes at the end of this prompt. Please carry out the following task:
- Group these codes into overearching themes that relate to the research question.
- Assign codes to the themes provided in the list of user-defined themes and generate new themes when needed.
- The theme names should be detailed and expressive.
- Output a list of Theme objects, containing the theme name and a list of codes that are included in that theme. Start this list with the user-defined themes.
- Include each of the codes under exactly one theme.
- Give your output as valid JSON.

THEME EXAMPLES:
Appreciation of Craftsmanship and Aesthetics
Interactive Experience and Player Involvement

CODES:
novelty
player agency
realism
craftsmanship
sacrifice
setting
beauty

RESEARCH QUESTION: How do people experience games as art?
"""

response=llmcode.query_LLM(prompt, model=gpt_model)
print("LLM output:\n")
try:
    display(Markdown(response))
except Exception as e:
    print(e)
    print(response)

## Generating themes with LLMCode

We want the LLM to output data in a structured format, i.e. a list of themes each containing a theme name and a sub-list of codes. The output from LLMs is not always perfect, which means that it is best to use LLMCode's get_themes() function for this task that automatically corrects such cases.

For example, sometimes, the LLM the may be unable to assign all codes under a theme in one pass. Particularly with long inputs, in this case a potentially lengthy list of codes, the attention mechanism underlying LLMs may not be able to "focus" on all of the codes at once. One solution is to solve the task iteratively: we can set max_retries to an integer N to make the function repeat the analysis up to N times for the unassigned codes.

Alternatively, you may wish to take part in assigning the unthemed codes yourself, in what is referred to as the [human-in-the-loop](https://en.wikipedia.org/wiki/Human-in-the-loop) approach. To do so, you may set max_retries=0 and assign the codes in unthemed_codes to the themes yourself.

The following list gives an overview of some of the customisable parameters to get_themes():

* codes (list): A list of codes to be thematically grouped.
* prior_themes (dict or list): Existing themes to which the LLM should add new codes. If a dict is provided, keys are theme names and values are sets of codes already associated with each theme. If a list is provided, it's assumed to be a list of theme names with no associated codes.
* code_descriptions (dict, optional): Optional descriptions for each code, providing additional context to the LLM.
* max_retries (int, optional): Maximum number of retries to attempt assigning themes to all codes. Defaults to 0 (no retries).

Here, using the advanced GPT-4o model over smaller models like GPT-4o-mini is recommended due to the complexity of the task. Larger models are able to utilise attention more effectively, and therefore handle a larger amount of input at once.

In [None]:
# Initialise the themes, or leave the dictionary empty {} to have the LLM make the first pass
themes = {"How games fit within a traditional view of art"}

In [None]:
codes = set(code_highlights.keys())

themes, unthemed_codes = llmcode.get_themes(
    codes=codes,
    prior_themes=themes,
    code_descriptions=code_descriptions,
    max_retries=3,
    research_question=research_question,
    gpt_model="gpt-4o",
)

Let's take a look at the generated themes, and check if any codes were left without a theme.

In [None]:
for theme, codes in themes.items():
    print(f"Theme: {theme}")
    print("Codes: " + "; ".join(codes))
    print("")

if unthemed_codes:
    print(f"{len(unthemed_codes)} codes weren't assigned a theme: " + "; ".join(unthemed_codes))
else:
    print("All codes were assigned a theme.")

# Communicating the findings

Finally, we take a look at two approaches to communicating the research findings. The first is simply producing a table of all the identified themes, and for each theme:
* the list of included codes;
* the number of mentions, calculated as the total count of all segments across the input texts annotated by any of the included codes; and
* an example quotation chosen from the annotated segments.

In [None]:
# Produce table
theme_mentions = {theme: sum(len(code_highlights[code]) for code in themes[theme]) for theme in themes}
theme_examples = {theme: random.choice(list(chain(*(code_highlights[code] for code in themes[theme])))) for theme in themes}
theme_data = [(theme, "; " .join(codes), theme_mentions[theme], theme_examples[theme]) for theme, codes in themes.items()]
pd.DataFrame(theme_data, columns=["theme", "codes", "mentions", "example quotation"])

We can also ask LLMCode to write a report about the findings, using the themes, codes, and quotations as inputs. In academic writing, it is absolutely crucial that the LLM does not make up or "hallucinate" incorrect information. The write_report function automatically checks for and removes any hallucinated quotes from the output, but it is important for you as a researcher to verify that the findings here reflect your personal insights about the data. What other potential issues do you see with using LLMs to communicate research findings?

The following list gives an overview of some of the customisable parameters to write_report():
* themes (dict): A dictionary where keys are theme names and values are sets of codes that belong to each theme.
* code_highlights (dict): A dictionary where keys are codes and values are lists of highlights (quotes or text) related to those codes.
* max_themes (int, optional): The maximum number of themes to include in the report. If None, all themes with at least three highlights are included. Defaults to None.

In [None]:
report = llmcode.write_report(
    themes=themes,
    code_highlights=code_highlights,
    max_themes=4,
    research_question=research_question,
    gpt_model=gpt_model
)

In [None]:
display(Markdown(report))