# Tutorial 4: Theme generation
*This notebook is part of the [LLMCode library](https://github.com/PerttuHamalainen/LLMCode).*

*A note on data privacy: The user experience of this notebook is better on Google Colab, but if you are processing data that cannot be sent to Google and OpenAI servers, you should run this notebook locally using the "Aalto" LLM API.*

**Learning goals**

In this notebook, you'll learn to utilize LLMs to code qualitative data inductively and deductively, as well as how to evaluate the human-likeness of the output.

**How to use this Colab notebook?**
* Select the LLM API and model to use below. The default values are recommended, but some of the examples may produce better quality results using the more expensive "gpt-4o" model. For details about the models, see [OpenAI documentation](https://platform.openai.com/docs/models).
* Select "Run all" from the Runtime menu above.
* Enter your API key below when prompted. This will be provided to you at the workshop. You can also create your own OpenAI account at https://platform.openai.com/signup. The initial free quota you get with the account should be enough for the exercises of this notebook. To create an API key, follow [OpenAI's instructions](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)
* Proceed top-down following the instructions

**New to Colab notebooks?**

Colab notebooks are browser-based learning environments consisting of *cells* that include either text or code. The code is executed in a Google virtual machine instead of your own computer. You can run code cell-by-cell (click the "play" symbol of each code cell), and selecting "Run all" as instructed above is usually the first step to verify that everything works. For more info, see Google's [Intro video](https://www.youtube.com/watch?v=inN8seMm7UI) and [curated example notebooks](https://colab.google/notebooks/).


In [None]:
#Initial setup code. If you opened this notebook in Colab, this code is hidden
#by default to avoid unnecessary user interface clutter

#-------------------------------------------------------
#User-defined parameters. You can freely edit the values
llm_API="OpenAI" # @param ["OpenAI", "Aalto"]
gpt_model="gpt-4o-mini" #@param ["gpt-4o-mini","gpt-4o"]


#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

#Import packages
import pandas as pd
import numpy as np
from IPython.display import HTML, Markdown, display, clear_output
import getpass
import os
import html
import lxml
import re
import plotly.graph_objects as go
import plotly.express as px
import textwrap
import random
import math
from itertools import chain
from collections import Counter, defaultdict
from google.colab import files

original_dir = os.getcwd()

#determine if we are running in Colab
import sys
RunningInCOLAB = 'google.colab' in sys.modules
if RunningInCOLAB:
  import plotly.io as pio
  pio.renderers.default = "colab"
  if not os.path.exists("LLMCode"):
    if not os.getcwd().endswith("LLMCode"):
      print("Cloning the LLMCode repository...")
      #until the repo is public, we download this working copy instead of cloning
      #(shared as: anyone with the link can view)
      #!wget "https://drive.google.com/uc?export=download&id=1ylMQn96JuKBB-YU9mHLyEhtm6Qin1Kgh" -O LLMCode.zip
      #!mkdir LLMCode
      #!unzip -q LLMCode.zip -d LLMCode
      !git clone https://github.com/PerttuHamalainen/LLMCode.git
  if not os.getcwd().endswith("LLMCode"):
    os.chdir("LLMCode")
    print("Installing dependencies...")
    !pip install -r requirements_notebooks.txt
import llmcode

os.chdir(original_dir)

#Jupyter is already running an asyncio event loop => need this hack for async OpenAI API calling
import nest_asyncio
nest_asyncio.apply()

#Prompt the user for an API key if not provided via a system variable
clear_output()
if llm_API=="OpenAI":
    if os.environ.get("OPENAI_API_KEY") is None:
        print("Please input an OpenAI API key")
        api_key = getpass.getpass()
        os.environ["OPENAI_API_KEY"] = api_key
elif llm_API=="Aalto":
    if os.environ.get("AALTO_OPENAI_API_KEY") is None:
        print("Please input an Aalto OpenAI API key")
        api_key = getpass.getpass()
        os.environ["AALTO_OPENAI_API_KEY"] = api_key
else:
    print(f"Invalid API type: {llm_API}")

#Initialize the LLMCode library
llmcode.init(API=llm_API)
llmcode.set_cache_directory("data_exploration_cache")

# Themes

In this notebook, we use LLMs to group the codes you generated in the previous notebook "Inductive and deductive coding" under wider themes, as is often done as part of qualitative analysis.

Before we begin, let's take a moment to critically reflect on the role of agency and transparency when using LLMs for this purpose. While LLMs can efficiently organize large amounts of data, they may lack the nuanced understanding of human context and intentions. This raises concerns about researcher agency—how much control do researchers retain over the interpretation of their data? Similarly, the transparency of LLMs' decision-making processes is limited, making it difficult to trace how specific themes were generated, which may obscure valuable insights or introduce unintended biases. Balancing automation with researcher input is crucial to maintain rigor and interpretive depth in the analysis.

# Loading the data

For this notebook, you may choose to use the codes you generated in the previous notebook, or use a set of codes we have generated based on the [Games As Art](https://osf.io/ryvt6/) survey dataset. If you wish to use these default codes, run the following code with the default inputs.

**Loading your own codes**

To use your own codes, choose one of the `coded_texts` .csv files you created in the previous notebook. You may choose to use the codes from any of the three methods (inductive, inductive with code consistency, deductive) explored in the previous notebook, all of which were stored in separated files.

Optionally, you may also upload a `coded_descriptions` .csv file containing descriptions for each of the codes in `coded_texts`, which may increase the accuracy of the theme generation. Descriptions were automatically generated for inductive coding with code consistency in the previous notebook. Please ensure that the `coded_descriptions` file matches the `coded_texts` file, i.e. both were generated together using the same LLMCode function.

Please upload the files using the file browser on the left and input the file names to the corresponding fields. Make sure to also define your own research question.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values
research_question = 'How do people experience games as art?' # @param {type:"string"}
coded_texts_file = "LLMCode/test_data/games_as_art/bopp_test_coded_texts.csv" # @param {type:"string"}
code_descriptions_file = "LLMCode/test_data/games_as_art/bopp_test_code_descriptions.csv" # @param {type:"string"}

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

coded_texts_df = pd.read_csv(coded_texts_file).dropna()  # Drop any nan values
coded_texts = coded_texts_df.coded_text.tolist()

if code_descriptions_file:
    code_descriptions_df = pd.read_csv(code_descriptions_file)
    code_descriptions = dict(zip(code_descriptions_df.code, code_descriptions_df.description))
else:
    code_descriptions = None

def plot_generated_codes(code_highlights, title):
    code_counts = [(code, len(highlights)) for code, highlights in code_highlights.items()]
    df_codes = pd.DataFrame(code_counts, columns=['Code', 'Count'])
    df_codes = df_codes.sort_values(by='Count', ascending=False).reset_index(drop=True)

    # Create a vertical bar plot using Plotly with angled x-axis labels
    fig = px.bar(df_codes, x='Code', y='Count', title=title)

    # Update layout to angle x-axis labels at 45 degrees
    fig.update_layout(xaxis_tickangle=-45)
    fig.show()

# Parse all codes and highlights in LLM output
code_highlights = llmcode.get_codes_and_highlights(coded_texts)
plot_generated_codes(code_highlights, 'All codes')

# Theme generation with LLMs: a simple example

Before getting into the LLMCode functions, let's first look at a simple example of how an LLM may be prompted to generate themes for a set of codes. In the prompt, we include a set of `example_themes` and the `codes` we would like the system to group under themes. Click "Show code" to view the entire prompt. You may also edit the prompt instructions and see what effect this has on the output.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values

example_themes = "Appreciation of Craftsmanship and Aesthetics; Interactive Experience and Player Involvement" # @param {type:"string"}
codes = "novelty; player agency; realism; craftsmanship; sacrifice; setting; beauty"  # @param {type:"string"}

#-------------------------------------------------------------------
#Implementation. Feel free to edit the prompt

prompt = """You are an expert qualitative researcher. You are given a list of qualitative codes at the end of this prompt. Please carry out the following task:
- Group these codes into overarching themes.
- Assign codes to the themes provided in the list of user-defined themes and generate new themes when needed.
- The theme names should be detailed and expressive.
- Output a list of Theme objects, containing the theme name and a list of codes that are included in that theme. Start this list with the user-defined themes.
- Include each of the codes under exactly one theme.
- Give your output as valid JSON.

THEME EXAMPLES:
{}

CODES:
{}
""".format(
    "\n".join([s.strip() for s in example_themes.split(";")]),
    "\n".join([s.strip() for s in codes.split(";")])
)

response=llmcode.query_LLM(prompt, model=gpt_model)
print("LLM output:\n")
try:
    display(Markdown(response))
except Exception as e:
    print(e)
    print(response)

# Generating themes with LLMCode

We want the LLM to output data in a structured format, i.e. a list of themes each containing a theme name and a sub-list of codes. The output from LLMs is not always perfect, which means that it is best to use LLMCode's `get_themes()` function for this task that automatically corrects any errors.

For example, sometimes, the LLM the may be unable to assign all codes under a theme in one pass. Particularly with long inputs, in this case a potentially lengthy list of codes, the attention mechanism underlying LLMs may not be able to "focus" on all of the codes at once. One solution is to solve the task iteratively: we can set the function's `max_retries` parameter to an integer N to make the function repeat the analysis up to N times for the unassigned codes.

If you already have some themes in mind, you may write these below in `prior_themes` separated by a semicolon.

Here, using the advanced GPT-4o model over smaller models like GPT-4o-mini is recommended due to the complexity of the task. Larger models are able to utilise attention more effectively, and therefore handle a larger amount of input at once.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values

prior_themes = "" # @param {"type":"string","placeholder":"Themes separated by ;"}
max_retries = 3 # @param {"type":"integer"}
gpt_model_for_themes = "gpt-4o" # @param ["gpt-4o-mini", "gpt-4o"]

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

prior_themes = [t.strip() for t in prior_themes.split(";")] if prior_themes else []
codes = set(code_highlights.keys())

themes, unthemed_codes = llmcode.get_themes(
    codes=codes,
    prior_themes=prior_themes,
    code_descriptions=code_descriptions,
    max_retries=max_retries,
    research_question=research_question,
    gpt_model=gpt_model_for_themes,
)

for theme, codes in themes.items():
    print(f"Theme: {theme}")
    print("Codes: " + "; ".join(codes))
    print("")

if unthemed_codes:
    print(f"{len(unthemed_codes)} codes weren't assigned a theme: " + "; ".join(unthemed_codes))
else:
    print("All codes were assigned a theme.")

# Communicating the findings

Finally, we take a look at three approaches to communicating the research findings.

## Approach 1: Table

The first approach, often seen in research papers, involves producing a table of all the identified themes, and for each theme:
* the list of included codes;
* the number of mentions, calculated as the total count of all highlights across the input texts annotated by any of the included codes; and
* example quotations chosen from the highlights: you can choose the number of quotations by changing `quotes_per_theme`.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values

quotes_per_theme = 3 # @param {"type":"integer"}

#-------------------------------------------------------------------
#Implementation. Feel free to edit the prompt

def sample_quotes(theme_codes, code_highlights, quotes_per_theme):
    all_quotes = list(chain(*(code_highlights[code] for code in theme_codes)))
    quotes = random.sample(all_quotes, min(len(all_quotes), quotes_per_theme))
    return "<br>".join("\"{}\"".format(quote) for quote in quotes)

# Produce table
theme_mentions = {theme: sum(len(code_highlights[code]) for code in themes[theme]) for theme in themes}
theme_examples = {theme: sample_quotes(themes[theme], code_highlights, quotes_per_theme) for theme in themes}
theme_data = [(theme, "<br>" .join(codes), theme_mentions[theme], theme_examples[theme]) for theme, codes in themes.items()]
df_themes = pd.DataFrame(theme_data, columns=["theme", "codes", "mentions", "example quotations"])
df_themes

display(HTML(df_themes.to_html(escape=False)))

## Approach 2: Sunburst chart

We might want to communicate the findings in a more visual and interactive manner. One alternative option is to produce a sunburst chart, which is apt for visualising hierarchical data. The inner level visualises the themes and the outer level their associated codes, while the sizes of the segments correspond to the number of mentions for each theme and code. You may hover your mouse over the code segments to reveal a randomly chosen example quotation for each.

In [None]:
#@title Produce sunburst chart

code_mentions = {code: len(code_highlights[code]) for theme in themes for code in themes[theme]}
code_examples = {code: random.choice(code_highlights[code]) for theme in themes for code in themes[theme]}
code_data = [(theme, code, code_mentions[code], code_examples[code]) for theme, codes in themes.items() for code in codes]
df_codes = pd.DataFrame(code_data, columns=["theme", "code", "mentions", "example quotation"])

# Function to add line breaks to long quotations
def format_quotation(quotation, max_length=60):
    words = quotation.split()
    lines = []
    line = ""
    for word in words:
        if len(line) + len(word) + 1 > max_length:
            lines.append(line)
            line = word
        else:
            line += " " + word if line else word
    lines.append(line)  # Add the last line
    return "<br>".join(lines)

# Apply the function to format the quotations
df_codes["example quotation"] = df_codes["example quotation"].apply(format_quotation)

# Create the sunburst chart with custom data for hover
fig = px.sunburst(df_codes, path=['theme', 'code'], values='mentions', title="Themes and codes", hover_data=["example quotation"])

# Add custom data (quotations) for hover information
fig.update_traces(
    hovertemplate='<b>%{label}</b><br>Mentions: %{value}<br>Example quotation: %{customdata[0]}'
)

# Update layout for larger display
fig.update_layout(width=800, height=800)

fig.show()

## Approach 3: LLM-generated research report

We can also ask LLMCode to write a report about the findings, using the themes, codes, and quotations as inputs. In academic writing, it is absolutely crucial that the LLM does not make up or "hallucinate" incorrect information. The `write_report` function automatically checks for and removes any hallucinated quotations from the output, but it is important for you as a researcher to verify that the findings here reflect your personal insights about the data. What other potential issues do you see with using LLMs to communicate research findings?

You may choose the themes for the report yourself or leave `themes_for_report` empty, in which case the four themes with most mentions will be written about.

**Choosing the themes yourself**

Generating the report takes some time, so if you choose the themes yourself, please specify only up to four themes, separated by a semicolon in `themes_for_report`. Make sure to choose these themes from the LLM-generated themes above. Each chosen theme should have at least three mentions in the input texts, as these will be used to illustrate the themes with quotations.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values

themes_for_report = "" # @param {"type":"string","placeholder":"Themes separated by ;"}

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

themes_for_report = [t.strip().lower() for t in themes_for_report.split(";")] if themes_for_report else None

if themes_for_report:
    for theme in themes_for_report:
        if theme not in [t.lower() for t in themes]:
            raise ValueError("Theme '{}' not found in the generated themes. Choose from the themes: {}".format(
                theme, "; ".join(themes.keys())
            ))
    themes_for_report = {theme: codes for theme, codes in themes.items() if theme.lower() in themes_for_report}
else:
    themes_for_report = themes

report = llmcode.write_report(
    themes=themes_for_report,
    code_highlights=code_highlights,
    max_themes=4,
    research_question=research_question,
    gpt_model=gpt_model
)

display(Markdown(report))