# Inductive coding of Reddit data
*This notebook is part of the [LLMCode library](https://github.com/PerttuHamalainen/LLMCode).*

*A note on data privacy: The user experience of this notebook is better on Google Colab, but if you are processing data that cannot be sent to Google and OpenAI servers, you should run this notebook locally using the "Aalto" LLM API.*

**Learning goals**

In this notebook, you'll learn to utilize LLMs to code qualitative data inductively, as well as how to evaluate the human-likeness of the output.

**How to use this Colab notebook?**
* Select the LLM API as well as the GPT and embedding models to use below. The default values are recommended, but some of the examples may produce better quality results using the more expensive "gpt-4o" model. For details about the models, see [OpenAI documentation](https://platform.openai.com/docs/models).
* Select "Run all" from the Runtime menu above.
* Enter your API key below when prompted. This will be provided to you at the workshop. You can also create your own OpenAI account at https://platform.openai.com/signup. The initial free quota you get with the account should be enough for the exercises of this notebook. To create an API key, follow [OpenAI's instructions](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)
* Proceed top-down following the instructions

**New to Colab notebooks?**

Colab notebooks are browser-based learning environments consisting of *cells* that include either text or code. The code is executed in a Google virtual machine instead of your own computer. You can run code cell-by-cell (click the "play" symbol of each code cell), and selecting "Run all" as instructed above is usually the first step to verify that everything works. For more info, see Google's [Intro video](https://www.youtube.com/watch?v=inN8seMm7UI) and [curated example notebooks](https://colab.google/notebooks/).


In [None]:
#Initial setup code. If you opened this notebook in Colab, this code is hidden
#by default to avoid unnecessary user interface clutter

#-------------------------------------------------------
#User-defined parameters. You can freely edit the values
llm_API="OpenAI" # @param ["OpenAI", "Aalto"]
gpt_model="gpt-4o-mini" #@param ["gpt-4o-mini","gpt-4o"]
embedding_model = "text-embedding-3-small" #@param ["text-embedding-ada-002","text-embedding-3-small", "text-embedding-3-large"]

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

#Import packages
import pandas as pd
import numpy as np
from IPython.display import HTML, Markdown, display, clear_output
import getpass
import os
import html
import lxml
import re
import plotly.graph_objects as go
import plotly.express as px
import textwrap
import random
import math
from collections import Counter, defaultdict

original_dir = os.getcwd()

#determine if we are running in Colab
import sys
RunningInCOLAB = 'google.colab' in sys.modules
if RunningInCOLAB:
  from google.colab import files
  import plotly.io as pio
  pio.renderers.default = "colab"
  if not os.path.exists("LLMCode"):
    if not os.getcwd().endswith("LLMCode"):
      print("Cloning the LLMCode repository...")
      #until the repo is public, we download this working copy instead of cloning
      #(shared as: anyone with the link can view)
      #!wget "https://drive.google.com/uc?export=download&id=1ylMQn96JuKBB-YU9mHLyEhtm6Qin1Kgh" -O LLMCode.zip
      #!mkdir LLMCode
      #!unzip -q LLMCode.zip -d LLMCode
      !git clone https://github.com/PerttuHamalainen/LLMCode.git
  if not os.getcwd().endswith("LLMCode"):
    os.chdir("LLMCode")
    print("Installing dependencies...")
    !pip install -r requirements_notebooks.txt
import llmcode

os.chdir(original_dir)

#Jupyter is already running an asyncio event loop => need this hack for async OpenAI API calling
import nest_asyncio
nest_asyncio.apply()

#Prompt the user for an API key if not provided via a system variable
clear_output()
if llm_API=="OpenAI":
    if os.environ.get("OPENAI_API_KEY") is None:
        print("Please input an OpenAI API key")
        api_key = getpass.getpass()
        os.environ["OPENAI_API_KEY"] = api_key
elif llm_API=="Aalto":
    if os.environ.get("AALTO_OPENAI_API_KEY") is None:
        print("Please input an Aalto OpenAI API key")
        api_key = getpass.getpass()
        os.environ["AALTO_OPENAI_API_KEY"] = api_key
else:
    print(f"Invalid API type: {llm_API}")

#Initialize the LLMCode library
llmcode.init(API=llm_API)
llmcode.set_cache_directory("data_exploration_cache")

# Qualitative coding

Qualitative coding involves the analysis of texts through highlighting relevant parts of the text and assigning each highlight a code or a set of codes that communicate its relevance to the research. For example, using an excerpt from the [Games As Art](https://osf.io/ryvt6/) research project as an example, a coded text may look like the following:

> When you take in consideration **the music, the graphic design, the writing, and of course the gameplay**<sup>audio; visuals; narrative</sup>. Its in games like this when you are aware of be **playing an artwork, rather than just another game**<sup>comparison to conventional art forms</sup>.

In this example, the text has been annotated with two highlights that have been assigned the codes _audio_, _visuals_, _narrative_ (for the first highlight), and _comparison to conventional art forms_ (for the second highlight).

In this notebook, we accept files that have been coded using Microsoft Word. Make sure that the file is formatted according to the following requirements:
* Individual texts are separated by five dashes "-----"
* The texts are coded using [Word comments](https://support.microsoft.com/en-us/office/using-modern-comments-8d3f868a-867e-4df2-8c68-bf96671641e2) by highlighting the relevant part and typing the code(s) in the comment field. Multiple codes are separated by a semicolon ";".

While LLMCode contains functions that can automate much of the coding process, it requires at least a few human-coded examples (called _few-shot_ examples) to get started. Additionally, in this notebook, the LLM-generated codes are evaluated using human-generated codes as a baseline. Therefore, we request that you use the provided [manual coding tool](https://perttuhamalainen.github.io/LLMCode/) to code **at least 50 texts** before proceeding with this notebook. Please prepare two Excel files: one for the human-coded texts and one for any additional uncoded texts you would like to code using the LLMCode methods.

You may be curious about the code annotations in the coded Excel file you get as output from the manual coding tool. Because LLMs deal with unstructured text, we instruct it with code annotations in the [Markdown](https://en.wikipedia.org/wiki/Markdown) format. The example given above in Markdown, printed as a plain string, looks like the following:

> ```When you take in consideration **the music, the graphic design, the writing, and of course the gameplay**<sup>audio; visuals; narrative</sup>. Its in games like this when you are aware of be **playing an artwork, rather than just another game**<sup>comparison to conventional art forms</sup>.```

# Preliminaries
First, let's take a quick look at some demonstrations on how to prompt an LLM with Python code.

## Prompting an LLM using Python
Prompting an LLM is straighforward, as shown below. Note that **the lines starting with "#" are not code**, but *comments* that describe what the code below them is doing.

If you want to learn more about Python basics such as variables and functions, check out this [YouTube playlist](https://www.youtube.com/playlist?list=PLUaB-1hjhk8GHKfndKjyDMHPg_HlQ4vpK).

In [None]:
# Define the prompt and store it in a variable (a container for some data)
# called "my_prompt".
my_prompt="Hi!"

# Call the query_LLM() function from the LLMCode library.
# Functions are pieces of Python code that perform some functionality.
# Here, the query_LLM() function takes in the "prompts" and "model" parameters and
# and sends the prompts to the LLM. The "gpt_model" is the model you defined above.
# The LLM response is is stored in the "response" variable"
response = llmcode.query_LLM(prompts=my_prompt,
                             model=gpt_model)

# Print out the response.
print("LLM response:")
print(response)

## Simple coding example

We start exploring LLM code generation with a simple, prompt-only example, asking the model to code a single text. The prompt utilizes _few-shot learning_ as it includes some examples that "teach" the model how we want it to code the text. Feel free to edit the `text_to_code` or the `prompt` instructions and re-run the cell to see what effect this has on the output.

In [None]:
# The text that we will ask the LLM to code
text_to_code = "I was playing the Legend of Zelda: Breath of the Wild in a sad day. A sunset on a perfect landscape between two biomes almost made me cry. It was PERFECT." # @param {type:"string"}

# The prompt giving instructions on how to do the coding
prompt = """You are an expert qualitative researcher conducting a research project.
You are given a text to code inductively. Please carry out the following task:
- Respond by repeating the original text, but highlighting the coded statements by surrounding the statements with double asterisks, as if they were bolded text in a Markdown document.
- Include the associated code(s) immediately after the statement, separated by a semicolon and enclosed in <sup></sup> tags, as if they were superscript text in a Markdown document.
- Preserve exact formatting of the original text. Do not correct typos or remove unnecessary spaces.

Below, I first give you examples of the output you should produce given an example input. After that, I give you the actual input to process.

EXAMPLE INPUT:
When you take in consideration the music, the graphic design, the writing, and of course the gameplay. Its in games like this when you are aware of be playing an artwork, rather than just another game.

EXAMPLE OUTPUT:
When you take in consideration **the music, the graphic design, the writing, and of course the gameplay**<sup>audio; visuals; narrative</sup>. Its in games like this when you are aware of be **playing an artwork, rather than just another game**<sup>comparison to conventional art forms</sup>.

EXAMPLE INPUT:
The protagonist has no fighting skills. The environment (empty space) is naturally hostile and makes the player feel isolated an alone. The style of the space ship feels inhabited and dead at the same time (like a carcass). The body horror is immensly unsettling and detailed.

EXAMPLE OUTPUT:
The protagonist has no fighting skills. The environment (empty space) is naturally hostile and **makes the player feel isolated an alone**<sup>emotional response</sup>. The style of the space ship feels inhabited and dead at the same time (like a carcass).

"""

# Add the text we want to process into the prompt
prompt += f"ACTUAL INPUT:\n{text_to_code}"

# Call the query_LLM() function with the prompt
response = llmcode.query_LLM(prompt, model=gpt_model)

# Parse and display the response, which should be formatted correctly in Markdown format as we've instructed in the prompt
parsed_response = Markdown(response)
display(parsed_response)

# Loading your data

Below, please define the `research_question` for your project as well as the data you would like to analyze. To load your data in Colab, upload your .xlsx (Excel) files using the file browser on the left and input the file names to the corresponding fields.

* `coded_file`: A coded file you exported from the coding tool, containing only texts you have annotated. This may include texts that contain no highlights or codes, i.e. texts that you inspected and deemed to not include anything relevant. If you didn't have time to get to the end of all of the texts that you uploaded to the coding tool, please ensure to **remove any texts that you did not read from this file**. You can add these rows to the start of `uncoded_file`.
* `uncoded_file`: Contains texts that you scraped but have not coded manually, that are to be coded by LLMCode. The columns 'score' and 'url' are not required for this notebook, and may be removed if you are appending any uncoded rows from the coding tool output.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values
research_question = "" # @param {type:"string"}
coded_file = "" # @param {type:"string"}
uncoded_file = "" # @param {type:"string"}

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

df_coded = pd.read_excel(coded_file)
df_uncoded = pd.read_excel(uncoded_file)

# Set id as index
df_coded.set_index("id", inplace=True)
df_uncoded.set_index("id", inplace=True)

# Check uniqueness of id within and across each DataFrame
df_all = pd.concat([df_coded, df_uncoded])
if not df_all.index.is_unique:
    raise ValueError("The data contains rows with duplicate ids. Please ensure all 'id' values are unique within and across the two files.")

def create_ancestor_dict(df_all):
    # Create a dictionary mapping each index (id) to its immediate `parent_id`
    parent_dict = df_all["parent_id"].to_dict()

    # Initialize an empty dictionary to store all ancestors for each `id`
    ancestor_dict = {}

    # Helper function to find all ancestors of a given `id` in top-down order
    def find_ancestors(id, parent_dict):
        ancestors = []
        parent = parent_dict.get(id)
        while not pd.isna(parent):
            ancestors.append(parent)
            parent = parent_dict.get(parent)
        # Reverse the list to get top-down order
        return ancestors[::-1]

    # Build the ancestor dictionary with a list of ancestors in top-down order
    for id in parent_dict:
        ancestor_dict[id] = find_ancestors(id, parent_dict)

    return ancestor_dict

ancestor_dict = create_ancestor_dict(df_all)

# Remove leading and trailing whitespace in dataset to simplify analysis
df_coded['text'] = df_coded['text'].str.strip()
df_coded['coded_text'] = df_coded['coded_text'].str.strip()
df_uncoded['text'] = df_uncoded['text'].str.strip()

# Define embedding context based on the research question
embedding_context = f", in the context of the research question: {research_question}"

print(f"Loaded {len(df_coded)} coded texts and {len(df_uncoded)} uncoded texts")

# Trustworthy coding with LLMCode

If you want to use LLMs to code multiple texts, it is best to use the coding functions in the LLMCode package, for a few reasons:
* LLMCode can automatically and dynamically construct a prompt for each text, and takes care of details such as randomizing the order of the few-shot examples to avoid recency bias.
* LLMCode automatically parses the Markdown output from the model and is able to detect `hallucinations`. Hallucinations refer to cases where the model incorrectly outputs a modified version of the input text, although keeping the research data intact is extremely important. This often happens for example when the input data contains spelling errors. In most cases, LLMCode is able to automatically correct such hallucinations.
* LLMCode supports a variety of coding methods, with specialized prompts for each. In this notebook, we will focus on inductive coding.

In reflexive coding, your own personal understanding of the data is important in shaping the insights you gain from analysing it. Therefore, before jumping straight into coding the entire dataset with LLMCode, it is wise to investigate just how well the system can replicate your own coding style. In this section, we will initially run the LLM coding with texts from `coded_file` that have also been human-annotated, so that we have a baseline to compare the LLM output to. While this may feel redundant, it helps us calibrate our trust in the system before we code the entire dataset.

## Defining the data counts

The LLMCode functions utilize few-shot learning, requiring a small set of human-coded examples. Next, we define the number of few-shot examples (`n_examples`) we would like to use. Once you have made your first pass through this notebook, you may try running it again with different numbers of examples and see if the model's performance increases with additional examples. Additionally, we define data counts for the so-called *validation* and *test* sets (explained later).

These three non-overlapping datasets are randomly sampled from the texts in `coded_file`. Make sure that the sum of the three values is less than the number of texts in `coded_file`. Smaller data counts makes working with this tutorial faster and cheaper but produces less reliable quality metrics. We recommend testing the notebook with the default values.

The randomly chosen few-shot examples are displayed for you to verify. Make sure that the examples in the `coded_text` column are representative of your coding and the dataset, and include a variety of different kinds of examples (for example, heavily coded and sparsely coded texts). If not, you may re-run this cell to generate a different selection of examples.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values

n_examples = 10 # @param {"type":"integer"}
n_validation = 50 # @param {"type":"integer"}
n_test = 50 # @param {"type":"integer"}

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

n_sum = n_examples + n_validation + n_test
if n_sum > len(df_coded):
    raise ValueError(f"Sum of the three inputs ({n_sum}) cannot be higher than the number of coded texts {len(df_coded)}")

# Create a shuffled copy of df_coded
df_shuffled = df_coded.sample(frac=1)

# Split the shuffled copy into df_few_shot, df_val, and df_test
df_few_shot = df_shuffled.iloc[:n_examples]
df_val = df_shuffled.iloc[n_examples:n_examples + n_validation]
df_test = df_shuffled.iloc[n_examples + n_validation:n_sum]

# Prepare list of codes in the few-shot examples
few_shot_codes = set(code for coded_text in df_few_shot.coded_text for _, code in llmcode.parse_codes(coded_text))

def get_ancestor_texts(id):
    ancestors = ancestor_dict[id]
    if ancestors:
        return df_all.loc[ancestors].text.tolist()
    else:
        return []

# Define ancestors for few-shot examples
few_shot_ancestors = [get_ancestor_texts(id) for id in df_few_shot.index]

print(f"Few-shot examples: {len(df_few_shot)}")
print(f"Validation examples: {len(df_val)}")
print(f"Test examples: {len(df_test)}")
print('')
print("Few-shot examples:")
df_few_shot

Let's print out all the human-annotated codes in the input (validation) dataset and the number of times they appear in the texts:

In [None]:
# @title
def get_codes_df(df):
    codes = [code for coded_text in df.coded_text for _, code in llmcode.parse_codes(coded_text)]
    code_counts = Counter(codes)
    df_codes = pd.DataFrame(list(code_counts.items()), columns=['Code', 'Count'])
    df_codes = df_codes.sort_values(by='Count', ascending=False).reset_index(drop=True)
    return df_codes

df_codes = get_codes_df(df_val)

# Create a vertical bar plot using Plotly with angled x-axis labels
fig = px.bar(df_codes, x='Code', y='Count', title='Human-annotated codes')

# Update layout to angle x-axis labels at 45 degrees
fig.update_layout(xaxis_tickangle=-45)

fig.show()

## Coding and quality metrics

In this notebook, we will be using the `code_inductively_with_code_consistency()` function from the LLMCode package. This function codes the input texts given:
* the research question, defined earlier
* the set of few-shot examples, defined earlier
* a list of *ancestors* for each text in the input data and few-shot examples, providing the thread context of the text to the model
* a bulleted list of coding instructions, given as a string.

After coding the texts, the function attempts to correct some common errors that the LLM may make, such as correcting typos or omitting some non-coded sentences of the original text. For further analysis, it is important that the LLM preserves the exact formatting of the original texts.

The term *code consistency* in the function name refers to the function's attempt to generate consistent codes across the different texts, similarly to a human coder. The function processes the texts sequentially, to allow the reuse of codes between text instances instead of creating an entirely new set of possibly redundant codes for each text. This is done by keeping track of a list of previous codes that is added as input to each new prompt.

If you take a closer look at the code, you may notice that the function also generates a description for each code. This is in order to avoid generating duplicate codes with the same meaning. We store these descriptions at the end of this notebook alongside the coded texts, as they may be useful in further processing of the codes, by elaborating on their meaning.

Run the following cell to code the input texts with the `code_inductively_with_code_consistency()` function. This may take a few moments.

In [None]:
# @title

def code_inductively(df_input, coding_instructions):
    input_texts = df_input.text.tolist()
    input_ancestors = [get_ancestor_texts(id) for id in df_input.index]
    return llmcode.code_inductively_with_code_consistency(
        texts=input_texts,
        text_ancestors=input_ancestors,
        research_question=research_question,
        coding_instructions=coding_instructions,
        few_shot_examples=df_few_shot,
        few_shot_ancestors=few_shot_ancestors,
        gpt_model=gpt_model
    )

# Instructions that are appended to the LLM prompt
coding_instructions = f"""- Ignore text that is not insightful with regards to the research question: {research_question}."""

# Do inductive coding
coded_texts, code_descriptions = code_inductively(df_val, coding_instructions)

print("Coding complete, proceed to the next cell to see the output")

Next, we evaluate the LLM's output against human-coded texts from our validation set, i.e. some instances from `coded_file` that weren't used as few-shot examples.

We print out a table that contains the LLM- and human-coded texts side by side. This table also contains two metrics to help you evaluate the model's output: [Intersection over Union (IoU, a.k.a. Jaccard Index)](https://en.wikipedia.org/wiki/Jaccard_index) and a modified [Hausdorff distance measure](https://en.wikipedia.org/wiki/Hausdorff_distance).

**IoU** calculates the overlap between LLM- and human-highlighted parts of each text, measuring how similarly the model can identify interesting parts in the text to a human coder. For the IoU score, 0 means no overlap and 1 means perfect overlap, i.e., identical human and LLM highlights.

IoU does not tell us anything about the codes that are assigned to each highlight, for which reason we also calculate the **Hausdorff distance** between each pair of LLM- and human-coded texts. Note that the coded parts of the text may vary between annotators, which makes a highlight level evaluation of codes difficult. In order to address this issue, we compare the codes on a text level (as opposed to highlight level), by merging all the codes that are assigned to any highlight in the text. The Hausdorff distance measure calculated by the LLMCode package then utilizes code embeddings to evaluate the semantic similarity between the LLM and human-generated codes for each text.

You can choose to sort the texts by either of the two metrics. This enables you to easily spot cases where the model is doing well or underperfoming. Note that a **higher IoU is better** (higher IoU = more overlap between highlighted segments), while a **lower Hausdorff distance is better** (lower distance = codes are more similar). Be sure to also compare the LLM- and human-annotated texts yourself, as sometimes the errors may be due to inconsistencies in your own coding.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values

sort_by = "IoU" # @param ["IoU","Hausdorff","Input order"]

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

html_report, df_eval = llmcode.run_coding_eval(
    llm_coded_texts=coded_texts,
    human_coded_texts=df_val.coded_text.tolist(),
    embedding_context=embedding_context,
    embedding_model=embedding_model,
    sort_by=sort_by
)

avg_iou = np.mean(df_eval["IoU"])
avg_hausdorff = np.mean(df_eval["Hausdorff"])
print(f"Average IoU: {avg_iou:.4f}")
print(f"Average Hausdorff distance: {avg_hausdorff:.4f}")

display(HTML(html_report))

Next, let's visualize all the LLM-generated codes and the number of times they appear. You may wish to compare these to the human-annotated code distribution above, and see how well they match. Can you find any new codes, or ones that are missing?

In [None]:
# @title
def plot_generated_codes(code_highlights, title):
    code_counts = [(code, len(highlights)) for code, highlights in code_highlights.items()]
    df_codes = pd.DataFrame(code_counts, columns=['Code', 'Count'])
    df_codes = df_codes.sort_values(by='Count', ascending=False).reset_index(drop=True)

    # Create a vertical bar plot using Plotly with angled x-axis labels
    fig = px.bar(df_codes, x='Code', y='Count', title=title)

    # Update layout to angle x-axis labels at 45 degrees
    fig.update_layout(xaxis_tickangle=-45)
    fig.show()

# Parse all codes and highlights in LLM output
code_highlights = llmcode.get_codes_and_highlights(coded_texts)
plot_generated_codes(code_highlights, 'LLM-generated codes')

In order to compare the human- and LLM-generated codes in more detail, let's visualize the codes with the help of _word embeddings_, which capture the meaning of words and can therefore be used to explore code similarities. We will use OpenAI's embedding models for this. Word embeddings are typically very high-dimensional vectors, so we will reduce the dimensionality of the vectors to two in order to plot them onto a two-dimensional plane.

We create an interactive visualization of the code embeddings using Plotly, so that you can hover over the plot to reveal the code name and an example of an associated highlight from the texts. The size of each marker corresponds to the total number of highlights for that code in the texts. The visualization contains markers for three kinds of codes:
* LLM code (new): A code entirely generated by the LLM.
* LLM code (few-shot): A code generated by the LLM that was included in the few-shot examples given to it.
* Human code: A code in the human-annotated input texts, for comparison.

Note that there is likely a lot of overlap between the markers as both LLM- and human-annotated codes are plotted, so be sure to hover your mouse over the plot to view details of any overlapping markers.

In [None]:
# @title
def prepare_code_vis_df(code_highlights, human_code_highlights, few_shot_codes, embedding_context, embedding_model):
    # Find code embeddings for all codes
    all_codes = set(code_highlights.keys()).union(set(human_code_highlights.keys()))
    df_em = llmcode.get_2d_code_embeddings(list(all_codes), embedding_context, embedding_model)

    # Create DataFrame of LLM-generated codes
    df_llm = pd.DataFrame([(c,) for c in code_highlights.keys()], columns=["code"])
    df_llm["code_count"] = df_llm["code"].apply(lambda code: len(code_highlights[code]))
    df_llm["example"] = df_llm["code"].apply(lambda code: code_highlights[code][0])
    df_llm["group"] = df_llm.code.apply(lambda code: "LLM code (few-shot)" if code in few_shot_codes else "LLM code (new)")

    # Create DataFrame of human-generated codes
    df_human = pd.DataFrame([(c,) for c in human_code_highlights.keys()], columns=["code"])
    df_human["code_count"] = df_human["code"].apply(lambda code: len(human_code_highlights[code]))
    df_human["example"] = df_human["code"].apply(lambda code: human_code_highlights[code][0])
    df_human["group"] = "Human code"

    # Concatenate code DataFrames and merge with embeddings
    df_em_codes = pd.concat([df_llm, df_human])
    df_em_codes = df_em_codes.merge(df_em, on="code", validate="many_to_one")
    return df_em_codes

def visualise_2d_embeddings(df_em):
    # Prepare labels for visualisation
    hover_texts = []
    colors = []  # List to store color categories
    for _, row in df_em.iterrows():
        text = f"{row.code} ({row.code_count})</br></br>"

        # Add an example of a code highlight
        text += '"' + "</br>".join(textwrap.wrap(row.example, width=60)) + '"'

        hover_texts.append(text)

        # Determine color category based on group
        colors.append(row.group)

    df_vis = pd.DataFrame()
    df_vis["Hover"] = hover_texts
    df_vis["Size"] = [c / df_em["code_count"].max() for c in df_em["code_count"]]
    df_vis["x"] = df_em["code_2d_0"]
    df_vis["y"] = df_em["code_2d_1"]
    df_vis["Color"] = colors

    # Plot the codes in 2D
    fig = px.scatter(df_vis,
                     width=1000, height=800,
                     x="x",
                     y="y",
                     size="Size",
                     color="Color",  # Set color categories
                     hover_name="Hover",
                     title="Codes Visualised in 2D")


    fig.show()

human_code_highlights = llmcode.get_codes_and_highlights(df_val.coded_text)
df_em = prepare_code_vis_df(
    code_highlights,
    human_code_highlights,
    few_shot_codes,
    embedding_context,
    embedding_model
)
visualise_2d_embeddings(df_em)

## Prompt iteration

Now that we've gotten a handle on how coding with LLMCode works, as well as the metrics by which we can evaluate the coding quality, let's see if we can increase the performance of the model by iterating on the coding instructions given to the model as part of the prompt. For reflexive coding, the coder's own understanding of the topic and the data is important in shaping the findings. By tuning the instructions, you can help the model understand how you personally coded the data.  

First, run the following cell that prepares a function for both coding and the subsequent evaluation of the coded data, allowing us to neatly test the entire pipeline with different instructions.

In [None]:
# @title

def code_and_eval(
    df_input,
    coding_instructions,
    sort_by="IoU",
):
    # Perform inductive coding
    coded_texts, _ = code_inductively(df_input, coding_instructions)

    # Run eval
    html_report, df_eval = llmcode.run_coding_eval(
        llm_coded_texts=coded_texts,
        human_coded_texts=df_input.coded_text.tolist(),
        embedding_context=embedding_context,
        embedding_model=embedding_model,
        sort_by=sort_by
    )

    # Print averages
    avg_iou = np.mean(df_eval["IoU"])
    avg_hausdorff = np.mean(df_eval["Hausdorff"])
    print(f"Average IoU: {avg_iou:.4f}")
    print(f"Average Hausdorff distance: {avg_hausdorff:.4f}")

    # Visualise results
    code_highlights = llmcode.get_codes_and_highlights(coded_texts)
    human_code_highlights = llmcode.get_codes_and_highlights(df_input.coded_text)
    df_em = prepare_code_vis_df(
        code_highlights,
        human_code_highlights,
        few_shot_codes,
        embedding_context,
        embedding_model
    )
    visualise_2d_embeddings(df_em)

    # Print table
    display(HTML(html_report))

print("Instruction iteration function ready, proceed")

The following cell runs the coding and evaluation with a `coding_instructions` string variable that you can modify to your liking. The instructions defined here are appended to the LLM prompt, to help it understand how you would like the texts to be coded. Output format instructions and the few-shot examples are already added to the prompt automatically by LLMCode, so these should not be included in `coding_instructions`.

The variable is initialised with the same instructions given to the model in the previous section: a single item instructing the model to focus on the defined research question. Based on the evaluations you ran in the previous section, try to improve upon these instructions by adding new lines and running the cell to see if you can improve the model's performance. Place each new instruction on its own line beginning with a dash.

Below, we have provided three identical cells so you may iterate on the instructions by adding lines to the `coding_instructions` variable and running the coding multiple times until you're happy with the output. This way, we store a trace of the results you got with the previous instructions. Before running each cell, copy the instructions from the previous cell and add one or more additional lines.

In [None]:
# Add more instructions, based on the evaluation results in the previous section
coding_instructions = f"""
- Ignore text that is not insightful with regards to the research question: {research_question}.
"""

# Run coding
code_and_eval(df_val, coding_instructions)

In [None]:
# Copy the instruction from the previous cell and add even more instructions, based on the evaluation results in the previous cell
coding_instructions = # TODO: Copy and improve code instructions from previous cell

# Run coding
code_and_eval(df_val, coding_instructions)

In [None]:
# Copy the instruction from the previous cell and add even more instructions, based on the evaluation results in the previous cell
coding_instructions = # TODO: Copy and improve code instructions from previous cell

# Run coding
code_and_eval(df_val, coding_instructions)

## Validation and test data

In all AI and Machine Learning, a common danger is to [overfit](https://en.wikipedia.org/wiki/Overfitting) one's model or approach to some data, making it generalize poorly to new data.

**The more you iterate on your prompt instructions and examples, the more you are in danger of overfitting.**

This is why it is a standard practice to split one's data into [three distinct parts](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets):

1. Training data: This is used to train a model. Whilst we're not actually training the LLM here, this is analogous to the few-shot examples we chose for the prompts since they are data for which the ground truth (i.e. the human codes) is shown to the model.

2. Validation data: This is typically used to search for the best possible [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning) such as when to stop training or how many layers to use in a neural network model.

3. Test data: This is used to test the performance of the final model after the hyperparameter tuning. *Separating test and validation data avoids overly optimistic test results caused by overfitting the hyperparameters to the validation data*.

The prompt instructions, which we tuned in this notebook, can be considered as a hyperparameter. Therefore, **one should ideally iterate/optimize the prompt with validation data and when done, verify the performance with a separate set of test data**. This is especially important if your human-defined reference dataset is small.

**For academic research, we recommend using at least 100 texts for both the validation and test data**, i.e., the data file should have at least 200 texts with human-annotated ground truth highlights, as the first 100 would be used for validation and next 100 for testing.

**For industry research**, the designer or researcher should use their own judgement - how crucial is it to be able to measure the performance accurately?

### How to report LLM use in qualitative research?

There does not currently exist an established best practice for reporting LLM-based qualitative analysis tool use, but if you use the LLM-based highlighting, you could report at least the full prompt with examples, the number of validation and test data texts, the validation and test data IoUs, and a table with examples of the worst and best case LLM performance so that the reader can judge themselves if the LLM performance is acceptable.

### Evaluate the final prompt on the test data

You may recall how we earlier divided the data in `coded_file` into validation and test sets. So far, we have only used the validation data. Once you're done iterating on your `coding_instructions`, run the code below to evaluate the model using the test data. This will provide a final reading of the model's performance. Note: this will re-use the final instructions defined in the above section.

Inspect the results. Is the LLM performance different than for the validation data? Can you spot any further human annotation errors or inconsistencies that should perhaps be corrected?

In [None]:
# @title

# Run coding and eval on test data
code_and_eval(df_test, coding_instructions)

# Coding all of the data

Now that we've tuned the coding instructions to our liking and ran the final evaluation on the test data, it is time for us to finally code the entire dataset, including texts in `uncoded_file` that weren't annotated by a human.

When dealing with large datasets, it is wise to define a ceiling for the number of texts that are coded, as each additional text incurs a small API cost. You may edit `max_number_of_texts` based on how many texts you wish to code. If your files contain more texts, the system will code the first `max_number_of_texts` from the two files.

In [None]:
max_number_of_texts = 300 # @param {type:"integer"}

# Do inductive coding on all data, capping to max_number_of_texts
df_final = df_all.head(max_number_of_texts)
coded_texts, code_descriptions = code_inductively(df_final, coding_instructions)

def print_coded_texts(coded_texts, sample_n=None):
    if sample_n:
        coded_texts = random.sample(coded_texts, sample_n)
    for coded_text in coded_texts:
        display(Markdown(coded_text))
        display(Markdown("-----"))

print("Coding complete, output:\n")
print_coded_texts(coded_texts)

# Downloading the results

Run the following code to download the generated codes and code descriptions as .csv files compressed into a single zip file. You may use these files as input to the following notebook on theme generation.

In [None]:
#@title

# Create output dir
output_dir = "coding_output"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Save raw and coded texts
data = [(t, t_coded) for t, t_coded in zip(df_final.text.tolist(), coded_texts)]
df_out = pd.DataFrame(data, columns=["text", "coded_text"])
file_path = "{}/coded_texts.csv".format(output_dir)
df_out.to_csv(file_path, index=False)

# Save code descriptions
data = code_descriptions.items()
df_out = pd.DataFrame(data, columns=["code", "description"])
file_path = f"{output_dir}/code_descriptions.csv"
df_out.to_csv(file_path, index=False)

# Zip directory and save locally
zip_file = f"{output_dir}.zip"
!zip -r {zip_file} {output_dir}

if RunningInCOLAB:
    # Use Colab's download function
    files.download(zip_file)
else:
    # For local Jupyter notebook, print the path
    print(f"Zip file saved locally: {zip_file}")