# Tutorial 3: Inductive and deductive coding
*This notebook is part of the [LLMCode library](https://github.com/PerttuHamalainen/LLMCode).*

*A note on data privacy: The user experience of this notebook is better on Google Colab, but if you are processing data that cannot be sent to Google and OpenAI servers, you should run this notebook locally using the "Aalto" LLM API.*

**Learning goals**

In this notebook, you'll learn to utilize LLMs to code qualitative data inductively and deductively, as well as how to evaluate the human-likeness of the output.

**How to use this Colab notebook?**
* Select the LLM API as well as the GPT and embedding models to use below. The default values are recommended, but some of the examples may produce better quality results using the more expensive "gpt-4o" model. For details about the models, see [OpenAI documentation](https://platform.openai.com/docs/models).
* Select "Run all" from the Runtime menu above.
* Enter your API key below when prompted. This will be provided to you at the workshop. You can also create your own OpenAI account at https://platform.openai.com/signup. The initial free quota you get with the account should be enough for the exercises of this notebook. To create an API key, follow [OpenAI's instructions](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)
* Proceed top-down following the instructions

**New to Colab notebooks?**

Colab notebooks are browser-based learning environments consisting of *cells* that include either text or code. The code is executed in a Google virtual machine instead of your own computer. You can run code cell-by-cell (click the "play" symbol of each code cell), and selecting "Run all" as instructed above is usually the first step to verify that everything works. For more info, see Google's [Intro video](https://www.youtube.com/watch?v=inN8seMm7UI) and [curated example notebooks](https://colab.google/notebooks/).


In [None]:
#Initial setup code. If you opened this notebook in Colab, this code is hidden
#by default to avoid unnecessary user interface clutter

#-------------------------------------------------------
#User-defined parameters. You can freely edit the values
llm_API="OpenAI" # @param ["OpenAI", "Aalto"]
gpt_model="gpt-4o-mini" #@param ["gpt-4o-mini","gpt-4o"]
embedding_model = "text-embedding-3-small" #@param ["text-embedding-ada-002","text-embedding-3-small", "text-embedding-3-large"]

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

#Import packages
import pandas as pd
import numpy as np
from IPython.display import HTML, Markdown, display, clear_output
import getpass
import os
import html
import lxml
import re
import plotly.graph_objects as go
import plotly.express as px
import textwrap
import random
import math
from collections import Counter, defaultdict
from google.colab import files

original_dir = os.getcwd()

#determine if we are running in Colab
import sys
RunningInCOLAB = 'google.colab' in sys.modules
if RunningInCOLAB:
  import plotly.io as pio
  pio.renderers.default = "colab"
  if not os.path.exists("LLMCode"):
    if not os.getcwd().endswith("LLMCode"):
      print("Cloning the LLMCode repository...")
      #until the repo is public, we download this working copy instead of cloning
      #(shared as: anyone with the link can view)
      !wget "https://drive.google.com/uc?export=download&id=1ylMQn96JuKBB-YU9mHLyEhtm6Qin1Kgh" -O LLMCode.zip
      !mkdir LLMCode
      !unzip -q LLMCode.zip -d LLMCode
      #!git clone https://github.com/PerttuHamalainen/LLMCode.git
  if not os.getcwd().endswith("LLMCode"):
    os.chdir("LLMCode")
    print("Installing dependencies...")
    !pip install -r requirements_notebooks.txt
import llmcode

os.chdir(original_dir)

#Jupyter is already running an asyncio event loop => need this hack for async OpenAI API calling
import nest_asyncio
nest_asyncio.apply()

#Prompt the user for an API key if not provided via a system variable
clear_output()
if llm_API=="OpenAI":
    if os.environ.get("OPENAI_API_KEY") is None:
        print("Please input an OpenAI API key")
        api_key = getpass.getpass()
        os.environ["OPENAI_API_KEY"] = api_key
elif llm_API=="Aalto":
    if os.environ.get("AALTO_OPENAI_API_KEY") is None:
        print("Please input an Aalto OpenAI API key")
        api_key = getpass.getpass()
        os.environ["AALTO_OPENAI_API_KEY"] = api_key
else:
    print(f"Invalid API type: {llm_API}")

#Initialize the LLMCode library
llmcode.init(API=llm_API)
llmcode.set_cache_directory("data_exploration_cache")

# Qualitative coding

Qualitative coding involves the analysis of texts through highlighting relevant parts of the text and assigning each highlight a code or a set of codes that communicate its relevance to the research. For example, using the Games as Art research project as an example, a coded text may look like the following:

> When you take in consideration **the music, the graphic design, the writing, and of course the gameplay**<sup>audio; visuals; narrative</sup>. Its in games like this when you are aware of be **playing an artwork, rather than just another game**<sup>comparison to conventional art forms</sup>.

In this example, the text has been annotated with two highlights that have been assigned the codes _audio_, _visuals_, _narrative_ (for the first highlight), and _comparison to conventional art forms_ (for the second highlight).

There exist various commercial tools such as Atlas.ti that provide a neat user interface for coding, but licenses for these can be expensive. Therefore, in this notebook, we accept files that have been coded using Microsoft Word. Make sure that the file is formatted according to the following requirements:
* Individual texts are separated by five dashes "-----"
* The texts are coded using [Word comments](https://support.microsoft.com/en-us/office/using-modern-comments-8d3f868a-867e-4df2-8c68-bf96671641e2) by highlighting the relevant part and typing the code(s) in the comment field. Multiple codes are separated by a semicolon ";".

While LLMCode contains functions that can automate much of the coding process, it requires at least a few human-coded examples (called _few-shot_ examples) to get started. Additionally, in the last section of this notebook, the LLM-generated codes are evaluated using human-generated codes as a baseline. Therefore, if you wish to analyze your own dataset, **we request that you code at least five texts** to use as examples before proceeding with LLM coding. Please separate these examples into their own Word file. In order to run the evaluation against human-generated codes, we ask you to code the remaining texts in the other Word file as well.

Because LLMs deal with unstructured text, we instruct it with code annotations in the [Markdown](https://en.wikipedia.org/wiki/Markdown) format. The example given above in Markdown, printed as a plain string, looks like the following:

> ```When you take in consideration **the music, the graphic design, the writing, and of course the gameplay**<sup>audio; visuals; narrative</sup>. Its in games like this when you are aware of be **playing an artwork, rather than just another game**<sup>comparison to conventional art forms</sup>.```

LLMCode's functions take care of the conversion between this format and the more structured code representations that us humans prefer to deal with, such as annotated Word files.

# Loading the data

Below, please define the research question you would like to answer as well as the data you would like to analyze by selecting a Word file that is formatted according to the requirements detailed above. You may run this cell with the default values to analyze the [Games As Art](https://osf.io/ryvt6/) survey dataset annotated by us with the research question: How do people experience games as art? Alternatively, you may upload your own research data by following the instructions below.

To reduce processing time and API cost, we limit the number of processed input texts to `max_number_of_texts`. We recommend keeping this at the default value of 40 texts.

**Using your own data**

Please upload your .docx (Word) files using the file browser on the left and input the file names to the corresponding fields. The file `examples_file` should contain at least five fully coded texts you wish to use as examples for the model, and `input_file` should contain the rest of the dataset which the model will code. If you wish to evaluate the model's codes against human codes, also the `input_file` should be fully coded: in this case, please make sure the `input_file_is_coded` box is ticked.

Remember to define your own research question.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values
research_question = 'How do people experience games as art?' # @param {type:"string"}
examples_file = "LLMCode/test_data/bopp_test_examples.docx" # @param {type:"string"}
input_file = "LLMCode/test_data/bopp_test_input.docx" # @param {type:"string"}
input_file_is_coded = True # @param {"type":"boolean"}
max_number_of_texts = 40 # @param {type:"integer"}

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

df_examples = llmcode.open_docx_and_process_codes(examples_file)
df_input = llmcode.open_docx_and_process_codes(input_file)

# Remove leading and trailing whitespace in dataset to simplify analysis
for df in [df_examples, df_input]:
    df['text'] = df['text'].str.strip()
    df['coded_text'] = df['coded_text'].str.strip()

# Limit input to max_number_of_texts
df_input = df_input.head(max_number_of_texts)

# Collect all human code input
df_human = pd.concat([df_examples, df_input]).reset_index(drop=True)

# Define embedding context based on the research question
embedding_context = f", in the context of the research question: {research_question}"

print(f"Using {len(df_examples)} example texts and {len(df_input)} input texts")

Let's print out all the codes and the number of times they appear in the input texts:

In [None]:
# @title
def get_codes_df(df):
    codes = [code for coded_text in df.coded_text for _, code in llmcode.parse_codes(coded_text)]
    code_counts = Counter(codes)
    df_codes = pd.DataFrame(list(code_counts.items()), columns=['Code', 'Count'])
    df_codes = df_codes.sort_values(by='Count', ascending=False).reset_index(drop=True)
    return df_codes

df_codes = get_codes_df(df_human)

# Create a vertical bar plot using Plotly with angled x-axis labels
fig = px.bar(df_codes, x='Code', y='Count', title='Human-annotated codes')

# Update layout to angle x-axis labels at 45 degrees
fig.update_layout(xaxis_tickangle=-45)

fig.show()

# Coding with LLMs: a simple example

We start exploring LLM code generation with a simple, prompt-only example, asking the model to code a text with a single prompt. The prompt utilizes _few-shot learning_ as it includes some examples that teach the model how we want it to code the text. Click "Show code" to view the prompt. You may edit the instructions and see what effect this has on the output.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values

text_to_code = "I was playing the Legend of Zelda: Breath of the Wild in a sad day. A sunset on a perfect landscape between two biomes almost made me cry. It was PERFECT." # @param {type:"string"}

#-------------------------------------------------------------------
#Implementation. Feel free to edit the prompt

prompt = """You are an expert qualitative researcher conducting a research project.
You are given a text to code inductively. Please carry out the following task:
- Respond by repeating the original text, but highlighting the coded statements by surrounding the statements with double asterisks, as if they were bolded text in a Markdown document.
- Include the associated code(s) immediately after the statement, separated by a semicolon and enclosed in <sup></sup> tags, as if they were superscript text in a Markdown document.
- Preserve exact formatting of the original text. Do not correct typos or remove unnecessary spaces.

Below, I first give you examples of the output you should produce given an example input. After that, I give you the actual input to process.

EXAMPLE INPUT:
When you take in consideration the music, the graphic design, the writing, and of course the gameplay. Its in games like this when you are aware of be playing an artwork, rather than just another game.

EXAMPLE OUTPUT:
When you take in consideration **the music, the graphic design, the writing, and of course the gameplay**<sup>audio; visuals; narrative</sup>. Its in games like this when you are aware of be **playing an artwork, rather than just another game**<sup>comparison to conventional art forms</sup>.

EXAMPLE INPUT:
The protagonist has no fighting skills. The environment (empty space) is naturally hostile and makes the player feel isolated an alone. The style of the space ship feels inhabited and dead at the same time (like a carcass). The body horror is immensly unsettling and detailed.

EXAMPLE OUTPUT:
The protagonist has no fighting skills. The environment (empty space) is naturally hostile and **makes the player feel isolated an alone**<sup>emotional response</sup>. The style of the space ship feels inhabited and dead at the same time (like a carcass).
"""

prompt += f"ACTUAL INPUT:\n{text_to_code}"
response = llmcode.query_LLM(prompt, model=gpt_model)
display(Markdown(response))

# Coding with LLMCode

If you want to use LLMs to code multiple texts, it is best to use the coding functions in the LLMCode package, for a few reasons:
* LLMCode can automatically and dynamically construct a prompt for each text, and takes care of details such as randomizing the order of the few-shot examples to avoid recency bias.
* LLMCode automatically parses the Markdown output from the model, and is able to detect and many times also correct hallucinations. Hallucinations refer to cases where the model incorrectly outputs a modified version of the input text, although keeping the research data intact is extremely important. This often happens for example when the input data contains spelling errors.
* LLMCode supports three coding methods, with specialized prompts for each: inductive coding with and without code consistency as well as deductive coding. We will try out each of these three methods.

## Defining the few-shot examples

Each of the LLMCode functions utilise few-shot learning, requiring a set of human-coded examples as input. Next, we define the number of few-shot examples we would like to use: 5-10 is a good number to start with. You may later try running this notebook with different numbers of examples and see if the accuracy increases with more examples. The examples are randomly sampled from the provided `examples_file`. If you wish to use all the examples, set `number_of_examples` to -1.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values

number_of_examples = -1 # @param {"type":"integer"}

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

if number_of_examples == -1:
    number_of_examples = len(df_examples)
if number_of_examples > len(df_examples):
    raise ValueError(f"Number of examples higher than example dataset {len(df_examples)}")

# Ensure that few-shot examples are excluded from the input texts
few_shot_examples = df_examples.sample(n=number_of_examples).reset_index(drop=True)

# Parse all codes with respective highlights in human-annotated input texts, for later comparison with LLM-generated
human_code_highlights = llmcode.get_codes_and_highlights(df_input.coded_text)

# Prepare list of codes in the few-shot examples
few_shot_codes = set(code for coded_text in few_shot_examples.coded_text for _, code in llmcode.parse_codes(coded_text))

print(f"Input data count: {len(df_input)}")
print(f"Few-shot examples count: {len(few_shot_examples)}")
print("Few-shot examples:")
few_shot_examples

## Inductive coding

We first try out the code_inductively() function from the LLMCode package, which codes a list of texts given a research question and a DataFrame containing few-shot examples. The function prompts the LLM with batches of the text, which is faster than only prompting one text at a time. The function also attempts to correct some common errors that the LLM may make, such as correcting typos or omitting some non-coded sentences of the original text. For further analyses, we want the LLM to preserve the exact formatting of the original text.

In [None]:
# @title
# Perform inductive coding
coded_texts_ind = llmcode.code_inductively(
    texts=df_input.text.tolist(),
    research_question=research_question,
    few_shot_examples=few_shot_examples,
    gpt_model=gpt_model
)

print("Coding complete")

Next, we evaluate the output. The method depends on whether you ticked `input_file_is_coded` when loading the data.

**Evaluation against a coded input file**

If `input_file_is_coded` was ticked, we print out a table comparing the LLM-coded texts to the human-coded ones. The table contains two metrics to help you evaluate the model's output: IoU, introduced in the preceding "Relevant data extraction" notebook, and a modified [Hausdorff distance measure](https://en.wikipedia.org/wiki/Hausdorff_distance). As you may recall, IoU calculates the overlap between LLM- and human-highlighted parts of the text, measuring how similarly the model can identify interesting parts in the text to a human coder. However, IoU does not tell us anything about the codes that are assigned to each highlight.

Note that the coded parts of the text may vary between annotators, which makes a highlight level evaluation of codes difficult. In order to address this issue, we compare the codes on a text (as opposed to highlight) level, by merging all the codes that are assigned to any highlight in the text. The Hausdorff distance measure calculated by the LLMCode package then utilizes code embeddings to evaluate the semantic similarity between the LLM and human-generated codes for each text.

You can choose to sort the texts by either of the two metrics. This enables you to easily spot cases where the model is underperfoming. Note that a **higher IoU is better** (higher IoU = more overlap between highlighted segments), while a **lower Hausdorff distance is better** (lower distance = codes are more similar). Keep in mind to also compare the LLM- and human-annotated texts yourself, as sometimes the errors may be due to inconsistencies in human coding.

**Evaluation without a coded input file**

In the case where the input file is not coded by a human, we print out all of the LLM-coded texts in the same order as the input. You may visually inspect the codes to see if you can spot any coding errors.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values

sort_by = "Hausdorff" # @param ["IoU","Hausdorff","Input order"]

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

def print_coded_texts(coded_texts, sample_n=None):
    if sample_n:
        coded_texts = random.sample(coded_texts, sample_n)
    for coded_text in coded_texts:
        display(Markdown(coded_text))
        display(Markdown("-----"))

if input_file_is_coded:
    html_report_ind, df_eval_ind = llmcode.run_coding_eval(
        llm_coded_texts=coded_texts_ind,
        human_coded_texts=df_input.coded_text.tolist(),
        embedding_context=embedding_context,
        embedding_model=embedding_model,
        sort_by=sort_by
    )

    avg_iou = np.mean(df_eval_ind["IoU"])
    avg_hausdorff = np.mean(df_eval_ind["Hausdorff"])
    print(f"Average IoU: {avg_iou:.4f}")
    print(f"Average Hausdorff distance: {avg_hausdorff:.4f}")

    display(HTML(html_report_ind))
else:
    print("Input file not set as coded, printing all coded texts:")
    print_coded_texts(coded_texts_ind)

Next, let's visualize all the LLM-generated codes and the number of times they appear. Compare these to the human-annotated code distribution above. What differences can you spot?

In [None]:
# @title
def plot_generated_codes(code_highlights, title):
    code_counts = [(code, len(highlights)) for code, highlights in code_highlights.items()]
    df_codes = pd.DataFrame(code_counts, columns=['Code', 'Count'])
    df_codes = df_codes.sort_values(by='Count', ascending=False).reset_index(drop=True)

    # Create a vertical bar plot using Plotly with angled x-axis labels
    fig = px.bar(df_codes, x='Code', y='Count', title=title)

    # Update layout to angle x-axis labels at 45 degrees
    fig.update_layout(xaxis_tickangle=-45)
    fig.show()

# Parse all codes and highlights in LLM output
code_highlights_ind = llmcode.get_codes_and_highlights(coded_texts_ind)
plot_generated_codes(code_highlights_ind, 'LLM-generated inductive codes')

In order to compare the human- and LLM-generated codes in more detail, let's visualize the codes with the help of _word embeddings_, which capture the meaning of words and can therefore be used to explore code similarities. We will use OpenAI's embedding models for this. Word embeddings are typically high dimensional vectors, so we will reduce the dimensionality of the vectors to two in order to plot them onto a two-dimensional plane.

We create an interactive visualization of the code embeddings using Plotly, so that you can hover over the plot to reveal the code name and an example of an associated highlight from the texts. The size of each marker corresponds to the total number of highlights for that code in the texts. The visualization contains markers for three kinds of codes:
* LLM code: A code entirely generated by the LLM.
* LLM code (few-shot): A code generated by the LLM that was included in the few-shot examples given to it.
* Human code: A code in the human-annotated input texts, for comparison. These are only visualized if `input_file` is coded.

Note that there is likely a lot of overlap between the markers as both LLM- and human-annotated codes are plotted, so be sure to hover your mouse over the plot to view details of any overlapping markers. Can you spot a difference between the counts of the codes entirely generated by the LLM and the codes included in the few-shot examples?

In [None]:
# @title
def prepare_code_vis_df(code_highlights, human_code_highlights, few_shot_codes, embedding_context, embedding_model):
    # Find code embeddings for all codes
    all_codes = set(code_highlights.keys()).union(set(human_code_highlights.keys()))
    df_em = llmcode.get_2d_code_embeddings(list(all_codes), embedding_context, embedding_model)

    # Create DataFrame of LLM-generated codes
    df_llm = pd.DataFrame([(c,) for c in code_highlights.keys()], columns=["code"])
    df_llm["code_count"] = df_llm["code"].apply(lambda code: len(code_highlights[code]))
    df_llm["example"] = df_llm["code"].apply(lambda code: code_highlights[code][0])
    df_llm["group"] = df_llm.code.apply(lambda code: "LLM code (few-shot)" if code in few_shot_codes else "LLM code")

    # Create DataFrame of human-generated codes
    df_human = pd.DataFrame([(c,) for c in human_code_highlights.keys()], columns=["code"])
    df_human["code_count"] = df_human["code"].apply(lambda code: len(human_code_highlights[code]))
    df_human["example"] = df_human["code"].apply(lambda code: human_code_highlights[code][0])
    df_human["group"] = "Human code"

    # Concatenate code DataFrames and merge with embeddings
    df_em_codes = pd.concat([df_llm, df_human])
    df_em_codes = df_em_codes.merge(df_em, on="code", validate="many_to_one")
    return df_em_codes

def visualise_2d_embeddings(df_em):
    # Prepare labels for visualisation
    hover_texts = []
    colors = []  # List to store color categories
    for _, row in df_em.iterrows():
        text = f"{row.code} ({row.code_count})</br></br>"

        # Add an example of a code highlight
        text += '"' + "</br>".join(textwrap.wrap(row.example, width=60)) + '"'

        hover_texts.append(text)

        # Determine color category based on group
        colors.append(row.group)

    df_vis = pd.DataFrame()
    df_vis["Hover"] = hover_texts
    df_vis["Size"] = [c / df_em["code_count"].max() for c in df_em["code_count"]]
    df_vis["x"] = df_em["code_2d_0"]
    df_vis["y"] = df_em["code_2d_1"]
    df_vis["Color"] = colors

    # Plot the codes in 2D
    fig = px.scatter(df_vis,
                     width=1000, height=800,
                     x="x",
                     y="y",
                     size="Size",
                     color="Color",  # Set color categories
                     hover_name="Hover",
                     title="Codes Visualised in 2D")


    fig.show()

df_em = prepare_code_vis_df(
    code_highlights_ind,
    human_code_highlights,
    few_shot_codes,
    embedding_context,
    embedding_model
)
visualise_2d_embeddings(df_em)

## Inductive coding with code consistency

You may have noticed that the above inductive coding approach leads to a large quantity of codes, some of which may carry essentially the same meaning. This is because `code_inductively()` passes each text through the model individually and in parallel, meaning that the system cannot keep track of an internal codebook similar to a human coder would. Many of the codes are assigned to only a single highlight, which does not provide much value to qualitative analysis.

Below, we try an alternative inductive coding function in the LLMCode package,  `code_inductively_with_code_consistency()`. This function processes the texts sequentially, to allow the reuse of codes between text instances instead of creating an entirely new set of possibly redundant codes for each text. The system does this by keeping track of a list of previous codes that is added as input to each prompt.

If you're take a closer look at the code, you may notice that the `code_inductively_with_code_consistency()` function also generates a description for each code, in order to avoid generating duplicate codes with the same meaning. We store these as they may be useful in further processing of the codes, by elaborating on their meaning.

Note that, because the texts are processed sequentially, the function may take longer to complete, especially with larger datasets.

In [None]:
# @title
coded_texts_ind_con, code_descriptions_ind_con = llmcode.code_inductively_with_code_consistency(
    texts=df_input.text.tolist(),
    research_question=research_question,
    few_shot_examples=few_shot_examples,
    gpt_model=gpt_model
)

Let's again evaluate the output, using the same method as previously:

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values

sort_by = "Hausdorff" # @param ["IoU","Hausdorff","Input order"]

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

if input_file_is_coded:
    html_report_ind_con, df_eval_ind_con = llmcode.run_coding_eval(
        llm_coded_texts=coded_texts_ind_con,
        human_coded_texts=df_input.coded_text.tolist(),
        embedding_context=embedding_context,
        embedding_model=embedding_model,
        sort_by=sort_by
    )

    avg_iou = np.mean(df_eval_ind_con["IoU"])
    avg_hausdorff = np.mean(df_eval_ind_con["Hausdorff"])
    print(f"Average IoU: {avg_iou:.4f}")
    print(f"Average Hausdorff distance: {avg_hausdorff:.4f}")

    display(HTML(html_report_ind_con))
else:
    print("Input file not set as coded, printing all coded texts:")
    print_coded_texts(coded_texts_ind_con)

Similarly, let's again print a visualization showing all of the generated codes:

In [None]:
# @title
# Parse all codes and highlights in LLM output
code_highlights_ind_con = llmcode.get_codes_and_highlights(coded_texts_ind_con)
plot_generated_codes(code_highlights_ind_con, 'LLM-generated inductive codes (with consistency)')

Let's plot the codes again on a two-dimensional plane. Use the visualizations to compare these codes to the codes generated in the previous section, for example in terms of:
- Redundancy
- Alignment with human codes and few-shot codes
- Coverage of what you view as the most important topics
- Creativity

In [None]:
# @title
df_em = prepare_code_vis_df(
    code_highlights_ind_con,
    human_code_highlights,
    few_shot_codes,
    embedding_context,
    embedding_model
)
visualise_2d_embeddings(df_em)

## Deductive coding

In this section, we will demonstrate how to do deductive coding with the LLMCode package. Where in inductive coding, codes are created as the texts are coded, deductive coding involves applying a set of predefined codes, referred to as a _codebook_, to a dataset.

We initialise the codebook with all the codes present in the few-shot examples. If `extract_codebook_from_all_human_codes` is ticked, also any additional human-annotated ground truth codes from `examples_file` and `input_file` are included in the codebook. You may choose to add to these codes by defining your own codes in `additional_codes`, separated by a semicolon.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values

extract_codebook_from_all_human_codes = True # @param {"type":"boolean"}
additional_codes = "" # @param {"type":"string","placeholder":"Codes separated by ;"}

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

if extract_codebook_from_all_human_codes:
    # Initialise codebook with all human codes
    codebook = {code for coded_text in df_human.coded_text for _, code in llmcode.parse_codes(coded_text)}
else:
    # Initialise codebook with few-shot codes
    codebook = {code for coded_text in few_shot_examples.coded_text for _, code in llmcode.parse_codes(coded_text)}

if additional_codes:
    # Add user-defined codes
    codebook = codebook | {code.strip() for code in additional_codes.split(";")}

codebook = list(codebook)
print("Codebook:")
for code in sorted(codebook):
    print(f"- {code}")

Next, we run the `code_deductively()` function from the LLMCode package using the above codebook:

In [None]:
# @title
# Deductive coding
coded_texts_ded = llmcode.code_deductively(
    texts=df_input.text.tolist(),
    research_question=research_question,
    codebook=codebook,
    few_shot_examples=few_shot_examples,
    gpt_model=gpt_model
)

Next, we once again evaluate the output, plot the distribution of generated codes, and visualize the codes in 2D.

In [None]:
#@title Evaluate LLM-generated deductive codes

#-------------------------------------------------------
#User-defined parameters. You can freely edit the values

sort_by = "Hausdorff" # @param ["IoU","Hausdorff","Input order"]

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

if input_file_is_coded:
    html_report_ded, df_eval_ded = llmcode.run_coding_eval(
        llm_coded_texts=coded_texts_ded,
        human_coded_texts=df_input.coded_text.tolist(),
        embedding_context=embedding_context,
        embedding_model=embedding_model,
        sort_by=sort_by
    )

    avg_iou = np.mean(df_eval_ded["IoU"])
    avg_hausdorff = np.mean(df_eval_ded["Hausdorff"])
    print(f"Average IoU: {avg_iou:.4f}")
    print(f"Average Hausdorff distance: {avg_hausdorff:.4f}")

    display(HTML(html_report_ded))
else:
    print("Input file not set as coded, printing all coded texts:")
    print_coded_texts(coded_texts_ded)

In [None]:
#@title Show distribution of LLM-generated deductive codes

# Parse all codes and highlights in LLM output
code_highlights_ded = llmcode.get_codes_and_highlights(coded_texts_ded)
plot_generated_codes(code_highlights_ded, 'LLM-generated deductive codes')

In [None]:
#@title Plot LLM-generated deductive codes

df_em = prepare_code_vis_df(
    code_highlights_ded,
    human_code_highlights,
    few_shot_codes,
    embedding_context,
    embedding_model,
)
visualise_2d_embeddings(df_em)

## Comparing coding performance across the different functions

We may use the calculated IoU and Hausdorff distance metrics to compare the performance of the different functions against each other. What differences can you spot?

In [None]:
# @title

if input_file_is_coded:
    methods = {
        "Inductive": df_eval_ind,
        "Inductive with code consistency": df_eval_ind_con,
        "Deductive": df_eval_ded
    }

    for method, df_eval in methods.items():
        print(f"Performance of {method} method:")
        avg_iou = np.mean(df_eval["IoU"])
        avg_hausdorff = np.mean(df_eval["Hausdorff"])
        print(f"Average IoU: {avg_iou:.4f}")
        print(f"Average Hausdorff distance: {avg_hausdorff:.4f}")
        print("")
else:
    print("Input file not set as coded, skipping evaluation")

Both the IoU and Hausdorff distance metrics measure how well the LLM-generated codes align with human-generated codes. However, given the subjective nature of qualitative coding, one may also be interested in comparing the diversity of useful codes generated by the LLM methods. Some of these might be new codes that do not exist in the human-coded data. One way to carry out such an analysis is to evaluate the codes visually, using the plots we generated at the end of each coding section. For the inductive coding functions, can you find any new and interesting codes that weren't present in the human-coded data?

# Downloading the results

Run the following code to download the generated codes (and code descriptions where applicable) as .csv files compressed into a single zip file. You may use these files as input to the following notebook on "Theme generation" (if you do not wish to download your results, you may also run the following notebook with default codes provided by us).

In [None]:
#@title

coded_texts_by_method = {
    "Inductive": locals().get("coded_texts_ind"),
    "Inductive with code consistency": locals().get("coded_texts_ind_con"),
    "Deductive": locals().get("coded_texts_ded")
}

# Create output dir
output_dir = "coding_output"  # TODO: Prompt from user
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Save raw and coded texts for each method
for name, coded_texts in coded_texts_by_method.items():
    if coded_texts is not None:
        data = [(t, t_coded) for t, t_coded in zip(df_input.text.tolist(), coded_texts)]
        df_out = pd.DataFrame(data, columns=["text", "coded_text"])
        file_path = "{}/coded_texts_{}.csv".format(output_dir, name.lower().replace(" ", "_"))
        df_out.to_csv(file_path, index=False)

# Save code descriptions for ind con
if locals().get("code_descriptions_ind_con"):
    data = code_descriptions_ind_con.items()
    df_out = pd.DataFrame(data, columns=["code", "description"])
    file_path = f"{output_dir}/code_descriptions_inductive_with_code_consistency.csv"
    df_out.to_csv(file_path, index=False)

# Zip directory and save locally
zip_file = f"{output_dir}.zip"
!zip -r {zip_file} {output_dir}
files.download(zip_file)