# **Using LLMs for Deductive Coding in Qualitative Research**


# **1. Getting Started with Gemini: Your First API Call**

Let‚Äôs make our first call to the Gemini model! We‚Äôll go step-by-step to set up the environment, plug in the API key, and send a simple prompt.

In [2]:
# üß± **Step 1: Install the Gemini SDK**
!pip install -q google-generativeai

In [3]:
#üîë **Step 2: Import Libraries and Set Up Your API Key**
import os
import google.generativeai as genai

# Paste your API key here (‚ö†Ô∏è do not share in public notebooks)
os.environ["GOOGLE_API_KEY"] = ""  # <--- REPLACE THIS with your own api key

# Configure the API key
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

In [4]:
#‚öôÔ∏è **Step 3: Initialize the Gemini Model**
model = genai.GenerativeModel("gemini-1.5-flash")

In [5]:
# üí¨ **Step 4: Try Your First Prompt!**
response = model.generate_content("List three benefits of using LLMs for qualitative coding.")

# Print the result
print(response.text)

Three benefits of using LLMs for qualitative coding are:

1. **Increased Speed and Efficiency:** LLMs can process vast amounts of textual data significantly faster than a human coder. This allows researchers to analyze larger datasets and complete coding projects in a shorter timeframe, accelerating the research process.

2. **Improved Consistency and Reliability:**  Human coders can introduce bias and inconsistencies in their coding decisions. LLMs, trained on large datasets, can apply coding schemes more consistently, reducing inter-rater reliability issues and enhancing the objectivity of the analysis.  This leads to more robust and reproducible findings.

3. **Enhanced Identification of Subtle Themes and Patterns:** LLMs can identify subtle relationships and patterns within the data that might be missed by human coders due to cognitive limitations or fatigue. Their ability to analyze language nuances and contextual information can uncover hidden themes and enrich the qualitative un

# **2. Qualitative Coding Task with Gemini**

In this session, we‚Äôll walk through a simple example of using Gemini for a deductive qualitative coding task.

## **2.1 Task Description**

In this example, we‚Äôll use a deductive coding task from a research project conducted at CCRC. In the study, we aimed to code how well extracted skills align with course descriptions.

The course data comes from publicly available course descriptions on a higher education institution‚Äôs website, while the skills are drawn from O*NET‚Äôs list of over 2,000 Detailed Work Activities (DWAs).

Below is the coding rubric we used to assess the alignment level between each skill and the course description.

1.   Direct Match (5): The task is the core learning objective of the course and is explicitly covered.
2.   Related (4): The task aligns with the course content and students are able to perform.
3.   Transferable (3): The task is not explicitly covered but students may develop transferable skills that can be applied.
4.   Same Domain(2): The task falls within the same discipline or broader field but is not relevant.
5.   Unrelated (1): The task is outside the scope of the course and belongs to a different domain.





## **2.2 Performing Qualitative Coding with Gemini**
In this part, we‚Äôll run a live example of qualitative coding using the Gemini API to demonstrate how LLMs can be guided to perform structured deductive coding in a real research context.


**Step 1: Loading the data...**

In [6]:
# read files
import pandas as pd

url = "https://raw.githubusercontent.com/ZhenXu-0/Deductive-Coding-using-LLMs/refs/heads/main/human_labeled_data_100.csv"
df = pd.read_csv(url)

In [7]:
df = df.head(10)
print(df)

                               domain     code                    title  \
0  Business Management and Accounting  ACC 122  accounting principles i   
1  Business Management and Accounting  ACC 122  accounting principles i   
2  Business Management and Accounting  ACC 122  accounting principles i   
3  Business Management and Accounting  ACC 122  accounting principles i   
4  Business Management and Accounting  ACC 122  accounting principles i   
5  Business Management and Accounting  ACC 122  accounting principles i   
6  Business Management and Accounting  ACC 122  accounting principles i   
7  Business Management and Accounting  ACC 122  accounting principles i   
8  Business Management and Accounting  ACC 122  accounting principles i   
9  Business Management and Accounting  ACC 122  accounting principles i   

                                         description  \
0  the course covers the fundamental principles o...   
1  the course covers the fundamental principles o...   
2  the

**Step 2: Prompting the Gemini Model to Perform the Coding Task** ...

In [10]:
import os
import pandas as pd
import google.generativeai as genai
from tqdm import tqdm

# --- Step 1: Setup Gemini API ---
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel(model_name="models/gemini-1.5-flash")

# --- Step 2: Define your rubric as a string ---
instruction = """
You are given a course and a skill.

Rate the alignment of the skill with the course using the following 5-point scale:

Score 5 (Direct Match): The real-world task is a core learning objective of the course and is explicitly covered in the course description.
Score 4 (Related): The real-world task aligns with the course content, and students should be able to perform this task upon completing the course.
Score 3 (Transferable): The task is not explicitly covered in the course, but students may develop transferable skills that can be applied to it.
Score 2 (Same Domain): The task falls within the same discipline or broader field but is not relevant to the course content.
Score 1 (Unrelated): The task is outside the scope of the course and belongs to a different domain.

Only return the numeric score (1‚Äì5).
"""

# --- Step 3: Define a prompting function ---
def score_alignment(domain, title, description, skill):
    prompt = (
        f"{instruction}\n\n"
        f"Course Domain: {domain}\n"
        f"Course Title: {title}\n"
        f"Course Description: {description}\n"
        f"Skill: {skill}\n"
        f"Score:"
    )
    try:
        response = model.generate_content(prompt)
        return response.text.strip().split()[0]  # Just return the score (first token)
    except Exception as e:
        return f"Error: {e}"

# --- Step 4: Apply to DataFrame ---

llm_scores = []
for i, row in enumerate(df.itertuples(), 1):
    print(f"Scoring row {i}/{len(df)}: {row.title[:50]}...")  # show part of the title
    score = score_alignment(row.domain, row.title, row.description, row.skill)
    llm_scores.append(score)
df["llm_score"] = llm_scores


Scoring row 1/10: accounting principles i...
Scoring row 2/10: accounting principles i...
Scoring row 3/10: accounting principles i...
Scoring row 4/10: accounting principles i...
Scoring row 5/10: accounting principles i...
Scoring row 6/10: accounting principles i...
Scoring row 7/10: accounting principles i...
Scoring row 8/10: accounting principles i...
Scoring row 9/10: accounting principles i...
Scoring row 10/10: accounting principles i...


In [11]:
print(df)

                               domain     code                    title  \
0  Business Management and Accounting  ACC 122  accounting principles i   
1  Business Management and Accounting  ACC 122  accounting principles i   
2  Business Management and Accounting  ACC 122  accounting principles i   
3  Business Management and Accounting  ACC 122  accounting principles i   
4  Business Management and Accounting  ACC 122  accounting principles i   
5  Business Management and Accounting  ACC 122  accounting principles i   
6  Business Management and Accounting  ACC 122  accounting principles i   
7  Business Management and Accounting  ACC 122  accounting principles i   
8  Business Management and Accounting  ACC 122  accounting principles i   
9  Business Management and Accounting  ACC 122  accounting principles i   

                                         description  \
0  the course covers the fundamental principles o...   
1  the course covers the fundamental principles o...   
2  the

## **2.3 Checking the Reliability of LLM Coding**
After obtaining the LLM-generated labels, we compare them with human-labeled ground truth to evaluate the reliability of the model‚Äôs output.

The ground truth in this dataset (the 'human_score' column) was created by two trained human raters. Their scores were calibrated through an initial round of alignment, and inter-rater agreement was assessed.

The following code calculates the agreement between LLM-generated labels and human-coded ground truth using two common reliability metrics: Cohen‚Äôs Kappa and Intraclass Correlation Coefficient (ICC).

In [12]:
!pip install -q pingouin scikit-learn
import pandas as pd
from sklearn.metrics import cohen_kappa_score
import pingouin as pg  # for ICC

# Make sure scores are numeric
df["human_score"] = pd.to_numeric(df["human_score"], errors="coerce")
df["llm_score"] = pd.to_numeric(df["llm_score"], errors="coerce")

# Drop rows with missing values
df_scores = df[["human_score", "llm_score"]].dropna()

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/204.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m204.4/204.4 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [13]:
# üîπ Cohen's Kappa
kappa = cohen_kappa_score(df_scores["human_score"], df_scores["llm_score"], weights="quadratic")
print(f"Cohen's Kappa: {kappa:.3f}")

# üîπ ICC
# Prepare data for ICC
df_scores_reset = df_scores.reset_index()  # Add original index as a column

# Melt into long format
df_long = df_scores.reset_index().melt(
    id_vars="index",                  # this becomes the 'targets' column
    value_vars=["human_score", "llm_score"],
    var_name="rater",
    value_name="score"
)

# Calculate ICC(3) - Two-way mixed, consistency/absolute
icc_table = pg.intraclass_corr(
    data=df_long,
    targets="index",
    raters="rater",
    ratings="score"
)

# Filter ICC(3) result
icc3_value = icc_table[icc_table["Type"] == "ICC3"]["ICC"].values[0]
print(f"ICC(3): {icc3_value:.3f}")


Cohen's Kappa: 0.277
ICC(3): 0.427


# **3. Improving LLM Deductive Coding Reliability through Prompt Engineering**

In this session, we‚Äôll explore how different prompt engineering strategies can help improve the reliability of LLMs in deductive coding tasks. By adjusting how we frame instructions, provide context, or structure examples, we can guide the model to produce outputs that better align with human-coded results.

## **3.1 Prompt Engineering Basics**

In this section, we‚Äôll walk through three common prompting strategies to improve LLM performance on deductive coding tasks:

*   Zero-shot prompting: The model is given only the instructions and the input text, without any examples. This tests how well the model can follow general guidance.
*   Few-shot prompting: The prompt includes a few example inputs and their correct coded outputs. This helps the model learn from context and apply similar logic to new data.
*   Chain-of-thought (CoT) prompting: We ask the model to explain its reasoning before giving the final code. This encourages step-by-step thinking and often improves accuracy and consistency.


## **3.2 Testing Prompt Strategies**

We‚Äôll now test each prompting strategy‚Äîzero-shot, few-shot, and chain-of-thought‚Äîusing the same input. By comparing the outputs, we‚Äôll evaluate how each approach influences the clarity, consistency, and alignment of LLM coding with human-coded results.

### **Zero-Shot**

In [None]:
# --- Step 1: Edit prompt as zero-shot ---
instruction = """
You are given a course and a skill.

Rate the alignment of the skill with the course using the following 5-point scale:

Score 5 (Direct Match): The real-world task is a core learning objective of the course and is explicitly covered in the course description.
Score 4 (Related): The real-world task aligns with the course content, and students should be able to perform this task upon completing the course.
Score 3 (Transferable): The task is not explicitly covered in the course, but students may develop transferable skills that can be applied to it.
Score 2 (Same Domain): The task falls within the same discipline or broader field but is not relevant to the course content.
Score 1 (Unrelated): The task is outside the scope of the course and belongs to a different domain.

Only return the numeric score (1‚Äì5).
"""

# --- Step 2: Apply to the data ---
llm_scores = []
for i, row in enumerate(df.itertuples(), 1):
    print(f"Scoring row {i}/{len(df)}: {row.title[:50]}...")  # show part of the title
    score = score_alignment(row.domain, row.title, row.description, row.skill)
    llm_scores.append(score)
df["llm_score"] = llm_scores



Scoring row 1/10: accounting principles i...
Scoring row 2/10: accounting principles i...
Scoring row 3/10: accounting principles i...
Scoring row 4/10: accounting principles i...
Scoring row 5/10: accounting principles i...
Scoring row 6/10: accounting principles i...
Scoring row 7/10: accounting principles i...
Scoring row 8/10: accounting principles i...
Scoring row 9/10: accounting principles i...
Scoring row 10/10: accounting principles i...


In [None]:
# Make sure scores are numeric
df["human_score"] = pd.to_numeric(df["human_score"], errors="coerce")
df["llm_score"] = pd.to_numeric(df["llm_score"], errors="coerce")

# Drop rows with missing values
df_scores = df[["human_score", "llm_score"]].dropna()

# üîπ Cohen's Kappa
kappa = cohen_kappa_score(df_scores["human_score"], df_scores["llm_score"], weights="quadratic")
print(f"Cohen's Kappa: {kappa:.3f}")

# üîπ ICC
# Prepare data for ICC
df_scores_reset = df_scores.reset_index()  # Add original index as a column

# Melt into long format
df_long = df_scores.reset_index().melt(
    id_vars="index",                  # this becomes the 'targets' column
    value_vars=["human_score", "llm_score"],
    var_name="rater",
    value_name="score"
)

# Calculate ICC(3) - Two-way mixed, consistency/absolute
icc_table = pg.intraclass_corr(
    data=df_long,
    targets="index",
    raters="rater",
    ratings="score"
)

# Filter ICC(3) result
icc3_value = icc_table[icc_table["Type"] == "ICC3"]["ICC"].values[0]
print(f"ICC(3): {icc3_value:.3f}")


Cohen's Kappa: 0.277
ICC(3): 0.427


### **Few-Shot**

In [None]:
# --- Step 1: Edit prompt as few-shot ---
instruction = """
You are given a course and a skill.

Rate the alignment of the skill with the course using the following 5-point scale:

Score 5 (Direct Match): The real-world task is a core learning objective of the course and is explicitly covered in the course description.
Score 4 (Related): The real-world task aligns with the course content, and students should be able to perform this task upon completing the course.
Score 3 (Transferable): The task is not explicitly covered in the course, but students may develop transferable skills that can be applied to it.
Score 2 (Same Domain): The task falls within the same discipline or broader field but is not relevant to the course content.
Score 1 (Unrelated): The task is outside the scope of the course and belongs to a different domain.

Here are some examples:

Example 1:
description: this course is a training program to provide the students with the necessary basic skills and knowledge to deal with a broad spectrum of illness and injuries in the pre hospital care phase of emergency medicine upon successful completion of the course students will take the new york state emergency medical technical certification examination once certified and upon completion of certain fundamental core courses the student will be eligible to take the advanced paramedic level courses of the program the course will be offered in the fall and spring semesters only
Real-world Task:
Treat medical emergencies, Rating:5
Develop emergency response plans or procedures. Rating:4
Provide medical or cosmetic advice for clients. Rating:3
Teach health management classes. Rating:2
Sell products or services. Rating:1

Example 2:
description: this is a course in fundamental engineering drawing and industrial drafting room practice lettering orthographic projection auxiliary views sessions and conventions pictorials threads and fasteners tolerances detail drawing dimension ing and electrical drawing introduction to computer aided graphics are covered
Real-world Task:
Create graphical representations of mechanical equipment. Rating:5
Document technical design details. Rating:4
Develop equipment or component configurations. Rating:3
Develop tools to diagnose or assess needs. Rating:2
Plan community programs or activities for the general public. Rating:1


Now, please evaluate this case

Only return the numeric score (1‚Äì5).

"""

# --- Step 2: Apply to the data ---
llm_scores = []
for i, row in enumerate(df.itertuples(), 1):
    print(f"Scoring row {i}/{len(df)}: {row.title[:50]}...")  # show part of the title
    score = score_alignment(row.domain, row.title, row.description, row.skill)
    llm_scores.append(score)
df["llm_score"] = llm_scores



Scoring row 1/10: accounting principles i...
Scoring row 2/10: accounting principles i...
Scoring row 3/10: accounting principles i...
Scoring row 4/10: accounting principles i...
Scoring row 5/10: accounting principles i...
Scoring row 6/10: accounting principles i...
Scoring row 7/10: accounting principles i...
Scoring row 8/10: accounting principles i...
Scoring row 9/10: accounting principles i...
Scoring row 10/10: accounting principles i...


In [None]:
# Make sure scores are numeric
df["human_score"] = pd.to_numeric(df["human_score"], errors="coerce")
df["llm_score"] = pd.to_numeric(df["llm_score"], errors="coerce")

# Drop rows with missing values
df_scores = df[["human_score", "llm_score"]].dropna()

# üîπ Cohen's Kappa
kappa = cohen_kappa_score(df_scores["human_score"], df_scores["llm_score"], weights="quadratic")
print(f"Cohen's Kappa: {kappa:.3f}")

# üîπ ICC
# Prepare data for ICC
df_scores_reset = df_scores.reset_index()  # Add original index as a column

# Melt into long format
df_long = df_scores.reset_index().melt(
    id_vars="index",                  # this becomes the 'targets' column
    value_vars=["human_score", "llm_score"],
    var_name="rater",
    value_name="score"
)

# Calculate ICC(3) - Two-way mixed, consistency/absolute
icc_table = pg.intraclass_corr(
    data=df_long,
    targets="index",
    raters="rater",
    ratings="score"
)

# Filter ICC(3) result
icc3_value = icc_table[icc_table["Type"] == "ICC3"]["ICC"].values[0]
print(f"ICC(3): {icc3_value:.3f}")


Cohen's Kappa: 0.371
ICC(3): 0.541


### **Chain of thought**

In [None]:
# --- Step 1: Edit prompt as few-shot ---
instruction = """
You are given a course and a skill.

Rate the alignment of the skill with the course using the following 5-point scale:

Score 5 (Direct Match): The real-world task is a core learning objective of the course and is explicitly covered in the course description.
Score 4 (Related): The real-world task aligns with the course content, and students should be able to perform this task upon completing the course.
Score 3 (Transferable): The task is not explicitly covered in the course, but students may develop transferable skills that can be applied to it.
Score 2 (Same Domain): The task falls within the same discipline or broader field but is not relevant to the course content.
Score 1 (Unrelated): The task is outside the scope of the course and belongs to a different domain.

Here are some examples:

Example 1:
description: this course is a training program to provide the students with the necessary basic skills and knowledge to deal with a broad spectrum of illness and injuries in the pre hospital care phase of emergency medicine upon successful completion of the course students will take the new york state emergency medical technical certification examination once certified and upon completion of certain fundamental core courses the student will be eligible to take the advanced paramedic level courses of the program the course will be offered in the fall and spring semesters only
Real-world Task:
Treat medical emergencies, Rating:5
Develop emergency response plans or procedures. Rating:4
Provide medical or cosmetic advice for clients. Rating:3
Teach health management classes. Rating:2
Sell products or services. Rating:1

Example 2:
description: this is a course in fundamental engineering drawing and industrial drafting room practice lettering orthographic projection auxiliary views sessions and conventions pictorials threads and fasteners tolerances detail drawing dimension ing and electrical drawing introduction to computer aided graphics are covered
Real-world Task:
Create graphical representations of mechanical equipment. Rating:5
Document technical design details. Rating:4
Develop equipment or component configurations. Rating:3
Develop tools to diagnose or assess needs. Rating:2
Plan community programs or activities for the general public. Rating:1


Now, please evaluate this case, think step by step

Only return the numeric score (1‚Äì5).

"""

# --- Step 2: Apply to the data ---
llm_scores = []
for i, row in enumerate(df.itertuples(), 1):
    print(f"Scoring row {i}/{len(df)}: {row.title[:50]}...")  # show part of the title
    score = score_alignment(row.domain, row.title, row.description, row.skill)
    llm_scores.append(score)
df["llm_score"] = llm_scores


Scoring row 1/10: accounting principles i...
Scoring row 2/10: accounting principles i...
Scoring row 3/10: accounting principles i...
Scoring row 4/10: accounting principles i...
Scoring row 5/10: accounting principles i...
Scoring row 6/10: accounting principles i...
Scoring row 7/10: accounting principles i...
Scoring row 8/10: accounting principles i...
Scoring row 9/10: accounting principles i...
Scoring row 10/10: accounting principles i...


In [None]:
# Make sure scores are numeric
df["human_score"] = pd.to_numeric(df["human_score"], errors="coerce")
df["llm_score"] = pd.to_numeric(df["llm_score"], errors="coerce")

# Drop rows with missing values
df_scores = df[["human_score", "llm_score"]].dropna()

# üîπ Cohen's Kappa
kappa = cohen_kappa_score(df_scores["human_score"], df_scores["llm_score"], weights="quadratic")
print(f"Cohen's Kappa: {kappa:.3f}")

# üîπ ICC
# Prepare data for ICC
df_scores_reset = df_scores.reset_index()  # Add original index as a column

# Melt into long format
df_long = df_scores.reset_index().melt(
    id_vars="index",                  # this becomes the 'targets' column
    value_vars=["human_score", "llm_score"],
    var_name="rater",
    value_name="score"
)

# Calculate ICC(3) - Two-way mixed, consistency/absolute
icc_table = pg.intraclass_corr(
    data=df_long,
    targets="index",
    raters="rater",
    ratings="score"
)

# Filter ICC(3) result
icc3_value = icc_table[icc_table["Type"] == "ICC3"]["ICC"].values[0]
print(f"ICC(3): {icc3_value:.3f}")


Cohen's Kappa: 0.371
ICC(3): 0.541


# **4. Try your own prompt**

Now, let‚Äôs try to write your own prompt and see how the model responds...

## Improved CoT: Add Step-by-Step Evaluation Strategy in CoT

In [8]:
# --- Step 1: Write your own prompt ---
instruction = """

You are given a course and a skill.

Rate the alignment of the skill with the course using the following 5-point scale:

Score 5 (Direct Match): The real-world task is a core learning objective of the course and is explicitly covered in the course description.
Score 4 (Related): The real-world task aligns with the course content, and students should be able to perform this task upon completing the course.
Score 3 (Transferable): The task is not explicitly covered in the course, but students may develop transferable skills that can be applied to it.
Score 2 (Same Domain): The task falls within the same discipline or broader field but is not relevant to the course content.
Score 1 (Unrelated): The task is outside the scope of the course and belongs to a different domain.

Here are some examples:

Example 1:
description: this course is a training program to provide the students with the necessary basic skills and knowledge to deal with a broad spectrum of illness and injuries in the pre hospital care phase of emergency medicine upon successful completion of the course students will take the new york state emergency medical technical certification examination once certified and upon completion of certain fundamental core courses the student will be eligible to take the advanced paramedic level courses of the program the course will be offered in the fall and spring semesters only
Real-world Task:
Treat medical emergencies, Rating:5
Develop emergency response plans or procedures. Rating:4
Provide medical or cosmetic advice for clients. Rating:3
Teach health management classes. Rating:2
Sell products or services. Rating:1

Example 2:
description: this is a course in fundamental engineering drawing and industrial drafting room practice lettering orthographic projection auxiliary views sessions and conventions pictorials threads and fasteners tolerances detail drawing dimension ing and electrical drawing introduction to computer aided graphics are covered
Real-world Task:
Create graphical representations of mechanical equipment. Rating:5
Document technical design details. Rating:4
Develop equipment or component configurations. Rating:3
Develop tools to diagnose or assess needs. Rating:2
Plan community programs or activities for the general public. Rating:1


Now, please evaluate this case, think step by step

### Step-by-Step Evaluation Strategy:
1. Read the course description and identify the main goals and content areas.
2. Understand the task and its required skills or knowledge.
3. Ask yourself:
   - Is this task explicitly covered in the course?
   - Is the task relevant, indirectly supported, or completely unrelated to the course?
4. Assign a score (1‚Äì5) using the definitions above.

Only return the numeric score (1‚Äì5).

"""


In [14]:
# --- Step 2: Apply to the data ---
llm_scores = []
for i, row in enumerate(df.itertuples(), 1):
    print(f"Scoring row {i}/{len(df)}: {row.title[:50]}...")  # show part of the title
    score = score_alignment(row.domain, row.title, row.description, row.skill)
    llm_scores.append(score)
df["llm_score"] = llm_scores


# Make sure scores are numeric
df["human_score"] = pd.to_numeric(df["human_score"], errors="coerce")
df["llm_score"] = pd.to_numeric(df["llm_score"], errors="coerce")

# Drop rows with missing values
df_scores = df[["human_score", "llm_score"]].dropna()

# üîπ Cohen's Kappa
kappa = cohen_kappa_score(df_scores["human_score"], df_scores["llm_score"], weights="quadratic")
print(f"Cohen's Kappa: {kappa:.3f}")

# üîπ ICC
# Prepare data for ICC
df_scores_reset = df_scores.reset_index()  # Add original index as a column

# Melt into long format
df_long = df_scores.reset_index().melt(
    id_vars="index",                  # this becomes the 'targets' column
    value_vars=["human_score", "llm_score"],
    var_name="rater",
    value_name="score"
)

# Calculate ICC(3) - Two-way mixed, consistency/absolute
icc_table = pg.intraclass_corr(
    data=df_long,
    targets="index",
    raters="rater",
    ratings="score"
)

# Filter ICC(3) result
icc3_value = icc_table[icc_table["Type"] == "ICC3"]["ICC"].values[0]
print(f"ICC(3): {icc3_value:.3f}")

Scoring row 1/10: accounting principles i...
Scoring row 2/10: accounting principles i...
Scoring row 3/10: accounting principles i...
Scoring row 4/10: accounting principles i...
Scoring row 5/10: accounting principles i...
Scoring row 6/10: accounting principles i...
Scoring row 7/10: accounting principles i...
Scoring row 8/10: accounting principles i...




Scoring row 9/10: accounting principles i...




Scoring row 10/10: accounting principles i...




Cohen's Kappa: 0.512
ICC(3): 0.564


## CCoT: Add Incorrect Examples Based on Improved CoT

In [17]:
# --- Step 1: Write your own prompt ---
instruction = """

You are given a course and a skill.

Rate the alignment of the skill with the course using the following 5-point scale:

Score 5 (Direct Match): The real-world task is a core learning objective of the course and is explicitly covered in the course description.
Score 4 (Related): The real-world task aligns with the course content, and students should be able to perform this task upon completing the course.
Score 3 (Transferable): The task is not explicitly covered in the course, but students may develop transferable skills that can be applied to it.
Score 2 (Same Domain): The task falls within the same discipline or broader field but is not relevant to the course content.
Score 1 (Unrelated): The task is outside the scope of the course and belongs to a different domain.

Here are some correct examples and incorrect examples:

Correct Example 1:
description: this course is a training program to provide the students with the necessary basic skills and knowledge to deal with a broad spectrum of illness and injuries in the pre hospital care phase of emergency medicine upon successful completion of the course students will take the new york state emergency medical technical certification examination once certified and upon completion of certain fundamental core courses the student will be eligible to take the advanced paramedic level courses of the program the course will be offered in the fall and spring semesters only
Real-world Task:
Treat medical emergencies, Rating:5
Develop emergency response plans or procedures. Rating:4
Provide medical or cosmetic advice for clients. Rating:3
Teach health management classes. Rating:2
Sell products or services. Rating:1

Correct Example 2:
description: this is a course in fundamental engineering drawing and industrial drafting room practice lettering orthographic projection auxiliary views sessions and conventions pictorials threads and fasteners tolerances detail drawing dimension ing and electrical drawing introduction to computer aided graphics are covered
Real-world Task:
Create graphical representations of mechanical equipment. Rating:5
Document technical design details. Rating:4
Develop equipment or component configurations. Rating:3
Develop tools to diagnose or assess needs. Rating:2
Plan community programs or activities for the general public. Rating:1

Incorrect Example 1:
description: this course provides an overview of financial accounting principles, including balance sheet preparation, income statements, and journal entries. Students will gain foundational skills in accounting software and transaction recording.
Real-world Task:
Assemble machine tools, parts, or fixtures. Incorrect rating: 4, Correct rating: 1 (Unrelated)
Reason: This task belongs to the mechanical/engineering domain and has no relation to financial accounting.
Verify accuracy of records. Incorrect rating: 2, Correct rating: 4 (Related)
Reason: Verifying records is a directly relevant task in accounting and is likely practiced in journal entry and balance sheet work.
Maintain personnel records. Incorrect rating: 5, Correct rating: 2 (Same Domain)
Reason: While this involves record-keeping, it‚Äôs specific to HR functions and not the focus of financial accounting.

Incorrect Example 2:
description: this course focuses on digital marketing strategies including SEO, PPC advertising, email campaigns, social media branding, and customer analytics.
Real-world Task:
Develop online advertising strategies.Incorrect rating: 2, Correct rating: 5 (Direct Match)
Reason: This is explicitly covered in the course (SEO, PPC).
Write content for digital platforms. Incorrect rating: 3, Correct rating: 4 (Related)
Reason: This task is a likely component of social media branding and email campaigns.
Plan community outreach events. Incorrect rating: 4, Correct rating: 2 (Same Domain)
Reason: This task is not mentioned in the course and is more aligned with public relations or nonprofit work.

Now, please evaluate this case, think step by step

### Step-by-Step Evaluation Strategy:
1. Read the course description and identify the main goals and content areas.
2. Understand the task and its required skills or knowledge.
3. Ask yourself:
   - Is this task explicitly covered in the course?
   - Is the task relevant, indirectly supported, or completely unrelated to the course?
4. Assign a score (1‚Äì5) using the definitions above.
5. Before finalizing your rating, compare your judgment with the provided examples. If your reasoning resembles an incorrect example, reconsider your rating.


Only return the numeric score (1‚Äì5).

"""


In [18]:
# --- Step 2: Apply to the data ---
llm_scores = []
for i, row in enumerate(df.itertuples(), 1):
    print(f"Scoring row {i}/{len(df)}: {row.title[:50]}...")  # show part of the title
    score = score_alignment(row.domain, row.title, row.description, row.skill)
    llm_scores.append(score)
df["llm_score"] = llm_scores


# Make sure scores are numeric
df["human_score"] = pd.to_numeric(df["human_score"], errors="coerce")
df["llm_score"] = pd.to_numeric(df["llm_score"], errors="coerce")

# Drop rows with missing values
df_scores = df[["human_score", "llm_score"]].dropna()

# üîπ Cohen's Kappa
kappa = cohen_kappa_score(df_scores["human_score"], df_scores["llm_score"], weights="quadratic")
print(f"Cohen's Kappa: {kappa:.3f}")

# üîπ ICC
# Prepare data for ICC
df_scores_reset = df_scores.reset_index()  # Add original index as a column

# Melt into long format
df_long = df_scores.reset_index().melt(
    id_vars="index",                  # this becomes the 'targets' column
    value_vars=["human_score", "llm_score"],
    var_name="rater",
    value_name="score"
)

# Calculate ICC(3) - Two-way mixed, consistency/absolute
icc_table = pg.intraclass_corr(
    data=df_long,
    targets="index",
    raters="rater",
    ratings="score"
)

# Filter ICC(3) result
icc3_value = icc_table[icc_table["Type"] == "ICC3"]["ICC"].values[0]
print(f"ICC(3): {icc3_value:.3f}")

Scoring row 1/10: accounting principles i...
Scoring row 2/10: accounting principles i...
Scoring row 3/10: accounting principles i...
Scoring row 4/10: accounting principles i...
Scoring row 5/10: accounting principles i...
Scoring row 6/10: accounting principles i...
Scoring row 7/10: accounting principles i...
Scoring row 8/10: accounting principles i...
Scoring row 9/10: accounting principles i...
Scoring row 10/10: accounting principles i...
Cohen's Kappa: 0.624
ICC(3): 0.722



## Here, we use two prompting strategies, including (1) Improved CoT: Add Step-by-Step Evaluation Strategy in CoT (ICC(3): 0.564), (2) CCoT: Add Incorrect Examples Based on Improved CoT (ICC(3): 0.722).

## Based on the above reported results, we can observe that using good prompting strategies can achieve higher scores.
