In [None]:
# Make sure to set your OpenAI API key in the .env file
import openai
import os
from dotenv import load_dotenv

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

# Fun Use Case! 

## __Understanding Papers with Prompt Engineering__

### The Five High-Level Questions

1. What problem is the paper trying to solve?
2. Why is the problem interesting?
3. What is the primary contribution?
4. How did they do it?
5. What are the key take-aways?

In [1]:
from openai import OpenAI
client = OpenAI()

def get_response(prompt_question):
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "system", "content": "You are a helpful research and programming assistant"},
                  {"role": "user", "content": prompt_question}]
    )
    
    return response.choices[0].message.content

In [3]:
paper_contents = """

1
Towards A Human-in-the-Loop LLM Approach to
Collaborative Discourse Analysis
Clayton Cohn1[0000−0003−0856−9587], Caitlin Snyder1[0000−0002−3341−0490], Justin
Montenegro2, and Gautam Biswas1[0000−0002−2752−3878]
1 Vanderbilt University, Nashville, TN 37240, USA
clayton.a.cohn@vanderbilt.edu2 Martin Luther King, Jr. Academic Magnet High School, Nashville, TN 37203, USA
Abstract. LLMs have demonstrated proficiency in contextualizing their
outputs using human input, often matching or beating human-level per-
formance on a variety of tasks. However, LLMs have not yet been used to
characterize synergistic learning in students’ collaborative discourse. In
this exploratory work, we take a first step towards adopting a human-in-
the-loop prompt engineering approach with GPT-4-Turbo to summarize
and categorize students’ synergistic learning during collaborative dis-
course. Our preliminary findings suggest GPT-4-Turbo may be able to
characterize students’ synergistic learning in a manner comparable to
humans and that our approach warrants further investigation.
Keywords: LLM ·Collaborative Learning ·Human-in-the-Loop ·Dis-
course Analysis ·K12 STEM.
1 Introduction
Computational modeling of scientific processes has been shown to effectively
foster students’ Science, Technology, Engineering, Mathematics, and Comput-
ing (STEM+C) learning [5], but task success necessitates synergistic learning
(i.e., the simultaneous development and application of science and computing
knowledge to address modeling tasks), which can lead to student difficulties
[1]. Research has shown that problem-solving environments promoting synergis-
tic learning in domains such as physics and computing often facilitate a bet-
ter understanding of physics and computing concepts and practices when com-
pared to students taught via a traditional curriculum [5]. Analyzing students’
collaborative discourse offers valuable insights into their application of both
domains’ concepts as they construct computational models [8]. Unfortunately,
manually analyzing students’ discourse to identify their synergistic processes is
time-consuming, and programmatic approaches are needed.
In this paper, we take an exploratory first step towards adopting a human-in-
the-loop LLM approach from previous work called Chain-of-Thought Prompting
+ Active Learning [3] (detailed in Section 3) to characterize the synergistic con-
tent in students’ collaborative discourse. We use a large language model (LLM)
to summarize conversation segments in terms of how physics and computing
arXiv:2405.03677v1  [cs.CL]  6 May 2024
2 C. Cohn et al.
concepts are interwoven to support students’ model building and debugging
tasks. We evaluate our approach by comparing the LLM’s summaries to human-
produced ones (using an expert human evaluator to rank them) and by qualita-
tively analyzing the summaries to discern the LLM’s strengths and weaknesses
alongside a physics and computer science teacher (the Educator) with experi-
ence teaching the C2STEM curriculum (see Section 3.1). Within this framework,
we analyze data from high school students working in pairs to build kinematics
models and answer the following research questions: RQ1) How does the quality
of human- and LLM-generated summaries and synergistic learning characteri-
zations of collaborative student discourse compare?, and RQ2) What are the
LLM’s strengths, and where does it struggle, in summarizing and characterizing
synergistic learning in physics and computing?
As this work is exploratory, due to the small sample size, we aim not to
present generalizable findings but hope that our results will inform subsequent
research as we work towards forging a human-AI partnership by providing teach-
ers with actionable, LLM-generated feedback and recommendations to help them
guide students in their synergistic learning.
2 Background
Roschelle and Teasley [7] define collaboration as “a coordinated, synchronous ac-
tivity that is a result of a continuous attempt to construct and maintain a shared
conception of a problem.” This development of a shared conceptual understand-
ing necessitates multi-faceted collaborative discourse across multiple dimensions:
social (e.g., navigating the social intricacies of forming a consensus [12]), cogni-
tive (e.g., the development of context-specific knowledge [8]), and metacognitive
(e.g., socially shared regulation [4]). Researchers have developed and leveraged
frameworks situated within learning theory to classify and analyze collaborative
problem solving (CPS) both broadly (i.e., across dimensions [6]) and narrowly
(i.e., by focusing on one CPS aspect to gain in-depth insight, e.g., argumentative
knowledge construction [12]). In this paper, we focus on one dimension of CPS
that is particularly important to the context of STEM+C learning: students’
cognitive integration of synergistic domains.
Leveraging CPS frameworks to classify student discourse has traditionally
been done through hand-coding utterances. However, this is time-consuming
and laborious, leading researchers to leverage automated classification methods
such as rule-based approaches, supervised machine learning methods, and (more
recently) LLMs [10]. Utilizing LLMs can help extend previous work on classifying
synergistic learning discourse, which has primarily relied on the frequency counts
of domain-specific concept codes [8,5]. In particular, the use of LLMs can help
address the following difficulties encountered while employing traditional meth-
ods: (1) concept codes are difficult to identify programmatically, as rule-based
approaches like regular expressions (regex) have difficulties with misspellings
and homonyms; (2) the presence or absence of concept codes is not analyzed in
a conversational context; and (3) the presence of cross-domain concept codes is
Towards A HITL LLM Approach to Collaborative Discourse Analysis 3
not necessarily indicative of synergistic learning, as synergistic learning requires
students to form connections between concepts in both domains.
Recent advances in LLM performance capabilities have allowed researchers
to find new and creative ways to apply these powerful models to education using
in-context learning (ICL) [2] (i.e., providing the LLM with labeled instances dur-
ing inference) in lieu of traditional training that requires expensive parameter
updates. One prominent extension of ICL is chain-of-thought reasoning (CoT)
[11], which augments the labeled instances with “reasoning chains” that explain
the rationale behind the correct answer and help guide the LLM towards the cor-
rect solution. Recent work has found success in leveraging CoT towards scoring
and explaining students’ formative assessment responses in the Earth Science
domain [3]. In this work, we investigate this approach as a means to summarize
and characterize synergistic learning in students’ collaborative discourse.
3 Methods
This paper extends the previous work of 1) Snyder et al. on log-segmented dis-
course summarization defined by students’ model building segments extracted
from their activity logs [9], and 2) Cohn et al. on a human-in-the-loop prompt
engineering approach called Chain-of-Thought Prompting + Active Learning [3]
(the Method) for scoring and explaining students’ science formative assessment
responses. The original Method is a three-step process: 1) Response Scoring,
where two human reviewers manually label a sample of students’ formative as-
sessment responses and identify disagreements (i.e., sticking points) the LLM
may similarly struggle with; 2) Prompt Development, which employs few-shot
CoT prompting to address the sticking points and help align the LLM with
the humans’ scoring consensus; and 3) Active Learning, where a knowledgeable
human (e.g., a domain expert, researcher, or instructor) acts as an “oracle” and
identifies the LLM’s reasoning errors on a validation set, then appends additional
few-shot instances that the LLM struggled with to the prompt and uses CoT
reasoning to help correct the LLM’s misconceptions. We illustrate the Method
in Figure 1. For a complete description of the Method, please see [3].
In this work, we combine log-based discourse segmentation [9] and CoT
prompting [3] to generate more contextualized summaries of students’ discourse
segments to study students’ synergistic learning processes by linking their model
construction and debugging activities with their conversations during each probl-
em-solving segment. We provide Supplementary Materials3that include 1) addi-
tional information about the learning environment, 2) method application details
(including our final prompt and few-shot example selection methodology), 3) a
more in depth look at our conversation with the Educator, and 4) a more detailed
analysis of the LLM’s strengths and weaknesses while applying the Method.
3https://github.com/oele-isis-vanderbilt/AIED24_LBR
4 C. Cohn et al.
3.1 STEM+C Learning Environment, Curriculum, and Data
Our work in this paper centers on the C2STEM learning environment [5], where
students learn kinematics by building computational models of the 1- and 2-D
motion of objects. C2STEM combines block-based programming with domain-
specific modeling blocks to support the development and integration of science
and computing knowledge as students create partial or complete models that
simulate behaviors governed by scientific principles. This paper focuses on the
1-D Truck Task, where students use their knowledge of kinematic equations to
model the motion of a truck that starts from rest, accelerates to a speed limit,
cruises at that speed, then decelerates to come to a stop at a stop sign.
Our study, approved by our university Institutional Review Board, included
26 consented high school students (aged 14-15) who completed the C2STEM
kinematics curriculum. Students’ demographic information was not collected as
part of this study (we began collecting it in later studies). Data collection in-
cluded logged actions in the C2STEM environment, saved project files, and video
and audio data (collected using laptop webcams and OBS software). Our data
analysis included 9 dyads (one group had a student who did not consent to
data collection, so we did not analyze that group; and we had technical issues
with audio data from other groups). The dataset includes 9 hours of discourse
transcripts and over 2,000 logged actions collected during one day of the study.
Student discourse was transcribed using Otter.ai and edited for accuracy.
3.2 Approach
We extend the Method, previously used for formative assessment scoring and
feedback, to prompt GPT-4-Turbo to summarize segments of students’ discourse
and identify the Discourse Category (defined momentarily) by answering the fol-
lowing question: “Given a discourse segment, and its environment task context
and actions, is the students’ conversation best characterized as physics-focused
(i.e., the conversation is primarily focused on the physics domain), computing-
focused (i.e., the conversation is primarily focused on the computing domain),
physics-and-computing-synergistic (i.e., students discuss concepts from both do-
mains, interleaving them throughout the conversation, and making connections
Fig. 1. Chain-of-Thought Prompting + Active Learning, identified by the green box,
where each blue diamond is a step in the Method. Yellow boxes represent the process’s
application to the classroom detailed in prior work [3].
Towards A HITL LLM Approach to Collaborative Discourse Analysis 5
between them), or physics-and-computing-separate (i.e., students discuss both
domains but do so separately without interleaving)?” We use the recently re-
leased GPT-4-Turbo LLM (gpt-4-0125-preview) because it provides an extended
context window (128,000 tokens).
We selected 10 training instances and 12 testing instances (10 additional
segments were used as a validation set to perform Active Learning) prior to
Response Scoring, using stratified sampling to approximate a uniform distribu-
tion across Discourse Categories for both the train and test sets. Note that the
student discourse was segmented based on which element of the model the stu-
dents were working on (identified automatically via log data). During Response
Scoring, the first two authors of this paper (Reviewers R1 and R2, respectively)
independently evaluated the training set segments, classifying each segment as
belonging to one of the four Discourse Categories. For each segment the Re-
viewers disagreed on, the reason for disagreement was noted as a sticking point,
and the segment was discussed until a consensus was reached on the specific
Discourse Category for that segment. R1 and R2 initially struggled to agree on
segments’ Discourse Categories (Cohen’s k = 0.315). This is because segments
often contained concepts from both domains that may or may not have been
interwoven, so it was not always clear which Discourse Category a segment be-
longed to. Because of this, the Reviewers ultimately opted to label all segments
via consensus coding.
During Prompt Development, we provided the LLM with explicit task instruc-
tions, curricular and environment context, and general guidelines (e.g., instruct-
ing the LLM to cite evidence directly from the students’ discourse to support
its summary decisions and Discourse Category choice). We supplemented the
prompt with extensive contextual information not found in previous work [9],
including the Discourse Categories, C2STEM variables and their values, physics
and computing concepts and their definitions, and students’ actions in the learn-
ing environment (derived from environment logs). Four labeled instances were
initially appended to the prompt as few-shot examples (one per Discourse Cate-
gory). Active Learning was performed for a total of two rounds over 10 validation
set instances, at the end of which one additional few-shot instance was added.
Before testing, R1 wrote summaries (and labeled Discourse Categories) for
the 12 test instances. R2 then compared the human-generated summaries to two
LLMs’ summaries: GPT-4-Turbo and GPT-4. We compare GPT-4 to GPT-4-
Turbo to see which LLM is most promising for use in future work. To evaluate
RQ1, R2 used “ranked choice” to rank the three summaries from best to worst
for each test set instance without knowledge of whether the summaries were gen-
erated by a human, GPT-4-Turbo, or GPT-4 (the Competitors). Three rankings
were used for the scoring: (1) Wins (the number of times each Competitor was
ranked higher than another Competitor across all instances, i.e., the best Com-
petitor for an individual segment receives two “wins” for outranking the other
two Competitors for that segment); (2) Best (the number of instances each Com-
petitor was selected as the best choice); and (3) Worst (the number of instances
each Competitor was selected as the worst choice). To answer RQ1, we used
"""

get_response(f"What is the main idea of the following paper: '''{paper_contents}'''")

'The paper "Towards A Human-in-the-Loop LLM Approach to Collaborative Discourse Analysis" primarily explores the integration of a human-in-the-loop (HITL) approach utilizing a large language model (LLM) for the analysis of collaborative discourse, specifically within the context of students engaging in synergistic learning processes in STEM+C (Science, Technology, Engineering, Mathematics, and Computing) education.\n\nThe authors present an exploratory study using the GPT-4-Turbo model to summarize and categorize the synergistic learning observed during students\' collaborative discourse, particularly focusing on how they integrate physics and computing concepts in their dialogues during problem-solving tasks. Such an approach seeks to address the labor-intensive nature of manually analyzing student interactions, aiming for a more automated, yet insightful, analysis that could ideally match human evaluators in effectiveness.\n\nThe study includes the design and implementation of a meth

In [6]:
def prompt_template_paper_questions(prompt: str, paper_contents: str):
    return f"Consider this paper: '''{paper_contents}'''. \n {prompt}"

prompt1 = "What problem is the paper trying to solve?"
prompt2 = "Why is the problem interesting?"
prompt3 = "What is the primary contribution?"
prompt4 = "How did they do it?"
prompt5 = "What are the key take-aways?"

prompt_high_level_question_list = [prompt_template_paper_questions(p, paper_contents) for p in [prompt1, prompt2, prompt3, prompt4, prompt5]]
prompt_high_level_question_list

["Consider this paper: '''\n\n1\nTowards A Human-in-the-Loop LLM Approach to\nCollaborative Discourse Analysis\nClayton Cohn1[0000−0003−0856−9587], Caitlin Snyder1[0000−0002−3341−0490], Justin\nMontenegro2, and Gautam Biswas1[0000−0002−2752−3878]\n1 Vanderbilt University, Nashville, TN 37240, USA\nclayton.a.cohn@vanderbilt.edu2 Martin Luther King, Jr. Academic Magnet High School, Nashville, TN 37203, USA\nAbstract. LLMs have demonstrated proficiency in contextualizing their\noutputs using human input, often matching or beating human-level per-\nformance on a variety of tasks. However, LLMs have not yet been used to\ncharacterize synergistic learning in students’ collaborative discourse. In\nthis exploratory work, we take a first step towards adopting a human-in-\nthe-loop prompt engineering approach with GPT-4-Turbo to summarize\nand categorize students’ synergistic learning during collaborative dis-\ncourse. Our preliminary findings suggest GPT-4-Turbo may be able to\ncharacterize stu

In [7]:
output_high_level_questions = []
for prompt in prompt_high_level_question_list:
    output_high_level_questions.append(get_response(prompt))

In [8]:
prompt_questions = [prompt1, prompt2, prompt3, prompt4, prompt5]
for q,o in zip(prompt_questions,output_high_level_questions):
    print(q)
    print(o)
    print("****")

What problem is the paper trying to solve?
The paper is addressing the challenge of analyzing students' collaborative discourse to better understand and characterize their synergistic learning processes in STEM+C (Science, Technology, Engineering, Mathematics, and Computing) environments. Synergistic learning, which involves the simultaneous development and application of knowledge in science and computing to solve modeling tasks, can be complex and challenging for students. The traditional approach of manually analyzing collaborative discourse to identify and characterize these synergistic learning processes is time-consuming and labor-intensive. The paper explores the use of a human-in-the-loop large language model (LLM) to automatically summarize and categorize the content of students' discussions during collaborative tasks, aiming to make this analysis more efficient and effective.
****
Why is the problem interesting?
The problem addressed in the paper is interesting for several re

In [9]:
def prompt_template_paper_questions(prompt: str, paper_contents: str):
    return f"Consider this paper: '''{paper_contents}'''. \n {prompt}. At the end of your answer, include at least one citation from the original paper to validate your response."

In [10]:
prompt_questions = [prompt1, prompt2, prompt3, prompt4, prompt5]

prompt_high_level_question_list = [prompt_template_paper_questions(p, paper_contents) for p in [prompt1, prompt2, prompt3, prompt4, prompt5]]

In [11]:
output_high_level_questions = []
for prompt in prompt_high_level_question_list:
    output_high_level_questions.append(get_response(prompt))

In [13]:
from IPython.display import Markdown

Markdown(prompt1 + "\n\n" + output_high_level_questions[0])

What problem is the paper trying to solve?

The paper addresses the challenge of analyzing students' collaborative discourse in STEM+C (Science, Technology, Engineering, Mathematics, and Computing) learning environments to identify and characterize synergistic learning—which involves the integrated use of physics and computing knowledge. The main problem it seeks to solve is the time-consuming and labor-intensive nature of manually analyzing such discourse to understand how students interweave concepts from different domains during problem-solving activities. The paper proposes leveraging a Large Language Model (LLM) with a human-in-the-loop approach to automate and potentially enhance the efficiency and effectiveness of this discourse analysis process, aiming to provide useful insights and feedback that can aid in educational settings.

The significance of addressing this problem is underscored by the benefits of synergistic learning in improving students' understanding of complex concepts when compared to traditional learning methods, as students engaging in synergistic learning often achieve a better grasp of the interconnected concepts of physics and computing. Therefore, developing automated tools to analyze such learning processes can provide crucial support for educational research and instructional design.

A direct citation that supports this problem description is: "Research has shown that problem-solving environments promoting synergistic learning in domains such as physics and computing often facilitate a better understanding of physics and computing concepts and practices when compared to students taught via a traditional curriculum [5]. Analyzing students’ collaborative discourse offers valuable insights into their application of both domains’ concepts as they construct computational models [8]. Unfortunately, manually analyzing students’ discourse to identify their synergistic processes is time-consuming, and programmatic approaches are needed" (Cohn et al.)

In this notebook we will employ some of the techniques and ideas we've learned and instead of doing a rigid experiment, per se, we will play around with these ideas to perform different actions with LLMs to try and better understand the contents of scientific papers.

# Potential Extensions

- LLM outlines the background knowledge required to understand the paper as bullet points.
- LLM quizzes the learner about the contents of the paper
- Extracting citations
- Extracting quotes from the paper to validate statements  
- Paper summary in different formats
- LLM provides opportunity for discussion about contents of the paper
- LLM gives feedback on learner's understanding of the paper based on sections of the paper.