# Building an AI Research Assistant: Large Language Models as a Versatile Tool for Digital Historians

 ### Daniel Hutchinson [![orcid](https://orcid.org/sites/default/files/images/orcid_16x16.png)](https://orcid.org/0000-0003-2759-5318) 
Belmont Abbey College

[![cc-by](https://licensebuttons.net/l/by/4.0/88x31.png)](https://creativecommons.org/licenses/by/4.0/) 
©<AUTHOR or ORGANIZATION / FUNDER>. Published by De Gruyter in cooperation with the University of Luxembourg Centre for Contemporary and Digital History. This is an Open Access article distributed under the terms of the [Creative Commons Attribution License CC-BY](https://creativecommons.org/licenses/by/4.0/)


[![cc-by-nc-nd](https://licensebuttons.net/l/by-nc-nd/4.0/88x31.png)](https://creativecommons.org/licenses/by-nc-nd/4.0/) 
©<AUTHOR or ORGANIZATION / FUNDER>. Published by De Gruyter in cooperation with the University of Luxembourg Centre for Contemporary and Digital History. This is an Open Access article distributed under the terms of the [Creative Commons Attribution License CC-BY-NC-ND](https://creativecommons.org/licenses/by-nc-nd/4.0/)


In [0]:
from IPython.display import Image, display

display(Image("./media/placeholder.png"))

Large language models, GPT-4, artifical intelligence, machine learning, historical methodology, optical character recognition, oral history, prompt engineering

This article explores into the potential applications and implications of large language models (LLMs) for historical research and pedagogy. The article examines the capabilities of GPT-4 and other machine learning models through case studies studying their utility in fostering greater accessibility to historical sources. GPT-4's performance is evaluated on a series of prompted tasks including data preparation, source analysis, and the ethical implications of simulated historical worldviews. GPT-4's proficiency in historical knowledge is also evaluated by using a widely recognized machine learning benchmark. A replication study demonstrates that GPT-4 exhibits expert-level performance in three distinct historical subfields. Given the rapid advances in LLMs, historians should contribute wider debates surrounding these technologies, as the unpredictable impacts of democratized AI on historical knowledge are already emerging.

In the article's hermeneutical layer, the author explores the practice of prompt engineering, or the techniques for using natural language instructions to guide a LLM's output. Prompt engineering strategies are demonstrated through the use of few-shot prompting, chain-of-thought reasoning, and prompt chaining.

# Introduction

In 2003, Roy Rosenzweig predicted that digital historians would need to develop new techniques "to research, write, and teach in a world of unheard-of historical abundance." (<cite id="cite2c-27937/CJYNFHVI"><a href="#zotero%7C27937%2FCJYNFHVI">(Rosenzweig, 2003)</a></cite>) Over the past two decades historians have risen to this challenge, embracing digital mapping, network analysis, distant reading of large text collections, and machine learning as part of their growing methodological toolkit. (<cite id="cite2c-27937/L2ILKERU"><a href="#zotero%7C27937%2FL2ILKERU">(Graham et al., 2015)</a></cite>) Generative artificial intelligence (AI) has emerged as another potential tool that historians are using to explore the past, particularly large language models (LLMs), the most prominent form of this technology. These models possess striking capacities to generate, interpret, and manipulate data across a range of modalities. The rapidly-expanding scope of these capabilities and their limits remain intensely debated (<cite id="cite2c-27937/6DE3XGUT"><a href="#zotero%7C27937%2F6DE3XGUT">(Crane, 2019)</a></cite>), as do their broader social, economic, cultural, and environmental impacts. Yet while still an emerging technology, historians are already demonstrating generative AI's potential as a versatile digital tool. Historians are also contributing to the critical discourse surrounding this new domain, raising key questions about how these models achieve their capabilities, their propensity to reinforce existing inequalities, and their potential to distort our understanding of the past. 

This article contributes to this discourse by demonstrating how generative AI is being used to explore the past, and offer insights into the common approaches digital historians are using to effectively leverage LLMs. We begin by assessing the metrics commonly used to measure the historical knowledge of LLMs, and examine how such metrics can give us insights into the capacities and limits this technology. We then examine how generative AI can be used in tasks as varied as preparing datasets, exploring text collections, and offering novel methods of representing the past. We conclude with a call to historians to continue to contribute to ongoing research and debates concerning the ethical use of generative AI. Given the rapid pace of innovation in this field, it is crucial that the profession addresses the implications of this technology for our research and teaching. Historians will have much to contribute in contextualizing the innovative and disruptive potential of these breakthroughs.

## What Do AIs Know About History? Assessing LLMs for Historical Knowledge

As historians explore the possibilities of generative AI, it is important to understand how these technologies are created and assessed. With this knowledge we can better evaluate their potential utility and their limits.

At the most fundamental level, generative AI models like LLMs are statistical representations of the datasets on which they are trained. Machine learning techniques like deep learning and recent innovations like the Transformer network architecture (<cite id="cite2c-27937/BBYR8D86"><a href="#zotero%7C27937%2FBBYR8D86">(Vaswani et. al, 2017)</a></cite>) have enabled the creation of models capable of mimicking the data on which they are trained. But researchers have also discovered that with sufficient time and the application of (often immense) computational power, these models exhibit a range of "emergent" capabilities (<cite id="cite2c-27937/56EE9N63"><a href="#zotero%7C27937%2F56EE9N63">(Wei et al., 2022)</a></cite>). For example, LLMs can summarize texts, perform language translation, write working computer code, and compose informative responses on a wide array of subjects - all without specific training on how to perform such tasks. (<cite id="cite2c-27937/KNEK45E4"><a href="#zotero%7C27937%2FKNEK45E4">(Brown et al., 2020)</a></cite>). Moreover, these emergent capacities seem to "scale", meaning new models exhibit enhanced performance through training on ever-greater quantities of data and computation. (<cite id="cite2c-27937/H9BUWE28"><a href="#zotero%7C27937%2FH9BUWE28">(Kaplan et al., 2020)</a></cite>) The nature of these emergent capacities remains a matter of intense research and debate, as do the ethical and legal questions surrounding their use. However, it is clear that LLMs can both interpret and generate data in ways that rival previous machine learning methods. Scholars studying these AI systems have labeled them "foundational models" due to their potential to enable new domains of computational analysis (<cite id="cite2c-27937/F3XT4XAQ"><a href="#zotero%7C27937%2FF3XT4XAQ">(Bommasani et al., 2022)</a></cite>). Indeed, the remarkable versatility of LLMs is stimulating broader discussions about the potential implictions of these technologies on society at large. (<cite id="cite2c-27937/QD3X7XMD"><a href="#zotero%7C27937%2FQD3X7XMD">(Eloundou et al., 2023)</a></cite>).

While the Generative Pre-trained Transformer (GPT) series from OpenAI is the best known of these foundational models, there has been a rapid proliferation of commercial and open-source alternatives. Notable recent LLMs include Google's Gemini, Anthropic's Claude, and open-source models offered by Meta and Mistral. 

Foundational models are also emerging in other domains, such as image, video, and audio synthesis. Architectures like CLIP (<cite id="cite2c-27937/UYVGUT4C"><a href="#zotero%7C27937%2FUYVGUT4C">(Radford et al., 2021)</a></cite>) enable the creation of synthetic imagery in models like OpenAI's DALL-E, Midjourney, and the open-source community behind Stable Diffusion. Similar approaches for generating video, speech, and music have been developed by firms like Runway-XL, ElevenLabs, and Suno, along with open-source alternatives hosted on sites likes HuggingFace. Most notably, new forms of LLM-training have enabled the a combination of these capacities in multi-modal models capable of working across multiple domains, such as OpenAI's GPT-4 series. (<cite id="cite2c-27937/U534FF7L"><a href="#zotero%7C27937%2FU534FF7L">(OpenAI, 2023)</a></cite>)

An accessible way to stay abreast of recent innovations in this field is by following the leaderboards used to measure performance on standard LLM benchmarks. LLMArena's Chatbot Arena (<cite id="cite2c-27937/ZICATXAV"><a href="#zotero%7C27937%2FZICATXAV">(Chiang et. al, 2024)</a></cite>) offers an overview of leading contemporary models, while HuggingFace's Open LLM Leaderboard (<cite id="cite2c-27937/QPK6D3M9"><a href="#zotero%7C27937%2FQPK6D3M9">(Fourrier et. al, 2024)</a></cite>) and the Open Multilingual LLM Evaluation Leaderboard (<cite id="cite2c-27937/RRLN9TF5"><a href="#zotero%7C27937%2FRRLN9TF5">(Lai et al., 2023)</a></cite>) offer specialized metrics for particular domains and use-cases.

While such claims have sparked both excitement and alarm, any assessment of LLMs must first be tempered with humility. LLMs are often described as possessing "knowledge" and "understanding," yet direct engagement with these models can quickly reveal both their remarkable breadth and their narrow limits. Incisive critics of this technology characterize LLMs as "stochastic parrots" that excel at uncanny mimicry of human intelligence (<cite id="cite2c-27937/MVDFMR8K"><a href="#zotero%7C27937%2FMVDFMR8K">(Bender et al., n.d.)</a></cite>). A form of this mimicry has proven convincing in the past. The first attribution of true artificial intelligence to a computer program occurred in 1966 with a scripted chatbot named ELIZA, developed by AI pioneer Joseph Weizenbaum (<cite id="cite2c-27937/KHF4TXQH"><a href="#zotero%7C27937%2FKHF4TXQH">(“Machines Who Think,” 2004, 289-298)</a></cite>). A recent replication of this phenomenon occurred in June 2022 when a Google AI engineer declared the LLM he was training had become sentient (<cite id="cite2c-27937/BXZEP65G"><a href="#zotero%7C27937%2FBXZEP65G">(<i>What Is LaMDA and What Does It Want? | by Blake Lemoine | Medium</i>, n.d.)</a></cite>). Such attributions will likely increase as newer LLMs demonstrate increasing proficiency in seemingly distinct human qualities, like humor (<cite id="cite2c-27937/FMW5DCWM"><a href="#zotero%7C27937%2FFMW5DCWM">(Chowdhery et al., 2022)</a></cite>, 39). The means by which LLMs process, interpret, and generate information is a highly technical field requiring specialization in natural language processing, statistics, computational linguistics, and machine learning. While most historians may lack the technical knowledge to effectively evaluate the merits of these debates, when it comes to our own domain we are well equipped to offer informed insights.

Indeed, the standard measurement for a LLM's historical knowledge was inadvertently created by historians. Researchers have devised a series of benchmarks for measuring LLM performance on various forms of academic knowledge. One widely-used measurement is the Massive Multitask Language Understanding (MMLU) benchmark, developed in 2021 by a team led by Dan Hendryks. This benchmark contains nearly 16,000 questions from 57 academic disciplines ranging in difficulty from an elementary educational level to postgraduate curricula in professional domains. History is measured in this benchmark through questions taken from the Advanced Placement (A.P.) curricula for U.S., European, and World history. Hundreds of thousands of secondary students across the globe annually enroll in these curricula, which are designed to replicate the rigors of an introductory university-level history course. The educators who developed and refined these programs likely never imagined their work would serve as a technical benchmark, and the appropriateness of such a standard can be debated. Yet this benchmark, however imperfect, offers historians an accessible means to evaluate this highly technical domain.

Understanding the format of the benchmarks is important in understanding the performance of LLMs. LLMs are given an excerpt from a historical source followed a multiple-choice question and then asked to identify the answer. Below is an example question drawn from the U.S. History curriculum: 

**U.S. History Benchmark, Question 5:**

This question refers to the following information.

"I was once a tool of oppression  
And as green as a sucker could be  
And monopolies banded together  
To beat a poor hayseed like me."    

"The railroads and old party bosses  
Together did sweetly agree;  
And they thought there would be little trouble  
In working a hayseed like me. . . ."  

"The Hayseed"  

The song, and the movement that it was connected to, highlight which of the following developments in the broader society in the late 1800s?  

A: Corruption in government, especially as it related to big business, energized the public to demand increased popular control and reform of local, state, and national governments.  
B: A large-scale movement of struggling African American and white farmers, as well as urban factory workers, was able to exert a great deal of leverage over federal legislation.  
C: The two-party system of the era broke down and led to the emergence of an additional major party that was able to win control of Congress within ten years of its founding.  
D: Continued skirmishes on the frontier in the 1890s with American Indians created a sense of fear and bitterness among western farmers.

**Answer: A**

The MMLU benchmarks were first tested in 2021 against the then-leading LLM, OpenAI's GPT-3. Twenty-five percent accuracy represented random chance; ninety percent performance reflected expert-level accuracy. GPT-3 initially achieved over fifty percent accuracy on all three A.P. curricula, and its performance in these subfields numbered among the top third of all the academic disciplines in the benchmarks. However, in no field did GPT-3 achieve expert-level accuracy, and the model demonstrated particularly poor performance in the fields of "Moral Questions" and "Professional Law." As the authors note, this "weakness is particularly concerning because it will be important for future models to have a strong understanding of what is legal and what is ethical." (<cite id="cite2c-27937/ZS9JDNGD"><a href="#zotero%7C27937%2FZS9JDNGD"></a></cite>)

The specific accuracy rates for GPT-3 for the initial Hendryks study: US History, 52.9%; European History, 53.9%; and World History, 56.1%. Full data for questions for history and other disciplines can be found at: (<cite id="cite2c-27937/A834FRJL"><a href="#zotero%7C27937%2FA834FRJL"></a></cite>) Many thanks to Dan Hendrycks for sharing the discipline-specific accuracy rates for these fields.

The moral and ethical grounding of LLMs remains a key concern. However, rapid advances in model development have occurred since 2021, and subsequent models demonstrate substantial gains in performance on these historical benchmarks. Below are results from a replication study conducted in September 2024 across a series of leading LLMs, along with the initial Hendryks test:

In [0]:
from IPython.display import Image
from IPython.display import display

# Load the image from the article GitHub URL
table_1_url = 'https://raw.githubusercontent.com/Dr-Hutchinson/jdh_article/main/media/Table%201%20-%20MMLU%20Benchmark%20Performance.png'
display(Image(url=table_1_url))

Data from this replication study can be accessed via the HELM Leaderboard for the MMLU Benchmark, hosted by the Center for Research on Foundation Models at Stanford University. (<cite id="cite2c-27937/GSIXPJ7P"><a href="#zotero%7C27937%2FGSIXPJ7P">(Mai and Laing, 2024)</a></cite>). You can directly experiment with LLM performance on these benchmarks via a digital history project accompanying this article, "What Do AIs Know About History?" (<cite id="cite2c-27937/5AL5LZ2K"><a href="#zotero%7C27937%2F5AL5LZ2K">(Hutchinson, 2022)</a></cite>)

Rapid improvement on this benchmark have been made in just a few years, with a variety of commercial and open-source LLMs now demonstrating expert-level accuracy on all three of the subject exams. These findings mirror the striking performance of models like GPT-4 in other knowledge domains such as medical school curricula (<cite id="cite2c-27937/VEDFUUBA"><a href="#zotero%7C27937%2FVEDFUUBA"></a></cite>), American bar exams (<cite id="cite2c-27937/A39GT7B4"><a href="#zotero%7C27937%2FA39GT7B4"></a></cite>), and a host of other standardized assessments. (<cite id="cite2c-27937/U534FF7L"><a href="#zotero%7C27937%2FU534FF7L">(OpenAI, 2023)</a></cite>)

Yet, why do some LLMs perform better in some knowledge domains than others? How can a model get one question right, while other questions generate errors? There is a temptation to parse the model's performance in ways relatable to our human perspective. The human test taker might approach the question by assessing what types of historical thinking each question requires, what sort of knowledge is offered by the options, and how the historical source relates to the question. But, of course, LLMs aren't human - and unlike the human test taker, these models have likely already seen the questions in advance. In 2023 alone, over 467,000 students took the A.P. U.S. History exam. (insert citation) Significant online resources have emerged to serve the sizable population of students and instructors participating in this international curriculum. Hundreds of exam questions have migrated online via the collective efforts of the test prep publishing industry, various study apps, and uploaded example tests. 

Thus the capabilities of LLMs on these benchmarks directly relates to the vast dataset used to train them: the Internet itself. The data collection built for training GPT-3 encompassed the majority of English-language Wikipedia, Reddit's thousands of discussion forums, extensive corpora of digitized books, and a filtered (yet immense) collection of billions of web pages contained in the Common Crawl repository (<cite id="cite2c-27937/KNEK45E4"><a href="#zotero%7C27937%2FKNEK45E4">(Brown et al., 2020)</a></cite>, 8-9). The training sets used for subsequent LLMs remains largely unknown, as AI firms keep their data a closely guarded proprietary secret; indeed, the future of LLMs may depend on pending litigation concerning copyright infringement in the use of this data. Yet given the scale of such datasets, many of the A.P. History questions and their answers have likely ended up in the training data of these LLMs. If those who critique LLMs as "stochastic parrots" are correct, these gains in performance come from improvements in models memorizing their training data, and not through any analytical process. (<cite id="cite2c-27937/MVDFMR8K"><a href="#zotero%7C27937%2FMVDFMR8K">(Bender et al., n.d.)</a></cite>, 618.) When LLMs encounter questions outside of their training data, their accuracy is likely to suffer - although the model will not indicate any uncertainty, but will instead confidently assert error as fact. 

Such inaccuracies are a phenomenon described by AI researchers as "hallucinations." Such hallucinations represent a major challenge in LLM research and for many practical applications of this technology, particularly given the remarkable effectiveness of these models in generating convincing and otherwise accurate prose. (<cite id="cite2c-27937/9GQG6VFM"><a href="#zotero%7C27937%2F9GQG6VFM"></a></cite>) Detecting such errors can be difficult. Initial testing by OpenAI on the GPT series demonstrated that human readers often struggle to identify text generated by LLMs (<cite id="cite2c-27937/KNEK45E4"><a href="#zotero%7C27937%2FKNEK45E4">(Brown et al., 2020)</a></cite> 16, table 7.3). Rectifying such hallucinations is a significant area of LLM research. However, some scholars, like computational linguist Emily Bender, argue that such behaviors are inherent flaws in LLMs (<cite id="cite2c-27937/TPGPSRAI"><a href="#zotero%7C27937%2FTPGPSRAI"></a></cite>). 

Additional risks confront historians using these technologies. While AI firms seek to remove potentially offensive texts from their training sets, the sheer scale of these data make selective curation very challenging. Consequently, LLMs generate responses reflecting both the best and the worst of our online world. This reality has troubled previous AI implementations. Well-intentioned researchers have created chatbots that spew hateful invective, human resources applications that refuse to hire female applicants, and algorithms based on criminal justice sentencing guidelines that starkly reinforce racial disparities already prevalent in the carceral system (<cite id="cite2c-27937/5YDNQS4V"><a href="#zotero%7C27937%2F5YDNQS4V"></a></cite>). Early models in the GPT series have been known to unexpectedly generate responses in innocuous contexts containing violent imagery, sexually explicit language, and racial, ethnic, and religious slurs (<cite id="cite2c-27937/NYNDVYMM"><a href="#zotero%7C27937%2FNYNDVYMM"></a></cite>). These findings further confirm the prescient warnings offered by scholars such as Safiya Umoja Noble (<cite id="cite2c-27937/IEQ8GAVU"><a href="#zotero%7C27937%2FIEQ8GAVU"></a></cite>), Timnit Gebru (<cite id="cite2c-27937/S3ADX5DD"><a href="#zotero%7C27937%2FS3ADX5DD"></a></cite>), Ruha Benjamin (<cite id="cite2c-27937/X4D92B7V"><a href="#zotero%7C27937%2FX4D92B7V"></a></cite>), Kate Crawford (<cite id="cite2c-undefined"><a href="#zotero%7C27937%2FYVTAGDKZ"></a></cite>), and Trevor Paglen (<cite id="cite2c-27937/5GTQD5W9"><a href="#zotero%7C27937%2F5GTQD5W9"></a></cite>) on digital practices that reinforce analog inequalities. Some AI researchers consider such behaviors as lamentable but solvable problems through further technical advances, particularly with the use of methods like Reinforcement Learning from Human Feedback (RLHF). (<cite id="cite2c-27937/TGPDB8WX"><a href="#zotero%7C27937%2FTGPDB8WX">(Deep Learning Group, 2021)</a></cite>) Reducing the impact of such biases is a significant research area, particularly through the creation of smaller, more carefully curated datasets for AI training. However, many historians will likely share the skepticism of some researchers concerning such mitigations. (<cite id="cite2c-27937/MHRIEHH8"><a href="#zotero%7C27937%2FMHRIEHH8"></a></cite>) Bias emerges from more than just explicit language or imagery but from the very structures of societies. Can any historical source be separated from its context as a neutral artifact, free of its creator's perspective and the influences of its time? What about the untold millions of sources that make up the scale of an LLM's training set?

To be sure, LLMs are imperfect digital tools, and given these flaws historians must exercise caution when employing this technology. Yet scholars are finding that within the confines of these imperfections there is real potential to advance historical research. While a LLM's facility with multiple-choice questions might be the product of memorization, such knowledge has long been a springboard for more advanced forms of historical inquiry. And A.P. study guides are not the only historical texts LLMs are trained on. Primary source collections, academic monographs, open-source scholarly journals - these too inform an LLM's training. The influence of these sources can be found when LLMs are posed more complex questions in a structured prompt. Let's return to the earlier A.P. question above featuring the Populist-era campaign song "The Hayseed." In the code blocks below, GPT-4 is given the lyrics and publication history of the song. (<cite id="cite2c-27937/R5P23ZWU"><a href="#zotero%7C27937%2FR5P23ZWU"></a></cite>) GPT-4 is then prompted to identify the larger historical context of the source, the song's intended purpose and audience, and how the source might be interpreted via different historiographical approaches.

In [0]:
# Source: Arthur L. Kellog, “The Hayseed,” Farmers Alliance (4 October 1890). Nebraska Newspapers (University of Nebraska Libraries), https://nebnewspapers.unl.edu/lccn/2017270209/1890-10-04/ed-1/seq-1/. Original citation found in: John Donald Hicks, The Populist Revolt: A History of the Farmers' Alliance and the People's Party (University of Minnesota Press, 1931), 168, fn. 30.

from IPython.display import Image, display, Markdown

hayseed_url = 'https://raw.githubusercontent.com/Dr-Hutchinson/jdh_article/refs/heads/main/media/hayseed.png'
display(Image(url=hayseed_url))

# Display the citation
display(Markdown("""Arthur L. Kellog, “The Hayseed,” *Farmers Alliance* (4 October 1890). Nebraska Newspapers (University of Nebraska Libraries), https://nebnewspapers.unl.edu/lccn/2017270209/1890-10-04/ed-1/seq-1/. \n\nOriginal citation found in: John Donald Hicks, *The Populist Revolt: A History of the Farmers' Alliance and the People's Party* (University of Minnesota Press, 1931), 168, fn. 30."""))
#display(resized_image)

In [0]:
import requests
from IPython.display import Markdown, display

# URL of the raw text file on GitHub
file_url = 'https://raw.githubusercontent.com/Dr-Hutchinson/jdh_article/main/prompts/primary_source_analysis.txt'

# Fetch the content of the file
response = requests.get(file_url)

# Download and format the prompt for primary source analysis
primary_source_analysis_prompt = response.text.replace('\\n', '\n')
    
# Display the content as markdown
display(Markdown("**Primary Souce Analysis Prompt fed to GPT-4o:**\n\n"+ primary_source_analysis_prompt))


In [0]:
# installing openai
!pip install openai
!pip install jiwer

In [0]:
# Code for running primary source analysis of the "Hayseed" with OpenAI's GPT-4o model.

from openai import OpenAI
from IPython.display import Markdown, display

# Initialize the OpenAI client
client = OpenAI()

# Create the query for the LLM
query = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": primary_source_analysis_prompt}
    ]
)

# Extract the output from the response
output = query.choices[0].message.content

# Display the output using Markdown
display(Markdown("**GPT-4's Interpretation of 'The Hayseed'**\n\n" + output))


While one can debate aspects of GPT-4’s interpretations, it nonetheless accurately captures much of the context and intent of the source. With the right design, LLMs could be automated to annotate an entire corpus of archival sources, becoming a tool of the digital historian overwhelmed by an abundance of historical data, as envisioned by Roy Rozenweig twenty years ago. Yet LLM hallucinations are already contributing this deluge. Further experimentation will be needed to more fully assess LLM's capabilities for historical interpretation, as well as the creation of new benchmarks to test different approaches to historical analysis. But progress moves quickly in the field of generative AI, and there is intense competition to build new models that advance the existing capabilities of LLMs while shedding their shortcomings. Yet progress remains uneven. Of significant concern are LLM's performance on benchmarks on ethics and morality, which continue to demonstrate troubling areas of weakness. (<cite id="cite2c-27937/USPCB8UK"><a href="#zotero%7C27937%2FG5ESJ8NI">(Hoffmann et al., 2022)</a></cite>, 31, table A6)

Historians should contribute to the broader dialogue about the implications of these technologies, especially as they become increasingly embedded in our digital lives. While imperfect tools, their flaws do not mean that LLMs have no place in the historian's toolkit. In fact, by acknowledging and confronting these shortcomings, historians can better contribute our disciplinary perspectives on the debates concerning this technology, particularly in leveraging the strengths of these models to empower and broaden accessibility. The case studies below demonstrate how historians are using LLMs as a versatile tools for both researching and communicating the past.

# LLM Use Case Studies

A promising approach for LLMs and other foundational AI models is for data preparation and cleanup. A general rule of thumb is that 80% of the labor involved in data analysis is dedicated to preparing the data (, ix)). AI models hold significant potential to streamline and accelerate the challenging work of creating "tidy datasets" ().

## Case Study: Oral History Transcriptions

Oral history provides a particularly useful case study for demonstrating the potential utility of generative AI. Transcribing audio recordings is a central labor in this methodology, but transcriptions often require considerable expense in both time and labor, typically requiring six to eight hours of processing for a single page. (<cite id="cite2c-27937/YWJAQ4V8"><a href="#zotero%7C27937%2FYWJAQ4V8">(Ritchie, 2nd ed, 2003)</a></cite>) However, advances in machine learning models have resulted in impressive gains in streamlining this task. 

Notable among these models is Whisper, an open-source audio transcription model developed by OpenAI that belongs to the same Transformer family as the GPT series. (<cite id="cite2c-27937/7VHKCH3M"><a href="#zotero%7C27937%2F7VHKCH3M"></a></cite>). Let's test Whisper on the first two minutes of a transcribed oral history of historian John Hope Franklin by the Southern Oral History Program (<cite id="cite2c-27937/IJWETJM7"><a href="#zotero%7C27937%2FIJWETJM7"></a></cite>). Recorded on audiotape in 1990, this segment features multiple voices, crosstalk, filler words, and background noise. You can listen to the segment in the code block below. 

In [0]:
# Interview with John Hope Franklin by John Egerton, July 27, 1990. 
# Interview A-0339, Collection #4007.
# Southern Oral History Program Collection, Southern Historical Collection, 
# Wilson Library, University of North Carolina at Chapel Hill. 
# https://docsouth.unc.edu/sohp/A-0339/menu.html 

from IPython.display import Audio

file_path = "./A-0339_edited.mp3"
Audio(file_path)


In the code below, we will use Whisper to transcribe the audio segment and compare it against the official transcript. 

The Whisper series is offered as a series of open-source voice recognition and voice translation models across several tiers of computing power and across 57 languages, and freely available and hosted on HuggingFace. (<cite id="cite2c-27937/GJ2EYJWT"><a href="#zotero%7C27937%2FGJ2EYJWT"></a></cite>) However, for simplicity this demonstration code uses OpenAI's API for the Whisper-1 model. As of September 2024, OpenAI charged $0.36 per hour of recorded time for transcriptions using the API.  

In [0]:
from openai import OpenAI
import time

# Initialize the OpenAI client
client = OpenAI()

audio_file = open("./A-0339_edited.mp3", "rb")

# Measure the transcription time for the audio file
start_time = time.time()

transcription = client.audio.transcriptions.create(
  model="whisper-1", 
  file=audio_file
)
whisper_transcript = transcription.text

end_time = time.time()

# Calculate the actual transcription time
automation_time = end_time - start_time

# Calculate the estimated transcription time for 1 hour (3600 seconds) based on the transcription time for audio segment
audio_length_seconds = 153  # 2 minutes and 33 seconds in seconds
estimated_time_for_one_hour = (automation_time / audio_length_seconds) * 3600  # Estimate time for 1 hour (3600 seconds)

# Convert estimated time for better readability (hours, minutes, and seconds)
hours = int(estimated_time_for_one_hour // 3600)
minutes = int((estimated_time_for_one_hour % 3600) // 60)
seconds = int(estimated_time_for_one_hour % 60)

# Display the output in the desired format
print(f"Whisper Transcription time: {automation_time:.2f} seconds")
print(f"Estimated Transcription Time for an hour recording at this rate: {hours} hours, {minutes} minutes, {seconds} seconds\n")
print("Raw Whisper Transcript:\n")
print(whisper_transcript)


Due to the stochastic nature of these models, if you run this code multiple times you'll sometimes get different results, particularly in the most challenging segments. But the code block below visualizes a sample transcription that was annotated and compared against the original. Notable omissions and discrepancies are highlighted. The word error score is also calculated, providing a standard benchmark for the mode's performance on this oral history.   

In [0]:
import re
from jiwer import wer
from IPython.display import display, HTML

# Function to clean HTML tags
def clean_html(text):
    return re.sub(r'<.*?>', '', text)

# Load the formatted transcripts
original_file_path = './revised_original_transcript_formatted_2.txt'
whisper_file_path = './revised_whisper_transcript_formatted_2.txt'

# Read the contents of the original and whisper transcripts
with open(original_file_path, 'r') as original_file:
    original_transcript = original_file.read()

with open(whisper_file_path, 'r') as whisper_file:
    whisper_transcript = whisper_file.read()

# Clean the transcripts for WER calculation
cleaned_original_transcript = clean_html(original_transcript)
cleaned_whisper_transcript = clean_html(whisper_transcript)

# Calculate the Word Error Rate (WER)
error_rate = wer(cleaned_original_transcript, cleaned_whisper_transcript)

# Add <br> tags to preserve line breaks in the text
original_transcript = original_transcript.replace('\n', '<br>')
whisper_transcript = whisper_transcript.replace('\n', '<br>')

# Ensure that color highlighting also includes bolding
whisper_transcript = whisper_transcript.replace(
    'style="background-color: #fbb;"',
    'style="background-color: #fbb; font-weight: bold;"'
)

original_transcript = original_transcript.replace(
    'style="background-color: #bfb;"',
    'style="background-color: #bfb; font-weight: bold;"'
)

# Display the two transcripts side by side using HTML in Jupyter
html_content = f'''
<div style="display: flex;">
    <div style="width: 50%; padding-right: 20px; border-right: 1px solid black;">
        <h4>Original Transcript: (discrepancies in green)</h4>
       {original_transcript}
    </div>
    <div style="width: 50%; padding-left: 20px;">
        <h4>Whisper Transcript: (discrepancies in green)</h4>
        {whisper_transcript}
    </div>
</div>
<br><br>
<div style="text-align: center;">
    <h4>Word Error Rate (WER) for Whisper: {error_rate:.2%}</h4>
</div>
'''

# Render the HTML content in Jupyter
display(HTML(html_content))


There are some suggestive observations we can take from these results. There are several notable errors, significant ommisions, and changes in syntax. But given the media and the source audio quality, the oral historian has a solid first draft to speed their initial review in a few seconds. Human review is still required. But even best human transcriptions contain errors. In the original transcript, Franklin identifies Harvard as the destination of E. Franklin Frazier in 1934; but the noted sociologist and pioneer actually joined the faculty of Howard University, something audible but easily confused in the original recording. Here Whisper actually corrected a human error. Whisper's performance on this example is in line with other larger studies on similar media. 

Applications like Whisper hold significant potential for oral historians, as they can dramatically improve the efficiency and cost-effectiveness of transcription workflows. Scholars are already using these techniques for multi-lingual transcription 

sources - German/Czech Paper - https://www.isca-archive.org/interspeech_2023/lehecka23_interspeech.pdf
Holocaust oral history archive - j/https://aclanthology.org/2024.htres-1.6.pdf
The Computational Social Science group of the Institut Polytechnique de Paris - https://www.css.cnrs.fr/using-whisper-to-transcribe-oral-interviews/
Rochester Tech - https://www.rit.edu/news/artificial-intelligence-aids-cultural-heritage-researchers-documenting-and-teaching-oral
ASR Leaderboard - https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
Chinese NER with Whisper - https://arxiv.org/abs/2401.11382
Whisper paper - https://cdn.openai.com/papers/whisper.pdf

New Yorker article - https://www.newyorker.com/tech/annals-of-technology/whispers-of-ais-modular-future
https://arxiv.org/abs/2407.17160



As this technology continues to advance, it is likely that the accuracy of tools like Whisper will only improve, further enhancing their utility for oral historians and allowing them to focus more on the analysis and interpretation of these sources.

Transition to prompt section

## Case Study: Error Correction of Optical Character Recognition Scans 

Another potential use case of AI models for digital historians is the error correction of optical character recognition (OCR) scans. Machine learning techniques, such as those pioneered by the research team at Transkribus, have greatly enhanced the quality, speed, and cost of OCR scans across a broad range of historical texts. (<cite id="cite2c-27937/FUEUWF5M"><a href="#zotero%7C27937%2FFUEUWF5M"></a></cite>) However, even high-fidelity OCRs possess error rates that have insidious impacts on the accessibility and searchability of texts collections. (<cite id="cite2c-27937/ZJW9AI49"><a href="#zotero%7C27937%2FZJW9AI49"></a></cite>) For instance, the image below comes from a newspaper published in a German prisoner-of-war camp in Mississippi during World War II and later microfilmed by the Library of Congress. Let's compare an OCR scan of this image via Google's Cloud Vision OCR service with a human transcription of the same text.

In [0]:
# Source: "Nur ein Film?." Die Lotse (Camp McCain, Mississippi), 30 June 1945. In: Karl John Richard Arndt, editor. German P.O.W. camp papers. Washington, D.C.: Library of Congress Photoduplication Service, 1965. Reel 9.

from PIL import Image
image = Image.open('./die_lotse_6-30-45_1.png')

new_width = 600
new_height = int(image.height * (new_width / image.width))

# Resize the image
resized_image = image.resize((new_width, new_height), Image.LANCZOS)

# Display the resized image
display(Markdown(""""Nur ein Film?." *Die Lotse* (Camp McCain, Mississippi), 30 June 1945. In: Karl John Richard Arndt, editor. German P.O.W. Camp Papers. Washington, D.C.: Library of Congress Photoduplication Service, 1965. Reel 9."""))
display(resized_image)

In [0]:
# This script compares an OCR output of the image above with human transcription. 
# Words in red are from the OCR corrections, words in green are from the human transcription.

import difflib
from IPython.display import display, HTML

ocr_output_1 = 'NUR EIN FILM? NUR EIN\nUnverloeschbar tief haben sich uns die Bilder des Grauens einge- praegt, die jeder von uns dieser Tage in dem ersten amerikanischen Armeefilm aus Deutschland sah. Er schuetterung und Ent setzen haben Jeden Fuehlenden verstummen las sen, aber die Unmenschlichkeit, von "Deutschen" auf deutschem Bo- den begangen, lassst den Gesitte- ten nicht schweigend darueberhin- gehen.\nDante setzt in seinem Werk: 1 "Die goettliche Komoedie" ueber den Eingang zur Hoelle die Worte: "Lasst fahren alle Hoffnungen 1hr, die ihr hier eintritt."\nDiese Worte koonnen ueber je- dem K.Z.-Lager Deutschlands ge standen haben; denn die Bilder des Schreckens und Grauens, wie sie Dante von der Hoelle entwirft, ver- blassen vor dieser schaurigen Wirklichkeit, die sich hier auf Er den unter lebenden Menschen im Herzen Europas abspielte. Was wir sahen, war dabei wohl nur ein kleiner Ausschnitt, wenn wir beden- ken, dass diese Tragoedie seit 1933 unzaehlige Opfer forderte.'
human_corrected_output_1 = 'NUR EIN FILM?\nUnverloeschbar tief haben sich uns die Bilder des Grauens eingepraegt, die jeder von uns dieser Tage in dem ersten amerikanischen Armeefilm aus Deutschland sah. Erschuetterung und Entsetzen haben jeden Fuehlenden verstummen lassen, aber die Unmenschlichkeit, von "Deutschen" auf deutschem Boden begangen, laesst den Gesitteten nicht schweigend darueberhingehen.\nDante setzt in seinem Werk: "Die goettliche Komoedie" ueber den Eingang zur Hoelle die Worte: "Lasst fahren alle Hoffnungen ihr, die ihr hier eintritt."\nDiese Worte koennen ueber jedem K.Z.-Lager Deutschlands gestanden haben; denn die Bilder des Schreckens und Grauens, wie sie Dante von der Hoelle entwirft, verblassen vor dieser schaurigen Wirklichkeit, die sich hier auf Erden unter lebenden Menschen im Herzen Europas abspielte. Was wir sahen, war dabei wohl nur ein kleiner Ausschnitt, wenn wir bedenken, dass diese Tragoedie seit 1933 unzaehlige Opfer forderte.'

differ = difflib.Differ()
diff1 = list(differ.compare(ocr_output_1.split(), human_corrected_output_1.split()))

def ocr1_vs_human_1(diff1):
    result1 = []
    for word in diff1:
        if word.startswith('+'):
            result1.append(f'<span style="color:green;background-color:#e6ffe6;">{word[2:]}</span>')
        elif word.startswith('-'):
            result1.append(f'<span style="color:red;background-color:#ffe6e6;">{word[2:]}</span>')
        elif word.startswith(' '):
            result1.append(word[2:])
    return ' '.join(result1)

colored_diff_1 = ocr1_vs_human_1(diff1)

display(HTML(f'<p><strong>Differences between OCR Output (red) vs Human Transcription (green):</strong></p><p>{colored_diff_1}</p>'))

While the image quality is satisfactory and the text is printed using modern typefaces, the OCR still generates errors requiring human correction. Correcting such errors necessitates substantial review and intervention, representing a significant labor for processing a sizable text corpus. However, LLMs can expedite this correction process when guided by a carefully designed prompt. This practice, often referred to as "prompt engineering," is a method used to direct LLMs in completing specific tasks. Details concerning each prompt and the methods behind it are described in the hermeneutical layer.

Let's now see how GPT-4 interprets this prompt, and compare the generated corrections with the initial OCR scan.

In [0]:
# Prompt 1: OCR Correction

ocr_prompt = """You are an AI research assistant with a specialty in correcting errors in OCR scans of  newspaper images. In the following task you will be given a OCR'd text, and you will generate corrections using the following Task Format. Follow the instructions of the Format step-by-step.\n\nTask Format\n1. Examine the Examples: Examine the two examples of sample OCR generations, Sample Text 1 and Sample Text 2. \n2. Examine the Corrections: Examine the two examples of OCR corrections given, Corrected Transcription 1 and Corrected Transcription 2. Compare the changes between the Samples to the Corrected Transcriptions.\n3. Note Formatting in the Corrected Transcriptions: Within each corrected transcription are symbols to represent uncertainty or substantial edits to the original OCR. These are inserted to communicate to the user which words may need additional human review. Words that you are very uncertain about are bracketed with a \***\ before and after the very uncertain word. \n4. Examine New OCR Generation: You will then be given a new OCR generation.\n5. Generate New Corrected Transcription: Based on the examples and the prompt instructions, compose a New Corrected Transcription based on the New OCR Generation. Do your best to make it as accurate as possible. Do not correct the grammar or wording of the text, only seek to correct errors in the OCR. Likewise do not add umlauts, eszetts, or other diacritics. \n\nLet's begin.\n\nSample Text 1\n\n#begin sample text 1\n\nEIN LLERRUNDGANG\n\nit diosom Boitreg wird der Boricht des\nOborfoldwebels ltstaodt uobor seinen Rund-\ngang durchs Lagor c.bieschlosson.\nIm Juli bozogon dig orston PoWs ihra "zwungsheimat" im Camp I. wio sah es\ndesnals in diesom Lagor aus ? Bcreckon vurde:: aufgebaut; \n\n#end sample text 1\n\nHere is Sample Correction 1. \n\n#begin sample correction 1\n\nEIN LAGERRUNDGANG\n\nMit diesem Beitrag wird der Bericht des Oberfeldwebels Altstaedt ueber seinen Rundgang durchs Lager abgeschlossen.\nIm Juli bezogen die ersten PoWs ihre "Zwangsheimat" im Camp I. Wie sah es damals in diesem Lager aus? Barecken wurden aufgebaut\n#end sample correction 1\n\nHere is Sample Text 2:\n\n#begin sample text 2\n\nDem crsten bericut, der den kunagang duro:\ndas dritte bateillon schilcerte, lesson wir\nheuto den ueber das zwcito bataillon folgen.\nDie Schriftl.\n\nim bingang zum zweiten Lataillon ruht der blick auf dor Lagerstresse, did\nscharf ansteigt Lirks und rechts zichon sich sauber geinauerte Graeben entlane\n#end sample text 2\n\nHere is Sample Correction 2:\n\n#begin sample correction 2\nDem ersten ***Bericht***, der den ***Aufmarsch***\ndes dritten Bataillons schilderte, lessen wir  \nheute der Ueber das zweite Bataillon. \nDie Schriftl. \n\nAm Anfang des zweiten Bataillons ruht der Blick auf der Lagerstrasse, die \nscharf ansteigt. Links und rechts sich sauber eingegrabene Graeben entlang.\n\n#end sample correction 2\n\nHere is the New OCR Generation:\n\n#begin new corrected transcription without umlauts, eszetts, or other diacritics\n"""

Prompt engineering utilizes the emergent capabilities of LLMs to solve complex tasks. This prompt employs two common prompt methods:

Few-shot learning: In this prompt, two sample OCR generations (Sample Text 1 and Sample Text 2) are provided, along with their corresponding corrected transcriptions (Corrected Transcription 1 and Corrected Transcription 2). These examples serve as a learning basis for the model to emulate the task and perform it on a new OCR generation. This technique allows the AI to generalize from the provided examples and apply its insights to new instances. (<cite id="cite2c-27937/KNEK45E4"><a href="#zotero%7C27937%2FKNEK45E4">(Brown et al., 2020)</a></cite>)

Chain-of-thought reasoning: The prompt guides the model through a series of steps to complete the task. It starts with examining the sample texts and their corrections, noting the formatting in the corrected transcriptions, and then proceeds to work on a new OCR generation. It is theorized that this structured approach helps LLMs better determine the desired output, which in this case is a more accurate transcription. (<cite id="cite2c-27937/4QKVA4R7"><a href="#zotero%7C27937%2F4QKVA4R7"></a></cite>)

The versatility of these prompting methods enables the completion of a broad range of tasks, as does the ability to prompt LLMs using natural language instructions. Understanding and developing prompt approaches is an active area of research and experimentation. An excellent resource for exploring prompt engineering has been published by DAIR.AI (Democratizing Artificial Intelligence Research, Education, and Technologies). (<cite id="cite2c-27937/3I2YH8D3"><a href="#zotero%7C27937%2F3I2YH8D3"></a></cite>)

Prompt 1: OCR Correction

You are an AI research assistant with a specialty in correcting errors in OCR scans of newspaper images. In the following task you will be given a OCR’d text, and you will generate corrections using the following Task Format. Follow the instructions of the Format step-by-step.

Task Format

1. Examine the Examples: Examine the two examples of sample OCR generations, Sample Text 1 and Sample Text 2.

2. Examine the Corrections: Examine the two examples of OCR corrections given, Corrected Transcription 1 and Corrected Transcription 

3. Compare the changes between the Samples to the Corrected Transcriptions. Note Formatting in the Corrected Transcriptions: Within each corrected transcription are symbols to represent uncertainty or substantial edits to the original OCR. These are inserted to communicate to the user which words may need additional human review. Words that you are very uncertain about are bracketed with a ***\ before and after the very uncertain word.

4. Examine New OCR Generation: You will then be given a new OCR generation.

5. Generate New Corrected Transcription: Based on the examples and the prompt instructions, compose a New Corrected Transcription based on the New OCR Generation. Do your best to make it as accurate as possible. Do not correct the grammar or wording of the text, only seek to correct errors in the OCR. Likewise do not add umlauts, eszetts, or other diacritics.

Let’s begin.

Sample Text 1

#begin sample text 1

EIN LLERRUNDGANG

it diosom Boitreg wird der Boricht des Oborfoldwebels ltstaodt uobor seinen Rund- gang durchs Lagor c.bieschlosson. Im Juli bozogon dig orston PoWs ihra “zwungsheimat” im Camp I. wio sah es desnals in diesom Lagor aus ? Bcreckon vurde:: aufgebaut;

#end sample text 1

Here is Sample Correction 1.

#begin sample correction 1

EIN LAGERRUNDGANG

Mit diesem Beitrag wird der Bericht des Oberfeldwebels Altstaedt ueber seinen Rundgang durchs Lager abgeschlossen. Im Juli bezogen die ersten PoWs ihre “Zwangsheimat” im Camp I. Wie sah es damals in diesem Lager aus? Barecken wurden aufgebaut 

#end sample corrrection 1

Here is Sample Text 2:

#begin sample text 2

Dem crsten bericut, der den kunagang duro: das dritte bateillon schilcerte, lesson wir heuto den ueber das zwcito bataillon folgen. Die Schriftl.

im bingang zum zweiten Lataillon ruht der blick auf dor Lagerstresse, did scharf ansteigt Lirks und rechts zichon sich sauber geinauerte Graeben entlane 

#end sample text 2

Here is Sample Correction 2:

#begin sample correction 2 

Dem ersten Bericht, der den Aufmarsch des dritten Bataillons schilderte, lessen wir
heute der Ueber das zweite Bataillon. Die Schriftl.

Am Anfang des zweiten Bataillons ruht der Blick auf der Lagerstrasse, die scharf ansteigt. Links und rechts sich sauber eingegrabene Graeben entlang.

#end sample correction 2

Here is the New OCR Generation:

#begin new corrected transcription without umlauts, eszetts, or other diacritics

In [0]:
# OpenAI completion using the GPT-4 model with the OCR correction prompt.

query = openai.ChatCompletion.create(
                      model="gpt-4",
                      messages=[
                            {"role": "assistant", "content": ocr_prompt},
                            {"role": "user", "content": ocr_output_1}
                        ]
                    )
                             
gpt4_output_1 = query['choices'][0]['message']['content']

# Comparing GPT-4's output with the initial OCR scan results.

differ = difflib.Differ()
diff = list(differ.compare(gpt4_output_1.split(), human_corrected_output_1.split()))

def gpt4_vs_human_1(diff2):
    result = []
    for word in diff:
        if word.startswith('+'):
            result.append(f'<span style="color:green;background-color:#e6ffe6;">{word[2:]}</span>')
        elif word.startswith('-'):
            result.append(f'<span style="color:red;background-color:#ffe6e6;">{word[2:]}</span>')
        elif word.startswith(' '):
            result.append(word[2:])
    return ' '.join(result)

colored_diff_2 = gpt4_vs_human_1(diff)

display(HTML(f'<p><strong>Differences between GPT-4 Output (red) vs Human Transcription (green):</strong></p><p>{colored_diff_2}</p>'))
display(HTML(f'<p><strong>Differences between OCR Output (red) vs Human Transcription (green):</strong></p><p>{colored_diff_1}</p>'))

In comparing the two outputs against the human transcription, GPT-4 demonstrates a remarkable ability to correct the errors in the OCR scan. GPT-4's ability to indicate its uncertanity in its error corrections also speeds human review of its output. However, the quality of the initial OCR scan remains crucial for a successful output. For example, here is an lower-quality image containing 'noise' that causes substantial errors in the OCR output.

In [0]:
# Source: "Zum Geleit." Die Lotse (Camp McCain, Mississippi), 15 March 1945. In: Karl John Richard Arndt, editor. German P.O.W. camp papers. Washington, D.C.: Library of Congress Photoduplication Service, 1965. Reel 9.

image = Image.open('./die_lotse_3-15-45_1.png')

new_width = 600
new_height = int(image.height * (new_width / image.width))

# Resize the image
resized_image = image.resize((new_width, new_height), Image.LANCZOS)

# Display the resized image
display(Markdown(""""Zum Geleit." *Die Lotse* (Camp McCain, Mississippi), 15 March 1945. In: Karl John Richard Arndt, editor. German P.O.W. Camp Papers. Washington, D.C.: Library of Congress Photoduplication Service, 1965. Reel 9."""))
display(resized_image)

In [0]:
# OpenAI completion using the GPT-4 model.

ocr_output_2 = "Zum Deleit:\nDie neue Lagerzeitung ist nun erschienen. Ja eis ist nun ehr eine unengoare Totesche reworden und we ate anime der Prisoner in stilien Stunden und in froler Laune ersonnon, ier findet ihr es schwarz auf weiss.\nUeber manches moschtet ihr nachdenken, ueber manches euch freuen, belaecheln koennt ihr aller, aber denkt iaren wie an es besser nachen koennte und seit mit Vorschlaegen nicht geizig und zurueckhaltend. Alles, or euch bewegt, arnstes und Heiteres, soll seinen Platz Tinden in dieren. Blaettern, nur Politik lasst ferne.\nWenn euch diese Zeitung Errunterung Unterhaltung und Anregung Ceben, so ist das Cer rchoenste Loin fuer die Nuehe aller, die um das Zustandekommen dieser Laerzeitung benueht war'n.\nwollen\nNoolimals, Jeder arbeite mit an diesen schoenen Werk, nach der Parole Alles von Prisoner fuer Prisoner wir die Zeitung fuehren.\nDas Erscheinen ist nonetlich zreimal vorgesehen. Einsendungen werden nach Hasnabe des verfuegberen Platzes aufgenommen, wobei kein besonders kritischer Kesesta oezue lich er kuenetlerischen Vollendun; an- Celest sird, inner in denkt daran sie viele Kameraden sure Geisteeprodukte lesen und wir doch eine Auerall treffen muessen.\nDie Sohriftleitung."
human_corrected_output_2 = "Zum Geleit:\nDie neue Lagerzeitung ist nun erschienen. Ja sie ist nunmehr eine unlengoare Tatsache geworden und was die Oshirne [?] der Prisoner in stillen Stunden und in froher Laune ersonnen, hier findet ihr es schwarz auf weiss.\nUeber manches moechtet ihr nachdenken, ueber manches euch freuen, belaecheln koennt ihr aller, aber denkt daran wie man es besser machen koennte und seit mit Vorschlaegen nicht geizig und zurueckhaltend. Alles, was euch bewegt, Ernstes und Heiteres, soll seinen Platz finden in diesen Blaettern, nur Politik lasst ferne.\nWenn euch diese Zeitung Ermunterung, Unterhaltung und Anregung geben, so ist das der schoenste Lohn fuer die Muehe aller, die um das Zustandekommen dieser Lagerzeitung bemueht war'n.\nNochmals, jeder arbeite mit an diesem schoenen Werk, nach der Parole “Alles von Prisoner fuer Prisoner” wollen wir die Zeitung fuehren.\nDas Erscheinen ist monatlich zweimal vorgesehen. Einsendungen werden nach Hasnabe des verfuegbaren Platzes aufgenommen, wobei kein besonders kritischer Massstab bezueglich der künstlerischen Vollendung angelegt wird, immerhin denkt daran sie viele Kameraden eure Geistesprodukte lesen und wir doch eine Auswahl treffen muessen.\nDie Schriftleitung."

query = openai.ChatCompletion.create(
                      model="gpt-4",
                      messages=[
                            {"role": "assistant", "content": ocr_prompt},
                            {"role": "user", "content": ocr_output_2}
                        ]
                    )

gpt4_output_2 = query['choices'][0]['message']['content']


# Comparing GPT-4's output with the human transcription.

differ = difflib.Differ()
diff2 = list(differ.compare(ocr_output_2.split(), human_corrected_output_2.split()))

def ocr2_vs_human_2(diff):
    result2 = []
    for word in diff2:
        if word.startswith('+'):
            result2.append(f'<span style="color:green;background-color:#e6ffe6;">{word[2:]}</span>')
        elif word.startswith('-'):
            result2.append(f'<span style="color:red;background-color:#ffe6e6;">{word[2:]}</span>')
        elif word.startswith(' '):
            result2.append(word[2:])
    return ' '.join(result2)

colored_diff_3 = ocr2_vs_human_2(diff2)
                           

differ = difflib.Differ()
diff3 = list(differ.compare(gpt4_output_2.split(), human_corrected_output_2.split()))

def gpt4_vs_human_2(diff3):
    result_3 = []
    for word in diff3:
        if word.startswith('+'):
            result_3.append(f'<span style="color:green;background-color:#e6ffe6;">{word[2:]}</span>')
        elif word.startswith('-'):
            result_3.append(f'<span style="color:red;background-color:#ffe6e6;">{word[2:]}</span>')
        elif word.startswith(' '):
            result_3.append(word[2:])
    return ' '.join(result_3)

colored_diff_4 = gpt4_vs_human_2(diff3)

display(HTML(f'<p><strong>Differences between GPT-4 Output (red) vs Human Transcription (green):</strong></p><p>{colored_diff_4}</p>'))
display(HTML(f'<p><strong>Differences between OCR Output (red) vs Human Transcription (green):</strong></p><p>{colored_diff_3}</p>'))


Here we see that GPT-4 achieved only modest improvements in the OCR output, perhaps an indication of the limits of this approach. However, GPT-4's multimodal nature may open new opportunities for the future. GPT-4's multi-modal interface will soon allow it to directly perform prompted OCR scans of images. (<cite id="cite2c-27937/U534FF7L"><a href="#zotero%7C27937%2FU534FF7L">(OpenAI, 2023)</a></cite>) Only time will tell if these abilities surpass existing OCR techniques. Yet there seems to be remarkable potential for further exploration.

These two case studies demonstrate an LLMs capacity to assist in various forms of data cleanup and preparation. While human review remains essential, LLMs can make that review less time-consuming and labor-intensive. LLMs are already being employed for tasks as varied as text normalization, metadata generation, automated summarization, date extraction and standardization, sentiment analysis, relationship extraction, and named entity recognition. Further experimentation will undoubtedly reveal future use cases. Such approaches can improve the accuracy, lower the costs, and accelerate the pace of data preparation. LLMs can also expand accessibility to historical sources, enabling the use of programmatic techniques via natural language instructions.

## Case Study: Ask-A-Source - Retrieval Based Methods for LLMs

While LLMs demonstrate a broad range of capabilities for data cleanup, their tendency towards 'hallucinations' represents a formidable obstacle towards their use in historical research and analysis. However, recent advances in retrieval-based methods offer potential to ground LLMs in greater factual accuracy. (<cite id="cite2c-27937/XKINSLH3"><a href="#zotero%7C27937%2FXKINSLH3"></a></cite>) Such techniques also enable the use of LLMs to analyze large text collections, search the Internet, and utilize external tools to solve problems in unfamiliar knowledge domains. The following case study demonstrates one such approach for historians: how an LLM can be used to answer questions about a historical source.

One of the shortcomings of LLMs is their context length, or the hard limits on how much text they can interpret in a single query. Models like GPT-3 and ChatGPT can only process textual data for some two to three pages in length, while GPT-4 possesses a much larger context length. (<cite id="cite2c-27937/U534FF7L"><a href="#zotero%7C27937%2FU534FF7L">(OpenAI, 2023)</a></cite>) Yet even the most advanced models cannot directly interpret long-form texts or large text collections in a single query.

Yet these limits can be circumvented through the use of tools like semantic search and prompt chaining. In this case study, two different AI models will work together to answer questions about a historical text, Thomas More's *History of Richard III* (<cite id="cite2c-27937/IDRV8C65"><a href="#zotero%7C27937%2FIDRV8C65"></a></cite>). At the end of the process, GPT-4 will deliver a series of responses supported by direct quotations from the text.

Tne first step in this process is semantic search with OpenAI's ADA model, a computational technique for establishing text similarity. Let's pose the following question: "Who killed the princes in the Tower?"

Here are the sections of the text identified by Ada as the most semantically similar:

In [0]:
# Script for semantic search over an embedded text using OpenAI's Ada embedding model.

import pandas as pd
import numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity

question = "Who killed the princes in the Tower?"

# Computed embeddings of Thomas More's History of Richard III. Avaliable in article Github:

datafile_path = "./more_text_embedded.csv"

df = pd.read_csv(datafile_path)
df["embedding"] = df.embedding.apply(eval).apply(np.array)

def search_text(df, text, n=3, pprint=True):
    text_embedding = get_embedding(
        text,
        engine="text-embedding-ada-002"
    )
    df["similarities"] = df.embedding.apply(lambda x: cosine_similarity(x, text_embedding))

    # Select the first three rows of the sorted DataFrame
    top_three = df.sort_values("similarities", ascending=False).head(3)

    # If `pprint` is True, display the output
    if pprint:
        for i, (_, row) in enumerate(top_three.iterrows(), 1):
            display(Markdown(f"**Result {i} (Similarity: {row['similarities']:.4f}):**\n\n{row['combined']}\n"))

    # Return the DataFrame with the added similarity values
    return top_three


# Call the search_text() function and store the return value in a variable
results_df = search_text(df, question, n=3)

# Reset the index and create a new column "index"
results_df = results_df.reset_index()

# Access the values in the "similarities" and "combined" columns
similarity1 = results_df.iloc[0]["similarities"]
combined1 = str(results_df.iloc[0]["combined"])

similarity2 = results_df.iloc[1]["similarities"]
combined2 = str(results_df.iloc[1]["combined"])

similarity3 = results_df.iloc[2]["similarities"]
combined3 = str(results_df.iloc[2]["combined"])

To enable semantic search in More's text, it must be converted to text embeddings, an approach for transforming "unstructured text data into a structured form." (("Text Embeddings Visually Explained" 2022)). Preparing a text for semantic search depends on the intended use case and requires consideration of the model's context length. In this example, the text was broken down to the paragraph level and accompanied by a summary initially generated by ChatGPT (GPT-3.5) and edited by the author afterward. OpenAI's Ada model was then used to compute a set of searchable embeddings. Both the original data file and computed embeddings are included in the Github repo for this article. See the OpenAI Cookbook for step-by-step code examples of how to generate embeddings for texts. ((OpenAI Cookbook 2023)) A variety of other semantic search platforms are available, such as Pinecone, Weaviate, and Haystack.

This student edition of More's *History of Richard III* is produced by the Thomas More Society. (<cite id="cite2c-27937/IDRV8C65"><a href="#zotero%7C27937%2FIDRV8C65"></a></cite>) My thanks to Dr. Ian Crowe, director of the Thomas More Program at Belmont Abbey College, for the opporunity to explore this text with his students in spring 2023 with this research approach.

The results from the semantic search provide the three sections of the text with the highest semantic similarity score. However, high semantic similarity does not always indicate relevance, and such searches can return false positives. Filtering these false positives is essential if you wish to attempt large-scale analysis of an entire text. We can use an LLM to determine the relevance of each text section and then filter out irrelevant sections before answering the question.

In making multiple queries to the LLM for a single task, we will use a technique known as prompt chaining. Prompt chains break down complex tasks into smaller, more manageable components by making a series of queries to the LLM. We'll use the langchain library for creating these chains. (<cite id="cite2c-27937/IJ5AI567"><a href="#zotero%7C27937%2FIJ5AI567"></a></cite>)

The first link in the chain is a relevance check. In this sequence, GPT-4 will be prompted with a detailed set of instructions, along with three examples of how to determine relevance. Then each text section identified in the semantic search will be passed to the LLM. GPT-4 will use the prompt to generate an analysis of each section's relevance.

langchain is a Python library for large language model programming. It is supported by a highly active open source community, and supported by extensive documentation. <cite id="cite2c-27937/YAX4VVR9"><a href="#zotero%7C27937%2FYAX4VVR9"></a></cite>

In [0]:
# Code for Text Relevance Prompt using langchain

from langchain.prompts import PromptTemplate
from langchain.prompts import FewShotPromptTemplate

# Few-shot examples to enable in-context learning for GPT-4.

burial_question = "2. Section: Summary: Section_160:  Sir James had the murderers bury King Edward V and Prince Richard's bodies deep in the ground under a heap of stones. Text: Section_160: Which after that the wretches perceived, first by the struggling with the pains of death, and after long lying still, to be thoroughly dead, they laid their bodies naked out upon the bed, and fetched Sir James to see them. Who, upon the sight of them, caused those murderers to bury them at the stair-foot, suitably deep in the ground, under a great heap of stones..\n3. SSS: 0.896\n4.Key Words: Edward V, Prince Richard, bodies, bury, Sir James\n5. Background knowledge and context: Edward V was one of the sons of King Edward IV and Prince Richard was his brother. Sir James was involved in their deaths and had their bodies buried.\n6.Relevance Determination: Medium\n7. Relevance Explanation: The key words 'Edward V' and 'Prince Richard' are related to the question as they are mentioned in the same sentence as 'bury'. However, the question specifically asks about the burial of Edward IV, not Edward V and Prince Richard.\n8Final Output: Section_160: Irrelevant.\nExcellent. Let's try another.",
cecily_question = "2. Section: Summary: Section_113:  In a sermon at Paul's Cross, it was revealed to the people that King Edward IV's marriage was not lawful, and that his children were bastards. Text: Section_113: Now then as I began to show you, it was by the Protector and his council concluded that this Doctor Shaa should in a sermon at Paul's Cross signify to the people that neither King Edward himself nor the Duke of Clarence were lawfully begotten, nor were the very children of the Duke of York, but gotten unlawfully by other persons by the adultery of the Duchess, their mother, and that also Dame Elizabeth Lucy was verily the wife of King Edward, and so the Prince and all his children were bastards that were gotten upon the Queen.\n3. SSS: 0.869\nBased on the provided information, it appears that the section is potentially relevant to the question. The semantic similarity score is relatively high, indicating that there may be some connection between the section and the question. However, it is important to carefully examine the section and the question to determine the specific relevance.\n4.Key Words: The key words in the section that may be specifically and directly related to the question are 'King Edward,' 'Duke of Clarence,' 'Duke of York,' 'Elizabeth Lucy,' 'Prince,' and 'children.' These words refer to individuals or groups of people mentioned in the section.\n5. Background knowledge and context: Knowing that the question is asking about a person named Cecily, we can use our background knowledge about the context of the text to further assess the relevance of the section. The section mentions several individuals and groups of people, including King Edward, the Duke of Clarence, the Duke of York, Elizabeth Lucy, the Prince, and the children. Cecily is not mentioned by name in the section.\n6.Relevance Determination: Based on the key words identified in the section and our background knowledge of the context, it is unlikely that the section is relevant to the question. The section does not mention the name Cecily and does not provide any information about her. Therefore, I have a low degree of confidence in determining that the section is relevant to the question.\n7.Relevance Explanation: The section is not relevant to the question because it does not mention the name Cecily and does not provide any information about her.\n8.Final Output: Section_113: Irrelevant.\nExcellent. Let's try another.",
edward_question = "2. Section: Summary: Section_3:  King Edward IV was a good-looking and strong man who was wise in counsel and just in war. He was also known for his love of women and good food. However, he was also known to be a fair and merciful man, and he was greatly loved by his people. Text: Section_3: He was a goodly personage, and very princely to behold: of heart, courageous; politic in counsel; in adversity nothing abashed; in prosperity, rather joyful than proud; in peace, just and merciful; in war, sharp and fierce; in the field, bold and hardy, and nevertheless, no further than wisdom would, adventurous. Whose wars whosoever would well consider, he shall no less commend his wisdom when he withdrew than his manhood when he vanquished. He was of visage lovely, of body mighty, strong, and clean made; however, in his latter days with over-liberal diet , he became somewhat corpulent and burly, and nonetheless not uncomely; he was of youth greatly given to fleshly wantonness, from which health of body in great prosperity and fortune, without a special grace, hardly refrains. This fault not greatly grieved the people, for one man's pleasure could not stretch and extend to the displeasure of very many, and the fault was without violence, and besides that, in his latter days, it lessened and well left.\n3. SSS: 0.928\nTo determine whether this section is relevant to the question, let's follow the steps of the Method:1.Question: The user's question is ‘What was King Edward IV's appearance?’\n2.Section: The given section is about King Edward IV's appearance, character, and behavior.\n3. SSS: The semantic similarity score (SSS) is 0.928, which is above the threshold of .90 and indicates that there is some potential relevance between the section and the question.\n4. Key Words: Key words in the section that are directly and specifically related to the question include ‘goodly personage,’ ‘visage lovely,’ ‘body mighty, strong, and clean made,’ and ‘somewhat corpulent and burly.’ These words directly describe King Edward IV's appearance.\n5. Background Knowledge: Based on my background knowledge of the subject matter, I can confirm that this section is directly and specifically relevant to answering the question about King Edward IV's appearance.\n6. Relevance Determination: The relevance determination is high, as the section is directly and specifically related to the question.\n7. Relevance Explanation: The relevance explanation is that the section contains detailed descriptions of King Edward IV's appearance, including his physical appearance and any changes to it over time.\n8. Final Output: Therefore, the final output is ‘Section_3: Relevant.’\nExcellent. Let's try another."

# Formatting the examples to pass on the LLM.
examples = [
    {"question": "1. Question: Where was Edward IV buried?", "output": burial_question},
    {"question": "1. Question: What was Edward IV's appearence?", "output": edward_question},
    {"question": "1. Question: Who is Cecily?", "output": cecily_question}
],

example_prompt = PromptTemplate(
    input_variables=["question"],
    template="question: {question}",
)

relevance_prompt = FewShotPromptTemplate(
    # These are the examples we want to insert into the prompt.
    examples=examples,
    # This is how we want to format the examples when we insert them into the prompt.
    example_prompt=example_prompt,
    # The prefix is some text that goes before the examples in the prompt.
    # Usually, this consists of intructions.
    prefix="You are an AI expert on the 'History of Richard III' by Thomas More. In this exercise you are given a user supplied question, a Section of the Text, a Semantic Similarity Score, and a Method for determining the Section’s relevance to the Question. Your objective is to determine whether that Section of the text is directly and specifically relevant to the user question. You will be the Method below to fulfill this objective, taking each step by step.\n\nHere is your Method.\n\nMethod: Go step by step in answering the question.\n1. Question: You will be provided with a user question.\n2. Section: You will be given a section of the text from Thomas More's 'The History of Richard III.' \n3. Semantic Similarity Score: You are then given a semantic similarity score, which ranges from 1.0 (highest) to 0.0 (lowest). The higher the score, the more likely its potential relevance. Scores approaching .90 and above are generally considered to have some relevance. However, this score isn’t fully determinative, as other semantically related words in the Section can generate false positives. Weigh the value of this score alongside a careful examination of the Question and the Section.\n4. Key Words: Identify key words in the Section that are specifically and directly related to the Question. Such key words could include specific locations, events, or people mentioned in the Section.\n5. Background knowledge and context: Use your background knowledge of the subject matter to further elaborate on whether the Section is directly and specifically relevant to answering the Question.\n6. Relevance Determination: Based on your review of the earlier steps in the Method, determine whether the section is relevant, and gauge your confidence (high, medium, low, or none)  in this determination. High determination is specifically and directly related to the Question. If the section is relevant and ranked high, write ‘'Section_x: Relevant'. Otherwise, if the section is not relevant and the determination is less than high, write 'Section_x: Irrelevant'.\n7. Relevance Explanation: Based on your review in the earlier steps in the Method, explain why the Section’s relevance to the Question.\nLet’s begin.",
    # The suffix is some text that goes after the examples in the prompt.
    # Usually, this is where the user input will go
    suffix="Question: {question}\nKey Terms:",
    # The input variables are the variables that the overall prompt expects.
    input_variables=["question"],
    # The example_separator is the string we will use to join the prefix, examples, and suffix together with.
    example_separator="\n\n"
)


This prompt design employs few-shot learning and chain-of-thought prompting to guide the language model in determining the relevance of a given text section to a specific question. Let's analyze each aspect of the prompt design and its impact on the model's response.

Chain-of-thought prompting: The prompt design incorporates a set of instructions for step-by-step completion of the task. This method guides the model through the process of analyzing the text section and evaluating its relevance to the deaths of the Princes.

Few-shot learning: As in the previous example, a set of examples are offered to help guide the model's response. Here various sections of the text are compared with a user question for relevance. A scripted sequence is then provided, following the chain-of-thought instructions. Both irrevelant and revelant examples are used. 

Prompt 2: Textual Relevance

You are an AI expert on the ‘History of Richard III’ by Thomas More. In this exercise you are given a user supplied question, a Section of the Text, a Semantic Similarity Score, and a Method for determining the Section’s relevance to the Question. Your objective is to determine whether that Section of the text is directly and specifically relevant to the user question. You will be the Method below to fulfill this objective, taking each step by step.

Here is your Method.

Method: Go step by step in answering the question.

1. Question: You will be provided with a user question.

2. Section: You will be given a section of the text from Thomas More’s ‘The History of Richard III.’

3. Semantic Similarity Score: You are then given a semantic similarity score, which ranges from 1.0 (highest) to 0.0 (lowest). The higher the score, the more likely its potential relevance. Scores approaching .90 and above are generally considered to have some relevance. However, this score isn’t fully determinative, as other semantically related words in the Section can generate false positives. Weigh the value of this score alongside a careful examination of the Question and the Section.

4. Key Words: Identify key words in the Section that are specifically and directly related to the Question. Such key words could include specific locations, events, or people mentioned in the Section.

5. Background knowledge and context: Use your background knowledge of the subject matter to further elaborate on whether the Section is directly and specifically relevant to answering the Question.

6. Relevance Determination: Based on your review of the earlier steps in the Method, determine whether the section is relevant, and gauge your confidence (high, medium, low, or none) in this determination. High determination is specifically and directly related to the Question. If the section is relevant and ranked high, write ‘‘Section_x: Relevant’. Otherwise, if the section is not relevant and the determination is less than high, write ‘Section_x: Irrelevant’.

7. Relevance Explanation: Based on your review in the earlier steps in the Method, explain why the Section’s relevance to the Question. Let’s begin.

Prompt 2: Examples for in-context learning.

Example 1: 

1. Question: Who is Cecily?

2. Section: 

Summary: Section_113: In a sermon at Paul's Cross, it was revealed to the people that King Edward IV's marriage was not lawful, and that his children were bastards. 

Text: Section_113: Now then as I began to show you, it was by the Protector and his council concluded that this Doctor Shaa should in a sermon at Paul's Cross signify to the people that neither King Edward himself nor the Duke of Clarence were lawfully begotten, nor were the very children of the Duke of York, but gotten unlawfully by other persons by the adultery of the Duchess, their mother, and that also Dame Elizabeth Lucy was verily the wife of King Edward, and so the Prince and all his children were bastards that were gotten upon the Queen. 

3. SSS: 0.869 Based on the provided information, it appears that the section is potentially relevant to the question. The semantic similarity score is relatively high, indicating that there may be some connection between the section and the question. However, it is important to carefully examine the section and the question to determine the specific relevance. 

4. Key Words: The key words in the section that may be specifically and directly related to the question are 'King Edward,' 'Duke of Clarence,' 'Duke of York,' 'Elizabeth Lucy,' 'Prince,' and 'children.' These words refer to individuals or groups of people mentioned in the section. 

5. Background knowledge and context: Knowing that the question is asking about a person named Cecily, we can use our background knowledge about the context of the text to further assess the relevance of the section. The section mentions several individuals and groups of people, including King Edward, the Duke of Clarence, the Duke of York, Elizabeth Lucy, the Prince, and the children. Cecily is not mentioned by name in the section. 

6. Relevance Determination: Based on the key words identified in the section and our background knowledge of the context, it is unlikely that the section is relevant to the question. The section does not mention the name Cecily and does not provide any information about her. Therefore, I have a low degree of confidence in determining that the section is relevant to the question. 

7. Relevance Explanation: The section is not relevant to the question because it does not mention the name Cecily and does not provide any information about her. 

8. Final Output: Section_113: Irrelevant. 

Excellent. Let's try another.


Example 2:

Edward Question: "What was King Edward IV's appearance?"

Section 3: Summary

King Edward IV was a good-looking and strong man who was wise in counsel and just in war. He was also known for his love of women and good food. However, he was also known to be a fair and merciful man, and he was greatly loved by his people.

Section 3: Text

King Edward IV was described as a goodly personage and very princely to behold. He was courageous at heart, politically astute, unshaken in adversity, and joyful in prosperity without being overly proud. In peace, he was just and merciful, while in war, he was sharp and fierce. He was bold and hardy in the field, but he didn't take unnecessary risks. Observers of his wars would commend his wisdom when he withdrew, just as they would praise his manhood when he emerged victorious.

Edward had a lovely visage, and his body was mighty, strong, and well-built. However, in his later years, he became somewhat corpulent and burly due to overindulgence in food, but he still maintained an attractive appearance. In his youth, he was greatly given to fleshly wantonness, a fault that the people did not hold against him since it didn't involve violence and affected only a few. Furthermore, this fault diminished and was eventually abandoned in his later years.

Semantic Similarity Score (SSS): 0.928

To determine the relevance of this section to the question, we can follow the steps of the Method:

1. Question: The user's question is "What was King Edward IV's appearance?"
2. Section: The given section is about King Edward IV's appearance, character, and behavior.
3. SSS: The semantic similarity score (SSS) is 0.928, which is above the threshold of 0.90, indicating potential relevance between the section and the question.
4. Key Words: Key words in the section that are directly and specifically related to the question include "goodly personage," "visage lovely," "body mighty, strong, and clean made," and "somewhat corpulent and burly." These words directly describe King Edward IV's appearance.
5. Background Knowledge: Based on background knowledge of the subject matter, it can be confirmed that this section is directly and specifically relevant to answering the question about King Edward IV's appearance.
6. Relevance Determination: The relevance determination is high, as the section is directly and specifically related to the question.
7. Relevance Explanation: The relevance explanation is that the section contains detailed descriptions of King Edward IV's appearance, including his physical appearance and any changes to it over time.


Final Output: Therefore, the final output is "Section_3: Relevant."


Excellent. Let's try another.

In [0]:
# Code for using Text Relevance Prompt with GPT-4 via the langchain library.

from langchain.chat_models import ChatOpenAI
from langchain import PromptTemplate, LLMChain
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    AIMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)

system_prompt = SystemMessagePromptTemplate(prompt=relevance_prompt)

human_message_prompt_template = "Question: {question}\nKey Terms:"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_message_prompt_template)

chat_prompt = ChatPromptTemplate.from_messages([system_prompt, human_message_prompt])

chat = ChatOpenAI(temperature=0, model_name="gpt-4")
chain = LLMChain(llm=chat, prompt=chat_prompt)

r_check_1 = chain.run(question=str(question + "\n2. Section:\n " + combined1 + "\n3. SSS: " + str(similarity1)))
#print("Relevance Check 1: \n\n" + "4. Key Terms: \n" + r_check_1 + "\n")

r_check_2 = chain.run(question=str(question + "\n2. Section:\n " + combined2 + "\n3. SSS: " + str(similarity2)))
#print("Relevance Check 2: \n\n" +  "4. Key Terms: \n" + r_check_2 + "\n")

r_check_3 = chain.run(question=str(question + "\n2. Section:\n " + combined3 + "\n3. SSS: " + str(similarity3)))
#print("Relevance Check 3: \n\n" +  "4. Key Terms: \n" + r_check_3 + "\n")

display(Markdown("GPT-4's Determination of Relevance Starts at Step 4:\n\nRelevance Check 1: \n\n" + combined1 + "\n\n4. Key Terms: \n" + r_check_1 + "\n\n" + "Relevance Check 2: \n\n" +  combined2 + "\n\n4. Key Terms: \n" + r_check_2 + "\n\n" + "Relevance Check 3: \n\n" +  combined3 + "\n\n4. Key Terms: \n" + r_check_3 + "\n"))


GPT-4's analysis of the three text sections provides determinations of each section's relevance to the original question of "who killed the Princes in the Tower." The model also provides the basis behind its determination, facilitating not just the ability to examine its analysis, but offering further data for the next link in the prompt chain.

In our next step, we'll filter out irrelevant sections and pass on the remaining texts to the model to answer the initial question. We'll then prompt GPT-4 to identify a supporting quotation from each text to support its answer.

In [0]:
# This is example code in this notebook demonstrating a script for using regular expressions to filter out 
# irrevelant text sections for the next part of the prompt chain. For this particular example, all three texts are relevant.
# Code designed with the assistiance of GPT-3.

import pandas as pd
import re

# combined function for combining sections + outputs, and then filtering via regex for relevant sections

combined_df = pd.DataFrame(columns=['output', 'r_check'])
combined_df['output'] = [combined1, combined2, combined3]
combined_df['r_check'] = [r_check_1, r_check_2, r_check_3]

# Use the re.IGNORECASE flag to make the regular expression case-insensitive
regex = re.compile(r'(section_\d+:\srelevant)', re.IGNORECASE)

# Apply the regex pattern to the 'r_check' column and store the results in a new 'mask' column
combined_df['mask'] = combined_df['r_check'].str.extract(regex).get(0).notnull()

# Create a second mask to capture "this is relevant"
combined_df['second_mask'] = combined_df['r_check'].str.contains(r'this section is relevant', flags=re.IGNORECASE)

# Combine the two masks using the bitwise OR operator (|) and store the result in the 'mask' column
combined_df['mask'] = combined_df['mask'] | combined_df['second_mask']

# Filter the combined dataframe to include only rows where the 'mask' column is True
relevant_df = combined_df.loc[combined_df['mask']].copy()

# Check if there are any rows in the relevant_df dataframe
if relevant_df.empty:
    # If there are no rows, print the desired message
    print("No relevant sections identified.")
else:
    # Otherwise, continue with the rest of the script

    def combine_strings(row):
        return row['output'] + '\nKey Terms\n' + row['r_check']

    # Use the apply function to apply the combine_strings function to each row of the relevant_df dataframe
    # and assign the result to the 'combined_string' column
    relevant_df['combined_string'] = relevant_df.apply(combine_strings, axis=1)

    final_sections = relevant_df['combined_string']
    #final_sections.to_csv('final_sections.csv')

    evidence_df = pd.DataFrame(final_sections)

    evidence = '\n\n'.join(evidence_df['combined_string'])      
    
    # Filter the relevant_df dataframe to include only the 'output' column
    output_df = relevant_df[['output']]

    # Convert the dataframe to a dictionary
    output_dict = output_df.to_dict('records')

    # Extract the values from the dictionary using a list comprehension
    output_values = [d['output'] for d in output_dict]

    # Print the output values to see the results
    #print(output_values)
    

In [0]:
# Prompt for GPT-4 to identify quotes from the texts to support its answer. 

windsor_analysis = "2. Summary: Section_1:  King Edward IV was a beloved king who was interred at Windsor with great honor. He was especially beloved by the people at the time of his death. Text: Section_1: This noble prince died at his palace of Westminster and, with great funeral honor and heaviness of his people from thence conveyed, was interred at Windsor. He was a king of such governance and behavior in time of peace (for in war each part must needs be another's enemy) that there was never any prince of this land attaining the crown by battle so heartily beloved by the substance of the people, nor he himself so specially in any part of his life as at the time of his death.\n3.Initial Answer: King Edward IV was buried at Windsor with great honor and mourning from his people.\n4.Supporting Quote: ‘This noble prince died at his palace of Westminster and, with great funeral honor and heaviness of his people from thence conveyed, was interred at Windsor.’ (S.1)\n5. Combined Answer: King Edward IV was interred at Windsor with great honor and mourned by his people: ‘This noble prince...was interred at Windsor...and at the time of his death there was never any prince of this land attaining the crown by battle so heartily beloved by the substance of the people.’ (S.1)\nExcellent. Let’s try another.",
wales_analysis = "2. Summary: Section_17:  After King Edward IV's death, his son Prince Edward moved towards London. He was accompanied by Sir Anthony Woodville, Lord Rivers, and other members of the queen's family. Text: Section_17: As soon as the King was departed, that noble Prince his son drew toward London, who at the time of his father's death kept household at Ludlow in Wales.  Such country, being far off from the law and recourse to justice, was begun to be far out of good will and had grown up wild with robbers and thieves walking at liberty uncorrected. And for this reason the Prince was, in the life of his father, sent thither, to the end that the authority of his presence should restrain evilly disposed persons from the boldness of their former outrages.  To the governance and ordering of this young Prince, at his sending thither, was there appointed Sir Anthony Woodville, Lord Rivers and brother unto the Queen, a right honorable man, as valiant of hand as politic in counsel. Adjoined were there unto him others of the same party, and, in effect, every one as he was nearest of kin unto the Queen was so planted next about the Prince.\n3. Initial Answer: Wales is mentioned in the text as the place where Prince Edward kept household at the time of his father's death and where he was sent to maintain order and restrain criminal activity.\n4. Supporting Quote: 'That noble Prince his son drew toward London, who at the time of his father's death kept household at Ludlow in Wales…That the authority of his presence should restrain evilly disposed persons from the boldness of their former outrages.' (S.17)\n5. Combined Answer: Wales is mentioned in the text as the place where Prince Edward kept household and was sent to maintain order and prevent crime: 'That noble Prince his son drew toward London, who at the time of his father's death kept household at Ludlow in Wales...That the authority of his presence should restrain evilly disposed persons from the boldness of their former outrages.' (S.17)",
edward_analysis = "2. Summary: Section_2:  The people's love for King Edward IV increased after his death, as many of those who bore him grudge for deposing King Henry VI were either dead or had grown into his favor. Text: Section_2: Even after his death, this favor and affection toward him because of the cruelty, mischief, and trouble of the tempestuous world that followed afterwards increased more highly. At such time as he died, the displeasure of those that bore him grudge for King Henry's sake, the Sixth, whom he deposed, was well assuaged, and in effect quenched, in that many of them were dead in the more than twenty years of his reign a great part of a long life. And many of them in the meantime had grown into his favor, of which he was never sparing.\nInitial Answer: The public regarded Edward IV highly, with their love for him increasing after his death as many of those who bore him grudge for deposing Henry VI either died or grew into his favor.\nSupporting Quote: 'Even after his death, this favor and affection toward him because of the cruelty, mischief, and trouble of the tempestuous world that followed afterwards increased more highly...At such time as he died, the displeasure of those that bore him grudge for King Henry's sake, the Sixth, whom he deposed, was well assuaged, and in effect quenched, in that many of them were dead in the more than twenty years of his reign a great part of a long life. And many of them in the meantime had grown into his favor, of which he was never sparing.' (S.2)\nCombined Answer: The public regarded Edward IV highly at the time of his death, with their love for him increasing over time. 'Even after his death, this favor and affection toward him because of the cruelty, mischief, and trouble of the tempestuous world that followed afterwards increased more highly.' (S.2)\n. Excellent. Let’s try another."

examples = [
    {"question": "Question: Where was Edward IV buried?", "output": windsor_analysis},
    {"question": "Question: Is Wales mentioned in the text?", "output": wales_analysis},
    {"question": "Question: How did the public regard Edward IV?", "output": edward_analysis}
],

# This how we specify how the example should be formatted.
example_prompt = PromptTemplate(
    input_variables=["question"],
    template="question: {question}",
)

quotation_extraction_prompt = "You are an AI question-answerer and quotation-selector. The focus of your expertise is interpreting “The History of Richard III” by Thomas More. In this exercise you will first be given a user question, a Section of More’s text, and a Method for answering the question and supporting it with an appropriate quotation from the Section. In following this Method you will complete each step by step until finished.\n\nHere is your Method.\nMethod: Go step by step in the question.\n1. Question: You will be provided with a user question.\n2. Section: You will be given a section from Thomas More's 'The History of Richard III.'\n3. Compose Initial Answer: Based on the Question and information provided in the Section, compose a historically accurate Initial Answer to that Question. The Initial Answer should be incisive, brief, and well-written.\n4. Identify Supporting Quote: Based on the Answer, select a Quote from the Section that supports that Answer. Be sure to only select Quotes from the “Text:Section_number” part of the Section. Select the briefest and most relevant Quote possible. You can also use paraphrasing to further shorten the Quote. Cite the Section the Quote came from, in the following manner: (S.1) for quotes from Section_1.\n5. Combined Answer with Supporting Quote: Rewrite the Initial Answer to incorporate the Quote you’ve identified from the “Text:Section_number” part of the Section. This Combined Answer should be historically accurate, and be incisive, brief, and well-written. All Quotes used should be cited using the method above.\nLet’s begin."

This prompt design employs the same structure of few-shot learning and chain-of-thought prompting as in the last example.

Prompt 3: Quotation Extraction

You are an AI question-answerer and quotation-selector. The focus of your expertise is interpreting “The History of Richard III” by Thomas More. In this exercise you will first be given a user question, a Section of More’s text, and a Method for answering the question and supporting it with an appropriate quotation from the Section. In following this Method you will complete each step by step until finished.

Here is your Method. 

Method: Go step by step in the question.

1. Question: You will be provided with a user question.
2. Section: You will be given a section from Thomas More’s ‘The History of Richard III.’
3. Compose Initial Answer: Based on the Question and information provided in the Section, compose a historically accurate 4. Initial Answer to that Question. The Initial Answer should be incisive, brief, and well-written.
5. Identify Supporting Quote: Based on the Answer, select a Quote from the Section that supports that Answer. Be sure to only select Quotes from the “Text:Section_number” part of the Section. Select the briefest and most relevant Quote possible. You can also use paraphrasing to further shorten the Quote. Cite the Section the Quote came from, in the following manner: (S.1) for quotes from Section_1.
6. Combined Answer with Supporting Quote: Rewrite the Initial Answer to incorporate the Quote you’ve identified from the “Text:Section_number” part of the Section. This Combined Answer should be historically accurate, and be incisive, brief, and well-written. All Quotes used should be cited using the method above. Let’s begin.

Example Prompt for in-context learning:

Question: Where was Edward IV buried?

Summary: Section_1: King Edward IV was a beloved king who was interred at Windsor with great honor. He was especially beloved by the people at the time of his death.

Text: Section_1: This noble prince died at his palace of Westminster and, with great funeral honor and heaviness of his people from thence conveyed, was interred at Windsor. He was a king of such governance and behavior in time of peace (for in war each part must needs be another’s enemy) that there was never any prince of this land attaining the crown by battle so heartily beloved by the substance of the people, nor he himself so specially in any part of his life as at the time of his death.

Initial Answer: King Edward IV was buried at Windsor with great honor and mourning from his people.

Supporting Quote: ‘This noble prince died at his palace of Westminster and, with great funeral honor and heaviness of his people from thence conveyed, was interred at Windsor.’ (S.1)

Combined Answer: King Edward IV was interred at Windsor with great honor and mourned by his people: ‘This noble prince…was interred at Windsor…and at the time of his death there was never any prince of this land attaining the crown by battle so heartily beloved by the substance of the people.’ (S.1)]

Excellent. Let’s try another.

In [0]:
# Code for calling GPT-4 with the Quote Extraction prompt for the relevant text sections.

pd.set_option('display.max_colwidth', None)

example_prompt = SystemMessagePromptTemplate.from_template(quotation_extraction_prompt)

human_message_prompt = HumanMessagePromptTemplate.from_template("Question: {question}\nKey Terms:")

chat_prompt = ChatPromptTemplate.from_messages([example_prompt, human_message_prompt])

chat = ChatOpenAI(temperature=0, model_name="gpt-4")
chain = LLMChain(llm=chat, prompt=chat_prompt)

# Create an empty list to store the final_analysis results
final_analysis_results = []

# Iterate over the output_values list
for output_value in output_values:
    # Run the final_analysis step and store the result in a variable
    final_analysis = chain.run(question+output_value)
    # Add the final_analysis result to the list
    final_analysis_results.append(final_analysis)

# Create a Pandas dataframe from the output_values list
final_analysis_df = pd.DataFrame({'output_values': output_values, 'final_analysis': final_analysis_results})


display(Markdown(f"**Analysis 1:**\n\n" + final_analysis_df['final_analysis'][0] + "\n\n"))
display(Markdown(f"**Analysis 2:**\n\n" + final_analysis_df['final_analysis'][1] + "\n\n"))
display(Markdown(f"**Analysis 3:**\n\n" + final_analysis_df['final_analysis'][2] + "\n\n"))

Using semantic search and prompt chaining this case demonstrates how to answer the user's questions about a longform text with direct quotations and a direct citation. There is also a log of the model's "reasoning" to better help human review of LLM hallucinations.  

From here, additional links of the prompt chain could be added for customized analytical purposes. Each text section could be combined into a single analysis, providing the reader with a narrative response grounded in multiple sections of the text. GPT-4 could also be tasked with a range of other inquires: contextualizing these events within the broader events of the War of the Roses, extracting character interactions for graphing network analysis, or extracting geocoding data for digital mapping. Indeed, given GPT-4's remarkable capacities perhaps the major limiting factor in using this technology is the researcher's imagination and research budget.

The possibilities for querying a historical source with customized analytical approaches are compelling. So too is the potential to scale this approach. Scholars have employed similar techniques that enable natural language queries over their Zotero research collections. (<cite id="cite2c-27937/UJDUIP4T"><a href="#zotero%7C27937%2FUJDUIP4T"></a></cite>) (<cite id="cite2c-27937/G7XTPF45"><a href="#zotero%7C27937%2FG7XTPF45"></a></cite>) Archival collections and other digitized text corpora could be searched in a similar manner. This capacity to "ask a source" could expand accessibility, accelerate research, and enable new forms of interpretation of the past.

## What Do AIs “Know” About History? Assessing GPT-3’s Historical Capacities

The above case studies demonstrate the general versatility of LLMs on a range of technical and analytical tasks. Of particular interest to historians are empirical studies documenting generative AI's capacities for historical interpretation. 

Machine learning researchers have devised a series of benchmarks for measuring the capacities of LLMs on various forms of academic knowledge. One recently established benchmark introduces a standard for measuring LLM's performance on the Advanced Placement (A.P.) curricula for U.S., European, and World history. Hundreds of thousands of secondary students across the globe annually enroll in these curricula, which are designed to replicate the rigors of an introductory university-level history course.

In January 2021, a team of ML researchers led by Dan Hendryks tested GPT-3 on hundreds of multiple-choice questions from the A.P. History curricula, along with fifty-seven other academic disciplines. Twenty-five percent accuracy represented random chance; eighty percent reflected expert-level accuracy. GPT-3 initially achieved over 50% accuracy on all three A.P. curricula. GPT-3's performance in these subfields numbered among the top third of all the academic disciplines included in the study, although in no field did GPT-3 achieve expert-level accuracy. While demonstrating strengths in some areas, GPT-3 nonetheless possessed worrying blind spots, such as particularly poor performance in the fields of "Moral Questions" and "Professional Law." As the authors note, this "weakness is particularly concerning because it will be important for future models to have a strong understanding of what is legal and what is ethical." (<cite id="cite2c-27937/ZS9JDNGD"><a href="#zotero%7C27937%2FZS9JDNGD"></a></cite>)

Understanding the format of the benchmarks is important in evaluating the performance of LLMs. Below are two examples questions drawn from the U.S. History curriculum, both using the same historical source. The code below display's GPT-3's responses:

Here are the accuracy rates for GPT-3 for the initial Hendryks study: US History, 52.9%; European History, 53.9%; and World History, 56.1%. Full data for questions for history and other disciplines can be found at: (<cite id="cite2c-27937/A834FRJL"><a href="#zotero%7C27937%2FA834FRJL"></a></cite>) Many thanks to Dan Hendrycks for sharing the discipline-specific accuracy rates for these fields.


In [0]:
us_history_benchmark_q5 = """This question refers to the following information.\n\n\"I was once a tool of oppression\n\nAnd as green as a sucker could be\n\nAnd monopolies banded together\n\nTo beat a poor hayseed like me.\n\n"The railroads and old party bosses\n\nTogether did sweetly agree;\n\nAnd they thought there would be little trouble\n\nIn working a hayseed like me. . . ."\n\n—"The Hayseed"\n\nThe song, and the movement that it was connected to, highlight which of the following developments in the broader society in the late 1800s?\n\nA: Corruption in government, especially as it related to big business, energized the public to demand increased popular control and reform of local, state, and national governments.\n\nB: A large-scale movement of struggling African American and white farmers, as well as urban factory workers, was able to exert a great deal of leverage over federal legislation.\n\nC: The two-party system of the era broke down and led to the emergence of an additional major party that was able to win control of Congress within ten years of its founding.\n\nD: Continued skirmishes on the frontier in the 1890s with American Indians created a sense of fear and bitterness among western farmers."""
us_history_benchmark_q22 = """This question refers to the following information.\n\n\"I was once a tool of oppression\n\nAnd as green as a sucker could be\n\nAnd monopolies banded together\n\nTo beat a poor hayseed like me.\n\n"The railroads and old party bosses\n\nTogether did sweetly agree;\n\nAnd they thought there would be little trouble\n\nIn working a hayseed like me. . . ."\n\n—"The Hayseed"\n\nWhich of the following is an accomplishment of the political movement that was organized around sentiments similar to the one in the song lyrics above?\n\nA: Establishment of the minimum wage law.\n\nB: Enactment of laws regulating railroads.\n\nC: Shift in U.S. currency from the gold standard to the silver standard.\n\nD: Creation of a price-support system for small-scale farmers."""
display(Markdown("**U.S. History Benchmarks - Question 5:** \n\n" + us_history_benchmark_q5 + "\n\n\n**U.S History Benchmarks - Question 22:** \n\n" + us_history_benchmark_q22))

In [0]:
import openai

question_5 = openai.Completion.create(
                    model='text-davinci-002',
                    prompt=us_history_benchmark_q5,
                    temperature=0,
                    max_tokens=50)

question_22 = openai.Completion.create(
                    model='text-davinci-002',
                    prompt=us_history_benchmark_q22,
                    temperature=0,
                    max_tokens=50)

display(Markdown("**GPT-3's Answer for Question 5:** " + (question_5.choices[0].text) + "\n\n**Correct Answer**\n\n A: Corruption in government, especially as it related to big business, energized the public to demand increased popular control and reform of local, state, and national governments.\n\n" + "\n\n**GPT-3's Answer for Question 22:** \n\n" + (question_22.choices[0].text) + "\n\n**Correct Answer**\n\n B: Enactment of laws regulating railroads."))

Since 2021, the release of new GPT models trained using "reinforcement learning from human feedback" (RLHF) has dramatically improved the performance of LLMs on these historical benchmarks, as well as numerous others. (<cite id="cite2c-27937/36TWI3H2"><a href="#zotero%7C27937%2F36TWI3H2"></a></cite>) Below are the results from my replication of the Hendryks study using later models in the GPT series: the GPT-3 Instruct model, ChatGPT (GPT 3.5), and GPT-4.

In [0]:
# Designed with the help of GPT-4

import matplotlib.pyplot as plt
import seaborn as sns

csv_files = [
    "euro_history_benchmark_tests_chatgpt.csv",
    "euro_history_benchmark_tests_gpt3.csv",
    "euro_history_benchmark_tests_gpt4.csv",
    "us_history_benchmark_tests_chatgpt.csv",
    "us_history_benchmark_tests_gpt3.csv",
    "us_history_benchmark_tests_gpt4.csv",
    "world_history_benchmark_tests_chatgpt.csv",
    "world_history_benchmark_tests_gpt3.csv",
    "world_history_benchmark_tests_gpt4.csv",
]

# Function to calculate accuracy from a CSV file
def calculate_accuracy(file_path):
    df = pd.read_csv(file_path)
    correct_count = df['correct_status'].value_counts().get('correct', 0)
    total_count = len(df)
    return correct_count / total_count * 100

# Calculate accuracies for each file
accuracies = {file: calculate_accuracy(file) for file in csv_files}

# Function to extract the model and history type from the file path
def extract_info(file_path):
    file_name = file_path.split("/")[-1].split(".")[0]
    history_type, model = file_name.split("_benchmark_tests_")
    return history_type, model

# Convert the accuracy dictionary to a DataFrame
data = []
for file, accuracy in accuracies.items():
    history_type, model = extract_info(file)
    if model == 'chatgpt':
        model = 'ChatGPT (GPT-3.5)'
    elif model == 'gpt3':
        model = 'GPT-3 (Instruct model)'
    elif model == 'gpt4':
        model = 'GPT-4'
    data.append([history_type, model, accuracy])

# Add the new data values for "GPT-3 (Hendryks test)"
hendryks_test_data = [
    ["us_history", "GPT-3 (Hendryks test)", 52.9],
    ["euro_history", "GPT-3 (Hendryks test)", 53.9],
    ["world_history", "GPT-3 (Hendryks test)", 56.1],
]
for item in hendryks_test_data:
    history_type, model, accuracy = item
    data.append([history_type, model, accuracy])

df_accuracy = pd.DataFrame(data, columns=['History Type', 'Model', 'Accuracy'])

# Reorder the models in the desired sequence
model_order = ["GPT-3 (Hendryks test)", "GPT-3 (Instruct model)", "ChatGPT (GPT-3.5)", "GPT-4"]

# Create a bar chart with Seaborn
plt.figure(figsize=(10, 8))
sns.set(style="whitegrid")
ax = sns.barplot(x="History Type", y="Accuracy", hue="Model", data=df_accuracy, palette="muted", hue_order=model_order)
ax.set(title="Accuracy of the GPT Series on A.P. History Questions in the MMLU Benchmarks", ylabel="Accuracy (%)", xlabel="History Category")

# Customize the chart
for p in ax.patches:
    ax.annotate(
        f"{p.get_height():.2f}%",
        (p.get_x() + p.get_width() / 2., p.get_height()),
        ha="center",
        va="bottom",
        fontsize=9,
        xytext=(0, 3),
        textcoords="offset points",
    )

# Move the legend to the bottom of the chart
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=len(model_order), fancybox=True, shadow=True)

plt.show()


The trajectory of the GPT series on this form of historical knowledge offers a striking demonstration of the rapid gains made in just a few years. GPT-4 now meets expert-level accuracy on all three of the subject exams. These findings mirror GPT-4's performance in other knowledge domains such as medical tests (<cite id="cite2c-27937/VEDFUUBA"><a href="#zotero%7C27937%2FVEDFUUBA"></a></cite>), American bar exams (<cite id="cite2c-27937/A39GT7B4"><a href="#zotero%7C27937%2FA39GT7B4"></a></cite>), and a host of other standardized assessments. (<cite id="cite2c-27937/U534FF7L"><a href="#zotero%7C27937%2FU534FF7L">(OpenAI, 2023)</a></cite>)

Yet, why do GPT-4 and other LLMs peform better in some knowledge domains than others? How can it get one question right, and other questions generate errors? There is a temptation to parse the model's performance in ways relatable to our human perspective. The human test taker might approach the question by assessing what types of historical thinking each question requires, what sort of knowledge is offered by the options, and how the historical source relates to the question. But, of course, GPT-4 isn't human - and unlike the human test taker, it has likely already seen the question in advance. In 2021, nearly 400,000 students took the A.P. U.S. History exam. (<cite id="cite2c-27937/HAZWJTY7"><a href="#zotero%7C27937%2FHAZWJTY7"></a></cite>) A vast web presence has emerged to serve the sizable population of students and instructors participating in this international curriculum. Hundreds of exam questions have migrated online via the collective efforts of the test prep publishing industry, various study apps, and uploaded example tests. Given the scale of the dataset used to create it, many of these questions have likely ended up in GPT-4's training data. If those who critique LLMs as "stochastic parrots" are correct, then GPT-4's likely comes from sheer memorization, and not through any analytical process. (<cite id="cite2c-27937/MVDFMR8K"><a href="#zotero%7C27937%2FMVDFMR8K">(Bender et al., n.d.)</a></cite>, 618.) GPT-4's varying performance in A.P.'s different historical fields supports this hypothesis. GPT-4 achieves over 90% accuracy in the most popular A.P. courses: U.S. History (second most popular overall) and World History (fifth). In contrast, the GPT series lags in accuracy on European History, the seventeenth most popular A.P. course. (“Student Score Distributions: AP Exams - May 2019.”) This less popular exam would presumably have a smaller presence both online and in GPT-4's training data. However, this argument is admittedly speculative. GPT-4's training data is not available for public inspection, and the specific mechanisms of how LLMs process information remain a fluid field of inquiry.  

Yet even if GPT-4's remarkable performance on standardized tests is the product of memorization, this knowledge has long been a springboard for more advanced forms of historical inquiry. And A.P. study guides are not the only historical texts the GPT series is trained on. Primary source collections, academic monographs, scholarly journals - these too form GPT-4's training data. The influence of these sources can be found when GPT-4 is posed more complex questions in a structured prompt. Let's return to the earlier A.P. questions above featuring the Populist-era campaign song "The Hayseed." In the following prompt, GPT-4 is given the lyrics and publication history of the song. (<cite id="cite2c-27937/R5P23ZWU"><a href="#zotero%7C27937%2FR5P23ZWU"></a></cite>) It is then instructed to identify the larger historical context of the source, the song's intended purpose and audience, and how the source might be interpreted via different historiographical approaches.

While one can debate aspects of GPT-4’s interpretations, it nonetheless accurately captures much of the context and intent of the source. With the right design (and sufficient budget), GPT-4 could be automated to annotate an entire corpus of primary sources, becoming a tool of the digital historian overwhelmed by an abundance of historical data, as envisioned by Roy Rozenweig twenty years ago. Further experimentation will be needed to more fully assess GPT-4’s capabilities for historical interpretation. But progress moves quickly in the ML world, and there is intense competition to build new models that advance the existing capabilities of LLMs and shed their shortcomings. Yet progress remains uneven. Of significant concern are LLM's performance on benchmarks on ethics and morality, which continue to demonstrate troubling areas of weakness. (<cite id="cite2c-27937/USPCB8UK"><a href="#zotero%7C27937%2FG5ESJ8NI">(Hoffmann et al., 2022)</a></cite>, 31, table A6)

It at this juncture where historians should contribute their distinctive expertise in the collective efforts to establish ethical guidelines to inform future AI research. (<cite id="cite2c-27937/Q4U4YZPD"><a href="#zotero%7C27937%2FQ4U4YZPD"></a></cite>) We must especially confront the difficult challenge raised by AI researcher Janelle Shane: “Sometimes, to reckon with the effects of biased training data is to realize that the app shouldn’t be built.”<cite id="cite2c-27937/9SCM73IS"><a href="#zotero%7C27937%2F9SCM73IS"></a></cite>   

## Chatting with Representations of the Past: Why Historians Should Care About AI

While AIs might master multiple choice questions, most historians would consider this an insufficient proxy for true historical fluency. We need more creative forms of assessment and accessible tools that permit experimentation. Historians also need to engage with the ethical ramifications of these experiments, and devise socially responsible frameworks for implementing these technologies. But we'll need to think quickly - this technology has already enabled a compelling idea nonetheless fraught with unintended consequences.

Among their many talents, GPT-4 and other LLMs are adept at generating responses when guided by a specific point of reference, such as the perspective of a well-known historical figure. This surprising capacity enables a simulation of the worldview of a historical personality. This ability may unlock new forms of interaction with historical sources. It could also reproduce ELIZA effects with significant ramifications for the public's engagement with the past.  

Such was the context of my first experimentation with an LLM: a simulated conversation with "Martin Luther". I selected Luther because of his historical significance and because his conversational style is arguably captured in works like the Table Talk, which reflected his views on a wide range of subjects. (<cite id="cite2c-27937/LA73IHHD"><a href="#zotero%7C27937%2FLA73IHHD"></a></cite>) Using the OpenAI's Playground, I directed GPT-3 to adopt this perspective with the following prompt: “I am an AI representation of Martin Luther, a key figure in the Protestant Reformation. You can ask me questions about faith and theology, and I will answer at great length and in the style of Luther's Table Talk.”

And so “he” did. Our chat ranged on the key moments in Luther’s life, religious teachings, and even contemporary events. (<cite id="cite2c-27937/W9A3PKIN"><a href="#zotero%7C27937%2FW9A3PKIN"></a></cite>) To be sure, GPT-3 generated for “Luther” some serious hallucinations, such as Emperor Charles V’s conversion to Lutheranism and the Catholic Church’s admission of error at the Diet of Worms. Yet GPT-3 offered accurate and evocative responses in other areas. GPT-3 correctly identified Luther’s views on scriptural authority, the basis of human salvation, and the doctrine of predestination. In engaging with Luther’s views on Copernicus, GPT-3 correctly interpreted Luther’s opposition to heliocentrism and cited appropriate biblical passages supporting that view. Luther even complained of his depiction in contemporary historiography, citing the preeminent scholar in the field. I attempted to further enhance the verisimilitude of Luther’s responses by creating a fine-tuned model of GPT-3 trained on the actual text of Luther’s Table Talk. (<cite id="cite2c-27937/5L8CK3F6"><a href="#zotero%7C27937%2F5L8CK3F6"></a></cite>) This worked to a degree, and GPT-3 soon generated responses that more accurately matched Luther’s famous pugnacity. But I soon questioned the wisdom of creating an application that accurately mimics Luther. His language inspired profound religious and cultural transformation whose power continues to reverberate centuries later. Luther’s language also inspired violence, in his time and in recent memory. (<cite id="cite2c-27937/ZMXKDKE8"><a href="#zotero%7C27937%2FZMXKDKE8"></a></cite>)

Following the release and surging popularity of ChatGPT in November 2022, dialogues with simulated historic figures proliferated over social media. App developers quickly developed interfaces to use ChatGPT for such simulated interactions, drawn by the pedagogical potential of new forms of historical interaction.  However, these apps did little to address the problems of LLM "hallucinations," nor the potent ethical ramifications such approaches raise. Users soon reported their conversations with both humanity's greatest luminaries and its greatest villains. (<cite id="cite2c-27937/KPPP2BAQ"><a href="#zotero%7C27937%2FKPPP2BAQ"></a></cite>) The ability of these applications to “bring history to life” soon gave way to an appreciation that perhaps some parts of the past are better off dead. 

LLMs are quickly entering the public domain. These technologies have the potential to inspire new forms of human discovery and creativity. Yet if we do not take care, AIs will also advance the inequalities, injustices, and misinformation that form the record of human history on which they are trained.

Historians have a stake in this future. The informed and ethical integration of AI in historical research and pedagogy has the potential to democratize access to the past, fostering greater inclusivity in a time of educational austerity.  This technology can enhance the learning experience by connecting historical information in novel ways to students and researchers alike, allowing for innovative explorations of historical data and primary sources. In turn, this can foster new scholarly conversations, enrich classroom discussions, and inspire a deeper appreciation for the complexities of the past. But we first have to understand its strengths and limitations, like with any historical source.

GPT-4 is anchored within a specific time, with definitive (if vast) contours that historians can interrogate - except this source can respond to your questions. And yes, GPT-4 invents facts, confuses dates, and distorts the past. But don’t our existing sources already require careful examination? Digital historians have demonstrated historiographical innovation in utilizing emerging technologies to create new forms of scholarship. There is similar potential for historical explorations of generative AI. The effort is worthwhile, as few historical sources possess GPT-4’s scope. However flawed, generative AI represents a powerful tool for addressing Roy Rosenzweig’s call to grapple with the “unheard-of historical abundance” of the digital age.

I am grateful to Abraham Gibson for extending an invitation to present the preliminary research findings of this article with the Digital History Working Group in May 2022, organized by the Consortium For History of Science, Technology, and Medicine. I would also like to express my appreciation to my colleagues William Mattingly and Patrick Wadden for their insightful commentary on the article, as well as to Ian Crowe for his valuable feedback and the opportunity to use the "Ask-A-Source" approach with students in the Thomas More Seminar at Belmont Abbey College.

This article was facilitated by a sabbatical semester generously granted by the Office of Academic Affairs at Belmont Abbey College. My thanks to Provost Travis Feezell and Vice Provost David Williams for their support of this project. 

## Bibliography

<!-- BIBLIOGRAPHY START --><!-- BIBLIOGRAPHY END -->