# Large Language Models

In their most common implementations, Large Language Models are prompt based. This means that in order to get the model to carry out a task for you, you can simply tell it what you want in plain text. Probably the most well-known implementation of this principle is ChatGPT. As we have learned, this behavior is rooted in their architecture. However, it has some implementations on how we can utilize LLMs in our research.

## Messages

Large Language Models such as those provided by OpenAI will most commonly employ a **message** object in order to deal with input. Think of it as the chat history with your favorite AI chatbot. We will learn more sophisticated ways of dealing with this chat history later. For now it is enough to know it is usually referred to as a list of messages like this:

```json
[
    <message 1>,
    <message 2>,
    ...
]
```

Each message typically has some additional data associated with it. Most importantly apart from the content is the **role** attribute. This lets the model know who is talking, so to say. The most important differentiation is between *system* and *user*.

- **system**: The system prompt, that is, the prompt defining the behavior of the model. For example, this would instruct your model to be a helpful assistant, a translator, or a sentiment classification system. This is usually hidden when using a chat interface such as ChatGPT.

- **user**: The user input, that is, what you pass to the model directly. In a chat bot, this is what you type into the chat window. If you use a LLM as a classification system, this would be the input you are looking to classify.

There are two more roles that are typically employed for more sophisticated applications such as prolonged chats or Retrieval Augmented Generation (RAG). We will look at them in more detail later, but they are:

- **assistant**: The model's output. In a chat model, this would be the answer to your question. In a classification setup, this would be the the classification output. Flagging this especially helps keep track of prolonged conversations. 

- **tool**: When using Agents with tools, this flags the tool output. For example, when using RAG, the retriever output would typically be a tool message. More on this later.

The messages are usually saved in a dictionary. In practice, this looks something like this:

In [1]:
# we will task the model to classify the sentiment of a sentence
instructions = """ 
You are a sentiment classification system. 

You will be given a sentence and you have to determine if the sentence is "positive", "neutral", or "negative".

Only reply with "positive", "neutral", or "negative".
""" # the triple quotes allow us to have a multi-line string

messages = [
    # we pass the instructions as system message
    {
      "role": "system",
      "content": instructions
    },
    # and the content we wish to classify as user message
    {
      "role": "user",
      "content": "I love this movie!"
    }
]

In [2]:
print(messages)

[{'role': 'system', 'content': 'You are a sentiment classification system. \n\nYou will be given a sentence and you have to determine if the sentence is "positive", "neutral", or "negative".\n\nOnly reply with "positive", "neutral", or "negative".'}, {'role': 'user', 'content': 'I love this movie!'}]


After passed through the model, the conversation would look something like this:

In [3]:
[
    {
      "role": "system",
      "content": instructions
    },
    {
      "role": "user",
      "content": "I love this movie!"
    },
    {
      "role": "assistant",
      "content": "positive"
    }
]

[{'role': 'system',
  'content': ' \nYou are a sentiment classification system. \n\nYou will be given a sentence and you have to determine if the sentence is "positive", "neutral", or "negative".\n\nOnly reply with "positive", "neutral", or "negative".\n'},
 {'role': 'user', 'content': 'I love this movie!'},
 {'role': 'assistant', 'content': 'positive'}]

## OpenAI

Let's try this by using the OpenAI API.

For this, we need to set up a client to communicate with the API using our API key from the .env file.

In [1]:
import os
from dotenv import load_dotenv # to load .env file
from openai import OpenAI

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) # client = communicates with API

In [87]:
# using the messages from above....
instructions = """ 
You are a sentiment classification system. 

You will be given a sentence and you have to determine if the sentence is "positive", "neutral", or "negative".

Only reply with "positive", "neutral", or "negative".
""" # the triple quotes allow us to have a multi-line string

messages = [
    # we pass the instructions as system message
    {
      "role": "system",
      "content": instructions
    },
    # and the content we wish to classify as user message
    {
      "role": "user",
      "content": "I love this movie!"
    }
]

response = client.chat.completions.create( # completions = create responses
    model="gpt-4o",
    messages=messages
)

In [3]:
response

ChatCompletion(id='chatcmpl-B4r6elwTQE7PzvGJ0ymBc3N5vlPU2', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='positive', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1740497152, model='gpt-4o-2024-08-06', object='chat.completion', service_tier='default', system_fingerprint='fp_eb9dce56a8', usage=CompletionUsage(completion_tokens=2, prompt_tokens=63, total_tokens=65, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

The response of the model comes with a lot of additional info, such as model and token use, if the request was rejected, etc. What we're actually interested in is the `choices` object and, more specifically, the message returned by the model.

In [4]:
response.choices[0].message

ChatCompletionMessage(content='positive', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None)

There we can see the message content and some info such as the `role` we talked about before. To get straight to the answer, we can refer to the `content` of the message:

In [5]:
response.choices[0].message.content

'positive'

We can also have the model return the (log) probabilities of the given answer, effectively giving us an idea how sure the model is about its answer:

In [89]:
response = client.chat.completions.create( # completions = create responses
    model="gpt-4o",
    messages=messages,
    logprobs=True,
    top_logprobs=3 # this sets the number of alternative tokens for which the logprobs are returned
)

In [90]:
print(response.choices[0].message.content)
print(response.choices[0].logprobs.content[0].logprob)

positive
-0.0011918948730453849


We can check what other tokens the model considered:

In [92]:
response.choices[0].logprobs.content[0].top_logprobs

[TopLogprob(token='positive', bytes=[112, 111, 115, 105, 116, 105, 118, 101], logprob=-0.0011918948730453849),
 TopLogprob(token='"', bytes=[34], logprob=-6.751192092895508),
 TopLogprob(token='Positive', bytes=[80, 111, 115, 105, 116, 105, 118, 101], logprob=-10.751192092895508)]

There are a number of additional attributes to the API call that you can access with `?`:

In [34]:
client.chat.completions.create?

[1;31mSignature:[0m
[0mclient[0m[1;33m.[0m[0mchat[0m[1;33m.[0m[0mcompletions[0m[1;33m.[0m[0mcreate[0m[1;33m([0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mmessages[0m[1;33m:[0m [1;34m'Iterable[ChatCompletionMessageParam]'[0m[1;33m,[0m[1;33m
[0m    [0mmodel[0m[1;33m:[0m [1;34m'Union[str, ChatModel]'[0m[1;33m,[0m[1;33m
[0m    [0maudio[0m[1;33m:[0m [1;34m'Optional[ChatCompletionAudioParam] | NotGiven'[0m [1;33m=[0m [0mNOT_GIVEN[0m[1;33m,[0m[1;33m
[0m    [0mfrequency_penalty[0m[1;33m:[0m [1;34m'Optional[float] | NotGiven'[0m [1;33m=[0m [0mNOT_GIVEN[0m[1;33m,[0m[1;33m
[0m    [0mfunction_call[0m[1;33m:[0m [1;34m'completion_create_params.FunctionCall | NotGiven'[0m [1;33m=[0m [0mNOT_GIVEN[0m[1;33m,[0m[1;33m
[0m    [0mfunctions[0m[1;33m:[0m [1;34m'Iterable[completion_create_params.Function] | NotGiven'[0m [1;33m=[0m [0mNOT_GIVEN[0m[1;33m,[0m[1;33m
[0m    [0mlogit_bias[0m[1;33m:[0m 

Among other things, you can provide a maximum number of tokens to generate. This can be helpful when trying to force the model to give a specific answer (e.g. not longer than the longest label we want returned) or when trying to make sure you do not use more tokens per answer than you can afford (as you are charged by tokens).

Additionally, there's the **_temperature_** parameter, which effectively changes how deterministic the model's answer will be. Simply put, a higher temperature leads to a higher likelihood of tokens with a lower log probability being generated. This can be helpful in e.g. Chatbots, where you do not want exactly the same answer everytime. In classification tasks, however, you usually want to set it to 0.

### Classification

Technically, the sentiment example above was already a classification task. But let's try the same multi-class classification with CAP categories we have previously attempted with RoBERTa models.  

In [36]:
# we're loading up our trusty UK media dataset and do some minor data cleaning

import pandas as pd

uk_media = pd.read_csv('data/uk_media.csv')

 # fillna() makes sure missing values don't result in NaN entries
uk_media['text'] = uk_media['description'].fillna('') + ' ' + uk_media['subtitle'].fillna('')

# we'll also drop duplicates indicated by the filter_duplicate column
uk_media = uk_media[uk_media['filter_duplicate'] == 0]

# we'll also drop rows where text is NaN (missing due to missing headlines)
uk_media = uk_media[uk_media['text'].notna()]

uk_media.head()

Unnamed: 0,id,year,speech_year,date,description,subtitle,words,majortopic,filter_international,filter_duplicate,text
0,1,1960,52,06/01/1960,Engineering Unions Press Pay And Hours Claim,Apparition Of Inflationary Spiral Laid,1042.0,5,0,0,Engineering Unions Press Pay And Hours Claim A...
1,2,1960,52,06/01/1960,In Disaster Pit,,796.0,8,0,0,In Disaster Pit
2,3,1960,52,06/01/1960,Peeping Toms,,159.0,12,0,0,Peeping Toms
3,4,1960,52,06/01/1960,Police Officer Praised For Car Struggle,,173.0,12,0,0,Police Officer Praised For Car Struggle
4,5,1960,52,06/01/1960,540 Tugmen End Strike,Employers Agree To Supply Deck Boys,411.0,10,0,0,540 Tugmen End Strike Employers Agree To Suppl...


A major advante of LLMs in contrast to our RoBERTa model is that they do not need to "learn" the categories via training. Instead, we will provide them with instructions how to label the data, much as you would with a human labeler. However, this makes it necessary to provide the model with descriptive labels and extensive coding instructions, which increase the number of tokens you use per call, and therefore the cost.

Here we are using a codebook based on the Minor Policy Codes of the official [CAP codebook](https://www.comparativeagendas.net/pages/master-codebook):

In [45]:
cap_labels = {
    "1": "Issues related to general domestic macroeconomic policy; Interest Rates; Unemployment Rate; Monetary Policy; National Budget; Tax Code; Industrial Policy; Price Control; other macroeconomics subtopics",
    "2": "Issues related generally to civil rights and minority rights; Minority Discrimination; Gender Discrimination; Age Discrimination; Handicap Discrimination; Voting Rights; Freedom of Speech; Right to Privacy; Anti-Government; other civil rights subtopics",
    "3": "Issues related generally to health care, including appropriations for general health care government agencies; Health Care Reform; Insurance; Drug Industry; Medical Facilities; Insurance Providers; Medical Liability; Manpower; Disease Prevention; Infants and Children; Mental Health; Long-term Care; Drug Coverage and Cost; Tobacco Abuse; Drug and Alcohol Abuse; health care research and development; issues related to other health care topics",
    "4": "Issues related to general agriculture policy, including appropriations for general agriculture government agencies; agricultural foreign trade; Subsidies to Farmers; Food Inspection & Safety; Food Marketing & Promotion; Animal and Crop Disease; Fisheries & Fishing; agricultural research and development; issues related to other agricultural subtopics",
    "5": "Issues generally related to labor, employment, and pensions, including appropriations for government agencies regulating labor policy; Worker Safety; Employment Training; Employee Benefits; Labor Unions; Fair Labor Standards; Youth Employment; Migrant and Seasonal workers; Issues related to other labor policy",
    "6": "Issues related to General education policy, including appropriations for government agencies regulating education policy; Higher education, student loans and education finance, and the regulation of colleges and universities; Elementary & Secondary education; Underprivileged students; Vocational education; Special education and education for the physically or mentally handicapped; Education Excellence; research and development in education; issues related to other subtopics in education policy",
    "7": "Issues related to General environmental policy, including appropriations for government agencies regulating environmental policy; Drinking Water; Waste Disposal; Hazardous Waste; Air Pollution; Recycling; Indoor Hazards; Species & Forest; Land and Water Conservation; research and development in environmental technology, not including alternative energy; issues related to other environmental subtopics",
    "8": "Issues generally related to energy policy, including appropriations for government agencies regulating energy policy; Nuclear energy, safety and security, and disposal of nuclear waste; Electricity; Natural Gas & Oil; Coal; Alternative & Renewable Energy; Issues related to energy conservation and energy efficiency; issues related to energy research and development; issues related to other energy subtopics",
    "9": "Issues related to immigration, refugees, and citizenship",
    "10": "Issues related generally to transportation, including appropriations for government agencies regulating transportation policy; mass transportation construction, regulation, safety, and availability; public highway construction, maintenance, and safety; Air Travel; Railroad Travel; Maritime transportation; Infrastructure and public works, including employment initiatives; transportation research and development; issues related to other transportation subtopics",
    "12": "Issues related to general law, crime, and family issues; law enforcement agencies, including border, customs, and other specialized enforcement agencies and their appropriations; White Collar Crime; Illegal Drugs; Court Administration; Prisons; Juvenile Crime; Child Abuse; Family Issues, domestic violence, child welfare, family law; Criminal & Civil Code; Crime Control; Police; issues related to other law, crime, and family subtopics",
    "13": "Issues generally related to social welfare policy; Low-Income Assistance; Elderly Assistance; Disabled Assistance; Volunteer Associations; Child Care; issues related to other social welfare policy subtopics",
    "14": "Issues related generally to housing and urban affairs; housing and community development, neighborhood development, and national housing policy; urban development and general urban issues; Rural Housing; economic, infrastructure, and other developments in non-urban areas; housing for low-income individuals and families, including public housing projects and housing affordability programs; housing for military veterans and their families, including subsidies for veterans; housing for the elderly, including housing facilities for the handicapped elderly; housing for the homeless and efforts to reduce homelessness ; housing and community development research and development; Other issues related to housing and community development",
    "15": "Issues generally related to domestic commerce, including appropriations for government agencies regulating domestic commerce; Banking; Securities & Commodities; Consumer Finance; Insurance Regulation; personal, commercial, and municipal bankruptcies; corporate mergers, antitrust regulation, corporate accounting and governance, and corporate management; Small Businesses; Copyrights and Patents; Disaster Relief; Tourism; Consumer Safety; Sports Regulation; domestic commerce research and development; other domestic commerce policy subtopics",
    "16": "Issues related generally to defense policy, and appropriations for agencies that oversee general defense policy; defense alliance and agreement, security assistance, and UN peacekeeping activities; military intelligence, espionage, and covert operations; military readiness, coordination of armed services air support and sealift capabilities, and national stockpiles of strategic materials.; Nuclear Arms; Military Aid; military manpower, military personel and their dependents, military courts, and general veterans' issues; military procurement, conversion of old equipment, and weapons systems evaluation; military installations, construction, and land transfers; military reserves and reserve affairs; military nuclear and hazardous waste disposal and military environmental compliance; domestic civil defense, national security responses to terrorism, and other issues related to homeland security; non-contractor civilian personnel, civilian employment in the defense industry, and military base closings; military contractors and contracting, oversight of military contrators and fraud by military contractors; Foreign Operations; claims against the military, settlements for military dependents, and compensation for civilians injured in military operations; defense research and development; other defense policy subtopics",
    "17": "Issues related to general space, science, technology, and communications; government use of space and space resource exploitation agreements, government space programs and space exploration, military use of space; regulation and promotion of commercial use of space, commercial satellite technology, and government efforts to encourage commercial space development; science and technology transfer and international science cooperation; Telecommunications; Broadcast; Weather Forecasting; computer industry, regulation of the internet, and cyber security; space, science, technology, and communication research and development not mentioned in other subtopics.; other issues related to space, science, technology, and communication research and development",
    "18": "Issues generally related to foreign trade and appropriations for government agencies generally regulating foreign trade; Trade Agreements; Exports; Private Investments; productivity of competitiveness of domestic businesses and balance of payments issues; Tariff & Imports; Exchange Rates; other foreign trade policy subtopics",
    "19": "Issues related to general international affairs and foreign aid, including appropriations for general government foreign affairs agencies; Foreign Aid; Resources Exploitation; Developing Countries; International Finance; Western Europe; issues related specifically to a foreign country or region not codable using other codes, assessment of political issues in other countries, relations between individual countries; Human Rights; International organizations, NGOs, the United Nations, International Red Cross, UNESCO, International Olympic Committee, International Criminal Court; international terrorism, hijacking, and acts of piracy in other countries, efforts to fight international terrorism, international legal mechanisms to combat terrorism; diplomats, diplomacy, embassies, citizens abroad, foreign diplomats in the country, visas and passports; issues related to other international affairs policy subtopics",
    "20": "Issues related to general government operations, including appropriations for multiple government agencies; Intergovernmental Relations; Bureaucracy; Postal Service; issues related to civil employees not mentioned in other subtopics, government pensions and general civil service issues; issues related to nominations and appointments not mentioned elsewhere; issues related the currency, national mints, medals, and commemorative coins; government procurement, government contractors, contractor and procurement fraud, and procurement processes and systems; government property management, construction, and regulation; Tax Administration; public scandal and impeachment; government branch relations, administrative issues, and constitutional reforms; regulation of political campaigns, campaign finance, political advertising and voter registration; Census & Statistics; issues related to the capital city; claims against the government, compensation for the victims of terrorist attacks, compensation policies without other substantive provisions; National Holidays; other government operations subtopics",
    "21": "Issues related to general public lands, water management, and territorial issues; National Parks; Indigenous Affairs; natural resources, public lands, and forest management, including forest fires, livestock grazing; water resources, water resource development and civil works, flood control, and research; territorial and dependency issues and devolution; other public lands policy subtopics",
    "23": "Issues related to general cultural policy issues",
    "99": "Other issues, where none of the above is appropriate.", # dummy category
}

Now, we will turn this into a system prompt so the model knows what to do. For complex tasks, it is recommended to use a modular system prompt for troubleshooting:

In [50]:
# give the model some context for the task it is about to perform
context = """
You are a political scientist tasked with annotating documents into policy categories. 
The documents can be classified as one of the following numbered categories. 
A description of each category is following the ':' sign.
"""

# turn the CAP dictionary into a string
labels_definitions = ""

for i in range(len(cap_labels)):
    labels_definitions += f'{list(cap_labels.keys())[i]}: {list(cap_labels.values())[i]}\n'

# finally, the question we want the model to answer, including specific instructions for the output
question = """
Which policy category does this document belong to? 
Answer only with the number of the category, and only with a single category.
"""

# now we combine the parts into the system prompt
system_message = f"{context}\n{labels_definitions}\n\n{question}"

print(system_message)


You are a political scientist tasked with annotating documents into policy categories. 
The documents can be classified as one of the following numbered categories. 
A description of each category is following the ':' sign.

1: Issues related to general domestic macroeconomic policy; Interest Rates; Unemployment Rate; Monetary Policy; National Budget; Tax Code; Industrial Policy; Price Control; other macroeconomics subtopics
2: Issues related generally to civil rights and minority rights; Minority Discrimination; Gender Discrimination; Age Discrimination; Handicap Discrimination; Voting Rights; Freedom of Speech; Right to Privacy; Anti-Government; other civil rights subtopics
3: Issues related generally to health care, including appropriations for general health care government agencies; Health Care Reform; Insurance; Drug Industry; Medical Facilities; Insurance Providers; Medical Liability; Manpower; Disease Prevention; Infants and Children; Mental Health; Long-term Care; Drug Covera

**Important:** Check the length of your prompt! Depending on your model's context window and the amount of tokens you are willing to spend per call, you might need to shorten your prompt.

In [51]:
len(system_message)

10702

Next, we will write a simple wrapper function to make running it over multiple rows easier:

In [118]:
import re

def classify_text(text, system_message, model, max_tokens, logprobs, top_logprobs):

  # clean the text by removing extra spaces
  text = re.sub(r'\s+', ' ', text).strip()

  # construct input

  messages = [
    # system prompt
    {"role": "system", "content": system_message}, # this will contain all instructions for the model
    # user input
    {"role": "user", "content": text}, # text here is the input text to be classified
  ]

  response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=0.0,
    max_tokens=max_tokens, # restrict max tokens for more consistent/concise output
    logprobs=logprobs, # include log probs
    top_logprobs=top_logprobs, # number of top logprobs to return
  )

  
  result = response.choices[0].message.content

  if logprobs: # a simple check if we wanted logprobs to get returned
    logprobs = response.choices[0].logprobs.content[0].logprob
    top_logprobs = response.choices[0].logprobs.content[0].top_logprobs
    return result, logprobs, top_logprobs
  
  else: # if not, return only the result
    return result

As previously observed, the UK media dataset includes some policy codes not part of the offical CAP codebook. For evaluation purposes, we will drop these.

In [52]:
# drop rows with majortopic code 0
uk_media = uk_media[uk_media['majortopic'] != 0]

# only keep rows below 24 OR equal to 99
uk_media = uk_media[(uk_media['majortopic'] < 24) | (uk_media['majortopic'] == 99)]

# drop category 22 not in the CAP
uk_media = uk_media[uk_media['majortopic'] != 22]

# turn the majortopic column into a string
uk_media['majortopic'] = uk_media['majortopic'].astype(str)


For testing purposes, we will only run a subsample of the data. This is useful when testing out different prompts without using all your API credit.

In [124]:
# set a seed before sampling for repdroducibility
seed = 20250228 # I'm using today's date as seed here

uk_media_sample = uk_media.sample(n = 100, random_state = seed) # we'll sample 100 out of the 17746 rows

uk_media_sample.reset_index(drop = True, inplace = True) # reset index

Let's classify! OpenAI provides different models, you can find an overview (including their context windows) here: https://platform.openai.com/docs/models

We'll used the latest 4o-mini Model (currently gpt-4o-mini-2024-07-18). We could also use a model with stronger reasoning capabilities such as o1 or a larger model such as the one used in the ChatGPT application. Note that the pricing differs for these models. You can find more information on the pricing here: https://platform.openai.com/docs/pricing

In [127]:
model = "gpt-4o-mini-2024-07-18"

classification_results = [classify_text(text, 
                                        system_message = system_message, 
                                        model = model, 
                                        logprobs = True,
                                        top_logprobs=5, # number of alternative logprobs
                                        max_tokens = 10) for text in uk_media_sample['text']] # we're looping our function over the texts

In [129]:
# safe results
import pickle

with open('data/openai_classifications.pkl', 'wb') as f:
    pickle.dump(classification_results, f)

In [130]:
# load results later like this:
import pickle

with open('data/openai_classifications.pkl', 'rb') as f:
    classification_results = pickle.load(f)

In [131]:
# add the classification results to the sample

classification_results_df = pd.concat([uk_media_sample, 
                                       pd.DataFrame(classification_results, 
                                                    columns = ['result', 'logprobs', 'top_logprobs'])],
                                        axis = 1)
classification_results_df

Unnamed: 0,id,year,speech_year,date,description,subtitle,words,majortopic,filter_international,filter_duplicate,text,result,logprobs,top_logprobs
0,18704,2004,97,09/06/2004,Sugary drink risk,,52.0,3,0,0,Sugary drink risk,3,0.000000e+00,"[TopLogprob(token='3', bytes=[51], logprob=0.0..."
1,20612,2007,100,04/04/2007,Iran softens stance over captured crew but Bec...,,632.0,16,0,0,Iran softens stance over captured crew but Bec...,19,-3.128163e-07,"[TopLogprob(token='19', bytes=[49, 57], logpro..."
2,14946,1994,87,09/11/1994,Woolworths death charge - Ian Karl Kay,,39.0,12,0,0,Woolworths death charge - Ian Karl Kay,99,-1.278886e-02,"[TopLogprob(token='99', bytes=[57, 57], logpro..."
3,7837,1975,68,19/11/1975,Leftist units in Lisbon placed on the alert,,451.0,19,1,0,Leftist units in Lisbon placed on the alert,19,-1.109613e-03,"[TopLogprob(token='19', bytes=[49, 57], logpro..."
4,16857,2000,93,24/05/2000,Tourism chief to take over at the Dome - New M...,,269.0,15,0,0,Tourism chief to take over at the Dome - New M...,15,-3.650519e-06,"[TopLogprob(token='15', bytes=[49, 53], logpro..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,4702,1968,60,23/10/1968,Blake 'hid under nose of police,,712.0,12,0,0,Blake 'hid under nose of police,12,-1.147242e-06,"[TopLogprob(token='12', bytes=[49, 50], logpro..."
496,3965,1967,59,15/03/1967,Witness 'heard plot to kill Kennedy,Clay Shaw accused at hearing,578.0,19,0,0,Witness 'heard plot to kill Kennedy Clay Shaw ...,12,-8.410085e-03,"[TopLogprob(token='12', bytes=[49, 50], logpro..."
497,638,1961,53,22/03/1961,Twelve Months For Steel Thefts,"Over 1,000 Tons Said To Be Involved",436.0,12,0,0,"Twelve Months For Steel Thefts Over 1,000 Tons...",12,-7.469591e-05,"[TopLogprob(token='12', bytes=[49, 50], logpro..."
498,3468,1966,58,06/04/1966,Mr. Wilson To Call Farm Talks,,441.0,4,0,0,Mr. Wilson To Call Farm Talks,4,0.000000e+00,"[TopLogprob(token='4', bytes=[52], logprob=0.0..."


_Note_: Classifying 500 rows of our data with the 4o-mini model and the system prompt above (~10.000 tokens long) costs approximately 0,11€.

### Evaluation

As we our data provided a gold standard, we can evaluate the results as we did before.

In [133]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(classification_results_df["majortopic"], classification_results_df["result"])
precision = precision_score(classification_results_df["majortopic"], classification_results_df["result"], average="weighted")
recall = recall_score(classification_results_df["majortopic"], classification_results_df["result"], average="weighted")
f1 = f1_score(classification_results_df["majortopic"], classification_results_df["result"], average="weighted")

print(
    f'Accuracy: {accuracy}\n'
    f'Precision: {precision}\n'
    f'Recall: {recall}\n'
    f'F1: {f1}'
)

Accuracy: 0.448
Precision: 0.6545391707391707
Recall: 0.448
F1: 0.4797436468680533


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [135]:
from sklearn.metrics import classification_report

print(classification_report(classification_results_df["majortopic"], classification_results_df["result"]))

              precision    recall  f1-score   support

           1       0.55      0.82      0.66        28
          10       0.90      0.39      0.54        49
          12       0.71      0.67      0.69        70
          13       0.00      0.00      0.00         4
          14       0.60      0.46      0.52        13
          15       0.27      0.12      0.17        24
          16       0.81      0.21      0.34        80
          17       0.71      0.23      0.34        22
          18       0.67      1.00      0.80         4
          19       0.40      0.47      0.43        73
           2       0.27      1.00      0.42         4
          20       0.78      0.19      0.31        36
          21       0.00      0.00      0.00         5
          23       0.00      0.00      0.00         0
           3       0.84      0.84      0.84        25
           4       0.69      0.75      0.72        12
           5       0.17      0.40      0.24        10
           6       1.00    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


What do you think about these results?