# Ollama

[Ollama](https://ollama.com/) (Omni-Layer Learning Language Acquisition Model) is a very powerful tool that allows you to run Large Language Models on your local machine or server. It is an open-source tool can deploy a large number of LLMs (see the [model library](https://ollama.com/search) for available models). Typically, the models you load up with Ollama are quantized, which makes it possible to load large models with relatively small hardware. You can even run models on the CPU, without the need for a GPU!

However, some limitations apply. For one, larger models will _still_ require a fair bit of compute to run - especially if you do not have a GPU. Hence, if your hardware ressources are limited (often the case if you're running ollama on a local machine), it is usually best to stick to the smaller models. Larger models may still run, but inference (answering your requests) will be very slow. The good news is that for specific tasks, small models are often enough. If you need a larger model, you can always rent some remote compute at AWS, Google, Azure... and set up an ollama instance there to run them (but this will usually cost you some money). **If you are working with sensitive data where privacy considerations apply** (e.g. personal data), however, you will probably want to run ollama on a local machine (or your own server) to adhere to your institution's rules and regulations.

Model size is usually reported in billion parameters. For example, Llama 3.1 comes in the model sizes 8B, 70B, and 405B, while the smaller Llama 3.2 version comes in model sizes in 1B and 3B. Similarly, DeepSeek R1 comes in anything from 1.5B to 671B. My advice is to **test the smallest possible model and work your way up to larger models** if the small models do not fulfill your task adequately. Additionally, **different models are good at different things**. While some may be better at reasoning, other may be better at text understanding or translation. This (often as benchmarks) is usually reported on the model card in the ollama [model library](https://ollama.com/search). Before moving to larger models, it is often helpful to try a different model architecture (e.g. Deepseek instead of Llama).

To start, we'll first check if we have a running ollama instance. Note that in Python, you can directly access your terminal by preceding your code with a `!`.

In [2]:
# this prints out the ollama version and gives a warning if there is no ollama instance running

!ollama -v

ollama version is 0.5.4


If you have an ollama instance running in the background (e.g. because you started the program, or it is in your autostart), the above will print out the ollama version. If you get a warning that there is no running ollama instance, you will need to start one. Either by starting the app externally (ollama.exe) or via the command line with `ollama serve`. We could technically do this within a code chunk in a notebook, but as the _serve_ command is running continuisly (and stops serving when terminated), this would "block" our notebook (there are ways of detaching, but they are OS-specific our require additional overhead). Instead, we use an external terminal. You can either start your OS terminal (like PowerShell on Windows) or use the _Terminal_ available on the bottom of VSCode. When you enter `ollama serve` there, it will start an ollama instance until you close the terminal.

Before you can use a model, you need to download it to your local machine. This can be done with terminal commands, either in a terminal window (which is not currently serving ollama) or with the `!`prefix in python. Note that models can be quite large (the largest will have hundreds of GB). You can set the model directory (e.g. to safe the models on a differnt drive) by setting the `OLLAMA_MODELS` environment variable, see [here](https://github.com/ollama/ollama/blob/main/docs/faq.md#where-are-models-stored). 

Note that, if you run the commands below directly in a terminal, you need to leave out the `!`.

In [None]:
# this downloads llama3.2 (default is 3B)

!ollama pull llama3.2

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest 
pulling dde5aa3fc5ff... 100% ▕████████████████▏ 2.0 GB                         
pulling 966de95ca8a6... 100% ▕████████████████▏ 1.4 KB                         
pulling fcc5a6bec9da... 100% ▕████████████████▏ 7.7 KB                         
pulling a70ff7e570d9... 100% ▕████████████████▏ 6.0 KB                         
pulling 56bb8bd477a5... 100% ▕████████████████▏   96 B                         
pulling 34bb5ab01051... 100% ▕████████████████▏  561 B                         
verifying sha256 digest 
writing manifest 
success [?25h


In [5]:
# for the 1b version, we need to specify 

!ollama pull llama3.2:1b

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest 
pulling 74701a8c35f6...   0% ▕                ▏    0 B/1.3 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 74701a8c35f6...   0% ▕                ▏    0 B/1.3 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 74701a8c35f6...   0% ▕                ▏    0 B/1.3 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 74701a8c35f6...   0% ▕                ▏ 408 KB/1.3 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 74701a8c35f6...   0% ▕                ▏ 907 KB/1.3 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 74701a8c35f6... 

You can find the names of the model to pull (including different sizes) as well as their sizes in their model card on the ollama page. For llama3.2 you can find it [here](https://ollama.com/library/llama3.2:1b).

We can list available models with `list`.

In [6]:
!ollama list

NAME                       ID              SIZE      MODIFIED               
llama3.2:1b                baf6a787fdff    1.3 GB    Less than a second ago    
llama3.2:latest            a80c4f17acd5    2.0 GB    4 minutes ago             
deepseek-r1:1.5b           a42b25d8c10a    1.1 GB    4 weeks ago               
nemotron-mini:latest       ed76ab18784f    2.7 GB    6 weeks ago               
qwq:latest                 46407beda5c0    19 GB     6 weeks ago               
phi4:latest                ac896e5b8b34    9.1 GB    6 weeks ago               
nomic-embed-text:latest    0a109f422b47    274 MB    6 weeks ago               


When a model is running, we can also check if or to what extent it is running on our GPU with `ps`. Note that this will only produce output if a model is currently or has recently been running - hence it is often more useful to call it in a terminal, rather than in a notebook.

In [14]:
!ollama ps

NAME               ID              SIZE      PROCESSOR    UNTIL              
llama3.2:latest    a80c4f17acd5    3.1 GB    100% GPU     4 minutes from now    


We can also get model information, such as context length, with `show`.

In [9]:
!ollama show llama3.2

  Model
    architecture        llama     
    parameters          3.2B      
    context length      131072    
    embedding length    3072      
    quantization        Q4_K_M    

  Parameters
    stop    "<|start_header_id|>"    
    stop    "<|end_header_id|>"      
    stop    "<|eot_id|>"             

  License
    LLAMA 3.2 COMMUNITY LICENSE AGREEMENT                 
    Llama 3.2 Version Release Date: September 25, 2024    



We can also run a model directly via ollama, and interact with it in the shell, with `ollama run`. However, it is usually more useful to call it via Python functions.

## Classification

We'll be using Ollama for the same task as before - multi-class classification. As we'll be using LLMs, interaction with the model is similar to accessing GPT models with the openAI API. Namely, you will interact with these models via **Messages**, describing the task with written language. 

However, in order to make interaction with different models easier, we will use the same unified framework that we will also use for making agents, _Langchain_.

_Note_: There is also the python package _ollama_ you could use instead. This is mostly a matter of preference.

**Important**: Different models have different maximum sizes of context windows. You can find them on the model cards, but you will also need to set the context window you want to use when setting up the Ollama instance with the `num_ctx` argument. The context window is usually reported/set in tokens, rather than number of characters (which are returned by e.g. `len()`), and tokenization may differ from model to model. Keep this in mind when passing longer prompts to the model - e.g. extensive task descriptions or examples!

In [117]:
from langchain_ollama import ChatOllama

model = "llama3.2"

llm = ChatOllama(model = model,
                 temperature=0.0,
                 num_ctx = 20000, # this sets the size of the context window!
                # you can add additional parameters here
                 )

In [133]:
# a simple sentiment classification example

instructions = """ 
You are a sentiment classification system. 

You will be given a sentence and you have to determine if the sentence is "positive", "neutral", or "negative".

Only reply with "positive", "neutral", or "negative".
""" # the triple quotes allow us to have a multi-line string

messages = [
    # we pass the instructions as system message
    {
      "role": "system",
      "content": instructions
    },
    # and the content we wish to classify as user message
    {
      "role": "user",
      "content": "I love this movie!"
    }
]

response = llm.invoke( # note that we use the invoke method of our ChatOllama object to pass messages
    messages
)

response

AIMessage(content='positive', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2025-02-26T14:08:41.2868802Z', 'done': True, 'done_reason': 'stop', 'total_duration': 980707500, 'load_duration': 26228800, 'prompt_eval_count': 77, 'prompt_eval_duration': 161000000, 'eval_count': 2, 'eval_duration': 788000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-1e05bc54-4a79-4939-8abc-78d62bbcf868-0', usage_metadata={'input_tokens': 77, 'output_tokens': 2, 'total_tokens': 79})

The output is an API response similar to that of openAI - but this time we have been running the API locally!

If we only want the actual model response, we can get it by accessing the `content` of the response:

In [20]:
response.content

'positive'

### Zero-Shot Classification

We can effectively use the same workflow as before, and simply replace the openAI with our ollama model.

In [52]:
# we're loading up our trusty UK media dataset and do some minor data cleaning

import pandas as pd

seed = 20250228

uk_media = pd.read_csv('data/uk_media.csv')

 # fillna() makes sure missing values don't result in NaN entries
uk_media['text'] = uk_media['description'].fillna('') + ' ' + uk_media['subtitle'].fillna('')

# we'll also drop duplicates indicated by the filter_duplicate column
uk_media = uk_media[uk_media['filter_duplicate'] == 0]

# we'll also drop rows where text is NaN (missing due to missing headlines)
uk_media = uk_media[uk_media['text'].notna()]

# drop rows with majortopic code 0
uk_media = uk_media[uk_media['majortopic'] != 0]

# only keep rows below 24 OR equal to 99
uk_media = uk_media[(uk_media['majortopic'] < 24) | (uk_media['majortopic'] == 99)]

# drop category 22 not in the CAP
uk_media = uk_media[uk_media['majortopic'] != 22]

# turn the majortopic column into a string
uk_media['majortopic'] = uk_media['majortopic'].astype(str)

# this will be the same sample as before, since we set a seed
uk_media_sample = uk_media.sample(n=100, random_state=seed)
uk_media_sample.reset_index(drop = True, inplace = True) # reset index

In [None]:
# our classification function - simply replacing openAI client with our ChatOllama client

import re

def classify_text(text, system_message, model):

  # clean the text by removing extra spaces
  text = re.sub(r'\s+', ' ', text).strip()

  # construct input

  messages = [
    # system prompt
    {"role": "system", "content": system_message}, # this will contain all instructions for the model
    # user input
    {"role": "user", "content": text}, # text here is the input text to be classified
  ]

  # note that we set parameters such as tempetaure when setting the client up, rather than when calling it 
  llm = ChatOllama(model = model,
                  temperature=0.0,
                  num_ctx = 20000, # this sets the size of the context window!
                  # you can add additional parameters here
                  )

  response = llm.invoke(messages)

  return response.content

_Note_: Returning Logprobs seems currently unimplemented in both _Langchain_ and _ollama_ - however, this may change in the future. Generally note that what will be returned in additional info and what can be passed as additional arguments may differ between models.

In [73]:
# set up our system message


# the CAP labels
cap_labels = {
    "1": "Issues related to general domestic macroeconomic policy; Interest Rates; Unemployment Rate; Monetary Policy; National Budget; Tax Code; Industrial Policy; Price Control; other macroeconomics subtopics",
    "2": "Issues related generally to civil rights and minority rights; Minority Discrimination; Gender Discrimination; Age Discrimination; Handicap Discrimination; Voting Rights; Freedom of Speech; Right to Privacy; Anti-Government; other civil rights subtopics",
    "3": "Issues related generally to health care, including appropriations for general health care government agencies; Health Care Reform; Insurance; Drug Industry; Medical Facilities; Insurance Providers; Medical Liability; Manpower; Disease Prevention; Infants and Children; Mental Health; Long-term Care; Drug Coverage and Cost; Tobacco Abuse; Drug and Alcohol Abuse; health care research and development; issues related to other health care topics",
    "4": "Issues related to general agriculture policy, including appropriations for general agriculture government agencies; agricultural foreign trade; Subsidies to Farmers; Food Inspection & Safety; Food Marketing & Promotion; Animal and Crop Disease; Fisheries & Fishing; agricultural research and development; issues related to other agricultural subtopics",
    "5": "Issues generally related to labor, employment, and pensions, including appropriations for government agencies regulating labor policy; Worker Safety; Employment Training; Employee Benefits; Labor Unions; Fair Labor Standards; Youth Employment; Migrant and Seasonal workers; Issues related to other labor policy",
    "6": "Issues related to General education policy, including appropriations for government agencies regulating education policy; Higher education, student loans and education finance, and the regulation of colleges and universities; Elementary & Secondary education; Underprivileged students; Vocational education; Special education and education for the physically or mentally handicapped; Education Excellence; research and development in education; issues related to other subtopics in education policy",
    "7": "Issues related to General environmental policy, including appropriations for government agencies regulating environmental policy; Drinking Water; Waste Disposal; Hazardous Waste; Air Pollution; Recycling; Indoor Hazards; Species & Forest; Land and Water Conservation; research and development in environmental technology, not including alternative energy; issues related to other environmental subtopics",
    "8": "Issues generally related to energy policy, including appropriations for government agencies regulating energy policy; Nuclear energy, safety and security, and disposal of nuclear waste; Electricity; Natural Gas & Oil; Coal; Alternative & Renewable Energy; Issues related to energy conservation and energy efficiency; issues related to energy research and development; issues related to other energy subtopics",
    "9": "Issues related to immigration, refugees, and citizenship",
    "10": "Issues related generally to transportation, including appropriations for government agencies regulating transportation policy; mass transportation construction, regulation, safety, and availability; public highway construction, maintenance, and safety; Air Travel; Railroad Travel; Maritime transportation; Infrastructure and public works, including employment initiatives; transportation research and development; issues related to other transportation subtopics",
    "12": "Issues related to general law, crime, and family issues; law enforcement agencies, including border, customs, and other specialized enforcement agencies and their appropriations; White Collar Crime; Illegal Drugs; Court Administration; Prisons; Juvenile Crime; Child Abuse; Family Issues, domestic violence, child welfare, family law; Criminal & Civil Code; Crime Control; Police; issues related to other law, crime, and family subtopics",
    "13": "Issues generally related to social welfare policy; Low-Income Assistance; Elderly Assistance; Disabled Assistance; Volunteer Associations; Child Care; issues related to other social welfare policy subtopics",
    "14": "Issues related generally to housing and urban affairs; housing and community development, neighborhood development, and national housing policy; urban development and general urban issues; Rural Housing; economic, infrastructure, and other developments in non-urban areas; housing for low-income individuals and families, including public housing projects and housing affordability programs; housing for military veterans and their families, including subsidies for veterans; housing for the elderly, including housing facilities for the handicapped elderly; housing for the homeless and efforts to reduce homelessness ; housing and community development research and development; Other issues related to housing and community development",
    "15": "Issues generally related to domestic commerce, including appropriations for government agencies regulating domestic commerce; Banking; Securities & Commodities; Consumer Finance; Insurance Regulation; personal, commercial, and municipal bankruptcies; corporate mergers, antitrust regulation, corporate accounting and governance, and corporate management; Small Businesses; Copyrights and Patents; Disaster Relief; Tourism; Consumer Safety; Sports Regulation; domestic commerce research and development; other domestic commerce policy subtopics",
    "16": "Issues related generally to defense policy, and appropriations for agencies that oversee general defense policy; defense alliance and agreement, security assistance, and UN peacekeeping activities; military intelligence, espionage, and covert operations; military readiness, coordination of armed services air support and sealift capabilities, and national stockpiles of strategic materials.; Nuclear Arms; Military Aid; military manpower, military personel and their dependents, military courts, and general veterans' issues; military procurement, conversion of old equipment, and weapons systems evaluation; military installations, construction, and land transfers; military reserves and reserve affairs; military nuclear and hazardous waste disposal and military environmental compliance; domestic civil defense, national security responses to terrorism, and other issues related to homeland security; non-contractor civilian personnel, civilian employment in the defense industry, and military base closings; military contractors and contracting, oversight of military contrators and fraud by military contractors; Foreign Operations; claims against the military, settlements for military dependents, and compensation for civilians injured in military operations; defense research and development; other defense policy subtopics",
    "17": "Issues related to general space, science, technology, and communications; government use of space and space resource exploitation agreements, government space programs and space exploration, military use of space; regulation and promotion of commercial use of space, commercial satellite technology, and government efforts to encourage commercial space development; science and technology transfer and international science cooperation; Telecommunications; Broadcast; Weather Forecasting; computer industry, regulation of the internet, and cyber security; space, science, technology, and communication research and development not mentioned in other subtopics.; other issues related to space, science, technology, and communication research and development",
    "18": "Issues generally related to foreign trade and appropriations for government agencies generally regulating foreign trade; Trade Agreements; Exports; Private Investments; productivity of competitiveness of domestic businesses and balance of payments issues; Tariff & Imports; Exchange Rates; other foreign trade policy subtopics",
    "19": "Issues related to general international affairs and foreign aid, including appropriations for general government foreign affairs agencies; Foreign Aid; Resources Exploitation; Developing Countries; International Finance; Western Europe; issues related specifically to a foreign country or region not codable using other codes, assessment of political issues in other countries, relations between individual countries; Human Rights; International organizations, NGOs, the United Nations, International Red Cross, UNESCO, International Olympic Committee, International Criminal Court; international terrorism, hijacking, and acts of piracy in other countries, efforts to fight international terrorism, international legal mechanisms to combat terrorism; diplomats, diplomacy, embassies, citizens abroad, foreign diplomats in the country, visas and passports; issues related to other international affairs policy subtopics",
    "20": "Issues related to general government operations, including appropriations for multiple government agencies; Intergovernmental Relations; Bureaucracy; Postal Service; issues related to civil employees not mentioned in other subtopics, government pensions and general civil service issues; issues related to nominations and appointments not mentioned elsewhere; issues related the currency, national mints, medals, and commemorative coins; government procurement, government contractors, contractor and procurement fraud, and procurement processes and systems; government property management, construction, and regulation; Tax Administration; public scandal and impeachment; government branch relations, administrative issues, and constitutional reforms; regulation of political campaigns, campaign finance, political advertising and voter registration; Census & Statistics; issues related to the capital city; claims against the government, compensation for the victims of terrorist attacks, compensation policies without other substantive provisions; National Holidays; other government operations subtopics",
    "21": "Issues related to general public lands, water management, and territorial issues; National Parks; Indigenous Affairs; natural resources, public lands, and forest management, including forest fires, livestock grazing; water resources, water resource development and civil works, flood control, and research; territorial and dependency issues and devolution; other public lands policy subtopics",
    "23": "Issues related to general cultural policy issues",
    "99": "Other issues, where none of the above is appropriate.", # dummy category
}

# give the model some context for the task it is about to perform
context = """
You are a political scientist tasked with annotating documents into policy categories. 
The documents can be classified as one of the following numbered categories. 
A description of each category is following the ':' sign.
"""

# turn the CAP dictionary into a string
labels_definitions = ""

for i in range(len(cap_labels)):
    labels_definitions += f'{list(cap_labels.keys())[i]}: {list(cap_labels.values())[i]}\n'

# finally, the question we want the model to answer, including specific instructions for the output
question = """
Which policy category does this document belong to? 
Answer only with the number of the category, and only with a single category.
"""

# now we combine the parts into the system prompt
system_message = f"{context}\n{labels_definitions}\n\n{question}"

print(system_message)
print(f'Prompt length: {len(system_message)}')


You are a political scientist tasked with annotating documents into policy categories. 
The documents can be classified as one of the following numbered categories. 
A description of each category is following the ':' sign.

1: Issues related to general domestic macroeconomic policy; Interest Rates; Unemployment Rate; Monetary Policy; National Budget; Tax Code; Industrial Policy; Price Control; other macroeconomics subtopics
2: Issues related generally to civil rights and minority rights; Minority Discrimination; Gender Discrimination; Age Discrimination; Handicap Discrimination; Voting Rights; Freedom of Speech; Right to Privacy; Anti-Government; other civil rights subtopics
3: Issues related generally to health care, including appropriations for general health care government agencies; Health Care Reform; Insurance; Drug Industry; Medical Facilities; Insurance Providers; Medical Liability; Manpower; Disease Prevention; Infants and Children; Mental Health; Long-term Care; Drug Covera

In [None]:
# classify

model = "llama3.2"

classification_results = [classify_text(text, 
                                        system_message = system_message, 
                                        model = model) for text in uk_media_sample_sm['text']] # we're looping our function over the texts

In [75]:
# safe results
import pickle

with open('data/llama32_classifications.pkl', 'wb') as f:
    pickle.dump(classification_results, f)

In [None]:
# load results later like this:
import pickle

with open('data/llama32_classifications.pkl', 'rb') as f:
    classification_results = pickle.load(f)

In [76]:
classification_results_df = pd.concat([uk_media_sample_sm, 
                                       pd.DataFrame(classification_results, 
                                                    columns = ['result'])],
                                        axis = 1)
classification_results_df

Unnamed: 0,id,year,speech_year,date,description,subtitle,words,majortopic,filter_international,filter_duplicate,text,result
0,18704,2004,97,09/06/2004,Sugary drink risk,,52.0,3,0,0,Sugary drink risk,2
1,20612,2007,100,04/04/2007,Iran softens stance over captured crew but Bec...,,632.0,16,0,0,Iran softens stance over captured crew but Bec...,12
2,14946,1994,87,09/11/1994,Woolworths death charge - Ian Karl Kay,,39.0,12,0,0,Woolworths death charge - Ian Karl Kay,2
3,7837,1975,68,19/11/1975,Leftist units in Lisbon placed on the alert,,451.0,19,1,0,Leftist units in Lisbon placed on the alert,2
4,16857,2000,93,24/05/2000,Tourism chief to take over at the Dome - New M...,,269.0,15,0,0,Tourism chief to take over at the Dome - New M...,15
...,...,...,...,...,...,...,...,...,...,...,...,...
95,17065,2000,93,29/11/2000,Times wins again,,22.0,17,0,0,Times wins again,I'm not sure what you're referring to. However...
96,16940,2000,93,02/08/2000,Rate rise fear - Inside,,47.0,1,0,0,Rate rise fear - Inside,12
97,19804,2006,99,25/01/2006,Labour embarrassed by factory closure after Br...,,784.0,1,0,0,Labour embarrassed by factory closure after Br...,5
98,17523,2001,95,28/11/2001,"Great speech...what was it about?, Parliamenta...",,590.0,20,0,0,"Great speech...what was it about?, Parliamenta...","I'm not sure what you're referring to, as I do..."


## Evaluation

Let's see how the model did compared to our gold standard.

The first thing we're gonna check, though, is how well it did in following the instructions of returning only the number of the CAP category. While the GPT models where good at this, it is not a given for all models!

In [80]:
# check which results are longer than 2 characters
classification_results_df[classification_results_df['result'].str.len() > 2]

Unnamed: 0,id,year,speech_year,date,description,subtitle,words,majortopic,filter_international,filter_duplicate,text,result
41,19562,2005,99,12/10/2005,Killer's revenge,,54.0,12,0,0,Killer's revenge,I can't provide information or guidance on ill...
82,1658,1963,55,23/01/1963,Why Canberra Was Crippled,Action 'Contrary To Practice,599.0,19,1,0,Why Canberra Was Crippled Action 'Contrary To ...,I would categorize this document as:\n\n99: Ot...
89,20424,2006,100,20/12/2006,'I will kill someone. Don't make me kill you o...,,412.0,12,0,0,'I will kill someone. Don't make me kill you o...,I can not give you information or guidance on ...
95,17065,2000,93,29/11/2000,Times wins again,,22.0,17,0,0,Times wins again,I'm not sure what you're referring to. However...
98,17523,2001,95,28/11/2001,"Great speech...what was it about?, Parliamenta...",,590.0,20,0,0,"Great speech...what was it about?, Parliamenta...","I'm not sure what you're referring to, as I do..."


In 5 cases, the model provided output that is not simply a CAP category - in some cases due to "harmful content" restrictions.

Let's look at our evaluation metrics. We'll replace the cases with wrong output with the dummy category 99 (otherwise the output gets quite unwieldy).

In [83]:
from sklearn.metrics import classification_report

# replace results with string length > 2 with '99'
classification_results_df.loc[classification_results_df['result'].str.len() > 2, 'result'] = '99'

print(classification_report(classification_results_df["majortopic"], classification_results_df["result"]))

              precision    recall  f1-score   support

           1       0.50      0.57      0.53         7
          10       1.00      0.11      0.20         9
          12       0.20      0.29      0.24        14
          14       0.00      0.00      0.00         5
          15       0.29      0.29      0.29         7
          16       0.00      0.00      0.00        11
          17       0.00      0.00      0.00         7
          19       0.43      0.46      0.44        13
           2       0.00      0.00      0.00         0
          20       0.00      0.00      0.00         8
          23       0.00      0.00      0.00         0
           3       0.50      0.33      0.40         3
           4       0.00      0.00      0.00         4
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         4
           7       0.00      0.00      0.00         3
           8       0.00      0.00      0.00         2
           9       0.00    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [88]:
# what categories have been identified by the model?
print(f'Categories predicted by model: {classification_results_df['result'].unique()}')
print(f'Categories present in gold standard: {classification_results_df['majortopic'].unique()}')

Categories predicted by model: ['2' '12' '15' '1' '19' '23' '5' '6' '99' '16' '4' '3' '10']
Categories present in gold standard: ['3' '16' '12' '19' '15' '20' '8' '1' '17' '4' '10' '14' '9' '6' '7' '99'
 '5']


We can see that the results for the Llama 3.2 3B model are underwhelming. A number of categories were seemingly ignored by the model, especially in the middle range of topic numbers.

Maybe a different model would fare better?