# Most Experimental Lesson of the workshop - How to label with LLMs

### Disclaimer for those who do not know me:
* I'm no professional instructor
* This was written while I was working on something so everything's jumbled
* This is not a full-on demonstration, i simplified things for the sake of keeping it small and timely

### Notes:
* This is all a result of my trial and error
* Of course context changes everything
* We're working with Ollama because it's free, in a workplace you'll probably be able to use APIs, other than initial setup and using different functions (functions are per API) the work is pretty much the same

### Concepts to keep in mind:
* Golden set/ground truth - it'll be super helpful if we have a correctly labeled dataset to compare results against
* Do not attempt to do 2 things at once with LLMs, always break up the task as much as possible - instead of 1 questions with 2 parts do 2 questions separately, first extract a response and only afterwards parse it, etc

### My lessons learned the hard way:
* Label simple questions, I call it 10-agorot questions
* Do not prompt engineer here. The simpler the better. Ask a direct question with direct answers.
* You can break down a complex question to simple questions for better results. LLMs are not smart. Treat them like crowd labelers
* If you run an LLM down a big dataset, no matter what, eventually you get weird answers, so keep the labeling in small chunks
* Structure your output in a way you can QA it later
* Ask for your output on a scale (even number) so that you have also the correctness info and also how clear-cut it is.
* Scale slowly. I find it convenient to start with a few examples, edit prompt until that's ok, then a small sample (lets say 150 lines), use them as a golden set to iterate until you have a good enough process, then you can check it on a larger sample
* Something that really helps improve quality: ask for the reasoning first and the score/answer after (learned this in PyData!!!)
* To make manual inspection quicker and less confusing - work by category, such as positive predictions (and then mark as true or false positive - which is by the way a decision that needs to be made, what consitutes a true positive), and utilize any other data you might have (previous checks, other sources of truth..)
* I have this feeling that there's always going to be a data category that the process just won't work for, and it's better to try to identify it and remove it from the dataset for LLM labeling

### how to work towards a goal:
* evaluate data and how much you can improve, there will always be a small number of false positives/negatives, decide what you care about more (it's a context/business decision)
* I haven't done yet any process that goes directly to production, it always stops on the way at a human station, so part of my goal is to make it easier for the human to read/understand
* Easier said than done: keep documentation of what you tried and how well it worked, it will really help later to explain your decisions to non-believers

### Avenues I haven't explored yet:
* Difference when putting info in the system prompt vs. regular prompt
* Ask for output in natural language and parse myself later (example output: review mentions brand, review does not mention brand)
* Finetuning a small model for specific labeling tasks (did it at ebay with a big model, but i also had there a convenient platform and lots of resources)


## Setup

In [1]:
import ollama

response = ollama.chat(
    model='tinyllama',
    messages=[{'role': 'user', 'content': 'What is the capital of France?'}]
)

response['message']['content']

'The capital of France is Paris.'

In [3]:
response

ChatResponse(model='tinyllama', created_at='2025-11-20T13:18:17.5170768Z', done=True, done_reason='stop', total_duration=4153239700, load_duration=2769273600, prompt_eval_count=41, prompt_eval_duration=997091000, eval_count=8, eval_duration=376594100, message=Message(role='assistant', content='The capital of France is Paris.', images=None, tool_calls=None))

In [5]:
import sqlite3
import numpy as np
import pandas as pd

conn = sqlite3.connect(r'C:\Users\ydool\Downloads\workshop_db.db')

In [7]:
pd.read_sql("""select r.product_id, r.review_text, r.review_id, p.brand, p.model
from reviews r
left join labeled_data l on l.review_id=r.review_id
left join products p on p.product_id=r.product_id
where l.review_id is NULL
limit 10""", conn)

Unnamed: 0,product_id,review_text,review_id,brand,model
0,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""This phone is exactly what I was looking for ...",1bc28563-cd53-45f9-975d-2e3100047573,Google,Pixel 3a
1,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""The Google Pixeel 3a Silver 256 is an excelle...",85f03c45-5bfa-4a49-9f57-4ef69855e31f,Google,Pixel 3a
2,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""The Google Pixeel 3a Silver 256 is an excelle...",5e9bae6f-2f52-420e-b83d-61eaa66bb790,Google,Pixel 3a
3,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""The phone is perfect for people who need a ba...",86d5861c-d801-492f-a6c5-fc3d2b39506f,Google,Pixel 3a
4,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""The Google PixeL 3a Silver 256 is a reliable ...",75bdf9af-ce48-4d69-8d38-38d48c62e50b,Google,Pixel 3a
5,f4cba840-7701-463b-976f-8cc064bc4be9,"""I've been using my brand new Apple iPhone 3GS...",920ee619-16b0-4692-a469-c0c142d17ebc,Apple,iPhone 3GS
6,f4cba840-7701-463b-976f-8cc064bc4be9,"""As an avid Apple fan, I've used this latest s...",6c2aaf63-f36f-470e-af4b-06f9bf693ecd,Apple,iPhone 3GS
7,431c56b2-2550-4f9d-b983-3f92a3fb7bab,"""Sorry, but this 11-inch gold-colored version ...",0cc1c19e-1a9c-4ee4-8127-cc9843f91d5e,Apple,iPhone 11
8,431c56b2-2550-4f9d-b983-3f92a3fb7bab,"""Overall, I'm impressed with this latest itera...",48a07eef-0ef8-482f-9e58-c943d7c4c638,Apple,iPhone 11
9,e553e8d1-a8de-4476-a2c7-f6b5858929b0,"""The Samsung Galaxy A51 Blue 256 is an excelle...",1a1ee561-5b0f-45bb-b12f-448e84811d7c,Samsung,Galaxy A51


In [9]:
sample = pd.read_sql("""select r.product_id, r.review_text, r.review_id, p.brand, p.model
from reviews r
left join labeled_data l on l.review_id=r.review_id
left join products p on p.product_id=r.product_id
where l.review_id is NULL
limit 10""", conn)

## Exploratory stage

Lets say our goal is to label which phone features are mentioned in the review. Lets break it down:
* First question: are features mentioned
* If so, which features

In [11]:
sample['naive check'] = sample['review_text'].str.contains('features')

In [13]:
sample

Unnamed: 0,product_id,review_text,review_id,brand,model,naive check
0,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""This phone is exactly what I was looking for ...",1bc28563-cd53-45f9-975d-2e3100047573,Google,Pixel 3a,False
1,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""The Google Pixeel 3a Silver 256 is an excelle...",85f03c45-5bfa-4a49-9f57-4ef69855e31f,Google,Pixel 3a,True
2,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""The Google Pixeel 3a Silver 256 is an excelle...",5e9bae6f-2f52-420e-b83d-61eaa66bb790,Google,Pixel 3a,True
3,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""The phone is perfect for people who need a ba...",86d5861c-d801-492f-a6c5-fc3d2b39506f,Google,Pixel 3a,False
4,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""The Google PixeL 3a Silver 256 is a reliable ...",75bdf9af-ce48-4d69-8d38-38d48c62e50b,Google,Pixel 3a,False
5,f4cba840-7701-463b-976f-8cc064bc4be9,"""I've been using my brand new Apple iPhone 3GS...",920ee619-16b0-4692-a469-c0c142d17ebc,Apple,iPhone 3GS,False
6,f4cba840-7701-463b-976f-8cc064bc4be9,"""As an avid Apple fan, I've used this latest s...",6c2aaf63-f36f-470e-af4b-06f9bf693ecd,Apple,iPhone 3GS,True
7,431c56b2-2550-4f9d-b983-3f92a3fb7bab,"""Sorry, but this 11-inch gold-colored version ...",0cc1c19e-1a9c-4ee4-8127-cc9843f91d5e,Apple,iPhone 11,False
8,431c56b2-2550-4f9d-b983-3f92a3fb7bab,"""Overall, I'm impressed with this latest itera...",48a07eef-0ef8-482f-9e58-c943d7c4c638,Apple,iPhone 11,True
9,e553e8d1-a8de-4476-a2c7-f6b5858929b0,"""The Samsung Galaxy A51 Blue 256 is an excelle...",1a1ee561-5b0f-45bb-b12f-448e84811d7c,Samsung,Galaxy A51,False


In [15]:
#positive example
positive_example = sample[sample['naive check']==True]['review_text'][1]
positive_example

'"The Google Pixeel 3a Silver 256 is an excellent smartphone that boasts top-notch features at a reasonable price point. Its high resolution display, efficient processor, and impressive camera make it an ideal choice for those looking for an affordable, high-performance device. With its user-friendly interface and reliable performance, this phone is an excellent investment for anyone in the market for a dependable smartphone."'

In [17]:
#negative example
negative_example = sample[sample['naive check']!=True]['review_text'][3]
negative_example

'"The phone is perfect for people who need a basic and reliable smartphone. It has a sturdy build, a functional design, and performs well with basic tasks like calling and messaging. The camera quality, as expected from this price range, is good, but not outstanding. Overall, this phone offers a decent value for money, making it an excellent option for anyone who wants to stay connected on the go."'

### First stage is just get some correct answers with the simplest prompt possible. 
I started with "does this review explicitly mention phone features", then added "answer in yes or no", then added "special features" to get what i want

In [25]:
prompt = "does this review explicitly mention phone special features, answer in yes or no: "

In [27]:
response = ollama.chat(
    model='tinyllama',
    messages=[{'role': 'user', 'content': prompt + positive_example}]
)

response['message']['content']

"The passage does not explicitly mention phone special features, but it does provide some information about phone performance and a detailed review of the Google Pixeel 3a Silver 256. The article mentions that the phone's user-friendly interface, reliable performance, and top-notch features make it an excellent investment for those in the market for a dependable smartphone. It does not specifically mention any specific phone special features such as a high resolution display or efficient processor. However, overall, the phone's combination of good design, decent hardware, and strong software makes it a great choice for anyone looking for an affordable, high-performing device."

In [28]:
response = ollama.chat(
    model='tinyllama',
    messages=[{'role': 'user', 'content': prompt + negative_example}],
    options =  {"temperature" : 0.1}
)

response['message']['content']

'Yes, the review explicitly mentions phone special features such as:\n\n- Phone Special Features: The phone is perfect for people who need a basic and reliable smartphone. It has a sturdily built design with functional features like calling and messaging. The camera quality is good, but not outstanding, and overall, it offers a decent value for money, making it an excellent option for anyone who wants to stay connected on the go.'

#### Lets force a specific answer format:

In [None]:
from pydantic import BaseModel,ValidationError

class RatingModel(BaseModel):
    relevance: int
    answer: bool
    reason: str

In [34]:
prompt = """Does the following review explicitly mention special phone features?"""

format_prompt = """

relevance should be on a scale of 1-10 where 1 means no special features are mentioned, 10 means special features are explicitly mentioned
Return ONLY a valid JSON object that matches this schema:
{
  "reason": str,
  "relevance": int, 
  "answer": bool
  
}

"""

In [36]:
response = ollama.chat(
    model='phi3',
    messages=[{'role': 'user', 'content': prompt + positive_example + format_prompt }],
    options =  {"temperature" : 0.1}
)["message"]["content"]

print(response)

```json

{

  "reason": "The review does not explicitly mention any special phone features.",

  "relevance": 5,

  "answer": false

}

```

In this case, the relevance is a mid-range score because while there are hints at potential unique selling points (e.g., high resolution display and efficient processor), they're not described as special features but rather standard expectations for modern smartphones. The answer field indicates that no explicit mention of phone feature uniqueness was found in the text provided, hence it is set to false.


In [37]:
response = ollama.chat(
    model='phi3',
    messages=[{'role': 'user', 'content': prompt + negative_example + format_prompt}],
    options =  {"temperature" : 0.1}
)["message"]["content"]

print(response)

```json

{

  "reason": "The review does not explicitly mention any special phone features.",

  "relevance": 3,

  "answer": false

}

```

In this case, the relevance score is a bit higher than what might be expected because while there's no direct reference to unique or standout features like facial recognition, augmented reality capabilities, etc., some readers may infer that since it performs well with basic tasks and offers decent value for money in its price range, these could count as implicit special features. However, the review does not explicitly state any such phone-specific attributes beyond general performance expectations of a smartphone at this tier.


## Label a sample

I recommend first only extracting a response, and processing it separately. Somehow trying to extract a response and process simulaneously with LLMs doesn't really work

In [None]:
# def helper(review):
#     response = ollama.chat(
#     model='tinyllama',
#     messages=[{'role': 'user', 'content': prompt + review}],
#     options =  {"temperature" : 0.1})["message"]["content"]
#     return response

In [40]:
import re
import json


def helper(review):
    response = ollama.chat(
        model='phi3',
        messages=[{'role': 'user', 'content': prompt + review + format_prompt}],
        options={"temperature": 0.1}
    )["message"]["content"]

    return response

In [42]:
sample['llm_response'] = sample['review_text'].apply(helper)

In [70]:
print(sample['llm_response'][0])

{

  "reason": "The review does not explicitly mention any special phone features such as unique hardware capabilities or software-specific functions.",

  "relevance": 5,

  "answer": false

}




In [44]:
sample['llm_response'][0][9:-5]

'{\n\n  "reason": "The review does not explicitly mention any special phone features.",\n\n  "relevance": 3,\n\n  "answer": false\n\n}\n\n```\n\nIn this case, the relevance is a bit higher than what might be expected because while there are some implied benefits that could relate to \'special\' aspects of modern phones (like long battery life and camera quality), these features aren\'t explicitly highlighted as special or unique. The answer field reflects whether any phone-specific, standout feature is mentionedâ€”in this case, it isn\'t directly stated in the review prov'

In [45]:
sample['llm_response'] = sample['llm_response'].apply(lambda x: x[9:-5])

In [46]:
def safe_json_loads(x):
    try:
        return json.loads(x)
    except Exception:
        return None

# Usage with pandas apply
sample['llm_response_json'] = sample['llm_response'].apply(safe_json_loads)


In [47]:
sample = sample[~sample['llm_response_json'].isna()].copy()

In [48]:
sample['relevance'] = sample['llm_response'].apply(lambda x: json.loads(x)['relevance'])
sample['answer'] = sample['llm_response'].apply(lambda x: json.loads(x)['answer'])
sample['reason'] = sample['llm_response'].apply(lambda x: json.loads(x)['reason'])

In [49]:
sample

Unnamed: 0,product_id,review_text,review_id,brand,model,naive check,llm_response,llm_response_json,relevance,answer,reason
4,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""The Google PixeL 3a Silver 256 is a reliable ...",75bdf9af-ce48-4d69-8d38-38d48c62e50b,Google,Pixel 3a,False,"{\n\n ""reason"": ""The review does not explicit...",{'reason': 'The review does not explicitly men...,5,False,The review does not explicitly mention any spe...
7,431c56b2-2550-4f9d-b983-3f92a3fb7bab,"""Sorry, but this 11-inch gold-colored version ...",0cc1c19e-1a9c-4ee4-8127-cc9843f91d5e,Apple,iPhone 11,False,"{\n\n ""reason"": ""The review does not explicit...",{'reason': 'The review does not explicitly men...,1,False,The review does not explicitly mention any spe...
9,e553e8d1-a8de-4476-a2c7-f6b5858929b0,"""The Samsung Galaxy A51 Blue 256 is an excelle...",1a1ee561-5b0f-45bb-b12f-448e84811d7c,Samsung,Galaxy A51,False,"{\n\n ""reason"": ""The review does not explicit...",{'reason': 'The review does not explicitly men...,5,False,The review does not explicitly mention any spe...


In [50]:
ground_truth = pd.read_csv('ground_truth.csv')

In [51]:
ground_truth.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   product_id    9 non-null      object
 1   review_text   9 non-null      object
 2   review_id     9 non-null      object
 3   ground_truth  9 non-null      bool  
dtypes: bool(1), object(3)
memory usage: 357.0+ bytes


In [72]:
ground_truth

Unnamed: 0,product_id,review_text,review_id,ground_truth
0,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""This phone is exactly what I was looking for ...",1bc28563-cd53-45f9-975d-2e3100047573,False
1,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""The Google Pixeel 3a Silver 256 is an excelle...",85f03c45-5bfa-4a49-9f57-4ef69855e31f,True
2,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""The Google Pixeel 3a Silver 256 is an excelle...",5e9bae6f-2f52-420e-b83d-61eaa66bb790,True
3,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""The phone is perfect for people who need a ba...",86d5861c-d801-492f-a6c5-fc3d2b39506f,False
4,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""The Google PixeL 3a Silver 256 is a reliable ...",75bdf9af-ce48-4d69-8d38-38d48c62e50b,True
5,f4cba840-7701-463b-976f-8cc064bc4be9,"""I've been using my brand new Apple iPhone 3GS...",920ee619-16b0-4692-a469-c0c142d17ebc,False
6,f4cba840-7701-463b-976f-8cc064bc4be9,"""As an avid Apple fan, I've used this latest s...",6c2aaf63-f36f-470e-af4b-06f9bf693ecd,False
7,431c56b2-2550-4f9d-b983-3f92a3fb7bab,"""Sorry, but this 11-inch gold-colored version ...",0cc1c19e-1a9c-4ee4-8127-cc9843f91d5e,False
8,431c56b2-2550-4f9d-b983-3f92a3fb7bab,"""Overall, I'm impressed with this latest itera...",48a07eef-0ef8-482f-9e58-c943d7c4c638,True


In [52]:
sample = sample.merge(ground_truth, how='left', on=['product_id', 'review_text', 'review_id'])

In [53]:
sample['answer same as ground truth'] = sample['answer']==sample['ground_truth']

In [54]:
sample

Unnamed: 0,product_id,review_text,review_id,brand,model,naive check,llm_response,llm_response_json,relevance,answer,reason,ground_truth,answer same as ground truth
0,06bfafe7-060d-4eb8-8e38-aaa2cbe9b99b,"""The Google PixeL 3a Silver 256 is a reliable ...",75bdf9af-ce48-4d69-8d38-38d48c62e50b,Google,Pixel 3a,False,"{\n\n ""reason"": ""The review does not explicit...",{'reason': 'The review does not explicitly men...,5,False,The review does not explicitly mention any spe...,True,False
1,431c56b2-2550-4f9d-b983-3f92a3fb7bab,"""Sorry, but this 11-inch gold-colored version ...",0cc1c19e-1a9c-4ee4-8127-cc9843f91d5e,Apple,iPhone 11,False,"{\n\n ""reason"": ""The review does not explicit...",{'reason': 'The review does not explicitly men...,1,False,The review does not explicitly mention any spe...,False,True
2,e553e8d1-a8de-4476-a2c7-f6b5858929b0,"""The Samsung Galaxy A51 Blue 256 is an excelle...",1a1ee561-5b0f-45bb-b12f-448e84811d7c,Samsung,Galaxy A51,False,"{\n\n ""reason"": ""The review does not explicit...",{'reason': 'The review does not explicitly men...,5,False,The review does not explicitly mention any spe...,,False


## Evaluating results

### Confusion matrix
Mark your results as:
* True Positive (TP) - LLM labeled as true (=review mentions features) and indeed the review mentions features
* True Negative (TN) - LLM labeled as false (=review does NOT mention features) and indeed the review does not mention features
* False Positive (FP) - LLM labeled as true (=review mentions features) BUT the review does not mention features
* False Negative (FN) - LLM labeled as false (=review does NOT mention features) BUT the review does actually mention features

### Now you can calculate some metrics for keeping track of your experiments:  
* Recall - How good are the positive predictions - TP / TP+FN
* Precision - Quality of positive predictions - TP/ TP+FP
* Accuracy - Correct Rows/Everything ( TP+TN / TP+TN+FP+FN )
* F1 -  precision+recall - 2*precision*recall / precision+recall

More details on this whole thing: https://www.geeksforgeeks.org/machine-learning/confusion-matrix-machine-learning/

Personal note from me: metrics for me are more of a reporting task, I mostly use the confusion matrix to understand what's going on - what's working, what isn't

### Lets write down numbers:

In [78]:
true_positive = 0
true_negative = 2
false_positive = 0
false_negative = 1

total = true_positive+true_negative+false_positive+false_negative

#recall = true_positive / (true_positive+false_negative)
#precision = true_positive / (true_positive+false_positive)
accuracy = (true_positive + true_negative) / total


In [80]:
total

3

In [82]:
accuracy

0.6666666666666666

### Now lets try to do something else with the prompt and do a comparison

In [None]:
prompt = """Does the following review explicitly mention special phone features?"""

format_prompt = """

relevance should be on a scale of 1-10 where 1 means no special features are mentioned, 10 means special features are explicitly mentioned
Return ONLY a valid JSON object that matches this schema:
{
  "relevance": int, 
  "answer": bool, 
  "reason": str
}

"""

In [None]:
sample['llm_response 2'] = sample['review_text'].apply(helper)

In [None]:
sample['llm_response_json 2'] = sample['llm_response 2'].apply(safe_json_loads)
sample = sample[~sample['llm_response_json 2'].isna()].copy()

In [None]:
sample['relevance 2'] = sample['llm_response 2'].apply(lambda x: json.loads(x)['relevance'])
sample['answer 2'] = sample['llm_response 2'].apply(lambda x: json.loads(x)['answer'])
sample['reason 2'] = sample['llm_response 2'].apply(lambda x: json.loads(x)['reason'])

In [None]:
sample['answer 2 same as 1'] = sample['answer 2'] == sample['answer']

In [None]:
sample['relevancies diff'] = sample['relevance 2'] - sample['relevance'] 