<a href="https://colab.research.google.com/github/CherpanovNazim/learn-llm/blob/using-ollama/notebooks/01_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


> [GitHub Repo](https://github.com/CherpanovNazim/learn-llm)

## Setup

In [None]:
#clone git repository
!git clone -b using-ollama -q https://github.com/CherpanovNazim/learn-llm.git

In [None]:
# wait ~1 min for installations
%%time

!pip install -qU openai==1.40.3 langchain-ollama==0.1.3 langchain==0.2.15 transformers==4.44.0

In [None]:
# wait ~1 min for installations
%%time

import sys

def load_model():
  !curl -fsSL https://ollama.com/install.sh | sh
  !nohup ollama serve &
  !sleep 5
  return

load_model()

In [None]:
import json

# Load the default model
DEFAULT_MODEL = json.load(open('learn-llm/configs/llama_3_8B.json', 'r'))

!ollama run {DEFAULT_MODEL['model']} &

In [None]:
!python3 learn-llm/notebooks/utils/explainer.py
sys.path.append('learn-llm/notebooks/utils')

from explainer import Explainer

explain = Explainer(DEFAULT_MODEL)
# use this class if you want to get some explanations

In [None]:
import json
import openai
import pandas as pd
from tqdm import tqdm
from functools import partial

# Set the base URL and API key
# For production apps it's preferable to use some secret management system and don't store the key in git repo :)

client = openai.OpenAI(
    base_url = DEFAULT_MODEL['api_base'],
    api_key = DEFAULT_MODEL['api_key']
    )

# Load dataset
This is a sample from [Amazon Fine Food Reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews)

Table contains:
* ProductId - Unique identifier for the product
* Score - Rating between 1 and 5
* Text - Text of the review

In [None]:
#Load the dataset
reviews_df = pd.read_csv('learn-llm/data/amazon_food_reviews_sample.csv')
reviews_df.head()

In [None]:
reviews_df.Score.value_counts().plot.barh(title='Score distribution')

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# for small datasets of course better to use KFold CV, but for education purposes we will use train_test_split
train_df, test_df = train_test_split(reviews_df, stratify=reviews_df.Score, test_size=0.1, random_state=42)

## Baseline TF-IDF + Logistic Regression model

** ***In this notebook, due to limited memory, we used the Llama-3 8B model, which resulted in unstable performance. Therefore, don’t be surprised if advanced techniques yield lower accuracy. At the end, we will present a table of results using a larger model, where you will see how accuracy tends to improve.***

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

clf = make_pipeline(TfidfVectorizer(), LogisticRegression(class_weight='balanced'))

clf.fit(train_df['Text'], train_df['Score'])

print('TF-IDF + LogReg Accuracy:', accuracy_score(test_df['Score'], clf.predict(test_df['Text'])))

In [None]:
def text_completion(prompt, temperature=0, max_tokens=2, return_completion_only=True, **kwargs):
    completion_response = client.completions.create(
                            model=DEFAULT_MODEL["model"],
                            temperature=temperature,
                            max_tokens=max_tokens,
                            prompt=prompt,
                            **kwargs)
    if return_completion_only:
        return completion_response.choices[0].text.strip()
    else:
        return completion_response

def chat_completion(prompts, temperature=0, max_tokens=2, system_prompt: str = None, **kwargs):
    if system_prompt is None:
        system_prompt = "Just follow user instructions and don't communicate like \"Sure!\" or \"I hope this helps\""

    completion = client.chat.completions.create(
        model=DEFAULT_MODEL["model"],
        temperature=temperature,
        max_tokens=max_tokens,
        messages=[{"role": "system", "content": system_prompt},
                  {"role": "user", "content": prompts}],
        **kwargs
        )
    return completion.choices[0].message.content.strip()

## Zero-shot Classification

* We'll first assess the performance of the base models at classifying using a simple prompt.
* If model starts generate some garbage, we will return -1

In [None]:
explain('What is Zero-shot classification in LLMs like ChatGPT ? Make a short answer')

In [None]:
def llm_predict(review_text, template):
    prompt = template.format(review_text=review_text)
    prediction = chat_completion(prompt).lower().strip()

    try:
        return int(prediction)
    except:
        # if model generates some garbage, we will return -1
        return -1

In [None]:
zero_shot_template = """\
You will get a review and you should predict the score of the review.
Score is a number from 1 to 5. Answer with only with one number and nothing else.

Review: "{review_text}"
Score:
"""

zero_shot_predictions = list(map(partial(llm_predict, template=zero_shot_template), tqdm(test_df.Text)))

print('Accuracy:', accuracy_score(test_df['Score'], zero_shot_predictions))

In [None]:
pd.Series(zero_shot_predictions).value_counts()

* Using just a description of what we want to get as a prompt (w/o any training data) we already beat the TF-IDF+LogReg supervised baseline
* Please notice that model sometime not following instructions and generating some garbage (-1)
* It's because sometimes model is not fully following instructions (GPT4 for example do it much better)
* Fortunately we can can force LLM to understand follow structure we want

## Force output structure

In [None]:
zero_shot_structural_template = """\
You will get a review and you should predict the score of the review.
Score is a number from 1 to 5. Answer with only with one number and nothing else.

Review: "Very nice!"
Score: 5

Review: "Garbage!"
Score: 1

Review: "Kind of okay"
Score: 3

Review: "{review_text}"
Score:
"""

zero_shot_structural_predictions = list(map(partial(llm_predict, template=zero_shot_structural_template), tqdm(test_df.Text)))

print('Accuracy:', accuracy_score(test_df['Score'], zero_shot_structural_predictions))

# check for garbage predictions
assert all([x != -1 for x in zero_shot_structural_predictions]), 'There are some garbage predictions'

* Model doesn't generate garbage anymore
* We just added a few synthetic examples to help LLM understand structure we want

## [Chain of Thought](https://learnprompting.org/docs/intermediate/chain_of_thought) for Zero-shot Classification
* It's well known that if you ask LLM to reason before commiting to answer it can provide better results

In [None]:
prompt_template = """\
You will get a review and you should predict the score of the review.
Score is a number from 1 to 5. Answer with only with one number and nothing else.
Try to give a short explanation before providing score.

------------------
Review: "Very nice!"
Reasoning: ...
Score: 5

Review: "Garbage!"
Reasoning: ...
Score: 1

Review: "Kind of okay"
Reasoning: ...
Score: 3

Review: "{review_text}"
Reasoning: """

def llm_predict(review_text):
    prompt = prompt_template.format(review_text=review_text)

    # Step 1: generate some reasoning
    reasoning = ''
    while len(reasoning)<10:
        # try to generate till we get desired reasoning structure
        # we use stop keyword in order to stop generation if model tries to answer right away, before generating reasoning
        # we also stop generation if model starts repeating original review text
        reasoning = chat_completion(prompt, max_tokens=100, stop=['score:', 'Score:', review_text], temperature=0.7).strip().replace('\n', '')

    # Step 2: join original prompt with generate reasoning and ask to generate final score
    # It's not necessary to have two steps, but decomposing it we can get more control over generation (especially for open source LLMs)
    # Note: we use text_completion API, but for GPT3.5 and GPT4 it's not required
    final_prompt = prompt + reasoning + '\nScore: '
    final_answer = text_completion(final_prompt, max_tokens=2).strip()

    try:
        return int(final_answer)
    except:
        # if model starts to generate some garbage, we will return -1
        return -1

chain_of_thought_predictions = list(map(llm_predict, tqdm(test_df.Text)))

# Accuracy may wary since we use non-zero temperature for generation
print('Accuracy:', accuracy_score(test_df['Score'], chain_of_thought_predictions))

# # check for garbage predictions
assert all([x != -1 for x in chain_of_thought_predictions]), 'There are some garbage predictions'

* Ufortunately, generation process is much slower (~10x) since you need to generate reasoning before it can generate final answer
* It's possible to provide more precise instuctions on way of reasoning for your specific task
* You also can provide examples of reasoning to help LLM understand the way of reasoning you want

# CoT with Self-Consistency

* It's possible to even further improve accuracy by using [Self-Consistency](https://learnprompting.org/docs/intermediate/self_consistency) trick, but it also will increase generation time
* It's basically an ensembles where you use majority vote
* It will be N time slower than using CoT alone - where N number of runs

> <br>**Important:** you need to use temperature > 0 to get some variance in generated answers
>
> <br>

In [None]:
from collections import Counter

def self_consitency(review_text, nb_generations=3):
    # we generate predictions several times with CoT
    # make sure that you use temperature>0 for generation
    # then just take the most common prediction
    results = [llm_predict(review_text) for _ in range(nb_generations)]
    final_result = Counter(results).most_common()[0][0]
    print('Predictions:', results, 'Final prediction:', final_result)
    return final_result

# just illustrate how it works
chain_of_thought_self_cons_predictions = list(map(self_consitency, tqdm(test_df.Text.iloc[:5])))

## Few-shot learning
* Now we'll try to use few-shot learning to futher improve accuracy
* LLMs can do In-Context-Learning [(ICL)](https://thegradient.pub/in-context-learning-in-context/) -  LLM learns to solve a new task at inference time (without any change to its weights) by being fed a prompt with examples of that task

In [None]:
def sample_few_shot_examples(dataset: pd.DataFrame, samples_per_class=1, seed=None):

    # sample examples from each class
    examples = dataset.groupby('Score').apply(lambda x: x.sample(samples_per_class, random_state=seed))

    # shuffle sampled examples
    examples = examples.sample(frac=1, random_state=seed)

    # construct final string
    string = ''
    for _, row in examples.iterrows():
        review = str(row.Text).replace('\n','')
        string += f'Review: "{review}"\nScore: {_[0]}\n\n'

    return string.strip()

sampled_few_shot_examples = sample_few_shot_examples(reviews_df, samples_per_class=1, seed=0)
print(sampled_few_shot_examples)

In [None]:
prompt_template = """\
You will get a review and you should predict the score of the review.
Score is a number from 1 to 5. Answer with only with one number and nothing else.

------------------
{few_shot_examples}

Review: "{review_text}"
Score:"""

def llm_predict(review_text, few_shot_examples, llm_kwargs={}):
    prompt = prompt_template.format(review_text=review_text, few_shot_examples=few_shot_examples)

    # Note: that we use text-completion API for open-source LLMs, but for GPT it's not required
    prediction = text_completion(prompt, **llm_kwargs).lower().strip()

    try:
        return int(prediction)
    except:
        # if model starts to generate some garbage, we will return -1
        return -1

few_shot_predictions = list(map(partial(llm_predict, few_shot_examples=sampled_few_shot_examples), tqdm(test_df.Text)))

print('Accuracy:', accuracy_score(test_df['Score'], few_shot_predictions))

# check for garbage predictions
assert all([x != -1 for x in few_shot_predictions]), 'There are some garbage predictions'

* By using only 5 training examples (one per class) from real data we've got Accuracy which is comparable to CoT results
* But computational time is significantly lower (almost 4x lower) than CoT and bit higher that Zero-shot
* * Generation of new tokens is much slower (and expensive) than enlarging size of the prompt
* Selected few-shot examples significantly affect accuracy, so it's important to select them carefully
* It's well known that even order of sampled examples also affect accuracy

## Classes (tokens) probability

* We can also use LLMs to get probability of each class (token) in the prompt
* It can be used to get some insights about unsertainty of the LLM's predictions
* For that we we need to pass [logprobs](https://platform.openai.com/docs/api-reference/completions/create#completions/create-logprobs) parameter to API call
* Usually it's better to have classes which are represented by only one token (like in our case)
* In real use cases it's better to use [logit_bias](https://platform.openai.com/docs/api-reference/completions/create#completions/create-logit_bias) as well - to maximize probability that model will return in logprobs all the classes we want - unfortunately it's not implemented in VLLM yet which we use for our experiments

In [None]:
import numpy as np

def get_tokens_probability(raw_prediction, classes_tokens=None):
    # extract logprobs from API answer
    logpropbs = raw_prediction.choices[0].logprobs.top_logprobs

    # the first predicted token can be a space, not the class token itself
    # so we need to find which token contains the most amount of potential tokens
    classes_tokens_intersect = [len(set(token.keys()) & set(classes_tokens)) for token in logpropbs]
    logpropbs_token_pos = logpropbs[np.argmax(classes_tokens_intersect)]

    # then in order to convert that into probabilities we need to exponentiate and normalize
    # potentially there is chance that som of classes are not presented in the logprobs, so we use -1000 (low probability)
    classes_prob_unnorm = [np.exp(logpropbs_token_pos.get(c, -1000)) for c in classes_tokens]
    classes_prob = np.array(classes_prob_unnorm) / np.sum(classes_prob_unnorm)

    # and finally return classes (tokens) probabilities
    return {c: p for c, p in zip(classes_tokens, classes_prob)}



results = list()
for review_text in tqdm(test_df.Text.iloc[:10]):
    prompt = prompt_template.format(review_text=review_text, few_shot_examples=sampled_few_shot_examples)

    # we use param logprobs which will return logprob of tokens for each position
    prediction = text_completion(prompt, return_completion_only=False, logprobs=10)

    # extract probabilities for each class
    classes_prob = get_tokens_probability(prediction, classes_tokens=list(map(str, range(1, 6))))
    results.append(classes_prob)

# show class probability for each review
# columns - our classes, rows - reviews
pd.DataFrame.from_records(results).round(2).style.background_gradient(axis=1)

## Embeddings and LogReg
- ** ***(Run only if you have additional GPU memory; otherwise, an error will occur.)***

In [None]:
!pip install sentence-transformers==3.0.1

In [None]:
from sentence_transformers import SentenceTransformer

# (!) It will download 2GB neural network and run inference on your device
# Long texts will be truncated to at most 512 tokens.
model = SentenceTransformer('intfloat/e5-small-v2')

train_embeddings = model.encode(('query: ' + train_df.Text).to_list())
test_embeddings = model.encode(('query: ' + test_df.Text).to_list())

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(train_embeddings, train_df['Score'])

print('E5 Large embeddings + LogReg Accuracy:',
      accuracy_score(test_df['Score'], clf.predict(test_embeddings)))

# Summary

This is the result we obtained using the larger and more stable Llama 3 80B model. As shown, accuracy tends to improve.

| Approach             | Accuracy | Training data        |
|----------------------|----------|----------------------|
| TF-IDF + LogReg      | 0.604    | Full (2187 examples) |
| Zero-Shot naive      | 0.720    | No                   |
| Zero-Shot structured | 0.732    | No                   |
| Zero-Shot with CoT   | 0.741    | No                   |
| Few-Shot             | 0.757    | 5 examples           |
| E5 + LogReg          | 0.749    | Full (2187 examples) |

* If you don't have training data at all - you can use Zero-Shot approach
* If you have some training data (at least 1 sample per class) - you can use Few-Shot approach
* CoT can be used to improve accuracy of Zero-Shot approach as well as Few-Shot
* You can use LLM to get probability of each class (token)
* You can also use LLM to label your data -> train more classical models (like E5 + LogReg) -> use it for inference

# Homework
* Try to play with prompt in [Force output structure](#Force-output-structure) cell to get better accuracy e.g. provide more context on task
* Try to sample different \ more examples per class in [Few-shot learning](#Few-shot-learning) cell and see how it affects accuracy
* * Extra: Try to find the best set of examples from training set using validation set (not test set)
* Implement self-consistency trick for [Few-shot learning](#Few-shot-learning) cell
* Try other models (like Mistral 7B) from other config files
* TODO: Add one more dataset for multi-class classification

# Extra
* There are libraries which can help you to use LLMs
* * [guidance](https://github.com/guidance-ai/guidance) - enables you to control modern LLMs more efficiently than traditional prompting or chaining
* * [DSPy](https://github.com/stanfordnlp/dspy/tree/main) - provides composable and declarative modules for instructing LMs in a familiar Pythonic syntax