# Praktikum 5 - Large Language Model

Note: the praktikums are for your own practice. They will **not be graded**!

Remember to make a copy of this notebook to your own Colab. Changes made directly here will not be stored!

Whenenver you see an ellipsis `...` and/or TODO comment, you're supposed to insert code or text answers.



In this praktikum we will walk you through how to work with large language models.

We will use large language model to solve the task of Sentiment Analysis

## Using LLMs with OpenAI API

One way of using LLMs is via the OpenAI API.

The OpenAI API can be used to access LLMs hosted using the [Text Generation Inference by Huggingface](https://huggingface.co/docs/text-generation-inference/en/index), [llama.cpp](https://github.com/ggerganov/llama.cpp), and OpenAI's models (you would have to pay to use this one), ...

For this praktikum, I have hosted an LLM in our server, which you can access via the OpenAI API. The LLM used is [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).

First, we need to install OpenAI:

In [None]:
!pip install openai

Then, create a `client`. Since you are using my hosted LLM, you do not need an OpenAI API key, and you need to specify the `base_url` to be the location where I host the LLM:

In [None]:
from openai import OpenAI

api_key = "we_dont_need_this"
base_url="https://i13hpc51.isl.iar.kit.edu/v1"

client = OpenAI(
    base_url=base_url,
    timeout=900,
    api_key=api_key
)

Let's try sending your first request!

In [None]:
response = client.chat.completions.create(
    model="mistral",
    seed=0,
    messages=[
        {"role": "user", "content": "What's the weather like today?"}
    ]
)

Inpect the output:

In [None]:
response

Extract the response content:

In [None]:
response.choices[0].message.content

## Task: Sentiment Analysis

We will work on the task of Sentiment Analysis. Specifically, we will make use of the [Multiclass Sentiment Analysis Dataset](https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset), which contains English sentences and their corresponding sentiment labels among `[positive, negative, neutral]`.

First, download and load the dataset. We will truncate this dataset for faster inference speed.

In [None]:
!wget https://bwsyncandshare.kit.edu/s/M5sDCcBrf8earHk/download/Sp1786--multiclass-sentiment-analysis-dataset.zip
!unzip Sp1786--multiclass-sentiment-analysis-dataset.zip

In [None]:
import pandas as pd


train_df = pd.read_csv("Sp1786--multiclass-sentiment-analysis-dataset/train_df.csv")
test_df = pd.read_csv("Sp1786--multiclass-sentiment-analysis-dataset/test_df.csv")[:5]

Inspect the data:

In [None]:
test_df.head()

## LLM for Sentiment Analysis

We will now see how we can make use of the LLM for Sentiment Analysis. As the most naive approach, we can try directly asking the model:

In [None]:
response = client.chat.completions.create(
    model="mistral",
    seed=0,
    messages=[
        {
            "role": "user",
            "content": "What is the sentiment of this sentence: "\
                       "\"getting cds ready for tour\"? " \
                       "The sentiment is one of the followings: positive, negative, neutral"
        }
    ]
)

response.choices[0].message.content

As you can see, the answer from the LLM is freeform. Therefore, we need to extract the sentiment labels from this freeform answer. Fill in the function below for this purpose:

In [None]:
def extract_label(llm_output):
    ...

extract_label(response.choices[0].message.content)


Let's now do the inference more systematically. Let's define a function where we can input an English sentence, and get LLM answer out. Feel free to re-design your prompt, i.e., asking the LLM in a different way.

In [None]:
def llm_sa(en_sent):
    ...

In [None]:
llm_output = test_df['text'].apply(lambda x: llm_sa(x))

Now, let's evaluate the performance of the LLM

In [None]:
from sklearn import metrics

# First extract the labels from the LLM's answer
llm_labels = llm_output.apply(lambda x: extract_label(x))

print(metrics.accuracy_score(test_df['label'], llm_labels))
print(metrics.f1_score(test_df['label'], llm_labels, average='micro'))


## In-context learning

In order to improve the performance of the LLM, we can provide it with some examples. This is called *In-context Learning*, or *Few-shot Prompting*.

First, write the function to select the examples fromt the training data. We will try out 3 different strategies:
- Select the examples randomly
- Select the most similar examples using a transformer sentence embedding model
- Select the most similar examples using a traditional metric, e.g., CHRF


In [None]:
!pip install -U sentence-transformers
!pip install sacrebleu

In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer, util
from sacrebleu.metrics import CHRF

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

def select_shot(en_sent, nr_shots, method='random', similarity_model=None):
    if method == 'random':
        ...
    elif method == 'closest_sentences_transformer':
        ...
    elif method == 'closest_sentences_chrf':
        ...
    else:
        raise NotImplmentedError

Now write the function to perform LLM inference for Sentiment Analysis with few-shot prompting:

In [None]:
def llm_sa_few_shot(en_sent, nr_shots=2, shot_selection_method='random'):
    ...



In [None]:
nr_shots = 2

for shot_selection_method in ['random', 'closest_sentences_transformer', 'closest_sentences_chrf']:
    print(f'shot_selection_method: {shot_selection_method}')
    llm_output = test_df['text'].apply(lambda x: llm_sa_few_shot(x, nr_shots))

    # First extract the labels from the LLM's answer
    llm_labels = llm_output.apply(lambda x: extract_label(x))

    print(metrics.accuracy_score(test_df['label'], llm_labels))
    print(metrics.f1_score(test_df['label'], llm_labels, average='micro'))
    print('------------------------------------------------------')


## Chain-of-thought prompting

Another way to improve the performance of the LLM is through Chain-of-thought Prompting, i.e., asking the model to first provide the reasoning before giving the final output.

Write the function to perform LLM inference for Sentiment Analysis with Chain-of-thought prompting:

In [None]:
def llm_sa_cot(en_sent):
    ...

In [None]:
llm_output = test_df['text'].apply(lambda x: llm_sa_cot(x))

# First extract the labels from the LLM's answer
llm_labels = llm_output.apply(lambda x: extract_label(x))

print(metrics.accuracy_score(test_df['label'], llm_labels))
print(metrics.f1_score(test_df['label'], llm_labels, average='micro'))


## In-context learning + Chain-of-thought prompting

Now let's combine the two methods. Write the function to perform LLM inference for Sentiment Analysis with Few-shot Prompting and Chain-of-thought Prompting:

In [None]:
def llm_sa_few_shot_cot(en_sent, nr_shots=2):
    ...

In [None]:
nr_shots = 2

llm_output = test_df['text'].apply(lambda x: llm_sa_few_shot_cot(x, nr_shots))

# First extract the labels from the LLM's answer
llm_labels = llm_output.apply(lambda x: extract_label(x))

print(metrics.accuracy_score(test_df['label'], llm_labels))
print(metrics.f1_score(test_df['label'], llm_labels, average='micro'))


## LLM attack

In this section, we try to attack the LLM and make it output whatever we want. Let's try to make it output "LLMs are evil."

In [None]:
llm_sa(...)

In [None]:
llm_sa_few_shot(...)


In [None]:
llm_sa_cot(...)


In [None]:
llm_sa_few_shot_cot(...)
