---

<div align='center'>
<font size="+2">

Text Mining and Natural Language Processing  
2023-2024

<b>SelectWise</b>

Alessandro Ghiotto 513944

</font>
</div>

---

# Notebook 4 - LLM Prompting:

- Zero-Shot Prompting
- Zero-Shot Chain of Thought Prompting
- Few-Shot Prompting
- RAG inspired Few-Shot Prompting

---

Data

In [1]:
from datasets import load_dataset
import pandas as pd
import numpy as np
import random
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import Dataset
sns.set_theme(style="darkgrid")

# SEED
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.random.manual_seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
seed = 8
set_seed(seed)

# DEVICE and DTYPE
mydevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.set_default_device(mydevice) # default tensor device
# torch.set_default_dtype(torch.float32) # default tensor dtype

# DATASET
dataset = load_dataset("allenai/qasc")
n_train_sample = 7323
dataset_train = dataset['train'].select(range(n_train_sample))
dataset_val = dataset['train'].select(range(n_train_sample, len(dataset['train'])))
dataset_test = dataset['validation']

def format_choices(example):
    if example['choices']['label'] == ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']:
        example['choices'] = example['choices']['text']
    else:
        print("The order of the choices is not the same for all the examples")
    example['answerKey_int'] = ord(example['answerKey']) - 65
    return example

dataset_train = dataset_train.map(format_choices)
dataset_val = dataset_val.map(format_choices)

# Display the dataset
dataset_train[0]

{'id': '3E7TUJ2EGCLQNOV1WEAJ2NN9ROPD9K',
 'question': 'What type of water formation is formed by clouds?',
 'choices': ['pearls',
  'streams',
  'shells',
  'diamonds',
  'rain',
  'beads',
  'cooled',
  'liquid'],
 'answerKey': 'F',
 'fact1': 'beads of water are formed by water vapor condensing',
 'fact2': 'Clouds are made of water vapor.',
 'combinedfact': 'Beads of water can be formed by clouds.',
 'formatted_question': 'What type of water formation is formed by clouds? (A) pearls (B) streams (C) shells (D) diamonds (E) rain (F) beads (G) cooled (H) liquid',
 'answerKey_int': 5}

---

# LLM: `'DeciLM-7B-instruct'`

<https://huggingface.co/Deci/DeciLM-7B-instruct>

DeciLM-7B-instruct is a model for short-form instruction following. It is built by LoRA fine-tuning on the SlimOrca dataset.

![picture](../imgs/4_LLMprompting.png)

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline

model_name = "Deci/DeciLM-7B-instruct"

device = "cuda" 

dtype_kwargs = dict(
    quantization_config=BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
))

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True,
    **dtype_kwargs
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    temperature=0.1,
    top_p=0.9,
    do_sample=True,
    device_map="auto",
    max_new_tokens=256,
    return_full_text=False
)


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

I follow the propt for this specific LLM:

```Python
"""
### System:
You are an AI assistant that follows instruction extremely well. Help as much as you can.
### User:
{propt}
"""
```

In [4]:
def get_response(user_prompt, pipeline=pipe):
    system_prompt = "You are an AI assistant that follows instruction extremely well. Help as much as you can."
    prompt = pipeline.tokenizer.apply_chat_template([
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ], tokenize=False, add_generation_prompt=True)
    return pipeline(prompt)[0]['generated_text']


system_prompt = "You are an AI assistant that follows instruction extremely well. Help as much as you can."
user_prompt = "How to train a dragon?"
prompt = tokenizer.apply_chat_template([
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
], tokenize=False, add_generation_prompt=True)

response = pipe(prompt)[0]['generated_text']
print(prompt + response)

### System:
You are an AI assistant that follows instruction extremely well. Help as much as you can.
### User:
How to train a dragon?
### Assistant:
 Training a dragon can be a complex and potentially dangerous task. However, here are some general steps that can be taken to train a dragon:

1. Establish a bond: Start by building a strong bond with the dragon. This can be done through positive interactions, such as feeding it, playing with it, and providing a safe and comfortable environment.

2. Teach basic commands: Once the dragon trusts you, you can begin teaching it basic commands, such as "come," "sit," and "stay."

3. Teach more advanced commands: As the dragon becomes more comfortable with you, you can teach it more advanced commands, such as "roll over," "jump through a hoop," and "fly."

4. Use positive reinforcement: When the dragon successfully completes a command, reward it with a treat or praise. This will encourage the dragon to continue learning and performing the comma

## **Zero-Shot Prompting**

We directly ask the question in which we are interested

```Python
f"""
Question: {item['question']}
fact1: {item['fact1']}
fact2: {item['fact2']}
A) {item['choices'][0]}
B) {item['choices'][1]}
...
H) {item['choices'][7]}
Choose the correct choice. Answer with the corresponding letter only. 
"""
```

In [29]:
import timeit

def evaluate_model(dataset, question_format, pipeline=pipe):
  total = 0
  correct = 0
  skipped = 0
  t0 = timeit.default_timer()
  # cicle over all the reviews
  for item in dataset:
    true_label = item["answerKey"]
    prompt = question_format(item)
    answer = get_response(prompt, pipeline)
    output_label = answer.upper().replace("\n", " ").strip()[0]
    # output_label = output_label.replace("ANSWER:", "").strip()[0]

    # if the answer is not in the choices, we skip
    if output_label not in ['A','B','C','D','E','F','G','H']:
      print(answer)
      skipped+=1 # counter of skipped sentences
      continue # we simply continue the loop

    if output_label == true_label: # CORRECT
      correct+=1
    total+=1

  print(f"elapsed : {timeit.default_timer()-t0:.2f} seconds")
  print(f"skipped : {skipped}")
  print(f"correct : {correct}")
  print(f"accuracy: {correct/total:.5f}")


In [17]:
def question_format_(item):
  question_format = f"""\
  Question: {item['question']}
  fact1: {item['fact1']}
  fact2: {item['fact2']}
  A) {item['choices'][0]}
  B) {item['choices'][1]}
  C) {item['choices'][2]}
  D) {item['choices'][3]}
  E) {item['choices'][4]}
  F) {item['choices'][5]}
  G) {item['choices'][6]}
  H) {item['choices'][7]}
  Choose the correct choice. Answer with the corresponding letter only."""
  return question_format

evaluate_model(dataset_val, question_format_)

  self.gen = func(*args, **kwds)


elapsed : 186.94 seconds
skipped : 0
correct : 743
accuracy: 0.91615


Since the information are read from left to right in Language Models (unlike BERT like models, in which each word look at any other word), can be interesting to look how the other of the elements in our sentence influence the performance.


In [36]:
# swap the order between the facts and the question

def question_format_(item):
  question_format = f"""\
  fact1: {item['fact1']}
  fact2: {item['fact2']}
  Question: {item['question']}
  A) {item['choices'][0]}
  B) {item['choices'][1]}
  C) {item['choices'][2]}
  D) {item['choices'][3]}
  E) {item['choices'][4]}
  F) {item['choices'][5]}
  G) {item['choices'][6]}
  H) {item['choices'][7]}
  Choose the correct choice. Answer with the corresponding letter only."""
  return question_format

evaluate_model(dataset_val, question_format_)

elapsed : 159.64 seconds
skipped : 0
correct : 780
accuracy: 0.96178


The increase in the accuracy given by just changing the order of the facts and the question is incredible

In [32]:
# put together the facts and the question

def question_format_(item):
  question_format = f"""\
  Question: {item['fact1']}. {item['fact2']} {item['question']}
  A) {item['choices'][0]}
  B) {item['choices'][1]}
  C) {item['choices'][2]}
  D) {item['choices'][3]}
  E) {item['choices'][4]}
  F) {item['choices'][5]}
  G) {item['choices'][6]}
  H) {item['choices'][7]}
  Choose the correct choice. Answer with the corresponding letter only."""
  return question_format

evaluate_model(dataset_val, question_format_)

  self.gen = func(*args, **kwds)


elapsed : 163.18 seconds
skipped : 0
correct : 740
accuracy: 0.91245


In [37]:
# put together the choices

def question_format_(item):
  question_format = f"""\
  {item['fact1']}. {item['fact2']}
  Question: {item['question']}
  (A) {item['choices'][0]} (B) {item['choices'][1]} (C) {item['choices'][2]} (D) {item['choices'][3]} \
  (E) {item['choices'][4]} (F) {item['choices'][5]} (G) {item['choices'][6]} (H) {item['choices'][7]}
  Choose the correct choice. Answer with the corresponding letter only (i.e. A, B...)."""
  return question_format

evaluate_model(dataset_val, question_format_)

 (G)
 (C) harmful substances
elapsed : 154.66 seconds
skipped : 2
correct : 763
accuracy: 0.94314


In [10]:
# move the instractions at the beginning

def question_format_(item):
  question_format = f"""\
  This is a multiple choice question, choose the correct choice. Answer with the corresponding letter only.
  fact1: {item['fact1']}
  fact2: {item['fact2']}
  Question: {item['question']}
  A) {item['choices'][0]}
  B) {item['choices'][1]}
  C) {item['choices'][2]}
  D) {item['choices'][3]}
  E) {item['choices'][4]}
  F) {item['choices'][5]}
  G) {item['choices'][6]}
  H) {item['choices'][7]}
  ANSWER:"""
  return question_format

evaluate_model(dataset_val, question_format_)

elapsed : 163.67 seconds
skipped : 0
correct : 774
accuracy: 0.95438


The results are already very good, if we think that we have not trained the model, we are just using it out of the box. We have also to consider that requires much more time wrt BERT-like models. We can also say that DeciLM is very good at following what we have asked (no skipped answers).

## **Zero-Shot Chain of Thought Prompting**

I ask the model to think about the answer, by saying "Let's think step by step"

In [31]:
# Let's think step by step

def question_format_(item):
  question_format = f"""\
  Let's think step by step.
  fact1: {item['fact1']}
  fact2: {item['fact2']}
  Question: {item['question']}
  A) {item['choices'][0]}
  B) {item['choices'][1]}
  C) {item['choices'][2]}
  D) {item['choices'][3]}
  E) {item['choices'][4]}
  F) {item['choices'][5]}
  G) {item['choices'][6]}
  H) {item['choices'][7]}
  Choose the correct choice and motivate your answer."""
  return question_format

evaluate_model(dataset_val, question_format_)

  self.gen = func(*args, **kwds)


 The correct choice is D) animals.

Fight-or-flight is the same in humans and animals because both species exhibit a similar response to threatening situations. This response is a survival mechanism that helps them cope with danger.
 The correct choice is E) adrenaline to surge.

Animal attacking another animal may cause adrenaline to surge.

This is because when an animal is threatened, it experiences a fight-or-flight response, which is a physiological reaction that prepares the animal to either fight or flee. This response is triggered by the release of adrenaline, a hormone produced by the adrenal glands. The adrenaline surge causes the animal's body to prepare for action, such as increased heart rate, blood flow to the muscles, and heightened senses.
 The correct choice is E) Store it for later.

The body stores calories for later use when it is not immediately used for energy. This is because calories are the energy stored in food, and the body needs them to maintain its function

In [32]:
# specify better how to answer

def question_format_(item):
  question_format = f"""\
  Let's think step by step.
  fact1: {item['fact1']}
  fact2: {item['fact2']}
  Question: {item['question']}
  A) {item['choices'][0]}
  B) {item['choices'][1]}
  C) {item['choices'][2]}
  D) {item['choices'][3]}
  E) {item['choices'][4]}
  F) {item['choices'][5]}
  G) {item['choices'][6]}
  H) {item['choices'][7]}
  Give the correct choice and a short motivation. start the sentence with the letter of the choice."""
  return question_format

evaluate_model(dataset_val, question_format_)

  self.gen = func(*args, **kwds)


elapsed : 1730.24 seconds
skipped : 0
correct : 773
accuracy: 0.95314


The time required by this method is much higher, because when we are asking the rationale of the answer, we ask the model to generate more token. And each token is generated sequentially, so the time required increases.

## **Few-Shot Prompting**

We give one example (or more), as a pattern to be followed by the model to give the answer.

```Python
f"""
Input: {text_1}
Output: {label_1}

...

Input: {text_n}
Output: {label_n}

Input: {target_text}
Output:
"""
```

In [52]:
# one example

def question_format_(item, examples):
    # examples : list of items 
    question_format = ""
    for example in examples:
        question_format += f"""\
    fact1: {example['fact1']}
    fact2: {example['fact2']}
    Question: {example['question']}
    A) {example['choices'][0]}
    B) {example['choices'][1]}
    C) {example['choices'][2]}
    D) {example['choices'][3]}
    E) {example['choices'][4]}
    F) {example['choices'][5]}
    G) {example['choices'][6]}
    H) {example['choices'][7]}
    Answer: {example['answerKey']}\n\n"""

    question_format += f"""\
    fact1: {item['fact1']}
    fact2: {item['fact2']}
    Question: {item['question']}
    A) {item['choices'][0]}
    B) {item['choices'][1]}
    C) {item['choices'][2]}
    D) {item['choices'][3]}
    E) {item['choices'][4]}
    F) {item['choices'][5]}
    G) {item['choices'][6]}
    H) {item['choices'][7]}
    Answer:"""
    return question_format

examples = [dataset_train[0]]
question_format_train0 = lambda item : question_format_(item, examples)
evaluate_model(dataset_val, question_format_train0)

  self.gen = func(*args, **kwds)


elapsed : 177.33 seconds
skipped : 0
correct : 790
accuracy: 0.97411


I didn't specified anything, what was the task and what to answer. But with just one example the model given all valid answers (no skipped)

In [55]:
# try with 2 examples

examples = [dataset_train[0], dataset_train[10]]
question_format_train0 = lambda item : question_format_(item, examples)
evaluate_model(dataset_val, question_format_train0)

  self.gen = func(*args, **kwds)


elapsed : 195.24 seconds
skipped : 0
correct : 784
accuracy: 0.96671


More examples are not needed, since the model already understand the task with one example

In [56]:
# specify differently the examples

def question_format_(item, examples):
    # examples : list of items 
    question_format = ""
    for example in examples:
        question_format += f"""\
    INPUT:
        fact1: {example['fact1']}
        fact2: {example['fact2']}
        Question: {example['question']}
        A) {example['choices'][0]}
        B) {example['choices'][1]}
        C) {example['choices'][2]}
        D) {example['choices'][3]}
        E) {example['choices'][4]}
        F) {example['choices'][5]}
        G) {example['choices'][6]}
        H) {example['choices'][7]}
    OUTPUT: {example['answerKey']}\n\n"""

    question_format += f"""\
    INPUT:
        fact1: {item['fact1']}
        fact2: {item['fact2']}
        Question: {item['question']}
        A) {item['choices'][0]}
        B) {item['choices'][1]}
        C) {item['choices'][2]}
        D) {item['choices'][3]}
        E) {item['choices'][4]}
        F) {item['choices'][5]}
        G) {item['choices'][6]}
        H) {item['choices'][7]}
    OUTPUT:"""
    return question_format

examples = [dataset_train[0]]
question_format_train0 = lambda item : question_format_(item, examples)
evaluate_model(dataset_val, question_format_train0)

elapsed : 169.38 seconds
skipped : 0
correct : 784
accuracy: 0.96671


With just a simple example we increased a little bit the accuracy, without increasing the time required too much.

## **RAG inspired Few-Shot Prompting**

I don't simply give a random example as context, but instead for each question I search a similar question to be given as context.

The example is chosen in the train dataset. I choose the one which has the highest **cosine similarity** on the **tf-idf** representation. I preprocess the sentences before of computing the tf-idf weighting

In [5]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

nltk.download('punkt')
nltk.download('stopwords')
stopwords_list = set(stopwords.words('english'))
punctuations = set(string.punctuation)
porter = PorterStemmer()

def nltk_tokenizer(sentence):
    # Lowercase all sentences
    sentence = sentence.lower()

    # Tokenize using nltk
    my_tokenized_tokens = word_tokenize(sentence)

    # Removing stop words and punctuations
    mytokens = [word for word in my_tokenized_tokens if word not in stopwords_list and word not in punctuations]

    # Stemming
    mytokens = [porter.stem(word) for word in mytokens]

    # join the tokens back into a single string
    sentence_preprocessed = ' '.join(mytokens)

    return sentence_preprocessed


[nltk_data] Downloading package punkt to /home/max/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/max/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Train the tf-idf representation on the questions only

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

tokenized_sentences = [nltk_tokenizer(sentence) for sentence in dataset_train['question']]

tfidf_vectorizer = TfidfVectorizer()
tfidf_train = tfidf_vectorizer.fit_transform(tokenized_sentences)

tfidf_train.shape

(7323, 2710)

In [None]:
import scipy

def question_format_fewshot(item, examples):
    # examples : list of items 
    question_format = ""
    for example in examples:
        question_format += f"""\
    fact1: {example['fact1']}
    fact2: {example['fact2']}
    Question: {example['question']}
    A) {example['choices'][0]}
    B) {example['choices'][1]}
    C) {example['choices'][2]}
    D) {example['choices'][3]}
    E) {example['choices'][4]}
    F) {example['choices'][5]}
    G) {example['choices'][6]}
    H) {example['choices'][7]}
    Answer: {example['answerKey']}\n\n"""

    question_format += f"""\
    fact1: {item['fact1']}
    fact2: {item['fact2']}
    Question: {item['question']}
    A) {item['choices'][0]}
    B) {item['choices'][1]}
    C) {item['choices'][2]}
    D) {item['choices'][3]}
    E) {item['choices'][4]}
    F) {item['choices'][5]}
    G) {item['choices'][6]}
    H) {item['choices'][7]}
    Answer:"""
    return question_format


def cosine_similarity_btw2vec(v1, v2):
    # Check if the vectors are sparse matrices
    if scipy.sparse.issparse(v1):
        v1 = v1.toarray().flatten()
    if scipy.sparse.issparse(v2):
        v2 = v2.toarray().flatten()
    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    if norm_v1 == 0 or norm_v2 == 0:
        return 0.0 
    return dot_product / (norm_v1 * norm_v2)

In [17]:
total = 0
correct = 0
skipped = 0
t0 = timeit.default_timer()
# cicle over all the reviews
for item in dataset_val:
    true_label = item["answerKey"]

    # get the most similar example
    question = nltk_tokenizer(item["question"])
    question_tfidf = tfidf_vectorizer.transform([question])
    cos_similarity = [cosine_similarity_btw2vec(tfidf_train[i], question_tfidf) for i in range(tfidf_train.shape[0])]
    most_similar_example = np.argmax(cos_similarity)
    example = [dataset_train[int(most_similar_example)]]

    prompt = question_format_fewshot(item, example)
    answer = get_response(prompt, pipeline=pipe)
    output_label = answer.upper().replace("\n", " ").strip()[0]

    # if the answer is not in the choices, we skip
    if output_label not in ['A','B','C','D','E','F','G','H']:
        print(answer)
        skipped+=1 # counter of skipped sentences
        continue # we simply continue the loop

    if output_label == true_label: # CORRECT
        correct+=1
    total+=1

print(f"elapsed : {timeit.default_timer()-t0:.2f} seconds")
print(f"skipped : {skipped}")
print(f"correct : {correct}")
print(f"accuracy: {correct/total:.5f}")


  self.gen = func(*args, **kwds)


elapsed : 353.71 seconds
skipped : 0
correct : 782
accuracy: 0.96424


Compute tf-idf on `'question' + 'fact1' + 'fact2' + choices` instead of just the question

In [27]:
def get_full_text(item):
    return item['fact1'] + " " + item['fact2'] + " " + item['question'] + " " + " ".join(item['choices'])

full_text_sentences = []
for item in dataset_train:
    full_text_sentences.append(get_full_text(item))

tokenized_sentences = [nltk_tokenizer(sentence) for sentence in full_text_sentences]

tfidf_vectorizer = TfidfVectorizer()
tfidf_train = tfidf_vectorizer.fit_transform(tokenized_sentences)

tfidf_train.shape

(7323, 6286)

In [28]:
total = 0
correct = 0
skipped = 0
t0 = timeit.default_timer()

for item in dataset_val:
    true_label = item["answerKey"]

    # get the most similar example
    full_text = nltk_tokenizer(get_full_text(item))
    text_tfidf = tfidf_vectorizer.transform([full_text])
    cos_similarity = [cosine_similarity_btw2vec(tfidf_train[i], text_tfidf) for i in range(tfidf_train.shape[0])]
    most_similar_example = np.argmax(cos_similarity)
    example = [dataset_train[int(most_similar_example)]]

    prompt = question_format_fewshot(item, example)
    answer = get_response(prompt, pipeline=pipe)
    output_label = answer.upper().replace("\n", " ").strip()[0]

    # if the answer is not in the choices, we skip
    if output_label not in ['A','B','C','D','E','F','G','H']:
        print(answer)
        skipped+=1 # counter of skipped sentences
        continue # we simply continue the loop

    if output_label == true_label: # CORRECT
        correct+=1
    total+=1

print(f"elapsed : {timeit.default_timer()-t0:.2f} seconds")
print(f"skipped : {skipped}")
print(f"correct : {correct}")
print(f"accuracy: {correct/total:.5f}")

  self.gen = func(*args, **kwds)


elapsed : 369.51 seconds
skipped : 0
correct : 777
accuracy: 0.95808


At the end we didn't gain anything by trying to give a more suiteable example for each prompt.

---

### **Best result notebook 4 -> Few-Shot Prompting**

With the following prompt, for each item in the dataset. As context we have just one example (`dataset_train[0]`)

```Python
f"""
    fact1: beads of water are formed by water vapor condensing
    fact2: Clouds are made of water vapor.
    Question: What type of water formation is formed by clouds?
    A) pearls
    B) streams
    C) shells
    D) diamonds
    E) rain
    F) beads
    G) cooled
    H) liquid
    Answer: F

    fact1: {item['fact1']}
    fact2: {item['fact2']}
    Question: {item['question']}
    A) {item['choices'][0]}
    B) {item['choices'][1]}
    C) {item['choices'][2]}
    D) {item['choices'][3]}
    E) {item['choices'][4]}
    F) {item['choices'][5]}
    G) {item['choices'][6]}
    H) {item['choices'][7]}
    Answer:
"""
```

| Metric          | Validation |
|-----------------|------------|
| Accuracy        | $0.97411$  |

The results are very good, if we consider also that we didn't train antything. But encoder only models like BERT are more suited for single-label classification tasks like this one. For two reasons:

- requires less resources
- it gives always a valid output, follows always what we want