<a href="https://colab.research.google.com/github/Miliyas/Generative_AI/blob/main/LLM_Fine_Tuning_Mib.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **LLM Project to Build and Fine Tune a Large Language Model**

In today's data-driven world, the ability to process and generate natural language text at scale has become a transformative force across industries. Large Language Models (LLMs) represent a cutting-edge advancement in natural language processing, enabling businesses to extract valuable insights, automate tasks, and enhance user experiences. By harnessing the power of LLMs, organizations can improve customer service, automate content creation, and gain a competitive edge in the digital landscape.

This project builds the foundation for Large Language Models by diving deep into the details of their inner workings. Moreover, It shows how to optimize their use through prompt engineering and fine-tuning techniques such as LoRA.

Prompt engineering techniques involve crafting specific instructions or queries given to the language model to influence its output will be introduced to guide LLMs in generating desired responses through zero-shot, one-shot, and few-shot inferences.

Fine-tuning entails training a pre-trained language model on a specific task or dataset to adapt it for a particular application. It explores full fine-tuning and Parameter Efficient Fine Tuning (PEFT), a technique that optimizes the fine-tuning process by focusing on a subset of the model's parameters, making it more resource-efficient.

The project also involves the application of Retrieval Augmented Generation (RAG) using OpenAI's GPT-3.5 Turbo, resulting in the development of a chatbot for online shopping for knowledge grounding. Knowledge grounding with Retrieval Augmented Generation (RAG) is implemented to mitigate hallucinations and provide trustworthy and reliable responses. This is achieved by incorporating information from external sources to validate and support the generated text.

For example, in the context of an e-commerce chatbot using RAG, knowledge grounding ensures that product information, availability, and prices are sourced from a trusted database or e-commerce platform. This prevents the chatbot from generating inaccurate or fictional details and instead provides responses based on real-world data.


![image](https://images.pexels.com/photos/18069697/pexels-photo-18069697/free-photo-of-an-artist-s-illustration-of-artificial-intelligence-ai-this-illustration-depicts-language-models-which-generate-text-it-was-created-by-wes-cockx-as-part-of-the-visualising-ai-project-l.png?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1)

## **Learning Outcomes**

* Understand Large Language Models (LLMs) and how they work.
* Gain practical experience in implementing generative AI projects.
* Understand fundamental NLP concepts like RNNs, Transformers, and Attention Mechanism.
* Explore tokenization, embeddings, and internal workings of Transformers.
* Generate text and summarize dialogues using LLMs.
* Learn optimization techniques like Prompt Engineering, Fine-Tuning, and PEFT.
* Apply Prompt Engineering techniques for better responses.
* Fine-tune LLMs for improved performance on tasks.
* Evaluate model performance using the ROUGE metric.
* Understand RLHF for improved model output.
* Implement Retrieval Augmented Generation (RAG) for knowledge grounding.
* Build a chatbot application for online shopping.


In [None]:
# python- 3.8.10
# !pip install --upgrade pip
# !pip install transformers
# !pip install datasets --quiet
# !pip install torchdata
# !pip install torch
# !pip install streamlit
# !pip install openai
# !pip install langchain
# !pip install unstructured
# !pip install sentence-transformers
# !pip install chromadb
# !pip install evaluate==0.4.0
# !pip install rouge_score==0.1.2
# !pip install loralib==0.1.1
# !pip install peft==0.3.0

In [2]:
pip install evaluate datasets

Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from evaluate)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [6]:
import torch
import evaluate
import time
import pandas as pd
import numpy as np
from datasets import load_dataset
from transformers import (AutoModelForSeq2SeqLM, AutoModelForCausalLM,
                          AutoTokenizer, GenerationConfig, TrainingArguments, Trainer)
from transformers import AutoTokenizer
from transformers import GenerationConfig


## **Refresher**

## **Text Generation**

In recent years, there has been an increasing interest in open-ended language generation thanks to the rise of large transformer-based language models trained on millions of webpages, including OpenAI's ChatGPT and Meta's LLaMA. The results on conditioned open-ended language generation are impressive, having shown to generalize to new tasks, handle code, or take non-text data as input. Besides the improved transformer architecture and massive unsupervised training data, better decoding methods have also played an important role.

Currently most prominent decoding methods, mainly Greedy search, Beam search, and Sampling.

In [8]:
DEVICE = 'cuda'

In [9]:
torch_device = torch.device(DEVICE)

In [8]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id).to(torch_device)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Greedy Search

Greedy search is the simplest decoding method. It selects the word with the highest probability as its next word:

![title](images/greedy_search.png)

Starting from the word "The", the algorithm greedily chooses the next word of highest probability "nice" and so on, so that the final generated word sequence is ("The", "nice", "woman")
having an overall probability of 0.5×0.4=0.2




In [8]:
# encode context the generation is conditioned on
model_inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt').to(torch_device)

# generate 40 new tokens
greedy_output = model.generate(**model_inputs, max_new_tokens=40)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure


We have generated our first short text with GPT2!

The generated words following the context are reasonable, but the model quickly start repeating itself! This is a very common problem in language generation in general and seems to be even more so in greedy and beam search. The major drawback of greedy search though is that it misses high probability words hidden behind low probability word as can be seen in the sketch above.

The word "has" with its high conditional probability of 0.9 hidden behind the word "dog", which has only the second-highest conditional probability, so that greedy search misses the word sequence "The", "dog", "has".

## **Beam Search**

Beam search reduces the risk of missing hidden high probability word sequences by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability. Let's illustrate with num_beams=2:

![title](images/beam_search.png)

At time step 1, besides the most likely hypothesis ("The", "nice"), beam search also keeps track of the second most likely one ("The", "dog"). At time step 2, beam search finds that the word sequence ("The", "dog", "has") with 0.36 higher probability than ("the", "nice", "woman") which has 0.2. Great, it has found the most likely word sequence in the example.

Beam search will always find an output sequence with higher probability than greedy search, but its not guaranteed to find the most likely output

In [9]:
# activate beam search and early_stopping
beam_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=2,
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with my dog again.

I'm not sure if I'll ever be able to walk with my dog again, but I don


While the result is arguably more fluent, the output still includes repetitions of the same word sequences. One of the available remedies is to introduce n-grams. The most common n-grams penalty makes sure that no n-gram appears twice by manually setting the probability of next words that could create an already seen n-gram to 0.

In [10]:
# set no_repeat_ngram_size to 2
beam_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to


We can see that the repetition does not appear anymore. Nevertheless, n-gram penalties have to be used with care. An article generated about the city New York should not use a 2-gram penalty or otherwise, the name of the city would only appear once in the whole text!

Another important feature about beam search is that we can compare the top beams after generation and choose the generated beam that fits our purpose best.

In transformers, we simply set the parameter num_return_sequences to the number of highest scoring beams that should be returned. Make sure though that num_return_sequences <= num_beams!

In [11]:
# set return_num_sequences > 1
beam_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,
    no_repeat_ngram_size=2,
    num_return_sequences=5,
    early_stopping=True
)

# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to
1: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to
2: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's a good idea to
3: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time to take a
4: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's a good idea

As can be seen, the five beam hypotheses are only marginally different to each other - which should not be too surprising when using only 5 beams.

## **Sampling**

In its most basic form, sampling means randomly picking the next word according to its conditional probability distribution

![title](images/sampling_search.png)

It becomes obvious that language generation using sampling is not deterministic anymore. The word ("car") is sampled from the conditioned probability distribution P(w|"The"), followed by sampling ("drives") from P(w|"The", "car")

In transformers, we set do_sample=True and deactivate Top-K sampling (more on this later) via top_k=0. In the following, we will fix the random seed for illustration purposes. Feel free to change the set_seed argument to obtain different results, or to remove it for non-determinism.

In [12]:
# set seed to reproduce results. Feel free to change the seed though to get different results
from transformers import set_seed
set_seed(42)

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog for the rest of the day, but this had me staying in an unusual room and not going on nights out with friends (which will always be wondered for a mere minute or so at this point).


Interesting! The text seems alright - but when taking a closer look, it is not very coherent and doesn't sound like it was written by a human. That is the big problem when sampling word sequences: The models often generate incoherent gibberish

A trick is to make the distribution sharper (increasing the likelihood of high probability words and decreasing the likelihood of low probability words) by lowering the so called temperature of the softmax

<img src="images/sampling_search_with_temp.png" width="800" height="400">

The conditional next word distribution of step T=1 becomes much sharper leaving almost no chance for word ("car") to be selected.

In [13]:
# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(42)

# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=0,
    temperature=0.6,
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I don't like to chew on it. I like to eat it and not chew on it. I like to be able to walk with my dog."

So how did you decide


There were less weird n-grams and the output is a bit more coherent now. However, while applying temperature can make a distribution less random, in its limit, when setting temperature -> 0, temperature scaled sampling becomes equal to greedy decoding and will suffer from the same problems as before.

## **Top-K Sampling**

In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words. GPT2 adopted this sampling scheme, which was one of the reasons for its success in story generation.

<img src="images/top_k_sampling.png" width="1200" height="600">

In [14]:
# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(42)

# set top_k to 50
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=35,
    do_sample=True,
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog for the rest of the day, but this time it was hard for me to figure out what to do with it. (One reason I asked this for a few months back


Not bad at all! The text is arguably the most human-sounding text so far. One concern though with Top-K sampling is that it does not dynamically adapt the number of words that are filtered from the next word probability distribution. This can be problematic as some words might be sampled from a very sharp distribution (distribution on the right in the graph above), whereas others from a much more flat distribution (distribution on the left in the graph above).

## **Top-p (nucleus) sampling**

Instead of sampling only from the most likely K words, in Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p. The probability mass is then redistributed among this set of words. This way, the size of the set of words (a.k.a the number of words in the set) can dynamically increase and decrease according to the next word's probability distribution. Ok, that was very wordy, let's visualize.

<img src="images/top_p_sampling.png" width="1200" height="600">

Having set p = 0.92, Top-p sampling picks the minimum number of words to exceed together p = 92% of the probability mass. In the first example, this included the 9 most likely words, whereas it only has to pick the top 3 words in the second example to exceed 92%.

In [15]:
# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(42)

# set top_k to 50
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=35,
    do_sample=True,
    top_p=0.92,
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog for the rest of the day, but this had me staying in an unusual room and not going on nights out with friends (which will always be my yearning for such a


While in theory, Top-p seems more elegant than Top-K, both methods work well in practice. Top-p can also be used in combination with Top-K, which can avoid very low ranked words while allowing for some dynamic selection.

In [16]:
# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(40)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3,
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog and it is a wonderful little dog. His personality is spot on, very friendly and very nice. I can barely stop walking when he runs along the sidewalk around the corner."

The dog would
1: I enjoy walking with my cute dog, but I don't like to spend it alone, because I don't want to go to the gym or eat with her because I don't want to show it to her at all. I get
2: I enjoy walking with my cute dog!" As the words started coming out of her mouth, she was really close to bursting. "I never see anybody in the village."

"Really? Why are you so happy about it?"


# **Dialogue Summarization**

In this use case, we will be generating a summary of a dialogue with the pre-trained Large Language Model (LLM) FLAN-T5 from Hugging face.

Let's upload some simple dialogues from the DialogSum Hugging Face dataset. This dataset contains 10.000+ dialogues with the corresponding manually labeled summaries and topics.

In [17]:
torch_device = torch.device(DEVICE)

In [18]:
huggingface_dataset_name = 'knkarthick/dialogsum'
dataset = load_dataset(huggingface_dataset_name)

Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [21]:
example_indices = [40, 200]

In [22]:
dash_line = "-".join('' for x in range(100))

for i, index in enumerate(example_indices):
    print(dash_line)
    print('Example ', i+1)
    print(dash_line)
    print('Input Dialogue: ')
    print(dataset['test'][index]['dialogue'])
    print(dash_line)
    print('Baseline Human Summary: ')
    print(dataset['test'][index]['summary'])
    print(dash_line)
    print()


---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
Input Dialogue: 
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
---------------------------------------------------------------------------------------------------
Baseline Human Summary: 
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
E

## **FLAN-T5 Model**

<img src="images/flan2_architecture.jpg" width="1000" height="500">

#### Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints,1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models

In [23]:
model_name = 'google/flan-t5-base'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

## Training

<img src="images/flan_t5_tasks.png" width="900" height="450">

#### These models are based on pretrained T5 (Raffel et al., 2020) and fine-tuned with instructions for better zero-shot and few-shot performance. There is one fine-tuned Flan model per T5 model size.

## Inference

In [25]:
sentence = "What time is it, Tom?"

In [26]:
sentence_encoded = tokenizer(sentence, return_tensors='pt')

sentence_decoded = tokenizer.decode(sentence_encoded["input_ids"][0], skip_special_tokens=True)

print(f'ENCODED SENTENCE:\n {sentence_encoded["input_ids"][0]}')
print(f'DECODED SENTENCE: {sentence_decoded}')

ENCODED SENTENCE:
 tensor([ 363,   97,   19,   34,    6, 3059,   58,    1])
DECODED SENTENCE: What time is it, Tom?


In [27]:
example_indices

[40, 200]

In [28]:
def summarize_dialogues(example_indices, dataset, prompt = "%s"):
    for i, index in enumerate(example_indices):
        dialogue = dataset['test'][index]['dialogue']
        summary = dataset['test'][index]['summary']

        input = prompt % (dialogue)

        inputs = tokenizer(input, return_tensors='pt')
        pred = model.generate(inputs["input_ids"], max_new_tokens=50)[0]
        output = tokenizer.decode(pred, skip_special_tokens=True)

        print(dash_line)
        print(f'Example {i+1}')
        print(dash_line)
        print(f'Input Prompt: \n {dialogue}')
        print(dash_line)
        print(f'Baseline Human Summary: \n {summary}')
        print(dash_line)
        print(f'Model Generation: \n{output}\n')

In [29]:
summarize_dialogues(example_indices, dataset)

---------------------------------------------------------------------------------------------------
Example 1
---------------------------------------------------------------------------------------------------
Input Prompt: 
 #Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
---------------------------------------------------------------------------------------------------
Baseline Human Summary: 
 #Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------
Model Generation: 
Person1: It's ten to nine.

--------------------------------------------------------

### Zero Shot Inference with an Instruction Prompt

In [31]:
prompt = f'Summarize the following conversation. \n%s\nSummary:'
print(prompt)

Summarize the following conversation. 
%s
Summary:


In [32]:
summarize_dialogues(example_indices, dataset, prompt)

---------------------------------------------------------------------------------------------------
Example 1
---------------------------------------------------------------------------------------------------
Input Prompt: 
 #Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
---------------------------------------------------------------------------------------------------
Baseline Human Summary: 
 #Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------
Model Generation: 
The train is about to leave.

------------------------------------------------------

In [33]:
prompt = f'Dialogue: \n%s\n\nWhat Happened?'
print(prompt)

Dialogue: 
%s

What Happened?


In [34]:
summarize_dialogues(example_indices, dataset, prompt)

---------------------------------------------------------------------------------------------------
Example 1
---------------------------------------------------------------------------------------------------
Input Prompt: 
 #Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
---------------------------------------------------------------------------------------------------
Baseline Human Summary: 
 #Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------
Model Generation: 
Tom is late.

----------------------------------------------------------------------

# One Shot Inference

In [35]:
def make_prompt(example_indices_full, example_index_to_summarize):
    prompt = ''
    for index in example_indices_full:
        dialogue = dataset['test'][index]['dialogue']
        summary = dataset['test'][index]['summary']

        prompt += f"""Dialogue:\n{dialogue}\n\nWhat was going on?\n{summary}\n\n\n"""
        dialogue = dataset['test'][example_index_to_summarize]['dialogue']

    prompt += f'Dialogue:\n{dialogue}\n\nWhat was going on?'
    return prompt

In [36]:
example_indices_full = [40]
example_index_to_summarize = 200

one_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(one_shot_prompt)

Dialogue:
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

What was going on?
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.


Dialogue:
#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a

In [37]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(one_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(inputs["input_ids"], max_new_tokens=50)[0], skip_special_tokens=True
)

print(dash_line)
print(f'Baseline Human Summary: \n{summary}\n')
print(dash_line)
print(f'Model Generation - One Shot:\n{output}')

---------------------------------------------------------------------------------------------------
Baseline Human Summary: 
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.

---------------------------------------------------------------------------------------------------
Model Generation - One Shot:
#Person1 wants to upgrade his system. #Person2 wants to add a painting program to his software. #Person1 wants to add a CD-ROM drive.


# Few Shot Inference

In [38]:
example_indices_full = [40, 80, 120]
example_index_to_summarize = 200

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(few_shot_prompt)

Dialogue:
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

What was going on?
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.


Dialogue:
#Person1#: May, do you mind helping me prepare for the picnic?
#Person2#: Sure. Have you checked the weather report?
#Person1#: Yes. It says it will be sunny all day. No sign of rain at all. This is your father's favorite sausage. Sandwiches for you and Daniel.
#Person2#: No, thanks Mom. I'd like some toast and chicken wings.
#Person1#: Okay. Please take some fruit salad and crackers for me.
#Person2#: Done. Oh, don't forget to take napkins disposable plates, cups and picnic blanket.
#Person1#: All set. May,

In [39]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(inputs["input_ids"], max_new_tokens=50)[0], skip_special_tokens=True
)

print(dash_line)
print(f'Baseline Human Summary: \n{summary}\n')
print(dash_line)
print(f'Model Generation - Few Shot:\n{output}')

Token indices sequence length is longer than the specified maximum sequence length for this model (818 > 512). Running this sequence through the model will result in indexing errors


---------------------------------------------------------------------------------------------------
Baseline Human Summary: 
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.

---------------------------------------------------------------------------------------------------
Model Generation - Few Shot:
#Person1 wants to upgrade his system. #Person2 wants to add a painting program to his software. #Person1 wants to upgrade his hardware.


# Model Fine Tuning

In [11]:
torch_device = torch.device(DEVICE)

## Load Dataset and LLM

In [5]:
hugging_face_dataset_name = "knkarthick/dialogsum"

In [6]:
dataset = load_dataset(hugging_face_dataset_name)

In [7]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(torch_device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [8]:
def number_of_trainable_model_parameters(model):
        trainable_model_params = 0
        all_model_params = 0
        for _, param in model.named_parameters():
            all_model_params += param.numel()
            if param.requires_grad:
                trainable_model_params += param.numel()
        result = f"trainable model parameters: {trainable_model_params}\n"
        result += f"all model parameters: {all_model_params}\n"
        result += f"Percentage of model params: {(trainable_model_params/all_model_params)*100}"
        return result

In [9]:
print(number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
Percentage of model params: 100.0


## Test the Model with Zero Shot Inferencing

In [47]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt').to(torch_device)
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)
dash_line = "-".join("" for x in range(100))
print(dash_line)
print(f"Input Prompt:\n{prompt}")
print(dash_line)
print(f"Baseline Human Summary:\n{summary}\n")
print(dash_line)
print(f"Model Generation - Zero Shot: \n{output}")


---------------------------------------------------------------------------------------------------
Input Prompt:

Summarize the following conversation

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

--------------------------------------------------------------------

## Perform Full Fine-Tunning

### Preprocess the Dialog-Summary dataset

Convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with 'Summarize the following conversation' and the start of the summary with 'Summary as follows'

In [10]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example['dialogue']]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

In [11]:
# The dataset actually contains 3 diff splits: train, validation, test
# The tokenize_function code is handling all data accross all splits in batches

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

To save some time, we will subsample the dataset:

In [19]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

NameError: name 'tokenized_datasets' is not defined

In [None]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (125, 2)
Validation: (5, 2)
Test: (15, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 125
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 5
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 15
    })
})


### Fine-Tune the model with the Preprocessed Dataset

Now utilize the built-in Hugging Face Trainer class.

In [None]:
output_dir = f"./dialogue-summary-training-{str(int(time.time()))}"

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

In [None]:
trainer.train()

In [None]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained('full/').to(torch_device)
original_model = original_model.to(torch_device)

## Evaluate the Model Qualitatively

In [None]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors='pt').input_ids

original_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_text_output = tokenizer.decode(original_outputs[0], skip_special_tokens=True)

instruct_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_text_output = tokenizer.decode(instruct_outputs[0], skip_special_tokens=True)

dash_line = "-".join("" for x in range(100))
print(dash_line)
print(f"Input Prompt:\n{prompt}")
print(dash_line)
print(f"Baseline Human Summary:\n{human_baseline_summary}\n")
print(dash_line)
print(f"Original Model Generation - Zero Shot: \n{original_text_output}")
print(dash_line)
print(f"Instruct Model Generation - Fine Tune: \n{instruct_text_output}")

---------------------------------------------------------------------------------------------------
Input Prompt:

Summarize the following conversation

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

--------------------------------------------------------------------

## Evaluate the Model Quantitatively (with ROUGE Metric)

In [None]:
rouge = evaluate.load('rouge')

Downloading builder script: 100%|██████████| 6.27k/6.27k [00:00<00:00, 6.27MB/s]


In [None]:
dialogue = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
for _, dialogue in enumerate(dialogue):
    prompt = f"""
Summarize the following conversation

{dialogue}

Summary:
    """
    input_ids = tokenizer(prompt, return_tensors='pt').input_ids

    original_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_text_output = tokenizer.decode(original_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_text_output)

    instruct_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_text_output = tokenizer.decode(instruct_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))

df = pd.DataFrame(zipped_summaries, columns=['human', 'original', 'instruct'])

In [None]:
df

Unnamed: 0,human,original,instruct
0,Ms. Dawson helps #Person1# to write a memo to ...,Employees are required to use instant messagin...,#Person1# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,This memo will be sent to all employees by thi...,#Person1# asks Ms. Dawson to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,Employees are required to use the Office of In...,#Person1# asks Ms. Dawson to take a dictation ...
3,#Person2# arrives late because of traffic jam....,People are talking about the traffic in this c...,#Person2# got stuck in traffic again. #Person1...
4,#Person2# decides to follow #Person1#'s sugges...,#Person1: I'm finally here!,#Person2# got stuck in traffic again. #Person1...
5,#Person2# complains to #Person1# about the tra...,#Person1: I'm sorry to hear that you're stuck ...,#Person2# got stuck in traffic again. #Person1...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are divorced.,#Person1# tells Kate Masha and Hero are gettin...
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are divorced.,#Person1# tells Kate Masha and Hero are gettin...
8,#Person1# and Kate talk about the divorce betw...,#Person1: Masha and Hero are getting a divorce.,#Person1# tells Kate Masha and Hero are gettin...
9,#Person1# and Brian are at the birthday party ...,"#Person1#: Brian, thank you for coming to the ...",Brian's birthday is coming. Brian dances with ...


In [None]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries,
    use_aggregator=True,
    use_stemmer=True
)

In [None]:
instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True
)

print(f"Original Model: \n{original_model_results}")
print(f"Instruct Model: \n{instruct_model_results}")


Original Model: 
{'rouge1': 0.261052062988671, 'rouge2': 0.08531489481944488, 'rougeL': 0.224821552384684, 'rougeLsum': 0.22788611265447228}
Instruct Model: 
{'rouge1': 0.38857220563277894, 'rouge2': 0.13135692283806472, 'rougeL': 0.28167162470172985, 'rougeLsum': 0.28344342480768214}


# Parameter Efficient Fine Tunning with LoRA

Now lets perform Parameter Efficient Fine-Tunning (PEFT). Opposed to full fine tunning, PEFT is a form of instruction fine-tunnin hat is much more efficient than full fine-tunning - with comparable evaluation results as you will see soon.

PEFT is a generic term that includes Low-Rank Adaptation (LoRA) and prompt tunning (which is not the same as prompt engineering). In most cases when someone says PEFT, they typically mean LoRA, at a very high level allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU).

## Setup the PEFT/LoRA model for Fine-Tunning

In [None]:
pip install peft

Collecting peft
  Downloading peft-0.11.1-py3-none-any.whl (251 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/251.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m245.8/251.6 kB[0m [31m9.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate>=0.21.0 (from peft)
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cu

In [12]:
from peft import LoraConfig, get_peft_model, TaskType, PeftModel, PeftConfig

In [13]:
lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=['q','v'],
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.SEQ_2_SEQ_LM
)

Learn More About LORA:https://docs.google.com/document/d/1ZDCrdrXwRn2vVOQ_jlb3-3nQRUB6nPi-L8IbLf9Gsik/edit?usp=sharing

In [14]:
peft_model = get_peft_model(original_model, lora_config)
print(number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3538944
all model parameters: 251116800
Percentage of model params: 1.4092820552029972


In [17]:
pip install transformers[torch]



In [18]:
pip install accelerate -U



In [15]:
output_dir = f"./peft-dialogue-summary-training-{str(int(time.time()))}"

training_args = TrainingArguments(
    auto_find_batch_size=True,
    output_dir=output_dir,
    learning_rate=1e-3,
    num_train_epochs=100,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

peft_trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

max_steps is given, it will override any value given in num_train_epochs


In [16]:
peft_trainer.train()

Step,Training Loss
1,50.1473


TrainOutput(global_step=1, training_loss=50.147274017333984, metrics={'train_runtime': 1.2147, 'train_samples_per_second': 3.293, 'train_steps_per_second': 0.823, 'total_flos': 2782515953664.0, 'train_loss': 50.147274017333984, 'epoch': 0.00032102728731942215})

In [17]:
# Convert LoRA configuration to a serializable format
lora_config_dict = lora_config.__dict__.copy()
lora_config_dict['target_modules'] = list(lora_config_dict['target_modules'])


In [18]:
torch.save(peft_model.state_dict(), '/content/fine_tuned_lora_model.pth')


In [19]:
import json
# Save the LoRA configuration as JSON
with open('lora_config.json', 'w') as f:
    json.dump(lora_config_dict, f)

In [20]:
# To load the model and configuration later
# Load the LoRA configuration from JSON
with open('/content/lora_config.json', 'r') as f:
    lora_config_dict = json.load(f)
    lora_config_dict['target_modules'] = set(lora_config_dict['target_modules'])
    lora_config = LoraConfig(**lora_config_dict)

In [21]:


# Save the base model configuration and tokenizer
original_model.save_pretrained("/content/")
tokenizer.save_pretrained("/content/")

('/content/tokenizer_config.json',
 '/content/special_tokens_map.json',
 '/content/spiece.model',
 '/content/added_tokens.json',
 '/content/tokenizer.json')

In [23]:
lora_config_dict

{'peft_type': 'LORA',
 'auto_mapping': None,
 'base_model_name_or_path': 'google/flan-t5-base',
 'revision': None,
 'task_type': 'SEQ_2_SEQ_LM',
 'inference_mode': False,
 'r': 32,
 'target_modules': {'q', 'v'},
 'lora_alpha': 32,
 'lora_dropout': 0.05,
 'fan_in_fan_out': False,
 'bias': 'none',
 'use_rslora': False,
 'modules_to_save': None,
 'init_lora_weights': True,
 'layers_to_transform': None,
 'layers_pattern': None,
 'rank_pattern': {},
 'alpha_pattern': {},
 'megatron_config': None,
 'megatron_core': 'megatron.core',
 'loftq_config': {},
 'use_dora': False,
 'layer_replication': None}

In [53]:
# Define your base model
model_name = "/content/model/"
peft_model_base = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Some weights of the model checkpoint at /content/model/ were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.0.SelfAttention.q.base_layer.weight', 'decoder.block.0.layer.0.SelfAttention.q.lora_A.default.weight', 'decoder.block.0.layer.0.SelfAttention.q.lora_B.default.weight', 'decoder.block.0.layer.0.SelfAttention.v.base_layer.weight', 'decoder.block.0.layer.0.SelfAttention.v.lora_A.default.weight', 'decoder.block.0.layer.0.SelfAttention.v.lora_B.default.weight', 'decoder.block.0.layer.1.EncDecAttention.q.base_layer.weight', 'decoder.block.0.layer.1.EncDecAttention.q.lora_A.default.weight', 'decoder.block.0.layer.1.EncDecAttention.q.lora_B.default.weight', 'decoder.block.0.layer.1.EncDecAttention.v.base_layer.weight', 'decoder.block.0.layer.1.EncDecAttention.v.lora_A.default.weight', 'decoder.block.0.layer.1.EncDecAttention.v.lora_B.default.weight', 'decoder.block.1.layer.0.SelfAttention.q.base_layer.weight', 'decoder.block.1.layer.0.SelfAttention.q.lora_

In [26]:
peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

peft_model = PeftModel.from_pretrained(
    peft_model_base,
    "/content/",
).to(torch_device)
original_model = original_model.to(torch_device)

HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/content/'. Use `repo_type` argument if needed.

In [60]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors='pt').input_ids

original_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1)).to(torch_device)
original_text_output = tokenizer.decode(original_outputs[0], skip_special_tokens=True)

peft_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_text_output = tokenizer.decode(peft_outputs[0], skip_special_tokens=True)

dash_line = "-".join("" for x in range(100))
print(dash_line)
print(f"Input Prompt:\n{prompt}")
print(dash_line)
print(f"Baseline Human Summary:\n{human_baseline_summary}\n")
print(dash_line)
print(f"Original Model Generation - Zero Shot: \n{original_text_output}")
print(dash_line)
print(f"Instruct Model Generation - Zero Shot: \n{peft_text_output}")

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

In [None]:
dialogue = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
peft_model_summaries = []
for _, dialogue in enumerate(dialogue):
    prompt = f"""
Summarize the following conversation

{dialogue}

Summary:
    """
    input_ids = tokenizer(prompt, return_tensors='pt').input_ids

    peft_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_text_output = tokenizer.decode(peft_outputs[0], skip_special_tokens=True)
    peft_model_summaries.append(peft_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns=['human', 'original', 'peft'])

In [None]:
peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True
)

print(f"Original Model: \n{original_model_results}")
print(f"Instruct Model: \n{instruct_model_results}")
print(f"Peft Model: \n{peft_model_results}")

Original Model: 
{'rouge1': 0.261052062988671, 'rouge2': 0.08531489481944488, 'rougeL': 0.224821552384684, 'rougeLsum': 0.22788611265447228}
Instruct Model: 
{'rouge1': 0.38857220563277894, 'rouge2': 0.13135692283806472, 'rougeL': 0.28167162470172985, 'rougeLsum': 0.28344342480768214}
Peft Model: 
{'rouge1': 0.33176278482581173, 'rouge2': 0.08811333505050914, 'rougeL': 0.2509677309788697, 'rougeLsum': 0.25262149176905513}


# **Knowledge Grounding**

Knowledge grounding in Natural Language Processing (NLP) refers to the process of connecting or linking information in text to real-world entities or concepts. It involves ensuring that the language model understands and can relate the information it processes to factual, contextual, or external knowledge.

For instance, if a sentence mentions "Einstein's theory of relativity," knowledge grounding would involve the model recognizing that Einstein refers to a renowned physicist and the theory of relativity is a fundamental concept in physics.

Knowledge grounding is crucial for NLP applications that require a deeper understanding of the world, as it enables the model to go beyond surface-level patterns in text and make meaningful inferences based on its understanding of the underlying concepts. This is particularly important in tasks like question-answering, where the model needs to provide accurate and contextually relevant responses.
Knowledge grounding with Retrieval Augmented Generation (RAG) is implemented to mitigate hallucinations and provide trustworthy and reliable responses. This is achieved by incorporating information from external sources to validate and support the generated text.



## To know more

### RAG

https://towardsdatascience.com/build-industry-specific-llms-using-retrieval-augmented-generation-af9e98bb6f68

### Langchain

https://python.langchain.com/docs/get_started/introduction

### OpenAI API Key

Note: Please note that the use of the OpenAI API may require the utilization of allocated free credits or additional purchases for implementing the project. Kindly take note of the free credits limit provided for your usage.

https://www.howtogeek.com/885918/how-to-get-an-openai-api-key/#:~:text=Go%20to%20OpenAI's%20Platform%20website,generate%20a%20new%20API%20key.

In [None]:
!streamlit run llm_app.py

[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.27.12.174:8501[0m
[0m
[34m[1m  For better performance, install the Watchdog module:[0m

  $ xcode-select --install
  $ pip install watchdog
            [0m
a5255313ec7d762337e1d1f02b6be7c9
Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the users question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
Printable Design for Apparel Products

Available in: AU, CA, EU, NZ, US with minimum quantity per order of 1 piece and maximum of 50 pieces. Printing goes direct to garment printing. Adult T Shirts for Men and Women a

## **Conclusion**

In today's world, we see language models doing some pretty amazing things. They help businesses understand text on a big scale and make our online experiences better.

This project was all about getting to know these language models inside out. We looked at how they work, how we can use them better, and even gave them a bit of a tune-up. We made them smarter by teaching them how to answer questions and summarize conversations.

We also built a cool shopping chatbot. This bot is smart because it doesn't make up stuff; it gives you real info about products.

With this project, you've learned a lot about these language models, and you're all set to use them for exciting tasks, from writing text to building smart applications.


Start experimenting !!

## **Interview Questions**


* Can you explain the significance of Large Language Models in today's data-driven world?
* What is the main purpose of prompt engineering in the context of LLMs?
* How does fine-tuning enhance the performance of a pre-trained language model?
* What is the difference between full fine-tuning and Parameter Efficient Fine Tuning (PEFT)?
* Can you explain the concept of Retrieval Augmented Generation (RAG) and how it was implemented in the project?
* How does knowledge grounding improve the reliability of responses generated by the chatbot?
* What is Reinforcement Learning from Human Feedback (RLHF)?

* How did you evaluate the performance of the models in this project, and what metrics did you use?
How did you handle the potential issue of hallucinations in the chatbot's responses?
* What challenges did you encounter during the implementation of fine-tuning techniques, and how did you overcome them?
* How would you address potential ethical concerns related to the use of language models in real-world applications?
* What are some potential future improvements or extensions you would consider for this project?

In [3]:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
from datasets import load_dataset, load_metric
from peft import LoraConfig, get_peft_model, TaskType



In [4]:
# Load the dataset
dataset = load_dataset('knkarthick/dialogsum')



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [10]:
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')
model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base').to(torch_device)



In [11]:
# Prepare the model for PEFT with LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    r=32,  # Rank of the adaptation
    lora_alpha=32,  # Alpha parameter for LoRA
    lora_dropout=0.05,  # Dropout for LoRA,
    bias='none',  # Bias type for LoRA,
    target_modules=['q', 'v']  # Target modules for LoRA
)
model = get_peft_model(model, lora_config)




In [12]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example['dialogue']]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

In [13]:
# The dataset actually contains 3 diff splits: train, validation, test
# The tokenize_function code is handling all data accross all splits in batches

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [14]:
# Define training arguments
training_args = TrainingArguments(
    output_dir='./model_save',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

In [15]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
)



In [16]:
# Train and save the model
trainer.train()
trainer.save_model('./model_save')



Step,Training Loss
10,49.6283
20,50.3885
30,49.403
40,49.8004
50,49.5841
60,49.8598
70,48.982
80,48.8152
90,49.1589
100,48.2318




Step,Training Loss
10,49.6283
20,50.3885
30,49.403
40,49.8004
50,49.5841
60,49.8598
70,48.982
80,48.8152
90,49.1589
100,48.2318




In [17]:
# For loading the model, you can use the following:
loaded_model = AutoModelForSeq2SeqLM.from_pretrained('./model_save')





In [18]:
# Inference
def generate_summary(batch):
    inputs = tokenizer(batch['dialogue'], return_tensors='pt', padding=True, truncation=True)
    outputs = loaded_model.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'])
    batch['generated_summary'] = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return batch



In [19]:
results = dataset['validation'].map(generate_summary, batched=True, batch_size=8)



Map:   0%|          | 0/500 [00:00<?, ? examples/s]



In [21]:
pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=e07473d0d670199b9f093186744ec69aa1b35a449138b0080b4f024adb20371d
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [22]:
# Calculate ROUGE scores
rouge = load_metric('rouge')
scores = rouge.compute(predictions=results['generated_summary'], references=results['summary'])

print(scores)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


{'rouge1': AggregateScore(low=Score(precision=0.4817555474182683, recall=0.2592502907067705, fmeasure=0.3236973929825437), mid=Score(precision=0.4996706455309398, recall=0.27257498737180663, fmeasure=0.33792473505834475), high=Score(precision=0.5170327857844771, recall=0.2858596723722431, fmeasure=0.3512306555784262)), 'rouge2': AggregateScore(low=Score(precision=0.18806036907536938, recall=0.09793891439334962, fmeasure=0.12326742470385582), mid=Score(precision=0.20409839604839625, recall=0.10735714197495547, fmeasure=0.1345998730673798), high=Score(precision=0.22187068070818075, recall=0.11831323041312973, fmeasure=0.1467474038441383)), 'rougeL': AggregateScore(low=Score(precision=0.4154926014549176, recall=0.2251146229465615, fmeasure=0.28043059898844136), mid=Score(precision=0.4320726011733367, recall=0.2366806373171207, fmeasure=0.29265945110805147), high=Score(precision=0.44946870724455285, recall=0.24903701568173156, fmeasure=0.30491159305058696)), 'rougeLsum': AggregateScore(low

In [24]:
original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base').to(torch_device)


In [30]:
torch_device

device(type='cuda')

In [33]:
loaded_model.to('cuda:0')

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): lora.Linear(
                (base_layer): Linear(in_features=768, out_features=768, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=768, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k): Linear(in_features=768, out_features=768, bias=False)
              

In [34]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to('cuda:0')

original_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1)).to('cuda:0')
original_text_output = tokenizer.decode(original_outputs[0], skip_special_tokens=True)

peft_outputs = loaded_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1)).to('cuda:0')
peft_text_output = tokenizer.decode(peft_outputs[0], skip_special_tokens=True)

dash_line = "-".join("" for x in range(100))
print(dash_line)
print(f"Input Prompt:\n{prompt}")
print(dash_line)
print(f"Baseline Human Summary:\n{human_baseline_summary}\n")
print(dash_line)
print(f"Original Model Generation - Zero Shot: \n{original_text_output}")
print(dash_line)
print(f"Instruct Model Generation - Zero Shot: \n{peft_text_output}")

---------------------------------------------------------------------------------------------------
Input Prompt:

Summarize the following conversation

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

--------------------------------------------------------------------

In [35]:
dialogue = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
peft_model_summaries = []
for _, dialogue in enumerate(dialogue):
    prompt = f"""
Summarize the following conversation

{dialogue}

Summary:
    """
    input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to('cuda:0')

    peft_outputs = loaded_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200)).to('cuda:0')
    peft_text_output = tokenizer.decode(peft_outputs[0], skip_special_tokens=True)
    peft_model_summaries.append(peft_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns=['human', 'original', 'peft'])

In [37]:
df.head()

Unnamed: 0,human,original,peft


In [40]:
peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True
)

print(f"Peft Model: \n{peft_model_results}")

Peft Model: 
{'rouge1': AggregateScore(low=Score(precision=0.21454029800781568, recall=0.31216363826232246, fmeasure=0.24679564322523886), mid=Score(precision=0.24807690906275334, recall=0.42296787720936324, fmeasure=0.3001855426906946), high=Score(precision=0.2844887165350385, recall=0.520475745401829, fmeasure=0.3539571484485634)), 'rouge2': AggregateScore(low=Score(precision=0.028995005473453763, recall=0.05783471511074065, fmeasure=0.038307021604553675), mid=Score(precision=0.05528424640493605, recall=0.10339573717476194, fmeasure=0.07005298598963097), high=Score(precision=0.08067416156640293, recall=0.14414163614163614, fmeasure=0.09667834828451334)), 'rougeL': AggregateScore(low=Score(precision=0.17469492400641287, recall=0.2530867654876943, fmeasure=0.20480030957275383), mid=Score(precision=0.19257718917097078, recall=0.32340460852070757, fmeasure=0.2296022285068855), high=Score(precision=0.20811826772424596, recall=0.37957985427025676, fmeasure=0.2521499415260069)), 'rougeLsum'

In [41]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [44]:
val_dataset = dataset.map(preprocess_function, batched=True)


NameError: name 'preprocess_function' is not defined

In [47]:
#references = dataset['validation']["summary"]
#hypotheses = generate_summary(val_dataset[:10])

rouge_output = rouge.compute(predictions=results['generated_summary'], references=results['summary'])

# Convert ROUGE output to DataFrame
rouge_df = pd.DataFrame({
    "rouge-1_precision": [rouge_output["rouge1"].mid.precision],
    "rouge-1_recall": [rouge_output["rouge1"].mid.recall],
    "rouge-1_fmeasure": [rouge_output["rouge1"].mid.fmeasure],
    "rouge-2_precision": [rouge_output["rouge2"].mid.precision],
    "rouge-2_recall": [rouge_output["rouge2"].mid.recall],
    "rouge-2_fmeasure": [rouge_output["rouge2"].mid.fmeasure],
    "rouge-l_precision": [rouge_output["rougeL"].mid.precision],
    "rouge-l_recall": [rouge_output["rougeL"].mid.recall],
    "rouge-l_fmeasure": [rouge_output["rougeL"].mid.fmeasure],
})
rouge_df

Unnamed: 0,rouge-1_precision,rouge-1_recall,rouge-1_fmeasure,rouge-2_precision,rouge-2_recall,rouge-2_fmeasure,rouge-l_precision,rouge-l_recall,rouge-l_fmeasure
0,0.499671,0.272575,0.337925,0.204098,0.107357,0.1346,0.432073,0.236681,0.292659
