# Fine-Tuning and Retrieval-Augmented Generation (RAG) for FAQ Assistance

This notebook demonstrates a complete workflow for fine-tuning a model with **Unsloth** and implementing a **Retrieval-Augmented Generation (RAG)**. The final section combines both methods hoping to enhance the accuracy and relevance of responses for FAQ assistance. This notebook is exploratory and aims to compare different methods before implementing the real ChatBot.

---

## 1. Fine-Tuning with Unsloth

### Install Required Packages


In [17]:
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"


### Load and Configure Model for Fine-Tuning

We load a quantized, pre-trained model using `FastLanguageModel` to save memory and computational resources.
- **Model Selection**: We use `Meta-Llama-3.1-8B-Instruct-bnb-4bit` for efficient performance.
- **Quantization**: 4-bit quantization reduces memory usage.
- **Fine-Tuning Parameters**: Configured for efficient tuning with LoRA (Low-Rank Adaptation).

---


In [2]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.7: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

Unsloth: We fixed a gradient accumulation bug, but it seems like you don't have the latest transformers version!
Please update transformers, TRL and unsloth via:
`pip install --upgrade --no-cache-dir --no-deps unsloth transformers git+https://github.com/huggingface/trl.git`
Unsloth 2024.10.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Preparing Dataset for Fine-Tuning

We load a JSON dataset and format it to include conversational pairs (`user` and `assistant` roles) needed for instruction-following models.


In [3]:
from datasets import load_dataset

dataset = load_dataset('json', data_files='/content/FAQ.json', split='train')

def create_conversations(examples):
    conversations = [
        [{"role": "user", "content": instr}, {"role": "assistant", "content": output}]
        for instr, output in zip(examples["instruction"], examples["output"])
    ]
    return {"conversations": conversations}

dataset = dataset.map(create_conversations, batched=True)


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/45 [00:00<?, ? examples/s]

### Formatting Prompts for Training

We apply a template to format conversations for better training compatibility.


In [4]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

def formatting_prompts_func(examples):
    texts = [tokenizer.apply_chat_template(conv, tokenize=False, add_generation_prompt=False) for conv in examples["conversations"]]
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)


Map:   0%|          | 0/45 [00:00<?, ? examples/s]

### Fine-Tuning the Model

Using **SFTTrainer** to conduct supervised fine-tuning, optimized with parameter-efficient techniques like 4-bit quantization and gradient checkpointing.


In [5]:
# Import necessary libraries
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

# Step 1: Initialize Trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",
    ),
)

# Step 2: Fine-Tune on Responses Only
from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)

# Debugging/Preview Data
# View the input IDs and labels for a sample item from the training dataset
tokenizer.decode(trainer.train_dataset[5]["input_ids"])
space = tokenizer(" ", add_special_tokens=False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

# Start the training process and capture stats
trainer_stats = trainer.train()
model.save_pretrained('finetuned_model')
tokenizer.save_pretrained('finetuned_model')
torch.cuda.empty_cache()


Map (num_proc=2):   0%|          | 0/45 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


Map:   0%|          | 0/45 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 45 | Num Epochs = 12
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


**** Unsloth: Please use our fixed gradient_accumulation_steps by updating transformers, TRL and Unsloth!
`pip install --upgrade --no-cache-dir --no-deps unsloth transformers git+https://github.com/huggingface/trl.git`


Step,Training Loss
1,3.1867
2,3.0448
3,3.6241
4,2.8877
5,2.3398
6,2.3842
7,2.0795
8,2.3996
9,2.0777
10,2.0921


### Generating Responses to Questions

After fine-tuning, we initialize the model for inference and generate responses for a list of FAQs.


In [6]:
import csv
from unsloth.chat_templates import get_chat_template

# Step 1: Initialize Model and Tokenizer for Inference
tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")
model = FastLanguageModel.for_inference(model)

# Step 2: Define List of Questions for Testing
questions = [
    "Am I allowed to work as a student?",
    "Do I need a French social insurance?",
    "Does the university have accommodation on the campus?",
    "How do I get to the university?",
    "Is there a canteen at the university?",
    "What is the accommodation like?",
    "What is the language of teaching?",
    "Once I complete my studies, when do I receive my transcript?",
    "What can I do after I graduate?",
    "Do I need a visa?"
]

# Placeholder to store results
results = []
method = "Finetuning"

# Step 3: Define Function to Generate Responses
def generate_response(question):
    messages = [{"role": "user", "content": question}]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")

    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=64,
        use_cache=True,
        temperature=1.5,
        min_p=0.1,
    )
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

    # Extract the assistant's response after the "assistant" label
    response_cleaned = response.split("assistant\n")[-1].strip()
    return response_cleaned

# Step 4: Generate and Store Responses
for idx, question in enumerate(questions, start=1):
    response = generate_response(question)
    results.append({
        "Method": method,
        "Question": idx,
        "Input": question,
        "Output": response
    })

# Step 5: Save Responses to CSV
with open("questions_responses.csv", mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=["Method", "Question", "Input", "Output"])
    writer.writeheader()
    writer.writerows(results)

print("Responses saved to questions_responses.csv")


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Responses saved to questions_responses.csv


## 2. Retrieval-Augmented Generation (RAG)

### Install Additional Libraries for RAG

We install necessary packages for document loading, text splitting, and vector storage.


In [7]:
!pip install langchain chromadb pdfplumber sentence-transformers accelerate bitsandbytes einops
!pip install -U langchain-community

Collecting langchain
  Downloading langchain-0.3.4-py3-none-any.whl.metadata (7.1 kB)
Collecting chromadb
  Downloading chromadb-0.5.16-py3-none-any.whl.metadata (6.8 kB)
Collecting pdfplumber
  Downloading pdfplumber-0.11.4-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers
  Downloading sentence_transformers-3.2.1-py3-none-any.whl.metadata (10 kB)
Collecting langchain-core<0.4.0,>=0.3.12 (from langchain)
  Downloading langchain_core-0.3.13-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.137-py3-none-any.whl.metadata (13 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Coll

### Load and Process PDF Document

Using **LangChain** to load and split the document for context-based retrieval.


In [8]:
import uuid
from langchain.document_loaders import PDFPlumberLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PDFPlumberLoader("FAQ.pdf")
pages = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=200)
chunks = text_splitter.split_documents(pages)


### Set up Embeddings and Vectorstore for RAG

Embeddings are computed using `HuggingFaceEmbeddings` and stored in a Chroma vector database for fast similarity-based retrieval.


In [9]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

embedding_function = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

def create_vectorstore(chunks, embedding_function, vectorstore_path="vectorstore"):
    ids = [str(uuid.uuid5(uuid.NAMESPACE_DNS, doc.page_content)) for doc in chunks]
    unique_chunks = list({id: chunk for id, chunk in zip(ids, chunks)}.values())
    vectorstore = Chroma.from_documents(unique_chunks, embedding=embedding_function, persist_directory=vectorstore_path)
    vectorstore.persist()
    return vectorstore

vectorstore = create_vectorstore(chunks=chunks, embedding_function=embedding_function)


  embedding_function = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  vectorstore.persist()


### Define Prompt and Retrieve Context

Using a concise prompt template for RAG, we retrieve contextually relevant information to answer specific questions.


In [10]:
from langchain.prompts import PromptTemplate

PROMPT = """
You are a concise and informative assistant. Use the context below to answer the question accurately.
Context: {context}
Question: {question}
Answer:
"""

prompt = PromptTemplate(input_variables=["context", "question"], template=PROMPT)
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})


### Define the LLM


In [11]:
# Load language model and configure it for inference
model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True
)
FastLanguageModel.for_inference(model)


==((====))==  Unsloth 2024.10.7: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaExtendedRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,),

### Set up Language Model Pipeline

We configure the text generation pipeline with parameters optimized for RAG, including temperature control and repetition penalties to improve the quality of responses.


In [12]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
from langchain.chains import LLMChain

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    temperature=0.5,
    top_p=0.9,
    repetition_penalty=1.1,
    device_map='auto',
    trust_remote_code=True
)

llm = HuggingFacePipeline(pipeline=pipe)
llm_chain = LLMChain(llm=llm, prompt=prompt)


  llm = HuggingFacePipeline(pipeline=pipe)
  llm_chain = LLMChain(llm=llm, prompt=prompt)


### Define List of Questions for Testing RAG

These questions are intended to test the model's ability to retrieve relevant information and generate contextually accurate responses.


In [13]:
# Define list of questions to process
questions = [
    "Am I allowed to work as a student?",
    "Do I need a French social insurance?",
    "Does the university have accommodation on the campus?",
    "How do I get to the university?",
    "Is there a canteen at the university?",
    "What is the accommodation like?",
    "What is the language of teaching?",
    "Once I complete my studies, when do I receive my transcript?",
    "What can I do after I graduate?",
    "Do I need a visa?"
]


### Generate and Save Responses with RAG

For each question, we use a retriever to fetch relevant document chunks. The language model then leverages these chunks to answer questions, enhancing the accuracy and relevance of its responses.


In [14]:
import pandas as pd

# Load existing data to append new results
csv_file = "questions_responses.csv"
existing_data = pd.read_csv(csv_file)
method = "RAG"

# Iterate over questions, generate responses, and append to existing data
for idx, question in enumerate(questions, start=1):
    # Retrieve relevant document chunks for the question
    docs = retriever.get_relevant_documents(question)
    context = "\n\n".join([doc.page_content for doc in docs])

    # Run the language model with the prompt and relevant context
    answer = llm_chain.run({"context": context, "question": question})

    # Clean the answer text
    answer_cleaned = answer.split("Answer:")[-1].strip()

    # Create a new DataFrame row for this response
    new_row = pd.DataFrame([{
        "Method": method,
        "Question": idx,
        "Input": question,
        "Output": answer_cleaned
    }])

    # Append the new row to the existing data
    existing_data = pd.concat([existing_data, new_row], ignore_index=True)

# Save the updated responses back to the CSV file
existing_data.to_csv(csv_file, index=False, encoding="utf-8")
print("Responses appended to questions_responses.csv")


  docs = retriever.get_relevant_documents(question)
  answer = llm_chain.run({"context": context, "question": question})


Responses appended to questions_responses.csv


## 3. Combining Fine-Tuning and RAG



### Installing the packages

In [1]:
# Install necessary packages
!pip install unsloth langchain chromadb pdfplumber python-dotenv pandas transformers
!pip install -U langchain-community
!pip install sentence-transformers
!pip install bitsandbytes
!pip install accelerate
!pip install einops



### Importing the libraries

In [2]:
# Import required libraries
import torch
import csv
import pandas as pd
from datasets import load_dataset
from transformers import TrainingArguments, DataCollatorForSeq2Seq, pipeline
from unsloth import FastLanguageModel, is_bfloat16_supported
from unsloth.chat_templates import get_chat_template, train_on_responses_only
from trl import SFTTrainer
from langchain.document_loaders import PDFPlumberLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
import uuid

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


### Loading the model and the data

In [3]:
# Set parameters
max_seq_length = 2048
dtype = None
load_in_4bit = True

# Load the base model and tokenizer
model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

# Prepare the model for PEFT (Parameter-Efficient Fine-Tuning)
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

# Initialize the tokenizer with the chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)

# Load your JSON file into a dataset
dataset = load_dataset('json', data_files='FAQ.json', split='train')

==((====))==  Unsloth 2024.10.7: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: We fixed a gradient accumulation bug, but it seems like you don't have the latest transformers version!
Please update transformers, TRL and unsloth via:
`pip install --upgrade --no-cache-dir --no-deps unsloth transformers git+https://github.com/huggingface/trl.git`
Unsloth 2024.10.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Formatting the data

In [4]:
# Function to create the 'conversations' field from 'instruction' and 'output'
def create_conversations(examples):
    conversations = []
    for instruction, output in zip(examples["instruction"], examples["output"]):
        convo = [
            {"role": "user", "content": instruction},
            {"role": "assistant", "content": output}
        ]
        conversations.append(convo)
    return {"conversations": conversations}

# Apply the function to create 'conversations' field
dataset = dataset.map(create_conversations, batched=True)

# Function to format prompts
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    return {"text": texts}

# Apply formatting function
dataset = dataset.map(formatting_prompts_func, batched=True)


Map:   0%|          | 0/45 [00:00<?, ? examples/s]

Map:   0%|          | 0/45 [00:00<?, ? examples/s]

### Training the model

In [5]:
# Initialize the trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",
    ),
)

# Adjust training to focus on responses
trainer = train_on_responses_only(
    trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)

# Start training
trainer.train()

# Save the fine-tuned model
model.save_pretrained('finetuned_model')
tokenizer.save_pretrained('finetuned_model')

# Release GPU memory
del trainer
torch.cuda.empty_cache()

Map (num_proc=2):   0%|          | 0/45 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


Map:   0%|          | 0/45 [00:00<?, ? examples/s]

**** Unsloth: Please use our fixed gradient_accumulation_steps by updating transformers, TRL and Unsloth!
`pip install --upgrade --no-cache-dir --no-deps unsloth transformers git+https://github.com/huggingface/trl.git`


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 45 | Num Epochs = 12
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,3.1867
2,3.0448
3,3.6241
4,2.8877
5,2.3398
6,2.3842
7,2.0795
8,2.3996
9,2.0777
10,2.0921


### Using the RAG with the finetuned model

In [6]:
# Define the method name
method = "FineTuning + RAG"

# Load the PDF file
loader = PDFPlumberLoader("FAQ.pdf")
pages = loader.load()

# Split the text for better context continuity
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=200)
chunks = text_splitter.split_documents(pages)

# Define the embedding function
def get_embedding_function():
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    return embeddings

embedding_function = get_embedding_function()

# Function to create vectorstore with normalized embeddings
def create_vectorstore(chunks, embedding_function, vectorstore_path):
    ids = [str(uuid.uuid5(uuid.NAMESPACE_DNS, doc.page_content)) for doc in chunks]
    unique_chunks = list({id: chunk for id, chunk in zip(ids, chunks)}.values())

    # Create Chroma database with unique documents
    vectorstore = Chroma.from_documents(
        documents=unique_chunks,
        ids=[str(uuid.uuid5(uuid.NAMESPACE_DNS, doc.page_content)) for doc in unique_chunks],
        embedding=embedding_function,
        persist_directory=vectorstore_path
    )
    vectorstore.persist()
    return vectorstore

# Create and load vectorstore
vectorstore_path = "vectorstore"
vectorstore = create_vectorstore(chunks=chunks, embedding_function=embedding_function, vectorstore_path=vectorstore_path)
vectorstore = Chroma(persist_directory=vectorstore_path, embedding_function=embedding_function)

# Define prompt template
PROMPT = """
You are a concise and informative assistant. Use the context below to answer the question briefly and accurately.
If the answer is unclear, say "I don’t have enough information."
Context: {context}
Question: {question}
Answer:
"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=PROMPT,
)

# Set up retriever to get relevant chunks
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})

# Load the fine-tuned model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name='finetuned_model',
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)
model = FastLanguageModel.for_inference(model)

# Set up the LLM pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    temperature=0.5,
    top_p=0.9,
    repetition_penalty=1.1,
    device_map='auto',
    trust_remote_code=True,
)

llm = HuggingFacePipeline(pipeline=pipe)
llm_chain = LLMChain(llm=llm, prompt=prompt)

# List of questions to process
questions = [
    "Am I allowed to work as a student?",
    "Do I need a French social insurance?",
    "Does the university have accommodation on the campus?",
    "How do I get to the university?",
    "Is there a canteen at the university?",
    "What is the accommodation like?",
    "What is the language of teaching?",
    "Once I complete my studies, when do I receive my transcript?",
    "What can I do after I graduate?",
    "Do I need a visa?"
]

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
  vectorstore.persist()
  vectorstore = Chroma(persist_directory=vectorstore_path, embedding_function=embedding_function)


==((====))==  Unsloth 2024.10.7: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'JambaForCausalLM', 'JetMoeForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'Mamba2ForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausal

### Getting the results

In [7]:
# Placeholder for storing results
results = []

# Iterate over each question and generate responses
for idx, question in enumerate(questions, start=1):
    docs = retriever.get_relevant_documents(question)
    context = "\n\n".join([doc.page_content for doc in docs])

    # Run the chain with the prompt and relevant context
    answer = llm_chain.run({"context": context, "question": question})

    # Extract only the answer part after "Answer:" label
    answer_cleaned = answer.split("Answer:")[-1].strip()

    results.append({
        "Method": method,
        "Question": idx,
        "Input": question,
        "Output": answer_cleaned
    })

# Write results to CSV
csv_file = "questions_responses.csv"

# Check if the CSV file exists
try:
    existing_data = pd.read_csv(csv_file)
    # Append new results
    new_data = pd.DataFrame(results)
    updated_data = pd.concat([existing_data, new_data], ignore_index=True)
    updated_data.to_csv(csv_file, index=False, encoding="utf-8")
except FileNotFoundError:
    # If the file doesn't exist, create it
    with open(csv_file, mode="w", newline="", encoding="utf-8") as file:
        writer = csv.DictWriter(file, fieldnames=["Method", "Question", "Input", "Output"])
        writer.writeheader()
        writer.writerows(results)

print("Responses saved to questions_responses.csv")


  docs = retriever.get_relevant_documents(question)
  answer = llm_chain.run({"context": context, "question": question})


Responses saved to questions_responses.csv


## 4. Evaluation of Generated Responses

### Calculating Flesch Reading Ease Score

Evaluates readability of generated responses.


In [8]:
! pip install textstat

Collecting textstat
  Downloading textstat-0.7.4-py3-none-any.whl.metadata (14 kB)
Collecting pyphen (from textstat)
  Downloading pyphen-0.17.0-py3-none-any.whl.metadata (3.2 kB)
Downloading textstat-0.7.4-py3-none-any.whl (105 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/105.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.1/105.1 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyphen-0.17.0-py3-none-any.whl (2.1 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.1/2.1 MB[0m [31m111.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m57.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyphen, textstat
Successfully installed pyphen-0.17.0 textstat-0.7.4


In [9]:
from textstat import flesch_reading_ease

data = pd.read_csv("questions_responses.csv")
data['Flesch Reading Ease'] = data["Output"].apply(flesch_reading_ease)
data.to_csv("questions_responses_flesch.csv", index=False)
data.head()


Unnamed: 0,Method,Question,Input,Output,Flesch Reading Ease
0,Finetuning,1,Am I allowed to work as a student?,"Yes, it’s allowed, however, it is difficult to...",62.68
1,Finetuning,2,Do I need a French social insurance?,"Yes, French social insurance is mandatory in F...",65.22
2,Finetuning,3,Does the university have accommodation on the ...,"Yes, the student accommodation is managed by t...",50.53
3,Finetuning,4,How do I get to the university?,IMT Mines Alès is located in southern France b...,49.15
4,Finetuning,5,Is there a canteen at the university?,"Yes, there are two places to eat at the univer...",53.17


### Evaluating Response Quality with ROUGE, BLEU, and METEOR Scores

This evaluates the similarity between generated responses and ground truth responses.


In [11]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=ef491c30b01e29cb8db69b629a0b14d6d6d359028d500b9ccd2d09ff7fa58884
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [13]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

import pandas as pd
import json
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import meteor_score
from rouge_score import rouge_scorer

# Load the CSV file with generated outputs
csv_file_path = "/content/questions_responses_flesch.csv"
data = pd.read_csv(csv_file_path)

# Load the JSON file with expected (ground truth) outputs
json_file_path = "/content/FAQ.json"
with open(json_file_path, 'r') as f:
    ground_truth_data = json.load(f)

# Define functions to compute ROUGE, BLEU, and METEOR
rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

def compute_rouge(generated, reference):
    scores = rouge_scorer.score(reference, generated)
    return {
        "ROUGE-1": scores['rouge1'].fmeasure,
        "ROUGE-2": scores['rouge2'].fmeasure,
        "ROUGE-L": scores['rougeL'].fmeasure
    }

def compute_bleu(generated, reference):
    reference_tokens = reference.split()
    generated_tokens = generated.split()
    return sentence_bleu([reference_tokens], generated_tokens)

def compute_meteor(generated, reference):
    # Tokenize the generated and reference texts
    generated_tokens = generated.split()
    reference_tokens = reference.split()
    return meteor_score([reference_tokens], generated_tokens)

# Helper function to find the expected output for a given input
def find_reference_output(input_text):
    for entry in ground_truth_data:
        if entry["instruction"] == input_text:
            return entry["output"]
    return ""  # Return empty if no match is found

# Initialize columns for scores
data["ROUGE-1"] = 0.0
data["ROUGE-2"] = 0.0
data["ROUGE-L"] = 0.0
data["BLEU"] = 0.0
data["METEOR"] = 0.0

# Calculate scores for each row
for idx, row in data.iterrows():
    input_text = row["Input"]
    generated_output = row["Output"]

    # Retrieve the expected output from the JSON file
    reference_output = find_reference_output(input_text)

    # Compute ROUGE scores
    rouge_scores = compute_rouge(generated_output, reference_output)
    data.at[idx, "ROUGE-1"] = rouge_scores["ROUGE-1"]
    data.at[idx, "ROUGE-2"] = rouge_scores["ROUGE-2"]
    data.at[idx, "ROUGE-L"] = rouge_scores["ROUGE-L"]

    # Compute BLEU score
    data.at[idx, "BLEU"] = compute_bleu(generated_output, reference_output)

    # Compute METEOR score
    data.at[idx, "METEOR"] = compute_meteor(generated_output, reference_output)

# Save the updated dataframe with scores to a new CSV file
output_file = "/content/questions_responses_ground_truth.csv"
data.to_csv(output_file, index=False)

# Display the updated data with scores
data.head()


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Unnamed: 0,Method,Question,Input,Output,Flesch Reading Ease,ROUGE-1,ROUGE-2,ROUGE-L,BLEU,METEOR
0,Finetuning,1,Am I allowed to work as a student?,"Yes, it’s allowed, however, it is difficult to...",62.68,1.0,1.0,1.0,1.0,0.999987
1,Finetuning,2,Do I need a French social insurance?,"Yes, French social insurance is mandatory in F...",65.22,0.0,0.0,0.0,0.0,0.0
2,Finetuning,3,Does the university have accommodation on the ...,"Yes, the student accommodation is managed by t...",50.53,0.735294,0.731343,0.735294,0.499195,0.593829
3,Finetuning,4,How do I get to the university?,IMT Mines Alès is located in southern France b...,49.15,0.93578,0.934579,0.93578,0.820199,0.890035
4,Finetuning,5,Is there a canteen at the university?,"Yes, there are two places to eat at the univer...",53.17,1.0,1.0,1.0,0.906187,0.999979
