##Task 5
Find LLM (LLaMA 2 / 3, Mistral, Claude) for RAG and try to run on our data (we need at least on example how it works on practice and some baseline for future evaluation with other models)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os

path = 'nlp/Project'

os.chdir(f'/content/drive/MyDrive/{path}')
os.getcwd()

'/content/drive/MyDrive/nlp/Project'

Install datasets library

In [None]:
!pip install datasets



Download Retrieval-Augmented Generation (RAG) Dataset 12000

In [None]:
from datasets import load_dataset
rag_dataset_train = load_dataset('neural-bridge/rag-dataset-12000', split='train')
rag_dataset_test = load_dataset('neural-bridge/rag-dataset-12000', split='test')

print(f"Train dataset size: {len(rag_dataset_train)} ")
print(f"Test dataset size: {len(rag_dataset_test)} ")

Train dataset size: 9600 
Test dataset size: 2400 


In [None]:
print(f"Train dataset features: {rag_dataset_train.column_names}")
rag_dataset_train

Train dataset features: ['context', 'question', 'answer']


Dataset({
    features: ['context', 'question', 'answer'],
    num_rows: 9600
})

In [None]:
print(f"Test dataset features: {rag_dataset_test.column_names}")
rag_dataset_test

Test dataset features: ['context', 'question', 'answer']


Dataset({
    features: ['context', 'question', 'answer'],
    num_rows: 2400
})

In [None]:
# Trainデータから 'question' 特徴量を抽出
train_contexts = rag_dataset_train['context']
train_questions = rag_dataset_train['question']
train_answers = rag_dataset_train['answer']

print("Trainのquestion :", train_questions[:2])
print("Trainのanswer :", train_answers[:2])

Trainのquestion : ['What is the Berry Export Summary 2028 and what is its purpose?', 'What are some of the benefits reported from having access to Self-supply water sources?']
Trainのanswer : ['The Berry Export Summary 2028 is a dedicated export plan for the Australian strawberry, raspberry, and blackberry industries. It maps the sectors’ current position, where they want to be, high-opportunity markets, and next steps. The purpose of this plan is to grow their global presence over the next 10 years.', 'Benefits reported from having access to Self-supply water sources include convenience, less time spent for fetching water and access to more and better quality water. In some areas, Self-supply sources offer important added values such as water for productive use, income generation, family safety and improved food security.']


In [None]:
# Testデータから 'question' 特徴量を抽出
test_contexts = rag_dataset_test['context']
test_questions = rag_dataset_test['question']
test_answers = rag_dataset_test['answer']

print("Testのquestion :", test_questions[:2])
print("Testのanswer :", test_answers[:2])

Testのquestion : ['Who is the music director of the Quebec Symphony Orchestra?', 'Who were the four students of the University of Port Harcourt that were allegedly murdered?']
Testのanswer : ['The music director of the Quebec Symphony Orchestra is Fabien Gabel.', 'The four students of the University of Port Harcourt that were allegedly murdered were Chiadika Lordson, Ugonna Kelechi Obusor, Mike Lloyd Toku and Tekena Elkanah.']


##Retrieval by SentenceTransformer and Reranking by CrossEncoder

Import the two models for Retrieval and Reranking

In [None]:
from sentence_transformers import SentenceTransformer, CrossEncoder

semb_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
xenc_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

Embed the contexts (Checkpoint the embeddings to avoid repeating the computation each time)

In [None]:
import os
import pickle

# Define hnswlib index path
embeddings_cache_path = './rag_context_train_embeddings_cache.pkl'

# Load cache if available
if os.path.exists(embeddings_cache_path):
    print('Loading embeddings cache')
    with open(embeddings_cache_path, 'rb') as f:
        context_embeddings = pickle.load(f)
# Else compute embeddings
else:
    print('Computing embeddings')
    context_embeddings = semb_model.encode(train_contexts, convert_to_tensor=True, show_progress_bar=True)
    # Save the index to a file for future loading
    print(f'Saving index to: \'{embeddings_cache_path}\'')
    with open(embeddings_cache_path, 'wb') as f:
        pickle.dump(context_embeddings, f)

Loading embeddings cache


Index the embeddings

In [None]:
!pip -q install hnswlib

In [None]:
import os
import hnswlib

# Create empthy index
index = hnswlib.Index(space='cosine', dim=384)

# Define hnswlib index path
index_path = './rag_context_train_hnswlib.index'

# Load index if available
if os.path.exists(index_path):
    print('Loading index...')
    index.load_index(index_path)
# Else index data collection
else:
    # Initialise the index
    print('Started creating HNSWLIB index')
    index.init_index(max_elements=context_embeddings.size(0), ef_construction=400, M=64)
    #  Compute the HNSWLIB index (it may take a while)
    index.add_items(context_embeddings.cpu(), list(range(len(context_embeddings))))
    # Save the index to a file for future loading
    print(f'Saving index to: {index_path}')
    index.save_index(index_path)

Loading index...


Try with FLAN T5

In [None]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="cuda", torch_dtype=torch.bfloat16)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
import random

random.seed(1995)

idx = random.choice(range(len(train_questions)))

question = train_questions[0]
target_answer = train_answers[0]

print(f'Question {idx}: {question}?')

Question 7445: What is the Berry Export Summary 2028 and what is its purpose??


Embed the question

In [None]:
question_embedding = semb_model.encode(question, convert_to_tensor=True)

Retrieve relevant documents keeping top $k$ matches

In [None]:
corpus_ids, distances = index.knn_query(question_embedding.cpu(), k=64)
scores = 1 - distances

print("Cosine similarity model search results")
print(f"Query: \"{question}\"")
print("---------------------------------------")
for idx, score in zip(corpus_ids[0][:5], scores[0][:5]):
    print(f"Score: {score:.4f}\nDocument: \"{train_contexts[idx][:100]}\"\n\n")

Cosine similarity model search results
Query: "What is the Berry Export Summary 2028 and what is its purpose?"
---------------------------------------
Score: 0.5690
Document: "Caption: Tasmanian berry grower Nic Hansen showing Macau chef Antimo Merone around his property as p"


Score: 0.4219
Document: "Amazon Produce Network adds position
Amazon Produce Network, Mullica Hill, N.J., has added a grower "


Score: 0.3912
Document: "Ingredients
- Spring water from Norwich
- Grape 6%
- Cranberry 1%
- Raspberry 1%
- Sugar
- Citric ac"


Score: 0.3812
Document: "Her Majesty Queen Elizabeth II, looking smashing in a pale yellow dress with gold buttons, and a mat"


Score: 0.3630
Document: "ps- check out The One Smith for more tunes: CDbaby The download collected works vol. 49 of tracking "




Re-rank retrieved documents

In [None]:
import numpy as np

model_inputs = [(question, train_contexts[idx]) for idx in corpus_ids[0]]
cross_scores = xenc_model.predict(model_inputs)

print("Cross-encoder model re-ranking results")
print(f"Query: \"{question}\"")
print("---------------------------------------")
for idx in np.argsort(-cross_scores)[:5]:
    print(f"Score: {cross_scores[idx]:.4f}\nDocument: \"{train_contexts[corpus_ids[0][idx]]}\"\n\n")

Cross-encoder model re-ranking results
Query: "What is the Berry Export Summary 2028 and what is its purpose?"
---------------------------------------
Score: 4.1088
Document: "Caption: Tasmanian berry grower Nic Hansen showing Macau chef Antimo Merone around his property as part of export engagement activities.
THE RISE and rise of the Australian strawberry, raspberry and blackberry industries has seen the sectors redouble their international trade focus, with the release of a dedicated export plan to grow their global presence over the next 10 years.
Driven by significant grower input, the Berry Export Summary 2028 maps the sectors’ current position, where they want to be, high-opportunity markets and next steps.
Hort Innovation trade manager Jenny Van de Meeberg said the value and volume of raspberry and blackberry exports rose by 100 per cent between 2016 and 2017. She said the Australian strawberry industry experienced similar success with an almost 30 per cent rise in export volum

Use best match to answer (and compare to reference answer)

In [None]:
context_idx = np.argsort(-cross_scores)[0]
context = train_contexts[corpus_ids[0][context_idx]]

input_text = f"Given the following passage, answer the related question.\n\nPassage:\n\n{context}\n\nQ: {question}?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
print(input_text, "\n")

output_ids = model.generate(input_ids, max_new_tokens=32)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("A (generated): ", output_text, "\n")

print(f"A (target): {target_answer}")

Token indices sequence length is longer than the specified maximum sequence length for this model (753 > 512). Running this sequence through the model will result in indexing errors


Given the following passage, answer the related question.

Passage:

Caption: Tasmanian berry grower Nic Hansen showing Macau chef Antimo Merone around his property as part of export engagement activities.
THE RISE and rise of the Australian strawberry, raspberry and blackberry industries has seen the sectors redouble their international trade focus, with the release of a dedicated export plan to grow their global presence over the next 10 years.
Driven by significant grower input, the Berry Export Summary 2028 maps the sectors’ current position, where they want to be, high-opportunity markets and next steps.
Hort Innovation trade manager Jenny Van de Meeberg said the value and volume of raspberry and blackberry exports rose by 100 per cent between 2016 and 2017. She said the Australian strawberry industry experienced similar success with an almost 30 per cent rise in export volume and a 26 per cent rise in value to $32.6M over the same period.
“Australian berry sectors are in a firm p

Using Gemma3

In [None]:
!pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

Collecting git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
  Cloning https://github.com/huggingface/transformers (to revision v4.49.0-Gemma-3) to /tmp/pip-req-build-rrtztjya
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-rrtztjya
  Running command git checkout -q 367bab469b0ef32017e2a0a0a5dbac5d36002f03
  Resolved https://github.com/huggingface/transformers to commit 367bab469b0ef32017e2a0a0a5dbac5d36002f03
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/gemma-3-4b-it"

model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id, device_map="auto"
).eval()

processor = AutoProcessor.from_pretrained(model_id)

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]



processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

In [None]:
context_idx = np.argsort(-cross_scores)[0]
context = train_contexts[corpus_ids[0][context_idx]]

prompt = f"Given the following passage, answer the related question.\n\nPassage:\n\n{context}\n\nQ: {question}?"
print(prompt, "\n")

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [ {"type": "text", "text": prompt}]
    }]

inputs = processor.apply_chat_template(
        messages, add_generation_prompt=True, tokenize=True,return_dict=True, return_tensors="pt"
        ).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
      generation = model.generate(**inputs, max_new_tokens=400, do_sample=False)
      generation = generation[0][input_len:]

output_text = processor.decode(generation, skip_special_tokens=True)
print("A (generated): ", output_text, "\n")

print(f"A (target): {target_answer}")

Given the following passage, answer the related question.

Passage:

Caption: Tasmanian berry grower Nic Hansen showing Macau chef Antimo Merone around his property as part of export engagement activities.
THE RISE and rise of the Australian strawberry, raspberry and blackberry industries has seen the sectors redouble their international trade focus, with the release of a dedicated export plan to grow their global presence over the next 10 years.
Driven by significant grower input, the Berry Export Summary 2028 maps the sectors’ current position, where they want to be, high-opportunity markets and next steps.
Hort Innovation trade manager Jenny Van de Meeberg said the value and volume of raspberry and blackberry exports rose by 100 per cent between 2016 and 2017. She said the Australian strawberry industry experienced similar success with an almost 30 per cent rise in export volume and a 26 per cent rise in value to $32.6M over the same period.
“Australian berry sectors are in a firm p



A (generated):  The Berry Export Summary 2028 is a document that maps the current position, desired future state, high-opportunity markets, and next steps for the Australian strawberry, raspberry, and blackberry industries. Its purpose is to grow the sectors’ global presence over the next 10 years by outlining a strategy for increased international trade. 

A (target): The Berry Export Summary 2028 is a dedicated export plan for the Australian strawberry, raspberry, and blackberry industries. It maps the sectors’ current position, where they want to be, high-opportunity markets, and next steps. The purpose of this plan is to grow their global presence over the next 10 years.
