In [1]:
# !pip install llama-cpp-python

# Taking the model from my huggingface repo

In [2]:
from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="AkinduH/Qwen2.5-3B-Instruct-Fine-Tuned-on-Deepseek-Research-Papers",
	filename="unsloth.Q4_K_M.gguf",
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


unsloth.Q4_K_M.gguf:   0%|          | 0.00/4.68G [00:00<?, ?B/s]

llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /root/.cache/huggingface/hub/models--AkinduH--Qwen2.5-3B-Instruct-Fine-Tuned-on-Deepseek-Research-Papers/snapshots/64179f9d4366b5df59fe37591cc408b7f1aa39d4/./unsloth.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 7b Instruct Unsloth Bnb 4bit
llama_model_loader: - kv   3:                       general.organization str              = Unsloth
llama_model_loader: - kv   4:                           general.finetune str              = instruct-unsloth-bnb-4bit
llama_model_loader: - kv   5:                           general.basename

## Inferencing check

In [3]:
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are an AI assistant that answers research-based questions factually."},
        {"role": "user", "content": "Can you explain how AlpacaEval2.0 contributes to the performance of DeepSeek-R1 in reasoning tasks?"}
    ],
    temperature=0.7,
    max_tokens=1000
)
response["choices"][0]["message"]["content"]

llama_perf_context_print:        load time =   27416.04 ms
llama_perf_context_print: prompt eval time =   27415.78 ms /    50 tokens (  548.32 ms per token,     1.82 tokens per second)
llama_perf_context_print:        eval time =   85602.57 ms /   136 runs   (  629.43 ms per token,     1.59 tokens per second)
llama_perf_context_print:       total time =  113238.72 ms /   186 tokens


"AlpacaEval2.0 plays a significant role in enhancing the performance of DeepSeek-R1 by providing a benchmark for evaluating large language models (LLMs) on reasoning tasks. This evaluation helps identify the strengths and weaknesses of DeepSeek-R1, allowing for targeted improvements. The results indicate that DeepSeek-R1 demonstrates competitive performance across various benchmarks, particularly excelling in STEM-related questions. The model's ability to surpass previous open-source models and achieve performance on par with commercial models like Qwen and Llama suggests its effectiveness in reasoning tasks. Additionally, the evaluation highlights the model's limitations, such as weaker performance in mathematics, which can guide future training and enhancement efforts."

# Let's create a unseen dataset for evaluation

In [5]:
import pandas as pd

df = pd.read_csv("/content/merged_df.csv")

In [9]:
# pip install datasets

In [10]:
from datasets import Dataset

data_examples = df.apply(lambda row: {
    "conversations": [
        {"from": "human", "value": row["user_input"]},
        {"from": "gpt", "value": row["reference"]}
    ]
}, axis=1).tolist()

dataset = Dataset.from_list(data_examples)

In [11]:
dataset_split = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = dataset_split["train"]
val_dataset = dataset_split["test"]

In [12]:
val_dataset

Dataset({
    features: ['conversations'],
    num_rows: 110
})

Dataset creating is complete. This the same one we had in training for evaluation

# Integrating RAG components

Let's add the RAG Wrapper around the model for more accurate results

In [13]:
def retrieve_context(query, vector_store, top_k=1):
    results = vector_store.similarity_search_with_score(query, k=top_k)
    weighted_context = ""
    for doc, score in results:
        weighted_context += f"{doc.page_content} (relevance: {score})\n\n"
    return weighted_context

In [19]:
# pip install langchain_community langchain_huggingface faiss-cpu

In [22]:
import os
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

path = "/content/Vector_DB"

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_store = FAISS.load_local(path, embeddings, allow_dangerous_deserialization=True)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [23]:
def generate_response(user_query):

  # Retrieve relevant context from the vector store
  context = retrieve_context(user_query, vector_store)

  response = llm.create_chat_completion(
      messages=[
          {"role": "system", "content": "You are an AI assistant that answers research-based questions factually."},
          {"role": "user", "content": user_query},
          {"role": "user", "content": f"Context: {context[1000]}"}

      ],
      temperature=0.7,
      max_tokens=1000
  )
  return response["choices"][0]["message"]["content"]

In [24]:
generate_response("Who developed the DualPipe algorithm?")

Llama.generate: 21 prefix-match hit, remaining 20 prompt tokens to eval
llama_perf_context_print:        load time =   27416.04 ms
llama_perf_context_print: prompt eval time =   10300.52 ms /    20 tokens (  515.03 ms per token,     1.94 tokens per second)
llama_perf_context_print:        eval time =   13789.61 ms /    22 runs   (  626.80 ms per token,     1.60 tokens per second)
llama_perf_context_print:       total time =   24127.55 ms /    42 tokens


'The DualPipe algorithm was developed by Jiashi Li, Chengqi Deng, and Wenfeng Liang.'

# Evaluation

Now we are going to evaluate the model on the unseen dataset

Let's take a sample from the unseen dataset around 50 samples

In [26]:
question = val_dataset['conversations'][1][0]['value']
answer = val_dataset['conversations'][1][1]['value']

In [27]:
question

'What is FP8 in the context of Deepseek v3 model training?'

In [28]:
answer

'FP8 refers to the mixed precision training implemented in the Deepseek v3 model, which utilizes FP8 precision to optimize the training process.'

In [31]:
# pip install sentence-transformers torch rouge-score bert-score

In [30]:
from sentence_transformers import SentenceTransformer, util
from rouge_score import rouge_scorer
import bert_score

# Load model for embeddings
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Function to calculate similarity using embeddings
def cosine_similarity(answer1, answer2):
    emb1 = embedding_model.encode(answer1, convert_to_tensor=True)
    emb2 = embedding_model.encode(answer2, convert_to_tensor=True)
    return util.pytorch_cos_sim(emb1, emb2).item()

# Function to calculate ROUGE score
def rouge_score_metric(answer1, answer2):
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    scores = scorer.score(answer1, answer2)
    return scores['rougeL'].fmeasure  # F1 score

# Function to calculate BERTScore
def bert_score_metric(answer1, answer2):
    P, R, F1 = bert_score.score([answer1], [answer2], lang="en")
    return F1.item()

The following code is for evaluation

It will calculate the cosine similarity, rouge score and bert score for the unseen dataset

In [37]:
results = []

for n in range(50):
    question = val_dataset['conversations'][n][0]['value']
    ground_truth = val_dataset['conversations'][n][1]['value']

    generated_response = generate_response(question)

    cos_sim = cosine_similarity(generated_response, ground_truth)
    rouge = rouge_score_metric(generated_response, ground_truth)
    bert = bert_score_metric(generated_response, ground_truth)

    results.append({
        "question": question,
        "ground_truth": ground_truth,
        "generated_response": generated_response,
        "cosine_similarity": cos_sim,
        "rouge_score": rouge,
        "bert_score": bert
    })

Llama.generate: 21 prefix-match hit, remaining 37 prompt tokens to eval
llama_perf_context_print:        load time =   27416.04 ms
llama_perf_context_print: prompt eval time =   26499.96 ms /    37 tokens (  716.22 ms per token,     1.40 tokens per second)
llama_perf_context_print:        eval time =   91025.59 ms /   135 runs   (  674.26 ms per token,     1.48 tokens per second)
llama_perf_context_print:       total time =  117771.48 ms /   172 tokens
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Results

In [36]:
import numpy as np

cosine_similarities = [res['cosine_similarity'] for res in results]
rouge_scores = [res['rouge_score'] for res in results]
bert_scores = [res['bert_score'] for res in results]

avg_cosine_similarity = np.mean(cosine_similarities)
avg_rouge_score = np.mean(rouge_scores)
avg_bert_score = np.mean(bert_scores)

print(f"Average Cosine Similarity: {avg_cosine_similarity:.4f}")
print(f"Average ROUGE Score: {avg_rouge_score:.4f}")
print(f"Average BERT Score: {avg_bert_score:.4f}")


Average Cosine Similarity: 0.8255
Average ROUGE Score: 0.2754
Average BERT Score: 0.8968


We evaluated the model on an unseen dataset using 50 samples. The evaluation metrics used were:

-Cosine Similarity (0.8255): This measures how similar the model’s responses are to the expected answers in terms of vector space. A high value indicates that the model's generated responses closely align with the reference answers.

-ROUGE Score (0.2754): This assesses the overlap between the generated and reference text in terms of word sequences. A relatively low value suggests that while the model captures the essence of the answers, exact wording might differ.

-BERT Score (0.8968): This evaluates semantic similarity using contextual embeddings. A high score implies that the model preserves the meaning of the expected answers well.