All necessary libraries

In [1]:
import re
import pymupdf
import llama_cpp
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


Path of pdf

In [19]:
path="D:\\personalCode\\ragAgentFitness\\sample.pdf"

PDF text reading function, text splitting function

In [3]:
def read_pdf_text(path):
    doc=pymupdf.open(path)
    full_text=""
    for page in doc:
        full_text+=page.get_text()
    return full_text

def split_into_chunks(text, chunk_size=500, overlap=50):
    chunks=[]
    for i in range(0, len(text), chunk_size-overlap):
        chunks.append(text[i:i+chunk_size])
    return chunks

In [20]:
text=read_pdf_text(path)
chunks=split_into_chunks(text)

Embedding and storage of vectors

In [21]:
embedder=SentenceTransformer('all-MiniLM-L6-v2')
vectors=embedder.encode(chunks)

In [22]:
dimension=vectors[0].shape[0]
index=faiss.IndexFlatL2(dimension)
index.reset()
index.add(np.array(vectors))

id_to_text={i: chunk for i, chunk in enumerate(chunks)}

In [23]:
def search_chunks(query, top_k=3):
    query_vec=embedder.encode([query])
    D, I=index.search(np.array(query_vec), top_k)
    return [id_to_text[i] for i in I[0]]

Model being used

In [24]:
model_path_gguf="D:\\personalCode\\ragAgentFitness\\Dolphin3.0-Llama3.2-3B-Q5_K_M.gguf"
model=llama_cpp.Llama(model_path=model_path_gguf, chat_format="llama-2", n_ctx=8192)

llama_model_loader: loaded meta data with 81 key-value pairs and 255 tensors from D:\personalCode\ragAgentFitness\Dolphin3.0-Llama3.2-3B-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Dolphin 3.0 Llama 3.2 3B
llama_model_loader: - kv   3:                       general.organization str              = Cognitivecomputations
llama_model_loader: - kv   4:                           general.basename str              = dolphin-3.0-Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              

In [9]:
print(model.context_params.n_ctx) 

8192


User input and search

In [25]:
user_query=input("Please enter query: ")
context="\n\n".join(search_chunks(user_query))

context_tokens = model.tokenize(context.encode("utf-8"))
context_token_count = len(context_tokens)
print(f"Total tokens used by the context: {context_token_count}")

Total tokens used by the context: 464


In [26]:
final_prompt=f"""<|im_start|>system
You are a helpful assistant. If the answer is not present in the context, print "Insufficient context" and nothing else. Structure your response in markdown, using bullet points or headings if appropriate. Ensure that if there is no relevant information, you provide "Insufficient context" and nothing else at all. <|im_end|>
<|im_start|>user
Use the following context to answer the question.

Context:
{context}

Question:
{user_query}<|im_end|>
<|im_start|>assistant
"""

temp=0.7
max_tokens=512

response=model.create_completion(
    prompt=final_prompt,
    temperature=temp,
    max_tokens=max_tokens
)

llama_perf_context_print:        load time =   14894.28 ms
llama_perf_context_print: prompt eval time =   14893.70 ms /   560 tokens (   26.60 ms per token,    37.60 tokens per second)
llama_perf_context_print:        eval time =   43092.42 ms /   511 runs   (   84.33 ms per token,    11.86 tokens per second)
llama_perf_context_print:       total time =   58957.66 ms /  1071 tokens


In [27]:
assistant_reply=response['choices'][0]['text']
assistant_reply=assistant_reply.replace("[/INST]", "")
print(assistant_reply)

"**How to do push ups?**

* Begin in a standard push-up position, with your hands directly under your shoulders and your feet on the ground.
* Straighten your arms and legs, and push your butt as far into the air as possible.
* Keeping your arms straight and your back in line, swoop your upper body down in an arc, sticking your chest out.
* Slowly return to the starting position, maintaining control throughout the movement.

* For those who don't think they can ever do a single good push-up, I'll show you how to easily work into it:
	+ Start with your hands on the ground, but your knees raised, forming a plank position. Do as many of these as you can.
	+ Once you can do several in a row without resting, lower your knees to the ground, and do push-ups.
	+ The more you practice, the easier it will become.

* **Posture and Variations:**
	+ Keep your core tight throughout the movement. Imagine you're pulling your belly button towards your tailbone.
	+ If you're not ready for the standard p

In [28]:
hallucination_prompt = f"""<|im_start|>system
You are an expert fact-checking assistant. Your job is to verify whether the assistant's answer is fully supported by the given context. If parts of the answer are not present in the context, clearly identify them. Be strict and objective in your judgment.<|im_end|>
<|im_start|>user
Context:
{context}

Answer:
{assistant_reply}

Task:
Determine whether the answer is hallucinated. List any parts of the answer that are not supported by the context. If answer is determined to not be hallucinated respond with "Assistant reply is not hallucinated."<|im_end|>
<|im_start|>assistant
"""

hal_response = model.create_completion(
    prompt=hallucination_prompt,
    temperature=temp,
    max_tokens=256
)
assistant_hallucination = hal_response['choices'][0]['text']
print(assistant_hallucination)

Llama.generate: 6 prefix-match hit, remaining 1083 prompt tokens to eval
llama_perf_context_print:        load time =   14894.28 ms
llama_perf_context_print: prompt eval time =   29017.74 ms /  1083 tokens (   26.79 ms per token,    37.32 tokens per second)
llama_perf_context_print:        eval time =     738.54 ms /     7 runs   (  105.51 ms per token,     9.48 tokens per second)
llama_perf_context_print:       total time =   29770.26 ms /  1090 tokens


Assistant reply is not hallucinated.


In [14]:
print(assistant_hallucination)

Assistant reply is not hallucinated.


In [15]:
import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sentence_transformers import SentenceTransformer
from collections import Counter
import math

In [16]:
def compute_semantic_entropy(responses, distance_threshold=1.0):
    embeddings=embedder.encode(responses)

    clustering=AgglomerativeClustering(n_clusters=None, distance_threshold=distance_threshold, linkage="average")
    labels=clustering.fit_predict(embeddings)

    counts=Counter(labels)
    total=sum(counts.values())
    probabilities=[count/total for count in counts.values()]
    entropy=-sum(p*math.log2(p) for p in probabilities)

    return entropy, labels    

In [18]:
for i in range(20):
    semantic_response=[model.create_completion(prompt=user_query, temperature=temp, max_tokens=256)['choices'][0]['text'] for _ in range(5)]
    entropy, labels=compute_semantic_entropy(semantic_response, distance_threshold=1.0)
    print(f"At iteration {i} Entropy:{entropy}")
    if entropy>=1.5:
        print(f"iteration {i} detected possible hallucination")

Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =   37345.64 ms /     2 tokens (18672.82 ms per token,     0.05 tokens per second)
llama_perf_context_print:        eval time =    8787.68 ms /   107 runs   (   82.13 ms per token,    12.18 tokens per second)
llama_perf_context_print:       total time =    9039.19 ms /   109 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   12717.80 ms /   153 runs   (   83.12 ms per token,    12.03 tokens per second)
llama_perf_context_print:       total time =   12957.05 ms /   154 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_p

At iteration 0 Entropy:-0.0


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   18256.13 ms /   256 runs   (   71.31 ms per token,    14.02 tokens per second)
llama_perf_context_print:       total time =   18630.24 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17783.84 ms /   256 runs   (   69.47 ms per token,    14.40 tokens per second)
llama_perf_context_print:       total time =   18159.11 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 1 Entropy:-0.0


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17588.98 ms /   256 runs   (   68.71 ms per token,    14.55 tokens per second)
llama_perf_context_print:       total time =   17987.54 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   18952.18 ms /   256 runs   (   74.03 ms per token,    13.51 tokens per second)
llama_perf_context_print:       total time =   19346.50 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 2 Entropy:-0.0


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   15368.43 ms /   229 runs   (   67.11 ms per token,    14.90 tokens per second)
llama_perf_context_print:       total time =   15687.75 ms /   230 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17304.91 ms /   256 runs   (   67.60 ms per token,    14.79 tokens per second)
llama_perf_context_print:       total time =   17673.46 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 3 Entropy:0.7219280948873623


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17974.56 ms /   256 runs   (   70.21 ms per token,    14.24 tokens per second)
llama_perf_context_print:       total time =   18344.06 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17733.71 ms /   256 runs   (   69.27 ms per token,    14.44 tokens per second)
llama_perf_context_print:       total time =   18104.17 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 4 Entropy:-0.0


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   16432.41 ms /   239 runs   (   68.75 ms per token,    14.54 tokens per second)
llama_perf_context_print:       total time =   16769.88 ms /   240 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17509.40 ms /   256 runs   (   68.40 ms per token,    14.62 tokens per second)
llama_perf_context_print:       total time =   17868.61 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 5 Entropy:-0.0


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17564.17 ms /   256 runs   (   68.61 ms per token,    14.58 tokens per second)
llama_perf_context_print:       total time =   17936.09 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17553.13 ms /   256 runs   (   68.57 ms per token,    14.58 tokens per second)
llama_perf_context_print:       total time =   17928.47 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 6 Entropy:-0.0


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17659.09 ms /   256 runs   (   68.98 ms per token,    14.50 tokens per second)
llama_perf_context_print:       total time =   18036.77 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17590.30 ms /   256 runs   (   68.71 ms per token,    14.55 tokens per second)
llama_perf_context_print:       total time =   17951.69 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 7 Entropy:-0.0


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   15157.72 ms /   223 runs   (   67.97 ms per token,    14.71 tokens per second)
llama_perf_context_print:       total time =   15463.54 ms /   224 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   11963.14 ms /   174 runs   (   68.75 ms per token,    14.54 tokens per second)
llama_perf_context_print:       total time =   12190.30 ms /   175 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 8 Entropy:-0.0


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17668.03 ms /   256 runs   (   69.02 ms per token,    14.49 tokens per second)
llama_perf_context_print:       total time =   18032.19 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   12303.00 ms /   179 runs   (   68.73 ms per token,    14.55 tokens per second)
llama_perf_context_print:       total time =   12543.75 ms /   180 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 9 Entropy:-0.0


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   11799.52 ms /   172 runs   (   68.60 ms per token,    14.58 tokens per second)
llama_perf_context_print:       total time =   12021.55 ms /   173 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    8181.62 ms /   120 runs   (   68.18 ms per token,    14.67 tokens per second)
llama_perf_context_print:       total time =    8330.74 ms /   121 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 10 Entropy:1.3709505944546687


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17413.22 ms /   256 runs   (   68.02 ms per token,    14.70 tokens per second)
llama_perf_context_print:       total time =   17779.21 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17524.81 ms /   256 runs   (   68.46 ms per token,    14.61 tokens per second)
llama_perf_context_print:       total time =   17891.23 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 11 Entropy:-0.0


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17578.67 ms /   256 runs   (   68.67 ms per token,    14.56 tokens per second)
llama_perf_context_print:       total time =   17947.02 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17580.13 ms /   256 runs   (   68.67 ms per token,    14.56 tokens per second)
llama_perf_context_print:       total time =   17940.84 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 12 Entropy:-0.0


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17582.99 ms /   256 runs   (   68.68 ms per token,    14.56 tokens per second)
llama_perf_context_print:       total time =   17943.51 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    6250.90 ms /    92 runs   (   67.94 ms per token,    14.72 tokens per second)
llama_perf_context_print:       total time =    6360.52 ms /    93 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 13 Entropy:-0.0


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17427.20 ms /   256 runs   (   68.08 ms per token,    14.69 tokens per second)
llama_perf_context_print:       total time =   17794.16 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17537.77 ms /   256 runs   (   68.51 ms per token,    14.60 tokens per second)
llama_perf_context_print:       total time =   17902.00 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 14 Entropy:-0.0


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   10752.09 ms /   157 runs   (   68.48 ms per token,    14.60 tokens per second)
llama_perf_context_print:       total time =   10951.34 ms /   158 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17583.14 ms /   256 runs   (   68.68 ms per token,    14.56 tokens per second)
llama_perf_context_print:       total time =   17948.52 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 15 Entropy:-0.0


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17691.42 ms /   256 runs   (   69.11 ms per token,    14.47 tokens per second)
llama_perf_context_print:       total time =   18066.57 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   14377.52 ms /   206 runs   (   69.79 ms per token,    14.33 tokens per second)
llama_perf_context_print:       total time =   14664.74 ms /   207 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 16 Entropy:-0.0


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17499.92 ms /   256 runs   (   68.36 ms per token,    14.63 tokens per second)
llama_perf_context_print:       total time =   17873.42 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   12234.55 ms /   181 runs   (   67.59 ms per token,    14.79 tokens per second)
llama_perf_context_print:       total time =   12471.75 ms /   182 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 17 Entropy:1.3709505944546687


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   10068.52 ms /   149 runs   (   67.57 ms per token,    14.80 tokens per second)
llama_perf_context_print:       total time =   10256.30 ms /   150 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17407.23 ms /   256 runs   (   68.00 ms per token,    14.71 tokens per second)
llama_perf_context_print:       total time =   17770.63 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 18 Entropy:-0.0


llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   17441.41 ms /   256 runs   (   68.13 ms per token,    14.68 tokens per second)
llama_perf_context_print:       total time =   17807.15 ms /   257 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   15478.50 ms /   228 runs   (   67.89 ms per token,    14.73 tokens per second)
llama_perf_context_print:       total time =   15801.59 ms /   229 tokens
Llama.generate: 6 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14684.57 ms
llama_perf_context_print: promp

At iteration 19 Entropy:-0.0


In [29]:
prompt=final_prompt
semantic_response=[model.create_completion(prompt=prompt, temperature=temp, max_tokens=256)['choices'][0]['text'] for _ in range(5)]

entropy, labels=compute_semantic_entropy(semantic_response)
print(f"Entropy: {entropy: .3f}")

if entropy>1.5:
    print("Possible hallucination.")

else:
    print("Response likely grounded")


Llama.generate: 6 prefix-match hit, remaining 554 prompt tokens to eval
llama_perf_context_print:        load time =   14894.28 ms
llama_perf_context_print: prompt eval time =   14753.43 ms /   554 tokens (   26.63 ms per token,    37.55 tokens per second)
llama_perf_context_print:        eval time =   21498.60 ms /   255 runs   (   84.31 ms per token,    11.86 tokens per second)
llama_perf_context_print:       total time =   36652.24 ms /   809 tokens
Llama.generate: 559 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14894.28 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =     379.05 ms /     4 runs   (   94.76 ms per token,    10.55 tokens per second)
llama_perf_context_print:       total time =     385.26 ms /     5 tokens
Llama.generate: 559 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_con

Entropy:  0.971
Response likely grounded
