All necessary libraries

In [1]:
import pymupdf
import llama_cpp
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


Path of pdf

In [4]:
#path="D:\\personalCode\\ragAgentFitness\\sample.pdf"
path="D:\\college books\\sem4\\CSD204 - OS\\textbook.pdf"

PDF text reading function, text splitting function

In [5]:
def read_pdf_text(path):
    doc=pymupdf.open(path)
    full_text=""
    for page in doc:
        full_text+=page.get_text()
    return full_text

def split_into_chunks(text, chunk_size=500, overlap=50):
    chunks=[]
    for i in range(0, len(text), chunk_size-overlap):
        chunks.append(text[i:i+chunk_size])
    return chunks

In [6]:
text=read_pdf_text(path)
chunks=split_into_chunks(text)

Embedding and storage of vectors

In [7]:
embedder=SentenceTransformer('all-MiniLM-L6-v2')
vectors=embedder.encode(chunks)



In [8]:
dimension=vectors[0].shape[0]
index=faiss.IndexFlatL2(dimension)
index.reset()
index.add(np.array(vectors))

id_to_text={i: chunk for i, chunk in enumerate(chunks)}

In [9]:
def search_chunks(query, top_k=3):
    query_vec=embedder.encode([query])
    D, I=index.search(np.array(query_vec), top_k)
    return [id_to_text[i] for i in I[0]]

Model being used

In [11]:
model_path_gguf="D:\\personalCode\\RAG-Toolkit\models\\Dolphin3.0-Llama3.2-3B-Q5_K_M.gguf"
model=llama_cpp.Llama(model_path=model_path_gguf, chat_format="llama-2", n_ctx=8192, n_gpu_layers=-1)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Laptop GPU) - 7099 MiB free
llama_model_loader: loaded meta data with 81 key-value pairs and 255 tensors from D:\personalCode\RAG-Toolkit\models\Dolphin3.0-Llama3.2-3B-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Dolphin 3.0 Llama 3.2 3B
llama_model_loader: - kv   3:                       general.organization str              = Cognitiv

In [12]:
print(model.context_params.n_ctx) 

8192


User input and search

In [23]:
user_query=input("Please enter query: ")
context="\n\n".join(search_chunks(user_query))

context_tokens = model.tokenize(context.encode("utf-8"))
context_token_count = len(context_tokens)
print(f"Total tokens used by the context: {context_token_count}")

Total tokens used by the context: 321


In [24]:
final_prompt=f"""<|im_start|>system
You are a helpful assistant. If the answer is not present in the context, print "Insufficient context" and nothing else. Structure your response in markdown, using bullet points or headings if appropriate. Ensure that if there is no relevant information, you provide "Insufficient context" and nothing else at all. <|im_end|>
<|im_start|>user
Use the following context to answer the question.

Context:
{context}

Question:
{user_query}<|im_end|>
<|im_start|>assistant
"""

temp=0.7
max_tokens=2048

response=model.create_completion(
    prompt=final_prompt,
    temperature=temp,
    max_tokens=max_tokens
)

Llama.generate: 6 prefix-match hit, remaining 409 prompt tokens to eval
llama_perf_context_print:        load time =     778.50 ms
llama_perf_context_print: prompt eval time =     321.40 ms /   409 tokens (    0.79 ms per token,  1272.55 tokens per second)
llama_perf_context_print:        eval time =    4968.82 ms /   284 runs   (   17.50 ms per token,    57.16 tokens per second)
llama_perf_context_print:       total time =    5578.42 ms /   693 tokens


In [25]:
assistant_reply=response['choices'][0]['text']
assistant_reply=assistant_reply.replace("[/INST]", "")
print(assistant_reply)

### Deﬁning Operating Systems

**Operating Systems:** An operating system is a type of software that manages the computer hardware and provides essential services for computer programs. It acts as an intermediary between the computer's hardware and the software applications running on it.

**Key Functions:**
1. **Resource Management:** Operating systems manage the allocation of resources like memory, storage, and processing power.
2. **Input/Output (I/O) Control:** They manage the communication between the computer's hardware and the software applications, ensuring that the hardware operates correctly and that programs do not interfere with system operations.
3. **Process Execution:** Operating systems control the execution of user programs, ensuring that they run efficiently and without causing errors or improper system use.
4. **User Interface:** They provide the user interface through which users can interact with the computer system.
5. **Error Handling:** Operating systems are des

In [19]:
hallucination_prompt = f"""<|im_start|>system
You are an expert fact-checking assistant. Your job is to verify whether the assistant's answer is fully supported by the given context. If parts of the answer are not present in the context, clearly identify them. Be strict and objective in your judgment.<|im_end|>
<|im_start|>user
Context:
{context}

Answer:
{assistant_reply}

Task:
Determine whether the answer is hallucinated. List any parts of the answer that are not supported by the context. If answer is determined to not be hallucinated respond with "Assistant reply is not hallucinated."<|im_end|>
<|im_start|>assistant
"""

hal_response = model.create_completion(
    prompt=hallucination_prompt,
    temperature=temp,
    max_tokens=256
)
assistant_hallucination = hal_response['choices'][0]['text']
print(assistant_hallucination)

Llama.generate: 6 prefix-match hit, remaining 605 prompt tokens to eval
llama_perf_context_print:        load time =     778.50 ms
llama_perf_context_print: prompt eval time =     336.30 ms /   605 tokens (    0.56 ms per token,  1799.01 tokens per second)
llama_perf_context_print:        eval time =     117.94 ms /     7 runs   (   16.85 ms per token,    59.35 tokens per second)
llama_perf_context_print:       total time =     462.76 ms /   612 tokens


Assistant reply is not hallucinated.


In [15]:
print(assistant_hallucination)

Assistant reply is not hallucinated.


In [20]:
import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sentence_transformers import SentenceTransformer
from collections import Counter
import math

In [17]:
def compute_semantic_entropy(responses, distance_threshold=1.0):
    embeddings=embedder.encode(responses)

    clustering=AgglomerativeClustering(n_clusters=None, distance_threshold=distance_threshold, linkage="average")
    labels=clustering.fit_predict(embeddings)

    counts=Counter(labels)
    total=sum(counts.values())
    probabilities=[count/total for count in counts.values()]
    entropy=-sum(p*math.log2(p) for p in probabilities)

    return entropy, labels    

In [18]:
for i in range(20):
    semantic_response=[model.create_completion(prompt=user_query, temperature=temp, max_tokens=256)['choices'][0]['text'] for _ in range(5)]
    entropy, labels=compute_semantic_entropy(semantic_response, distance_threshold=1.0)
    print(f"At iteration {i} Entropy:{entropy}")
    if entropy>=1.5:
        print(f"iteration {i} detected possible hallucination")

Llama.generate: 1 prefix-match hit, remaining 7 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =     222.18 ms /     7 tokens (   31.74 ms per token,    31.51 tokens per second)
llama_perf_context_print:        eval time =    1908.22 ms /   138 runs   (   13.83 ms per token,    72.32 tokens per second)
llama_perf_context_print:       total time =    2259.40 ms /   145 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    2588.46 ms /   187 runs   (   13.84 ms per token,    72.24 tokens per second)
llama_perf_context_print:       total time =    2765.17 ms /   188 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_p

At iteration 0 Entropy:-0.0


Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3236.62 ms /   223 runs   (   14.51 ms per token,    68.90 tokens per second)
llama_perf_context_print:       total time =    3451.05 ms /   224 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3557.23 ms /   256 runs   (   13.90 ms per token,    71.97 tokens per second)
llama_perf_context_print:       total time =    3799.40 ms /   257 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_p

At iteration 1 Entropy:-0.0


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    2982.39 ms /   214 runs   (   13.94 ms per token,    71.75 tokens per second)
llama_perf_context_print:       total time =    3152.50 ms /   215 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3584.33 ms /   256 runs   (   14.00 ms per token,    71.42 tokens per second)
llama_perf_context_print:       total time =    3784.38 ms /   257 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 2 Entropy:-0.0


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    2632.55 ms /   190 runs   (   13.86 ms per token,    72.17 tokens per second)
llama_perf_context_print:       total time =    2779.57 ms /   191 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    2236.80 ms /   162 runs   (   13.81 ms per token,    72.42 tokens per second)
llama_perf_context_print:       total time =    2359.78 ms /   163 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 3 Entropy:-0.0


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3253.38 ms /   234 runs   (   13.90 ms per token,    71.93 tokens per second)
llama_perf_context_print:       total time =    3436.15 ms /   235 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    2197.26 ms /   159 runs   (   13.82 ms per token,    72.36 tokens per second)
llama_perf_context_print:       total time =    2314.72 ms /   160 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 4 Entropy:-0.0


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    1929.01 ms /   140 runs   (   13.78 ms per token,    72.58 tokens per second)
llama_perf_context_print:       total time =    2033.87 ms /   141 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    2936.50 ms /   211 runs   (   13.92 ms per token,    71.85 tokens per second)
llama_perf_context_print:       total time =    3100.38 ms /   212 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 5 Entropy:-0.0


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3230.95 ms /   232 runs   (   13.93 ms per token,    71.81 tokens per second)
llama_perf_context_print:       total time =    3414.69 ms /   233 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    1154.82 ms /    84 runs   (   13.75 ms per token,    72.74 tokens per second)
llama_perf_context_print:       total time =    1214.61 ms /    85 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 6 Entropy:-0.0


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    2019.33 ms /   146 runs   (   13.83 ms per token,    72.30 tokens per second)
llama_perf_context_print:       total time =    2130.97 ms /   147 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    1684.30 ms /   122 runs   (   13.81 ms per token,    72.43 tokens per second)
llama_perf_context_print:       total time =    1775.65 ms /   123 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 7 Entropy:-0.0


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    2263.96 ms /   163 runs   (   13.89 ms per token,    72.00 tokens per second)
llama_perf_context_print:       total time =    2410.65 ms /   164 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3074.45 ms /   220 runs   (   13.97 ms per token,    71.56 tokens per second)
llama_perf_context_print:       total time =    3283.07 ms /   221 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 8 Entropy:-0.0


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3585.29 ms /   256 runs   (   14.01 ms per token,    71.40 tokens per second)
llama_perf_context_print:       total time =    3837.25 ms /   257 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3590.72 ms /   256 runs   (   14.03 ms per token,    71.29 tokens per second)
llama_perf_context_print:       total time =    3844.73 ms /   257 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 9 Entropy:-0.0


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3607.52 ms /   256 runs   (   14.09 ms per token,    70.96 tokens per second)
llama_perf_context_print:       total time =    3875.55 ms /   257 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3598.81 ms /   256 runs   (   14.06 ms per token,    71.13 tokens per second)
llama_perf_context_print:       total time =    3855.79 ms /   257 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 10 Entropy:0.7219280948873623


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =     827.37 ms /    60 runs   (   13.79 ms per token,    72.52 tokens per second)
llama_perf_context_print:       total time =     876.89 ms /    61 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =      14.82 ms /     1 runs   (   14.82 ms per token,    67.46 tokens per second)
llama_perf_context_print:       total time =      16.04 ms /     2 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 11 Entropy:0.7219280948873623


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3608.45 ms /   256 runs   (   14.10 ms per token,    70.94 tokens per second)
llama_perf_context_print:       total time =    3864.64 ms /   257 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    1934.25 ms /   139 runs   (   13.92 ms per token,    71.86 tokens per second)
llama_perf_context_print:       total time =    2040.48 ms /   140 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 12 Entropy:-0.0


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3605.01 ms /   256 runs   (   14.08 ms per token,    71.01 tokens per second)
llama_perf_context_print:       total time =    3849.74 ms /   257 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3605.75 ms /   256 runs   (   14.08 ms per token,    71.00 tokens per second)
llama_perf_context_print:       total time =    3855.38 ms /   257 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 13 Entropy:-0.0


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3137.86 ms /   224 runs   (   14.01 ms per token,    71.39 tokens per second)
llama_perf_context_print:       total time =    3350.18 ms /   225 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3602.21 ms /   256 runs   (   14.07 ms per token,    71.07 tokens per second)
llama_perf_context_print:       total time =    3852.78 ms /   257 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 14 Entropy:-0.0


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3601.87 ms /   256 runs   (   14.07 ms per token,    71.07 tokens per second)
llama_perf_context_print:       total time =    3850.62 ms /   257 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    2148.12 ms /   154 runs   (   13.95 ms per token,    71.69 tokens per second)
llama_perf_context_print:       total time =    2285.15 ms /   155 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 15 Entropy:-0.0


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    1439.23 ms /   104 runs   (   13.84 ms per token,    72.26 tokens per second)
llama_perf_context_print:       total time =    1530.57 ms /   105 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3288.23 ms /   234 runs   (   14.05 ms per token,    71.16 tokens per second)
llama_perf_context_print:       total time =    3510.19 ms /   235 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 16 Entropy:-0.0


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3601.93 ms /   256 runs   (   14.07 ms per token,    71.07 tokens per second)
llama_perf_context_print:       total time =    3858.30 ms /   257 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    2219.24 ms /   159 runs   (   13.96 ms per token,    71.65 tokens per second)
llama_perf_context_print:       total time =    2365.83 ms /   160 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 17 Entropy:-0.0


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    2354.40 ms /   169 runs   (   13.93 ms per token,    71.78 tokens per second)
llama_perf_context_print:       total time =    2501.80 ms /   170 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    2819.26 ms /   201 runs   (   14.03 ms per token,    71.30 tokens per second)
llama_perf_context_print:       total time =    3006.78 ms /   202 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 18 Entropy:-0.0


llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3366.06 ms /   240 runs   (   14.03 ms per token,    71.30 tokens per second)
llama_perf_context_print:       total time =    3557.72 ms /   241 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    3598.86 ms /   256 runs   (   14.06 ms per token,    71.13 tokens per second)
llama_perf_context_print:       total time =    3803.10 ms /   257 tokens
Llama.generate: 7 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     737.43 ms
llama_perf_context_print: promp

At iteration 19 Entropy:-0.0


In [None]:
prompt=final_prompt
semantic_response=[model.create_completion(prompt=prompt, temperature=temp, max_tokens=256)['choices'][0]['text'] for _ in range(5)]

entropy, labels=compute_semantic_entropy(semantic_response)
print(f"Entropy: {entropy: .3f}")

if entropy>1.5:
    print("Possible hallucination.")

else:
    print("Response likely grounded")


Llama.generate: 6 prefix-match hit, remaining 554 prompt tokens to eval
llama_perf_context_print:        load time =   14894.28 ms
llama_perf_context_print: prompt eval time =   14753.43 ms /   554 tokens (   26.63 ms per token,    37.55 tokens per second)
llama_perf_context_print:        eval time =   21498.60 ms /   255 runs   (   84.31 ms per token,    11.86 tokens per second)
llama_perf_context_print:       total time =   36652.24 ms /   809 tokens
Llama.generate: 559 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   14894.28 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =     379.05 ms /     4 runs   (   94.76 ms per token,    10.55 tokens per second)
llama_perf_context_print:       total time =     385.26 ms /     5 tokens
Llama.generate: 559 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_con

Entropy:  0.971
Response likely grounded


In [None]:
def ask_query():
    user_query=input("What is your question: ")
    response=model.create_completion(prompt=user_query, temperature=temp, max_tokens=256)['choices'][0]['text']
    print(response)

In [None]:
ask_query()

NameError: name 'model' is not defined