All necessary libraries

In [12]:
import pymupdf
import llama_cpp
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import json
from sklearn.cluster import KMeans

Path of pdf

In [2]:
path="D:\\personalCode\\RAG-Toolkit\\documents\\sample.pdf"

PDF text reading function, text splitting function

In [10]:
def read_pdf_text(path):
    doc=pymupdf.open(path)
    full_text=""
    for page in doc:
        full_text+=page.get_text()
    return full_text, doc

def split_into_chunks(text, chunk_size=500, overlap=50):
    chunks=[]
    for i in range(0, len(text), chunk_size-overlap):
        chunks.append(text[i:i+chunk_size])
    return chunks

In [11]:
text, doc=read_pdf_text(path)
chunks=split_into_chunks(text)

Embedding and storage of vectors

In [7]:
embedder=SentenceTransformer('all-MiniLM-L6-v2')
vectors=embedder.encode(chunks)

Model being used

In [8]:
model_path_gguf="D:\\personalCode\\RAG-Toolkit\models\\Dolphin3.0-Llama3.2-3B-Q5_K_M.gguf"
model=llama_cpp.Llama(model_path=model_path_gguf, chat_format="llama-2", n_ctx=8192, n_gpu_layers=-1)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Laptop GPU) - 7099 MiB free
llama_model_loader: loaded meta data with 81 key-value pairs and 255 tensors from D:\personalCode\RAG-Toolkit\models\Dolphin3.0-Llama3.2-3B-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Dolphin 3.0 Llama 3.2 3B
llama_model_loader: - kv   3:                       general.organization str              = Cognitiv

In [9]:
print(model.context_params.n_ctx) 

8192


In [14]:
def clustering(vectors, num_clusters):
    """
    input: embeddings from given pdf text as vectors
    output: clusters of similar vectors
    """
    k=num_clusters
    kmeans=KMeans(n_clusters=k, random_state=42).fit(vectors)
    labels=kmeans.labels_
    closest_indices=[]

    for i in range(num_clusters):
        distances=np.linalg.norm(vectors-kmeans.cluster_centers_[i], axis=1)

        closest_index=np.argmin(distances)

        closest_indices.append(closest_index)

    selected_indices=sorted(closest_indices)
    return selected_indices


In [59]:
selected_indices=clustering(vectors, 10)
print(selected_indices)

[42, 123, 189, 233, 282, 390, 506, 547, 551, 657]


In [60]:
print(chunks[233])


excuses not to buckle down and reach your optimum level of fitness. The ironic thing is 
that people often feel they have to put themselves through far harsher and lengthy routines 
in the gym than the more effective bodyweight programs explained in this book.  
I’ve visited hundreds of gyms in my career. And the proof is in the pudding. I look at 
the people there. Then I look at my SpecOps troops. The difference is night and day. And 
you can achieve this difference with an amazingly small sa


In [61]:
summary_list=[]
j=0
for i in selected_indices:
    section=chunks[i]
    map_prompt=f"""
    Act as a concise summariser.
    Summarise the given text into 2-3 lines, no more. Ensure you completely cover the content of the text. This text will be enclosed in triple backticks (```)
    The output should be the summary of the user supplied text.
    Be concise and precise in your behaviour.

    ```{section}```
    SUMMARY: 
    """
    temp=0.7
    max_tokens=150

    response=model.create_completion(
    prompt=map_prompt,
    temperature=temp,
    max_tokens=max_tokens
    )

    summary=response['choices'][0]['text']
    summary=summary.replace("[/INST]", "")
    print(summary)
    print(i)
    summary_list.append(summary)
    j=j+1
    print(f"Summary for chunk{i} is ready, {j} indices covered")

summaries="\n".join(summary_list)


Llama.generate: 2 prefix-match hit, remaining 195 prompt tokens to eval
llama_perf_context_print:        load time =     540.71 ms
llama_perf_context_print: prompt eval time =     293.00 ms /   195 tokens (    1.50 ms per token,   665.53 tokens per second)
llama_perf_context_print:        eval time =    1131.93 ms /    86 runs   (   13.16 ms per token,    75.98 tokens per second)
llama_perf_context_print:       total time =    1488.76 ms /   281 tokens
Llama.generate: 73 prefix-match hit, remaining 102 prompt tokens to eval


 The text describes 9 weeks of intense training in a challenging underwater environment. The trainees are required to commit fully, tying three different knots perfectly underwater. The instructors aim to make the trainees quit, but the full commitment to the training and tasks leads to success. The training environment is described as challenging, and the trainees learn to commit, stay down, and overcome the initial discomfort. Success is achieved through full commitment.
42
Summary for chunk42 is ready, 1 indices covered


llama_perf_context_print:        load time =     540.71 ms
llama_perf_context_print: prompt eval time =      34.08 ms /   102 tokens (    0.33 ms per token,  2992.69 tokens per second)
llama_perf_context_print:        eval time =    1305.77 ms /   103 runs   (   12.68 ms per token,    78.88 tokens per second)
llama_perf_context_print:       total time =    1415.47 ms /   205 tokens
Llama.generate: 73 prefix-match hit, remaining 113 prompt tokens to eval


 The text focuses on the resting metabolic rate (RMR) which is crucial for maintaining a lean body. RMR is influenced significantly by body composition, particularly the presence of muscle. Muscle is the most effective calorie burner. The text emphasizes the importance of making positive changes in body composition, specifically gaining muscle, rather than just focusing on weight loss. Losing muscle weight is detrimental and counterproductive to achieving a lean body. The concept of calories in vs. calories out is discussed in relation to body composition changes.```


123
Summary for chunk123 is ready, 2 indices covered


llama_perf_context_print:        load time =     540.71 ms
llama_perf_context_print: prompt eval time =      36.16 ms /   113 tokens (    0.32 ms per token,  3124.91 tokens per second)
llama_perf_context_print:        eval time =     779.90 ms /    62 runs   (   12.58 ms per token,    79.50 tokens per second)
llama_perf_context_print:       total time =     863.00 ms /   175 tokens
Llama.generate: 72 prefix-match hit, remaining 115 prompt tokens to eval


 The text provides information about the importance of the post workout meal, which consists of 30-50 grams of lean protein and 30-50 grams of high glycemic index carbohydrates. The lean protein is essential to ensure that the body absorbs the nutrients properly and efficiently, as fat slows down this absorption process.
189
Summary for chunk189 is ready, 3 indices covered


llama_perf_context_print:        load time =     540.71 ms
llama_perf_context_print: prompt eval time =      36.34 ms /   115 tokens (    0.32 ms per token,  3164.12 tokens per second)
llama_perf_context_print:        eval time =     682.95 ms /    54 runs   (   12.65 ms per token,    79.07 tokens per second)
llama_perf_context_print:       total time =     756.68 ms /   169 tokens
Llama.generate: 72 prefix-match hit, remaining 146 prompt tokens to eval


 The text discusses how people often give excuses for not reaching their optimal fitness level. It highlights that more effective bodyweight programs can be found in a book than harsher gym routines. The author shares their career experience visiting gyms and compares it to their SpecOps troops.
233
Summary for chunk233 is ready, 4 indices covered


llama_perf_context_print:        load time =     540.71 ms
llama_perf_context_print: prompt eval time =      46.38 ms /   146 tokens (    0.32 ms per token,  3148.18 tokens per second)
llama_perf_context_print:        eval time =    1005.13 ms /    79 runs   (   12.72 ms per token,    78.60 tokens per second)
llama_perf_context_print:       total time =    1108.47 ms /   225 tokens
Llama.generate: 73 prefix-match hit, remaining 141 prompt tokens to eval


 The text discusses the use of resistance bands in a fitness program. The bands are divided into four sections: Push, Pull, Core, and Legs. Each muscle group needs to be worked once a week. The standard gym training regimen can also be used. The muscle groups are broken down into specific exercises such as shoulders, triceps, chest, lats, and biceps and forearms.
282
Summary for chunk282 is ready, 5 indices covered


llama_perf_context_print:        load time =     540.71 ms
llama_perf_context_print: prompt eval time =      45.82 ms /   141 tokens (    0.32 ms per token,  3077.06 tokens per second)
llama_perf_context_print:        eval time =     801.01 ms /    63 runs   (   12.71 ms per token,    78.65 tokens per second)
llama_perf_context_print:       total time =     891.36 ms /   204 tokens
Llama.generate: 73 prefix-match hit, remaining 212 prompt tokens to eval


 The text describes a modified push-up exercise called "Press shoulders, triceps (2-4)" where you perform the exercise with shoulder-width apart hands, similar to a Chinese Push Up. The exercise can be increased in difficulty by placing your hands on a raised surface, allowing your head to come below your hands.
390
Summary for chunk390 is ready, 6 indices covered


llama_perf_context_print:        load time =     540.71 ms
llama_perf_context_print: prompt eval time =      59.41 ms /   212 tokens (    0.28 ms per token,  3568.12 tokens per second)
llama_perf_context_print:        eval time =     351.39 ms /    27 runs   (   13.01 ms per token,    76.84 tokens per second)
llama_perf_context_print:       total time =     429.82 ms /   239 tokens
Llama.generate: 73 prefix-match hit, remaining 263 prompt tokens to eval


 Sumo Squat - Lift yourself up until your legs are straight again. YOU ARE YOUR OWN GYM - 104.
    ```

506
Summary for chunk506 is ready, 7 indices covered


llama_perf_context_print:        load time =     540.71 ms
llama_perf_context_print: prompt eval time =      75.14 ms /   263 tokens (    0.29 ms per token,  3499.99 tokens per second)
llama_perf_context_print:        eval time =    1969.64 ms /   149 runs   (   13.22 ms per token,    75.65 tokens per second)
llama_perf_context_print:       total time =    2160.52 ms /   412 tokens
Llama.generate: 73 prefix-match hit, remaining 205 prompt tokens to eval


 - The text describes a fitness routine involving explosive movements.
     - It also mentions a specific exercise called "Pistols" and provides instructions for its performance.
     - The text ends with an instruction to "bring your butt all the way down to the heel of your working foot" during the exercise.
     - Finally, it provides a summary of the fitness routine, suggesting that the reader should "really ready to kick it up a notch."
    ```
    ```

The given text seems to be about fitness routine or exercise instructions. However, it lacks the context or purpose of such instructions. It primarily focuses on a specific exercise called "Pistols" and provides step-by-step instructions for its performance. The text also mentions a fitness
547
Summary for chunk547 is ready, 8 indices covered


llama_perf_context_print:        load time =     540.71 ms
llama_perf_context_print: prompt eval time =      59.46 ms /   205 tokens (    0.29 ms per token,  3447.46 tokens per second)
llama_perf_context_print:        eval time =     635.11 ms /    49 runs   (   12.96 ms per token,    77.15 tokens per second)
llama_perf_context_print:       total time =     730.73 ms /   254 tokens
Llama.generate: 73 prefix-match hit, remaining 174 prompt tokens to eval


 - Squats target thighs, hamstrings, and glutes.
     - Mimics leg extensions with more muscle involvement.
     - Additional resistance options available (e.g., heavy object or backpack).
     - One-leg squats can improve balance.
551
Summary for chunk551 is ready, 9 indices covered


llama_perf_context_print:        load time =     540.71 ms
llama_perf_context_print: prompt eval time =      52.21 ms /   174 tokens (    0.30 ms per token,  3332.63 tokens per second)
llama_perf_context_print:        eval time =     913.96 ms /    71 runs   (   12.87 ms per token,    77.68 tokens per second)
llama_perf_context_print:       total time =    1017.21 ms /   245 tokens


 This week's workout focuses on core stability and functional movements. Exercises include 1-legged hip extensions, supermans, push-ups with feet elevated, assisted dips, alternating 1-legged RDLs on a pillow, box jumps with reverse grip, V-ups, and Russian twists. The undulating block schedule ensures variety and intensity are maintained throughout the week.
657
Summary for chunk657 is ready, 10 indices covered


In [62]:
print(summaries)

 The text describes 9 weeks of intense training in a challenging underwater environment. The trainees are required to commit fully, tying three different knots perfectly underwater. The instructors aim to make the trainees quit, but the full commitment to the training and tasks leads to success. The training environment is described as challenging, and the trainees learn to commit, stay down, and overcome the initial discomfort. Success is achieved through full commitment.
 The text focuses on the resting metabolic rate (RMR) which is crucial for maintaining a lean body. RMR is influenced significantly by body composition, particularly the presence of muscle. Muscle is the most effective calorie burner. The text emphasizes the importance of making positive changes in body composition, specifically gaining muscle, rather than just focusing on weight loss. Losing muscle weight is detrimental and counterproductive to achieving a lean body. The concept of calories in vs. calories out is di

User input and search

In [63]:
final_prompt = f"""
You are a precise and concise summariser.
You will be given a series of summaries from a book. The summaries will be enclosed in triple backticks (```).
Your task is to write a verbose summary of what was covered in the book.

The output should be a detailed and coherent summary that captures all the key information present in the provided summaries. Combine each summary into one whole summary 
The goal is to help a reader understand the entire content of the book from this single collated summary. 

Do not add any external information. Base your answer only on what is provided. Ensure it is a single stream of text, and not split up. Combine parts to form a bigger whole.
Capture the sentiment of the book.

```{summaries}```

SUMMARY:
Here is the detailed summary of the book:
"""

temp=0.7
max_tokens=3000

response=model.create_completion(
    prompt=final_prompt,
    temperature=temp,
    max_tokens=max_tokens
)

Llama.generate: 2 prefix-match hit, remaining 906 prompt tokens to eval
llama_perf_context_print:        load time =     540.71 ms
llama_perf_context_print: prompt eval time =     601.47 ms /   906 tokens (    0.66 ms per token,  1506.32 tokens per second)
llama_perf_context_print:        eval time =    1405.57 ms /   105 runs   (   13.39 ms per token,    74.70 tokens per second)
llama_perf_context_print:       total time =    2083.15 ms /  1011 tokens


In [64]:
print(response)

{'id': 'cmpl-3c2d24af-33b1-47dd-a81b-26558e290c70', 'object': 'text_completion', 'created': 1747928271, 'model': 'D:\\personalCode\\RAG-Toolkit\\models\\Dolphin3.0-Llama3.2-3B-Q5_K_M.gguf', 'choices': [{'text': 'The book focuses on fitness routines and exercise instructions, particularly explosive movements and core stability. It provides step-by-step instructions for specific exercises such as "Pistols" and a modified push-up exercise called "Press shoulders, triceps (2-4)". The book emphasizes the importance of muscle involvement and provides various resistance options for squats. It also highlights the role of bodyweight programs in achieving fitness goals. The text describes fitness routines that mimic military training and encourages readers to "really ready to kick it up a notch."', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 908, 'completion_tokens': 105, 'total_tokens': 1013}}


In [65]:
assistant_reply=response['choices'][0]['text']
assistant_reply=assistant_reply.replace("[/INST]", "")
print(assistant_reply)

The book focuses on fitness routines and exercise instructions, particularly explosive movements and core stability. It provides step-by-step instructions for specific exercises such as "Pistols" and a modified push-up exercise called "Press shoulders, triceps (2-4)". The book emphasizes the importance of muscle involvement and provides various resistance options for squats. It also highlights the role of bodyweight programs in achieving fitness goals. The text describes fitness routines that mimic military training and encourages readers to "really ready to kick it up a notch."
