## **Project Microbot**

Simple RAG pipeline created for learning about microservices.

In [None]:
#Download the PDF
import requests
import os

context_path = "context"
pdf_path = context_path + "/Building Microservices: Designing Fine-Grained Systems.pdf"
pdf_download_url = "https://github.com/rootusercop/Free-DevOps-Books-1/raw/refs/heads/master/book/Building%20Microservices%20-%20Designing%20Fine-Grained%20Systems.pdf"

if not os.path.exists(context_path):
    print('[INFO] Context folder does not exists. Creating new folder...')
    os.mkdir(context_path)
    print('[INFO] Context folder created')
    if not os.path.exists(pdf_path):
        print('[INFO] PDF does not exists. Downloading PDF...')
        res = requests.get(pdf_download_url)
        if(res.status_code == 200):
            with open(pdf_path, 'wb') as f:
                f.write(res.content)
                print('[INFO] PDF downloaded')
    else:
        print('[INFO] PDF already exists')
else:
  print('[INFO] Context folder already exists')

[INFO] Context folder does not exists. Creating new folder...
[INFO] Context folder created
[INFO] PDF does not exists. Downloading PDF...
[INFO] PDF downloaded


In [None]:
if "COLAB_GPU" in os.environ:
    !pip install -U torch # requires torch 2.1.1+ (for efficient sdpa implementation)
    !pip install PyMuPDF # for reading PDFs with Python
    !pip install tqdm # for progress bars
    !pip install sentence-transformers # for embedding models
    !pip install accelerate # for quantization model loading
    !pip install bitsandbytes # for quantizing models (less storage space)
    !pip install flash-attn --no-build-isolation # for faster attention mechanism = faster LLM inference

Collecting PyMuPDF
  Downloading PyMuPDF-1.24.13-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading PyMuPDF-1.24.13-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (19.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m65.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.24.13
Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl (122.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.44.1
Collecting flash-attn
  Downloading flash_attn-2.7.0.post2.tar.gz (2.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m82.8 MB/s[0m eta

In [None]:
import fitz
import random
from tqdm.auto import tqdm

# Read the PDF and scrape and preprocess the data
def preprocess_data(text_data:str) -> str:
    cleaned_text_data = text_data.replace("\n", " ").strip()
    cleaned_text_data = cleaned_text_data.replace("  ", " ")
    return cleaned_text_data

def open_and_read(pdf_path:str) -> list[dict]:
      pdf = fitz.open(pdf_path)
      pages_and_text = []
      for page_no, page in tqdm(enumerate(pdf)):
          text_data = preprocess_data(page.get_text())
          pages_and_text.append({'page_no': page_no - 19,
                                 'char_count': len(text_data),
                                 'word_count': len(text_data.split(" ")),
                                 'sentence_count': len(text_data.split(". ")),
                                 'token_count': (len(text_data) / 4) ,
                                 'page_content': text_data
                                 })
      return pages_and_text

pages_and_text = open_and_read(pdf_path = pdf_path)
random.sample(pages_and_text, k = 3)

0it [00:00, ?it/s]

[{'page_no': 227,
  'char_count': 3127,
  'word_count': 558,
  'sentence_count': 24,
  'token_count': 781.75,
  'page_content': 'First, with HTTP, we can use cache-control directives in our responses to clients. These tell clients if they should cache the resource at all, and if so how long they should cache it for in seconds. We also have the option of setting an Expires header, where instead of saying how long a piece of content can be cached for, we specify a time and date at which a resource should be considered stale and fetched again. The nature of the resources you are sharing determines which one is most likely to fit. Standard static website content like CSS or images often fit well with a simple cache- control time to live (TTL). On the other hand, if you know in advance when a new version of a resource will be updated, setting an Expires header will make more sense. All of this is very useful in stopping a client from even needing to make a request to the server in the first

In [None]:
import pandas as pd

df = pd.DataFrame(pages_and_text)
df.head(2)

Unnamed: 0,page_no,char_count,word_count,sentence_count,token_count,page_content
0,-19,64,7,1,16.0,Sam Newman Building Microservices DESIGNING FI...
1,-18,1989,282,9,497.25,PROGRAMMING Building Microservices ISBN: 978-1...


In [None]:
df.describe().round(2)
#This has 280 pages and
#according to the stats, average token count is 571 per page

Unnamed: 0,page_no,char_count,word_count,sentence_count,token_count
count,280.0,280.0,280.0,280.0,280.0
mean,120.5,2281.99,433.38,19.99,570.5
std,80.97,697.67,305.57,17.57,174.42
min,-19.0,0.0,1.0,1.0,0.0
25%,50.75,1928.25,319.0,14.0,482.06
50%,120.5,2478.0,423.5,19.0,619.5
75%,190.25,2779.75,480.75,22.0,694.94
max,260.0,3282.0,2286.0,187.0,820.5


In [None]:
#splitting the text content into sentences
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [None]:
for page in tqdm(pages_and_text):
    page['sentences'] = sent_tokenize(page['page_content'])
    page['sentences'] = [str(sentence) for sentence in page['sentences']]
    page['page_sent_count_nltk'] = len(page['sentences'])

  0%|          | 0/280 [00:00<?, ?it/s]

In [None]:
random.sample(pages_and_text, k = 1)

[{'page_no': 155,
  'char_count': 1956,
  'word_count': 338,
  'sentence_count': 17,
  'token_count': 489.0,
  'page_content': 'CHAPTER 8 Monitoring As I’ve hopefully shown so far, breaking our system up into smaller, fine-grained microservices results in multiple benefits. It also, however, adds complexity when it comes to monitoring the system in production. In this chapter, we’ll look at the chal‐ lenges associated with monitoring and identifying problems in our fine-grained sys‐ tems, and I’ll outline some of the things you can do to have your cake and eat it too! Picture the scene. It’s a quiet Friday afternoon, and the team is looking forward to sloping off early to the pub as a way to start a weekend away from work. Then sud‐ denly the emails arrive. The website is misbehaving! Twitter is ablaze with your com‐ pany’s failings, your boss is chewing your ear off, and the prospects of a quiet weekend vanish. What’s the first thing you need to know? What the hell has gone wrong? In 

In [None]:
df = pd.DataFrame(pages_and_text)
df.describe().round(2)

Unnamed: 0,page_no,char_count,word_count,sentence_count,token_count,page_sent_count_nltk
count,280.0,280.0,280.0,280.0,280.0,280.0
mean,120.5,2281.99,433.38,19.99,570.5,21.45
std,80.97,697.67,305.57,17.57,174.42,17.98
min,-19.0,0.0,1.0,1.0,0.0,0.0
25%,50.75,1928.25,319.0,14.0,482.06,16.0
50%,120.5,2478.0,423.5,19.0,619.5,21.0
75%,190.25,2779.75,480.75,22.0,694.94,24.0
max,260.0,3282.0,2286.0,187.0,820.5,189.0


In [None]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# Create a function that recursively splits a list into desired sizes and returns a list of
# sentence chunks
def split_list(input_list: list,
               slice_size: int) -> list[list[str]]:

    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for page in tqdm(pages_and_text):
    page["sentence_chunks"] = split_list(input_list=page['sentences'],
                                         slice_size=num_sentence_chunk_size)
    page["num_chunks"] = len(page["sentence_chunks"])

  0%|          | 0/280 [00:00<?, ?it/s]

In [None]:
random.sample(pages_and_text, k=1)

[{'page_no': 85,
  'char_count': 2066,
  'word_count': 372,
  'sentence_count': 15,
  'token_count': 516.5,
  'page_content': 'Figure 5-3. Post removal of the foreign key relationship At this point it becomes clear that we may well end up having to make two database calls to generate the report. This is correct. And the same thing will happen if these are two separate services. Typically concerns around performance are now raised. I have a fairly easy answer to those: how fast does your system need to be? And how fast is it now? If you can test its current performance and know what good perfor‐ mance looks like, then you should feel confident in making a change. Sometimes making one thing slower in exchange for other things is the right thing to do, espe‐ cially if slower is still perfectly acceptable. But what about the foreign key relationship? Well, we lose this altogether. This becomes a constraint we need to now manage in our resulting services rather than in the database level. T

In [None]:
df = pd.DataFrame(pages_and_text)
df.describe().round(2)

#According to the current stats we have around 3 sentence chunks per page(Around 30 sentences per page)

Unnamed: 0,page_no,char_count,word_count,sentence_count,token_count,page_sent_count_nltk,num_chunks
count,280.0,280.0,280.0,280.0,280.0,280.0,280.0
mean,120.5,2281.99,433.38,19.99,570.5,21.45,2.65
std,80.97,697.67,305.57,17.57,174.42,17.98,1.81
min,-19.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,50.75,1928.25,319.0,14.0,482.06,16.0,2.0
50%,120.5,2478.0,423.5,19.0,619.5,21.0,3.0
75%,190.25,2779.75,480.75,22.0,694.94,24.0,3.0
max,260.0,3282.0,2286.0,187.0,820.5,189.0,19.0


In [None]:
import re

# Split each chunk into its own item

pages_and_chunks = [] #This is the main data dict from this point

for page in tqdm(pages_and_text):
    for sentence_chunk in page["sentence_chunks"]:
        #Creating a new dictionary item to store the chunk info
        chunk_dict = {}
        chunk_dict["page_number"] = page["page_no"]

        # Join the sentences back into a paragraph-like structure, aka a chunk
        #(There is no point of keeping them as seperate sentences and we broke 'em into sents just to make the chunks)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4

        pages_and_chunks.append(chunk_dict)

print(f"Total no of text chunks after processing: {len(pages_and_chunks)}")

  0%|          | 0/280 [00:00<?, ?it/s]

Total no of text chunks after processing: 742


In [None]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 124,
  'sentence_chunk': 'Inside the VMs, we get what looks like completely different hosts. They can run their own operating systems, with their own kernels. They can be considered almost her‐ metically sealed machines, kept isolated from the underlying physical host and the other virtual machines by the hypervisor. The problem is that the hypervisor here needs to set aside resources to do its job. This takes away CPU, I/O, and memory that could be used elsewhere. The more hosts the hypervisor manages, the more resources it needs. At a certain point, this overhead becomes a constraint in slicing up your physical infrastructure any further. In prac‐ tice, this means that there are often diminishing returns in slicing up a physical box into smaller and smaller parts, as proportionally more and more resources go into the overhead of the hypervisor. Vagrant Vagrant is a very useful deployment platform, which is normally used for dev and test rather than production. Vagran

In [None]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)
# We have a max of 601 token count in some chunks

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,742.0,742.0,742.0,742.0
mean,107.62,847.92,150.96,211.98
std,84.18,503.19,103.72,125.8
min,-19.0,3.0,1.0,0.75
25%,31.0,413.0,76.0,103.25
50%,108.0,966.5,168.5,241.62
75%,180.75,1219.25,209.0,304.81
max,260.0,2404.0,1156.0,601.0


In [None]:
# Find text chunks that contains low token count to remove it.
min_token_length = 30 #Right now we filter chunks less than 30 tokens(Have to fine-tune a bit)
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 10.5 | Text: 18 | Chapter 2: The Evolutionary Architect
Chunk token count: 2.5 | Text: ..........
Chunk token count: 2.5 | Text: ..........
Chunk token count: 21.75 | Text: Many years later, that process remains a work in progress!Conway’s Law in Reverse | 201
Chunk token count: 27.75 | Text: This shifts things somewhat. Now we could bake the common tools into our own image. When we Custom Images | 111


In [None]:
filtered_pages_and_chunks = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
filtered_pages_and_chunks[:2]

[{'page_number': -18,
  'sentence_chunk': 'PROGRAMMING Building Microservices ISBN: 978-1-491-95035-7 US $49.99\t CAN $57.99 “ The Microservices architecture has many appealing qualities, but the road towards it has painful traps for the unwary. This book will help you figure out if this path is for you, and how to avoid those traps on your journey.” —Martin Fowler Chief Scientist, ThoughtWorks Twitter: @oreillymedia facebook.com/oreilly Distributed systems have become more fine-grained in the past 10 years, shifting from code-heavy monolithic applications to smaller, self-contained microservices. But developing these systems brings its own set of headaches. With lots of examples and practical advice, this book takes a holistic view of the topics that system architects and administrators must consider when building, managing, and evolving microservice architectures. Microservice technologies are moving quickly. Author Sam Newman provides you with a firm grounding in the concepts while 

In [None]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2) # Niw it is reduced to 642

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,642.0,642.0,642.0,642.0
mean,120.69,976.53,173.95,244.13
std,78.16,411.92,92.21,102.98
min,-18.0,123.0,24.0,30.75
25%,53.25,705.0,127.25,176.25
50%,121.0,1035.5,180.0,258.88
75%,187.0,1250.0,215.0,312.5
max,260.0,2404.0,1156.0,601.0


In [None]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device="cpu")

In [None]:
#THIS NEEDS TO BE REMOVED, THIS IS ADDED ONLY FOR TESTING
%%time
single_sentence = "Hellooo from microbot!!"
single_embedding = embedding_model.encode(single_sentence)
print(f"Sentence: {single_sentence}")
print(f"Embedding:\n{single_embedding}")
print(f"Embedding size: {single_embedding.shape}")

Sentence: Hellooo from microbot!!
Embedding:
[ 5.79969548e-02 -4.07439284e-02 -1.61629226e-02 -5.22268005e-03
  5.84756508e-02  4.92249569e-03  3.09310481e-02  7.30430009e-04
  9.90403001e-04  7.73363048e-03  3.26934643e-02  3.25390100e-02
 -3.58189009e-02  9.50230584e-02  2.79407781e-02 -4.20339443e-02
  5.37104532e-02 -2.61030439e-02 -6.49524629e-02  1.01029444e-02
  5.45813963e-02  6.68614078e-03 -9.47924331e-03  4.88695018e-02
 -3.29577774e-02 -2.73885857e-02 -1.04041295e-02  3.99433114e-02
  2.99903415e-02 -4.48150635e-02  9.02634952e-03 -3.36240418e-02
  2.39242725e-02  7.76743889e-02  1.90300045e-06 -1.41998613e-02
 -7.75742065e-03  2.27753110e-02 -3.15696485e-02 -3.10772993e-02
  2.05999259e-02 -7.36647332e-03 -2.61037927e-02  3.38850021e-02
  1.77306365e-02 -1.73701625e-02  2.96300147e-02  2.36570626e-03
  2.63083857e-02  2.89837364e-02 -2.35963184e-02  1.23626374e-01
 -3.40511911e-02  9.33346525e-03  5.24827465e-02  7.55690038e-02
 -1.00305267e-02  5.18335141e-02  2.63888054e

In [None]:
embedding_model.to("cuda") #Changing the model to use GPU instead of CPU

SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [None]:
%%time

single_sentence = "Hellooo from microbot!!"
single_embedding = embedding_model.encode(single_sentence)
print(f"Sentence: {single_sentence}")
print(f"Embedding:\n{single_embedding}")
print(f"Embedding size: {single_embedding.shape}")


Sentence: Hellooo from microbot!!
Embedding:
[ 5.79970628e-02 -4.07439284e-02 -1.61628872e-02 -5.22268331e-03
  5.84755801e-02  4.92250873e-03  3.09310537e-02  7.30411033e-04
  9.90413828e-04  7.73363514e-03  3.26935127e-02  3.25390026e-02
 -3.58189493e-02  9.50230137e-02  2.79407762e-02 -4.20339108e-02
  5.37104160e-02 -2.61030495e-02 -6.49525151e-02  1.01029323e-02
  5.45813516e-02  6.68617152e-03 -9.47925448e-03  4.88694981e-02
 -3.29577848e-02 -2.73886062e-02 -1.04041817e-02  3.99433300e-02
  2.99903508e-02 -4.48150225e-02  9.02629737e-03 -3.36240940e-02
  2.39242986e-02  7.76743665e-02  1.90300057e-06 -1.41998008e-02
 -7.75738480e-03  2.27752514e-02 -3.15696746e-02 -3.10772751e-02
  2.05998942e-02 -7.36642536e-03 -2.61037815e-02  3.38850021e-02
  1.77305732e-02 -1.73701067e-02  2.96299923e-02  2.36568577e-03
  2.63083670e-02  2.89837383e-02 -2.35963091e-02  1.23626366e-01
 -3.40512060e-02  9.33349878e-03  5.24827391e-02  7.55690485e-02
 -1.00304922e-02  5.18335141e-02  2.63888054e

* Changing from CPU to GPU gives at least x5 speedup

In [None]:
# Add new key/val pair including the embeddings
for item in tqdm(filtered_pages_and_chunks):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

  0%|          | 0/642 [00:00<?, ?it/s]

In [None]:
# Turn text chunks into a single list
text_chunks = [item["sentence_chunk"] for item in filtered_pages_and_chunks]

In [None]:
%%time

# Embed all texts in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=32, # batch sizes here is for speedup by computing them as batches instead of sequentially going through them
                                               convert_to_tensor=True)

text_chunk_embeddings

CPU times: user 8.72 s, sys: 5.76 ms, total: 8.72 s
Wall time: 8.52 s


tensor([[ 0.0535, -0.0013, -0.0196,  ...,  0.0390,  0.0505, -0.0132],
        [ 0.0106,  0.0138, -0.0255,  ...,  0.0064, -0.0293, -0.0110],
        [-0.0034, -0.0685, -0.0184,  ..., -0.0363, -0.0169, -0.0073],
        ...,
        [ 0.0197,  0.0063, -0.0372,  ...,  0.0163,  0.0297, -0.0425],
        [-0.0107,  0.0404, -0.0043,  ..., -0.0085,  0.0348, -0.0265],
        [ 0.0647,  0.0453, -0.0398,  ...,  0.0240,  0.0135,  0.0216]],
       device='cuda:0')

In [None]:
from google.colab import drive #To save the embedding to drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
save_path = '/content/drive/My Drive/embeddings/' #NOTE: Change accordingly if running this locally

if not os.path.exists(save_path):
  os.makedirs(save_path)

text_chunks_and_embeddings_df = pd.DataFrame(filtered_pages_and_chunks)
embeddings_df_file_name = "text_chunks_and_embeddings_df.csv"
embeddings_save_path = save_path + embeddings_df_file_name
text_chunks_and_embeddings_df.to_csv(embeddings_save_path, index=False)

In [None]:
# Import saved file and view
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_save_path)
text_chunks_and_embedding_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-18,PROGRAMMING Building Microservices ISBN: 978-1...,1988,281,497.0,[ 5.35305664e-02 -1.32836215e-03 -1.96003914e-...
1,-16,978-1-491-95035-7 [LSI] Building Microservices...,1480,196,370.0,[ 1.05742468e-02 1.37516996e-02 -2.55072583e-...
2,-16,Use of the information and instructions contai...,338,56,84.5,[-3.41875851e-03 -6.84858263e-02 -1.84098165e-...
3,-15,"....1 What Are Microservices?2 Small, and Focu...",805,445,201.25,[ 4.90625054e-02 -2.43801214e-02 -3.40706035e-...
4,-15,........13 Inaccurate Comparisons ...,303,174,75.75,[-8.73544719e-03 3.87647003e-02 -2.41016690e-...


In [None]:
import random

import torch
import numpy as np
import pandas as pd

# Added here as well so that no need to re-run the set of cells again if runtime was lost or unless
# the embedding model or something changed
save_path = '/content/drive/My Drive/embeddings/'
embeddings_df_file_name = "text_chunks_and_embeddings_df.csv"
embeddings_save_path = save_path + embeddings_df_file_name

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv(embeddings_save_path)

# Convert embedding column back to np.array (it got converted to string when it got saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)
embeddings.shape

torch.Size([642, 768])

In [None]:
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device=device) # choose the device to load the model to

In [None]:
# Running a sample query chunk mapping test

query = "what is a microservice" #Have to provide microservices related queries as this pipeline is based on Microservices
print(f"Query: {query}")

# NOTE: Have to perform the same pre-processing steps to both training and test data
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

from time import perf_counter as timer

start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
end_time = timer()

print(f"Time take to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

# Print top k res
top_results_dot_product = torch.topk(dot_scores, k=5)
top_results_dot_product

Query: what is a microservice
Time take to get scores on 642 embeddings: 0.00017 seconds.


torch.return_types.topk(
values=tensor([0.7197, 0.7160, 0.6787, 0.6731, 0.6649], device='cuda:0'),
indices=tensor([ 25,  26, 619, 628,   0], device='cuda:0'))

In [None]:
#Helper function to print wrapped text
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

In [None]:
# Format the dot-prod output to print the actual text instead of tensors

print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"Page number: {pages_and_chunks[idx]['page_number']}")
    print("\n")

Query: 'what is a microservice'

Results:
Score: 0.7197
Text:
At the same time, many organizations were experimenting with finer-grained
archi‐ tectures to accomplish similar goals, but also to achieve things like
improved scaling, increasing autonomy of teams, or to more easily embrace new
technologies. My own experiences, as well as those of my colleagues at
ThoughtWorks and elsewhere, rein‐ forced the fact that using larger numbers of
services with their own independent life‐ cycles resulted in more headaches that
had to be dealt with. In many ways, this book was imagined as a one-stop shop
that would help encompass the wide variety of top‐ ics that are necessary for
understanding microservices—something that would have helped me greatly in the
past!A Word on Microservices Today Microservices is a fast-moving topic.
Although the idea is not new (even if the term itself is), experiences from
people all over the world, along with the emergence of new technologies, are
having a profoun

In [None]:
def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=5,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query,
                                   convert_to_tensor=True)

    # Get dot product scores on embeddings
    start_time = timer()
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    end_time = timer()

    if print_time:
        print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

    scores, indices = torch.topk(input=dot_scores,
                                 k=n_resources_to_return)

    return scores, indices

def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 pages_and_chunks: list[dict]=pages_and_chunks,
                                 n_resources_to_return: int=5):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.

    Note: Requires pages_and_chunks to be formatted in a specific way (see above for reference).
    """

    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  n_resources_to_return=n_resources_to_return)

    print(f"Query: {query}\n")
    print("Results:")
    # Loop through zipped together scores and indicies
    for score, index in zip(scores, indices):
        print(f"Score: {score:.4f}")
        # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
        print_wrapped(pages_and_chunks[index]["sentence_chunk"])
        # Print the page number too so we can reference the textbook further and check the results
        print(f"Page number: {pages_and_chunks[index]['page_number']}")
        print("\n")

In [None]:
query = "monolithic architecture"

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

[INFO] Time taken to get scores on 642 embeddings: 0.00010 seconds.


(tensor([0.5825, 0.5170, 0.4956, 0.4870, 0.4844], device='cuda:0'),
 tensor([223, 402,  49, 227,  20], device='cuda:0'))

In [None]:
# Print out the texts of the top scores
print_top_results_and_scores(query=query,
                             embeddings=embeddings)

[INFO] Time taken to get scores on 642 embeddings: 0.00008 seconds.
Query: monolithic architecture

Results:
Score: 0.5825
CHAPTER 5 Splitting the Monolith We’ve discussed what a good service looks like,
and why smaller servers may be better for us. We also previously discussed the
importance of being able to evolve the design of our systems. But how do we
handle the fact that we may already have a large num‐ ber of codebases lying
about that don’t follow these patterns?How do we go about decomposing these
monolithic applications without having to embark on a big-bang rewrite?The
monolith grows over time. It acquires new functionality and lines of code at an
alarming rate. Before long it becomes a big, scary giant presence in our
organization that people are scared to touch or change. But all is not lost!With
the right tools at our disposal, we can slay this beast. It’s All About Seams We
discussed in Chapter 3 that we want our services to be highly cohesive and
loosely coupled.
Page n

In [None]:
# Get GPU available memory
import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
print(f"Available GPU memory: {gpu_memory_gb} GB")


# Choosing the appropriate Gemma model based on the available memory
if gpu_memory_gb < 5.1:
    print(f"Your available GPU memory is insufficient to run a Gemma LLM locally without quantization.")
elif gpu_memory_gb < 8.1:
    print(f"Recommended model: Gemma 2B in 4-bit precision.")
    use_quantization_config = True
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb < 19.0:
    print(f"Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.")
    use_quantization_config = False
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb > 19.0:
    print(f"Recommend model: Gemma 7B in 4-bit or float16 precision.")
    use_quantization_config = False
    model_id = "google/gemma-7b-it"

print(f"use_quantization_config set to: {use_quantization_config}")
print(f"model_id set to: {model_id}")

Available GPU memory: 15 GB
Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.
use_quantization_config set to: False
model_id set to: google/gemma-2b-it


In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `Microbot-Rag-Token` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: 

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available

# 1. Create quantization config for smaller model loading (optional)
# Requires !pip install bitsandbytes accelerate, see: https://github.com/TimDettmers/bitsandbytes, https://huggingface.co/docs/accelerate/
# For models that require 4-bit quantization (use this if you have low GPU memory available)
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         bnb_4bit_compute_dtype=torch.float16)

# Bonus: Setup Flash Attention 2 for faster inference, default to "sdpa" or "scaled dot product attention" if it's not available
# Flash Attention 2 requires NVIDIA GPU compute capability of 8.0 or above, see: https://developer.nvidia.com/cuda-gpus
# Requires !pip install flash-attn, see: https://github.com/Dao-AILab/flash-attention
if (is_flash_attn_2_available()) and (torch.cuda.get_device_capability(0)[0] >= 8):
  attn_implementation = "flash_attention_2"
else:
  attn_implementation = "sdpa"
print(f"[INFO] Using attention implementation: {attn_implementation}")

# 2. Pick a model we'd like to use (this will depend on how much GPU memory you have available)
#model_id = "google/gemma-7b-it"
model_id = model_id # (we already set this above)
print(f"[INFO] Using model_id: {model_id}")

# 3. Instantiate tokenizer (tokenizer turns text into numbers ready for the model)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id)

# 4. Instantiate the model
llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id,
                                                 torch_dtype=torch.float16, # datatype to use, we want float16
                                                 quantization_config=quantization_config if use_quantization_config else None,
                                                 low_cpu_mem_usage=False, # use full memory
                                                 attn_implementation=attn_implementation) # which attention version to use

if not use_quantization_config: # quantization takes care of device setting automatically, so if it's not used, send model to GPU
    llm_model.to("cuda")

[INFO] Using attention implementation: sdpa
[INFO] Using model_id: google/gemma-2b-it


tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [None]:
llm_model

def get_model_num_params(model: torch.nn.Module):
    return sum([param.numel() for param in model.parameters()])

get_model_num_params(llm_model)

2506172416

In [None]:
def get_model_mem_size(model: torch.nn.Module):

    # Get model parameters and buffer sizes
    mem_params = sum([param.nelement() * param.element_size() for param in model.parameters()])
    mem_buffers = sum([buf.nelement() * buf.element_size() for buf in model.buffers()])

    # Calculate various model sizes
    model_mem_bytes = mem_params + mem_buffers # in bytes
    model_mem_mb = model_mem_bytes / (1024**2) # in megabytes
    model_mem_gb = model_mem_bytes / (1024**3) # in gigabytes

    return {"model_mem_bytes": model_mem_bytes,
            "model_mem_mb": round(model_mem_mb, 2),
            "model_mem_gb": round(model_mem_gb, 2)}

get_model_mem_size(llm_model)

{'model_mem_bytes': 5012354048, 'model_mem_mb': 4780.15, 'model_mem_gb': 4.67}

In [None]:
input_text = "What are microservices and why it is so popular in modern software engineering?"
print(f"Input text:\n{input_text}")

dialogue_template = [
    {"role": "user",
     "content": input_text}
]

# Since we are using instruction tuned model, have to use this template
prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                       tokenize=False, # keep as raw text (not tokenized)
                                       add_generation_prompt=True)
print(f"\nPrompt (formatted):\n{prompt}")

Input text:
What are microservices and why it is so popular in modern software engineering?

Prompt (formatted):
<bos><start_of_turn>user
What are microservices and why it is so popular in modern software engineering?<end_of_turn>
<start_of_turn>model



In [None]:
%%time

# Tokenize the input text (turn it into numbers) and send it to GPU
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
print(f"Model input (tokenized):\n{input_ids}\n")

# Generate outputs passed on the tokenized input
# See generate docs: https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/text_generation#transformers.GenerationConfig
outputs = llm_model.generate(**input_ids,
                             max_new_tokens=512) # define the maximum number of new tokens to create
# print(f"Model output (tokens):\n{outputs[0]}\n")

# Decode the output tokens to text
outputs_decoded = tokenizer.decode(outputs[0])
print(f"Model output (decoded):\n{outputs_decoded}\n")

Model input (tokenized):
{'input_ids': tensor([[     2,      2,    106,   1645,    108,   1841,    708,  19048,   5209,
            578,   3165,    665,    603,    712,   5876,    575,   5354,   6815,
          13299, 235336,    107,    108,    106,   2516,    108]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1]], device='cuda:0')}

Model output (decoded):
<bos><bos><start_of_turn>user
What are microservices and why it is so popular in modern software engineering?<end_of_turn>
<start_of_turn>model
**Microservices**

Microservices are a software architecture style that breaks down a large software application into smaller, independent services. Each service is responsible for a specific functionality, and it is deployed and scaled independently.

**Benefits of Microservices Architecture:**

* **Agility and Speed:** Microservices architecture allows developers to build and deploy software features i

In [None]:
def prompt_formatter(query: str,
                     context_items: list[dict]) -> str:
    """
    Augments query with text-based context from context_items.
    """
    # Join context items into one dotted paragraph
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])

    # Create a base prompt with examples to help the model
    # Note: this is very customizable, I've chosen to use 3 examples of the answer style we'd like.
    # We could also write this in a txt file and import it in if we wanted.
    base_prompt = """Based on the following context items, please formulate an answer to the given query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
If the provided context items are seems to be incomplete or irrelavant, discard them.
But make sure you check the context items before discarding any of them as irrelavant
\nNow use the following context items to answer the user query if relevant:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""

    # Update base prompt with context items and query
    base_prompt = base_prompt.format(context=context, query=query)

    # Create prompt template for instruction-tuned model
    dialogue_template = [
        {"role": "user",
        "content": base_prompt}
    ]

    # Apply the chat template
    prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                          tokenize=False,
                                          add_generation_prompt=True)
    return prompt

In [None]:
query = "What are microservices and why it is so popular in modern software engineering?"
print(f"Query: {query}")

# Get relevant resources
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)

# Create a list of context items
context_items = [pages_and_chunks[i] for i in indices]

# Format prompt with context items
prompt = prompt_formatter(query=query,
                          context_items=context_items)
print(prompt)

Query: What are microservices and why it is so popular in modern software engineering?
[INFO] Time taken to get scores on 642 embeddings: 0.00008 seconds.
<bos><start_of_turn>user
Based on the following context items, please formulate an answer to the given query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
If the provided context items are seems to be incomplete or irrelavant, discard them. 
But make sure you check the context items before discarding any of them as irrelavant

Now use the following context items to answer the user query if relevant:
- vices give us significantly more freedom to react and make different decisions, allowing us to respond faster to the inevitable change that impacts all of us. What Are Microservices?Microservices are small, autonomous services that work together. Let’s break that def‐ initi

In [None]:
%%time

input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate an output of tokens
outputs = llm_model.generate(**input_ids,
                             temperature=0.7, # lower temperature = more deterministic outputs, higher temperature = more creative outputs
                             do_sample=True, # whether or not to use sampling, see https://huyenchip.com/2024/01/16/sampling.html for more
                             max_new_tokens=512) # how many new tokens to generate from prompt

# Turn the output tokens into text
output_text = tokenizer.decode(outputs[0])

output_text = output_text.replace(prompt, "").replace("<bos>", "").replace("<eos>", "").replace("Sure, here is the answer to the user query:\n\n", "")

print(f"Query: {query}")
print(f"RAG answer:\n")
print_wrapped(output_text.replace(prompt, ''))

Query: What are microservices and why it is so popular in modern software engineering?
RAG answer:

Sure, here is the answer to the user's query:  Microservices are small,
autonomous services that work together to achieve a specific goal. Microservices
are popular in modern software engineering because they offer several benefits,
including:  - **Agility:** Microservices allow developers to build and deploy
software changes independently, reducing downtime and increasing development
speed. - **Scalability:** Microservices can be scaled horizontally, allowing
organizations to handle increasing demand. - **Resilience:** Microservices are
isolated from each other, making it easier to roll back changes or recover from
failures. - **Maintainability:** Microservices are easier to maintain and debug
than monolithic applications. - **Technology flexibility:** Microservices allow
developers to use different technologies to build their applications, which can
be more efficient and cost-effective