# Experiment Architecture
## RAG (Retrieval Augmented Generation)
#### Differences
- Delta is not used for data operations in this experiment
- The input data are PDF not XML files
#### Problem Statement
- We want to get Dolly knowledgeable about Leader Election in Distributed Computing Systems
#### Flow
<img src="https://raw.githubusercontent.com/databricks-demos/dbdemos-resources/main/images/product/llm-dolly/llm-dolly-full.png" width="1000" />


# Reading PDF papers
##### All the downloaded academic papers are about "Leader Election in Distributed Computing Systems"
The papers have been downloaded from **arXiv**

<img src="https://info.arxiv.org/brand/images/brand-logo-primary.jpg" width="250" />


In [15]:
import PyPDF2
from tqdm import tqdm
import os


def clean_chunk(raw_chunk):
    chunk = raw_chunk.replace('-\n','').replace('\n',' ')
    return chunk


pdf_paths = [x for x in os.listdir("data/pdf/leader_election") if ".pdf" in x]

print("\n".join(pdf_paths))

bodies = []
titles = []

for pdf_path in pdf_paths:
    with open(f"data/pdf/leader_election/{pdf_path}", "rb") as file:
        reader = PyPDF2.PdfReader(file)
        body = []
        for page in tqdm(reader.pages):
            text = page.extract_text()
            body.append(clean_chunk(text))
        bodies.append(body[:-1]) # the last page is removed (references)
        titles.append(pdf_path)

Modified Bully Algorithm using Election.pdf
Improved Bully Election Algorithm for Distributed Systems.pdf
Improved Tradeoffs for Leader Election.pdf
ZePoP A Distributed Leader Election Protocol using the Delay-based Closeness Centrality for Peer-to-Peer Applications.pdf
A Survey and Taxonomy of Leader Election Algorithms in Distributed Systems.pdf
Distributed Consensus in Content Centric Networking.pdf
Fault Toleran Leader Election in Distributed Systems.pdf


100%|██████████| 8/8 [00:00<00:00, 19.39it/s]
100%|██████████| 22/22 [00:00<00:00, 29.85it/s]
100%|██████████| 33/33 [00:01<00:00, 19.45it/s]
100%|██████████| 9/9 [00:01<00:00,  7.87it/s]
100%|██████████| 17/17 [00:00<00:00, 25.29it/s]
100%|██████████| 3/3 [00:00<00:00, 26.54it/s]
100%|██████████| 8/8 [00:00<00:00, 29.54it/s]


# Embeddings
##### Multidimensional Vector Representation of Semantic Meanings
<img src="https://corpling.hypotheses.org/files/2018/04/Screen-Shot-2018-04-25-at-13.21.44.png" width="400" />

In [18]:
from sentence_transformers import SentenceTransformer
import itertools

model_name = "sentence-transformers/all-MiniLM-L12-v2"

model = SentenceTransformer(model_name, device='cuda')

sentences = list(itertools.chain(*bodies))

embeddings = [[float(x) for x in model.encode(s)] for s in sentences]

# Database
##### Creating a Vector Database of Embeddings using the open-source ChromaDB
<img src="https://www.mlq.ai/content/images/2023/08/1_admwyPyR6v_IZI0EYE--eA-1.webp" width="250" />

In [19]:
import chromadb
from chromadb.utils import embedding_functions


chroma_client = chromadb.Client()

default_ef = embedding_functions.DefaultEmbeddingFunction()

collection = chroma_client.get_or_create_collection(name="leader_election_distributed_systems", embedding_function=default_ef)

collection.add(
    documents=sentences,
    embeddings=embeddings,
    ids=[str(x) for x in range(len(embeddings))]
)

# Fine-Tuned Model
##### Based on the open-source Databricks' Dolly
<img src="https://www.databricks.com/sites/default/files/2023-04/Dolly-logo.png" width="300" />

In [27]:
from transformers import pipeline
import torch
from langchain import PromptTemplate
from langchain.llms import HuggingFacePipeline
from langchain.chains.question_answering import load_qa_chain


def build_qa_chain():

    model_name = "databricks/dolly-v2-3b" # Dolly smallest version (3 billion params)

    instruct_pipeline = pipeline(model=model_name, torch_dtype=torch.bfloat16, trust_remote_code=True,
                                 return_full_text=True, max_new_tokens=4096, top_p=0.95, top_k=50,
                                 device='cuda') #cuda

    template = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

    Instruction:
    You are an expert about leader election algorithms in distributed systems.
    You use a simple language to explain concepts.
    You reply using only short textual descriptions, no images.

    {context}

    Question: {question}

    Response:
    """

    prompt = PromptTemplate(input_variables=['context', 'question'], template=template)

    hf_pipe = HuggingFacePipeline(pipeline=instruct_pipeline)

    return load_qa_chain(llm=hf_pipe, chain_type="stuff", prompt=prompt, verbose=True)

In [28]:
# Building the chain will load Dolly and can take several minutes depending on the model size
qa_chain = build_qa_chain()

In [29]:
class Document():
    def __init__(self, content):
        self.page_content = content
        self.metadata = {"metadata": "leader election paper page from arxiv"}

def get_similar_docs(question):
    results = collection.query(
        query_embeddings=[float(x) for x in model.encode(question)],
        n_results=1
    )
    return results["documents"]

def answer_question(question):
    similar_docs = [Document(x) for x in get_similar_docs(question)]
    result = qa_chain({"input_documents": similar_docs, "question": question})
    return result

In [30]:
import os
os.environ['CURL_CA_BUNDLE'] = ''


question = "Why distributed systems need a leader?"

answer = answer_question(question)

print(answer["output_text"])



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mBelow is an instruction that describes a task. Write a response that appropriately completes the request.

    Instruction:
    You are an expert about leader election algorithms in distributed systems.
    You use a simple language to explain concepts.
    You reply using only short textual descriptions, no images.

    ['International Journal of Computer Science & Informa tion Technology (IJCSIT) Vol 9, No 1, February 2017   DOI:10.5121/ijcsit.2017.9102                                                                                                                        13   FAULT TOLERANT LEADER ELECTION IN  DISTRIBUTED SYSTEMS     Marius Rafailescu    The Faculty of Automatic Control and Computers, POL ITEHNICA University,  Bucharest    ABSTRACT     There are many distributed systems which use a lead er in their logic. When such systems need to be f

In [32]:
question = "Explain shortly the modified bully algorithm for leader election in distributed systems"

answer = answer_question(question)

print(answer["output_text"])



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mBelow is an instruction that describes a task. Write a response that appropriately completes the request.

    Instruction:
    You are an expert about leader election algorithms in distributed systems.
    You use a simple language to explain concepts.
    You reply using only short textual descriptions, no images.

    ['arXiv:1403.3255v1  [cs.DC]  28 Feb 2014Improved Bully Election Algorithm for Distributed Systems P Beaulah Soundarabaia, Ritesh Sahaia, Thriveni Jb, K R Venugopalb, L M Patnaikc aDepartment of Computer Science , Christ University, Bangalore 56 0 029 India, Contact: beaulah.s@christuniversity.in bUniversity Visvesvaraya College of Engineering, Bangalore Universit y, Bangalore. cHonorary Professor, Indian Institute of Science, Bangalore. Electing a leader is a classical problem in distributed comp uting system. Synchronization between p

In [35]:
question = "Explain shortly the ZePoP distributed leader election protocol"

answer = answer_question(question)

print(answer["output_text"])



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mBelow is an instruction that describes a task. Write a response that appropriately completes the request.

    Instruction:
    You are an expert about leader election algorithms in distributed systems.
    You use a simple language to explain concepts.
    You reply using only short textual descriptions, no images.

    ['ZePoP: A Distributed Leader Election Prot ocol using  the Delay -based  Closeness  Centrality  for Peer -to-Peer Applications Md Amjad Hossain   School of Business, Emporia State University   Emporia, KS, 66801   Email: mhossai1@emporia.edu      Javed I. Khan   Department of Computer Science, Kent  State University   Kent, OH, 66801   Email: javed@cs.kent.ed u        Abstract — This paper presents ZePoP, a leader election  protocol  for distributed systems , optimizing a delay -based  closeness centrality . We design the protocol spec