# Experiment Architecture
## Summarization + RAG (Retrieval Augmented Generation)
#### Differences
- Delta is not used for data operations in this experiment
- The input data are HTML not XML files
#### Problem Statement
- We want to get Dolly knowledgeable about the Amadeus open source DataIO framework
#### Flow
<img src="https://raw.githubusercontent.com/databricks-demos/dbdemos-resources/main/images/product/llm-dolly/llm-dolly-full.png" width="1000" />

# Reading HTML pages from the DataIO documentation
<img src="https://cdn-icons-png.flaticon.com/512/4248/4248142.png" width="250" />



In [18]:
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import re


def slice_html_page(html_page):
  chunks = re.findall(r"(<.+>)+", html_page)
  merged_chunks = []
  # split in chunks with no overlap
  for i in range(len(chunks), 50):
    merged_chunks.append(chunks[i:i+50])
  # append the last part, in case len(chunks)%50!=0
  # the possible overalp of the last 2 chunks is not an issue for the LLM
  merged_chunks.append(chunks[-50:])
  return merged_chunks


page_list = [
    "https://amadeusitgroup.github.io/dataio-framework/getting-started.html",
    "https://amadeusitgroup.github.io/dataio-framework/main-concepts.html",
    "https://amadeusitgroup.github.io/dataio-framework/writing-processors.html",
    "https://amadeusitgroup.github.io/dataio-framework/configuration/configuration.html",
    "https://amadeusitgroup.github.io/dataio-framework/configuration/processors.html",
    "https://amadeusitgroup.github.io/dataio-framework/configuration/schema-definitions.html",
    "https://amadeusitgroup.github.io/dataio-framework/configuration/paths.html",
    "https://amadeusitgroup.github.io/dataio-framework/configuration/date-filters.html",
    "https://amadeusitgroup.github.io/dataio-framework/configuration/pipes/pipes.html",
    "https://amadeusitgroup.github.io/dataio-framework/configuration/pipes/storage.html",
    "https://amadeusitgroup.github.io/dataio-framework/configuration/pipes/elk.html",
    "https://amadeusitgroup.github.io/dataio-framework/configuration/pipes/kafka.html",
    "https://amadeusitgroup.github.io/dataio-framework/configuration/distributors/distributors.html",
    "https://amadeusitgroup.github.io/dataio-framework/configuration/distributors/email.html",
    "https://amadeusitgroup.github.io/dataio-framework/advanced/custom-processor-traits.html"
]

html_page_list = []

for page_link in tqdm(page_list):
  response = requests.get(page_link)
  html_doc = response.text
  soup = BeautifulSoup(html_doc, 'html.parser')
  str_html = soup.prettify()
  html_page_list.append(slice_html_page(str_html))

100%|██████████| 15/15 [00:06<00:00,  2.39it/s]


# Embeddings
##### Multidimensional Vector Representation of Semantic Meanings
<img src="https://corpling.hypotheses.org/files/2018/04/Screen-Shot-2018-04-25-at-13.21.44.png" width="400" />

In [19]:
from sentence_transformers import SentenceTransformer
import itertools

model_name = "sentence-transformers/all-MiniLM-L12-v2"

model = SentenceTransformer(model_name, device='cuda')

sentences = [["".join(chunks) for chunks in chunk_list] for chunk_list in html_page_list]

flattened_sentences = sentences = list(itertools.chain(*sentences))

embeddings = [[float(x) for x in model.encode(s)] for s in sentences]

# Database
##### Creating a Vector Database of Embeddings using the open-source ChromaDB
<img src="https://www.mlq.ai/content/images/2023/08/1_admwyPyR6v_IZI0EYE--eA-1.webp" width="250" />

In [None]:
import chromadb
from chromadb.utils import embedding_functions


chroma_client = chromadb.Client()

default_ef = embedding_functions.DefaultEmbeddingFunction()

collection = chroma_client.get_or_create_collection(name="data_io_doc_chunks", embedding_function=default_ef)

collection.add(
    documents=html_page_list,
    embeddings=embeddings,
    ids=[str(x) for x in range(len(embeddings))]
)

# ChatBot
##### Using the open-source Databricks' Dolly model fine-tuned through the RAG techinique
<img src="https://www.databricks.com/sites/default/files/2023-04/Dolly-logo.png" width="300" />

In [23]:
from transformers import pipeline
import torch
from langchain import PromptTemplate
from langchain.llms import HuggingFacePipeline
from langchain.chains.question_answering import load_qa_chain


def build_qa_chain():

    model_name = "databricks/dolly-v2-3b" # Dolly smallest version (3 billion params)

    instruct_pipeline = pipeline(model=model_name, torch_dtype=torch.bfloat16, trust_remote_code=True,
                                 return_full_text=True, max_new_tokens=4096, top_p=0.95, top_k=50,
                                 device=0) #cuda

    template = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

    Instruction:
    You are an expert the Amadeus DataIO Framework.
    You use a simple language to explain concepts.
    You reply using textual descriptions and examples take from the Amadeus DataIO documentation pages.

    {context}

    Question: {question}

    Response:
    """

    prompt = PromptTemplate(input_variables=['context', 'question'], template=template)

    hf_pipe = HuggingFacePipeline(pipeline=instruct_pipeline)

    return load_qa_chain(llm=hf_pipe, chain_type="stuff", prompt=prompt, verbose=True)

In [24]:
# Building the chain will load Dolly and can take several minutes depending on the model size
qa_chain = build_qa_chain()

In [47]:
class Document():
    def __init__(self, content):
        self.page_content = content
        self.metadata = {"metadata": "DataIO documentation"}

def get_similar_docs(question):
    results = collection.query(
        query_embeddings=[float(x) for x in model.encode(question)],
        n_results=1
    )
    return results["documents"]

def answer_question(question):
    similar_docs = [Document(x) for x in get_similar_docs(question)]
    result = qa_chain({"input_documents": similar_docs, "question": question})
    return result

In [50]:
import os
os.environ['CURL_CA_BUNDLE'] = ''

question = "What is Amadeus' DataIO framework?"

answer = answer_question(question)

print(answer["output_text"])



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mBelow is an instruction that describes a task. Write a response that appropriately completes the request.

    Instruction:
    You are an expert the Amadeus DataIO Framework.
    You use a simple language to explain concepts.
    You reply using textual descriptions and examples take from the Amadeus DataIO documentation pages.

    [None]

    Question: What is Amadeus' DataIO framework?

    Response:
    [0m





[1m> Finished chain.[0m

[1m> Finished chain.[0m

The Amadeus DataIO framework is a set of technologies to collect, store, analyse and use data in the workplace. It consists of the following parts.
1. The Data IO framework and it's components
2. The APIs that use the Data IO framework
3. Processes that use the APIs


In [51]:
question = "Can I use Amadeus' DataIO to read from Kafka?"

answer = answer_question(question)

print(answer["output_text"])



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mBelow is an instruction that describes a task. Write a response that appropriately completes the request.

    Instruction:
    You are an expert the Amadeus DataIO Framework.
    You use a simple language to explain concepts.
    You reply using textual descriptions and examples take from the Amadeus DataIO documentation pages.

    [None]

    Question: Can I use Amadeus' DataIO to read from Kafka?

    Response:
    [0m





[1m> Finished chain.[0m

[1m> Finished chain.[0m

You can use Amadeus' DataIO to read from Kafka, if you use the Kafka API in your Kafka setup and provide the configuration needed to connect to Kafka from the DataIO framework, you can read from Kafka.


In [52]:
question = "Can you provide a configuration file for reading from Kafka Input in text format?"

answer = answer_question(question)

print(answer["output_text"])



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mBelow is an instruction that describes a task. Write a response that appropriately completes the request.

    Instruction:
    You are an expert the Amadeus DataIO Framework.
    You use a simple language to explain concepts.
    You reply using textual descriptions and examples take from the Amadeus DataIO documentation pages.

    [None]

    Question: Can you provide a configuration file for reading from Kafka Input in text format?

    Response:
    [0m





[1m> Finished chain.[0m

[1m> Finished chain.[0m

Sure, here is a sample configuration file for a Kafka Input:

[Kafka Input]
Topic = your_kafka_topic
Broker = your_kafka_broker_id
GroupBy = customer,product
FieldName = your_field_name
FieldType = string
DataFormat = json
DataEncoding = UTF8
HeaderEncoding = UTF8
Validation = true
FieldChecking = true
RecordChecking = true
MaxFieldLength = 256
MaxBytes = 4194304
MinFieldLength = 1
MinBytes = 1
