### DESCRIPTION
Load tens of thousands of Wikipedia articles into Azure Data Explorer.
Harness its sub milisecond query capabilities to search your data and combine this with LLM to generate a response with Retrieval Augmented Generation pattern.
Use Azure Data Explorer vector store capabilities with embeddings together with Generative AI to generate answers.  


### PREPARATION
* An ADX (Azure Data Explorer or Kusto) cluster  
* In ADX, create a Database named "embeddings"  
    <img src="images/1.png" alt="Create Kusto cluster" /> 

* Create an AAD app registration for Authentication - see below   
    [Create an Azure Active Directory application registration in Azure Data Explorer](https://learn.microsoft.com/en-us/azure/data-explorer/provision-azure-ad-app)


In [1]:
from azure.kusto.data import KustoClient, KustoConnectionStringBuilder
from azure.kusto.data.exceptions import KustoServiceError
from azure.kusto.data.helpers import dataframe_from_result_table

from langchain.document_loaders import UnstructuredURLLoader
from langchain.embeddings import AzureOpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from IPython.display import display, HTML, JSON, Markdown

from dotenv import load_dotenv
import pandas as pd
from ast import literal_eval
import os
from tenacity import retry, wait_random_exponential, stop_after_attempt

# Configure environment variables
load_dotenv()

AAD_TENANT_ID = os.getenv("AAD_TENANT_ID")
KUSTO_CLUSTER = os.getenv("KUSTO_CLUSTER")
KUSTO_DATABASE = os.getenv("KUSTO_DATABASE")
KUSTO_TABLE = os.getenv("KUSTO_TABLE")
KUSTO_MANAGED_IDENTITY_APP_ID = os.getenv("KUSTO_MANAGED_IDENTITY_APP_ID")
KUSTO_MANAGED_IDENTITY_SECRET = os.getenv("KUSTO_MANAGED_IDENTITY_SECRET")

# Configure OpenAI API
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_DEPLOYMENT_ENDPOINT = os.getenv("OPENAI_DEPLOYMENT_ENDPOINT")
OPENAI_DEPLOYMENT_NAME = os.getenv("OPENAI_DEPLOYMENT_NAME")
OPENAI_MODEL_NAME = os.getenv("OPENAI_MODEL_NAME")
OPENAI_DEPLOYMENT_VERSION = os.getenv("OPENAI_DEPLOYMENT_VERSION")

OPENAI_ADA_EMBEDDING_DEPLOYMENT_NAME = os.getenv("OPENAI_ADA_EMBEDDING_DEPLOYMENT_NAME")
OPENAI_ADA_EMBEDDING_MODEL_NAME = os.getenv("OPENAI_ADA_EMBEDDING_MODEL_NAME")

In [2]:
embeddingmodel = AzureOpenAIEmbeddings(
    deployment=OPENAI_ADA_EMBEDDING_DEPLOYMENT_NAME,
    model=OPENAI_ADA_EMBEDDING_MODEL_NAME,
    openai_api_base=OPENAI_DEPLOYMENT_ENDPOINT,
    chunk_size = 1)




#### IMPORTANT!! Embeddings Creation Section - Run this only once !!!
You only need to run this once to create the embeddings and save them to Azure Data Explorer.   
Then you can use the already created database and table in Azure Data explorer for retrieval

In [3]:
# you can add as many urls as you want, but for this example we will only use one
# "moby dick" the book is available online at the URL below
urls = ["https://www.gutenberg.org/files/2701/2701-0.txt"]

loader = UnstructuredURLLoader(urls=urls)
documents = loader.load()

#we use chunk size of 1000 and 10% overlap to try not to cut sentences in the middle
#this regex separates by placing the sentence period when cutting a chunk at the end of that chunk
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100, separators=["\n\n", "\n", "(?<=\. )", " ", ""])
chunks = text_splitter.split_documents(documents)
len(chunks)

1819

In [4]:
#we use the tenacity library to create delays and retries when calling openAI to avoid hitting throttling limits
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def calc_embeddings(text):
    deployment = OPENAI_ADA_EMBEDDING_DEPLOYMENT_NAME
    # replace newlines, which can negatively affect performance.
    text = text.replace("\n", " ")
    return embeddingmodel.embed_query(text)

In [5]:
#save all the chunks into a pandas dataframe
df = pd.DataFrame(columns=['document_name', 'content', 'embedding'])
for ch in chunks:
    dict = {'document_name': ch.metadata['source'],'content': ch.page_content, 'embedding': ""}
    temp_df = pd.DataFrame(dict, index=[0])
    df = pd.concat([df, temp_df], ignore_index=True)
df.head()

Unnamed: 0,document_name,content,embedding
0,https://www.gutenberg.org/files/2701/2701-0.txt,The Project Gutenberg eBook of Moby-Dick; or T...,
1,https://www.gutenberg.org/files/2701/2701-0.txt,CONTENTS\n\nETYMOLOGY.\n\nEXTRACTS (Supplied b...,
2,https://www.gutenberg.org/files/2701/2701-0.txt,CHAPTER 33. The Specksnyder.\n\nCHAPTER 34. Th...,
3,https://www.gutenberg.org/files/2701/2701-0.txt,CHAPTER 58. Brit.\n\nCHAPTER 59. Squid.\n\nCHA...,
4,https://www.gutenberg.org/files/2701/2701-0.txt,CHAPTER 85. The Fountain.\n\nCHAPTER 86. The T...,


In [6]:
# calculate the embeddings using openAI
df["embedding"] = df.content.apply(lambda x: calc_embeddings(x))
df.to_csv('data/adx/adx_embeddings.csv', index=False)
print(df.head(10))

                                     document_name  \
0  https://www.gutenberg.org/files/2701/2701-0.txt   
1  https://www.gutenberg.org/files/2701/2701-0.txt   
2  https://www.gutenberg.org/files/2701/2701-0.txt   
3  https://www.gutenberg.org/files/2701/2701-0.txt   
4  https://www.gutenberg.org/files/2701/2701-0.txt   
5  https://www.gutenberg.org/files/2701/2701-0.txt   
6  https://www.gutenberg.org/files/2701/2701-0.txt   
7  https://www.gutenberg.org/files/2701/2701-0.txt   
8  https://www.gutenberg.org/files/2701/2701-0.txt   
9  https://www.gutenberg.org/files/2701/2701-0.txt   

                                             content  \
0  The Project Gutenberg eBook of Moby-Dick; or T...   
1  CONTENTS\n\nETYMOLOGY.\n\nEXTRACTS (Supplied b...   
2  CHAPTER 33. The Specksnyder.\n\nCHAPTER 34. Th...   
3  CHAPTER 58. Brit.\n\nCHAPTER 59. Squid.\n\nCHA...   
4  CHAPTER 85. The Fountain.\n\nCHAPTER 86. The T...   
5  CHAPTER 111. The Pacific.\n\nCHAPTER 112. The ...   
6  CHAPTER 13

In [7]:
#save to local file
df.to_csv('data/adx/adx_embeddings.csv', index=False)

### Ingest the embeddings into Azure Data Explorer


* Please use one click ingest in Azure Data explorer into a table called "books" by ingesting data from ["./data/adx/adx_embeddings.csv"](./data/wikipedia/vector_database_wikipedia_articles_embedded_1000.csv)   
    <img src="images/2.png" alt="Create Kusto cluster" /> 

In [8]:
# Connect to adx using AAD app registration
cluster = KUSTO_CLUSTER
kcsb = KustoConnectionStringBuilder.with_aad_application_key_authentication(cluster, KUSTO_MANAGED_IDENTITY_APP_ID, KUSTO_MANAGED_IDENTITY_SECRET,  AAD_TENANT_ID)
client = KustoClient(kcsb)
kusto_db = KUSTO_DATABASE
table_name = "books"

In [14]:
#testing the connection to kusto works - sample query to get the top 10 results from wikipedia
query = table_name + " | take 10"

response = client.execute(kusto_db, query)
for row in response.primary_results[0]:
    txt = (row["content"])[0:10]
    print("Title :{}".format(txt))

Title :The Projec
Title :CONTENTS


Title :CHAPTER 33
Title :CHAPTER 58
Title :CHAPTER 85
Title :CHAPTER 11
Title :CHAPTER 13
Title :“While you
Title :EXTRACTS. 
Title :So fare th


In [15]:
def get_answer_from_adx(question, nr_of_answers=1):
        searchedEmbedding = calc_embeddings(question)
        kusto_query = table_name + " | extend similarity = series_cosine_similarity_fl(dynamic("+str(searchedEmbedding)+"), embedding,1,1) | top " + str(nr_of_answers) + " by similarity desc "
        response = client.execute(kusto_db, kusto_query)

        for row in response.primary_results[0]:
                return row['content']

In [16]:
# this is the question we want to ask and its embeddings
question = calc_embeddings("Why does the coffin prepared for Queequeg become Ishmael's life buoy once the Pequod sinks?")
print('Embeddings: {}'.format(question))

Embeddings: [0.01388722391538789, -0.02785565915458321, 0.01125460704859909, -0.011153091962517278, -0.01700034452955324, 0.0024566581448270523, -0.020235282129429738, 0.012134402288222238, 0.0021724166488559975, 0.002757818511370902, 0.01796135202430634, 0.0334863457079393, 0.020993259452019217, 0.001238480487889455, -0.0024414307887825285, 0.0043008436754292155, 0.028776060987432597, 0.00841219302021106, 0.0006949533888357305, -0.015240754981629462, -0.0055393238140727255, 0.015389642719050595, 0.013602982419416834, -0.007153409119293172, -0.018286199182181113, 0.015592672891214217, 0.008452798682114777, -0.015633278553117934, 0.009007745832540635, 0.011999048529672316, 0.005251698823383868, -0.007945225114350763, -0.029263332655567274, -0.009224310914897988, -0.0313477684300433, -0.007985830776254478, 0.004629074793683049, -0.016729638875098436, 0.01312924659279841, -0.006033362471643008, -0.008303910013371123, -0.0027036774736121936, -0.018245593520277395, 0.003080974154717402, -0.

In [17]:
# here we get our answer but in a long and non concise way
get_answer_from_adx("Why does the coffin prepared for Queequeg become Ishmael's life buoy once the Pequod sinks?",1)

'Leaning over in his hammock, Queequeg long regarded the coffin with an attentive eye. He then called for his harpoon, had the wooden stock drawn from it, and then had the iron part placed in the coffin along with one of the paddles of his boat. All by his own request, also, biscuits were then ranged round the sides within: a flask of fresh water was placed at the head, and a small bag of woody earth scraped up in the hold at the foot; and a piece of sail-cloth being rolled up for a pillow, Queequeg now entreated to be lifted into his final bed, that he might make trial of its comforts, if any it had. He lay without moving a few minutes, then told one to go to his bag and bring out his little god, Yojo. Then crossing his arms on his breast with Yojo between, he called for the coffin lid (hatch he called it) to be placed over him. The head part turned over with a leather hinge, and there lay Queequeg in his coffin with little but his composed countenance in view. “Rarmai” (it will do;'

In [27]:
from openai import AzureOpenAI
clientOpenAI = AzureOpenAI(
  azure_endpoint = OPENAI_DEPLOYMENT_ENDPOINT, 
  api_key=OPENAI_API_KEY,  
  api_version="2023-05-15"
)

def call_openAI(text):
    response = clientOpenAI.chat.completions.create(
        model=OPENAI_DEPLOYMENT_NAME,
        messages = text,
        temperature=0.7,
        max_tokens=800,
        top_p=0.95,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )

    return response.choices[0].message.content

In [28]:

question = "Why does the coffin prepared for Queequeg become Ishmael's life buoy once the Pequod sinks?"
retrieved_answer_from_adx = get_answer_from_adx(question,1)

prompt = 'Question: {}'.format(question) + '\n' + 'Information: {}'.format(retrieved_answer_from_adx)

# prepare prompt
messages = [{"role": "system", "content": "You are a HELPFUL assistant answering users questions. Answer the question using the provided information and do not add anything else."},
            {"role": "user", "content": prompt}]

result = call_openAI(messages)
display(HTML(result))

In [29]:
question = "Why does Ahab pursue Moby Dick?"
retrieved_answer_from_adx = get_answer_from_adx(question,1)

prompt = 'Question: {}'.format(question) + '\n' + 'Information: {}'.format(retrieved_answer_from_adx)

# prepare prompt
messages = [{"role": "system", "content": "You are a HELPFUL assistant answering users questions. Answer the question using the provided information and do not add anything else."},
            {"role": "user", "content": prompt}]

result = call_openAI(messages)
display(HTML(result))