# RAG with Elastic and Llama3 using Llamaindex

This interactive notebook uses `Llamaindex` to process fictional workplace documents and uses `Llama3` running locally using `Ollama` to transform these documents into embeddings and store them into `Elasticsearch`. We then ask a question, retrieve the relevant documents from `Elasticsearch` and use `Llama3` to provide a response. 

**_Note_** : _Llama3 is expected to be running using `Ollama` on the same machine where you will be running this notebook._

## Requirements

For this example, you will need:

- An Elastic deployment
  - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook))
  - For LLM we will be using [Ollama](https://ollama.com/) and [Llama3](https://ollama.com/library/llama3) configured locally.  

### Use Elastic Cloud

If you don't have an Elastic Cloud deployment, follow these steps to create one.

1. Go to [Elastic cloud Registration](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) and sign up for a free trial
2. Select **Create Deployment** and follow the instructions

## Install required dependencies for LlamaIndex and Elasticsearch

First we install the packages we need for this example.

In [2]:
!pip install llama-index llama-index-cli llama-index-core llama-index-embeddings-elasticsearch llama-index-embeddings-ollama llama-index-legacy llama-index-llms-ollama llama-index-readers-elasticsearch llama-index-readers-file llama-index-vector-stores-elasticsearch llamaindex-py-client




[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Import packages
Next we import the required packages as required. The imports are placed in the cells as required.

## Prompt user to provide Cloud ID and API Key
We now prompt the user to provide us Cloud ID and API Key using `getpass`. We get these details from the deployment. 

In [None]:
from getpass import getpass


# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
#ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = "-"

## Prepare documents for chunking and ingestion
We now prepare the data to be in the [Document](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/) type for processing using [Llamaindex](https://docs.llamaindex.ai/en/stable/) 

In [4]:
import json
from urllib.request import urlopen
from llama_index.core import Document

url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/datasets/workplace-documents.json"

response = urlopen(url)
workplace_docs = json.loads(response.read())

# Building Document required by LlamaIndex.
documents = [
    Document(
        text=doc["content"],
        metadata={
            "name": doc["name"],
            "summary": doc["summary"],
            "rolePermissions": doc["rolePermissions"],
        },
    )
    for doc in workplace_docs
]

## Define Elasticsearch and ingest pipeline in LlamaIndex for document processing. Use Llama3 for generating embeddings.
We now define the `Elasticsearchstore` with the required index name, the text field and its associated embeddings. We use `Llama3` to generate the embeddings. We will be running Semantic search on the index to find documents relevant to the query posed by the user. We will use the `SentenceSplitter` provided by `Llamaindex` to chunk the documents. All this is run as part of an `IngestionPipeline` provided by the `Llamaindex` framework.

In [5]:
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.vector_stores.elasticsearch import ElasticsearchStore
#from elasticsearch import Elasticsearch
from elasticsearch import AsyncElasticsearch

es = AsyncElasticsearch(
    "https://192.168.123.151:9200",
    api_key=ELASTIC_API_KEY,
    verify_certs=False,
    request_timeout=60
)

es_vector_store = ElasticsearchStore(
    index_name="workplace_index",
    vector_field="content_vector",
    text_field="content",
    #es_url="https://192.168.123.151:9200",
    #es_cloud_id=ELASTIC_CLOUD_ID,
    #es_api_key=ELASTIC_API_KEY,
    es_client=es
)


  _transport = transport_class(


In [6]:

# Elasticsearch ÏÉÅÌÉú ÌôïÏù∏
from elasticsearch import Elasticsearch

es = Elasticsearch("https://192.168.123.151:9200", api_key=ELASTIC_API_KEY, verify_certs=False)

print(es.info())


{'name': 'dongkook-es1', 'cluster_name': 'elastic8-dongkook', 'cluster_uuid': 'kVeFudZZTxWZ1VqUzNg4xA', 'version': {'number': '9.0.0', 'build_flavor': 'default', 'build_type': 'rpm', 'build_hash': '112859b85d50de2a7e63f73c8fc70b99eea24291', 'build_date': '2025-04-08T15:13:46.049795831Z', 'build_snapshot': False, 'lucene_version': '10.1.0', 'minimum_wire_compatibility_version': '8.18.0', 'minimum_index_compatibility_version': '8.0.0'}, 'tagline': 'You Know, for Search'}


  _transport = transport_class(


In [7]:
# Embedding Model to do local embedding using Ollama.
ollama_embedding = OllamaEmbedding("exaone-deep:latest")
# LlamaIndex Pipeline configured to take care of chunking, embedding
# and storing the embeddings in the vector store.
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=100), # ÏàòÏ†ïÌïòÎ©¥ÏÑú ÎèåÎ†§Î≥¥Í∏∞
        ollama_embedding,
    ],
    vector_store=es_vector_store,
)

## Execute pipeline 
This will chunk the data, generate embeddings using `Llama3` and ingest into `Elasticsearch` index, with embeddings in a `dense` vector field.

In [8]:
pipeline.run(show_progress=True, documents=documents)

The embeddings are stored in a dense vector field of dimension `4096`. The dimension size comes from the size of the embeddings generated from `Llama3`.

## Define LLM settings. 
This connects to your local LLM. Please refer to https://ollama.com/library/llama3 for details on steps to run Llama3 locally. 

_If you have sufficient resources (atleast >64 GB Ram and GPU available) then you could try the 70B parameter version of Llama3_ 

In [9]:
from llama_index.llms.ollama import Ollama
from llama_index.core import Settings

Settings.embed_model = ollama_embedding
local_llm = Ollama(
    model="exaone-deep:latest",
    request_timeout=600
)

### Setup Semantic search and integrate with Llama3. 
We now configure `Elasticsearch` as the vector store for the `Llamaindex` query engine. The query engine, using `Llama3` is then used to answer your questions with contextually relevant data from `Elasticsearch`.

In [10]:
from llama_index.core import VectorStoreIndex, QueryBundle

index = VectorStoreIndex.from_vector_store(es_vector_store)
query_engine = index.as_query_engine(local_llm, similarity_top_k=5)

# Customer Query
#query = "What are the organizations sales goals?"
query = "Based on the document, what are the organization's sales goals?"
bundle = QueryBundle(
    query_str=query, embedding=Settings.embed_model.get_query_embedding(query=query)
)

response = query_engine.query(bundle)

print(response.response)

<thought>

Okay, let me try to figure this out. The user is asking about the organization's sales goals based on the provided context. I need to look through the context information given and extract the relevant details.

First, I'll scan through each document in the context. The first document is the "Sales Organization Overview" which mentions the structure of the sales regions and their roles. However, the sales goals aren't mentioned here. The next document is the "Fy2024 Company Sales Strategy" which seems more promising. Let me read that carefully.

The summary states that the primary goal is to increase revenue, expand market share, and strengthen customer relationships in target markets. Then under the sections C, D, and IV, there are more details. The C. Partner Ecosystem section talks about strengthening partnerships and expanding market reach. The D. Customer Success section mentions improving retention and satisfaction. The IV. Monitoring and Evaluation section sets KPIs a

_You could now try experimenting with other questions._

In [None]:
from deepeval.metrics import ContextualPrecisionMetric
from deepeval.metrics import ContextualRecallMetric
from deepeval.metrics import ContextualRelevancyMetric
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
import os

os.environ["OPENAI_API_KEY"] = "-"

contextual_precision = ContextualPrecisionMetric(model="gpt-3.5-turbo")
contextual_recall = ContextualRecallMetric(model="gpt-3.5-turbo")
contextual_relevancy = ContextualRelevancyMetric(model="gpt-3.5-turbo")

input = query
expected_output = "Increase revenue by 20% compared to fiscal year 2023, expand market share in segments by 15%, retain 95% of existing customers, and increase customer satisfaction ratings."
retrieval_context=[doc.text for doc in response.source_nodes]

test_case = LLMTestCase(
    input=input,
    actual_output=response.response,
    retrieval_context=retrieval_context,
    expected_output=expected_output,
)

evaluate(
    [test_case],
    metrics=[contextual_recall, contextual_precision, contextual_relevancy]
)

Evaluating 1 test case(s) in parallel: |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà|100% (1/1) [Time Taken: 00:08,  8.96s/test case]



Metrics Summary

  - ‚ùå Contextual Recall (score: 0.0, threshold: 0.5, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.00 because the expected output does not align with any parts of the retrieval context., error: None)
  - ‚úÖ Contextual Precision (score: 0.5, threshold: 0.5, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.50 because the relevant nodes, such as the second node which outlines the sales strategy for fiscal year 2024, are ranked higher than irrelevant nodes like the first, third, fourth, and fifth nodes., error: None)
  - ‚ùå Contextual Relevancy (score: 0.0, threshold: 0.5, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.00 because the retrieval context does not contain any statements relevant to the input, focusing instead on various aspects of the organization's structure, objectives, and activities that do not directly address the sales goals., error: None)

For test case:

  - input: Based on th




EvaluationResult(test_results=[TestResult(name='test_case_0', success=False, metrics_data=[MetricData(name='Contextual Recall', threshold=0.5, success=False, score=0.0, reason='The score is 0.00 because the expected output does not align with any parts of the retrieval context.', strict_mode=False, evaluation_model='gpt-3.5-turbo', error=None, evaluation_cost=0.0014165, verbose_logs='Verdicts:\n[\n    {\n        "verdict": "no",\n        "reason": "The sentence does not align with any parts of the retrieval context."\n    },\n    {\n        "verdict": "no",\n        "reason": "The sentence does not align with any parts of the retrieval context."\n    },\n    {\n        "verdict": "no",\n        "reason": "The sentence does not align with any parts of the retrieval context."\n    },\n    {\n        "verdict": "no",\n        "reason": "The sentence does not align with any parts of the retrieval context."\n    }\n]'), MetricData(name='Contextual Precision', threshold=0.5, success=True, sc