# Use Case Demonstration of EVE
In this notebook, we explore practical applications of EVE (Earth Virtual Explorer), a large language model specialized in Earth Observation (EO). EVE is designed to understand, analyze, and generate text related to EO data and topics, making it a valuable tool for researchers, analysts, and decision-makers in the field.

We will demonstrate two key use cases of EVE:

- **Summarization**: In this task, EVE is given a document related to Earth Observation and asked to generate a concise and informative summary. This is useful for quickly understanding lengthy reports, scientific papers, or satellite data documentation.

- **Question Answering (Q&A)**: In this use case, we enhance EVE’s performance by integrating it with a retrieval system, using a technique known as Retrieval-Augmented Generation (RAG). Here, the model first retrieves relevant context from a knowledge base before answering questions, leading to more accurate and grounded responses.

These examples highlight EVE’s potential to streamline information processing and support decision-making in Earth Observation workflows.

In [None]:
# Install the required libraries
!pip3 install -q -U bitsandbytes
!pip3 install -q datasets
!pip3 install -q langchain_community
!pip3 install -q pypdf
!pip3 install -q sentence-transformers
!pip3 install -q faiss-cpu
!pip3 install -q langchain_huggingface
!pip3 install -q langchain_runpod
!pip3 install -q qdrant_client
!pip3 install -q langchain_aws
!pip3 install numpy==1.26.4


## Summarization

In [None]:
from datasets import  load_dataset
import random
# Load documents
docs = load_dataset('eve-esa/eve-cpt-sample-v0.2')['train']

# Select a random doc from the dataset
doc = docs.select(random.sample(range(len(docs)), 1))[0]

In [None]:
from langchain_aws import BedrockLLM

# Load the model
llm = BedrockLLM(model_id='arn:aws:bedrock:us-west-2:637423382292:imported-model/7i06g1utels3', region_name='us-west-2', provider='meta')

In [None]:
# Promp message

message = f"""
<|system|>
You are an helpful assistant expert in Earth Observation, help the user with his tasks.
<|end|>
<|user|>
Summarize the following document focusing on the main concepts and ideas.

The document starts
{doc['text']}

<|end|>
<|assistant|>
"""


In [None]:
# Run this cell to show the final prompt given to the model
print(message)

In [None]:
output = llm.invoke(message)

In [None]:
output

# Retrieval Augmented Generation (RAG) with Langchain

In this notebook, we implement a complete RAG pipeline for answering questions based on a given context. Using the LangChain library, we'll walk through the entire process—from retrieving relevant context to generating accurate answers.


**Roadmap**

1. **Indexing**: Organize the raw documents into a structured format suitable for processing, such as splitting them into chunks or passages for more efficient retrieval.

2. **Embedding**: Convert each text chunk into a dense vector representation using a pre-trained embedding model. These embeddings capture the semantic meaning of the content.

3. **Vector Store**: Store the embeddings in a vector database (Qdrant in our case), allowing fast and scalable similarity search across the document collection.

4. **Retrieval and Generation**: Given a user query, retrieve the most relevant document chunks from the vector store and feed them into a language model (EVE) to generate a context-aware, accurate response.

## Load dataset of Q&A
Let's load our dataset of Q&A about EO, each sample is composed of a question and an answer

In [None]:
from datasets import load_dataset

qa = load_dataset('eve-esa/eve-is-open-ended')['train']

idx = 120

question = qa[idx]['question']
answer = qa[idx]['answer']


print('Question: ', question)
print('Answer: ', answer)

## Indexing

The first part of a RAG pipeline is called **indexing**. This is the process of ingesting data from a source and indexing it. The indexing process is composed of three steps:
- **Load**: process and load data in text format.
- **Split**: this is useful both for indexing data and passing it into a model, as large chunks are harder to search over and won't fit in a model's finite context window.
- **Store**: we need somewhere to store and index our splits, so that they can be searched over later. This is often done using a [VectorStore](https://python.langchain.com/docs/concepts/vectorstores/) and [Embeddings](https://python.langchain.com/docs/concepts/embedding_models/) model.
Once the Indexing step is done we will have our knowledge base made of scientific papers indexed and ready to be used in the generation steps as context.


<div>
<img src="https://python.langchain.com/assets/images/rag_indexing-8160f90a90a33253d0154659cf7d453f.png" width="800"/>
</div>


### Embeddings

An embeddings model in Retrieval-Augmented Generation (RAG) is a neural network that converts text into dense vector representations (embeddings) in a **high-dimensional space**. These models take text as input and produce a fixed-length array of numbers, a numerical fingerprint of the text's semantic meaning. Embeddings allow search system to find relevant documents not just based on keyword matches, but on semantic understanding.

Embeddings models are trained on large text corpora using unsupervised learning techniques. They learn to encode the semantic meaning of words, sentences, and documents in a way that captures relationships between them. For example, embeddings models can learn that "cat" and "dog" are similar because they are both animals, or that "apple" and "orange" are similar because they are both fruits.

There are many pre-trained embedding models available, each suited to different types of data and use cases. For our application, we use Indus, a fine-tuned encoder-only transformer model trained specifically on scientific journals and articles related to NASA’s Science Mission Directorate (SMD).

Choosing the right embedding model is a critical step in building an effective retrieval system. Ideally, the embedding model should be trained—or at least fine-tuned—on data similar to the target documents. Since our corpus consists of scientific texts focused on Earth Observation, Indus is a better fit than a general-purpose model, as it captures domain-specific terminology and semantics more accurately.

<img src="https://weaviate.io/assets/images/embedding-models-0c04d93c0be28dd63a0e8781c4e8685d.jpg" width='800px'>




In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

# Load the embeddings model
model_name = "nasa-impact/nasa-smd-ibm-st-v2"
encode_kwargs = {"normalize_embeddings": True}
indus_embd = HuggingFaceEmbeddings(
    model_name=model_name,  encode_kwargs=encode_kwargs
)

### Vector Store

Vector stores are specialized databases designed to efficiently index and retrieve information using vector representations of data. Vector stores leverages the dense representation by reducing the task of finding similar documents to a search in a high-dimensional space. This search is made by comparing the vector representation of the **query** with the vector representation of the **documents** in the database. The documents that are closer to the query vector are considered more similar to the query.

Wrapping up the retrieval process is composed of:
- **Documents embedding**
- **Store the embeddings in a VectorStore**
- **Query embedding**
- **Retrieve** the most similar documents to the query


The most popular and simple setup is using the **cosine similarity** to compare the vectors and retrieve the **top k** most similar ones


<div>
<img src="https://python.langchain.com/assets/images/vectorstores-2540b4bc355b966c99b0f02cfdddb273.png" width="800"/>
</div>


## Connect to QDrant

To save time, the embedding and indexing of documents have already been completed prior to this notebook. These steps can be computationally intensive, so we’ve pre-processed the data to streamline the workflow.

We are using Qdrant as our vector store, which has been preloaded with all the relevant documents needed for retrieval. In this section, we will connect to the Qdrant instance and select the specific collection that contains our indexed data. This will enable us to perform efficient semantic searches and support the Retrieval-Augmented Generation (RAG) process used in our Q&A tasks.

In [None]:
# Examples of retrieval pipeline using the embedding function and the API from QDrant
from qdrant_client import QdrantClient

qdrant_url = 'https://e186510c-4dd9-45c7-99a5-ae38c4c8bc36.us-east-1-0.aws.cloud.qdrant.io:6333'
api_key = 'rZYblMkzsiqiiuPqxXxmckfyMFIZ9Yg9EpxYxhbeFZj82MEOIbT5Fg'

# Enstablish a connection wit the vector store
client = QdrantClient(
    url=qdrant_url,
    api_key=api_key
)

# Embedd the query
query_emb = indus_embd.embed_query(question)

# Perform similarity search using the computed embeddings
search_result = client.search(
    collection_name="indus-test",
    query_vector=query_emb,
    limit=1,
)

data = search_result[0].payload
# Payload containing metadata and text
for key, value in data['metadata'].items():
  print(f'{key}: {value}')

print('Retrieved chunk:\n', data['page_content'])

In [None]:
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever
from typing import List
from qdrant_client import QdrantClient
from pydantic import PrivateAttr, Field
from typing import List, Optional, Dict
from qdrant_client.models import Filter, PointStruct

from langchain_core.callbacks import CallbackManagerForRetrieverRun

import numpy as np

# Let's define our retriever class to have a nice interface
class RunpodRetriever():
    def __init__(self, embedding, collection_name='indus-test', k: int = 3):
        self._client = QdrantClient(url="https://e186510c-4dd9-45c7-99a5-ae38c4c8bc36.us-east-1-0.aws.cloud.qdrant.io:6333",
        api_key="rZYblMkzsiqiiuPqxXxmckfyMFIZ9Yg9EpxYxhbeFZj82MEOIbT5Fg")
        self.embedding = embedding
        self.collection_name = collection_name
        self.k = k


    def get_relevant_documents(self, query: str) -> List[Document]:
        query_emb = self.embedding.embed_query(query)

        search_result = self._client.search(
            collection_name=self.collection_name,
            query_vector=query_emb,
            limit=self.k,
        )

        docs = []
        for hit in search_result:
            # Adjust based on your actual data structure
            data = hit.payload
            content = data.get("page_content", "")
            metadata = data.get("metadata", {})
            docs.append(Document(page_content=content, metadata=metadata))

        return docs


In [None]:
# Let's define our retriever
retriever = RunpodRetriever(indus_embd, k=3)

# Format retrieved documents:
def format_docs(docs):
  doc_str = ''
  for i, doc in enumerate(docs):
    doc_str += f'Document n. {i+1}\n'
    doc_str += f'TITLE: {doc.metadata.get("source_name", "No title")}\n' # Add title's of the paper
    doc_str += f'URL: {doc.metadata.get("source", "No url")}\n\n' # Add URL of the paper
    doc_str += f'{doc.page_content}\n\n'
  return doc_str



print('Question: ')
print(question)
print()
docs = retriever.get_relevant_documents(question)
print(format_docs(docs))

## Retrieval and generation

The pipeline consists of the following key components:

- Retriever: This component queries the vector store (Qdrant) to fetch the most relevant document chunks based on the user’s question. It performs a semantic search using the pre-computed embeddings to find contextually similar content.

- LLM (Large Language Model): Once the relevant context is retrieved, it is passed to EVE, our Earth Observation-specialized language model. EVE then generates a coherent and informed response based on both the query and the retrieved context.

This approach ensures that the generated answers are grounded in the source documents, improving accuracy and reducing hallucination.

### Prompt

First, we will define the prompt to be used  using three different templates:
- **SystemMessagePromptTemplate**: the system message represents guidelines for the model on how to interact with the user and interpret the conversation.
- **AIMessagePromptTemplate**: the AI message represents a message generate by the model.
- **HumanMessagePromptTemplate**: the human message represents the message sent by the user.

From the code below we can see the structure of the prompt and the templates used to create it using langchain. Specifically we could see some **special tokens** used in the prompt:
- **<|system|> | <|user|> | <|assistant|>**: are special tokens that helps the model to understand to who belongs that specific message.
- **<|end|>**: is a special token that indicates the end of a message.
- **{message} | {context} | {question}**: are placeholders that will be replaced with the actual message, context and question.

In [None]:
import os  # Customize SystemPromptTemplate
from langchain.prompts import SystemMessagePromptTemplate, AIMessagePromptTemplate, HumanMessagePromptTemplate

template= '''<|system|>
{message}
<|end|>
'''

# A human message will contain the question and the context. The context will be automatically added by the retriever.
human_template = '''<|user|>
Context: {context}

Question is below:

Question: {question}

<|end|>
<|assistant|>
'''


assistant_template = '''<|assistant|>
{message}
<|end|>
'''

# Define the templates
SystemMessageTemplate = SystemMessagePromptTemplate.from_template(template)
HumanMessageTemplate = HumanMessagePromptTemplate.from_template(human_template)
AIMessageTemplate = AIMessagePromptTemplate.from_template(assistant_template)


In [None]:
# Define the system message

system_message = '''You are an expert assistant that answers questions about different topics.

You are given some extracted parts from science papers along with a question.

If you don't know the answer, just say "I don't know." Don't try to make up an answer.

Use only the following pieces of context to answer the question at the end.

Do not use any prior knowledge.'''


system_msg = SystemMessageTemplate.format(message=system_message)

system_msg

Now that we have the definition of different templates we can define the chat prompt. Langchain requireres a specific structure for the chat prompt that is composed of a list of messages. In the code below we can see that our chat template will be composed of two messages, the **system message** and the **human message** that contains the input from the user.

In [None]:
from langchain.prompts import MessagesPlaceholder, PromptTemplate, ChatPromptTemplate

chat_template = ChatPromptTemplate.from_messages(
    messages=[
    system_msg,
    human_template,
    ]
)

# As we can see, our prompt is expecting two variables to be filled
print(chat_template)

### Model initialization


In [None]:
from langchain_aws import BedrockLLM

llm = BedrockLLM(model_id='arn:aws:bedrock:us-west-2:637423382292:imported-model/7i06g1utels3', region_name='us-west-2', provider='meta')

### Langchain pipelines

Langchain pipelines are a powerful tool used to assemble and coordinated different components. Our pipeline will look something like this

$$\text{user query} → \text{retriever} → \text{chat prompt} → \text{LLM} → \text{answer} $$

In langchain we will use the chain '|' operator to assemble in series our components. The chain operator is part of the **LangChain Expression Language** a declarative method to build pipelines. In the LCEL language the output of what is on the left of '|' will be the input on what there is on the right of the pipeline.

Let's build our first pipeline to understand how they works. In our sample pipeline below, we can see that we are dynamically creating a dictionary that will be given in input to our chat template (N.B. as we saw above our chat template takes in input three variables)

From the code we can see that the context value is created by taking the question (from the input dict given to the chain) and using it as input to our retriever. The output of the retriever will be then formatted by the format_docs function.
The question instead will remain as it is.


A chain will be called by the **invoke** method. The invoke methods takes as argument a dictionary that will represent the input of the first element of the pipeline.



In [None]:
from operator import itemgetter
from langchain.schema.runnable import RunnableLambda


# Build the pipeline
rag_chain_from_docs = (
    {
        "question": itemgetter('question'),
        "context": itemgetter('question') | RunnableLambda(retriever.get_relevant_documents) | format_docs,
    }
    | RunnableLambda(lambda inputs: {
        **inputs,
        "prompt": chat_template.invoke(inputs)  # Add the rendered prompt explicitly
    })
    | {
        "model_out": itemgetter("prompt") | llm,
        "prompt": itemgetter("prompt"),
    }
)

output = rag_chain_from_docs.invoke({"question": question})

In [None]:
# Print the prompt
print(output['prompt'].to_string())
print()
# Print the model output
print(output['model_out'])