# Use cases of EVE


In [4]:
# Install the required libraries
!pip3 install -U bitsandbytes
!pip3 install datasets
!pip3 install langchain_community
!pip3 install pypdf
!pip3 install sentence-transformers
!pip3 install faiss-cpu
!pip3 install langchain_huggingface

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.4.1-py3-none-any.whl (487 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m487.4/487.4 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.

## Summarization

In [1]:
from datasets import  load_dataset
import random
# Load documents
docs = load_dataset('eve-esa/eve-cpt-sample-v0.2')['train']

# Select a random doc from the dataset
doc = docs.select(random.sample(range(len(docs)), 1))

In [2]:
#from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig

model_id = "eve-esa/eve-sft-instruct-qa"
# Load the tokenizer

# Quantize the model
# Quantize the model
quantization_config = {'load_in_4bit': True, 'bnb_4bit_compute_dtype':"float16"}  # Can be "float16", "bfloat16", etc.}

tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
    quantization_config=quantization_config)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [6]:
messages = [
    {"role": "system", "content": "You are an helpful assistant expert in Earth Observation."},
    {"role": "user", "content": f"Given the following document create an exhaustive summary.\nDocument:\n\n{doc['text']}"},
]

In [7]:
pipeline_gen = pipeline(
    "text-generation", model=model, tokenizer=tokenizer, device_map="auto"
)

Device set to use cuda:0


In [8]:
pipeline_gen(messages)

[{'generated_text': [{'role': 'system',
    'content': 'You are an helpful assistant expert in Earth Observation.'},
   {'role': 'user',
    'content': 'Given the following document create an exhaustive summary.\nDocument:\n\n'},
   {'role': 'assistant',
    'content': 'Earth Observation is the study of the physical and human environments and their related interactions. It is an interdisciplinary'}]}]

# Retrieval Augmented Generation (RAG) with Langchain

In this notebook we will implement a full RAG pipeline for answering questions based on a given context. We will use the Langchain library to implement the pipeline. The data used in this notebook is a collection of [science papers](https://huggingface.co/datasets/loukritia/science-journal-for-kids-data) that we will use to answer questions. We will see the whole process from obtaining the data needed, ingest them and creating the pipeline.

**Roadmap**:
1. **Scraping**: the data we will be obtained by scraping the pdfs from the urls provided in the dataset.
2. **Indexing**: a pipeline for ingesting data from a source and indexing it. This usually happens offline.
3. **Retrieval and generation**: a langchain pipeline will be initialized with the necessary components.

## Requirements

Installation of libraries and packages required for the notebook.



In [3]:
from google.colab import userdata
import os

os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')

In [5]:
!env

SHELL=/bin/bash
NV_LIBCUBLAS_VERSION=12.5.3.2-1
NVIDIA_VISIBLE_DEVICES=all
COLAB_JUPYTER_TRANSPORT=ipc
NV_NVML_DEV_VERSION=12.5.82-1
NV_CUDNN_PACKAGE_NAME=libcudnn9-cuda-12
CGROUP_MEMORY_EVENTS=/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events
NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.22.3-1+cuda12.5
NV_LIBNCCL_DEV_PACKAGE_VERSION=2.22.3-1
VM_GCE_METADATA_HOST=169.254.169.253
HOSTNAME=c0ac44894ddd
LANGUAGE=en_US
TBE_RUNTIME_ADDR=172.28.0.1:8011
COLAB_TPU_1VM=
GCE_METADATA_TIMEOUT=3
NVIDIA_REQUIRE_CUDA=cuda>=12.5 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,dr

## Load dataset of Q&A


In [4]:
from datasets import load_dataset

qa = load_dataset('eve-esa/eve-is-open-ended')['train']

qa

Dataset({
    features: ['question', 'answer'],
    num_rows: 313
})

## Indexing

The first part of a RAG pipeline is called **indexing**. This is the process of ingesting data from a source and indexing it. This usually happens offline. In this case we will index the pdfs we downloaded in the previous step. The indexing process is composed of three steps:
- **Load**: process and load data in text format.
- **Split**: this is useful both for indexing data and passing it into a model, as large chunks are harder to search over and won't fit in a model's finite context window.
- **Store**: we need somewhere to store and index our splits, so that they can be searched over later. This is often done using a [VectorStore](https://python.langchain.com/docs/concepts/vectorstores/) and [Embeddings](https://python.langchain.com/docs/concepts/embedding_models/) model.
Once the Indexing step is done we will have our knowledge base made of scientific papers indexed and ready to be used in the generation steps as context.


<div>
<img src="https://python.langchain.com/assets/images/rag_indexing-8160f90a90a33253d0154659cf7d453f.png" width="800"/>
</div>


### Embeddings

An embeddings model in Retrieval-Augmented Generation (RAG) is a neural network that converts text into dense vector representations (embeddings) in a **high-dimensional space**. These models take text as input and produce a fixed-length array of numbers, a numerical fingerprint of the text's semantic meaning. Embeddings allow search system to find relevant documents not just based on keyword matches, but on semantic understanding.

Embeddings models are trained on large text corpora using unsupervised learning techniques. They learn to encode the semantic meaning of words, sentences, and documents in a way that captures relationships between them. For example, embeddings models can learn that "cat" and "dog" are similar because they are both animals, or that "apple" and "orange" are similar because they are both fruits.

There exists several pre-trained embeddings models that can be used according to the use case. For our use case we will use the [bge-small-en](https://huggingface.co/BAAI/bge-small-en) a small-scale English text embedding model developed by BAAI (Beijing Academy of Artificial Intelligence) as part of their FlagEmbedding project.

<img src="https://weaviate.io/assets/images/embedding-models-0c04d93c0be28dd63a0e8781c4e8685d.jpg" width='800px'>




In [5]:
from langchain_community.embeddings import HuggingFaceEmbeddings

# Load the embeddings model
model_name = "nasa-impact/nasa-smd-ibm-st-v2"
#model_kwargs = {"device": "cuda"}
encode_kwargs = {"normalize_embeddings": True}
indus_embd = HuggingFaceEmbeddings(
    model_name=model_name,  encode_kwargs=encode_kwargs
)

  indus_embd = HuggingFaceEmbeddings(


### VectorStore

Vector stores are specialized databases designed to efficiently index and retrieve information using vector representations of data. VectorStores leverages the dense representation by reducing the task of finding similar documents to a search in a high-dimensional space. This search is made by comparing the vector representation of the **query** with the vector representation of the **documents** in the database. The documents that are closer to the query vector are considered more similar to the query.

Wrapping up the retrieval process is composed of:
- **Documents embedding**
- **Store the embeddings in a VectorStore**
- **Query embedding**
- **Retrieve** the most similar documents to the query


The most popular and simple setup is using the **cosine similarity** to compare the vectors and retrieve the **top k** most similar ones


<div>
<img src="https://python.langchain.com/assets/images/vectorstores-2540b4bc355b966c99b0f02cfdddb273.png" width="800"/>
</div>




Langchain supports several VectorStore, we will use the [FAISS](https://github.com/facebookresearch/faiss) implementation.


In [24]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain import FAISS
from datasets import load_dataset
import random

# Load documents
rag_docs = load_dataset('eve-esa/eve-cpt-sample-v0.2')['train']

# Select a random sample of 100 docs
rag_docs = rag_docs.select(random.sample(range(len(rag_docs)), 100))

# Initialize the text splitter class
text_splitter  = RecursiveCharacterTextSplitter(chunk_size=1800, chunk_overlap=0)

docs = []
# Perform loading and splitting of the documents
for doc in rag_docs:
    splits = text_splitter.split_text(doc['text'])
    docs.extend(splits)

# Initialize the VectorStore by passing the documents and the embeddings model
vectore_store = FAISS.from_texts(docs, indus_embd)

# Initialize the retriever defining the search type and the amount of similar documents to retrieve
retriever = vectore_store.as_retriever(search_type = "similarity", search_kwargs = {"k": 3})

In [7]:
docs


['Atmos. Chem. Phys., 10, 3261-3272, 2010 www.atmos-chem-phys.net/10/3261/2010/\n\nO Author(s) 2010. This work is distributed under\n\nthe Creative Commons Attribution 3.0 License.\n\n## 1 Introduction\n\nEnvironmental concerns over aviation emissions on the current and projected climate change have increased as air traffic and aviation industry continue to grow (Wuebbles, 2006). Among aviation emissions, condensation trails (contrails) formed from water vapor emissions behind aircraft engines has gained greater attention during recent years. Contrails are formed under certain thermodynamic constraints (Schmidt, 1941; Appleman, 1953; Schumann, 2005) and can evolve into cirrus clouds under favorable conditions (Minnis et al., 2004). Both contrails and contrail-induced cirrus clouds could play important roles in the global climate change via affecting the radiation budget. The Intergovernmental Panel on Climate Change (IPCC) identified contrails as the most uncertain components of the av

In [18]:
# Let's try our retriever with a question:
def print_docs(docs):
  for i, doc in enumerate(docs):
    print('Document n. ', i+1)
    print(doc.page_content)
    print()
question = qa[0]['question']
print('Question: ')
print(question)
print()
docs = retriever.invoke("What is the role of microbes in fruit flies?")
print_docs(docs)

Question: 
What are the impacts of a warming climate on the cryosphere?

Document n.  1
* [43] Yelvington, P. E., Herndon, S. C., Wormhoudt, J. C., Jayne, J. T., Miake-Lye, R. C., Knighton, W. B., and Wey, C.: Chemical speciation of hydrocarbon emissions from a commercial aircraft engine, J. Propul. Power, 23, 912-918, 2007.
* [44] Zhang, R., Suh, I., Zhao, J., Zhang, D., Fortner, E. C., Tie, X., Molina, L. T., and Molina, M. J.: Atmospheric new particle formation enhanced by organic acids, Science, 304, 1487-1490, 2004.

Document n.  2
* Segal et al. (2007) Segal, Y., Pinsky, M., and Khain, A.: The Role of Competition Effect in the Raindrop Formation, Atmos. Res., 83, 106-118, 2007.
* Singh et al. (1982) Singh, C., Singh, R. N., and Pillai, P. C.: A Study of Geometrical Factor in Optical Particle Counters, Opt. Applicata, 12, 231-242, 1982.
* Shmeter (1987) Shmeter, S. M.: Thermodynamics and Physics of the Convective Clouds, in Russian, Leningrad, Gidrometeoizdat, 1987.
* Silverman (2

## Retrieval and generation

Once we have our documents processed and indexed we can start the costruction of our pipeline. The pipeline will be composed of the following components:
- **Retriever**: the vector store we created in the previous step.
- **LLM**: a language model that will generate the answer to the question based on the context.



### Prompt

As first think we will define the prompt to be used in the chat. Since our final goal is to create a chat system between we are going to structure the prompt using three different templates:
- **SystemMessagePromptTemplate**: the system message represents guidelines for the model on how to interact with the user and interpret the conversation.
- **AIMessagePromptTemplate**: the AI message represents a message generate by the model.
- **HumanMessagePromptTemplate**: the human message represents the message sent by the user.

From the code below we can see the structure of the prompt and the templates used to create it using langchain. Specifically we could see some **special tokens** used in the prompt:
- **<|system|> | <|user|> | <|assistant|>**: are special tokens that helps the model to understand to who belongs that specific message.
- **<|end|>**: is a special token that indicates the end of a message.
- **{message} | {context} | {question}**: are placeholders that will be replaced with the actual message, context and question.

In [9]:
import os  # Customize SystemPromptTemplate
from langchain.prompts import SystemMessagePromptTemplate, AIMessagePromptTemplate, HumanMessagePromptTemplate

template= '''<|system|>
{message}
<|end|>
'''

# A human message will contain the question and the context. The context will be automatically added by the retriever.
human_template = '''<|user|>
Context: {context}

Question is below:

Question: {question}

<|end|>
<|assistant|>
'''


assistant_template = '''<|assistant|>
{message}
<|end|>
'''

# Define the templates
SystemMessageTemplate = SystemMessagePromptTemplate.from_template(template)
HumanMessageTemplate = HumanMessagePromptTemplate.from_template(human_template)
AIMessageTemplate = AIMessagePromptTemplate.from_template(assistant_template)


In [10]:
from langchain.memory import ChatMessageHistory

# Define the system message

system_message = '''You are an expert assistant that answers questions about different topics.

You are given some extracted parts from science papers along with a question.

If you don't know the answer, just say "I don't know." Don't try to make up an answer.

Use only the following pieces of context to answer the question at the end.

Do not use any prior knowledge.'''


system_msg = SystemMessageTemplate.format(message=system_message)

system_msg

SystemMessage(content='<|system|>\nYou are an expert assistant that answers questions about different topics.\n\nYou are given some extracted parts from science papers along with a question.\n\nIf you don\'t know the answer, just say "I don\'t know." Don\'t try to make up an answer.\n\nUse only the following pieces of context to answer the question at the end.\n\nDo not use any prior knowledge.\n<|end|>\n', additional_kwargs={}, response_metadata={})

Now that we have the definition of different templates we can define the chat prompt. Langchain requireres a specific structure for the chat prompt that is composed of a list of messages. In the code below we can see that our chat template will be composed of two messages, the **system message** and the **human message** that contains the input from the user.

In [11]:
from langchain.prompts import MessagesPlaceholder, PromptTemplate, ChatPromptTemplate

chat_template = ChatPromptTemplate.from_messages(
    messages=[
    system_msg,
    human_template,
    ]
)

# As we can see our prompt is expecting three variables to be filled
print(chat_template)

input_variables=['context', 'question'] input_types={} partial_variables={} messages=[SystemMessage(content='<|system|>\nYou are an expert assistant that answers questions about different topics.\n\nYou are given some extracted parts from science papers along with a question.\n\nIf you don\'t know the answer, just say "I don\'t know." Don\'t try to make up an answer.\n\nUse only the following pieces of context to answer the question at the end.\n\nDo not use any prior knowledge.\n<|end|>\n', additional_kwargs={}, response_metadata={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='<|user|>\nContext: {context}\n\nQuestion is below:\n\nQuestion: {question}\n\n<|end|>\n<|assistant|>\n'), additional_kwargs={})]


### Model initialization

Let's load the model and the tokenizer and create an HuggingFace pipeline that will tokenize the input and generate the answer to the question.

In [18]:
!pip install -U bitsandbytes



In [1]:
#from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain_huggingface import HuggingFacePipeline
from langchain.chains.retrieval_qa.base import RetrievalQA
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig

model_id = "eve-esa/eve-sft-instruct-qa"
# Load the tokenizer

# Quantize the model
# Quantize the model
quantization_config = {'load_in_8bit': True}

tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
    quantization_config=quantization_config,
)


# Create the pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=2000)

model = HuggingFacePipeline(pipeline=pipe)

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:  11%|#         | 535M/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

Device set to use cuda:0


### Langchain pipelines

Langchain pipelines are a powerful tool used to assemble and coordinated different components. Our pipeline will look something like this

$$\text{user query} → \text{retriever} → \text{chat prompt} → \text{LLM} → \text{answer} $$

In langchain we will use the chain '|' operator to assemble in series our components. The chain operator is part of the **LangChain Expression Language** a declarative method to build pipelines. In the LCEL language the output of what is on the left of '|' will be the input on what there is on the right of the pipeline.

Let's build our first pipeline to understand how they works. In our sample pipeline below, we can see that we are dynamically creating a dictionary that will be given in input to our chat template (N.B. as we saw above our chat template takes in input three variables)

From the code we can see that the context value is created by taking the question (from the input dict given to the chain) and using it as input to our retriever. The output of the retriever will be then formatted by the format_docs function.
The question instead will remain as it is.


A chain will be called by the **invoke** method. The invoke methods takes as argument a dictionary that will represent the input of the first element of the pipeline.



In [25]:
from operator import itemgetter

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


retrieve_docs = (lambda x: x["question"]) | retriever | format_docs

In [26]:
docs = retrieve_docs.invoke({'question': question})

In [27]:
# Print the result of the chain
print(docs)

Figure 6: Percentage contribution of blowing snow source to the simulated tropospheric column BrO in **(a)** March and **(b)** September of 1998. Values are calculated according to the formula: (BASE run – OCEAN run)/(BASE run)\(\times\) 100%).

Figure 7: BASE run surface BrO and ozone mixing ratios at different longitude locations along two model latitudes of 74.\({}^{\circ}\) N (**a** and **b)** and 69.2\({}^{\circ}\) N (**c** and **d**) in March 1998.

contribution throughout the high latitude troposphere where a monthly mean ozone difference of up to 8% is obtained (Fig. 5c and d). This ozone difference reflects ozone loss just due to blowing snow events and is in addition to ozone losses caused by bromine emitted from the open ocean sea salt and released from bromocarbons. Comparing the BASE run with the noBr run, we find that the bromine chemistry in the troposphere reduces mean tropospheric ozone amounts by 5-30% (Fig. 5e and f), which is consistent with previous estimates (von 

In [28]:
retriever = vectore_store.as_retriever(search_type = "similarity", search_kwargs = {"k": 3})


# Retrieval chain
retrieve_docs = (lambda x: x["question"]) | retriever

rag_chain_from_docs = (
    {
        "question": itemgetter('question'),  # keep the input query
        "context": itemgetter('question') | retriever | format_docs,  # format the context
    } # In this dict we have defined our three variables needed for our chat template
    |
    chat_template | model
)

output = rag_chain_from_docs.invoke({"question": question})



In [29]:
print(output)

System: <|system|>
You are an expert assistant that answers questions about different topics.

You are given some extracted parts from science papers along with a question.

If you don't know the answer, just say "I don't know." Don't try to make up an answer.

Use only the following pieces of context to answer the question at the end.

Do not use any prior knowledge.
<|end|>

Human: <|user|>
Context: Figure 6: Percentage contribution of blowing snow source to the simulated tropospheric column BrO in **(a)** March and **(b)** September of 1998. Values are calculated according to the formula: (BASE run – OCEAN run)/(BASE run)\(\times\) 100%).

Figure 7: BASE run surface BrO and ozone mixing ratios at different longitude locations along two model latitudes of 74.\({}^{\circ}\) N (**a** and **b)** and 69.2\({}^{\circ}\) N (**c** and **d**) in March 1998.

contribution throughout the high latitude troposphere where a monthly mean ozone difference of up to 8% is obtained (Fig. 5c and d). Th