# Camel 5b Loading PDF using LLama Index and Llama collectors (HugginFace Embeddings)

In this example, rather than training a model from scratch, we can leverage the benefits of using a pre-trained model and enhance its knowledge by extracting information from a PDF using Llama collectors and the Llama index. This approach, known as context learning, allows us to reduce resource consumption while creating a customized model specifically designed for geoscience.

Here we will use the LLM from open ai, therefore we need to have an open AI key. 

In [14]:
import os
os.environ['OPENAI_API_KEY'] = "insert your openai key "
os.environ["CUDA_VISIBLE_DEVICES"] = "0" 


from llama_index import GPTListIndex, SimpleDirectoryReader
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index import LangchainEmbedding, ServiceContext
from llama_index.llm_predictor import HuggingFaceLLMPredictor
from llama_index import GPTVectorStoreIndex, GPTEmptyIndex

from pathlib import Path #needed for the pdf connector
from llama_index import download_loader

# setup prompts - specific to StableLM
from llama_index.prompts.prompts import SimpleInputPrompt

import torch


## Benchmark

First, we will begin by posing a "geoscience question" in order to establish a benchmark. This step is intended to gauge the pretrained model's ability to provide an answer without any prior knowledge or context regarding a specific problem that will be introduced later on.

Loading Model with HuggingFace

In [15]:
# This will wrap the default prompts that are internal to llama-index
# taken from https://huggingface.co/Writer/camel-5b-hf
query_wrapper_prompt = SimpleInputPrompt(
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{query_str}\n\n### Response:"
)

In [16]:
hf_predictor = HuggingFaceLLMPredictor(
    max_input_size=2048, 
    max_new_tokens=256,
    # temperature=0.25,
    # do_sample=False,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="Writer/camel-5b-hf",
    model_name="Writer/camel-5b-hf",
    device_map="auto",
    tokenizer_kwargs={"max_length": 2048},
    # model_kwargs={"torch_dtype": torch.bfloat16}
    model_kwargs={"torch_dtype": torch.float16},
)

embed_model = LangchainEmbedding(HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))

service_context = ServiceContext.from_defaults(chunk_size_limit=512, llm_predictor=hf_predictor, embed_model=embed_model)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

For this we will create an empty index, it will allow us to get an answer directly from the LLM.

In [17]:
# build empty index
empty_index = GPTEmptyIndex(service_context=service_context)

Creating the query, and obtaining the response

In [18]:
query_engine = empty_index.as_query_engine(
    response_mode='generation'
)
response_benchmark = query_engine.query(
    "What is the benefit of using segmentation on 4D seismic inversion?",
)
print(response_benchmark)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Segmentation on 4D seismic inversion allows for the analysis of subsurface structures and properties by dividing the data into smaller subsets or segments. This improves the accuracy and resolution of the inversion results, allowing for better understanding of the subsurface structure, composition, and fluid content.


In [19]:
query_engine = empty_index.as_query_engine(
    response_mode='generation'
)
response_benchmark_2 = query_engine.query(
    "What type of regularization should I use for post-stack seismic inversion?",
)
print(response_benchmark_2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Regularization techniques for post-stack seismic inversion include Tikhonov regularization, L1-regularization, and L2-regularization. Tikhonov regularization is simple and effective, but may lead to overfitting. L1-regularization is suitable for problems with a large number of parameters, but may result in noisy solutions. L2-regularization is more robust and can handle both large and small parameter values, but it may lead to underfitting. It's recommended to use a combination of techniques, such as L1-regularization with Tikhonov regularization for robustness and L2-regularization for better accuracy.


With this we obtained the benchamark answer. Now, we will proceed to inject the knowledge of a pdf (EAGE abstract)

## From Context


Pdf reader from llama connectors

In [20]:
CJKPDFReader = download_loader("CJKPDFReader")
loader = CJKPDFReader()

Loading the pdf file to introduce our dataset and create the index

In [21]:
documents = loader.load_data(file=Path('data_test/315.pdf'))
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

Creating the query and the response

In [22]:
query_engine = index.as_query_engine()
response_context = query_engine.query("What is the benefit of using segmentation on seismic inversion?")
print(response_context)

Token indices sequence length is longer than the specified maximum sequence length for this model (964 > 512). Running this sequence through the model will result in indexing errors
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Segmentation enhances the interpretation of seismic data by dividing the subsurface into distinct layers, allowing for more accurate reservoir property estimation and better risk assessment.


In [23]:
query_engine = index.as_query_engine()
response_context_2 = query_engine.query("What type of regularization should I use for post-stack seismic inversion?")
print(response_context_2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


For post-stack seismic inversion, you should use the 4D JIS algorithm (Ravasi and Birnie, 2022; Romero et al., 2022).


## Asking about the paper

In [24]:
query_engine = index.as_query_engine(response_mode="compact")
response_context_3 = query_engine.query("What is the novelty of 4D joint inversion")
print(response_context_3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The 4D joint inversion method is novel as it combines the advantages of 4D seismic data analysis with the capability of
segmenting and inversion of 4D seismic data, leading to improved subsurface model resolution and reduced noise in the
inverted difference.


In [25]:
query_engine = index.as_query_engine(response_mode="compact")
response_context_4 = query_engine.query("give me a summary of the document")
print(response_context_4)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The document is a summary of the Sleipner dataset, including the well-analysis workflow, well-to-seismic tie, and time-shift distribution.
