# Camel 5b Loading PDF using LLama Index and Llama collectors (HugginFace Embeddings)

In this example, rather than training a model from scratch, we can leverage the benefits of using a pre-trained model and enhance its knowledge by extracting information from a PDF using Llama collectors and the Llama index. This approach, known as context learning, allows us to reduce resource consumption while creating a customized model specifically designed for geoscience.

Here we will use the LLM from open ai, therefore we need to have an open AI key. 

In [1]:
import os
os.environ['OPENAI_API_KEY'] = "insert your openai key "
os.environ["CUDA_VISIBLE_DEVICES"] = "0" 


from llama_index import GPTListIndex, SimpleDirectoryReader
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index import LangchainEmbedding, ServiceContext
from llama_index.llm_predictor import HuggingFaceLLMPredictor
from llama_index import GPTVectorStoreIndex, GPTEmptyIndex

from pathlib import Path #needed for the pdf connector
from llama_index import download_loader

# setup prompts - specific to StableLM
from llama_index.prompts.prompts import SimpleInputPrompt

import torch


## Benchmark

First, we will begin by posing a "geoscience question" in order to establish a benchmark. This step is intended to gauge the pretrained model's ability to provide an answer without any prior knowledge or context regarding a specific problem that will be introduced later on.

Loading Model with HuggingFace

In [2]:
# This will wrap the default prompts that are internal to llama-index
# taken from https://huggingface.co/Writer/camel-5b-hf
query_wrapper_prompt = SimpleInputPrompt(
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{query_str}\n\n### Response:"
)

In [3]:
hf_predictor = HuggingFaceLLMPredictor(
    max_input_size=2048, 
    max_new_tokens=256,
    # temperature=0.25,
    # do_sample=False,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="Writer/camel-5b-hf",
    model_name="Writer/camel-5b-hf",
    device_map="auto",
    tokenizer_kwargs={"max_length": 2048},
    # model_kwargs={"torch_dtype": torch.bfloat16}
    model_kwargs={"torch_dtype": torch.float16},
)

embed_model = LangchainEmbedding(HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))

service_context = ServiceContext.from_defaults(chunk_size_limit=512, llm_predictor=hf_predictor, embed_model=embed_model)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

For this we will create an empty index, it will allow us to get an answer directly from the LLM.

In [4]:
# build empty index
empty_index = GPTEmptyIndex(service_context=service_context)

Creating the query, and obtaining the response

In [5]:
query_engine = empty_index.as_query_engine(
    response_mode='generation'
)
response_benchmark = query_engine.query(
    "What is the difference between model-based and data-driven seismic invesion?",
)
print(response_benchmark)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Model-based seismic invasion is a predictive method that uses mathematical models to simulate the behavior of a subsurface structure under seismic loading. These models are based on the underlying rock properties, such as elasticity, density, and fault geometry. The models are calibrated using historical data and then used to predict the likelihood of seismic events, such as ground shaking or ground deformation.

Data-driven seismic invasion, on the other hand, involves collecting and analyzing large amounts of seismic data from real-world events to identify patterns and relationships between the seismic signals and the subsurface structure. This data-driven approach relies on the interpretation of the seismic data to identify potential vulnerabilities and risks to human and infrastructure safety.


In [6]:
query_engine = empty_index.as_query_engine(
    response_mode='generation'
)
response_benchmark_2 = query_engine.query(
    "When is better to use Supervised Learning for post-stack seismic inversion?",
)
print(response_benchmark_2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Supervised Learning is better used for post-stack seismic inversion when the data is clean and well-labeled, with clear boundaries between the ground and the subsurface structures.


With this we obtained the benchamark answer. Now, we will proceed to inject the knowledge of a pdf (EAGE abstract)

## From Context


Pdf reader from llama connectors

In [7]:
CJKPDFReader = download_loader("CJKPDFReader")
loader = CJKPDFReader()

Loading the pdf file to introduce our dataset and create the index

In [8]:
documents = loader.load_data(file=Path('data_test/158.pdf'))
index = GPTVectorStoreIndex.from_documents(documents,service_context=service_context)

Creating the query and the response

In [9]:
query_engine = index.as_query_engine()
response_context = query_engine.query("What is the difference between model-based and data-driven seismic invesion?")
print(response_context)

Token indices sequence length is longer than the specified maximum sequence length for this model (937 > 512). Running this sequence through the model will result in indexing errors
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Model-based inversion relies on a mathematical model to invert seismic data, while data-driven inversion uses training data to learn the relationship between acoustic impedance and seismic data.


In [10]:
query_engine = index.as_query_engine()
response_context_2 = query_engine.query("When is better to use Supervised Learning for post-stack seismic inversion?")
print(response_context_2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


When the data is scarce or the background model is unavailable, using Supervised Learning for post-stack seismic inversion is better as it requires no train-
ing data and can handle noisy data effectively.


## Asking about the paper

In [11]:
query_engine = index.as_query_engine(response_mode="compact")
response_context_4 = query_engine.query("give me a summary of the document")
print(response_context_4)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The document discusses the application of PnP priors in seismic inversion, focusing on the use of learned denoising neural networks as a regularizer in solving inverse problems. It compares model-based and data-driven methods, highlighting the advantages of supervised regularization, and provides numerical examples.
