# GPT Loading PDF using LLama Index and Llama collectors

In this example, rather than training a model from scratch, we can leverage the benefits of using a pre-trained model and enhance its knowledge by extracting information from a PDF using Llama collectors and the Llama index. This approach, known as context learning, allows us to reduce resource consumption while creating a customized model specifically designed for geoscience.

Here we will use the LLM from open ai, therefore we need to have an open AI key. 

In [1]:
import os
os.environ['OPENAI_API_KEY'] = "insert your openai key "
os.environ["CUDA_VISIBLE_DEVICES"] = "0" 

from pathlib import Path #needed for the pdf connector
from llama_index import download_loader
from llama_index import SimpleDirectoryReader # Simple reader from txt

from llama_index import (
    GPTVectorStoreIndex,
    GPTEmptyIndex,
    GPTTreeIndex,
    GPTListIndex,
    SimpleDirectoryReader,
    ServiceContext,
    StorageContext,
)


## Benchmark

First, we will begin by posing a "geoscience question" in order to establish a benchmark. This step is intended to gauge the pretrained model's ability to provide an answer without any prior knowledge or context regarding a specific problem that will be introduced later on.

For this we will create an empty index, it will allow us to get an answer directly from the LLM.

In [2]:
# configure
service_context = ServiceContext.from_defaults(chunk_size=256)
storage_context = StorageContext.from_defaults()

# build empty index
empty_index = GPTEmptyIndex(service_context=service_context, storage_context=storage_context)

Creating the query, and obtaining the response

In [3]:
query_engine = empty_index.as_query_engine(
    response_mode='generation'
)
response_benchmark = query_engine.query(
    "What is the difference between model-based and data-driven seismic invesion?",
)
print(response_benchmark)



Model-based seismic inversion is a process of interpreting seismic data by using a priori knowledge of the subsurface. This method relies on a model of the subsurface that is built from geological and geophysical data. The model is then used to generate synthetic seismic data that is compared to the actual seismic data. The differences between the two are used to refine the model and improve the accuracy of the interpretation.

Data-driven seismic inversion is a process of interpreting seismic data without relying on a priori knowledge of the subsurface. This method uses machine learning algorithms to analyze the seismic data and identify patterns that can be used to infer the subsurface structure. Data-driven seismic inversion does not require a model of the subsurface and can be used to generate more accurate interpretations of the subsurface.


In [10]:
query_engine = empty_index.as_query_engine(
    response_mode='generation'
)
response_benchmark_2 = query_engine.query(
    "When is better to use Supervised Learning for post-stack seismic inversion?",
)
print(response_benchmark_2)



Supervised learning is best used for post-stack seismic inversion when the seismic data is of high quality and the desired output is well-defined. Supervised learning can be used to identify patterns in the seismic data and to create a model that can be used to predict the desired output. This type of learning is particularly useful when the data is complex and the desired output is difficult to define.


With this we obtained the benchamark answer. Now, we will proceed to inject the knowledge of a pdf (EAGE abstract)

## From Context


Pdf reader from llama connectors

In [5]:
CJKPDFReader = download_loader("CJKPDFReader")
loader = CJKPDFReader()

Loading the pdf file to introduce our dataset and create the index

In [6]:
documents = loader.load_data(file=Path('data_test/158.pdf'))
index = GPTVectorStoreIndex.from_documents(documents)

Creating the query and the response

In [7]:
query_engine = index.as_query_engine()
response_context = query_engine.query("What is the difference between model-based and data-driven seismic invesion?")
print(response_context)


Model-based seismic inversion is a method of solving a minimization problem to estimate the acoustic impedance model from seismic data. It relies on a linear approximation of the reflectivity equation and is solved using an iterative solver with a background acoustic impedance model as an initial guess. 

Data-driven seismic inversion, on the other hand, relies on the supervised learning paradigm. It uses a neural network to predict the acoustic impedance profile from seismic traces, and can also include a data-consistency loss to compare the output of the network with the data. This approach does not require a linear approximation of the reflectivity equation and does not rely on an initial guess.


In [12]:
query_engine = index.as_query_engine(response_mode='tree_summarize')
response_context_2 = query_engine.query("When is better to use Supervised Learning for post-stack seismic inversion?")
print(response_context_2)


Supervised learning is best used for post-stack seismic inversion when there is a dense well coverage and a background trend in the input dataset. Additionally, a data-consistency loss should be included for seismic traces without matching acoustic impedance profiles. When less than six wells are available in the training process, the supervised approach becomes less accurate than the Plug-and-Play (PnP) approach.


## Asking about the paper

In [8]:
query_engine = index.as_query_engine()
response_context_4 = query_engine.query("What is the paper about")
print(response_context_4)


This paper is about the comparison of different seismic inversion methods, including deep image prior based seismic inversion, plug-and-play seismic inversion, and supervised learning approaches. The comparison is performed on a synthetic dataset created from the Marmousi model.
