# GPT Loading PDF using LLama Index and Llama collectors

In this example, rather than training a model from scratch, we can leverage the benefits of using a pre-trained model and enhance its knowledge by extracting information from a PDF using Llama collectors and the Llama index. This approach, known as context learning, allows us to reduce resource consumption while creating a customized model specifically designed for geoscience.

Here we will use the LLM from open ai, therefore we need to have an open AI key. 

In [1]:
import os
os.environ['OPENAI_API_KEY'] = "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"
os.environ["CUDA_VISIBLE_DEVICES"] = "0" 

from pathlib import Path #needed for the pdf connector
from llama_index import download_loader
from llama_index import SimpleDirectoryReader # Simple reader from txt

from llama_index import (
    GPTVectorStoreIndex,
    GPTEmptyIndex,
    GPTTreeIndex,
    GPTListIndex,
    SimpleDirectoryReader,
    ServiceContext,
    StorageContext,
)


## Benchmark

First, we will begin by posing a "geoscience question" in order to establish a benchmark. This step is intended to gauge the pretrained model's ability to provide an answer without any prior knowledge or context regarding a specific problem that will be introduced later on.

For this we will create an empty index, it will allow us to get an answer directly from the LLM.

In [2]:
# configure
service_context = ServiceContext.from_defaults(chunk_size=256)
storage_context = StorageContext.from_defaults()

# build empty index
empty_index = GPTEmptyIndex(service_context=service_context, storage_context=storage_context)

Creating the query, and obtaining the response

In [3]:
query_engine = empty_index.as_query_engine(
    response_mode='generation'
)
response_benchmark = query_engine.query(
    "What is the benefit of using segmentation on 4D seismic inversion?",
)
print(response_benchmark)



Segmentation on 4D seismic inversion can help to improve the accuracy of the inversion results by allowing the inversion to focus on specific areas of interest. This can help to reduce noise and improve the resolution of the inversion results. Additionally, segmentation can help to reduce the computational cost of the inversion by allowing the inversion to focus on smaller areas of interest. This can help to reduce the time and resources needed to complete the inversion.


In [4]:
query_engine = empty_index.as_query_engine(
    response_mode='generation'
)
response_benchmark_2 = query_engine.query(
    "What type of regularization should I use for post-stack seismic inversion?",
)
print(response_benchmark_2)



The type of regularization used for post-stack seismic inversion depends on the type of data being inverted and the desired outcome. Generally, it is recommended to use a combination of Tikhonov regularization, total variation regularization, and sparsity-promoting regularization. Tikhonov regularization is used to reduce noise and stabilize the inversion, while total variation regularization is used to preserve sharp features in the data. Sparsity-promoting regularization is used to promote sparsity in the inversion, which can help reduce the number of parameters needed to accurately represent the data.


With this we obtained the benchamark answer. Now, we will proceed to inject the knowledge of a pdf (EAGE abstract)

## From Context


Pdf reader from llama connectors

In [5]:
CJKPDFReader = download_loader("CJKPDFReader")
loader = CJKPDFReader()

Loading the pdf file to introduce our dataset and create the index

In [6]:
documents = loader.load_data(file=Path('data_test/315.pdf'))
index = GPTVectorStoreIndex.from_documents(documents)

Creating the query and the response

In [7]:
query_engine = index.as_query_engine()
response_context = query_engine.query("What is the benefit of using segmentation on seismic inversion?")
print(response_context)


The benefit of using segmentation on seismic inversion is that it can mitigate the non-repeatable noise imprint from the data in the inverted models by jointly inverting for the baseline and monitor datasets, produce high-resolution baseline and model estimates (with better-defined geological units and CO2 plume) due to the presence of Total-Variation and segmentation constraints, and classify the time-lapse changes into expected 4D scenarios. The segmentation product helps the 4D interpretation process and might be used as input for reservoir simulations.


In [8]:
query_engine = index.as_query_engine()
response_context_2 = query_engine.query("What type of regularization should I use for post-stack seismic inversion?")
print(response_context_2)


For post-stack seismic inversion, a Total-Variation (TV) regularization should be used. This type of regularization ensures smooth time-shift estimates in the spatial or time axis, and is used in the nonlinear inversion approach proposed by Rickett et al. (2007). Additionally, it is used in the 4D JIS approach to retrieve the classification of the segmented baseline-monitor differences.


## Asking about the paper

In [9]:
query_engine = index.as_query_engine()
response_context_3 = query_engine.query("What is the novelty of 4D joint inversion")
print(response_context_3)

and segmentation?

The novelty of 4D joint inversion and segmentation is that it produces high-resolution acoustic impedance models and strongly reduces the non-repeatable noise in the inverted difference. This allows for the subsurface changes due to CO2 injection to be clearly visible, and provides a segmented volume of the expected time-lapse changes that can ease 4D seismic interpretation and might be used as the input geobodies for reservoir simulations.


In [10]:
query_engine = index.as_query_engine()
response_context_4 = query_engine.query("What is the paper about")
print(response_context_4)


This paper is about using the 4D joint inversion-segmentation (JIS) algorithm to analyze two vintages of the 4D Sleipner seismic dataset around the CO2 plume in the Norwegian North Sea. The JIS algorithm is used to estimate reliable reservoir property changes through time, mitigate non-repeatable noise imprint from the data in the inverted models, produce high-resolution baseline and model estimates, and classify the time-lapse changes into expected 4D scenarios.
