# Dataset creation for SFT and continued pretraining
From EAGE abstracts released for the Annual Hackathon in 2023

In [1]:
# !pip install gdown --quiet

In [3]:

# !pip install langchain --quiet
# !pip install langchain_nvidia_ai_endpoints --quiet
# !pip install pypdf --quiet

In [4]:
!nvidia-smi

Mon Jun  3 18:26:16 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA H100 PCIe               On  | 00000000:01:00.0 Off |                    0 |
| N/A   45C    P0              53W / 310W |      3MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Imports

In [9]:
# import the relevant libraries
import json
import os

import tqdm
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain.text_splitter import  RecursiveCharacterTextSplitter
from multiprocessing import Pool
from pypdf import PdfReader

In [10]:

from pprint import PrettyPrinter
pprint = PrettyPrinter(indent=4).pprint
# os.environ['NVIDIA_API_KEY'] = "<YOUR NVIDIA API KEY HERE>"

## Download and extract documents

In [4]:
!pwd

/


In [5]:
# !gdown 1HmxAZerbIQfHo3evhys1nGX2lspda1RX -O /workspace/data/documents.zip

In [6]:
# !unzip -o /workspace/data/documents.zip -d /workspace/data/

## Extract raw texts

In [7]:
PDF_PATH = "/workspace/data/Annual 2023_proceedings for Hackathon/"

Create a list of all .pdf filenames

In [11]:
pdf_files = []
folder_content = os.listdir(PDF_PATH)
for file in tqdm.tqdm(folder_content):
    if file.endswith(".pdf"):
        pdf_files.append(os.path.join(PDF_PATH, file))

100%|██████████| 879/879 [00:00<00:00, 181920.12it/s]


Spawn a pool of CPU threads for text parsing

In [12]:
def extract_text_from_pdf(pdf_path):
    texts = []
    reader = PdfReader(pdf_path)
    texts = [page.extract_text() for page in reader.pages]
    return "\n".join(texts) 

with Pool(64) as pool:
    documents = pool.map(extract_text_from_pdf, pdf_files)

Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 11 0 (offset 0)
Ignoring wrong pointing object 11 0 (offset 0)
Ignoring wrong pointing object 11 0 (offset 0)
Ignoring wrong pointing object 13 0 (offset 0)
Ignoring wrong pointing object 13 0 (offset 0)
Ignoring wrong pointing object 13 0 (offset 0)
Ignoring wrong pointing object 15 0 (offset 0)
Ignoring wrong pointing object 15 0 (offset 0)
Ignoring wrong pointing object 15 0 (offset 0)
Ignoring wrong pointing object 17 0 (offset 0)
Ignoring wrong pointing object 17 0 (offset 0)
Ignoring wrong pointing object 24 0 (offset 0)
Ignoring wrong pointing object 11 0 (offset 0)
Ignoring wrong pointing object 14 0 (offset 0)
Ignoring wrong pointing object 35 0 (offset 0)
Ignoring wrong poin

Populate .jsonl from extracted .pdf files


FileNotFoundError: [Errno 2] No such file or directory: '/workspace/data/raw/documents.jsonl'

Populate .jsonl from extracted .pdf files

In [24]:
os.makedirs("/workspace/data/raw", exist_ok=True)
with open("/workspace/data/raw/documents.jsonl", "w") as f:
    for document in tqdm.tqdm(documents):
        document = document.strip().replace("84th EAGE Annual Conference & Exhibition  ", "")
        f.write(json.dumps({"text": document}) + "\n")

100%|██████████| 879/879 [00:00<00:00, 6101.15it/s]


In [7]:
!head /workspace/data/raw/documents.jsonl

{"text": "Multisensor UAS testing to support avalanche forecasting and monitoring  \n \nIntroduction  \n Snow avalanches pose a major natural hazard in Arctic and Alpine countries. Transport networks in \nNorway are especially vulnerable and for this reason, the GEOSFAIR (GEO hazard Survey F rom the \nAIR) research project is investigating how uncrewed aerial systems (UAS) can help to forecast and \nmonitor roadside avalanches and other natural hazards.  The main objective of the project is  to develop \nmethodologies and workflows  for UAS data collect ion in the Norwegian Public Road s Administration  \n(NPRA)  decision support system , for faster and better assessment of roadside  avalanche hazard s.  \n \nRemote sensing provides good spatial coverage to improve avalanche monitoring, from ground, air or \nspace and using optical, laser or radar sensors (Eckerstorfer  et al., 2016). UAS  remote sensing and \nsurface mapping has been  used for improving avalanche forecasting using eit

## Cleaning raw documents with NeMo Curator

In [26]:
!text_cleaning --help

usage: 
Text cleaning and language filtering

Takes as input a directory consisting of .jsonl files with one
document per line and outputs to a separate directory the text
with fixed unicode. Also, performs language filtering using
the 'language' field within each JSON object.

       [-h] [--input-data-dir INPUT_DATA_DIR]
       [--input-local-data-dir INPUT_LOCAL_DATA_DIR] --output-clean-dir
       OUTPUT_CLEAN_DIR
       [--output-removed-document-dir OUTPUT_REMOVED_DOCUMENT_DIR]
       [--log-dir LOG_DIR] [--min-document-length MIN_DOCUMENT_LENGTH]
       [--output-language OUTPUT_LANGUAGE] [--cpus-per-node CPUS_PER_NODE]

options:
  -h, --help            show this help message and exit
  --input-data-dir INPUT_DATA_DIR
                        Input directory consisting of .jsonl files that are
                        accessible to all nodes. Use this for a distributed
                        file system (default: None)
  --input-local-data-dir INPUT_LOCAL_DATA_DIR
                

In [27]:
!text_cleaning --input-data-dir /workspace/data/raw --output-clean-dir /workspace/data/clean

In [28]:
!head /workspace/data/clean/documents.jsonl

{"text": "Multisensor UAS testing to support avalanche forecasting and monitoring  \n \nIntroduction  \n Snow avalanches pose a major natural hazard in Arctic and Alpine countries. Transport networks in \nNorway are especially vulnerable and for this reason, the GEOSFAIR (GEO hazard Survey F rom the \nAIR) research project is investigating how uncrewed aerial systems (UAS) can help to forecast and \nmonitor roadside avalanches and other natural hazards.  The main objective of the project is  to develop \nmethodologies and workflows  for UAS data collect ion in the Norwegian Public Road s Administration  \n(NPRA)  decision support system , for faster and better assessment of roadside  avalanche hazard s.  \n \nRemote sensing provides good spatial coverage to improve avalanche monitoring, from ground, air or \nspace and using optical, laser or radar sensors (Eckerstorfer  et al., 2016). UAS  remote sensing and \nsurface mapping has been  used for improving avalanche forecasting using eit

## Preparing data for continuous pretraining
Creating `*.idx` and `*.bin` files

In [30]:
!python /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
    --input /workspace/data/clean/documents.jsonl \
    --json-keys text \
    --tokenizer-library sentencepiece \
    --tokenizer-model /workspace/models/Llama-2-7b-chat-hf/tokenizer.model \
    --output-prefix /workspace/data/clean/hackathon

[NeMo I 2024-06-03 11:20:47 tokenizer_utils:185] Getting SentencePiece with model: /workspace/models/Llama-2-7b-chat-hf/tokenizer.model
Vocab size: 32000
Output prefix: /workspace/data/clean/hackathon
Time to startup: 0.033612966537475586
Processing file /workspace/data/clean/documents.jsonl 1/1
[NeMo I 2024-06-03 11:20:47 tokenizer_utils:185] Getting SentencePiece with model: /workspace/models/Llama-2-7b-chat-hf/tokenizer.model
Processed 100 documents (74.32878124697075 docs/s, 0.8518148535852177 MB/s).
Processed 200 documents (75.22731791273134 docs/s, 0.8709619475826994 MB/s).
Processed 300 documents (75.90809837632872 docs/s, 0.8812034476742628 MB/s).
Processed 400 documents (77.02688587987312 docs/s, 0.885677219310585 MB/s).
Processed 500 documents (77.18279159749733 docs/s, 0.885471173599083 MB/s).
Processed 600 documents (77.9499979758134 docs/s, 0.8895413917972526 MB/s).
Processed 700 documents (77.33576908335414 docs/s, 0.8886260467578556 MB/s).
Processed 800 documents (77.674

In [33]:
!ls -nh /workspace/data/clean/

total 17M
-rw-r--r-- 1 0 0  11M Jun  3 11:09 documents.jsonl
-rw-r--r-- 1 0 0 6.2M Jun  3 11:20 hackathon_text_document.bin
-rw-r--r-- 1 0 0  18K Jun  3 11:20 hackathon_text_document.idx


## Preparing data for LLM tuning

Split text into overlapping chunks

In [11]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=1500)

with open("/workspace/data/clean/documents.jsonl", "r") as f:
    documents = [json.loads(line)["text"] for line in f.readlines()]
document_chunks = [text_splitter.split_text(document) for document in documents]
document_chunks_flat = [chunk for chunks in document_chunks for chunk in chunks]
print(f'{len(document_chunks_flat)} chunks extracted out of {len(documents)} pdf documents')

5551 chunks extracted out of 879 pdf documents


Define LLM and prompt

In [12]:
INSTRUCTION_PROMPT = """Given the paragraph after <INPUT_START> tag, create a very good geoscience-related question and answer pair. Your output should be in a .json format containing the following fields: ['question', 'answer']
Restrict the question to the context information provided. The questions should use information from passage, but should not refer to the originating text implicitly (you can not use 'according to', 'based on', and similar).
Respond only with .json output, add no other comments. If generating a good question and answer pair is not possible, output <skip> instead.
<INPUT_START>"""
# CHUNKS_TO_PROCESS = 10
CHUNKS_TO_PROCESS = None # means all

llm = ChatNVIDIA(
    model="ai-llama3-70b",
    temperature=0.2,
    max_tokens=256
)



Submit batched requests to the LLM

**WARNING! It will take ±30 min to generate QA paris**

In [13]:
# qa_pairs = await llm.abatch(['\n'.join([INSTRUCTION_PROMPT, chunk]) for chunk in document_chunks_flat[:CHUNKS_TO_PROCESS]], 
#                             config={"max_concurrency": 10})
# qa_pairs = [qa_pair.content for qa_pair in qa_pairs if qa_pair.content != "<skip>"]

In [16]:
pprint(qa_pairs[0])

('{\n'
 '"question": "What type of climate is characterized by large amounts of '
 'precipitation and a large influence of snow wind transport on avalanche '
 'risk?",\n'
 '"answer": "Coastal climate"\n'
 '}')


Parse generated QA pairs in tovalid document

In [20]:
failed_count = 0
warning_count = 0
with open("/workspace/data/clean/documents_sft.jsonl", "w") as f:
    for qa_pair in qa_pairs:
        # Checking if json is correct
        try:
            json.loads(qa_pair)
        except json.JSONDecodeError:
            print(f'Failed to read {qa_pair} as a valid JSON')
            failed_count += 1
            continue
        jsonl_line = qa_pair.replace("\n", "").replace('"question":', '"input":').replace('"answer":', '"output":').strip()
        json_line_obj = json.loads(jsonl_line)
        if isinstance(json_line_obj, list):
            print(f'WARNING: {jsonl_line}')
            jsonl_line = json.dumps(json_line_obj[0])
            warning_count += 1
        f.write(jsonl_line + "\n")

print('Done')
print(f'Failed\t{failed_count} / {len(qa_pairs)}')
print(f'Warnings\t{warning_count} / {len(qa_pairs)}')

Done
Failed	0 / 5518


In [21]:
pprint(jsonl_line)

('{"input": "What is the primary advantage of using the SPAFSM method over the '
 'standard FSM method in complex media?", "output": "The accuracy of travel '
 'time calculation is effectively improved."}')


## Split dataset into train / val / test

In [None]:
import os
import json


def read_and_split(fname: str, out_dir: str):
    # Open the original file
    with open(fname, 'r') as original_file:
        lines = original_file.readlines()

    # Calculate partition sizes
    total_lines = len(lines)
    test_size = int(total_lines * 0.1)
    val_size = int(total_lines * 0.1)
    # The rest goes to the train partition

    print(f'There are {total_lines}--> {test_size}, {val_size}, {total_lines - test_size - val_size}')
    print(f'Iterate over {len(lines)} lines in {fname}')

    with open(os.path.join(out_dir, 'data_test.jsonl'), 'w') as test_file, \
         open(os.path.join(out_dir, 'data_val.jsonl'), 'w') as val_file, \
         open(os.path.join(out_dir, 'data_train.jsonl'), 'w') as train_file:

        # Iterate over each line in the original file
        for i, line in enumerate(lines):
            # Parse JSON data (optional, if you need to manipulate the data)
            json_data = json.loads(line)

            # Convert JSON back to string (if manipulated) or use original line
            # json_line = json.dumps(json_data) if 'manipulate' in locals() else line
            # json_line = str(json.dumps(json_data))
            json_line = line

            # Write to appropriate file based on index
            if i < test_size:
                test_file.write(json_line)
            elif i < test_size + val_size:
                val_file.write(json_line)
            else:
                train_file.write(json_line)

In [None]:
read_and_split('/workspace/data/clean/documents_sft.jsonl', '/workspace/data/')

In [1]:
!head /workspace/data/data_train.jsonl

{"input": "What is a major challenge in exploration and production that can be addressed by improving the predictive power of subsurface models?", "output": "Significant risks associated with exploration and production"}
{"input": "What is a key requirement for producing subsurface models with a higher predictive value in Stratigraphic Forward Models?","output": "Calibration of SFMs, which involves defining initial parameters and conditions leading to plausible models."}
{"input": "What is a major challenge in automatically finding the best-fit subsurface model using forward models?", "output": "The complexity of typical input parameters, such as subsidence history, makes it difficult to test them in an automated way."}
{"input": "What are the two paths to achieve more predictive subsurface models?","output": "Improving the way SFMs are formulated and calibrated, or through the combination of different numerical techniques (hybrid models)"}
{"input": "What can improve the calibration w