# Dataset Preprocessing: Grouping by Document ID

This notebook processes the SciER dataset to group all sentences by document ID, combining content and extracting all relations for each document.

In [1]:
import json
from collections import defaultdict
from pathlib import Path

In [2]:
def process_dataset(input_file, output_file):
    """
    Process the JSONL dataset to group sentences by document ID.
    Each document will contain:
    - doc_id: The document identifier
    - content: List of all sentences in the document
    - relations: All extracted relations (rel_plus) from all sentences
    """
    # Dictionary to group data by document ID
    documents = defaultdict(lambda: {
        'doc_id': None,
        'content': [],
        'relations': []
    })
    
    # Read the JSONL file
    with open(input_file, 'r', encoding='utf-8') as f:
        for line in f:
            data = json.loads(line)
            doc_id = data['doc_id']
            
            # Initialize doc_id if not set
            if documents[doc_id]['doc_id'] is None:
                documents[doc_id]['doc_id'] = doc_id
            
            # Add sentence to content
            documents[doc_id]['content'].append(data['sentence'])
            
            # Add relations (using rel_plus as it includes entity types)
            if data.get('rel_plus'):
                documents[doc_id]['relations'].extend(data['rel_plus'])
    
    # Convert to list and write to JSON file
    documents_list = list(documents.values())
    
    # Only write to file if output_file is provided
    if output_file is not None:
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(documents_list, f, indent=2, ensure_ascii=False)
    
    return documents_list

## Process Train Dataset

In [3]:
# Define paths
train_input = Path('dataset/SciER/LLM/train.jsonl')
train_output = Path('dataset/SciER/LLM/train_by_document.json')

# Process the train dataset
train_docs = process_dataset(train_input, train_output)

print(f"Processed {len(train_docs)} documents from train.jsonl")
print(f"Output saved to: {train_output}")

# Show example of first document
if train_docs:
    print(f"\nExample - First Document:")
    print(f"Doc ID: {train_docs[0]['doc_id']}")
    print(f"Number of sentences: {len(train_docs[0]['content'])}")
    print(f"Number of relations: {len(train_docs[0]['relations'])}")
    print(f"\nFirst sentence: {train_docs[0]['content'][0][:100]}...")
    if train_docs[0]['relations']:
        print(f"First relation: {train_docs[0]['relations'][0]}")

Processed 80 documents from train.jsonl
Output saved to: dataset\SciER\LLM\train_by_document.json

Example - First Document:
Doc ID: 51923817
Number of sentences: 51
Number of relations: 57

First sentence: We propose CornerNet , a new approach to object detection where we detect an object bounding box as ...
First relation: ['convolution neural network:Method', 'Part-Of', 'CornerNet:Method']


## Process All Datasets (Dev, Test, Test OOD)

In [4]:
# Process all datasets
datasets = {
    'dev': ('dataset/SciER/LLM/dev.jsonl', 'dataset/SciER/LLM/dev_by_document.json'),
    'test': ('dataset/SciER/LLM/test.jsonl', 'dataset/SciER/LLM/test_by_document.json'),
    'test_ood': ('dataset/SciER/LLM/test_ood.jsonl', 'dataset/SciER/LLM/test_ood_by_document.json')
}

results = {}
for name, (input_file, output_file) in datasets.items():
    input_path = Path(input_file)
    output_path = Path(output_file)
    
    if input_path.exists():
        docs = process_dataset(input_path, output_path)
        results[name] = len(docs)
        print(f"✓ {name}: {len(docs)} documents -> {output_path}")
    else:
        print(f"✗ {name}: File not found - {input_path}")

# Summary
print("\n" + "="*50)
print("Summary:")
print("="*50)
print(f"Train:    {len(train_docs)} documents")
for name, count in results.items():
    print(f"{name.capitalize():9} {count} documents")

✓ dev: 10 documents -> dataset\SciER\LLM\dev_by_document.json
✓ test: 10 documents -> dataset\SciER\LLM\test_by_document.json
✓ test_ood: 6 documents -> dataset\SciER\LLM\test_ood_by_document.json

Summary:
Train:    80 documents
Dev       10 documents
Test      10 documents
Test_ood  6 documents


## Inspect Sample Document Structure

In [5]:
# Display a complete sample document
if train_docs:
    sample_doc = train_docs[0]
    
    print("="*80)
    print(f"DOCUMENT ID: {sample_doc['doc_id']}")
    print("="*80)
    
    print("\nCONTENT:")
    print("-"*80)
    for i, sentence in enumerate(sample_doc['content'][:3], 1):
        print(f"{i}. {sentence}")
    if len(sample_doc['content']) > 3:
        print(f"... and {len(sample_doc['content']) - 3} more sentences")
    
    print("\nRELATIONS:")
    print("-"*80)
    for i, relation in enumerate(sample_doc['relations'][:5], 1):
        print(f"{i}. {relation}")
    if len(sample_doc['relations']) > 5:
        print(f"... and {len(sample_doc['relations']) - 5} more relations")
    
    print("\n" + "="*80)
    print(f"Total Sentences: {len(sample_doc['content'])}")
    print(f"Total Relations: {len(sample_doc['relations'])}")
    print("="*80)

DOCUMENT ID: 51923817

CONTENT:
--------------------------------------------------------------------------------
1. We propose CornerNet , a new approach to object detection where we detect an object bounding box as a pair of keypoints , the top - left corner and the bottom - right corner , using a single convolution neural network .
2. Experiments show that CornerNet achieves a 4 2 . 2 % AP on MS COCO , outperforming all existing one - stage detectors .
3. Object detectors based on convolutional neural networks ( ConvNets ) ( Krizhevsky et al. , 2 0 1 2 ; Simonyan and Zisserman , 2 0 1 4 ; He et al. , 2 0 1 6 ) have achieved state - of - the - art results on various challenging benchmarks ( Lin et al. , 2 0 1 4 ; Deng et al. , 2 0 0 9 ; Everingham et al. , 2 0 1 5 ) .
... and 48 more sentences

RELATIONS:
--------------------------------------------------------------------------------
1. ['convolution neural network:Method', 'Part-Of', 'CornerNet:Method']
2. ['CornerNet:Method', 'Used

In [6]:
# --- Chunking for RAG ---
def chunk_document(doc, num_chunks=3):
    sentences = doc['content']
    relations = doc['relations']
    n = len(sentences)
    chunk_size = (n + num_chunks - 1) // num_chunks  # ceil division
    chunks = []
    for i in range(num_chunks):
        start = i * chunk_size
        end = min((i + 1) * chunk_size, n)
        chunk_sentences = sentences[start:end]
        # Find relations for sentences in this chunk
        # We assume relations are extracted from sentences in order
        # So we need to know which relations belong to which sentence
        # If relations are not mapped to sentences, we distribute evenly
        # Here, we will use a simple approach: if relations are present, split them proportionally
        rel_chunk_size = (len(relations) + num_chunks - 1) // num_chunks
        rel_start = i * rel_chunk_size
        rel_end = min((i + 1) * rel_chunk_size, len(relations))
        chunk_relations = relations[rel_start:rel_end]
        chunks.append({
            'doc_id': doc['doc_id'],
            'chunk_id': i + 1,
            'content': chunk_sentences,
            'relations': chunk_relations
        })
    return chunks

def process_and_chunk_dataset(input_file, output_file, num_chunks=3):
    """
    Process the JSONL dataset, group by document, then split each document into chunks for RAG.
    Each chunk contains:
    - doc_id
    - chunk_id
    - content: sentences in the chunk
    - relations: relations for those sentences
    """
    # Use previous process_dataset to group by document
    documents = process_dataset(input_file, None)
    all_chunks = []
    for doc in documents:
        chunks = chunk_document(doc, num_chunks=num_chunks)
        all_chunks.extend(chunks)
    
    # Create output directory if it doesn't exist
    output_file.parent.mkdir(parents=True, exist_ok=True)
    
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(all_chunks, f, indent=2, ensure_ascii=False)
    return all_chunks

# --- Process all datasets with chunking ---
output_dir = Path('postprocessed-dataset')
output_dir.mkdir(exist_ok=True)

datasets_to_chunk = {
    'train': (Path('dataset/SciER/LLM/train.jsonl'), output_dir / 'train_chunks.json'),
    'dev': (Path('dataset/SciER/LLM/dev.jsonl'), output_dir / 'dev_chunks.json'),
    'test': (Path('dataset/SciER/LLM/test.jsonl'), output_dir / 'test_chunks.json'),
    'test_ood': (Path('dataset/SciER/LLM/test_ood.jsonl'), output_dir / 'test_ood_chunks.json')
}

print("="*80)
print("Processing and Chunking All Datasets for RAG")
print("="*80)

chunk_results = {}
for name, (input_path, output_path) in datasets_to_chunk.items():
    if input_path.exists():
        print(f"\nProcessing {name}...")
        chunks = process_and_chunk_dataset(input_path, output_path, num_chunks=3)
        chunk_results[name] = chunks
        print(f"✓ {name}: {len(chunks)} chunks saved to {output_path}")
        
        # Show example from first chunk
        if chunks:
            print(f"  Example - Doc ID: {chunks[0]['doc_id']}, Chunk: {chunks[0]['chunk_id']}")
            print(f"  Sentences: {len(chunks[0]['content'])}, Relations: {len(chunks[0]['relations'])}")
    else:
        print(f"✗ {name}: File not found - {input_path}")

# Summary
print("\n" + "="*80)
print("Summary - Chunked Datasets:")
print("="*80)
for name, chunks in chunk_results.items():
    num_docs = len(set(chunk['doc_id'] for chunk in chunks))
    print(f"{name.upper():10} {len(chunks):4} chunks from {num_docs:3} documents")
print("="*80)

Processing and Chunking All Datasets for RAG

Processing train...
✓ train: 240 chunks saved to postprocessed-dataset\train_chunks.json
  Example - Doc ID: 51923817, Chunk: 1
  Sentences: 17, Relations: 19

Processing dev...
✓ dev: 30 chunks saved to postprocessed-dataset\dev_chunks.json
  Example - Doc ID: 53719258, Chunk: 1
  Sentences: 36, Relations: 73

Processing test...
✓ test: 30 chunks saved to postprocessed-dataset\test_chunks.json
  Example - Doc ID: 192546007, Chunk: 1
  Sentences: 19, Relations: 50

Processing test_ood...
✓ test_ood: 18 chunks saved to postprocessed-dataset\test_ood_chunks.json
  Example - Doc ID: AAAI2024, Chunk: 1
  Sentences: 29, Relations: 25

Summary - Chunked Datasets:
TRAIN       240 chunks from  80 documents
DEV          30 chunks from  10 documents
TEST         30 chunks from  10 documents
TEST_OOD     18 chunks from   6 documents
