# BERT Named Entity Recognition untuk Putusan Pengadilan

Notebook ini mengimplementasikan model BERT untuk Named Entity Recognition (NER) pada dokumen putusan pengadilan Indonesia.

## Dataset:

- **HasilAnotasiPutusanBERT.txt**: Data anotasi dalam format tab-separated untuk BERT training

## Entitas Legal yang Dikenali:

- **B_DEFN/I_DEFN**: Nama Terdakwa
- **B_PROS/I_PROS**: Penuntut Umum
- **B_JUDG/I_JUDG**: Hakim Anggota
- **B_JUDP/I_JUDP**: Hakim Ketua
- **B_VERN/I_VERN**: Nomor Putusan
- **B_TIMV/I_TIMV**: Tanggal Putusan
- **B_ARTV/I_ARTV**: Pasal KUHP
- **B_CRIA/I_CRIA**: Dakwaan/Tuntutan Pidana
- **B_PENA/I_PENA**: Tuntutan Hukuman
- **B_PUNI/I_PUNI**: Putusan Hukuman
- **B_REGI/I_REGI**: Panitera

## Model: IndoBERT Base


In [1]:
# Import libraries untuk BERT NER
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoTokenizer, 
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForTokenClassification,
    EarlyStoppingCallback
)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
import pandas as pd
import numpy as np
from collections import Counter
import os
import warnings
warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")


✅ Libraries imported successfully!
PyTorch version: 2.5.1+cu121
CUDA available: True
Using device: cuda


## 1. Load and Analyze BERT Dataset


In [2]:
def load_bert_data(file_path):
    """
    Load data dari format tab-separated BERT training data
    Format: token\tlabel
    """
    sentences = []
    current_tokens = []
    current_labels = []
    
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            
            # Skip comments dan doc markers
            if line.startswith('#') or line.startswith('-DOCSTART-'):
                continue
            
            # Empty line menandakan akhir sentence
            elif not line:
                if current_tokens:
                    sentences.append((current_tokens, current_labels))
                    current_tokens = []
                    current_labels = []
            
            else:
                # Parse token dan label
                parts = line.split('\t')
                if len(parts) >= 2:
                    token = parts[0].strip()
                    label = parts[1].strip()
                    current_tokens.append(token)
                    current_labels.append(label)
    
    # Add last sentence if exists
    if current_tokens:
        sentences.append((current_tokens, current_labels))
    
    return sentences

# Load dataset
data_path = '../../Datasets/ANOTASI/HasilAnotasiPutusanBERT.txt'
sentences = load_bert_data(data_path)

print(f"📊 Total sentences/documents: {len(sentences):,}")
print(f"📝 Sample sentence:")
print(f"   Tokens: {sentences[0][0][:10]}...")
print(f"   Labels: {sentences[0][1][:10]}...")

📊 Total sentences/documents: 235
📝 Sample sentence:
   Tokens: ['PUTUSAN', 'Nomor', '192/Pid.', 'B/2019/PN', 'Bkl', 'DEMI', 'KEADILAN', 'BERDASARKAN', 'KETUHANAN', 'YANG']...
   Labels: ['O', 'O', 'B_VERN', 'I_VERN', 'I_VERN', 'O', 'O', 'O', 'O', 'O']...


In [3]:
# Analisis dataset
all_tokens = []
all_labels = []
sent_lengths = []

for tokens, labels in sentences:
    all_tokens.extend(tokens)
    all_labels.extend(labels)
    sent_lengths.append(len(tokens))

# Label distribution
label_counts = Counter(all_labels)
unique_labels = sorted(list(set(all_labels)))

print("📈 Dataset Statistics:")
print(f"   Total tokens: {len(all_tokens):,}")
print(f"   Unique labels: {len(unique_labels)}")
print(f"   Average sentence length: {np.mean(sent_lengths):.1f} tokens")
print(f"   Max sentence length: {max(sent_lengths)} tokens")

print("\n🏷️ Label Distribution:")
for label, count in label_counts.most_common():
    percentage = (count / len(all_labels)) * 100
    print(f"   {label}: {count:,} ({percentage:.2f}%)")

print(f"\n🎯 Entity ratio: {((len(all_labels) - label_counts['O']) / len(all_labels) * 100):.2f}%")

📈 Dataset Statistics:
   Total tokens: 1,513,881
   Unique labels: 23
   Average sentence length: 6442.0 tokens
   Max sentence length: 26277 tokens

🏷️ Label Distribution:
   O: 1,482,761 (97.94%)
   I_DEFN: 7,521 (0.50%)
   I_ARTV: 3,114 (0.21%)
   I_CRIA: 2,745 (0.18%)
   I_JUDG: 2,721 (0.18%)
   B_DEFN: 2,711 (0.18%)
   I_JUDP: 1,932 (0.13%)
   I_PENA: 1,660 (0.11%)
   I_VERN: 1,622 (0.11%)
   I_REGI: 1,145 (0.08%)
   I_PROS: 1,056 (0.07%)
   I_PUNI: 726 (0.05%)
   B_ARTV: 668 (0.04%)
   B_VERN: 666 (0.04%)
   I_TIMV: 570 (0.04%)
   B_JUDG: 437 (0.03%)
   B_PUNI: 356 (0.02%)
   B_CRIA: 355 (0.02%)
   B_REGI: 282 (0.02%)
   B_JUDP: 270 (0.02%)
   B_PROS: 234 (0.02%)
   B_TIMV: 171 (0.01%)
   B_PENA: 158 (0.01%)

🎯 Entity ratio: 2.06%


## 2. Setup BERT Model dan Tokenizer


In [4]:
# Setup BERT model dan tokenizer
model_name = "archi-ai/Indo-LegalBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create label mappings
label_to_id = {label: i for i, label in enumerate(unique_labels)}
id_to_label = {i: label for label, i in label_to_id.items()}
num_labels = len(unique_labels)

print(f"🤖 Model: {model_name}")
print(f"🏷️ Number of labels: {num_labels}")
print(f"📝 Labels: {unique_labels}")

# Initialize model
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label=id_to_label,
    label2id=label_to_id
)

model.to(device)
print(f"✅ Model loaded and moved to {device}")

tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/229k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

🤖 Model: archi-ai/Indo-LegalBERT
🏷️ Number of labels: 23
📝 Labels: ['B_ARTV', 'B_CRIA', 'B_DEFN', 'B_JUDG', 'B_JUDP', 'B_PENA', 'B_PROS', 'B_PUNI', 'B_REGI', 'B_TIMV', 'B_VERN', 'I_ARTV', 'I_CRIA', 'I_DEFN', 'I_JUDG', 'I_JUDP', 'I_PENA', 'I_PROS', 'I_PUNI', 'I_REGI', 'I_TIMV', 'I_VERN', 'O']


config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at archi-ai/Indo-LegalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Model loaded and moved to cuda


## 3. Data Preprocessing untuk BERT


In [5]:
def tokenize_and_align_labels(examples, tokenizer, label_to_id, max_length=512):
    """
    Tokenize text dan align labels dengan subword tokens
    """
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
        padding='max_length',
        max_length=max_length,
        return_tensors="pt"
    )
    
    labels = []
    for i, label in enumerate(examples["labels"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        
        for word_idx in word_ids:
            if word_idx is None:
                # Special token
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                # First subword of word
                label_ids.append(label_to_id[label[word_idx]])
            else:
                # Subsequent subwords - use -100 to ignore in loss
                label_ids.append(-100)
            previous_word_idx = word_idx
        
        labels.append(label_ids)
    
    tokenized_inputs["labels"] = torch.tensor(labels)
    return tokenized_inputs

print("✅ Tokenization function defined")

✅ Tokenization function defined


In [6]:
class NERDataset(Dataset):
    def __init__(self, sentences, tokenizer, label_to_id, max_length=512):
        self.sentences = sentences
        self.tokenizer = tokenizer
        self.label_to_id = label_to_id
        self.max_length = max_length
    
    def __len__(self):
        return len(self.sentences)
    
    def __getitem__(self, idx):
        tokens, labels = self.sentences[idx]
        
        # Tokenize
        encoding = self.tokenizer(
            tokens,
            truncation=True,
            is_split_into_words=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors="pt"
        )
        
        # Align labels
        word_ids = encoding.word_ids()
        aligned_labels = []
        previous_word_idx = None
        
        for word_idx in word_ids:
            if word_idx is None:
                aligned_labels.append(-100)
            elif word_idx != previous_word_idx:
                aligned_labels.append(self.label_to_id[labels[word_idx]])
            else:
                aligned_labels.append(-100)
            previous_word_idx = word_idx
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(aligned_labels, dtype=torch.long)
        }

print("✅ NER Dataset class defined")

✅ NER Dataset class defined


## 4. Split Data untuk Training, Validation, dan Testing


In [7]:
# Split data: 70% train, 15% validation, 15% test
train_sentences, temp_sentences = train_test_split(
    sentences, test_size=0.3, random_state=42
)

val_sentences, test_sentences = train_test_split(
    temp_sentences, test_size=0.5, random_state=42
)

print(f"📊 Data Split:")
print(f"   Training: {len(train_sentences):,} sentences")
print(f"   Validation: {len(val_sentences):,} sentences")
print(f"   Testing: {len(test_sentences):,} sentences")

# Create datasets
train_dataset = NERDataset(train_sentences, tokenizer, label_to_id)
val_dataset = NERDataset(val_sentences, tokenizer, label_to_id)
test_dataset = NERDataset(test_sentences, tokenizer, label_to_id)

print(f"✅ Datasets created successfully")

📊 Data Split:
   Training: 164 sentences
   Validation: 35 sentences
   Testing: 36 sentences
✅ Datasets created successfully


## 5. Setup Training Configuration


In [8]:
def compute_metrics(eval_pred):
    """
    Compute metrics untuk evaluasi
    """
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=2)
    
    # Remove ignored index (special tokens)
    true_predictions = [
        [id_to_label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id_to_label[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    
    # Flatten for sklearn metrics
    flat_true_labels = [label for sublist in true_labels for label in sublist]
    flat_predictions = [pred for sublist in true_predictions for pred in sublist]
    
    # Calculate metrics
    accuracy = accuracy_score(flat_true_labels, flat_predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(
        flat_true_labels, flat_predictions, average='weighted', zero_division=0
    )
    
    return {
        'accuracy': accuracy,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Data collator
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

# Training arguments
training_args = TrainingArguments(
    output_dir='./legal-ner-model',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    eval_strategy="steps",
    eval_steps=500,
    save_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    save_total_limit=2,
    report_to=None,  # Disable wandb
    dataloader_pin_memory=False
)

print("✅ Training configuration ready")

✅ Training configuration ready


## 6. Train BERT Model


In [9]:
# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

print("🚀 Starting BERT training...")
print(f"   Model: {model_name}")
print(f"   Training samples: {len(train_dataset):,}")
print(f"   Validation samples: {len(val_dataset):,}")
print(f"   Epochs: {training_args.num_train_epochs}")
print(f"   Batch size: {training_args.per_device_train_batch_size}")
print(f"   Device: {device}")

# Start training
trainer.train()

print("✅ Training completed!")

🚀 Starting BERT training...
   Model: archi-ai/Indo-LegalBERT
   Training samples: 164
   Validation samples: 35
   Epochs: 3
   Batch size: 8
   Device: cuda


Step,Training Loss,Validation Loss


✅ Training completed!


## 7. Evaluate Model Performance


In [10]:
# Evaluate on test set
print("📊 Evaluating on test set...")
test_results = trainer.evaluate(eval_dataset=test_dataset)

print("\n🎯 Test Results:")
for key, value in test_results.items():
    if key.startswith('eval_'):
        metric_name = key.replace('eval_', '').title()
        print(f"   {metric_name}: {value:.4f}")

# Detailed predictions for analysis
predictions = trainer.predict(test_dataset)
test_predictions = np.argmax(predictions.predictions, axis=2)

# Convert to readable format
test_true_labels = []
test_pred_labels = []

for i in range(len(test_predictions)):
    true_labels = []
    pred_labels = []
    
    for j in range(len(test_predictions[i])):
        if predictions.label_ids[i][j] != -100:
            true_labels.append(id_to_label[predictions.label_ids[i][j]])
            pred_labels.append(id_to_label[test_predictions[i][j]])
    
    test_true_labels.extend(true_labels)
    test_pred_labels.extend(pred_labels)

# Classification report
print("\n📋 Detailed Classification Report:")
print(classification_report(test_true_labels, test_pred_labels, digits=4))

📊 Evaluating on test set...



🎯 Test Results:
   Loss: 0.3590
   Accuracy: 0.9121
   F1: 0.8763
   Precision: 0.8561
   Recall: 0.9121
   Runtime: 71.1608
   Samples_Per_Second: 0.5060
   Steps_Per_Second: 0.0700

📋 Detailed Classification Report:
              precision    recall  f1-score   support

      B_ARTV     0.0000    0.0000    0.0000        11
      B_CRIA     0.0000    0.0000    0.0000        34
      B_DEFN     0.0000    0.0000    0.0000        87
      B_PENA     0.0000    0.0000    0.0000        18
      B_PUNI     0.0000    0.0000    0.0000        10
      B_VERN     0.0000    0.0000    0.0000        77
      I_ARTV     0.0000    0.0000    0.0000        45
      I_CRIA     0.0000    0.0000    0.0000       315
      I_DEFN     0.7478    0.3007    0.4289       286
      I_PENA     0.0000    0.0000    0.0000       156
      I_PUNI     0.0000    0.0000    0.0000        20
      I_VERN     0.7500    0.0833    0.1500       216
           O     0.9138    0.9975    0.9538     12404

    accuracy           

## 8. Test Predictions on New Text


In [11]:
def predict_entities(text, model, tokenizer, id_to_label, device):
    """
    Predict entities in new text
    """
    # Tokenize input
    tokens = text.split()
    
    # Encode
    encoding = tokenizer(
        tokens,
        is_split_into_words=True,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=512
    )
    
    # Move to device
    encoding = {k: v.to(device) for k, v in encoding.items()}
    
    # Predict
    model.eval()
    with torch.no_grad():
        outputs = model(**encoding)
        predictions = torch.argmax(outputs.logits, dim=2)
    
    # Convert predictions to labels
    # Manually track word IDs based on tokens
    predicted_labels = []
    token_subwords = []
    
    # Get subword lengths for each token
    for i, token in enumerate(tokens):
        # Tokenize each token to see how many subwords it gets split into
        subwords = tokenizer.tokenize(token)
        token_subwords.append(len(subwords))
    
    # Map predictions back to original tokens
    idx = 1  # Start after [CLS] token
    for i, token in enumerate(tokens):
        # Get the first subword prediction for this token
        if idx < len(predictions[0]):
            label_id = predictions[0][idx].item()
            predicted_labels.append(id_to_label[label_id])
        else:
            # Fallback if we somehow exceed the predictions length
            predicted_labels.append('O')
        
        # Skip ahead by the number of subwords for this token
        idx += token_subwords[i]
    
    # Combine tokens with predictions
    results = []
    for token, label in zip(tokens, predicted_labels):
        results.append((token, label))
    
    return results

# Test dengan contoh teks putusan
sample_text = "Terdakwa AHMAD BIN SITI berusia 25 tahun terbukti melanggar pasal 362 KUHP dan dijatuhi hukuman penjara 2 tahun oleh Hakim Ketua Dr. BUDI SANTOSO dengan Nomor Putusan 123/Pid.B/2023/PN.Jkt.Sel"

print("🔍 Testing prediction on sample text:")
print(f"Text: {sample_text}\n")

prediction_results = predict_entities(sample_text, model, tokenizer, id_to_label, device)

print("📝 Prediction Results:")
for token, label in prediction_results:
    if label != 'O':
        print(f"   {token} → {label}")
    else:
        print(f"   {token} → -")

🔍 Testing prediction on sample text:
Text: Terdakwa AHMAD BIN SITI berusia 25 tahun terbukti melanggar pasal 362 KUHP dan dijatuhi hukuman penjara 2 tahun oleh Hakim Ketua Dr. BUDI SANTOSO dengan Nomor Putusan 123/Pid.B/2023/PN.Jkt.Sel

📝 Prediction Results:
   Terdakwa → -
   AHMAD → -
   BIN → I_DEFN
   SITI → -
   berusia → -
   25 → -
   tahun → -
   terbukti → -
   melanggar → -
   pasal → -
   362 → -
   KUHP → -
   dan → -
   dijatuhi → -
   hukuman → -
   penjara → -
   2 → -
   tahun → -
   oleh → -
   Hakim → -
   Ketua → -
   Dr. → -
   BUDI → -
   SANTOSO → -
   dengan → -
   Nomor → -
   Putusan → -
   123/Pid.B/2023/PN.Jkt.Sel → -


## 9. Save Trained Model


In [12]:
# Save model and tokenizer
model_save_path = "./legal-ner-bert-model"
os.makedirs(model_save_path, exist_ok=True)

# Save the model
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

# Save label mappings
import json
label_mappings = {
    'label_to_id': label_to_id,
    'id_to_label': id_to_label,
    'unique_labels': unique_labels
}

with open(os.path.join(model_save_path, 'label_mappings.json'), 'w') as f:
    json.dump(label_mappings, f, indent=2)

print(f"✅ Model saved to: {model_save_path}")

# Final summary
print("\n" + "="*60)
print("           🎉 BERT NER MODEL TRAINING COMPLETED")
print("="*60)
print(f"📊 Dataset: {len(sentences):,} sentences from legal documents")
print(f"🏷️ Entities: {len([l for l in all_labels if l != 'O']):,} legal entities")
print(f"🤖 Model: {model_name}")
print(f"🎯 Test Accuracy: {test_results['eval_accuracy']:.4f}")
print(f"📈 Test F1-Score: {test_results['eval_f1']:.4f}")
print(f"💾 Model Location: {model_save_path}")
print("\n🚀 Model ready for legal document entity extraction!")
print("="*60)

✅ Model saved to: ./legal-ner-bert-model

           🎉 BERT NER MODEL TRAINING COMPLETED
📊 Dataset: 235 sentences from legal documents
🏷️ Entities: 31,120 legal entities
🤖 Model: archi-ai/Indo-LegalBERT
🎯 Test Accuracy: 0.9121
📈 Test F1-Score: 0.8763
💾 Model Location: ./legal-ner-bert-model

🚀 Model ready for legal document entity extraction!
