# MODUL OCR - SISTEM VERIFIKASI DOKUMEN OTOMATIS

**Versi Windows 11**

---

## Deskripsi
Ekstraksi teks dari dokumen PDF/Gambar menggunakan Tesseract OCR

**Alur Kerja:**
```
Input Gambar ‚Üí Orientation/Skew Correction ‚Üí Grayscale ‚Üí Adaptive Denoise ‚Üí Tesseract OCR (Grayscale, PSM 3, OEM 3) ‚Üí Post-processing (garbage line removal) ‚Üí Single-line Evaluation
```

**Catatan Pipeline:**
- CLAHE, Sharpening, Otsu Thresholding, dan Morphology **dinonaktifkan** (pass-through)
- Tesseract LSTM (OEM 3) bekerja lebih akurat pada gambar **grayscale** dibanding binary
- **Post-processing**: Garbage line removal untuk membersihkan artefak watermark/stempel
- Evaluasi menggunakan mode **single-line** (layout/newline diabaikan)

**Target Performa:** <10 detik per dokumen

**Framework:** Python + Tesseract OCR + OpenCV

OCR System dengan Pengujian Akurasi, WER, dan CER

Ditambahkan: Testing menggunakan jiwer (alternatif FastWER yang lebih mudah di Windows)

In [1]:
# =============================================
# INSTALASI PYTHON PACKAGES
# =============================================

!pip install pytesseract pdf2image Pillow opencv-python python-Levenshtein jiwer

print("\n‚úÖ Instalasi package selesai!")
print("‚ö†Ô∏è  Pastikan Tesseract OCR dan Poppler sudah terinstall!")

Defaulting to user installation because normal site-packages is not writeable

‚úÖ Instalasi package selesai!
‚ö†Ô∏è  Pastikan Tesseract OCR dan Poppler sudah terinstall!


In [2]:
# ================================================
# KONFIGURASI PATH TESSERACT & POPPLER
# ================================================
import os

# SESUAIKAN PATH INI DENGAN LOKASI INSTALASI ANDA!
TESSERACT_PATH = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
POPPLER_PATH = r'C:\Program Files\poppler-25.07.0\Library\bin'

# Verifikasi path Tesseract
if os.path.exists(TESSERACT_PATH):
    print(f"‚úÖ Tesseract ditemukan di: {TESSERACT_PATH}")
else:
    print(f"‚ùå Tesseract TIDAK ditemukan di: {TESSERACT_PATH}")
    print("‚ö†Ô∏è  Sesuaikan TESSERACT_PATH dengan lokasi instalasi Anda!")

# Verifikasi path Poppler
if os.path.exists(POPPLER_PATH):
    print(f"‚úÖ Poppler ditemukan di: {POPPLER_PATH}")
else:
    print(f"‚ùå Poppler TIDAK ditemukan di: {POPPLER_PATH}")
    print("‚ö†Ô∏è  Sesuaikan POPPLER_PATH dengan lokasi instalasi Anda!")

‚úÖ Tesseract ditemukan di: C:\Program Files\Tesseract-OCR\tesseract.exe
‚úÖ Poppler ditemukan di: C:\Program Files\poppler-25.07.0\Library\bin


In [3]:
# ================================================
# IMPORT LIBRARY DENGAN ERROR HANDLING
# ================================================
try:
    import pytesseract
    from pdf2image import convert_from_path
    from IPython.display import display
    from PIL import Image
    import io
    import os
    import cv2
    import numpy as np
    import Levenshtein
    import re
    import json
    import time
    from pathlib import Path
    from jiwer import wer, cer  # Import jiwer untuk WER dan CER

    # Set Tesseract path untuk Windows
    pytesseract.pytesseract.tesseract_cmd = TESSERACT_PATH

    print("‚úÖ Semua library berhasil di-import")
    print("üì¶ Menggunakan jiwer untuk WER dan CER calculation")

except ImportError as e:
    print(f"‚ùå Error importing library: {e}")
    print("‚ö†Ô∏è  Pastikan semua library sudah terinstall")
    raise

‚úÖ Semua library berhasil di-import
üì¶ Menggunakan jiwer untuk WER dan CER calculation


In [4]:
# ================================================
# VERIFIKASI TESSERACT OCR ENGINE
# ================================================
try:
    tesseract_version = pytesseract.get_tesseract_version()
    print(f"‚úÖ Tesseract engine version: {tesseract_version}")
    
    # Cek bahasa yang tersedia
    languages = pytesseract.get_languages()
    print(f"\nüìö Bahasa yang tersedia: {', '.join(languages)}")
    
    if 'ind' in languages:
        print("‚úÖ Bahasa Indonesia tersedia")
    else:
        print("‚ö†Ô∏è  Bahasa Indonesia tidak tersedia")
        print("   Download tessdata dari: https://github.com/tesseract-ocr/tessdata")
        
except pytesseract.TesseractNotFoundError:
    print("‚ùå Tesseract tidak ditemukan!")
    print(f"   Path yang dicoba: {TESSERACT_PATH}")
    print("   Pastikan Tesseract sudah terinstall dan path sudah benar")
    raise

‚úÖ Tesseract engine version: 5.5.0.20241111

üìö Bahasa yang tersedia: eng, ind, osd
‚úÖ Bahasa Indonesia tersedia


In [4]:
# ================================================
# KONFIGURASI GROUND TRUTH DARI FILE TXT LOKAL
# ================================================

import os
from pathlib import Path

# SESUAIKAN PATH INI KE FOLDER GROUND TRUTH ANDA
# GROUND_TRUTH_FOLDER = r'E:\Softwares\Jupyter\Projects\OCR\dokumen\dokumen_normal\ground_truth'
BASE_FOLDER = r'E:\Softwares\Jupyter\Projects\OCR\data\dokumen'

# User tinggal pilih folder kategori
CATEGORY = 'listrik'  # atau 'dokumen_blur', 'dokumen_noisy', dll
GROUND_TRUTH_FOLDER = os.path.join(BASE_FOLDER, CATEGORY, 'ground_truth')
DOCUMENTS_FOLDER = os.path.join(BASE_FOLDER, CATEGORY)

print("üîç Membaca ground truth dari file lokal...")
print("=" * 60)

GROUND_TRUTH = {}

# Cek apakah folder exists
if not os.path.exists(GROUND_TRUTH_FOLDER):
    print(f"‚ùå Folder tidak ditemukan: {GROUND_TRUTH_FOLDER}")
    print("‚ö†Ô∏è  Sesuaikan GROUND_TRUTH_FOLDER dengan lokasi folder Anda")
else:
    print(f"‚úÖ Folder ditemukan: {GROUND_TRUTH_FOLDER}\n")
    
    # Baca semua file .txt di folder
    txt_files = list(Path(GROUND_TRUTH_FOLDER).glob('*.txt'))
    
    if not txt_files:
        print(f"‚ö†Ô∏è  Tidak ada file .txt ditemukan di folder")
    else:
        for txt_file in txt_files:
            try:
                # Baca isi file
                with open(txt_file, 'r', encoding='utf-8') as f:
                    content = f.read()
                
                # Dapatkan nama file tanpa path
                filename_base = txt_file.stem  # Nama file tanpa extension
                
                # Cari file dokumen yang sesuai di folder yang sama
                import glob
                doc_folder = os.path.dirname(GROUND_TRUTH_FOLDER)
                
                # Coba cocokkan dengan PDF atau gambar
                possible_extensions = ['.pdf', '.jpg', '.jpeg', '.png']
                actual_file = None
                
                for ext in possible_extensions:
                    pattern = os.path.join(doc_folder, f"{filename_base}{ext}")
                    matches = glob.glob(pattern)
                    if matches:
                        actual_file = os.path.basename(matches[0])
                        break
                
                # Jika tidak ketemu file asli, gunakan PDF sebagai default
                if actual_file is None:
                    actual_file = f"{filename_base}.pdf"
                
                key = actual_file
                GROUND_TRUTH[key] = content
                
                char_count = len(content)
                print(f"‚úÖ {txt_file.name} ({char_count} karakter)")
                print(f"   ‚Üí Key: {key}")
                
            except Exception as e:
                print(f"‚ùå Error membaca {txt_file.name}: {e}")

print("\n" + "=" * 60)
print(f"üìä Total ground truth dimuat: {len(GROUND_TRUTH)} file")
print("=" * 60)

if GROUND_TRUTH:
    print("\nüìã Daftar ground truth yang tersedia:")
    for filename, content in GROUND_TRUTH.items():
        char_count = len(content)
        line_count = content.count('\n') + 1
        print(f"   ‚Ä¢ {filename}: {char_count} karakter, {line_count} baris")
    
    print("\n‚ö†Ô∏è  PENTING:")
    print("   1. Pastikan nama file .txt sesuai dengan nama dokumen yang akan di-OCR")
    print("   2. Contoh: 'Struk 1.txt' untuk 'Struk 1.pdf'")
    print("   3. Ground truth harus berisi teks RAW tanpa normalisasi")
else:
    print("\n‚ö†Ô∏è  Tidak ada ground truth yang dimuat!")
    print("   Silakan periksa:")
    print(f"   1. Path folder: {GROUND_TRUTH_FOLDER}")
    print("   2. Pastikan ada file .txt di folder tersebut")

üîç Membaca ground truth dari file lokal...
‚úÖ Folder ditemukan: E:\Softwares\Jupyter\Projects\OCR\data\dokumen\listrik\ground_truth

‚úÖ BANDARBARU (Listrik).txt (940 karakter)
   ‚Üí Key: BANDARBARU (Listrik).pdf
‚úÖ MEDANPAYAGELI (Listrik).txt (942 karakter)
   ‚Üí Key: MEDANPAYAGELI (Listrik).pdf
‚úÖ NAMORAMBE (Listrik).txt (945 karakter)
   ‚Üí Key: NAMORAMBE (Listrik).pdf
‚úÖ PALANGGA (Listrik).txt (638 karakter)
   ‚Üí Key: PALANGGA (Listrik).pdf
‚úÖ PANCURBATU (Listrik).txt (947 karakter)
   ‚Üí Key: PANCURBATU (Listrik).pdf

üìä Total ground truth dimuat: 5 file

üìã Daftar ground truth yang tersedia:
   ‚Ä¢ BANDARBARU (Listrik).pdf: 940 karakter, 32 baris
   ‚Ä¢ MEDANPAYAGELI (Listrik).pdf: 942 karakter, 32 baris
   ‚Ä¢ NAMORAMBE (Listrik).pdf: 945 karakter, 32 baris
   ‚Ä¢ PALANGGA (Listrik).pdf: 638 karakter, 27 baris
   ‚Ä¢ PANCURBATU (Listrik).pdf: 947 karakter, 33 baris

‚ö†Ô∏è  PENTING:
   1. Pastikan nama file .txt sesuai dengan nama dokumen yang akan di-OCR
   2. 

In [5]:
# ================================================
# INPUT FILE
# ================================================
# ‚úÖ SOLUSI (Otomatis - scan folder)
from pathlib import Path

def get_all_documents(folder_path, extensions=['.pdf', '.jpg', '.jpeg', '.png']):
    """Otomatis ambil semua file dokumen dari folder"""
    folder = Path(folder_path)
    all_files = []
    
    for ext in extensions:
        all_files.extend(folder.glob(f'*{ext}'))
        # all_files.extend(folder.glob(f'*{ext.upper()}'))
    
    return sorted([str(f) for f in all_files])

# Pakai:
DOCUMENTS_FOLDER = r'E:\Softwares\Jupyter\Projects\OCR\data\dokumen\listrik'
FILE_PATHS = get_all_documents(DOCUMENTS_FOLDER)
print(f"‚úÖ Ditemukan {len(FILE_PATHS)} dokumen")
# Output: ‚úÖ Ditemukan 50 dokumen (otomatis!)

‚úÖ Ditemukan 5 dokumen


In [None]:
# ================================================
# KONVERSI FILE KE GAMBAR
# ================================================

print("üîÑ Memproses file...")
print("=" * 60)

# Inisialisasi timer global & tracking per dokumen
ocr_pipeline_start = time.time()
timing_per_doc = {}

def _record_timing(doc_name, step, duration):
    """Catat waktu proses per dokumen per tahap"""
    if doc_name not in timing_per_doc:
        timing_per_doc[doc_name] = {}
    timing_per_doc[doc_name][step] = timing_per_doc[doc_name].get(step, 0) + duration

all_images = []
file_info = []  # Track source file untuk setiap gambar

for file_path in FILE_PATHS:
    print(f"\nüìÑ Memproses: {file_path}")
    _t_doc = time.time()
    
    try:
        if file_path.lower().endswith('.pdf'):
            # Konversi PDF ke gambar
            images = convert_from_path(
                file_path,
                dpi=300,
                poppler_path=POPPLER_PATH
            )
            print(f"   ‚úÖ PDF dikonversi ke {len(images)} halaman")
        else:
            # Baca file gambar langsung
            images = [Image.open(file_path)]
            print(f"   ‚úÖ Gambar berhasil dibaca")
        
        # Simpan semua gambar dan info file
        for img in images:
            all_images.append(img)
            file_info.append(os.path.basename(file_path))
        
        _record_timing(os.path.basename(file_path), 'Konversi', time.time() - _t_doc)
        
        # Preview halaman pertama
        if images:
            print(f"\n   üîç Preview (Halaman 1):")
            display(images[0])
            
    except Exception as e:
        print(f"   ‚ùå Error: {e}")
        continue

print(f"\n{'=' * 60}")
print(f"üìä Total gambar siap diproses: {len(all_images)}")
print(f"{'=' * 60}")

images = all_images

In [None]:
# ================================================
# PREPROCESSING 0: ORIENTATION & SKEW CORRECTION
# ================================================
import matplotlib.pyplot as plt

MAX_DISPLAY_ORIENT = 5

print("üîÑ Memulai proses Orientation & Skew Correction...")
print("=" * 60)
start_time = time.time()

corrected_images = []

for i, img in enumerate(images):
    _t_doc = time.time()
    source_file = file_info[i]
    rotation_applied = False
    skew_applied = False
    rotation_angle = 0
    rotation_conf = 0.0
    skew_angle = 0.0
    osd_error = None

    # === STEP 1: Deteksi & Koreksi Rotasi (OSD) ===
    try:
        osd = pytesseract.image_to_osd(img)
        for line in osd.split('\n'):
            if 'Rotate:' in line:
                rotation_angle = int(line.split(':')[-1].strip())
            if 'Orientation confidence:' in line:
                rotation_conf = float(line.split(':')[-1].strip())
    except pytesseract.TesseractError as e:
        osd_error = str(e)

    # Rotasi jika terdeteksi (confidence > 1.0)
    if osd_error is None and rotation_angle != 0 and rotation_conf > 1.0:
        img = img.rotate(rotation_angle, expand=True, fillcolor=(255, 255, 255))
        rotation_applied = True

    # === STEP 1b: Fallback Rotasi 180¬∞ ===
    # Jika OSD gagal atau tidak mendeteksi rotasi, cek apakah dokumen terbalik
    # dengan membandingkan OCR confidence normal vs rotasi 180¬∞
    if not rotation_applied:
        try:
            img_np_temp = np.array(img)
            h, w = img_np_temp.shape[:2]
            # Ambil crop tengah untuk tes cepat
            center_crop = img_np_temp[h//4:3*h//4, w//4:3*w//4]
            
            # Confidence orientasi normal
            data_normal = pytesseract.image_to_data(
                Image.fromarray(center_crop),
                lang='ind+eng',
                config='--psm 6 --oem 3',
                output_type=pytesseract.Output.DICT
            )
            conf_normal = [int(c) for c in data_normal['conf'] if int(c) > 0]
            avg_conf_normal = sum(conf_normal) / len(conf_normal) if conf_normal else 0
            
            # Confidence rotasi 180¬∞
            rotated_180 = np.rot90(img_np_temp, 2)
            center_crop_180 = rotated_180[h//4:3*h//4, w//4:3*w//4]
            data_rotated = pytesseract.image_to_data(
                Image.fromarray(center_crop_180),
                lang='ind+eng',
                config='--psm 6 --oem 3',
                output_type=pytesseract.Output.DICT
            )
            conf_rotated = [int(c) for c in data_rotated['conf'] if int(c) > 0]
            avg_conf_rotated = sum(conf_rotated) / len(conf_rotated) if conf_rotated else 0
            
            # Rotasi 180¬∞ jika confidence jauh lebih tinggi
            if avg_conf_rotated > avg_conf_normal + 10:
                img = Image.fromarray(rotated_180)
                rotation_applied = True
                rotation_angle = 180
                rotation_conf = avg_conf_rotated
        except:
            pass  # Fallback gagal, lanjut dengan orientasi asli

    # === STEP 2: Deteksi & Koreksi Skew (Kemiringan Kecil) ===
    img_np = np.array(img)
    gray_temp = cv2.cvtColor(img_np, cv2.COLOR_RGB2GRAY) if len(img_np.shape) == 3 else img_np

    _, binary = cv2.threshold(gray_temp, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
    coords = np.column_stack(np.where(binary > 0))

    if len(coords) >= 10:
        angle = cv2.minAreaRect(coords)[-1]
        skew_angle = -(90 + angle) if angle < -45 else -angle
        if abs(skew_angle) > 15:
            skew_angle = 0.0

    if abs(skew_angle) >= 0.5:
        (h, w) = img_np.shape[:2]
        M = cv2.getRotationMatrix2D((w // 2, h // 2), skew_angle, 1.0)
        border = (255, 255, 255) if len(img_np.shape) == 3 else 255
        img_np = cv2.warpAffine(img_np, M, (w, h), flags=cv2.INTER_CUBIC,
                                borderMode=cv2.BORDER_CONSTANT, borderValue=border)
        skew_applied = True

    corrected_pil = Image.fromarray(img_np)
    corrected_images.append(corrected_pil)
    _record_timing(source_file, 'Orientasi', time.time() - _t_doc)

    # === DISPLAY ===
    should_display = (MAX_DISPLAY_ORIENT is None) or (i < MAX_DISPLAY_ORIENT)

    if should_display:
        print(f"\n‚úÖ Orientation & Skew Correction gambar {i+1} dari {len(images)}:")
        print(f"   üìÑ File: {source_file}")

        if osd_error:
            print(f"   ‚ö†Ô∏è  OSD Error: {osd_error}")
        
        if rotation_angle == 180 and rotation_applied:
            print(f"   üîÑ Rotasi 180¬∞ terdeteksi via confidence check ‚Üí Dikoreksi")
        elif rotation_applied:
            print(f"   üîç Rotation: {rotation_angle}¬∞ (conf: {rotation_conf:.2f}) ‚Üí Dikoreksi")
        else:
            print(f"   üîç Rotation: Tidak perlu")

        print(f"   üîç Skew: {skew_angle:.2f}¬∞ ‚Üí {'Dikoreksi' if skew_applied else 'Tidak perlu'}")

        fig, axes = plt.subplots(1, 2, figsize=(12, 6))
        axes[0].imshow(np.array(images[i]))
        axes[0].set_title('Original')
        axes[0].axis('off')
        axes[1].imshow(np.array(corrected_pil))
        axes[1].set_title(f'Corrected (rot:{rotation_angle}¬∞ skew:{skew_angle:.1f}¬∞)')
        axes[1].axis('off')
        plt.tight_layout()
        plt.show()
        print("-" * 60)
    else:
        if i == MAX_DISPLAY_ORIENT:
            print(f"\nüìä Memproses gambar {i+1} - {len(images)}...")
        print(f"   ‚úÖ Gambar {i+1} selesai", end="\r")

if MAX_DISPLAY_ORIENT is not None and len(images) > MAX_DISPLAY_ORIENT:
    print(f"\n\nüí° {len(images) - MAX_DISPLAY_ORIENT} gambar lainnya sudah diproses")

elapsed = time.time() - start_time
print(f"\n‚è±Ô∏è  Waktu Orientation & Skew Correction: {elapsed:.2f} detik")
print(f"üìä Total gambar diproses: {len(corrected_images)}")
print("=" * 60)

In [None]:
# ================================================
# PREPROCESSING 1: GRAYSCALING
# ================================================

MAX_DISPLAY_GRAY = 5

print("üîÑ Memulai proses Grayscaling...")
print("=" * 60)
start_time = time.time()

grayscale_images = []

for i, img in enumerate(corrected_images):
    _t_doc = time.time()
    open_cv_image = np.array(img)
    
    # Konversi RGB ke Grayscale
    if len(open_cv_image.shape) == 3:
        img_gray = cv2.cvtColor(open_cv_image, cv2.COLOR_RGB2GRAY)
    else:
        img_gray = open_cv_image
    
    grayscale_images.append(img_gray)
    _record_timing(file_info[i], 'Grayscale', time.time() - _t_doc)

    should_display = (MAX_DISPLAY_GRAY is None) or (i < MAX_DISPLAY_GRAY)

    if should_display:
        print(f"\n‚úÖ Grayscale gambar {i+1} dari {len(corrected_images)}:")
        print(f"   Dimensi: {img_gray.shape[1]} x {img_gray.shape[0]} pixels")
        display(Image.fromarray(img_gray))
        print("-" * 60)
    else:
        if i == MAX_DISPLAY_GRAY:
            print(f"\nüìä Memproses gambar {i+1} - {len(corrected_images)}...")
        print(f"   ‚úÖ Gambar {i+1} selesai", end="\r")

if MAX_DISPLAY_GRAY is not None and len(corrected_images) > MAX_DISPLAY_GRAY:
    print(f"\n\nüí° {len(corrected_images) - MAX_DISPLAY_GRAY} gambar lainnya sudah diproses")

elapsed = time.time() - start_time
print(f"\n‚è±Ô∏è  Waktu Grayscaling: {elapsed:.2f} detik")
print(f"üìä Total gambar diproses: {len(grayscale_images)}")
print("=" * 60)

In [None]:
# ================================================
# PREPROCESSING 2: ADAPTIVE NOISE REMOVAL
# ================================================
import matplotlib.pyplot as plt
MAX_DISPLAY_DENOISE = 5
print("üîÑ Memulai proses Adaptive Noise Removal...")
print("=" * 60)
start_time = time.time()

denoised_images = []

def estimate_noise(gray_img):
    """Estimasi noise level menggunakan Laplacian variance.
    Nilai tinggi = banyak detail/noise, nilai rendah = gambar bersih/halus."""
    return cv2.Laplacian(gray_img, cv2.CV_64F).var()

for i, gray_img in enumerate(grayscale_images):
    _t_doc = time.time()
    
    # Estimasi noise level untuk pilih metode yang tepat
    noise_level = estimate_noise(gray_img)
    
    if noise_level > 1500:
        # Gambar sangat noisy (scan kualitas rendah) ‚Üí blur lebih kuat
        denoised = cv2.GaussianBlur(gray_img, (5, 5), 0)
        method = "Gaussian 5x5 (noisy)"
    elif noise_level > 500:
        # Noise sedang ‚Üí blur ringan
        denoised = cv2.GaussianBlur(gray_img, (3, 3), 0)
        method = "Gaussian 3x3 (moderate)"
    else:
        # Gambar bersih (PDF digital) ‚Üí tanpa blur agar teks tetap tajam
        denoised = gray_img.copy()
        method = "No blur (clean)"
    
    denoised_images.append(denoised)
    _record_timing(file_info[i], 'Denoise', time.time() - _t_doc)
    
    should_display = (MAX_DISPLAY_DENOISE is None) or (i < MAX_DISPLAY_DENOISE)
    if should_display:
        print(f"\n‚úÖ Denoise gambar {i+1} dari {len(grayscale_images)}:")
        print(f"   Dimensi: {denoised.shape[1]} x {denoised.shape[0]} pixels")
        print(f"   üìä Noise level: {noise_level:.0f} ‚Üí {method}")
        
        fig, axes = plt.subplots(1, 2, figsize=(12, 6))
        axes[0].imshow(gray_img, cmap='gray')
        axes[0].set_title('Grayscale')
        axes[0].axis('off')
        axes[1].imshow(denoised, cmap='gray')
        axes[1].set_title(f'Denoised ({method})')
        axes[1].axis('off')
        plt.tight_layout()
        plt.show()
        print("-" * 60)
    else:
        if i == MAX_DISPLAY_DENOISE:
            print(f"\nüìä Memproses gambar {i+1} - {len(grayscale_images)}...")
        print(f"   ‚úÖ Gambar {i+1} selesai", end="\r")

if MAX_DISPLAY_DENOISE is not None and len(grayscale_images) > MAX_DISPLAY_DENOISE:
    print(f"\n\nüí° {len(grayscale_images) - MAX_DISPLAY_DENOISE} gambar lainnya sudah diproses")

elapsed = time.time() - start_time
print(f"\n‚è±Ô∏è  Waktu Adaptive Noise Removal: {elapsed:.2f} detik")
print(f"üìä Total gambar diproses: {len(denoised_images)}")
print("=" * 60)

In [None]:
# ================================================
# PREPROCESSING 3: CONTRAST ENHANCEMENT (CLAHE) ‚Äî DINONAKTIFKAN
# ================================================
# CATATAN: CLAHE dinonaktifkan karena:
# 1. Tesseract LSTM (OEM 3) bekerja lebih baik pada grayscale asli
# 2. CLAHE memperkuat watermark dan artefak latar belakang,
#    menyebabkan Tesseract membaca watermark sebagai teks
# 3. Untuk dokumen bersih, CLAHE tidak diperlukan
#
# Jika ingin mengaktifkan kembali untuk eksperimen:
#   clahe = cv2.createCLAHE(clipLimit=1.5, tileGridSize=(8, 8))
#   enhanced = clahe.apply(denoised_img)

print("‚è≠Ô∏è  CLAHE dinonaktifkan ‚Äî Tesseract LSTM lebih akurat pada grayscale asli")
start_time = time.time()

enhanced_images = []
for i, denoised_img in enumerate(denoised_images):
    _t_doc = time.time()
    enhanced_images.append(denoised_img)  # Pass-through tanpa perubahan
    _record_timing(file_info[i], 'CLAHE', time.time() - _t_doc)

elapsed = time.time() - start_time
print(f"üìä Total gambar: {len(enhanced_images)} (pass-through, {elapsed:.2f} detik)")
print("=" * 60)

In [None]:
# ================================================
# PREPROCESSING 4: SHARPENING ‚Äî DINONAKTIFKAN
# ================================================
# CATATAN: Sharpening dinonaktifkan karena:
# 1. Memperkuat noise, watermark, dan artefak JPEG
# 2. Tesseract LSTM tidak memerlukan penajaman tambahan
# 3. Pada 300 DPI, teks sudah cukup tajam
#
# Jika ingin mengaktifkan kembali untuk eksperimen:
#   sharpen_kernel = np.array([[0, -1, 0], [-1, 5, -1], [0, -1, 0]])
#   sharpened = cv2.filter2D(enhanced_img, -1, sharpen_kernel)

print("‚è≠Ô∏è  Sharpening dinonaktifkan ‚Äî dapat memperkuat noise dan watermark")
start_time = time.time()

sharpened_images = []
for i, enhanced_img in enumerate(enhanced_images):
    _t_doc = time.time()
    sharpened_images.append(enhanced_img)  # Pass-through tanpa perubahan
    _record_timing(file_info[i], 'Sharpen', time.time() - _t_doc)

elapsed = time.time() - start_time
print(f"üìä Total gambar: {len(sharpened_images)} (pass-through, {elapsed:.2f} detik)")
print("=" * 60)

In [None]:
# ================================================
# PREPROCESSING 5: THRESHOLDING ‚Äî DINONAKTIFKAN
# ================================================
# CATATAN: Binarisasi (Otsu) dinonaktifkan karena:
# 1. Tesseract LSTM (OEM 3) dirancang untuk gambar GRAYSCALE, bukan binary
# 2. Binarisasi menghancurkan informasi gradien yang digunakan LSTM
#    untuk membedakan karakter mirip (0 vs O, 1 vs l vs I)
# 3. Referensi: https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html
#
# Jika ingin mengaktifkan kembali untuk eksperimen:
#   _, img_thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

print("‚è≠Ô∏è  Otsu Thresholding dinonaktifkan ‚Äî LSTM lebih akurat pada grayscale")
start_time = time.time()

thresh_images_adaptive = []
for i, img_sharp in enumerate(sharpened_images):
    _t_doc = time.time()
    thresh_images_adaptive.append(img_sharp)  # Pass-through tanpa perubahan
    _record_timing(file_info[i], 'Threshold', time.time() - _t_doc)

elapsed = time.time() - start_time
print(f"üìä Total gambar: {len(thresh_images_adaptive)} (pass-through, {elapsed:.2f} detik)")
print("=" * 60)

In [None]:
# ================================================
# PREPROCESSING 6: MORPHOLOGICAL OPERATIONS ‚Äî DINONAKTIFKAN
# ================================================
# CATATAN: Operasi morfologi dinonaktifkan karena:
# 1. Tanpa binarisasi, morfologi tidak dapat diterapkan (butuh gambar binary)
# 2. Bahkan pada gambar binary, Opening+Closing dengan kernel 2x2
#    dapat menipis/menebalkan karakter tipis di 300 DPI
#
# Jika ingin mengaktifkan kembali (harus aktifkan Otsu juga):
#   kernel = np.ones((2, 2), np.uint8)
#   opened = cv2.morphologyEx(thresh_img, cv2.MORPH_OPEN, kernel)
#   closed = cv2.morphologyEx(opened, cv2.MORPH_CLOSE, kernel)

print("‚è≠Ô∏è  Morphological Operations dinonaktifkan ‚Äî tidak diperlukan tanpa binarisasi")
start_time = time.time()

morphed_images = []
for i, thresh_img in enumerate(thresh_images_adaptive):
    _t_doc = time.time()
    morphed_images.append(thresh_img)  # Pass-through tanpa perubahan
    _record_timing(file_info[i], 'Morfologi', time.time() - _t_doc)

elapsed = time.time() - start_time
print(f"üìä Total gambar: {len(morphed_images)} (pass-through, {elapsed:.2f} detik)")
print(f"üí° Final preprocessed images ready for OCR (GRAYSCALE)")
print("=" * 60)

In [None]:
# ================================================
# EKSTRAKSI TEKS DENGAN TESSERACT + POST-PROCESSING
# ================================================

print("üîÑ Memulai ekstraksi teks dengan Tesseract...")
print("   Input: Gambar GRAYSCALE (bukan binary)")
print("   Config: PSM 3 (auto segmentation) + OEM 3 (LSTM)")
print("   Post-processing: Whitespace cleanup + garbage line removal")
print("=" * 60)
start_time_total = time.time()

extracted_texts = {}

def clean_garbage_lines(text):
    """
    Hapus baris-baris yang kemungkinan besar adalah garbage text
    (hasil OCR dari watermark, stempel, artefak visual).
    
    Kriteria garbage line:
    1. Baris sangat pendek (<=3 karakter) yang bukan angka/simbol penting
    2. Baris dengan rasio karakter non-alfanumerik terlalu tinggi
    3. Baris dengan banyak karakter tunggal terpisah (misalnya "a b c D e")
    
    Returns:
        tuple: (cleaned_text, removed_count, removed_examples)
    """
    lines = text.split('\n')
    cleaned_lines = []
    removed_count = 0
    removed_examples = []
    
    for line in lines:
        stripped = line.strip()
        
        # Baris kosong ‚Üí tetap simpan (untuk mempertahankan paragraf)
        if not stripped:
            cleaned_lines.append(line)
            continue
        
        is_garbage = False
        
        # --- Cek 1: Baris sangat pendek (1-2 karakter non-bermakna) ---
        # Misalnya: "a", ";", "Sa 3", "wok ;"
        # TAPI jangan buang angka penting seperti "Rp", "VA", dll
        if len(stripped) <= 2 and not stripped.isdigit():
            is_garbage = True
        
        # --- Cek 2: Rasio huruf terlalu rendah ---
        # Baris normal punya banyak huruf (a-z, A-Z) dan angka (0-9)
        # Garbage text sering punya banyak simbol dan sedikit huruf
        if not is_garbage and len(stripped) > 0:
            alnum_count = sum(1 for c in stripped if c.isalnum())
            alnum_ratio = alnum_count / len(stripped)
            # Jika kurang dari 40% alfanumerik DAN baris pendek ‚Üí garbage
            if alnum_ratio < 0.4 and len(stripped) < 20:
                is_garbage = True
        
        # --- Cek 3: Pola karakter tunggal terpisah ---
        # Misalnya: "a b c D e F" atau "N N P A I a"
        # Baris normal punya kata-kata dengan 2+ huruf
        if not is_garbage:
            words = stripped.split()
            if len(words) >= 3:
                single_char_count = sum(1 for w in words if len(w) == 1)
                single_char_ratio = single_char_count / len(words)
                # Jika >60% kata adalah karakter tunggal ‚Üí garbage
                if single_char_ratio > 0.6 and len(stripped) < 30:
                    is_garbage = True
        
        if is_garbage:
            removed_count += 1
            if len(removed_examples) < 10:
                removed_examples.append(stripped)
        else:
            cleaned_lines.append(line)
    
    cleaned_text = '\n'.join(cleaned_lines)
    return cleaned_text, removed_count, removed_examples

for i, final_img in enumerate(morphed_images):
    start_time = time.time()
    
    # Tesseract config:
    # --psm 3 : Fully automatic page segmentation
    # --oem 3 : LSTM neural network engine (paling akurat)
    # Input  : Gambar GRAYSCALE
    text = pytesseract.image_to_string(
        final_img,
        lang='ind+eng',
        config='--psm 3 --oem 3'
    )
    
    # ============================================
    # POST-PROCESSING
    # ============================================
    # 1. Hapus multiple spaces ‚Üí single space
    text = re.sub(r' +', ' ', text)
    
    # 2. Hapus trailing spaces per baris
    text = re.sub(r' +\n', '\n', text)
    
    # 3. Hapus garbage lines (watermark, artefak)
    text, garbage_count, garbage_examples = clean_garbage_lines(text)
    
    # 4. Normalize multiple newlines ‚Üí double newline
    text = re.sub(r'\n\s*\n+', '\n\n', text)
    
    # 5. Trim whitespace awal/akhir
    text = text.strip()
    
    # Simpan hasil
    source_file = file_info[i]
    if source_file not in extracted_texts:
        extracted_texts[source_file] = []
    extracted_texts[source_file].append(text)
    
    elapsed = time.time() - start_time
    _record_timing(source_file, 'Ekstraksi', elapsed)
    
    print(f"\n‚úÖ Halaman {i+1}/{len(morphed_images)} (File: {source_file})")
    print(f"   ‚è±Ô∏è  Waktu: {elapsed:.2f} detik | Karakter: {len(text)}")
    if garbage_count > 0:
        print(f"   üßπ Garbage lines dihapus: {garbage_count}")
        if garbage_examples:
            examples = ', '.join([f'"{ex}"' for ex in garbage_examples[:5]])
            print(f"      Contoh: {examples}")
    print(f"   üìÑ Preview: {text[:200]}...")
    print("-" * 60)

elapsed_total = time.time() - start_time_total
avg_time = elapsed_total / len(morphed_images)

print(f"\n{'=' * 60}")
print(f"‚è±Ô∏è  Total waktu ekstraksi: {elapsed_total:.2f} detik")
print(f"üìä Rata-rata: {avg_time:.2f} detik/halaman")

if avg_time < 10:
    print(f"‚úÖ Target performa tercapai (<10 detik/dokumen)")
else:
    print(f"‚ö†Ô∏è  Performa belum optimal (target: <10 detik)")

print(f"‚úÖ Output disimpan di variable: extracted_texts")
print("=" * 60)

In [None]:
# ================================================
# OUTPUT LIST UNTUK MODUL NER
# ================================================
# Format output: List of dict, setiap item berisi:
#   - 'nama_lampiran': Nama file dokumen (string)
#   - 'hasil_ocr': Teks hasil OCR (string)
#
# List ini bisa langsung diteruskan ke modul NER
# untuk ekstraksi entitas (nominal, IDPEL, nama, dll)

ocr_results_list = []

for filename, ocr_texts in extracted_texts.items():
    # Gabungkan semua halaman jadi satu teks
    full_text = '\n'.join(ocr_texts)
    
    ocr_results_list.append({
        'nama_lampiran': filename,
        'hasil_ocr': full_text
    })

# Tampilkan hasil
print("=" * 60)
print("üìã OUTPUT LIST UNTUK MODUL NER")
print("=" * 60)
print(f"\nüìä Total dokumen: {len(ocr_results_list)}")
print(f"üì¶ Variable: ocr_results_list\n")

for i, item in enumerate(ocr_results_list, 1):
    preview = item['hasil_ocr'][:100].replace('\n', ' ')
    print(f"  [{i}] {item['nama_lampiran']}")
    print(f"      Preview: {preview}...")
    print(f"      Panjang: {len(item['hasil_ocr'])} karakter")
    print()

print("=" * 60)
print("üí° Gunakan ocr_results_list untuk input ke modul NER")
print("   Contoh akses:")
print("   >>> ocr_results_list[0]['nama_lampiran']")
print("   >>> ocr_results_list[0]['hasil_ocr']")
print("=" * 60)

In [None]:
# ============================================
# RINGKASAN WAKTU KESELURUHAN OCR PIPELINE
# ============================================
ocr_pipeline_total = time.time() - ocr_pipeline_start

print(f"{'=' * 120}")
print(f"‚è±Ô∏è  RINGKASAN WAKTU OCR PIPELINE")
print(f"{'=' * 120}")

# Urutan tahap (tanpa CLAHE, Sharpen, Threshold, Morfologi yang dinonaktifkan)
STEPS = ['Konversi', 'Orientasi', 'Grayscale', 'Denoise', 'CLAHE', 'Sharpen', 'Threshold', 'Morfologi', 'Ekstraksi']

# Header tabel
header = f"{'No':<4} {'Dokumen':<30} "
for step in STEPS:
    header += f"{step:<10} "
header += f"{'TOTAL':<10}"
print(f"\n{header}")
print("-" * 120)

# Isi tabel per dokumen
doc_totals = []
for idx, (doc_name, steps) in enumerate(timing_per_doc.items(), 1):
    row = f"{idx:<4} {doc_name:<30} "
    total = 0
    for step in STEPS:
        t = steps.get(step, 0)
        total += t
        row += f"{t:<10.2f} "
    row += f"{total:<10.2f}"
    doc_totals.append(total)
    print(row)

# Rata-rata
print("-" * 120)
avg_row = f"{'RATA-RATA':<34} "
for step in STEPS:
    avg_val = sum(timing_per_doc[d].get(step, 0) for d in timing_per_doc) / len(timing_per_doc)
    avg_row += f"{avg_val:<10.2f} "
avg_total = sum(doc_totals) / len(doc_totals) if doc_totals else 0
avg_row += f"{avg_total:<10.2f}"
print(avg_row)
print("=" * 120)

print(f"\n‚è±Ô∏è  Total waktu keseluruhan OCR pipeline: {ocr_pipeline_total:.2f} detik")
print(f"üìä Rata-rata per dokumen: {avg_total:.2f} detik")

if avg_total < 10:
    print(f"‚úÖ Target performa tercapai (<10 detik/dokumen)")
else:
    print(f"‚ö†Ô∏è  Performa belum optimal (target: <10 detik/dokumen)")

print(f"\nüí° Catatan: CLAHE, Sharpen, Threshold, Morfologi dinonaktifkan (pass-through ~0.00 detik)")
print(f"üí° Satuan waktu: detik (seconds)")
print("=" * 120)

## Pengujian Akurasi, WER, dan CER menggunakan jiwer

In [None]:
# ================================================
# FUNGSI UNTUK MENGHITUNG METRIK (SINGLE-LINE)
# ================================================
# 
# METRIK EVALUASI OCR:
# ==========================================
# 1. Raw Document Similarity (SequenceMatcher)
#    - Mengukur kesamaan keseluruhan antara ground truth dan OCR output
#    - Berbasis Longest Common Subsequence (LCS)
#    - Skala: 0-100% (semakin tinggi semakin baik)
#    - Kelebihan: Intuitif, menggambarkan "seberapa mirip" kedua teks
#    - Kekurangan: Sensitif terhadap teks tambahan (insertion)
#
# 2. WER (Word Error Rate) ‚Äî via jiwer
#    - Mengukur error pada level KATA
#    - Menghitung: (Substitusi + Insersi + Delesi) / Total kata di ground truth
#    - Skala: 0%+ (semakin rendah semakin baik, bisa >100% jika banyak insertion)
#    - Cocok untuk: Mengevaluasi apakah kata-kata kunci terbaca benar
#
# 3. CER (Character Error Rate) ‚Äî via jiwer
#    - Mengukur error pada level KARAKTER
#    - Menghitung: (Substitusi + Insersi + Delesi) / Total karakter di ground truth
#    - Skala: 0%+ (semakin rendah semakin baik)
#    - PALING RELEVAN untuk verifikasi nominal (Rp, IDPEL, No. Resi, dll)
#      karena mendeteksi kesalahan sekecil 1 digit
#
# REKOMENDASI: Gunakan KETIGA metrik ‚Äî CER sebagai metrik utama untuk
# verifikasi nominal, WER sebagai pelengkap level kata, dan Raw Similarity
# sebagai gambaran umum kesamaan keseluruhan dokumen.
# ==========================================

from difflib import SequenceMatcher

def normalize_to_single_line(text):
    """
    Gabung semua whitespace (newline, tab, multiple space) menjadi single space.
    Untuk evaluasi verifikasi teks, yang penting ISI teks, bukan layout/posisi baris.
    """
    return re.sub(r'\s+', ' ', text.strip())

def calculate_raw_document_similarity(ground_truth, ocr_output):
    """
    Mengukur similarity dokumen dalam mode SINGLE-LINE.
    Semua whitespace di-normalize sebelum perbandingan,
    sehingga perbedaan layout/newline tidak mempengaruhi skor.
    
    Menggunakan SequenceMatcher yang berbasis Longest Common Subsequence (LCS).
    
    Returns similarity dalam bentuk persentase (0-100%)
    """
    # Normalize ke single-line sebelum membandingkan
    gt_line = normalize_to_single_line(ground_truth)
    ocr_line = normalize_to_single_line(ocr_output)
    
    matcher = SequenceMatcher(None, gt_line, ocr_line)
    similarity = matcher.ratio() * 100
    
    matching_blocks = matcher.get_matching_blocks()
    matching_chars = sum(block.size for block in matching_blocks[:-1])
    
    return {
        'raw_similarity': similarity,
        'matching_chars': matching_chars,
        'gt_length': len(gt_line),
        'ocr_length': len(ocr_line)
    }

def calculate_wer_cer_jiwer(ground_truth, ocr_output):
    """
    Menghitung WER dan CER dalam mode SINGLE-LINE menggunakan jiwer.
    
    WER (Word Error Rate): Persentase kata yang salah (substitusi + insersi + delesi)
    CER (Character Error Rate): Persentase karakter yang salah
    
    Nilai lebih rendah = lebih baik. 0% = sempurna.
    """
    # Normalize ke single-line
    gt_line = normalize_to_single_line(ground_truth)
    ocr_line = normalize_to_single_line(ocr_output)
    
    # Handle edge case: string kosong
    if not gt_line or not ocr_line:
        return {'wer': 100.0, 'cer': 100.0}
    
    wer_value = wer(gt_line, ocr_line)
    cer_value = cer(gt_line, ocr_line)
    
    return {
        'wer': wer_value * 100,
        'cer': cer_value * 100
    }

print("‚úÖ Fungsi metrik pengujian siap digunakan (mode: SINGLE-LINE)")
print("üì¶ Raw Document Similarity: SequenceMatcher (kesamaan keseluruhan)")
print("üì¶ WER: jiwer library (error level kata)")
print("üì¶ CER: jiwer library (error level karakter ‚Äî paling relevan untuk verifikasi nominal)")
print("üìù Semua perbandingan dilakukan dalam mode single-line (layout diabaikan)")

In [None]:
# ================================================
# PENGUJIAN AKURASI, WER, DAN CER (SINGLE-LINE)
# ================================================

print("\n" + "=" * 80)
print("üìä PENGUJIAN AKURASI OCR (Mode: Single-Line)")
print("=" * 80)

testing_results = []

for filename, ocr_texts in extracted_texts.items():
    print(f"\n{'=' * 80}")
    print(f"üìÑ Testing File: {filename}")
    print(f"{'=' * 80}")
    
    if filename not in GROUND_TRUTH:
        print(f"‚ö†Ô∏è  Ground truth tidak tersedia untuk {filename}")
        continue
    
    ground_truth = GROUND_TRUTH[filename]
    ocr_output = '\n'.join(ocr_texts)
    
    # Tampilkan info single-line
    gt_single = normalize_to_single_line(ground_truth)
    ocr_single = normalize_to_single_line(ocr_output)
    
    print(f"\nüìè Informasi Dasar (Single-Line):")
    print(f"   Ground Truth: {len(gt_single)} karakter")
    print(f"   OCR Output:   {len(ocr_single)} karakter")
    print(f"   Selisih:      {abs(len(gt_single) - len(ocr_single))} karakter")
    
    # Hitung Raw Document Similarity
    similarity_metrics = calculate_raw_document_similarity(ground_truth, ocr_output)
    print(f"\nüìä Raw Document Similarity: {similarity_metrics['raw_similarity']:.2f}%")
    print(f"   Matching Characters: {similarity_metrics['matching_chars']}")
    
    # Hitung WER dan CER
    wer_cer_metrics = calculate_wer_cer_jiwer(ground_truth, ocr_output)
    print(f"üìä WER: {wer_cer_metrics['wer']:.2f}%")
    print(f"üìä CER: {wer_cer_metrics['cer']:.2f}%")
    
    # Simpan hasil
    result = {
        'filename': filename,
        'ground_truth_length': len(gt_single),
        'ocr_output_length': len(ocr_single),
        'raw_similarity': similarity_metrics['raw_similarity'],
        'matching_chars': similarity_metrics['matching_chars'],
        'wer': wer_cer_metrics['wer'],
        'cer': wer_cer_metrics['cer'],
        'ground_truth': ground_truth,
        'ocr_output': ocr_output
    }
    testing_results.append(result)
    
    # Preview perbandingan single-line (150 karakter pertama)
    print(f"\nüìù Preview Single-Line (150 karakter pertama):")
    print(f"   GT:  {gt_single[:150]}...")
    print(f"   OCR: {ocr_single[:150]}...")

print(f"\n\n{'=' * 80}")
print("‚úÖ PENGUJIAN SELESAI")
print(f"{'=' * 80}")

In [None]:
# ================================================
# RINGKASAN HASIL PENGUJIAN (SINGLE-LINE)
# ================================================

if testing_results:
    print("\n" + "=" * 80)
    print("üìä RINGKASAN HASIL PENGUJIAN SEMUA DOKUMEN (Single-Line)")
    print("=" * 80)
    
    avg_raw_sim = sum(r['raw_similarity'] for r in testing_results) / len(testing_results)
    avg_wer = sum(r['wer'] for r in testing_results) / len(testing_results)
    avg_cer = sum(r['cer'] for r in testing_results) / len(testing_results)
    
    print(f"\n{'No':<4} {'Dokumen':<35} {'Sim (%)':<12} {'WER (%)':<12} {'CER (%)':<12}")
    print("-" * 80)
    
    for i, result in enumerate(testing_results, 1):
        print(f"{i:<4} {result['filename']:<35} "
              f"{result['raw_similarity']:<12.2f} "
              f"{result['wer']:<12.2f} "
              f"{result['cer']:<12.2f}")
    
    print("-" * 80)
    print(f"{'RATA-RATA':<39} {avg_raw_sim:<12.2f} {avg_wer:<12.2f} {avg_cer:<12.2f}")
    print("=" * 80)
    
    print(f"\nüìà Interpretasi Hasil:")
    
    if avg_raw_sim >= 95:
        print(f"   ‚úÖ Raw Document Similarity: SANGAT BAIK ({avg_raw_sim:.2f}%)")
    elif avg_raw_sim >= 90:
        print(f"   ‚úÖ Raw Document Similarity: BAIK ({avg_raw_sim:.2f}%)")
    elif avg_raw_sim >= 85:
        print(f"   ‚ö†Ô∏è  Raw Document Similarity: CUKUP ({avg_raw_sim:.2f}%)")
    else:
        print(f"   ‚ùå Raw Document Similarity: KURANG ({avg_raw_sim:.2f}%)")
    
    if avg_wer <= 10:
        print(f"   ‚úÖ WER: SANGAT BAIK ({avg_wer:.2f}%)")
    elif avg_wer <= 20:
        print(f"   ‚úÖ WER: BAIK ({avg_wer:.2f}%)")
    elif avg_wer <= 30:
        print(f"   ‚ö†Ô∏è  WER: CUKUP ({avg_wer:.2f}%)")
    else:
        print(f"   ‚ùå WER: KURANG ({avg_wer:.2f}%)")
    
    if avg_cer <= 5:
        print(f"   ‚úÖ CER: SANGAT BAIK ({avg_cer:.2f}%)")
    elif avg_cer <= 10:
        print(f"   ‚úÖ CER: BAIK ({avg_cer:.2f}%)")
    elif avg_cer <= 15:
        print(f"   ‚ö†Ô∏è  CER: CUKUP ({avg_cer:.2f}%)")
    else:
        print(f"   ‚ùå CER: KURANG ({avg_cer:.2f}%)")
    
    print(f"\nüí° Catatan:")
    print(f"   ‚Ä¢ Mode evaluasi: SINGLE-LINE (layout/newline diabaikan)")
    print(f"   ‚Ä¢ Raw Similarity: Kesamaan keseluruhan teks (0-100%)")
    print(f"   ‚Ä¢ WER: Kesalahan level kata (semakin rendah semakin baik)")
    print(f"   ‚Ä¢ CER: Kesalahan level karakter (semakin rendah semakin baik)")

else:
    print("\n‚ö†Ô∏è  Tidak ada hasil pengujian. Pastikan ground truth sudah dikonfigurasi.")

In [None]:
# ================================================
# ANALISIS DETAIL PER DOKUMEN
# ================================================
# Menampilkan perbandingan teks ground truth vs OCR output
# untuk membantu identifikasi pola error dan area perbaikan.

if testing_results:
    print("=" * 80)
    print("üîç ANALISIS DETAIL PER DOKUMEN")
    print("=" * 80)
    
    for i, result in enumerate(testing_results, 1):
        print(f"\n{'‚îÄ' * 80}")
        print(f"üìÑ [{i}] {result['filename']}")
        print(f"   Similarity: {result['raw_similarity']:.2f}% | "
              f"WER: {result['wer']:.2f}% | CER: {result['cer']:.2f}%")
        print(f"   GT: {result['ground_truth_length']} chars | "
              f"OCR: {result['ocr_output_length']} chars | "
              f"Selisih: {result['ocr_output_length'] - result['ground_truth_length']:+d} chars")
        print(f"{'‚îÄ' * 80}")
        
        gt_single = normalize_to_single_line(result['ground_truth'])
        ocr_single = normalize_to_single_line(result['ocr_output'])
        
        # Tampilkan teks lengkap (single-line, dipotong jika terlalu panjang)
        max_display = 500
        print(f"\n   üìó Ground Truth (single-line):")
        if len(gt_single) > max_display:
            print(f"   {gt_single[:max_display]}... [{len(gt_single)} total chars]")
        else:
            print(f"   {gt_single}")
        
        print(f"\n   üìò OCR Output (single-line):")
        if len(ocr_single) > max_display:
            print(f"   {ocr_single[:max_display]}... [{len(ocr_single)} total chars]")
        else:
            print(f"   {ocr_single}")
        
        # Hitung kata yang tepat cocok vs tidak
        gt_words = set(gt_single.lower().split())
        ocr_words = set(ocr_single.lower().split())
        common = gt_words & ocr_words
        missing = gt_words - ocr_words
        extra = ocr_words - gt_words
        
        print(f"\n   üìä Analisis Kata:")
        if gt_words:
            print(f"      Kata cocok:     {len(common)}/{len(gt_words)} "
                  f"({len(common)/len(gt_words)*100:.0f}%)")
        if missing:
            missing_sample = list(missing)[:10]
            print(f"      Kata hilang:    {len(missing)} ‚Äî contoh: {', '.join(missing_sample)}")
        if extra:
            extra_sample = list(extra)[:10]
            print(f"      Kata tambahan:  {len(extra)} ‚Äî contoh: {', '.join(extra_sample)}")
    
    # Ringkasan keseluruhan
    print(f"\n\n{'=' * 80}")
    print(f"üìä KESIMPULAN ANALISIS")
    print(f"{'=' * 80}")
    
    best = max(testing_results, key=lambda r: r['raw_similarity'])
    worst = min(testing_results, key=lambda r: r['raw_similarity'])
    
    print(f"\n   üèÜ Terbaik:  {best['filename']} ({best['raw_similarity']:.2f}%)")
    print(f"   üìâ Terburuk: {worst['filename']} ({worst['raw_similarity']:.2f}%)")
    
    avg_sim = sum(r['raw_similarity'] for r in testing_results) / len(testing_results)
    avg_cer_val = sum(r['cer'] for r in testing_results) / len(testing_results)
    
    print(f"\n   üìà Rata-rata Similarity: {avg_sim:.2f}%")
    print(f"   üìà Rata-rata CER: {avg_cer_val:.2f}%")
    
    if avg_cer_val <= 10:
        print(f"\n   ‚úÖ CER di bawah 10% ‚Äî cukup baik untuk verifikasi nominal")
    elif avg_cer_val <= 20:
        print(f"\n   ‚ö†Ô∏è  CER 10-20% ‚Äî perlu perbaikan untuk verifikasi yang akurat")
    else:
        print(f"\n   ‚ùå CER di atas 20% ‚Äî masih perlu optimasi lebih lanjut")
    
    print(f"\n   üí° Tips peningkatan:")
    print(f"      ‚Ä¢ Kualitas scan/foto sangat mempengaruhi hasil OCR")
    print(f"      ‚Ä¢ Dokumen dengan watermark/stempel cenderung memiliki CER lebih tinggi")
    print(f"      ‚Ä¢ Dokumen digital bersih (seperti PALANGGA) bisa mencapai CER <1%")

else:
    print("‚ö†Ô∏è  Tidak ada hasil testing untuk dianalisis.")