# Receipt OCR → JSON Pipeline 

## Tujuan
Notebook ini mendokumentasikan pipeline end-to-end untuk memproses
gambar struk menjadi data terstruktur (JSON final) yang siap digunakan
oleh aplikasi.

Pipeline ini mencakup:
1. OCR (PaddleOCR)
2. Klasifikasi baris item menggunakan model ML (Random Forest, frozen)
3. Grouping multi-line item
4. Parsing item (nama, qty, harga)
5. Ekstraksi metadata struk (merchant, tanggal, pembayaran)
6. Penyusunan JSON final

Catatan penting:
- Tidak ada training ML di notebook ini
- Model ML yang digunakan sudah divalidasi dan di-freeze
- Fokus notebook ini adalah engineering & integrasi sistem


In [278]:
# standard library
import json
import re
from pathlib import Path
from typing import List, Dict, Any

# data
import pandas as pd
import numpy as np

# OCR
from paddleocr import PaddleOCR

# model loading
import joblib


## Step 2 — Load ML Model (Item Line Classifier)

Di tahap ini kita memuat:
- Model Random Forest yang sudah di-freeze
- Daftar fitur (feature contract) yang harus digunakan saat inference

Model ini hanya bertugas untuk mengklasifikasikan:
"Apakah sebuah baris OCR merupakan bagian dari item belanja atau bukan?"

Tidak ada proses training di sini.


In [279]:
# path ke model artifact
MODEL_PATH = Path("../models/rf_itemline_bundle.pkl")

artifact = joblib.load(MODEL_PATH)

rf_model = artifact["model"]
FEATURE_COLUMNS = artifact["features"]

print("Model loaded.")
print("Number of features:", len(FEATURE_COLUMNS))


Model loaded.
Number of features: 12


## Step 3 — Inisialisasi OCR Engine

OCR digunakan untuk mengekstrak teks dan koordinat dari gambar struk.
Kita menggunakan PaddleOCR dengan:
- angle classification (untuk teks miring)
- bahasa Inggris (umum untuk struk Indonesia)

OCR akan dijalankan per gambar (1 receipt per call).


In [280]:
# Init OCR (hanya perlu sekali saja)
ocr_model = PaddleOCR(
   # 1. Dokumen Orientasi (Rotasi Gambar)
    use_doc_orientation_classify=True,  # JANGAN LUPA SET TRUE
    doc_orientation_classify_model_name='PP-LCNet_x1_0_doc_ori',
    
    # 2. Dokumen Unwarping (Pelurusan Kertas Lecek)
    use_doc_unwarping=True,             # JANGAN LUPA SET TRUE
    doc_unwarping_model_name='UVDoc',
    
    # 3. Deteksi Teks (Mencari text dengan Kotak)
    text_detection_model_name='PP-OCRv5_server_det',
    
    # 4. Orientasi Per Baris Teks
    use_textline_orientation=True,      # Opsional, bisa False biar lebih cepat
    textline_orientation_model_name='PP-LCNet_x1_0_textline_ori',
    
    # 5. Pengenalan Teks (Membaca Huruf)
    text_recognition_model_name='latin_PP-OCRv5_mobile_rec',
    
    
    # 3. Deteksi & Ukuran (Penting untuk struk panjang)
    text_det_limit_side_len=1200,        
    text_det_limit_type='max',          
    
    # 4. Thresholding (Fine-tuning deteksi)
    text_det_thresh=0.4,                
    text_det_box_thresh=0.5,            
    text_det_unclip_ratio=2,   
           
    # 5. Parameter Tambahan (Jika diperlukan)
    text_rec_score_thresh=0.5,          # Batas minimum confidence score untuk hasil OCR
    return_word_box=False,              # False jika ingin per baris, True jika per kata
)


[32mCreating model: ('PP-LCNet_x1_0_doc_ori', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `C:\Users\igust\.paddlex\official_models\PP-LCNet_x1_0_doc_ori`.[0m
[32mCreating model: ('UVDoc', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `C:\Users\igust\.paddlex\official_models\UVDoc`.[0m
[32mCreating model: ('PP-LCNet_x1_0_textline_ori', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `C:\Users\igust\.paddlex\official_models\PP-LCNet_x1_0_textline_ori`.[0m
[32mCreating model: ('PP-OCRv5_server_det', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `C:\Users\igust\.paddlex\official_models\PP-OCRv5_server_det`.[0m
[32mCreating model: ('latin_PP-OCRv5_mobile_rec', None)[0m
[32mModel files already exist. Using cac

## Step 4 — Fungsi OCR (`run_ocr`)

Fungsi ini bertugas untuk:
- Menerima path gambar struk
- Menjalankan OCR
- Mengubah output PaddleOCR menjadi DataFrame baris OCR

Output DataFrame ini adalah input utama untuk tahap ML dan grouping.


In [281]:
from PIL import Image

def run_ocr_from_engine(image_path: Path, ocr_engine) -> pd.DataFrame:
    """
    Run PaddleOCR directly on an image and return line-level OCR DataFrame
    using rec_boxes (x_min, y_min, x_max, y_max).

    This version is CONSISTENT with ML training features.
    """
    # ambil ukuran gambar (untuk normalisasi)
    img = Image.open(image_path)
    IMAGE_WIDTH, IMAGE_HEIGHT = img.size

    # jalankan OCR
    result = ocr_engine.predict(str(image_path))

    if not result or len(result) == 0:
        return pd.DataFrame(columns=[
            "text",
            "x_center_norm",
            "y_center_norm",
            "box_width_norm",
            "ocr_conf"
        ])

    page = result[0]

    texts = page.get("rec_texts", [])
    boxes = page.get("rec_boxes", [])
    scores = page.get("rec_scores", [])

    rows = []

    for line_id, text in enumerate(texts):
        if line_id >= len(boxes):
            continue

        x_min, y_min, x_max, y_max = boxes[line_id]
        score = scores[line_id] if line_id < len(scores) else None

        # hitung center & width (SESUAI TRAINING)
        x_center_norm = ((x_min + x_max) / 2) / IMAGE_WIDTH
        y_center_norm = ((y_min + y_max) / 2) / IMAGE_HEIGHT
        box_width_norm = (x_max - x_min) / IMAGE_WIDTH

        rows.append({
            "line_id": line_id,
            "text": text,
            "ocr_conf": round(float(score), 4) if score is not None else None,
            "x_center_norm": float(x_center_norm),
            "y_center_norm": float(y_center_norm),
            "box_width_norm": float(box_width_norm),
        })

    return pd.DataFrame(rows)


## Step 5 — Feature Engineering (Inference)

Pada tahap ini kita:
- Menggunakan fungsi `build_features` yang sama seperti saat training
- Menjamin konsistensi fitur antara training dan inference
- Menghasilkan feature matrix untuk prediksi ML

Penting:
- Tidak boleh ada perubahan fitur di tahap ini


In [282]:

price_regex = re.compile(r"\d+[.,]\d+") 

def has_price_pattern(text):
    """
    Mengecek keberadaan pola angka yang menyerupai format harga atau nominal.
    
    Fungsi ini mencari angka yang diikuti oleh separator (titik atau koma) 
    dan diakhiri oleh tepat 2 atau 3 digit (contoh: '10.500' atau '99,99').
    
    Args:
        text (str): String input yang akan diperiksa.
        
    Returns:
        bool: True jika pola ditemukan, False jika tidak.
    """
    if not isinstance(text, str):
        return 0
    return int(bool(price_regex.search(text)))


def has_qty_pattern(text):
    """
    Mengecek keberadaan pola kuantitas atau dimensi dalam string.
    
    Fungsi ini mendeteksi pola format "angka x angka" atau "angka @ angka" 
    secara case-insensitive (huruf besar/kecil tidak berpengaruh).
    
    Args:
        text (str): String input yang akan diperiksa.
        
    Returns:
        bool: True jika pola ditemukan, False jika tidak.
    """
    qty_regex = re.compile(r"(\b\d+\s*[xX@]\s*\d+\b)|(\b\d+\s*[xX]\b)")

    if not isinstance(text, str):
        return 0
    return int(bool(qty_regex.search(text)))


def digit_ratio(text):
    """
    Menghitung densitas angka dalam sebuah string.
    
    Nilai dihitung berdasarkan jumlah digit dibagi dengan total panjang 
    karakter dalam teks (termasuk spasi, tanda baca, dll).
    
    Args:
        text (str): String input yang akan dianalisis.
        
    Returns:
        float: Nilai rasio antara 0.0 hingga 1.0.
               Mengembalikan 0 jika string kosong (empty string).
    """
    if not isinstance(text, str) or len(text) == 0:
        return 0.0
    return sum(c.isdigit() for c in text) / len(text)

 
def upper_ratio(text):
    """
    Menghitung rasio penggunaan huruf kapital dalam sebuah string.
    
    Fungsi ini menyaring karakter non-alfabet (angka, spasi, simbol) sebelum 
    melakukan perhitungan.
    
    Args:
        text (str): String input yang akan dianalisis.
        
    Returns:
        float: Nilai rasio antara 0.0 hingga 1.0. 
               Mengembalikan 0 jika tidak ditemukan huruf sama sekali.
    """
    if not isinstance(text, str) or len(text) == 0:
        return 0.0
    letters = [c for c in text if c.isalpha()]
    if not letters:
        return 0.0
    return sum(c.isupper() for c in letters) / len(letters)


def contains_keywords(text, keywords):
    """
    Mendeteksi keberadaan kata kunci tertentu dalam teks sebagai fitur biner.
    
    Pencarian dilakukan secara case-insensitive. Fungsi mengembalikan integer
    agar dapat langsung digunakan sebagai input numerik untuk model ML.
    
    Args:
        text (str): String input yang akan diperiksa.
        keywords (list of str): Daftar kata kunci yang dicari.
        
    Returns:
        int: 1 jika minimal satu keyword ditemukan, 0 jika tidak ada sama sekali.
    """
    if not isinstance(text, str):
        return 0
    text = text.lower()
    return int(any(k in text for k in keywords))


In [283]:
def build_features(df):
    """
    Membangun dataset fitur (feature engineering) dari data mentah OCR.
    
    Fungsi ini menggabungkan fitur spasial (posisi), kontekstual (jarak antar baris), 
    statistik teks, dan sinyal semantik (keyword) ke dalam satu DataFrame.

    Tujuan dari fungsi ini adalah untuk mengklasifikasikan apakah text dari hasil ocr
    bagian dari item belanjaan atau tidak 
    
    Args:
        df (pd.DataFrame): DataFrame input yang WAJIB memiliki kolom:
            - 'receipt_id': ID unik untuk setiap struk.
            - 'text': String hasil pembacaan OCR.
            - 'x_center_norm': Posisi horizontal ternormalisasi.
            - 'y_center_norm': Posisi vertikal ternormalisasi.
            - 'y': Posisi vertikal (untuk perhitungan diff/jarak).
            
    Returns:
        pd.DataFrame: DataFrame baru yang berisi kolom-kolom fitur numerik
                      siap pakai untuk pelatihan model ML.
    """
    
    feats = pd.DataFrame()

    # posisi
    feats["x"] = df["x_center_norm"]
    feats["y"] = df["y_center_norm"]

    # dy prev / next
    feats["dy_prev"] = df.groupby("receipt_id")["y_center_norm"].diff().fillna(0)
    feats["dy_next"] = df.groupby("receipt_id")["y_center_norm"].diff(-1).fillna(0).abs()

    # text stats
    feats["text_len"] = df["text"].str.len()
    feats["digit_ratio"] = df["text"].apply(digit_ratio)
    feats["upper_ratio"] = df["text"].apply(upper_ratio)

    # regex flags
    feats["has_price"] = df["text"].apply(has_price_pattern).astype(int)
    feats["has_qty"] = df["text"].apply(has_qty_pattern).astype(int)

    # keyword hints (soft signal)
    feats["kw_total"] = df["text"].apply(
        lambda x: contains_keywords(x, ["total", "subtotal", "grand total", "total belanja",])
    )
    feats["kw_cash"] = df["text"].apply(
        lambda x: contains_keywords(x, ["cash", "tunai", "qris", "pembayaran",])
    )
    feats["kw_disc"] = df["text"].apply(
        lambda x: contains_keywords(x, ["disc", "ppn", "bkp",])
    )

    return feats


In [284]:
# Pastikan fungsi build_features, digit_ratio, upper_ratio, dll
# sudah didefinisikan / di-import di notebook ini

def predict_item_lines(df_ocr: pd.DataFrame) -> pd.DataFrame:
    """
    Predict which OCR lines are item lines using the frozen ML model.
    """
    df = df_ocr.copy()
    df["receipt_id"] = "temp_receipt"  # dummy ID for grouping

    X = build_features(df)
    X = X[FEATURE_COLUMNS]

    df["is_item_line"] = rf_model.predict(X)
    return df


## Step 6 — Sanity Check OCR + ML Output

Di cell ini kita akan:
- Menjalankan OCR pada 1 gambar contoh
- Melihat hasil prediksi is_item_line

Tujuannya:
- Memastikan pipeline OCR → ML berjalan tanpa error
- BUKAN untuk evaluasi performa


In [285]:
test_img = Path('../struk/46.jpeg')

df_ocr = run_ocr_from_engine(test_img, ocr_model)
df_pred = predict_item_lines(df_ocr)

df_pred.head(30)


Unnamed: 0,line_id,text,ocr_conf,x_center_norm,y_center_norm,box_width_norm,receipt_id,is_item_line
0,0,Jenar Kopi Kal1asem,0.9643,0.325391,0.158854,0.269531,temp_receipt,0
1,1,No,0.9996,0.052344,0.238542,0.04375,temp_receipt,0
2,2,: JKK01202512250148,0.9739,0.328516,0.2375,0.269531,temp_receipt,0
3,3,Tangga 1,0.923,0.089063,0.284375,0.110937,temp_receipt,0
4,4,: 25-12-2025 11:43,0.9774,0.322266,0.28125,0.253906,temp_receipt,0
5,5,Mode,0.9918,0.071094,0.325,0.071875,temp_receipt,0
6,6,: DINE IN,0.9095,0.262891,0.323437,0.130469,temp_receipt,0
7,7,Kasir,0.999,0.080469,0.367708,0.085938,temp_receipt,0
8,8,: PASEK,0.9594,0.247656,0.364063,0.10625,temp_receipt,0
9,9,Es Rosella Susu Regular,0.9996,0.205078,0.448437,0.325781,temp_receipt,0


## Interpretasi Awal

Perhatikan kolom `is_item_line`:
- 1 → kandidat baris item belanja
- 0 → header, footer, metadata

Pada tahap ini:
- False Positive masih diperbolehkan
- False Negative harus seminimal mungkin

Baris dengan `is_item_line == 1` akan diproses lebih lanjut
pada tahap grouping dan parsing.


In [286]:
# ============================================
# STEP A.1 — SORT BARIS OCR
# ============================================

# WAJIB:
# - grouping hanya valid jika baris sudah
#   diurutkan dari atas ke bawah
df_pred = df_pred.sort_values(
    by=["y_center_norm"]
).reset_index(drop=True)

df_pred.head(35)

Unnamed: 0,line_id,text,ocr_conf,x_center_norm,y_center_norm,box_width_norm,receipt_id,is_item_line
0,0,Jenar Kopi Kal1asem,0.9643,0.325391,0.158854,0.269531,temp_receipt,0
1,2,: JKK01202512250148,0.9739,0.328516,0.2375,0.269531,temp_receipt,0
2,1,No,0.9996,0.052344,0.238542,0.04375,temp_receipt,0
3,4,: 25-12-2025 11:43,0.9774,0.322266,0.28125,0.253906,temp_receipt,0
4,3,Tangga 1,0.923,0.089063,0.284375,0.110937,temp_receipt,0
5,6,: DINE IN,0.9095,0.262891,0.323437,0.130469,temp_receipt,0
6,5,Mode,0.9918,0.071094,0.325,0.071875,temp_receipt,0
7,8,: PASEK,0.9594,0.247656,0.364063,0.10625,temp_receipt,0
8,7,Kasir,0.999,0.080469,0.367708,0.085938,temp_receipt,0
9,9,Es Rosella Susu Regular,0.9996,0.205078,0.448437,0.325781,temp_receipt,0


In [287]:
# ============================================
# STEP A.2 — INISIALISASI GROUP ID
# ============================================

# group_id akan kita isi secara sekuensial
df_pred["group_id"] = -1


In [288]:
# ============================================
# FIX: PASTIKAN TIPE DATA NUMERIK
# ============================================

numeric_cols = [
    "x_center_norm",
    "y_center_norm",
    "box_width_norm",
    "ocr_conf"
]

for col in numeric_cols:
    df_pred[col] = pd.to_numeric(
        df_pred[col],
        errors="coerce"   # jika ada junk → NaN
    )


In [289]:
# ============================================
# STEP A.3 — GROUPING BERDASARKAN JARAK VERTIKAL
# ============================================

# THRESHOLD JARAK VERTIKAL
# ------------------------
# 0.03–0.04 stabil untuk kebanyakan struk
Y_GAP_THRESHOLD = 0.04

current_group_id = 0

# Kita proses PER STRUK
for receipt_id, g in df_pred.groupby("receipt_id", sort=False):

    # Ambil index baris untuk struk ini
    idxs = g.index.tolist()

    # Assign baris pertama ke group pertama
    df_pred.loc[idxs[0], "group_id"] = current_group_id

    # Iterasi baris berikutnya
    for i in range(1, len(idxs)):
        prev_idx = idxs[i - 1]
        curr_idx = idxs[i]

        # Hitung jarak vertikal antar baris
        dy = (df_pred.loc[curr_idx, "y_center_norm"] - df_pred.loc[prev_idx, "y_center_norm"])

        # Jika masih dekat → group yang sama
        if dy < Y_GAP_THRESHOLD:
            df_pred.loc[curr_idx, "group_id"] = current_group_id
        else:
            # Jika jauh → mulai group baru
            current_group_id += 1
            df_pred.loc[curr_idx, "group_id"] = current_group_id

    # Setelah selesai satu struk
    current_group_id += 1


In [290]:
# ============================================
# DEBUG: LIHAT HASIL GROUPING
# ============================================

for gid, g in df_pred.groupby("group_id"):
    print(f"\n--- GROUP {gid} ---")
    for _, r in g.iterrows():
        print(
            f"{r.y_center_norm:.3f} | "
            f"{r.text:25} | "
            f"ML={r.is_item_line}"
        )



--- GROUP 0 ---
0.159 | Jenar Kopi Kal1asem       | ML=0

--- GROUP 1 ---
0.237 | : JKK01202512250148       | ML=0
0.239 | No                        | ML=0

--- GROUP 2 ---
0.281 | : 25-12-2025 11:43        | ML=0
0.284 | Tangga 1                  | ML=0
0.323 | : DINE IN                 | ML=0
0.325 | Mode                      | ML=0
0.364 | : PASEK                   | ML=0
0.368 | Kasir                     | ML=0

--- GROUP 3 ---
0.448 | Es Rosella Susu Regular   | ML=0
0.487 | @20.000                   | ML=1
0.488 | 20.000                    | ML=1
0.489 | 1x                        | ML=1
0.529 | Indomie Goreng Double     | ML=0
0.567 | @15.000                   | ML=1
0.568 | 1x                        | ML=1
0.568 | 15.000                    | ML=1
0.608 | Telur Mata Sapi Setengah Matang | ML=0
0.647 | @5.000                    | ML=1
0.648 | 5.000                     | ML=1
0.650 | 1x                        | ML=0

--- GROUP 4 ---
0.724 | 3 item                    | ML=0
0.763 |

In [291]:
group_stats = (
    df_pred
    .groupby("group_id")
    .agg(
        n_rows=("is_item_line", "count"),
        n_item=("is_item_line", "sum")
    )
)

group_stats["item_ratio"] = (
    group_stats["n_item"] / group_stats["n_rows"]
)


In [292]:
group_stats.sort_values("item_ratio", ascending=False)


Unnamed: 0_level_0,n_rows,n_item,item_ratio
group_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,12,8,0.666667
7,2,1,0.5
5,2,1,0.5
4,5,1,0.2
0,1,0,0.0
1,2,0,0.0
2,6,0,0.0
6,2,0,0.0
8,2,0,0.0


In [293]:
ITEM_ZONE_THRESHOLD = 0.5

item_group_ids = group_stats[
    group_stats["item_ratio"] >= ITEM_ZONE_THRESHOLD
].index

df_item_zone = df_pred[
    df_pred["group_id"].isin(item_group_ids)
].copy()


In [294]:
df_item_zone.head(30)

Unnamed: 0,line_id,text,ocr_conf,x_center_norm,y_center_norm,box_width_norm,receipt_id,is_item_line,group_id
9,9,Es Rosella Susu Regular,0.9996,0.205078,0.448437,0.325781,temp_receipt,0,3
10,11,@20.000,0.992,0.167969,0.4875,0.104688,temp_receipt,1,3
11,12,20.000,0.9972,0.583984,0.488021,0.097656,temp_receipt,1,3
12,10,1x,0.9956,0.065234,0.489063,0.039844,temp_receipt,1,3
13,13,Indomie Goreng Double,0.975,0.192969,0.528646,0.29375,temp_receipt,0,3
14,15,@15.000,0.9677,0.169141,0.566667,0.103906,temp_receipt,1,3
15,14,1x,0.9814,0.066797,0.567708,0.039844,temp_receipt,1,3
16,16,15.000,0.9965,0.583984,0.568229,0.092969,temp_receipt,1,3
17,17,Telur Mata Sapi Setengah Matang,0.9993,0.260937,0.607812,0.43125,temp_receipt,0,3
18,19,@5.000,0.9596,0.162891,0.647396,0.091406,temp_receipt,1,3


#### Klasifikasikan SETIAP baris

#### Grouping item per baris per item

In [295]:
def group_item_lines_with_ml(
    df_item_zone,
    y_gap_item=0.01,
):
    """
    Group baris OCR menjadi kandidat item berdasarkan jarak Y,
    sambil membawa atribut is_item_line untuk filtering lanjutan.
    """

    items = []
    current_item = None

    df_item_zone = df_item_zone.sort_values("y_center_norm")

    for _, row in df_item_zone.iterrows():
        text = row["text"]
        y = row["y_center_norm"]
        ml = int(row["is_item_line"])

        if current_item is None:
            current_item = {
                "item": [text],
                "ys": [y],
                "ml_flags": [ml],
            }
            continue

        dy = y - current_item["ys"][-1]

        # masih satu item (jarak dekat)
        if dy <= y_gap_item:
            current_item["item"].append(text)
            current_item["ys"].append(y)
            current_item["ml_flags"].append(ml)
            continue

        # jarak jauh → tutup item lama
        items.append(current_item)
        current_item = {
            "item": [text],
            "ys": [y],
            "ml_flags": [ml],
        }

    if current_item:
        items.append(current_item)

    return items


In [296]:
items_ = group_item_lines_with_ml(df_item_zone)
items_

[{'item': ['Es Rosella Susu Regular'], 'ys': [0.4484375], 'ml_flags': [0]},
 {'item': ['@20.000', '20.000', '1x'],
  'ys': [0.4875, 0.48802083333333335, 0.4890625],
  'ml_flags': [1, 1, 1]},
 {'item': ['Indomie Goreng Double'],
  'ys': [0.5286458333333334],
  'ml_flags': [0]},
 {'item': ['@15.000', '1x', '15.000'],
  'ys': [0.5666666666666667, 0.5677083333333334, 0.5682291666666667],
  'ml_flags': [1, 1, 1]},
 {'item': ['Telur Mata Sapi Setengah Matang'],
  'ys': [0.6078125],
  'ml_flags': [0]},
 {'item': ['@5.000', '5.000', '1x'],
  'ys': [0.6473958333333333, 0.6484375, 0.65],
  'ml_flags': [1, 1, 0]},
 {'item': ['44.000', 'Grand Total :'],
  'ys': [0.8614583333333333, 0.8666666666666667],
  'ml_flags': [1, 0]},
 {'item': ['56.000', 'Kembalian :'],
  'ys': [1.003125, 1.0041666666666667],
  'ml_flags': [1, 0]}]

### Filter baris item yang memiliki proposi Ml prediksi "is_item_line"

* ini perlu dilakukan karena pada tahapan Grouping sebelumnya jika masih ada metadata atau header atau footer
* Threshhol yang akan di terapkan adalah 0.7 atau 70% dari baris text yang ada pada lines
  ini untuk mengkompensasi 1 False Negatif pada line mungkin ada text nama barang yang hilang atau komponen penting item belanja.
*

#### helper function is mostly item
ini digunakan unutk malakukan filtering dari hasil Grouping item lebih 

In [297]:
def is_mostly_item(ml_flags, threshold=0.5):
    """
    Return True jika proporsi is_item_line=1 cukup tinggi
    """
    if not ml_flags:
        return False
    return sum(ml_flags) / len(ml_flags) >= threshold


#### Filtter group yang memiliki proposi `is_item_line` di atas 50%

In [298]:
final_items = [item for item in items_ if is_mostly_item(item["ml_flags"], threshold=0.5)]
final_items


[{'item': ['@20.000', '20.000', '1x'],
  'ys': [0.4875, 0.48802083333333335, 0.4890625],
  'ml_flags': [1, 1, 1]},
 {'item': ['@15.000', '1x', '15.000'],
  'ys': [0.5666666666666667, 0.5677083333333334, 0.5682291666666667],
  'ml_flags': [1, 1, 1]},
 {'item': ['@5.000', '5.000', '1x'],
  'ys': [0.6473958333333333, 0.6484375, 0.65],
  'ml_flags': [1, 1, 0]},
 {'item': ['44.000', 'Grand Total :'],
  'ys': [0.8614583333333333, 0.8666666666666667],
  'ml_flags': [1, 0]},
 {'item': ['56.000', 'Kembalian :'],
  'ys': [1.003125, 1.0041666666666667],
  'ml_flags': [1, 0]}]

#### Helper function 
- ini diperlukan untuk membantu logika pengabungan multi line pada struk 

In [299]:
def is_alpha_line(text):
    return isinstance(text, str) and any(c.isalpha() for c in text)

def is_digit_line(text):
    return isinstance(text, str) and any(c.isdigit() for c in text)

def group_alpha_ratio(lines):
    return sum(is_alpha_line(t) for t in lines) / max(len(lines), 1)

def group_digit_ratio(lines):
    return sum(is_digit_line(t) for t in lines) / max(len(lines), 1)


#### proses pengabungan multi line
proses ini di perlukan jika pola item line pada struk dibagi menjadi 2 line berbeda.
1.  pola Umum pada Struk 1 line:
    - `nama barang` `qty` `unit price` `price`
2.  pola umum Struk 2 line:
    - `nama barang`
    - `qty` `unit price` `price`

ket:
  - `nama barang` --> biasanya adalah huruf atau string dengan proporsi huruf lebih besar dibanging angka jika ada
  - `qty` --> biasya singel numerik atau dengan pola `qty`+ simbol optional ('x', '@') + `unit price`
  - `unit price` dan `price` --> biasnya berbentuk 1.000 atau 1,000

In [None]:
def merge_alpha_digit_groups(
    grouped_items,
    y_gap_merge=0.06,
    alpha_ratio_min=0.6,
    digit_ratio_min=0.6,
):
    merged = []
    i = 0

    while i < len(grouped_items):
        curr = grouped_items[i]

        # kandidat merge jika ada next
        if i + 1 < len(grouped_items):
            nxt = grouped_items[i + 1]

            curr_alpha = group_alpha_ratio(curr["item"])
            nxt_digit = group_digit_ratio(nxt["item"])

            y_gap = nxt["ys"][0] - curr["ys"][-1]

            # ===============================
            # KONDISI MERGE
            # ===============================
            if (
                curr_alpha >= alpha_ratio_min
                and nxt_digit >= digit_ratio_min
                and y_gap <= y_gap_merge
            ):
                merged.append({
                    "Item": curr["item"] + nxt["item"]
                })
                i += 2
                continue

        # default: tidak merge
        merged.append({
            "Item": curr['item']
        })
        i += 1

    return merged


In [301]:
item_lines = merge_alpha_digit_groups(final_items)
item_lines

[{'Item': ['@20.000', '20.000', '1x']},
 {'Item': ['@15.000', '1x', '15.000']},
 {'Item': ['@5.000', '5.000', '1x']},
 {'Item': ['44.000', 'Grand Total :']},
 {'Item': ['56.000', 'Kembalian :']}]

#### parsing item 

In [302]:
import re

def to_int(text):
    if not text:
        return None
    nums = re.findall(r"\d+", text.replace(".", "").replace(",", ""))
    return int(nums[0]) if nums else None


In [303]:
def alpha_ratio(s):
    if not s:
        return 0
    a = sum(c.isalpha() for c in s)
    return a / max(len(s), 1)

def digit_ratio(s):
    if not s:
        return 0
    d = sum(c.isdigit() for c in s)
    return d / max(len(s), 1)


In [304]:
def extract_name(lines):
    name_parts = []

    for l in lines:
        if alpha_ratio(l) >= 0.4:
            name_parts.append(l)

    if not name_parts:
        return None

    name = " ".join(name_parts)
    name = re.sub(r"\s+", " ", name).strip()
    return name


In [305]:
def extract_qty(lines):
    """
    Extract quantity from OCR item lines.
    Handles formats:
    - '12 @913'
    - '2x20000'
    - '1', '1.'
    """
    for l in lines:
        l = l.strip()

        # 12 @913 or 1 x
        m = re.search(r"(\d+)\s*[@xX]", l)
        if m:
            return int(m.group(1))

        # 2x20000
        m = re.search(r"(\d+)\s*x\s*\d+", l)
        if m:
            return int(m.group(1))

        # single small digit
        if re.fullmatch(r"\d+[.,]?", l):
            q = to_int(l)
            if q and q <= 100:
                return q

    return None



In [306]:
def extract_total(lines, min_price=100):
    candidates = []

    for l in lines:
        s = l.strip()

        # Ambil SEMUA pola angka (termasuk ribuan)
        nums = re.findall(r"\d{1,3}(?:[.,]\d{3})+|\d+", s)

        for n in nums:
            # Skip SKU (angka panjang tanpa separator)
            if re.fullmatch(r"\d{5,}", n):
                continue

            v = to_int(n)
            if v and v >= min_price:
                candidates.append(v)

    if not candidates:
        return None

    return max(candidates)


In [307]:
def extract_unit_price(lines, qty, total):
    nums = []

    for l in lines:
        if digit_ratio(l) >= 0.6:
            v = to_int(l)
            if v:
                nums.append(v)

    for v in nums:
        if qty and total and v * qty == total:
            return v

    # fallback: kandidat terbesar kedua
    if total and nums:
        nums = sorted(nums)
        for v in reversed(nums):
            if v < total:
                return v

    return None


In [308]:
def parse_item(group):
    lines = group["Item"]

    # =====================
    # 1. CLEAN & NORMALIZE
    # =====================
    clean_lines = [l.strip() for l in lines if l.strip()]

    # =====================
    # 2. DIGIT
    # =====================

    digits = []
    for l in clean_lines:
        if re.search(r"\d", l):
            digits.append(l)

    # =====================
    # 3. NAME (gabung semua alpha)
    # =====================
    name = extract_name(lines)

    # =====================
    # 4. QTY (PAKAI FUNGSI LAMA)
    # =====================
    qty = extract_qty(digits)
    if qty is None:
        qty = 1

    # =====================
    # 5. TOTAL & UNIT PRICE
    # =====================
    total = extract_total(lines)

    unit_price = None
    if total and qty > 0:
        up = total // qty
        if up > 0:
            unit_price = up

    # =====================
    # 8. CONFIDENCE
    # =====================
    confidence = "high" if name and total else "medium"

    return {
        "name": name,
        "qty": qty,
        "unit_price": unit_price,
        "total": total,
        #"raw_lines": clean_lines,
        #"confidence": confidence
    }


In [309]:
parsed_items = []
for g in item_lines:
    parsed_items.append(parse_item(g))

parsed_items 

[{'name': '1x', 'qty': 1, 'unit_price': 20000, 'total': 20000},
 {'name': '1x', 'qty': 1, 'unit_price': 15000, 'total': 15000},
 {'name': '1x', 'qty': 1, 'unit_price': 5000, 'total': 5000},
 {'name': 'Grand Total :', 'qty': 1, 'unit_price': 44000, 'total': 44000},
 {'name': 'Kembalian :', 'qty': 1, 'unit_price': 56000, 'total': 56000}]

### hasil final dari item belanja 

sudah dapat digunakan untuk sebagian besar struk untuk mengekstrak detail item belanja yang ada

### Ekstrak meta data Lainnya dari Struk 
1. Nama Toko
2. Datatime 
    * tanggal dan
    * Waktu
3. Total Belanja
4. Metode Pembayaran

hasil Akhirnya akan di gabungkan dengan `parsed_item` agar Informasi dari Struk Lengkap.

In [310]:
# Dataframe hasil prediksi OCR Dan Proses ML untuk Flag is_item_line
test_img = Path('../struk/46.jpeg')

df_ocr_meta = run_ocr_from_engine(test_img, ocr_model)
df_pred_meta = predict_item_lines(df_ocr_meta)

df_pred_meta.head(30)

Unnamed: 0,line_id,text,ocr_conf,x_center_norm,y_center_norm,box_width_norm,receipt_id,is_item_line
0,0,Jenar Kopi Kal1asem,0.9643,0.325391,0.158854,0.269531,temp_receipt,0
1,1,No,0.9996,0.052344,0.238542,0.04375,temp_receipt,0
2,2,: JKK01202512250148,0.9739,0.328516,0.2375,0.269531,temp_receipt,0
3,3,Tangga 1,0.923,0.089063,0.284375,0.110937,temp_receipt,0
4,4,: 25-12-2025 11:43,0.9774,0.322266,0.28125,0.253906,temp_receipt,0
5,5,Mode,0.9918,0.071094,0.325,0.071875,temp_receipt,0
6,6,: DINE IN,0.9095,0.262891,0.323437,0.130469,temp_receipt,0
7,7,Kasir,0.999,0.080469,0.367708,0.085938,temp_receipt,0
8,8,: PASEK,0.9594,0.247656,0.364063,0.10625,temp_receipt,0
9,9,Es Rosella Susu Regular,0.9996,0.205078,0.448437,0.325781,temp_receipt,0


In [311]:
def extract_store_name(df):
    """
    Ambil nama toko dari baris OCR paling atas
    """
    df_sorted = df.sort_values("y_center_norm")
    return df_sorted.iloc[0]["text"]


In [312]:
store_name = extract_store_name(df_pred_meta)
store_name


'Jenar Kopi Kal1asem'

In [313]:
def assign_vertical_groups(
    df,
    y_gap_threshold=0.04,
    receipt_col="receipt_id",
    y_col="y_center_norm",
    group_col="group_id",
):
    """
    Group baris OCR berdasarkan jarak vertikal (y_center_norm) per struk.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame hasil OCR + prediksi ML
    y_gap_threshold : float
        Threshold jarak vertikal (default 0.03)
    receipt_col : str
        Nama kolom receipt id
    y_col : str
        Nama kolom koordinat y (normalized)
    group_col : str
        Nama kolom output group id

    Returns
    -------
    pd.DataFrame
        DataFrame dengan kolom group_id terisi
    """

    df = df.copy()
    current_group_id = 0

    for receipt_id, g in df.groupby(receipt_col, sort=False):

        idxs = g.index.tolist()

        # assign baris pertama
        df.loc[idxs[0], group_col] = current_group_id

        for i in range(1, len(idxs)):
            prev_idx = idxs[i - 1]
            curr_idx = idxs[i]

            dy = df.loc[curr_idx, y_col] - df.loc[prev_idx, y_col]

            if dy < y_gap_threshold:
                df.loc[curr_idx, group_col] = current_group_id
            else:
                current_group_id += 1
                df.loc[curr_idx, group_col] = current_group_id

        # naikkan group_id setelah satu struk selesai
        current_group_id += 1

    df[group_col] = df[group_col].astype(int)
    return df


In [314]:
df_meta = assign_vertical_groups(df_pred_meta)

In [315]:
# ============================================
# DEBUG: LIHAT HASIL GROUPING
# ============================================

for gid, g in df_meta.groupby("group_id"):
    print(f"\n--- GROUP {gid} ---")
    for _, r in g.iterrows():
        print(
            f"{r.y_center_norm:.3f} | "
            f"{r.text:25} | "
            f"ML={r.is_item_line}"
        )



--- GROUP 0 ---
0.159 | Jenar Kopi Kal1asem       | ML=0

--- GROUP 1 ---
0.239 | No                        | ML=0
0.237 | : JKK01202512250148       | ML=0

--- GROUP 2 ---
0.284 | Tangga 1                  | ML=0
0.281 | : 25-12-2025 11:43        | ML=0

--- GROUP 3 ---
0.325 | Mode                      | ML=0
0.323 | : DINE IN                 | ML=0

--- GROUP 4 ---
0.368 | Kasir                     | ML=0
0.364 | : PASEK                   | ML=0

--- GROUP 5 ---
0.448 | Es Rosella Susu Regular   | ML=0

--- GROUP 6 ---
0.489 | 1x                        | ML=1
0.487 | @20.000                   | ML=1
0.488 | 20.000                    | ML=1

--- GROUP 7 ---
0.529 | Indomie Goreng Double     | ML=0
0.568 | 1x                        | ML=1
0.567 | @15.000                   | ML=1
0.568 | 15.000                    | ML=1
0.608 | Telur Mata Sapi Setengah Matang | ML=0

--- GROUP 8 ---
0.650 | 1x                        | ML=0
0.647 | @5.000                    | ML=1
0.648 | 5.000        

In [316]:
def get_groups_by_keywords(
    df,
    keywords,
    text_col="text",
    group_col="group_id"
):
    """
    Ambil group_id yang mengandung keyword tertentu
    """
    keywords = [k.upper() for k in keywords]

    mask = df[text_col].str.upper().apply(
        lambda t: any(k in t for k in keywords)
    )

    return df.loc[mask, group_col].unique().tolist()


In [317]:
TOTAL_KEYS = [
    "TOTAL",
    "GRAND",
    "TOTAL BELANJA",
    "TOTAL NET",
    "TOTAL BAYAR",
    "JUMLAH",
]
grup_total = get_groups_by_keywords(df_meta, TOTAL_KEYS)
grup_total

[10, 12]

In [318]:
def merge_group_lines(
    df_group,
    y_gap_merge=0.02,
    x_col="x_center_norm",
    y_col="y_center_norm",
    text_col="text"
):
    """
    Merge baris OCR dalam satu group menjadi baris logis
    """
    df_group = df_group.sort_values(y_col)

    merged_lines = []
    current = []

    for _, row in df_group.iterrows():
        if not current:
            current = [row]
            continue

        dy = row[y_col] - current[-1][y_col]

        if dy <= y_gap_merge:
            current.append(row)
        else:
            merged_lines.append(current)
            current = [row]

    if current:
        merged_lines.append(current)

    # Susun berdasarkan x
    results = []
    for line in merged_lines:
        line_sorted = sorted(line, key=lambda r: r[x_col])
        text = " ".join(r[text_col] for r in line_sorted)
        results.append(text)

    return results


In [319]:
TOTAL_KEYS = [
    "TOTAL",
    "GRAND",
    "TOTAL BELANJA",
    "TOTAL NET",
    "TOTAL BAYAR",
    "JUMLAH",
]

total_group_ids = get_groups_by_keywords(df_meta, TOTAL_KEYS)

total_lines = []

for gid in total_group_ids:
    g = df_meta[df_meta["group_id"] == gid]
    lines = merge_group_lines(g, y_gap_merge=0.02)
    total_lines.extend(lines)

total_lines


['Subtotal : 40.000', 'Grand Total : 44.000']

In [320]:
def extract_grand_total_from_strings(lines):
    candidates = []

    for text in lines:
        t = text.upper()

        # hanya proses baris yang mengandung keyword total
        if not any(k in t for k in TOTAL_KEYS):
            continue

        # ambil SEMUA kandidat harga
        nums = re.findall(r"\d{1,3}(?:[.,]\d{3})+|\d{4,}", t)

        for n in nums:
            v = to_int(n)
            candidates.append(v)

    return max(candidates) if candidates else None

In [321]:
grand_total = extract_grand_total_from_strings(total_lines)
grand_total

44000

In [322]:
DATE_RE = (
    r"\b("
    r"(?:\d{1,2}[./-]\d{1,2}[./-]\d{2,4})"   # DD-MM-YYYY / DD-MM-YY
    r"|"
    r"(?:\d{2,4}[./-]\d{1,2}[./-]\d{1,2})"   # YYYY-MM-DD / YY-MM-DD
    r")\b"
)

TIME_RE = r"\b(\d{1,2}:\d{2}(?::\d{2})?)\b"

def extract_datetime_from_df(df):
    date, time = None, None

    for _, row in df.iterrows():
        text = str(row["text"])

        if not date:
            m = re.search(DATE_RE, text)
            if m:
                date = m.group(1)

        if not time:
            m = re.search(TIME_RE, text)
            if m:
                time = m.group(1)

        if date and time:
            break

    return {
        "date": date,
        "time": time,
        "confidence": "high" if date or time else "low"
    }



In [323]:
date = extract_datetime_from_df(df_meta)
date

{'date': '25-12-2025', 'time': '11:43', 'confidence': 'high'}

In [324]:
PAYMENT_KEYWORDS = {
    "CASH": ["CASH", "TUNAI", "NON TUNAI"],
    "DEBIT": ["DEBIT", "ATM", "KARTU DEBIT"],
    "CREDIT": ["CREDIT", "KARTU KREDIT"],
    "QRIS": ["QRIS"],
    "EWALLET": ["OVO", "GOPAY", "DANA", "SHOPEEPAY", "LINKAJA"]
}


In [325]:
def get_payment_groups(df, payment_keywords):
    matched_groups = set()

    for gid, g in df.groupby("group_id"):
        text_blob = " ".join(g["text"].str.upper())

        for kws in payment_keywords.values():
            if any(k in text_blob for k in kws):
                matched_groups.add(gid)

    return matched_groups


In [326]:
def extract_payment_method(df, payment_keywords):
    payment_groups = get_payment_groups(df, payment_keywords)

    detected_methods = []

    for gid in payment_groups:
        g = df[df["group_id"] == gid]
        texts = " ".join(g["text"].str.upper())

        for method, kws in payment_keywords.items():
            if any(k in texts for k in kws):
                detected_methods.append(method)

    # fallback: ambil yang paling spesifik
    if "QRIS" in detected_methods:
        return "QRIS"
    if "EWALLET" in detected_methods:
        return "EWALLET"
    if "CREDIT" in detected_methods:
        return "CREDIT"
    if "DEBIT" in detected_methods:
        return "DEBIT"
    if "CASH" in detected_methods:
        return "CASH"

    return None


In [327]:
paymant_mathod = extract_payment_method(df_meta, PAYMENT_KEYWORDS)
paymant_mathod

'CASH'

In [328]:
def build_receipt_result(
    store_name,
    datetime_info,
    grand_total,
    payment_method,
):
    parsed_items = []
    for g in item_lines:
        item = parse_item(g)
        if item:
            parsed_items.append(item)

    result = {
        "receipt_id": receipt_id,
        "store": {
            "name": store_name
        },
        "datetime": datetime_info,
        "payment": {
            "method": payment_method
        },
        "totals": {
            "grand_total": grand_total
        },
        "items": parsed_items,
        "meta": {
            "item_count": len(parsed_items),

        }
    }

    return result


In [329]:

hasil_akhir = build_receipt_result(
    store_name=store_name,
    datetime_info=date,
    grand_total=grand_total,
    payment_method=paymant_mathod,
)

hasil_akhir

{'receipt_id': 'temp_receipt',
 'store': {'name': 'Jenar Kopi Kal1asem'},
 'datetime': {'date': '25-12-2025', 'time': '11:43', 'confidence': 'high'},
 'payment': {'method': 'CASH'},
 'totals': {'grand_total': 44000},
 'items': [{'name': '1x', 'qty': 1, 'unit_price': 20000, 'total': 20000},
  {'name': '1x', 'qty': 1, 'unit_price': 15000, 'total': 15000},
  {'name': '1x', 'qty': 1, 'unit_price': 5000, 'total': 5000},
  {'name': 'Grand Total :', 'qty': 1, 'unit_price': 44000, 'total': 44000},
  {'name': 'Kembalian :', 'qty': 1, 'unit_price': 56000, 'total': 56000}],
 'meta': {'item_count': 5}}