# Leksara Black-Box Feature Validation Notebook
This notebook exercises the Leksara public APIs exactly as an end-user would, covering ingestion dashboards, cleaning primitives, redaction helpers, review normalisation, presets, benchmarking, runtime customisations, and logging hooks.

## How to use this notebook
1. Ensure Leksara and its optional dependencies (`regex`, `emoji`, `Sastrawi`, `pandas`) are installed in your environment.
2. Execute the cells in order. Outputs are intentionally verbose so you can visually confirm behaviour without digging into implementation details.
3. Treat each section as a standalone scenario you can adapt for your own datasets or pipelines.

In [None]:
# Core imports for the feature tour
import json
from pathlib import Path

import pandas as pd

from leksara import leksara, ReviewChain, get_preset
from leksara.frames.cartboard import CartBoard, get_flags, get_stats, noise_detect
from leksara.function import (
    remove_tags,
    case_normal,
    remove_stopwords,
    remove_whitespace,
    remove_punctuation,
    remove_digits,
    remove_emoji,
    replace_url,
    replace_rating,
    shorten_elongation,
    replace_acronym,
    normalize_slangs,
    expand_contraction,
    word_normalization,
 )
from leksara.pattern import (
    replace_phone,
    replace_address,
    replace_email,
    replace_id,
 )
from leksara.core.logging import setup_logging, log_pipeline_step

## 1. Quick health check with CartBoard dashboards

In [19]:
# Sample dataset mirroring Indonesian marketplace reviews
raw_reviews = pd.DataFrame(
    {
        "review_id": [101, 102, 103],
        "channel": ["Tokopedia", "Shopee", "WhatsApp"],
        "text": [
            "Barang mantul!!! Email: user@example.com ⭐⭐⭐⭐⭐",
            "Pengiriman lambat :( Hubungi 0812-3456-7890 segera",
            "Halo admin, alamat saya Jl. Melati No. 8 RT 02 RW 04, Bandung",
        ],
    }
)

cartboard_flags = get_flags(raw_reviews, text_column="text")
cartboard_stats = get_stats(raw_reviews, text_column="text")
cartboard_noise = noise_detect(raw_reviews, text_column="text", include_normalized=False)

display(cartboard_flags[["review_id", "pii_flag", "rating_flag", "non_alphabetical_flag"]])
display(cartboard_stats[["review_id", "stats"]])
display(cartboard_noise[["review_id", "detect_noise"]])

single_card = CartBoard(raw_text=raw_reviews.loc[0, "text"], rating=5)
single_card.to_dict()

Unnamed: 0,review_id,pii_flag,rating_flag,non_alphabetical_flag
0,101,True,True,True
1,102,True,False,False
2,103,False,False,False


Unnamed: 0,review_id,stats
0,101,"{'length': 46, 'word_count': 6, 'stopwords': 0..."
1,102,"{'length': 50, 'word_count': 4, 'stopwords': 1..."
2,103,"{'length': 61, 'word_count': 10, 'stopwords': ..."


Unnamed: 0,review_id,detect_noise
0,101,"{'urls': [], 'html_tags': [], 'emails': ['user..."
1,102,"{'urls': [], 'html_tags': [], 'emails': [], 'p..."
2,103,"{'urls': [], 'html_tags': [], 'emails': [], 'p..."


{'original_text': 'Barang mantul!!! Email: user@example.com ⭐⭐⭐⭐⭐',
 'rating': 5,
 'pii_flag': True,
 'non_alphabetical_flag': True}

## 2. Exercising cleaning primitives as standalone helpers

In [20]:
sample_text = "<p>MANTUULLL banget! Promo di https://shop.id, email cs@shop.id 😍😍</p>"

step_html = remove_tags(sample_text)
step_case = case_normal(step_html)
step_url = replace_url(step_case, mode="replace")
step_emoji = remove_emoji(step_url, mode="replace")
step_stopwords = remove_stopwords(step_emoji)
step_punct = remove_punctuation(step_stopwords)
step_whitespace = remove_whitespace(step_punct)

print("Original:", sample_text)
print("Cleaned:", step_whitespace)

Original: <p>MANTUULLL banget! Promo di https://shop.id, email cs@shop.id 😍😍</p>
Cleaned: mantuulll banget promo URL email csURL suka banget suka banget


## 3. Validating PII masking helpers

In [21]:
pii_sample = ("Hubungi saya di 0812 9876 5432 atau email rani+vip@contoh.co.id. "
             "Alamat: Jl. Kenanga No. 5 RT 03 RW 09, Jakarta. NIK 3276120705010003 ")

masked = replace_id(pii_sample, mode='replace')
masked = replace_phone(masked, mode="replace")
masked = replace_email(masked, mode="replace")
masked = replace_address(masked, mode="replace")

print(masked)

Hubungi saya di [PHONE_NUMBER] atau email [EMAIL]. Alamat: [ADDRESS]. NIK [NIK]


## 4. Review normalisation workflow

In [22]:
review_text = "Mantuuul ⭐⭐⭐⭐⭐ abis, cs nya grg bgt tp overall 4/5 kok!"

normalized = replace_rating(review_text)
normalized = shorten_elongation(normalized, max_repeat=2)
normalized = normalize_slangs(normalized, mode="replace")
normalized = replace_acronym(normalized, mode="replace")
normalized = expand_contraction(normalized)
normalized = word_normalization(normalized, method="stem", mode="keep")

print("Original:", review_text)
print("Normalised:", normalized)

Original: Mantuuul ⭐⭐⭐⭐⭐ abis, cs nya grg bgt tp overall 4/5 kok!
Normalised: mantuul 5 0 abis cs nya grg bgt tapi overall 4 0 kok


## 5. Running preset and custom pipelines with benchmarking

In [23]:
reviews = pd.Series([
    "Email saya customer@mart.id, rating 5/5, kurir ramah.",
    "Alamat pengiriman: Jl. Durian No. 3 RT 05 RW 07, Bandung.",
])

preset_results, preset_metrics = leksara(reviews, preset="ecommerce_review", benchmark=True)

custom_pipeline = {
    "patterns": [
        (replace_phone, {"mode": "replace"}),
        (replace_email, {"mode": "replace"}),
    ],
    "functions": [
        case_normal,
        replace_rating,
        remove_digits,
        remove_stopwords,
        remove_punctuation,
        remove_whitespace,
    ],
}

chain = ReviewChain.from_steps(**custom_pipeline)
chain_results, chain_metrics = chain.transform(reviews, benchmark=True)

display(pd.DataFrame({
    "preset_output": preset_results,
    "custom_output": chain_results,
}))

print("Preset timings:", preset_metrics)
print("Custom timings:", chain_metrics)

Unnamed: 0,preset_output,custom_output
0,email [EMAIL] rating 5 5 kurir ramah,email [EMAIL] rating kurir ramah
1,alamat kirim [ADDRESS],alamat pengiriman jl durian no rt rw bandung


Preset timings: {'n_steps': 15, 'total_time_sec': 0.00033039999652828556, 'per_step': [('word_normalization', 7.610000102431513e-05), ('replace_address', 7.379999806289561e-05), ('mask_whitelist', 3.2500000088475645e-05), ('remove_emoji', 2.6399999114801176e-05), ('remove_stopwords', 2.6399999114801176e-05), ('replace_phone', 1.9299999621580355e-05), ('unmask_whitelist', 1.9099999917671084e-05), ('replace_url', 1.1899999663000926e-05), ('remove_punctuation', 1.1300000551273115e-05), ('replace_email', 8.999999408842996e-06), ('shorten_elongation', 6.800000846851617e-06), ('replace_id', 5.899999450775795e-06), ('remove_whitespace', 5.699999746866524e-06), ('remove_tags', 4.19999923906289e-06), ('case_normal', 2.0000006770715117e-06)]}
Custom timings: {'n_steps': 10, 'total_time_sec': 0.00019700000484590419, 'per_step': [('replace_rating', 8.800000068731606e-05), ('remove_stopwords', 2.9200000426499173e-05), ('mask_whitelist', 2.200000199081842e-05), ('unmask_whitelist', 1.720000182103831

## 6. Listing and customising presets

In [24]:
ecommerce_preset = get_preset("ecommerce_review")
print("Patterns:", ecommerce_preset["patterns"])
print("Functions:", ecommerce_preset["functions"])

# Extend preset with additional address masking depth
extended = get_preset("ecommerce_review")
extended["patterns"].append((replace_address, {"mode": "replace", "street": True, "city": True}))
extended_results = leksara(reviews, pipeline=extended)
extended_results

Patterns: [(<function replace_phone at 0x0000020045F06A20>, {'mode': 'replace'}), (<function replace_email at 0x0000020045F06D40>, {'mode': 'replace'}), (<function replace_address at 0x0000020045F06CA0>, {'mode': 'replace'}), (<function replace_id at 0x0000020045F06DE0>, {'mode': 'replace'})]
Functions: [<function remove_tags at 0x0000020045F06FC0>, <function case_normal at 0x0000020045F07100>, (<function replace_url at 0x0000020045F07420>, {'mode': 'replace'}), (<function remove_emoji at 0x0000020045F074C0>, {'mode': 'replace'}), <function word_normalization at 0x0000020045F82CA0>, <function remove_stopwords at 0x0000020045F071A0>, <function shorten_elongation at 0x0000020045F82840>, <function remove_punctuation at 0x0000020045F07380>, <function remove_whitespace at 0x0000020045F07240>]


0    email [EMAIL] rating 5 5 kurir ramah
1                  alamat kirim [ADDRESS]
dtype: object

## 7. Runtime dictionary tweaks for experimentation

In [25]:
print(normalize_slangs("sokap msh bingung sama proses garansi"))

tingkah msh bingung sama proses garansi
