#  Scraper SD/MI Kabupaten Semarang  
### Mengambil Data Profil Sekolah dari Kemendikbud (Referensi + Sekolah.Data)

Project ini bertujuan untuk melakukan pengambilan data sekolah tingkat **SD** dan **MI** di Kabupaten Semarang secara otomatis dari dua sumber resmi Kemendikbud untuk penyediaan data guna mendukung marketing canvasing pada Fams Medika Holistic Care Center:

---

##  **Sumber Data**

1. **referensi.data.kemendikdasmen.go.id**  
   - Menyediakan *listing* sekolah berdasarkan kecamatan  
   - Data yang diambil:
     - Nama Sekolah  
     - NPSN  
     - Status (Negeri/Swasta)  
     - Kelurahan  
     - Link menuju halaman referensi sekolah (dipakai untuk mendapatkan UUID)

2. **sekolah.data.kemendikdasmen.go.id (Profil Sekolah)**  
   - Menyediakan detail profil setiap sekolah  
   - Data yang diambil:
     - Alamat  
     - Kepala Sekolah  
     - Telepon  
     - Email  
     - Website  
     - Jumlah Siswa Laki-laki  
     - Jumlah Siswa Perempuan  

---

##  **Metode Pengambilan Data**

### 1️ **Listing Sekolah (UC Driver)**  
Listing sekolah diambil menggunakan **undetected_chromedriver (UC Driver)** untuk menghindari deteksi bot dan memastikan halaman referensi dapat di-*render* dengan stabil.  
Hanya **1 driver UC** yang digunakan sepanjang proses listing untuk efisiensi.

---

##  **Metode Versi 1 — Selenium Full (Tanpa ThreadPool)**  
Versi ini:
- Menggunakan **1 UC driver** untuk listing  
- Menggunakan **1 standard Selenium driver** untuk setiap sekolah  
- Semua detail diambil **secara sequential** (satu per satu)  
- Keunggulan: lebih stabil  
- Kekurangan: waktu scraping sangat lama karena setiap sekolah diproses berurutan

---

## **Metode Versi 2 — Optimized Parallel Detail Scraper (ThreadPool)**  
Versi optimasi ini:
- **Listing tetap UC driver (1 instance)**  
- **Detail sekolah diambil paralel** menggunakan:
  - `ThreadPoolExecutor(max_workers=N)`
  - Setiap worker menggunakan **Selenium standard driver** terpisah
- UUID untuk setiap sekolah tetap diperoleh menggunakan `requests` + `BeautifulSoup`
- Keunggulan: **sangat cepat**, bisa 5–10x lebih cepat dari versi v1  
- Kekurangan: memerlukan resource CPU & RAM lebih besar

---

##  Output
Hasil scraping disimpan dalam file CSV di folder:
output/list_sd_mi_{nama_kecamatan}.csv


Kolom CSV dapat dipilih sesuai kebutuhan (Nama Sekolah, NPSN, Alamat, Kepala Sekolah, dst).

---

##  Catatan
- UC Driver digunakan untuk *listing* agar tidak terblokir  
- Standard Selenium digunakan untuk halaman detail karena lebih stabil di Angular frontend  
- `requests + BeautifulSoup` dipakai hanya untuk mem-parsing halaman referensi lama (HTML statis) guna mengambil UUID dengan cepat tanpa Selenium  



## Install Library yang dipakai

In [1]:
! pip install selenium
! pip install undetected-chromedriver
! pip install requests
! pip install beautifulsoup4
! pip install gradio

Collecting selenium
  Downloading selenium-4.38.0-py3-none-any.whl.metadata (7.5 kB)
Collecting urllib3<3.0,>=2.5.0 (from urllib3[socks]<3.0,>=2.5.0->selenium)
  Using cached urllib3-2.5.0-py3-none-any.whl.metadata (6.5 kB)
Collecting trio<1.0,>=0.31.0 (from selenium)
  Downloading trio-0.32.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket<1.0,>=0.12.2 (from selenium)
  Using cached trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting certifi>=2025.10.5 (from selenium)
  Downloading certifi-2025.11.12-py3-none-any.whl.metadata (2.5 kB)
Collecting typing_extensions<5.0,>=4.15.0 (from selenium)
  Using cached typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Collecting websocket-client<2.0,>=1.8.0 (from selenium)
  Using cached websocket_client-1.9.0-py3-none-any.whl.metadata (8.3 kB)
Collecting attrs>=23.2.0 (from trio<1.0,>=0.31.0->selenium)
  Using cached attrs-25.4.0-py3-none-any.whl.metadata (10 kB)
Collecting sortedcontainers (from trio<1.0,>=

##  Ambil List Kode dan Nama Kecamatan

Tahapan ini berfungsi untuk **mengambil dan memetakan daftar kode serta nama kecamatan** di Kabupaten Semarang, yang kemudian digunakan dalam dua fungsi utama sistem:

1. **Sebagai dasar penyusunan URL dinamis untuk proses scraping**, dan  
2. **Sebagai sumber data untuk dropdown pemilihan kecamatan pada tahap antarmuka (deployment)**

---
Setiap halaman daftar sekolah pada situs resmi **Kemendikdasmen** mengikuti pola URL yang tetap:
https://referensi.data.kemendikdasmen.go.id/pendidikan/dikdas/{kode_kecamatan}/3

Bagian `{kode_kecamatan}` menunjukkan **identitas numerik unik** untuk setiap kecamatan.  
Contoh:
- Kecamatan *Ambarawa* memiliki kode `032210`,  

Untuk mendapatkan kode ini secara sistematis, sistem membaca file JSON lokal (misalnya `kecamatan_kab_semarang.json`) yang berisi pasangan:

```json
{
  "Kab. Semarang": {
    "kode": "032200",
    "kecamatan": {
      "Ambarawa": "032210",
      "Tengaran": "032202",
      "Susukan": "032203",
      ....
    }
  }
}




In [None]:
import os
import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait, Select
from selenium.webdriver.support import expected_conditions as EC


def setup_driver(headless=True):
    options = Options()
    if headless:
        options.add_argument("--headless=new")
    # options.add_argument("--disable-gpu")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--window-size=1920,1080")
    options.add_argument("--log-level=3")
    options.add_argument("--blink-settings=imagesEnabled=false")
    options.page_load_strategy = "eager"
    driver = webdriver.Chrome(options=options)
    driver.set_page_load_timeout(30)
    return driver


def get_table_rows(driver, url):
    """Helper untuk ambil semua baris dari tabel referensi"""
    driver.get(url)
    WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table#table1 tbody tr"))
    )
    try:
        select = Select(driver.find_element(By.NAME, "table1_length"))
        select.select_by_value("100")
    except:
        pass
    return driver.find_elements(By.CSS_SELECTOR, "table#table1 tbody tr")


def get_kecamatan_jateng_by_kode():
    """
    Ambil daftar kabupaten/kota di Jawa Tengah (030000),
    input kode kabupaten/kota (misal 032200),
    hasil disimpan ke data/kecamatan_<nama_kab>.json
    dalam format JSON hierarkis.
    """
    base_url = "https://referensi.data.kemendikdasmen.go.id/pendidikan/dikdas/"
    driver = setup_driver(headless=True)
    kecamatan_list = []

    try:
        # 1️⃣ Ambil daftar kabupaten/kota di Provinsi Jawa Tengah
        print("Mengambil daftar kabupaten/kota di Provinsi Jawa Tengah...")
        kab_url = base_url + "030000/1"
        rows_kab = get_table_rows(driver, kab_url)

        kabupaten_list = []
        for r in rows_kab:
            try:
                nama = r.find_element(By.CSS_SELECTOR, "td:nth-child(2)").text.strip()
                href = r.find_element(By.CSS_SELECTOR, "a").get_attribute("href")
                kode = href.split("/")[-2]
                kabupaten_list.append({"nama": nama, "kode": kode})
            except:
                continue

        print("\nDaftar Kabupaten/Kota di Jawa Tengah:")
        print("=" * 60)
        for k in kabupaten_list:
            display(f"{k['kode']} - {k['nama']}")
        print("=" * 60)

        kode_input = input("\nMasukkan KODE kabupaten/kota yang ingin discrap (contoh: 032200): ").strip()

        kab = next((k for k in kabupaten_list if k["kode"] == kode_input), None)
        if not kab:
            print("Kode tidak ditemukan dalam daftar.")
            driver.quit()
            return []

        print(f"\nMengambil daftar kecamatan di {kab['nama']}...")

        # 2️⃣ Ambil daftar kecamatan dari kabupaten/kota terpilih
        kec_url = base_url + f"{kab['kode']}/2"
        rows_kec = get_table_rows(driver, kec_url)

        for r in rows_kec:
            try:
                nama = r.find_element(By.CSS_SELECTOR, "td:nth-child(2)").text.strip()
                href = r.find_element(By.CSS_SELECTOR, "a").get_attribute("href")
                kode = href.split("/")[-2]
                kecamatan_list.append({"nama": nama, "kode": kode, "url": href})
            except:
                continue

        print(f"Berhasil ambil {len(kecamatan_list)} kecamatan dari {kab['nama']}")

        # 3️⃣ Simpan hasil ke list_kecamatan/
        os.makedirs("list_kecamatan", exist_ok=True)
        safe_name = kab["nama"].replace(" ", "_").replace(".", "").lower()
        save_path = os.path.join("list_kecamatan", f"kecamatan_{safe_name}.json")

        # === Format sesuai contoh Prof ===
        json_data = {
            kab["nama"]: {
                "kode": kab["kode"],
                "kecamatan": {k["nama"]: k["kode"] for k in kecamatan_list}
            }
        }

        with open(save_path, "w", encoding="utf-8") as f:
            json.dump(json_data, f, ensure_ascii=False, indent=2)

        print(f"Data disimpan ke: {save_path}")

    except Exception as e:
        print(f"Terjadi kesalahan: {e}")

    driver.quit()
    return kecamatan_list


# ==========================
# MAIN
# ==========================
if __name__ == "__main__":
    kecamatan_data = get_kecamatan_jateng_by_kode()

Mengambil daftar kabupaten/kota di Provinsi Jawa Tengah...

Daftar Kabupaten/Kota di Jawa Tengah:


'030100 - Kab. Cilacap'

'030200 - Kab. Banyumas'

'030300 - Kab. Purbalingga'

'030400 - Kab. Banjarnegara'

'030500 - Kab. Kebumen'

'030600 - Kab. Purworejo'

'030700 - Kab. Wonosobo'

'030800 - Kab. Magelang'

'030900 - Kab. Boyolali'

'031000 - Kab. Klaten'

'031100 - Kab. Sukoharjo'

'031200 - Kab. Wonogiri'

'031300 - Kab. Karanganyar'

'031400 - Kab. Sragen'

'031500 - Kab. Grobogan'

'031600 - Kab. Blora'

'031700 - Kab. Rembang'

'031800 - Kab. Pati'

'031900 - Kab. Kudus'

'032000 - Kab. Jepara'

'032100 - Kab. Demak'

'032200 - Kab. Semarang'

'032300 - Kab. Temanggung'

'032400 - Kab. Kendal'

'032500 - Kab. Batang'

'032600 - Kab. Pekalongan'

'032700 - Kab. Pemalang'

'032800 - Kab. Tegal'

'032900 - Kab. Brebes'

'036000 - Kota Magelang'

'036100 - Kota Surakarta'

'036200 - Kota Salatiga'

'036300 - Kota Semarang'

'036400 - Kota Pekalongan'

'036500 - Kota Tegal'


Mengambil daftar kecamatan di Kab. Semarang...
Berhasil ambil 19 kecamatan dari Kab. Semarang
Data disimpan ke: data\kecamatan_kab_semarang.json


# GRADIO DEMO

##  Versi 1 — Selenium Full (Tanpa Parallel / ThreadPool)

###  Ringkasan
Versi pertama dari scraper ini menggunakan pendekatan **Selenium murni** untuk seluruh proses pengambilan data — baik listing sekolah maupun halaman detail. Semua proses berjalan **secara sequential** (satu per satu), sehingga memiliki stabilitas tinggi namun waktu scraping relatif lama.

---

##  Sumber Data

1. **referensi.data.kemendikdasmen.go.id**  
   Digunakan untuk mengambil:
   - Nama Sekolah  
   - NPSN  
   - Status  
   - Kelurahan  
   - URL menuju detail sekolah (masih format HTML lama)

2. **sekolah.data.kemendikdasmen.go.id (Profil Sekolah)**  
   Digunakan untuk mengambil data profil lengkap:
   - Alamat  
   - Kepala Sekolah  
   - Telepon  
   - Email  
   - Website  
   - Jumlah Siswa Laki-laki  
   - Jumlah Siswa Perempuan  

---

##  Teknologi yang Digunakan

### 1️ **Undetected ChromeDriver (UC Driver) — untuk Listing**  
- Dipakai untuk membuka halaman listing sekolah  
- Lebih tahan terhadap blocking dari Cloudflare / security check  
- Hanya 1 driver digunakan sepanjang proses listing

### 2️ **Standard Selenium Chrome Driver — untuk Detail Sekolah**  
- Setiap sekolah dibuka menggunakan Selenium Chrome standar  
- Lebih stabil dalam memuat halaman Angular di `sekolah.data.kemendikdasmen.go.id`  
- Dipanggil secara sequential

### 3️ **BeautifulSoup + Requests (minimal)**  
- Hanya digunakan untuk parsing HTML referensi (untuk mendapatkan UUID)
- Halaman referensi lama lebih ringan sehingga tidak perlu Selenium

---

##  Metode Pengambilan Data

### **A. Listing Sekolah (1 UC Driver)**
- Membuka halaman referensi berdasarkan kecamatan
- Mengatur tampilan ke 100 baris
- Mengumpulkan informasi dasar dari tabel
- Mengambil URL referensi detail setiap sekolah

### **B. Ambil UUID (requests + BeautifulSoup)**
- Mengakses halaman referensi lama
- Mencari link `profil-sekolah/{uuid}`
- UUID ini digunakan untuk mengakses halaman profil baru

### **C. Ambil Detail Sekolah (Selenium Full Sequential)**
Untuk setiap sekolah:
1. Membuka halaman profil sekolah  
2. Menunggu elemen tertentu muncul (Angular rendering)  
3. Mengambil seluruh data profil  
4. Menutup driver  
5. Melanjutkan ke sekolah berikutnya  

Semua dilakukan **tanpa parallel**, sehingga lebih aman namun lambat.

---

##  Estimasi Waktu Scraping (Versi 1)
Estimasi berdasarkan 1 kecamatan:

| Jumlah Sekolah | Waktu (perkiraan) |
|----------------|------------------|
| 20 sekolah     | 2 – 4 menit |
| 40 sekolah     | 5 – 8 menit |
| 60 sekolah     | 9 – 14 menit |
| 80 sekolah     | 12 – 18 menit |

Faktor yang mempengaruhi:
- Kecepatan internet  
- Performa CPU (Chrome headless per halaman)  
- Waktu render Angular (halaman profil sekolah cukup berat)  

Versi ini **sangat stabil**, tetapi waktu eksekusi cukup panjang karena setiap profil sekolah membuka driver baru dan menunggu render.

---

##  Kelebihan & Kekurangan

###  Kelebihan
- Paling stabil  
- Hampir tidak pernah error timeout  
- Mudah di-deploy ke server  
- Tidak memerlukan resource tinggi  

###  Kekurangan
- Sangat lambat untuk kecamatan dengan >50 sekolah  
- Banyak instance Chrome dibuat dan dihancurkan satu per satu  
- Tidak ideal untuk scraping massal (20 kecamatan sekaligus)  

---

In [2]:
import os
import csv
import json
import time
import requests
from bs4 import BeautifulSoup

# Selenium (standard + uc fallback)
try:
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options as ChromeOptions
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait, Select
    from selenium.webdriver.support import expected_conditions as EC
    STANDARD_SELENIUM_AVAILABLE = True
except:
    STANDARD_SELENIUM_AVAILABLE = False

import undetected_chromedriver as uc

import logging
uc.logger.setLevel(logging.ERROR)

import gradio as gr
from time import sleep

# ============================ CONFIG ============================
HEADLESS = True
TIMEOUT_PAGE = 15
# ================================================================


# =====================================================
#  UC DRIVER → hanya untuk LISTING
# =====================================================
def setup_uc_driver(headless=True):
    opts = uc.ChromeOptions()
    if headless:
        opts.add_argument("--headless=new")
    opts.add_argument("--no-sandbox")
    opts.add_argument("--disable-dev-shm-usage")
    opts.add_argument("--window-size=1600,900")
    opts.add_argument("--blink-settings=imagesEnabled=false")
    opts.add_argument("--disable-gpu")
    opts.add_argument("--log-level=3")

    try:
        driver = uc.Chrome(options=opts)
    except:
        driver = setup_standard_driver(headless=headless)

    driver.set_page_load_timeout(TIMEOUT_PAGE)
    return driver


# =====================================================
#  STANDARD SELENIUM → dipakai hanya untuk DETAIL
# =====================================================
def setup_standard_driver(headless=True):
    try:
        opts = ChromeOptions()
        if headless:
            opts.add_argument("--headless=new")

        opts.add_argument("--no-sandbox")
        opts.add_argument("--disable-dev-shm-usage")
        opts.add_argument("--window-size=1600,900")
        opts.add_argument("--blink-settings=imagesEnabled=false")
        opts.add_argument("--disable-gpu")
        opts.add_argument("--log-level=3")
        opts.add_experimental_option("excludeSwitches", ["enable-automation"])
        opts.add_experimental_option("useAutomationExtension", False)

        driver = webdriver.Chrome(options=opts)
        driver.set_page_load_timeout(TIMEOUT_PAGE)
        return driver
    except:
        return setup_uc_driver(headless=headless)


# =====================================================
#  REQUEST FAST SESSION
# =====================================================
def create_fast_session():
    s = requests.Session()
    s.headers.update({"User-Agent": "Mozilla/5.0"})
    return s


# =====================================================
#  READ JSON
# =====================================================
def get_kode_kecamatan_from_json(nama_kecamatan, json_path="./list_kecamatan/kecamatan_kab_semarang.json"):
    with open(json_path, "r", encoding="utf-8") as f:
        data = json.load(f)
    wilayah = next(iter(data))
    return data[wilayah]["kecamatan"].get(nama_kecamatan)


# =====================================================
#  GET UUID FROM REFERENSI (HTML lama)
# =====================================================
def extract_uuid_from_referensi(url, session):
    try:
        r = session.get(url, timeout=10)
        soup = BeautifulSoup(r.text, "html.parser")
        a = soup.find("a", href=lambda x: x and "profil-sekolah" in x)
        if a:
            return a["href"].rstrip("/").split("/")[-1]
    except:
        return None
    return None


# =====================================================
#  DETAIL SCRAPER (NO THREADPOOL)
# =====================================================
def fetch_detail_single(href, base):
    session = create_fast_session()
    uuid = extract_uuid_from_referensi(href, session)
    if not uuid:
        return base

    url = f"https://sekolah.data.kemendikdasmen.go.id/profil-sekolah/{uuid}"

    driver = setup_standard_driver(headless=HEADLESS)

    detail = {
        "Alamat": "-",
        "Kepala Sekolah": "-",
        "Telepon": "-",
        "Email": "-",
        "Website": "-",
        "Jumlah Siswa Laki-laki": "-",
        "Jumlah Siswa Perempuan": "-"
    }

    try:
        driver.get(url)

        # alamat
        try:
            WebDriverWait(driver, TIMEOUT_PAGE).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "h1 + p"))
            )
            detail["Alamat"] = driver.find_element(By.CSS_SELECTOR, "h1 + p").text.strip()
        except:
            pass

        # blok info
        try:
            blocks = driver.find_elements(By.CSS_SELECTOR, "div.grid div.flex")
            for blk in blocks:
                try:
                    label = blk.find_element(By.CSS_SELECTOR, ".text-slate-500").text.lower().strip()
                except:
                    continue

                if "kepala sekolah" in label:
                    detail["Kepala Sekolah"] = blk.find_element(By.CSS_SELECTOR, ".font-semibold").text.strip()

                elif "telepon" in label:
                    try:
                        detail["Telepon"] = blk.find_element(By.TAG_NAME, "a").text.strip()
                    except:
                        pass

                elif "email" in label:
                    try:
                        detail["Email"] = blk.find_element(By.TAG_NAME, "a").text.strip()
                    except:
                        pass

                elif "website" in label:
                    try:
                        detail["Website"] = blk.find_element(By.TAG_NAME, "a").get_attribute("href")
                    except:
                        pass
        except:
            pass

        # statistik
        try:
            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "section div.grid"))
            )
            stat_blocks = driver.find_elements(By.CSS_SELECTOR, "section div.grid div.flex")

            for blk in stat_blocks:
                try:
                    lbl = blk.find_element(By.CSS_SELECTOR, "div.text-slate-600").text.lower()
                    val = blk.find_element(By.CSS_SELECTOR, "div.text-2xl").text.strip()
                except:
                    continue

                if "laki" in lbl or "lak" in lbl or "pa" in lbl:
                    detail["Jumlah Siswa Laki-laki"] = val

                if "perempuan" in lbl or "putri" in lbl:
                    detail["Jumlah Siswa Perempuan"] = val
        except:
            pass

    except Exception as e:
        print("DETAIL ERROR:", e)

    try:
        driver.quit()
    except:
        pass

    return {**base, **detail}


# =====================================================
#  MAIN SCRAPER (NO THREADPOOL)
# =====================================================
def get_sd_mi_schools_final(kode_kecamatan, nama_kecamatan, selected_fields):

    list_driver = setup_uc_driver(headless=HEADLESS)

    sekolah_list = []

    need_detail = any(f in selected_fields for f in [
        "Alamat", "Kepala Sekolah", "Telepon", "Email", "Website",
        "Jumlah Siswa Laki-laki", "Jumlah Siswa Perempuan"
    ])

    # ============= LIST SEKOLAH =============
    for jenjang, value in [("SD", "5"), ("MI", "9")]:
        url = f"https://referensi.data.kemendikdasmen.go.id/pendidikan/dikdas/{kode_kecamatan}/3/all/{value}/all"

        list_driver.get(url)

        WebDriverWait(list_driver, 15).until(
            EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table#table1 tbody tr"))
        )

        # 100 rows
        try:
            Select(list_driver.find_element(By.NAME, "table1_length")).select_by_value("100")
            sleep(0.3)
        except:
            pass

        rows = list_driver.find_elements(By.CSS_SELECTOR, "table#table1 tbody tr")

        for r in rows:
            data = {}

            if "Nama Sekolah" in selected_fields:
                data["Nama Sekolah"] = r.find_element(By.CSS_SELECTOR, "td:nth-child(3)").text.strip()
            if "NPSN" in selected_fields:
                data["NPSN"] = r.find_element(By.CSS_SELECTOR, "td:nth-child(2)").text.strip()
            if "Status" in selected_fields:
                data["Status"] = r.find_element(By.CSS_SELECTOR, "td:nth-child(6)").text.strip()
            if "Kelurahan" in selected_fields:
                data["Kelurahan"] = r.find_element(By.CSS_SELECTOR, "td:nth-child(5)").text.strip()

            # DETAIL (no parallel)
            if need_detail:
                href = r.find_element(By.CSS_SELECTOR, "a").get_attribute("href")
                full_data = fetch_detail_single(href, data)
                sekolah_list.append(full_data)
            else:
                sekolah_list.append(data)

    try:
        list_driver.quit()
    except:
        pass

    # SORT
    if selected_fields:
        sort_key = selected_fields[0]
        sekolah_list.sort(key=lambda x: str(x.get(sort_key, "")).lower())

    return sekolah_list


# =====================================================
#  SAVE CSV
# =====================================================
def save_school_list_by_kecamatan(nama_kecamatan, selected_fields):
    kode = get_kode_kecamatan_from_json(nama_kecamatan)
    data = get_sd_mi_schools_final(kode, nama_kecamatan, selected_fields)

    os.makedirs("output", exist_ok=True)
    path = f"output/list_sd_mi_{nama_kecamatan.replace(' ', '_').lower()}.csv"

    all_cols = [
        "Kelurahan", "Nama Sekolah", "NPSN", "Status",
        "Kepala Sekolah", "Alamat", "Telepon", "Email", "Website",
        "Jumlah Siswa Laki-laki", "Jumlah Siswa Perempuan"
    ]
    cols = [c for c in all_cols if c in selected_fields]

    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=cols)
        w.writeheader()
        w.writerows(data)

    return f"{len(data)} sekolah disimpan ke {path}", path


# =====================================================
#  GRADIO
# =====================================================
def run_scraper(nama_kecamatan, selected_fields):
    if nama_kecamatan == "-- Pilih Kecamatan --":
        return "Pilih kecamatan.", gr.update(visible=False)

    if not selected_fields:
        return "Pilih minimal 1 kolom.", gr.update(visible=False)

    start = time.time()
    status, path = save_school_list_by_kecamatan(nama_kecamatan, selected_fields)
    dur = int(time.time() - start)

    m, s = divmod(dur, 60)
    waktu = f"\nWaktu: {m} menit {s} detik" if m else f"\nWaktu: {s} detik"

    return status + waktu, gr.update(value=path, visible=True)


def create_gradio_ui(json_path="./list_kecamatan/kecamatan_kab_semarang.json"):
    with open(json_path, "r", encoding="utf-8") as f:
        d = json.load(f)

    wilayah = next(iter(d))
    kec_list = sorted(d[wilayah]["kecamatan"].keys())

    fields = [
        "Kelurahan", "Nama Sekolah", "NPSN", "Status",
        "Kepala Sekolah", "Alamat", "Telepon", "Email", "Website",
        "Jumlah Siswa Laki-laki", "Jumlah Siswa Perempuan"
    ]

    with gr.Blocks(title="Scraper SD/MI Kabupaten Semarang - No ThreadPool") as demo:
        gr.Markdown("## Scraper SD/MI Kabupaten Semarang — Selenium Full (Tanpa ThreadPool)")

        kec = gr.Dropdown(choices=["-- Pilih Kecamatan --"] + kec_list,
                          value="-- Pilih Kecamatan --",
                          label="Pilih Kecamatan")

        kolom = gr.CheckboxGroup(label="Kolom CSV", choices=fields)

        btn = gr.Button("Mulai Scrape")
        status = gr.Textbox(label="Status", lines=5)
        file = gr.File(label="CSV Output", visible=False)

        btn.click(run_scraper, [kec, kolom], [status, file], queue=False)

    return demo


ui = create_gradio_ui()
ui.launch(inbrowser=True, share=False)


* Running on local URL:  http://127.0.0.1:7861
* To create a public link, set `share=True` in `launch()`.




##  Versi 2 — Hybrid Optimized (UC + Selenium + ThreadPool)

###  Ringkasan
Versi 2 adalah **peningkatan performa besar** dari versi 1.  
Jika pada versi 1 semua proses dilakukan *secara sequential*, maka pada versi 2 dilakukan optimasi dengan:

- Tetap memakai **UC Driver** untuk *listing sekolah* (agar anti-blocking dan stabil)
- Menggunakan **Selenium Chrome Driver standar** untuk *scraping detail* (lebih stabil di Angular)
- Menjalankan pengambilan detail secara **parallel (ThreadPoolExecutor)**  
  → Sehingga scraping menjadi **5–10× lebih cepat** dibanding versi pertama.

Versi ini dirancang untuk performa tinggi, tetap stabil, dan cocok digunakan untuk scraping massal dalam jumlah besar.

---

##  Sumber Data

1. **referensi.data.kemendikdasmen.go.id**  
   Diambil via UC Driver  
   - Nama Sekolah  
   - NPSN  
   - Status  
   - Kelurahan  
   - URL referensi detail sekolah  

2. **sekolah.data.kemendikdasmen.go.id (Profil Sekolah)**  
   Diambil via Selenium Standard (parallel)  
   - Alamat  
   - Kepala Sekolah  
   - Telepon  
   - Email  
   - Website  
   - Jumlah Siswa Laki-laki  
   - Jumlah Siswa Perempuan  

3. **BeautifulSoup + Requests**  
   - Untuk mengambil UUID dari halaman referensi lama  
   - Proses ringan & cepat → tidak memerlukan Selenium  

---

##  Teknologi yang Digunakan

### 1️ **Undetected ChromeDriver (UC Driver) – Listing**
- Dipakai hanya untuk men-scrape tabel sekolah
- Anti bot detection
- Hanya 1 instance sepanjang proses

### 2️ **Selenium ChromeDriver Standard – Detail (Parallel)**
- Setiap worker menjalankan 1 instance Selenium
- Stabil pada halaman Angular (profil sekolah)
- Total worker dikendalikan melalui `MAX_WORKERS` (default: 12)

### 3️ **ThreadPoolExecutor**
- Menjalankan scraping detail sekolah secara paralel
- Mempercepat eksekusi hingga 10×
- Sangat efektif pada kecamatan dengan banyak sekolah

### 4️ **Requests + BeautifulSoup**
- Mendapatkan UUID dari halaman HTML lama
- Beban ringan dan cepat
- Mengurangi pemakaian Selenium

---

##  Metode Pengambilan Data

### **A. Listing Sekolah (UC Driver)**
- Mengambil semua sekolah dalam kecamatan (SD + MI)
- Mengatur tabel menjadi 100 rows agar efisien
- Menyimpan link ke halaman referensi tiap sekolah

### **B. Mendapatkan UUID**
- Menggunakan requests
- Parsing dengan BeautifulSoup
- Menghasilkan URL profil sekolah terbaru

### **C. Detail Sekolah (Parallel Selenium Workers)**
Untuk setiap sekolah:
1. Mengambil URL profil  
2. Mendaftarkan task ke ThreadPool  
3. Masing-masing worker membuka halaman profil dengan Selenium Standard  
4. Worker membaca:
   - alamat  
   - kepala sekolah  
   - email, telepon, website  
   - jumlah siswa laki-laki / perempuan  
5. Data dikembalikan ke thread utama  
6. Worker otomatis ditutup  

---

##  Estimasi Waktu — Versi 2 (Parallel)

| Jumlah Sekolah | Estimasi Waktu |
|----------------|----------------|
| 20 sekolah     | 20–40 detik |
| 40 sekolah     | 40–70 detik |
| 60 sekolah     | 60–100 detik |
| 80 sekolah     | 90–140 detik |

Kecepatan bergantung pada:
- Jumlah `MAX_WORKERS`
- CPU server
- Latency internet  
- Waktu render Angular (profil sekolah)

Dengan CPU yang bagus, versi 2 bisa bekerja **hingga 12× lebih cepat daripada versi 1**.

---

##  Perbandingan Versi 1 vs Versi 2

| Fitur / Aspek | Versi 1 (Sequential) | Versi 2 (Parallel Optimized) |
|--------------|----------------------|------------------------------|
| Listing sekolah | UC Driver | UC Driver |
| Detail sekolah | Selenium Standard (1×) | Selenium Standard + ThreadPool (multi-worker) |
| Requests + BS4 | Ya | Ya |
| Kecepatan | Lama (1 sekolah = 3–8 detik) | Sangat cepat (12 sekolah paralel) |
| Stabilitas | Sangat stabil | Stabil + cepat |
| CPU usage | Rendah | Sedang – tinggi |
| Cocok untuk | Penggunaan kecil (1–2 kecamatan) | Scraping massal (banyak kecamatan) |
| Penggunaan driver | 1 per sekolah | 12 per batch |

---

In [None]:
import os
import csv
import json
import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed

# Selenium imports (standard + fallback)
try:
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.chrome.options import Options as ChromeOptions
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait, Select
    from selenium.webdriver.support import expected_conditions as EC
    STANDARD_SELENIUM_AVAILABLE = True
except Exception:
    STANDARD_SELENIUM_AVAILABLE = False

import undetected_chromedriver as uc  # for listing only (single instance)
import logging
uc.logger.setLevel(logging.ERROR)


# =================== CONFIG ===================
MAX_WORKERS = 12  
HEADLESS = True
TIMEOUT_PAGE = 8
# ===============================================


# =====================================================
#  UC Driver (for LISTING only → 1 instance)
# =====================================================
def setup_uc_driver(headless=True):
    opts = uc.ChromeOptions()
    if headless:
        opts.add_argument("--headless=new")
    opts.add_argument("--no-sandbox")
    opts.add_argument("--disable-dev-shm-usage")
    opts.add_argument("--window-size=1600,900")
    opts.add_argument("--blink-settings=imagesEnabled=false")
    opts.add_argument("--disable-gpu")
    opts.add_argument("--log-level=3")

    try:
        driver = uc.Chrome(options=opts)
    except:
        driver = setup_standard_driver(headless=headless)

    driver.set_page_load_timeout(TIMEOUT_PAGE)
    return driver


# =====================================================
#  Standard Chrome Selenium (for DETAIL)
# =====================================================
def setup_standard_driver(headless=True):
    try:
        opts = ChromeOptions()
        if headless:
            opts.add_argument("--headless=new")
        opts.add_argument("--no-sandbox")
        opts.add_argument("--disable-dev-shm-usage")
        opts.add_argument("--window-size=1600,900")
        opts.add_argument("--blink-settings=imagesEnabled=false")
        opts.add_argument("--disable-gpu")
        opts.add_argument("--log-level=3")

        opts.add_experimental_option("excludeSwitches", ["enable-automation"])
        opts.add_experimental_option("useAutomationExtension", False)

        driver = webdriver.Chrome(options=opts)
        driver.set_page_load_timeout(TIMEOUT_PAGE)
        return driver
    except:
        return setup_uc_driver(headless=headless)


# =====================================================
#  Requests Session FastPool
# =====================================================
def create_fast_session():
    s = requests.Session()
    adapter = requests.adapters.HTTPAdapter(pool_connections=50, pool_maxsize=50, max_retries=2)
    s.mount("http://", adapter)
    s.mount("https://", adapter)
    s.headers.update({"User-Agent": "Mozilla/5.0"})
    return s


# =====================================================
#  Kecamatan Code Reader
# =====================================================
def get_kode_kecamatan_from_json(nama_kecamatan, json_path="./list_kecamatan/kecamatan_kab_semarang.json"):
    with open(json_path, "r", encoding="utf-8") as f:
        data = json.load(f)
    wilayah = next(iter(data))
    return data[wilayah]["kecamatan"].get(nama_kecamatan)


# =====================================================
#  Extract UUID from referensi.data
# =====================================================
def extract_uuid_from_referensi(url, session):
    try:
        r = session.get(url, timeout=10)
        soup = BeautifulSoup(r.text, "html.parser")
        a = soup.find("a", href=lambda x: x and "profil-sekolah" in x)
        if a:
            return a["href"].rstrip("/").split("/")[-1]
    except:
        return None
    return None


# =====================================================
#  Worker → Detail sekolah.data via Selenium
# =====================================================
def fetch_detail_worker(link_base_tuple):
    link, base = link_base_tuple
    session = create_fast_session()

    uuid = extract_uuid_from_referensi(link, session)
    if not uuid:
        return base

    url = f"https://sekolah.data.kemendikdasmen.go.id/profil-sekolah/{uuid}"

    # Use standard selenium for parallel workers
    driver = setup_standard_driver(headless=HEADLESS)

    detail = {
        "Alamat": "-",
        "Kepala Sekolah": "-",
        "Telepon": "-",
        "Email": "-",
        "Website": "-",
        "Jumlah Siswa Laki-laki": "-",
        "Jumlah Siswa Perempuan": "-"
    }

    try:
        driver.get(url)

        # Wait Angular top part rendered
        try:
            WebDriverWait(driver, TIMEOUT_PAGE).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "h1 + p"))
            )
        except:
            time.sleep(1)

        # ========================================
        # 1) Alamat
        # ========================================
        try:
            detail["Alamat"] = driver.find_element(By.CSS_SELECTOR, "h1 + p").text.strip()
            
        except:
            pass

        # ========================================
        # 2) Kepala Sekolah / Email / Telepon / Website
        # ========================================
        try:
            blocks = driver.find_elements(By.CSS_SELECTOR, "div.grid div.flex")

            for blk in blocks:
                try:
                    label = blk.find_element(By.CSS_SELECTOR, ".text-slate-500").text.lower().strip()
                except:
                    continue

                if "kepala sekolah" in label:
                    detail["Kepala Sekolah"] = blk.find_element(By.CSS_SELECTOR, ".font-semibold").text.strip()

                elif "telepon" in label:
                    try:
                        detail["Telepon"] = blk.find_element(By.TAG_NAME, "a").text.strip()
                    except:
                        pass

                elif "email" in label:
                    try:
                        detail["Email"] = blk.find_element(By.TAG_NAME, "a").text.strip()
                    except:
                        pass

                elif "website" in label:
                    try:
                        detail["Website"] = blk.find_element(By.TAG_NAME, "a").get_attribute("href")
                    except:
                        pass
        except:
            pass

        # ========================================
        # 3) Statistik Siswa FIX
        # ========================================
        try:
            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "section div.grid"))
            )
            time.sleep(0.5)

            stat_blocks = driver.find_elements(By.CSS_SELECTOR,
                "section div.grid div.flex, section div.grid > div"
            )

            for sblk in stat_blocks:
                try:
                    lbl = sblk.find_element(By.CSS_SELECTOR, "div.text-slate-600").text.lower().strip()
                    val = sblk.find_element(By.CSS_SELECTOR, "div.text-2xl").text.strip()
                except:
                    continue

                if ("laki" in lbl) or ("lak" in lbl) or ("pa" in lbl):
                    detail["Jumlah Siswa Laki-laki"] = val

                if ("perempuan" in lbl) or ("putri" in lbl):
                    detail["Jumlah Siswa Perempuan"] = val

        except Exception as e:
            print("[STAT ERR]", e)

    finally:
        try:
            driver.quit()
        except:
            pass

    return {**base, **detail}


# =====================================================
#  MAIN SCRAPER
# =====================================================
def get_sd_mi_schools_final(kode_kecamatan, nama_kecamatan, selected_fields, progress=None):

    list_driver = setup_uc_driver(headless=HEADLESS)

    sekolah_list = []
    urls = []

    need_detail = any(f in selected_fields for f in [
        "Alamat", "Kepala Sekolah", "Telepon", "Email", "Website",
        "Jumlah Siswa Laki-laki", "Jumlah Siswa Perempuan"
    ])

    # ---------------- LIST SEKOLAH ----------------
    for jenjang, value in [("SD", "5"), ("MI", "9")]:

        url = f"https://referensi.data.kemendikdasmen.go.id/pendidikan/dikdas/{kode_kecamatan}/3/all/{value}/all"

        list_driver.get(url)

        WebDriverWait(list_driver, 20).until(
            EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table#table1 tbody tr"))
        )

        # set 100 rows
        try:
            Select(list_driver.find_element(By.NAME, "table1_length")).select_by_value("100")
            time.sleep(0.3)
        except:
            pass

        rows = list_driver.find_elements(By.CSS_SELECTOR, "table#table1 tbody tr")

        for r in rows:
            data = {}
            if "Nama Sekolah" in selected_fields:
                data["Nama Sekolah"] = r.find_element(By.CSS_SELECTOR, "td:nth-child(3)").text.strip()
            if "NPSN" in selected_fields:
                data["NPSN"] = r.find_element(By.CSS_SELECTOR, "td:nth-child(2)").text.strip()
            if "Status" in selected_fields:
                data["Status"] = r.find_element(By.CSS_SELECTOR, "td:nth-child(6)").text.strip()
            if "Kelurahan" in selected_fields:
                data["Kelurahan"] = r.find_element(By.CSS_SELECTOR, "td:nth-child(5)").text.strip()

            if need_detail:
                urls.append((r.find_element(By.CSS_SELECTOR, "a").get_attribute("href"), data))
            else:
                sekolah_list.append(data)

    try:
        list_driver.quit()
    except:
        pass

    # ---------------- DETAIL (ThreadPool) ----------------
    if need_detail and urls:

        results = []
        with ThreadPoolExecutor(max_workers=MAX_WORKERS) as ex:
            futures = [ex.submit(fetch_detail_worker, lb) for lb in urls]
            for i, fut in enumerate(as_completed(futures)):
                try:
                    row = fut.result()
                except:
                    row = {}
                results.append(row)

                if progress and (i + 1) % 5 == 0:
                    progress((i + 1) / len(futures))

        sekolah_list.extend(results)

    # SORT
    if selected_fields:
        key = selected_fields[0]
        sekolah_list.sort(key=lambda x: str(x.get(key, "")).lower())

    return sekolah_list


# =====================================================
#  SAVE CSV
# =====================================================
def save_school_list_by_kecamatan(nama_kecamatan, selected_fields):
    kode = get_kode_kecamatan_from_json(nama_kecamatan)
    data = get_sd_mi_schools_final(kode, nama_kecamatan, selected_fields)
    if not data:
        return f"Tidak ada data '{nama_kecamatan}'", None

    os.makedirs("output", exist_ok=True)
    path = f"output/list_sd_mi_{nama_kecamatan.lower().replace(' ', '_')}.csv"

    cols = [c for c in [
        "Kelurahan", "Nama Sekolah", "NPSN", "Status",
        "Kepala Sekolah", "Alamat", "Telepon", "Email", "Website",
        "Jumlah Siswa Laki-laki", "Jumlah Siswa Perempuan"
    ] if c in selected_fields]

    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=cols)
        w.writeheader()
        w.writerows(data)

    return f"{len(data)} sekolah disimpan ke '{path}'", path


# =====================================================
#  Gradio UI
# =====================================================
import gradio as gr

def run_scraper(nama_kecamatan, selected_fields):
    if nama_kecamatan in ["", "-- Pilih Kecamatan --"]:
        yield "Pilih kecamatan.", gr.update(visible=False)
        return

    if not selected_fields:
        yield "Pilih minimal 1 kolom.", gr.update(visible=False)
        return

    t0 = time.time()
    s, p = save_school_list_by_kecamatan(nama_kecamatan, selected_fields)
    d = int(time.time() - t0)
    m, s_ = divmod(d, 60)
    txt = f"{s}\nWaktu: {m} menit {s_} detik" if m else f"{s}\nWaktu: {s_} detik"

    yield txt, gr.update(value=p, visible=True)


def create_gradio_ui(json_path="./list_kecamatan/kecamatan_kab_semarang.json"):
    with open(json_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    wilayah = next(iter(data))
    kec_list = sorted(data[wilayah]["kecamatan"].keys())

    fields = [
        "Kelurahan", "Nama Sekolah", "NPSN", "Status",
        "Kepala Sekolah", "Alamat", "Telepon", "Email", "Website",
        "Jumlah Siswa Laki-laki", "Jumlah Siswa Perempuan"
    ]

    with gr.Blocks(title="Scraper SD/MI Semarang — Selenium Full (dengan ThreadPool)") as demo:
        gr.Markdown("## Scraper SD/MI Kab. Semarang — Selenium Full (dengan ThreadPool)")

        kec = gr.Dropdown(label="Pilih Kecamatan",
                          choices=["-- Pilih Kecamatan --"] + kec_list,
                          value="-- Pilih Kecamatan --")

        kolom = gr.CheckboxGroup(label="Kolom CSV", choices=fields)

        btn = gr.Button("Mulai Scrape")
        stat = gr.Textbox(label="Status", lines=4)
        file = gr.File(label="File CSV", visible=False)

        btn.click(run_scraper, [kec, kolom], [stat, file])

    return demo


if __name__ == "__main__":
    ui = create_gradio_ui()
    ui.launch(inbrowser=True, share=False)

  from .autonotebook import tqdm as notebook_tqdm


* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.


## UPDATE FINAL, SUMBER DATA LISTING DIUBAH KARENA WEBSITE REFERENSI KEMENDIKDASEM ERROR/SUDAH TIDAK ADA

In [4]:
import os
import csv
import json
import time
import logging
from concurrent.futures import ThreadPoolExecutor, as_completed

# Selenium imports (standard + fallback)
try:
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options as ChromeOptions
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    STANDARD_SELENIUM_AVAILABLE = True
except Exception:
    STANDARD_SELENIUM_AVAILABLE = False

import undetected_chromedriver as uc
uc.logger.setLevel(logging.ERROR)

# =================== CONFIG ===================
MAX_WORKERS = 12
HEADLESS = True
TIMEOUT_PAGE = 20  
# ===============================================

# =================== Logging ===================
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
log = logging.getLogger(__name__)
# ===============================================


# =====================================================
#  DRIVER LISTING → UC CHROME
# =====================================================
def setup_uc_driver(headless=True):
    opts = uc.ChromeOptions()
    if headless:
        opts.add_argument("--headless=new")
    opts.add_argument("--no-sandbox")
    opts.add_argument("--disable-dev-shm-usage")
    opts.add_argument("--window-size=1600,900")
    opts.add_argument("--blink-settings=imagesEnabled=false")
    opts.add_argument("--disable-gpu")
    opts.add_argument("--log-level=3")

    try:
        driver = uc.Chrome(options=opts)
    except Exception:
        driver = setup_standard_driver(headless)

    driver.set_page_load_timeout(TIMEOUT_PAGE)
    return driver


# =====================================================
#  DRIVER DETAIL → STANDARD SELENIUM
# =====================================================
def setup_standard_driver(headless=True):
    try:
        opts = ChromeOptions()
        if headless:
            opts.add_argument("--headless=new")
        opts.add_argument("--no-sandbox")
        opts.add_argument("--disable-dev-shm-usage")
        opts.add_argument("--window-size=1600,900")
        opts.add_argument("--blink-settings=imagesEnabled=false")
        opts.add_argument("--disable-gpu")

        # anti-automation
        opts.add_experimental_option("excludeSwitches", ["enable-automation"])
        opts.add_experimental_option("useAutomationExtension", False)

        driver = webdriver.Chrome(options=opts)
        driver.set_page_load_timeout(TIMEOUT_PAGE)
        return driver

    except Exception as e:
        log.warning(f"Standard Chrome gagal, fallback ke UC: {e}")
        return setup_uc_driver(headless)


# =====================================================
#  BACA JSON KODE KECAMATAN
# =====================================================
def get_kode_kecamatan_from_json(nama_kecamatan, json_path="./list_kecamatan/kecamatan_kab_semarang.json"):
    with open(json_path, "r", encoding="utf-8") as f:
        data = json.load(f)
    wilayah = next(iter(data))
    return data[wilayah]["kecamatan"].get(nama_kecamatan)


# =====================================================
#  AUTO-DETECT TABEL DAPO SP
# =====================================================
def wait_for_table_rows(driver, timeout=20):

    # ID tabel sering berubah-ubah
    TABLE_CHOICES = [
        "table#dataTables tbody tr",
        "table#example tbody tr",
        "table#myTable tbody tr",
        "table.table tbody tr",
        "tbody tr"
    ]

    end = time.time() + timeout
    while time.time() < end:
        for sel in TABLE_CHOICES:
            try:
                rows = driver.find_elements(By.CSS_SELECTOR, sel)
                if rows and len(rows) > 0:
                    return sel
            except:
                pass
        time.sleep(0.25)

    return None


# =====================================================
#  AUTO RETRY + AUTO REFRESH UNTUK LISTING DAPO
# =====================================================
def load_listing_with_retry(driver, url, max_retry=4):
    """
    Mekanisme super stabil:
    - get halaman
    - deteksi tabel
    - kalau gagal → refresh → retry
    - work untuk dapo yang lemot / blank
    """

    for attempt in range(1, max_retry + 1):
        log.info(f"[LISTING] Attempt {attempt}/{max_retry}: {url}")

        try:
            driver.get(url)
        except:
            pass

        selector = wait_for_table_rows(driver, timeout=12)
        if selector:
            log.info(f"[LISTING] Tabel ditemukan: {selector}")
            return selector

        log.warning("[LISTING] Tabel belum muncul, refresh halaman...")
        try:
            driver.refresh()
        except:
            pass

        time.sleep(2)  # tunggu server cooldown

    log.error("[LISTING] Gagal memuat tabel setelah semua retry")
    return None


# =====================================================
#  FETCH DETAIL SEKOLAH.DATA (FINAL & STABIL)
# =====================================================
def fetch_detail_worker(npsn_and_base):
    npsn, base = npsn_and_base
    driver = None

    detail = {
        "Alamat": "-",
        "Kepala Sekolah": "-",
        "Telepon": "-",
        "Email": "-",
        "Website": "-",
        "Akreditasi": "-",
        "Yayasan": "-",
        "Jumlah Siswa Laki-laki": "-",
        "Jumlah Siswa Perempuan": "-",
    }

    try:
        driver = setup_standard_driver(headless=HEADLESS)

        # 1. buka pencarian
        search_url = f"https://sekolah.data.kemendikdasmen.go.id/sekolah?keyword={npsn}&page=0&size=12"
        driver.get(search_url)

        # tunggu grid hasil
        try:
            WebDriverWait(driver, 12).until(
                EC.presence_of_element_located((By.XPATH, "//article"))
            )
        except:
            return base

        # cari article yg mengandung NPSN
        articles = driver.find_elements(By.XPATH, "//article")
        target = None
        for a in articles:
            if str(npsn) in a.text:
                target = a
                break

        if target is None:
            target = articles[0]

        # KLIK tombol Lihat
        try:
            lihat_btn = target.find_element(By.XPATH, ".//button[contains(.,'Lihat')]")
            driver.execute_script("arguments[0].click();", lihat_btn)
        except:
            return base

        # Tunggu halaman detail stabil
        WebDriverWait(driver, 15).until(
            EC.presence_of_element_located((By.XPATH, "//h1"))
        )

        # "patch sakti" supaya Angular render semua
        time.sleep(1.3)

        # =============================== ALAMAT ===============================
        try:
            h1 = driver.find_element(By.XPATH, "//h1")
            alamat = h1.find_element(By.XPATH, "following-sibling::p[1]").text.strip()
            detail["Alamat"] = alamat
        except:
            pass

        # =============================== INFO GRID ===============================
        info_blocks = driver.find_elements(
            By.XPATH,
            "//div[contains(@class,'grid') and contains(@class,'gap-x-6')]//div[contains(@class,'flex')]"
        )

        for blk in info_blocks:
            try:
                # ambil label
                label = blk.find_element(
                    By.XPATH,
                    ".//div[contains(@class,'text-slate-500')]"
                ).text.lower().strip()

                # ambil isi default dari <div class='font-semibold'>
                try:
                    value_div = blk.find_element(
                        By.XPATH,
                        ".//div[contains(@class,'font-semibold')]"
                    ).text.strip()
                except:
                    value_div = ""

                # ================== AKREDITASI ==================
                if "akreditasi" in label:
                    detail["Akreditasi"] = value_div

                # ================== KEPALA SEKOLAH ==================
                elif "kepala sekolah" in label:
                    detail["Kepala Sekolah"] = value_div

                # ================== YAYASAN ==================
                elif "yayasan" in label:
                    detail["Yayasan"] = value_div

                # ================== TELEPON ==================
                elif "telepon" in label:
                    try:
                        detail["Telepon"] = blk.find_element(
                            By.XPATH, ".//a[starts-with(@href,'tel')]"
                        ).text.strip()
                    except:
                        detail["Telepon"] = value_div or "-"

                # ================== EMAIL ==================
                elif "email" in label:
                    try:
                        detail["Email"] = blk.find_element(
                            By.XPATH, ".//a[starts-with(@href,'mailto')]"
                        ).text.strip()
                    except:
                        detail["Email"] = value_div or "-"

                # ================== WEBSITE ==================
                elif "website" in label:
                    try:
                        href = blk.find_element(By.XPATH, ".//a").get_attribute("href")
                        detail["Website"] = href if href.startswith("http") else "-"
                    except:
                        detail["Website"] = "-"

            except:
                continue

        # =============================== STATISTIK SISWA ===============================
        stat_blocks = driver.find_elements(
            By.XPATH,
            "//div[contains(@class,'rounded-xl')]//div[contains(@class,'px-3')]"
        )

        for s in stat_blocks:
            try:
                lbl = s.find_element(By.XPATH, ".//div[contains(@class,'line-clamp-1')]").text.lower()
                val = s.find_element(By.XPATH, ".//div[contains(@class,'text-2xl')]").text.strip()

                if "laki" in lbl:
                    detail["Jumlah Siswa Laki-laki"] = val

                elif "perempuan" in lbl:
                    detail["Jumlah Siswa Perempuan"] = val

            except:
                continue

        return {**base, **detail}

    except Exception as e:
        log.exception(f"DETAIL ERROR {npsn}: {e}")
        return base

    finally:
        if driver:
            try:
                driver.quit()
            except:
                pass


# =====================================================
#  MAIN SCRAPER LISTING
# =====================================================
def get_sd_mi_schools_final(kode_kecamatan, nama_kecamatan, selected_fields, progress=None):

    list_driver = None
    sekolah_list = []
    urls = []

    need_detail = any(f in selected_fields for f in [
        "Alamat", "Kepala Sekolah", "Telepon", "Email",
        "Website", "Jumlah Siswa Laki-laki",
        "Jumlah Siswa Perempuan", "Akreditasi", "Yayasan"
    ])

    try:
        list_driver = setup_uc_driver(headless=HEADLESS)

        if not kode_kecamatan:
            return []

        url = f"https://dapo.kemendikdasmen.go.id/sp/3/{kode_kecamatan}"

        # === LOAD HALAMAN + AUTO RETRY ===
        selector = load_listing_with_retry(list_driver, url, max_retry=4)
        if not selector:
            return []

        rows = list_driver.find_elements(By.CSS_SELECTOR, selector)

        for r in rows:
            cols = r.find_elements(By.TAG_NAME, "td")
            if len(cols) < 5:
                continue

            data = {}

            if "Nama Sekolah" in selected_fields:
                data["Nama Sekolah"] = cols[1].text.strip()

            if "NPSN" in selected_fields:
                data["NPSN"] = cols[2].text.strip()

            if "Status" in selected_fields:
                data["Status"] = cols[4].text.strip()

            if need_detail:
                urls.append((cols[2].text.strip(), data))
            else:
                sekolah_list.append(data)

    except Exception as e:
        log.exception(f"LISTING ERROR {nama_kecamatan}: {e}")

    finally:
        if list_driver:
            try:
                list_driver.quit()
            except:
                pass

    # === DETAIL MULTI THREAD ===
    if need_detail and urls:
        results = []
        with ThreadPoolExecutor(max_workers=MAX_WORKERS) as ex:
            futures = [ex.submit(fetch_detail_worker, u) for u in urls]
            for i, fut in enumerate(as_completed(futures)):
                try:
                    results.append(fut.result())
                except:
                    results.append({})
        sekolah_list.extend(results)

    return sekolah_list


# =====================================================
#  SAVE CSV
# =====================================================
def save_school_list_by_kecamatan(nama_kecamatan, selected_fields):
    kode = get_kode_kecamatan_from_json(nama_kecamatan)
    data = get_sd_mi_schools_final(kode, nama_kecamatan, selected_fields)

    if not data:
        return f"Tidak ada data '{nama_kecamatan}'", None

    os.makedirs("output", exist_ok=True)
    path = f"output/list_sd_mi_{nama_kecamatan.lower().replace(' ', '_')}.csv"

    cols = selected_fields.copy()

    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=cols)
        writer.writeheader()
        writer.writerows(data)

    return f"{len(data)} sekolah disimpan ke '{path}'", path



# =====================================================
#  Gradio UI
# =====================================================
import gradio as gr

def run_scraper(nama_kecamatan, selected_fields):
    # quick validation
    if nama_kecamatan in ["", "-- Pilih Kecamatan --"]:
        yield "Pilih kecamatan.", gr.update(visible=False)
        return

    if not selected_fields:
        yield "Pilih minimal 1 kolom.", gr.update(visible=False)
        return

    t0 = time.time()
    s, p = save_school_list_by_kecamatan(nama_kecamatan, selected_fields)
    d = int(time.time() - t0)
    m, s_ = divmod(d, 60)
    txt = f"{s}\nWaktu: {m} menit {s_} detik" if m else f"{s}\nWaktu: {s_} detik"

    yield txt, gr.update(value=p, visible=True)


def create_gradio_ui(json_path="./list_kecamatan/kecamatan_kab_semarang.json"):
    with open(json_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    wilayah = next(iter(data))
    kec_list = sorted(data[wilayah]["kecamatan"].keys())

    fields = [
        "Nama Sekolah", "NPSN", "Status",
        "Kepala Sekolah", "Alamat", "Telepon", "Email", "Website",
        "Jumlah Siswa Laki-laki", "Jumlah Siswa Perempuan", "Akreditasi", "Yayasan"
    ]

    with gr.Blocks(title="Scraper SD/MI Semarang — Selenium Full (UPDATED) with ThreadPool") as demo:
        gr.Markdown("## Scraper SD/MI Kab. Semarang — Selenium Full (dapo -> sekolah.data) with ThreadPool")

        kec = gr.Dropdown(label="Pilih Kecamatan",
                          choices=["-- Pilih Kecamatan --"] + kec_list,
                          value="-- Pilih Kecamatan --")

        kolom = gr.CheckboxGroup(label="Kolom CSV", choices=fields)

        btn = gr.Button("Mulai Scrape")
        stat = gr.Textbox(label="Status", lines=4)
        file = gr.File(label="File CSV", visible=False)

        btn.click(run_scraper, [kec, kolom], [stat, file])

    return demo


if __name__ == "__main__":
    ui = create_gradio_ui()
    ui.launch(inbrowser=True, share=False)


2025-11-25 09:46:27,836 [INFO] HTTP Request: GET http://127.0.0.1:7863/gradio_api/startup-events "HTTP/1.1 200 OK"
2025-11-25 09:46:27,845 [INFO] HTTP Request: HEAD http://127.0.0.1:7863/ "HTTP/1.1 200 OK"
2025-11-25 09:46:27,845 [INFO] HTTP Request: HEAD https://huggingface.co/api/telemetry/https%3A/api.gradio.app/gradio-initiated-analytics "HTTP/1.1 200 OK"


* Running on local URL:  http://127.0.0.1:7863
* To create a public link, set `share=True` in `launch()`.


2025-11-25 09:46:28,125 [INFO] HTTP Request: HEAD https://huggingface.co/api/telemetry/https%3A/api.gradio.app/gradio-launched-telemetry "HTTP/1.1 200 OK"
2025-11-25 09:46:28,479 [INFO] HTTP Request: GET https://api.gradio.app/pkg-version "HTTP/1.1 200 OK"
2025-11-25 09:46:41,014 [INFO] patching driver executable C:\Users\ASUS\appdata\roaming\undetected_chromedriver\undetected_chromedriver.exe
2025-11-25 09:46:43,705 [INFO] [LISTING] Attempt 1/4: https://dapo.kemendikdasmen.go.id/sp/3/032211
2025-11-25 09:47:06,345 [INFO] [LISTING] Attempt 2/4: https://dapo.kemendikdasmen.go.id/sp/3/032211
2025-11-25 09:47:19,085 [INFO] [LISTING] Tabel ditemukan: table#dataTables tbody tr
