<a href="https://colab.research.google.com/github/Febri-ElectricalEngineering/CapstoneProject/blob/main/Capstone_Hoax_Indonesia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capstone: Deteksi Hoaks Berita Indonesia + Ringkasan Netral (IBM Granite)

**Platform:** Google Colab (GPU) → commit notebook ke GitHub via *File → Save a copy in GitHub*  
**Model:** `ibm-granite/granite-3.1-3b-a800m-instruct` (open, long context)  
**Dataset utama:** IDNHoaxCorpus (CSV) dari GitHub  

📌 *Notebook ini lengkap dari nol: EDA → baseline (TF‑IDF + LR/SVM) → Granite few‑shot klasifikasi → Granite ringkasan → evaluasi → insight → rekomendasi.*


## 0) Persiapan runtime Colab
1. **Runtime → Change runtime type → GPU (T4/A100)**  
2. Jalankan sel instalasi di bawah ini.


In [None]:
!pip -q install --upgrade pip
!pip -q install transformers accelerate bitsandbytes datasets scikit-learn evaluate rouge-score matplotlib pandas numpy tqdm
import os, re, json, random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
import torch
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
print('CUDA available:', torch.cuda.is_available())


CUDA available: False


## 1) Ambil & siapkan data (tanpa Drive/Kaggle)
Kita akan **mengunduh ZIP repo** IDNHoaxCorpus dari GitHub lalu memuat CSV di folder `dataset/`.

**Catatan label:** dataset menyediakan label `Hoax`, `Valid`, dan `?`.  
Untuk klasifikasi **biner**, kita pakai:
- `Hoax` → `HOAX`
- `Valid` → `NON-HOAX`
- `?` → **dibuang** (tidak jelas)


In [None]:
%%bash
set -e
URL_ZIP="https://codeload.github.com/9uz/IDNHoaxCorpus/zip/refs/heads/main"
OUT_ZIP="IDNHoaxCorpus.zip"
if [ ! -f "$OUT_ZIP" ]; then
  wget -q "$URL_ZIP" -O "$OUT_ZIP"
  unzip -q "$OUT_ZIP"
fi
ls -R | head -n 40


.:
IDNHoaxCorpus-main
IDNHoaxCorpus.zip
sample_data

./IDNHoaxCorpus-main:
DataCorpus.ipynb
dataset
LICENSE
README.md

./IDNHoaxCorpus-main/dataset:
datasetUMPOHoax.csv
index

./sample_data:
anscombe.json
california_housing_test.csv
california_housing_train.csv
mnist_test.csv
mnist_train_small.csv
README.md


In [None]:
import glob, os, pandas as pd
base = 'IDNHoaxCorpus-main'
csv_candidates = glob.glob(os.path.join(base, 'dataset', '*.csv'))
assert len(csv_candidates) > 0, 'Tidak menemukan CSV di folder dataset/. Periksa struktur repo.'
csv_path = csv_candidates[0]
print('Menggunakan file:', csv_path)
df = pd.read_csv(csv_path)
print('Kolom:', df.columns.tolist())
df.head(3)

Menggunakan file: IDNHoaxCorpus-main/dataset/datasetUMPOHoax.csv
Kolom: ['topik', 'keyword', 'tweet', 'gambar', 'url', 'label']


Unnamed: 0,topik,keyword,tweet,gambar,url,label
0,Air Garam,Air garam sumber energi,Mahasiswa ITS buat pembangkit listrik dari kol...,,https://twitter.com/Klik_iT_indo/status/370802...,hoax
1,Air Garam,Air garam sumber energi,Lampu Dengan Sumber Energi Air Garam Ini Berpo...,,https://twitter.com/Robin_nugraha/status/62854...,hoax
2,Air Garam,Air garam sumber energi,#didUknow Lampu LED yang menggunakan air dan g...,,https://twitter.com/SerbaTauID/status/24731045...,hoax


### 1.1 Pembersihan kolom & mapping label
Kita cari kolom teks utama (`tweet` atau `text`), normalisasi minimal (hapus URL), serta mapping label biner.


In [None]:
text_col = None
for cand in ['text', 'tweet', 'content', 'judul', 'title']:
    if cand in df.columns:
        text_col = cand; break
assert text_col is not None, 'Tidak menemukan kolom teks. Cek nama kolom di CSV.'

label_col = None
for cand in ['label', 'kelas', 'class', 'category']:
    if cand in df.columns:
        label_col = cand; break
assert label_col is not None, 'Tidak menemukan kolom label. Cek nama kolom di CSV.'

def clean_text(s):
    s = str(s)
    s = re.sub(r'https?://\S+', ' ', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s

df = df[[text_col, label_col]].dropna()
df[text_col] = df[text_col].apply(clean_text)
df[label_col] = df[label_col].astype(str).str.strip().str.lower()

map_bin = {'hoax':'HOAX', 'valid':'NON-HOAX'}
df = df[df[label_col].isin(['hoax','valid'])].copy()
df['label_bin'] = df[label_col].map(map_bin)
df = df[df[text_col].str.len() > 10]
df.head(5)

Unnamed: 0,tweet,label,label_bin
0,Mahasiswa ITS buat pembangkit listrik dari kol...,hoax,HOAX
1,Lampu Dengan Sumber Energi Air Garam Ini Berpo...,hoax,HOAX
2,#didUknow Lampu LED yang menggunakan air dan g...,hoax,HOAX
3,#tekno Lampu Dengan Sumber Energi Air Garam In...,hoax,HOAX
4,"Green House,lampu darurat dg sumber energi gar...",hoax,HOAX


### 1.2 Train/Valid/Test split (stratified)


In [None]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.15, random_state=SEED, stratify=df['label_bin'])
train_df, valid_df = train_test_split(train_df, test_size=0.1765, random_state=SEED, stratify=train_df['label_bin'])
# => total ~15% test, 15% valid, 70% train
for name, d in [('train', train_df), ('valid', valid_df), ('test', test_df)]:
    print(name, d.shape, d['label_bin'].value_counts(normalize=True).to_dict())


train (2626, 3) {'HOAX': 0.8092155369383092, 'NON-HOAX': 0.19078446306169078}
valid (564, 3) {'HOAX': 0.8085106382978723, 'NON-HOAX': 0.19148936170212766}
test (563, 3) {'HOAX': 0.8099467140319716, 'NON-HOAX': 0.19005328596802842}


## 2) Baseline cepat: TF‑IDF + Linear Model
Tujuan baseline: patokan sebelum pakai Granite.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

vec = TfidfVectorizer(max_features=30000, ngram_range=(1,2))
Xtr = vec.fit_transform(train_df[text_col]); Xva = vec.transform(valid_df[text_col]); Xte = vec.transform(test_df[text_col])
ytr = (train_df['label_bin']=='HOAX').astype(int)
yva = (valid_df['label_bin']=='HOAX').astype(int)
yte = (test_df['label_bin']=='HOAX').astype(int)

clf = LogisticRegression(max_iter=3000, class_weight='balanced', n_jobs=None)
clf.fit(Xtr, ytr)
pred_va = clf.predict(Xva)
print('VALIDATION REPORT (Baseline)')
print(classification_report(yva, pred_va, target_names=['NON-HOAX','HOAX']))


VALIDATION REPORT (Baseline)
              precision    recall  f1-score   support

    NON-HOAX       0.51      0.57      0.54       108
        HOAX       0.90      0.87      0.88       456

    accuracy                           0.81       564
   macro avg       0.70      0.72      0.71       564
weighted avg       0.82      0.81      0.82       564



## 3) IBM Granite 3.1 3B *Instruct* (4‑bit) untuk Klasifikasi (few‑shot)
Kita pakai prompt **JSON-only** agar parsing mudah. *Temperature* rendah untuk konsistensi.


In [None]:
# === GRANITE DROP-IN FIX (boleh dijalankan kapan saja) ===
import re, json, numpy as np
from tqdm import tqdm
from string import Template

# Pastikan model/pipeline tersedia. Kalau belum, load.
try:
    gen, tok, model
except NameError:
    from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
    import torch
    model_id = 'ibm-granite/granite-3.1-3b-a800m-instruct'
    bnb = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype=torch.bfloat16
    )
    tok = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id, device_map='auto', quantization_config=bnb)
    gen = pipeline('text-generation', model=model, tokenizer=tok, device_map='auto')

# Hindari warning pad_token
if tok.pad_token is None and tok.eos_token is not None:
    tok.pad_token = tok.eos_token
    if hasattr(model, "config"):
        model.config.pad_token_id = tok.eos_token_id

# ===== Prompt templates (aman, tidak pakai .format) =====
CLS_TMPL = Template(
    "You are a precise Indonesian news verifier.\n"
    "Klasifikasikan teks berikut sebagai 'HOAX' atau 'NON-HOAX'.\n"
    "Output HANYA JSON valid tanpa teks lain: {\"label\": \"...\", \"evidence\": [\"...\", \"...\"]}.\n"
    "TEKS: $text"
)
SUM_TMPL = Template(
    "Ringkas isi teks berikut (3–5 kalimat) dalam Bahasa Indonesia, netral, tanpa opini.\n"
    "Sertakan 'key_claims' bullet pendek.\n"
    "Output HANYA JSON valid: {\"summary\": \"...\", \"key_claims\": [\"...\", \"...\"]}.\n"
    "TEKS: $text"
)

def build_cls_prompt(text:str) -> str:
    return CLS_TMPL.substitute(text=text)

def build_sum_prompt(text:str) -> str:
    return SUM_TMPL.substitute(text=text)

# ===== Parser JSON robust =====
def parse_json(s:str):
    # coba JSON kecil dulu (menghindari prompt/examples)
    small = re.findall(r'\{[^{}]*\}', s)
    for cand in reversed(small):
        try:
            return json.loads(cand)
        except Exception:
            continue
    # fallback ke blok besar
    m = re.search(r'\{.*\}', s, re.S)
    if m:
        try:
            return json.loads(m.group(0))
        except Exception:
            pass
    return {"label":"NON-HOAX","evidence":[]}

# ===== Prediksi klasifikasi (greedy; tanpa temperature) =====
def granite_predict(texts, max_new_tokens=128):
    preds = []
    for t in tqdm(texts):
        prompt = build_cls_prompt(t)
        out_full = gen(prompt, max_new_tokens=max_new_tokens, do_sample=False)[0]["generated_text"]
        completion = out_full[len(prompt):]  # ambil hanya jawaban model
        obj = parse_json(completion)
        label = str(obj.get("label", "NON-HOAX")).upper().strip()
        preds.append(1 if label == "HOAX" else 0)
    return np.array(preds)

# ===== Summarization =====
def granite_summarize(texts, max_new_tokens=256):
    outs = []
    for t in texts:
        prompt = build_sum_prompt(t)
        out_full = gen(prompt, max_new_tokens=max_new_tokens, do_sample=False)[0]["generated_text"]
        completion = out_full[len(prompt):]
        outs.append(parse_json(completion))
    return outs


Ini bagian yang agak lama di Run. Kampret, jaringan harus stabil

In [None]:
from sklearn.metrics import classification_report

sample_valid = valid_df[text_col].tolist()[:64]  # subset untuk demo cepat
pred_va_gra = granite_predict(sample_valid)
yva_sub = (valid_df.iloc[:len(sample_valid)]['label_bin']=='HOAX').astype(int).values
print('VALIDATION (Granite few-shot) — subset')
print(classification_report(yva_sub, pred_va_gra, target_names=['NON-HOAX','HOAX']))


## 4) Summarization (Granite)
Ringkasan 3–5 kalimat + daftar *key_claims* dalam JSON.


In [None]:
def sum_prompt(text):
    return (
        "Ringkas isi teks berikut (3–5 kalimat) dalam Bahasa Indonesia, netral, tanpa opini.\n"
        "Sertakan 'key_claims' bullet pendek.\n"
        "Balas JSON: {{\"summary\": \"...\", \"key_claims\": [\"...\", \"...\"]}}.\n"
        "TEKS: {text}"
    ).format(text=text)

def granite_summarize(texts, max_new_tokens=256, temperature=0.3):
    outs = []
    for t in texts:
        prompt = sum_prompt(t)
        out_full = gen(prompt, max_new_tokens=max_new_tokens, temperature=temperature)[0]["generated_text"]
        completion = out_full[len(prompt):]
        outs.append(parse_json(completion))
    return outs

demo_texts = test_df[text_col].tolist()[:5]
sums = granite_summarize(demo_texts)
for i, obj in enumerate(sums, 1):
    print(f'[{i}] SUMMARY:\n', obj.get('summary',''))
    print('key_claims:', obj.get('key_claims', []))
    print('-'*80)


## 5) Insight & visualisasi ringkas
- Distribusi kelas
- Rata-rata panjang teks per kelas
- Top kata (ngram) per kelas (sederhana)


In [None]:
train_df['len'] = train_df[text_col].str.split().apply(len)
ax = train_df['label_bin'].value_counts().plot(kind='bar', title='Distribusi Label (train)')
plt.show()
print('Rata-rata panjang teks per label:')
print(train_df.groupby('label_bin')['len'].mean())


In [None]:
from collections import Counter
def top_terms(texts, k=20):
    toks = []
    for x in texts:
        toks.extend(re.findall(r"[\w']+", x.lower()))
    cnt = Counter(toks)
    return cnt.most_common(k)
print('Top terms HOAX:')
print(top_terms(train_df[train_df['label_bin']=='HOAX'][text_col].tolist()))
print('\nTop terms NON-HOAX:')
print(top_terms(train_df[train_df['label_bin']=='NON-HOAX'][text_col].tolist()))


## 6) Simpan artefak hasil (opsional)
Menyimpan metrik ringkas & contoh ringkasan untuk dokumentasi.


In [None]:
metrics = {
  'seed': SEED,
  'n_train': int(train_df.shape[0]),
  'n_valid': int(valid_df.shape[0]),
  'n_test': int(test_df.shape[0])
}
with open('metrics.json','w') as f: json.dump(metrics, f, ensure_ascii=False, indent=2)
with open('sample_summaries.json','w') as f: json.dump(sums, f, ensure_ascii=False, indent=2)
print('Tersimpan: metrics.json, sample_summaries.json')


## 7) Cara submit (Colab → GitHub)
1) Buka menu **File → Save a copy in GitHub**  
2) Pilih repo kamu (mis. `capstone-granite-hoax`) dan folder (root).  
3) Tambahkan pesan commit, klik **OK**.  
4) Setelah notebook masuk ke repo, buat **README.md** langsung di GitHub Web UI → *Add file → Create new file*.  
5) Salin *README template* dari bagian bawah notebook ini.


## 8) README Template (salin ke GitHub)
```markdown
# Deteksi Hoaks Berita Indonesia + Ringkasan Netral (IBM Granite)

## Project Overview
- **Masalah**: Penyebaran hoaks berdampak pada opini publik.
- **Tujuan**: Klasifikasi `HOAX` vs `NON-HOAX` dan pembuatan ringkasan netral 3–5 kalimat untuk membantu moderasi.
- **Scope**: Teks pendek Bahasa Indonesia (tweet/teks), tanpa metadata pribadi.

## Raw Dataset Link
- GitHub: IDNHoaxCorpus (lihat folder `dataset/`).

## Analysis Process
1. EDA & pembersihan data (hapus URL, normalisasi sederhana)
2. Baseline: TF‑IDF + Logistic Regression
3. IBM Granite 3.1 3B (few‑shot) untuk klasifikasi (JSON output)
4. IBM Granite 3.1 3B untuk summarization (JSON: `summary`, `key_claims`)
5. Evaluasi: F1/precision/recall; ROUGE (opsional) + inspeksi manual

## Insight & Findings (contoh)
- Tema hoaks dominan: politik/kesehatan.
- Ciri leksikal: klik-bait, modalitas tertentu.
- *Error analysis*: kasus ambigu atau satire.

## Conclusion & Recommendations
- Daftar kata kunci pemicu & aturan `flagging` awal.
- Workflow moderasi: (1) klasifikasi → (2) prioritas review → (3) ringkasan untuk editor.
- Rencana perawatan model/prompt bulanan.

## AI Support Explanation
- **Granite 3.1 3B Instruct**: open, long context (128K), mendukung klasifikasi & summarization.
- Prompt JSON-only untuk stabilitas parsing; suhu rendah untuk konsistensi.

## Cara Reproduksi (Colab)
1. Aktifkan GPU di Colab, jalankan sel instalasi.
2. Jalankan seluruh sel berurutan.
3. Commit notebook via *File → Save a copy in GitHub*.
```
