# 02 — Labeling (TextBlob) — Window 1–10 / Bulan (2024)

Tujuan:
- Memberi **label sentimen** (Positif/Netral/Negatif) menggunakan **TextBlob (PatternAnalyzer)**.
- Baca file hasil preprocess *window* (tgl 1–10 tiap bulan).
- Tulis hasil ke CSV baru yang siap dipakai untuk training.

Catatan:
- Jika `TextBlob` (PatternAnalyzer) tidak tersedia di environment, sel ini akan memasang dependensinya.
- Ambang default: `NEG ≤ -0.05`, `POS ≥ +0.05`, selain itu `Netral`. Ubah bila perlu.


In [1]:
%%time
# Install & imports
%pip install -q textblob

import os, json, warnings
import numpy as np
import pandas as pd

from textblob import TextBlob

warnings.filterwarnings("ignore")
np.random.seed(42)
print("✅ Imports ready (TextBlob)")


Note: you may need to restart the kernel to use updated packages.
✅ Imports ready (TextBlob)
CPU times: total: 1.55 s
Wall time: 5.49 s


## Konfigurasi & Validasi Kolom

- `INPUT_FILE`: file hasil preprocess *window* (tgl 1–10) tahun 2024.
- `OUT_LABELED`: nama keluaran berlabel.
- `POS_THR`, `NEG_THR`: ambang polaritas untuk mapping label.
- `REQUIRED_COLS`: kolom yang harus ada.


In [2]:
%%time
# ====== KONFIG ======
INPUT_FILE  = "reddit_opinion_PSE_ISR_2024_window_clean.csv"
OUT_LABELED = "reddit_opinion_PSE_ISR_2024_window_labeled_textblob.csv"

POS_THR =  0.05   # tingkatkan ke 0.10 jika ingin netral lebih lebar
NEG_THR = -0.05   # turunkan ke -0.10 jika ingin negatif lebih ketat

REQUIRED_COLS = ["comment_id","created_time","self_text","score","subreddit","final_text","month"]

# Safety: hapus output lama agar tidak double-append
if os.path.exists(OUT_LABELED):
    os.remove(OUT_LABELED)

assert os.path.exists(INPUT_FILE), f"Tidak ditemukan: {INPUT_FILE}"
print("✅ Config ready")


✅ Config ready
CPU times: total: 0 ns
Wall time: 347 μs


## Fungsi Labeling

- `polarity_score(text)`: ambil **TextBlob polarity** (−1..+1).
- `map_label(p)`: ubah skor → **Positif / Netral / Negatif** berdasarkan ambang.
- Robust terhadap `NaN`/teks kosong.


In [3]:
%%time
def polarity_score(text: str) -> float:
    if not isinstance(text, str) or not text.strip():
        return 0.0
    try:
        return float(TextBlob(text).sentiment.polarity)
    except Exception:
        # fallback aman
        return 0.0

def map_label(p: float, pos_thr=POS_THR, neg_thr=NEG_THR) -> str:
    if p >= pos_thr:
        return "Positif"
    if p <= neg_thr:
        return "Negatif"
    return "Netral"

# quick test
for s in ["i love this", "this is okay", "i hate this"]:
    sc = polarity_score(s)
    print(f"{s!r} → {sc:.3f} → {map_label(sc)}")


'i love this' → 0.500 → Positif
'this is okay' → 0.500 → Positif
'i hate this' → -0.800 → Negatif
CPU times: total: 46.9 ms
Wall time: 57 ms


## Preview (5 baris) — Before/After

Menampilkan contoh kecil untuk memastikan skor & label sudah wajar.


In [4]:
%%time
PREVIEW_N = 5
prev = pd.read_csv(INPUT_FILE, nrows=PREVIEW_N)
for col in REQUIRED_COLS:
    if col not in prev.columns:
        prev[col] = None

prev["polarity"] = prev["final_text"].astype(str).apply(polarity_score)
prev["label"]    = prev["polarity"].apply(map_label)
display(prev[["comment_id","month","self_text","final_text","polarity","label"]])


Unnamed: 0,comment_id,month,self_text,final_text,polarity,label
0,m1g0pi5,2024-12,doesn't the PM have parliamentary immunity whi...,pm parliamentari immun offic,0.0,Netral
1,m1g0okb,2024-12,I have read the history of the Levant. And the...,read histori levant,0.0,Netral
2,m1g0ok1,2024-12,WAS being the operative word and he made no se...,oper word made secret save civilian live,-0.131818,Negatif
3,m1g0m2l,2024-12,Obviously there's multiple reasons why we don'...,obvious multipl reason bomb nk respect armisti...,-0.3,Negatif
4,m1g0l5s,2024-12,Somehow I think Lockmart is fine with this. It...,somehow think lockmart fine make new plane,0.276515,Positif


CPU times: total: 15.6 ms
Wall time: 47 ms


## Proses Utama (Streaming)

- Baca per‐**chunk** (hemat RAM).
- Hitung `polarity` di **final_text** → `label`.
- Simpan kolom penting + label ke CSV keluaran (append, header sekali).


In [5]:
%%time
CHUNKSIZE = 500_000  # sesuaikan jika RAM terbatas

wrote_header = False
total_in = total_out = 0

for chunk in pd.read_csv(INPUT_FILE, chunksize=CHUNKSIZE, low_memory=False):
    total_in += len(chunk)
    # pastikan kolom wajib ada
    for col in REQUIRED_COLS:
        if col not in chunk.columns:
            chunk[col] = None

    # hitung polarity & label
    text = chunk["final_text"].astype(str).fillna("")
    chunk["polarity"] = text.apply(polarity_score)
    chunk["label"]    = chunk["polarity"].apply(map_label)

    # pilih kolom output
    out_cols = ["comment_id","created_time","month","subreddit","score","self_text","final_text","polarity","label"]
    out = chunk[out_cols].copy()

    out.to_csv(OUT_LABELED, index=False, mode="a", header=not wrote_header)
    wrote_header = True
    total_out += len(out)
    print(f"Chunk labeled → {len(out):,} rows (total {total_out:,})")

print("\n✅ DONE labeling (TextBlob)")
print(f"Total input : {total_in:,}")
print(f"Total wrote : {total_out:,}")
print(f"Output file : {OUT_LABELED}")


Chunk labeled → 500,000 rows (total 500,000)
Chunk labeled → 58,844 rows (total 558,844)

✅ DONE labeling (TextBlob)
Total input : 558,844
Total wrote : 558,844
Output file : reddit_opinion_PSE_ISR_2024_window_labeled_textblob.csv
CPU times: total: 1min 54s
Wall time: 1min 57s


## Ringkasan Cepat

- Distribusi label keseluruhan.
- Rata-rata `polarity` per label (kontrol kualitas).


In [6]:
%%time
df = pd.read_csv(OUT_LABELED, usecols=["label","polarity"], low_memory=False)
print("Distribusi label:")
print(df["label"].value_counts())

print("\nRata-rata polarity per label:")
print(df.groupby("label")["polarity"].mean().round(4))


Distribusi label:
label
Netral     259871
Positif    174951
Negatif    124022
Name: count, dtype: int64

Rata-rata polarity per label:
label
Negatif   -0.2761
Netral     0.0005
Positif    0.2723
Name: polarity, dtype: float64
CPU times: total: 2.72 s
Wall time: 3.08 s


## (Opsional) Ringkasan Bulanan

Untuk evaluasi awal sebelum training: proporsi label per bulan.


In [7]:
%%time
dfm = pd.read_csv(OUT_LABELED, usecols=["month","label"], low_memory=False)
tbl = (dfm.groupby(["month","label"]).size()
       .unstack(fill_value=0)
       .reindex(columns=["Positif","Netral","Negatif"], fill_value=0)
       .sort_index())
tbl["Total"] = tbl.sum(axis=1)
for c in ["Positif","Netral","Negatif"]:
    tbl[c+"_pct"] = (tbl[c]/tbl["Total"]).round(4)

display(tbl.head(12))


label,Positif,Netral,Negatif,Total,Positif_pct,Netral_pct,Negatif_pct
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2024-01,20084,29888,14379,64351,0.3121,0.4645,0.2234
2024-02,14382,20932,10171,45485,0.3162,0.4602,0.2236
2024-03,16351,23800,11952,52103,0.3138,0.4568,0.2294
2024-04,17271,26746,12686,56703,0.3046,0.4717,0.2237
2024-05,18955,28610,13601,61166,0.3099,0.4677,0.2224
2024-06,14228,20668,9782,44678,0.3185,0.4626,0.2189
2024-07,9597,13763,6726,30086,0.319,0.4575,0.2236
2024-08,13459,20068,9565,43092,0.3123,0.4657,0.222
2024-09,12538,18162,8760,39460,0.3177,0.4603,0.222
2024-10,17757,26612,12410,56779,0.3127,0.4687,0.2186


CPU times: total: 2.19 s
Wall time: 2.33 s


# 02b — Labeling (TextBlob) — Window 1–10 / Bulan (2025)

Tujuan:
- Memberi label untuk **2025 window** (untuk evaluasi/analisis tren & inference check nanti).
- Proses identik dengan 2024 agar konsisten.


In [8]:
%%time
# ====== KONFIG 2025 ======
INPUT_2025  = "reddit_opinion_PSE_ISR_2025_window_clean.csv"
OUT_2025_LB = "reddit_opinion_PSE_ISR_2025_window_labeled_textblob.csv"

# Safety: hapus output lama
if os.path.exists(OUT_2025_LB):
    os.remove(OUT_2025_LB)

assert os.path.exists(INPUT_2025), f"Tidak ditemukan: {INPUT_2025}"
print("✅ Config 2025 ready")


✅ Config 2025 ready
CPU times: total: 0 ns
Wall time: 178 μs


## Preview 2025 (5 baris)


In [9]:
%%time
PREVIEW_N = 5
prev = pd.read_csv(INPUT_2025, nrows=PREVIEW_N)
for col in REQUIRED_COLS:
    if col not in prev.columns:
        prev[col] = None

prev["polarity"] = prev["final_text"].astype(str).apply(polarity_score)
prev["label"]    = prev["polarity"].apply(map_label)
display(prev[["comment_id","month","self_text","final_text","polarity","label"]])


Unnamed: 0,comment_id,month,self_text,final_text,polarity,label
0,ndjojnm,2025-09,That’s all old stuff. It has been already disc...,old stuff alreadi discuss ad nauseam two year ...,0.025,Netral
1,ndjohss,2025-09,May be stop sending Iranian drones. ? \nThere ...,may stop send iranian drone reason israel atta...,0.0,Netral
2,ndjo93p,2025-09,Google was founded by Page and Brin. Larry Pag...,googl found page brin larri page michigan serg...,0.2,Positif
3,ndjo760,2025-09,The difference is that it appears the Qatari k...,differ appear qatari knew allow proceed canadi...,0.033333,Netral
4,ndjo435,2025-09,Take a look at the video on instagram. The da...,take look video instagram damag small video cl...,-0.025,Netral


CPU times: total: 31.2 ms
Wall time: 16.6 ms


## Proses Utama 2025 (Streaming)


In [10]:
%%time
CHUNKSIZE = 500_000

wrote_header = False
total_in = total_out = 0

for chunk in pd.read_csv(INPUT_2025, chunksize=CHUNKSIZE, low_memory=False):
    total_in += len(chunk)
    for col in REQUIRED_COLS:
        if col not in chunk.columns:
            chunk[col] = None

    text = chunk["final_text"].astype(str).fillna("")
    chunk["polarity"] = text.apply(polarity_score)
    chunk["label"]    = chunk["polarity"].apply(map_label)

    out_cols = ["comment_id","created_time","month","subreddit","score","self_text","final_text","polarity","label"]
    out = chunk[out_cols].copy()

    out.to_csv(OUT_2025_LB, index=False, mode="a", header=not wrote_header)
    wrote_header = True
    total_out += len(out)
    print(f"Chunk labeled (2025) → {len(out):,} rows (total {total_out:,})")

print("\n✅ DONE labeling 2025 (TextBlob)")
print(f"Total input : {total_in:,}")
print(f"Total wrote : {total_out:,}")
print(f"Output file : {OUT_2025_LB}")


Chunk labeled (2025) → 294,411 rows (total 294,411)

✅ DONE labeling 2025 (TextBlob)
Total input : 294,411
Total wrote : 294,411
Output file : reddit_opinion_PSE_ISR_2025_window_labeled_textblob.csv
CPU times: total: 1min 2s
Wall time: 1min 4s


## Ringkasan Bulanan 2025 (opsional)


In [11]:
%%time
dfm = pd.read_csv(OUT_2025_LB, usecols=["month","label"], low_memory=False)
tbl = (dfm.groupby(["month","label"]).size()
       .unstack(fill_value=0)
       .reindex(columns=["Positif","Netral","Negatif"], fill_value=0)
       .sort_index())
tbl["Total"] = tbl.sum(axis=1)
for c in ["Positif","Netral","Negatif"]:
    tbl[c+"_pct"] = (tbl[c]/tbl["Total"]).round(4)

display(tbl.head(12))


label,Positif,Netral,Negatif,Total,Positif_pct,Netral_pct,Negatif_pct
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2025-01,7680,11301,5069,24050,0.3193,0.4699,0.2108
2025-02,4387,6790,2802,13979,0.3138,0.4857,0.2004
2025-03,7172,10526,4724,22422,0.3199,0.4694,0.2107
2025-04,10000,14540,7130,31670,0.3158,0.4591,0.2251
2025-05,10818,15521,7235,33574,0.3222,0.4623,0.2155
2025-06,14338,19875,9706,43919,0.3265,0.4525,0.221
2025-07,11389,16991,8486,36866,0.3089,0.4609,0.2302
2025-08,14400,20692,10068,45160,0.3189,0.4582,0.2229
2025-09,13479,20116,9176,42771,0.3151,0.4703,0.2145


CPU times: total: 1.36 s
Wall time: 1.44 s
