# 01 — Preprocessing

Tujuan:
- Bersihkan & normalisasi teks:
  **cleaning → lower → social/slang normalize → token → stopword → stemming**.
- Keluaran:
  - `reddit_opinion_PSE_ISR_2024_window_clean.csv`
  - `reddit_opinion_PSE_ISR_2025_window_clean.csv`

Catatan:
- Pipeline identik antar-tahun (konsistensi train/test).
- Tetap streaming (`chunksize`) untuk hemat RAM.


In [None]:
%%time
%pip install -q ftfy ekphrasis emoji scikit-learn

import os, re, hashlib, warnings
import pandas as pd
from ftfy import fix_text

# NLTK komponen yg tanpa korpus eksternal
from nltk.stem import PorterStemmer
from nltk.tokenize.toktok import ToktokTokenizer

from ekphrasis.classes.preprocessor import TextPreProcessor
from ekphrasis.classes.tokenizer import SocialTokenizer
from emoji import demojize

warnings.filterwarnings("ignore")
print("✅ Imports ready")


Note: you may need to restart the kernel to use updated packages.
✅ Imports ready
CPU times: total: 2.5 s
Wall time: 8.23 s


## Konfigurasi file window & stopwords


In [2]:
%%time
IN_2024 = "reddit_opinion_PSE_ISR_2024_window.csv"
IN_2025 = "reddit_opinion_PSE_ISR_2025_window.csv"

OUT_2024 = "reddit_opinion_PSE_ISR_2024_window_clean.csv"
OUT_2025 = "reddit_opinion_PSE_ISR_2025_window_clean.csv"

CHUNKSIZE    = 500_000
PREVIEW_ROWS = 5
KEEP_COLS    = ["comment_id","created_time","self_text","score","subreddit"]

# Stopwords (NLTK jika ada; fallback sklearn)
try:
    from nltk.corpus import stopwords
    STOPWORDS = set(stopwords.words("english"))
except Exception:
    from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
    STOPWORDS = set(ENGLISH_STOP_WORDS)
STOPWORDS |= {"im","u","ur","btw","lol","thx"}

STEMMER = PorterStemmer()
TOK     = ToktokTokenizer()

print("✅ Config ready")


✅ Config ready
CPU times: total: 15.6 ms
Wall time: 19.4 ms


## (Opsional) Lexicon slang & emoji + build `text_processor`

- Jika tidak punya CSV kamus: pipeline jalan pakai fallback mini.
- `text_processor` (ekphrasis) untuk normalisasi sosial: url/email/hashtag/elongated, dst.


In [3]:
%%time
def load_lexicon_csv(path, key_col, val_col):
    if os.path.exists(path):
        df = pd.read_csv(path)
        df = df[[key_col, val_col]].dropna()
        df[key_col] = df[key_col].astype(str).str.strip().str.lower()
        df[val_col] = df[val_col].astype(str).str.strip().str.lower()
        print(f"Loaded lexicon: {path} ({len(df):,} entries)")
        return dict(zip(df[key_col], df[val_col]))
    print(f"(Optional) Not found: {path}")
    return {}

EN_SLANG  = load_lexicon_csv("english_slang.csv", "slang", "normalized")
ID_SLANG  = load_lexicon_csv("indonesian_slang.csv", "slang", "normalized")
EMOJI_MAP = load_lexicon_csv("emoji_map.csv", "emoji", "normalized")

def sanitize_token(s: str) -> str:
    return re.sub(r"[^\w']", "", s.lower())

def sanitize_map(m: dict) -> dict:
    out = {}
    for k, v in m.items():
        sk = sanitize_token(k)
        sv = sanitize_token(v)
        if sk and sv:
            out[sk] = sv
    return out

SLANG_MAP = {}
SLANG_MAP.update(sanitize_map(EN_SLANG))
SLANG_MAP.update(sanitize_map(ID_SLANG))
if not SLANG_MAP:
    SLANG_MAP.update({
        "idk":"i do not know","imo":"in my opinion","imho":"in my humble opinion",
        "btw":"by the way","tbh":"to be honest","omg":"oh my god","wtf":"what the fuck",
        "ikr":"i know right","thx":"thanks","pls":"please","u":"you","ur":"your",
        "dont":"do not","cant":"cannot","wont":"will not","ive":"i have",
        "isnt":"is not","wasnt":"was not","arent":"are not","w":"win","l":"loss"
    })

def build_text_processor():
    return TextPreProcessor(
        normalize=['url','email','percent','money','phone','time','date'],
        annotate={'hashtag','allcaps','elongated','repeated'},
        fix_html=True,
        segmenter='twitter',
        corrector='twitter',
        unpack_hashtags=True,
        unpack_contractions=True,
        tokenizer=SocialTokenizer(lowercase=False).tokenize,
        dicts=[]
    )

text_processor = build_text_processor()
print("✅ text_processor ready | slang entries:", len(SLANG_MAP))


(Optional) Not found: english_slang.csv
(Optional) Not found: indonesian_slang.csv
(Optional) Not found: emoji_map.csv
Reading twitter - 1grams ...
Reading twitter - 2grams ...
Reading twitter - 1grams ...
✅ text_processor ready | slang entries: 21
CPU times: total: 6.17 s
Wall time: 6.54 s


## Helper functions: cleaning → lower → social/slang → token → stopword → stemming


In [4]:
%%time
URL_RE       = re.compile(r"(https?://\S+|www\.)\S+", re.IGNORECASE)
MD_LINK_RE   = re.compile(r"\[.*?\]\(.*?\)")
HTML_TAG_RE  = re.compile(r"<.*?>")
MENTION_RE   = re.compile(r"@\w+")
NONPRINT_RE  = re.compile(r"[\x00-\x1f\x7f]")

def is_irrelevant(text: str) -> bool:
    if not isinstance(text, str): return True
    t = text.strip()
    if not t: return True
    if t.lower() in ("[deleted]","[removed]"): return True
    t2 = URL_RE.sub(" ", t)
    t2 = MD_LINK_RE.sub(" ", t2)
    t2 = HTML_TAG_RE.sub(" ", t2)
    words = [w for w in re.split(r"\s+", t2.strip()) if w]
    if len(words) <= 2: return True
    if re.fullmatch(r"[^A-Za-z]+", (t2.strip() or "")): return True
    return False

def cleaning_step(text: str) -> str:
    if not isinstance(text, str): return ""
    x = fix_text(text)
    x = URL_RE.sub(" ", x)
    x = MD_LINK_RE.sub(" ", x)
    x = HTML_TAG_RE.sub(" ", x)
    x = MENTION_RE.sub(" ", x)
    x = NONPRINT_RE.sub(" ", x)
    x = re.sub(r"\s+", " ", x).strip()
    return x

def case_folding_step(text: str) -> str:
    return text.lower().strip()

def social_normalize_step(text: str) -> str:
    toks = text_processor.pre_process_doc(text)
    sent = " ".join(toks)
    sent = demojize(sent)
    if EMOJI_MAP:
        for emo, rep in EMOJI_MAP.items():
            sent = sent.replace(emo, rep)
    return re.sub(r"\s+", " ", sent).strip()

def slang_replace_step(text: str) -> str:
    tokens = re.split(r"\s+", text)
    out = []
    for t in tokens:
        base = sanitize_token(t)
        out.append(SLANG_MAP.get(base, t))
    return " ".join(out)

def tokenize_step(text: str) -> list:
    toks = TOK.tokenize(text)
    return [t for t in toks if t.isalpha()]

def stopword_removal_step(tokens: list) -> list:
    return [t for t in tokens if t not in STOPWORDS]

def stemming_step(tokens: list) -> list:
    return [STEMMER.stem(t) for t in tokens]

def to_final_text(tokens: list) -> str:
    return " ".join(tokens)

def month_str(dt_series: pd.Series) -> pd.Series:
    return dt_series.dt.to_period("M").astype(str)

print("✅ Helpers ready")


✅ Helpers ready
CPU times: total: 0 ns
Wall time: 986 μs


## Preview kecil: 5 baris pertama yang relevan (2024 window)


In [5]:
%%time
def preview_steps(file_path, n=5):
    prev = None
    for chunk in pd.read_csv(file_path, chunksize=200_000, low_memory=False):
        if "self_text" not in chunk.columns:
            raise ValueError("Kolom 'self_text' tidak ditemukan.")
        sub = chunk[~chunk["self_text"].apply(is_irrelevant)].copy()
        if len(sub) >= n:
            prev = sub.head(n).copy()
            break
    if prev is None:
        print("Tidak ada baris relevan untuk preview.")
        return

    tmp = prev.copy()
    tmp["after"] = tmp["self_text"].apply(cleaning_step)
    display(tmp.loc[:, ["self_text","after"]].rename(columns={"self_text":"before"}))

    tmp = tmp.rename(columns={"after":"cleaned"})
    tmp["after"] = tmp["cleaned"].apply(case_folding_step)
    display(tmp.loc[:, ["cleaned","after"]].rename(columns={"cleaned":"before"}))

    tmp = tmp.rename(columns={"after":"lowered"})
    tmp["soc"] = tmp["lowered"].apply(social_normalize_step)
    tmp["after"] = tmp["soc"].apply(slang_replace_step)
    display(tmp.loc[:, ["lowered","after"]].rename(columns={"lowered":"before"}))

    tmp = tmp.rename(columns={"after":"slang_replaced"})
    tmp["after"] = tmp["slang_replaced"].apply(tokenize_step)
    display(tmp.loc[:, ["slang_replaced","after"]].rename(columns={"slang_replaced":"before"}))

    tmp = tmp.rename(columns={"after":"tokens"})
    tmp["after"] = tmp["tokens"].apply(stopword_removal_step)
    display(tmp.loc[:, ["tokens","after"]].rename(columns={"tokens":"before"}))

    tmp = tmp.rename(columns={"after":"no_stop"})
    tmp["after"] = tmp["no_stop"].apply(stemming_step)
    display(tmp.loc[:, ["no_stop","after"]].rename(columns={"no_stop":"before"}))

    tmp = tmp.rename(columns={"after":"stemmed"})
    tmp["final_text"] = tmp["stemmed"].apply(to_final_text)
    display(tmp.loc[:, ["slang_replaced","final_text"]])

print("### PREVIEW 2024 WINDOW ###")
if os.path.exists(IN_2024):
    preview_steps(IN_2024, PREVIEW_ROWS)
else:
    print("File tidak ditemukan:", IN_2024)


### PREVIEW 2024 WINDOW ###


Unnamed: 0,before,after
0,doesn't the PM have parliamentary immunity whi...,doesn't the PM have parliamentary immunity whi...
1,I have read the history of the Levant. And the...,I have read the history of the Levant. And the...
2,WAS being the operative word and he made no se...,WAS being the operative word and he made no se...
3,Obviously there's multiple reasons why we don'...,Obviously there's multiple reasons why we don'...
4,Somehow I think Lockmart is fine with this. It...,Somehow I think Lockmart is fine with this. It...


Unnamed: 0,before,after
0,doesn't the PM have parliamentary immunity whi...,doesn't the pm have parliamentary immunity whi...
1,I have read the history of the Levant. And the...,i have read the history of the levant. and the...
2,WAS being the operative word and he made no se...,was being the operative word and he made no se...
3,Obviously there's multiple reasons why we don'...,obviously there's multiple reasons why we don'...
4,Somehow I think Lockmart is fine with this. It...,somehow i think lockmart is fine with this. it...


Unnamed: 0,before,after
0,doesn't the pm have parliamentary immunity whi...,does not the pm have parliamentary immunity wh...
1,i have read the history of the levant. and the...,i have read the history of the levant . and th...
2,was being the operative word and he made no se...,was being the operative word and he made no se...
3,obviously there's multiple reasons why we don'...,obviously there ' s multiple reasons why we do...
4,somehow i think lockmart is fine with this. it...,somehow i think lockmart is fine with this . i...


Unnamed: 0,before,after
0,does not the pm have parliamentary immunity wh...,"[does, not, the, pm, have, parliamentary, immu..."
1,i have read the history of the levant . and th...,"[i, have, read, the, history, of, the, levant,..."
2,was being the operative word and he made no se...,"[was, being, the, operative, word, and, he, ma..."
3,obviously there ' s multiple reasons why we do...,"[obviously, there, s, multiple, reasons, why, ..."
4,somehow i think lockmart is fine with this . i...,"[somehow, i, think, lockmart, is, fine, with, ..."


Unnamed: 0,before,after
0,"[does, not, the, pm, have, parliamentary, immu...","[pm, parliamentary, immunity, office]"
1,"[i, have, read, the, history, of, the, levant,...","[read, history, levant]"
2,"[was, being, the, operative, word, and, he, ma...","[operative, word, made, secret, saved, civilia..."
3,"[obviously, there, s, multiple, reasons, why, ...","[obviously, multiple, reasons, bomb, nk, respe..."
4,"[somehow, i, think, lockmart, is, fine, with, ...","[somehow, think, lockmart, fine, make, new, pl..."


Unnamed: 0,before,after
0,"[pm, parliamentary, immunity, office]","[pm, parliamentari, immun, offic]"
1,"[read, history, levant]","[read, histori, levant]"
2,"[operative, word, made, secret, saved, civilia...","[oper, word, made, secret, save, civilian, live]"
3,"[obviously, multiple, reasons, bomb, nk, respe...","[obvious, multipl, reason, bomb, nk, respect, ..."
4,"[somehow, think, lockmart, fine, make, new, pl...","[somehow, think, lockmart, fine, make, new, pl..."


Unnamed: 0,slang_replaced,final_text
0,does not the pm have parliamentary immunity wh...,pm parliamentari immun offic
1,i have read the history of the levant . and th...,read histori levant
2,was being the operative word and he made no se...,oper word made secret save civilian live
3,obviously there ' s multiple reasons why we do...,obvious multipl reason bomb nk respect armisti...
4,somehow i think lockmart is fine with this . i...,somehow think lockmart fine make new plane


CPU times: total: 5.8 s
Wall time: 6.11 s


## Fungsi utama: preprocess streaming per file (window)


In [6]:
%%time
def preprocess_file(input_path, output_path, chunksize=500_000):
    if os.path.exists(output_path):
        os.remove(output_path)

    wrote_header = False
    total_in = total_out = 0

    for chunk in pd.read_csv(input_path, chunksize=chunksize, low_memory=False, parse_dates=["created_time"]):
        total_in += len(chunk)
        if "self_text" not in chunk.columns:
            raise ValueError("Kolom 'self_text' tidak ditemukan.")

        # buang baris tak relevan
        chunk = chunk[~chunk["self_text"].apply(is_irrelevant)].copy()
        if chunk.empty:
            continue

        # pipeline teks
        chunk["cleaned"] = chunk["self_text"].apply(cleaning_step)
        chunk["lowered"] = chunk["cleaned"].apply(case_folding_step)
        chunk["soc"] = chunk["lowered"].apply(social_normalize_step)
        chunk["slang_replaced"] = chunk["soc"].apply(slang_replace_step)
        chunk["tokens"] = chunk["slang_replaced"].apply(tokenize_step)
        chunk["no_stop"] = chunk["tokens"].apply(stopword_removal_step)
        chunk["stemmed"] = chunk["no_stop"].apply(stemming_step)
        chunk["final_text"] = chunk["stemmed"].apply(to_final_text)

        # drop kosong & jaga tanggal
        chunk = chunk[chunk["final_text"].str.strip() != ""].copy()
        chunk = chunk[chunk["created_time"].notna()].copy()
        if chunk.empty:
            continue

        # kolom bulan
        chunk["month"] = month_str(chunk["created_time"])

        # pastikan kolom simpan tersedia
        for c in KEEP_COLS:
            if c not in chunk.columns:
                chunk[c] = None

        out = chunk[KEEP_COLS + ["final_text","month"]].copy()
        out.to_csv(output_path, mode="a", index=False, header=not wrote_header)
        wrote_header = True
        total_out += len(out)

        print(f"Chunk → wrote {len(out):,} (total {total_out:,})")

    print(f"\n✅ DONE preprocess: {input_path}")
    print(f"  Total input: {total_in:,}")
    print(f"  Total kept : {total_out:,}")
    print(f"  Output     : {output_path}")


CPU times: total: 0 ns
Wall time: 9.06 μs


## Jalankan preprocess untuk 2024 window


In [7]:
%%time
if os.path.exists(IN_2024):
    preprocess_file(IN_2024, OUT_2024, CHUNKSIZE)
else:
    print("File 2024 window tidak ditemukan:", IN_2024)


Chunk → wrote 477,577 (total 477,577)
Chunk → wrote 81,267 (total 558,844)

✅ DONE preprocess: reddit_opinion_PSE_ISR_2024_window.csv
  Total input: 585,053
  Total kept : 558,844
  Output     : reddit_opinion_PSE_ISR_2024_window_clean.csv
CPU times: total: 12min 17s
Wall time: 12min 39s


## Jalankan preprocess untuk 2025 window


In [8]:
%%time
if os.path.exists(IN_2025):
    preprocess_file(IN_2025, OUT_2025, CHUNKSIZE)
else:
    print("File 2025 window tidak ditemukan:", IN_2025)


Chunk → wrote 294,411 (total 294,411)

✅ DONE preprocess: reddit_opinion_PSE_ISR_2025_window.csv
  Total input: 307,018
  Total kept : 294,411
  Output     : reddit_opinion_PSE_ISR_2025_window_clean.csv
CPU times: total: 7min 5s
Wall time: 7min 18s


## Verifikasi cepat: ukuran & 5 baris teratas


In [9]:
%%time
for out in [OUT_2024, OUT_2025]:
    if os.path.exists(out):
        sz = os.path.getsize(out)/(1024**2)
        print(f"{out}: {sz:.2f} MB")
        display(pd.read_csv(out, nrows=5))
    else:
        print("Tidak ditemukan:", out)


reddit_opinion_PSE_ISR_2024_window_clean.csv: 237.57 MB


Unnamed: 0,comment_id,created_time,self_text,score,subreddit,final_text,month
0,m1g0pi5,2024-12-10 23:59:55,doesn't the PM have parliamentary immunity whi...,1,worldnews,pm parliamentari immun offic,2024-12
1,m1g0okb,2024-12-10 23:59:45,I have read the history of the Levant. And the...,-25,worldnews,read histori levant,2024-12
2,m1g0ok1,2024-12-10 23:59:45,WAS being the operative word and he made no se...,1,IsraelPalestine,oper word made secret save civilian live,2024-12
3,m1g0m2l,2024-12-10 23:59:21,Obviously there's multiple reasons why we don'...,-8,CombatFootage,obvious multipl reason bomb nk respect armisti...,2024-12
4,m1g0l5s,2024-12-10 23:59:12,Somehow I think Lockmart is fine with this. It...,7,NonCredibleDefense,somehow think lockmart fine make new plane,2024-12


reddit_opinion_PSE_ISR_2025_window_clean.csv: 136.80 MB


Unnamed: 0,comment_id,created_time,self_text,score,subreddit,final_text,month
0,ndjojnm,2025-09-10 23:59:37,That’s all old stuff. It has been already disc...,3,IsraelPalestine,old stuff alreadi discuss ad nauseam two year ...,2025-09
1,ndjohss,2025-09-10 23:59:19,May be stop sending Iranian drones. ? \nThere ...,310,worldnews,may stop send iranian drone reason israel atta...,2025-09
2,ndjo93p,2025-09-10 23:57:58,Google was founded by Page and Brin. Larry Pag...,1,IsraelPalestine,googl found page brin larri page michigan serg...,2025-09
3,ndjo760,2025-09-10 23:57:39,The difference is that it appears the Qatari k...,24,worldnews,differ appear qatari knew allow proceed canadi...,2025-09
4,ndjo435,2025-09-10 23:57:11,Take a look at the video on instagram. The da...,12,worldnews,take look video instagram damag small video cl...,2025-09


CPU times: total: 15.6 ms
Wall time: 72.2 ms
