# 00 — Filter Window (Tanggal 1–10 per Bulan)

Tujuan:
- Dari raw CSV besar → ambil hanya komentar bertanggal **1–10** tiap bulan.
- Keluaran:
  - `reddit_opinion_PSE_ISR_2024_window.csv`
  - `reddit_opinion_PSE_ISR_2025_window.csv`

Catatan:
- Pastikan kolom `created_time` ada dan bisa di-parse menjadi datetime.
- Proses streaming (chunksize) supaya hemat RAM.


In [1]:
%%time
import os
import pandas as pd
import numpy as np

# ===== Konfigurasi =====
RAW_FILE = "reddit_opinion_PSE_ISR.csv"  # ganti sesuai nama file raw kamu (gabungan 2024+2025)
OUT_2024 = "reddit_opinion_PSE_ISR_2024_window.csv"
OUT_2025 = "reddit_opinion_PSE_ISR_2025_window.csv"

CHUNKSIZE = 1_000_000
USECOLS   = ["comment_id","created_time","self_text","score","subreddit"]

# Bersihkan output lama agar append bersih
for p in [OUT_2024, OUT_2025]:
    if os.path.exists(p):
        os.remove(p)

print("Input:", RAW_FILE)


Input: reddit_opinion_PSE_ISR.csv
CPU times: total: 1.03 s
Wall time: 1.52 s


## Fungsi: filter hanya tanggal 1–10 tiap bulan (split 2024 vs 2025)

- Parse `created_time` → datetime.
- Seleksi baris dengan **hari antara 1 s.d. 10**.
- Split berdasarkan **tahun** (2024 vs 2025).
- Dedup ringan per-tahun berdasarkan `comment_id`.


In [2]:
%%time
def filter_window(raw_path, out_2024, out_2025, chunksize=1_000_000):
    wrote_2024 = False
    wrote_2025 = False
    seen_id_24, seen_id_25 = set(), set()   # dedup ringan per-tahun
    
    total_in = total_24 = total_25 = 0
    
    for chunk in pd.read_csv(
        raw_path,
        chunksize=chunksize,
        usecols=USECOLS,
        on_bad_lines="skip",
        low_memory=False
    ):
        total_in += len(chunk)

        # parse datetime
        chunk["created_time"] = pd.to_datetime(
            chunk["created_time"], errors="coerce", infer_datetime_format=True
        )
        chunk = chunk[chunk["created_time"].notna()].copy()
        if chunk.empty:
            continue

        # ambil hanya tanggal 1–10
        day  = chunk["created_time"].dt.day
        year = chunk["created_time"].dt.year
        chunk = chunk[day.between(1, 10)].copy()
        if chunk.empty:
            continue

        # split 2024 vs 2025
        c24 = chunk[year == 2024].copy()
        c25 = chunk[year == 2025].copy()

        # dedup ringan per tahun
        if not c24.empty and "comment_id" in c24.columns:
            mask = ~c24["comment_id"].astype(str).isin(seen_id_24)
            c24 = c24[mask]
            seen_id_24.update(c24["comment_id"].astype(str).tolist())
        if not c25.empty and "comment_id" in c25.columns:
            mask = ~c25["comment_id"].astype(str).isin(seen_id_25)
            c25 = c25[mask]
            seen_id_25.update(c25["comment_id"].astype(str).tolist())

        # tulis
        if not c24.empty:
            c24.to_csv(out_2024, mode="a", index=False, header=not wrote_2024)
            wrote_2024 = True
            total_24  += len(c24)
        if not c25.empty:
            c25.to_csv(out_2025, mode="a", index=False, header=not wrote_2025)
            wrote_2025 = True
            total_25  += len(c25)

        print(f"Chunk done → kept 2024: {len(c24):5d} | 2025: {len(c25):5d}")

    print("\n✅ DONE filter window 1–10:")
    print(f"  Total input rows: {total_in:,}")
    print(f"  2024 kept       : {total_24:,} → {out_2024}")
    print(f"  2025 kept       : {total_25:,} → {out_2025}")


CPU times: total: 0 ns
Wall time: 7.63 μs


## Jalankan filter (streaming)


In [3]:
%%time
filter_window(RAW_FILE, OUT_2024, OUT_2025, CHUNKSIZE)




Chunk done → kept 2024:     0 | 2025: 283834




Chunk done → kept 2024: 315007 | 2025: 23184




Chunk done → kept 2024: 270046 | 2025:     0




Chunk done → kept 2024:     0 | 2025:     0

✅ DONE filter window 1–10:
  Total input rows: 3,321,074
  2024 kept       : 585,053 → reddit_opinion_PSE_ISR_2024_window.csv
  2025 kept       : 307,018 → reddit_opinion_PSE_ISR_2025_window.csv
CPU times: total: 1min 9s
Wall time: 1min 11s


## Verifikasi ringkas distribusi per bulan

- Cek jumlah baris per `month` untuk memastikan subset 1–10 berhasil.


In [4]:
%%time
def to_month_str(s):
    s = pd.to_datetime(s, errors="coerce")
    return s.dt.to_period("M").astype(str)

for out in [OUT_2024, OUT_2025]:
    if os.path.exists(out):
        df = pd.read_csv(out, nrows=200_000)
        df["month"] = to_month_str(df["created_time"])
        print("\nFile:", out)
        print(df.groupby("month").size().sort_index().head(20))
    else:
        print("Tidak ditemukan:", out)



File: reddit_opinion_PSE_ISR_2024_window.csv
month
2024-08    31250
2024-09    41181
2024-10    59354
2024-11    38219
2024-12    29996
dtype: int64

File: reddit_opinion_PSE_ISR_2025_window.csv
month
2025-05    24369
2025-06    45803
2025-07    38502
2025-08    46881
2025-09    44445
dtype: int64
CPU times: total: 2.59 s
Wall time: 2.82 s
