# Preprocessing

Tahapan ini berisi proses preprocessing data time series NO2, termasuk penanganan missing values, pembuatan lag features, dan transformasi data menjadi format supervised learning.

## 1. Import Library dan Load Data

Import library yang diperlukan untuk preprocessing data dan load dataset time series NO2.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

df = pd.read_csv("./dataset/NO2_sampang.csv")

print("Preview Data:")
print(df.head())

print("\nInfo Data Awal:")
print(df.info())

print(f"\nShape dataset: {df.shape}")
print(f"Kolom dataset: {df.columns.tolist()}")

Preview Data:
            t  NO2
0  2021-01-01  NaN
1  2021-01-01  NaN
2  2021-01-01  NaN
3  2021-01-01  NaN
4  2021-01-02  NaN

Info Data Awal:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6916 entries, 0 to 6915
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   t       6916 non-null   object 
 1   NO2     3610 non-null   float64
dtypes: float64(1), object(1)
memory usage: 108.2+ KB
None

Shape dataset: (6916, 2)
Kolom dataset: ['t', 'NO2']


## 2. Konversi Tanggal dan Sorting Data

Mengubah kolom tanggal menjadi tipe datetime dan mengurutkan data berdasarkan waktu untuk memastikan urutan time series yang benar.

In [6]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import MinMaxScaler

df["t"] = pd.to_datetime(df["t"], format="%Y-%m-%d")
df = df.sort_values("t").reset_index(drop=True)

print("Data setelah konversi datetime:")
print(df.head())
print(f"Rentang tanggal: {df['t'].min()} sampai {df['t'].max()}")

Data setelah konversi datetime:
           t  NO2
0 2021-01-01  NaN
1 2021-01-01  NaN
2 2021-01-01  NaN
3 2021-01-01  NaN
4 2021-01-02  NaN
Rentang tanggal: 2021-01-01 00:00:00 sampai 2025-10-18 00:00:00


## 3. Handling Missing Values

Mengidentifikasi missing values dalam dataset dan melakukan interpolasi linier untuk mengisi nilai yang hilang. Data hasil preprocessing disimpan ke file baru.

In [7]:
print("Missing values sebelum preprocessing:")
missing_before = df.isnull().sum()
print(missing_before)

missing_percent = (df.isnull().sum() / len(df)) * 100
print("\nPersentase missing values:")
for col, percent in missing_percent.items():
    if percent > 0:
        print(f"{col}: {percent:.2f}%")

if df.isnull().sum().sum() > 0:
    print("\nMelakukan interpolasi linier untuk missing values...")
    
    df_processed = df.copy()
    
    numeric_columns = df_processed.select_dtypes(include=[np.number]).columns
    
    for col in numeric_columns:
        if df_processed[col].isnull().sum() > 0:
            print(f"Interpolasi kolom {col}: {df_processed[col].isnull().sum()} missing values")
            
            df_processed[col] = df_processed[col].interpolate(method='linear')
            
            df_processed[col] = df_processed[col].fillna(method='ffill').fillna(method='bfill')
    
    print("\nMissing values setelah interpolasi:")
    missing_after = df_processed.isnull().sum()
    print(missing_after)

    output_filename = "./dataset/NO2_sampang_preprocessed.csv"
    df_processed.to_csv(output_filename, index=False)
    
    print(f"\nData berhasil diproses dan disimpan ke: {output_filename}")
    print(f"Shape data asli: {df.shape}")
    print(f"Shape data processed: {df_processed.shape}")

Missing values sebelum preprocessing:
t         0
NO2    3306
dtype: int64

Persentase missing values:
NO2: 47.80%

Melakukan interpolasi linier untuk missing values...
Interpolasi kolom NO2: 3306 missing values

Missing values setelah interpolasi:
t      0
NO2    0
dtype: int64

Data berhasil diproses dan disimpan ke: ./dataset/NO2_sampang_preprocessed.csv
Shape data asli: (6916, 2)
Shape data processed: (6916, 2)


  df_processed[col] = df_processed[col].fillna(method='ffill').fillna(method='bfill')


## 4. Pembuatan Supervised Data dengan Lag Features

Mengubah data time series menjadi format supervised learning dengan membuat lag features (nilai-nilai pada hari sebelumnya) dan menambahkan fitur temporal seperti year, month, day, dan representasi siklik.

### Fungsi Create Supervised Data

Fungsi ini membuat lag features dengan menggeser nilai target sebanyak n_lags hari ke belakang, serta menambahkan fitur temporal:
- **Lag features**: Nilai NO2 pada hari-hari sebelumnya
- **Fitur temporal**: year, month, day, dayofweek, dayofyear
- **Fitur siklik**: sin dan cos untuk month dan day (menangkap sifat siklik waktu)

In [9]:
def create_supervised_data(data, n_lags=7, target_col='NO2_mean'):
    df_supervised = data.copy()
    
    df_supervised = df_supervised.sort_values('t').reset_index(drop=True)
    
    print(f"Membuat {n_lags} lag features untuk prediksi...")
    
    for i in range(1, n_lags + 1):
        df_supervised[f't-{i}'] = df_supervised[target_col].shift(i)
    
    df_supervised['year'] = df_supervised['t'].dt.year
    df_supervised['month'] = df_supervised['t'].dt.month
    df_supervised['day'] = df_supervised['t'].dt.day
    df_supervised['dayofweek'] = df_supervised['t'].dt.dayofweek
    df_supervised['dayofyear'] = df_supervised['t'].dt.dayofyear
    
    df_supervised['month_sin'] = np.sin(2 * np.pi * df_supervised['month'] / 12)
    df_supervised['month_cos'] = np.cos(2 * np.pi * df_supervised['month'] / 12)
    df_supervised['day_sin'] = np.sin(2 * np.pi * df_supervised['day'] / 31)
    df_supervised['day_cos'] = np.cos(2 * np.pi * df_supervised['day'] / 31)
    
    df_supervised = df_supervised.dropna().reset_index(drop=True)
    
    return df_supervised


df_clean = pd.read_csv("./dataset/NO2_sampang_preprocessed.csv")
df_clean['t'] = pd.to_datetime(df_clean['t'])
print("Menggunakan data preprocessed...")

target_columns = [col for col in df_clean.columns if 'NO2' in col.upper() or 'no2' in col]
if target_columns:
    target_col = target_columns[0]
    print(f"Kolom target: {target_col}")
else:
    numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
    target_col = numeric_cols[0] if len(numeric_cols) > 0 else 'value'
    print(f"Kolom target (otomatis): {target_col}")

lag_options = [4, 30]
for n_lags in lag_options:
    print(f"\nMEMBUAT SUPERVISED DATA DENGAN {n_lags} LAG")

    df_supervised = create_supervised_data(df_clean, n_lags=n_lags, target_col=target_col)

    print(f"Shape supervised data: {df_supervised.shape}")
    print(f"Features yang dibuat: {n_lags} lag features + 8 temporal features")

    feature_cols = [col for col in df_supervised.columns 
                   if col not in ['t', target_col] and not col.startswith('Unnamed')]

    X = df_supervised[feature_cols]
    y = df_supervised[target_col]

    print(f"Features (X) shape: {X.shape}")
    print(f"Target (y) shape: {y.shape}")
    print(f"Feature columns: {feature_cols}")

    output_filename = f"./dataset/supervised_data_lag_{n_lags}.csv"
    df_supervised.to_csv(output_filename, index=False)
    print(f"Supervised data disimpan ke: {output_filename}")

    print(f"\nContoh supervised data (5 baris pertama):")
    print(df_supervised[['t'] + feature_cols[:5] + [target_col]].head())

print("\nFile yang dibuat:")
for n_lags in lag_options:
    print(f"- supervised_data_lag_{n_lags}.csv")

Menggunakan data preprocessed...
Kolom target: NO2

MEMBUAT SUPERVISED DATA DENGAN 4 LAG
Membuat 4 lag features untuk prediksi...
Shape supervised data: (6912, 15)
Features yang dibuat: 4 lag features + 8 temporal features
Features (X) shape: (6912, 13)
Target (y) shape: (6912,)
Feature columns: ['t-1', 't-2', 't-3', 't-4', 'year', 'month', 'day', 'dayofweek', 'dayofyear', 'month_sin', 'month_cos', 'day_sin', 'day_cos']
Supervised data disimpan ke: ./dataset/supervised_data_lag_4.csv

Contoh supervised data (5 baris pertama):
           t       t-1       t-2       t-3       t-4  year       NO2
0 2021-01-02  0.000034  0.000034  0.000034  0.000034  2021  0.000034
1 2021-01-02  0.000034  0.000034  0.000034  0.000034  2021  0.000034
2 2021-01-02  0.000034  0.000034  0.000034  0.000034  2021  0.000034
3 2021-01-02  0.000034  0.000034  0.000034  0.000034  2021  0.000034
4 2021-01-03  0.000034  0.000034  0.000034  0.000034  2021  0.000034

MEMBUAT SUPERVISED DATA DENGAN 30 LAG
Membuat 30 lag 