# Traffic Machine Learning Analysis
## Random Forest Regression for Flow Prediction

Notebook ini berisi analisis data lalu lintas secara mendalam mulai dari tahap eksplorasi data, pembersihan data, pembuatan fitur (feature engineering), hingga pelatihan model menggunakan algoritma **Random Forest**.

### 1. Persiapan Environment

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import warnings

warnings.filterwarnings('ignore')
sns.set(style="whitegrid")

### 2. Pemuatan Data
Kita akan memuat dataset `torino.csv` yang berisi data lalu lintas dari berbagai sensor (detector).

In [None]:
# Muat data
df = pd.read_csv('torino.csv')

# Tampilkan 5 baris pertama
print("Top 5 Records:")
display(df.head())

# Informasi struktur data
print("\nData Information:")
df.info()

### 3. Pembersihan Data (Data Cleaning)
Menangani nilai yang hilang (missing values) dan nilai tak terhingga (infinity).

In [None]:
def clean_data(data):
    df_clean = data.copy()
    
    # Kolom numerik yang kritis
    numeric_cols = ['flow', 'occ', 'speed']
    
    for col in numeric_cols:
        if col in df_clean.columns:
            # Ganti infinity dengan NaN
            df_clean[col] = df_clean[col].replace([np.inf, -np.inf], np.nan)
            # Isi NaN dengan mean
            df_clean[col] = df_clean[col].fillna(df_clean[col].mean())
            
    # Forward fill untuk kolom waktu dan ID jika ada yang kosong
    other_cols = ['day', 'interval', 'detid']
    for col in other_cols:
        if col in df_clean.columns:
            df_clean[col] = df_clean[col].ffill().bfill()
            
    return df_clean

df = clean_data(df)
print("Missing values after cleaning:")
print(df.isnull().sum())

### 4. Pembuatan Fitur (Feature Engineering)
Tahap ini sangat krusial agar model memahami pola waktu dan karakteristik jalan.

In [None]:
# a. Ekstrak Waktu
df['hour'] = df['interval'] // 3600
df['weekday'] = df['day'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').weekday())
df['month'] = df['day'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').month)

# b. Fitur Biner (Jam Sibuk & Hari Kerja)
df['is_rush_hour'] = df['hour'].apply(lambda h: 1 if h in [7, 8, 9, 17, 18, 19] else 0)
df['is_weekday'] = df['weekday'].apply(lambda w: 1 if w in [0, 1, 2, 3, 4] else 0)
df['is_peak_traffic'] = ((df['is_rush_hour'] == 1) & (df['is_weekday'] == 1)).astype(int)

# c. Agregasi Statistik per Detector
det_agg = df.groupby('detid').agg({
    'flow': 'mean',
    'speed': 'mean',
    'occ': 'mean'
}).reset_index()
det_agg.columns = ['detid', 'detector_mean_flow', 'detector_mean_speed', 'detector_mean_occ']
df = df.merge(det_agg, on='detid', how='left')

# d. Agregasi per Jam
hour_agg = df.groupby('hour').agg({'flow': 'mean'}).reset_index()
hour_agg.columns = ['hour', 'hourly_mean_flow']
df = df.merge(hour_agg, on='hour', how='left')

# e. Perhitungan Traffic Index (Indeks Kemacetan)
def calculate_traffic_index(row):
    # Normalisasi komponen
    flow_norm = min(row['flow'] / 500, 1.0)
    occ_norm = min(row['occ'] / 100, 1.0)
    speed_factor = max(1 - (row['speed'] / 120), 0)
    
    # Bobot: 40% flow, 30% occupancy, 30% speed
    index = (flow_norm * 0.4 + occ_norm * 0.3 + speed_factor * 0.3) * 100
    return max(0, min(100, index))

df['traffic_index'] = df.apply(calculate_traffic_index, axis=1)
df.head()

### 5. Analisis Eksploratif Visual
Melihat pola lalu lintas berdasarkan waktu.

In [None]:
plt.figure(figsize=(15, 6))

# Plot 1: Rata-rata flow per jam
plt.subplot(1, 2, 1)
sns.lineplot(data=df, x='hour', y='flow', hue='is_weekday')
plt.title('Rata-rata Flow Lalu Lintas per Jam (Hari Kerja vs Akhir Pekan)')
plt.legend(title='Hari Kerja')

# Plot 2: Traffic Index distribution
plt.subplot(1, 2, 2)
sns.boxplot(data=df, x='hour', y='traffic_index')
plt.title('Distribusi Traffic Index per Jam')

plt.tight_layout()
plt.show()

### 6. Pelatihan Model (Machine Learning)
Menggunakan Random Forest Regressor untuk memprediksi `flow`.

In [None]:
# Pilih fitur untuk model
feature_columns = [
    'hour', 'weekday', 'month', 'is_rush_hour', 'is_weekday', 
    'is_peak_traffic', 'detector_mean_flow', 'detector_mean_speed',
    'detector_mean_occ', 'hourly_mean_flow', 'occ', 'speed', 'detid'
]

X = df[feature_columns]
y = df['flow']

# Sampling data jika terlalu besar untuk mempercepat training (opsional)
if len(X) > 50000:
    indices = np.random.choice(len(X), 50000, replace=False)
    X_sample = X.iloc[indices]
    y_sample = y.iloc[indices]
else:
    X_sample, y_sample = X, y

# Split Training & Testing
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.2, random_state=42)

# Inisialisasi Model
rf = RandomForestRegressor(n_estimators=50, max_depth=15, random_state=42, n_jobs=-1)

# Training
print("Training model...")
rf.fit(X_train, y_train)
print("Training complete!")

### 7. Evaluasi Model
Mengukur seberapa akurat prediksi model kita.

In [None]:
y_pred = rf.predict(X_test)

print(f"R2 Score: {r2_score(y_test, y_pred):.4f}")
print(f"Mean Absolute Error (MAE): {mean_absolute_error(y_test, y_pred):.4f}")
print(f"Root Mean Squared Error (RMSE): {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

# Plot Feature Importance
plt.figure(figsize=(10, 6))
feat_importances = pd.Series(rf.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.title('Top 10 Fitur Paling Berpengaruh')
plt.show()

### 8. Kesimpulan
Berdasarkan analisis di atas, kita dapat melihat bahwa:
1. Pola lalu lintas sangat dipengaruhi oleh **Jam (Hour)** dan **Hari Kerja (Is Weekday)**.
2. Fitur agregasi detector (`detector_mean_flow`) memberikan informasi yang kuat bagi model.
3. Traffic Index berhasil mengkategorikan tingkat kemacetan di berbagai interval waktu.