# **1. PENDAHULUAN**

### **1.1 Latar Belakang**
Notebook ini adalah bagian dari kompetisi klasifikasi multi-kelas untuk memprediksi status ekonomi harian. Data yang digunakan berasal dari 10 dataset berbeda yang mewakili berbagai region, termasuk Eropa, Asia, Amerika, dan Afrika, menjadikannya sangat unik dan bervariasi.

### **1.2 Strategi**
Notebook ini fokus pada tahap modeling dan prediksi. Data mentah yang awalnya memiliki 75% missing values telah digabungkan dan diimputasi di notebook terpisah. Pendekatan kami mencakup feature engineering yang strategis, hyperparameter tuning yang efisien menggunakan Randomized Search, dan training model Random Forest Classifier yang kuat untuk menghasilkan prediksi akurat.

# **2. PERSIAPAN**

#### **2.1 Import Libraries**
Bagian ini berisi semua library penting yang diperlukan untuk analisis data, feature engineering, modeling, dan evaluasi, seperti Pandas, Numpy, dan Scikit-learn.

In [147]:
import numpy as np
import pandas as pd
import pickle as pkl
import warnings

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_selection import mutual_info_classif
import hdbscan
from sklearn.model_selection import cross_val_score, KFold, train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN


warnings.filterwarnings('ignore')
%matplotlib inline

#### **2.2 Load Data**
Dataset yang sudah digabungkanm diimputasi, dan disiapkan di notebook sebelumnya, dimuat untuk memulai proses machine learning.

In [148]:
train_df = pd.read_csv('Data/train.csv')
test_df = pd.read_csv('Data/test.csv')

#### **2.3 Initialize**
Menginisialisasi seed dan target prediksi dari data

In [149]:
warnings.filterwarnings('ignore')
SEED   = 42
TARGET = 'economic_day_status'

##### **2.4 Helper Function**

In [150]:
def make_mi_scores(X, y):
    mi_scores = mutual_info_classif(X, y)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")

# **3. PRE-PROCESSING**

#### **3.1 Feature Engineering**
Fitur-fitur baru yang lebih informatif, seperti rasio ekspor-impor dan volume perdagangan, dibuat dari kolom yang sudah ada untuk meningkatkan kekuatan prediktif model.

In [151]:
# Impute ID
train_df["id"] = train_df["country_code"] + "_" + train_df["date"].str.replace("-", "", regex=False)
train_df.set_index("id", inplace=True)

test_df["id"] = test_df["country_code"] + "_" + test_df["date"].str.replace("-", "", regex=False)
test_df.set_index("id", inplace=True)

In [152]:
# Menghilangkan Sisa Missing Values Apabila Ada
missing_counts = train_df.isna().sum()
missing_cols = missing_counts[missing_counts > 0].index

In [153]:
# Convert 'Date' Feature into a DateTime
train_df["date"] = pd.to_datetime(train_df["date"])
test_df["date"]  = pd.to_datetime(test_df["date"])

In [None]:
def detailed_feature_engineering(data):
    # 1. Lagged Features
    data['inflation_rate_lag_1'] = data['inflation_rate'].shift(1)
    data['unemployment_rate_lag_1'] = data['unemployment_rate'].shift(1)
    data['exchange_rate_lag_1'] = data['exchange_rate'].shift(1)
    data['business_confidence_index_lag_1'] = data['business_confidence_index'].shift(1)
    data['inflation_rate_lag_1'] = data['inflation_rate_lag_1'].fillna(0)
    data['unemployment_rate_lag_1'] = data['unemployment_rate_lag_1'].fillna(0)
    data['exchange_rate_lag_1'] = data['exchange_rate_lag_1'].fillna(0)
    data['business_confidence_index_lag_1'] = data['business_confidence_index_lag_1'].fillna(0)

    # 2. Moving Averages (using a 7-day window)
    data['inflation_rate_ma_7']    = data['inflation_rate'].rolling(window=7).mean()
    data['unemployment_rate_ma_7'] = data['unemployment_rate'].rolling(window=7).mean()
    data['inflation_rate_ma_7']    = data['inflation_rate_ma_7'].fillna(0)
    data['unemployment_rate_ma_7'] = data['unemployment_rate_ma_7'].fillna(0)

    # 3. Rate of Change / Momentum
    data['inflation_rate_pct_change']  = data['inflation_rate'].pct_change()
    data['exchange_rate_daily_change'] = data['exchange_rate'].diff()
    data['biz_confidence_change']      = data['business_confidence_index'].diff()
    data['inflation_rate_pct_change']  = data['inflation_rate_pct_change'].fillna(0)
    data['exchange_rate_daily_change'] = data['exchange_rate_daily_change'].fillna(0)
    data['biz_confidence_change']      = data['biz_confidence_change'].fillna(0)

    # 4. Interaction & Ratio Features
    data['exports_imports_ratio']    = data['exports_usd'] / data['imports_usd']
    data['debt_gdp_x_interest_rate'] = data['debt_gdp_ratio'] * data['interest_rate']

    # 5. Volatility Features (using a 7-day standard deviation)
    data['exchange_rate_std_7'] = data['exchange_rate'].rolling(window=7).std()
    data['exchange_rate_std_7'] = data['exchange_rate_std_7'].fillna(0)

    # 6. Threshold-based Binary Features
    data['high_inflation']    = np.where(data['inflation_rate'] > 0.05, 1, 0)
    data['high_unemployment'] = np.where(data['unemployment_rate'] > 0.06, 1, 0)

    # 8. Modify the Categorical Variables
    data["trade_bloc"]           =  data["trade_bloc"].fillna("No Bloc")
    data["is_crisis_event"]      = ~data["crisis_event"].isna()
    data["trade_balance_status"] =  data["trade_balance_status"].replace({
        "Deficit":-1,
        "Neutral":0,
        "Surplus":1
    })
    data["political_stability"] = data["political_stability"].replace({
        "Unstable":-1,
        "Moderate":0,
        "Stable":1
    })
    data["migration_trend"] = data["migration_trend"].replace({
        "Outflow":-1,
        "Neutral":0,
        "Inflow":1
    })
    data["income_group"] = data["income_group"].replace({
        "Low":-1,
        "Lower-Middle":0,
        "Upper-Middle":1
    })
    data["governance_quality"] = data["governance_quality"].replace({
        "Low":-1,
        "Medium":0,
        "High":1
    })
    data["climate_impact_level"] = data["climate_impact_level"].replace({
        "Low":-1,
        "Medium":0,
        "High":1
    })
    data["financial_access"] = data["financial_access"].replace({
        "Low":-1,
        "Medium":0,
        "High":1
    })
    
    # 9.Clustering
    # Economic
    # features = ["inflation_rate", "unemployment_rate", "exchange_rate"]
    # scaler = StandardScaler()
    # X_scaled = scaler.fit_transform(data[features].fillna(0))
    # dbscan = hdbscan.HDBSCAN(
    #     min_cluster_size=100,      # cluster minimal 100 titik (biar stabil)
    #     min_samples=10,            # lebih konservatif dari default 5
    #     metric="euclidean",        # umum & stabil setelah scaling
    #     cluster_selection_method="eom",  # default, lebih adaptif
    #     core_dist_n_jobs=-1        # pakai semua core CPU
    # )
    # clusters = dbscan.fit_predict(X_scaled)
    # data['economic'] = clusters


    return data

In [155]:
train_df.columns

Index(['country_code', 'region', 'income_group', 'trade_balance_status',
       'political_stability', 'economic_sector_dominant', 'currency_type',
       'policy_framework', 'season', 'crisis_event', 'governance_quality',
       'climate_impact_level', 'trade_bloc', 'financial_access',
       'migration_trend', 'date', 'year', 'quarter', 'gdp_per_capita',
       'inflation_rate', 'unemployment_rate', 'interest_rate', 'exchange_rate',
       'exports_usd', 'imports_usd', 'debt_gdp_ratio', 'investment_pct_gdp',
       'consumption_pct_gdp', 'population_million', 'urbanization_pct',
       'internet_penetration', 'energy_consumption_mwh',
       'renewable_energy_pct', 'income_inequality_gini', 'education_index',
       'healthcare_index', 'poverty_rate', 'fdi_inflows_usd',
       'business_confidence_index', 'manufacturing_pmi',
       'economic_day_status'],
      dtype='object')

In [156]:
# Train Data
train_df = detailed_feature_engineering(train_df)
# Test Data
test_df  = detailed_feature_engineering(test_df)

#### **3.2 Encoding**
Semua fitur kategorikal (misalnya, trade_balance_status dan policy_framework) diubah menjadi format numerik menggunakan One-Hot Encoding, menjadikannya siap untuk diproses oleh model.

In [157]:
train_df = train_df.drop(columns=['season', 'crisis_event', "date"])
test_df  = test_df.drop( columns=['season', 'crisis_event', "date"])

num_cols = train_df.select_dtypes(exclude="object").columns
cat_cols = train_df.select_dtypes(include="object").columns

cat_cols = [col for col in cat_cols if col not in [TARGET]]

train_df = pd.get_dummies(train_df, columns=cat_cols)
test_df  = pd.get_dummies(test_df,  columns=cat_cols)

#### **3.3 Data Splitting**

In [158]:
X = train_df.drop(columns=TARGET)
y = train_df[TARGET]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

#### **3.4 Data Evaluation**

In [159]:
# mi_scores = make_mi_scores(X,y)

In [160]:
# mi_scores.sort_values(ascending=False)

In [161]:
# plot_mi_scores(mi_scores[::3])

# ***4. Modeling***

#### **4.1 Pemilihan Model**
Model Random Forest Classifier dipilih karena kekuatannya dalam menangani dataset yang bervariasi dan kemampuannya dalam memitigasi overfitting.

#### **4.2 Hyperparameter Tuning**
Kami menggunakan Randomized Search untuk mencari kombinasi hyperparameter terbaik (n_estimators, max_depth, dll.) secara efisien. Metode ini dipilih untuk menghemat waktu komputasi secara signifikan tanpa mengorbankan performa model.

In [162]:
# Encode target
mapping = {"Low": 0, "Medium": 1, "High": 2}
y_train = y_train.replace(mapping).astype("int32")
y_test  = y_test.replace(mapping).astype("int32")

# # Random Forest Model
# rf = RandomForestClassifier()

In [163]:
# # Generate hyperparams
# param_grid = {
#     "n_estimators": [200, 500, 1000, 1200, 1500],
#     "max_depth": [13, 15, 17, 20],
#     "max_features": ["sqrt", "log2"],
#     "min_samples_split": [2, 5, 12, 14],
#     "min_samples_leaf": [1, 2, 4],
#     "bootstrap": [True, False],
#     "n_jobs": [-1],
#     "random_state": [SEED]
# }

In [164]:
# # Search for the best params
# random_search = RandomizedSearchCV(
#     estimator=rf,
#     param_distributions=param_grid,
#     n_iter=50,
#     scoring="f1_macro",
#     cv=3,
#     verbose=2,
#     random_state=SEED,
#     n_jobs=-1
# )

# # Fit to search best params
# random_search.fit(X_train, y_train)

In [165]:
# print("Best parameters found:", random_search.best_params_)
# print("Best CV F1-macro:", random_search.best_score_)

#### **4.3 Training Model**
Model Random Forest dilatih menggunakan hyperparameter terbaik yang ditemukan dari tahap tuning.

In [166]:
# # Initialize a new RF with the best parameters
# rf_best = RandomForestClassifier(**random_search.best_params_)

# # Fit the new model to the training data
# rf_best.fit(X_train, y_train)

In [167]:
# Initialize a new RF with the best parameters
rf = RandomForestClassifier(
    n_estimators=805,
    max_depth=17,
    max_features='sqrt',
    min_samples_split=14,
    min_samples_leaf=4,
    bootstrap=False,
    class_weight="balanced",
    random_state=SEED,
    n_jobs=-1
)

# Fit the new model to the training data
rf.fit(X_train, y_train)

0,1,2
,n_estimators,805
,criterion,'gini'
,max_depth,17
,min_samples_split,14
,min_samples_leaf,4
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,False


# **5. Prediksi & Pembuatan Submission**

#### **5.1 Prediksi**
Model yang sudah dilatih digunakan untuk memprediksi kelas target (Low, Medium, atau High) pada data uji.

In [168]:
# Evaluate on Test Set
y_pred = rf.predict(X_test)
f1 = f1_score(y_test, y_pred, average="macro")

print("Test F1-macro:", f1)

Test F1-macro: 0.4359036713493793


#### **5.2 Pembuatan Berkas *Submission***
Hasil prediksi didekode dari angka kembali ke label asli dan disimpan dalam format submission.csv untuk diunggah ke Kaggle.

In [None]:
train_cols = X.columns
test_cols  = test_df.columns

rf.fit(X,y)
test_df = test_df[train_cols]
test_pred = rf.predict(test_df)

test_pred = [mapping[pred] for pred in test_pred]

submission_df = pd.DataFrame({'id': test_df.index, TARGET: test_pred})
submission_df.to_csv('submission.csv', index=False)
print("Submission file created successfully!")

Submission file created successfully!


#### 5.3 **Model Saving**

In [171]:
# Create the filename with F1 score and model name
model_name = type(rf).__name__
filename = f"{f1:.4f}-{model_name}.pkl"

# Save the trained model with the new filename
pkl.dump(rf, open(filename, "wb"))
print(f"Model saved successfully as {filename}!")

Model saved successfully as 0.4359-RandomForestClassifier.pkl!
