# Notebook 03 — Data Cleaning & Feature Engineering

**Tujuan:**  
1. Menangani missing values pada depth, SST, dan salinity.  
2. Membuat kolom depth gabungan (`depth_m`).  
3. Menyiapkan dataset final untuk SDM modeling di Notebook 04.

# 1️⃣ Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os

## 2️⃣ Load Dataset Notebook 01 / 02

In [2]:
df = pd.read_csv("../data/processed/occurrence_with_environment.csv")
print("Shape awal dataset:", df.shape)
df.head()

Shape awal dataset: (12762, 10)


Unnamed: 0,scientificName,decimalLatitude,decimalLongitude,eventDate,minimumDepthInMeters,maximumDepthInMeters,verbatimDepth,basisOfRecord,sst_mean,salinity_mean
0,Undaria pinnatifida f. pinnatifida,45.32,141.05,2005-05-21,,,,PreservedSpecimen,10.479646,33.538726
1,Saccharina japonica var. ochotensis,45.32,141.05,2005-05-21,,,,PreservedSpecimen,10.479646,33.538726
2,Saccharina japonica var. ochotensis,45.32,141.05,2005-05-21,,,,PreservedSpecimen,10.479646,33.538726
3,Costaria costata,45.32,141.05,2005-05-21,,,,PreservedSpecimen,10.479646,33.538726
4,Undaria pinnatifida f. pinnatifida,45.12,141.18,2005-05-22,,,,PreservedSpecimen,10.634725,33.447089


## 3️⃣ Handle Missing Values Depth

- Buat kolom `depth_m` sebagai rata-rata `minimumDepthInMeters` dan `maximumDepthInMeters`.  
- Jika keduanya hilang, tetap `NaN`.

In [3]:
df["depth_m"] = df[["minimumDepthInMeters","maximumDepthInMeters"]].mean(axis=1)

missing_depth = df["depth_m"].isna().sum() / len(df) * 100
print(f"Missing depth (%): {missing_depth:.2f}%")

Missing depth (%): 98.71%


## 4️⃣ Handle Missing Values Environmental Variables

- SST dan Salinity dari Bio-ORACLE mungkin memiliki missing values (~35%).  
- Drop record yang memiliki missing environmental variables.

In [4]:
missing_sst = df["sst_mean"].isna().sum() / len(df) * 100
missing_sal = df["salinity_mean"].isna().sum() / len(df) * 100
print(f"Missing SST (%): {missing_sst:.2f}%")
print(f"Missing Salinity (%): {missing_sal:.2f}%")

# Drop row dengan missing SST atau Salinity
df_clean = df.dropna(subset=["sst_mean","salinity_mean"]).reset_index(drop=True)
print("Shape dataset setelah drop missing env:", df_clean.shape)

Missing SST (%): 35.22%
Missing Salinity (%): 35.22%
Shape dataset setelah drop missing env: (8267, 11)


## 5️⃣ Encode Presence

- Semua occurrence dianggap `presence = 1`.  
- Pseudo-absence akan dibuat di Notebook 04 saat modeling.

In [5]:
df_clean["presence"] = 1

## 6️⃣ Save Cleaned Dataset

- Dataset ini menjadi **input final untuk modeling SDM** di Notebook 04.

In [6]:
output_file = "../data/processed/occurrence_cleaned_for_sdm.csv"
df_clean.to_csv(output_file, index=False)
print(f"Dataset cleaned dan siap modeling disimpan: {output_file}")

Dataset cleaned dan siap modeling disimpan: ../data/processed/occurrence_cleaned_for_sdm.csv


### ✅ Summary Notebook 03

- Missing values untuk depth, SST, dan salinity sudah ditangani.  
- Kolom `depth_m` dibuat sebagai rata-rata min/max depth.  
- Kolom `presence` ditambahkan untuk SDM modeling.  
- Dataset final tersimpan di folder `data/processed/` siap digunakan di Notebook 04 (SDM & Probability Map).