# Notebook 01 — Data Acquisition & Initial Preprocessing

**Tujuan:**  
1. Mengunduh dan menyiapkan dataset **occurrence rumput laut** dari OBIS.  
2. Mengambil data lingkungan dari **Bio-ORACLE v3** (SST & Salinity).  
3. Menyimpan dataset mentah untuk digunakan di notebook selanjutnya.

# 1️⃣ Import Libraries

In [1]:
import pandas as pd
import numpy as np
import xarray as xr
import os

# Membuat folder jika belum ada
os.makedirs("../data/raw", exist_ok=True)
os.makedirs("../data/processed", exist_ok=True)
os.makedirs("../figures", exist_ok=True)

## 2️⃣ Load OBIS Occurrence Data

- Data OBIS dapat berupa CSV atau TXT yang sudah diunduh.  
- Kolom penting: `species`, `latitude`, `longitude`, `eventDate`, `minimumDepthInMeters`, `maximumDepthInMeters`, `verbatimDepth`.


In [2]:
raw_path = "../data/raw/kelp_occurrence_raw.txt"

df_raw = pd.read_csv(
    raw_path,
    sep="\t",        # OBIS biasanya pakai TAB
    encoding="utf-8",
    low_memory=False
)

print("Raw data shape:", df_raw.shape)
df_raw.head()

Raw data shape: (20248, 189)


Unnamed: 0,id,type,modified,language,license,rightsHolder,accessRights,bibliographicCitation,references,institutionID,...,infraspecificEpithet,cultivarEpithet,taxonRank,verbatimTaxonRank,scientificNameAuthorship,vernacularName,nomenclaturalCode,taxonomicStatus,nomenclaturalStatus,taxonRemarks
0,urn:catalog:S9-5:Kelp:100708,,2019-06-26T15:23:58.713+09:00,en,ccby,,,"Herbarium, Graduate School of Science, Hokkaid...",http://www.godac.jamstec.go.jp/bismal/record/S...,,...,,,form,,,,,,,
1,urn:catalog:S9-5:Kelp:100712,,2019-06-26T15:23:58.713+09:00,en,ccby,,,"Herbarium, Graduate School of Science, Hokkaid...",http://www.godac.jamstec.go.jp/bismal/record/S...,,...,,,variety,,"(Miyabe) Yotsukura, Kawashima, T. Kawai, T. Ab...",,,,,
2,urn:catalog:S9-5:Kelp:100713,,2019-06-26T15:23:58.714+09:00,en,ccby,,,"Herbarium, Graduate School of Science, Hokkaid...",http://www.godac.jamstec.go.jp/bismal/record/S...,,...,,,variety,,"(Miyabe) Yotsukura, Kawashima, T. Kawai, T. Ab...",,,,,
3,urn:catalog:S9-5:Kelp:100722,,2019-06-26T15:23:58.714+09:00,en,ccby,,,"Herbarium, Graduate School of Science, Hokkaid...",http://www.godac.jamstec.go.jp/bismal/record/S...,,...,,,species,,(C. Agardh) De A. Saunders,,,,,
4,urn:catalog:S9-5:Kelp:100723,,2019-06-26T15:23:58.715+09:00,en,ccby,,,"Herbarium, Graduate School of Science, Hokkaid...",http://www.godac.jamstec.go.jp/bismal/record/S...,,...,,,species,,Dumortier,,,,,


## 3️⃣ Filter Target Species

- Fokus pada rumput laut utama yang relevan:
  - Undaria pinnatifida
  - Saccharina japonica
  - Sargassum spp.
  - Eisenia bicyclis

In [3]:
columns_needed = [
    "scientificName",
    "decimalLatitude",
    "decimalLongitude",
    "eventDate",
    "minimumDepthInMeters",
    "maximumDepthInMeters",
    "verbatimDepth",
    "basisOfRecord"
]

df = df_raw[columns_needed].copy()
print("After column selection:", df.shape)
df.head()

After column selection: (20248, 8)


Unnamed: 0,scientificName,decimalLatitude,decimalLongitude,eventDate,minimumDepthInMeters,maximumDepthInMeters,verbatimDepth,basisOfRecord
0,Undaria pinnatifida f. pinnatifida,45.32,141.05,2005-05-21,,,,PreservedSpecimen
1,Saccharina japonica var. ochotensis,45.32,141.05,2005-05-21,,,,PreservedSpecimen
2,Saccharina japonica var. ochotensis,45.32,141.05,2005-05-21,,,,PreservedSpecimen
3,Costaria costata,45.32,141.05,2005-05-21,,,,PreservedSpecimen
4,Agarum clathratum,45.32,141.05,2005-05-21,,,,PreservedSpecimen


In [4]:
df = df.dropna(subset=["decimalLatitude", "decimalLongitude"])
df.shape

(20248, 8)

In [5]:
lat_min, lat_max = 20, 50
lon_min, lon_max = 120, 150

df = df[
    (df["decimalLatitude"].between(lat_min, lat_max)) &
    (df["decimalLongitude"].between(lon_min, lon_max))
]

In [6]:
species_counts = df["scientificName"].value_counts()

species_counts.head(20)

scientificName
Undaria pinnatifida f. pinnatifida     1392
Saccharina japonica var. religiosa     1371
Eisenia bicyclis                       1162
Sargassum                               895
Sargassum patens                        813
Sargassum fulvellum                     791
Sargassum horneri                       790
Ecklonia kurome                         613
Saccharina japonica var. japonica       573
Saccharina japonica var. ochotensis     481
Gelidium elegans                        471
Sargassum fusiforme                     430
Ecklonia cava                           420
Saccharina longissima                   420
Sargassum macrocarpum                   416
Saccharina angustata                    376
Saccharina coriacea                     365
Sargassum siliquastrum                  341
Ulva pertusa                            329
Costaria costata                        313
Name: count, dtype: int64

In [7]:
target_species = [
    "Undaria pinnatifida f. pinnatifida",
    "Saccharina japonica var. religiosa",
    "Eisenia bicyclis",
    "Sargassum",
    "Sargassum patens",
    "Sargassum fulvellum",
    "Sargassum horneri",
    "Ecklonia kurome",
    "Saccharina japonica var. japonica",
    "Saccharina japonica var. ochotensis",
    "Gelidium elegans",
    "Sargassum fusiforme",
    "Ecklonia cava",
    "Saccharina longissima",
    "Sargassum macrocarpum",
    "Saccharina angustata",
    "Saccharina coriacea",
    "Sargassum siliquastrum",
    "Ulva pertusa",
    "Costaria costata"
]

df = df[df["scientificName"].isin(target_species)].reset_index(drop=True)
print(f"Jumlah record setelah filter species: {df.shape[0]}")

Jumlah record setelah filter species: 12762


## 4️⃣ Handle Missing Coordinates

- Drop record yang tidak memiliki `latitude` atau `longitude`.


In [8]:
df = df.dropna(subset=["decimalLatitude","decimalLongitude","eventDate"])
print(f"Jumlah record setelah drop missing coords: {df.shape[0]}")

Jumlah record setelah drop missing coords: 12762


## 5️⃣ Load Bio-ORACLE Environmental Layers

- SST (`thetao_mean`) dan Salinity (`so_mean`)  


In [9]:
# SST
sst_ds = xr.open_dataset("../data/raw/sst-2010-2020.nc")
sst = sst_ds["thetao_mean"].isel(time=0)

# Salinity
sal_ds = xr.open_dataset("../data/raw/salinity-2010-2020.nc")
sal = sal_ds["so_mean"].isel(time=0)


## 6️⃣ Assign Environmental Values to Occurrence Points

- Mengambil nilai SST & Salinity dari Bio-ORACLE menggunakan nearest neighbor.


In [10]:
# Fungsi sampling Bio-ORACLE
def sample_biooracle(da, lat, lon):
    return da.sel(latitude=xr.DataArray(lat, dims="points"),
                  longitude=xr.DataArray(lon, dims="points"),
                  method="nearest").values

df["sst_mean"] = sample_biooracle(sst, df["decimalLatitude"], df["decimalLongitude"])
df["salinity_mean"] = sample_biooracle(sal, df["decimalLatitude"], df["decimalLongitude"])

# Cek missing values
print(df[["sst_mean","salinity_mean"]].isna().sum())

sst_mean         4495
salinity_mean    4495
dtype: int64


## 7️⃣ Save Processed Dataset

- Dataset ini akan digunakan di Notebook 02–04.


In [11]:
output_file = "../data/processed/occurrence_with_environment.csv"
df.to_csv(output_file, index=False)
print(f"Processed dataset saved: {output_file}")


Processed dataset saved: ../data/processed/occurrence_with_environment.csv


### ✅ Summary Notebook 01

- Dataset occurrence rumput laut sudah diunduh dan difilter species utama  
- Data lingkungan (SST & Salinity) sudah ditambahkan dari Bio-ORACLE  
- File CSV siap untuk **EDA dan SDM modeling** di notebook berikutnya