# 🌾 Tutorial Overview — AI in Soil Spectroscopy
Welcome to the *AI in Agriculture* hands-on module! In this exercise, you'll learn how to use **machine learning** to predict **Soil Organic Carbon (SOC)** from soil spectral data.

---
## 🎯 Objectives
By the end of this tutorial, you will be able to:
1. Understand how soil spectra represent soil composition.
2. Visualize raw and preprocessed spectra.
3. Apply simple preprocessing (SNV, Savitzky–Golay).
4. Train a baseline **PLSR model** to predict SOC.
5. Visualize and interpret prediction results.

---
## 📁 Dataset Description
The dataset `soil_spectra_teaching.csv` contains:
- **Spectral columns:** named `wl_400` to `wl_2500`, representing reflectance at each wavelength (nm).  
- **Target column:** `SOC` (Soil Organic Carbon, %).  
- **Other optional columns:** may include `pH`, `Clay`, etc.

---
Let's start by loading and exploring the data!


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

DATA_PATH = Path("../data/soil_spectra_teaching.csv")
df = pd.read_csv(DATA_PATH)
df.head()


# 📊 Basic Information
print("Rows:", df.shape[0], " Columns:", df.shape[1])
spec_cols = [c for c in df.columns if c.startswith("wl_")]
print("Spectral columns detected:", len(spec_cols))
print("Target columns available:", [c for c in df.columns if c not in spec_cols])


# 🌈 Plot Raw Spectra
wavelengths = np.array([float(c.split('_')[1]) for c in spec_cols])
plt.figure(figsize=(8,5))
for i in range(min(50, len(df))):  # plot up to 50 random spectra
    plt.plot(wavelengths, df.iloc[i][spec_cols].values, alpha=0.5)
plt.xlabel("Wavelength (nm)")
plt.ylabel("Reflectance")
plt.title("Raw Soil Spectra (Reflectance vs Wavelength)")
plt.show()


# ⚙️ Preprocessing — SNV + Savitzky–Golay
from scipy.signal import savgol_filter

def snv(mat):
    return (mat - mat.mean(axis=1, keepdims=True)) / (mat.std(axis=1, keepdims=True) + 1e-12)

X_raw = df[spec_cols].values.astype(float)
X_snv = snv(X_raw)
window = 11 if X_snv.shape[1] >= 21 else (X_snv.shape[1]//2*2+1)
X_sg = savgol_filter(X_snv, window_length=window, polyorder=2, deriv=1, axis=1)

plt.figure(figsize=(8,5))
for i in range(min(50, len(df))):
    plt.plot(wavelengths, X_sg[i], alpha=0.5)
plt.xlabel("Wavelength (nm)")
plt.ylabel("Processed Reflectance (1st Derivative)")
plt.title("Preprocessed Soil Spectra (SNV + SG)")
plt.show()


# 🔢 Prepare Data for Modeling (PLSR)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.metrics import r2_score, mean_squared_error

y = df['SOC'].values
X_train, X_test, y_train, y_test = train_test_split(X_sg, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

pls = PLSRegression(n_components=10)
pls.fit(X_train_s, y_train)
y_pred = pls.predict(X_test_s).ravel()

r2 = r2_score(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"R² = {r2:.3f}, RMSE = {rmse:.3f}")


# 📉 Parity (Scatter) Plot — Observed vs Predicted SOC
plt.figure(figsize=(5,5))
plt.scatter(y_test, y_pred, alpha=0.7, edgecolor='none')
mn, mx = min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())
plt.plot([mn, mx], [mn, mx], '--', color='black')
plt.xlabel("Observed SOC (%)")
plt.ylabel("Predicted SOC (%)")
plt.title(f"PLSR Model — SOC Prediction (R²={r2:.2f})")
plt.tight_layout()
plt.show()


# 💭 Reflection Questions
1. How do the raw and preprocessed spectra differ visually?
2. Why is preprocessing important for spectral modeling?
3. What does R² tell you about model performance?
4. What other factors (like soil moisture or calibration transfer) could affect model accuracy?
5. Try changing `n_components` in the PLSR model — what happens?

---
### ✅ Next Steps
Now that you understand the basics:
- Explore the `01_plsr_baseline.ipynb` for a more detailed PLSR workflow.  
- Try the `02_cnn_1d.ipynb` to see how deep learning handles spectral data!
