# Feature Engineering — Soil & Fertilizer

1. Encode categoricals (Soil Type, Crop Type)
2. Add optional features (NPK ratio, Temp×Humidity, moisture bins)
3. Scale numeric features

Run cells in order. Run **Load & clean** first.

## Load data & clean

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

DATA_PATH = "../data/data/raw/data_core.csv"
TARGET_COL = "Fertilizer Name"

df = pd.read_csv(DATA_PATH)
df = df.fillna(df.mean(numeric_only=True)).fillna("Unknown")

X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL]

print("Shape after load:", X.shape)
print("Columns:", X.columns.tolist())
X.head()

## 1. Encode categorical variables (One-Hot)

Soil Type and Crop Type → one-hot encoding so models get numeric input.

In [None]:
X_encoded = pd.get_dummies(X, columns=["Soil Type", "Crop Type"], prefix=["Soil", "Crop"])
print("Shape after one-hot encoding:", X_encoded.shape)
print("New columns:", X_encoded.columns.tolist())
X_encoded.head()

## 2. Optional new features

- **NPK total** (Nitrogen + Phosphorous + Potassium)
- **Temp × Humidity** (environment interaction)
- **Moisture level** (Low / Medium / High bins)

In [None]:
X_fe = X_encoded.copy()

# NPK total (nutrient load)
X_fe["NPK_total"] = X_fe["Nitrogen"] + X_fe["Phosphorous"] + X_fe["Potassium"]

# Temperature × Humidity interaction
X_fe["Temp_Humidity"] = X_fe["Temperature"] * X_fe["Humidity"]

# Moisture bins: Low/Medium/High (quantile-based)
X_fe["Moisture_level"] = pd.qcut(X_fe["Moisture"], q=3, labels=["Low", "Medium", "High"])
X_fe = pd.get_dummies(X_fe, columns=["Moisture_level"], prefix="Moisture")

print("Shape after new features:", X_fe.shape)
X_fe.head()

## 3. Scale numeric features

Scale continuous columns so models are not dominated by large ranges. Binary (0/1) columns left as-is.

In [None]:
# Columns to scale: original numeric + new continuous features
to_scale = ["Temperature", "Humidity", "Moisture", "Nitrogen", "Potassium", "Phosphorous", "NPK_total", "Temp_Humidity"]
to_scale = [c for c in to_scale if c in X_fe.columns]

scaler = StandardScaler()
X_fe[to_scale] = scaler.fit_transform(X_fe[to_scale])

print("Scaled columns:", to_scale)
print("Final feature matrix shape:", X_fe.shape)
X_fe.head()

## 4. Final feature matrix & target

Use `X_fe` and `y` for train-test split and model training.

In [None]:
print("Feature matrix X_fe:", X_fe.shape)
print("Target y:", y.shape)
print("\nAll feature columns:")
print(X_fe.columns.tolist())
print("\nTarget value counts:")
print(y.value_counts().head(10))