<a href="https://colab.research.google.com/github/BeastHunter0041/csci_4170_s26/blob/main/02_modeling_and_decision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 1 (Week 2)

In this section, I (1) train an SVM-family model with light tuning using 3-fold cross validation, (2) compare it to the Week 1 baseline using the same split and preprocessing, and (3) perform decision thresholding under a simple cost model. I keep random seeds fixed for reproducibility and log each run (baseline + SVM variants). Finally, I calibrate SVM scores into probabilities (Platt scaling) and choose a classification threshold on the validation set that minimizes expected cost, then report metrics and confusion matrices at both the default (0.5) and chosen thresholds.

In [15]:

import numpy as np

df = pd.read_csv("/content/GL_FishBiodiversity_first_2100.csv")
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Shape:", df.shape)
df.head(3)



Shape: (2100, 93)


Unnamed: 0,Project Name,Field Number,Date,Day,Month,Year,Date fished,Waterbody Name,WaterbodyType,Arrival Time,...,Stream depth (m),depth_>_recorded,Water velocity (msec),Bin Number,Bin_time_s,Species,Number Captured,Caught > Number captured,Minimum (mm),Maximum (mm)
0,Ausable Channel Sampling 2002,AUCR02-01-03-BEF,24-Sep-02,24,9,2002,,Ausable Channel,Stream,,...,2.0,False,,0.0,,Micropterus nigricans,2.0,False,0.0,0.0
1,Ausable Channel Sampling 2002,AUCR02-01-03-BEF,24-Sep-02,24,9,2002,,Ausable Channel,Stream,,...,2.0,False,,0.0,,Lepomis peltastes,3.0,False,39.4,44.94
2,Ausable Channel Sampling 2002,AUCR02-01-03-BEF,24-Sep-02,24,9,2002,,Ausable Channel,Stream,,...,2.0,False,,0.0,,Notemigonus crysoleucas,2.0,False,32.6,68.7


###Why we convert this dataset into a binary classification task

This dataset’s Species column typically contains many categories (multi-class). Part D (decision thresholding with a 2×2 cost matrix) is cleanest in a binary setting, because false positives and false negatives are defined relative to one “positive” class. To keep the choice deterministic and reproducible, I define the positive class as the most frequent species in the dataset and label all other species as negative. This allows threshold selection to be justified using a cost model.

In [3]:

TARGET = "Number Captured"
if TARGET not in df.columns:
    raise ValueError(f"Expected '{TARGET}' column not found.")

y = (df[TARGET] > 0).astype(int)

print("Target distribution (1 = captured/present, 0 = none/absent):")
print(y.value_counts())
print("Positive rate:", y.mean())


Target distribution (1 = captured/present, 0 = none/absent):
Number Captured
1    2082
0      18
Name: count, dtype: int64
Positive rate: 0.9914285714285714


To keep a fair and valid prediction setup, Week 1 drops columns that behave like identifiers, text that can encode location descriptions, the outcome itself, derived post-outcome fields, and post-catch measurements. The same drop list to prevent leakage and maintain consistency.

In [4]:
DROP_COLS = [
    "Project Name", "Field Number", "Narrative Locality Description",
    "Species",                       # prevents trivial memorization / label proxy
    "Number Captured",               # target source
    "Caught > Number captured",      # post-outcome / derived
    "Minimum (mm)", "Maximum (mm)"   # post-catch measurements (post-outcome)
]

X = df.drop(columns=[c for c in DROP_COLS if c in df.columns], errors="ignore")

print("X shape:", X.shape)
print("Remaining columns sample:", X.columns[:10].tolist())


X shape: (2100, 85)
Remaining columns sample: ['Date', 'Day', 'Month', 'Year', 'Date fished', 'Waterbody Name', 'WaterbodyType', 'Arrival Time', 'Departure Time', 'Start Time']


### Train/Validation/Test split strategy (no leakage)

To compare models fairly, I use the same fixed split for baseline and SVM. I also use a dedicated test set that is never used for tuning or threshold selection. The split is stratified to preserve the positive rate across subsets (important for us since the dataset is imbalanced). Concretely: 80% of the data is reserved for train+validation, and 20% is held out for test. Then the 80% is split again into train and validation.

Week 1 used a single train/test split. For Week 2, we still hold out a final test set, but also create a validation set for tuning and threshold selection. The consistency is neccesary since baseline and SVM use the exact same split, and the split is stratified so the captured/not-captured ratio stays similar across subsets.

In [5]:
from sklearn.model_selection import train_test_split

# First, hold out a final test set (20%)
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=RANDOM_STATE,
    stratify=y
)

# Then create a validation set from the training portion (20% of total)
X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval,
    test_size=0.25,   # 0.25 * 0.80 = 0.20
    random_state=RANDOM_STATE,
    stratify=y_trainval
)

print("Train:", X_train.shape, "Val:", X_val.shape, "Test:", X_test.shape)
print("Pos rate train/val/test:", y_train.mean(), y_val.mean(), y_test.mean())


Train: (1260, 85) Val: (420, 85) Test: (420, 85)
Pos rate train/val/test: 0.9920634920634921 0.9904761904761905 0.9904761904761905


### Preprocessing --> Consistent for all models

SVMs are sensitive to feature scaling, so numeric features must be standardized. Real datasets also commonly have missing values and categorical columns. To ensure a fair comparison between models, I define one preprocessing pipeline and reuse it for baseline and all SVM variants:

- Numeric: median imputation → standardization
- Categorical: most-frequent imputation → one-hot encoding
- OneHotEncoder uses handle_unknown="ignore" so unseen categories in validation/test do not crash the pipeline.

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
import numpy as np

num_cols = X_train.select_dtypes(include=[np.number]).columns
cat_cols = X_train.select_dtypes(exclude=[np.number]).columns

preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler())
        ]), num_cols),
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("onehot", OneHotEncoder(handle_unknown="ignore"))
        ]), cat_cols)
    ],
    remainder="drop"
)

print("Numeric features:", len(num_cols), "Categorical features:", len(cat_cols))


Numeric features: 66 Categorical features: 19


###Metrics, confusion matrix, and cost model

I evaluate models using standard classification metrics: accuracy, precision, recall, F1, and ROC-AUC. ROC-AUC is threshold-free and works with either probabilities or continuous scores, which is useful because SVMs may output uncalibrated decision scores. For Part D, I also define a simple cost model where:

False Positive cost = cost_fp

False Negative cost = cost_fn

Then expected cost is computed as: FP * cost_fp + FN * cost_fn. Threshold selection will minimize this expected cost on validation predictions.

In [7]:
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix
)
import numpy as np

def eval_at_threshold(y_true, y_score, threshold=0.5):
    y_pred = (y_score >= threshold).astype(int)
    cm = confusion_matrix(y_true, y_pred)  # [[TN, FP],[FN, TP]]
    tn, fp, fn, tp = cm.ravel()
    metrics = {
        "threshold": float(threshold),
        "accuracy": accuracy_score(y_true, y_pred),
        "precision": precision_score(y_true, y_pred, zero_division=0),
        "recall": recall_score(y_true, y_pred, zero_division=0),
        "f1": f1_score(y_true, y_pred, zero_division=0),
        "roc_auc": roc_auc_score(y_true, y_score),
        "tn": int(tn), "fp": int(fp), "fn": int(fn), "tp": int(tp)
    }
    return metrics, cm

def expected_cost(cm, cost_fp=1.0, cost_fn=5.0):
    tn, fp, fn, tp = cm.ravel()
    return fp * cost_fp + fn * cost_fn

experiment_log = []


###FIX --> Week 1 baseline model (KNN, k=7)
Your Lab 1 baseline model is KNN with n_neighbors=7 wrapped in the same preprocessing pipeline. I refit it on the Week 2 train split and evaluate on validation. This is the reference used for the Week 2 comparison.



In [8]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

baseline = Pipeline(steps=[
    ("preprocess", preprocess),
    ("knn", KNeighborsClassifier(n_neighbors=7))
])

baseline.fit(X_train, y_train)

# KNN supports predict_proba
val_proba_base = baseline.predict_proba(X_val)[:, 1]
val_metrics_base, val_cm_base = eval_at_threshold(y_val, val_proba_base, threshold=0.5)

experiment_log.append({
    "run": "baseline_knn_k7",
    "model": "KNeighborsClassifier",
    "params": {"n_neighbors": 7},
    "seed": RANDOM_STATE,
    "val_f1@0.5": val_metrics_base["f1"],
    "val_roc_auc": val_metrics_base["roc_auc"]
})

val_metrics_base, val_cm_base


 'Turbidity (ntu)' 'Sample Area Width (m)' 'Sample Area Length (m)'
 'Dominant Vegetation' 'Bank Slope (degrees)' 'Wind Speed (km/h)'
 'TDS (g/L)' 'Salinity' 'Bedrock' 'Hardpan' 'Concrete' 'Unknown_substrate'
 'Substrate_not_determined' 'Not_recorded_aqua_veg' 'Unknown_aqua_veg'
 'None' 'Unknown' 'Not recorded' 'Water velocity (msec)' 'Bin_time_s']. At least one non-missing value is needed for imputation with strategy='median'.
 'Turbidity (ntu)' 'Sample Area Width (m)' 'Sample Area Length (m)'
 'Dominant Vegetation' 'Bank Slope (degrees)' 'Wind Speed (km/h)'
 'TDS (g/L)' 'Salinity' 'Bedrock' 'Hardpan' 'Concrete' 'Unknown_substrate'
 'Substrate_not_determined' 'Not_recorded_aqua_veg' 'Unknown_aqua_veg'
 'None' 'Unknown' 'Not recorded' 'Water velocity (msec)' 'Bin_time_s']. At least one non-missing value is needed for imputation with strategy='median'.


({'threshold': 0.5,
  'accuracy': 0.9904761904761905,
  'precision': 0.9904761904761905,
  'recall': 1.0,
  'f1': 0.9952153110047847,
  'roc_auc': np.float64(0.7247596153846154),
  'tn': 0,
  'fp': 4,
  'fn': 0,
  'tp': 416},
 array([[  0,   4],
        [  0, 416]]))

### SVM light tuning with 3-fold CV
I evaluate a small set of SVM-family configurations using 3-fold stratified cross validation on the training set only. This meets the “light tuning” requirement without over-searching hyperparameters. I pick the best variant by mean CV F1 (you can switch to ROC-AUC if that’s your course preference).

In [17]:
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC, SVC
import numpy as np
import pandas as pd

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)

svm_variants = [
    ("LinearSVC_C0.1", LinearSVC(C=0.1, class_weight="balanced", random_state=RANDOM_STATE)),
    ("LinearSVC_C1.0", LinearSVC(C=1.0, class_weight="balanced", random_state=RANDOM_STATE)),
    ("RBF_SVC_C10",    SVC(C=10.0, kernel="rbf", gamma="scale",
                           class_weight="balanced", random_state=RANDOM_STATE))
]

cv_rows = []
for name, model in svm_variants:
    pipe = Pipeline([
        ("preprocess", preprocess),
        ("model", model)
    ])
    scores = cross_validate(
        pipe, X_train, y_train,
        cv=cv,
        scoring=["f1", "roc_auc", "recall", "precision"],
        n_jobs=-1
    )
    cv_rows.append({
        "run": name,
        "mean_f1": float(np.mean(scores["test_f1"])),
        "mean_auc": float(np.mean(scores["test_roc_auc"])),
        "mean_recall": float(np.mean(scores["test_recall"])),
        "mean_precision": float(np.mean(scores["test_precision"]))
    })

cv_df = pd.DataFrame(cv_rows).sort_values("mean_f1", ascending=False)
cv_df


Unnamed: 0,run,mean_f1,mean_auc,mean_recall,mean_precision
1,LinearSVC_C1.0,0.974294,0.427283,0.957617,0.99173
2,RBF_SVC_C10,0.974294,0.439458,0.957617,0.99173
0,LinearSVC_C0.1,0.971397,0.470904,0.952011,0.991678


In [18]:
now = pd.Timestamp.now()

# Log each SVM variant from CV
for row in cv_df.to_dict(orient="records"):
    experiment_log.append({
        "run": row["run"],
        "model": "SVM_variant",
        "params": dict(svm_variants)[row["run"]].get_params(),
        "seed": RANDOM_STATE,
        "timestamp": now,
        "cv_mean_f1": row["mean_f1"],
        "cv_mean_auc": row["mean_auc"],
        "cv_mean_recall": row["mean_recall"],
        "cv_mean_precision": row["mean_precision"],
        "eval_source": "3-fold CV on train"
    })

# Log baseline with timestamp too (if not already)
# (Only do this once — if you already appended baseline, skip)


###Fit best SVM and compare to baseline on validation
After selecting the best SVM variant, I train it on the full training set and evaluate on validation. Before calibration, LinearSVC outputs decision scores (not probabilities), so I compare using ROC-AUC, which is valid for scores and doesn’t depend on a threshold.

In [19]:
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score

best_run = cv_df.iloc[0]["run"]
best_model = dict(svm_variants)[best_run]

svm_pipe = Pipeline([
    ("preprocess", preprocess),
    ("model", best_model)
])

svm_pipe.fit(X_train, y_train)

svm_val_scores = svm_pipe.decision_function(X_val)
svm_val_auc = roc_auc_score(y_val, svm_val_scores)

experiment_log.append({
    "run": best_run,
    "model": type(best_model).__name__,
    "params": best_model.get_params(),
    "seed": RANDOM_STATE,
    "val_roc_auc_scores": float(svm_val_auc)
})

print("Baseline (KNN) val ROC-AUC:", val_metrics_base["roc_auc"])
print("SVM val ROC-AUC (scores):  ", svm_val_auc)


 'Turbidity (ntu)' 'Sample Area Width (m)' 'Sample Area Length (m)'
 'Dominant Vegetation' 'Bank Slope (degrees)' 'Wind Speed (km/h)'
 'TDS (g/L)' 'Salinity' 'Bedrock' 'Hardpan' 'Concrete' 'Unknown_substrate'
 'Substrate_not_determined' 'Not_recorded_aqua_veg' 'Unknown_aqua_veg'
 'None' 'Unknown' 'Not recorded' 'Water velocity (msec)' 'Bin_time_s']. At least one non-missing value is needed for imputation with strategy='median'.


Baseline (KNN) val ROC-AUC: 0.7247596153846154
SVM val ROC-AUC (scores):   0.912860576923077


 'Turbidity (ntu)' 'Sample Area Width (m)' 'Sample Area Length (m)'
 'Dominant Vegetation' 'Bank Slope (degrees)' 'Wind Speed (km/h)'
 'TDS (g/L)' 'Salinity' 'Bedrock' 'Hardpan' 'Concrete' 'Unknown_substrate'
 'Substrate_not_determined' 'Not_recorded_aqua_veg' 'Unknown_aqua_veg'
 'None' 'Unknown' 'Not recorded' 'Water velocity (msec)' 'Bin_time_s']. At least one non-missing value is needed for imputation with strategy='median'.


Display Experimental Log

In [26]:
import pandas as pd


# Convert experiment log to DataFrame
exp_log_df = pd.DataFrame(experiment_log)

exp_log_df

# Write to CSV
#output_path = "experiment_log_week2.csv"
#exp_log_df.to_csv(output_path, index=False)

#print(f"Experiment log saved to: {output_path}")

Unnamed: 0,run,model,params,seed,val_f1@0.5,val_roc_auc,val_roc_auc_scores,timestamp,cv_mean_f1,cv_mean_auc,cv_mean_recall,cv_mean_precision,eval_source
0,baseline_knn_k7,KNeighborsClassifier,{'n_neighbors': 7},42,0.995215,0.72476,,NaT,,,,,
1,LinearSVC_C1.0,LinearSVC,"{'C': 1.0, 'class_weight': 'balanced', 'dual':...",42,,,0.912861,NaT,,,,,
2,LinearSVC_C1.0,SVM_variant,"{'C': 1.0, 'class_weight': 'balanced', 'dual':...",42,,,,2026-02-05 16:05:34.140178,0.974294,0.427283,0.957617,0.99173,3-fold CV on train
3,RBF_SVC_C10,SVM_variant,"{'C': 10.0, 'break_ties': False, 'cache_size':...",42,,,,2026-02-05 16:05:34.140178,0.974294,0.439458,0.957617,0.99173,3-fold CV on train
4,LinearSVC_C0.1,SVM_variant,"{'C': 0.1, 'class_weight': 'balanced', 'dual':...",42,,,,2026-02-05 16:05:34.140178,0.971397,0.470904,0.952011,0.991678,3-fold CV on train
5,LinearSVC_C1.0,SVM_variant,"{'C': 1.0, 'class_weight': 'balanced', 'dual':...",42,,,,2026-02-05 16:05:56.138672,0.974294,0.427283,0.957617,0.99173,3-fold CV on train
6,RBF_SVC_C10,SVM_variant,"{'C': 10.0, 'break_ties': False, 'cache_size':...",42,,,,2026-02-05 16:05:56.138672,0.974294,0.439458,0.957617,0.99173,3-fold CV on train
7,LinearSVC_C0.1,SVM_variant,"{'C': 0.1, 'class_weight': 'balanced', 'dual':...",42,,,,2026-02-05 16:05:56.138672,0.971397,0.470904,0.952011,0.991678,3-fold CV on train
8,LinearSVC_C1.0,LinearSVC,"{'C': 1.0, 'class_weight': 'balanced', 'dual':...",42,,,0.912861,NaT,,,,,


###Calibration (Platt scaling) to get probabilities

Thresholding under a cost model is most interpretable using probabilities. Linear SVM scores are not calibrated by default, so I use CalibratedClassifierCV(method="sigmoid") (Platt scaling) to map scores to probabilities using cross validation on the training set. If calibration were skipped, I would need to explicitly state that thresholding uses uncalibrated scores.

In [12]:
from sklearn.calibration import CalibratedClassifierCV
from sklearn.pipeline import Pipeline

calibrated_svm = Pipeline([
    ("preprocess", preprocess),
    ("cal", CalibratedClassifierCV(
        estimator=best_model,
        method="sigmoid",
        cv=3
    ))
])

calibrated_svm.fit(X_train, y_train)

svm_val_proba = calibrated_svm.predict_proba(X_val)[:, 1]
svm_val_metrics_default, svm_val_cm_default = eval_at_threshold(y_val, svm_val_proba, threshold=0.5)

svm_val_metrics_default, svm_val_cm_default


 'Turbidity (ntu)' 'Sample Area Width (m)' 'Sample Area Length (m)'
 'Dominant Vegetation' 'Bank Slope (degrees)' 'Wind Speed (km/h)'
 'TDS (g/L)' 'Salinity' 'Bedrock' 'Hardpan' 'Concrete' 'Unknown_substrate'
 'Substrate_not_determined' 'Not_recorded_aqua_veg' 'Unknown_aqua_veg'
 'None' 'Unknown' 'Not recorded' 'Water velocity (msec)' 'Bin_time_s']. At least one non-missing value is needed for imputation with strategy='median'.
 'Turbidity (ntu)' 'Sample Area Width (m)' 'Sample Area Length (m)'
 'Dominant Vegetation' 'Bank Slope (degrees)' 'Wind Speed (km/h)'
 'TDS (g/L)' 'Salinity' 'Bedrock' 'Hardpan' 'Concrete' 'Unknown_substrate'
 'Substrate_not_determined' 'Not_recorded_aqua_veg' 'Unknown_aqua_veg'
 'None' 'Unknown' 'Not recorded' 'Water velocity (msec)' 'Bin_time_s']. At least one non-missing value is needed for imputation with strategy='median'.


({'threshold': 0.5,
  'accuracy': 0.9904761904761905,
  'precision': 0.9904761904761905,
  'recall': 1.0,
  'f1': 0.9952153110047847,
  'roc_auc': np.float64(0.671875),
  'tn': 0,
  'fp': 4,
  'fn': 0,
  'tp': 416},
 array([[  0,   4],
        [  0, 416]]))

###Choose threshold by minimizing expected cost (validation)

I define a simple 2×2 cost model where false negatives are more expensive than false positives. I then sweep thresholds on validation probabilities and select the threshold that minimizes expected cost.

In [13]:
import numpy as np
import pandas as pd

cost_fp = 1.0
cost_fn = 5.0

rows = []
for t in np.linspace(0.0, 1.0, 501):
    m, cm = eval_at_threshold(y_val, svm_val_proba, threshold=t)
    rows.append({
        **m,
        "expected_cost": float(expected_cost(cm, cost_fp=cost_fp, cost_fn=cost_fn))
    })

thr_df = pd.DataFrame(rows).sort_values("expected_cost", ascending=True)
best_threshold = float(thr_df.iloc[0]["threshold"])

best_threshold, thr_df.head(5)


(0.654,
      threshold  accuracy  precision  recall        f1   roc_auc  tn  fp  fn  \
 327      0.654  0.990476   0.990476     1.0  0.995215  0.671875   0   4   0   
 342      0.684  0.990476   0.990476     1.0  0.995215  0.671875   0   4   0   
 341      0.682  0.990476   0.990476     1.0  0.995215  0.671875   0   4   0   
 340      0.680  0.990476   0.990476     1.0  0.995215  0.671875   0   4   0   
 339      0.678  0.990476   0.990476     1.0  0.995215  0.671875   0   4   0   
 
       tp  expected_cost  
 327  416            4.0  
 342  416            4.0  
 341  416            4.0  
 340  416            4.0  
 339  416            4.0  )

###Report metrics at default vs chosen threshold (test set)

To avoid biased reporting, I evaluate both thresholds on the held-out test set and report metrics and confusion matrices at:

default threshold = 0.5

chosen threshold = cost-minimizing threshold from validation

In [14]:
svm_test_proba = calibrated_svm.predict_proba(X_test)[:, 1]

test_default_metrics, test_default_cm = eval_at_threshold(y_test, svm_test_proba, threshold=0.5)
test_best_metrics, test_best_cm = eval_at_threshold(y_test, svm_test_proba, threshold=best_threshold)

test_default_cost = expected_cost(test_default_cm, cost_fp=cost_fp, cost_fn=cost_fn)
test_best_cost = expected_cost(test_best_cm, cost_fp=cost_fp, cost_fn=cost_fn)

print("=== TEST @ threshold 0.5 ===")
print(test_default_metrics)
print("Expected cost:", test_default_cost)
print("Confusion matrix [[TN, FP],[FN, TP]]:\n", test_default_cm)

print("\n=== TEST @ chosen threshold ===")
print(test_best_metrics)
print("Expected cost:", test_best_cost)
print("Confusion matrix [[TN, FP],[FN, TP]]:\n", test_best_cm)


=== TEST @ threshold 0.5 ===
{'threshold': 0.5, 'accuracy': 0.9904761904761905, 'precision': 0.9904761904761905, 'recall': 1.0, 'f1': 0.9952153110047847, 'roc_auc': np.float64(0.6938100961538461), 'tn': 0, 'fp': 4, 'fn': 0, 'tp': 416}
Expected cost: 4.0
Confusion matrix [[TN, FP],[FN, TP]]:
 [[  0   4]
 [  0 416]]

=== TEST @ chosen threshold ===
{'threshold': 0.654, 'accuracy': 0.9904761904761905, 'precision': 0.9904761904761905, 'recall': 1.0, 'f1': 0.9952153110047847, 'roc_auc': np.float64(0.6938100961538461), 'tn': 0, 'fp': 4, 'fn': 0, 'tp': 416}
Expected cost: 4.0
Confusion matrix [[TN, FP],[FN, TP]]:
 [[  0   4]
 [  0 416]]


 'Turbidity (ntu)' 'Sample Area Width (m)' 'Sample Area Length (m)'
 'Dominant Vegetation' 'Bank Slope (degrees)' 'Wind Speed (km/h)'
 'TDS (g/L)' 'Salinity' 'Bedrock' 'Hardpan' 'Concrete' 'Unknown_substrate'
 'Substrate_not_determined' 'Not_recorded_aqua_veg' 'Unknown_aqua_veg'
 'None' 'Unknown' 'Not recorded' 'Water velocity (msec)' 'Bin_time_s']. At least one non-missing value is needed for imputation with strategy='median'.
