In [24]:
import pandas as pd

# Load the transformed dataset
df = pd.read_csv("../../data/processed/train_dataset_formatted_no_missing_transformed.csv")
df.head()

Unnamed: 0,longitude,latitude,source,gravity_iso_residual,gravity_cscba,gravity_cscba_1vd,mag_uc_1_2km,mag_uc_2_4km,mag_uc_4_8km,mag_uc_8_12km,...,radio_u_ppm_log,radio_k_pct_log,radio_u_th_ratio_log,radio_th_k_ratio_log,radio_u_k_ratio_log,mag_uc_1_2km_clipped,mag_uc_2_4km_clipped,mag_uc_4_8km_clipped,mag_uc_8_12km_clipped,mag_uc_12_16km_clipped
0,134.324653,-27.294063,blank_area,-204.018005,-426.38882,-309.04285,-19.665098,-33.036575,-46.971394,-30.871347,...,0.786896,0.564769,0.120143,2.592046,0.946886,-19.665098,-33.036575,-46.971394,-30.871347,-21.795435
1,148.050504,-32.937903,positive,93.298981,-186.38301,-2674.9626,-44.330212,23.895699,107.432144,82.715973,...,0.874188,0.767738,0.151853,2.132374,0.795274,-44.330212,23.895699,107.432144,82.715973,57.587337
2,119.0271,-22.9757,other_deposit,-200.687836,-739.63226,-1203.5088,-442.748383,-354.920288,-211.401382,-84.787079,...,0.45261,0.11898,0.16455,2.83828,1.351199,-211.120219,-161.594283,-139.436635,-71.818292,-49.249933
3,121.464232,-23.649192,blank_area,-163.918274,-592.99493,-536.7037,18.632324,31.867907,50.295372,35.592964,...,0.314299,0.169076,0.107061,2.853893,1.046089,18.632324,31.867907,50.295372,35.592964,25.073792
4,142.4699,-35.1686,other_deposit,-81.172989,-139.28423,-507.28357,-0.266748,-0.374094,-0.820666,-0.999457,...,0.669835,0.345898,0.182548,2.530576,1.19728,-0.266748,-0.374094,-0.820666,-0.999457,-1.124462


##  1.Apply Feature Scaling in Geoscience-Based Machine Learning

In traditional geoscientific workflows, raw feature values (e.g., gravity anomalies in mGal, radiometric concentrations in ppm) are preserved to maintain their **physical interpretability**. However, in machine learning-based mineral exploration, feature scaling becomes a critical preprocessing step for the following reasons:

### 1.1. Numerical Stability and Model Compatibility
- Many machine learning algorithms (e.g., SVM, KNN, Logistic Regression, PCA) assume input features are on comparable numerical scales.
- Unscaled geoscience features may span vastly different ranges (e.g., gravity from -1200 to +600 mGal vs. radiometric K from 0–4.5%).
- Without scaling, large-magnitude features may dominate model training unfairly.

### 1.2. Handling Skewed Distributions and Outliers
- Geochemical features and ratios often exhibit heavy right-skew and extreme values.
- Applying log-transform and `RobustScaler` helps normalize their distribution and reduces the influence of rare anomalies.

### 1.3. Improving Model Interpretability and Convergence
- Scaled features improve feature importance rankings, convergence in gradient-based models, and generalization performance.

### 1.4 When to Retain Physical Meaning

If **preserving the physical units** of certain features (e.g., gravity anomalies for geophysical interpretation or map visualization) is necessary, **do not overwrite the original values**. Instead:

- Store **scaled features in a separate dataset** (`train_ready_scaled.csv`).
- Retain **raw features in parallel** (`train_dataset_original.csv`) for interpretability, post-analysis, or visual validation.
- Use scaled features **only for modeling**, not for geological interpretation or cross-project reuse.

This separation ensures that modeling accuracy and geoscientific explainability are both preserved.


## 2. Select Features for Scaling

Only features selected for final modeling input are scaled.  
Original features with corresponding `*_log` or `*_clipped` versions are **excluded from scaling** to avoid unnecessary processing and reduce duplication.

This ensures clarity, modeling efficiency, and cleaner datasets.


In [25]:
# Drop non-feature columns
drop_cols = ['label', 'source', 'longitude', 'latitude']
feature_cols = [col for col in df.columns if col not in drop_cols and df[col].dtype in ['float64', 'int64']]

def remove_redundant_originals(columns):
    filtered = []
    for col in columns:
        if (col + "_log") in columns or (col + "_clipped") in columns:
            continue
        filtered.append(col)
    return filtered

final_model_features = remove_redundant_originals(feature_cols)
print("Selected numeric features for scaling:", final_model_features)

Selected numeric features for scaling: ['gravity_iso_residual', 'gravity_cscba', 'gravity_cscba_1vd', 'radio_th_ppm_log', 'radio_u_ppm_log', 'radio_k_pct_log', 'radio_u_th_ratio_log', 'radio_th_k_ratio_log', 'radio_u_k_ratio_log', 'mag_uc_1_2km_clipped', 'mag_uc_2_4km_clipped', 'mag_uc_4_8km_clipped', 'mag_uc_8_12km_clipped', 'mag_uc_12_16km_clipped']


From a geoscience perspective, the selection of feature scaling techniques must account for both statistical properties and the geological meaning of each variable. We applied `RobustScaler` to radiometric and geochemical features due to their skewed distributions and susceptibility to outliers. Magnetic features, which tend to follow near-normal distributions, were scaled using `StandardScaler`. Gravity features were left in their original scale to preserve their physical interpretability.

1. Is the feature strongly skewed or has extreme outliers?
   → Yes → Use `RobustScaler`
   → No → Proceed

2. Is the feature approximately normally distributed?
   → Yes → Use `StandardScaler`
   → No → Use `MinMaxScaler` (if already log-transformed)

3. Is the feature geophysically interpretable (e.g., gravity)?
   → Yes → Consider retaining original values if not mandatory to scale

In [26]:
import pandas as pd
from scipy.stats import skew

# Evaluate skewness
skewness = df[final_model_features].apply(skew).sort_values(ascending=False)

# Classify based on rules
scaler_recommendations = {}

for col in final_model_features:
    skew_val = skewness[col]
    is_log_transformed = col.endswith('_log') or col.endswith('_clipped')
    is_geophysical = any(key in col for key in ['gravity', 'mag', 'aem', 'conductivity'])

    if is_geophysical and 'gravity' in col:
        scaler_recommendations[col] = 'KEEP ORIGINAL (GEOPHYSICAL)'
    elif abs(skew_val) > 1:
        scaler_recommendations[col] = 'RobustScaler'
    elif abs(skew_val) < 0.5:
        scaler_recommendations[col] = 'StandardScaler'
    elif is_log_transformed:
        scaler_recommendations[col] = 'MinMaxScaler (log-transformed)'
    else:
        scaler_recommendations[col] = 'StandardScaler (fallback)'

# Output as DataFrame for inspection
scaler_df = pd.DataFrame({
    'Feature': scaler_recommendations.keys(),
    'Skewness': [round(skewness[col], 3) for col in scaler_recommendations.keys()],
    'Recommended Scaler': scaler_recommendations.values()
})

print("Recommended Scalers by Feature:")
display(scaler_df)  


Recommended Scalers by Feature:


Unnamed: 0,Feature,Skewness,Recommended Scaler
0,gravity_iso_residual,-0.944,KEEP ORIGINAL (GEOPHYSICAL)
1,gravity_cscba,-0.766,KEEP ORIGINAL (GEOPHYSICAL)
2,gravity_cscba_1vd,0.232,KEEP ORIGINAL (GEOPHYSICAL)
3,radio_th_ppm_log,-0.141,StandardScaler
4,radio_u_ppm_log,1.093,RobustScaler
5,radio_k_pct_log,0.565,MinMaxScaler (log-transformed)
6,radio_u_th_ratio_log,6.692,RobustScaler
7,radio_th_k_ratio_log,0.853,MinMaxScaler (log-transformed)
8,radio_u_k_ratio_log,1.262,RobustScaler
9,mag_uc_1_2km_clipped,2.619,RobustScaler


In [27]:
# Define Feature Groups Based on Skewness Analysis

# Columns to exclude
exclude_cols = ['label', 'source', 'longitude', 'latitude']

# Grouped based on geophysical knowledge and skewness analysis
keep_original = [
    'gravity_iso_residual', 
    'gravity_cscba', 
    'gravity_cscba_1vd'
]

standard_scale_cols = [
    'radio_th_ppm_log'
]

robust_scale_cols = [
    'radio_u_ppm_log',
    'radio_u_th_ratio_log',
    'radio_u_k_ratio_log',
    'mag_uc_1_2km_clipped',
    'mag_uc_2_4km_clipped',
    'mag_uc_4_8km_clipped',
    'mag_uc_8_12km_clipped'
]

minmax_scale_cols = [
    'radio_k_pct_log',
    'radio_th_k_ratio_log',
    'mag_uc_12_16km_clipped'
]

In [29]:
# Apply Scalers Accordingly
from sklearn.preprocessing import RobustScaler, MinMaxScaler, StandardScaler

df_scaled = df.copy()

# Apply RobustScaler
robust_scaler = RobustScaler()
df_scaled[robust_scale_cols] = robust_scaler.fit_transform(df_scaled[robust_scale_cols])

# Apply StandardScaler
standard_scaler = StandardScaler()
df_scaled[standard_scale_cols] = standard_scaler.fit_transform(df_scaled[standard_scale_cols])

# Apply MinMaxScaler
minmax_scaler = MinMaxScaler()
df_scaled[minmax_scale_cols] = minmax_scaler.fit_transform(df_scaled[minmax_scale_cols])

print("Applied RobustScaler, StandardScaler and MinMaxScaler based on feature grouping.")

Applied RobustScaler, StandardScaler and MinMaxScaler based on feature grouping.


In [30]:
# Inspect Results
print("Scaled feature preview:")
display(df_scaled[robust_scale_cols + minmax_scale_cols].describe())
display(df_scaled[final_model_features].apply(skew).sort_values(ascending=False))

Scaled feature preview:


Unnamed: 0,radio_u_ppm_log,radio_u_th_ratio_log,radio_u_k_ratio_log,mag_uc_1_2km_clipped,mag_uc_2_4km_clipped,mag_uc_4_8km_clipped,mag_uc_8_12km_clipped,radio_k_pct_log,radio_th_k_ratio_log,mag_uc_12_16km_clipped
count,2850.0,2850.0,2850.0,2850.0,2850.0,2850.0,2850.0,2850.0,2850.0,2850.0
mean,0.052331,0.267696,0.1347946,0.445827,0.359912,0.2920935,0.282576,0.329787,0.394324,0.434179
std,0.948315,1.640975,0.9095323,2.594104,1.655676,1.192079,1.062884,0.193722,0.115974,0.182811
min,-2.469537,-2.366471,-1.78331,-7.79667,-4.257698,-2.802553,-2.229055,0.0,0.0,0.0
25%,-0.514015,-0.452681,-0.4803271,-0.394543,-0.3430951,-0.3250325,-0.317871,0.188188,0.324908,0.328146
50%,0.0,0.0,-1.038666e-16,0.0,1.477225e-18,-1.3552529999999999e-19,0.0,0.309158,0.384217,0.388328
75%,0.485985,0.547319,0.5196729,0.605457,0.6569049,0.6749675,0.682129,0.449514,0.444739,0.503403
max,8.084046,30.107331,5.141409,15.417224,8.571295,4.643792,3.734272,1.0,1.0,1.0


radio_u_th_ratio_log      6.692259
mag_uc_1_2km_clipped      2.619252
mag_uc_2_4km_clipped      1.832600
radio_u_k_ratio_log       1.262309
radio_u_ppm_log           1.092741
mag_uc_4_8km_clipped      1.046205
mag_uc_8_12km_clipped     1.014158
mag_uc_12_16km_clipped    0.950692
radio_th_k_ratio_log      0.853399
radio_k_pct_log           0.564662
gravity_cscba_1vd         0.231727
radio_th_ppm_log         -0.140971
gravity_cscba            -0.766423
gravity_iso_residual     -0.943989
dtype: float64

## 3.Final Scaler Strategy Summary

- **StandardScaler** was used for near-normal transformed features (e.g., `radio_th_ppm_log`).
- **RobustScaler** was applied to skewed or outlier-prone features, especially clipped magnetic and ratio data.
- **MinMaxScaler** was used for log-transformed features intended to be bounded within [0,1].
- **Gravity-based geophysical features** were retained in their original scale for physical interpretability.

This approach balances statistical normalization with geoscientific meaning.


In [31]:
# store scaled data
df_scaled.to_csv("../../data/processed/train_dataset_scaled.csv", index=False)
print("Scaled data saved to 'train_ready_scaled.csv'")

Scaled data saved to 'train_ready_scaled.csv'
