In [2]:
import pandas as pd

# Load formatted dataset from previous step
df = pd.read_csv("../../data/processed/train_dataset_formatted.csv")
df.head()

Unnamed: 0,longitude,latitude,source,gravity_iso_residual,gravity_cscba,gravity_cscba_1vd,mag_uc_1_2km,mag_uc_2_4km,mag_uc_4_8km,mag_uc_8_12km,mag_uc_12_16km,radio_k_pct,radio_th_ppm,radio_u_ppm,radio_th_k_ratio,radio_u_k_ratio,radio_u_th_ratio,label
0,134.324653,-27.294063,blank_area,-204.018005,-426.38882,-309.04285,-19.665098,-33.036575,-46.971394,-30.871347,-21.795435,0.759041,9.374757,1.196567,12.357069,1.57767,0.127658,0
1,148.050504,-32.937903,positive,93.298981,-186.38301,-2674.9626,-44.330212,23.895699,107.432144,82.715973,57.587337,1.154886,8.526972,1.396929,7.434864,1.215049,0.16399,1
2,119.0271,-22.9757,other_deposit,-200.687836,-739.63226,-1203.5088,-442.748383,-354.920288,-211.401382,-84.787079,-53.003239,0.126347,3.217272,0.57241,16.086359,2.862052,0.178862,0
3,121.464232,-23.649192,blank_area,-163.918274,-592.99493,-536.7037,18.632324,31.867907,50.295372,35.592964,25.073792,0.18421,3.271043,0.3693,16.355215,1.846498,0.113002,0
4,142.4699,-35.1686,other_deposit,-81.172989,-139.28423,-507.28357,-0.266748,-0.374094,-0.820666,-0.999457,-1.124462,0.413259,4.779623,0.953914,11.560739,2.311097,0.200272,0


## 1.Missing Values Summary
Radiometric features such as radio_k_pct, radio_th_ppm, radio_u_ppm, and derived ratios show ~3% missing values, Gravity feature, gravity_iso_residual show ~0.3% missing values. Other geophysical features are complete.

In [3]:
# Count missing values for each column
missing_summary = df.isnull().sum()
missing_summary = missing_summary[missing_summary > 0].sort_values(ascending=False)

print("Missing Values Summary:")
display(missing_summary)

# Show percentage for easier interpretation
missing_pct = (df.isnull().sum() / len(df)) * 100
print("Missing Value Percentage (%):")
display(missing_pct[missing_pct > 0].sort_values(ascending=False))

Missing Values Summary:


radio_k_pct             91
radio_th_ppm            91
radio_u_ppm             91
radio_th_k_ratio        91
radio_u_k_ratio         91
radio_u_th_ratio        91
gravity_iso_residual     9
mag_uc_1_2km             1
mag_uc_2_4km             1
mag_uc_4_8km             1
mag_uc_8_12km            1
mag_uc_12_16km           1
dtype: int64

Missing Value Percentage (%):


radio_k_pct             3.192982
radio_th_ppm            3.192982
radio_u_ppm             3.192982
radio_th_k_ratio        3.192982
radio_u_k_ratio         3.192982
radio_u_th_ratio        3.192982
gravity_iso_residual    0.315789
mag_uc_1_2km            0.035088
mag_uc_2_4km            0.035088
mag_uc_4_8km            0.035088
mag_uc_8_12km           0.035088
mag_uc_12_16km          0.035088
dtype: float64

## 2. Missing Value Imputation Strategy

Based on the EDA results, the dataset contains missing values primarily in radiometric features (approximately 3%) and marginally in one gravity feature (`gravity_iso_residual`, ~0.3%). The imputation strategy is determined by both statistical distribution and geological domain characteristics:

- **Radiometric Features** (`radio_k_pct`, `radio_th_ppm`, `radio_u_ppm`, and derived ratios) exhibit strong right-skewed distributions and long tails due to natural variations in geochemical enrichment. These are highly sensitive to outliers.

- **Gravity Feature** (`gravity_iso_residual`) is nearly normally distributed, with smooth variation and minimal skew, but has a small number of missing entries.

- **Magnetic Feature**: Five upward-continued magnetic anomaly features (`mag_uc_*`) were found to contain one missing value each.

This choice ensures consistency with other geophysical features while maintaining robustness against potential outliers or local noise.

### Chosen Strategy:
- **Median Imputation** is used for radiometric, gravity, and magnetic features. It is robust to outliers and ensures stable replacement, especially suitable for skewed or noisy geoscientific data.
- Additionally, **group-wise median imputation by `source`** is optionally applied for radiometric variables to preserve localized geological context (e.g., different behavior in blank vs. deposit areas).

In [7]:
# List of features to impute (with missing values)
radiometric_features = [
    'radio_k_pct', 'radio_th_ppm', 'radio_u_ppm',
    'radio_th_k_ratio', 'radio_u_k_ratio', 'radio_u_th_ratio'
]

gravity_features = ['gravity_iso_residual']

magnetic_features = ['mag_uc_1_2km', 'mag_uc_2_4km', 'mag_uc_4_8km', 'mag_uc_8_12km', 'mag_uc_12_16km']

# --- Option 1: Global Median Imputation (Simple, Robust) ---
#df[radiometric_features + gravity_features + magnetic_features] = df[radiometric_features + gravity_features + magnetic_features].fillna(
#    df[radiometric_features + gravity_features].median()
#)

#print("Applied global median imputation.")

# --- Option 2: Group-wise Median Imputation by 'source' ---
# Uncomment this if you prefer more geo-context-aware imputation
for col in radiometric_features + gravity_features + magnetic_features:
    df[col] = df.groupby('source')[col].transform(lambda x: x.fillna(x.median()))
print("Applied group-wise median imputation by source.")


Applied group-wise median imputation by source.


In [8]:
# Recheck missing values after imputation
remaining_missing = df.isnull().sum()
print("Remaining missing values:")
display(remaining_missing[remaining_missing > 0])

Remaining missing values:


Series([], dtype: int64)

In [9]:
df.to_csv("../../data/processed/train_dataset_formatted_no_missing.csv", index=False)
print("Cleaned dataset saved to 'train_dataset_formatted_no_missing.csv'")

Cleaned dataset saved to 'train_dataset_formated_no_missing.csv'


## Missing Value Handling Summary

- **Columns affected**: Radiometric element concentrations and ratios, gravity_iso_residual, and magnetic features.
- **Imputation method used**: Median imputation.
  - Chosen because these features are highly skewed and contain physical/geochemical measurement values.
  - Median is robust to outliers and preserves distributional integrity.
- **Validation**: All missing values resolved after imputation.
- **Domain Justification**: Radiometric values are spatially interpolated from field measurements. Median preserves range while avoiding overfitting to extreme anomalies.
