<a href="https://colab.research.google.com/github/Menon-Vineet/Books/blob/main/Geo_AI_project_V3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSV ML Trainer (Colab-only) — Rare Event Targeting (V3.1)

**Author:** Vineet Menon  
**Last updated:** 2026-01-15

## Purpose
Train an interpretable, imbalance-aware ML model from a user-provided CSV.
The CSV would contain geological data

## Key outputs
- Automated dataset description
- User-guided feature retention (Y/N)
- Rare-event diagnostics
- Probability ranking (Top-K)
- Saved artifacts

## 1) Install dependencies

In [None]:

!pip -q install pandas numpy scikit-learn joblib

## 2) Upload CSV

In [None]:

from google.colab import files
import io, pandas as pd

uploaded = files.upload()
if not uploaded:
    raise RuntimeError("No file uploaded.")

csv_name = list(uploaded.keys())[0]
raw_bytes = uploaded[csv_name]

def read_csv_safely(raw):
    for enc in ["utf-8","utf-8-sig","latin1","cp1252","ISO-8859-1"]:
        try:
            return pd.read_csv(io.BytesIO(raw), encoding=enc), enc
        except Exception:
            pass
    raise ValueError("Unable to decode CSV.")

df, encoding_used = read_csv_safely(raw_bytes)
print(f"Loaded {csv_name} using encoding: {encoding_used}")
print("Shape:", df.shape)
df.head()

Saving Ni_v1_2025.csv to Ni_v1_2025.csv
Loaded Ni_v1_2025.csv using encoding: latin1
Shape: (1820788, 4)


Unnamed: 0,H3,Magmatic Ni / Ni Magmatique,Magmatic Ni / Ni Magmatique (Standard Deviation / Écart type),Positive Sample / Échantillon positif
0,8712e579bffffff,0.0,0.0,0
1,8712e579affffff,0.0,0.0,0
2,8712e56b4ffffff,0.0,0.0,0
3,8712e56b5ffffff,0.0,0.0,0
4,8712e56a6ffffff,0.0,0.0,0


## 3) Dataset description (analytics view)

In [None]:

import pandas as pd

summary = []
for c in df.columns:
    summary.append({
        "column": c,
        "dtype": str(df[c].dtype),
        "missing_%": round(df[c].isna().mean()*100, 3),
        "unique_values": df[c].nunique(dropna=True)
    })

summary_df = pd.DataFrame(summary)
summary_df

Unnamed: 0,column,dtype,missing_%,unique_values
0,H3,object,0.0,1820788
1,Magmatic Ni / Ni Magmatique,float64,0.0,1201484
2,Magmatic Ni / Ni Magmatique (Standard Deviatio...,float64,0.0,1253
3,Positive Sample / Échantillon positif,int64,0.0,2


### Interpretation guidance
- Numeric columns → measurements
- Object columns → IDs, categories, or spatial keys
- High unique counts often indicate identifiers
You will now decide what to keep.

## 4) Choose target column

In [None]:

target_column = df.columns[-1]  # default
print("Target column:", target_column)
print("Unique target values:", df[target_column].nunique())

Target column: Positive Sample / Échantillon positif
Unique target values: 2


## 5) User-guided feature selection

In [None]:

X_full = df.drop(columns=[target_column])
y = df[target_column]

retain_cols = []
drop_cols = []

print("Answer Y/N for each feature:\n")

for c in X_full.columns:
    print(f"Column: {c}")
    print(f"  dtype={X_full[c].dtype}, unique={X_full[c].nunique()}")
    ans = input("  Retain this column for modeling? (Y/N): ").strip().lower()
    if ans == "y":
        retain_cols.append(c)
    else:
        drop_cols.append(c)

X = X_full[retain_cols]

print("\nRetained columns:", retain_cols)
print("Dropped columns:", drop_cols)

Answer Y/N for each feature:

Column: H3
  dtype=object, unique=1820788
  Retain this column for modeling? (Y/N): n
Column: Magmatic Ni / Ni Magmatique
  dtype=float64, unique=1201484
  Retain this column for modeling? (Y/N): y
Column: Magmatic Ni / Ni Magmatique (Standard Deviation / Écart type)
  dtype=float64, unique=1253
  Retain this column for modeling? (Y/N): y

Retained columns: ['Magmatic Ni / Ni Magmatique', 'Magmatic Ni / Ni Magmatique (Standard Deviation / Écart type)']
Dropped columns: ['H3']


## 6) Class imbalance diagnostics

In [None]:

target_dist = y.value_counts(normalize=True).rename("proportion")
target_dist

Unnamed: 0_level_0,proportion
Positive Sample / Échantillon positif,Unnamed: 1_level_1
0,0.999319
1,0.000681


## 7) Build imbalance-aware model

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(
        class_weight="balanced",
        max_iter=1000,
        n_jobs=-1
    ))
])

pipe.fit(X_train, y_train)

proba_test = pipe.predict_proba(X_test)[:,1]
pr_auc = average_precision_score(y_test, proba_test)

print("PR-AUC:", round(pr_auc, 4))

PR-AUC: 0.0159


## 8) Top-K ranking evaluation

In [None]:

test_df = X_test.copy()
test_df["true_label"] = y_test.values
test_df["probability"] = proba_test
test_df = test_df.sort_values("probability", ascending=False)

def topk_capture(df, k_frac):
    k = int(len(df)*k_frac)
    top = df.head(k)
    return {
        "k_fraction": k_frac,
        "rows_examined": k,
        "positives_found": int(top["true_label"].sum()),
        "precision": round(top["true_label"].mean(), 4)
    }

import pandas as pd
pd.DataFrame([
    topk_capture(test_df, 0.01),
    topk_capture(test_df, 0.005),
    topk_capture(test_df, 0.001)
])

Unnamed: 0,k_fraction,rows_examined,positives_found,precision
0,0.01,3641,86,0.0236
1,0.005,1820,32,0.0176
2,0.001,364,13,0.0357


## 9) Train final model on full dataset

In [None]:

pipe.fit(X, y)
print("Final model trained")

Final model trained


## 10) Save artifacts

In [None]:

import joblib, json
from datetime import datetime

joblib.dump(pipe, "model.joblib")

ranked = X.copy()
ranked["probability"] = pipe.predict_proba(X)[:,1]
ranked = ranked.sort_values("probability", ascending=False)
ranked.to_csv("ranked_targets.csv", index=False)

metadata = {
    "author": "Vineet Menon",
    "source_file": csv_name,
    "rows": int(df.shape[0]),
    "retained_features": retain_cols,
    "dropped_features": drop_cols,
    "target_column": target_column,
    "class_distribution": y.value_counts(normalize=True).to_dict(),
    "model": "LogisticRegression (balanced)",
    "primary_metric": "PR-AUC",
    "trained_at_utc": datetime.utcnow().isoformat()+"Z"
}

with open("metadata.json","w") as f:
    json.dump(metadata, f, indent=2)

print("Saved artifacts")

Saved artifacts


  "trained_at_utc": datetime.utcnow().isoformat()+"Z"


## 11) Download outputs

In [None]:

from google.colab import files
files.download("model.joblib")
files.download("metadata.json")
files.download("ranked_targets.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# ================================
# Model Performance Explainer Cell
# ================================
import numpy as np
import pandas as pd
from sklearn.metrics import average_precision_score

# ---- REQUIRED: set these if your names differ ----
# y: full target
# y_test: test target
# proba_test: predicted probability on test set
# pipe: trained sklearn pipeline

# 1) Baseline (random) performance = positive rate
pos_rate = float(np.mean(y == 1)) if hasattr(y, "__len__") else float(np.mean(y.values == 1))
baseline_pr_auc = pos_rate

# 2) Model PR-AUC
pr_auc = float(average_precision_score(y_test, proba_test))

# 3) Improvement factor over random
improvement = pr_auc / baseline_pr_auc if baseline_pr_auc > 0 else np.nan

# 4) Top-K ranking diagnostics (actionable targeting)
df_rank = pd.DataFrame({"y_true": np.asarray(y_test), "proba": np.asarray(proba_test)}).sort_values("proba", ascending=False)

def topk_stats(k_frac: float):
    k = max(1, int(len(df_rank) * k_frac))
    top = df_rank.iloc[:k]
    positives_found = int(top["y_true"].sum())
    precision_in_topk = float(top["y_true"].mean())
    return {
        "top_fraction": k_frac,
        "rows_examined": k,
        "positives_found": positives_found,
        "precision_in_topk": precision_in_topk,
        "lift_vs_random": (precision_in_topk / pos_rate) if pos_rate > 0 else np.nan
    }

top_table = pd.DataFrame([topk_stats(0.01), topk_stats(0.005), topk_stats(0.001)])

# 5) Human interpretation (rule-of-thumb bands)
def band_from_improvement(x):
    if np.isnan(x):
        return "Unknown (baseline is zero?)"
    if x < 2:
        return "≈ random (not learning useful signal)"
    if x < 5:
        return "weak signal (may still be usable for broad screening)"
    if x < 15:
        return "solid signal (useful for prioritization / targeting)"
    return "strong signal (very useful for prioritization)"

verdict = band_from_improvement(improvement)

# 6) Print explanation
print("\n" + "="*70)
print("MODEL PERFORMANCE — SUMMARY")
print("="*70)

print("\n1) What metric are we using and why?")
print("- We use PR-AUC (Precision–Recall AUC) because positives are rare.")
print("- PR-AUC answers: 'When the model flags something as promising, how often is it right?'")

print("\n2) What is the default (random) value?")
print(f"- Positive rate (base rate) = {pos_rate:.6f}  (~{pos_rate*100:.4f}%)")
print(f"- Baseline PR-AUC (random guess) ≈ {baseline_pr_auc:.6f}")

print("\n3) What did the model achieve?")
print(f"- Model PR-AUC = {pr_auc:.6f}")
print(f"- Improvement over random ≈ {improvement:.1f}×")
print(f"- Interpretation: {verdict}")

print("\n4) What does this mean operationally? (Top-K targeting)")
print("- Instead of asking 'is this definitely positive?', we rank locations by probability.")
print("- Below: if you only investigate the top fraction, how concentrated are positives?")
display(top_table)

print("\nHow to read the Top-K table:")
print("- precision_in_topk: % of positives in the top slice")
print("- lift_vs_random: how many times better than random search the top slice is")





MODEL PERFORMANCE — SUMMARY

1) What metric are we using and why?
- We use PR-AUC (Precision–Recall AUC) because positives are rare.
- PR-AUC answers: 'When the model flags something as promising, how often is it right?'

2) What is the default (random) value?
- Positive rate (base rate) = 0.000681  (~0.0681%)
- Baseline PR-AUC (random guess) ≈ 0.000681

3) What did the model achieve?
- Model PR-AUC = 0.015887
- Improvement over random ≈ 23.3×
- Interpretation: strong signal (very useful for prioritization)

4) What does this mean operationally? (Top-K targeting)
- Instead of asking 'is this definitely positive?', we rank locations by probability.
- Below: if you only investigate the top fraction, how concentrated are positives?


Unnamed: 0,top_fraction,rows_examined,positives_found,precision_in_topk,lift_vs_random
0,0.01,3641,86,0.02362,34.682905
1,0.005,1820,32,0.017582,25.817625
2,0.001,364,13,0.035714,52.442051



How to read the Top-K table:
- precision_in_topk: % of positives in the top slice
- lift_vs_random: how many times better than random search the top slice is
