# K2 Exoplanet Modeling (Flask-ready)

**Purpose:** Modular notebook for loading K2 dataset, training models, saving artifacts for a Flask app, and providing prediction utilities that accept top-5 feature values from the web UI.

This notebook follows a structured workflow with clear steps and saves models/plots/results to `../static/` so Flask can serve them.

## Step 1 — Imports & Global Configuration
Import libraries and define global directories.

In [1]:
# -*- coding: utf-8 -*-
import warnings
warnings.filterwarnings('ignore')

import os
import json
import uuid
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
                             confusion_matrix, roc_curve)
import joblib

# Directories for Flask to read
BASE_MODEL_DIR = '../static/models'
PLOTS_DIR = '../static/plots'
RESULTS_DIR = '../static/results'
DATA_DIR = '../../Data Sources'  # expected location for K2.csv

os.makedirs(BASE_MODEL_DIR, exist_ok=True)
os.makedirs(PLOTS_DIR, exist_ok=True)
os.makedirs(RESULTS_DIR, exist_ok=True)

RANDOM_STATE = 42
sns.set(style='whitegrid')


Matplotlib is building the font cache; this may take a moment.


## Step 2 — Dataset Path Helper
Map dataset names to file paths. Ensure `K2.csv` is placed under `../../Data Sources/` or you may point to `/mnt/data/k2.csv`.

In [2]:
def get_dataset_path(dataset_name):
    dataset_map = {
        'K2': os.path.join('..','..','Data Sources','K2.csv'),
        'local_k2': r"Data Sources\k2.csv" # Added mapping for local k2.csv
    }
    return dataset_map.get(dataset_name, None)

## Step 3 — Load, Select Columns & Rename
Load the K2 dataset, select relevant columns and create the target label.

In [3]:
def load_raw_dataset(csv_path):
    if csv_path is None:
        raise FileNotFoundError('Dataset path is None')
    if not os.path.exists(csv_path):
        raise FileNotFoundError(f'File not found: {csv_path}')
    df = pd.read_csv(csv_path)
    print(f'Loaded {csv_path} -> shape: {df.shape}')
    return df

def preprocess_initial(df):
    # Basic cleaning: strip column names
    df = df.rename(columns=lambda x: x.strip())
    return df

def select_and_prepare_k2(df):
    # Keep a conservative set of useful columns (based on your earlier notebook)
    keep = [
        'pl_orbper','pl_rade','pl_radj','st_rad','st_mass','sy_dist',
        'st_teff','st_logg','disposition','discoverymethod','disc_facility','soltype','pl_name'
    ]
    existing = [c for c in keep if c in df.columns]
    df_sel = df[existing].copy()
    # Drop rows with no disposition info
    if 'disposition' in df_sel.columns:
        df_sel = df_sel[df_sel['disposition'].notna()]
        # map dispositions
        df_sel = df_sel[df_sel['disposition'] != 'REFUTED']
        df_sel['Target'] = df_sel['disposition'].map({'CONFIRMED':2,'CANDIDATE':1,'FALSE POSITIVE':0})
    else:
        raise KeyError('disposition column not found in K2 dataset')
    # Fill simple missing numeric values with median
    num_cols = df_sel.select_dtypes(include=['float64','int64']).columns.tolist()
    for c in num_cols:
        df_sel[c] = df_sel[c].fillna(df_sel[c].median())
    # Fill categorical with mode
    cat_cols = df_sel.select_dtypes(include='object').columns.tolist()
    for c in cat_cols:
        df_sel[c] = df_sel[c].fillna(df_sel[c].mode().iloc[0] if not df_sel[c].mode().empty else '')
    return df_sel


## Step 4 — Conservative Outlier Handling
Remove extreme outliers using 1st/99th percentiles and 3×IQR to preserve astrophysical extremes.

In [4]:
def remove_extreme_outliers(df, numeric_cols, lower_q=0.01, upper_q=0.99, multiplier=3.0):
    q_low = df[numeric_cols].quantile(lower_q)
    q_high = df[numeric_cols].quantile(upper_q)
    iqr = q_high - q_low
    lower = q_low - multiplier * iqr
    upper = q_high + multiplier * iqr
    mask = ~((df[numeric_cols] < lower) | (df[numeric_cols] > upper)).any(axis=1)
    return df[mask].copy()


## Step 5 — Build Preprocessor (Scaler + Encoder)

In [5]:
def build_preprocessor(numeric_features, categorical_features):
    num_pipe = Pipeline([('scaler', StandardScaler())])
    cat_pipe = Pipeline([('onehot', OneHotEncoder(handle_unknown='ignore'))])
    preprocessor = ColumnTransformer([('num', num_pipe, numeric_features),
                                      ('cat', cat_pipe, categorical_features)],
                                     remainder='drop')
    return preprocessor

## Step 6 — Training, Evaluation & Plotting Helpers

In [6]:
def train_models(X_train, y_train, preprocessor, save_dir=BASE_MODEL_DIR):
    """
    Train multiple ML models and save them as pipelines.
    
    Args:
        X_train: Training features
        y_train: Training target
        preprocessor: Fitted preprocessor
        save_dir: Directory to save models
    
    Returns:
        dict: Dictionary of trained model pipelines
    """
    models = {
        'RandomForest': RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1),
        'LogisticRegression': LogisticRegression(random_state=RANDOM_STATE, max_iter=1000),
        'XGBoost': XGBClassifier(random_state=RANDOM_STATE, eval_metric='logloss')
    }
    
    trained_models = {}
    
    for name, clf in models.items():
        print(f'Training {name}...')
        
        # Create pipeline with preprocessor and classifier
        pipeline = Pipeline([
            ('preprocessor', preprocessor),
            ('clf', clf)
        ])
        
        # Fit the pipeline
        pipeline.fit(X_train, y_train)
        
        # Save the trained model
        model_path = os.path.join(save_dir, f'{name}_pipeline.pkl')
        joblib.dump(pipeline, model_path)
        print(f'Saved {name} to {model_path}')
        
        # Store in dictionary
        trained_models[name] = pipeline
    
    return trained_models


In [7]:
def evaluate_and_save(models_dict, X_test, y_test, dataset_name, plots_dir=PLOTS_DIR, results_dir=RESULTS_DIR):
    results = {'meta': {'dataset': dataset_name, 'run_id': str(uuid.uuid4())}, 'results': {}}
    for name, model in models_dict.items():
        y_pred = model.predict(X_test)
        y_proba = model.predict_proba(X_test)[:,1] if hasattr(model, 'predict_proba') and len(model.classes_) == 2 else None # Only get proba for binary
        metrics = {
            'accuracy': float(accuracy_score(y_test, y_pred)),
            'precision': float(precision_score(y_test, y_pred, average='weighted', zero_division=0)),
            'recall': float(recall_score(y_test, y_pred, average='weighted', zero_division=0)),
            'f1': float(f1_score(y_test, y_pred, average='weighted', zero_division=0)),
            'auc': float(roc_auc_score(y_test, y_proba)) if y_proba is not None else None
        }
        results['results'][name] = metrics

        # confusion matrix plot
        cm = confusion_matrix(y_test, y_pred)
        plt.figure(figsize=(4,3))
        sns.heatmap(cm, annot=True, fmt='d', cbar=False)
        plt.title(f'{dataset_name} - {name} Confusion Matrix')
        cm_path = os.path.join(plots_dir, f'{dataset_name}_{name}_confusion.png')
        plt.savefig(cm_path, bbox_inches='tight')
        plt.close()

        # ROC
        if y_proba is not None:
            fpr, tpr, _ = roc_curve(y_test, y_proba)
            plt.figure(figsize=(5,4))
            plt.plot(fpr, tpr, label=f'AUC={metrics["auc"]:.3f}')
            plt.plot([0,1],[0,1],'--')
            plt.xlabel('FPR'); plt.ylabel('TPR'); plt.title(f'{dataset_name} - {name} ROC')
            plt.legend(loc='lower right')
            roc_path = os.path.join(plots_dir, f'{dataset_name}_{name}_roc.png')
            plt.savefig(roc_path, bbox_inches='tight')
            plt.close()

    # save metrics json
    out_path = os.path.join(results_dir, f'{dataset_name}_metrics.json')
    with open(out_path, 'w') as f:
        json.dump(results, f, indent=2)
    print(f'Saved metrics to {out_path}')
    return results

## Step 7 — Feature Importances & Save Top-5

In [8]:
def extract_and_save_top_features(models_dict, feature_names, dataset_name, top_k=5, results_dir=RESULTS_DIR):
    all_top = {}
    for name, model in models_dict.items():
        clf = model.named_steps['clf']
        imp = None
        if hasattr(clf, 'feature_importances_'):
            imp = np.array(clf.feature_importances_)
        elif hasattr(clf, 'coef_'):
            # For linear models, coef_ can be 2D for multi-class, need to flatten
            imp = np.abs(np.array(clf.coef_).ravel())
        else:
            print(f'No importances for {name}')
            continue

        # Get preprocessor output feature names after fitting
        try:
            pre = model.named_steps['preprocessor']
            processed_names = pre.get_feature_names_out()
        except Exception as e:
            print(f'Error getting feature names out for {name}: {e}')
            # Fallback: use provided feature_names (numeric + categorical before transform)
            processed_names = feature_names

        if len(imp) != len(processed_names):
            print(f'Warning: importance len {len(imp)} != feature len {len(processed_names)} for {name}')
            # If lengths still don't match, something is fundamentally wrong or
            # the model type isn't handled correctly. Skip saving top features for this model.
            continue

        s = pd.Series(imp, index=processed_names).sort_values(ascending=False)
        top = s.head(top_k).index.tolist()
        all_top[name] = top
        with open(os.path.join(results_dir, f'{dataset_name}_{name}_top_features.json'), 'w') as f:
            json.dump(top, f, indent=2)

        plt.figure(figsize=(6, max(2, len(top)*0.5)))
        sns.barplot(x=s.head(top_k).values, y=s.head(top_k).index)
        plt.title(f'{name} Top {top_k} features')
        plt.tight_layout()
        plt.savefig(os.path.join(PLOTS_DIR, f'{dataset_name}_{name}_topk.png'), bbox_inches='tight')
        plt.close()
        print(f'Saved top{top_k} for {name}')
    return all_top

def save_feature_medians(df, dataset_name, path=RESULTS_DIR):
    med = df.median(numeric_only=True).to_dict()
    with open(os.path.join(path, f'{dataset_name}_feature_medians.json'), 'w') as f:
        json.dump(med, f, indent=2)
    print('Saved medians for dataset:', dataset_name)
    return med

## Step 8 — Prediction Helpers

In [9]:
def predict_full_input(model_name, dataset_name, input_full):
    model_path = os.path.join(BASE_MODEL_DIR, f'{model_name}_pipeline.pkl')
    model = joblib.load(model_path)
    X = pd.DataFrame([input_full], columns=model.named_steps['preprocessor'].feature_names_in_)
    pred = int(model.predict(X)[0])
    proba = float(model.predict_proba(X)[0][1]) if hasattr(model, 'predict_proba') else None
    return {'prediction': pred, 'probability': proba}

def predict_from_top5(model_name, dataset_name, top5_dict):
    med_path = os.path.join(RESULTS_DIR, f'{dataset_name}_feature_medians.json')
    if not os.path.exists(med_path):
        raise FileNotFoundError('Medians not found. Run training to save medians first.')
    with open(med_path,'r') as f:
        medians = json.load(f)
    top_path = os.path.join(RESULTS_DIR, f'{dataset_name}_{model_name}_top_features.json')
    if not os.path.exists(top_path):
        raise FileNotFoundError('Top features file not found. Run extract_and_save_top_features first.')
    with open(top_path,'r') as f:
        top_feats = json.load(f)
    cols_path = os.path.join(RESULTS_DIR, f'{dataset_name}_training_columns.json')
    if not os.path.exists(cols_path):
        raise FileNotFoundError('Training columns file not found. Run training step that saves training columns.')
    with open(cols_path,'r') as f:
        training_cols = json.load(f)
    row = {}
    for c in training_cols:
        if c in medians:
            row[c] = medians[c]
        else:
            row[c] = 0
    # overwrite with provided top5
    for k,v in top5_dict.items():
        if k not in row:
            raise ValueError(f'Feature {k} not in training columns')
        row[k] = v
    df_row = pd.DataFrame([row], columns=training_cols)
    model_path = os.path.join(BASE_MODEL_DIR, f'{model_name}_pipeline.pkl')
    model = joblib.load(model_path)
    pred = model.predict(df_row)[0]
    proba = model.predict_proba(df_row)[0][1] if hasattr(model, 'predict_proba') else None
    return {'prediction': int(pred), 'probability': float(proba) if proba is not None else None}


## Step 9 — Example: Full workflow runner
Call `run_full_workflow('K2')` to execute everything for the K2 dataset.

In [10]:
def run_full_workflow(dataset_name='K2'):
    path = get_dataset_path(dataset_name)
    if path is None:
        raise FileNotFoundError(f'Dataset mapping for {dataset_name} not found. Put K2.csv under ../../Data Sources/ or use local_k2 mapping.')
    df0 = load_raw_dataset(path)
    df1 = preprocess_initial(df0)
    df_sel = select_and_prepare_k2(df1)
    print('After selection:', df_sel.shape)
    numeric_cols = df_sel.select_dtypes(include=['float64','int64']).columns.tolist()
    numeric_cols = [c for c in numeric_cols if c not in ['Target']]
    df_clean = remove_extreme_outliers(df_sel, numeric_cols, lower_q=0.01, upper_q=0.99, multiplier=3.0)
    print('After outlier removal:', df_clean.shape)
    df_clean = df_clean.dropna(subset=['Target'])
    training_cols = [c for c in df_clean.columns if c != 'Target']
    with open(os.path.join(RESULTS_DIR, f'{dataset_name}_training_columns.json'), 'w') as f:
        json.dump(training_cols, f, indent=2)
    save_feature_medians(df_clean[training_cols], dataset_name)
    X = df_clean[training_cols]
    y = df_clean['Target']
    numeric_features = X.select_dtypes(include=['float64','int64']).columns.tolist()
    categorical_features = X.select_dtypes(include=['object']).columns.tolist()
    preprocessor = build_preprocessor(numeric_features, categorical_features)
    if len(X) < 10:
        raise ValueError('Not enough data after cleaning to train models.')
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y if len(y.unique())>1 else None)
    trained = train_models(X_train, y_train, preprocessor, save_dir=BASE_MODEL_DIR)
    res = evaluate_and_save(trained, X_test, y_test, dataset_name)
    extract_and_save_top_features(trained, numeric_features + categorical_features, dataset_name, top_k=5, results_dir=RESULTS_DIR)
    print('Workflow done for', dataset_name)
    return res


In [11]:
results = run_full_workflow('local_k2')
display(results)

Loaded Data Sources\k2.csv -> shape: (4004, 129)
After selection: (3982, 14)
After outlier removal: (3943, 14)
Saved medians for dataset: local_k2
Training RandomForest...


Saved RandomForest to ../static/models\RandomForest_pipeline.pkl
Training LogisticRegression...
Saved LogisticRegression to ../static/models\LogisticRegression_pipeline.pkl
Training XGBoost...


Saved XGBoost to ../static/models\XGBoost_pipeline.pkl


Saved metrics to ../static/results\local_k2_metrics.json


Saved top5 for RandomForest


Saved top5 for XGBoost
Workflow done for local_k2


{'meta': {'dataset': 'local_k2',
  'run_id': '2f1af870-a6f1-428c-861f-331992445424'},
 'results': {'RandomForest': {'accuracy': 1.0,
   'precision': 1.0,
   'recall': 1.0,
   'f1': 1.0,
   'auc': None},
  'LogisticRegression': {'accuracy': 1.0,
   'precision': 1.0,
   'recall': 1.0,
   'f1': 1.0,
   'auc': None},
  'XGBoost': {'accuracy': 1.0,
   'precision': 1.0,
   'recall': 1.0,
   'f1': 1.0,
   'auc': None}}}

## Step 10 — Final Notes
After running the workflow, Flask can serve JSON and PNG artifacts from `../static/results` and `../static/plots`. Use the prediction helpers to accept top-5 feature inputs from the UI.