<a href="https://colab.research.google.com/github/Moe-phantom/Beyondinfinity/blob/main/TESS_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

importing the necassary **libraries**

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import joblib
import warnings
warnings.filterwarnings('ignore')

In [23]:
df = pd.read_csv('/content/TOI_2025.09.28_05.51.22.csv',
                 comment='#',            # ignore lines that start with #
                 on_bad_lines='skip')
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

This section of the preprocessing script addresses severe class imbalance, which is common in astronomical catalogs where Planet Candidates (PC) often vastly outnumber Confirmed Planets or False Positives.

Ambiguous Class Removal: Rows classified as APC (Ambiguous Planet Candidate) are initially removed to simplify the classification task and improve label quality.

Downsampling: The script identifies the PC class (the majority class) and randomly samples a fixed number of rows (2,400) from it. All other minority classes (Confirmed Planets, False Positives) are retained in full.

Recombination: The downsampled PC set is then merged back with the complete set of other classes.

In [13]:
# Optional: Remove ambiguous classes first
df = df[~df['tfopwg_disp'].isin(['APC'])]
# Separate the classes
df_pc = df[df['tfopwg_disp'] == 'PC']          # Planet Candidates (to downsample)
df_other = df[df['tfopwg_disp'] != 'PC']       # All other classes (keep all)

# Downsample PC to 2,700
df_pc_downsampled = df_pc.sample(n=2400, random_state=42)

# Combine
df = pd.concat([df_pc_downsampled, df_other], ignore_index=True)

# Display new class distribution
print("Class distribution after downsampling PC to 2,700:")
print(df['tfopwg_disp'].value_counts())

Class distribution after downsampling PC to 2,700:
tfopwg_disp
PC    2400
FP    1196
CP     683
KP     583
FA      98
Name: count, dtype: int64


In [14]:
# 1. Check class distribution
print("\n1. Class Distribution:")
print(df['tfopwg_disp'].value_counts())
print("\nPercentages:")
print(df['tfopwg_disp'].value_counts(normalize=True) * 100)


1. Class Distribution:
tfopwg_disp
PC    2400
FP    1196
CP     683
KP     583
FA      98
Name: count, dtype: int64

Percentages:
tfopwg_disp
PC    48.387097
FP    24.112903
CP    13.770161
KP    11.754032
FA     1.975806
Name: proportion, dtype: float64


his code block implements Feature Engineering, which is the process of creating new variables from existing data to improve the performance and interpretability of a machine learning model.

The goal here is to introduce physics-informed features relevant to exoplanet detection

In [15]:
print("\n" + "="*80)
print("⚙️ ADVANCED FEATURE ENGINEERING")
print("="*80)

def engineer_features(df):
    """
    Create physics-informed features that improve classification
    Based on exoplanet detection theory
    """
    df = df.copy()

    print("Creating engineered features...")

    # 1. Transit Signal-to-Noise (key discriminator!)
    if 'pl_trandep' in df.columns and 'st_tmag' in df.columns:
        df['transit_snr'] = df['pl_trandep'] * (10 ** (-df['st_tmag'] / 5))
        print(" ✅ transit_snr (Photon-noise-limited SNR)")

    # 4. Planet temperature ratio
    if 'pl_eqt' in df.columns and 'st_teff' in df.columns:
        df['temp_ratio'] = df['pl_eqt'] / df['st_teff']
        print("   ✅ temp_ratio (Planet/Star temperature)")


    # 7. Log-transformed features (planets follow power laws!)
    log_cols = ['pl_orbper', 'pl_rade', 'pl_insol']
    for col in log_cols:
        if col in df.columns:
            df[f'{col}_log'] = np.log10(df[col] + 1e-6)
            print(f"   ✅ {col}_log")

    # 8. Interaction features (non-linear relationships!)
    if 'st_mass' in df.columns and 'st_rad' in df.columns:
        df['stellar_density'] = df['st_mass'] / (df['st_rad'] ** 3 + 1e-6)
        print("   ✅ stellar_density")

    return df

df_engineered = engineer_features(df)

print(f"\n✅ Original features: {len(df.columns)}")
print(f"✅ Total features now: {len(df_engineered.columns)}")



⚙️ ADVANCED FEATURE ENGINEERING
Creating engineered features...
 ✅ transit_snr (Photon-noise-limited SNR)
   ✅ temp_ratio (Planet/Star temperature)
   ✅ pl_orbper_log
   ✅ pl_rade_log
   ✅ pl_insol_log

✅ Original features: 27
✅ Total features now: 32


Target Label Coarsening (Hierarchical Classification)
This crucial step simplifies the multi-class prediction problem to maximize initial model accuracy.

Instead of training the model to distinguish between all fine-grained classifications (like FP, FA, CP, etc.), this function implements a hierarchical strategy by grouping them into three robust, coarse categories:



In [16]:
print("🎯 HIERARCHICAL CLASSIFICATION STRATEGY")
print("="*80)

def create_coarse_labels(df, target_col='tfopwg_disp'):
    """
    Map 6 fine classes to 3 coarse classes
    This is the KEY to high accuracy!
    """
    mapping = {
        'FP': 'NOT_PLANET',      # False Positive
        'FA': 'NOT_PLANET',      # False Alarm
        'PC': 'CANDIDATE',       # Planet Candidate
        'CP': 'PLANET',          # Confirmed Planet
        'KP': 'PLANET'           # Known Planet
    }

    df['coarse_class'] = df[target_col].map(mapping)

    print("✅ Coarse Classification Mapping:")
    print("   NOT_PLANET ← [FP, FA]")
    print("   CANDIDATE  ← [PC]")
    print("   PLANET     ← [CP, KP]")

    return df

df_engineered = create_coarse_labels(df_engineered)

print("\n📊 Coarse Class Distribution:")
print(df_engineered['coarse_class'].value_counts())


🎯 HIERARCHICAL CLASSIFICATION STRATEGY
✅ Coarse Classification Mapping:
   NOT_PLANET ← [FP, FA]
   CANDIDATE  ← [PC]
   PLANET     ← [CP, KP]

📊 Coarse Class Distribution:
coarse_class
CANDIDATE     2400
NOT_PLANET    1294
PLANET        1266
Name: count, dtype: int64


This phase finalizes the dataset, ensuring only high-quality, relevant features are passed to the machine learning model.

Feature Selection: Identifier columns (toi, tid, ra, dec) and the target labels (tfopwg_disp, coarse_class) are explicitly excluded. The remaining columns, including the newly engineered features, are selected as the final input feature set.

Missing Value Handling: Missing data points (NaN) within the selected feature columns are addressed using Median Imputation. This replaces missing values with the median of their respective columns, preventing data loss while mitigating the influence of potential outliers.

Infinity Removal: Any extreme, non-numeric infinite values (±∞) that may have resulted from mathematical operations (like division by zero during feature engineering) are removed.

Result: A fully cleaned dataset with a complete, consistent set of numerical features, ready for scaling and training.

In [19]:
# ============================================================================
# 4. DATA PREPROCESSING
# ============================================================================

# Select features (exclude identifiers and targets)
exclude_cols = ['toi', 'tid', 'ra', 'dec', 'tfopwg_disp', 'coarse_class']
feature_cols = [col for col in df_engineered.columns if col not in exclude_cols]

print(f"\n📋 Using {len(feature_cols)} features for classification")

# Handle missing values
df_clean = df_engineered.copy()
for col in feature_cols:
    if df_clean[col].isnull().sum() > 0:
        df_clean[col].fillna(df_clean[col].median(), inplace=True)

# Remove infinite values
df_clean = df_clean.replace([np.inf, -np.inf], np.nan)
df_clean = df_clean.dropna(subset=feature_cols)

print(f"✅ Clean dataset: {len(df_clean)} samples")


📋 Using 27 features for classification
✅ Clean dataset: 4960 samples


Deciedig the neumaric feaatures

In [20]:
numeric_features = df_clean[feature_cols].select_dtypes(include=[np.number]).columns.tolist()

print(f"\n✅ Keeping {len(numeric_features)} numeric features:")
print(numeric_features)

# Update your feature columns
feature_cols = numeric_features

# Now this will work
X = df_clean[feature_cols].values


✅ Keeping 23 numeric features:
['rowid', 'toipfx', 'ctoi_alias', 'pl_pnum', 'st_pmra', 'st_pmdec', 'pl_tranmid', 'pl_orbper', 'pl_trandurh', 'pl_trandep', 'pl_rade', 'pl_insol', 'pl_eqt', 'st_tmag', 'st_dist', 'st_teff', 'st_logg', 'st_rad', 'transit_snr', 'temp_ratio', 'pl_orbper_log', 'pl_rade_log', 'pl_insol_log']


preparing the model for training

In [24]:
print("\n" + "="*80)
print("🎯 STAGE 1: COARSE CLASSIFICATION (3 classes)")
print("="*80)

#---------------------------------------------------------------------------------------------------------------------------
X = df_clean[feature_cols].values
y_coarse = df_clean['coarse_class'].values

# Encode labels
le_coarse = LabelEncoder()
y_coarse_encoded = le_coarse.fit_transform(y_coarse)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y_coarse_encoded, test_size=0.2, random_state=RANDOM_STATE, stratify=y_coarse_encoded
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

# Train ensemble models
print("\n🤖 Training Stage 1 Models...")



🎯 STAGE 1: COARSE CLASSIFICATION (3 classes)
Training set: (3968, 23)
Test set: (992, 23)

🤖 Training Stage 1 Models...


**XGBoost** training

In [26]:

# Model 1: XGBoost (usually best for tabular data)
print("\n1️⃣ XGBoost...")
xgb1 = XGBClassifier(
    n_estimators=500,
    max_depth=8,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=RANDOM_STATE,
    eval_metric='mlogloss',
    scale_pos_weight=1
)
xgb1.fit(X_train_scaled, y_train)
acc_xgb1 = accuracy_score(y_test, xgb1.predict(X_test_scaled))
print(f"   Accuracy: {acc_xgb1:.4f} ({acc_xgb1*100:.2f}%)")



1️⃣ XGBoost...
   Accuracy: 0.7591 (75.91%)


LightGBM training

In [28]:
# Model 2: LightGBM (faster, often better)
print("\n2️⃣ LightGBM...")
lgb1 = LGBMClassifier(
    n_estimators=500,
    boosting_type='dart',
    max_depth=8,
    learning_rate=0.05,
    random_state=RANDOM_STATE,
    num_class=3,
    verbose=-1
)
lgb1.fit(X_train_scaled, y_train)
acc_lgb1 = accuracy_score(y_test, lgb1.predict(X_test_scaled))
print(f"   Accuracy: {acc_lgb1:.4f} ({acc_lgb1*100:.2f}%)")



2️⃣ LightGBM...
   Accuracy: 0.7530 (75.30%)


model evaluation and saving

In [29]:
# Evaluate models and save them
print("\n📊 Evaluating Stage 1 Models...")

# Evaluate XGBoost
y_pred_xgb1 = xgb1.predict(X_test_scaled)
print("\nXGBoost Classification Report:")
print(classification_report(y_test, y_pred_xgb1, target_names=le_coarse.classes_))
print("XGBoost Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_xgb1))


# Evaluate LightGBM
y_pred_lgb1 = lgb1.predict(X_test_scaled)
print("\nLightGBM Classification Report:")
print(classification_report(y_test, y_pred_lgb1, target_names=le_coarse.classes_))
print("LightGBM Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_lgb1))

# Save models
print("\n💾 Saving Stage 1 Models...")
joblib.dump(xgb1, 'xgb_coarse_model.pkl')
joblib.dump(lgb1, 'lgb_coarse_model.pkl')
print("✅ Models saved: xgb_coarse_model.pkl, lgb_coarse_model.pkl")


📊 Evaluating Stage 1 Models...

XGBoost Classification Report:
              precision    recall  f1-score   support

   CANDIDATE       0.73      0.87      0.79       480
  NOT_PLANET       0.80      0.65      0.72       259
      PLANET       0.79      0.66      0.72       253

    accuracy                           0.76       992
   macro avg       0.78      0.73      0.74       992
weighted avg       0.77      0.76      0.76       992

XGBoost Confusion Matrix:
[[418  30  32]
 [ 79 168  12]
 [ 75  11 167]]

LightGBM Classification Report:
              precision    recall  f1-score   support

   CANDIDATE       0.73      0.86      0.79       480
  NOT_PLANET       0.80      0.64      0.71       259
      PLANET       0.76      0.67      0.71       253

    accuracy                           0.75       992
   macro avg       0.76      0.72      0.74       992
weighted avg       0.76      0.75      0.75       992

LightGBM Confusion Matrix:
[[411  30  39]
 [ 78 166  15]
 [ 72  11 17