# Multiple ML Classifiers

Names: Christian Juarez, Analiese Gonzalez, Alyssa Amancio

This notebook implements four or more different ML classifiers for the planet-host prediction problem using the preprocessed Kepler features.

- Target: `label_lenient` (planet-host vs non-host)
- Features: all engineered columns from `table_v1.parquet`
- Split: pre-defined `train/val/test` from `split_v1.csv`

Note:This submission focuses only on implementing and fitting the models. Full evaluation, model comparison, and hyperparameter tuning will be done in the next assignment.

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Adjust ROOT. When this notebook lives in notebooks/
# the data folder is at ../data/processed/..
ROOT = Path("/Users/chrisjuarez/CPSC483_ML_Project") 

X_full = pd.read_parquet(ROOT / 'data/processed/features/table_v1.parquet')
y_full = pd.read_csv(ROOT / 'data/processed/labels/labels_v1.csv')
splits = pd.read_csv(ROOT / 'data/processed/splits/split_v1.csv')

# Merge into a single DataFrame
df = X_full.merge(y_full, on='kepid').merge(splits, on='kepid')
df.head()

Unnamed: 0,kepid,teff,logg,feh,radius,mass,kepmag,rrmscdpp03p0,rrmscdpp06p0,rrmscdpp12p0,...,detection_eff,rrmscdpp03p0_log,rrmscdpp06p0_log,rrmscdpp12p0_log,nconfp,nkoi,ntce,label_strict,label_lenient,split
0,10000785,5333.0,4.616,-1.0,0.65,0.635,15.749,445.41,499.98,589.3,...,5.4e-05,6.101238,6.216566,6.380631,0,0,2,0,0,train
1,10000797,6289.0,4.27,-0.44,1.195,0.968,13.994,80.767,60.264,45.939,...,0.001693,4.403874,4.115192,3.848849,0,0,0,0,0,train
2,10000800,5692.0,4.547,-0.04,0.866,0.965,15.379,226.348,184.595,158.22,...,0.000264,5.426482,5.223567,5.070287,0,0,0,0,0,test
3,10000823,6580.0,4.377,-0.16,1.169,1.191,15.558,181.468,148.879,132.14,...,0.00059,5.206575,5.009828,4.891401,0,0,0,0,0,val
4,10000827,5648.0,4.559,-0.1,0.841,0.939,14.841,124.834,92.096,67.532,...,0.000517,4.834964,4.533631,4.227301,0,0,0,0,0,train


## Train / Val / Test Split

We use the existing `split` column to create train, validation, and test sets. The prediction target is `label_lenient` (binary classification).

In [2]:
# Target column (binary classification)
target_col = 'label_lenient'

# Features: drop non-feature columns
non_feature_cols = ['split', 'label_lenient', 'label_strict'] # add any other features to reduce if needed 
feature_cols = [c for c in df.columns if c not in non_feature_cols]

X = df[feature_cols].copy()
y = df[target_col].copy()

X_train = X[df['split'] == 'train']
y_train = y[df['split'] == 'train']

X_val = X[df['split'] == 'val']
y_val = y[df['split'] == 'val']

X_test = X[df['split'] == 'test']
y_test = y[df['split'] == 'test']

X_train.shape, X_val.shape, X_test.shape

((105532, 27), (15077, 27), (30153, 27))

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier


### 1. Logistic Regression (with Standardization)

A linear classification model that estimates the probability of a star being a planet host. We use an L2-penalized logistic regression with standardized inputs.

In [4]:
log_reg_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('log_reg', LogisticRegression(max_iter=200, n_jobs=-1))
])

log_reg_clf.fit(X_train, y_train)
print('Logistic Regression model fitted.')

Logistic Regression model fitted.


### 2. Random Forest Classifier

An ensemble of decision trees trained with bootstrap aggregation (bagging). This is similar to the model already used in the earlier notebook, but we keep it here as one of the required ML techniques.

In [5]:
rf_clf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    random_state=42,
    n_jobs=-1
)

rf_clf.fit(X_train, y_train)
print('Random Forest model fitted.')

Random Forest model fitted.


### 3. Gradient Boosting Classifier

A boosting-based ensemble that builds trees sequentially, where each new tree tries to correct the errors of the previous ensemble. This often performs well on tabular data with structured features.

In [6]:
gb_clf = GradientBoostingClassifier(
    random_state=42
)

gb_clf.fit(X_train, y_train)
print('Gradient Boosting model fitted.')

Gradient Boosting model fitted.


### 4. K-Nearest Neighbors (KNN) Classifier

A simple instance-based learner that classifies each sample based on the majority label among its `k` nearest neighbors in feature space (after scaling).

In [7]:
knn_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=15))
])

knn_clf.fit(X_train, y_train)
print('KNN model fitted.')

KNN model fitted.


# We can delete whats below just checking to see. 

In [8]:
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

In [9]:
def evaluate_model(name, clf, X_train, y_train, X_val, y_val, X_test, y_test):
    print(f"\n== {name} ==")
    
    splits = {
        "Train": (X_train, y_train),
        "Val":   (X_val,   y_val),
        "Test":  (X_test,  y_test),
    }
    
    for split_name, (X_split, y_split) in splits.items():
        y_pred = clf.predict(X_split)
        acc = accuracy_score(y_split, y_pred)
        
        y_score = None
        if hasattr(clf, "predict_proba"):
            y_score = clf.predict_proba(X_split)[:, 1]
        elif hasattr(clf, "decision_function"):
            y_score = clf.decision_function(X_split)
        
        if y_score is not None:
            auc = roc_auc_score(y_split, y_score)
            print(f"{split_name:5s} | accuracy = {acc:.4f} | ROC AUC = {auc:.4f}")
        else:
            print(f"{split_name:5s} | accuracy = {acc:.4f} | ROC AUC = N/A (no scores)")
    
    
    print("\nTest set classification report:")
    print(classification_report(y_test, clf.predict(X_test)))

In [10]:
models = {
    "Logistic Regression":      log_reg_clf,
    "Random Forest":           rf_clf,
    "Gradient Boosting":       gb_clf,
    "K-Nearest Neighbors":     knn_clf,   # add your model if you want to see its output
}

for name, clf in models.items():
    evaluate_model(
        name,
        clf,
        X_train, y_train,
        X_val,   y_val,
        X_test,  y_test,
    )


== Logistic Regression ==
Train | accuracy = 0.9911 | ROC AUC = 0.9973
Val   | accuracy = 0.9899 | ROC AUC = 0.9965
Test  | accuracy = 0.9916 | ROC AUC = 0.9975

Test set classification report:
              precision    recall  f1-score   support

           0       0.99      1.00      1.00     29428
           1       0.93      0.70      0.80       725

    accuracy                           0.99     30153
   macro avg       0.96      0.85      0.90     30153
weighted avg       0.99      0.99      0.99     30153


== Random Forest ==
Train | accuracy = 1.0000 | ROC AUC = 1.0000
Val   | accuracy = 0.9910 | ROC AUC = 0.9974
Test  | accuracy = 0.9918 | ROC AUC = 0.9978

Test set classification report:
              precision    recall  f1-score   support

           0       0.99      1.00      1.00     29428
           1       0.91      0.73      0.81       725

    accuracy                           0.99     30153
   macro avg       0.95      0.87      0.90     30153
weighted avg     