<a href="https://colab.research.google.com/github/Bishtrahulsingh/Kepler_exoplanet_classification/blob/main/kepler_exoplanets_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Predicting Exoplanet Candidates from Kepler Data

**Author:** Rahul singh bisht

### 1. Introduction & Objective

The goal of this project was to build a machine learning model to accurately classify planet candidates from the NASA Kepler space telescope.

The dataset contains "Kepler Objects of Interest" (KOI) which are labeled in one of three ways:
* **CONFIRMED:** A verified exoplanet.
* **FALSE POSITIVE:** An object that looked like a planet, but was later found to be something else.
* **CANDIDATE:** An unverified object.

To create a clean, binary classification problem, I filtered the data to only include "CONFIRMED" (Class 1) and "FALSE POSITIVE" (Class 0) objects. The primary objective was to train a model that could accurately predict this classification based *only* on the scientific measurements.

In [84]:
import pandas as pd
import numpy as np
from sklearn import linear_model, svm , model_selection, preprocessing

In [85]:
kepler = pd.read_csv('https://raw.githubusercontent.com/Bishtrahulsingh/Datacsv/refs/heads/main/cumulative.csv')

### 2. Data Cleaning & Preparation

The initial dataset contained 7,316 rows and 44 columns. The cleaning process was a critical part of this project.

1.  **Filtering:** Dropped all rows where `koi_disposition` was "CANDIDATE" to create a binary (0 vs. 1) problem.
2.  **Column Removal:** Dropped non-scientific or empty columns.
    * **Empty Columns:** `koi_teq_err1` and `koi_teq_err2` were 100% null and were removed.
    * **Irrelevant Columns:** `rowid`, `kepid`, `kepoi_name`, etc., were removed as they are identifiers, not predictive features.
3.  **Handling Missing Data:** The dataset had many `NaN` (null) values.
    * A simple `dropna()` would have removed 1,100 rows, or **15.5% of the entire dataset**. This would lose too much information and bias the model.
    * **Solution:** I chose to impute (fill in) the missing values using the **median** of each column. The median is more robust to outliers than the mean, which is common in astronomical data.

In [86]:
kepler.head(5)

Unnamed: 0,rowid,kepid,kepoi_name,kepler_name,koi_disposition,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
0,1,10797460,K00752.01,Kepler-227 b,CONFIRMED,CANDIDATE,1.0,0,0,0,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
1,2,10797460,K00752.02,Kepler-227 c,CONFIRMED,CANDIDATE,0.969,0,0,0,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
2,3,10811496,K00753.01,,FALSE POSITIVE,FALSE POSITIVE,0.0,0,1,0,...,-176.0,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436
3,4,10848459,K00754.01,,FALSE POSITIVE,FALSE POSITIVE,0.0,0,1,0,...,-174.0,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.28521,15.597
4,5,10854555,K00755.01,Kepler-664 b,CONFIRMED,CANDIDATE,1.0,0,0,0,...,-211.0,4.438,0.07,-0.21,1.046,0.334,-0.133,288.75488,48.2262,15.509


In [87]:
kepler_clean = kepler.drop(columns=['rowid','kepid','kepoi_name','kepler_name','koi_tce_delivname','koi_pdisposition']).copy()


In [88]:
kepler_clean = kepler_clean[kepler.koi_disposition != 'CANDIDATE']

In [89]:
kepler_clean.head()

Unnamed: 0,koi_disposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
0,CONFIRMED,1.0,0,0,0,0,9.488036,2.775e-05,-2.775e-05,170.53875,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
1,CONFIRMED,0.969,0,0,0,0,54.418383,0.0002479,-0.0002479,162.51384,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
2,FALSE POSITIVE,0.0,0,1,0,0,19.89914,1.494e-05,-1.494e-05,175.850252,...,-176.0,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436
3,FALSE POSITIVE,0.0,0,1,0,0,1.736952,2.63e-07,-2.63e-07,170.307565,...,-174.0,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.28521,15.597
4,CONFIRMED,1.0,0,0,0,0,2.525592,3.761e-06,-3.761e-06,171.59555,...,-211.0,4.438,0.07,-0.21,1.046,0.334,-0.133,288.75488,48.2262,15.509


In [90]:
kepler_clean['koi_disposition'] = kepler['koi_disposition'].map({'CONFIRMED':1, 'FALSE POSITIVE' : 0 })

In [91]:
kepler_clean.sample(5)

Unnamed: 0,koi_disposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
7914,0.0,,1,0,0,0,534.86668,0.01408,-0.01408,209.0227,...,-282.0,4.074,0.205,-0.205,1.791,0.893,-0.52,286.67697,41.826874,13.845
8841,0.0,0.0,1,0,0,0,1.262742,1e-06,-1e-06,132.19792,...,-167.0,3.621,0.84,-0.21,3.15,0.947,-2.209,295.32907,51.137798,13.573
6943,0.0,0.0,0,1,1,0,5.758437,2.1e-05,-2.1e-05,135.11279,...,-120.0,2.194,0.033,-0.03,25.287,0.546,-10.374,291.98096,38.97118,13.819
662,1.0,0.998,0,0,0,0,59.878026,0.00015,-0.00015,151.1847,...,-79.0,4.658,0.039,-0.015,0.599,0.023,-0.037,295.3045,42.475288,14.995
9094,0.0,0.001,0,0,1,0,2.932251,4e-05,-4e-05,133.7648,...,-232.0,4.487,0.048,-0.192,0.966,0.285,-0.095,295.85266,47.503578,15.662


In [92]:
kepler_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7316 entries, 0 to 9563
Data columns (total 44 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   koi_disposition    7316 non-null   float64
 1   koi_score          6257 non-null   float64
 2   koi_fpflag_nt      7316 non-null   int64  
 3   koi_fpflag_ss      7316 non-null   int64  
 4   koi_fpflag_co      7316 non-null   int64  
 5   koi_fpflag_ec      7316 non-null   int64  
 6   koi_period         7316 non-null   float64
 7   koi_period_err1    6939 non-null   float64
 8   koi_period_err2    6939 non-null   float64
 9   koi_time0bk        7316 non-null   float64
 10  koi_time0bk_err1   6939 non-null   float64
 11  koi_time0bk_err2   6939 non-null   float64
 12  koi_impact         7016 non-null   float64
 13  koi_impact_err1    6939 non-null   float64
 14  koi_impact_err2    6939 non-null   float64
 15  koi_duration       7316 non-null   float64
 16  koi_duration_err1  6939 non-n

In [93]:
# 25  koi_teq_err1       0 non-null      float64
# 26  koi_teq_err2       0 non-null      float64
kepler_clean = kepler_clean.drop(columns=['koi_teq_err1','koi_teq_err2'])

In [94]:
kepler_clean.columns

Index(['koi_disposition', 'koi_score', 'koi_fpflag_nt', 'koi_fpflag_ss',
       'koi_fpflag_co', 'koi_fpflag_ec', 'koi_period', 'koi_period_err1',
       'koi_period_err2', 'koi_time0bk', 'koi_time0bk_err1',
       'koi_time0bk_err2', 'koi_impact', 'koi_impact_err1', 'koi_impact_err2',
       'koi_duration', 'koi_duration_err1', 'koi_duration_err2', 'koi_depth',
       'koi_depth_err1', 'koi_depth_err2', 'koi_prad', 'koi_prad_err1',
       'koi_prad_err2', 'koi_teq', 'koi_insol', 'koi_insol_err1',
       'koi_insol_err2', 'koi_model_snr', 'koi_tce_plnt_num', 'koi_steff',
       'koi_steff_err1', 'koi_steff_err2', 'koi_slogg', 'koi_slogg_err1',
       'koi_slogg_err2', 'koi_srad', 'koi_srad_err1', 'koi_srad_err2', 'ra',
       'dec', 'koi_kepmag'],
      dtype='object')

### 3. The Modeling Process: A Data Leakage Investigation

This project became a deep dive into identifying and fixing data leakage.

#### Attempt 1: The "0.99 Score" Trap (Feature Leakage)

My very first model attempt achieved a score of `0.99`. This is "too good to be true" and a classic sign of data leakage.

* **Cause:** **Feature Leakage.** I had included columns like `koi_score` and `koi_fpflag_` (False Positive Flags). These columns are *not* raw measurements; they are the *result* of a different analysis. The model was simply reading the answer.


In [95]:
kepler_data_leak = kepler_clean.fillna(kepler_clean.median()).copy()

In [96]:
kepler_data_leak.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7316 entries, 0 to 9563
Data columns (total 42 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   koi_disposition    7316 non-null   float64
 1   koi_score          7316 non-null   float64
 2   koi_fpflag_nt      7316 non-null   int64  
 3   koi_fpflag_ss      7316 non-null   int64  
 4   koi_fpflag_co      7316 non-null   int64  
 5   koi_fpflag_ec      7316 non-null   int64  
 6   koi_period         7316 non-null   float64
 7   koi_period_err1    7316 non-null   float64
 8   koi_period_err2    7316 non-null   float64
 9   koi_time0bk        7316 non-null   float64
 10  koi_time0bk_err1   7316 non-null   float64
 11  koi_time0bk_err2   7316 non-null   float64
 12  koi_impact         7316 non-null   float64
 13  koi_impact_err1    7316 non-null   float64
 14  koi_impact_err2    7316 non-null   float64
 15  koi_duration       7316 non-null   float64
 16  koi_duration_err1  7316 non-n

* **Fix:** I dropped all leaky columns: `koi_score`, `koi_fpflag_nt`, `koi_fpflag_ss`, `koi_fpflag_co`, and `koi_fpflag_ec`.

In [97]:
leaky_cols = ['koi_disposition', 'koi_score', 'koi_fpflag_nt',
              'koi_fpflag_ss', 'koi_fpflag_co', 'koi_fpflag_ec']
x = kepler_data_leak.drop(columns=leaky_cols)
y = kepler_data_leak['koi_disposition']

In [98]:
#splitting the data in to training and test data
x_train,x_test,y_train,y_test = model_selection.train_test_split(x,y,test_size=0.23)

In [99]:
x_test.shape

(1683, 36)

In [100]:
x_train.shape

(5633, 36)

In [101]:
#scaling the date to prevent bais
scaler = preprocessing.StandardScaler()

In [102]:
scaled_x_train = scaler.fit_transform(x_train)
scaled_x_test = scaler.transform(x_test)

In [103]:
#using Logistic regression model and get the score
from sklearn import linear_model

In [104]:
reg = linear_model.LogisticRegression()

In [105]:
reg.fit(scaled_x_train,y_train)

In [106]:
reg.score(scaled_x_test,y_test)

0.9263220439691028

In [107]:
reg.score(scaled_x_train,y_train)

0.9139002307828865

In [108]:
# this is a data leakage problem here

In [109]:
# reimplementing the model to prevent this

In [110]:
kepler_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7316 entries, 0 to 9563
Data columns (total 42 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   koi_disposition    7316 non-null   float64
 1   koi_score          6257 non-null   float64
 2   koi_fpflag_nt      7316 non-null   int64  
 3   koi_fpflag_ss      7316 non-null   int64  
 4   koi_fpflag_co      7316 non-null   int64  
 5   koi_fpflag_ec      7316 non-null   int64  
 6   koi_period         7316 non-null   float64
 7   koi_period_err1    6939 non-null   float64
 8   koi_period_err2    6939 non-null   float64
 9   koi_time0bk        7316 non-null   float64
 10  koi_time0bk_err1   6939 non-null   float64
 11  koi_time0bk_err2   6939 non-null   float64
 12  koi_impact         7016 non-null   float64
 13  koi_impact_err1    6939 non-null   float64
 14  koi_impact_err2    6939 non-null   float64
 15  koi_duration       7316 non-null   float64
 16  koi_duration_err1  6939 non-n

In [111]:
#splitting the data first to prevent data leakage

In [112]:
leaky_cols = ['koi_disposition', 'koi_score', 'koi_fpflag_nt',
              'koi_fpflag_ss', 'koi_fpflag_co', 'koi_fpflag_ec']
x = kepler_clean.drop(columns=leaky_cols)
y = kepler_clean['koi_disposition']

In [113]:
x_train,x_test,y_train,y_test = model_selection.train_test_split(x,y,test_size=0.2,random_state=42,stratify=y)

In [114]:
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5852 entries, 3729 to 9333
Data columns (total 36 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   koi_period         5852 non-null   float64
 1   koi_period_err1    5557 non-null   float64
 2   koi_period_err2    5557 non-null   float64
 3   koi_time0bk        5852 non-null   float64
 4   koi_time0bk_err1   5557 non-null   float64
 5   koi_time0bk_err2   5557 non-null   float64
 6   koi_impact         5621 non-null   float64
 7   koi_impact_err1    5557 non-null   float64
 8   koi_impact_err2    5557 non-null   float64
 9   koi_duration       5852 non-null   float64
 10  koi_duration_err1  5557 non-null   float64
 11  koi_duration_err2  5557 non-null   float64
 12  koi_depth          5621 non-null   float64
 13  koi_depth_err1     5557 non-null   float64
 14  koi_depth_err2     5557 non-null   float64
 15  koi_prad           5621 non-null   float64
 16  koi_prad_err1      5621 no

In [115]:
#handling test and train data null values seprately

In [116]:
data_med = x_train.median()
cleaned_x_train = x_train.fillna(data_med)

In [117]:
cleaned_x_test = x_test.fillna(data_med)

In [118]:
cleaned_x_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5852 entries, 3729 to 9333
Data columns (total 36 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   koi_period         5852 non-null   float64
 1   koi_period_err1    5852 non-null   float64
 2   koi_period_err2    5852 non-null   float64
 3   koi_time0bk        5852 non-null   float64
 4   koi_time0bk_err1   5852 non-null   float64
 5   koi_time0bk_err2   5852 non-null   float64
 6   koi_impact         5852 non-null   float64
 7   koi_impact_err1    5852 non-null   float64
 8   koi_impact_err2    5852 non-null   float64
 9   koi_duration       5852 non-null   float64
 10  koi_duration_err1  5852 non-null   float64
 11  koi_duration_err2  5852 non-null   float64
 12  koi_depth          5852 non-null   float64
 13  koi_depth_err1     5852 non-null   float64
 14  koi_depth_err2     5852 non-null   float64
 15  koi_prad           5852 non-null   float64
 16  koi_prad_err1      5852 no

###Model1 : Logistic Regression

In [119]:
from sklearn import linear_model, preprocessing

In [120]:
#scale the test and train data
scaler = preprocessing.StandardScaler()
scaled_cleaned_x_train = scaler.fit_transform(cleaned_x_train)
scaled_cleaned_x_test = scaler.transform(cleaned_x_test)

In [121]:
reg_correct = linear_model.LogisticRegression(max_iter=1000)

In [122]:
reg_correct.fit(scaled_cleaned_x_train,y_train)

In [123]:
reg_correct.score(scaled_cleaned_x_test,y_test)

0.907103825136612

In [124]:
reg_correct.score(scaled_cleaned_x_train,y_train)

0.9174641148325359

###Model 2 : Support vector machine with rbf kernel

In [125]:
#lets use svm
from sklearn.svm import SVC


In [126]:
model = SVC()

In [127]:
model.fit(scaled_cleaned_x_test,y_test)

In [128]:
model.score(scaled_cleaned_x_train,y_train)

0.8880724538619276

In [129]:
model.score(scaled_cleaned_x_test,y_test)

0.9098360655737705

###Model 3 : support vector machine with linear kernel

In [130]:
svm_linear_model = svm.SVC(kernel='linear')

In [131]:
svm_linear_model.fit(scaled_cleaned_x_train,y_train)

In [132]:
svm_linear_model.score(scaled_cleaned_x_train,y_train)


0.9215652768284347

In [133]:
svm_linear_model.score(scaled_cleaned_x_test, y_test)

0.9118852459016393

#### Analysis of Results

* **Linear Models (LogReg & Linear SVM):** Both linear models performed the best and had virtually identical scores. This strongly suggests the data is **linearly separable**.

#### Model Performance Comparison

| Model | Training Accuracy | Test Accuracy |
| :--- | :--- | :--- |
| **Logistic Regression** | 0.9174 | 0.9071 |
| **SVM (Linear Kernel)** | 0.921 | 0.9118 |
| **SVM (RBF Kernel)** | 0.8880 | 0.9098 |

#Conclusion:
The data is highly linearly seprable with 91% accuracy usig SVM linear kernel.