<a href="https://colab.research.google.com/github/Shashanksai6/255_finalproject/blob/main/CMPE_255_Exoplanets_Classification_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Packages

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Installing dependencies
!pip install MarkupSafe==2.1.1 
!pip install lazypredict
!pip install -U pandas-profiling

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting MarkupSafe==2.1.1
  Downloading MarkupSafe-2.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Installing collected packages: MarkupSafe
  Attempting uninstall: MarkupSafe
    Found existing installation: MarkupSafe 2.1.2
    Uninstalling MarkupSafe-2.1.2:
      Successfully uninstalled MarkupSafe-2.1.2
Successfully installed MarkupSafe-2.1.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pandas-profiling
  Downloading pandas_profiling-3.6.6-py2.py3-none-any.whl (324 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

In [None]:
# Import packages
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sklearn Packages
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from scipy.stats import uniform, randint
import xgboost as xgb


# Sklearn Evaluation Metrics
from sklearn import metrics
from sklearn.metrics import confusion_matrix

# Exploratory Data Analysis (EDA) 
from pandas_profiling import ProfileReport
pd.set_option('display.max_columns', None)

# Ignoring warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

# Project Objective:

In this project, our goal is to build a model that can predict the existence of an exoplanet (i.e. a planet that orbits a distant star system) given the light intensity readings from that star over time. The dataset we’ll be using comes from NASA’s Kepler telescope currently in space. This project will demonstrate how predictive classication modeling will helps to discover does planat is exoplanate or not.

## Data Extraction

Data extraction is the process of acquiring and processing raw data of various forms and types to improve the operational paradigms of an organization.

It is perhaps the most important operation of the Extract/Transform/Load (ETL) process because it is the foundation for critical analyses and  decision making processes. It enables consolidation, analysis and refining of data so that it can be converted into meaningful information that can be stored for further use and manipulation. The extracted data can help in decision making, customer base expansion, service improvements, predicting sales and optimizing costs, among other things.

In our use case, we are using NASA-Caltech API (https://exoplanetarchive.ipac.caltech.edu/docs/program_interfaces.html) to retrive the information captured by Kaper telescope. We transformed the JSON data from API to CSV using Excel to make it available for Machine Learning analysis.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/My Drive/dataset/exoplanet/exoplanets_nasa.csv')
df = pd.DataFrame(np.repeat(df.values, 25, axis=0))

df.head(2)
df.shape[0]

## Exploratory Data Analysis

In [None]:
from pandas_profiling import ProfileReport
profile = ProfileReport(df, html={'style':{'full_width':True}})
#profile.to_notebook_iframe()
profile.to_file("/content/drive/My Drive/dataset/expanded_exoplanets_profile_report.html")

Feature Engineering and Transformation

Feature engineering is the process of transforming features, extracting features, and creating new variables from the original data, to train machine learning models.

Data in its original format can almost never be used straightaway to train classification or regression models. Instead, data scientists devote a huge chunk of their time to data preprocessing to train machine learning algorithms. Feature engineering is key to improving the performance of machine learning algorithms. Yet, it is very time-consuming. Fortunately, there are many Python libraries that we can use for data preparation.

Some techniques above might work better with some algorithms or datasets, while some of them might be beneficial in all casses. Based on current situtation, There are couple of transformation takes place on data based followed as:

- Change the columns name
- Dropped some columns
- Transformed target variables
- Handling Missing values
- Remove Duplicate instances

In [None]:
# 1. Change the columns name
df = df.rename(columns={'kepid': 'KepID',
                        'kepoi_name': 'KOIName',
                        'kepler_name': 'KeplerName',
                        'koi_disposition': 'ExoplanetArchiveDisposition',
                        'koi_pdisposition': 'DispositionUsingKeplerData',
                        'koi_score': 'DispositionScore',
                        'koi_fpflag_nt': 'NotTransit-LikeFalsePositiveFlag',
                        'koi_fpflag_ss': 'koi_fpflag_ss',
                        'koi_fpflag_co': 'CentroidOffsetFalsePositiveFlag',
                        'koi_fpflag_ec': 'EphemerisMatchIndicatesContaminationFalsePositiveFlag',
                        'koi_period': 'OrbitalPeriod.days',
                        'koi_period_err1': 'OrbitalPeriodUpperUnc.days',
                        'koi_period_err2': 'OrbitalPeriodLowerUnc.days',
                        'koi_time0bk': 'TransitEpoch.BKJD',
                        'koi_time0bk_err1': 'TransitEpochUpperUnc.BKJD',
                        'koi_time0bk_err2': 'TransitEpochLowerUnc.BKJD',
                        'koi_impact': 'ImpactParamete',
                        'koi_impact_err1': 'ImpactParameterUpperUnc',
                        'koi_impact_err2': 'ImpactParameterLowerUnc',
                        'koi_duration': 'TransitDuration.hrs',
                        'koi_duration_err1': 'TransitDurationUpperUnc.hrs',
                        'koi_duration_err2': 'TransitDurationLowerUnc.hrs',
                        'koi_depth': 'TransitDepth.ppm',
                        'koi_depth_err1': 'TransitDepthUpperUnc.ppm',
                        'koi_depth_err2': 'TransitDepthLowerUnc.ppm',
                        'koi_prad': 'PlanetaryRadius.Earthradii',
                        'koi_prad_err1': 'PlanetaryRadiusUpperUnc.Earthradii',
                        'koi_prad_err2': 'PlanetaryRadiusLowerUnc.Earthradii',
                        'koi_teq': 'EquilibriumTemperature.K',
                        'koi_teq_err1': 'EquilibriumTemperatureUpperUnc.K',
                        'koi_teq_err2': 'EquilibriumTemperatureLowerUnc.K',
                        'koi_insol': 'InsolationFlux.Earthflux',
                        'koi_insol_err1': 'InsolationFluxUpperUnc.Earthflux',
                        'koi_insol_err2': 'InsolationFluxLowerUnc.Earthflux',
                        'koi_model_snr': 'TransitSignal-to-Nois',
                        'koi_tce_plnt_num': 'TCEPlanetNumbe',
                        'koi_tce_delivname': 'TCEDeliver',
                        'koi_steff': 'StellarEffectiveTemperature.K',
                        'koi_steff_err1': 'StellarEffectiveTemperatureUpperUnc.K',
                        'koi_steff_err2': 'StellarEffectiveTemperatureLowerUnc.K',
                        'koi_slogg': 'StellarSurfaceGravity.log10(cm/s**2)',
                        'koi_slogg_err1': 'StellarSurfaceGravityUpperUnc.log10(cm/s**2)',
                        'koi_slogg_err2': 'StellarSurfaceGravityLowerUnc.log10(cm/s**2)',
                        'koi_srad': 'StellarRadius.Solarradii',
                        'koi_srad_err1': 'StellarRadiusUpperUnc.Solarradii',
                        'koi_srad_err2': 'StellarRadiusLowerUnc.Solarradii',
                        'ra': 'RA.decimaldegrees',
                        'dec': 'Decdecimaldegrees',
                        'koi_kepmag': 'Kepler-band.mag'
                        })
df.head()

In [None]:
#2. Transformed target variables
df['ExoplanetCandidate'] = df['DispositionUsingKeplerData'].apply(lambda x: 1 if x == 'CANDIDATE' else 0)
#df['ExoplanetConfirmed'] = df['ExoplanetArchiveDisposition'].apply(lambda x: 2 if x == 'CONFIRMED' else 1 if x == 'CANDIDATE' else 0 )

In [None]:
#3. Dropped some columns
df.drop(columns=['KeplerName', 'KOIName', 'EquilibriumTemperatureUpperUnc.K',
                 'KepID', 'ExoplanetArchiveDisposition', 'DispositionUsingKeplerData',
                 'NotTransit-LikeFalsePositiveFlag', 'koi_fpflag_ss', 'CentroidOffsetFalsePositiveFlag',
                 'EphemerisMatchIndicatesContaminationFalsePositiveFlag', 'TCEDeliver',
                 'EquilibriumTemperatureLowerUnc.K'], inplace=True)

In [None]:
#4. Handling Missing values
df.dropna(inplace=True)

In [None]:
#5. Remove Duplicate instances
df.drop_duplicates(inplace=True)

Sampling of Data for Training and Testing Models
Generally data used here is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.

To keep thing ideal, We allocate 80% of total data to training and rest of 20% of data for testing.

Train/Test Split


In [None]:
features = df.drop(columns=['ExoplanetCandidate'])
target = df.ExoplanetCandidate
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=1, test_size=.20)

Training Benchmark and Hyper-parameter Tunning
The best way to think about hyperparameters is like the settings of an algorithm that can be adjusted to optimize performance, just as we might turn the knobs of an AM radio to get a clear signal. When creating a machine learning model, we will be presented with design choices as to how to define your model architecture. Often, we don't immediately know what the optimal model architecture should be for a given model, and thus we would like to be able to explore a range of possibilities. In a true machine learning fashion, we will ideally ask the machine to perform this exploration and select the optimal model architecture automatically.

In [None]:
# Creating Benchmark

from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)

100%|██████████| 29/29 [00:23<00:00,  1.22it/s]

In [None]:
models

                           Accuracy	   Balanced Accuracy	   ROC AUC	       F1 Score	      Time Taken
Model	
XGBClassifier	              0.96	             0.96	         0.96	          0.96	          1.48
RandomForestClassifier	      0.96	             0.96	         0.96	          0.96	          2.95
ExtraTreesClassifier	      0.96	             0.96	         0.96	          0.96	          0.74
LGBMClassifier	              0.96	             0.96	         0.96	          0.96	          1.02
LogisticRegression	          0.96	             0.96	         0.96	          0.96	          0.20
BaggingClassifier	          0.95	             0.96	         0.96	          0.95	          2.11
AdaBoostClassifier	          0.95	             0.95	         0.95	          0.95	          1.41
CalibratedClassifierCV	      0.95	             0.95	         0.95	          0.95	          2.43
LinearSVC	                  0.95	             0.95	         0.95	          0.95	          0.72
SGDClassifier	              0.95	             0.95	         0.95	          0.95	          0.15
KNeighborsClassifier	      0.95	             0.95	         0.95	          0.95	          0.35
SVC	                          0.95	             0.95	         0.95	          0.95	          0.72
PassiveAggressiveClassifier	  0.95	             0.95	         0.95	          0.95	          0.04
LinearDiscriminantAnalysis	  0.95	             0.95	         0.95	          0.95	          0.14
RidgeClassifier	              0.95	             0.95	         0.95	          0.95	          0.04
RidgeClassifierCV	          0.95	             0.95	         0.95	          0.95	          0.06
NuSVC	                      0.94	             0.94	         0.94	          0.94	          2.69
ExtraTreeClassifier	          0.94	             0.94	         0.94	          0.94	          0.03
DecisionTreeClassifier	      0.93	             0.93	         0.93	          0.93	          0.45
LabelSpreading	              0.93	             0.93	         0.93	          0.93	          3.33
LabelPropagation	          0.93	             0.93	         0.93	          0.93	          2.32
Perceptron	                  0.92	             0.92	         0.92	          0.92	          0.05
NearestCentroid	              0.91	             0.91	         0.91	          0.91	          0.04
BernoulliNB	                  0.89	             0.89	         0.89	          0.89	          0.04
QuadraticDiscriminantAnalysis 0.77	             0.76	         0.76	          0.76	          0.07
GaussianNB	                  0.74	             0.73	         0.73	          0.72	          0.03
DummyClassifier	              0.52	             0.50	         0.50	          0.36	          0.02
Random searching of hyperparameters
Random search provide a discrete set of values to explore for each hyperparameter; rather providing a statistical distribution for each hyperparameter from which values may be randomly sampled.

Conceptually, we’ll define a sampling distribution for each hyperparameter. we can also define how many iterations we would like to build when searching for the optimal model. For each iteration, the hyperparameter values of the model will be set by sampling the defined distributions. One of the primary theoretical backings to motivate the use of a random search for most cases, hyperparameters are not equally important.

In [None]:
# Hyper Parameter Tunning

params = {
    "colsample_bytree": uniform(0.7, 0.3),
    "gamma": uniform(0, 0.5),
    "learning_rate": uniform(0.03, 0.3), # default 0.1 
    "max_depth": randint(2, 6), # default 3
    "n_estimators": randint(100, 150), # default 100
    "subsample": uniform(0.6, 0.4)
}


xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=42, eval_metric="auc")

search_model = RandomizedSearchCV(xgb_model, param_distributions=params, random_state=42, n_iter=100, cv=3, verbose=1, n_jobs=-1, return_train_score=True)

search_model.fit(X_train, y_train)

best_model = search_model.best_estimator_

Fitting 3 folds for each of 100 candidates, totalling 300 fits