<a href="https://colab.research.google.com/github/Shashanksai6/255_finalproject/blob/main/CMPE_255_Exoplanets_Classification_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Packages

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Installing dependencies
!pip install MarkupSafe==2.1.1 
!pip install lazypredict
!pip install -U pandas-profiling

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting MarkupSafe==2.1.1
  Downloading MarkupSafe-2.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Installing collected packages: MarkupSafe
  Attempting uninstall: MarkupSafe
    Found existing installation: MarkupSafe 2.1.2
    Uninstalling MarkupSafe-2.1.2:
      Successfully uninstalled MarkupSafe-2.1.2
Successfully installed MarkupSafe-2.1.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pandas-profiling
  Downloading pandas_profiling-3.6.6-py2.py3-none-any.whl (324 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

In [None]:
# Import packages
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sklearn Packages
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from scipy.stats import uniform, randint
import xgboost as xgb


# Sklearn Evaluation Metrics
from sklearn import metrics
from sklearn.metrics import confusion_matrix

# Exploratory Data Analysis (EDA) 
from pandas_profiling import ProfileReport
pd.set_option('display.max_columns', None)

# Ignoring warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

# Project Objective:

In this project, our goal is to build a model that can predict the existence of an exoplanet (i.e. a planet that orbits a distant star system) given the light intensity readings from that star over time. The dataset we’ll be using comes from NASA’s Kepler telescope currently in space. This project will demonstrate how predictive classication modeling will helps to discover does planat is exoplanate or not.

## Data Extraction

Data extraction is the process of acquiring and processing raw data of various forms and types to improve the operational paradigms of an organization.

It is perhaps the most important operation of the Extract/Transform/Load (ETL) process because it is the foundation for critical analyses and  decision making processes. It enables consolidation, analysis and refining of data so that it can be converted into meaningful information that can be stored for further use and manipulation. The extracted data can help in decision making, customer base expansion, service improvements, predicting sales and optimizing costs, among other things.

In our use case, we are using NASA-Caltech API (https://exoplanetarchive.ipac.caltech.edu/docs/program_interfaces.html) to retrive the information captured by Kaper telescope. We transformed the JSON data from API to CSV using Excel to make it available for Machine Learning analysis.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/My Drive/dataset/exoplanet/exoplanets_nasa.csv')
df = pd.DataFrame(np.repeat(df.values, 25, axis=0))

df.head(2)
df.shape[0]

## Exploratory Data Analysis

In [None]:
from pandas_profiling import ProfileReport
profile = ProfileReport(df, html={'style':{'full_width':True}})
#profile.to_notebook_iframe()
profile.to_file("/content/drive/My Drive/dataset/expanded_exoplanets_profile_report.html")

Feature Engineering and Transformation

Feature engineering is the process of transforming features, extracting features, and creating new variables from the original data, to train machine learning models.

Data in its original format can almost never be used straightaway to train classification or regression models. Instead, data scientists devote a huge chunk of their time to data preprocessing to train machine learning algorithms. Feature engineering is key to improving the performance of machine learning algorithms. Yet, it is very time-consuming. Fortunately, there are many Python libraries that we can use for data preparation.

Some techniques above might work better with some algorithms or datasets, while some of them might be beneficial in all casses. Based on current situtation, There are couple of transformation takes place on data based followed as:

- Change the columns name
- Dropped some columns
- Transformed target variables
- Handling Missing values
- Remove Duplicate instances

In [None]:
# 1. Change the columns name
df = df.rename(columns={'kepid': 'KepID',
                        'kepoi_name': 'KOIName',
                        'kepler_name': 'KeplerName',
                        'koi_disposition': 'ExoplanetArchiveDisposition',
                        'koi_pdisposition': 'DispositionUsingKeplerData',
                        'koi_score': 'DispositionScore',
                        'koi_fpflag_nt': 'NotTransit-LikeFalsePositiveFlag',
                        'koi_fpflag_ss': 'koi_fpflag_ss',
                        'koi_fpflag_co': 'CentroidOffsetFalsePositiveFlag',
                        'koi_fpflag_ec': 'EphemerisMatchIndicatesContaminationFalsePositiveFlag',
                        'koi_period': 'OrbitalPeriod.days',
                        'koi_period_err1': 'OrbitalPeriodUpperUnc.days',
                        'koi_period_err2': 'OrbitalPeriodLowerUnc.days',
                        'koi_time0bk': 'TransitEpoch.BKJD',
                        'koi_time0bk_err1': 'TransitEpochUpperUnc.BKJD',
                        'koi_time0bk_err2': 'TransitEpochLowerUnc.BKJD',
                        'koi_impact': 'ImpactParamete',
                        'koi_impact_err1': 'ImpactParameterUpperUnc',
                        'koi_impact_err2': 'ImpactParameterLowerUnc',
                        'koi_duration': 'TransitDuration.hrs',
                        'koi_duration_err1': 'TransitDurationUpperUnc.hrs',
                        'koi_duration_err2': 'TransitDurationLowerUnc.hrs',
                        'koi_depth': 'TransitDepth.ppm',
                        'koi_depth_err1': 'TransitDepthUpperUnc.ppm',
                        'koi_depth_err2': 'TransitDepthLowerUnc.ppm',
                        'koi_prad': 'PlanetaryRadius.Earthradii',
                        'koi_prad_err1': 'PlanetaryRadiusUpperUnc.Earthradii',
                        'koi_prad_err2': 'PlanetaryRadiusLowerUnc.Earthradii',
                        'koi_teq': 'EquilibriumTemperature.K',
                        'koi_teq_err1': 'EquilibriumTemperatureUpperUnc.K',
                        'koi_teq_err2': 'EquilibriumTemperatureLowerUnc.K',
                        'koi_insol': 'InsolationFlux.Earthflux',
                        'koi_insol_err1': 'InsolationFluxUpperUnc.Earthflux',
                        'koi_insol_err2': 'InsolationFluxLowerUnc.Earthflux',
                        'koi_model_snr': 'TransitSignal-to-Nois',
                        'koi_tce_plnt_num': 'TCEPlanetNumbe',
                        'koi_tce_delivname': 'TCEDeliver',
                        'koi_steff': 'StellarEffectiveTemperature.K',
                        'koi_steff_err1': 'StellarEffectiveTemperatureUpperUnc.K',
                        'koi_steff_err2': 'StellarEffectiveTemperatureLowerUnc.K',
                        'koi_slogg': 'StellarSurfaceGravity.log10(cm/s**2)',
                        'koi_slogg_err1': 'StellarSurfaceGravityUpperUnc.log10(cm/s**2)',
                        'koi_slogg_err2': 'StellarSurfaceGravityLowerUnc.log10(cm/s**2)',
                        'koi_srad': 'StellarRadius.Solarradii',
                        'koi_srad_err1': 'StellarRadiusUpperUnc.Solarradii',
                        'koi_srad_err2': 'StellarRadiusLowerUnc.Solarradii',
                        'ra': 'RA.decimaldegrees',
                        'dec': 'Decdecimaldegrees',
                        'koi_kepmag': 'Kepler-band.mag'
                        })
df.head()

In [None]:
#2. Transformed target variables
df['ExoplanetCandidate'] = df['DispositionUsingKeplerData'].apply(lambda x: 1 if x == 'CANDIDATE' else 0)
#df['ExoplanetConfirmed'] = df['ExoplanetArchiveDisposition'].apply(lambda x: 2 if x == 'CONFIRMED' else 1 if x == 'CANDIDATE' else 0 )

In [None]:
#3. Dropped some columns
df.drop(columns=['KeplerName', 'KOIName', 'EquilibriumTemperatureUpperUnc.K',
                 'KepID', 'ExoplanetArchiveDisposition', 'DispositionUsingKeplerData',
                 'NotTransit-LikeFalsePositiveFlag', 'koi_fpflag_ss', 'CentroidOffsetFalsePositiveFlag',
                 'EphemerisMatchIndicatesContaminationFalsePositiveFlag', 'TCEDeliver',
                 'EquilibriumTemperatureLowerUnc.K'], inplace=True)

In [None]:
#4. Handling Missing values
df.dropna(inplace=True)

In [None]:
#5. Remove Duplicate instances
df.drop_duplicates(inplace=True)