# Voyager for Exoplanets: Mining on Imbalanced Data

In [1]:
import voyager

In [2]:
voyager.MultiUnderSamplerEstimator.__dict__

mappingproxy({'__dict__': <attribute '__dict__' of 'MultiUnderSamplerEstimator' objects>,
              '__doc__': None,
              '__init__': <function voyager.MultiUnderSamplerEstimator.__init__>,
              '__module__': 'voyager',
              '__weakref__': <attribute '__weakref__' of 'MultiUnderSamplerEstimator' objects>,
              'fit': <function voyager.MultiUnderSamplerEstimator.fit>,
              'predict': <function voyager.MultiUnderSamplerEstimator.predict>,
              'predict_proba': <function voyager.MultiUnderSamplerEstimator.predict_proba>})

## Ensemble of under sampling SVM:
- If we use only one-time under sampling, the results will be randomized.
- If we use over sampling, like SMOTE, the model will be over fitted.
- Ensemble:
    - Idea of bagging.
    - We under sample the dataset and train the sampled data using SVM. We do this 100 times.
    - For the generated 100 models, we score them using AUC score obtained by 10-fold cross validation.
    - The higher a model's score is, the heavier weight it will be assigned to evaluate the final results.
    - Predict:
        - If probability = False, then the results will be 0/1 binary. If the score > 0.5, then 1, else 0.
        - If we set probability = True, then results will be the probability. 
        - We can set the probability threshold that we want to get the expected results.

In [3]:
voyager.kepler_train_test()

Accuracy on whether exoplanet or not: 0.99
Accuracy on test: 0.98
kepler_train_test execution time: 1.49 s


- Here, we divide the kepler candidate into train and test (70%/30%).
- This tests on the accuracy of classifying exoplanets and non-exoplanet objects.
- Result is quite promising

In [4]:
voyager.kepler_candidate_dataset(probability = 0.99)

Accuracy on whether exoplanet or not: 0.99
75.22% of the candidates are comfirmed.
19.52% of the confirmed planets are habitable.
4.61% of the confirmed planets have 99.00% probability habitable.
kepler_candidate_dataset execution time: 9.74 s


Unnamed: 0,kepid,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,...,koi_dicco_msky,koi_dicco_msky_err,koi_dikco_mra,koi_dikco_mra_err,koi_dikco_mdec,koi_dikco_mdec_err,koi_dikco_msky,koi_dikco_msky_err,habitable,pre_confirmed


- We can set whether to use ensemble to make comparisons.
- First, we train a model to distinguish exoplanets, and use it on candidates.
- Then, we train a model using the confirmed data to distinguish habitable exoplanets.
- After that, we use the second model on the pre-confirmed exoplanets and see if they are habitable or not.
- Note that we can control the probability here to get different results.
- Since the probability is too high, there should be more work to carry on, like gathering more data, or getting more deterministic attributes.
- Output is useless.

In [12]:
voyager.kepler_candidate_dataset(ensemble = False, probability = 0.99)

Accuracy on whether exoplanet or not: 0.99
75.22% of the candidates are comfirmed.
Recall: 0.87
26.91% of the confirmed planets are habitable.
7.87% of the confirmed planets have 99.00% probability habitable.
kepler_candidate_dataset execution time: 6.29 s


Unnamed: 0,kepid,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,...,koi_dicco_msky,koi_dicco_msky_err,koi_dikco_mra,koi_dikco_mra_err,koi_dikco_mdec,koi_dikco_mdec_err,koi_dikco_msky,koi_dikco_msky_err,habitable,pre_confirmed


- Just an un-ensembled test. to show that ensembles can be more robust, reduce the side effect of random sampling.

In [5]:
voyager.confirmed_exoplanets_dataset_onebyone()

85.37% of the habitable planets are classified correctly.
confirmed_exoplanets_dataset_onebyone execution time: 178.92 s


- On confirmed dataset, pick out the habitable planets one by one
- Then use the picked-out sample as test sample, see if it will be classified correctly.
- If we use SMOTE, the result will be near 0. It means that ensemble of undersamping is better.

In [11]:
voyager.confirmed_exoplanets_dataset_onebyone(ensemble = False)

Recall: 0.70
Recall: 0.75
Recall: 0.70
Recall: 0.72
Recall: 0.75
Recall: 0.80
Recall: 0.45
Recall: 0.68
Recall: 0.75
Recall: 0.72
Recall: 0.72
Recall: 0.62
Recall: 0.68
Recall: 0.88
Recall: 0.82
Recall: 0.80
Recall: 0.82
Recall: 0.78
Recall: 0.78
Recall: 0.75
Recall: 0.70
Recall: 0.62
Recall: 0.75
Recall: 0.70
Recall: 0.72
Recall: 0.62
Recall: 0.70
Recall: 0.80
Recall: 0.75
Recall: 0.78
Recall: 0.68
Recall: 0.68
Recall: 0.80
Recall: 0.72
Recall: 0.75
Recall: 0.75
Recall: 0.85
Recall: 0.68
Recall: 0.80
Recall: 0.78
Recall: 0.68
78.05% of the habitable planets are classified correctly.
confirmed_exoplanets_dataset_onebyone execution time: 6.37 s


- Of course, if we do not use ensemble, the result is still ok, but not that robust. It will vary per training.