# Due to the poor performance of models trying to predict the discovery method out of 10 methods, I will change the scope of the question
## Now instead of trying to classify 10 different discovery methods, we will just apply a logistic regression to determine if a given exoplanet was found through TRANSIT (the majority discovery method class)
## This is in response to the heavily imbalanced classes. By combining all discovery methods except the majority class, training a good model will be more straightforward, and we can still get insights on the features that have the most weight in an exoplanet being predicted to have been discovered by "transit" or not

In [2]:
composite_preprocessed = pd.read_csv('Composite_preprocessed_NO_MV.csv')
pd.options.display.max_columns=None
pd.options.display.max_rows=None
composite_preprocessed.head()

Unnamed: 0,Number of Stars,Number of Planets,Number of Moons,Circumbinary Flag,Discovery Year,Detected by Radial Velocity Variations,Detected by Pulsar Timing Variations,Detected by Pulsation Timing Variations,Detected by Transits,Detected by Astrometric Variations,Detected by Orbital Brightness Modulations,Detected by Microlensing,Detected by Eclipse Timing Variations,Detected by Imaging,Detected by Disk Kinematics,Controversial Flag,Galactic Latitude [deg],Galactic Longitude [deg],Ecliptic Latitude [deg],Ecliptic Longitude [deg],Number of Photometry Time Series,Number of Radial Velocity Time Series,Number of Stellar Spectra Measurements,Number of Emission Spectroscopy Measurements,Number of Transmission Spectroscopy Measurements
0,2,1,0,0,2007,1,0,0,0,0,0,0,0,0,0,0,78.28058,264.13775,18.33392,177.4179,1,2,0,0,0
1,1,1,0,0,2009,1,0,0,0,0,0,0,0,0,0,0,41.04437,108.719,74.95821,141.64699,1,1,0,0,0
2,1,1,0,0,2008,1,0,0,0,0,0,0,0,0,0,0,-21.05141,106.41269,38.22901,11.95935,1,1,0,0,0
3,1,2,0,0,2002,1,0,0,0,0,0,0,0,0,0,0,46.94447,69.16849,62.87885,223.24717,1,4,1,0,0
4,3,1,0,0,1996,1,0,0,0,0,0,0,0,0,0,0,13.20446,83.33558,69.46803,321.21176,1,4,3,0,0


In [12]:
target_dummies = ['Detected by Radial Velocity Variations',
           'Detected by Pulsar Timing Variations',
           'Detected by Pulsation Timing Variations',
           'Detected by Transits',	
           'Detected by Astrometric Variations',
           'Detected by Orbital Brightness Modulations',
           'Detected by Microlensing',
           'Detected by Eclipse Timing Variations',
           'Detected by Imaging',
           'Detected by Disk Kinematics']

In [13]:
# drop rows with more than 1 discovery method for more straightforward training

composite_preprocessed = composite_preprocessed[composite_preprocessed[target_dummies].sum(axis=1) <=1]
composite_preprocessed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4511 entries, 0 to 5601
Data columns (total 25 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   Number of Stars                                   4511 non-null   int64  
 1   Number of Planets                                 4511 non-null   int64  
 2   Number of Moons                                   4511 non-null   int64  
 3   Circumbinary Flag                                 4511 non-null   int64  
 4   Discovery Year                                    4511 non-null   int64  
 5   Detected by Radial Velocity Variations            4511 non-null   int64  
 6   Detected by Pulsar Timing Variations              4511 non-null   int64  
 7   Detected by Pulsation Timing Variations           4511 non-null   int64  
 8   Detected by Transits                              4511 non-null   int64  
 9   Detected by Astrometric 

In [6]:
# get an idea of the observation count for each discovery method
composite_preprocessed[target_dummies].sum(axis=0)

Detected by Radial Velocity Variations         982
Detected by Pulsar Timing Variations             7
Detected by Pulsation Timing Variations          2
Detected by Transits                          3218
Detected by Astrometric Variations               2
Detected by Orbital Brightness Modulations       5
Detected by Microlensing                       210
Detected by Eclipse Timing Variations           17
Detected by Imaging                             56
Detected by Disk Kinematics                      1
dtype: int64

In [18]:
# since we are applying a logistic regression, we have to combine every discovery method besides "transits" 
# we can do this by dropping every 'Detected by...' column except for transits

dummies_to_drop = ['Detected by Radial Velocity Variations',
           'Detected by Pulsar Timing Variations',
           'Detected by Pulsation Timing Variations',
           'Detected by Astrometric Variations',
           'Detected by Orbital Brightness Modulations',
           'Detected by Microlensing',
           'Detected by Eclipse Timing Variations',
           'Detected by Imaging',
           'Detected by Disk Kinematics']

targets= composite_preprocessed['Detected by Transits']
# tells us instances of 1, total observations
targets.sum(), targets.shape[0]

(3218, 4511)

In [19]:
composite_dummies_dropped = composite_preprocessed.drop(dummies_to_drop, axis=1)
composite_dummies_dropped.head()

Unnamed: 0,Number of Stars,Number of Planets,Number of Moons,Circumbinary Flag,Discovery Year,Detected by Transits,Controversial Flag,Galactic Latitude [deg],Galactic Longitude [deg],Ecliptic Latitude [deg],Ecliptic Longitude [deg],Number of Photometry Time Series,Number of Radial Velocity Time Series,Number of Stellar Spectra Measurements,Number of Emission Spectroscopy Measurements,Number of Transmission Spectroscopy Measurements
0,2,1,0,0,2007,0,0,78.28058,264.13775,18.33392,177.4179,1,2,0,0,0
1,1,1,0,0,2009,0,0,41.04437,108.719,74.95821,141.64699,1,1,0,0,0
2,1,1,0,0,2008,0,0,-21.05141,106.41269,38.22901,11.95935,1,1,0,0,0
3,1,2,0,0,2002,0,0,46.94447,69.16849,62.87885,223.24717,1,4,1,0,0
4,3,1,0,0,1996,0,0,13.20446,83.33558,69.46803,321.21176,1,4,3,0,0


In [21]:
targets.sum() / targets.shape[0]

0.7133673243183329

## 71% of all target observations are 1 (discovered by transits)
## to efficiently train a logistic regression model the ratio has to be closer to 50/50
## so we apply SMOTE (Sample minority over sampling technique)

In [26]:
from imblearn.over_sampling import SMOTE

# the features are all the columns except our target column
features = composite_dummies_dropped.drop(['Detected by Transits'], axis=1)
smote = SMOTE()
x_resampled, y_resampled = smote.fit_resample(features, targets)

In [27]:
y_resampled.sum() / y_resampled.shape[0]

0.5

### The new ratio implies the synthetic sampling technique worked

In [28]:
# we see a higher non-null count suggesting samples were synthesized
x_resampled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6436 entries, 0 to 6435
Data columns (total 15 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   Number of Stars                                   6436 non-null   int64  
 1   Number of Planets                                 6436 non-null   int64  
 2   Number of Moons                                   6436 non-null   int64  
 3   Circumbinary Flag                                 6436 non-null   int64  
 4   Discovery Year                                    6436 non-null   int64  
 5   Controversial Flag                                6436 non-null   int64  
 6   Galactic Latitude [deg]                           6436 non-null   float64
 7   Galactic Longitude [deg]                          6436 non-null   float64
 8   Ecliptic Latitude [deg]                           6436 non-null   float64
 9   Ecliptic Longitude 

# Select features for logistic Regression

In [None]:
unscaled_features = 