# Concluding the Exoplanet Composite Discovery Method Predictor

## Preprocessing

### In the preprocessing ipynb, I started by using pandas to read the raw csv, made a copy and displayed the raw csv info
### Using a missing value threshold of 0, I removed columns that exceeded this threshold. 0 made the most sense as a threshold to simplify model training and reduce problem complexity. 
### After reducing the number of features, I used the exoplanet archive column mapping csv to map the table name (ex: pl_name) to the Description (ex: Planet name) using a dictionary
### Used a for loop through the number of code columns (358) and their definitions, where if row i in the database column matches a column name in the filtered raw data, you assign the key and value to the dictionary, removing trailing and leading whitespaces
### Used boolean to confirm we mapped all database column names to their definitions
### Then just reassigned the raw data column values (the names) as the values of the dictionary {key:value}
### Checkpoint where we create a copy of the current dataset we are working with
### Then I checked unique values in the "Discovery Method" column in the dataset to get an idea of the discovery methods possible
### Using value_count, I got an idea of the frequency of each of the discovery methods in the column. Transit was by far the most frequent
### Some of the discovery methods only had 10 instances, so I used SMOTE (synthetic minority oversampling technique) to synthetically generate instances of the minority class for a better class balance. This is essential for machine learning models to properly learn how to predict the discovery methods and not just predict the most frequent discovery method. 
### Realized that if I encoded Discovery Facility, Discovery Telescope and Discovery Instrument, there are 70-90 unique values for EACH, so encoding these was determined unfeasable
### So I dropped them along with Planet name and Host name, which would be irrelevant to model training
### Then I removed all the one hot binary encoded variables "Detected by..." columns except for transits
### This is because if "Detected by Transits" is 0, we already know that it has to be one of the other 10 discovery methods
### Before doing that, I removed instances of exoplanets that were found by more than one discovery method for more straightforward model training
### This was done by summmating each of the target dummies for each of the exoplanets, and if the sum was greater than 1 that means it was discovered by more than one method, warranting its removal
### After checking the new observation count for each of the "Detected by..." methods, we dropped the unnecessary "Detected by..." columns
### Checking the ratio of "Detected by transits" to the total number of instances, I saw it transits was roughly 71% of discovery method instances
### Imported SMOTE and specified my features (all variables except "Detected by Transits") and targets ("Detected by Transits")
### The new ratio was 50%, implying SMOTE worked as intended (50% exoplanets discovered by Transits, 50% were not)
### Then proceeded with standardization; Removed columns that could not be standardized (Discovery Year, Circumbinary Flag, Controversial Flag) and added these after standardizing
### Imported StandardScaler function from sklearn preprocessing library and fit the scaler to all the unscaled features.
### Then applied scaler.transform to all the unscaled features besides the ones excluded above. Checked the shapes of features, targets and the variables we removed, they all must match
### Since we used StandardScaler the scaled features are now in numpy array format. To combine it with the other excluded variables, it must first be convered to a pandas dataframe. Then we just added the excluded variables to the new pandas dataframe of the scaled features
### Converted discovery year to be treated as a categorical variable for the purposes of this project
### One last checkpoint (creating copy of the current dataframe we are working with) 
### Finally export the preprocessed dataframe and turn it into a csv to be saved in the current directory

## XGB Boost ALL Features

### Why XGB?

### After the initial preprocessing I got started with the first model I deployed for this analysis
### XGBoost was a great first option because it is a powerful implementation of gradient boosting algorithms designed for tabular data
#### The XGB captures complex patterns in data by combining predictions of multiple weak learners (typically decision trees)
### XGB also provides clear metrics for feature importance, allowing enhanced understanding of which features most influences the model's predictions. This is useful to my task in predicting exoplanet discovery methods and finding out which features influenced the model to make these predictions 
### Additionally, XGB handles unbalanced data well. Although we addressed these imbalances in the preprocessing, it is still beneficial that XGB has several techniques to handle imbalanced data, such as scale_pos_weight to assign more weight to the less frequent classes to improve the model's ability to predict them
### XGB is also very fast and high performing, as it handles sparse data, uses parallel processing and regularization techniques that prevent overfitting 
### XGB has flexibility with the feature types, including categorical and continuous features, without needing to extensively preprocess
### Finally, XGBoost has been widely adopted in scientific fields like astronomy for its accuracy and robustness in both classification and regression tasks. It is very handy in managing non-linear relationships and interactions between features 

### By including all the features in this XGB model, I aimed to reduce the dimensionality of the problem by identifying which features weighed the most according to the model
### Firstly imported the preprocessed data using pandas, and defined my train test split, as well as features and targets (features are all features except the target variable we are predicting, which is detected by transits)
### Then, I observed feature correlations to observe if there was any moderate to high correlation between variables, which would interfere with model training and performance. Should I have found any notable correlations, that warranted removal of one of them
### After this I proceeded to set a random seed for reproducability using random and numpy libraries. This means every time I set random state to 42 it shuffles the same RANDOM way
### Then I actually split the data into training, testing and validation (80% training, 10% testing, 10% validation). This was achieved through 2 separate splits (one to get 80:20, then to get that 20 into 10:10)
### Following the split, I used the XGBClassifier class to create my model from the xgb library, with logloss as the evaluation metric. logloss is appropriate because it is very suitable for binary classification problems
### Since XGB cannot have "[" or "]" or "<" or ">" characters in the feature names, I used the lambda function to replace these characters with empty spaces
### Then I actually fit the model with the new clean training, and validation data, with early stopping = 10 so that if theres no improvement in 10 consecutive trials the model stops to prevent overfitting and to make the overall process more efficient
### After fitting the model, I used .predict method on the cleaned test set, so I can see what predictions the model makes on each instance with my own eyes
### Evaluation of performance metrics was done using classificaiton report, accuracy score, roc auc score, and a confusion matrix. The actual evaluation was a comparison between the model's predictions and the y_test (the actual values of whether or not this exoplanet was discovered by transits)
### The XGB Model with all features had an accuracy of a whopping 96.42% on the test set, with an ROC AUC of 96.46
### Right after the evaluation metrics, I looked at the K fold cross validation score using sklearn the library, and declared 5 shuffled folds. I calculated the average cross validation accuracy (96.77%) and the average cross validation ROC AUC (99.3%)
### I wanted to see which features were deemed important by the model so I made a feature importance table, where the importance type is the WEIGHT of the feature. Naturally this means that I wanted to see which features had the greater weight on the model. 
### After extracting the weights of each of the coefficients from the fitted xgboost model using .get_booster().get_score('weight'), and the feature names from the feature training data, I created a pandas dataframe with the Features and their corresponding importance 
### Firstly had to turn the feature_imporance items into a list, set the columns, and sorted the values based on Importance in descending order
### I then plotted this feature importance table using matplot lib, and used xgb.plot_importance to directly plot from the xgb model 
### This plot demonstrates the F score. A higher F score indicates the feature is deemed important by the model
### For the rest of my models, to reduce dimensionality, I only included features that had an F score of 30 or above. These being: 
### 4   Ecliptic Latitude deg	271.0
### 2	Galactic Latitude deg	250.0
### 3	Galactic Longitude deg	223.0
### 5	Ecliptic Longitude deg	201.0
### 11	Discovery Year	132.0
### 1	Number of Planets	78.0
### 6	Number of Photometry Time Series	32.0

### With this new revised list of features, I had a better idea of the features I wanted to train the rest of my models with