# The following notebook uses combined BLS and PCA features generated from simulated TESS data and tests the effects of dropping features on the default score of the XGBoost machine learning algorithm    




In [106]:
import pandas as pd
from sklearn import metrics
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import cross_val_predict

# open Chelsea's combined feature file
# remove the last row to make the dimensions match with the raw LC file
data_combined_features = pd.read_csv("TESSfield_05h_01d_combinedfeatures.csv",
                                     header=0, index_col=0)
data_combined_features = data_combined_features.drop(data_combined_features.index[-1])

# drop the columns that aren't features and get targets 
X = data_combined_features.drop(['Ids', 'CatalogY', 'ManuleY', 'CombinedY',
                                 'Catalog_Period', 'Depth', 'Catalog_Epoch', 'SNR'],
                                axis=1)

y = data_combined_features['CombinedY']

The methods we'll try are the "lazy" ones, we'll just delete a feature off the tail/the head one at a time regardless of its effect on the score. 

In [107]:
xgb1 = XGBClassifier(objective='binary:logistic')



def modelfit(alg, X, y, cv_folds=4):
    
    # StratifiedKFold automatically used by cross_val_predict on binary classification
    # bear in mind that this does not use trapezfoid rule
    # y_pred calculates the probabilities that each value is 1 or 0 using stratified cross validation
    # pr_auc calculates the area under a precision recall curve
    y_pred = cross_val_predict(alg, X, y, cv=cv_folds)
    pr_auc = metrics.average_precision_score(y, y_pred)
    return pr_auc

def feature_testing(alg, X, y):
    score_list = []
    print 'testing features linearly'
    X = X
    while len(X.columns) > 0:
        score = modelfit(alg, X, y)
        print score
        score_list.append(score)
        print 'dropping {0}'.format(X.columns[-1])
        X.drop(X.columns[-1], axis=1, inplace=True)   
    print 'the max score was {0}'.format(max(score_list))
    

feature_testing(xgb1, X, y)

testing features linearly
0.699410990974
dropping P19
0.70257779817
dropping P18
0.701695051255
dropping P17
0.708411308736
dropping P16
0.705255657316
dropping P15
0.703905570515
dropping P14
0.699666604825
dropping P13
0.697122253662
dropping P12
0.693576606645
dropping P11
0.702092662611
dropping P10
0.714701020128
dropping P9
0.706627389138
dropping P8
0.695744191453
dropping P7
0.696236533057
dropping P6
0.691200028507
dropping P5
0.7007406319
dropping P4
0.693643577149
dropping P3
0.698521507342
dropping P2
0.710733512794
dropping P1
0.698922212224
dropping P0
0.698922212224
dropping BLS_SignaltoPinknoise_1_0
0.672466013053
dropping BLS_Whitenoise_1_0
0.67003291743
dropping BLS_Rednoise_1_0
0.675139052105
dropping BLS_Npointsaftertransit_1_0
0.665400745733
dropping BLS_Npointsbeforetransit_1_0
0.665400745733
dropping BLS_Ntransits_1_0
0.665400745733
dropping BLS_Npointsintransit_1_0
0.680849933295
dropping BLS_fraconenight_1_0
0.67003291743
dropping BLS_deltaChi2_1_0
0.6487081362

Not bad... we see that we can improve the score significantly just by arbitrarily deleting. What if we go in the other direction? 

In [108]:
# drop the columns that aren't features and get targets 
X = data_combined_features.drop(['Ids', 'CatalogY', 'ManuleY', 'CombinedY',
                                 'Catalog_Period', 'Depth', 'Catalog_Epoch', 'SNR'],
                                axis=1)

xgb1 = XGBClassifier(objective='binary:logistic')



def modelfit(alg, X, y, cv_folds=4):
    
    # StratifiedKFold automatically used by cross_val_predict on binary classification
    # bear in mind that this does not use trapezfoid rule
    # y_pred calculates the probabilities that each value is 1 or 0 using stratified cross validation
    # pr_auc calculates the area under a precision recall curve
    y_pred = cross_val_predict(alg, X, y, cv=cv_folds)
    pr_auc = metrics.average_precision_score(y, y_pred)
    return pr_auc

def feature_testing(alg, X, y):
    score_list = []
    print 'testing features linearly'
    X = X
    while len(X.columns) > 0:
        score = modelfit(alg, X, y)
        print score
        score_list.append(score)
        print 'dropping {0}'.format(X.columns[0])
        X.drop(X.columns[0], axis=1, inplace=True)   
    print 'the max score was {0}'.format(max(score_list))
    

feature_testing(xgb1, X, y)

testing features linearly
0.699410990974
dropping BLS_Period_1_0
0.7007406319
dropping BLS_Tc_1_0
0.708888928806
dropping BLS_SN_1_0
0.690747751006
dropping BLS_SR_1_0
0.677529510817
dropping BLS_SDE_1_0
0.69309749028
dropping BLS_Depth_1_0
0.703466408039
dropping BLS_Qtran_1_0
0.685181619696
dropping BLS_Qingress_1_0
0.686898037606
dropping BLS_OOTmag_1_0
0.688631705948
dropping BLS_i1_1_0
0.686568538615
dropping BLS_i2_1_0
0.684800265022
dropping BLS_deltaChi2_1_0
0.644127599675
dropping BLS_fraconenight_1_0
0.66559493405
dropping BLS_Npointsintransit_1_0
0.660582119336
dropping BLS_Ntransits_1_0
0.665502019408
dropping BLS_Npointsbeforetransit_1_0
0.66576554877
dropping BLS_Npointsaftertransit_1_0
0.484877534492
dropping BLS_Rednoise_1_0
0.452175682734
dropping BLS_Whitenoise_1_0
0.426157160814
dropping BLS_SignaltoPinknoise_1_0
0.174016767494
dropping P0
0.138532423404
dropping P1
0.158821904145
dropping P2
0.25659510827
dropping P3
0.196968527742
dropping P4
0.217706219381
droppin

As is intuitive, the BLS features are much more important that the PCA features, even so we still manage to increase the score noticeably. 

Next, we'll only remove a feature if it causes a positive increase in score. 

In [72]:
# the model fitting function 

# drop the columns that aren't features and get targets 
X = data_combined_features.drop(['Ids', 'CatalogY', 'ManuleY', 'CombinedY',
                                 'Catalog_Period', 'Depth', 'Catalog_Epoch', 'SNR'],
                                axis=1)

y = data_combined_features['CombinedY']


def feature_testing(alg, X, y):
    score_perm = modelfit(alg, X, y)  # test the model without the feature dropped
    drop_list = []
    print 'testing features linearly'
    X = X
    X_copy = X.copy()
    for column in X.columns:
        df_temp = X.drop(column, axis=1)  # temporarily drop features
        score_temp = modelfit(alg, df_temp, y)  # test the model with the feature dropped
        print 'The score with feature {0} is: {1}, the score without that feature is {2}'.format(column, score_perm,
                                                                                                 score_temp)
        
        if score_temp > score_perm: 
            X.drop(column, axis=1, inplace=True)
            score_perm = modelfit(alg, X, y)
            drop_list.append(column)
            
            print 'The score is higher without feature {0} so it has been dropped'.format(column)
    print 'the dropped features are: {0}'.format(drop_list)

feature_testing(xgb1, X, y)

testing features linearly
The score with feature BLS_Period_1_0 is: 0.699410990974, the score without that feature is 0.7007406319
The score is higher without feature BLS_Period_1_0 so it has been dropped
The score with feature BLS_Tc_1_0 is: 0.7007406319, the score without that feature is 0.708888928806
The score is higher without feature BLS_Tc_1_0 so it has been dropped
The score with feature BLS_SN_1_0 is: 0.708888928806, the score without that feature is 0.690747751006
The score with feature BLS_SR_1_0 is: 0.708888928806, the score without that feature is 0.6917438811
The score with feature BLS_SDE_1_0 is: 0.708888928806, the score without that feature is 0.7007406319
The score with feature BLS_Depth_1_0 is: 0.708888928806, the score without that feature is 0.701899562157
The score with feature BLS_Qtran_1_0 is: 0.708888928806, the score without that feature is 0.692558483229
The score with feature BLS_Qingress_1_0 is: 0.708888928806, the score without that feature is 0.7075874556

We can see that by conditioning on a score increase causes the best score yet of 0.721419055805. 

We'll follow suite and try the same but in reverse.

In [75]:
X = data_combined_features.drop(['Ids', 'CatalogY', 'ManuleY', 'CombinedY',
                                 'Catalog_Period', 'Depth', 'Catalog_Epoch', 'SNR'],
                                axis=1)

def feature_testing(alg, X, y):
    score_perm = modelfit(alg, X, y)  # test the model without the feature dropped
    drop_list = []
    print 'testing features linearly from right to left'
    X = X
    X_copy = X.copy()
    for column in reversed(X.columns):
        df_temp = X.drop(column, axis=1)  # temporarily drop features
        score_temp = modelfit(alg, df_temp, y)  # test the model with the feature dropped
        print 'The score with feature {0} is: {1}, the score without that feature is {2}'.format(column, score_perm,
                                                                                                 score_temp)
        
        if score_temp > score_perm: 
            X.drop(column, axis=1, inplace=True)
            score_perm = modelfit(alg, X, y)
            drop_list.append(column)
            
            print 'The score is higher without feature {0} so it has been dropped'.format(column)
    print 'the dropped features are: {0}'.format(drop_list)

feature_testing(xgb1, X, y)

testing features linearly from right to left
The score with feature P19 is: 0.699410990974, the score without that feature is 0.70257779817
The score is higher without feature P19 so it has been dropped
The score with feature P18 is: 0.70257779817, the score without that feature is 0.701695051255
The score with feature P17 is: 0.70257779817, the score without that feature is 0.697122253662
The score with feature P16 is: 0.70257779817, the score without that feature is 0.7007406319
The score with feature P15 is: 0.70257779817, the score without that feature is 0.692558483229
The score with feature P14 is: 0.70257779817, the score without that feature is 0.703905570515
The score is higher without feature P14 so it has been dropped
The score with feature P13 is: 0.703905570515, the score without that feature is 0.688955495642
The score with feature P12 is: 0.703905570515, the score without that feature is 0.7007406319
The score with feature P11 is: 0.703905570515, the score without that f

Not as good as last time, but an increase nonetheless. 

Lets try introducing a threshold, to see if cutting out only features that cause a certain change in threshold will yield better results. 

In [110]:
X = data_combined_features.drop(['Ids', 'CatalogY', 'ManuleY', 'CombinedY',
                                 'Catalog_Period', 'Depth', 'Catalog_Epoch', 'SNR'],
                                axis=1)

def feature_testing(alg, X, y, threshold):
    score_perm = modelfit(alg, X, y)  # test the model without the feature dropped
    drop_list = []
    print 'testing features linearly based on threshold'
    X = X
    X_copy = X.copy()
    for column in X.columns:
        df_temp = X.drop(column, axis=1)  # temporarily drop features
        score_temp = modelfit(alg, df_temp, y)  # test the model with the feature dropped
        print 'The score with feature {0} is: {1}, the score without that feature is {2}'.format(column, score_perm,
                                                                                                 score_temp)
        
        if score_temp - score_perm > threshold: 
            X.drop(column, axis=1, inplace=True)
            score_perm = modelfit(alg, X, y)
            drop_list.append(column)
            
            print 'The score is higher without feature {0} so it has been dropped'.format(column)
    print 'the dropped features are: {0}'.format(drop_list)

feature_testing(xgb1, X, y, .005)

testing features linearly based on threshold
The score with feature BLS_Period_1_0 is: 0.699410990974, the score without that feature is 0.7007406319
The score with feature BLS_Tc_1_0 is: 0.699410990974, the score without that feature is 0.696236533057
The score with feature BLS_SN_1_0 is: 0.699410990974, the score without that feature is 0.689364968058
The score with feature BLS_SR_1_0 is: 0.699410990974, the score without that feature is 0.699410990974
The score with feature BLS_SDE_1_0 is: 0.699410990974, the score without that feature is 0.703466408039
The score with feature BLS_Depth_1_0 is: 0.699410990974, the score without that feature is 0.721974298407
The score is higher without feature BLS_Depth_1_0 so it has been dropped
The score with feature BLS_Qtran_1_0 is: 0.721974298407, the score without that feature is 0.6843419566
The score with feature BLS_Qingress_1_0 is: 0.721974298407, the score without that feature is 0.705737072347
The score with feature BLS_OOTmag_1_0 is: 0.7

As we see, introducing a threshold reduced the amount of dropped features, but had a bigger impact on the score, increasing the max. score to 0.721974298407. 

In [111]:
X = data_combined_features.drop(['Ids', 'CatalogY', 'ManuleY', 'CombinedY',
                                 'Catalog_Period', 'Depth', 'Catalog_Epoch', 'SNR'],
                                axis=1)

def feature_testing(alg, X, y, threshold):
    score_perm = modelfit(alg, X, y)  # test the model without the feature dropped
    drop_list = []
    print 'testing features linearly in reverse'
    X = X
    X_copy = X.copy()
    for column in reversed(X.columns):
        df_temp = X.drop(column, axis=1)  # temporarily drop features
        score_temp = modelfit(alg, df_temp, y)  # test the model with the feature dropped
        print 'The score with feature {0} is: {1}, the score without that feature is {2}'.format(column, score_perm,
                                                                                                 score_temp)
        
        if score_temp - score_perm > threshold: 
            X.drop(column, axis=1, inplace=True)
            score_perm = modelfit(alg, X, y)
            drop_list.append(column)
            
            print 'The score is higher without feature {0} so it has been dropped'.format(column)
    print 'the dropped features are: {0}'.format(drop_list)

feature_testing(xgb1, X, y, .005)

testing features linearly in reverse
The score with feature P19 is: 0.699410990974, the score without that feature is 0.70257779817
The score with feature P18 is: 0.699410990974, the score without that feature is 0.696762713372
The score with feature P17 is: 0.699410990974, the score without that feature is 0.688955495642
The score with feature P16 is: 0.699410990974, the score without that feature is 0.699410990974
The score with feature P15 is: 0.699410990974, the score without that feature is 0.698922212224
The score with feature P14 is: 0.699410990974, the score without that feature is 0.697122253662
The score with feature P13 is: 0.699410990974, the score without that feature is 0.698521507342
The score with feature P12 is: 0.699410990974, the score without that feature is 0.703466408039
The score with feature P11 is: 0.699410990974, the score without that feature is 0.702092662611
The score with feature P10 is: 0.699410990974, the score without that feature is 0.694387979516
The 

Going in the opposite direction once again increases the score, but not to the extent it previously does. 

Lets see what happens if we drop one feature at a time, test its effect on the the score, and then return it, at the end, we'll remove the feature that causes the greatest score difference. 

In [121]:
X = data_combined_features.drop(['Ids', 'CatalogY', 'ManuleY', 'CombinedY',
                                 'Catalog_Period', 'Depth', 'Catalog_Epoch', 'SNR'],
                                axis=1)

def feature_testing(alg, X, y):
    score_perm = modelfit(alg, X, y)  # test the model without the feature dropped
    differences = []
    X = X
    X_copy = X.copy()
    for column in X.columns:
        df_temp = X.drop(column, axis=1)  # temporarily drop features
        score_temp = modelfit(alg, df_temp, y)  # test the model with the feature dropped
        differences.append(score_temp - score_perm) 
    print 'The largest difference was: {0}, it was caused by feature {1}'.format(max(differences),
                                                                                 X.columns[differences.index(max(differences))])   

feature_testing(xgb1, X, y)

testing features linearly from left to right
The largest difference was: 0.0225633074325, it was caused by feature BLS_Depth_1_0


Which is actually the difference we found by setting the threshold to .005 earlier. 

Across the tests the best score we got was 0.721974298407 by deleting 'BLS_Depth_1_0'. 

We also observed that the max score we get is dependant both on the order we deleted the features and the method used to delete them. 

A next step could be to start deleting features in multiples; 2, 3, etc. at a time and observing the effect. Also, instead of deleting the features linearly from one side to the next we could delete them randomly. 