## Testing the TSNE features on XGBoost 

The following TSNE reductions were generated by first using TruncatedSVD to get them to 50 features and then reduced down to several different lower dimensions using TSNE.  

In the first cell first the LC reductions are concatenated to the combined feature file and tested on the XGBoost default parameters, in the second cell the PCA features are first removed. 

In [7]:
import pandas as pd
from sklearn import metrics
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import cross_val_predict

# these are the various reduction files 

# 10 dimension reduction
tend_LC = pd.read_csv('tend_LC.csv', header=0, index_col =0)
tend_phase = pd.read_csv('tend_phase.csv', header=0, index_col=0)

# 3 dimension reduction
threed_LC = pd.read_csv('threed_LC.csv', header=0, index_col=0)
threed_phase = pd.read_csv('threed_phase.csv', header=0, index_col=0)

# 20 dimension reduction
twentyd_LC = pd.read_csv('twentyd_LC.csv', header=0, index_col=0)
twentyd_phase = pd.read_csv('twentyd_phase.csv', header=0, index_col=0)

# 2 dimension reduction 
twod_LC = pd.read_csv('twod_LC.csv', header=0, index_col=0)
twod_phase = pd.read_csv('twod_phase.csv', header=0, index_col=0)

# opening the combined feature file  

data_combined_features = pd.read_csv("TESSfield_05h_01d_combinedfeatures.csv",
                                     header=0, index_col=0)
data_combined_features = data_combined_features.drop(data_combined_features.index[-1])

# drop the columns that aren't features and get targets 
X_2 = data_combined_features.drop(['Ids', 'CatalogY', 'ManuleY', 'CombinedY',
                                 'Catalog_Period', 'Depth', 'Catalog_Epoch', 
                                 'SNR', 'BLS_Depth_1_0'],
                                axis=1)

# drop the columns that aren't features and get targets 
X = data_combined_features.drop(['Ids', 'CatalogY', 'ManuleY', 'CombinedY',
                                 'Catalog_Period', 'Depth', 'Catalog_Epoch', 
                                 'SNR'],
                                axis=1)
y = data_combined_features['CombinedY']


def modelfit(alg, X, y, cv_folds=4):
    # StratifiedKFold automatically used by cross_val_predict on binary classification
    # bear in mind that this does not use trapezfoid rule
    # y_pred calculates the probabilities that each value is 1 or 0 using stratified cross validation
    # pr_auc calculates the area under a precision recall curve
    y_pred = cross_val_predict(alg, X, y, cv=cv_folds, 
                               method='predict_proba')[:, 1]
    pr_auc = metrics.average_precision_score(y, y_pred)
    return pr_auc

xgb1 = XGBClassifier(objective='binary:logistic')

print 'the base score on the combined feature file is: {0}'.format(modelfit(xgb1, X, y))

# list of reduction references 
reductions_LC = [twod_LC, threed_LC, tend_LC, twentyd_LC]
reductions_phase = [twod_phase, threed_phase, tend_phase, twentyd_phase]

# concatenating the LC and Phase reductions to the combined feature file
# testing on all the variations
for reductions in reductions_LC:
    combination = pd.concat([reductions, X], axis=1)
    print 'Conatenating {0}D reduction of LC'.format(len(reductions.columns))
    print modelfit(xgb1, combination, y)
    
for reductions in reductions_phase:
    reductions = reductions.drop(reductions.index[-1]) # need to drop last to make files line up
    print 'Concatenating {0}D reduction of phase_space transformation'.format(len(reductions.columns))
    combination = pd.concat([reductions, X], axis=1)
    print modelfit(xgb1, combination, y)

the base score on the combined feature file is: 0.793499017458
Conatenating 2D reduction of LC
0.788332824602
Conatenating 3D reduction of LC
0.791029337864
Conatenating 10D reduction of LC
0.791350644396
Conatenating 20D reduction of LC
0.78543797189
Concatenating 2D reduction of phase_space transformation
0.793499017458
Concatenating 3D reduction of phase_space transformation
0.793499017458
Concatenating 10D reduction of phase_space transformation
0.793499017458
Concatenating 20D reduction of phase_space transformation
0.793499017458


As we can see above, appending the various reductions to the combined feature files does nothing positive for the score 

Next we'll see if substituting TSNE for the PCA features causes a better score. 

In [8]:
# performing the same but dropping the PCA features

X = X.drop(X.columns[20:], axis=1)

print X.columns # making sure columns are dropped

for reductions in reductions_LC:
    combination = pd.concat([reductions, X], axis=1)
    print 'Conatenating {0}D reduction of LC'.format(len(reductions.columns))
    print modelfit(xgb1, combination, y)
    
for reductions in reductions_phase:
    reductions = reductions.drop(reductions.index[-1]) # need to drop last to make files line up
    print 'Concatenating {0}D reduction of phase_space'.format(len(reductions.columns))
    combination = pd.concat([reductions, X], axis=1)
    print modelfit(xgb1, combination, y)

Index([u'BLS_Period_1_0', u'BLS_Tc_1_0', u'BLS_SN_1_0', u'BLS_SR_1_0',
       u'BLS_SDE_1_0', u'BLS_Depth_1_0', u'BLS_Qtran_1_0', u'BLS_Qingress_1_0',
       u'BLS_OOTmag_1_0', u'BLS_i1_1_0', u'BLS_i2_1_0', u'BLS_deltaChi2_1_0',
       u'BLS_fraconenight_1_0', u'BLS_Npointsintransit_1_0',
       u'BLS_Ntransits_1_0', u'BLS_Npointsbeforetransit_1_0',
       u'BLS_Npointsaftertransit_1_0', u'BLS_Rednoise_1_0',
       u'BLS_Whitenoise_1_0', u'BLS_SignaltoPinknoise_1_0'],
      dtype='object')
Conatenating 2D reduction of LC
0.787309206428
Conatenating 3D reduction of LC
0.782768907446
Conatenating 10D reduction of LC
0.779340888349
Conatenating 20D reduction of LC
0.780738668802
Concatenating 2D reduction of phase_space
0.784428393876
Concatenating 3D reduction of phase_space
0.784428393876
Concatenating 10D reduction of phase_space
0.784428393876
Concatenating 20D reduction of phase_space
0.784428393876


As we can see, the PCA features perform better than the TSNE/TSVD reductions in this configuration. 

Of course there is an extreme amount of variation that can be created by parameter tuning TSNE/TSVD. 

## References

- TSNE: https://lvdmaaten.github.io/tsne/
- TSVD: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
- XGBoost: https://xgboost.readthedocs.io/en/latest/