# TSFRESH Features Applied to XGBoosts's fitting of the TESS simulated dataset.

The following notebook uses the TSFRESH software for feature extraction from time series data. Documentation can be found at: 

The following implementation steps are taken: 

1) Data is imported and converted to a column based time series instead of row based time series

2) The features are extracted 

3) XGBoost on defualt parameters is trained using the data set

4) The features are concatenated to the Combined_Features files and then used to train XGBoost to see 
the impact they have on the score. 

In [16]:
# import relevant modules 
import pandas as pd
import numpy as np

# open Chelsea's combined feature file
# remove the last row to make the dimensions match with the raw LC file
data_combined_features = pd.read_csv("TESSfield_05h_01d_combinedfeatures.csv", header=0, index_col = 0)
data_combined_features = data_combined_features.drop(data_combined_features.index[-1])

# drop the columns that aren't features
X = data_combined_features.drop(['Ids', 'CatalogY', 'ManuleY', 'CombinedY',
                                 'Catalog_Period', 'Depth', 'Catalog_Epoch', 'SNR'],
                                axis=1)

# get the target values 
# In order to get the index of Y to match with X 
# we need to add one to the values 
y = pd.Series()
y = data_combined_features['CombinedY']
y.index = y.index + 1

# load the raw light curve data file to get the time series information
# delete the first row (which only contains NaN values)
data = np.load("TESSfield_05h_01d.npy")
data_raw = np.delete(data, (0), axis=0)


# tsfresh's algorithm takes a pandas DataFrame so 
# we convert the npy array
# we also needs the ID's for tsfresh 
df_raw = pd.DataFrame(data_raw)
df_raw['Ids'] = data_combined_features['Ids']

# the following initiates a column based time series 
for i, j in enumerate(df_raw['Ids']):
    if i == 0:
        temp = [[j]]*(len(df_raw.columns) - 1)
    else:
        temp += [[j]]*(len(df_raw.columns) - 1)


# we initiate the dataframe of the ID values we just created
df_raw_transpose = pd.DataFrame(temp, columns=['Ids'])

# the time series data now needs to be transposed to progress down the column
# so we retrieve each value in a linear list instead of an array
vals = []
for rows in data_raw:
    for values in rows:
        vals.append(values)

# we now add the transformed series data to the pandas dataframe we just created
df_raw_transpose['x'] = vals
time = range(0, 480)*len(data_raw)
df_raw_transpose['time'] = time

In [17]:
# the following is the implementation of the tsfresh algorithm
from tsfresh import extract_features
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute

extracted_features = extract_features(df_raw_transpose, column_id="Ids",
                                      column_sort="time")

# impute removes any features that only contain NaN 
impute(extracted_features)
# features filtered removes features that aren't statistically significant 
# and puts them in format that allows them to be used for training
features_filtered = select_features(extracted_features, y)


  unsupported[op_str]))


It total it finds a total of 113 features, a preview of some of the features are: 

In [22]:
print features_filtered.columns

Index([                                             u'x__skewness',
                                                    u'x__kurtosis',
                               u'x__ar_coefficient__k_10__coeff_1',
                                            u'x__count_above_mean',
                                            u'x__count_below_mean',
                                 u'x__binned_entropy__max_bins_10',
                                      u'x__autocorrelation__lag_1',
                                      u'x__autocorrelation__lag_2',
                                        u'x__mean_autocorrelation',
                                   u'x__longest_strike_below_mean',
       ...
                    u'x__time_reversal_asymmetry_statistic__lag_3',
       u'x__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_9__w_2',
                               u'x__ar_coefficient__k_10__coeff_3',
                   u'x__mean_abs_change_quantiles__qh_0.2__ql_0.0',
                   u'x__mean_abs_chan

We can now take the extracted features and use them to train the algo, first we'll do it
without the combined features and then we'll append them together

In [18]:
from xgboost.sklearn import XGBClassifier
from sklearn import metrics
from sklearn.model_selection import cross_val_predict

# function to fit and get scores
def modelfit(alg, X, y, cv_folds=4):

    # StratifiedKFold automatically used by cross_val_predict on binary classification
    # does not use trapezoid rule
    # y_pred calculates the probabilities that each value is 1 or 0 using stratified cross validation 
    # pr_auc calculates the area under a precision recall curve 
    y_pred = cross_val_predict(alg, X, y, cv=cv_folds)
    pr_auc = metrics.average_precision_score(y, y_pred)


    # Print model report:
    print "pr_auc model score: {0}".format(pr_auc)

# initialize model and call fitting function
xgb1 = XGBClassifier(
    objective='binary:logistic')

modelfit(xgb1, features_filtered, y)

pr_auc model score: 0.602695503001


We can now concatenate the two feature files together to see the effect it has. 

In [19]:
X_concat = pd.concat([features_filtered, X], axis=1)
X_concat = X_concat.drop([0]) # drop to make sure it aligns 

modelfit(xgb1, X_concat, y)

pr_auc model score: 0.609236214129


In the next notebook I can try dropping some of the features like in the feature_selection notebook 
to see the effect it has on the score. 