# Feature Extraction

Feature Extraction (Feature Engineering) is a transformation of raw data into features suitable for modeling. Feature extraction is used for all machine learning algorithms.

Feature extraction used for texts, images, geospatial data, date and time, and Time Series.

Feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations. Feature extraction is a dimensionality reduction process, where an initial set of raw variables is reduced to more manageable groups (features) for processing, while still accurately and completely describing the original data set. (Wikipedia)

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. (Wikipedia)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

# fix_yahoo_finance is used to fetch data 
import fix_yahoo_finance as yf
yf.pdr_override()

In [39]:
# input
symbol = 'AMD'
start = '2014-01-01'
end = '2018-08-27'

# Read data 
dataset = yf.download(symbol,start,end)

[*********************100%***********************]  1 of 1 downloaded


In [40]:
dataset.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-01-02,3.85,3.98,3.84,3.95,3.95,20548400
2014-01-03,3.98,4.0,3.88,4.0,4.0,22887200
2014-01-06,4.01,4.18,3.99,4.13,4.13,42398300
2014-01-07,4.19,4.25,4.11,4.18,4.18,42932100
2014-01-08,4.23,4.26,4.14,4.18,4.18,30678700


Create new features with Original data

In [41]:
# Add more data
dataset['Increase_Decrease'] = np.where(dataset['Volume'].shift(-1) > dataset['Volume'],'Increase','Decrease')
dataset['Buy_Sell_on_Open'] = np.where(dataset['Open'].shift(-1) > dataset['Open'],1,0)
dataset['Buy_Sell'] = np.where(dataset['Adj Close'].shift(-1) > dataset['Adj Close'],1,0)
dataset['Returns'] = dataset['Adj Close'].pct_change()
dataset['Average'] = dataset[['Open','High','Low','Adj Close']].mean(axis=1)
dataset['Std'] = dataset[['Open','High','Low','Adj Close']].std(axis=1)
dataset = dataset.dropna()
dataset.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,Increase_Decrease,Buy_Sell_on_Open,Buy_Sell,Returns,Average,Std
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2014-01-03,3.98,4.0,3.88,4.0,4.0,22887200,Increase,1,1,0.012658,3.965,0.057446
2014-01-06,4.01,4.18,3.99,4.13,4.13,42398300,Increase,1,1,0.0325,4.0775,0.09215
2014-01-07,4.19,4.25,4.11,4.18,4.18,42932100,Decrease,1,0,0.012107,4.1825,0.057373
2014-01-08,4.23,4.26,4.14,4.18,4.18,30678700,Decrease,0,0,0.0,4.2025,0.053151
2014-01-09,4.2,4.23,4.05,4.09,4.09,30667600,Decrease,0,1,-0.021531,4.1425,0.086168


In [42]:
dataset['Month'] = dataset.index.month 
dataset['Day'] = dataset.index.day 
dataset['Year'] = dataset.index.year

In [43]:
dataset['Norm_Price'] = (dataset['Adj Close'] - dataset['Average']) / dataset['Std'] 

In [44]:
dataset['Date_Stamp'] = pd.to_datetime(dataset.index)

In [45]:
dataset.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,Increase_Decrease,Buy_Sell_on_Open,Buy_Sell,Returns,Average,Std,Month,Day,Year,Norm_Price,Date_Stamp
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2014-01-03,3.98,4.0,3.88,4.0,4.0,22887200,Increase,1,1,0.012658,3.965,0.057446,1,3,2014,0.609272,2014-01-03
2014-01-06,4.01,4.18,3.99,4.13,4.13,42398300,Increase,1,1,0.0325,4.0775,0.09215,1,6,2014,0.569722,2014-01-06
2014-01-07,4.19,4.25,4.11,4.18,4.18,42932100,Decrease,1,0,0.012107,4.1825,0.057373,1,7,2014,-0.043574,2014-01-07
2014-01-08,4.23,4.26,4.14,4.18,4.18,30678700,Decrease,0,0,0.0,4.2025,0.053151,1,8,2014,-0.423324,2014-01-08
2014-01-09,4.2,4.23,4.05,4.09,4.09,30667600,Decrease,0,1,-0.021531,4.1425,0.086168,1,9,2014,-0.609272,2014-01-09


Timeseries Feature Extraction

In [46]:
from tsfresh import extract_features
import tsfresh

In [76]:
df = dataset.reset_index()
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Increase_Decrease,Buy_Sell_on_Open,Buy_Sell,Returns,Average,Std,Month,Day,Year,Norm_Price,Date_Stamp
0,2014-01-03,3.98,4.0,3.88,4.0,4.0,22887200,Increase,1,1,0.012658,3.965,0.057446,1,3,2014,0.609272,2014-01-03
1,2014-01-06,4.01,4.18,3.99,4.13,4.13,42398300,Increase,1,1,0.0325,4.0775,0.09215,1,6,2014,0.569722,2014-01-06
2,2014-01-07,4.19,4.25,4.11,4.18,4.18,42932100,Decrease,1,0,0.012107,4.1825,0.057373,1,7,2014,-0.043574,2014-01-07
3,2014-01-08,4.23,4.26,4.14,4.18,4.18,30678700,Decrease,0,0,0.0,4.2025,0.053151,1,8,2014,-0.423324,2014-01-08
4,2014-01-09,4.2,4.23,4.05,4.09,4.09,30667600,Decrease,0,1,-0.021531,4.1425,0.086168,1,9,2014,-0.609272,2014-01-09


In [80]:
df = df.dropna(how='all')

In [82]:
new_features = extract_features(df[['Norm_Price', 'Day', 'Date_Stamp']], 
                              column_id="Day", column_sort="Date_Stamp", 
                              column_value="Norm_Price", n_jobs=0).dropna(axis=1)

Feature Extraction: 100%|██████████| 31/31 [00:03<00:00,  9.43it/s]


In [83]:
new_features

variable,Norm_Price__abs_energy,Norm_Price__absolute_sum_of_changes,"Norm_Price__agg_autocorrelation__f_agg_""mean""","Norm_Price__agg_autocorrelation__f_agg_""median""","Norm_Price__agg_autocorrelation__f_agg_""var""","Norm_Price__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""intercept""","Norm_Price__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""rvalue""","Norm_Price__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""slope""","Norm_Price__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""stderr""","Norm_Price__agg_linear_trend__f_agg_""max""__chunk_len_5__attr_""intercept""",...,Norm_Price__time_reversal_asymmetry_statistic__lag_1,Norm_Price__time_reversal_asymmetry_statistic__lag_2,Norm_Price__time_reversal_asymmetry_statistic__lag_3,Norm_Price__value_count__value_-inf,Norm_Price__value_count__value_0,Norm_Price__value_count__value_1,Norm_Price__value_count__value_inf,Norm_Price__value_count__value_nan,Norm_Price__variance,Norm_Price__variance_larger_than_standard_deviation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,10.069981,22.073335,0.011036,-2.9e-05,0.088425,0.723716,0.234293,0.028313,0.083072,0.593362,...,0.017921,-0.041509,-0.002697,0.0,0.0,0.0,0.0,0.0,0.284603,0.0
2,15.014855,23.650433,-0.068024,-0.032363,0.221351,0.765494,-0.232963,-0.044569,0.131558,0.717905,...,0.003614,-0.000761,0.010993,0.0,0.0,0.0,0.0,0.0,0.394286,0.0
3,10.007918,22.794906,-0.031558,0.007704,0.12522,0.650171,0.034617,0.00592,0.120853,0.721209,...,-0.016315,-0.0223,-0.032399,0.0,3.0,0.0,0.0,0.0,0.260806,0.0
4,7.622961,17.675706,-2.4e-05,0.018148,0.098133,0.740912,-0.195852,-0.047454,0.168012,0.649473,...,-0.004962,-0.023041,0.028284,0.0,0.0,0.0,0.0,0.0,0.224188,0.0
5,9.924642,20.377309,-0.068299,-0.006735,0.160274,0.665402,0.561299,0.058399,0.060886,0.551799,...,0.050581,-0.03149,-0.019719,0.0,0.0,0.0,0.0,0.0,0.257632,0.0
6,12.313897,23.168678,-0.006238,-0.01187,0.153771,0.714396,0.628668,0.03306,0.023611,0.533903,...,-0.034664,0.016124,-0.026203,0.0,1.0,0.0,0.0,0.0,0.289533,0.0
7,9.634994,24.596423,-0.010482,0.036056,0.12804,0.712917,-0.083901,-0.008115,0.068147,0.555363,...,0.040656,0.028055,0.025381,0.0,1.0,0.0,0.0,0.0,0.242038,0.0
8,12.531512,28.226676,-0.016934,-0.049526,0.066014,0.776807,-0.169103,-0.005606,0.023103,0.584344,...,-0.014838,-0.006964,-0.050904,0.0,1.0,0.0,0.0,0.0,0.312205,0.0
9,11.037042,23.457618,0.014349,0.003432,0.111533,1.0738,-0.703534,-0.315431,0.183959,0.647062,...,-0.035327,-0.017039,-0.018024,0.0,0.0,0.0,0.0,0.0,0.250425,0.0
10,13.181257,25.853322,-0.005722,-0.044875,0.072481,0.696297,0.087233,0.009606,0.077565,0.526149,...,-0.008724,0.027061,-0.045334,0.0,1.0,0.0,0.0,0.0,0.325188,0.0


In [84]:
X = extract_features(df, column_id='Month', column_sort='Date', column_value='Adj Close')

Feature Extraction: 100%|██████████| 12/12 [00:02<00:00,  5.12it/s]


In [85]:
X

variable,Adj Close__abs_energy,Adj Close__absolute_sum_of_changes,"Adj Close__agg_autocorrelation__f_agg_""mean""","Adj Close__agg_autocorrelation__f_agg_""median""","Adj Close__agg_autocorrelation__f_agg_""var""","Adj Close__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""intercept""","Adj Close__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""rvalue""","Adj Close__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""slope""","Adj Close__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""stderr""","Adj Close__agg_linear_trend__f_agg_""max""__chunk_len_50__attr_""intercept""",...,Adj Close__time_reversal_asymmetry_statistic__lag_1,Adj Close__time_reversal_asymmetry_statistic__lag_2,Adj Close__time_reversal_asymmetry_statistic__lag_3,Adj Close__value_count__value_-inf,Adj Close__value_count__value_0,Adj Close__value_count__value_1,Adj Close__value_count__value_inf,Adj Close__value_count__value_nan,Adj Close__variance,Adj Close__variance_larger_than_standard_deviation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,6019.4169,26.54,0.354006,0.432519,0.175319,1.730545,0.860689,1.337879,0.279805,4.47,...,28.749453,57.362763,83.149545,0.0,0.0,0.0,0.0,0.0,18.781676,1.0
2,6705.4881,27.86,0.275627,0.274121,0.205653,1.630727,0.823661,1.454727,0.354107,3.72,...,22.531925,45.286815,65.917476,0.0,0.0,0.0,0.0,0.0,24.022986,1.0
3,7702.3745,31.76,0.370745,0.409414,0.170077,1.773636,0.767598,1.246727,0.346998,6.456667,...,6.10831,14.899758,27.105037,0.0,0.0,0.0,0.0,0.0,22.68231,1.0
4,6168.877,33.65,0.310897,0.334679,0.17117,2.242727,0.757335,1.106545,0.318047,6.611667,...,12.330922,23.026106,32.808871,0.0,0.0,0.0,0.0,0.0,18.26571,1.0
5,6890.726,28.45,0.467041,0.551048,0.136986,1.721364,0.85327,1.247364,0.254106,5.736667,...,30.335133,61.237832,91.868611,0.0,0.0,0.0,0.0,0.0,17.71398,1.0
6,9564.220168,34.500002,0.489733,0.566027,0.143772,1.204091,0.912175,1.584273,0.237248,6.716667,...,42.316454,86.95303,135.130659,0.0,0.0,0.0,0.0,0.0,27.107963,1.0
7,10788.664338,33.990004,0.475292,0.568334,0.152314,1.007727,0.919332,1.764636,0.25176,6.636667,...,85.408064,170.623756,225.458101,0.0,0.0,0.0,0.0,0.0,33.772204,1.0
8,13088.049644,39.550004,0.470143,0.476228,0.092382,-0.120455,0.934536,2.199909,0.27924,8.03,...,187.013637,315.085442,423.155964,0.0,0.0,0.0,0.0,0.0,41.308127,1.0
9,4536.335,23.91,0.204146,0.215385,0.239502,1.587333,0.866586,1.467333,0.319369,7.51,...,31.14806,62.606912,98.878568,0.0,0.0,0.0,0.0,0.0,16.83362,1.0
10,5179.5372,24.0,0.29145,0.317743,0.219575,0.524889,0.91208,1.731,0.294112,6.97,...,18.665421,43.788124,80.62326,0.0,0.0,0.0,0.0,0.0,20.254749,1.0


In [86]:
tsfresh.feature_extraction.feature_calculators.abs_energy(dataset['Adj Close'])

85643.58435012001

In [87]:
tsfresh.feature_extraction.feature_calculators.absolute_sum_of_changes(dataset['Adj Close'])

200.18001000000004