# Feature Engineering
Using the data we sourced in the previous step, we will select and preprocess the attributes we need for our machine learning model.

In [17]:
%run config.ipynb

In [18]:
# Connect to Cortex
cortex = Cortex.client()

In [19]:
# Our stock symbols
symbols = ['fb', 'xlf', 'aapl']

## Preparing Features
Using one of our stock prices datasets, we will begin defining the features we plan to use for our model.

In [20]:
s = symbols[0]
print('Loading stock prices for %s' % s)
ds = cortex.dataset('demo/stock-prices-%s' % s)
c = ds.contract('stock-prices')

Loading stock prices for fb


In [21]:
df = c.get_source_data()
c.discover.data_dictionary(df)

Unnamed: 0,Attribute,Type,% Nulls,Count,Unique,Observations,Knowledge
0,change,float64,0.0,1259,647,max=14.66 | min=-41.24 | mean=0.11,
1,changeOverTime,float64,0.0,1259,1200,max=4.910326087 | min=-0.0065217391 | mean=2.02,
2,changePercent,float64,0.0,1259,1126,max=15.521 | min=-18.961 | mean=0.14,
3,close,float64,0.0,1259,1200,max=217.5 | min=36.56 | mean=111.24,
4,date,object,0.0,1259,1259,Sample: 2014-02-28 | 2016-03-24 | 2017-06-08,
5,high,float64,0.0,1259,1200,max=218.62 | min=37.07 | mean=112.29,
6,label,object,0.0,1259,1259,"Sample: Aug 27, 15 | Oct 24, 13 | May 22, 15",
7,low,float64,0.0,1259,1198,max=214.27 | min=36.0201 | mean=110.04,
8,open,float64,0.0,1259,1184,max=215.715 | min=36.36 | mean=111.2,
9,unadjustedVolume,int64,0.0,1259,1259,max=248809006 | min=5913066 | mean=33023851.93,


## Feature Preprocessing
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data.  Many useful utilities are available to help preprocess features before attempting to build a ML model.

In [22]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer

In [23]:
def scale_columns(df, scaler, cols_to_scale):
    df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])

### Feature Scaling
We will only be using the closing price for this simple model.  The scaler itself is a model that we will persist for later use when we want to visualize our predictions.

In [24]:
price_scaler = StandardScaler()
scale_columns(df, price_scaler, ['close'])
df.head()

Unnamed: 0,change,changeOverTime,changePercent,close,date,high,label,low,open,unadjustedVolume,volume,vwap
0,-0.827,0.0,-2.198,-1.682742,2013-07-31,38.31,"Jul 31, 13",36.33,37.96,154729632,154729632,37.3518
1,0.689,0.018723,1.872,-1.667168,2013-08-01,38.29,"Aug 1, 13",36.92,37.3,105923556,105923556,37.825
2,0.561,0.033967,1.496,-1.654486,2013-08-02,38.49,"Aug 2, 13",37.5,37.66,72753482,72753482,38.031
3,1.139,0.064918,2.993,-1.62874,2013-08-05,39.32,"Aug 5, 13",38.25,38.43,79851445,79851445,39.0148
4,-0.639,0.047554,-1.631,-1.643184,2013-08-06,39.25,"Aug 6, 13",37.94,39.11,63861878,63861878,38.5606


In [25]:
c.discover.data_dictionary(df)

Unnamed: 0,Attribute,Type,% Nulls,Count,Unique,Observations,Knowledge
0,change,float64,0.0,1259,647,max=14.66 | min=-41.24 | mean=0.11,
1,changeOverTime,float64,0.0,1259,1200,max=4.910326087 | min=-0.0065217391 | mean=2.02,
2,changePercent,float64,0.0,1259,1126,max=15.521 | min=-18.961 | mean=0.14,
3,close,float64,0.0,1259,1200,max=2.4019017239971596 | min=-1.6881672929783968 | mean=-0.0,
4,date,object,0.0,1259,1259,Sample: 2018-01-02 | 2018-07-30 | 2017-05-12,
5,high,float64,0.0,1259,1200,max=218.62 | min=37.07 | mean=112.29,
6,label,object,0.0,1259,1259,"Sample: Jan 23, 17 | Aug 4, 15 | Apr 7, 14",
7,low,float64,0.0,1259,1198,max=214.27 | min=36.0201 | mean=110.04,
8,open,float64,0.0,1259,1184,max=215.715 | min=36.36 | mean=111.2,
9,unadjustedVolume,int64,0.0,1259,1259,max=248809006 | min=5913066 | mean=33023851.93,


### Dropping Columns
Let's drop everything except the date and close price columns.

In [26]:
df = c.clean.filter_columns(df, headers=['date', 'close'], drop=False)
df.head()

Unnamed: 0,close,date
0,-1.682742,2013-07-31
1,-1.667168,2013-08-01
2,-1.654486,2013-08-02
3,-1.62874,2013-08-05
4,-1.643184,2013-08-06


### Re-indexing
Convert dates to datetime type and set the index on our dataset to the date so we can easily partition the data for training/test split.

In [27]:
df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)
df.set_index('date',inplace=True)
df.head()

Unnamed: 0_level_0,close
date,Unnamed: 1_level_1
2013-07-31,-1.682742
2013-08-01,-1.667168
2013-08-02,-1.654486
2013-08-05,-1.62874
2013-08-06,-1.643184


## Saving the contract state
We can persist the different files we created for use later.

In [28]:
c.save_feature_file(data=df, tag='prices_%s' % s)
c.save_model_file(data=price_scaler, tag='prices_scaler_%s' % s)