# Feature Engineering
Using the data we sourced in the previous step, we will select and preprocess the attributes we need for our machine learning model.

In [1]:
%run config.ipynb

In [2]:
# Connect to Cortex
cortex = Cortex.client()

In [3]:
# Our stock symbols
symbols = ['fb', 'xlf', 'aapl']

## Preparing Features
Using one of our stock prices datasets, we will begin defining the features we plan to use for our model.

In [4]:
s = symbols[0]
print('Loading stock prices for %s' % s)
ds = cortex.dataset('demo/stock-prices-%s' % s)
c = ds.contract('stock-prices')

Loading stock prices for fb


In [5]:
df = c.get_source_data()
c.discover.data_dictionary(df)

Unnamed: 0,Attribute,Type,% Nulls,Count,Unique,Observations,Knowledge
0,change,float64,0.0,1259,647,max=14.66 | min=-41.24 | mean=0.12,
1,changeOverTime,float64,0.0,1259,1201,max=4.6493506494 | min=-0.0503896104 | mean=1.91,
2,changePercent,float64,0.0,1259,1126,max=15.521 | min=-18.961 | mean=0.14,
3,close,float64,0.0,1259,1201,max=217.5 | min=36.56 | mean=112.03,
4,date,object,0.0,1259,1259,Sample: 2014-03-11 | 2016-04-05 | 2017-06-19,
5,high,float64,0.0,1259,1199,max=218.62 | min=37.07 | mean=113.08,
6,label,object,0.0,1259,1259,"Sample: Sep 8, 15 | Nov 4, 13 | Jun 3, 15",
7,low,float64,0.0,1259,1197,max=214.27 | min=36.0201 | mean=110.81,
8,open,float64,0.0,1259,1184,max=215.715 | min=36.36 | mean=111.97,
9,unadjustedVolume,int64,0.0,1259,1259,max=248809006 | min=5913066 | mean=32745700.14,


## Feature Preprocessing
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data.  Many useful utilities are available to help preprocess features before attempting to build a ML model.

In [6]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer

In [7]:
def scale_columns(df, scaler, cols_to_scale):
    df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])

### Feature Scaling
We will only be using the closing price for this simple model.  The scaler itself is a model that we will persist for later use when we want to visualize our predictions.

In [9]:
price_scaler = StandardScaler()
scale_columns(df, price_scaler, ['close'])
df.head()

Unnamed: 0,change,changeOverTime,changePercent,close,date,high,label,low,open,unadjustedVolume,volume,vwap
0,-0.04,0.0,-0.104,-1.663954,2013-08-09,38.74,"Aug 9, 13",38.01,38.59,43553093,43553093,38.5242
1,-0.28,-0.007273,-0.727,-1.670291,2013-08-12,38.5,"Aug 12, 13",38.1,38.2,31102475,31102475,38.3061
2,-1.201,-0.038468,-3.142,-1.69747,2013-08-13,38.32,"Aug 13, 13",36.77,38.24,65277931,65277931,37.3722
3,-0.369,-0.048052,-0.997,-1.705821,2013-08-14,37.55,"Aug 14, 13",36.62,36.83,48350658,48350658,37.044
4,-0.09,-0.05039,-0.246,-1.707858,2013-08-15,37.07,"Aug 15, 13",36.0201,36.36,56439536,56439536,36.5183


In [10]:
c.discover.data_dictionary(df)

Unnamed: 0,Attribute,Type,% Nulls,Count,Unique,Observations,Knowledge
0,change,float64,0.0,1259,647,max=14.66 | min=-41.24 | mean=0.12,
1,changeOverTime,float64,0.0,1259,1201,max=4.6493506494 | min=-0.0503896104 | mean=1.91,
2,changePercent,float64,0.0,1259,1126,max=15.521 | min=-18.961 | mean=0.14,
3,close,float64,0.0,1259,1201,max=2.3869951845748765 | min=-1.7078579600615227 | mean=0.0,
4,date,object,0.0,1259,1259,Sample: 2015-02-20 | 2017-06-23 | 2016-12-29,
5,high,float64,0.0,1259,1199,max=218.62 | min=37.07 | mean=113.08,
6,label,object,0.0,1259,1259,"Sample: Mar 29, 16 | Jan 30, 17 | Mar 16, 16",
7,low,float64,0.0,1259,1197,max=214.27 | min=36.0201 | mean=110.81,
8,open,float64,0.0,1259,1184,max=215.715 | min=36.36 | mean=111.97,
9,unadjustedVolume,int64,0.0,1259,1259,max=248809006 | min=5913066 | mean=32745700.14,


### Dropping Columns
Let's drop everything except the date and close price columns.

In [11]:
df = c.clean.filter_columns(df, headers=['date', 'close'], drop=False)
df.head()

Unnamed: 0,date,close
0,2013-08-09,-1.663954
1,2013-08-12,-1.670291
2,2013-08-13,-1.69747
3,2013-08-14,-1.705821
4,2013-08-15,-1.707858


### Re-indexing
Convert dates to datetime type and set the index on our dataset to the date so we can easily partition the data for training/test split.

In [12]:
df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)
df.set_index('date',inplace=True)
df.head()

Unnamed: 0_level_0,close
date,Unnamed: 1_level_1
2013-08-09,-1.663954
2013-08-12,-1.670291
2013-08-13,-1.69747
2013-08-14,-1.705821
2013-08-15,-1.707858


## Saving the contract state
We can persist the different files we created for use later.

In [13]:
c.save_feature_file(data=df, tag='prices_%s' % s)
c.save_model_file(data=price_scaler, tag='prices_scaler_%s' % s)