# Feature Transformation

Feature Transformation (Data Preprocessing) is transformation of data to improve the accuracy of the algorithm.

Normalization and changing distribution(Scaling), Interactions and Filling in the missing values.

Transforms data such as Rescale Data, Standardize Data, Normalize Data, and Binarize Data (Make Binary)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

# fix_yahoo_finance is used to fetch data 
import fix_yahoo_finance as yf
yf.pdr_override()

In [2]:
# input
symbol = 'AMD'
start = '2014-01-01'
end = '2018-08-27'

# Read data 
dataset = yf.download(symbol,start,end)

[*********************100%***********************]  1 of 1 downloaded


In [3]:
# Add more data
dataset['Increase_Decrease'] = np.where(dataset['Volume'].shift(-1) > dataset['Volume'],'Increase','Decrease')
dataset['Buy_Sell_on_Open'] = np.where(dataset['Open'].shift(-1) > dataset['Open'],1,0)
dataset['Buy_Sell'] = np.where(dataset['Adj Close'].shift(-1) > dataset['Adj Close'],1,0)
dataset['Returns'] = dataset['Adj Close'].pct_change()
dataset['Average'] = dataset[['Open','High','Low','Adj Close']].mean(axis=1)
dataset['Std'] = dataset[['Open','High','Low','Adj Close']].std(axis=1)
dataset = dataset.dropna()
dataset.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,Increase_Decrease,Buy_Sell_on_Open,Buy_Sell,Returns,Average,Std
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2014-01-03,3.98,4.0,3.88,4.0,4.0,22887200,Increase,1,1,0.012658,3.965,0.057446
2014-01-06,4.01,4.18,3.99,4.13,4.13,42398300,Increase,1,1,0.0325,4.0775,0.09215
2014-01-07,4.19,4.25,4.11,4.18,4.18,42932100,Decrease,1,0,0.012107,4.1825,0.057373
2014-01-08,4.23,4.26,4.14,4.18,4.18,30678700,Decrease,0,0,0.0,4.2025,0.053151
2014-01-09,4.2,4.23,4.05,4.09,4.09,30667600,Decrease,0,1,-0.021531,4.1425,0.086168


In [15]:
dataset.shape

(1171, 12)

In [23]:
X = np.array(dataset['Open']).reshape(1171,-1)
Y = np.array(dataset['Adj Close']).reshape(1171,-1)

In [24]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0)
quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
X_train_trans = quantile_transformer.fit_transform(X_train)
X_test_trans = quantile_transformer.transform(X_test)

In [25]:
np.percentile(X_train[:, 0], [0, 25, 50, 75, 100])

array([ 1.62    ,  2.7     ,  4.2     , 11.4275  , 24.940001])

In [26]:
np.percentile(X_train_trans[:, 0], [0, 25, 50, 75, 100])

array([9.99999998e-08, 2.50250250e-01, 4.99499499e-01, 7.50250250e-01,
       9.99999900e-01])

In [27]:
np.percentile(X_train[:, 0], [0, 25, 50, 75, 100]) 

array([ 1.62    ,  2.7     ,  4.2     , 11.4275  , 24.940001])

In [28]:
np.percentile(X_test_trans[:, 0], [0, 25, 50, 75, 100])

array([0.00342075, 0.25725726, 0.55948203, 0.77948967, 0.99173318])

Rescale Data        

Rescaling is the most simplest and one can do is to take a range of data and map it onto a zero-to-one scale.

In [33]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[0.101]
 [0.102]
 [0.11 ]
 [0.112]
 [0.111]]


Standardize Data        

Standardization is a method that transfrom attributes with a Gaussian distribution and differing means and standard deviations to standard Gaussian distribution with a mean of 0 and a standard deviation of 1. 

This method works best with rescaled data with linear regression, logistic regression and linear discriminate analysis.

In [34]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[-0.622]
 [-0.616]
 [-0.579]
 [-0.571]
 [-0.577]]


Normalize Data

Normalizing rescale each of observation (row) to have a length of 1 (called a unit norm in linear algebra). This method is very uselufl if the dataset has many zeros. 

In [35]:
from sklearn.preprocessing import Normalizer

scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
np.set_printoptions(precision=3)
print(normalizedX[0:5,:])

[[1.]
 [1.]
 [1.]
 [1.]
 [1.]]


Binarize Data (Make Binary)


Binarize is use for transform dataing to threshold when the values above the threshold are marked 1 and all equal to or below are marked as 0.  This method is useful if you have probabilities that want to make crisp vcalues. 

In [36]:
from sklearn.preprocessing import Binarizer

binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
np.set_printoptions(precision=3)
print(binaryX[0:5,:])

[[1.]
 [1.]
 [1.]
 [1.]
 [1.]]
