# Pre-Processing 

Preproccessing is an important step in the data mining process and machine learning. It is used for "garbage in, garbage out." In machine learning and data mining, the data gathering methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, missing values, etc.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

# fix_yahoo_finance is used to fetch data 
import fix_yahoo_finance as yf
yf.pdr_override()

In [2]:
# input
symbol = 'AMD'
start = '2007-01-01'
end = '2018-12-31'

# Read data 
dataset = yf.download(symbol,start,end)

# View Columns
dataset.head()

[*********************100%***********************]  1 of 1 downloaded


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2007-01-03,20.08,20.4,19.35,19.52,19.52,28350300
2007-01-04,19.66,19.860001,19.32,19.790001,19.790001,23652500
2007-01-05,19.540001,19.91,19.540001,19.709999,19.709999,15902400
2007-01-08,19.709999,19.860001,19.370001,19.469999,19.469999,15814800
2007-01-09,19.450001,19.709999,19.370001,19.65,19.65,14494200


In [3]:
dataset['Increase_Decrease'] = np.where(dataset['Volume'].shift(-1) > dataset['Volume'],1,0)
dataset['Buy_Sell_on_Open'] = np.where(dataset['Open'].shift(-1) > dataset['Open'],1,0)
dataset['Buy_Sell'] = np.where(dataset['Adj Close'].shift(-1) > dataset['Adj Close'],1,0)
dataset['Returns'] = dataset['Adj Close'].pct_change()
dataset = dataset.dropna()
dataset.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,Increase_Decrease,Buy_Sell_on_Open,Buy_Sell,Returns
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2007-01-04,19.66,19.860001,19.32,19.790001,19.790001,23652500,0,0,0,0.013832
2007-01-05,19.540001,19.91,19.540001,19.709999,19.709999,15902400,0,1,0,-0.004043
2007-01-08,19.709999,19.860001,19.370001,19.469999,19.469999,15814800,0,0,1,-0.012177
2007-01-09,19.450001,19.709999,19.370001,19.65,19.65,14494200,1,1,1,0.009245
2007-01-10,19.639999,20.02,19.5,20.01,20.01,19783200,1,1,1,0.018321


In [4]:
X = dataset[['Open', 'High', 'Low', 'Volume']].values
y = dataset['Adj Close'].values

## Rescaling Data

Rescaling data is multiplying each member of a data set by a constant k; that is to say, transforming each number x to f(X), where f(x) = kx, and k and x are both real numbers. Rescaling will change the spread of your data as well as the position of your data points. (https://www.statisticshowto.datasciencecentral.com/what-is-rescaling-data/)

In [23]:
from sklearn.preprocessing import MinMaxScaler

In [25]:
scaler=MinMaxScaler(feature_range=(0,1))
rescaledX=scaler.fit_transform(X)
np.set_printoptions(precision=3) #Setting precision for the output
rescaledX[0:5,:]

array([[ 0.572,  0.56 ,  0.579,  0.073],
       [ 0.568,  0.561,  0.586,  0.049],
       [ 0.573,  0.56 ,  0.581,  0.049],
       [ 0.565,  0.555,  0.581,  0.045],
       [ 0.571,  0.565,  0.585,  0.061]])

## Standardizing Data

Standardizing Data: A standardized variable (sometimes called a z-score or a standard score) is a variable that has been rescaled to have a mean of zero and a standard deviation of one. (https://stats.idre.ucla.edu/stata/faq/how-do-i-standardize-variables-in-stata/)

In [7]:
from sklearn.preprocessing import StandardScaler

In [8]:
scaler=StandardScaler().fit(X)
rescaledX=scaler.transform(X)
rescaledX[0:5,:]

array([[ 2.45048633,  2.40507386,  2.48078494, -0.2809095 ],
       [ 2.42666905,  2.41477762,  2.5256097 , -0.53851508],
       [ 2.46041008,  2.40507386,  2.49097254, -0.54142682],
       [ 2.40880595,  2.37596163,  2.49097254, -0.58532225],
       [ 2.44651656,  2.4361263 ,  2.51745957, -0.40952117]])

## Normalizing Data

Normalizing Data: normalize your data between the range of 0 and 1. 

In [9]:
from sklearn.preprocessing import Normalizer

In [10]:
scaler=Normalizer().fit(X)
normalizedX=scaler.transform(X)
normalizedX[0:5,:]

array([[  8.31201776e-07,   8.39657584e-07,   8.16826974e-07,
          1.00000000e+00],
       [  1.22874541e-06,   1.25201227e-06,   1.22874541e-06,
          1.00000000e+00],
       [  1.24630087e-06,   1.25578578e-06,   1.22480215e-06,
          1.00000000e+00],
       [  1.34191615e-06,   1.35985422e-06,   1.33639670e-06,
          1.00000000e+00],
       [  9.92761484e-07,   1.01196975e-06,   9.85684823e-07,
          1.00000000e+00]])

## Binarizing Data

Binarization Data is the process of transforming data features of any entity into vectors of binary numbers to make classifier algorithms more efficient. (https://deepai.org/machine-learning-glossary-and-terms/binarization)

In [11]:
from sklearn.preprocessing import Binarizer

In [12]:
binarizer=Binarizer(threshold=0.0).fit(X)
binaryX=binarizer.transform(X)
binaryX[0:5,:]

array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

## Mean Removal

Mean Removal is a process that remove the mean from each column or feature to center it on zero.

In [13]:
from sklearn.preprocessing import scalemean

In [14]:
data_standardized=scale(dataset)
data_standardized.mean(axis=0)

array([  1.60042749e-16,  -2.07114146e-16,  -8.94356541e-17,
         1.88285587e-17,   1.88285587e-17,   5.64856762e-17,
         3.29499778e-17,  -2.35356984e-18,   6.58999556e-17,
         2.35356984e-18])

In [15]:
data_standardized.std(axis=0)

array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

## One Hot Encoding

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. (https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f)

In [16]:
from sklearn.preprocessing import OneHotEncoder

In [17]:
encoder=OneHotEncoder()
encoder.fit(X)

OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values='auto', sparse=True)

## Label Encoding

Label Encoding is for data that has categorical variables and convert data into numbers.

In [18]:
from sklearn.preprocessing import LabelEncoder

In [19]:
label_encoder=LabelEncoder()
input_classes=['Apple','Intel','Microsoft','Google','Tesla'] # We will use company names
label_encoder.fit(input_classes)

LabelEncoder()

In [20]:
for i,companies in enumerate(label_encoder.classes_):
    print(companies,'-->',i)

Apple --> 0
Google --> 1
Intel --> 2
Microsoft --> 3
Tesla --> 4


In [21]:
labels=['Apple','Intel','Microsoft']
label_encoder.transform(labels)

array([0, 2, 3], dtype=int64)

In [22]:
label_encoder.inverse_transform(label_encoder.transform(labels))

array(['Apple', 'Intel', 'Microsoft'],
      dtype='<U9')

##  DictVectorizor

DictVectorizor is used for data that has labels and numbers. In addition, DictVectorizor extract data.

In [31]:
from sklearn.feature_extraction import DictVectorizer

In [36]:
companies = [{'Apple':180.25,'Intel':45.30,'Microsoft':30.26,'Google':203.75,'Tesla':302.18}] # We will use company names
vec = DictVectorizer()

In [37]:
vec.fit_transform(companies).toarray()

array([[ 180.25,  203.75,   45.3 ,   30.26,  302.18]])

In [38]:
vec.get_feature_names()

['Apple', 'Google', 'Intel', 'Microsoft', 'Tesla']

## Polynomial Features

Polynomial Features is used to generate polynomial and interaction features. Also, it generate a new data of feature matrixx that is consist of all polynomial combinations of the features with degree less than or equal to specified degree.

In [26]:
from sklearn.preprocessing import PolynomialFeatures

In [27]:
poly = PolynomialFeatures(2)
poly.fit_transform(X)

array([[  1.000e+00,   1.966e+01,   1.986e+01, ...,   3.733e+02,
          4.570e+08,   5.594e+14],
       [  1.000e+00,   1.954e+01,   1.991e+01, ...,   3.818e+02,
          3.107e+08,   2.529e+14],
       [  1.000e+00,   1.971e+01,   1.986e+01, ...,   3.752e+02,
          3.063e+08,   2.501e+14],
       ..., 
       [  1.000e+00,   1.743e+01,   1.774e+01, ...,   2.703e+02,
          1.831e+09,   1.240e+16],
       [  1.000e+00,   1.753e+01,   1.831e+01, ...,   2.938e+02,
          1.872e+09,   1.193e+16],
       [  1.000e+00,   1.815e+01,   1.851e+01, ...,   3.186e+02,
          1.512e+09,   7.180e+15]])

In [28]:
poly = PolynomialFeatures(interaction_only=True)
poly.fit_transform(X)

array([[  1.000e+00,   1.966e+01,   1.986e+01, ...,   3.837e+02,
          4.697e+08,   4.570e+08],
       [  1.000e+00,   1.954e+01,   1.991e+01, ...,   3.890e+02,
          3.166e+08,   3.107e+08],
       [  1.000e+00,   1.971e+01,   1.986e+01, ...,   3.847e+02,
          3.141e+08,   3.063e+08],
       ..., 
       [  1.000e+00,   1.743e+01,   1.774e+01, ...,   2.916e+02,
          1.976e+09,   1.831e+09],
       [  1.000e+00,   1.753e+01,   1.831e+01, ...,   3.138e+02,
          2.000e+09,   1.872e+09],
       [  1.000e+00,   1.815e+01,   1.851e+01, ...,   3.304e+02,
          1.568e+09,   1.512e+09]])

## Imputer

Imputer (impute missing values with means) - it replaces the missing values with mean value in the columns or features data. 

In [29]:
from sklearn.preprocessing import Imputer

In [30]:
imputer = Imputer()
print(imputer.fit_transform(X, y))

[[  1.966e+01   1.986e+01   1.932e+01   2.365e+07]
 [  1.954e+01   1.991e+01   1.954e+01   1.590e+07]
 [  1.971e+01   1.986e+01   1.937e+01   1.581e+07]
 ..., 
 [  1.743e+01   1.774e+01   1.644e+01   1.114e+08]
 [  1.753e+01   1.831e+01   1.714e+01   1.092e+08]
 [  1.815e+01   1.851e+01   1.785e+01   8.473e+07]]
