# Data Mining
This notebook outlines steps taken to gather stock price data and compute desired technical indicators to be used later in feature engineering

In [33]:
import pandas_datareader.data as web
import pandas as pd
import numpy as np
from numpy import *
from scipy.signal import argrelextrema
import datetime
import os
import sys
print(sys.version)
# Please note the following version of Python being used (3.5.2)

3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]


## Import Trade Data
Currently, data on trades is collected on a daily, automated fashion.  Trade information via Twitter and stored in a standard MySQL database.  This database is not yet available to the public, but will be made so at a future date.

In [34]:
url = "https://raw.githubusercontent.com/AdrianGPrado/StockMarket-ML/CK/trades.csv"
trades = pd.read_csv(url,index_col=0,parse_dates=[0])

### A glance at the trade data
The data essentially tracks the Ticker, Strike, Option Type, Selling/Buying Activity, initial volume (size) of the trade in contracts, as well as the date & time of the trade.  We also have the upcoming earnings dates (if we choose to make use of that information later)..

Each row corresponds to one trade, so we can see that we currently have the following number of trades.

In [35]:
trades.shape

(13604, 16)

In [36]:
trades.head()

Unnamed: 0_level_0,Strike,OptionType,ActivityType,InitialVolume,IS_Flag,TweetTimeStamp,TradeDate,TradeTime,ExpDate,startOpen,startLow,startHigh,startClose,startDayDelt,EarningsDate,EarningsTime
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
GGP,29.0,Calls,BUYING,1300,,2016-07-12 14:59:17,2016-07-12,14:59:17,2016-07-15,31.09,30.48,31.18,31.06,-0.000965,2016-08-01,after
SPY,218.0,Calls,SELLING,2503,,2016-07-12 14:58:51,2016-07-12,14:58:51,2016-08-05,214.53,213.43,215.3,214.95,0.001958,,
JNPR,22.0,Calls,BUYING,493,,2016-07-12 14:39:41,2016-07-12,14:39:41,2016-07-15,23.05,22.97,23.29,23.1,0.002169,2016-07-26,after
RLGY,30.0,Calls,BUYING,500,,2016-07-12 14:38:01,2016-07-12,14:38:01,2016-08-19,29.34,29.33,29.94,29.71,0.012611,2016-08-04,before
GLD,127.0,Calls,BUYING,10000,,2016-07-12 14:29:40,2016-07-12,14:29:40,2016-07-15,128.52,126.99,128.54,127.15,-0.01066,,


## Pulling Historical Quote Data

Our model will use technical indicators which are derived off of prior price action for the given stocks, for example (100 day moving average).

To calculate these technical indicators, we need enough days of price data.  

Since the first trades tracked as part of this project are from Mid-2016, the starting date selected for this analysis is late 2015.

In [37]:
#set start and end dates
start = datetime.datetime(2015,11,1)
end = datetime.date.today()

### Initialize Empty Dataframe
We will create an empty dataframe to store the quote data from our selected stocks over the given time period.

In [38]:
#initialize df with mock data to retrieve columns headers
target = web.DataReader("F", 'yahoo', start, end)
target.reset_index(level=0, inplace=True)
target['Ticker'] = ""

#remove rows to have empty df
target2 = target.ix[:-1]

Next we will create a new df containing the distinct list of tickers from the trades df. We will loop through this list to collect the quote data.

In [39]:
trades.reset_index(level=0, inplace=True)
trades2 = pd.DataFrame(trades["Ticker"].unique())
trades2.columns = ['Ticker']
trades2.head()

Unnamed: 0,Ticker
0,GGP
1,SPY
2,JNPR
3,RLGY
4,GLD


In [40]:
trades2.shape

(1419, 1)

### Pulling the Quotes
We can see above how many tickers worth of data we will be pulling. This process might take some time depending on your machine's specs.

In [41]:
#pull quotes by ticker

for index, row in trades2.iterrows():
    print(row['Ticker'])
    ticker = row['Ticker']
    try:
        target_x = web.DataReader(ticker, 'yahoo',start,end)
        target_x['Ticker'] = ticker
        # target_x.reset_index(level=0, inplace=True)
        target2 = target2.append(target_x)
    except:
        pass

GGP
SPY
JNPR
RLGY
GLD
PCAR
TEVA
CMI
HUM
PRU
APC
COF
PAH
ALLY
CLF
MGM
VOYA
EPD
KITE
C
DRII
MDRX
SPLK
ACM
PAGP
KKR
CHGG
NXPI
CDE
AMX
WETF
TCK
WMT
KMI
INTC
NGD
UAL
ANET
GLNG
ETSY
SAND
YELP
CAG
WYN
SIG
LUV
EXPE
AAL
RIO
HFC
X
COH
LVLT
AR
ETFC
LNG
AIG
CSX
EEM
XLE
MPEL
TGT
F
SO
AOS
STX
SM
CERS
IWM
YHOO
DAL
JD
HPE
CS
TSLA
CVLT
ACN
MS
FIT
JCP
JPM
XLF
PIR
DIS
MCD
PFE
MSFT
USB
RCL
MRVL
GDX
CLR
SGMS
NVAX
DB
FEYE
MA
VNR
VALE
IOC
PBI
KSU
EGO
MDLZ
LYB
HUN
HP
AUY
PG
ETE
VLO
CTL
VGZ
EEP
ACHN
WNR
UNP
ABX
SPWR
WLL
CTSH
MTG
NTAP
CTRP
FB
OZM
ZAYO
JBLU
APA
FCX
CELG
BID
RIG
ADI
LOCK
CERN
CG
HAL
CPB
BK
BBY
GPRE
CCJ
EA
NEM
SHAK
GM
SBUX
QQQ
LRCX
TAP
RF
ESRX
CMCSA
AMAT
AAPL
PBA
VMW
MET
FCEL
NKE
PGNX
SWKS
BMY
CAT
MAN
IMPV
ZG
LBTYA
PYPL
QRVO
HOLX
SQ
CCL
RMP
INFY
LHO
AGN
NUE
WBA
DE
KO
XOP
AA
VIPS
OMC
HLF
ZION
SRPT
EXP
CAVM
CYBR
VXX
XOM
ATVI
DK
GSAT
AZN
GE
NRG
MRO
UA
AU
YPF
EBAY
QCOM
KATE
ARLZ
XBI
MU
TWTR
SFM
EQC
JBL
TRIP
MXIM
COL
JOY
FSM
XLNX
DVN
NOK
BWA
NBL
PXD
WMB
MRK
K
KHC
COP
WRK
MMM
YNDX
MJN
IMGN
BERY
ERIC
SBG

In [75]:
os.chdir("/Users/Collier/Dropbox/Skills/Python/Projects/Stocks/StockMarket-ML/")
target2.to_csv("all_quotes.csv", sep='\t', encoding='utf-8')

In [44]:
target2.shape

(465411, 8)

## Calculating Technical Indicators

In [57]:
#import trade data
quotes = target2
#url = "https://raw.githubusercontent.com/AdrianGPrado/StockMarket-ML/CK/all_quotes.csv"
#quotes = pd.read_csv(url,index_col=0,parse_dates=[0], sep='\t', encoding='utf-8')

Again, we will initialize an empty dataframe which will store the result of our calculations on the quotes dataframe.  We will name this 'quotesall'.

We will also be looping through our ticker list to calculate indicators stock by stock.

In [63]:
quotesAll = quotes.ix[:0]
tickerList = pd.DataFrame(quotes['Ticker'].unique())
tickerList.columns = ['Ticker']

Below is the list of technical indicators we will be calculating. We will define the functions for some of these indicators below.

In [64]:
quotesAll["upper_band"] = ""
quotesAll["lower_band"] = ""
quotesAll["9d"] = ""
quotesAll["21d"] = ""
quotesAll["50d"] = ""
quotesAll["100d"] = ""
quotesAll["200d"] = ""
quotesAll["RSI"] = ""
quotesAll["MACD"] = ""

### Define Functions for Technical Indicators

#### Bollinger Bands
http://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:bollinger_bands

In [65]:
# http://quant.stackexchange.com/questions/11264/calculating-bollinger-band-correctly
def Bollinger_Bands(stock_price, window_size, num_of_std):
    rolling_mean = stock_price.rolling(window=window_size).mean()
    rolling_std  = stock_price.rolling(window=window_size).std()
    upper_band = pd.Series(rolling_mean + (rolling_std*num_of_std))
    lower_band = pd.Series(rolling_mean - (rolling_std*num_of_std))
    bands = pd.concat([upper_band, lower_band], axis=1)
    bands.columns = ['upper_band','lower_band']
    return bands

#### Relative Strength Indicator (RSI)
http://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:relative_strength_index_rsi

In [66]:
# http://stackoverflow.com/questions/20526414/relative-strength-index-in-python-pandas
def RSI(series, period):
    delta = series.diff().dropna()
    u = delta * 0
    d = u.copy()
    u[delta > 0] = delta[delta > 0]
    d[delta < 0] = -delta[delta < 0]
    u[u.index[period-1]] = np.mean( u[:period] ) #first value is sum of avg gains
    u = u.drop(u.index[:(period-1)])
    d[d.index[period-1]] = np.mean( d[:period] ) #first value is sum of avg losses
    d = d.drop(d.index[:(period-1)])
    rs = pd.stats.moments.ewma(u, com=period-1, adjust=False) / \
         pd.stats.moments.ewma(d, com=period-1, adjust=False)
    return 100 - 100 / (1 + rs)

#### Moving Average Convergence/Divergence (MACD)
http://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:moving_average_convergence_divergence_macd

In [67]:
# http://stackoverflow.com/questions/38270524/cannot-calculate-macd-via-python-pandas
def MACD(group, nslow=26, nfast=12):
    emaslow = pd.ewma(group, span=nslow, min_periods=1)
    emafast = pd.ewma(group, span=nfast, min_periods=1)
    result = pd.DataFrame({'MACD': emafast-emaslow, 'emaSlw': emaslow, 'emaFst': emafast})
    return result

Other Indicators we will be using, but have not defined functions for:
#### Moving Averages
http://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:moving_averages
#### Sign Changes (Min/Max Local Extrema)
http://stackoverflow.com/questions/4624970/finding-local-maxima-minima-with-numpy-in-a-1d-numpy-array

### Begin Calculation Iteration
The below for loop will iterate through the ticker list and subset the quotes dataframe for that corresponding stock's prices.  The technical indicators will then be calculated and the resulting dataframe will be appended to the master dataframe 'quotesAll'.

In [68]:
from datetime import datetime
tstart = datetime.now()
n = len(tickerList)
for index, row in tickerList.iterrows():
    ticker = row['Ticker']
    # subquote = pd.DataFrame(quotes2[quotes2.Ticker == ticker])
    subquote = pd.DataFrame(quotes[quotes.Ticker == ticker])
    series = pd.Series(subquote['Adj Close'])
    #Bollinger Bands
    #==============================================================
    BB = Bollinger_Bands(series, 20, 2)
    subquote["upper_band"] = BB["upper_band"]
    subquote["lower_band"] = BB["lower_band"]
    #Moving averages
    #==============================================================
    subquote["9d"] = np.round(subquote["Adj Close"].rolling(window = 9, center = False).mean(), 2)
    subquote["21d"] = np.round(subquote["Adj Close"].rolling(window = 21, center = False).mean(), 2)
    subquote["50d"] = np.round(subquote["Adj Close"].rolling(window = 50, center = False).mean(), 2)
    subquote["100d"] = np.round(subquote["Adj Close"].rolling(window = 100, center = False).mean(), 2)
    subquote["200d"] = np.round(subquote["Adj Close"].rolling(window = 200, center = False).mean(), 2)
    #RSI
    #==============================================================
    RSIx = RSI(series, 14)
    subquote["RSI"] = RSIx
    #MACD
    #==============================================================    
    MACDx = MACD(series)
    subquote["MACD"] = MACDx["MACD"]
    #Sign-changes
    #==============================================================    
    b = (diff(sign(diff(series))) > 0).nonzero()[0] + 1 # local min
    c = (diff(sign(diff(series))) < 0).nonzero()[0] + 1 # local max
    subquote['min'] = series.iloc[b]
    subquote['max'] = series.iloc[c]
    ## Append DF's
    #==============================================================
    quotesAll = quotesAll.append(subquote)
    print ((float(index)/n)*100)
tend = datetime.now()
print(tend-tstart)

	Series.ewm(adjust=False,com=13,ignore_na=False,min_periods=0).mean()
	Series.ewm(adjust=True,span=26,ignore_na=False,min_periods=1).mean()
  app.launch_new_instance()
	Series.ewm(adjust=True,span=12,ignore_na=False,min_periods=1).mean()


0.0
0.07052186177715092
0.14104372355430184
0.21156558533145275
0.2820874471086037
0.3526093088857546
0.4231311706629055
0.4936530324400564
0.5641748942172073
0.6346967559943583
0.7052186177715092
0.7757404795486601
0.846262341325811
0.9167842031029618
0.9873060648801129
1.0578279266572637
1.1283497884344147
1.1988716502115657
1.2693935119887165
1.3399153737658673
1.4104372355430184
1.4809590973201692
1.5514809590973202
1.622002820874471
1.692524682651622
1.7630465444287728
1.8335684062059237
1.904090267983075
1.9746121297602257
2.0451339915373765
2.1156558533145273
2.1861777150916786
2.2566995768688294
2.32722143864598
2.3977433004231314
2.4682651622002822
2.538787023977433
2.609308885754584
2.6798307475317347
2.7503526093088855
2.8208744710860367
2.8913963328631875
2.9619181946403383
3.0324400564174896
3.1029619181946404
3.173483779971791
3.244005641748942
3.314527503526093
3.385049365303244
3.4555712270803953
3.5260930888575457
3.596614950634697
3.6671368124118473
3.7376586741889986

### Inspect Final Dataframe

In [70]:
quotesAll.tail()

Unnamed: 0,100d,200d,21d,50d,9d,Adj Close,Close,Date,High,Low,MACD,Open,RSI,Ticker,Volume,lower_band,max,min,upper_band
2017-02-17,44.18,41.11,47.7,48.27,47.37,48.350051,48.59,NaT,48.59,47.689999,-0.071556,47.990002,55.097859,RHI,811800,45.560843,,,49.734231
2017-02-21,44.28,41.16,47.68,48.28,47.51,48.36,48.599998,NaT,48.799999,48.080002,-0.009367,48.200001,55.169172,RHI,1013100,45.5618,48.36,,49.729294
2017-02-22,44.38,41.21,47.67,48.27,47.68,48.25,48.25,NaT,48.610001,47.77,0.030688,48.380001,54.145273,RHI,1038200,45.614453,,,49.589021
2017-02-23,44.5,41.26,47.62,48.27,47.8,47.98,47.98,NaT,48.560001,47.849998,0.040182,48.189999,51.613194,RHI,2292700,45.749866,,,49.295206
2017-02-24,44.6,41.31,47.54,48.26,47.91,47.959999,47.959999,NaT,47.959999,46.98,0.045567,47.450001,51.421357,RHI,741900,45.963838,,,48.90889


In [74]:
quotesAll.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 465411 entries, 2015-11-02 to 2017-02-24
Data columns (total 19 columns):
100d          325075 non-null float64
200d          183894 non-null float64
21d           437051 non-null float64
50d           395929 non-null float64
9d            454067 non-null float64
Adj Close     465411 non-null float64
Close         465411 non-null float64
Date          0 non-null datetime64[ns]
High          465411 non-null float64
Low           465411 non-null float64
MACD          465411 non-null float64
Open          465411 non-null float64
RSI           445559 non-null float64
Ticker        465411 non-null object
Volume        465411 non-null int64
lower_band    438469 non-null float64
max           122234 non-null float64
min           122165 non-null float64
upper_band    438469 non-null float64
dtypes: datetime64[ns](1), float64(16), int64(1), object(1)
memory usage: 71.0+ MB


In [73]:
##save DF
os.chdir("/Users/Collier/Dropbox/Skills/Python/Projects/Stocks/StockMarket-ML/")
quotesAll.to_csv("all_quotes_features.csv", sep='\t', encoding='utf-8')

This will conclude the the Data Mining portion of the analysis.  