# Data Mining & Feature Engineering (Part 1)
This notebook outlines steps taken to gather stock price data and compute desired technical indicators to be used later in feature engineering.  Some of our features must be calculated iteratively.

In [886]:
import pandas_datareader.data as web
import pandas as pd
import numpy as np
from numpy import *
from scipy.signal import argrelextrema
import datetime
import os
import sys
print(sys.version)
# Please note the following version of Python being used (3.5.2)
pd.__version__
# Please note the following version of Pandas being used (19.1)
pd.set_option('display.max_columns', 99)

3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]


## Import Trade Data
Currently, data on trades is collected on a daily, automated fashion.  Trade information is collected via Twitter and stored in a local MySQL database.  This database is not yet hosted on the Cloud, but will be at a future date.

In [992]:
url = "https://raw.githubusercontent.com/AdrianGPrado/StockMarket-ML/CK/trades.csv"
trades = pd.read_csv(url,parse_dates=[0], sep=',', encoding='utf-8')

# pd.read_csv(filename,index_col=0,parse_dates=[0], sep='\t', encoding='utf-8')

### A glance at the trade data
The data essentially tracks the Ticker, Strike, Option Type, Selling/Buying Activity, initial volume (size) of the trade in contracts, as well as the date & time of the trade.  We also have the upcoming earnings dates (if we choose to make use of that information later)..

Each row corresponds to one trade, so we can see that we currently have the following number of trades.

In [993]:
trades.shape
trades['TradeDate'] = pd.to_datetime(trades['TradeDate'], errors='coerce')

(13604, 17)

In [994]:
trades.head()

Unnamed: 0,Ticker,Strike,OptionType,ActivityType,InitialVolume,IS_Flag,TweetTimeStamp,TradeDate,TradeTime,ExpDate,startOpen,startLow,startHigh,startClose,startDayDelt,EarningsDate,EarningsTime
0,GGP,29.0,Calls,BUYING,1300,,2016-07-12 14:59:17,2016-07-12,14:59:17,2016-07-15,31.09,30.48,31.18,31.06,-0.000965,2016-08-01,after
1,SPY,218.0,Calls,SELLING,2503,,2016-07-12 14:58:51,2016-07-12,14:58:51,2016-08-05,214.53,213.43,215.3,214.95,0.001958,,
2,JNPR,22.0,Calls,BUYING,493,,2016-07-12 14:39:41,2016-07-12,14:39:41,2016-07-15,23.05,22.97,23.29,23.1,0.002169,2016-07-26,after
3,RLGY,30.0,Calls,BUYING,500,,2016-07-12 14:38:01,2016-07-12,14:38:01,2016-08-19,29.34,29.33,29.94,29.71,0.012611,2016-08-04,before
4,GLD,127.0,Calls,BUYING,10000,,2016-07-12 14:29:40,2016-07-12,14:29:40,2016-07-15,128.52,126.99,128.54,127.15,-0.01066,,


## Pulling Historical Quote Data

Our model will use technical indicators which are derived off of prior price action for the given stocks, for example: 100 day moving average.

To calculate these technical indicators, we need enough days of price data.  

Since the first trades tracked as part of this project are from Mid-2016, the starting date selected for this analysis is late 2015.

In [995]:
#set start and end dates
start = datetime.datetime(2015,11,1)
end = datetime.date.today()

### Initialize Empty Dataframe
We will create an empty dataframe to store the quote data from our selected stocks over the given time period.

In [996]:
#initialize df with mock data to retrieve columns headers
target = web.DataReader("F", 'yahoo', start, end)
target.reset_index(level=0, inplace=True)
target['Ticker'] = ""

#remove rows to have empty df to store the quote data in
quotesAll = target.ix[:-1]

Next we will create a new df containing the distinct list of tickers from the trades df. We will loop through this list to collect the quote data.

In [997]:
trades.reset_index(level=0, inplace=True)
trades2 = pd.DataFrame(trades["Ticker"].unique())
trades2.columns = ['Ticker']
trades2.head()

Unnamed: 0,Ticker
0,GGP
1,SPY
2,JNPR
3,RLGY
4,GLD


In [998]:
trades2.shape

(1419, 1)

### Pulling the Quotes
We can see above how many tickers worth of data we will be pulling. This process might take some time depending on your machine's specs.

In [999]:
import os.path

filename = "all_quotes.csv"

#Run quotes query if quotes not in directory (warning, can take a long time)
if os.path.isfile(filename)=="TRUE":
    #pull quotes by ticker
    for index, row in quotesAll.iterrows():
    #     print(row['Ticker'])
        ticker = row['Ticker']
        try:
            target_x = web.DataReader(ticker, 'yahoo',start,end)
            target_x['Ticker'] = ticker
            # target_x.reset_index(level=0, inplace=True)
            quotesAll = quotesAll.append(target_x)
        except:
            pass
    quotesAll.to_csv("all_quotes.csv", sep='\t', encoding='utf-8')
else: quotesAll = pd.read_csv(filename,parse_dates=[0], sep='\t', encoding='utf-8')

In [1001]:
quotesAll = quotesAll.rename(columns = {'Unnamed: 0':'TradeDate'})

In [1000]:
quotesAll.shape

(465401, 9)

In [1002]:
quotesAll.head()

Unnamed: 0,TradeDate,Adj Close,Close,Date,High,Low,Open,Ticker,Volume
0,2015-11-02,28.210024,29.549999,,29.559999,28.860001,28.91,GGP,6466600
1,2015-11-03,27.39857,28.700001,,29.08,28.42,28.57,GGP,9076300
2,2015-11-04,27.379476,28.68,,28.83,28.559999,28.719999,GGP,3623100
3,2015-11-05,27.532221,28.84,,29.049999,28.57,28.67,GGP,4137500
4,2015-11-06,26.252984,27.5,,28.42,27.32,28.15,GGP,6539300


## Joining 'Quotes' & 'Trades' Datasets
We will be joining our trade and quotes datasets as our features (technical indicators) will be derived from the time of the trades.

In [1008]:
# JOIN DATA FRAMES
quotesAll_join = pd.merge(quotesAll, trades, how='left', on=['Ticker','TradeDate'])


quotes = quotesAll_join
#10K rows for a test
# quotes = quotes.ix[:10000]

In [1009]:
quotes.head()

Unnamed: 0,TradeDate,Adj Close,Close,Date,High,Low,Open,Ticker,Volume,index,Strike,OptionType,ActivityType,InitialVolume,IS_Flag,TweetTimeStamp,TradeTime,ExpDate,startOpen,startLow,startHigh,startClose,startDayDelt,EarningsDate,EarningsTime
0,2015-11-02,28.210024,29.549999,,29.559999,28.860001,28.91,GGP,6466600,,,,,,,,,,,,,,,,
1,2015-11-03,27.39857,28.700001,,29.08,28.42,28.57,GGP,9076300,,,,,,,,,,,,,,,,
2,2015-11-04,27.379476,28.68,,28.83,28.559999,28.719999,GGP,3623100,,,,,,,,,,,,,,,,
3,2015-11-05,27.532221,28.84,,29.049999,28.57,28.67,GGP,4137500,,,,,,,,,,,,,,,,
4,2015-11-06,26.252984,27.5,,28.42,27.32,28.15,GGP,6539300,,,,,,,,,,,,,,,,


In [1014]:
quotes.Strike.notnull().sum()

13586

## Calculating Features

Again, we will initialize an empty dataframe which will store the result of our calculations on the quotes dataframe.  We will name this 'quotesAll', overwriting the prior dataframe for just the quotes.

As we do not want our numerical calculations to overlap between stocks, we will be looping through our ticker list to calculate indicators stock by stock.

In [857]:
quotesAll = quotes.ix[:0]
tickerList = pd.DataFrame(quotes['Ticker'].unique())
tickerList.columns = ['Ticker']

Below is the list of columns we will be calculating. We will define the functions for some of these indicators and features below.

### Define Functions for Technical Indicators

#### Bollinger Bands
http://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:bollinger_bands

http://quant.stackexchange.com/questions/11264/calculating-bollinger-band-correctly

In [858]:
def Bollinger_Bands(stock_price, window_size, num_of_std):
    rolling_mean = stock_price.rolling(window=window_size).mean()
    rolling_std  = stock_price.rolling(window=window_size).std()
    upper_band = pd.Series(rolling_mean + (rolling_std*num_of_std))
    lower_band = pd.Series(rolling_mean - (rolling_std*num_of_std))
    bands = pd.concat([upper_band, lower_band], axis=1)
    bands.columns = ['upper_band','lower_band']
    return bands

#### Relative Strength Indicator (RSI)
http://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:relative_strength_index_rsi

In [859]:
def RSI(series, period):
    delta = series.diff().dropna()
    u = delta * 0
    d = u.copy()
    u[delta > 0] = delta[delta > 0]
    d[delta < 0] = -delta[delta < 0]
    u[u.index[period-1]] = np.mean( u[:period] ) #first value is sum of avg gains
    u = u.drop(u.index[:(period-1)])
    d[d.index[period-1]] = np.mean( d[:period] ) #first value is sum of avg losses
    d = d.drop(d.index[:(period-1)])
    rs = pd.stats.moments.ewma(u, com=period-1, adjust=False) / \
         pd.stats.moments.ewma(d, com=period-1, adjust=False)
    return 100 - 100 / (1 + rs)

#### Moving Average Convergence/Divergence (MACD)
http://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:moving_average_convergence_divergence_macd

http://stackoverflow.com/questions/20526414/relative-strength-index-in-python-pandas

In [860]:
# http://stackoverflow.com/questions/38270524/cannot-calculate-macd-via-python-pandas
def MACD(group, nslow=26, nfast=12):
    emaslow = pd.ewma(group, span=nslow, min_periods=1)
    emafast = pd.ewma(group, span=nfast, min_periods=1)
    result = pd.DataFrame({'MACD': emafast-emaslow, 'emaSlw': emaslow, 'emaFst': emafast})
    return result

#### Moving Average Cross

These functions mark the row in the chronological date series where one line has crossed another.  We will use this for moving average crosses (versus each other and versus price).

In [861]:
def pos_ma_cross(df,crosser,line):
    result = np.where(np.logical_and(df[crosser] > subquote[line],
                                         subquote[crosser].shift(-1) < subquote[line].shift(-1)),1,0)
    return result

In [862]:
def neg_ma_cross(df,crosser,line):
    result = np.where(np.logical_and(df[crosser] < subquote[line],
                                         subquote[crosser].shift(-1) > subquote[line].shift(-1)),1,0)
    return result

#### Days Since Moving Average Cross & Local Min/Max Occurences
http://stackoverflow.com/questions/25119524/pandas-conditional-rolling-count

This function will compute the cumulative number of days (rows) since the most recent moving average or sign change (local minimum/maximum)

In [863]:
def rolling_count(val):
    if val == rolling_count.previous:
        rolling_count.count +=1
    else:
        rolling_count.previous = val
        rolling_count.count = 1
    return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable

Currently, the above 'rolling_count' function is behaving mysteriously. It is calculating rolling counts for rows prior to where triggering count condition occurs.  While I have not yet discovered the reason for this, I have created the following 'cleanup' function which will go and set these erroneous counts to 'None'

In [864]:
def rolling_count_cleanup(df,cross,days):
    if df[cross].any()==1:
        cross_locs = np.where(df[cross]==1)
        if cross_locs[0][0] > 0:
            df.loc[(df[days]>0) & (df['index_col'] < cross_locs[0][0]+1), days] = "None"
            return df[days]

#### Performance Since Conditional Event
The following functions return the position of a given conditional event (such as a moving average cross or a local min/max).  The second function computes the percentage change since that conditional event.

In [865]:
def find_cond_events(df,compflag):
    if df[compflag].any()==1:
        cross_locs = np.where(df[compflag]==1)
        cross_locs, = cross_locs
    else: cross_locs = "pass"    
    return(cross_locs)

In [866]:
def cross_sectional_perf(df,compcol,min_,max_,refcol=""):
    if refcol:
        df.loc[(df["index_col"]>cross_locs[min_]) & 
               (df["index_col"]<=cross_locs[max_]), 
               "return_since_last_"+str(compcol)] = (df['Adj Close'] - 
                                         df[refcol].iloc[cross_locs[min_]])/df[refcol].iloc[cross_locs[min_]]
    else:
        df.loc[(df["index_col"]>cross_locs[min_]) & 
               (df["index_col"]<=cross_locs[max_]), 
               "return_since_last_"+str(compcol)] = (df['Adj Close'] - 
                                                 df[compcol].iloc[cross_locs[min_]])/df[compcol].iloc[cross_locs[min_]]
    return(df)

In [867]:
def cross_sectional_perf2(df,compcol,min_,max_,refcol=""):   
    if refcol:
        df.loc[(df["index_col"]>cross_locs[min_]) & 
               (df["index_col"]<=max_), 
               "return_since_last_"+str(compcol)] = (df['Adj Close'] - 
                                                 df[refcol].iloc[cross_locs[min_]])/df[refcol].iloc[cross_locs[min_]]
    else:
        df.loc[(df["index_col"]>cross_locs[min_]) & 
               (df["index_col"]<=max_), 
               "return_since_last_"+str(compcol)] = (df['Adj Close'] - 
                                                 df[compcol].iloc[cross_locs[min_]])/df[compcol].iloc[cross_locs[min_]]
    return(df)

Other Indicators we will be using, but have not defined functions for:
#### Moving Averages
http://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:moving_averages
#### Sign Changes ( Local Minima/Maxima)
http://stackoverflow.com/questions/4624970/finding-local-maxima-minima-with-numpy-in-a-1d-numpy-array

### Define Calculation Columns

In [868]:
#Response Variables
quotesAll['5Day_FutureReturn_pct'] = ""
quotesAll['10Day_FutureReturn_pct'] = ""
quotesAll['20Day_FutureReturn_pct'] = ""
quotesAll['30Day_FutureReturn_pct'] = ""

#Features - Past Performance
quotesAll['5Day_PriorReturn_pct'] = ""
quotesAll['10Day_PriorReturn_pct'] = ""
quotesAll['20Day_PriorReturn_pct'] = ""
quotesAll['30Day_PriorReturn_pct'] = ""

#Features - Technical Indicators
quotesAll["upper_band"] = ""
quotesAll["upper_band_pct"] = ""
quotesAll["lower_band"] = ""
quotesAll["lower_band_pct"] = ""
quotesAll["5d_sma"] = ""
quotesAll["5d_sma_pct"] = ""
quotesAll["9d_sma"] = ""
quotesAll["9d_sma_pct"] = ""
quotesAll["21d_sma"] = ""
quotesAll["21d_sma_pct"] = ""
quotesAll["50d_sma"] = ""
quotesAll["50d_sma_pct"] = ""
quotesAll["100d_sma"] = ""
quotesAll["100d_sma_pct"] = ""
quotesAll["200d_sma"] = ""
quotesAll["200d_sma_pct"] = ""
quotesAll["RSI"] = ""
quotesAll["MACD"] = ""
quotesAll["Volume_5d_sma"] = ""
quotesAll["Volume_5d_sma_pct"] = ""

#Features - Sign Changes
quotesAll['local_min'] = ""
quotesAll['local_max'] = ""
quotesAll['local_min_flag'] = ""
quotesAll['local_max_flag'] = ""
quotesAll["return_since_last_local_min"] = ""
quotesAll["return_since_last_local_max"] = ""
quotesAll['local_min_Daycount'] = ""
quotesAll['local_max_Daycount'] = ""

#Price through MA Crosses
quotesAll["price_thru5_pos"] = ""
quotesAll["price_thru9_pos"] = ""
quotesAll["price_thru21_pos"] = ""
quotesAll["price_thru50_pos"] = ""
quotesAll["price_thru5_neg"] = ""
quotesAll["price_thru9_neg"] = ""
quotesAll["price_thru21_neg"] = ""
quotesAll["price_thru50_neg"] = ""
quotesAll["price_thru5_pos_Daycount"] = ""
quotesAll["price_thru9_pos_Daycount"] = ""
quotesAll["price_thru21_pos_Daycount"] = ""
quotesAll["price_thru50_pos_Daycount"] = ""
quotesAll["price_thru5_neg_Daycount"] = ""
quotesAll["price_thru9_neg_Daycount"] = ""
quotesAll["price_thru21_neg_Daycount"] = ""
quotesAll["price_thru50_neg_Daycount"] = ""

#Moving Average Crosses
quotesAll["5thru9_pos"] = ""
quotesAll["5thru21_pos"] = ""
quotesAll["5thru50_pos"] = ""
quotesAll["9thru21_pos"] = ""
quotesAll["9thru50_pos"] = ""
quotesAll["21thru50_pos"] = ""
quotesAll["5thru9_neg"] = ""
quotesAll["5thru21_neg"] = ""
quotesAll["5thru50_neg"] = ""
quotesAll["9thru21_neg"] = ""
quotesAll["9thru50_neg"] = ""
quotesAll["21thru50_neg"] = ""
quotesAll['5thru9_pos_DayCount']
quotesAll['5thru21_pos_DayCount']
quotesAll['5thru50_pos_DayCount']
quotesAll['9thru21_pos_DayCount']
quotesAll['9thru50_pos_DayCount']
quotesAll['21thru50_pos_DayCount']
quotesAll['5thru9_neg_DayCount']
quotesAll['5thru21_neg_DayCount']
quotesAll['5thru50_neg_DayCount']
quotesAll['9thru21_neg_DayCount']
quotesAll['9thru50_neg_DayCount']
quotesAll['21thru50_neg_DayCount']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas

KeyError: '5thru9_pos_DayCount'

In [874]:
col_list = pd.DataFrame({'condition':['5thru9_pos',
                                '5thru21_pos',
                                "5thru50_pos",
                                "9thru21_pos",
                                "9thru50_pos",
                                "21thru50_pos",
                                "5thru9_neg",
                                "5thru21_neg",
                                "5thru50_neg",
                                "9thru21_neg",
                                "9thru50_neg",
                                "21thru50_neg",
                                "price_thru5_pos",
                                "price_thru9_pos",
                                "price_thru21_pos",
                                "price_thru50_pos",
                                "price_thru5_neg",
                                "price_thru9_neg",
                                "price_thru21_neg",
                                "price_thru50_neg"],
                   'count':['5thru9_pos_DayCount',
                            '5thru21_pos_DayCount',
                            '5thru50_pos_DayCount',
                            '9thru21_pos_DayCount',
                            '9thru50_pos_DayCount',
                            '21thru50_pos_DayCount',
                            '5thru9_neg_DayCount',
                            '5thru21_neg_DayCount',
                            '5thru50_neg_DayCount',
                            '9thru21_neg_DayCount',
                            '9thru50_neg_DayCount',
                            '21thru50_neg_DayCount',
                            "price_thru5_pos_Daycount",
                            "price_thru9_pos_Daycount",
                            "price_thru21_pos_Daycount",
                            "price_thru50_pos_Daycount",
                            "price_thru5_neg_Daycount",
                            "price_thru9_neg_Daycount",
                            "price_thru21_neg_Daycount",
                            "price_thru50_neg_Daycount"]})

In [881]:
# TEST LOOP 1
# for idx,j in col_list.iterrows():
#     cross_locs = find_cond_events(subquote,j[0])
#     if cross_locs != "pass":
#         for i,item in enumerate(cross_locs):
#             if item != cross_locs.max():
#                 subquote = cross_sectional_perf(subquote,j[0],i,i+1,"Adj Close")
#             else: 
#                 subquote = cross_sectional_perf2(subquote,j[0],i,len(subquote),"Adj Close")
#     print(j[0])

5thru9_pos
5thru21_pos
5thru50_pos
9thru21_pos
9thru50_pos
21thru50_pos
5thru9_neg
5thru21_neg
5thru50_neg
9thru21_neg
9thru50_neg
21thru50_neg
price_thru5_pos
price_thru9_pos
price_thru21_pos
price_thru50_pos
price_thru5_neg
price_thru9_neg
price_thru21_neg
price_thru50_neg




In [882]:
# TEST LOOP 2
# for idx,j in col_list.iterrows():
#     subquote[j[1]] = subquote[j[0]].apply(rolling_count)
#     subquote[j[1]] = rolling_count_cleanup(subquote,j[0],j[1])
# #     print(j[0])
    
    
#     subquote['5thru9_pos_DayCount'] = subquote['5thru9_pos'].apply(rolling_count)
#     subquote['5thru9_pos_DayCount'] = rolling_count_cleanup(subquote,'5thru9_pos','5thru9_pos_DayCount')

5thru9_pos
5thru21_pos
5thru50_pos
9thru21_pos
9thru50_pos
21thru50_pos
5thru9_neg
5thru21_neg
5thru50_neg
9thru21_neg
9thru50_neg
21thru50_neg
price_thru5_pos
price_thru9_pos
price_thru21_pos
price_thru50_pos
price_thru5_neg
price_thru9_neg
price_thru21_neg
price_thru50_neg


### Begin Calculation Iteration
The below for loop will iterate through the ticker list and subset the quotes dataframe for that corresponding stock's prices.  The technical indicators will then be calculated and the resulting dataframe will be appended to the master dataframe 'quotesAll'.

In [877]:
quotes = quotesAll_join
#10K rows for a test
quotes = quotes.ix[:10000]
quotesAll = quotes.ix[:0]

In [878]:
from datetime import datetime
tstart = datetime.now()
n = len(tickerList)
# %%timeit
for index, row in tickerList.iterrows():
# for index, row in tickerList.itertuples():
    ticker = row['Ticker']
    
    # subquote = pd.DataFrame(quotes2[quotes2.Ticker == ticker])
    subquote = pd.DataFrame(quotes[quotes.Ticker == ticker])
    subquote['index_col'] = range(1, len(subquote) + 1)
    series = pd.Series(subquote['Adj Close'])
    
    #Bollinger Bands
    #==============================================================
    BB = Bollinger_Bands(series, 20, 2)
    subquote["upper_band"] = BB["upper_band"]
    subquote["lower_band"] = BB["lower_band"]
    
    subquote["upper_band_pct"] = (subquote["Adj Close"] - 
                                  subquote["upper_band"])/subquote["upper_band"]
    
    subquote["lower_band_pct"] = (subquote["Adj Close"] - 
                                  subquote["lower_band"])/subquote["lower_band"]
    
    #Moving averages
    #==============================================================
    subquote["5d_sma"] = np.round(subquote["Adj Close"].rolling(window = 5, center = False).mean(), 2)
    subquote["9d_sma"] = np.round(subquote["Adj Close"].rolling(window = 9, center = False).mean(), 2)
    subquote["21d_sma"] = np.round(subquote["Adj Close"].rolling(window = 21, center = False).mean(), 2)
    subquote["50d_sma"] = np.round(subquote["Adj Close"].rolling(window = 50, center = False).mean(), 2)
    subquote["100d_sma"] = np.round(subquote["Adj Close"].rolling(window = 100, center = False).mean(), 2)
    subquote["200d_sma"] = np.round(subquote["Adj Close"].rolling(window = 200, center = False).mean(), 2)
    
    subquote["5d_sma_pct"] = (subquote["Adj Close"] - 
                              subquote["5d_sma"])/subquote["5d_sma"]
    
    subquote["9d_sma_pct"] = (subquote["Adj Close"] - 
                              subquote["9d_sma"])/subquote["9d_sma"]
    
    subquote["21d_sma_pct"] = (subquote["Adj Close"] - 
                               subquote["21d_sma"])/subquote["21d_sma"]
    
    subquote["50d_sma_pct"] = (subquote["Adj Close"] - 
                               subquote["50d_sma"])/subquote["50d_sma"]
    
    subquote["100d_sma_pct"] = (subquote["Adj Close"] - 
                                subquote["100d_sma"])/subquote["100d_sma"]
    
    subquote["200d_sma_pct"] = (subquote["Adj Close"] - 
                                subquote["200d_sma"])/subquote["200d_sma"]
    
    #RSI
    #==============================================================
    RSIx = RSI(series, 14)
    subquote["RSI"] = RSIx
    
    #MACD
    #==============================================================    
    MACDx = MACD(series)
    subquote["MACD"] = MACDx["MACD"]
    
    #Sign-changes
    #==============================================================    
    b = (diff(sign(diff(series))) > 0).nonzero()[0] + 1 # local min
    c = (diff(sign(diff(series))) < 0).nonzero()[0] + 1 # local max
    subquote['local_min'] = series.iloc[b]    
    subquote['local_min_flag'] = np.where(np.isnan(subquote['local_min']), 0,1)
    
    subquote['local_max'] = series.iloc[c]
    subquote['local_max_flag'] = np.where(np.isnan(subquote['local_max']), 0,1)
    
    subquote['local_min_Daycount'] = subquote['local_min_flag'].apply(rolling_count)
    subquote['local_min_Daycount'] = rolling_count_cleanup(subquote,'local_min_flag','local_min_Daycount')
    
    subquote['local_max_Daycount'] = subquote['local_max_flag'].apply(rolling_count)
    subquote['local_max_Daycount'] = rolling_count_cleanup(subquote,'local_max_flag','local_max_Daycount')
    
    ##Days Since last local min/max
    subquote['local_min_pct'] = (subquote["Adj Close"] - 
                                 subquote["local_min"])/subquote["local_min"]
    
    subquote['local_max_pct'] = (subquote["Adj Close"] - 
                                 subquote["local_max"])/subquote["local_max"]
    
    ##Performance Since last min/max
    #===============================================================
    #Min
    cross_locs = find_cond_events(subquote,"local_min_flag")
    if cross_locs != "pass":
        for i,item in enumerate(cross_locs):
            if item != cross_locs.max():
                subquote = cross_sectional_perf(subquote,'local_min',i,i+1)
            else: 
                subquote = cross_sectional_perf2(subquote,'local_min',i,len(subquote))

    #Max
    cross_locs = find_cond_events(subquote,"local_max_flag")
    if cross_locs != "pass":
        for i,item in enumerate(cross_locs):
            if item != cross_locs.max():
                subquote = cross_sectional_perf(subquote,'local_max',i,i+1)
            else: 
                subquote = cross_sectional_perf2(subquote,'local_max',i,len(subquote))

    
    #Moving Average Crosses
    #==============================================================
    #Positive
    subquote['5thru9_pos'] = pos_ma_cross(subquote,"5d_sma","9d_sma")
    subquote['5thru21_pos'] = pos_ma_cross(subquote,"5d_sma","21d_sma")
    subquote['5thru50_pos'] = pos_ma_cross(subquote,"5d_sma","50d_sma")
    subquote['9thru21_pos'] = pos_ma_cross(subquote,"9d_sma","21d_sma")
    subquote['9thru50_pos'] = pos_ma_cross(subquote,"9d_sma","50d_sma")
    subquote['21thru50_pos'] = pos_ma_cross(subquote,"21d_sma","50d_sma")
     
    #Negative
    subquote['5thru9_neg']  = neg_ma_cross(subquote,"5d_sma","9d_sma")
    subquote['5thru21_neg'] = neg_ma_cross(subquote,"5d_sma","21d_sma")
    subquote['5thru50_neg'] = neg_ma_cross(subquote,"5d_sma","50d_sma")
    subquote['9thru21_neg'] = neg_ma_cross(subquote,"9d_sma","21d_sma")
    subquote['9thru50_neg'] = neg_ma_cross(subquote,"9d_sma","50d_sma")
    subquote['21thru50_neg'] = neg_ma_cross(subquote,"21d_sma","50d_sma")
    
    #Price Through Moving Average Crosses
    #===============================================================
    #Positive
    subquote["price_thru5_pos"] = pos_ma_cross(subquote,'Adj Close','5d_sma')
    subquote["price_thru9_pos"] = pos_ma_cross(subquote,'Adj Close','9d_sma')
    subquote["price_thru21_pos"] = pos_ma_cross(subquote,'Adj Close','21d_sma')
    subquote["price_thru50_pos"] = pos_ma_cross(subquote,'Adj Close','50d_sma')
    
    #Negative
    subquote["price_thru5_neg"] = neg_ma_cross(subquote,"Adj Close","5d_sma")
    subquote["price_thru9_neg"] = neg_ma_cross(subquote,"Adj Close","9d_sma")
    subquote["price_thru21_neg"] = neg_ma_cross(subquote,"Adj Close","21d_sma")
    subquote["price_thru50_neg"] = neg_ma_cross(subquote,"Adj Close","50d_sma")
    
    ## Performance Since Last Moving Average Cross
    #==============================================================
    for idx,j in col_list.iterrows():
        cross_locs = find_cond_events(subquote,j[0])
        if cross_locs != "pass":
            for i,item in enumerate(cross_locs):
                if item != cross_locs.max():
                    subquote = cross_sectional_perf(subquote,j[0],i,i+1,"Adj Close")
                else: 
                    subquote = cross_sectional_perf2(subquote,j[0],i,len(subquote),"Adj Close")
#         print(j[0])
    
    #Days Since Moving Average Crosses
    #=============================================================
    for idx,j in col_list.iterrows():
        subquote[j[1]] = subquote[j[0]].apply(rolling_count)
        subquote[j[1]] = rolling_count_cleanup(subquote,j[0],j[1])
#         print(j[0])
    
    
    #Prior Price Performance    
    subquote['5Day_PriorReturn_pct'] = ((subquote['Adj Close'] - 
                                         subquote['Adj Close'].shift(5))/
                                        subquote['Adj Close'].shift(5))

    subquote['10Day_PriorReturn_pct'] = ((subquote['Adj Close'] - 
                                         subquote['Adj Close'].shift(10))/
                                         subquote['Adj Close'].shift(10))
    
    subquote['20Day_PriorReturn_pct'] = ((subquote['Adj Close'] - 
                                         subquote['Adj Close'].shift(20))/
                                         subquote['Adj Close'].shift(20))
    
    subquote['30Day_PriorReturn_pct'] = ((subquote['Adj Close'] - 
                                         subquote['Adj Close'].shift(30))/
                                         subquote['Adj Close'].shift(30))

    #Future Price Performance (Dependent Variables)
    subquote['5Day_FutureReturn_pct'] = ((subquote['Adj Close'] - 
                                         subquote['Adj Close'].shift(-5))/
                                         subquote['Adj Close'].shift(-5))

    subquote['10Day_FutureReturn_pct'] = ((subquote['Adj Close'] - 
                                         subquote['Adj Close'].shift(-10))/
                                         subquote['Adj Close'].shift(-10))
    
    subquote['20Day_FutureReturn_pct'] = ((subquote['Adj Close'] - 
                                         subquote['Adj Close'].shift(-20))/
                                         subquote['Adj Close'].shift(-20))
    
    subquote['30Day_FutureReturn_pct'] = ((subquote['Adj Close'] - 
                                         subquote['Adj Close'].shift(-30))/
                                         subquote['Adj Close'].shift(-30))
    
    #Volume SMA & Trend
    subquote["Volume_5d_sma"] = np.round(subquote["Volume"].rolling(window = 5, center = False).mean(), 2)
    
    subquote["Volume_5d_sma_pct"] = (subquote["Adj Close"] - 
                                     subquote["Volume_5d_sma"])/subquote["Volume_5d_sma"]
    
    ## Append DF's
    #==============================================================
    quotesAll = quotesAll.append(subquote)
#     del(subquote)
#     print ((float(index)/n)*100)
tend = datetime.now()
print(tend-tstart)

	Series.ewm(ignore_na=False,min_periods=0,adjust=False,com=13).mean()
	Series.ewm(ignore_na=False,min_periods=1,adjust=True,span=26).mean()
  app.launch_new_instance()
	Series.ewm(ignore_na=False,min_periods=1,adjust=True,span=12).mean()


0:00:47.253121


In [885]:
quotesAll.ActivityType

0        NaN
0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
5        NaN
6        NaN
7        NaN
8        NaN
9        NaN
10       NaN
11       NaN
12       NaN
13       NaN
14       NaN
15       NaN
16       NaN
17       NaN
18       NaN
19       NaN
20       NaN
21       NaN
22       NaN
23       NaN
24       NaN
25       NaN
26       NaN
27       NaN
28       NaN
        ... 
9971     NaN
9972     NaN
9973     NaN
9974     NaN
9975     NaN
9976     NaN
9977     NaN
9978     NaN
9979     NaN
9980     NaN
9981     NaN
9982     NaN
9983     NaN
9984     NaN
9985     NaN
9986     NaN
9987     NaN
9988     NaN
9989     NaN
9990     NaN
9991     NaN
9992     NaN
9993     NaN
9994     NaN
9995     NaN
9996     NaN
9997     NaN
9998     NaN
9999     NaN
10000    NaN
Name: ActivityType, dtype: object

In [883]:
quotesAll.shape

(10002, 123)

In [879]:
quotesAll.head(100)

Unnamed: 0,100d_sma,100d_sma_pct,10Day_FutureReturn_pct,10Day_PriorReturn_pct,200d_sma,200d_sma_pct,20Day_FutureReturn_pct,20Day_PriorReturn_pct,21d_sma,21d_sma_pct,21thru50_neg,21thru50_neg_DayCount,21thru50_pos,21thru50_pos_DayCount,30Day_FutureReturn_pct,30Day_PriorReturn_pct,50d_sma,50d_sma_pct,5Day_FutureReturn_pct,5Day_PriorReturn_pct,5d_sma,5d_sma_pct,5thru21_neg,5thru21_neg_DayCount,5thru21_pos,5thru21_pos_DayCount,5thru50_neg,5thru50_neg_DayCount,5thru50_pos,5thru50_pos_DayCount,5thru9_neg,5thru9_neg_DayCount,5thru9_pos,5thru9_pos_DayCount,9d_sma,9d_sma_pct,9thru21_neg,9thru21_neg_DayCount,9thru21_pos,9thru21_pos_DayCount,9thru50_neg,9thru50_neg_DayCount,9thru50_pos,9thru50_pos_DayCount,ActivityType,Adj Close,Close,Date,EarningsDate,...,local_min_flag,local_min_pct,lower_band,lower_band_pct,price_thru21_neg,price_thru21_neg_Daycount,price_thru21_pos,price_thru21_pos_Daycount,price_thru50_neg,price_thru50_neg_Daycount,price_thru50_pos,price_thru50_pos_Daycount,price_thru5_neg,price_thru5_neg_Daycount,price_thru5_pos,price_thru5_pos_Daycount,price_thru9_neg,price_thru9_neg_Daycount,price_thru9_pos,price_thru9_pos_Daycount,return_since_last_21thru50_neg,return_since_last_21thru50_pos,return_since_last_5thru21_neg,return_since_last_5thru21_pos,return_since_last_5thru50_neg,return_since_last_5thru50_pos,return_since_last_5thru9_neg,return_since_last_5thru9_pos,return_since_last_9thru21_neg,return_since_last_9thru21_pos,return_since_last_9thru50_neg,return_since_last_9thru50_pos,return_since_last_local_max,return_since_last_local_min,return_since_last_price_thru21_neg,return_since_last_price_thru21_pos,return_since_last_price_thru50_neg,return_since_last_price_thru50_pos,return_since_last_price_thru5_neg,return_since_last_price_thru5_pos,return_since_last_price_thru9_neg,return_since_last_price_thru9_pos,startClose,startDayDelt,startHigh,startLow,startOpen,upper_band,upper_band_pct
0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,28.210024,29.549999,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,,,0.162470,,,,0.131750,,,,0.0,,0.0,,0.112434,,,,0.109651,,,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,0.0,,0.0,,0.0,,0.0,,,28.210024,29.549999,,,...,0.0,,,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,,,0.142516,,,,0.131257,,,,0.0,,0.0,,0.054444,,,,0.069300,,,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,0.0,,0.0,,0.0,,0.0,,,27.398570,28.700001,,,...,0.0,,,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,,,0.128245,,,,0.132254,,,,0.0,,0.0,,0.061171,,,,0.072550,,,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,0.0,,0.0,,0.0,,0.0,,,27.379476,28.680000,,,...,1.0,0.0,,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,,,,,,,,,,,,0.000000,,,,,,,,,,,,,,,
3,,,0.132757,,,,0.093667,,,,0.0,,0.0,,0.091917,,,,0.093667,,,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,0.0,,0.0,,0.0,,0.0,,,27.532221,28.840000,,,...,0.0,,,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,,,,,,,,,,,0.000000,0.005579,,,,,,,,,,,,,,,
4,,,0.066305,,,,0.052430,,,,0.0,,0.0,,0.042375,,,,0.092137,,27.35,-0.040110,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,0.0,,0.0,,0.0,,0.0,,,26.252984,27.500000,,,...,0.0,,,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,,,,,,,,,,,-0.046463,-0.041144,,,,,,,,,,,,,,,
5,,,0.028185,,,,0.009860,,,,0.0,,0.0,,0.005941,,,,0.047600,-0.098816,26.80,-0.051402,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,0.0,,0.0,,0.0,,0.0,,,25.422435,26.629999,,,...,1.0,0.0,,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,,,,,,,,,,,-0.076630,0.000000,,,,,,,,,,,,,,,
6,,,0.046802,,,,0.015513,,,,0.0,,0.0,,-0.012065,,,,0.068471,-0.064808,26.44,-0.030903,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,0.0,,0.0,,0.0,,0.0,,,25.622913,26.840000,,,...,0.0,,,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,,,,,,,,,,,0.000000,0.007886,,,,,,,,,,,,,,,
7,,,0.046166,,,,0.024521,,,,0.0,,0.0,,-0.017930,,,,0.051928,-0.067643,26.07,-0.020811,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,0.0,,0.0,,0.0,,0.0,,,25.527447,26.740000,,,...,0.0,,,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,,,,,,,,,,,-0.003726,0.004131,,,,,,,,,,,,,,,
8,,,0.018147,,,,0.009566,,,,0.0,,0.0,,-0.036155,,,,0.035742,-0.085645,25.60,-0.016632,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,26.50,-0.050029,0.0,,0.0,,0.0,,0.0,,,25.174226,26.370001,,,...,0.0,,,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,,,,,,,,,,,,,-0.017511,-0.009763,,,,,,,,,,,,,,,


### Inspect Final Dataframe

In [884]:
##save DF
os.chdir("/Users/Collier/Dropbox/Skills/Python/Projects/Stocks/StockMarket-ML/")
quotesAll.to_csv("all_quotes_features.csv", sep='\t', encoding='utf-8')

This will conclude the the Data Mining portion of the analysis.  