In [1]:
from IPython.display import display
import pandas as pd
import numpy as np
import datetime

### Predicting S&P500 stock returns using neural networks: acquiring and preparing data
 &nbsp;

This program gets daily stock price and volume data for S&P 500 stocks from quandl.com and creates a dataset that is later used to train and test neural networks. A free API key is required to access quandl data. For each ticker-day, *adjusted* closing price and trading volume are needed. Adjusted closing price accounts for splits and dividends and allows for a valid time series analysis. Quandl provides stock data access for free but ETF data are not free. Earlier versions of this work used yahoo finance's API, which no longer seems to be available (February 2018). That data set also includes adjusted prices and volume (number of shares traded) but also has data for ETFs. Google's finance API doesn't seem to have the adjusted closing price.
 &nbsp;
 
The inputs to the program are (1) set of ticker symbols, and (2) timeframe, i.e, the beginning and end dates. For this project, I use S&P 500 index constituents and 1/1/2000 - 1/31/2018.  
  &nbsp;
  
Adjusted prices, which are used to compute returns, and volume (dollar trading volume, which is obtained by multiplying adjusted closing price by share volume) are first converted to a 20 (=N) trading day frequency, so that returns are computed over a nonoverlapping 20-day window and the dollar volume is the average daily dollar volume over the 20 days. Roughly speaking, this corresponds to a monthly frequency and I refer to the dataset as such. This scheme sidesteps issues associated with calendar-month trading (end of the month / beginning of the month). It is relatively easy to modify this program to calendar month. It is also easy to modify this program to investigate other frequencies (weekly, quarterly, etc).
  &nbsp;
  
Since scaled data is important for network training, I standardize raw returns and volumes in two ways. Time series scaling uses a 12-month frame (this month plus the preceding 11 months) to subtract the mean return from its raw return for each month and divide that by the standard deviation on a rolling basis. This gives a scaled return / volume for each stock, based on its past year. A large scaled value would imply a high return relative to the stock's last year performance. I also scale returns and volume cross-sectionally on each date across all stocks. Here, a large scaled return would imply better performance over other stocks for that month. Also, 12 lags of these scaled variables are recorded. 
  &nbsp;
  
The output is a dataset that has all data items for a stock: the time-series and cross-sectional scaled data plus the lagged variables. 
 &nbsp;

Last revision (March 29, 2018): I added a volatility measure, standard deviation of daily returns. Also, I captured the lags of raw returns, average volumes, and standard deviation of daily returns in the dataset. I also removed parts of the program that were not relevant to this version.
 &nbsp;

 
 
March 2018

Murat Aydogdu

In [2]:
pd.set_option('display.max_columns', None)
pd.options.display.max_rows = 100
pd.options.display.float_format = '{:20,.4f}'.format

In [3]:
#Remove BF.B and BRK.B from the list as B-shares seem to be problematic for quandl
tickers = ['A','AAL','AAP','AAPL','ABBV','ABC','ABT','ACN','ADBE','ADI','ADM','ADP','ADS','ADSK','AEE','AEP','AES',
           'AET','AFL','AGN','AIG','AIV','AIZ','AJG','AKAM','ALB','ALGN','ALK','ALL','ALLE','ALXN','AMAT','AMD','AME',
           'AMG','AMGN','AMP','AMT','AMZN','ANDV','ANSS','ANTM','AON','AOS','APA','APC','APD','APH','APTV','ARE',
           'ARNC','ATVI','AVB','AVGO','AVY','AWK','AXP','AYI','AZO','BA','BAC','BAX','BBT','BBY','BDX',
           'BEN','BHF','BHGE','BIIB','BK','BLK','BLL','BMY',
           'BSX','BWA','BXP','C','CA','CAG','CAH','CAT','CB','CBG',
           'CBOE','CBS','CCI','CCL','CDNS','CELG','CERN','CF','CFG','CHD','CHK','CHRW','CHTR','CI','CINF','CL','CLX',
           'CMA','CMCSA','CME','CMG','CMI','CMS','CNC','CNP','COF','COG','COL','COO','COP','COST','COTY','CPB','CRM',
           'CSCO','CSRA','CSX','CTAS','CTL','CTSH','CTXS','CVS','CVX','CXO','D','DAL','DE','DFS','DG','DGX','DHI',
           'DHR','DIS','DISCA','DISCK','DISH','DLR','DLTR','DOV','DPS','DRE','DRI','DTE','DUK','DVA','DVN','DWDP',
           'DXC','EA','EBAY','ECL','ED','EFX','EIX','EL','EMN','EMR','EOG','EQIX','EQR','EQT','ES','ESRX','ESS',
           'ETFC','ETN','ETR','EVHC','EW','EXC','EXPD','EXPE','EXR','F','FAST','FB','FBHS','FCX','FDX','FE','FFIV',
           'FIS','FISV','FITB','FL','FLIR','FLR','FLS','FMC','FOX','FOXA','FRT','FTI','FTV','GD','GE','GGP','GILD',
           'GIS','GLW','GM','GOOG','GOOGL','GPC','GPN','GPS','GRMN','GS','GT','GWW','HAL','HAS','HBAN','HBI','HCA',
           'HCN','HCP','HD','HES','HIG','HII','HLT','HOG','HOLX','HON','HP','HPE','HPQ','HRB','HRL','HRS','HSIC',
           'HST','HSY','HUM','IBM','ICE','IDXX','IFF','ILMN','INCY','INFO','INTC','INTU','IP','IPG','IQV','IR','IRM',
           'ISRG','IT','ITW','IVZ','JBHT','JCI','JEC','JNJ','JNPR','JPM','JWN','K','KEY','KHC','KIM','KLAC','KMB',
           'KMI','KMX','KO','KORS','KR','KSS','KSU','L','LB','LEG','LEN','LH','LKQ','LLL','LLY','LMT','LNC','LNT',
           'LOW','LRCX','LUK','LUV','LYB','M','MA','MAA','MAC','MAR','MAS','MAT','MCD','MCHP','MCK','MCO','MDLZ',
           'MDT','MET','MGM','MHK','MKC','MLM','MMC','MMM','MNST','MO','MON','MOS','MPC','MRK','MRO','MS','MSFT',
           'MSI','MTB','MTD','MU','MYL','NAVI','NBL','NCLH','NDAQ','NEE','NEM','NFLX','NFX','NI','NKE','NLSN','NOC',
           'NOV','NRG','NSC','NTAP','NTRS','NUE','NVDA','NWL','NWS','NWSA','O','OKE','OMC','ORCL','ORLY','OXY','PAYX',
           'PBCT','PCAR','PCG','PCLN','PDCO','PEG',
           'PEP','PFE','PFG','PG','PGR','PH','PHM','PKG','PKI','PLD','PM','PNC',
           'PNR','PNW','PPG','PPL','PRGO','PRU','PSA','PSX','PVH','PWR','PX','PXD','PYPL','QCOM','QRVO','RCL','RE',
           'REG','REGN','RF','RHI','RHT','RJF','RL','RMD','ROK','ROP','ROST','RRC','RSG','RTN','SBAC','SBUX','SCG',
           'SCHW','SEE','SHW','SIG','SJM','SLB','SLG','SNA','SNI','SNPS','SO','SPG','SPGI','SRCL','SRE','STI','STT',
           'STX','STZ','SWK','SWKS','SYF','SYK','SYMC','SYY','T','TAP','TDG','TEL','TGT','TIF','TJX','TMK','TMO','TPR',
           'TRIP','TROW','TRV','TSCO','TSN','TSS','TWX','TXN','TXT','UA','UAA','UAL','UDR','UHS','ULTA','UNH','UNM',
           'UNP','UPS','URI','USB','UTX','V','VAR','VFC','VIAB','VLO','VMC','VNO','VRSK','VRSN','VRTX','VTR','VZ',
           'WAT','WBA','WDC','WEC','WFC','WHR','WLTW','WM','WMB','WMT','WRK','WU','WY','WYN','WYNN','XEC','XEL','XL',
           'XLNX','XOM','XRAY','XRX','XYL','YUM','ZBH','ZION','ZTS']

In [13]:
#!pip install quandl
import quandl

In [10]:
# I ran this program three times
# First, quandl threw an error for B-class shares
# Second, I got a nondescript error
# For the first run, create the output file. Afterwards, append to the output file
new = 1
for symbol in tickers:
    df = quandl.get("WIKI/{}".format(symbol), authtoken = "...", 
                    start_date="2000-1-1", end_date="2018-1-31").reset_index()
    df['Ticker'] = symbol
    df['V'] = df['Close']*df['Volume'] / 1000000
    df['P'] = df['Adj. Close']
    dfkeep = df[['Ticker','Date','V','P']]
    print symbol
    if new == 1:
        dfkeep.to_csv('SP500_Data.CSV', index = False, float_format='%.2f')
        new = 0
    else:
        dfkeep.to_csv('SP500_Data.CSV', index = False, mode = 'a', header = False, float_format='%.2f')

A
AAL
AAP
AAPL
ABBV
ABC
ABT
ACN
ADBE
ADI
ADM
ADP
ADS
ADSK
AEE
AEP
AES
AET
AFL
AGN
AIG
AIV
AIZ
AJG
AKAM
ALB
ALGN
ALK
ALL
ALLE
ALXN
AMAT
AMD
AME
AMG
AMGN
AMP
AMT
AMZN
ANDV
ANSS
ANTM
AON
AOS
APA
APC
APD
APH
APTV
ARE
ARNC
ATVI
AVB
AVGO
AVY
AWK
AXP
AYI
AZO
BA
BAC
BAX
BBT
BBY
BDX
BEN
BHF
BHGE
BIIB
BK
BLK
BLL
BMY
BSX
BWA
BXP
C
CA
CAG
CAH
CAT
CB
CBG
CBOE
CBS
CCI
CCL
CDNS
CELG
CERN
CF
CFG
CHD
CHK
CHRW
CHTR
CI
CINF
CL
CLX
CMA
CMCSA
CME
CMG
CMI
CMS
CNC
CNP
COF
COG
COL
COO
COP
COST
COTY
CPB
CRM
CSCO
CSRA
CSX
CTAS
CTL
CTSH
CTXS
CVS
CVX
CXO
D
DAL
DE
DFS
DG
DGX
DHI
DHR
DIS
DISCA
DISCK
DISH
DLR
DLTR
DOV
DPS
DRE
DRI
DTE
DUK
DVA
DVN
DWDP
DXC
EA
EBAY
ECL
ED
EFX
EIX
EL
EMN
EMR
EOG
EQIX
EQR
EQT
ES
ESRX
ESS
ETFC
ETN
ETR
EVHC
EW
EXC
EXPD
EXPE
EXR
F
FAST
FB
FBHS
FCX
FDX
FE
FFIV
FIS
FISV
FITB
FL
FLIR
FLR
FLS
FMC
FOX
FOXA
FRT
FTI
FTV
GD
GE
GGP
GILD
GIS
GLW
GM
GOOG
GOOGL
GPC
GPN
GPS
GRMN
GS
GT
GWW
HAL
HAS
HBAN
HBI
HCA
HCN
HCP
HD
HES
HIG
HII
HLT
HOG
HOLX
HON
HP
HPE
HPQ
HRB
HRL
HRS
HSIC
HST
HSY
HUM
IBM
ICE
IDXX
IF

In [11]:
df = pd.read_csv("SP500_Data.CSV")

# If the data acquisition works in one shot, all data will be in one file.
# Otherwise, that step may have to run multiple times.
# After each run, I rename SP500 using the first and last ticker in that run, 
# then I combine those datasets. 
#df1 = pd.read_csv("SP500_Data_A_APD.CSV")
#df2 = pd.read_csv("SP500_Data_APH_XOM.CSV")
#df3 = pd.read_csv("SP500_Data_XRAY_ZTS.CSV")
#df = pd.concat([df1, df2], axis=0)
#df = pd.concat([df, df3], axis=0)
#del df1, df2, df3

In [12]:
df.sort_values(by = ['Ticker','Date'], ascending=True, inplace=True)
print df.shape
display(df)

(2036919, 4)


Unnamed: 0,Ticker,Date,V,P
0,A,2000-01-03,240.7400,49.1200
1,A,2000-01-04,226.6700,45.3700
2,A,2000-01-05,253.5800,42.0000
3,A,2000-01-06,108.7700,40.9300
4,A,2000-01-07,131.1000,44.3500
5,A,2000-01-10,105.9500,47.0300
6,A,2000-01-11,90.2800,46.3900
7,A,2000-01-12,68.1500,45.4600
8,A,2000-01-13,54.8700,46.1400
9,A,2000-01-14,64.4100,46.6500


In [15]:
# Summary statistics by ticker
def f(x):
    d = {}
    d['date_count'] = x['Date'].count()
    d['date_min'] = x['Date'].min()
    d['date_max'] = x['Date'].max()
    return pd.Series(d, index=['date_count', 'date_min', 'date_max'])

In [16]:
# Obtain the longest data series for a reasonably large sample

df_summary = df.groupby('Ticker').apply(f)
df_summary = df_summary.sort_values(['date_max', 'date_count', 'date_min'], ascending=[True, False, True])
pd.options.display.max_rows = 1000
print df_summary.shape
display(df_summary)

(503, 3)


Unnamed: 0_level_0,date_count,date_min,date_max
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,4500,2000-01-03,2017-12-05
FAST,4549,2000-01-03,2018-01-31
LOW,4549,2000-01-03,2018-01-31
A,4548,2000-01-03,2018-01-31
ABC,4548,2000-01-03,2018-01-31
ADI,4548,2000-01-03,2018-01-31
ADM,4548,2000-01-03,2018-01-31
ADP,4548,2000-01-03,2018-01-31
ADSK,4548,2000-01-03,2018-01-31
AEE,4548,2000-01-03,2018-01-31


In [17]:
# Two tickers (FAST, LOW) have 4549 observations. Most others have 4548 observations
# and some 4547 observations
# MMM ends on 2017-12-05
keeplist = df_summary.where((df_summary['date_max'] == '2018-01-31') & (df_summary['date_min'] <= '2002-01-01')).reset_index()
keeplist = keeplist.dropna(axis=0, how='any')
#display(keeplist)
keeplist = keeplist['Ticker'].tolist()
print len(keeplist), keeplist

409 ['FAST', 'LOW', 'A', 'ABC', 'ADI', 'ADM', 'ADP', 'ADSK', 'AEE', 'AEP', 'AES', 'AET', 'AFL', 'AGN', 'AIG', 'AIV', 'AJG', 'ALB', 'ALK', 'ALL', 'ALXN', 'AMAT', 'AMD', 'AME', 'AMG', 'AMGN', 'AMT', 'ANSS', 'AON', 'AOS', 'APA', 'APC', 'APD', 'APH', 'ARE', 'ARNC', 'AVB', 'AVY', 'AXP', 'AZO', 'BA', 'BAC', 'BAX', 'BBT', 'BBY', 'BDX', 'BEN', 'BIIB', 'BK', 'BLK', 'BLL', 'BMY', 'BSX', 'BWA', 'BXP', 'C', 'CA', 'CAG', 'CAH', 'CAT', 'CB', 'CCI', 'CCL', 'CDNS', 'CERN', 'CHD', 'CHK', 'CHRW', 'CI', 'CINF', 'CL', 'CLX', 'CMA', 'CMI', 'CMS', 'CNP', 'COF', 'COG', 'COO', 'COP', 'CPB', 'CSCO', 'CSX', 'CTAS', 'CTL', 'CTSH', 'CTXS', 'CVS', 'CVX', 'D', 'DE', 'DGX', 'DHI', 'DHR', 'DIS', 'DISH', 'DLTR', 'DOV', 'DRE', 'DRI', 'DTE', 'DUK', 'DVA', 'DVN', 'EA', 'EBAY', 'ECL', 'ED', 'EFX', 'EIX', 'EL', 'EMN', 'EMR', 'EOG', 'EQR', 'EQT', 'ES', 'ESRX', 'ESS', 'ETFC', 'ETN', 'ETR', 'EXC', 'EXPD', 'F', 'FCX', 'FDX', 'FE', 'FFIV', 'FISV', 'FITB', 'FL', 'FLIR', 'FLS', 'FMC', 'FOX', 'FOXA', 'FRT', 'GD', 'GE', 'GGP', 'GIL

In [18]:
# Select data based on tickers that have long enough a time series 
# Then select the observations in the data range
dtdf = df[df['Ticker'].isin(keeplist)]
dtdf = dtdf[dtdf['Date'] >= '2002-01-01']
print dtdf.shape
display(dtdf)

(1655623, 4)


Unnamed: 0,Ticker,Date,V,P
500,A,2002-01-02,63.1600,19.9600
501,A,2002-01-03,101.4000,21.2200
502,A,2002-01-04,167.7700,22.3600
503,A,2002-01-07,124.3700,22.2800
504,A,2002-01-08,81.7200,22.3400
505,A,2002-01-09,67.6700,21.8100
506,A,2002-01-10,39.2700,21.6400
507,A,2002-01-11,43.9600,21.1800
508,A,2002-01-14,67.8300,20.7500
509,A,2002-01-15,56.6400,20.7700


In [19]:
# This is the remining data set
df_summary = dtdf.groupby('Ticker').apply(f)
df_summary = df_summary.sort_values(['date_max', 'date_count', 'date_min'], ascending=[True, False, True])
display(df_summary)

Unnamed: 0_level_0,date_count,date_min,date_max
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
FAST,4049,2002-01-02,2018-01-31
LOW,4049,2002-01-02,2018-01-31
A,4048,2002-01-02,2018-01-31
AAP,4048,2002-01-02,2018-01-31
ABC,4048,2002-01-02,2018-01-31
ACN,4048,2002-01-02,2018-01-31
ADI,4048,2002-01-02,2018-01-31
ADM,4048,2002-01-02,2018-01-31
ADP,4048,2002-01-02,2018-01-31
ADS,4048,2002-01-02,2018-01-31


Construct variables using lags and leads.

Today is day *t*. For each day *t*, we need P~t~, P~t-N~, P~t+N~, and AV~t~
 
 AV~t~ is the average dollar volume of last N days, ending (i.e., including) today. This is parallel to how returns are calculated.

In [20]:
dtdf.sort_values(by = ['Ticker','Date'], ascending=True, inplace=True)

In [21]:
# m : minus, p: plus
# Negative values can be used for leads ("forward lags")
# Return will be measured based on Pt and Pt-N
# Average dollar volume will be measured based on Vt through Vt-N

N = 20
lagN = N*1
leadN = N*-1

# Price: we need the N-lead and N-lag values
# Also, get the volatility of daily returns
cname = 'P'+'m'+str(1).zfill(2)
dtdf[cname] = dtdf.groupby('Ticker')['P'].shift(1)
dtdf['DR'] = (dtdf['P'] / dtdf['Pm01']) - 1
cname = 'P'+'m'+str(N).zfill(2)
dtdf[cname] = dtdf.groupby('Ticker')['P'].shift(lagN)
cname = 'P'+'p'+str(N).zfill(2)
dtdf[cname] = dtdf.groupby('Ticker')['P'].shift(leadN)
cname = 'AV'
# Average dollar volume of last N days, ending (including) today
dtdf[cname] = dtdf.groupby('Ticker')['V'].rolling(lagN).mean().reset_index(0,drop=True)
dtdf['DRSD'] = dtdf.groupby('Ticker')['DR'].rolling(lagN).std().reset_index(0,drop=True)

In [22]:
pd.options.display.max_rows = 50    
display(dtdf)

Unnamed: 0,Ticker,Date,V,P,Pm01,DR,Pm20,Pp20,AV,DRSD
500,A,2002-01-02,63.1600,19.9600,,,,20.7100,,
501,A,2002-01-03,101.4000,21.2200,19.9600,0.0631,,20.2000,,
502,A,2002-01-04,167.7700,22.3600,21.2200,0.0537,,19.5000,,
503,A,2002-01-07,124.3700,22.2800,22.3600,-0.0036,,18.5300,,
504,A,2002-01-08,81.7200,22.3400,22.2800,0.0027,,18.4100,,
505,A,2002-01-09,67.6700,21.8100,22.3400,-0.0237,,18.0000,,
506,A,2002-01-10,39.2700,21.6400,21.8100,-0.0078,,17.7200,,
507,A,2002-01-11,43.9600,21.1800,21.6400,-0.0213,,18.2400,,
508,A,2002-01-14,67.8300,20.7500,21.1800,-0.0203,,17.8500,,
509,A,2002-01-15,56.6400,20.7700,20.7500,0.0010,,18.3200,,


In [23]:
# Compute the N-day return 
# and next period return that will determine Y

dtdf['R'] = (dtdf['P'] / dtdf['Pm20']) - 1
dtdf['YR'] = (dtdf['Pp20'] / dtdf['P']) - 1

dtdf = dtdf.dropna()

In [24]:
# Get every 20th date. These will be the month identifiers
dates = dtdf['Date'].drop_duplicates()
dates = sorted(dates)
selected = dates[::20]
#display(dates)
display(selected)

['2002-01-31',
 '2002-03-01',
 '2002-04-01',
 '2002-04-29',
 '2002-05-28',
 '2002-06-25',
 '2002-07-24',
 '2002-08-21',
 '2002-09-19',
 '2002-10-17',
 '2002-11-14',
 '2002-12-13',
 '2003-01-14',
 '2003-02-12',
 '2003-03-13',
 '2003-04-10',
 '2003-05-09',
 '2003-06-09',
 '2003-07-08',
 '2003-08-05',
 '2003-09-03',
 '2003-10-01',
 '2003-10-29',
 '2003-11-26',
 '2003-12-26',
 '2004-01-27',
 '2004-02-25',
 '2004-03-24',
 '2004-04-22',
 '2004-05-20',
 '2004-06-21',
 '2004-07-20',
 '2004-08-17',
 '2004-09-15',
 '2004-10-13',
 '2004-11-10',
 '2004-12-09',
 '2005-01-07',
 '2005-02-07',
 '2005-03-08',
 '2005-04-06',
 '2005-05-04',
 '2005-06-02',
 '2005-06-30',
 '2005-07-29',
 '2005-08-26',
 '2005-09-26',
 '2005-10-24',
 '2005-11-21',
 '2005-12-20',
 '2006-01-20',
 '2006-02-17',
 '2006-03-20',
 '2006-04-18',
 '2006-05-16',
 '2006-06-14',
 '2006-07-13',
 '2006-08-10',
 '2006-09-08',
 '2006-10-06',
 '2006-11-03',
 '2006-12-04',
 '2007-01-04',
 '2007-02-02',
 '2007-03-05',
 '2007-04-02',
 '2007-05-

In [25]:
# This will get the monthly observations
dtdf = dtdf[dtdf['Date'].isin(selected)]
display(dtdf)

Unnamed: 0,Ticker,Date,V,P,Pm01,DR,Pm20,Pp20,AV,DRSD,R,YR
520,A,2002-01-31,59.8500,20.7100,20.0900,0.0309,19.9600,22.5100,66.3755,0.0305,0.0376,0.0869
540,A,2002-03-01,147.8800,22.5100,21.2500,0.0593,20.7100,24.9200,70.0370,0.0384,0.0869,0.1071
560,A,2002-04-01,94.3100,24.9200,23.8500,0.0449,22.5100,20.2300,105.6845,0.0307,0.1071,-0.1882
580,A,2002-04-29,85.2900,20.2300,20.4400,-0.0103,24.9200,18.5200,98.7330,0.0244,-0.1882,-0.0845
600,A,2002-05-28,36.0500,18.5200,18.6300,-0.0059,20.2300,16.0700,69.0350,0.0359,-0.0845,-0.1323
620,A,2002-06-25,46.2000,16.0700,16.4800,-0.0249,18.5200,12.9200,61.8465,0.0277,-0.1323,-0.1960
640,A,2002-07-24,85.6800,12.9200,13.4700,-0.0408,16.0700,11.6300,57.4480,0.0345,-0.1960,-0.0998
660,A,2002-08-21,44.7500,11.6300,11.0500,0.0525,12.9200,9.6000,43.8820,0.0515,-0.0998,-0.1745
680,A,2002-09-19,34.7600,9.6000,10.0600,-0.0457,11.6300,8.2600,47.3275,0.0347,-0.1745,-0.1396
700,A,2002-10-17,31.9400,8.2600,7.6800,0.0755,9.6000,9.5400,34.9660,0.0475,-0.1396,0.1550


In [26]:
# Rolling 12-month means and standard deviations for each ticker. 
# They will be used to standardize returns and volumes with respect to each ticker's own history
frame = 12
dtdf['RTM'] = dtdf.groupby('Ticker')['R'].rolling(frame).mean().reset_index(0,drop=True)
dtdf['RTSD'] = dtdf.groupby('Ticker')['R'].rolling(frame).std().reset_index(0,drop=True)
dtdf['AVTM'] = dtdf.groupby('Ticker')['AV'].rolling(frame).mean().reset_index(0,drop=True)
dtdf['AVTSD'] = dtdf.groupby('Ticker')['AV'].rolling(frame).std().reset_index(0,drop=True)
dtdf['SDTM'] = dtdf.groupby('Ticker')['DRSD'].rolling(frame).mean().reset_index(0,drop=True)
dtdf['SDTSD'] = dtdf.groupby('Ticker')['DRSD'].rolling(frame).std().reset_index(0,drop=True)

dtdf['RT'] = (dtdf['R'] - dtdf['RTM']) / dtdf['RTSD']
dtdf['AVT'] = (dtdf['AV'] - dtdf['AVTM']) / dtdf['AVTSD']
dtdf['SDT'] = (dtdf['DRSD'] - dtdf['SDTM']) / dtdf['SDTSD']

In [27]:
# Cross-sectional (per period) means and standard deviations. 
# They will be used to standardize returns and volumes cross-sectionally per period
dtdf['RCM'] = dtdf['R'].groupby(dtdf['Date']).transform('mean')
dtdf['RCSD'] = dtdf['R'].groupby(dtdf['Date']).transform('std')
dtdf['AVCM'] = dtdf['AV'].groupby(dtdf['Date']).transform('mean')
dtdf['AVCSD'] = dtdf['AV'].groupby(dtdf['Date']).transform('std')
dtdf['SDCM'] = dtdf['DRSD'].groupby(dtdf['Date']).transform('mean')
dtdf['SDCSD'] = dtdf['DRSD'].groupby(dtdf['Date']).transform('std')
dtdf['RC'] = (dtdf['R'] - dtdf['RCM']) / dtdf['RCSD']
dtdf['AVC'] = (dtdf['AV'] - dtdf['AVCM']) / dtdf['AVCSD']
dtdf['SDC'] = (dtdf['DRSD'] - dtdf['SDCM']) / dtdf['SDCSD']

In [None]:
display(dtdf)

In [28]:
# After the time series and cross-sectional scaling, the mean and st dev variables are no longer needed
# Also past and future prices are no longer needed
dtdf.drop(['Pm01','Pm20', 'Pp20','DR', 'RTM', 'RTSD','SDTM','SDTSD', \
           'AVTM', 'AVTSD', 'SDCM','SDCSD','RCM', 'RCSD', 'AVCM', 'AVCSD'], axis=1, inplace = True)
display(dtdf)

Unnamed: 0,Ticker,Date,V,P,AV,DRSD,R,YR,RT,AVT,SDT,RC,AVC,SDC
520,A,2002-01-31,59.8500,20.7100,66.3755,0.0305,0.0376,0.0869,,,,0.3305,-0.1465,0.7593
540,A,2002-03-01,147.8800,22.5100,70.0370,0.0384,0.0869,0.1071,,,,0.6892,-0.1453,0.9853
560,A,2002-04-01,94.3100,24.9200,105.6845,0.0307,0.1071,-0.1882,,,,0.6493,0.0472,0.7992
580,A,2002-04-29,85.2900,20.2300,98.7330,0.0244,-0.1882,-0.0845,,,,-1.4129,0.0204,0.2572
600,A,2002-05-28,36.0500,18.5200,69.0350,0.0359,-0.0845,-0.1323,,,,-1.1436,-0.1194,0.9097
620,A,2002-06-25,46.2000,16.0700,61.8465,0.0277,-0.1323,-0.1960,,,,-0.5221,-0.1666,0.3297
640,A,2002-07-24,85.6800,12.9200,57.4480,0.0345,-0.1960,-0.0998,,,,-0.5168,-0.2492,-0.1237
660,A,2002-08-21,44.7500,11.6300,43.8820,0.0515,-0.0998,-0.1745,,,,-1.3120,-0.2627,0.6217
680,A,2002-09-19,34.7600,9.6000,47.3275,0.0347,-0.1745,-0.1396,,,,-0.9627,-0.1818,0.6997
700,A,2002-10-17,31.9400,8.2600,34.9660,0.0475,-0.1396,0.1550,,,,-1.0662,-0.3216,0.4437


In [29]:
pd.options.mode.chained_assignment = None  # default='warn'
# j-lags of returns and average volumes
J = 12
vars = ['R','AV','DRSD','RT','AVT','SDT','RC','AVC','SDC']
for i in vars:
    for j in range (1,J+1):
        cname = i+str(j).zfill(2)
        dtdf[cname] = dtdf[i].shift(j)
display(dtdf)  

Unnamed: 0,Ticker,Date,V,P,AV,DRSD,R,YR,RT,AVT,SDT,RC,AVC,SDC,R01,R02,R03,R04,R05,R06,R07,R08,R09,R10,R11,R12,AV01,AV02,AV03,AV04,AV05,AV06,AV07,AV08,AV09,AV10,AV11,AV12,DRSD01,DRSD02,DRSD03,DRSD04,DRSD05,DRSD06,DRSD07,DRSD08,DRSD09,DRSD10,DRSD11,DRSD12,RT01,RT02,RT03,RT04,RT05,RT06,RT07,RT08,RT09,RT10,RT11,RT12,AVT01,AVT02,AVT03,AVT04,AVT05,AVT06,AVT07,AVT08,AVT09,AVT10,AVT11,AVT12,SDT01,SDT02,SDT03,SDT04,SDT05,SDT06,SDT07,SDT08,SDT09,SDT10,SDT11,SDT12,RC01,RC02,RC03,RC04,RC05,RC06,RC07,RC08,RC09,RC10,RC11,RC12,AVC01,AVC02,AVC03,AVC04,AVC05,AVC06,AVC07,AVC08,AVC09,AVC10,AVC11,AVC12,SDC01,SDC02,SDC03,SDC04,SDC05,SDC06,SDC07,SDC08,SDC09,SDC10,SDC11,SDC12
520,A,2002-01-31,59.8500,20.7100,66.3755,0.0305,0.0376,0.0869,,,,0.3305,-0.1465,0.7593,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
540,A,2002-03-01,147.8800,22.5100,70.0370,0.0384,0.0869,0.1071,,,,0.6892,-0.1453,0.9853,0.0376,,,,,,,,,,,,66.3755,,,,,,,,,,,,0.0305,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.3305,,,,,,,,,,,,-0.1465,,,,,,,,,,,,0.7593,,,,,,,,,,,
560,A,2002-04-01,94.3100,24.9200,105.6845,0.0307,0.1071,-0.1882,,,,0.6493,0.0472,0.7992,0.0869,0.0376,,,,,,,,,,,70.0370,66.3755,,,,,,,,,,,0.0384,0.0305,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.6892,0.3305,,,,,,,,,,,-0.1453,-0.1465,,,,,,,,,,,0.9853,0.7593,,,,,,,,,,
580,A,2002-04-29,85.2900,20.2300,98.7330,0.0244,-0.1882,-0.0845,,,,-1.4129,0.0204,0.2572,0.1071,0.0869,0.0376,,,,,,,,,,105.6845,70.0370,66.3755,,,,,,,,,,0.0307,0.0384,0.0305,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.6493,0.6892,0.3305,,,,,,,,,,0.0472,-0.1453,-0.1465,,,,,,,,,,0.7992,0.9853,0.7593,,,,,,,,,
600,A,2002-05-28,36.0500,18.5200,69.0350,0.0359,-0.0845,-0.1323,,,,-1.1436,-0.1194,0.9097,-0.1882,0.1071,0.0869,0.0376,,,,,,,,,98.7330,105.6845,70.0370,66.3755,,,,,,,,,0.0244,0.0307,0.0384,0.0305,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1.4129,0.6493,0.6892,0.3305,,,,,,,,,0.0204,0.0472,-0.1453,-0.1465,,,,,,,,,0.2572,0.7992,0.9853,0.7593,,,,,,,,
620,A,2002-06-25,46.2000,16.0700,61.8465,0.0277,-0.1323,-0.1960,,,,-0.5221,-0.1666,0.3297,-0.0845,-0.1882,0.1071,0.0869,0.0376,,,,,,,,69.0350,98.7330,105.6845,70.0370,66.3755,,,,,,,,0.0359,0.0244,0.0307,0.0384,0.0305,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1.1436,-1.4129,0.6493,0.6892,0.3305,,,,,,,,-0.1194,0.0204,0.0472,-0.1453,-0.1465,,,,,,,,0.9097,0.2572,0.7992,0.9853,0.7593,,,,,,,
640,A,2002-07-24,85.6800,12.9200,57.4480,0.0345,-0.1960,-0.0998,,,,-0.5168,-0.2492,-0.1237,-0.1323,-0.0845,-0.1882,0.1071,0.0869,0.0376,,,,,,,61.8465,69.0350,98.7330,105.6845,70.0370,66.3755,,,,,,,0.0277,0.0359,0.0244,0.0307,0.0384,0.0305,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-0.5221,-1.1436,-1.4129,0.6493,0.6892,0.3305,,,,,,,-0.1666,-0.1194,0.0204,0.0472,-0.1453,-0.1465,,,,,,,0.3297,0.9097,0.2572,0.7992,0.9853,0.7593,,,,,,
660,A,2002-08-21,44.7500,11.6300,43.8820,0.0515,-0.0998,-0.1745,,,,-1.3120,-0.2627,0.6217,-0.1960,-0.1323,-0.0845,-0.1882,0.1071,0.0869,0.0376,,,,,,57.4480,61.8465,69.0350,98.7330,105.6845,70.0370,66.3755,,,,,,0.0345,0.0277,0.0359,0.0244,0.0307,0.0384,0.0305,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-0.5168,-0.5221,-1.1436,-1.4129,0.6493,0.6892,0.3305,,,,,,-0.2492,-0.1666,-0.1194,0.0204,0.0472,-0.1453,-0.1465,,,,,,-0.1237,0.3297,0.9097,0.2572,0.7992,0.9853,0.7593,,,,,
680,A,2002-09-19,34.7600,9.6000,47.3275,0.0347,-0.1745,-0.1396,,,,-0.9627,-0.1818,0.6997,-0.0998,-0.1960,-0.1323,-0.0845,-0.1882,0.1071,0.0869,0.0376,,,,,43.8820,57.4480,61.8465,69.0350,98.7330,105.6845,70.0370,66.3755,,,,,0.0515,0.0345,0.0277,0.0359,0.0244,0.0307,0.0384,0.0305,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1.3120,-0.5168,-0.5221,-1.1436,-1.4129,0.6493,0.6892,0.3305,,,,,-0.2627,-0.2492,-0.1666,-0.1194,0.0204,0.0472,-0.1453,-0.1465,,,,,0.6217,-0.1237,0.3297,0.9097,0.2572,0.7992,0.9853,0.7593,,,,
700,A,2002-10-17,31.9400,8.2600,34.9660,0.0475,-0.1396,0.1550,,,,-1.0662,-0.3216,0.4437,-0.1745,-0.0998,-0.1960,-0.1323,-0.0845,-0.1882,0.1071,0.0869,0.0376,,,,47.3275,43.8820,57.4480,61.8465,69.0350,98.7330,105.6845,70.0370,66.3755,,,,0.0347,0.0515,0.0345,0.0277,0.0359,0.0244,0.0307,0.0384,0.0305,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-0.9627,-1.3120,-0.5168,-0.5221,-1.1436,-1.4129,0.6493,0.6892,0.3305,,,,-0.1818,-0.2627,-0.2492,-0.1666,-0.1194,0.0204,0.0472,-0.1453,-0.1465,,,,0.6997,0.6217,-0.1237,0.3297,0.9097,0.2572,0.7992,0.9853,0.7593,,,


In [30]:
dtdf = dtdf.dropna()
dtdf.shape

(72802, 122)

In [31]:
print list(dtdf)

['Ticker', 'Date', 'V', 'P', 'AV', 'DRSD', 'R', 'YR', 'RT', 'AVT', 'SDT', 'RC', 'AVC', 'SDC', 'R01', 'R02', 'R03', 'R04', 'R05', 'R06', 'R07', 'R08', 'R09', 'R10', 'R11', 'R12', 'AV01', 'AV02', 'AV03', 'AV04', 'AV05', 'AV06', 'AV07', 'AV08', 'AV09', 'AV10', 'AV11', 'AV12', 'DRSD01', 'DRSD02', 'DRSD03', 'DRSD04', 'DRSD05', 'DRSD06', 'DRSD07', 'DRSD08', 'DRSD09', 'DRSD10', 'DRSD11', 'DRSD12', 'RT01', 'RT02', 'RT03', 'RT04', 'RT05', 'RT06', 'RT07', 'RT08', 'RT09', 'RT10', 'RT11', 'RT12', 'AVT01', 'AVT02', 'AVT03', 'AVT04', 'AVT05', 'AVT06', 'AVT07', 'AVT08', 'AVT09', 'AVT10', 'AVT11', 'AVT12', 'SDT01', 'SDT02', 'SDT03', 'SDT04', 'SDT05', 'SDT06', 'SDT07', 'SDT08', 'SDT09', 'SDT10', 'SDT11', 'SDT12', 'RC01', 'RC02', 'RC03', 'RC04', 'RC05', 'RC06', 'RC07', 'RC08', 'RC09', 'RC10', 'RC11', 'RC12', 'AVC01', 'AVC02', 'AVC03', 'AVC04', 'AVC05', 'AVC06', 'AVC07', 'AVC08', 'AVC09', 'AVC10', 'AVC11', 'AVC12', 'SDC01', 'SDC02', 'SDC03', 'SDC04', 'SDC05', 'SDC06', 'SDC07', 'SDC08', 'SDC09', 'SDC10', 

In [32]:
# Portions of dataset
# This can be used to leave out some features if needed
id_cols = ['Ticker','Date']
other_cols= ['V','P','AV','R','DRSD','YR']
raw_ret = ['R01', 'R02', 'R03', 'R04', 'R05', 'R06', 'R07', 'R08', 'R09', 'R10', 'R11', 'R12']
raw_vol = ['AV01', 'AV02', 'AV03', 'AV04', 'AV05', 'AV06', 'AV07', 'AV08', 'AV09', 'AV10', 'AV11', 'AV12']
raw_drsd = ['DRSD01', 'DRSD02', 'DRSD03', 'DRSD04', 'DRSD05', 'DRSD06', 'DRSD07', 'DRSD08', 'DRSD09', 'DRSD10', 'DRSD11', 'DRSD12']
cs_ret = [col for col in dtdf if col.startswith('RC')]
cs_vol = [col for col in dtdf if col.startswith('AVC')]
cs_drsd = [col for col in dtdf if col.startswith('SDC')]
ts_ret = [col for col in dtdf if col.startswith('RT')]
ts_vol = [col for col in dtdf if col.startswith('AVT')]
ts_drsd = [col for col in dtdf if col.startswith('SDT')]
print id_cols, other_cols, raw_ret, raw_vol, raw_drsd, cs_ret, cs_vol, cs_drsd, ts_ret, ts_vol, ts_drsd

['Ticker', 'Date'] ['V', 'P', 'AV', 'R', 'DRSD', 'YR'] ['R01', 'R02', 'R03', 'R04', 'R05', 'R06', 'R07', 'R08', 'R09', 'R10', 'R11', 'R12'] ['AV01', 'AV02', 'AV03', 'AV04', 'AV05', 'AV06', 'AV07', 'AV08', 'AV09', 'AV10', 'AV11', 'AV12'] ['DRSD01', 'DRSD02', 'DRSD03', 'DRSD04', 'DRSD05', 'DRSD06', 'DRSD07', 'DRSD08', 'DRSD09', 'DRSD10', 'DRSD11', 'DRSD12'] ['RC', 'RC01', 'RC02', 'RC03', 'RC04', 'RC05', 'RC06', 'RC07', 'RC08', 'RC09', 'RC10', 'RC11', 'RC12'] ['AVC', 'AVC01', 'AVC02', 'AVC03', 'AVC04', 'AVC05', 'AVC06', 'AVC07', 'AVC08', 'AVC09', 'AVC10', 'AVC11', 'AVC12'] ['SDC', 'SDC01', 'SDC02', 'SDC03', 'SDC04', 'SDC05', 'SDC06', 'SDC07', 'SDC08', 'SDC09', 'SDC10', 'SDC11', 'SDC12'] ['RT', 'RT01', 'RT02', 'RT03', 'RT04', 'RT05', 'RT06', 'RT07', 'RT08', 'RT09', 'RT10', 'RT11', 'RT12'] ['AVT', 'AVT01', 'AVT02', 'AVT03', 'AVT04', 'AVT05', 'AVT06', 'AVT07', 'AVT08', 'AVT09', 'AVT10', 'AVT11', 'AVT12'] ['SDT', 'SDT01', 'SDT02', 'SDT03', 'SDT04', 'SDT05', 'SDT06', 'SDT07', 'SDT08', 'SDT09',

In [33]:
df_long = dtdf[id_cols + other_cols + raw_ret + raw_vol + raw_drsd + cs_ret + cs_vol + cs_drsd + ts_ret + ts_vol + ts_drsd]

display(df_long)

Unnamed: 0,Ticker,Date,V,P,AV,R,DRSD,YR,R01,R02,R03,R04,R05,R06,R07,R08,R09,R10,R11,R12,AV01,AV02,AV03,AV04,AV05,AV06,AV07,AV08,AV09,AV10,AV11,AV12,DRSD01,DRSD02,DRSD03,DRSD04,DRSD05,DRSD06,DRSD07,DRSD08,DRSD09,DRSD10,DRSD11,DRSD12,RC,RC01,RC02,RC03,RC04,RC05,RC06,RC07,RC08,RC09,RC10,RC11,RC12,AVC,AVC01,AVC02,AVC03,AVC04,AVC05,AVC06,AVC07,AVC08,AVC09,AVC10,AVC11,AVC12,SDC,SDC01,SDC02,SDC03,SDC04,SDC05,SDC06,SDC07,SDC08,SDC09,SDC10,SDC11,SDC12,RT,RT01,RT02,RT03,RT04,RT05,RT06,RT07,RT08,RT09,RT10,RT11,RT12,AVT,AVT01,AVT02,AVT03,AVT04,AVT05,AVT06,AVT07,AVT08,AVT09,AVT10,AVT11,AVT12,SDT,SDT01,SDT02,SDT03,SDT04,SDT05,SDT06,SDT07,SDT08,SDT09,SDT10,SDT11,SDT12
980,A,2003-11-26,41.8300,19.1200,76.6760,0.1630,0.0178,-0.0026,0.0830,-0.0916,0.1759,-0.0392,0.1537,0.1508,0.1592,0.0810,0.0483,-0.3756,0.1577,0.2296,38.4710,57.9205,51.6960,47.4470,43.3825,44.7175,30.0275,26.7770,29.2850,49.8305,39.6860,56.3460,0.0182,0.0257,0.0216,0.0221,0.0180,0.0275,0.0169,0.0308,0.0268,0.0606,0.0314,0.0609,1.5208,0.3687,-1.2496,1.0478,-0.1506,1.4818,0.5442,0.4699,0.2130,0.3951,-2.9512,1.4034,1.4170,-0.0965,-0.3192,-0.2085,-0.1926,-0.2717,-0.2898,-0.3006,-0.3805,-0.3799,-0.3433,-0.2001,-0.2535,-0.1893,0.1530,0.2441,0.9537,0.6928,0.2040,-0.0362,0.6648,-0.3047,0.5283,0.4193,3.4683,0.8623,2.0291,0.6691,0.1323,-0.9448,0.6541,-0.4034,0.6779,0.7946,0.9778,0.7010,0.6447,-1.7186,1.1496,1.7708,2.3063,-0.4309,1.4226,1.1347,0.7739,0.3887,0.3571,-1.0043,-1.3657,-1.1513,-0.3456,-0.8459,-0.2420,-0.7257,-0.7805,-0.4249,-0.7906,-0.8570,-1.2842,-0.8112,-1.5806,-0.7915,-1.0288,1.6532,-0.6505,2.1364
1000,A,2003-12-26,13.2200,19.0700,61.1830,-0.0026,0.0166,0.2916,0.1630,0.0830,-0.0916,0.1759,-0.0392,0.1537,0.1508,0.1592,0.0810,0.0483,-0.3756,0.1577,76.6760,38.4710,57.9205,51.6960,47.4470,43.3825,44.7175,30.0275,26.7770,29.2850,49.8305,39.6860,0.0178,0.0182,0.0257,0.0216,0.0221,0.0180,0.0275,0.0169,0.0308,0.0268,0.0606,0.0314,-0.5247,1.5208,0.3687,-1.2496,1.0478,-0.1506,1.4818,0.5442,0.4699,0.2130,0.3951,-2.9512,1.4034,-0.1632,-0.0965,-0.3192,-0.2085,-0.1926,-0.2717,-0.2898,-0.3006,-0.3805,-0.3799,-0.3433,-0.2001,-0.2535,0.2908,0.1530,0.2441,0.9537,0.6928,0.2040,-0.0362,0.6648,-0.3047,0.5283,0.4193,3.4683,0.8623,-0.2834,0.6691,0.1323,-0.9448,0.6541,-0.4034,0.6779,0.7946,0.9778,0.7010,0.6447,-1.7186,1.1496,1.0123,2.3063,-0.4309,1.4226,1.1347,0.7739,0.3887,0.3571,-1.0043,-1.3657,-1.1513,-0.3456,-0.8459,-0.7110,-0.7257,-0.7805,-0.4249,-0.7906,-0.8570,-1.2842,-0.8112,-1.5806,-0.7915,-1.0288,1.6532,-0.6505
1020,A,2004-01-27,790.5600,24.6300,138.7625,0.2916,0.0226,-0.0597,-0.0026,0.1630,0.0830,-0.0916,0.1759,-0.0392,0.1537,0.1508,0.1592,0.0810,0.0483,-0.3756,61.1830,76.6760,38.4710,57.9205,51.6960,47.4470,43.3825,44.7175,30.0275,26.7770,29.2850,49.8305,0.0166,0.0178,0.0182,0.0257,0.0216,0.0221,0.0180,0.0275,0.0169,0.0308,0.0268,0.0606,2.8893,-0.5247,1.5208,0.3687,-1.2496,1.0478,-0.1506,1.4818,0.5442,0.4699,0.2130,0.3951,-2.9512,0.1924,-0.1632,-0.0965,-0.3192,-0.2085,-0.1926,-0.2717,-0.2898,-0.3006,-0.3805,-0.3799,-0.3433,-0.2001,0.7865,0.2908,0.1530,0.2441,0.9537,0.6928,0.2040,-0.0362,0.6648,-0.3047,0.5283,0.4193,3.4683,1.8161,-0.2834,0.6691,0.1323,-0.9448,0.6541,-0.4034,0.6779,0.7946,0.9778,0.7010,0.6447,-1.7186,2.7908,1.0123,2.3063,-0.4309,1.4226,1.1347,0.7739,0.3887,0.3571,-1.0043,-1.3657,-1.1513,-0.3456,0.1245,-0.7110,-0.7257,-0.7805,-0.4249,-0.7906,-0.8570,-1.2842,-0.8112,-1.5806,-0.7915,-1.0288,1.6532
1040,A,2004-02-25,88.0800,23.1600,129.9435,-0.0597,0.0214,-0.1377,0.2916,-0.0026,0.1630,0.0830,-0.0916,0.1759,-0.0392,0.1537,0.1508,0.1592,0.0810,0.0483,138.7625,61.1830,76.6760,38.4710,57.9205,51.6960,47.4470,43.3825,44.7175,30.0275,26.7770,29.2850,0.0226,0.0166,0.0178,0.0182,0.0257,0.0216,0.0221,0.0180,0.0275,0.0169,0.0308,0.0268,-0.8789,2.8893,-0.5247,1.5208,0.3687,-1.2496,1.0478,-0.1506,1.4818,0.5442,0.4699,0.2130,0.3951,0.1337,0.1924,-0.1632,-0.0965,-0.3192,-0.2085,-0.1926,-0.2717,-0.2898,-0.3006,-0.3805,-0.3799,-0.3433,0.6231,0.7865,0.2908,0.1530,0.2441,0.9537,0.6928,0.2040,-0.0362,0.6648,-0.3047,0.5283,0.4193,-1.2857,1.8161,-0.2834,0.6691,0.1323,-0.9448,0.6541,-0.4034,0.6779,0.7946,0.9778,0.7010,0.6447,1.8632,2.7908,1.0123,2.3063,-0.4309,1.4226,1.1347,0.7739,0.3887,0.3571,-1.0043,-1.3657,-1.1513,-0.0431,0.1245,-0.7110,-0.7257,-0.7805,-0.4249,-0.7906,-0.8570,-1.2842,-0.8112,-1.5806,-0.7915,-1.0288
1060,A,2004-03-24,113.9700,19.9700,107.8090,-0.1377,0.0276,0.0035,-0.0597,0.2916,-0.0026,0.1630,0.0830,-0.0916,0.1759,-0.0392,0.1537,0.1508,0.1592,0.0810,129.9435,138.7625,61.1830,76.6760,38.4710,57.9205,51.6960,47.4470,43.3825,44.7175,30.0275,26.7770,0.0214,0.0226,0.0166,0.0178,0.0182,0.0257,0.0216,0.0221,0.0180,0.0275,0.0169,0.0308,-1.9335,-0.8789,2.8893,-0.5247,1.5208,0.3687,-1.2496,1.0478,-0.1506,1.4818,0.5442,0.4699,0.2130,0.0190,0.1337,0.1924,-0.1632,-0.0965,-0.3192,-0.2085,-0.1926,-0.2717,-0.2898,-0.3006,-0.3805,-0.3799,1.2698,0.6231,0.7865,0.2908,0.1530,0.2441,0.9537,0.6928,0.2040,-0.0362,0.6648,-0.3047,0.5283,-1.5687,-1.2857,1.8161,-0.2834,0.6691,0.1323,-0.9448,0.6541,-0.4034,0.6779,0.7946,0.9778,0.7010,1.0583,1.8632,2.7908,1.0123,2.3063,-0.4309,1.4226,1.1347,0.7739,0.3887,0.3571,-1.0043,-1.3657,1.5750,-0.0431,0.1245,-0.7110,-0.7257,-0.7805,-0.4249,-0.7906,-0.8570,-1.2842,-0.8112,-1.5806,-0.7915
1080,A,2004-04-22,78.6300,20.0400,77.3330,0.0035,0.0212,-0.1727,-0.1377,-0.0597,0.2916,-0.0026,0.1630,0.0830,-0.0916,0.1759,-0.0392,0.1537,0.1508,0.1592,107.8090,129.9435,138.7625,61.1830,76.6760,38.4710,57.9205,51.6960,47.4470,43.3825,44.7175,30.0275,0.0276,0.0214,0.0226,0.0166,0.0178,0.0182,0.0257,0.0216,0.0221,0.0180,0.0275,0.0169,-0.6473,-1.9335,-0.8789,2.8893,-0.5247,1.5208,0.3687,-1.2496,1.0478,-0.1506,1.4818,0.5442,0.4699,-0.1382,0.0190,0.1337,0.1924,-0.1632,-0.0965,-0.3192,-0.2085,-0.1926,-0.2717,-0.2898,-0.3006,-0.3805,0.5909,1.2698,0.6231,0.7865,0.2908,0.1530,0.2441,0.9537,0.6928,0.2040,-0.0362,0.6648,-0.3047,-0.4129,-1.5687,-1.2857,1.8161,-0.2834,0.6691,0.1323,-0.9448,0.6541,-0.4034,0.6779,0.7946,0.9778,0.1269,1.0583,1.8632,2.7908,1.0123,2.3063,-0.4309,1.4226,1.1347,0.7739,0.3887,0.3571,-1.0043,-0.1445,1.5750,-0.0431,0.1245,-0.7110,-0.7257,-0.7805,-0.4249,-0.7906,-0.8570,-1.2842,-0.8112,-1.5806
1100,A,2004-05-20,96.8400,16.5800,92.1400,-0.1727,0.0155,0.0525,0.0035,-0.1377,-0.0597,0.2916,-0.0026,0.1630,0.0830,-0.0916,0.1759,-0.0392,0.1537,0.1508,77.3330,107.8090,129.9435,138.7625,61.1830,76.6760,38.4710,57.9205,51.6960,47.4470,43.3825,44.7175,0.0212,0.0276,0.0214,0.0226,0.0166,0.0178,0.0182,0.0257,0.0216,0.0221,0.0180,0.0275,-1.7944,-0.6473,-1.9335,-0.8789,2.8893,-0.5247,1.5208,0.3687,-1.2496,1.0478,-0.1506,1.4818,0.5442,-0.0632,-0.1382,0.0190,0.1337,0.1924,-0.1632,-0.0965,-0.3192,-0.2085,-0.1926,-0.2717,-0.2898,-0.3006,-0.1287,0.5909,1.2698,0.6231,0.7865,0.2908,0.1530,0.2441,0.9537,0.6928,0.2040,-0.0362,0.6648,-1.4241,-0.4129,-1.5687,-1.2857,1.8161,-0.2834,0.6691,0.1323,-0.9448,0.6541,-0.4034,0.6779,0.7946,0.4515,0.1269,1.0583,1.8632,2.7908,1.0123,2.3063,-0.4309,1.4226,1.1347,0.7739,0.3887,0.3571,-1.4355,-0.1445,1.5750,-0.0431,0.1245,-0.7110,-0.7257,-0.7805,-0.4249,-0.7906,-0.8570,-1.2842,-0.8112
1120,A,2004-06-21,41.2900,17.4500,75.9650,0.0525,0.0163,0.0080,-0.1727,0.0035,-0.1377,-0.0597,0.2916,-0.0026,0.1630,0.0830,-0.0916,0.1759,-0.0392,0.1537,92.1400,77.3330,107.8090,129.9435,138.7625,61.1830,76.6760,38.4710,57.9205,51.6960,47.4470,43.3825,0.0155,0.0212,0.0276,0.0214,0.0226,0.0166,0.0178,0.0182,0.0257,0.0216,0.0221,0.0180,-0.0080,-1.7944,-0.6473,-1.9335,-0.8789,2.8893,-0.5247,1.5208,0.3687,-1.2496,1.0478,-0.1506,1.4818,-0.0846,-0.0632,-0.1382,0.0190,0.1337,0.1924,-0.1632,-0.0965,-0.3192,-0.2085,-0.1926,-0.2717,-0.2898,0.2718,-0.1287,0.5909,1.2698,0.6231,0.7865,0.2908,0.1530,0.2441,0.9537,0.6928,0.2040,-0.0362,0.2201,-1.4241,-0.4129,-1.5687,-1.2857,1.8161,-0.2834,0.6691,0.1323,-0.9448,0.6541,-0.4034,0.6779,-0.1137,0.4515,0.1269,1.0583,1.8632,2.7908,1.0123,2.3063,-0.4309,1.4226,1.1347,0.7739,0.3887,-1.1161,-1.4355,-0.1445,1.5750,-0.0431,0.1245,-0.7110,-0.7257,-0.7805,-0.4249,-0.7906,-0.8570,-1.2842
1140,A,2004-07-20,78.6600,17.5900,78.5570,0.0080,0.0253,-0.1779,0.0525,-0.1727,0.0035,-0.1377,-0.0597,0.2916,-0.0026,0.1630,0.0830,-0.0916,0.1759,-0.0392,75.9650,92.1400,77.3330,107.8090,129.9435,138.7625,61.1830,76.6760,38.4710,57.9205,51.6960,47.4470,0.0163,0.0155,0.0212,0.0276,0.0214,0.0226,0.0166,0.0178,0.0182,0.0257,0.0216,0.0221,0.1982,-0.0080,-1.7944,-0.6473,-1.9335,-0.8789,2.8893,-0.5247,1.5208,0.3687,-1.2496,1.0478,-0.1506,-0.1060,-0.0846,-0.0632,-0.1382,0.0190,0.1337,0.1924,-0.1632,-0.0965,-0.3192,-0.2085,-0.1926,-0.2717,1.3935,0.2718,-0.1287,0.5909,1.2698,0.6231,0.7865,0.2908,0.1530,0.2441,0.9537,0.6928,0.2040,-0.1325,0.2201,-1.4241,-0.4129,-1.5687,-1.2857,1.8161,-0.2834,0.6691,0.1323,-0.9448,0.6541,-0.4034,-0.1197,-0.1137,0.4515,0.1269,1.0583,1.8632,2.7908,1.0123,2.3063,-0.4309,1.4226,1.1347,0.7739,1.1191,-1.1161,-1.4355,-0.1445,1.5750,-0.0431,0.1245,-0.7110,-0.7257,-0.7805,-0.4249,-0.7906,-0.8570
1160,A,2004-08-17,69.5400,14.4600,77.4715,-0.1779,0.0353,0.0560,0.0080,0.0525,-0.1727,0.0035,-0.1377,-0.0597,0.2916,-0.0026,0.1630,0.0830,-0.0916,0.1759,78.5570,75.9650,92.1400,77.3330,107.8090,129.9435,138.7625,61.1830,76.6760,38.4710,57.9205,51.6960,0.0253,0.0163,0.0155,0.0212,0.0276,0.0214,0.0226,0.0166,0.0178,0.0182,0.0257,0.0216,-2.3988,0.1982,-0.0080,-1.7944,-0.6473,-1.9335,-0.8789,2.8893,-0.5247,1.5208,0.3687,-1.2496,1.0478,-0.0998,-0.1060,-0.0846,-0.0632,-0.1382,0.0190,0.1337,0.1924,-0.1632,-0.0965,-0.3192,-0.2085,-0.1926,1.4614,1.3935,0.2718,-0.1287,0.5909,1.2698,0.6231,0.7865,0.2908,0.1530,0.2441,0.9537,0.6928,-1.2528,-0.1325,0.2201,-1.4241,-0.4129,-1.5687,-1.2857,1.8161,-0.2834,0.6691,0.1323,-0.9448,0.6541,-0.2373,-0.1197,-0.1137,0.4515,0.1269,1.0583,1.8632,2.7908,1.0123,2.3063,-0.4309,1.4226,1.1347,2.3009,1.1191,-1.1161,-1.4355,-0.1445,1.5750,-0.0431,0.1245,-0.7110,-0.7257,-0.7805,-0.4249,-0.7906


In [34]:
# This is the final "long" data set
df_summary = df_long.groupby('Ticker').apply(f)
df_summary = df_summary.sort_values(['date_max', 'date_count', 'date_min'], ascending=[True, False, True])
display(df_summary)

Unnamed: 0_level_0,date_count,date_min,date_max
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,178,2003-11-26,2017-12-19
AAP,178,2003-11-26,2017-12-19
AAPL,178,2003-11-26,2017-12-19
ABC,178,2003-11-26,2017-12-19
ABT,178,2003-11-26,2017-12-19
ACN,178,2003-11-26,2017-12-19
ADBE,178,2003-11-26,2017-12-19
ADI,178,2003-11-26,2017-12-19
ADM,178,2003-11-26,2017-12-19
ADP,178,2003-11-26,2017-12-19


In [35]:
# This is the dataset that will be used for neural network training and testing
df_long.to_csv('SP500_Long_V4.CSV', index = False, header = True, float_format='%.4f')