# ETL For Top Performers

In this notebook I will download and transform datasets for the top mean/std performers identified previously. Prepared datasets should include:

- Daily diffed % returns for the last year of data
- Extraneous statistics
- Distribution of returns

In order to do the Sine wave analysis, I will need only the columns of return and volume. I intend to perform sine wave analysis on the data itself (actual close data, potentially scaled) and the diffed return data. 

After identifying the wave for the data itself, by computing the cos of the waveform and finding the zeros, I should be able to predict changes in trend. From the statistics on the dataset, I may also be able to predict how long trends will last. There will be some amount of expected error, and in the future I should understand how to quantify this error. 

By identifying the characteristics of the diffed wave, I should be able to time daily entrance and exits on the selected stocks. 

#### The cumulative goal becomes:

1) Identify the best stocks (already accomplish previously)  
2) Identify macro trends in the data (derivative of the wave on the data itself)  
3) Identify exits and entrances in the short run (derivative of the diffed wave, akin to the second derivative)   

#### Steps to accomplish

1) Download 1 Year data for previously identified stocks  
2) Format any necessary statistics  
3) Save the datasets  
4) Create a meta dataset with pertinant information for all stocks under consideration  
5) Write a learning algorithm to select stocks for investment based on the above criteria.  

This notebook will concern itself with steps 1 - 3 above. Steps 4-5 will be expanded in another notebook. 



#### 1) Download 1 Year data for previously identified stocks

In [12]:
from extract import get_history
from transform import format_date, base_returns
from time import sleep
from IPython import display

In [1]:
import pandas as pd
import numpy as np

pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)

In [2]:
key_df = pd.read_csv('./data/screens/volume600k/highmeanreturn_lowstd_top_performers_2_months.csv', )
key_df.head()

Unnamed: 0.1,Unnamed: 0,mean,std,alpha
0,EXPI,2.768366,5.129503,0.539695
1,OSTK,4.301383,8.244235,0.521744
2,GNRC,1.213634,2.374996,0.511005
3,EVH,2.035792,4.379593,0.464836
4,GSS,1.760608,3.880173,0.453745


In [3]:
key_df.rename({'Unnamed: 0':'SYM'}, axis=1, inplace=True)
key_df.head()

Unnamed: 0,SYM,mean,std,alpha
0,EXPI,2.768366,5.129503,0.539695
1,OSTK,4.301383,8.244235,0.521744
2,GNRC,1.213634,2.374996,0.511005
3,EVH,2.035792,4.379593,0.464836
4,GSS,1.760608,3.880173,0.453745


In [6]:
key_df.shape

(136, 4)

In [5]:
stocks = ['EXPI']  # Shows that the functionality works correctly

for stock in stocks:
    stock_data = get_history(stock)
    stock_df = format_date(stock_data)
    stock_df['SYMBOL'] = stock
    stock_df = base_returns(stock_df)
    
stock_df

Unnamed: 0,open,high,low,close,volume,date,SYMBOL,prev_close,diff_1,pct_change,log_return
0,9.49,9.8000,9.32,9.59,125705,2019-08-19,EXPI,,,,
1,9.54,9.7500,9.38,9.41,98463,2019-08-20,EXPI,9.59,-0.18,-0.018770,-0.018948
2,9.48,9.4800,9.10,9.24,98149,2019-08-21,EXPI,9.41,-0.17,-0.018066,-0.018231
3,9.26,9.3500,8.75,8.80,209154,2019-08-22,EXPI,9.24,-0.44,-0.047619,-0.048790
4,8.77,8.8500,8.34,8.48,281435,2019-08-23,EXPI,8.80,-0.32,-0.036364,-0.037041
...,...,...,...,...,...,...,...,...,...,...,...
248,30.98,32.9199,30.34,30.91,1503190,2020-08-12,EXPI,29.99,0.92,0.030677,0.030216
249,31.54,35.3300,31.20,33.59,1868868,2020-08-13,EXPI,30.91,2.68,0.086703,0.083149
250,33.75,36.2000,33.08,34.29,2064018,2020-08-14,EXPI,33.59,0.70,0.020840,0.020625
251,34.80,36.3700,32.51,33.54,2206337,2020-08-17,EXPI,34.29,-0.75,-0.021872,-0.022115


In [9]:
def extract_stock(stock,periodType='year', frequencyType='daily', frequency='1', periods=1):
    stock_data = get_history(stock, 
                             periodType=periodType, 
                             frequencyType=frequencyType, 
                             frequency=frequency, 
                             periods=periods)
    
    stock_df = format_date(stock_data)
    stock_df['SYMBOL'] = stock
    stock_df = base_returns(stock_df)
    stock_df = stock_df.iloc[::-1]
    # save as pickle
    stock_df.to_pickle('./data/screens/volume600k/high_alpha_12_months/{}081920.pickle'.format(stock))
    #return stock_df

In [13]:
for i, sym in enumerate(key_df['SYM']):
    display.clear_output()
    extract_stock(sym)
    print('{:.2f}%'.format(i/key_df['SYM'].shape[0]*100))
    sleep(.5)

99.26%


1 year stock data download for top performers and saved to './data/screens/volume600k/high_alpha_12_months/'. 

Furthermore, `def extract_stock` has been added to extract.py for future use as a stand alone extractor function, with the ability to save and optionally return the stock df. 