In [1]:
import dill
import numpy as np
import pandas as pd

# Normalizing stock prices

...while checking data slices for annotation can be done with raw stock prices, a model will probably benefit from normalized data.

Lets start by loading up the dataframes containing all our stock prices.

In [2]:
company_dataframes = dill.load(open('Prices_and_FDA_Dates.pkl', 'r'))

Going to start with a small subset of the data to make sure they can be joined predictable and with adequately informative column names. 

In [3]:
df1 = company_dataframes['AAAP']
df1_small = df1.loc['2016-06-12':'2016-06-25']
df2 = company_dataframes['ABBV']
df2_small = df2.loc['2016-06-12':'2016-06-25']
df2.columns = ['a','b','c','d','e','f']

So we can join AAAP and ABBV in a preduictable fashion.

Lets loop that same process over the entire pharma sector. 

In [19]:
first = True
for ticker, comp_df in company_dataframes.iteritems():
    if first:
        market_df = comp_df
        market_df.columns = ["volume-"+ticker,
                             "close-"+ticker,
                             "high-"+ticker,
                             "open-"+ticker,
                             "low-"+ticker,
                             "pdufa-"+ticker]
        first = False
    else:
        #comp_df.columns = ["volume-"+ticker,"close-"+ticker,"high-"+ticker,"open-"+ticker,"low-"+ticker,"pdufa-"+ticker]
        market_df = pd.merge(market_df, comp_df, how='outer', left_index=True, right_index=True, suffixes=('', '-'+ticker))

Go ahead and use a regular expression filter to get every column of close prices. Then, calculate the mean and standard deviation of each of those subframes. Eventually merging and yeilding a dataframe of the mean and standard deviaion of close prices on a day-by-day basis.

In [7]:
price_mean = market_df.filter(regex='close').mean(axis = 1, skipna = True)
price_stdv = market_df.filter(regex='close').std(axis = 1, skipna = True)

In [8]:
stats_df = pd.merge(price_mean.to_frame(),
                    price_stdv.to_frame(), 
                    left_index=True, 
                    right_index=True, 
                    how='inner')
stats_df.rename(columns={u'0_x':"CP_mean", u'0_y':"CP_stdv"}, inplace=True)

In [23]:
stats_df

Unnamed: 0,CP_mean,CP_stdv
2000-01-03,27.105697,28.486195
2000-01-04,25.714626,27.195048
2000-01-05,25.839977,27.284835
2000-01-06,26.257429,27.431282
2000-01-07,28.418530,30.458220
2000-01-10,29.682323,31.791633
2000-01-11,29.124288,30.836872
2000-01-12,28.900803,30.421695
2000-01-13,29.900409,32.670777
2000-01-14,30.369091,32.851835


In [9]:
result = pd.merge(market_df, stats_df, 
                  left_index=True, right_index=True, how='inner')

In [10]:
result

Unnamed: 0,volume-ENTA,close-ENTA,high-ENTA,open-ENTA,low-ENTA,pdufa-ENTA,volume,close,high,open,...,low-OPK,pdufa?-OPK,volume-SGYP,close-SGYP,high-SGYP,open-SGYP,low-SGYP,pdufa?-SGYP,CP_mean,CP_stdv
2000-01-03,,,,,,,,,,,...,7.500,False,,,,,,,27.105697,28.486195
2000-01-04,,,,,,,,,,,...,7.500,False,,,,,,,25.714626,27.195048
2000-01-05,,,,,,,,,,,...,7.625,False,,,,,,,25.839977,27.284835
2000-01-06,,,,,,,,,,,...,7.313,False,,,,,,,26.257429,27.431282
2000-01-07,,,,,,,,,,,...,7.188,False,,,,,,,28.418530,30.458220
2000-01-10,,,,,,,,,,,...,7.750,False,,,,,,,29.682323,31.791633
2000-01-11,,,,,,,,,,,...,7.563,False,,,,,,,29.124288,30.836872
2000-01-12,,,,,,,,,,,...,7.313,False,,,,,,,28.900803,30.421695
2000-01-13,,,,,,,,,,,...,7.063,False,,,,,,,29.900409,32.670777
2000-01-14,,,,,,,,,,,...,7.063,False,,,,,,,30.369091,32.851835


Now we've got a full dataframe of the pharmaceutical sector, with an index. It joins cleanly to the stock prices on date, so we've got a usable index for any given company (in the pharma sector, thats avalible for API calls on AlphaVantage).

Lets Serialize both dataframes, close_price_stats will be useful for normalizing stock prices, and the whole dataframe is already here so we may as well cache it. 

In [12]:
dill.dump(result, open("dataframe_with_mean_stdv_price.pkl", "w"))

In [24]:
dill.dump(stats_df, open("close_price_stats_frame.pkl", "w"))