# TDI Capstone - Final Report

### Abstract:
Phase III clincial trials have a massive effect on the market captalizatizations of pharmaceutical companies. A successful phase III trial for a novel drug allows a company to begin marketing drugs to a previously unserved clinical indication, yielding large revenue streams. In order for a drug to pass it must pass through roughly 10 years of Research and be tested on thousands (or tens of thousands) of patients. 

The long time scales of trials and the sheer numbers of people involved (patients, research scientists, clinicans, research coordinators, goverment regulators...), means that priveleged information regarding trial status has an unusally high level of exposure, compared to other high tech industries. [Recent research](https://academic.oup.com/jnci/article/103/20/1507/904625/Company-Stock-Prices-Before-and-After-Public) suggests that this information exposure can cause detectable movement in this public stock markets. 

If leaks of priveleged information can affect the valuations of pharmaceutical companies, can the movements of these valutaions be identified using Machine Learning and used to inform smarter trading decisions? 

### Gathering Data:

To start with this problem, we need to select a range of target companies (in the pharma sector), find daily closing prices of thier stocks, and get the dates of thier relevant approval announcements. With this information, I'll attempt to extract features and train a machine learning model. 

Data Sources: 
* [fdaTracker.com's free PDUFA Calendar](https://www.fdatracker.com/fda-calendar/)
* [Biopharm Catalyst's upcoming PDUFA Calendar](https://www.biopharmcatalyst.com/calendars/fda-calendar)
* [AlphaVantage's Stock Price API](https://www.alphavantage.co/)

###### First:
First, lets get the historical PDUFA (FDA announcement) Dates:

In [1]:
from urllib2 import urlopen
import ics
import re
from datetime import datetime
from alpha_vantage.timeseries import TimeSeries
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
import dill

tickerRe = re.compile(r"\A[A-Z]{3,4}\W")
today = datetime.today()

In [2]:
FdaUrl = "https://calendar.google.com/calendar/ical/5dso8589486irtj53sdkr4h6ek%40group.calendar.google.com/public/basic.ics"
FdaCal = ics.Calendar(urlopen(FdaUrl).read().decode('iso-8859-1'))
FdaCal

<Calendar with 551 events>

In [3]:
past_pdufa_syms = set()
for event in FdaCal.events:
    matches = re.findall(tickerRe, event.name)
    if len(matches) >=1:
        eComp = str(matches[0]).strip().strip(".")
        past_pdufa_syms.add(eComp)

In [4]:
print past_pdufa_syms

set(['ENTA', 'AAAP', 'PAR', 'KITE', 'BSTC', 'DNDN', 'ELTP', 'ZSPH', 'GSK', 'CLVS', 'ADLR', 'XENT', 'IRWD', 'VNDA', 'RPTP', 'ACUR', 'DEPO', 'ARIA', 'CHMA', 'VRTX', 'OREX', 'THRX', 'HPTX', 'PCRX', 'ENDP', 'NAVB', 'DRRX', 'OTIC', 'EXEL', 'FLXN', 'ZGEN', 'DSCO', 'SRPT', 'PSDV', 'BCRX', 'MRK', 'ADMP', 'ADMS', 'ISTA', 'AMRN', 'BMY', 'ITMN', 'CEMP', 'KMPH', 'RLYP', 'ANAC', 'SUPN', 'JNJ', 'AERI', 'SNY', 'VION', 'AMAG', 'HZNP', 'REGN', 'SOMX', 'COLL', 'CRTX', 'LPCN', 'VCEL', 'GTXI', 'CTIC', 'LLY', 'AFFY', 'CYPB', 'HGSI', 'OCUL', 'AEGR', 'HOLX', 'OSI', 'DVAX', 'TEVA', 'SGEN', 'TTNP', 'ACOR', 'RPRX', 'EGRX', 'JAZZ', 'AUXL', 'QCOR', 'FLML', 'NEOS', 'AGRX', 'RARE', 'RDUS', 'NGSX', 'RMTI', 'NPSP', 'CLDA', 'DYAX', 'ALTH', 'ARLZ', 'VVUS', 'LXRX', 'NVO', 'MELA', 'ANX', 'SCMP', 'POZN', 'HRTX', 'ACAD', 'AGIO', 'TSRO', 'RGEN', 'FURX', 'SPPI', 'AMGN', 'ONXX', 'BMTI', 'BPAX', 'REPH', 'DOR', 'NEW', 'WCRX', 'ASTX', 'KTOV', 'ISIS', 'AVEO', 'BIOD', 'GNBT', 'VRX', 'IGXT', 'LGND', 'MDVN', 'AMLN', 'OMER', 'KMDA', 

In [5]:
av_key_handle = open("alphavantage.apikey", "r")
ts = TimeSeries(key=av_key_handle.read().strip(), output_format='pandas')
av_key_handle.close()

In [6]:
dataframes = dict()
value_errors = set()
http_errors = set()
for ticker in tqdm_notebook(past_pdufa_syms):
    try:
        df, meta = ts.get_daily(symbol=ticker, outputsize='full')
        dataframes[meta["2. Symbol"]] = df
    except ValueError:
        value_errors.add(ticker)
    except HTTPError:
        http_errors.add(ticker)




Now we'll run through our past FDA dates and join the FDA actions to each dataframe

In [7]:
company_list = dataframes.keys()

In [8]:
price_and_fda = dict()
for company in tqdm_notebook(company_list):
    company_events = []
    for event in FdaCal.events:
        matches = re.findall(tickerRe, event.name)
        if len(matches)>=1:
            if company in matches[0]:
                company_events.append((event.begin.datetime.strftime("%Y-%m-%d"), True))
    price = dataframes[company]
    raw_dates = pd.DataFrame(company_events, columns = ["date", "pdufa?"])
    dates = raw_dates.set_index("date")
    final = price.join(dates,rsuffix='_y')
    final['pdufa?'].fillna(value=False, inplace = True)
    price_and_fda[company] = final




That leaves us with a dict of dataframes containing every company's stock price, and FDA action dates

In [32]:
price_and_fda['ENTA'].head(3)

Unnamed: 0,volume,close,high,open,low,pdufa?
2013-03-21,1763600.0,17.18,17.85,14.51,14.31,False
2013-03-22,75200.0,16.81,17.57,17.57,16.71,False
2013-03-25,24100.0,16.83,17.4,17.4,16.8,False


Now that I've got a good crop of downloaded data, lets cache it for good measure. 

In [22]:
dill.dump(price_and_fda, open("Prices_and_PDUFAs_final", "w"))

### Checkpoint 1 - FDA Action Dates Joined to Equity Prices

In [23]:
price_and_fda = dill.load(open("Prices_and_PDUFAs_final", "r"))

So thats every company's stock prices, with PDUFA dates going back to around 2006, and pricing data going back to 2001. More than enough data for our analysis. 

Lets unify all the prices into one frame, to construct a pharmaceutical price index. This will give us a base price to normalize our stock prices against, insulating our model from general economic events (the '08 housing crash) or events affecting the whole pharmaceutical sector (passage of new FDA regulations). 

In [10]:
price_and_fda = dill.load(open("Prices_and_PDUFAs_final", "r"))

In [11]:
first = True
for ticker, comp_df in price_and_fda.iteritems():
    if first:
        market_df = comp_df.copy()
        market_df.columns = ["volume-"+ticker,
                             "close-"+ticker,
                             "high-"+ticker,
                             "open-"+ticker,
                             "low-"+ticker,
                             "pdufa?-"+ticker]
        first = False
    else:
        market_df = pd.merge(market_df, comp_df, how='outer', left_index=True, right_index=True, suffixes=('', '-'+ticker))

In [12]:
price_mean = market_df.filter(regex='close').mean(axis = 1, skipna = True)
price_stdv = market_df.filter(regex='close').std(axis = 1, skipna = True)

In [13]:
stats_df = pd.merge(price_mean.to_frame(),
                    price_stdv.to_frame(), 
                    left_index=True, 
                    right_index=True, 
                    how='inner')
stats_df.rename(columns={u'0_x':"CP_mean", u'0_y':"CP_stdv"}, inplace=True)

In [30]:
stats_df.head()

Unnamed: 0,CP_mean,CP_stdv
2000-01-03,28.520755,29.266429
2000-01-04,27.095726,28.010693
2000-01-05,27.200238,28.108746
2000-01-06,27.628152,28.330073
2000-01-07,29.792918,31.265757


This is as good a place as any to cache the closing price index

In [15]:
dill.dump(stats_df, open("close_price_stats_frame_final.pkl", "w"))

### Checkpoint 2

In [24]:
stats_df = dill.load(open("close_price_stats_frame_final.pkl", "r"))

Now I have the mean and standard deviation of close prices (`stats_df`) for every day of my data coverage. This will make it easy to normalize prices for every slice of time relevant to an FDA trial. 

Time to cut time slices for each clinical trial and generate a population of clinical trials and normalized prices.

In [25]:
norm_data = []
for company in tqdm_notebook(company_list):
    df = price_and_fda[company].join(stats_df, how='left').reset_index()
    pdufa_dates = df.index[df['pdufa?']].tolist()
    if len(pdufa_dates) > 0:
        for date in pdufa_dates:
            pRange = range(date-120, date-7)
            pCloses, pVolumes = [], []
            for i in pRange:
                try:
                    close_price = df.loc[i]['close']
                    volume = df.loc[i]['volume']
                    mean_price = df.loc[i]['CP_mean']
                    stdv_price = df.loc[i]['CP_stdv']
                    pCloses.append(( df.loc[i]['index'],(close_price-mean_price)/(stdv_price) ))
                    pVolumes.append(( df.loc[i]['index'], volume ))
                except:
                    pCloses.append(None)
                    pVolumes.append(None)
            norm_data.append((company, df.loc[date]['index'], (pCloses, pVolumes)))




Well we have normalized slices, lets add the annotations from our score sheet

In [26]:
scores = [line.split() for line in open("score_sheet_complete.txt", "r").readlines()]

In [27]:
norm_data_annotated = []
mismatches = []
for datum in tqdm_notebook(norm_data):
    for score in scores:
        if datum[0] == score [0] and datum [1] == score[1]:
            norm_data_annotated.append((datum[0], datum[1], score[2], datum[2] ))
            break




In [58]:
dill.dump(norm_data_annotated, open("normalized_training_data.pkl", "w"))

### Checkpoint 3

In [59]:
norm_data_annotated = dill.load(open("normalized_training_data.pkl", "r"))

Now we have normalized stock prices, in 120-7 day slices prior to FDA action dates. Lets pull those back into smaller pandas frames for feature extraction. 

In [33]:
def assemble_frame(datum):
    df = pd.DataFrame(datum[3][0], columns=['date','norm_price'])
    df['event'] = datum[0]+"/"+datum[1]
    df['outcome'] = int(datum[2])
    return df

In [35]:
first = True

for line in tqdm_notebook(norm_data_annotated):
    try:
        if first:
            agg_data = assemble_frame(line)
            first = False
        else:
            tmp_data = assemble_frame(line)
            agg_data = pd.concat([agg_data, tmp_data],ignore_index=True)
    except:
        print line[0], line[1], "failed"

COLL 2015-10-12 failed
NEOS 2015-11-09 failed



In [39]:
agg_data['date_stamp'] = pd.to_datetime(agg_data['date'])
event_labels = pd.factorize(agg_data['event'])
agg_data["event_stamp"] = event_labels[0]

Now lets remove out the trials will null prices on some days (either due to acquisitions or bankruptcies). 

In [41]:
agg_data['null'] = pd.isnull(agg_data).apply(lambda x: sum(x) , axis=1)
cleaned_agg = agg_data[agg_data['null'] == 0]

In [42]:
cleaned_agg.head()

Unnamed: 0,date,norm_price,event,outcome,date_stamp,event_stamp,null
0,2015-12-08,-0.132904,AAAP/2016-06-01,1,2015-12-08,0,0
1,2015-12-09,-0.127848,AAAP/2016-06-01,1,2015-12-09,0,0
2,2015-12-10,-0.126276,AAAP/2016-06-01,1,2015-12-10,0,0
3,2015-12-11,-0.11862,AAAP/2016-06-01,1,2015-12-11,0,0
4,2015-12-14,-0.110472,AAAP/2016-06-01,1,2015-12-14,0,0


In [44]:
dill.dump(cleaned_agg, open('final_cleaned_price_slices.pkl', 'w'))

### Checkpoint 3 - Training data preprocessed

In [45]:
cleaned_agg = dill.load(open('final_cleaned_price_slices.pkl', 'r'))

That's a ready to extract package of every clinical trial scraped. Lets go ahead and make up a test and train split now, while its easy and convinent.

In [46]:
from sklearn.cross_validation import train_test_split



In [48]:
train_data, test_data = train_test_split(norm_data_annotated, train_size = .8)

In [51]:
first = True

for line in tqdm_notebook(train_data):
    try:
        if first:
            train_df = assemble_frame(line)
            first = False
        else:
            tmp_df = assemble_frame(line)
            train_df = pd.concat([train_df, tmp_df],ignore_index=True)
    except:
        print line[0], line[1], "failed"

train_df['date_stamp'] = pd.to_datetime(train_df['date'])
event_labels = pd.factorize(train_df['event'])
train_df["event_stamp"] = event_labels[0]

train_df['null'] = pd.isnull(train_df).apply(lambda x: sum(x) , axis=1)
train_clean = train_df[train_df['null'] == 0]

COLL 2015-10-12 failed
NEOS 2015-11-09 failed



In [52]:
first = True

for line in tqdm_notebook(test_data):
    try:
        if first:
            test_df = assemble_frame(line)
            first = False
        else:
            tmp_df = assemble_frame(line)
            test_df = pd.concat([test_df, tmp_df],ignore_index=True)
    except:
        print line[0], line[1], "failed"
test_df['date_stamp'] = pd.to_datetime(test_df['date'])
event_labels = pd.factorize(test_df['event'])
test_df["event_stamp"] = event_labels[0]

test_df['null'] = pd.isnull(test_df).apply(lambda x: sum(x) , axis=1)
test_clean = test_df[test_df['null'] == 0]




Thats two parts of a bifurcated dataframe. May as well cache it. 

In [56]:
dill.dump(train_clean, open("final_train_df.pkl", "w"))
dill.dump(test_clean, open("final_test_df.pkl", "w"))

### Checkpoint 4 - Test Train Split

In [57]:
train_clean = dill.load(open("final_train_df.pkl", "r"))
test_clean = dill.load(open("final_test_df.pkl", "r"))

Now for the serious work, extracting features from the pricing data in each case. 

I'll be using [tsfresh](http://tsfresh.readthedocs.io/en/latest/text/quick_start.html) to do the hard computing here, and then selecting the most relevant features. While I am able to compute almost 800 features for these data points, I'm going to narrow down to around ten of the most meaningful or important features. 

In [61]:
from tsfresh import extract_features

  from pandas.core import datetools


In [64]:
train_feats = extract_features(train_clean[['norm_price', 'event_stamp', 'date_stamp']], 
                              column_id="event_stamp", column_sort="date_stamp", 
                              column_value="norm_price", n_jobs=0).dropna(axis=1)

Feature Extraction: 100%|██████████| 186/186 [00:00<00:00, 28523.29it/s]


In [66]:
train_feats

variable,norm_price__abs_energy,norm_price__absolute_sum_of_changes,"norm_price__agg_autocorrelation__f_agg_""mean""","norm_price__agg_autocorrelation__f_agg_""median""","norm_price__agg_autocorrelation__f_agg_""var""","norm_price__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""intercept""","norm_price__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""rvalue""","norm_price__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""slope""","norm_price__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""stderr""","norm_price__agg_linear_trend__f_agg_""max""__chunk_len_50__attr_""intercept""",...,norm_price__time_reversal_asymmetry_statistic__lag_1,norm_price__time_reversal_asymmetry_statistic__lag_2,norm_price__time_reversal_asymmetry_statistic__lag_3,norm_price__value_count__value_-inf,norm_price__value_count__value_0,norm_price__value_count__value_1,norm_price__value_count__value_inf,norm_price__value_count__value_nan,norm_price__variance,norm_price__variance_larger_than_standard_deviation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,23.248568,0.514081,0.095439,-0.078043,0.160151,-0.438395,-0.122746,-0.000826,0.002112,-0.409927,...,-2.035117e-05,-1.796431e-04,-0.000358,0.0,0.0,0.0,0.0,0.0,0.000591,0.0
1,1.418535,1.196955,0.052072,-0.096692,0.129843,0.065291,0.574040,0.011233,0.005067,0.128214,...,1.752007e-04,3.610690e-04,0.000560,0.0,0.0,0.0,0.0,0.0,0.003454,0.0
2,48.026962,0.405354,0.359570,0.243541,0.053321,-0.671843,0.908484,0.005422,0.000789,-0.646393,...,1.253378e-03,2.455376e-03,0.003703,0.0,0.0,0.0,0.0,0.0,0.000370,0.0
3,44.799473,0.385884,0.434219,0.344674,0.087981,-0.597607,-0.894397,-0.004672,0.000739,-0.600684,...,-3.457448e-04,-8.506540e-04,-0.001571,0.0,0.0,0.0,0.0,0.0,0.000300,0.0
4,475.644683,1.875218,0.550580,0.638116,0.100106,2.335165,-0.966852,-0.049848,0.004163,2.327938,...,-7.533831e-02,-1.510242e-01,-0.227098,0.0,0.0,0.0,0.0,0.0,0.029567,0.0
5,2.667522,1.412181,0.089519,0.070594,0.253695,0.211239,-0.423223,-0.011411,0.007725,0.231063,...,-3.545368e-05,-2.424558e-05,-0.000011,0.0,0.0,0.0,0.0,0.0,0.007694,0.0
6,2.264999,1.340597,0.029158,-0.102241,0.092511,0.091644,-0.509509,-0.021242,0.011345,0.236231,...,-1.560140e-05,-6.412647e-05,-0.000075,0.0,0.0,0.0,0.0,0.0,0.015324,0.0
7,0.165810,0.950526,0.159892,0.105955,0.116260,-0.026346,0.638034,0.007515,0.002868,0.021707,...,8.107382e-06,1.174256e-05,0.000014,0.0,0.0,0.0,0.0,0.0,0.001410,0.0
8,32.957543,0.345893,0.095419,0.088259,0.119776,-0.526871,-0.532786,-0.001242,0.000624,-0.523353,...,-5.861163e-05,-7.247478e-05,-0.000113,0.0,0.0,0.0,0.0,0.0,0.000098,0.0
9,39.281838,0.348014,0.350386,0.356012,0.161056,-0.555634,-0.719993,-0.005169,0.001576,-0.546830,...,-5.014792e-04,-9.616962e-04,-0.001417,0.0,0.0,0.0,0.0,0.0,0.000492,0.0
