# TDI Capstone - Final Report

### Abstract:
Phase III clincial trials have a massive effect on the market captalizatizations of pharmaceutical companies. A successful phase III trial for a novel drug allows a company to begin marketing drugs to a previously unserved clinical indication, yielding large revenue streams. In order for a drug to pass it must pass through roughly 10 years of Research and be tested on thousands (or tens of thousands) of patients. 

The long time scales of trials and the sheer numbers of people involved (patients, research scientists, clinicans, research coordinators, goverment regulators...), means that priveleged information regarding trial status has an unusally high level of exposure, compared to other high tech industries. [Recent research](https://academic.oup.com/jnci/article/103/20/1507/904625/Company-Stock-Prices-Before-and-After-Public) suggests that this information exposure can cause detectable movement in this public stock markets. 

If leaks of priveleged information can affect the valuations of pharmaceutical companies, can the movements of these valutaions be identified using Machine Learning and used to inform smarter trading decisions? 

### Gathering Data:

To start with this problem, we need to select a range of target companies (in the pharma sector), find daily closing prices of thier stocks, and get the dates of thier relevant approval announcements. With this information, I'll attempt to extract features and train a machine learning model. 

Data Sources: 
* [fdaTracker.com's free PDUFA Calendar](https://www.fdatracker.com/fda-calendar/)
* [Biopharm Catalyst's upcoming PDUFA Calendar](https://www.biopharmcatalyst.com/calendars/fda-calendar)
* [AlphaVantage's Stock Price API](https://www.alphavantage.co/)

###### First:
First, lets get the historical PDUFA (FDA announcement) Dates:

In [1]:
from urllib2 import urlopen
import ics
import re
from datetime import datetime
from alpha_vantage.timeseries import TimeSeries
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
import dill

In [2]:
tickerRe = re.compile(r"\A[A-Z]{3,4}\W")
today = datetime.today()

FdaUrl = "https://calendar.google.com/calendar/ical/5dso8589486irtj53sdkr4h6ek%40group.calendar.google.com/public/basic.ics"
FdaCal = ics.Calendar(urlopen(FdaUrl).read().decode('iso-8859-1'))
FdaCal

<Calendar with 552 events>

In [3]:
past_pdufa_syms = set()
for event in FdaCal.events:
    matches = re.findall(tickerRe, event.name)
    if len(matches) >=1:
        eComp = str(matches[0]).strip().strip(".")
        past_pdufa_syms.add(eComp)

In [4]:
print past_pdufa_syms

set(['ENTA', 'AAAP', 'PAR', 'KITE', 'BSTC', 'DNDN', 'ELTP', 'ZSPH', 'GSK', 'CLVS', 'ADLR', 'XENT', 'IRWD', 'VNDA', 'RPTP', 'ACUR', 'DEPO', 'ARIA', 'CHMA', 'VRTX', 'OREX', 'THRX', 'HPTX', 'PCRX', 'ENDP', 'NAVB', 'DRRX', 'OTIC', 'EXEL', 'FLXN', 'ZGEN', 'DSCO', 'SRPT', 'PSDV', 'BCRX', 'MRK', 'ADMP', 'ADMS', 'ISTA', 'AMRN', 'BMY', 'ITMN', 'CEMP', 'KMPH', 'RLYP', 'ANAC', 'SUPN', 'JNJ', 'AERI', 'SNY', 'VION', 'AMAG', 'HZNP', 'REGN', 'SOMX', 'COLL', 'CRTX', 'LPCN', 'VCEL', 'GTXI', 'CTIC', 'LLY', 'AFFY', 'CYPB', 'HGSI', 'OCUL', 'AEGR', 'HOLX', 'OSI', 'DVAX', 'TEVA', 'SGEN', 'TTNP', 'ACOR', 'RPRX', 'EGRX', 'JAZZ', 'AUXL', 'QCOR', 'FLML', 'NEOS', 'AGRX', 'RARE', 'RDUS', 'NGSX', 'RMTI', 'NPSP', 'CLDA', 'DYAX', 'ALTH', 'ARLZ', 'VVUS', 'LXRX', 'NVO', 'MELA', 'ANX', 'SCMP', 'POZN', 'HRTX', 'ACAD', 'AGIO', 'TSRO', 'RGEN', 'FURX', 'SPPI', 'AMGN', 'ONXX', 'BMTI', 'BPAX', 'REPH', 'DOR', 'NEW', 'WCRX', 'ASTX', 'KTOV', 'ISIS', 'AVEO', 'BIOD', 'GNBT', 'VRX', 'IGXT', 'LGND', 'MDVN', 'AMLN', 'OMER', 'KMDA', 

In [5]:
av_key_handle = open("alphavantage.apikey", "r")
ts = TimeSeries(key=av_key_handle.read().strip(), output_format='pandas')
av_key_handle.close()

In [13]:
dataframes = dict()
value_errors = set()
other_errors = set()
for ticker in tqdm_notebook(past_pdufa_syms):
    try:
        df, meta = ts.get_daily(symbol=ticker, outputsize='full')
        dataframes[meta["2. Symbol"]] = df
    except ValueError:
        value_errors.add(ticker)
    except:
        other_errors.add(ticker)




In [15]:
print value_errors
print other_errors

set(['VION', 'PCYC', 'FURX', 'DNDN', 'AIS', 'GEVA', 'BPAX', 'INSV', 'ZSPH', 'ADLR', 'CBRX', 'PPDI', 'ISIS', 'CYPB', 'HGSI', 'AEGR', 'ARIA', 'THRX', 'MDVN', 'HPTX', 'APPA', 'SLXP', 'SNTS', 'AVNR', 'AUXL', 'TSPT', 'QCOR', 'FLML', 'NEOL', 'ZGEN', 'DSCO', 'XNPT', 'SVNT', 'ALXA', 'FRX', 'NGSX', 'ISTA', 'NPSP', 'CLDA', 'RPTP', 'DYAX', 'CHTP', 'ITMN', 'RLYP', 'BIOD', 'ANAC', 'DRTX', 'KYTH', 'MELA', 'ANX', 'POZN'])
set(['ASTX', 'WCRX'])


In [39]:
dill.dump(dataframes, open('final_raw_dataframe_dict.pkl', 'w'))

###### Mini Checkpoint for slow API calls

In [40]:
dataframes = dill.load(open('final_raw_dataframe_dict.pkl', 'r'))

Now we'll run through our past FDA dates and join the FDA actions to each dataframe

In [41]:
company_list = dataframes.keys()

In [42]:
price_and_fda = dict()
for company in tqdm_notebook(company_list):
    company_events = []
    for event in FdaCal.events:
        matches = re.findall(tickerRe, event.name)
        if len(matches)>=1:
            if company in matches[0]:
                company_events.append((event.begin.datetime.strftime("%Y-%m-%d"), True))
    price = dataframes[company]
    raw_dates = pd.DataFrame(company_events, columns = ["date", "pdufa?"])
    dates = raw_dates.set_index("date")
    final = price.join(dates,rsuffix='_y')
    final['pdufa?'].fillna(value=False, inplace = True)
    price_and_fda[company] = final




That leaves us with a dict of dataframes containing every company's stock price, and FDA action dates

In [43]:
price_and_fda['ENTA'].head(3)

Unnamed: 0,volume,close,high,open,low,pdufa?
2013-03-21,1763600.0,17.18,17.85,14.51,14.31,False
2013-03-22,75200.0,16.81,17.57,17.57,16.71,False
2013-03-25,24100.0,16.83,17.4,17.4,16.8,False


Now that I've got a good crop of downloaded data, lets cache it for good measure. 

In [187]:
dill.dump(price_and_fda, open("final_Prices_and_PDUFAs.pkl", "w"))

### Checkpoint 1 - FDA Action Dates Joined to Equity Prices

In [188]:
price_and_fda = dill.load(open("final_Prices_and_PDUFAs.pkl", "r"))

So thats every company's stock prices, with PDUFA dates going back to around 2006, and pricing data going back to 2001. More than enough data for our analysis. 

Lets unify all the prices into one frame, to construct a pharmaceutical price index. This will give us a base price to normalize our stock prices against, insulating our model from general economic events (the '08 housing crash) or events affecting the whole pharmaceutical sector (passage of new FDA regulations). 

In [46]:
price_and_fda = dill.load(open("Prices_and_PDUFAs_final", "r"))

In [47]:
first = True
for ticker, comp_df in price_and_fda.iteritems():
    if first:
        market_df = comp_df.copy()
        market_df.columns = ["volume-"+ticker,
                             "close-"+ticker,
                             "high-"+ticker,
                             "open-"+ticker,
                             "low-"+ticker,
                             "pdufa?-"+ticker]
        first = False
    else:
        market_df = pd.merge(market_df, comp_df, how='outer', left_index=True, right_index=True, suffixes=('', '-'+ticker))

In [48]:
price_mean = market_df.filter(regex='close').mean(axis = 1, skipna = True)
price_stdv = market_df.filter(regex='close').std(axis = 1, skipna = True)

In [49]:
stats_df = pd.merge(price_mean.to_frame(),
                    price_stdv.to_frame(), 
                    left_index=True, 
                    right_index=True, 
                    how='inner')
stats_df.rename(columns={u'0_x':"CP_mean", u'0_y':"CP_stdv"}, inplace=True)

In [50]:
stats_df.head()

Unnamed: 0,CP_mean,CP_stdv
2000-01-03,28.502039,29.517285
2000-01-04,27.066072,28.246109
2000-01-05,27.154608,28.343677
2000-01-06,27.534455,28.559617
2000-01-07,29.768633,31.529028


This is as good a place as any to cache the closing price index

In [51]:
dill.dump(stats_df, open("close_price_stats_frame_final.pkl", "w"))

### Checkpoint 2

In [52]:
stats_df = dill.load(open("close_price_stats_frame_final.pkl", "r"))

Now I have the mean and standard deviation of close prices (`stats_df`) for every day of my data coverage. This will make it easy to normalize prices for every slice of time relevant to an FDA trial. 

Time to cut time slices for each clinical trial and generate a population of clinical trials and normalized prices.

In [53]:
norm_data = []
for company in tqdm_notebook(company_list):
    df = price_and_fda[company].join(stats_df, how='left').reset_index()
    pdufa_dates = df.index[df['pdufa?']].tolist()
    if len(pdufa_dates) > 0:
        for date in pdufa_dates:
            pRange = range(date-120, date-7)
            pCloses, pVolumes = [], []
            for i in pRange:
                try:
                    close_price = df.loc[i]['close']
                    volume = df.loc[i]['volume']
                    mean_price = df.loc[i]['CP_mean']
                    stdv_price = df.loc[i]['CP_stdv']
                    pCloses.append(( df.loc[i]['index'],(close_price-mean_price)/(stdv_price) ))
                    pVolumes.append(( df.loc[i]['index'], volume ))
                except:
                    pCloses.append(None)
                    pVolumes.append(None)
            norm_data.append((company, df.loc[date]['index'], (pCloses, pVolumes)))




Well we have normalized slices, lets add the annotations from our score sheet

In [54]:
scores = [line.split() for line in open("score_sheet_complete.txt", "r").readlines()]

In [55]:
norm_data_annotated = []
mismatches = []
for datum in tqdm_notebook(norm_data):
    for score in scores:
        if datum[0] == score [0] and datum [1] == score[1]:
            norm_data_annotated.append((datum[0], datum[1], score[2], datum[2] ))
            break




In [56]:
dill.dump(norm_data_annotated, open("normalized_training_data.pkl", "w"))

### Checkpoint 3

In [57]:
norm_data_annotated = dill.load(open("normalized_training_data.pkl", "r"))

Now we have normalized stock prices, in 120-7 day slices prior to FDA action dates. Lets pull those back into smaller pandas frames for feature extraction. 

In [58]:
def assemble_frame(datum):
    df = pd.DataFrame(datum[3][0], columns=['date','norm_price'])
    df['event'] = datum[0]+"/"+datum[1]
    df['outcome'] = int(datum[2])
    return df

In [59]:
first = True

for line in tqdm_notebook(norm_data_annotated):
    try:
        if first:
            agg_data = assemble_frame(line)
            first = False
        else:
            tmp_data = assemble_frame(line)
            agg_data = pd.concat([agg_data, tmp_data],ignore_index=True)
    except:
        print line[0], line[1], "failed"

COLL 2015-10-12 failed
NEOS 2015-11-09 failed



In [60]:
agg_data['date_stamp'] = pd.to_datetime(agg_data['date'])
event_labels = pd.factorize(agg_data['event'])
agg_data["event_stamp"] = event_labels[0]

Now lets remove out the trials will null prices on some days (either due to acquisitions or bankruptcies). 

In [61]:
agg_data['null'] = pd.isnull(agg_data).apply(lambda x: sum(x) , axis=1)
cleaned_agg = agg_data[agg_data['null'] == 0]

In [62]:
cleaned_agg.head()

Unnamed: 0,date,norm_price,event,outcome,date_stamp,event_stamp,null
0,2015-12-08,-0.133607,AAAP/2016-06-01,1,2015-12-08,0,0
1,2015-12-09,-0.128218,AAAP/2016-06-01,1,2015-12-09,0,0
2,2015-12-10,-0.12667,AAAP/2016-06-01,1,2015-12-10,0,0
3,2015-12-11,-0.119063,AAAP/2016-06-01,1,2015-12-11,0,0
4,2015-12-14,-0.11084,AAAP/2016-06-01,1,2015-12-14,0,0


In [63]:
dill.dump(cleaned_agg, open('final_cleaned_price_slices.pkl', 'w'))

### Checkpoint 3 - Training data preprocessed

In [64]:
cleaned_agg = dill.load(open('final_cleaned_price_slices.pkl', 'r'))

That's a ready to extract package of every clinical trial scraped. Lets go ahead and make up a test and train split now, while its easy and convinent.

In [65]:
from sklearn.cross_validation import train_test_split

In [96]:
train_data, test_data = train_test_split(norm_data_annotated, train_size = .9)

In [97]:
first = True

for line in tqdm_notebook(train_data):
    try:
        if first:
            train_df = assemble_frame(line)
            first = False
        else:
            tmp_df = assemble_frame(line)
            train_df = pd.concat([train_df, tmp_df],ignore_index=True)
    except:
        print line[0], line[1], "failed"

train_df['date_stamp'] = pd.to_datetime(train_df['date'])
event_labels = pd.factorize(train_df['event'])
train_df["event_stamp"] = event_labels[0]

train_df['null'] = pd.isnull(train_df).apply(lambda x: sum(x) , axis=1)
train_clean = train_df[train_df['null'] == 0]

COLL 2015-10-12 failed
NEOS 2015-11-09 failed



In [98]:
first = True

for line in tqdm_notebook(test_data):
    try:
        if first:
            test_df = assemble_frame(line)
            first = False
        else:
            tmp_df = assemble_frame(line)
            test_df = pd.concat([test_df, tmp_df],ignore_index=True)
    except:
        print line[0], line[1], "failed"
test_df['date_stamp'] = pd.to_datetime(test_df['date'])
event_labels = pd.factorize(test_df['event'])
test_df["event_stamp"] = event_labels[0]

test_df['null'] = pd.isnull(test_df).apply(lambda x: sum(x) , axis=1)
test_clean = test_df[test_df['null'] == 0]




Thats two parts of a bifurcated dataframe. May as well cache it. 

In [99]:
dill.dump(train_clean, open("final_train_df.pkl", "w"))
dill.dump(test_clean, open("final_test_df.pkl", "w"))

### Checkpoint 4 - Test Train Split

In [100]:
train_clean = dill.load(open("final_train_df.pkl", "r"))
test_clean = dill.load(open("final_test_df.pkl", "r"))

Now for the serious work, extracting features from the pricing data in each case. 

I'll be using [tsfresh](http://tsfresh.readthedocs.io/en/latest/text/quick_start.html) to do the hard computing here, and then selecting the most relevant features. While I am able to compute almost 800 features for these data points, I'm going to narrow down to around ten of the most meaningful or important features. 

In [101]:
from tsfresh import extract_features

In [102]:
train_feats = extract_features(train_clean[['norm_price', 'event_stamp', 'date_stamp']], 
                              column_id="event_stamp", column_sort="date_stamp", 
                              column_value="norm_price", n_jobs=0).dropna(axis=1)

Feature Extraction: 100%|██████████| 209/209 [00:00<00:00, 25615.38it/s]


In [103]:
train_feats.head()

variable,norm_price__abs_energy,norm_price__absolute_sum_of_changes,"norm_price__agg_autocorrelation__f_agg_""mean""","norm_price__agg_autocorrelation__f_agg_""median""","norm_price__agg_autocorrelation__f_agg_""var""","norm_price__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""intercept""","norm_price__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""rvalue""","norm_price__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""slope""","norm_price__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""stderr""","norm_price__agg_linear_trend__f_agg_""max""__chunk_len_50__attr_""intercept""",...,norm_price__time_reversal_asymmetry_statistic__lag_1,norm_price__time_reversal_asymmetry_statistic__lag_2,norm_price__time_reversal_asymmetry_statistic__lag_3,norm_price__value_count__value_-inf,norm_price__value_count__value_0,norm_price__value_count__value_1,norm_price__value_count__value_inf,norm_price__value_count__value_nan,norm_price__variance,norm_price__variance_larger_than_standard_deviation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,8.267646,0.632868,0.338676,0.291667,0.0892,-0.193869,-0.807651,-0.011369,0.002625,-0.2092,...,-0.0005965052,-0.001119,-0.001715,0.0,0.0,0.0,0.0,0.0,0.00252,0.0
1,0.050117,0.364996,0.216699,0.061411,0.115088,-0.011717,0.330647,0.001689,0.001524,0.019494,...,5.207836e-07,2e-06,2e-06,0.0,0.0,0.0,0.0,0.0,0.000323,0.0
2,6.148967,1.010344,0.110301,0.130618,0.115312,-0.21915,0.246763,0.003207,0.003983,-0.111155,...,-9.008112e-05,-0.00013,-0.000176,0.0,0.0,0.0,0.0,0.0,0.001734,0.0
3,706.617635,2.724921,0.062402,-0.030932,0.079144,2.515227,0.432196,0.007583,0.005003,2.611413,...,0.04545126,0.089641,0.125181,0.0,0.0,0.0,0.0,0.0,0.006103,0.0
4,4.240612,0.6622,-0.015751,-0.165876,0.104243,-0.178728,-0.017051,-7.9e-05,0.001462,-0.167307,...,1.025592e-05,2.6e-05,5.2e-05,0.0,0.0,0.0,0.0,0.0,0.000349,0.0


In [104]:
train_y =\
train_df[['event_stamp', 'outcome']]\
.groupby('event_stamp')\
.head(1).set_index('event_stamp')['outcome']

In [105]:
train_y.head()

event_stamp
0    0
1    1
2    0
3    1
4    0
Name: outcome, dtype: int64

In [106]:
test_feats = extract_features(test_clean[['norm_price', 'event_stamp', 'date_stamp']], 
                              column_id="event_stamp", column_sort="date_stamp", 
                              column_value="norm_price", n_jobs=0).dropna(axis=1)

Feature Extraction: 100%|██████████| 24/24 [00:00<00:00, 22201.87it/s]


In [107]:
test_feats.shape

(24, 622)

In [108]:
test_y =\
test_df[['event_stamp', 'outcome']]\
.groupby('event_stamp')\
.head(1).set_index('event_stamp')['outcome']

In [109]:
test_y.shape

(24,)

In [110]:
dill.dump(train_feats, open('final_train_features.pkl','w'))
dill.dump(test_feats, open('final_test_features.pkl','w'))

### Checkpoint 4 - Extracted Features

In [111]:
train_feats = dill.load(open("final_train_features.pkl", "r"))
test_feats = dill.load(open("final_test_features.pkl", "r"))

Now its time to pick out 10 or so meaningful features from the 622 possible features. Time for some reading. Then itll be time to apply those to a classification model. 

In [112]:
print"\n".join(list(train_feats.columns.values))

norm_price__abs_energy
norm_price__absolute_sum_of_changes
norm_price__agg_autocorrelation__f_agg_"mean"
norm_price__agg_autocorrelation__f_agg_"median"
norm_price__agg_autocorrelation__f_agg_"var"
norm_price__agg_linear_trend__f_agg_"max"__chunk_len_10__attr_"intercept"
norm_price__agg_linear_trend__f_agg_"max"__chunk_len_10__attr_"rvalue"
norm_price__agg_linear_trend__f_agg_"max"__chunk_len_10__attr_"slope"
norm_price__agg_linear_trend__f_agg_"max"__chunk_len_10__attr_"stderr"
norm_price__agg_linear_trend__f_agg_"max"__chunk_len_50__attr_"intercept"
norm_price__agg_linear_trend__f_agg_"max"__chunk_len_50__attr_"rvalue"
norm_price__agg_linear_trend__f_agg_"max"__chunk_len_50__attr_"slope"
norm_price__agg_linear_trend__f_agg_"max"__chunk_len_50__attr_"stderr"
norm_price__agg_linear_trend__f_agg_"max"__chunk_len_5__attr_"intercept"
norm_price__agg_linear_trend__f_agg_"max"__chunk_len_5__attr_"rvalue"
norm_price__agg_linear_trend__f_agg_"max"__chunk_len_5__attr_"slope"
norm_price__agg_li

In [159]:
features_of_interest = ['norm_price__mean',
                        'norm_price__median',
                        'norm_price__mean_change',
                        #'norm_price__mean_abs_change',
                        'norm_price__first_location_of_maximum',
                        'norm_price__first_location_of_minimum',
                        'norm_price__linear_trend__attr_"slope"',
                        'norm_price__count_above_mean',
                        'norm_price__count_below_mean'
                       ]

In [160]:
print train_feats[features_of_interest].shape
train_feats[features_of_interest].head()

(209, 8)


variable,norm_price__mean,norm_price__median,norm_price__mean_change,norm_price__first_location_of_maximum,norm_price__first_location_of_minimum,"norm_price__linear_trend__attr_""slope""",norm_price__count_above_mean,norm_price__count_below_mean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,-0.265791,-0.249245,-0.001468,0.353982,0.99115,-0.001221,79.0,34.0
1,-0.01096,-0.008207,0.000124,0.283186,0.097345,0.000295,70.0,43.0
2,-0.229526,-0.226107,-0.001035,0.0,0.380531,0.000399,60.0,53.0
3,2.49943,2.514329,0.001829,0.80531,0.00885,0.001168,66.0,47.0
4,-0.192816,-0.190763,0.000123,0.946903,0.725664,-9.3e-05,58.0,55.0


In [161]:
print test_feats[features_of_interest].shape
test_feats[features_of_interest].head()

(24, 8)


variable,norm_price__mean,norm_price__median,norm_price__mean_change,norm_price__first_location_of_maximum,norm_price__first_location_of_minimum,"norm_price__linear_trend__attr_""slope""",norm_price__count_above_mean,norm_price__count_below_mean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,-0.421011,-0.409471,-0.000908,0.557522,0.929204,-0.000749,81.0,32.0
1,-0.231476,-0.232116,-0.000483,0.539823,0.256637,-2.4e-05,56.0,57.0
2,-0.557929,-0.558836,1.1e-05,0.955752,0.628319,-0.0001,51.0,62.0
3,-0.251067,-0.24005,6.4e-05,0.123894,0.699115,-0.000282,82.0,31.0
4,-0.284424,-0.289434,-0.000335,0.318584,0.867257,-0.000814,54.0,59.0


Thats our split data, with our features of interest. Lets begin Modeling. 

In [166]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score

In [178]:
scaler = StandardScaler()
classifier = SVC(C=1, coef0=1, degree=1)
params = {"C":range(1,5),
          "degree":range(1,3),
          "coef0":range(1,3)
         }
classifier_gs = GridSearchCV(classifier, params)

In [179]:
classifier_gs.fit(scaler.fit_transform(train_feats), train_y)

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1, cache_size=200, class_weight=None, coef0=1,
  decision_function_shape='ovr', degree=1, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [1, 2, 3, 4], 'coef0': [1, 2], 'degree': [1, 2]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [180]:
classifier_gs.best_params_

{'C': 1, 'coef0': 1, 'degree': 1}

In [181]:
cross_val_score(classifier, scaler.transform(test_feats), y=test_y)

array([ 0.66666667,  0.75      ,  0.71428571])

Thats a trained and cross validated model. Lets pickle it for safe keeping.

In [185]:
dill.dump(classifier, open("final_trained_svc.pkl","w"))

### Checkpoint 5 - Trained Model

In [186]:
classifier = dill.load(open("final_trained_svc.pkl","r"))

Now that we have a working predictor, lets play with some visualizations to show how powerful is can be.