# Finding historical PDUFA dates:
After getting in touch with the maintainers of [FDA Tracker](https://www.fdatracker.com/fda-calendar/), I had a better understanding of the resources at hand. Unfortunately, the FDA does not publish PDUFA dates. Some groups do however, and I was able to use thier google calendar to streamline the process. 

After validating the model, I can hopefully justify the time commitment to scrape 278 individual pharmaceutial companies websites, or begin keeping a running database of PDUFA dates scraped in my other notebook, `scrapingFuturePdufas.ipynb`.

In [1]:
import numpy as np
import pandas as pd
from sqlalchemy import create_engine, inspect
from datetime import datetime
from tqdm import tqdm_notebook

In [2]:
engine = create_engine('sqlite:///capstone.db')

First things first, lets import the needed librararies for working with the DB

...and one for working with the [.ics formatted list of historical FDA dates](https://calendar.google.com/calendar/ical/5dso8589486irtj53sdkr4h6ek%40group.calendar.google.com/public/basic.ics) which I [pulled down from google calendar](https://support.google.com/calendar/answer/37111?hl=en). 

In [3]:
from urllib2 import urlopen
import ics

...Lets open up that calendar

In [4]:
FdaUrl = "https://calendar.google.com/calendar/ical/5dso8589486irtj53sdkr4h6ek%40group.calendar.google.com/public/basic.ics"

In [5]:
FdaCal = ics.Calendar(urlopen(FdaUrl).read().decode('iso-8859-1'))

In [6]:
FdaCal

<Calendar with 543 events>

In [7]:
FdaCal.events

[<all-day Event 'NEW RIVER PHARMACEUTICALS INC PDUFA' 2006-10-06>,
 <all-day Event 'GENTA INC DE PDUFA' 2006-10-29>,
 <all-day Event 'ENCYSIVE PHARMACEUTICALS INC PDUFA' 2006-11-02>,
 <all-day Event 'MILLENNIUM PHARMACEUTICALS INC PDUFA' 2006-12-09>,
 <all-day Event 'INDEVUS PHARMACEUTICALS INC PDUFA' 2007-05-03>,
 <all-day Event 'Valera Pharmaceuticals Inc PDUFA' 2007-05-03>,
 <all-day Event 'SHPGY Shire plc PDUFA' 2007-05-21>,
 <all-day Event 'MiddleBrook Pharmaceuticals Inc PDUFA' 2007-05-22>,
 <all-day Event 'CRITICAL THERAPEUTICS INC PDUFA' 2007-05-31>,
 <all-day Event 'ENCYSIVE PHARMACEUTICALS INC PDUFA' 2007-06-15>,
 <all-day Event 'DOR BIOPHARMA INC PDUFA' 2007-07-21>,
 <all-day Event 'KV KV PHARMACEUTICAL CO DE PDUFA' 2007-07-29>,
 <all-day Event 'INDEVUS PHARMACEUTICALS INC PDUFA' 2007-08-13>,
 <all-day Event 'SPPI SPECTRUM PHARMACEUTICALS INC PDUFA' 2007-08-15>,
 <all-day Event 'NBIX NEUROCRINE BIOSCIENCES INC PDUFA' 2007-08-21>,
 <all-day Event 'TERCICA INC PDUFA' 2007-08-3

So I'm gonna need to get the stock ticker symbols out of each event name, and the start/end dates, which appear mostly to be all day one day events. Sounds like a job for...
###### REGULAR EXPRESSIONS DUM-DA-DUM

In [8]:
import re

In [9]:
tickerRe = re.compile(r"\A[A-Z]{3,4}\W")

In [10]:
past_pdufa = []
for event in FdaCal.events:
    matches = re.findall(tickerRe, event.name)
    if len(matches) >=1:
        eDate = event.begin.datetime
        eComp = str(matches[0]).strip()
        past_pdufa.append((eComp, eDate))

In [15]:
print len(past_pdufa)

475


So here is our list of **(475)** stock tickers in the past PDUFA dataset, including a few strings that slipped past the regex like `NEW` and `INC`. I'm going to run this list of ticker symbols against AlphaVantage to finish rounding out my training dataset. That _should_ clean them out and I can begin feature extraction tommorow.