## Lets go shopping for Stock Ticker Symbols!

In [1]:
import requests
from retrying import retry
from bs4 import BeautifulSoup
import dill
from tqdm import tqdm_notebook

###### Starting from [BioPharm Catalyst](https://www.biopharmcatalyst.com/calendars/fda-calendar)
We can use requests to pull the site, and bs4 to parse, generating a set of stock ticker symbols in `bpcTickers`

In [2]:
bpcReq = requests.get("https://www.biopharmcatalyst.com/calendars/fda-calendar")

In [3]:
bpcSup = BeautifulSoup(bpcReq.text, "lxml")

In [4]:
bpcTickers = set()
for tag in bpcSup.find_all('a', {'class': 'ticker'}):
    bpcTickers.add(tag.text)

###### Lets now pull a set from another site
How about [valinv](https://www.valinv.com/pdufa/history/1)?
This will require requesting multiple pages

In [5]:
import re

In [6]:
valTickers = set()
valTagPref = re.compile(r'/stock/[A-Z]{1,5}')
#A RegEx for parsing Valinv's links
maxPgNum = 4 
#This should be the highest numbered page to be requested, plus one, from valinv's FDA calendar 
#(at time of writing there were pages 1-3 avalible)
for pgnum in range(1,maxPgNum):
    valReq = requests.get("https://www.valinv.com/pdufa/history/%s" % pgnum)
    valSup = BeautifulSoup(valReq.text, "lxml")
    for tag in valSup.find_all('a', {'href':valTagPref}):
        valTickers.add(tag.text)

So how many ticker symbols have we gathered?

###### Lets pull from another site
How about [RTT News](http://www.rttnews.com/CorpInfo/FDACalendar.aspx?PageNum=1)?

In [13]:
rttTickers = set()
rttTagPref = re.compile(r'http://www\.rttnews\.com/symbolsearch\.aspx\?symbol=[A-Z]{1,5}')
#A RegEx for parsing RTT's links
maxPgNum = 6 
#This should be the highest numbered page to be requested, plus one, from RTT's FDA calendar 
#(at time of writing there were pages 1-5 avalible)
for pgnum in range(1,maxPgNum):
    rttReq = requests.get("http://www.rttnews.com/CorpInfo/FDACalendar.aspx?PageNum=%s" % pgnum)
    rttSup = BeautifulSoup(rttReq.text, "lxml")
    for tag in rttSup.find_all('a', {'href':rttTagPref}):
        rttTickers.add(tag.text)

In [14]:
print(rttTickers)

set([u'XENT', u'AGIO', u'KITE', u'MDCO', u'AMGN', u'PTCT', u'LLY', u'RARE', u'PTLA', u'VRX', u'HALO', u'DVAX', u'TEVA', u'LGND', u'BMY', u'ARRY', u'MYL', u'RHHBY', u'EGRX', u'KMDA', u' AZN', u'GILD', u'JAZZ', u'FLXN', u'KIND', u'NEOS', u'ATRS', u'OCUL', u'TSRO', u'MRK', u'NVS', u'RDUS', u'ADMS', u'PBYI', u'OTIC', u'CELG', u'PFE', u'ABBV', u'IRWD', u'ACRX', u'IPCI', u' AEZS', u' GSK', u'NVO', u'JNJ', u'SGYP'])


###### ...SO?
we've got every ticker symbol from the first three sites I found dealing with PDFUA dates, which companies are represented, and how many follows this title

In [15]:
allTicks = set()
allTicks.update(valTickers)
allTicks.update(bpcTickers)
allTicks.update(rttTickers)

In [19]:
print len(allTicks)

284


Okay so we've got our symbols, quandl had issues because some had whitespace. Lets just nip this in the bud back here and be done with it.

In [17]:
cleanTicks = set()
for tick in allTicks:
    cleanTicks.add(tick.strip())

Lets pickle this for later use, so we can save my poor cloud box some RAM

...we're gonna need it when we start pulling stock prices for all these companies

In [18]:
dill.dump(cleanTicks, open('Set_of_Ticker_Symbols.pkl', 'w'))

Thats all to see here folks