# Financial information scraper

. 

### Import scraping tools

We can use a yahoo finance scraper that I found that I will post in the repo: github.com/lukaszbanasiak/yahoo-finance it can be built from source using the setup.py method (pip seemed to be broken). For now I'm downloading a russell 3000 pdf (www.ftserussell.com/files/support-documents/membership-russell-3000) <- unsure of how stable that URL is, and scraping the tickers from that file using pyPdf (pip install pyPdf)

In [18]:
import yahoo_finance as yf #for finance data
import pyPdf as pp         #for reading pdf
import re                  #for regex matching
import numpy as np         #for building array
from tqdm import tqdm
import pickle as pkl

### Brief example of yahoo finace package

The finance package let's us get a range of data about a specific stock. In addition to getting price data, we can also get some historical data, trading volumes etc.

In [4]:
yahoo = yf.Share("FLWS") #Let's show the big dog some respect and see how the king of fantasy sports is doing

In [5]:
print yahoo.get_open() #Todays opening price

9.39


In [6]:
print yahoo.get_200day_moving_avg() #200 day moving average of stock price

8.35


In [7]:
print yahoo.get_volume() #The daily trade volume

101298


## Let's Get Down to Business

To defeat. The Huns. Did they send me nonsense, when I asked for tickers?

I've pulled a pdf from russells pdf. This was dated on June 27th, 2016. I'm not sure how much they update this (quarterly? As needed?). Either way, we probably don't need many copies of it lieing around, we just want to get up to date tickers data.

The workflow for scraping a pdf like this is as follows. First we have to connect a file reader to the pdf using the pyPdf package. Once this connection is open, we can iterate through the pages of the pdf, and extract the text. I've noticed that the pdf has a specific structure we can exploit to match tickers. 

This structure is dependent on how Russell decides to structure this pdf and so changes they decide to make might break this.

Once the raw text is extracted, we can use the pattern to find tickers.

In [8]:
russ = pp.PdfFileReader(open("ru3000_membershiplist_20160627.pdf", "rb"))

In [9]:
ticker_pattern = re.compile("\\n(\w\w?\w?\w?)\s\s\s\s")
tickers = []
for page in tqdm(russ.pages):
    raw_text = page.extractText()
    tickers.extend(ticker_pattern.findall(raw_text))

100%|██████████| 33/33 [00:06<00:00,  5.23it/s]


Note the use of extend vs append. Since each call of ticker_pattern.findall() gives a list, if we use the append operator we will end up with a list of lists. Instead by using extend, each list returned by ticker_pattern.findall() is inserted as a continuation of the current list. We can investigate what the results of this call are below.

In [10]:
print tickers

[u'FLWS', u'SRCE', u'FOXA', u'FOX', u'TWOU', u'DDD', u'MMM', u'EGHT', u'AVHI', u'ATEN', u'AAC', u'AAON', u'AIR', u'AAN', u'ABAX', u'ABT', u'ABBV', u'ANF', u'ABMD', u'ABM', u'AXAS', u'ACIA', u'ACTG', u'ACHC', u'ACAD', u'AKR', u'AXDX', u'XLRN', u'ACN', u'ANCX', u'ACCO', u'ARAY', u'ACRX', u'ACET', u'ACHN', u'ACIW', u'ACRS', u'ACNB', u'ACOR', u'ATVI', u'ACTA', u'ATU', u'AYI', u'ACXM', u'ADMS', u'AE', u'ADUS', u'ADPT', u'ADBE', u'ADTN', u'ADRO', u'AAP', u'WMS', u'AEIS', u'AMD', u'ADXS', u'ADVM', u'ABCO', u'ACM', u'AEGN', u'AEPI', u'AERI', u'HIVE', u'AJRD', u'AVAV', u'AES', u'AET', u'AMG', u'AFL', u'MITT', u'AGCO', u'AGEN', u'AGRX', u'A', u'AGYS', u'AGIO', u'GAS', u'ADC', u'AGFS', u'AIMT', u'AL', u'AIRM', u'APD', u'ATSG', u'AYR', u'AKS', u'AKAM', u'AKBA', u'AKRX', u'ALG', u'ALRM', u'ALK', u'AIN', u'AMRI', u'ALB', u'AA', u'ALDR', u'ALR', u'ALEX', u'ALX', u'ARE', u'ALXN', u'ALCO', u'ALGN', u'ALJJ', u'ALKS', u'Y', u'ATI', u'ABTX', u'ALGT', u'ALLE', u'AGN', u'ALE', u'ADS', u'AOI', u'LNT', u'AMOT

Ticker idea for flame capital: LGTM because that Looks Good to Me, baby!!!

## Scraping Open and Close from a date range

Now that we have a set of tickers, we can iterate through the tickers (and try our hardest not get banhammered by yahoo) to get historical data.

First we're going to set dates. In a script setting, we can probably set these as input options to the script.

In [15]:
start_date = '2015-09-03' 
end_date = '2016-09-03'       

For each ticker, we can take the historical data and store results in a list. This time we are using append, to maintain the list structure. This seems to take a while to run.

In [16]:
ticker_pricedata_raw = []
ticker_metadata_raw = []
failed_tickers = []
for tick in tqdm(tickers):
    ticker_reader = yf.Share(tick)
    try:
        ticker_pricedata_raw.append(ticker_reader.get_historical(start_date, 
                                                             end_date))
        ticker_metadata_raw.append(ticker_reader.get_info())
    except:
        ticker_pricedata_raw.append([])
        ticker_metadata_raw.append([])
        failed_tickers.append(tick)

100%|██████████| 2964/2964 [2:30:12<00:00,  2.77s/it]


In [21]:
save_data = open('20160903_20150903_Rus3000.pkl', 'wb')
pkl.dump(ticker_pricedata_raw, save_data)
save_data.close()

In [24]:
save_data = open('20160627_Rus3000_meta.pkl', 'wb')
pkl.dump(ticker_metadata_raw, save_data)
save_data.close()