# Olson Family 2017 summer project
# Part 2: Stock Market

## Attributions

##### Information on TensorFlow, Keras, etc.:

- “Hands-On Machine Learning with SciKit-Learn and TensorFlow" by Aurelien Geron
- Aurelien's github TensorFlow collection https://github.com/ageron/awesome-tensorflow
- Keras web-site https://keras.io/ - this is a higher-level abstraction that can run on TensorFlow and others, soon to be incorporated directly in TensorFlow
- McFly github https://github.com/NLeSC/mcfly - this is a higher-level abstraction that runs on keras, used to do preliminary exploration of model architectures and hyperparameters

##### Data for Stock Market:

- Originally from Yahoo Finance website - we will use up to 9 years of irregularly collected data on each of 557 individual stocks that are listed on web pages.
This includes items of interest to humans such as Market Cap, Trailing and Forward Price/Earnings ratio, Revenue per Share, Short Ratio, Trailing Annual Dividend Rate, Dividend Date and Ex-Dividend Date, etc.
- "sentdex" YouTube video of a different approach to analyzing the data - where I found out about this data set https://www.youtube.com/watch?v=AleGZ9dkfPs&list=PLQVvvaa0QuDd0flgGphKCej-9jp-QdzZ3&index=3
- Data itself collected and available in  https://pythonprogramming.net/static/downloads/machine-learning-data/intraQuarter.zip
- That is quite a bit of data; I will let you download it. The data we use should go in the directory StockMarket\TestData\intraQuarter\_KeyStats
- The processed spreadsheet version of this data can be found in StockMarket\mdo.xlsx - it will be combined with some other data from below

##### U.S. Government statistics

###### This is quite the rabbit-hole!!! I haven't yet narrowed down to what I want
- Summary http://guides.library.cornell.edu/c.php?g=31400&p=199827
- Got some statistics from https://www.federalreserve.gov/datadownload/
- https://www.federalreserve.gov/releases/h15/
- Series (Select to preview)	Available From	Available To	Observations	Description	Include
- H15/H15/RIFSGFSY01_N.M	1959-07	2017-07	697	1-year Treasury bill secondary market rate^ discount basis	
- H15/H15/RIFSPBLP_N.M	1949-01	2017-07	823	Average majority prime rate charged by banks on short-term loans to business, quoted on an investment basis
- https://www.data.gov/
- https://nces.ed.gov/partners/fedstat.asp
- https://www.treasury.gov/resource-center/faqs/Interest-Rates/Pages/faq.aspx
- https://www.bls.gov/cpi/#data
- http://www.esa.doc.gov/
- https://www.census.gov//foreign-trade/statistics/historical/index.html

##### Industry Codes

- We will probably base this from http://www.nasdaq.com/screening/companies-by-industry.aspx and https://www.sec.gov/cgi-bin/browse-edgar?CIK= - note that these together do not include all our stocks
- Place where I found out about some of these resources - https://mktstk.com/2015/03/03/sic-lookup-by-stock-symbol/

### Other Interesting Related Web Resources

- S. Raval - Predict Stock Prices - https://www.youtube.com/watch?v=ftMq5ps503w - https://github.com/llSourcell/How-to-Predict-Stock-Prices-Easily-Demo
- RoboBuffet - Predict future stock performance based on textual analysis of SEC filings - https://github.com/dandelionmane/RoboBuffett.git

## Introduction

We will create a system designed to tell us when to buy/sell/hold individual stocks. Of course caveat emptor - it may give bad advice.

##### That has been tried before and it didn't work

Many clever people work full-time on this and nobody has cornered the market yet.

In our summer project we don't expect a breakthrough. Still, this will introduce us to the kind of analysis used in many problems of this type and the difficulties involved. And who knows - we might get rich!

##### We will use Deep Learning and Time-Series Analysis

Deep Learning uses neural networks in a many-layered structure. We will use the libraries Keras and TensorFlow for the actual implementation of the neural network.

Time-Series means that the data has a time tag associated with it, that the data is in time order, and that we will be predicting data that is beyond the latest data.

##### Time-Series Analysis is subject to change in the underlying reality

One of the interesting aspects of this is that optimizing stock market predictions for one time period may give correct results that do not apply at a later time period. For instance, the most important factors affecting steel producer stocks in the early part of 1900s would not include factors regarding aluminum producers. Even during the training itself it is important to feed the data to the training algorithm in time order so that the latest data has a bit more impact than earlier data - the common K-fold cross-validation we did for the housing project with its static dataset would not necessarily apply here.

The magnitude of this affect would in general be different for each and every stock. An interesting fact to take into account in neural network design!

### Reading and Pre-Processing the data

This time the data does not come in a convenient spreadsheet form. We get it from the Yahoo Finance website.

We have a nearly 10-year collection of the data. It includes typographical errors, inconsistencies, and all kinds of information in various forms that we will need to interpret.

In addition to the data being numerical and categorical information that was known at a sequence of times, it contains other dates of interest inside it such as when the last set of financial reports were filed, when the next stockholder meeting is, etc. We will need to decide how to present these dates to the neural net.

We will augment this data with other data obtained from other sources such as the U.S. Treasury department to get interest rates and possibly other information not uniquely associated with a particular stock. This turns out to be more difficult than you would think.

We will also classify the stocks according to stock groups, such as "mining", "automotive", "services", etc. I am still investigating this; there are many such "standard" sets and none of them have every stock in our dataset.

At first I had planned to use the cleanup of this data as part of our summer project, but after I bought A. Geron's great book I decided to add the Housing part of our project based on his chapter 2. As a result, the many trials and tribulations of reading web pages I just did myself and bring the data out. I think we will include one of the following parts, either the U.S. Government statistics (which shows how to hide data in plain sight) or the industry codes (which shows how to deal with multiple partially-overlapping data standards).

As before we will use the pandas library for reading in the data - at least from the Yahoo web pages.

- I tried a library called Beautiful Soup but it didn't seem easier than pandas for what I was doing https://www.crummy.com/software/BeautifulSoup/
- A higher-level library for reading (scraping) web-site (*.html) files is SCRAPY https://doc.scrapy.org
- I found it easier to use pandas as we did in the Housing portion of our summer project. If we were going to iteratively return to Yahoo Finance to get more data, I probably would have gone with SCRAPY

The program I wound up with to read the Yahoo web pages is in the repository: readStockMarketData_pandasFrame.py. By default it writes a meta-cell for each data cell, giving info on what problems or edge cases it might have encountered. This was extremely useful diagnostic info to make the code work with all the weird and sometimes just plain mis-spelled or wrongly entered date fields.

Note that in the time since our Yahoo data was collected and converted to a spreadsheet format, Yahoo changed how they populate their data tables. See these pages to know how to get current data. Of course, the format could easily change again...
- describes the issue: https://pythonprogramming.net/current-yahoo-data-for-machine-learning/
- points to this code: https://github.com/tomgs/sentdexworkarounds