When embarking into the realm of quantitative finance, it is easy to become overwhelmed (or disoriented to say the least), regarding where to start, which data to get, and where to get it. 

If retrievable, such financial data has limited accessibility, whether it's only available to academics, or not in the ideal price range of retail traders (i.e. \$24,000/year for [Bloomberg Terminal](https://www.bloomberg.com/professional/solution/bloomberg-terminal/), \$13,000/year for [S&P Capital IQ](https://www.spglobal.com/marketintelligence/en/solutions/sp-capital-iq-platform). Other websites that offer free data (i.e. [Yahoo Finance](https://ca.finance.yahoo.com/), [Wall Street Journal](https://www.wsj.com/), [MarketWatch](https://www.marketwatch.com/)) only offer for the past few years (typically five), and can miss out on important entries. After reading countless posts on forums such as [r/algotrading](https://www.reddit.com/r/algotrading/), [Quant Stack Exchange](https://quant.stackexchange.com/) about recurrent issues, we can synthesize the required data and its use as follows:

* [Macro](# Data for Macro): Industry Classification, Risk-Free Rate, Gross National Product etc., mostly for factor models
* [Fundamental Analysis](# Data for Fundamental Analysis): Balance Sheets, Income Statements, Cash Flow Statements, in order to compute financial ratios, perform valuations...
* [Technical Analysis](# Data for Technical Analysis): Open, High, Low, Close price data and Volume, in order to compute technical indicators, chart patterns, candlestick patterns...

We are going to make this data publicly available and easily accessible for Python's data structures, and go over some scraping techniques to collect such data. 

After that, we will be able to analyze companies, construct and optimize a portfolio, and compare the performance of common investing styles. This is the first part of the series.


In [None]:
import traceback
import pandas as pd
import pickle
import re
import requests
from bs4 import BeautifulSoup
# from selenium.webdriver.common.keys import Keys
from time import sleep
# from selenium import webdriver
# from webdriver_manager.chrome import ChromeDriverManager
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC
from pprint import pprint

# Data for Macro

## Global Industry Classificiation Standard

In 1999, MSCI and S&P Dow Jones Indices developed the **Global Industry Classification Standard (GICS)**, seeking to offer 
an efficient investment tool to capture the breadth, depth and evolution of industry sectors. It is regarded as the definitive categorization system for industry groups in the United States, enabling market participants to identify and analyze companies using a common global standard.

GICS is a four-tiered, hierarchical industry classification system. Companies are classified quantitatively and qualitatively. Each company is assigned a single GICS classification at the 
Sub-Industry level according to its principal business activity. 

MSCI and S&P Dow Jones Indices use revenues as a key factor in determining a firm’s principal business activity. Earnings and market perception, however, are also recognized as important and relevant information for classification purposes, and are taken into account during the annual review process.

* A **sector** is
* An **industry group** is 
* An **industry** is 
* A **sub-industry** is

By gathering such data, we will be able to analyze various securities and sectors to determine which companies or industries are best positioned for growth.

We can scrape such data from the [Wikipedia page](https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard) for the Global Industry Classification Standard.

In [None]:
def get_gics():

    url = 'https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    table = soup.find('table', class_='wikitable')

    headers = []
    for th in table.find_all('th', {'colspan': '2'}):
        headers.append(th.text.rstrip())

    gics_dict = {}
    for tr in table.find_all('tr'):
        for td in tr.find_all('td'):
            if re.search(r'^\d{2}$', td.text):  # two digits code is for sector
                gics_dict[(td.text.rstrip(), td.nextSibling.nextSibling.text.rstrip())] = {}
            elif re.search(r'^\d{4}$', td.text):  # four digits code is for industry group
                sector_number = td.text[:2]
                for key, value in gics_dict.items():
                    if key[0] == sector_number:
                        gics_dict[key][(td.text.rstrip(), td.nextSibling.nextSibling.text.rstrip())] = {}
            elif re.search(r'^\d{6}$', td.text):  # six digits code is for industry
                industry_group_number = td.text[:4]
                for key, value in gics_dict.items():
                    for kk, vv in value.items():
                        if kk[0] == industry_group_number:
                            gics_dict[key][kk][(td.text.rstrip(), td.nextSibling.nextSibling.text.rstrip())] = []
            elif re.search(r'^\d{8}$', td.text):  # eight digits code is for sub-industry
                industry_number = td.text[:6]
                for key, value in gics_dict.items():
                    for kk, vv in value.items():
                        for kkk, vvv in vv.items():
                            if kkk[0] == industry_number:
                                gics_dict[key][kk][kkk].append((td.text.rstrip(), td.nextSibling.nextSibling.text.rstrip()))


    cleaned_gics_dict = {}
    for key, value in gics_dict.items():
        cleaned_gics_dict[key[1]] = {}
        for kk, vv in value.items():
            cleaned_gics_dict[key[1]][kk[1]] = {}
            for kkk, vvv in vv.items():
                cleaned_gics_dict[key[1]][kk[1]][kkk[1]] = []
                for v in vvv:
                    cleaned_gics_dict[key[1]][kk[1]][kkk[1]].append(v[1])

    pprint(cleaned_gics_dict)

get_gics()

{'Communication Services': {'Communication Services': {'Diversified Telecommunication Services': ['Alternative '
                                                                                                  'Carriers',
                                                                                                  'Integrated '
                                                                                                  'Telecommunication '
                                                                                                  'Services'],
                                                       'Wireless Telecommunication Services': ['Wireless '
                                                                                               'Telecommunication '
                                                                                               'Services']},
                            'Media & Entertainment': {'Entertainment': ['Movies '
                     

To get all 

NASDAQ makes this information available via [FTP](http://www.nasdaqtrader.com/trader.aspx?id=symboldirdefs) and they update it every night

You'll notice two files: nasdaqlisted.txt and otherlisted.txt. These two files will give you the entire list of tradeable symbols, where they are listed, their name/description, and an indicator as to whether they are an ETF.

* **Financial Status**:	Indicates when an issuer has failed to submit its regulatory filings on a timely basis (i.e. *delinquent*), has failed to meet NASDAQ's continuing listing standards (i.e. *deficient*), and/or has filed for bankruptcy (i.e. *bankrupt*).

# Data for Fundamental Analysis

In this section, we will go over scraping financial statements data from [Macrotrends](https://www.macrotrends.net/), as they provide quarterly and annual data all the way back to 2005. While it does a pretty good job for a free service, it does miss some data (and sometimes even has incorrect data), so I am currently writing a script to scrape (and normalize) the statements directly from the companies filings (10-K for yearly, 10-Q for quarterly) in the [SEC archives](https://www.sec.gov/edgar/searchedgar/companysearch.html). I expect it to be 95% done (the other 5% is really testing for weird edge cases, point of diminishing marginal returns) by end of this month (July 2020).  I am not planning on releasing the code itself, but will 

(Fun fact: in this [video](https://www.youtube.com/watch?v=LhISBNDO2Oo), Martin Shkreli talks about how it's possible to become an expert in an industry after reading SEC filings for 5 years)

For those who are s

# Data for Technical Analysis

# API

