<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import" data-toc-modified-id="Import-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import</a></span></li><li><span><a href="#Companies" data-toc-modified-id="Companies-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Companies</a></span></li><li><span><a href="#Scraping-the-desired-stock-prices" data-toc-modified-id="Scraping-the-desired-stock-prices-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Scraping the desired stock prices</a></span></li></ul></div>

# Dowloading the stock prices

In this notebook, we'll write a script to download the S&P500 stock prices, from January 2015 until December 2019 (so 5 full years). I've been greatly helped by this $\href{https://github.com/CNuge/kaggle-code/blob/master/stock_data/getSandP.py}{script}$.

## Import

First, we need to import some functions and libraries. 
1. ``Beautifulsoup`` is to scrape the names of the 500 companies from the S&P500
2. ``datetime`` will fix the start and end date of the data we'll download.
3. ``futures`` will enable us to download several data in parallel, hence accelerating the process
4. ``web`` will be a webscraping tool: it can extract data, and then save them in a dataframe. To use it, I first need to download the library ``pandas_datareader``

(Here, I had to install some libraries which weren' available on my computer)

In [None]:
pip install pandas_datareader

In [None]:
pip install requests

In [None]:
pip install beautifulsoup4

In [None]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
from concurrent import futures
import pandas_datareader.data as web

## Companies

First, we'll search for the S&P500 companies name in the Wikipedia page. I've noticed the names have the class "external text", hence I'm looking for these in the Wikipedia page.

In [None]:
URL = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

results = soup.find_all(class_="external text")
results[:2]

When printing the results of my query, I see company code names can be 1 to 5 characters long. One out of two results don't give the name of the company, but are a link to a report. Hence, I'm interested in the 1 to 5 characters before "< / a >".  Below is a function taking the html we can see above in argument, and returning the company code name.

In [None]:
def list_companies(list_scraping):
    res = []
    for x in list_scraping:
        name = str(x)[-9:-4]                       #The company code name can be up to 4 letters
        if name != 'ports':                         #We don't want the "report" lines
            if '>' in name:                        #in case the company code name is less than 4 letters
                res.append(name.split('>')[1])
            else:
                res.append(name)
    return res

In [None]:
first_step = list_companies(results)
first_step

Here, we can see the last elements of our list are not what we want. The company code names are capital letters, so we only keep that:

In [None]:
def remove_non_companies(list_scraping):
    res = list_scraping.copy()
    booleen = res[-1].isupper()
    while not(booleen):
        res = res[:-1]
        booleen = res[-1].isupper()
    return res

In [None]:
S_P500 = remove_non_companies(first_step)
S_P500

In [None]:
len(S_P500)

The S&P500 comprises 505 common stocks issued by 500 large-cap companies, so we've got just that. Now, we've got to scrape the prices of the stocks of those companies from 2015 until 2019.

## Scraping the desired stock prices

We set the time gap for which we want to obtain the data.

In [None]:
train_start_time = datetime(2010, 1, 1)
train_end_time = datetime(2014, 12, 31)

test_start_time = datetime(2015, 1, 1)
test_end_time = datetime(2019, 12, 31)

In [None]:
wrong_codes = [] #That's the list of the companies we'll have failed to obtain data

In [None]:
def download_stock_train(company_code): #We don't use a for loop because we'll want to parallelize the scraping
    #We may not have scraped the right company code name, so we try in case there may be errors
    try:
        stock_df = web.DataReader(company_code,'yahoo', train_start_time, train_end_time)
        stock_df.to_csv('training/' + company_code + '.csv')
    except:
        wrong_codes.append(company_code)

In [None]:
def download_stock_test(company_code): #We don't use a for loop because we'll want to parallelize the scraping
    #We may not have scraped the right company code name, so we try in case there may be errors
    try:
        stock_df = web.DataReader(company_code,'yahoo', test_start_time, test_end_time)
        stock_df.to_csv('testing/' + company_code + '.csv')
    except:
        wrong_codes.append(company_code)

Here, we parallelize the data scraping: it allows us to be much faster to obtain the data.

In [None]:
workers = len(S_P500)

with futures.ThreadPoolExecutor(workers) as executor:
    res = executor.map(download_stock_train, S_P500)

In [None]:
workers = len(S_P500)

with futures.ThreadPoolExecutor(workers) as executor:
    res = executor.map(download_stock_test, S_P500)

In [None]:
505 * 2 - len(wrong_codes)

Good news! We've managed to download the data from a total of 737 companies, which should be enough to train our model.