# Webscraping 201: Gathering Minute-to-Minute Stock Data

This tutorial assumes that you have at least looked through **Webscraping 101**. If you have any confusion about topics covered in this notebook please use it as a reference. 

Many websites and APIs will allow you to extract stock data from them. Typically this data exists as one day per row. That is, you will get an open, high, low, and close price for a company for each day. However, stock prices fluctuate constantly, and to get minute-by-minute snapshots of stock data you almost always have to pay for a(n expensive) subscription. 

Fortunately, webscraping can help us get around this by collecting minute-by-minute stock data for us and saving it for later.

### Dependencies:

The dependencies here are the same as for **Webscrape 101**, with the addition of time and datetime for scheduling purposes.

* [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [urllib.request](https://docs.python.org/3.0/library/urllib.request.html)
* [ssl](https://docs.python.org/2/library/ssl.html)
* [time](https://docs.python.org/3/library/time.html)
* [datetime](https://docs.python.org/3/library/datetime.html)

## Step 1: Find a Stock Website to Scrape

For this tutorial, we will be using Yahoo Finance to scrape stock data. With that said, you can use any site you want that permits scraping. By navigating to Yahoo Finance and searching for a stock (say Google), we find that the URL is found to be:

https://finance.yahoo.com/quote/GOOG?p=GOOG

If you notice, the URL seems to simply add the stock ticker to an existing URL string. If, say, we exchanged **GOOG** out for **AAPL**, we would find that we navigate to Apple's yahoo stock webpage. This will allow us to dynamically set the URL we want to go to.

## Step 2: Import Dependencies and Set Meta-parameters

Meta-parameters here mean anything that needs to be a global variable. Due to the way the following code is nested, the parameters below need to be global in scope to be recognized by the necessary functions. 

First we have our unverified ssl context parameter.

Last, we have a dictionary of URLs that can be changed for scraping. Though we are only writing code to scrape from Yahoo, creating functions to scrape from other websites would be a great way to further explore webscraping. If you can create a function to do this, we will merge your code to the master branch for all to see!

In [2]:
###########################################
#
# Import Dependencies and Data Collection
#
###########################################

from bs4 import BeautifulSoup
import urllib.request as req 
import ssl
import pandas as pd
import datetime as dt
import time
import numpy as np
import os

###########################################
#
#          Set Meta-parameters
#
###########################################

#URl dictionary of sites that could be web-scraped
URL_dict = {'yahoo': 'https://finance.yahoo.com/quote/STOCK?p=STOCK',
            'NASDAQ': 'http://www.nasdaq.com/symbol/STOCK',
            'bloomberg': 'https://www.bloomberg.com/quote/STOCK:US',
            'reuters': 'http://www.reuters.com/finance/stocks/overview?symbol=STOCK.O'}

#Set ssl context to allow for an unverified handshake with a network site
ssl_context = ssl._create_unverified_context()

## Step 3: Create Scraping Engine

Now, let's create a function that will scrape stock price and volume data from Yahoo finance. We want to make this dynamic, so we can get data from any stock that we want. 

In [7]:
'''
This is the yahoo ohlcv (open, high, low, close, volume)
webscraping engine. This function will return the most 
recent stock price and volume as found on the yahoo
finance webpage.

# ticker = any stock ticker, lower or uppercase
'''
def yahoo_minute_ohlcv_data(ticker):

    #Creates the link needed to contact yahoo finance
    link  = URL_dict['yahoo'].replace('STOCK',ticker.lower())

    #Opens the URL link created above
    page = req.urlopen(link, context = ssl_context)

    #Allows for any scraping exceptions to be caught and handled
    try:
        #Extracts all html data from the page opened above
        soup = BeautifulSoup(page, "html.parser")

        #Digging through html to find correct tags (classes) for stock price
        price = soup.find('span', class_= 'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)').find(text = True)

        #Refining down html tags for volume [(6th) row of the vol_table, within vol_class]
        vol_class = soup.find('div', class_='D(ib) W(1/2) Bxz(bb) Pend(12px) Va(t) ie-7_D(i)')
        vol_table = vol_class.findAll('td', class_= 'Ta(end) Fw(b) Lh(14px)')
        volume = vol_table[6].find(text = True).replace(',','')

        #Returns stock data and price as float values
        price, volume = float(price), float(volume)

        return price, volume

    # If stock data ill-formatted, scraping attempt is skipped
    # Sometimes data is reported as N/A briefly as it changes
    except Exception as e:
            print("\nError in scraping data, current scrape attempt skipped")
            print("Message: %s"%(e))
            print("Re-scraping\n")
            yahoo_minute_ohlcv_data(ticker)

Let's walk through the code above. 

    1. A link is created by dynamically adding the stock ticker to the URL provided by the URL_dict meta-parameter
    2. The webpage found at the link previously created is opened, and its html contents are saved 
    3. Include a try and except clause here to ensure that failed scrape attempts do not fail the entire process
    4. After going to the yahoo website and inspecting its HTML contents, we find the correct paths to the data 
    5. We pull the most current stock price and volume
    6. Finally, we re-format the data to be float values and return them.
    7. If an exception is found, a message is printed detailing the issue and the scrape attempts to pull data again


Awesome! If we give it a run, we can see that it is indeed working as expected. 

In [8]:
ticker = 'GOOG'

price, volume = yahoo_minute_ohlcv_data(ticker)

print("Price: " + str(price))
print("Volume: " + str(volume))

Price: 908.14
Volume: 1569344.0


## Step 4: Build Automated Scrape Scheduler

Now that we have a webscraping engine built properly, let's create a function that will schedule our scraper to run and save data every minute. The code can be seen below.

In [9]:
'''
This is the master function for webscraping intraday stock
data on a minute to minute basis. Waiting until the top of 
the next minute, the get_ohlcv_data function notifies the 
user the time that the scrape starts. It then pulls stock 
price and volume data a total of five times over the course
of a given minute. The open, high, low, close, and average
volume are then calculated from this. Every minute, a new
row is added with ohlcv features, as well as a timestamp.
Once the duration (in minutes) is over, all of the collected 
data is saved as a csv file, named in the format:

[*stock_ticker*]_minute_ohlcv_data_[*date*].csv

# ticker - Any stock ticker
# provider - any website provider in URL_dict
  *NOTE* ONLY THE YAHOO SCRAPING ENGINE HAS BEEN CREATED
# duration - number of minutes for which the user wants data
'''
def get_ohlcv_data(ticker, provider, duration):

    #Stores list of all scraped data, one row per minute.
    #This is what is stored as a csv eventually.
    master_stock_data_list = []

    #Creates a string with the necessary code to execute the given engine.given
    #(e.g. 'yahoo_minute_ohlcv_data('GOOG')')
    engine = str(provider) + '_minute_ohlcv_data(\'' + ticker + '\')'

    #Intializing count variables for duration and scrape counts
    scrape_count = 0
    actual_duration = 0

    #Initializing price and volume lists for intra-minute scrapes
    price_list = []
    vol_list = []

    #Will turn to true at the top of the next minute. Allows Web-scrape pulls
    # to be organized.
    scrape_start = False

    #Initializes all datetime values for starting the scrape
    now = dt.datetime.now()
    #Rounds down current time to the most recent minute
    now = dt.datetime.strptime(now.strftime('%Y-%m-%d %H:%M:%S'), '%Y-%m-%d %H:%M:%S')
    #Adds a minute to the rounded down time
    start = now + dt.timedelta(minutes = 1, seconds = -now.second)

    #Pre-execution of web-scraping engine
    while not scrape_start:

        #If the time has reached the next minute, begin execute sequence by changing scrape_start
        if dt.datetime.now() >= start:
            scrape_start = True
            next_scrape = dt.datetime.strptime(dt.datetime.now().strftime('%Y-%m-%d %H:%M'), '%Y-%m-%d %H:%M')
            print('\nScrape started at %s and will run for %s minute(s)\n'%(str(dt.datetime.now())[11:-7], duration))

        #Notify the user that the scraping engine is still waiting.
        else:
            print('Scrape starting in %s seconds'%str((start - dt.datetime.now()))[5:-7])
            #Pauses the sequence for 1 second
            time.sleep(1)

    #Main execution loop
    while scrape_start and actual_duration < duration:

        #Resets after the scraper has pulled 5 times in one minute
        while scrape_count < 5:

            if dt.datetime.now() >= next_scrape:

                #Gets price and volume data
                price, volume = eval(engine)

                #Add price and volume data to respective lists
                price_list.append(price)
                vol_list.append(volume)

                #Amend next_scrape time to be ten seconds later (allows pause)
                next_scrape += dt.timedelta(seconds = 10)

                #Increase scrape count by one
                scrape_count += 1

        #Gather all necessary data after scraping five times. (Date + ohlcv)
        Date_time = dt.datetime.now().strftime('%Y-%m-%d %H:%M')
        min_open = price_list[0]
        min_high = max(price_list)
        min_low = min(price_list)
        min_close = price_list[len(price_list) - 1]
        min_avg_volume = int(np.mean(vol_list))

        #Create a row with all necessary data
        row = [Date_time, min_open, min_high, min_low, min_close, min_avg_volume]

        #Add row to master data list
        master_stock_data_list.append(row)

        #Notify user of the added row
        print("[%s] Row Stored"%Date_time)

        #Reset time parameters, counts, and temporary lists
        scrape_count = 0
        next_scrape += dt.timedelta(seconds = 10)
        price_list = []
        vol_list = []
        actual_duration += 1

    #Notify user when the webscrape has finished
    print("\n[%s] Scrape Finished.\n"%(dt.datetime.now().strftime('%Y-%m-%d %H:%M')))

    #Create a DataFrame of all the gathered stock data, and remove dummy index
    stock_df = pd.DataFrame(master_stock_data_list, columns = ['DateTime', 'Open', 'High', 'Low', 'Close', "Avg_Volume"])
    stock_df.set_index('DateTime', inplace = True)

    #print(stock_df)

    #Parameters to create file name
    date = dt.datetime.now().strftime('%Y-%m-%d')
    file_name = ticker.upper() + '_Intraday_Stock_Data_' + date + '.csv'

    #Store stock data scrapings as a csv file
    stock_df.to_csv(file_name, sep = ',')
    print('File saved at location: %s'%(os.getcwd()))

### What Function Does 

Woah, that is a lot of code. Before delving into it, let's discuss what it is doing from a practical standpoint. 

    1. Once started, the program waits until the start of the next minute to run, notifying the user throughout
    2. After starting, the program pulls stock data for the provided ticker every ten seconds, five times in a row
    3. Since all of these prices and volumes are for the same minute, they are saved to a list
    4. Once data has been pulled five times, the open, high, low, and close price for the stock is found
    5. Items in the stock price list are chronological, making the previous step easy
    6. An average is taken of the volume list (vol_list), and saved as the volume variable
    7. Finally, all stock values (ohlcv) are saved to a list and stored.
    8. After a certain number of minutes has passed (equal to duration value), the stored data is saved as a csv 
    
### Technical Explanation

To add to our practical explanation, let me add some technical color to what is going on. First, the engine variable concatenates the string equivalent to running the function with the variables we need. For example, if we were trying to get **ibm** stock prices the engine variable would equal **`'yahoo_minute_ohlcv_data('IBM')'`**. By using the **`eval`** function, we can run our scraping engine dynamically with **any stock and any media provider**. 

The first while loop iterates in the seconds leading up to the top of the next minute. If the program is started at 11:30:34, the loop will continue to print statements 26 times, letting the user know when the program will actually begin to start scraping data.

Next, when scrape_start is set to **`True`** and the actual_duration is less than the planned duration set in the parameters, the scraping engine will be run every ten seconds, up to five times. These values are stored in lists, which are then used to identify high, low, open, and close prices. Finally, this data is added to the **`master_stock_data_list`** variable. 

Once the scraping process is finished. The data is fed into a pandas DataFrame and saved as a csv using the file name:

[*stock_ticker*]_minute_ohlcv_data_[*date*].csv_

Now that we have our code created, **let's make a main method and run our code**.

In [10]:
'''
Main method. All variables are defined as they were for get_ohlcv_data.
Only new addition is changing the working directory.
'''
def main(ticker, provider, duration):

    #Change this to your current working directory of choice
    os.chdir('/Users/Sam/Documents/Python/DPUDS/DPUDS_Meetings/Fall_2017/Webscrape_201')

    get_ohlcv_data(ticker, provider, duration)

Ok. We are now ready to see what our scraping automator does. 

In [11]:
ticker = 'ibm'

main(ticker,'yahoo',5)

Scrape starting in 05 seconds
Scrape starting in 04 seconds
Scrape starting in 03 seconds
Scrape starting in 02 seconds
Scrape starting in 01 seconds
Scrape starting in 00 seconds

Scrape started at 15:58:00 and will run for 5 minute(s)

[2017-08-10 15:58] Row Stored
[2017-08-10 15:59] Row Stored
[2017-08-10 16:00] Row Stored
[2017-08-10 16:01] Row Stored
[2017-08-10 16:02] Row Stored

[2017-08-10 16:02] Scrape Finished.

File saved at location: /Users/Sam/Documents/Python/DPUDS/DPUDS_Meetings/Fall_2017/Webscrape_Stock_Data


### Final Thoughts

By following the file path printed out, we can actually view the data we collected. Below is a data sample collected earlier for Tesla. 

The scraper ran for 90 minutes, but in theory it could run the entire day (just don't let your computer fall asleep or the operation will terminate and you will lose your data).

In [4]:
tsla_stock_df = pd.read_csv('TSLA_Intraday_Stock_Data_2017-08-11.csv')

print(tsla_stock_df)

            DateTime      Open      High       Low     Close  Avg_Volume
0   2017-08-11 13:18  357.5353  357.6000  357.4200  357.4200     2905183
1   2017-08-11 13:19  357.5390  357.6100  357.4000  357.4000     2910566
2   2017-08-11 13:20  357.4300  357.4300  357.3612  357.3612     2915401
3   2017-08-11 13:21  357.3794  357.3794  357.0000  357.0000     2922222
4   2017-08-11 13:22  357.0000  357.2600  356.6900  356.6900     2930593
5   2017-08-11 13:23  356.7200  356.7200  356.7059  356.7059     2935380
6   2017-08-11 13:24  356.7900  356.7900  356.7900  356.7900     2937465
7   2017-08-11 13:25  356.9636  356.9636  356.7200  356.7600     2938731
8   2017-08-11 13:26  356.9900  357.0341  356.9900  357.0300     2941034
9   2017-08-11 13:27  356.9100  356.9100  356.8900  356.8900     2942831
10  2017-08-11 13:28  357.1500  357.1500  356.7900  356.7900     2958126
11  2017-08-11 13:29  356.7000  356.7200  356.6625  356.6625     2970375
12  2017-08-11 13:30  356.6800  356.8200  356.6800 

### Further Exploration

If you are looking to gain more experience in this area, here are a few things that could make this program even better (and way cooler). These increase in difficulty as you move down. Let us know if you have questions!

     1. Since this notebook was created, the html code of yahoo changed. 
        The code in this notebook needs to be tweaked to work again. 
        The raw code has these changes and works, but try to investigate and fix the code on your own.
     2. Create scraping functions for other (non-yahoo) URLs listed in the URL dictionary meta-parameter
     3. Find a way to have the program send you a text / email when the scrape is finished
     4. Find a way for the program to run while the computer sleeps
     5. Find a way to have the scraper start at market open and run until market close automatically
     6. Find a way to have the scraper scrape multiple stocks at the same time, in parallel
     7. Find a way to make the scraper scrape a week straight, saving to the same file, but only during market hours