Beautiful Soup is a nice Python library that can be used to scrape data from web sites. 

There are a few tutorials out there. Here is one of them:

    https://www.dataquest.io/blog/web-scraping-tutorial-python/


Before using this tool for a *real* application, we illustrate its use with some simple html. 
Here is a url with some html taken from this site:

    https://www.webnots.com/sample-html-table-codes-for-websites/
 
 We'll start by grabbing some html from this url using the requests library.
 
     http://www.ams.jhu.edu/~dan/FinancialComputingWorkshop/simple.html

In [1]:
import requests
from bs4 import BeautifulSoup

req=requests.get("http://www.ams.jhu.edu/~dan/FinancialComputingWorkshop/simple.html")
text=req.text

In [2]:
print(text)

<html>
<title> The most interesting title ever.</title>
<body>
<h1>The most interesting paragraph you will ever read</h1>
<p>
Here is a paragraph. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna.
Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Proin pharetra nonummy pede. Mauris et orci.
Aenean nec lorem. In porttitor. Donec laoreet nonummy augue.
</p>
<h2> Here is another very interesting paragraph
</h2>

<p>
Suspendisse dui purus, scelerisque at, vulputate vitae, pretium mattis, nunc. Mauris eget neque at sem venenatis eleifend. Ut nonummy.
Fusce aliquet pede non pede. Suspendisse dapibus lorem pellentesque magna. Integer nulla.
Proin nec augue. Quisque aliquam tempor magna. Pellentesque habitant morbi tristique senectus et net

Just like xml, the content of an html is structured like a tree. The root node corresponds to the <html> tag. This node has a child node corresponding to the <body> tag. Then the body has multiple children. To navigate this html file, we create a *soup* object.

In [3]:
soup=BeautifulSoup(text)

Then one thing we can do to navigate is to simply use tags to recover portions of the file.

In [5]:
soup.html

<html>
<head><title> The most interesting title ever.</title>
</head><body>
<h1>The most interesting paragraph you will ever read</h1>
<p>
Here is a paragraph. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna.
Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Proin pharetra nonummy pede. Mauris et orci.
Aenean nec lorem. In porttitor. Donec laoreet nonummy augue.
</p>
<h2> Here is another very interesting paragraph
</h2>
<p>
Suspendisse dui purus, scelerisque at, vulputate vitae, pretium mattis, nunc. Mauris eget neque at sem venenatis eleifend. Ut nonummy.
Fusce aliquet pede non pede. Suspendisse dapibus lorem pellentesque magna. Integer nulla.
Proin nec augue. Quisque aliquam tempor magna. Pellentesque habitant morbi tristique sen

In [6]:
soup.html.body

<body>
<h1>The most interesting paragraph you will ever read</h1>
<p>
Here is a paragraph. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna.
Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Proin pharetra nonummy pede. Mauris et orci.
Aenean nec lorem. In porttitor. Donec laoreet nonummy augue.
</p>
<h2> Here is another very interesting paragraph
</h2>
<p>
Suspendisse dui purus, scelerisque at, vulputate vitae, pretium mattis, nunc. Mauris eget neque at sem venenatis eleifend. Ut nonummy.
Fusce aliquet pede non pede. Suspendisse dapibus lorem pellentesque magna. Integer nulla.
Proin nec augue. Quisque aliquam tempor magna. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.
Nunc ac magna. M

When a node has text in it, we can extract it using get_text()

In [7]:
soup.html.body.p

<p>
Here is a paragraph. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna.
Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Proin pharetra nonummy pede. Mauris et orci.
Aenean nec lorem. In porttitor. Donec laoreet nonummy augue.
</p>

In [8]:
soup.html.body.p.get_text()

'\nHere is a paragraph. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna.\nNunc viverra imperdiet enim. Fusce est. Vivamus a tellus.\nPellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Proin pharetra nonummy pede. Mauris et orci.\nAenean nec lorem. In porttitor. Donec laoreet nonummy augue.\n'

But this can only go so far. We want to be able to iterate over all of the children of a node.

In [9]:
for ch in soup.html.body.children:
    print("child = ")
    print(ch)

child = 


child = 
<h1>The most interesting paragraph you will ever read</h1>
child = 


child = 
<p>
Here is a paragraph. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna.
Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Proin pharetra nonummy pede. Mauris et orci.
Aenean nec lorem. In porttitor. Donec laoreet nonummy augue.
</p>
child = 


child = 
<h2> Here is another very interesting paragraph
</h2>
child = 


child = 
<p>
Suspendisse dui purus, scelerisque at, vulputate vitae, pretium mattis, nunc. Mauris eget neque at sem venenatis eleifend. Ut nonummy.
Fusce aliquet pede non pede. Suspendisse dapibus lorem pellentesque magna. Integer nulla.
Proin nec augue. Quisque aliquam tempor magna. Pellentesque habitant morbi tristique

We can search in the file for a table (i.e. a <table> tag)

In [12]:
tables=soup.find_all('table')
for tab in tables:
    print("table = ")
    print(tab)
    print("\n\n\n")

table = 
<table border="1">
<caption>Table Caption</caption>
<tr>
<th>Movie</th>
<th>Rating</th>
<th>Ticket</th>
</tr>
<tr align="Center">
<td>Movie 1</td>
<td>Good</td>
<td>$30</td></tr>
<tr align="Center">
<td>Movie 2</td>
<td>Poor</td>
<td>$20</td></tr>
<tr align="Center">
<td>Movie 3</td>
<td>Bad</td>
<td>$10</td>
</tr>
</table>




table = 
<table bgcolor="lightgrey" border="1">
<tr>
<td>
<table bgcolor="skyblue" border="1">
<thead align="center">
This is a table inside a cell
</thead><tr>
<th>Plan</th>
<th>Price</th>
</tr>
<tr>
<td>One Year Plan</td>
<td>60 USD</td>
</tr>
<tr>
<td>Two Year Plan</td>
<td>50 USD</td>
</tr>
</table>
</td>
<td> 
This is another cell of main table
</td>
</tr>
<tr>
<td>Main table cell</td>
<td>Main table cell</td>
</tr>
</table>




table = 
<table bgcolor="skyblue" border="1">
<thead align="center">
This is a table inside a cell
</thead><tr>
<th>Plan</th>
<th>Price</th>
</tr>
<tr>
<td>One Year Plan</td>
<td>60 USD</td>
</tr>
<tr>
<td>Two Year Plan</td

We can pick a table to extract data from. Let's deal with the first one.

In [14]:
tab1=list(tables)[0]
print(tab1)

<table border="1">
<caption>Table Caption</caption>
<tr>
<th>Movie</th>
<th>Rating</th>
<th>Ticket</th>
</tr>
<tr align="Center">
<td>Movie 1</td>
<td>Good</td>
<td>$30</td></tr>
<tr align="Center">
<td>Movie 2</td>
<td>Poor</td>
<td>$20</td></tr>
<tr align="Center">
<td>Movie 3</td>
<td>Bad</td>
<td>$10</td>
</tr>
</table>


We want to extract the entries in the header and the entries in each subsequent row.  We can find all children of the table that are rows.

In [15]:
rows=tab1.find_all('tr') 
for row in rows:
    print("row = ")
    print(row)
    print("\n")

row = 
<tr>
<th>Movie</th>
<th>Rating</th>
<th>Ticket</th>
</tr>


row = 
<tr align="Center">
<td>Movie 1</td>
<td>Good</td>
<td>$30</td></tr>


row = 
<tr align="Center">
<td>Movie 2</td>
<td>Poor</td>
<td>$20</td></tr>


row = 
<tr align="Center">
<td>Movie 3</td>
<td>Bad</td>
<td>$10</td>
</tr>




The cells are defined by the text that is between the <td> ... </td> tags.

In [16]:
rows=tab1.find_all('tr') 
for row in rows:
    print(row.get_text())


Movie
Rating
Ticket


Movie 1
Good
$30

Movie 2
Poor
$20

Movie 3
Bad
$10



So we see how to get what we need. We can store the table in a list of lists.

In [17]:
rows=tab1.find_all('tr') 
TABLE1=[]
for row in rows:
    cols=row.find_all('td')
    row_text=[]
    for col in cols:
        txt=col.get_text()
        row_text.append(txt)
    TABLE1.append(row_text)
print(TABLE1)

[[], ['Movie 1', 'Good', '$30'], ['Movie 2', 'Poor', '$20'], ['Movie 3', 'Bad', '$10']]


In [18]:
TABLE1=TABLE1[1:]
print(TABLE1)

[['Movie 1', 'Good', '$30'], ['Movie 2', 'Poor', '$20'], ['Movie 3', 'Bad', '$10']]


To illustrate use of this tool, we'll try extracting the data from a table of futures prices ar the MRCI website

    https://www.mrci.com
   
where daily futures prices are posted. The URL for each day has a particulr format.

We will pick a particular date to focus on: Dec 27, 2018 and the URL for this day is:

    https://www.mrci.com/ohlc/2018/181227.php
 
We start by extracting the text using the requests library.

In [19]:
import os
from bs4 import BeautifulSoup
import requests

req=requests.get("https://www.mrci.com/ohlc/2018/181227.php")
text=req.text

In [5]:
text

'\n \n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-gb" lang="en-gb">\n\n<head>\n\n  <link href="https://www.mrci.com/web/templates/ja_kennedy/favicon.ico" rel="shortcut icon" type="image/x-icon" />\n  <script type="text/javascript" src="/web/includes/js/joomla.javascript.js"></script>\n  <script type="text/javascript" src="/web/media/system/js/mootools.js"></script>\n  <script type="text/javascript" src="/web/media/system/js/caption.js"></script>\n\n\n    \n\t\t<script type="text/javascript" language="javascript" src="/js/jquery.min.js"></script>\n\t\t<script type="text/javascript" language="javascript" src="/js/AnyChartStock.js?v=1.0.0r7488"></script>\n\t\t\n\n\n\n\n  <script type="text/javascript">\n  if (top.location != location) {\n\t    top.location.href = document.location.href ;\n\t  }\n\t\twindow.addEvent(\'domready\', function(){ var JTooltips

Next we create a soup object to work with and find all tables.

In [66]:
soup = BeautifulSoup(text,"html.parser")
tables=soup.find_all('table')
len(list(tables))

7

We can inspect each table and we see that the table we want is a table within the first table.

In [69]:
table=list(tables)[0]
print(table)

<table><tr>
<td valign="top">
<p>
<table border="0" width="640">
<tr><td align="CENTER">
<table border="3" cellpadding="0" cellspacing="0">
<tr>
<td><img alt="MRCI Logo" border="0" height="68" src="/graphics/mlogo7c.gif" usemap="#mlogo7c" width="559"/></td>
</tr>
<tr>
<th bgcolor="YELLOW" class="title">MRCI's <em>End of Day Prices</em></th>
</tr>
</table>
</td></tr></table>
<map name="mlogo7c">
<area coords="403,3,481,23" href="/ohlc/" shape="rect"/>
<area coords="483,3,555,23" href="/" shape="rect"/>
<area coords="403,24,481,43" href="/ohlc/ohlc-10.php" shape="rect"/>
<area coords="483,24,555,43" href="mailto:sales@mrci.com" shape="rect"/>
<area coords="403,44,481,64" href="/ohlc/ohlc-01.php" shape="rect"/>
<area coords="483,44,555,64" href="/client/symbols.php" shape="rect"/>
<area nohref="" shape="default"/>
</map>
<p></p>
<table border="3" bordercolordark="BLACK" bordercolorlight="BLACK" cellpadding="1" cellspacing="1" class="strat" width="640">
<tr>
<th class="title1" colspan="10"

In [76]:
tables2=table.find_all('table')
for tab in tables2:
    print("table = ")
    print(tab)
    print("\n\n\n")


table = 
<table border="0" width="640">
<tr><td align="CENTER">
<table border="3" cellpadding="0" cellspacing="0">
<tr>
<td><img alt="MRCI Logo" border="0" height="68" src="/graphics/mlogo7c.gif" usemap="#mlogo7c" width="559"/></td>
</tr>
<tr>
<th bgcolor="YELLOW" class="title">MRCI's <em>End of Day Prices</em></th>
</tr>
</table>
</td></tr></table>




table = 
<table border="3" cellpadding="0" cellspacing="0">
<tr>
<td><img alt="MRCI Logo" border="0" height="68" src="/graphics/mlogo7c.gif" usemap="#mlogo7c" width="559"/></td>
</tr>
<tr>
<th bgcolor="YELLOW" class="title">MRCI's <em>End of Day Prices</em></th>
</tr>
</table>




table = 
<table border="3" bordercolordark="BLACK" bordercolorlight="BLACK" cellpadding="1" cellspacing="1" class="strat" width="640">
<tr>
<th class="title1" colspan="10">Daily Futures Price Listing Thu December 27, 2018</th>
</tr>
<tr>
<th class="colhead" colspan="7">Most Recent Information</th>
<th class="colhead" colspan="3">Previous Day</th>
</tr>
<tr sty

So we want the 2th entry.

In [78]:
tab=list(tables2)[2]
print(tab)

<table border="3" bordercolordark="BLACK" bordercolorlight="BLACK" cellpadding="1" cellspacing="1" class="strat" width="640">
<tr>
<th class="title1" colspan="10">Daily Futures Price Listing Thu December 27, 2018</th>
</tr>
<tr>
<th class="colhead" colspan="7">Most Recent Information</th>
<th class="colhead" colspan="3">Previous Day</th>
</tr>
<tr style="border-bottom: solid 3px">
<th class="colhead">Mth</th>
<th class="colhead">Date</th>
<th class="colhead">Open</th>
<th class="colhead">High</th>
<th class="colhead">Low</th>
<th class="colhead">Close</th>
<th class="colhead">Change</th>
<th class="colhead">Volume</th>
<th class="colhead">Open Int</th>
<th class="colhead">Change</th>
</tr>
<tr>
<th class="note1" colspan="10">Soybeans(CBOT)</th>
</tr>
<tr>
<td align="CENTER">
Jan19</td>
<td align="CENTER">181227</td>
<td align="RIGHT">870.75</td>
<td align="RIGHT">876.00</td>
<td align="RIGHT">867.00</td>
<td align="RIGHT">869.00</td>
<td align="RIGHT">-1.00<img src="/graphics/down.gif

And now we can extract the cell text as we did above.

In [82]:
TABLE=[]
rows=tab.find_all('tr')
for row in rows:
    rowdata=[]
    cols=row.find_all('td')
    for col in cols:
        coldata=col.get_text()
        rowdata.append(coldata)
    if len(rowdata)>0:
        TABLE.append(rowdata)
for row in TABLE:
    print(row)

['\r\nJan19', '181227', '870.75', '876.00', '867.00', '869.00', '-1.00', '30,790', '67,772', '-11,500']
['\r\nMar19', '181227', '883.75', '889.00', '880.50', '882.50', '-0.50', '51,735', '325,990', '+4,261']
['\r\nMay19', '181227', '897.00', '902.00', '893.50', '895.50', '-0.75', '11,366', '125,016', '+1,134']
['\r\nJul19', '181227', '910.00', '915.25', '906.50', '908.50', '-0.75', '5,981', '114,879', '+542']
['\r\nAug19', '181227', '916.00', '919.50', '911.50', '913.50', '-0.75', '391', '9,078', '+97']
['\r\nSep19', '181227', '921.75', '922.75', '915.25', '917.00', '-0.50', '125', '3,496', '+3']
['\r\nNov19', '181227', '926.00', '931.00', '923.50', '925.25', '-0.50', '2,127', '49,906', '+137']
['\r\nJan20', '181227', '939.75', '942.75', '934.50', '936.00', '+0.25', '52', '1,675', '+1']
['\r\nMar20', '181227', '948.25', '948.50', '942.25', '944.00', '+0.75', '2', '1,095', '+0']
['\r\nMay20', '181227', '952.25', '956.25', '950.75', '952.25', '+0.75', '2', '393', '-2']
['\r\nJul20', '181

Note that we missed some headers, especially ones that tell us the commodity/asset being considerd (Soybeans, Oil, etc.)) So we really want to capture more than just the cells.

In [87]:
TABLE=[]
rows=tab.find_all('tr')
for row in rows:
    rowdata=[]
    cols=row.find_all('td')
    headers=row.find_all('th')
    if (len(headers)>0):
        hdata=[]
        for h in headers:
            hdata.append(h.get_text())
        TABLE.append(hdata)
    if (len(cols)>0):
        for col in cols:
            coldata=col.get_text()
            rowdata.append(coldata)
        TABLE.append(rowdata)
for row in TABLE:
    print(row)

['Daily Futures Price Listing Thu December 27, 2018']
['Most Recent Information', 'Previous Day']
['Mth', 'Date', 'Open', 'High', 'Low', 'Close', 'Change', 'Volume', 'Open Int', 'Change']
['Soybeans(CBOT)']
['\r\nJan19', '181227', '870.75', '876.00', '867.00', '869.00', '-1.00', '30,790', '67,772', '-11,500']
['\r\nMar19', '181227', '883.75', '889.00', '880.50', '882.50', '-0.50', '51,735', '325,990', '+4,261']
['\r\nMay19', '181227', '897.00', '902.00', '893.50', '895.50', '-0.75', '11,366', '125,016', '+1,134']
['\r\nJul19', '181227', '910.00', '915.25', '906.50', '908.50', '-0.75', '5,981', '114,879', '+542']
['\r\nAug19', '181227', '916.00', '919.50', '911.50', '913.50', '-0.75', '391', '9,078', '+97']
['\r\nSep19', '181227', '921.75', '922.75', '915.25', '917.00', '-0.50', '125', '3,496', '+3']
['\r\nNov19', '181227', '926.00', '931.00', '923.50', '925.25', '-0.50', '2,127', '49,906', '+137']
['\r\nJan20', '181227', '939.75', '942.75', '934.50', '936.00', '+0.25', '52', '1,675', '

There are more things to do with this:

1) Create a dictionary of tables with keys being the assets (Soyben, Wheat, Corn etc.)
2) Clean up the cells:
    a) Get rid of the \r\n in the first columns
    b) Remove the total volume and open interest rows
    c) Remove the + and - signs in the change columns
    d) Remove the commas in the Volume and Open Int columns
    e) Convert the numerical entries from text to numbers.
3) Write a function that allows a user to enter a date and table or dictionnary of tables gets created for that trading date.
4) Combine 3) with some graphical summaries.