
# Automating the Extraction of Financial Data for Stocks

## Goals

Data collection is a huge part of being a data professional. Very often, we develop real-time analytics that needs to be pulling data continuously from different sources, or we need additional data to enrich our analyses. In some cases, this data is publicly available on the Internet, but it may be scattered across various websites that are updated continuously. Extracting this data manually is a very tedious task.

The action of extracting data from websites is commonly known as **web scraping**. The main goal of this case is to learn the basics of web scraping using Python and the `BeautifulSoup` library. By the end of this case, you will have learned some basics of Hypertext Markup Language (HTML) jargon and the most important methods of `BeautifulSoup`, which will help you get started with web scraping.

## Introduction

**Business Context.** You have recently joined the data science division of a multinational bank. Over the past month you've been working with a variety of stock data and are looking to gather fundamental data on a select group of energy stocks. Your firm is specifically interested in making investments in one of the following five energy sector companies:

1. Dominion Energy Inc. (Stock Symbol: D)
2. Exelon Corp. (Stock Symbol: EXC)
3. NextEra Energy Inc. (Stock Symbol: NEE)
4. Southern Co. (Stock Symbol: SO)
5. Duke Energy Corp. (Stock Symbol: DUK)

Your firm wants you to gather information about each stock's earnings-per-share (EPS), price-to-earnings ratio (PE ratio), and market capitalization data in order to make their investment decision. However, the firm has no experience doing this in an automated fashion, instead relying on time-consuming manual labor up to this point.

**Business Problem.** Your boss has posed the following question to you: **"How can we automate the gathering of earnings-per-share (EPS), price-to-earnings ratio (PE ratio), and market capitalization data?"**

**Analytical Context.**  In this case, you will learn the key skill of **web scraping** – the practice of automatically grabbing information off of online webpages, then parsing and transforming that information into a format amenable to further analysis.

In this case, you will: (1) learn the basics of HTML, which governs almost all static webpages; (2) parse a sample HTML document; (3) extract the necessary info from a single stock's HTML document; (4) scale this process to all symbols; and (5) learn how to scrape the contents of an HTML document from a live webpage in real time.

In [1]:
# Libraries needed for basic web-scraping
from IPython.core.display import HTML
from bs4 import BeautifulSoup
from IPython.display import IFrame
import urllib # package required to interact with live webpage
import pandas as pd # will use to store the data from the webpage

## Basics of Hyper Text Markup Language (HTML)

In order to automate the scraping and processing of stock data, you must become familiar with Hyper Text Markup Language (HTML). HTML is a markup language for creating web pages and applications; you interact with HTML constantly while you are browsing the web as the vast majority of pages are written using HTML. Some important points to keep in mind as we go over the basics of HTML:

1. HTML is traditionally used to design static (i.e. non-interactive) web pages
2. HTML uses a nested data structure with tags to instruct browsers how to display content
3. HTML is platform independent
4. HTML can be integrated into other languages (e.g. JavaScript)
5. HTML can be created using any text editor

An HTML document is made up of a series of tags. These tags instruct a browser on how to display content to the user. Different tags will cause different output styles to be displayed.

Let's begin by discussing a simple HTML formatted string, ```custom_html_doc```.

In [2]:
custom_html_doc = """
<html>
<head>
<title>HTML Page Title</title>
</head>
<h1>Head: Important Header: Global News</h1>
<br>
<h2>Head: Less Imporant Header: Global News</h2>
<body>
<p class="title"><b>Paragraph: Financial news</b></p>
<p class="story"> Stocks had a volatile week, where
<a href="https://finance.yahoo.com/quote/duk/" target="_blank" class="stock" id="link1">DUK</a>,
<a href="https://finance.yahoo.com/quote/d/" target="_blank" class="stock" id="link2">D</a>,
<a href="https://finance.yahoo.com/quote/exc/" target="_blank" class="stock" id="link3">EXC</a>,
<a href="https://finance.yahoo.com/quote/nee/" target="_blank" class="etf" id="link4">NEE</a>,
<a href="https://finance.yahoo.com/quote/so/" target="_blank" class="stock" id="link5">SO</a>,
were all making headlines.</p>
<p class="details">End of HTML document.</p>
"""

While there are a wealth of tags available in HTML, the above example highlights the fundamentals we need to get started with the language. The four vital tags of any HTML document inlcude:

1. < html > Instructs the browser that your web page is in HTML format.
2. < head > This is information that can be used by external sources (such as search engines). Holds webpage metadata.
3. < title > Viewers see the title in the browser toolbar, when the page is added to favorites, and in search engine results.
4. < body > Defines the body block, which contains the content of the page, including text and images.

Other structurally useful tags include:

1. < p > Defines a paragraph block which primarily contains text to be displayed to the user
2. < a > Defines a hyperlink
3. < h1 > Defines an important header
4. < h2 > Define a less important header
5. < br > Define a line break

We can view how ```custom_html_doc``` will render using the method ```HTML()```:

In [3]:
# View the HTML as it would appear by a web browser
HTML(custom_html_doc)

We see that the most important header < h1 > tag is responsible for the largest bold text in the document. Moreover, the paragraph tags cause the text nested in between their tags to be displayed in a regular-sized, non-bolded font. The hyperlink tags introduce the website links for each stock symbol (DUK, D, etc.).

### Exercise 1:

Do all tags in an HTML document require an end tag?

(a) Yes, all tags must be terminated for the browser to properly display the webpage

(b) No, there are some tags in HTML that do not require an end tag

**Answer.**

------------

### Exercise 2:

The following HTML document was found with all of its end tags missing. Starting from top to bottom, determine the correct order in which end tags should be added to eliminate the issues with the document.

```
<h1>This is a Heading
<p>This is a paragraph.
<br>
<p>Another paragraph
<br>
```

(a) < /h1 >, < /p >, < /br >, < /p >

(b) < /h1 >, < /p >,  < /br >, < /p >, < /br >

(c) < /h1 >, < /p >, < /p >

(d) < /h1 >, < /p >

**Answer.**

------------

Now that we've covered the basics of an HTML document, let's move forward and discuss methods of loading and extracting info from HTML documents in Python. Fortunately, Python offers the package ```BeautifulSoup``` to aid with this task.

## Using ```BeautifulSoup``` to navigate an HTML document

```BeautifulSoup``` transforms an HTML document into a navigable tree structure. This is important and useful to make HTML documents amenable to programming and automated parsing. The primary purpose of ```BeautifulSoup``` is to make working with HTML documents considerably easier. Specifically, ```BeautifulSoup``` is a library in Python that sits on top of HTML, and:

1. Offers a variety of ways to search the HTML document
2. Allows you to make edits to the HTML document
3. Offers techniques to extract information from an HTML document

Let's begin by using ```BeautifulSoup``` to analyze ```custom_html_doc```.

### Parsing the simple HTML document

In ```BeautifulSoup```, tags correspond to the HTML tag in the original HTML document. The ```html.parser``` of the ```BeautifulSoup``` library is the standard choice to parse a simple HTML formatted string. We will also use the ```prettify()``` method to show the parsed HTML string with indents included, which illustrates how ```BeautifulSoup``` views the HTML document as a tree structure hierarchy of tags:

In [4]:
# Use the standard html.parser to convert the HTML document into a BeautifulSoup data structure
soup = BeautifulSoup(custom_html_doc, 'html.parser')

# Print the HTML to the screen with indents included
print(soup.prettify())

<html>
 <head>
  <title>
   HTML Page Title
  </title>
 </head>
 <h1>
  Head: Important Header: Global News
 </h1>
 <br/>
 <h2>
  Head: Less Imporant Header: Global News
 </h2>
 <body>
  <p class="title">
   <b>
    Paragraph: Financial news
   </b>
  </p>
  <p class="story">
   Stocks had a volatile week, where
   <a class="stock" href="https://finance.yahoo.com/quote/duk/" id="link1" target="_blank">
    DUK
   </a>
   ,
   <a class="stock" href="https://finance.yahoo.com/quote/d/" id="link2" target="_blank">
    D
   </a>
   ,
   <a class="stock" href="https://finance.yahoo.com/quote/exc/" id="link3" target="_blank">
    EXC
   </a>
   ,
   <a class="etf" href="https://finance.yahoo.com/quote/nee/" id="link4" target="_blank">
    NEE
   </a>
   ,
   <a class="stock" href="https://finance.yahoo.com/quote/so/" id="link5" target="_blank">
    SO
   </a>
   ,
were all making headlines.
  </p>
  <p class="details">
   End of HTML document.
  </p>
 </body>
</html>


Here we see that ```BeautifulSoup``` has fully read in the HTML document string ```custom_html_doc```. Let's take a look at a few of the basic ```BeautifulSoup``` features to view the contents inside of ```soup```.

First, we can select tags by name using the ```.``` followed by the tag name:

In [5]:
# Select the first 'a' tag in the soup (by default the first appearance of a tag is selected)
tag = soup.a

# Print the tag
print(tag)

<a class="stock" href="https://finance.yahoo.com/quote/duk/" id="link1" target="_blank">DUK</a>


In [6]:
# Show the type of the tag
type(tag)

bs4.element.Tag

Notice how the ```tag``` above has the type ```bs4.element.Tag```. This is the object inside which ```BeautifulSoup``` stores tags.

```BeautifulSoup``` tags have attributes and methods. Attributes are essentially properties of the tag object, whereas methods are ways to call functions on the tag object. Let's take a look at a couple of examples of tag properties: 

In [10]:
print("Tag's name: ", tag.name) #The name of the tag.
print("Tag's text: ", tag.text) #Extract the text embedded in the tag.
print("Tag's parent name: ", tag.parent.name) #Our tag is inside a <p> element. 

Tag's name:  a
Tag's text:  DUK
Tag's parent name:  p


Importantly, tags can have multiple HTML attributes. We can access a these attributes using the ```attrs``` property of the tag.

In [11]:
# Show tag attributes
print(tag.attrs)

{'href': 'https://finance.yahoo.com/quote/duk/', 'target': '_blank', 'class': ['stock'], 'id': 'link1'}


HTML attributes define how the element will look (`class`, `id` pointing to a Cascading Style Sheets (CSS) file) and behave (`href` and `target`) in the webpage. We can access and modify these attributes very easily:

In [12]:
# Access the hyperlink attribute (use tag like a dictionary to access)
print(tag['href'])

https://finance.yahoo.com/quote/duk/


In [13]:
# Add a new attribute (modifies soup)
tag['new_attr'] = 100

# Look at results
print(tag.attrs)

{'href': 'https://finance.yahoo.com/quote/duk/', 'target': '_blank', 'class': ['stock'], 'id': 'link1', 'new_attr': 100}


### Extracting all tags of a certain kind from an HTML file

As we saw earlier, if we simply use ```soup.tag_name```, for some `tag_name` of interest such as a hyperlink tag ```a```, we only receive the first tag back. How do we recieve all the tags in a document of a certain kind?

Fortunately, ```BeautifulSoup``` provides the ability to navigate its data structure through a variety of search methods. The one that returns all tags of a given type is ```find_all()```:

In [14]:
# View all hyperlink tags in custom_html_doc
soup.find_all('a')

[<a class="stock" href="https://finance.yahoo.com/quote/duk/" id="link1" new_attr="100" target="_blank">DUK</a>,
 <a class="stock" href="https://finance.yahoo.com/quote/d/" id="link2" target="_blank">D</a>,
 <a class="stock" href="https://finance.yahoo.com/quote/exc/" id="link3" target="_blank">EXC</a>,
 <a class="etf" href="https://finance.yahoo.com/quote/nee/" id="link4" target="_blank">NEE</a>,
 <a class="stock" href="https://finance.yahoo.com/quote/so/" id="link5" target="_blank">SO</a>]

Notice that the `find_all()` method returns a list! This is convenient as we can then iterate over them via loops to extract desired information.

### Exercise 3:

Write a script to print all of the hyperlinks present in ```soup```. 

**Answer.**

In [19]:
print(*[item["href"] for item in soup.find_all('a')],sep="\n")

https://finance.yahoo.com/quote/duk/
https://finance.yahoo.com/quote/d/
https://finance.yahoo.com/quote/exc/
https://finance.yahoo.com/quote/nee/
https://finance.yahoo.com/quote/so/


------------

Notice that this ```BeautifulSoup``` structure greatly simplifies parsing an HTML document. The structure has been encoded in a simple navigable structure, where there are operations to access each subpart of the full document.

### Exercise 4:

From ```custom_html_doc``` above, use ```BeautifulSoup``` to print the symbol, class, and href attributes for all < a > tags. For example, the first line of output should print:

```python
AAPL,stock,https://finance.yahoo.com/quote/aapl/
```

**Answer.**

In [23]:
print(*["{},{},{}".format(item.text,item["class"][0],item["href"]) for item in soup.find_all('a')],sep="\n")

DUK,stock,https://finance.yahoo.com/quote/duk/
D,stock,https://finance.yahoo.com/quote/d/
EXC,stock,https://finance.yahoo.com/quote/exc/
NEE,etf,https://finance.yahoo.com/quote/nee/
SO,stock,https://finance.yahoo.com/quote/so/


------------

## Processing an HTML document corresponding to a real webpage

Recall that we are interested in scraping fundamental stock data off of Yahoo! Finance in order to facilitate making a stock recommendation. We are specifically interested in a company's EPS, PE ratio, and market capitalization.

We have pre-downloaded real Yahoo! Finance webpages and saved the HTML files for each of the five energy sector symbols under study. We will first focus on Duke Energy Corporation, an electric power holding company with the stock symbol DUK. Let's render the webpage in the notebook using ```IFrame``` and take a look at its contents.

In [24]:
# IFrame will allow us to view the HTML document
IFrame(src='DUK_Yahoo.html', width=800, height=400)

Scrolling over the IFrame viewer in the notebook, we see that the webpage for DUK indeed contains a variety of fundamental data quantities, including market capitalization, PE ratio, and EPS (as well as other information like beta, average volume, and forward dividend yield). Hence, this webpage will suffice for our analysis. 

Let's use ```BeautifulSoup``` to analyze this HTML document and extract the fundamental data:

In [26]:
# Open a file and pass the file handle (here file handle is f) to BeautifulSoup
file_name = 'DUK_Yahoo.html'
with open(file_name,encoding='utf8') as f:  #Windows users may need to add the option encoding='utf8'
    stock_soup = BeautifulSoup(f, 'html.parser')

**Note:** Windows users may need to add the option `encoding='utf8'` to load the webpage into Python.  

In [27]:
# Look at first 1000 characters to see head of the document (don't want to print too much or it's messy)
print(stock_soup.prettify()[:1000])

<!DOCTYPE html>
<html class="NoJs chrome desktop" id="atomic" lang="en-US">
 <head prefix="og: http://ogp.me/ns#">
  <script>
   window.performance && window.performance.mark && window.performance.mark('PageStart');
  </script>
  <meta charset="utf-8"/>
  <title>
   DUK : Summary for Duke Energy Corporation (Holdin - Yahoo Finance
  </title>
  <meta content="DUK, Duke Energy Corporation (Holdin, DUK stock chart, Duke Energy Corporation (Holdin stock chart, stock chart, stocks, quotes, finance" name="keywords"/>
  <meta content="on" http-equiv="x-dns-prefetch-control"/>
  <meta content="on" property="twitter:dnt"/>
  <meta content="90376669494" property="fb:app_id"/>
  <meta content="#400090" name="theme-color"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="View the basic DUK stock chart on Yahoo Finance. Change the date range, chart type and compare Duke Energy Corporation (Holdin against other companies." lang="en-US" name="description"/>
  

We see here that while the structure of a real webpage is more complex than our sample document from earlier, it still retains the same HTML structure consisting of a series of nested tags each of which identify different elements of the document.

One useful tool for debugging and checking the components in a HTML document is to look at the number of occurrences of every type of tag in the document.

### Exercise 5:

Write a script to determine the number of occurences of every tag in ```stock_soup```. Print to the screen a dictionary where each key is a tag name, and the corresponding value for each key is the number of occurrences of that particular tag.

**Hint:** You can use `stock_soup.find_all()` to create a list containing ALL tags present in the html file.

**Answer.**

In [38]:
stock_soup.find_all()[6]

<meta content="on" http-equiv="x-dns-prefetch-control"/>

This is often used to diagnose missing components of a web page. Let's continue by moving on to extract our first fundamental data quantity of interest - market capitalization.

### Question:

Imagine you are hired as a data scientist who needs to collect data using web scraping every day for a certain period of time. Why would Exercise 5 be helpful in such situation?

## Extracting market capitalization from the HTML document

Viewing the DUK stock's HTML document, we see there is a table that contains our fundamental data of interest. In HTML, the ```< td >``` tag defines a cell in a table. Since we know the stock market data is stored in a table in the HTML document we choose to look at all the table cell tags:

In [57]:
# We would like to specifically select the tag that contains the market capitalization information for DUK
stock_soup.find_all("td")

[<td class="C(black) W(51%)" data-reactid="38"><span data-reactid="39">Previous Close</span></td>,
 <td class="Ta(end) Fw(b) Lh(14px)" data-reactid="40" data-test="PREV_CLOSE-value"><span class="Trsdu(0.3s)" data-reactid="41">85.27</span></td>,
 <td class="C(black) W(51%)" data-reactid="43"><span data-reactid="44">Open</span></td>,
 <td class="Ta(end) Fw(b) Lh(14px)" data-reactid="45" data-test="OPEN-value"><span class="Trsdu(0.3s)" data-reactid="46">85.21</span></td>,
 <td class="C(black) W(51%)" data-reactid="48"><span data-reactid="49">Bid</span></td>,
 <td class="Ta(end) Fw(b) Lh(14px)" data-reactid="50" data-test="BID-value"><span class="Trsdu(0.3s)" data-reactid="51">84.19 x 900</span></td>,
 <td class="C(black) W(51%)" data-reactid="53"><span data-reactid="54">Ask</span></td>,
 <td class="Ta(end) Fw(b) Lh(14px)" data-reactid="55" data-test="ASK-value"><span class="Trsdu(0.3s)" data-reactid="56">84.80 x 800</span></td>,
 <td class="C(black) W(51%)" data-reactid="58"><span data-re

Since there are so many table cells, we need to narrow our search. In order to find the market capitalization indicator, let's open the HTML file in our browser (or your notebook directly) and inspect the value on the right of `Market Cap` text (use right-click -> inspect). You should see the following HTML code in your browser:

```html
<td class="Ta(end) Fw(b) Lh(14px)" data-test="MARKET_CAP-value" data-reactid="81"><span class="Trsdu(0.3s) " data-reactid="82">60.317B</span></td>
```

We see that the market capitalization value is inside a `<td>` element that has the attribute `data-test="MARKET_CAP-value`. We can use this identifier to locate the market capitalization value using the method `find()`: 

In [58]:
stock_soup.find("td", {"data-test" : 'MARKET_CAP-value'}).text

'60.317B'

Different web pages name different elements differently, so any parsing analysis will need to be customized to a specific website's structure. However, once the rules of a given webpage are established, then parsing becomes much easier as you can employ the power of ```BeautifulSoup```.

Let's practice extracting basic elements of an HTML document:

### Exercise 6:

Write a script to print all of the available ```data-test``` identifiers present in the table (i.e. present in the first td tag of ```stock_soup```). Your output should print:

```
PREV_CLOSE-value
OPEN-value
BID-value
ASK-value
DAYS_RANGE-value
FIFTY_TWO_WK_RANGE-value
TD_VOLUME-value
AVERAGE_VOLUME_3MONTH-value
MARKET_CAP-value
BETA_3Y-value
PE_RATIO-value
EPS_RATIO-value
EARNINGS_DATE-value
DIVIDEND_AND_YIELD-value
EX_DIVIDEND_DATE-value
ONE_YEAR_TARGET_PRICE-value
```

**Answer.**

------------

### Exercise 7:

Print the Bid, Ask, Volume, and Average Volume of the stock symbol DUK in ```DUK_Yahoo.html```.

**Answer.**

------------

## Search and process multiple HTML documents

Now we'd like to automate the parsing task above for all 5 symbols. We'd like to build a function that can parse ANY stock symbol using a systematic method to extract information. This automation will speed up future data analysis and increase productivity. Let's take a look at how to perform this task:

In [23]:
# Define a list of symbols that we'd like to parse
symbol_list = ['NEE','DUK','D','SO','EXC'] # list of stock symbols of interest

In [24]:
def process_yahoo(symbol):
    # Load the previously downloaded file
    file_name = symbol + '_Yahoo.html'
    with open(file_name) as f:
        s = BeautifulSoup(f, 'html.parser')
    
    # Parse the specific stock data of interest and store in a dictionary object
    info_dict = {'MARKET_CAP' : s.find("td", {"data-test" : 'MARKET_CAP-value'}).text}
    
    return info_dict

# Loop through all the symbols, applying the parsing function to each of the symbol's corresponding HTML file
fundamental_dict = {}
for sym in symbol_list:
    fundamental_dict[sym] = process_yahoo(sym)

In [25]:
# Look at the result
fundamental_dict

{'NEE': {'MARKET_CAP': '83.98B'},
 'DUK': {'MARKET_CAP': '60.317B'},
 'D': {'MARKET_CAP': '52.519B'},
 'SO': {'MARKET_CAP': '47.957B'},
 'EXC': {'MARKET_CAP': '44.279B'}}

Here we see that through the use of one function, we can now systematically parse stock information for symbols of our choosing. This has powerful implications for the efficiency of subsequent data analysis.

### Exercise 8:

Modify the ```process_yahoo()``` function to process and return all three fundamental data quantities of interest, namely the market capitalization, PE ratio, and EPS. The function should return a dictionary where the keys are the ```data-test``` identifiers, and the values are the corresponding fundamental data. Loop through all the symbols, applying the parsing function to each symbol's corresponding HTML file and print each dictionary of fundamental data to the screen.

**Answer.**

### Exercise 9:

After obtaining the preliminary results with the three fundamental data quantities of interest, your manager has requested that you add additional statistics to help determine the liquidity of the stock relative to its average. This will help indicate if a stock has been trading at higher or lower volumes recently. Write a function named ```scrape_volume_ratio``` that takes a symbol name string as an input, and returns the volume ratio, where volume ratio = volume / average volume . All of the data needed to calculate this ratio is available in the HTML documents for each symbol.

**Hints:**

1. If you need to remove commas from a string, use the replace() method on that string

2. Once commas are removed you can change a string to a float using the float() method

Once you've define the function loop through all the symbols, apply the parsing function to each symbol's corresponding HTML file. The resulting output should print:

{'NEE': 0.9837109088236352,
 'DUK': 0.7994789356696934,
 'D': 1.2231660648789393,
 'SO': 1.0092279816663878,
 'EXC': 0.8167456931073666}

**Answer.**

Now that we've extracted the required fundamental data from our saved HTML documents, let's take a quick look at how to perform web scraping on a live webpage in real time.

## Live web scraping of fundamental stock data

**IMPORTANT: You must be careful not to become blocked by a website due to excessive scraping. Do not run a loop that continually scrapes a webpage or the webpage will block you from receiving data due to excessive messaging.**

Let's explore scraping data from a Yahoo! Finance page. **(NOTE: Do NOT run this code block as having everyone do it at once may cause you to get blocked.)**

In [28]:
# Scrape data from website
site_url='https://finance.yahoo.com/quote/DUK?p=DUK'
r = urllib.request.urlopen(site_url)
site_content = r.read().decode('utf-8')

# Saving scraped HTML to .html file (for later processing)
with open('saved_page.html', 'w') as f:
    f.write(site_content)

# Use html.parser to create soup
s = BeautifulSoup(site_content, 'html.parser')

In [29]:
# Look at the soup object by using prettify() method
print(s.prettify()[:500]) # Only show portion of text as it is very long

<!DOCTYPE html>
<html class="NoJs featurephone" id="atomic" lang="en-US">
 <head prefix="og: http://ogp.me/ns#">
  <script>
   window.performance && window.performance.mark && window.performance.mark('PageStart');
  </script>
  <meta charset="utf-8"/>
  <title>
   Duke Energy Corporation (Holdin (DUK) Stock Price, Quote, History &amp; News
  </title>
  <meta content="DUK, Duke Energy Corporation (Holdin, DUK stock chart, Duke Energy Corporation (Holdin stock chart, stock chart, stocks, quotes, f


Here we've used the ```urllib``` package to request the website to send us its HTML document. We then passed this HTML document into ```BeautifulSoup``` for parsing. Let's extract the three fundamental data quantities using live web scraping.

### Grabbing data for all five stocks

**(NOTE: Do NOT run this code block as having everyone do it at once may cause you to get blocked.)**

In [30]:
symbol_list = ['NEE','DUK','D','SO','EXC'] # stocks of interest

def scrape_yahoo(symbol):
    symbol_url='https://finance.yahoo.com/quote/' + symbol
    MARKET_CAP= "MARKET_CAP"
    PE_RATIO = "PE_RATIO"
    EPS_RATIO = "EPS_RATIO"
    
    # Scrape
    r = urllib.request.urlopen(symbol_url)
    c = r.read().decode('utf-8')
    s = BeautifulSoup(c, 'html.parser')
    
    info_dict = {MARKET_CAP : s.find("td", {"data-test" : MARKET_CAP+'-value'}).text,
                 PE_RATIO : s.find("td", {"data-test" : PE_RATIO+'-value'}).text,
                 EPS_RATIO : s.find("td", {"data-test" : EPS_RATIO+'-value'}).text
                }
    
    return info_dict

# Scrape the data, and store in a dictionary
symbol_dict = {}
for symbol in symbol_list:
    print("Scraping Symbol: " + symbol)
    symbol_dict[symbol] = scrape_yahoo(symbol)
    
# Display the parsed data
fundamental_df = pd.DataFrame.from_dict(symbol_dict, orient='index')
fundamental_df

Scraping Symbol: NEE
Scraping Symbol: DUK
Scraping Symbol: D
Scraping Symbol: SO
Scraping Symbol: EXC


Unnamed: 0,MARKET_CAP,PE_RATIO,EPS_RATIO
D,65.251B,67.97,1.2
DUK,70.084B,21.28,4.52
EXC,46.617B,20.5,2.34
NEE,113.405B,33.34,6.96
SO,64.679B,14.57,4.25


Now that we've acquired the data, here is how the data is used to make recommendations:

1. A higher EPS value is seen as more attractive from an investment standpoint
2. PE ratios are often compared among stocks in the same industry. Within a single industry, the lower the PE ratio, the more undervalued it generally is
3. Market capitalization is important as it signals the size of the company. Smaller companies are more speculative and generally riskier

The firm would like to invest in the stock with the lowest PE ratio and the highest EPS which still has a market capitalization of at least 10 billion. Recall that from the static HTML files:

```
{'NEE': {'MARKET_CAP': '83.98B', 'PE_RATIO': '10.00', 'EPS_RATIO': '17.57'},
 'DUK': {'MARKET_CAP': '60.317B', 'PE_RATIO': '20.61', 'EPS_RATIO': '4.11'},
 'D': {'MARKET_CAP': '52.519B', 'PE_RATIO': '14.57', 'EPS_RATIO': '4.80'},
 'SO': {'MARKET_CAP': '47.957B', 'PE_RATIO': '19.45', 'EPS_RATIO': '2.40'},
 'EXC': {'MARKET_CAP': '44.279B', 'PE_RATIO': '11.91', 'EPS_RATIO': '3.84'}}
```
 
Here we see the best investment for the firm (using the metrics outlined) is to invest in NEE, as it has the lowest PE ratio and the highest EPS, while maintaining a market capitalization above 10 billion.

Given that we just scraped live data for these stocks, let's take a look to see if our investment decision has changed with the updated data.

## Conclusions

In this case, we've introduced a framework for automating web scraping tasks to produce a stock recommendation based on fundamental data. This general web scraping framework can be customized to address a user's unique needs on data requirements and parsing requirements.

We found that web scraping the HTML document for the five energy sector symbols required an analysis of the structure and content of the HTML documents to parse out the three fundamental data quantities of interest: market capitalization, PE ratio, and EPS. We utilized these three statistics alongside the firm's investment objectives to arrive at a recommendation to invest in the stock NEE.

## Takeaways

In this case, you learned the basics of HTML, ```BeautifulSoup```, and `urllib`. You found that ```BeautifulSoup``` greatly simplifies HTML parsing and extraction of useful information.

```BeautifulSoup``` is a library that has a vast array of capabilites that extend far beyond what is covered here. Hence, we encourage anyone looking to do more advanced web scraping to explore some of the more complex methods available in the library. The contents covered in this case should serve as an excellent base to build upon.