# CIV1498 - Introduction to Data Science
## Lecture 3.2 - Importing Data from Different Sources

### Lecture Structure
1. [Structured Text Data](#section1)
2. [Unstructured Text Data](#section2)
3. [JSON](#section3)
4. [XML](#section4)
5. [HDF5](#section5)
6. [API](#section6)
7. [HTML](#section7)

## Setup Notebook

In [None]:
# Import 3rd party libraries
import os
import json 
import requests
import numpy as np
import pandas as pd
import seaborn as sns
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import xml.etree.ElementTree as ET

# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")

### Install and Import h5py

In [None]:
!pip install h5py

In [None]:
import h5py

<a id='section1'></a>
## 1. Structured Text Data
Text files are human readable, in contrast to binary files, and can be opened using any text editor (Notepad, Sublime, etc.). Tabular (structured) data that is stored as a text file is often delimited using commas `,`, tabs `\t`, spaces ` `, or pipes `|`. As long as its consistent, any character can be used to delimit a text file. These characters are used to delimit the different columns (fields) in a table. And rows are delimited by a `\n` at the end of each row.

### Comma-Separated Values (CSV)
Let's try opening a `.csv`. This file contains 38765 rows of purchase orders from people at a grocery store. 

In [None]:
groceries = open('groceries_dataset.csv', 'r')
groceries.read()[0:1000]

In [None]:
groceries = pd.read_csv('groceries_dataset.csv')
groceries.head(10)

### Tab-Separated Values (TSV)
Let's try opening a `.tsv`. 

In [None]:
groceries = open('groceries_dataset.tsv', 'r')
groceries.read()[0:1000]

In [None]:
groceries = pd.read_csv('groceries_dataset.tsv', sep='\t')
groceries.head(10)

### Pipe-Separated Values (PSV) ??
Let's try opening a `.psv` (I just made this up). 

In [None]:
groceries = open('groceries_dataset.psv', 'r')
groceries.read()[0:1000]

In [None]:
groceries = pd.read_csv('groceries_dataset.psv', sep='|')
groceries.head(10)

Most of the time, you'll be dealing with `.csv` files when working with tabular files.

<a id='section2'></a>
## 2. Unstructured Text Data
The [NOAA - Great Lakes Environmental Research Laboratory](https://www.glerl.noaa.gov/res/glcfs/) (GLERL) dataset contains forecasts and measurements for Ice Cover, Wave Height, Current Direction, Wind Speed, and others. 
<br>
<img src="images/noaa.gif" alt="drawing" width="700"/>
<br>
The file is human readable, however, it is not tabular and therefore, cannot be easily imported into a Pandas DataFrame. Don't believe me? Let's give it a try.

First, let's take a look at the file in a text editor. I recommend using [Sublime](https://www.sublimetext.com/). From the text editor, we can see that the fields seems to be delimited by tabs or spaces so lets used `.read_csv` to import our file.

In [None]:
noaa = pd.read_csv('e202027712.0.wav', sep='\t')
noaa.head(10)

Something doesn't look right here. Let's try opening the file using the `open()` function which returns a file object, which has helpful methods (`.read()`, `.readline()`) for reading the content of the file. 

In [None]:
noaa = open('e202027712.0.wav', 'r')
print(noaa)

We can see that the `noaa` variable is a file object. The file class hass a method for reading one line of a file.

In [None]:
print(noaa.readline())

In [None]:
print(noaa.readline())

In [None]:
print(noaa.readline())

In [None]:
print(noaa.readline())

We can also use `.readline()` in a `for` loop.

In [None]:
for _ in range(10):
    print(noaa.readline())

If we want to return to the beginning of the file, we must open the file again. 

In [None]:
noaa = open('e202027712.0.wav', 'r')
for _ in range(10):
    print(noaa.readline())

The `.read()` method, on the other hand, reads the file as an individual string, and allows for relatively easy file-wide manipulations. Below, lets display the first 1000 characters from the file.

In [None]:
noaa = open('e202027712.0.wav', 'r')
noaa.read()[0:1000]

And now lets print the first 1000 characters from the file, which created a new line from `\n` newline character.

In [None]:
noaa = open('e202027712.0.wav', 'r')
print(noaa.read()[0:1000])
noaa.close()

<a id='section3'></a>
## 3. JSON
JavaScript Object Notation (JSON) is a lightweight data format that easy for humans to read and write. JSON is easy for machines to parse and generate and is based on the JavaScript Programming Language Standard. JSON is a text format that is programming language independent and can be early parsed using Python, Ruby, Pearl and many others.

JSON is built on two structures:

1. A collection of name/value pairs. This is realized as an object, record, dictionary, hash table, keyed list, or associative array.
2. An ordered list of values. This is realized as an array, vector, list, or sequence.

There are several third party packages that can be used to program with JSON files, however, Python includes a native package `json`, which we imported at the start of this notebook.

```python
import json
```

The JSON format is used primarily to transmit data between a server and web application, which we'll see more of in Section 7 [API](#section7) and 8 [HTML](#section8). As as example, we'll be working with the [Twitter](https://twitter.com/) JSON structure for tweets. 

Let's take a look at the Twitter tweet below,
<br>
<img src="images/tweet.png" alt="drawing" width="500"/>
<br>
and the associated JSON file.
```json
{
  "created_at": "Thu Apr 06 15:24:15 +0000 2017",
  "id_str": "850006245121695744",
  "text": "1\/ Today we\u2019re sharing our vision for the future of the Twitter API platform!\nhttps:\/\/t.co\/XweGngmxlP",
  "user": {
    "id": 2244994945,
    "name": "Twitter Dev",
    "screen_name": "TwitterDev",
    "location": "Internet",
    "url": "https:\/\/dev.twitter.com\/",
    "description": "Your official source for Twitter Platform news, updates & events. Need technical help? Visit https:\/\/twittercommunity.com\/ \u2328\ufe0f #TapIntoTwitter"
  },
  "place": {   
  },
  "entities": {
    "hashtags": [      
    ],
    "urls": [
      {
        "url": "https:\/\/t.co\/XweGngmxlP",
        "unwound": {
          "url": "https:\/\/cards.twitter.com\/cards\/18ce53wgo4h\/3xo1c",
          "title": "Building the Future of the Twitter API Platform"
        }
      }
    ],
    "user_mentions": [     
    ]
  }
}
```
You'll notice that the data structure is very similar to Python Dictionaries. 

#### Import
First, let's learn how to load a JSON file into Python as a Dictionary. There are two methods for parsing JSON, `json.load()` and `json.loads()`. 

**`json.load()`** can deserialize a file (it accepts a file object).

In [None]:
tweet = json.load(open('tweet.json'))
tweet

The following is a useful function for printing JSONs with the correct indentation to improve readability.

In [None]:
def json_print(obj):
    text = json.dumps(obj, sort_keys=True, indent=4)
    print(text)

In [None]:
json_print(tweet)

The variable `tweet` is now a Python Dictionary.

In [None]:
type(tweet)

**`json.loads()`** deserializes a string (the `s` stands for String - "load string").

In [None]:
json_string = '{"name": "Sebastian", "age": 35}'
json_dict = json.loads(json_string)
json_dict

#### Save
Like with importing JSONs, there are two functions for saving JSON files, `json.dump()` and `json.dumps()`.

**`json.dump()`** is used to write Python serialized object as JSON formatted data into a file. For example, if we take the `tweet` dictionary, we can save it to a file using `json.dump()`.

In [None]:
with open('test_file.json', 'w') as f:
    json.dump(tweet, f)

**`json.dumps()`** encodes any Python object into JSON formatted String. For example, if we take the `tweet` dictionary, we can save it to a string using `json.dumps()`.

In [None]:
json_string = json.dumps(tweet)
json_string 

Although they appear similar, there are some important differences between Python Dictionaries and JSONs.

| Python  | JSON   |
|---------|--------|
|dict	  | Object | 
|list	  | Array  | 
|tuple	  | Array  |
|str	  | String |
|int	  | Number |
|float	  | Number |
|True	  | true   |
|False	  | false  |
|None	  | null   |

For example, this means that you cannot include a tuple or a list in a JSON. They will both be incoded as an array. Similarly, you cannot encode a NumPy array in a JSON. Let's give it a try.

In [None]:
sample_data = {'counts': np.array([0, 5, 4, 10, 45])} 
json_string = json.dumps(sample_data)
json_string 

You can learn more about the JSONs by visiting its official [page](https://docs.python.org/3.6/library/json.html) on the Python website.

<a id='section4'></a>
## 4. XML
Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. XML is similar to HTML. Python contains a native module for reading XML which we imported at the start of the notebook.
```python
import xml.etree.ElementTree as ET
```
Below is an example of the XML file structure containing the names of employees at a company.
```xml
<employees>
  <employee>
    <firstName>John</firstName> <lastName>Doe</lastName>
  </employee>
  <employee>
    <firstName>Anna</firstName> <lastName>Smith</lastName>
  </employee>
  <employee>
    <firstName>Peter</firstName> <lastName>Jones</lastName>
  </employee>
</employees>
```
Let's try importing a sample XML file.

In [None]:
tree = ET.parse('books.xml')
root = tree.getroot()

The file we opened `books.xml` looks like this.
```xml
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description>
   </book>
   <book id="bk104">
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-03-10</publish_date>
      <description>In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant.</description>
   </book>
   <book id="bk105">
      <author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-09-10</publish_date>
      <description>The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy.</description>
   </book>
   <book id="bk106">
      <author>Randall, Cynthia</author>
      <title>Lover Birds</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-09-02</publish_date>
      <description>When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled.</description>
   </book>
   <book id="bk107">
      <author>Thurman, Paula</author>
      <title>Splish Splash</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>A deep sea diver finds true love twenty 
      thousand leagues beneath the sea.</description>
   </book>
   <book id="bk108">
      <author>Knorr, Stefan</author>
      <title>Creepy Crawlies</title>
      <genre>Horror</genre>
      <price>4.95</price>
      <publish_date>2000-12-06</publish_date>
      <description>An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.</description>
   </book>
   <book id="bk109">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems 
      of being quantum.</description>
   </book>
   <book id="bk110">
      <author>O'Brien, Tim</author>
      <title>Microsoft .NET: The Programming Bible</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-09</publish_date>
      <description>Microsoft's .NET initiative is explored in 
      detail in this deep programmer's reference.</description>
   </book>
   <book id="bk111">
      <author>O'Brien, Tim</author>
      <title>MSXML3: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-01</publish_date>
      <description>The Microsoft MSXML3 parser is covered in 
      detail, with attention to XML DOM interfaces, XSLT processing, 
      SAX and more.</description>
   </book>
   <book id="bk112">
      <author>Galos, Mike</author>
      <title>Visual Studio 7: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>49.95</price>
      <publish_date>2001-04-16</publish_date>
      <description>Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are 
      integrated into a comprehensive development 
      environment.</description>
   </book>
</catalog>
```
From the root of the XML tree, we can get all children, which should correspond to the 12 book tags we see. be can use the Python `list()` operator to do this.

In [None]:
for book in list(root):
    print(book.tag, book.attrib)
    print(book.find('author').text)
    print(book.find('title').text)
    print()

JSON and XML are similar in some ways

- Both JSON and XML are "self describing" (human readable)
- Both JSON and XML are hierarchical (values within values)
- Both JSON and XML can be parsed and used by lots of programming languages
- Both JSON and XML can be fetched with an XMLHttpRequest

and different in other ways.

- JSON doesn't use end tag
- JSON is shorter
- JSON is quicker to read and write
- JSON can use arrays

<a id='section5'></a>
## 5. Hierarchical Data Format (HDF)
Hierarchical Data Format (HDF) is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data. It was originally developed at the **National Center for Supercomputing Applications** and is now supported by [The HDF Group](https://www.hdfgroup.org/solutions/hdf5/), a non-profit corporation.
<br>
<img src="images/hdf5_structure4.jpg" alt="drawing" width="550"/>
<br>
Let's see what the [The HDF Group](https://www.hdfgroup.org/solutions/hdf5/) says about the HDF5 file structure.

In [None]:
%%HTML
'<iframe src="https://player.vimeo.com/video/226008481" width="640" height="360" frameborder="0" allow="autoplay; fullscreen" allowfullscreen></iframe>'

The Python package [h5py](https://www.h5py.org/) is required for working with HFD5 files. You'll see we installed and imported that package at the start of the Notebook.

```python
!pip install h5py

import h5py
```

For working exploring the HDF5 file structure, we'll be working with Hyperspectral Imagery data which is a dataset that Civil & Mineral Engineers might run into. 
<br>
<img src="images/light_spectrum.jpg" alt="drawing" width="800"/>
<br>
Hyperspectral imaging collects information from across the electromagnetic spectrum, which can be seen in the image below. Most familiar to us is the visible light spectrum ranging from red (700 nanometers) to violet (380 nanometers). Hyperspectral imagery is acquired in the infrared spectrum typically between 400 and 1100 nanometers.    
<br>
<img src="images/hyperspectral_image.png" alt="drawing" width="900"/>
<br>
A typical JPEG image that you might capture with your smartphone will have three color channels (Red, Green, and Blue). So, if your image is 500 pixels x 500 pixels (spatial dimensions), there will be a 3rd dimension of size three (3 color challanges). When imported into Python as an array, the shape of the array would be (500, 500, 3). With hyperspectral imagery, you're acquiring a continuous spectrum and thus, each pixel contains many more than three values as displayed in the image above. 

Let's use [h5py](https://www.h5py.org/) to import a hyperspectral image.

In [None]:
hyperspec = h5py.File('NEON_hyperspectral_dataset.h5', mode='r') 

First, lets see what groups are inside the file object.

In [None]:
print(hyperspec.keys())

and what groups are inside that group.

In [None]:
print(hyperspec['SJER'].keys())

and what groups are inside that group.

In [None]:
print(hyperspec['SJER']['Reflectance'].keys())

and what groups are inside the Metadata group.

In [None]:
print(hyperspec['SJER']['Reflectance']['Metadata'].keys())

You're starting to get the picture right? HDF5 is just a hierarchical folder structure as the name suggests.

Quick tip: We can make the same call as above using the following command.

In [None]:
print(hyperspec['SJER/Reflectance/Metadata'].keys())

In [None]:
def plot_hyperspectral_data():
    hyperspec = h5py.File('NEON_hyperspectral_dataset.h5', mode='r') 
    sjer_reflectance = hyperspec['SJER']['Reflectance']
    sjer_reflectance_array = sjer_reflectance['Reflectance_Data']
    reflectance_shape = sjer_reflectance_array.shape
    wavelengths = sjer_reflectance['Metadata']['Spectral_Data']['Wavelength']
    sjer_mapInfo = sjer_reflectance['Metadata']['Coordinate_System']['Map_Info']
    mapInfo_string = str(sjer_mapInfo[()])
    mapInfo_split = mapInfo_string.split(',') 
    res = float(mapInfo_split[5]), float(mapInfo_split[6])
    xMin = float(mapInfo_split[3]) 
    yMax = float(mapInfo_split[4])
    xMax = xMin + (reflectance_shape[1] * res[0])
    yMin = yMax - (reflectance_shape[0] * res[1])
    serc_ext = (xMin, xMax, yMin, yMax)
    b56 = sjer_reflectance_array[:, :, 55].astype(float)
    
    
    fig, ax = plt.subplots()
    img = ax.imshow(b56, extent=serc_ext, cmap='jet')
    ax.set_xlabel('Longitude', fontsize=16)
    ax.set_ylabel('Latitude', fontsize=16)
    plt.colorbar(img, label='Reflectance', ax=ax)
    plt.show()

In [None]:
plot_hyperspectral_data()

A more indepth look at the HDF file structure is beyond the scope of this course. For those interested in learning more about HDF, Python, and hyperspectral data, check out the excellent resources from 
[The National Science Foundation's National Ecological Observatory Network (NEON)](https://www.neonscience.org/neon-aop-hdf5-py).

<a id='section6'></a>
## 6. APIs
The term API is an acronym, and it stands for **Application Programming Interface**. An API is a software intermediary that allows two applications to talk to each other. APIs provide a means of collecting data with certain advantages over the file formats previously discussed. When would we want to use an API instead of a static CSV file you can download from the web?
1. When you only want a small piece of a much larger dataset. For example, Donald Trump likes to Tweet alot. If you're only interested in Trump's Tweets from the last 12 hours, the Twitter API will save you from having to download all of Trump's Tweets and then searching the ones you're interested in.
2. When the data is changing quickly. For example, if you're trying to build a high frequency trading alogrithm for the stock market, you're algorithm is going to need access to "real-time" data. Stock Exchange APIs can provide this kind of "real-time" data.
3. When the API provides some kind of computational functionality that you need for your analysis. For example, you might be interested in using drone imagery to estimate the rooftop solar potential in your neighbourhood. To to this, you need to be able to segment the rooftops in an image from their surrounding background. Perhaps a research group has made their Machine Learning rooftop segmentation model available via an API endpoint. In this case, you could send the API a JPEG and get back a rooftop mask.

#### How does an API work?
An API runs on a remote server typically use HTTP as their transfer protocol. You can send data to an API and also retrive data, which will be the focus of thus Lecture. We use APIs all the time, probably without knowing it. For example, if you click on this link [FiveThirtyEight](https://fivethirtyeight.com/) (cool website, check it out) your web browser (Chrome, Explorer, Firefox, Safari) will make a request to get the webpage data so it can display it for you.
<br>
<img src="images/API.png" alt="drawing" width="600"/>
<br>
There are many different types of requests that you can make, such as: GET, POST, PUT, DELETE, HEAD, OPTIONS, but for this lecture, we'll explore the GET requet. The diagram above presents the basic function of a GET request. We send information to the API specifying what data we want and the API will return the data we request and a response code indicating the status of the request. 

Common response codes:
- **200 (OK)** | Indicates that the API successfully carried out whatever action the client requested.
- **400 (Bad Request)** | The generic client-side error status indicating that you messed something up with your request. Check for errors in your code and try again.
- **500 (Internal Server Error)** Something send wrong on the server side and is not the client’s fault. It is reasonable for the client to retry the same request that triggered this response and hope to get a different response.

You can learn more about status codes [here](https://restfulapi.net/http-status-codes/).

As an example, consider the Foreign exchange rates API **https://exchangeratesapi.io/**. Do do this programmatically in Python, we'll be using the **requests** package that was imported at the start of the Notebook.

```Python
import requests
```

For documentation explaining the APIs functionality, check out this [link](https://exchangeratesapi.io/).

Let's try getting current exchange rates.

In [None]:
access_key = '0cef27412b30145ee895a37d5d55cf9d'
url = 'http://api.exchangeratesapi.io/v1/latest?access_key={}'.format(access_key)
response = requests.get(url)

Remember that a response contains the data we requested and the response code. Lets first check the repsonse code.

In [None]:
response.status_code

Success! The request was successful. Let's see what response code we get it the url has a typo.

In [None]:
url = 'http://api.exchangeratesapi.io/v1/lates?access_key={}'.format(access_key)
response = requests.get(url)
response.status_code

Bad request!

In [None]:
url = 'http://api.exchangeratesapi.io/v1/latest?access_key={}'.format(access_key)
response = requests.get(url)

Ok, so we got a 200 response, so now let's check out what data we got back.

We can get the data as text.

In [None]:
response.text

We can also get the data as a JSON converted to a Python Dictionary.

In [None]:
response.json()

We can use our ```json_print()``` function from earlier to see a more structured view of the JSON data.

In [None]:
def json_print(obj):
    text = json.dumps(obj, sort_keys=True, indent=4)
    print(text)

In [None]:
json_print(response.json())

By default, the base rate is the Euro (EUR) but we can specify a base rate using the following code.

In [None]:
url = 'http://api.exchangeratesapi.io/v1/latest?access_key={}'.format(access_key)

response = requests.get(url)
json_print(response.json())

<a id='section7'></a>
## 7. HTML
HTML stands for Hyper Text Markup Language and is the language that defines the structure of every webpage you visit. Most browsers haver a builtin debug tool that will allow you to see the HTML associated with each webpage you visit. Checkout [the CivMin](https://civmin.utoronto.ca/) site and press F12 on your keyboard.

Below is an example of some simple HTML.

```html
<html>
    <head>
        <title>
            A Simple HTML Document
        </title>
    </head>
    <body>
        <p>This is a very simple HTML document</p>
        <p>It only has two paragraphs</p>
    </body>
</html>
```

Text surrounded with `< >` are called `tags` and as you can see from the HTML above, there are manu different types of tags.

- HTML tag: It is the root of the html document which is used to specify that the document is html. <br>
  `<html> Statements... </html>` <br>
- Head tag: Head tag is used to contain all the head element in the html file. It contains the title, style, meta, … etc tag. <br>
  `<head> Statements... </head>` <br>
- Body tag: It is used to define the body of html document. It contains image, tables, lists, … etc. <br>
  `<body> Statements... </body>` <br>
- Title tag: It is used to define the title of html document. <br>
 `<title> Statements... </title>` <br>
- Heading tag: It is used to define the heading of html document. <br>
  `<h1> Statements... </h1>` <br>
  `<h2> Statements... </h2>` <br>
  `<h3> Statements... </h3>` <br>
  `<h4> Statements... </h4>` <br>
  `<h5> Statements... </h5>` <br>
  `<h6> Statements... </h6>` <br>
- Paragraph tag: It is used to define paragraph content in html document. <br>
  `<p> Statements... </p>` <br>
- Anchor tag: It is used to link one page to another page. <br>
  `<a href="..."> Statements... </a>` <br>
- List tag: It is used to list the content. <br>
  `<li> Statements... </li>` <br>
- Ordered List tag: It is used to list the content in a particular order. <br>
  `<ol> Statements... </ol>` <br>
- Unordered List tag: It is used to list the content without order. <br>
  `<ul> Statements... </ul>`
  
Let's first check out a very simple webpage. Remember the NOAA file we worked with in [Seciton 2: Unstructured Text Data](#section2)? This [website](https://www.glerl.noaa.gov/emf/waves/GLERL-Donelan-Archive/2021/) contains forecast files and measurements for Wave parameters such as heigh and direction. Let's say we want to analysis this data and need a way to programmatically extract the download links for all Wave Height forcast files. The HTML on this webpage looks like this (Go to the webpage and hit F12 to check it our for yourself).

```html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of /emf/waves/GLERL-Donelan-Archive/2021</title>
 </head>
 <body>
  <h1>Index of /emf/waves/GLERL-Donelan-Archive/2021</h1>
  <table>
   <tbody>   
    <tr></tr>
    <tr></tr>
    <tr></tr> 
    <tr>
     <td></td>
     <td></td>
     <td>2021-02-23 06:12</td>  
     <td>67M</td>
     <td></td>   
    </tr>
```

Let's introduce a useful Python package for scraping HTML. It's called `BeautifulSoup` and we imported it at the start of the notebook.

```python
from bs4 import BeautifulSoup
```

First, we need to use the webpage content using the `requests.get()` function we used in the last section.

In [None]:
response = requests.get('https://www.glerl.noaa.gov/emf/waves/GLERL-Donelan-Archive/2021/')
response.text[0:1000]

Next, we'll use BeautifulSoup to parse the HTML.

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())

So, we want to extract table rows where a wave forecast file is linked. 

```html
<tr>
    <td valign="top">
     <img alt="[   ]" src="/icons/unknown.gif"/>
    </td>
    <td>
     <a href="c2021_01.out1.nc">
      c2021_01.out1.nc
     </a>
    </td>
    <td align="right">
     2021-02-23 06:12
    </td>
    <td align="right">
     67M
    </td>
    <td>
    </td>
</tr>
```

Let's use BeautifulSoup to find all the `<tr>` tags.

In [None]:
table_rows = soup.findAll('tr')
print('{}\n'.format(table_rows[0]))
print('{}\n'.format(table_rows[1]))
print('{}\n'.format(table_rows[2]))
print('{}\n'.format(table_rows[3]))
print('{}\n'.format(table_rows[4]))

We can see that the table data `<td>` we're interested in only appears after line 3 so let's grab everything from the 4th row on.

In [None]:
table_rows = table_rows[3:-1]
print('{}\n'.format(table_rows[0]))
print('{}\n'.format(table_rows[1]))
print('{}\n'.format(table_rows[2]))
print('{}\n'.format(table_rows[3]))

Within a row, how can we find the `<td></td>` tags?

In [None]:
table_rows[0]

And, how can we find the file name?

In [None]:
table_rows[0].findAll('td')

In [None]:
table_rows[0].findAll('td')[1].findAll('a')

In [None]:
table_rows[0].findAll('td')[1].findAll('a')[0].contents