# HTML

In this activity, we will learn how we can load HTML tables directly into Pandas, and learn the basics of web scraping which is a very popular way of data gathering.

In [32]:
from bs4 import BeautifulSoup
import pandas as pd
import requests as re

In [33]:
url = 'https://www.worldcoinindex.com'
crypto_url = re.get(url)
crypto_url

<Response [200]>

In [34]:
body = crypto_url.text
body

'\n<!DOCTYPE html>\n<html style="background-color: #FFF">\n<head>\n<meta http-equiv="Content-Type" content="text/html">\n<meta name="mobile-web-app-capable" content="yes">\n<meta charset="utf-8" />\n<meta name="description" content="Cryptocoins ranked by 24hr trading volume, price info, charts, market cap and news" />\n<meta name="keywords" content="Coin, index, worldcoin, coinindex, worldindex, cryptocoins, crypto, traded, exchanges, ranked, marketcap, coins, cryptocurrency ,coinprices, price, last trade time, trade, market cap, trading, traded, volume, rate, buy cryptocurrency, sell cryptocurrency" />\n<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable = no">\n<meta name="apple-mobile-web-app-capable" content="yes" />\n<meta name="propeller" content="efad6eed15cdd45139ef34d287cc06ff" />\n<meta property="og:image" content="https://www.worldcoinindex.com/content/img/worldcoinindex_social.png?v=3" />\n<title>Cryptocoin price index and market cap - WorldCoi

The `body` consists of full HTML source code of our webpage. Now if the HTML source has a table which is marked by the HTML tag `<table> </table>`, Pandas uses `read_html()` to extract the table from the HTML document.

In [35]:
crypto_data = pd.read_html(body)
print(type(crypto_data))
print(len(crypto_data))

<class 'list'>
1


Whenever an HTML is passed to Pandas, it will output a nice looking DataFrame as long as there is a table in it.

In [36]:
crypto_data = crypto_data[0]
crypto_data.head()

Unnamed: 0,#,Unnamed: 1,Name,Ticker,Last price,%,24 high,24 low,Price Charts 7d,24 volume,# Coins,Market cap
0,1,,Bitcoin,BTC,"$ 16,962",-1.19%,"$ 17,241","$ 16,870",,$ 6.79B,19.22M,$ 326.05B
1,2,,Ethereum,ETH,"$ 1,275.45",-1.46%,"$ 1,295.73","$ 1,263.25",,$ 2.95B,122.37M,$ 156.08B
2,3,,Binanceusd,BUSD,$ 1.00,-0.01%,$ 1.00,$ 0.997948,,$ 912.05M,1.68B,$ 1.68B
3,4,,Binancecoin,BNB,$ 291.56,-2.89%,$ 301.87,$ 290.11,,$ 639.95M,154.53M,$ 45.05B
4,5,,Dogecoin,DOGE,$ 0.101530,-5.03%,$ 0.107704,$ 0.100442,,$ 636.46M,129.40B,$ 13.13B


## In the case where there is no table in the HTML...

**Scraping** is the other way to extract information from HTML. Python has a package for this called `Beautiful Soup`. [Here](https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/) is a tutorial on scraping.

# The Fundamentals of Web Scraping

Some websites offer data sets that are downloadable in CSV format, or accessible via an Application Programming Interface (API). But many websites with useful data don't offer these convenient options.

In this tutorial, web scraping will be performed using the `Beautiful Soup` library. This method will be used to scrape weather forecasts from the [National Weather Service](https://www.weather.gov/), and then analyzed using the Pandas library.

In [37]:
page = re.get('https://dataquestio.github.io/web-scraping-pages/simple.html')
print(page.status_code)
page.content # calling the content of the page returns an unreadable parsing of the page

200


b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

In [38]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [39]:
list(soup.children)

['html',
 '\n',
 <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [40]:
len(list(soup.children))

3

The above teslls us that there are two tags at the top level of the page--the initial <!DOCTYPE html> tag, and the <html> tag. There is a newline character (n) in the list as well.

In [41]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

1. The first is a `Doctype` object, which contains information about the tyupe of the document.
2. The second is a `NavigableString`, which represents text found in the HTML document.
3. The final item is a `Tag` object, which contains other nested tags.

In [43]:
# calling the third item and listing its children
html = list(soup.children)[2]
list(html.children)

['\n',
 <head>
 <title>A simple example page</title>
 </head>,
 '\n',
 <body>
 <p>Here is some simple content for this page.</p>
 </body>,
 '\n']

In [45]:
# calling the body tag 
body = list(html.children)[3]
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

In [46]:
# calling the p tag
p = list(body.children)[1]
p.get_text()

'Here is some simple content for this page.'

## Finding all instances of a tag at once

Above is useful for navigating a page, but to extract a single tag, the `find_all` method can be used.

In [48]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

Note that `find_all` returns a list, so we'll have to loop through, or use list indexing, to extract text:

In [50]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

If instead, the first instance of the tag is to be extracted, `find` can be used.

In [51]:
soup.find('p')

<p>Here is some simple content for this page.</p>

## Searching for tags by class and id

Classes and ids are used by CSS to determine which HTML elements to apply certain styles. But when scraping, they can be used to specify the elements to be scraped.

In [52]:
page = re.get('https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html')
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

We can use the `find-all` method to search for items by class or by id.

In [54]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [55]:
soup.find_all(class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [56]:
soup.find_all(id='first')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]