# HTML

In this activity, we will learn how we can load HTML tables directly into Pandas, and learn the basics of web scraping which is a very popular way of data gathering.

In [73]:
from bs4 import BeautifulSoup
import pandas as pd
import requests as re

In [74]:
url = 'https://www.worldcoinindex.com'
crypto_url = re.get(url)
crypto_url

<Response [200]>

In [75]:
body = crypto_url.text
body

'\n<!DOCTYPE html>\n<html style="background-color: #FFF">\n<head>\n<meta http-equiv="Content-Type" content="text/html">\n<meta name="mobile-web-app-capable" content="yes">\n<meta charset="utf-8" />\n<meta name="description" content="Cryptocoins ranked by 24hr trading volume, price info, charts, market cap and news" />\n<meta name="keywords" content="Coin, index, worldcoin, coinindex, worldindex, cryptocoins, crypto, traded, exchanges, ranked, marketcap, coins, cryptocurrency ,coinprices, price, last trade time, trade, market cap, trading, traded, volume, rate, buy cryptocurrency, sell cryptocurrency" />\n<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable = no">\n<meta name="apple-mobile-web-app-capable" content="yes" />\n<meta name="propeller" content="efad6eed15cdd45139ef34d287cc06ff" />\n<meta property="og:image" content="https://www.worldcoinindex.com/content/img/worldcoinindex_social.png?v=3" />\n<title>Cryptocoin price index and market cap - WorldCoi

The `body` consists of full HTML source code of our webpage. Now if the HTML source has a table which is marked by the HTML tag `<table> </table>`, Pandas uses `read_html()` to extract the table from the HTML document.

In [76]:
crypto_data = pd.read_html(body)
print(type(crypto_data))
print(len(crypto_data))

<class 'list'>
1


Whenever an HTML is passed to Pandas, it will output a nice looking DataFrame as long as there is a table in it.

In [77]:
crypto_data = crypto_data[0]
crypto_data.head()

Unnamed: 0,#,Unnamed: 1,Name,Ticker,Last price,%,24 high,24 low,Price Charts 7d,24 volume,# Coins,Market cap
0,1,,Bitcoin,BTC,"$ 16,974",-0.02%,"$ 16,981","$ 16,958",,$ 6.58B,19.22M,$ 326.27B
1,2,,Ethereum,ETH,"$ 1,275.20",-0.05%,"$ 1,276.07","$ 1,273.93",,$ 2.85B,122.37M,$ 156.05B
2,3,,Binanceusd,BUSD,$ 0.999923,-0.02%,$ 1.00,$ 0.999790,,$ 895.65M,1.68B,$ 1.68B
3,4,,Binancecoin,BNB,$ 291.96,0.00%,$ 292.38,$ 291.63,,$ 625.57M,154.53M,$ 45.11B
4,5,,Dogecoin,DOGE,$ 0.102035,+0.13%,$ 0.102108,$ 0.101303,,$ 606.67M,129.40B,$ 13.20B


## In the case where there is no table in the HTML...

**Scraping** is the other way to extract information from HTML. Python has a package for this called `Beautiful Soup`. [Here](https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/) is a tutorial on scraping.

# The Fundamentals of Web Scraping

Some websites offer data sets that are downloadable in CSV format, or accessible via an Application Programming Interface (API). But many websites with useful data don't offer these convenient options.

In this tutorial, web scraping will be performed using the `Beautiful Soup` library. This method will be used to scrape weather forecasts from the [National Weather Service](https://www.weather.gov/), and then analyzed using the Pandas library.

In [78]:
page = re.get('https://dataquestio.github.io/web-scraping-pages/simple.html')
print(page.status_code)
page.content # calling the content of the page returns an unreadable parsing of the page

200


b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

In [79]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [80]:
list(soup.children)

['html',
 '\n',
 <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [81]:
len(list(soup.children))

3

The above teslls us that there are two tags at the top level of the page--the initial <!DOCTYPE html> tag, and the <html> tag. There is a newline character (n) in the list as well.

In [82]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

1. The first is a `Doctype` object, which contains information about the tyupe of the document.
2. The second is a `NavigableString`, which represents text found in the HTML document.
3. The final item is a `Tag` object, which contains other nested tags.

In [83]:
# calling the third item and listing its children
html = list(soup.children)[2]
list(html.children)

['\n',
 <head>
 <title>A simple example page</title>
 </head>,
 '\n',
 <body>
 <p>Here is some simple content for this page.</p>
 </body>,
 '\n']

In [84]:
# calling the body tag 
body = list(html.children)[3]
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

In [85]:
# calling the p tag
p = list(body.children)[1]
p.get_text()

'Here is some simple content for this page.'

## Finding all instances of a tag at once

Above is useful for navigating a page, but to extract a single tag, the `find_all` method can be used.

In [86]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

Note that `find_all` returns a list, so we'll have to loop through, or use list indexing, to extract text:

In [87]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

If instead, the first instance of the tag is to be extracted, `find` can be used.

In [88]:
soup.find('p')

<p>Here is some simple content for this page.</p>

## Searching for tags by class and id

Classes and ids are used by CSS to determine which HTML elements to apply certain styles. But when scraping, they can be used to specify the elements to be scraped.

In [89]:
page = re.get('https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html')
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

We can use the `find-all` method to search for items by class or by id.

In [90]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [91]:
soup.find_all(class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [92]:
soup.find_all(id='first')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

## Using CSS Selectors

These selectors are how the CSS language allows developers to specify HTML tags to style. `BeautifulSoup` objects support searching a page via CSS selectors using the `select` method. We can use CSS selectors to find all the `p` tags in the page that are inside of a `div`.

In [93]:
soup.select('div p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>]

## Extracting data from the national weather page

In [94]:
page = re.get('https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168')
soup = BeautifulSoup(page.content, 'html.parser')
soup

<!DOCTYPE html>

<html class="no-js">
<head>
<!-- Meta -->
<meta content="width=device-width" name="viewport"/>
<link href="http://purl.org/dc/elements/1.1/" rel="schema.DC"/><title>National Weather Service</title><meta content="National Weather Service" name="DC.title"><meta content="NOAA National Weather Service National Weather Service" name="DC.description"/><meta content="US Department of Commerce, NOAA, National Weather Service" name="DC.creator"/><meta content="" name="DC.date.created" scheme="ISO8601"/><meta content="EN-US" name="DC.language" scheme="DCTERMS.RFC1766"/><meta content="weather, National Weather Service" name="DC.keywords"/><meta content="NOAA's National Weather Service" name="DC.publisher"/><meta content="National Weather Service" name="DC.contributor"/><meta content="//www.weather.gov/disclaimer.php" name="DC.rights"/><meta content="General" name="rating"/><meta content="index,follow" name="robots"/>
<!-- Icons -->
<link href="./images/favicon.ico" rel="shortcut 

In [95]:
seven_day = soup.find(id = 'seven-day-forecast')
forecast_items = seven_day.find_all(class_ = 'tombstone-container')
forecast_items

[<div class="tombstone-container">
 <p class="period-name">This<br/>Afternoon</p>
 <p><img alt="This Afternoon: Rain likely before 5pm.  Mostly cloudy, with a steady temperature around 52. West northwest wind 9 to 11 mph.  Chance of precipitation is 70%. New precipitation amounts of less than a tenth of an inch possible. " class="forecast-icon" src="newimages/medium/ra70.png" title="This Afternoon: Rain likely before 5pm.  Mostly cloudy, with a steady temperature around 52. West northwest wind 9 to 11 mph.  Chance of precipitation is 70%. New precipitation amounts of less than a tenth of an inch possible. "/></p><p class="short-desc">Rain Likely</p><p class="temp temp-high">High: 52 °F</p></div>,
 <div class="tombstone-container">
 <p class="period-name">Tonight<br/><br/></p>
 <p><img alt="Tonight: Mostly clear, with a low around 39. North wind 6 to 13 mph. " class="forecast-icon" src="newimages/medium/nfew.png" title="Tonight: Mostly clear, with a low around 39. North wind 6 to 13 mph

In [96]:
tonight = forecast_items[1]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Tonight: Mostly clear, with a low around 39. North wind 6 to 13 mph. " class="forecast-icon" src="newimages/medium/nfew.png" title="Tonight: Mostly clear, with a low around 39. North wind 6 to 13 mph. "/>
 </p>
 <p class="short-desc">
  Mostly Clear
 </p>
 <p class="temp temp-low">
  Low: 39 °F
 </p>
</div>


In [97]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)

Tonight
Mostly Clear
Low: 39 °F


In [98]:
img = tonight.find("img")
desc = img['title']
print(desc)

Tonight: Mostly clear, with a low around 39. North wind 6 to 13 mph. 


In [99]:
period_tags = seven_day.select('.tombstone-container .period-name')
periods = [pt.get_text() for pt in period_tags]
periods

['ThisAfternoon',
 'Tonight',
 'Friday',
 'FridayNight',
 'Saturday',
 'SaturdayNight',
 'Sunday',
 'SundayNight',
 'Monday']

In [100]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(short_descs)
print(temps)
print(descs)

['Rain Likely', 'Mostly Clear', 'Sunny', 'Partly Cloudythen ChanceRain', 'Rain Likely', 'Rain', 'Rain', 'Rain Likely', 'Chance Rain']
['High: 52 °F', 'Low: 39 °F', 'High: 54 °F', 'Low: 42 °F', 'High: 54 °F', 'Low: 46 °F', 'High: 58 °F', 'Low: 46 °F', 'High: 56 °F']
['This Afternoon: Rain likely before 5pm.  Mostly cloudy, with a steady temperature around 52. West northwest wind 9 to 11 mph.  Chance of precipitation is 70%. New precipitation amounts of less than a tenth of an inch possible. ', 'Tonight: Mostly clear, with a low around 39. North wind 6 to 13 mph. ', 'Friday: Sunny, with a high near 54. Northeast wind 6 to 9 mph. ', 'Friday Night: A 30 percent chance of rain after 5am.  Partly cloudy, with a low around 42. East northeast wind 5 to 9 mph.  New precipitation amounts of less than a tenth of an inch possible. ', 'Saturday: Rain likely, mainly after 11am.  Mostly cloudy, with a high near 54. East wind 5 to 7 mph.  Chance of precipitation is 70%. New precipitation amounts betwe

In [101]:
weather = pd.DataFrame({
  'period': periods,
  'short_desc': short_descs,
  'temp': temps,
  'desc': descs  
})

weather

Unnamed: 0,period,short_desc,temp,desc
0,ThisAfternoon,Rain Likely,High: 52 °F,This Afternoon: Rain likely before 5pm. Mostl...
1,Tonight,Mostly Clear,Low: 39 °F,"Tonight: Mostly clear, with a low around 39. N..."
2,Friday,Sunny,High: 54 °F,"Friday: Sunny, with a high near 54. Northeast ..."
3,FridayNight,Partly Cloudythen ChanceRain,Low: 42 °F,Friday Night: A 30 percent chance of rain afte...
4,Saturday,Rain Likely,High: 54 °F,"Saturday: Rain likely, mainly after 11am. Mos..."
5,SaturdayNight,Rain,Low: 46 °F,Saturday Night: Rain. Low around 46. Chance o...
6,Sunday,Rain,High: 58 °F,Sunday: Rain. High near 58. Chance of precipi...
7,SundayNight,Rain Likely,Low: 46 °F,"Sunday Night: Rain likely. Mostly cloudy, wit..."
8,Monday,Chance Rain,High: 56 °F,"Monday: A chance of rain. Partly sunny, with ..."


In [112]:
temp_nums = weather["temp"].str.extract(r'([0-9]+)', expand=False)
weather["temp_num"] = temp_nums.astype('int')
temp_nums

0    52
1    39
2    54
3    42
4    54
5    46
6    58
7    46
8    56
Name: temp, dtype: object

In [113]:
weather['temp_num'].mean()

49.666666666666664

In [114]:
is_night = weather['temp'].str.contains('Low')
weather['is_night'] = is_night
is_night

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
Name: temp, dtype: bool

In [115]:
weather['is_night'] = is_night

In [117]:
weather[is_night]

Unnamed: 0,period,short_desc,temp,desc,temp_num,is_night
1,Tonight,Mostly Clear,Low: 39 °F,"Tonight: Mostly clear, with a low around 39. N...",39,True
3,FridayNight,Partly Cloudythen ChanceRain,Low: 42 °F,Friday Night: A 30 percent chance of rain afte...,42,True
5,SaturdayNight,Rain,Low: 46 °F,Saturday Night: Rain. Low around 46. Chance o...,46,True
7,SundayNight,Rain Likely,Low: 46 °F,"Sunday Night: Rain likely. Mostly cloudy, wit...",46,True
