## Webscraping: The Basics!

Keep in mind: When possible, it's preferrable to use an API of the website you're trying to get information from. What's an API?

https://www.youtube.com/watch?v=s7wmiS2mSXY

It's better to use an API than to visit a website with a bot / webscrape. Why? 

1.) Because with an API you're now working _with_ the information provider, and allowing them to decide how to provide that information to you. 

2.) Downloading everything on their website over-and-over again can cause their website to fail, and they might even block you. 

3.) If the structure of their website changes, they'll update the API and you won't have to update your own code. You might understand this argument a bit better when I show you how webscraping works,

In [None]:
import bs4  # python package for webscraping
import re   # regular expressions, in case we need to do some pattern finding
import requests   # Getting data from a URL
import webbrowser as wb # Just so you can open something up in your webbrowser straight from this notebook ;) 

In [3]:
URL = "http://www.gutenberg.org/ebooks/100" # The complete works of Shakespeare

In [72]:
wb.open(URL)

True

In [14]:
input_text = """
<body>
<div id="listings_prices">
 <div class="item">
  <li class="item_name">Watch</li>
  <div class="main_price">Price: $66.68</div>
       <div class="discounted_price">Discounted price: $46.68</div>
   </div>
   <div class="item">
  <li class="item_name">Watch2</li>
  <div class="main_price">Price: $56.68</div>
   </div>
</div>
</body>"""

Before we continue, I'd like you to have another look at the HTML above. Notice that 

In [15]:
soup = bs4.BeautifulSoup(input_text)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [45]:
soup.body

<body>
<div id="listings_prices">
<div class="item">
<li class="item_name">Watch</li>
<div class="main_price">Price: $66.68</div>
<div class="discounted_price">Discounted price: $46.68</div>
</div>
<div class="item">
<li class="item_name">Watch2</li>
<div class="main_price">Price: $56.68</div>
</div>
</div>
</body>

In [46]:
soup.div

<div id="listings_prices">
<div class="item">
<li class="item_name">Watch</li>
<div class="main_price">Price: $66.68</div>
<div class="discounted_price">Discounted price: $46.68</div>
</div>
<div class="item">
<li class="item_name">Watch2</li>
<div class="main_price">Price: $56.68</div>
</div>
</div>

In [50]:
soup.div.li

<li class="item_name">Watch</li>

In [51]:
soup.find_all(class_='main_price')

[<div class="main_price">Price: $66.68</div>,
 <div class="main_price">Price: $56.68</div>]

In [59]:
# you can also access the main_price class by specifying the tag of the class
soup.find_all('div', attrs={'class':'main_price'})

[<div class="main_price">Price: $66.68</div>,
 <div class="main_price">Price: $56.68</div>]

In [80]:
URL = "https://coinmarketcap.com"
wb.open(URL)   # returns True if it succeeds

True

In [81]:
response = requests.get(URL)
response # response should be 200 if it works
# Fun Fact: if your error code is between 400-499, then it's a mistake on your side. (i.e. "404 Not Found")
# Otherwise, if the error code is between 500-599, then it's a mistake on the server side (i.e. "503 Service not Available")

<Response [200]>

In [82]:
response.text     # here's the HTML behind the website

'<!doctype html>\n<!--[if lt IE 7]>      <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->\n<!--[if IE 7]>         <html class="no-js lt-ie9 lt-ie8"> <![endif]-->\n<!--[if IE 8]>         <html class="no-js lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!-->\n<html lang="en"> <!--<![endif]-->\n<head>\n<meta charset="utf-8">\n<meta http-equiv="x-ua-compatible" content="ie=edge"><script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={xpid:"VQ4BV1dWDxABVFdQAQIEX1M="};window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o||n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);return r}({1:[function(t,n,e){function r(t){try{s.console&&console.log(t)}catch(n){}}var o,i=t("ee"),a=t(15),s={};try{o=localStorage.getItem("__nr_flags").split(","),console&&"function"==typeof console.log&&(s.console=!0,o.

In [83]:
soup = bs4.BeautifulSoup(response.text)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [113]:
for item in soup.find_all(class_ = 'price'):
    print(item)

<a class="price" data-btc="1.0" data-usd="6439.78227986" href="/currencies/bitcoin/#markets">$6439.78</a>
<a class="price" data-btc="0.03303785668" data-usd="212.486941027" href="/currencies/ethereum/#markets">$212.49</a>
<a class="price" data-btc="5.77844940898e-05" data-usd="0.371406552673" href="/currencies/ripple/#markets">$0.371407</a>
<a class="price" data-btc="0.0679436776067" data-usd="436.987918333" href="/currencies/bitcoin-cash/#markets">$436.99</a>
<a class="price" data-btc="0.000828804266433" data-usd="5.33055412735" href="/currencies/eos/#markets">$5.33</a>
<a class="price" data-btc="3.31067847066e-05" data-usd="0.212791977704" href="/currencies/stellar/#markets">$0.212792</a>
<a class="price" data-btc="0.00853668452264" data-usd="54.8690547482" href="/currencies/litecoin/#markets">$54.87</a>
<a class="price" data-btc="0.000155872528712" data-usd="1.00186182223" href="/currencies/tether/#markets">$1.00</a>
<a class="price" data-btc="1.16228234801e-05" data-usd="0.07475358

In [120]:
for item in soup.find_all(class_ = 'price'):
    print(item['href'].split(sep = '/')[2], end = '\t\t\t')
    print(item['data-usd'])

bitcoin			6439.78227986
ethereum			212.486941027
ripple			0.371406552673
bitcoin-cash			436.987918333
eos			5.33055412735
stellar			0.212791977704
litecoin			54.8690547482
tether			1.00186182223
cardano			0.0747535843898
monero			112.561163465
dash			192.841111661
iota			0.537131883383
tron			0.0206356079239
neo			17.5801779815
ethereum-classic			10.7964708842
tezos			1.56530958532
binance-coin			9.75132323782
nem			0.0874050213205
vechain			0.0134441526615
dogecoin			0.00571735524768
zcash			114.120040502
omisego			3.27526878837
lisk			3.37286020539
bitcoin-gold			20.9826520331
bytecoin-bcn			0.00187223985093
nano			2.45317982062
ontology			1.74639780823
bitshares			0.117029865774
decred			36.2465206079
qtum			3.36562784274
0x			0.540150840116
maker			393.635702896
bitcoin-diamond			1.78671769137
digibyte			0.025332437867
zilliqa			0.0338155072775
icon			0.606533068907
steem			0.801240812938
waves			2.22743996881
aeternity			0.944054248904
verge			0.0142417150918
siacoin			0.005385317

In [126]:
numCoins = 5
coinNames = []
prices = []
for item in soup.find_all(class_ = 'price')[:numCoins]:
    coinName = item['href'].split(sep = '/')[2]
    coinNames.append(coinName)
    prices.append(float(item['data-usd']))

In [131]:
print(*list(zip(coinNames, prices)), sep = '\n') 

('bitcoin', 6439.78227986)
('ethereum', 212.486941027)
('ripple', 0.371406552673)
('bitcoin-cash', 436.987918333)
('eos', 5.33055412735)


In [133]:
# response = requests.get('https://www.accuweather.com/en/gb/oxford/ox1-3/hourly-weather-forecast/330217')
# print(response)
# input_text = response.text

# soup = bs4.BeautifulSoup(input_text)
# print(soup.find_all(class_ = "hourly-table overview-hourly"))

ConnectionError: ('Connection aborted.', OSError("(60, 'ETIMEDOUT')",))