## Bloomberg Web Scraping

*Prepared by:*  
**Jude Michael Teves**  
Faculty, Software Technology Department  
College of Computer Studies - De La Salle University

<sup>```Last run: 2021-07-06 11:39PM (GMT +8)```</sup>

This notebook shows how to scrape real-world websites. We will be scraping the currencies in the Bloomberg website.

As of the time this notebook was last updated, this is what the Bloomberg Currencies webpage looks like:

<img width=1000 src="images/Bloomberg Currencies.png" />

### Using `requests` + `BeautifulSoup`

Let's use our boilerplate code for scraping and parsing the contents of websites through `requests` and `BeautifulSoup` libraries. If you have not installed `BeautifulSoup` yet, you may do it by doing any of the following:

```conda install -c anaconda beautifulsoup4``` or
```pip install beautifulsoup4```

For this exercise, we will be scraping the following webpage: https://www.bloomberg.com/markets/currencies.

In [3]:
import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.bloomberg.com/markets/currencies")

#feed it into beautiful soup for parsing
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Bloomberg - Are you a robot?
  </title>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="https://assets.bwbx.io/font-service/css/BWHaasGrotesk-55Roman-Web,BWHaasGrotesk-75Bold-Web,BW%20Haas%20Text%20Mono%20A-55%20Roman/font-face.css" rel="stylesheet" type="text/css"/>
  <style rel="stylesheet" type="text/css">
   html, body, div, span, applet, object, iframe,
        h1, h2, h3, h4, h5, h6, p, blockquote, pre,
        a, abbr, acronym, address, big, cite, code,
        del, dfn, em, img, ins, kbd, q, s, samp,
        small, strike, strong, sub, sup, tt, var,
        b, u, i, center,
        dl, dt, dd, ol, ul, li,
        fieldset, form, label, legend,
        table, caption, tbody, tfoot, thead, tr, th, td,
        article, aside, canvas, details, embed,
        figure, figcaption, footer, header, hgroup,
        menu, nav, output, ruby, section, summary,
        time, mark, audio, video {
            mar

### Bot Blocker

You will notice that we did not exactly get the contents of the webpage we are trying to access. Instead, we were redirected to the `robots.txt` page of the Bloomberg website. The URL for which is as follows: https://www.bloomberg.com/robots.txt. This page tells us which page(s) we are not allowed to scrape.

To bypass this page, we include a headers parameter which contains a key-value pairs of metadata which fools the server into thinking that we are just some legitimate anonymous internet user. We will be using the same boilerplate code but with the headers parameter.

In [4]:
headers = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0",
           "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"}

page = requests.get("https://www.bloomberg.com/markets/currencies", headers=headers)

#feed it into beautiful soup for parsing
soup = BeautifulSoup(page.text, 'html.parser')
# print(soup.prettify()) # I will be commenting this line out as this will return a very lengthy text output

### Exercise

Get the table of currencies on the webpage and turn it into a Pandas DataFrame.

## End
<sup>made by **Jude Michael Teves**</sup> <br>
<sup>for comments, corrections, suggestions, please email:</sup><sup> <href>judemichaelteves@gmail.com</href> or <href>jude.teves@dlsu.edu.ph</href></sup><br>