## Bloomberg Web Scraping

*Prepared by:*  
**Jude Michael Teves**  
Faculty, Software Technology Department  
College of Computer Studies - De La Salle University

This notebook shows how to scrape real-world websites. We will be scraping the currencies in the Bloomberg website. This is the solution to the previous notebook.

**Reminder**

> *"With great power, comes great responsibility"*
    
Remember to perform web scraping with extra caution and to not abuse it. The boundaries are not so clear when it comes to what you can and cannot legally do with scraping. Use your own judgment to determine if what you are about to do is unethical or illegal.
<hr>

<sup>```Last run: 2021-07-12 10:37PM (GMT +8)```</sup>

As of the time this notebook was last updated, this is what the Bloomberg Currencies webpage looks like:

<center><img width=1000 src="../images/Bloomberg Currencies.png" /></center>

### Using `requests` + `BeautifulSoup`

Let's use our boilerplate code for scraping and parsing the contents of websites through `requests` and `BeautifulSoup` libraries. If you have not installed `BeautifulSoup` yet, you may do it by doing any of the following:

```conda install -c anaconda beautifulsoup4``` or
```pip install beautifulsoup4```

For this exercise, we will be scraping the following webpage: https://www.bloomberg.com/markets/currencies.

In [1]:
import requests
from bs4 import BeautifulSoup

# page = requests.get("https://www.bloomberg.com/markets/currencies")

# #feed it into beautiful soup for parsing
# soup = BeautifulSoup(page.text, 'html.parser')
# print(soup.prettify())

### Bot Blocker

You will notice that we did not exactly get the contents of the webpage we are trying to access. Instead, we were redirected to the `robots.txt` page of the Bloomberg website. The URL for which is as follows: https://www.bloomberg.com/robots.txt. This page tells us which page(s) we are not allowed to scrape.

To bypass this page, we include a headers parameter which contains a key-value pairs of metadata which fools the server into thinking that we are just some legitimate anonymous internet user. We will be using the same boilerplate code but with the headers parameter.

In [2]:
headers = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0",
           "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"}

page = requests.get("https://www.bloomberg.com/markets/currencies", headers=headers)

#feed it into beautiful soup for parsing
soup = BeautifulSoup(page.text, 'html.parser')
# print(soup.prettify()) # I will be commenting this line out as this will return a very lengthy text output

### Exercise

Get the table of currencies on the webpage and turn it into a Pandas DataFrame.

In [3]:
import pandas as pd

pd.read_html(str(soup.find_all('table')[0]))[0]

Unnamed: 0,Currency,Value,Change,Net Change,Time (EDT)
0,EUR-USD,1.187,-0.0006,-0.05%,10:31 AM
1,USD-JPY,110.29,0.15,+0.14%,10:31 AM
2,GBP-USD,1.39,-0.0001,-0.01%,10:31 AM
3,AUD-USD,0.7487,-0.0001,-0.01%,10:31 AM
4,USD-CAD,1.2458,0.0011,+0.09%,10:31 AM
5,USD-CHF,0.9148,0.0001,+0.01%,10:31 AM
6,EUR-JPY,130.92,0.12,+0.09%,10:32 AM
7,EUR-GBP,0.854,-0.0005,-0.06%,10:31 AM
8,USD-HKD,7.7671,-0.0003,0.00%,10:31 AM
9,EUR-CHF,1.0859,0.0002,+0.02%,10:31 AM


## End
<sup>made by **Jude Michael Teves**</sup> <br>
<sup>for comments, corrections, suggestions, please email:</sup><sup> <href>judemichaelteves@gmail.com</href> or <href>jude.teves@dlsu.edu.ph</href></sup><br>