<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 9.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [29]:
## Import Libraries
import regex as re

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup
from urllib.request import urlopen

import warnings
warnings.filterwarnings('ignore')

In [30]:
#!pip install bs4

### Define the content to retrieve (webpage's URL)

In [31]:
# specify the url
quote_page = 'https://www.x-rates.com/table/?from=SGD&amount=1'

### Retrieve the page
- Require Internet connection

In [32]:
# query the website and return the html to the variable ‘page’
http = urllib3.PoolManager()
r = http.request('GET', quote_page)
if r.status == 200:
    page = r.data
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
    print('Some problem occurred. Request Status: %s' % r.status)

Type of the variable 'page': bytes
Page Retrieved. Request Status: 200, Page Size: 45506


### Convert the stream of bytes into a BeautifulSoup representation

In [33]:
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)

Type of the variable 'soup': BeautifulSoup


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [34]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="telephone=no" name="format-detection"/>
  <meta content="This currency rates table lets you compare an amount in Singapore Dollar to all other currencies." name="description"/>
  <meta content="SGD EUR, currency exchange table, exchange rate table, convert, euro, american dollar, british pound, canadian dollar, australian dollar, x-rates" name="keywords"/>
  <link href="https://www.x-rates.com/table/?from=SGD&amp;amount=1" rel="canonical"/>
  <script type="text/javascript">
   var e9AdSlots  = { 
				  output_lb : {site:'ExchangeRates', adSpace:'Homepage', size:'728x90,468x60', noAd: '1'},
output_rs : {site:'ExchangeRates', adSpace:'Homepage', size:'300x250,300x600,160x600', noAd: '1'},
ros_ls : {site:'XEInternal', adSpace:'HRROS', size:'300x250', rsize: '238x230', noAd: '1', async: false},
ros_ms : {site:'XEInternal', adSpace:'HRROS', size:'468x60', rsize: '300x90', n

### Check the HTML's Title

In [35]:
print('Title tag :%s:' % soup.title)
print('Title text:%s:' % soup.title.string)

Title tag :<title>Currency Exchange Table (Singapore Dollar - SGD) - X-Rates</title>:
Title text:Currency Exchange Table (Singapore Dollar - SGD) - X-Rates:


### Find the main content
- Check if it is possible to use only the relevant data

In [36]:
tag = 'tbody'
article = soup.find_all(tag)[0]
print('Type of the variable \'tbody\':', article.__class__.__name__)

Type of the variable 'tbody': Tag


### Get some of the text
- Plain text without HTML tags

In [37]:
# show the first 500 characters after removing redundant newlines
print(re.sub(r'\n\n+', '\n', article.text))


US Dollar
0.736885
1.357064
Euro
0.624227
1.601980
British Pound
0.535020
1.869089
Indian Rupee
54.995864
0.018183
Australian Dollar
0.995837
1.004180
Canadian Dollar
0.929545
1.075795
Swiss Franc
0.677270
1.476516
Malaysian Ringgit
3.101916
0.322381
Japanese Yen
81.125696
0.012327
Chinese Yuan Renminbi
4.774424
0.209449



### Find the links in the text

In [38]:
# identify the type of tag to retrieve
tag = 'a'
# create a list with the links from the `<a>` tag
tag_list = [t.get('href') for t in article.find_all(tag)]
tag_list

['https://www.x-rates.com/graph/?from=SGD&to=USD',
 'https://www.x-rates.com/graph/?from=USD&to=SGD',
 'https://www.x-rates.com/graph/?from=SGD&to=EUR',
 'https://www.x-rates.com/graph/?from=EUR&to=SGD',
 'https://www.x-rates.com/graph/?from=SGD&to=GBP',
 'https://www.x-rates.com/graph/?from=GBP&to=SGD',
 'https://www.x-rates.com/graph/?from=SGD&to=INR',
 'https://www.x-rates.com/graph/?from=INR&to=SGD',
 'https://www.x-rates.com/graph/?from=SGD&to=AUD',
 'https://www.x-rates.com/graph/?from=AUD&to=SGD',
 'https://www.x-rates.com/graph/?from=SGD&to=CAD',
 'https://www.x-rates.com/graph/?from=CAD&to=SGD',
 'https://www.x-rates.com/graph/?from=SGD&to=CHF',
 'https://www.x-rates.com/graph/?from=CHF&to=SGD',
 'https://www.x-rates.com/graph/?from=SGD&to=MYR',
 'https://www.x-rates.com/graph/?from=MYR&to=SGD',
 'https://www.x-rates.com/graph/?from=SGD&to=JPY',
 'https://www.x-rates.com/graph/?from=JPY&to=SGD',
 'https://www.x-rates.com/graph/?from=SGD&to=CNY',
 'https://www.x-rates.com/graph

### Create a filter for unwanted types of articles

In [39]:
# remove the links that start with "The"
wiki_tag_list = []
for link in tag_list:
    if link is not None and link[:36] == 'https://www.x-rates.com/graph/?from=':
        wiki_link = link[36:]
        wiki_tag_list.append(wiki_link)

# List comprehension:
# wiki_tag_list = [link[6:] for link in tag_list if link is not None and link[:6] == '/wiki/']

print('Size of \'wiki_tag_list\':', len(wiki_tag_list))
wiki_tag_list

Size of 'wiki_tag_list': 20


['SGD&to=USD',
 'USD&to=SGD',
 'SGD&to=EUR',
 'EUR&to=SGD',
 'SGD&to=GBP',
 'GBP&to=SGD',
 'SGD&to=INR',
 'INR&to=SGD',
 'SGD&to=AUD',
 'AUD&to=SGD',
 'SGD&to=CAD',
 'CAD&to=SGD',
 'SGD&to=CHF',
 'CHF&to=SGD',
 'SGD&to=MYR',
 'MYR&to=SGD',
 'SGD&to=JPY',
 'JPY&to=SGD',
 'SGD&to=CNY',
 'CNY&to=SGD']



---



---



> > > > > > > > > © 2021 Institute of Data


---



---



