# Web Scraping
Today, we are going to walkthrough web scraping. Web scraping is a technique to extract data from internet page while there is no API available. Usually the extract is about to parse HTML from the server and extract relevant information through XPath query, regular expression or even machine learning.

# Real Life Use Case
Web scraping is a common technique which as the following application.

## __A__. Price Monitoring

E-commerce website uses to monitor competitors website to monitor product price in their websites to adjust the price accordingly.

![Amazon PS4 Price](price_example.png)

## __B__. Search Engine Indexing

![Google Web Master](seo-google-webmaster-tools.jpg)

Google develops __"Google bot"__ to crawl data from all kinds of website, indexes websites and calculate __"Page Rank"__ to give the best search result in its search engine.

## __C__. Social Network / Forum Analysis

![LIGHK](lighk.png)

In order to understand the behavior of internet users, web scraping can be adopted to extract user information. Knowledges such as topic trends, interactions and top users can be discovered after scraping the data.

## Libraries / Tools

Today, we are going to go through the following libraries.

1. requests
2. beautiful soup
3. scrapy

We will be using "Hong Kong Observatory" website in today's tutorial.

![weather](weather.png)

# Downloading HTML

The first step, of course is to download HTML source from the website. In here, we will use `requests` library. First, we import the library.


In [1]:
import requests

And the target URL is "http://www.weather.gov.hk/contente.htm".

In [2]:
response = requests.get("http://www.hko.gov.hk/wxinfo/currwx/current.htm")

`get` means calling HTTP GET method to "http://www.weather.gov.hk/contente.htm". After the execution, you can find the HTML content from response.content.

In [3]:
response.text[0:2000]

'<!DOCTYPE html>\r\n<HTML  lang="en" >\r\n<HEAD>\r\n<TITLE>Local Weather Report</TITLE>\r\n<META http-equiv="Content-Type" content="text/html; CHARSET=UTF-8">\r\n<meta name="Comments" content="WCAG2.0_Verified" />\r\n<meta name="Description" content="Current weather report of Hong Kong issued by the Hong Kong Observatory" />\r\n<meta name="Keywords" content="Current weather, Hong Kong Observatory, HKO" />\r\n\r\n</HEAD>\r\n<BODY  >\r\n<script type="text/javascript" src="/js/jquery/jquery-1.6.4.min.js"></script>\r\n<script language="JavaScript" src="/clf.js?20181105"></script>\r\n<script language=\'javascript\' src="/JSMenu_MAIN.js"></script>\r\n<script language=\'javascript\' src="/enJSMenu.js"></script>\r\n<script language=\'javascript\' src=\'/additionalMenu_EN.js\'></script>\r\n<script language=\'javascript\' src=\'/common/common_page_en_2018.js\'></script>\r\n<LINK rel=stylesheet type=text/css href="/engimages/clf.css?20181026">\r\n<LINK rel="SHORTCUT ICON" href="/Logo.ico">\r\n<LI

Too much text? Let's write into a file first.

In [4]:
with open('weather.html', 'w') as f:
    f.write(response.text)

In `weather.html`, you will eventually find the below HTML block,

```html
<div class="hkoweb_outer hkoweb_bg_f2">
	<div class="hkoweb_title hkoweb_bg_f2"></div>
	<div class="hkoweb_inner hkoweb_bg_default">
<a href="http://maps.weather.gov.hk/ocf/index_e.html" target="_blank"><img src="/img/arwf_banner_e.png" border="0" alt="Automatic Regional Weather Forecast in Hong Kong &amp; Pearl River Delta Region" ></a>
<p><a href="flw.htm">Local Weather Forecast</a>&nbsp;|&nbsp;<a href="../frt/frt.htm">Regional Temperature Forecast in HK</a><br>
<a href="fnd.htm">9-day Weather Forecast</a>&nbsp;|
<a href="../uvindex/english/uvfcst.htm">UV Index Forecast</a><br>
<a href="ffish.htm">Weather Information for South China Coastal Waters</a></p>
<span style="font-style:italic">Bulletin issued at 12:02 HKT 17/Apr/2018</span><br/>
<br/>
<div style="display:block; width: 470px;">
At noon at the Hong Kong Observatory the air temperature was 20 degrees Celsius and the relative humidity 82 per cent.<br/>
During the past hour the mean UV Index recorded at King's Park was 2. The intensity of UV radiation was low.<br/>
<br/>
The air temperatures at other places were:<br/>
<br/>
```

Now, you already succeed to grab the HTML source code, the next step is to parse the HTML file and retrieve the weather information.

# Beautiful Soup
Now we are going to extract information from __"hkoweb_outer hkoweb_bg_f2"__ HTML node. First, we need to import `beautifulsoup`.

In [5]:
from bs4 import BeautifulSoup

The next step is to use BeautifulSoup `response.content` to parse HTML content.

In [6]:
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.title.string)

Local Weather Report


We can parse the content and retrieve the title with two simple lines. Now let's try to get a `div` element by ID.

In [7]:
weather_content_container = soup.find('div', {'class': 'hkoweb_outer hkoweb_bg_f2'})
weather_content_container

<div class="hkoweb_outer hkoweb_bg_f2">
<div class="hkoweb_title hkoweb_bg_f2"></div>
<div class="hkoweb_inner hkoweb_bg_default">
<a href="http://maps.weather.gov.hk/ocf/index_e.html" target="_blank"><img alt="Automatic Regional Weather Forecast in Hong Kong &amp; Pearl River Delta Region" border="0" src="/img/arwf_banner_e.png"/></a>
<p><a href="flw.htm">Local Weather Forecast</a> | <a href="../frt/frt.htm">Regional Temperature Forecast in HK</a><br>
<a href="fnd.htm">9-day Weather Forecast</a> |
<a href="../uvindex/english/uvfcst.htm">UV Index Forecast</a><br>
<a href="ffish.htm">Weather Information for South China Coastal Waters</a></br></br></p>
<span style="font-style:italic">Bulletin issued at 14:02 HKT 17/Apr/2018</span><br/>
<br/>
<div style="display:block; width: 470px;">
At 2 p.m. at the Hong Kong Observatory the air temperature was 22 degrees Celsius and the relative humidity 73 per cent.<br/>
During the past hour the mean UV Index recorded at King's Park was 4. The intensi

In [8]:
weather_content_container.text

"\n\n\n\nLocal Weather Forecast\xa0|\xa0Regional Temperature Forecast in HK\n9-day Weather Forecast\xa0|\r\nUV Index Forecast\nWeather Information for South China Coastal Waters\nBulletin issued at 14:02 HKT 17/Apr/2018\n\n\r\nAt 2 p.m. at the Hong Kong Observatory the air temperature was 22 degrees Celsius and the relative humidity 73 per cent.\r\nDuring the past hour the mean UV Index recorded at King's Park was 4. The intensity of UV radiation was moderate.\n\r\nThe air temperatures at other places were:\n\nKing's Park\r\n22\xa0degrees;\n\nWong Chuk Hang\r\n22\xa0degrees;\n\nTa Kwu Ling\r\n24\xa0degrees;\n\nLau Fau Shan\r\n22\xa0degrees;\n\nTai Po\r\n21\xa0degrees;\n\nSha Tin\r\n22\xa0degrees;\n\nTuen Mun\r\n23\xa0degrees;\n\nTseung Kwan O\r\n21\xa0degrees;\n\nSai Kung\r\n22\xa0degrees;\n\nCheung Chau\r\n21\xa0degrees;\n\nChek Lap Kok\r\n21\xa0degrees;\n\nTsing Yi\r\n22\xa0degrees;\n\nShek Kong\r\n24\xa0degrees;\n\nTsuen Wan Ho Koon\r\n21\xa0degrees;\n\nTsuen Wan Shing Mun Valley\r\

What a mess... Let's split the string

In [9]:
lines = weather_content_container.text.splitlines()
lines = [line for line in lines if line != '']
lines

['Local Weather Forecast\xa0|\xa0Regional Temperature Forecast in HK',
 '9-day Weather Forecast\xa0|',
 'UV Index Forecast',
 'Weather Information for South China Coastal Waters',
 'Bulletin issued at 14:02 HKT 17/Apr/2018',
 'At 2 p.m. at the Hong Kong Observatory the air temperature was 22 degrees Celsius and the relative humidity 73 per cent.',
 "During the past hour the mean UV Index recorded at King's Park was 4. The intensity of UV radiation was moderate.",
 'The air temperatures at other places were:',
 "King's Park",
 '22\xa0degrees;',
 'Wong Chuk Hang',
 '22\xa0degrees;',
 'Ta Kwu Ling',
 '24\xa0degrees;',
 'Lau Fau Shan',
 '22\xa0degrees;',
 'Tai Po',
 '21\xa0degrees;',
 'Sha Tin',
 '22\xa0degrees;',
 'Tuen Mun',
 '23\xa0degrees;',
 'Tseung Kwan O',
 '21\xa0degrees;',
 'Sai Kung',
 '22\xa0degrees;',
 'Cheung Chau',
 '21\xa0degrees;',
 'Chek Lap Kok',
 '21\xa0degrees;',
 'Tsing Yi',
 '22\xa0degrees;',
 'Shek Kong',
 '24\xa0degrees;',
 'Tsuen Wan Ho Koon',
 '21\xa0degrees;',
 '

So now we are going to extract information from the 6th line.

In [10]:
print(lines[5])

At 2 p.m. at the Hong Kong Observatory the air temperature was 22 degrees Celsius and the relative humidity 73 per cent.


Here, we are going to do string processing with regular expression. We first import `re` module and then use `re.match(pattern, string)`. `re.match` returns a match if there is any. You can call `m.group(x)` to get the bracket information.

In [11]:
import re
temperature_line = lines[5]
m = re.match(r'.*was (\d+) degrees.*humidity (\d+) per.*', temperature_line)
temperature = int(m.group(1))
humidity = int(m.group(2))
print("Temperature: %d degree celusis, Humidity: %d%%" % (temperature, humidity))

Temperature: 22 degree celusis, Humidity: 73%


# That's It! 👍
We just did a simple weather web scraper under 20 lines of code! 👍

# Scrapy

![Scrapy](https://scrapy.org/img/scrapylogo.png)
Scrapy is a very famous web scraping library which helps you develop complicated web scraping program fast.

_Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival._

_Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler._

From [Scrapy Overview](https://docs.scrapy.org/en/latest/intro/overview.html)


# Example Spider

In [34]:
! python spider.py

香港電台網站
http://news.rthk.hk/rthk/ch/component/k2/1391618-20180417.htm
http://news.rthk.hk/rthk/ch/component/k2/1391617-20180417.htm
http://news.rthk.hk/rthk/ch/component/k2/1391614-20180417.htm
http://news.rthk.hk/rthk/ch/component/k2/1391613-20180417.htm
http://news.rthk.hk/rthk/ch/component/k2/1391612-20180417.htm
http://news.rthk.hk/rthk/ch/component/k2/1391611-20180417.htm
http://news.rthk.hk/rthk/ch/component/k2/1391609-20180417.htm
http://news.rthk.hk/rthk/ch/component/k2/1391608-20180417.htm
美國制裁中興通訊　分析料或阻礙中國5G通訊市場發展 - RTHK

			美國商務部禁止美國企業未來7年向中興通訊出售零部件，因為中興通訊違反制裁條款，以及多次作出虛假陳述。

中興通訊被指向伊朗與北韓出售通訊設備，違反美國制裁禁令，去年3月與美國政府達成和解，支付近12億美元的罰款，當中3億美元罰款暫緩。美國商務部指，中興通訊未有執行和解協議，並且做出虛假陳述，誤導商務部，不但未有懲處相關員工，反而給予獎勵。

中興通訊A、H股今日起停牌，正評估美國出口禁令的潛在影響，並與各方溝通。中國商務部指，密切關注事件進展，隨時準備採取必要措施，維護中國企業合法權益。

市場預料，中興通訊生產設備中，有25-30%的零部件由美國企業提供，有關做法等同切斷中興的產品供應鏈。騰祺基金管理投資管理董事沈慶洪表示，事件令人認為美國有意抗衡中國科技崛起，有機會加劇兩國緊張關係。他又說，中興在中國電訊設備市佔率達30%，如果生產受阻，或阻礙中國5G通訊市場發展，影響深遠。		
陳淑莊歡迎政府公布「上網電價」水平 - RTHK

			立法會環境事務委員會主席、公民黨的陳淑莊歡迎政府公布「上網電價」的

# Conclusion

Though web scraping is a common and powerful technique in data extraction, please aware of crawler usage and try not to send too many requests to target server. Otherwise, your spider will be considered as an **attack program**!