# Chapter 8: Extracting Data from the Internet
## Part 3: Parsing XML APIs by SAX + *regular expression: re*

2020-12-08 \
鍾修改 2019-11-30 \
士超改為 Python3 2019-04-25 \
逸勳協助準備 2018-05-22 

# Parsing XML APIs 的四種方法：
1. DOM (02_要講的)
2. SAX (03_要講的)
3. *regular expression (regex) (帶過)*
3. ElementTree (04_要講的，課本沒教的)

# Parsing XML APIs

XML 的格式的網頁有兩種 API 的方法解讀：

1. DOM (Document Object Model): You load **the whole thing into memory** - it's a massive memory hog. You can blow memory with even medium sized documents. But you can use xpath and traverse the tree etc.

2. SAX (Simple API for XML): Is a **stream-based processor**. You only have a tiny part in memory at any time and you "sniff" the XML stream by implementing callback code for events like tagStarted() etc. It uses almost no memory, but you can't do "DOM" stuff, like use xpath or traverse trees.

## What is the difference between SAX and DOM?
https://stackoverflow.com/questions/6828703/what-is-the-difference-between-sax-and-dom

**SAX is event-based and DOM is tree model**

In SAX, events are triggered when the XML is being parsed. When the parser is parsing the XML, and encounters a tag starting (e.g. <something>), then it triggers the tagStarted event (actual name of event might differ). Similarly when the end of the tag is met while parsing (</something>), it triggers tagEnded. Using a SAX parser implies you need to handle these events and make sense of the data returned with each event.

In DOM, there are no events triggered while parsing. The entire XML is parsed and a DOM tree (of the nodes in the XML) is generated and returned. Once parsed, the user can navigate the tree to access the various data previously embedded in the various nodes in the XML.

In general, DOM is easier to use but has an overhead of parsing the entire XML before you can start using it.

## 來看一下氣象資料的 XML 的網頁內容：
https://codebeautify.org/xmlviewer
使用 "Load URL" \
http://api.openweathermap.org/data/2.5/forecast?q=Taipei,%20TW&mode=xml&appid=d1deefb25fb63cf70eea21a43dad94f7

![XML介紹](http://www.ukoln.ac.uk/metadata/dcmi/dc-elem-prop/image/Slide1.png)

# The SAX(Simple API for XML) method

## python weather_xml_sax.py Taipei,TW 

## 摸擬一下我們查看 xml 格式的內容時，怎麼準備我們想要的報告輸出？
1. 先找到城市、國家
2. 然後把 40 個時段的：時間、天氣示意符號、溫度
點取出來。

**我們先試跑一下 xml 檔**，當逐行讀內容時，我們會特別去找相對應的 tag，並從相關的tag 中的 attributes 找出所要的資料。

## 歸納：當解析 (parse) XML 文檔時，最重要的動作是：
1. 每看到有新的 tag，注意是否我們的讀取資料的 tag 出現了
2. 每次看到有標示為資料的欄位時，看有沒有我們要的資料
3. 看到有 tag 的結束時，就可以將沒用到的 tag 整段忽視掉
4. 而在這過程中，要有一個 tag buffer 提醒我，目前解析到整顆樹的那個位置：也就是記住從最高層那個 node 的 tag 到達當下的 tag，而中間沒有用到的，則不會出現在這 tag buffer 中。

```XML
<weatherdata>
<location>
<name>Taipei</name>
<type/>
<country>TW</country>
<timezone>28800</timezone>
<location altitude="0" latitude="25.0478" longitude="121.5319" geobase="geonames" geobaseid="1668341"/>
</location>
```

```XML
<time from="2020-12-06T06:00:00" to="2020-12-06T09:00:00">
<symbol number="500" name="light rain" var="10d"/>
<precipitation probability="0.36" unit="3h" value="0.11" type="rain"/>
<windDirection deg="68" code="ENE" name="East-northeast"/>
<windSpeed mps="5.62" unit="m/s" name="Moderate breeze"/>
<temperature unit="kelvin" value="292.54" min="292.25" max="292.54"/>
<feels_like value="291.13" unit="kelvin"/>
<pressure unit="hPa" value="1019"/>
<humidity value="88" unit="%"/>
<clouds value="broken clouds" all="75" unit="%"/>
<visibility value="10000"/>
</time>
```

In [1]:
from datetime import datetime
import urllib
import xml.sax # SAX module
from string import Template
import sys

In [3]:
help(xml.sax.ContentHandler)

Help on class ContentHandler in module xml.sax.handler:

class ContentHandler(builtins.object)
 |  Interface for receiving logical document content events.
 |  
 |  This is the main callback interface in SAX, and the one most
 |  important to applications. The order of events in this interface
 |  mirrors the order of the information in the document.
 |  
 |  Methods defined here:
 |  
 |  __init__(self)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  characters(self, content)
 |      Receive notification of character data.
 |      
 |      The Parser will call this method to report each chunk of
 |      character data. SAX parsers may return all contiguous
 |      character data in a single chunk, or they may split it into
 |      several chunks; however, all of the characters in any single
 |      event must come from the same external entity so that the
 |      Locator provides useful information.
 |  
 |  endDocument(self)
 |      Receive notificatio

In [4]:
'''
繼承 sax.ContentHandler
'''
class WeatherHandler(xml.sax.ContentHandler):  

    def __init__(self):
        self._tag_buffer = list() # 記住一路下來 tag 的族譜

        self._location_name = '?'
        self._location_country = '?'

        self._time_string= ''
        self._overview_string = ''
        self._temperature_string = ''

    ###  遇到XML開始標籤時調用，不僅是當下的 tag 還有內含的 attribute
    def startElement(self, tag, attributes):
        self._tag_buffer.append(tag)
        if self._tag_buffer[-2:] == ['weatherdata', 'location']:
            self._location_name = '?'
            self._location_country = '?'

        elif self._tag_buffer[-2:] == ['forecast', 'time']:
            time = datetime.strptime(attributes['from'], '%Y-%m-%dT%H:%M:%S')
            self._time_string = datetime.strftime(time, '%H:%M %d %B')

            self._overview_string = '?'
            self._temperature_string = '?'

        elif self._tag_buffer[-2:] == ['time', 'symbol']:
            if 'name' in attributes:
                self._overview_string = attributes['name']

        elif self._tag_buffer[-2:] == ['time', 'temperature']:
            if 'value' in attributes:
                self._temperature_string = '%.1f' % (float(attributes['value']) - 273.15)

    ###  遇到XML結束標籤時調用
    def endElement(self, tag):
        if self._tag_buffer[-2:] == ['weatherdata', 'location']:
            print('5 day forecast for %s, %s.\n' % (self._location_name, self._location_country))

        elif self._tag_buffer[-2:] == ['forecast', 'time']:
            print(u'%s: %s, %s°C' % (
                    self._time_string,
                    self._overview_string,
                    self._temperature_string))

        self._tag_buffer.pop()

    ###  遇到XML元素內容時調用，主要是讀取 content 的內容
    def characters(self, content):
        if self._tag_buffer[-2:] == ['location', 'name']:
            self._location_name = content

        elif self._tag_buffer[-2:] == ['location', 'country']:
            self._location_country = content

In [5]:
URL_TEMLATE = Template('http://api.openweathermap.org/data/2.5/forecast?q=${location}&mode=xml&appid=d1deefb25fb63cf70eea21a43dad94f7')

# search_location = sys.argv[1]
search_location = "Taipei,TW"

api_url = URL_TEMLATE.substitute(location=search_location)
api_url

'http://api.openweathermap.org/data/2.5/forecast?q=Taipei,TW&mode=xml&appid=d1deefb25fb63cf70eea21a43dad94f7'

In [6]:
response = urllib.request.urlopen(api_url)
xml_response = response.read()

In [7]:
print(xml_response)

b'<?xml version="1.0" encoding="UTF-8"?>\n<weatherdata><location><name>Taipei</name><type></type><country>TW</country><timezone>28800</timezone><location altitude="0" latitude="25.0478" longitude="121.5319" geobase="geonames" geobaseid="1668341"></location></location><credit></credit><meta><lastupdate></lastupdate><calctime>0</calctime><nextupdate></nextupdate></meta><sun rise="2020-12-08T22:27:42" set="2020-12-09T09:05:02"></sun><forecast><time from="2020-12-09T06:00:00" to="2020-12-09T09:00:00"><symbol number="500" name="light rain" var="10d"></symbol><precipitation probability="0.74" unit="3h" value="0.57" type="rain"></precipitation><windDirection deg="71" code="ENE" name="East-northeast"></windDirection><windSpeed mps="4.47" unit="m/s" name="Gentle Breeze"></windSpeed><temperature unit="kelvin" value="293.14" min="292.73" max="293.14"></temperature><feels_like value="292.93" unit="kelvin"></feels_like><pressure unit="hPa" value="1016"></pressure><humidity value="90" unit="%"></hum

以下只列出部份，其中只列 time[0] 
```json
<weatherdata>
    <location>
        <name>Taipei</name>
        <type/>
        <country>TW</country>
        <timezone>28800</timezone>
        <location altitude="0" latitude="25.0375" longitude="121.5637" geobase="geonames" geobaseid="1668341"/>
    </location>
    <credit/>
    <meta>
        <lastupdate/>
        <calctime>0</calctime>
        <nextupdate/>
    </meta>
<sun rise="2019-11-30T22:21:33" set="2019-12-01T09:03:52"/>
<forecast>
    <time from="2019-11-30T15:00:00" to="2019-11-30T18:00:00">
        <symbol number="800" name="clear sky" var="01n"/>
        <precipitation/>
        <windDirection deg="138" code="SE" name="SouthEast"/>
        <windSpeed mps="1.12" unit="m/s" name="Calm"/>
        <temperature unit="kelvin" value="291.92" min="291.15" max="291.92"/>
        <pressure unit="hPa" value="1018"/>
        <humidity value="84" unit="%"/>
        <clouds value="clear sky" all="1" unit="%"/>
    </time>
    </forecast>
</weatherdata>
```

In [8]:
'''
需要針對特定的 爬文需求建這個繼承 handler 的類別
'''
content_handler = WeatherHandler()  

xml.sax.parseString(xml_response, content_handler)

5 day forecast for Taipei, TW.

06:00 09 December: light rain, 20.0°C
09:00 09 December: light rain, 19.9°C
12:00 09 December: light rain, 20.1°C
15:00 09 December: light rain, 20.2°C
18:00 09 December: light rain, 20.2°C
21:00 09 December: light rain, 20.4°C
00:00 10 December: light rain, 20.6°C
03:00 10 December: light rain, 20.5°C
06:00 10 December: light rain, 20.2°C
09:00 10 December: light rain, 20.0°C
12:00 10 December: light rain, 19.8°C
15:00 10 December: light rain, 19.8°C
18:00 10 December: light rain, 19.8°C
21:00 10 December: light rain, 20.0°C
00:00 11 December: overcast clouds, 20.8°C
03:00 11 December: overcast clouds, 21.1°C
06:00 11 December: overcast clouds, 20.5°C
09:00 11 December: light rain, 20.5°C
12:00 11 December: light rain, 20.4°C
15:00 11 December: light rain, 20.1°C
18:00 11 December: light rain, 19.8°C
21:00 11 December: light rain, 19.7°C
00:00 12 December: light rain, 20.2°C
03:00 12 December: light rain, 19.2°C
06:00 12 December: light rain, 18.4°C
09:

### Python Tutorial: re Module - How to Write and Match Regular Expressions (Regex)
https://www.youtube.com/watch?v=K8L6KVGG-7o

In [9]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/K8L6KVGG-7o" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

## 以下是我用 regular expression 來測一下，開頭 < 到結尾 > 間的內容

In [12]:
import re
temperatures = re.findall(r'<temperature .*?>', str(xml_response))

In [13]:
len(temperatures)

40

In [14]:
for tem in temperatures:
    print(tem)

<temperature unit="kelvin" value="293.14" min="292.73" max="293.14">
<temperature unit="kelvin" value="293.09" min="292.97" max="293.09">
<temperature unit="kelvin" value="293.25" min="293.23" max="293.25">
<temperature unit="kelvin" value="293.4" min="293.4" max="293.4">
<temperature unit="kelvin" value="293.38" min="293.38" max="293.38">
<temperature unit="kelvin" value="293.5" min="293.5" max="293.5">
<temperature unit="kelvin" value="293.74" min="293.74" max="293.74">
<temperature unit="kelvin" value="293.63" min="293.63" max="293.63">
<temperature unit="kelvin" value="293.38" min="293.38" max="293.38">
<temperature unit="kelvin" value="293.11" min="293.11" max="293.11">
<temperature unit="kelvin" value="292.99" min="292.99" max="292.99">
<temperature unit="kelvin" value="292.94" min="292.94" max="292.94">
<temperature unit="kelvin" value="292.92" min="292.92" max="292.92">
<temperature unit="kelvin" value="293.17" min="293.17" max="293.17">
<temperature unit="kelvin" value="293.95

In [15]:
import re
time = re.findall(r'<time .*?>', str(xml_response))
len(time)
for tem in time:
    print(tem)

<time from="2020-12-09T06:00:00" to="2020-12-09T09:00:00">
<time from="2020-12-09T09:00:00" to="2020-12-09T12:00:00">
<time from="2020-12-09T12:00:00" to="2020-12-09T15:00:00">
<time from="2020-12-09T15:00:00" to="2020-12-09T18:00:00">
<time from="2020-12-09T18:00:00" to="2020-12-09T21:00:00">
<time from="2020-12-09T21:00:00" to="2020-12-10T00:00:00">
<time from="2020-12-10T00:00:00" to="2020-12-10T03:00:00">
<time from="2020-12-10T03:00:00" to="2020-12-10T06:00:00">
<time from="2020-12-10T06:00:00" to="2020-12-10T09:00:00">
<time from="2020-12-10T09:00:00" to="2020-12-10T12:00:00">
<time from="2020-12-10T12:00:00" to="2020-12-10T15:00:00">
<time from="2020-12-10T15:00:00" to="2020-12-10T18:00:00">
<time from="2020-12-10T18:00:00" to="2020-12-10T21:00:00">
<time from="2020-12-10T21:00:00" to="2020-12-11T00:00:00">
<time from="2020-12-11T00:00:00" to="2020-12-11T03:00:00">
<time from="2020-12-11T03:00:00" to="2020-12-11T06:00:00">
<time from="2020-12-11T06:00:00" to="2020-12-11T09:00:00

In [16]:
name = re.findall(r'<name>(.*?)</name>', str(xml_response))

In [17]:
name

['Taipei']

In [18]:
print(name[0])

Taipei


In [19]:
country = re.findall(r'<country>(.*?)</country>', str(xml_response))
print(country[0])

TW


### Python正則表達式最全詳解！你學會了嗎？

原文網址：https://kknews.cc/code/n6ayvyg.html
https://kknews.cc/zh-tw/code/n6ayvyg.html