## Material:
- XML Path Language (XPath) 3.1 https://www.w3.org/TR/2017/REC-xpath-31-20170321/
- python package `lxml`: https://lxml.de/
- Python XML Tutorial with ElementTree: Beginner's Guide. DataCamp. https://www.datacamp.com/tutorial/python-xml-elementtree
- XPath Tutorial: https://www.w3school.com.cn/xpath/index.asp


## Introduction
XML Path Language, searching and abstracting information with XML and HTML.
## XPath
Positioning elements, based on HTML DOM structure.
### Useful Expressions
| Expression | Meaning |
|--|--|
| `node.name` | Pick all child nodes of this parent node|   
| `/` | Picks a direct child node from the current node |
| `//` |Select a descendant node from the current node  |
| `.` | Pick the current node |
| `..` | Pick the parent node of the current node |
| `@`| Select an attribute |

For example, ```XPath=//tagname[@attribute='value']```

In [2]:
pip install lxml

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
from lxml import etree

In [4]:
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
#fix the html text with etree.HTML(),in this case, we add </li> to the last <li> and add <html></html>,<body></body> nodes for getting a complete structure.
html = etree.HTML(text)
#output the html after fixed use .tostring() function, we get bytes type
result = etree.tostring(html)
#use .decode() to get string type output
print(result.decode('utf-8'))

<html><body><div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </li></ul>
 </div>
</body></html>


### Directly read HTML file use `.parse()`
`
HTMLParser(self, encoding=None, remove_blank_text=False, remove_comments=False, remove_pis=False, strip_cdata=True, no_network=True, target=None, schema: XMLSchema =None, recover=True, compact=True, collect_ids=True, huge_tree=False)
`
   - By default, HTMLParser can read broken HTML files, to trun off this, set `recover=False`.
   - `no_network`: prevent network access for related files (default: True)
   - `remove_blank_text`: discard empty text nodes that are ignorable
   - `default_doctype`: add a default doctype even if it is not found in the HTML (default: True)

For more information, references:https://lxml.de/api/lxml.etree.HTMLParser-class.html



In [5]:
html = etree.parse('/content/index.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8"/>
    <meta http-equiv="X-UA-Compatible" content="IE=edge"/>
    <meta name="description" content="Omnifood is an AI-powered food subscription that will make you eat healthy again, 365 days per year.It's tailored to your personal tastes and nutritional needs."/>

    <!-- Always include this line of code!!! -->
    <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
    
    <link rel="icon" href="img/favicon.png"/>
    <link rel="apple-touch-icon" href="img/apple-touch-icon.png"/>
    <link rel="manifest" href="manifest.webmanifest"/>
    <link rel="preconnect" href="https://fonts.gstatic.com"/>
    <link href="https://fonts.googleapis.com/css2?family=Rubik:wght@400;500;600;700&amp;display=swap" rel="stylesheet"/>

    <link rel="stylesheet" href="css/general.css"/>
    <link rel="stylesheet" href="css/style.css"/>
    <link rel="stylesheet" href="css/queries.css"/>

    <script type="module" src="

## Get All Nodes
Use `.xpath('//*')` choose all acceptable nodes.

`*` represents all types of nodes, return results as lists, composition: `<Element NodeType at location>`.

In [6]:
html = etree.parse('/content/index.html', etree.HTMLParser())
result = html.xpath('//*')
print(result)

[<Element html at 0x7efc88867500>, <Element head at 0x7efc88869640>, <Element meta at 0x7efc88867910>, <Element meta at 0x7efc888678c0>, <Element meta at 0x7efc88867550>, <Element meta at 0x7efc88865fa0>, <Element link at 0x7efc88865f50>, <Element link at 0x7efc88865f00>, <Element link at 0x7efc88865eb0>, <Element link at 0x7efc88865e60>, <Element link at 0x7efc88865e10>, <Element link at 0x7efc88865a00>, <Element link at 0x7efc8810ec30>, <Element link at 0x7efc8810eaf0>, <Element script at 0x7efc8810ee60>, <Element script at 0x7efc8810ee10>, <Element script at 0x7efc8810eeb0>, <Element script at 0x7efc8810ef00>, <Element title at 0x7efc8810ef50>, <Element body at 0x7efc8810efa0>, <Element header at 0x7efc88120910>, <Element a at 0x7efc88120050>, <Element img at 0x7efc881200a0>, <Element nav at 0x7efc881200f0>, <Element ul at 0x7efc88120140>, <Element li at 0x7efc88120190>, <Element a at 0x7efc881201e0>, <Element li at 0x7efc88120230>, <Element a at 0x7efc88120280>, <Element li at 0x7e

In [7]:
#if we only want to get nodes belongs to <li></li>
result = html.xpath('//li')
print(result)
print(result[0]) #get the index object

[<Element li at 0x7efc88120190>, <Element li at 0x7efc88120230>, <Element li at 0x7efc881202d0>, <Element li at 0x7efc88120370>, <Element li at 0x7efc88120410>, <Element li at 0x7efc87e966e0>, <Element li at 0x7efc87e96820>, <Element li at 0x7efc87e96960>, <Element li at 0x7efc87e96d20>, <Element li at 0x7efc87e96e60>, <Element li at 0x7efc87e96fa0>, <Element li at 0x7efc87e98230>, <Element li at 0x7efc87e98320>, <Element li at 0x7efc87e98410>, <Element li at 0x7efc87e98500>, <Element li at 0x7efc87e985f0>, <Element li at 0x7efc87e986e0>, <Element li at 0x7efc87e987d0>, <Element li at 0x7efc87e988c0>, <Element li at 0x7efc87e989b0>, <Element li at 0x7efc87e99dc0>, <Element li at 0x7efc87e99eb0>, <Element li at 0x7efc87e99fa0>, <Element li at 0x7efc87e9a0f0>, <Element li at 0x7efc87e9a460>, <Element li at 0x7efc87e9a5a0>, <Element li at 0x7efc87e9a6e0>, <Element li at 0x7efc87e9a7d0>, <Element li at 0x7efc87e9b910>, <Element li at 0x7efc87e9ba00>, <Element li at 0x7efc87e9baf0>, <Elemen

## Child Node
- use `/` to get direct Child nodes
- use `//` to get Descendant nodes.

In [8]:
#if we want to get all direct <a> nodes inside the <li> nodes in this html file.
result = html.xpath('//li/a')
print(result)

[<Element a at 0x7efc87e99910>, <Element a at 0x7efc87e99870>, <Element a at 0x7efc87e99050>, <Element a at 0x7efc87e993c0>, <Element a at 0x7efc87e992d0>, <Element a at 0x7efc87e99230>, <Element a at 0x7efc87e990f0>, <Element a at 0x7efc87e991e0>, <Element a at 0x7efc87e990a0>, <Element a at 0x7efc87e980a0>, <Element a at 0x7efc87e98050>, <Element a at 0x7efc87e98140>, <Element a at 0x7efc87e980f0>, <Element a at 0x7efc87e98370>, <Element a at 0x7efc87e982d0>, <Element a at 0x7efc87e98280>, <Element a at 0x7efc87e981e0>, <Element a at 0x7efc87e98190>, <Element a at 0x7efc87e960f0>]


In [9]:
#if we want to get all descendant nodes <a> inside the <li> nodes in this html file.
result = html.xpath('//ul//a')
print(result)

[<Element a at 0x7efc87e99910>, <Element a at 0x7efc87e99870>, <Element a at 0x7efc87e99050>, <Element a at 0x7efc87e993c0>, <Element a at 0x7efc87e992d0>, <Element a at 0x7efc87e99230>, <Element a at 0x7efc87e990f0>, <Element a at 0x7efc87e991e0>, <Element a at 0x7efc87e990a0>, <Element a at 0x7efc87e980a0>, <Element a at 0x7efc87e98050>, <Element a at 0x7efc87e98140>, <Element a at 0x7efc87e980f0>, <Element a at 0x7efc87e98370>, <Element a at 0x7efc87e982d0>, <Element a at 0x7efc87e98280>, <Element a at 0x7efc87e981e0>, <Element a at 0x7efc87e98190>, <Element a at 0x7efc87e960f0>]


## Parent Nodes
- Method1: use `..` to get the parent node, given the known child node using `Nodename_Child[@detail]/../@Nodename_Parent`
- Method2: use `parent::`

In [10]:
#if we want to get the selected child's[@...] attribute is <a>, then we want to get its parent's class attribute name.
result = html.xpath('//a[@href="#how"]/../@class')  
print(result)

['hero-text-box']


In [11]:
result = html.xpath('//a[@href="#how"]/parent::*/@class')  
print(result)

['hero-text-box']


## Attribute Matching
- use `@Nodename='detailedName'` to filter the attribute we want to select

In [14]:
result = html.xpath('//p[@class="feature-title"]')  
print(result)

[<Element p at 0x7efc8e5bb140>, <Element p at 0x7efc8883e5f0>, <Element p at 0x7efc87e9c0a0>, <Element p at 0x7efc87e9cf50>]


## Text Aquisition
- use `/text()` to abstract the corresponding text we want to select **directly** under this attribute.
- if we use the indirect attribute in front then we need 
   - Method1:  `//indirectNodename[@...]/directNodename/text()`
      - Only returns all the text.
   - Method2: `//indirectNodename[@...]//text()`
      - It will get the whole information, includes some special symbols.

In [15]:
result = html.xpath('//p[@class="feature-title"]/text()')  
print(result)

['Never cook again!', 'Local and organic', 'No waste', 'Pause anytime']


## Get Attribute
- use `@` without `[ ]`, which is different from attribute matching.

In [16]:
result = html.xpath('//li/a/@href')  
print(result)

['#how', '#meals', '#testimonials', '#pricing', '#cta', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#']


## Match Attribute with Mutiple Values


In [20]:
text = '''  
<a href="#cta" class="btn btn--full margin-right-sm">Start eating well</a>
'''  
html = etree.HTML(text)  
result = html.xpath('//a[@class="btn"]/text()')  
print(result)

[]


- In this case, the `class` attribute has two values: `btn` and `btn--full`, we can not get the attribute matching method we mentioned before.
- Instead, we use `contains` method:
```
//nodeName[contains(@attributeName,'value')]/text()
```

In [21]:
result = html.xpath('//a[contains(@class, "btn")]/text()')  
print(result)

['Start eating well']


## Muti-Attribute Matching
- Situation: multiple attributes match one node, use `and` to connect.

In [23]:
text = '''  
<a href="#cta" class="btn btn--full margin-right-sm" name="cta">Start eating well</a>
'''  
html = etree.HTML(text)  
result = html.xpath('//a[contains(@class, "btn") and @name="cta"]/text()')  
print(result)

['Start eating well']


## Select by Order
- Sometimes, there are several nodes under one attribute, but we only want to select few nodes,using order number (start from **1**, which is different from indexing starts from 0) to abstract.

In [27]:
text='''
<nav class="main-nav">
        <ul class="main-nav-list">
          <li><a class="main-nav-link" href="#how">How it works</a></li>
          <li><a class="main-nav-link" href="#meals">Meals</a></li>
          <li><a class="main-nav-link" href="#testimonials">Testimonials</a></li>
          <li><a class="main-nav-link" href="#pricing">Pricing</a></li>
          <li><a class="main-nav-link nav-cta" href="#cta">Try for free</a></li>
        </ul>
      </nav>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/a/text()')
print(result)
result = html.xpath('//li[last()]/a/text()')
print(result)
result = html.xpath('//li[position()<3]/a/text()')
print(result)
result = html.xpath('//li[last()-2]/a/text()')
print(result)

['How it works']
['Try for free']
['How it works', 'Meals']
['Testimonials']


## Select by Axis
- Get child, sibling(brother), parent, ancestor elements.

In [28]:
text='''
<nav class="main-nav">
        <ul class="main-nav-list">
          <li><a class="main-nav-link" href="#how"><span>How it works<span></a></li>
          <li><a class="main-nav-link" href="#meals">Meals</a></li>
          <li><a class="main-nav-link" href="#testimonials">Testimonials</a></li>
          <li><a class="main-nav-link" href="#pricing">Pricing</a></li>
          <li><a class="main-nav-link nav-cta" href="#cta">Try for free</a></li>
        </ul>
      </nav>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/ancestor::*') #return all the ancestor nodes <html>,<body>,<nav>,<ul>
print(result)
result = html.xpath('//li[1]/ancestor::nav') #return nav ancestor node
print(result)
result = html.xpath('//li[1]/attribute::*') #return the atrribute of <li> nodes
print(result)
result = html.xpath('//li[1]/child::a[@href="#how"]')
print(result)
result = html.xpath('//li[1]/descendant::span') #get all descendant node with span
print(result)
result = html.xpath('//li[1]/following::*[2]') #get all nodes after the current node, in this case, we use order selector to get the second node
print(result)
result = html.xpath('//li[1]/following-sibling::*') #get the brother nodes in the same level, after the current node.
print(result)


[<Element html at 0x7efc8cb74280>, <Element body at 0x7efc8cb74eb0>, <Element nav at 0x7efc87447820>, <Element ul at 0x7efc87447c80>]
[<Element nav at 0x7efc87447820>]
[]
[<Element a at 0x7efc8cb74eb0>]
[<Element span at 0x7efc87447820>, <Element span at 0x7efc87447c80>]
[<Element a at 0x7efc8cb74eb0>]
[<Element li at 0x7efc87447820>, <Element li at 0x7efc87447c80>, <Element li at 0x7efc87447410>, <Element li at 0x7efc874479b0>]
