# 1. Introduction

lxml is a powerful and flexible library for working with **XML** and **HTML** documents in Python. It supports **XPath for querying** documents, making it an excellent tool for parsing and processing XML/HTML data.

In [None]:
pip install lxml

# 2. Parsing XML
You can parse XML documents in two ways: from a string or from a file.

## 2.1 Parsing XML from a String

In [None]:
from lxml import etree

# Sample XML data
xml_data = """<books>
    <book id="1">
        <title>Python Programming</title>
        <author>John Doe</author>
    </book>
    <book id="2">
        <title>Web Scraping with Python</title>
        <author>Jane Smith</author>
    </book>
</books>"""

# Parse XML from a string
root = etree.fromstring(xml_data)

# Access elements
for book in root.findall('book'):
    title = book.find('title').text
    author = book.find('author').text
    print(f"Title: {title}, Author: {author}")

## 2.2 Parsing XML from a file

In [None]:
# Parse XML from a file
tree = etree.parse('books.xml')
root = tree.getroot()

# Access elements
for book in root.findall('book'):
    title = book.find('title').text
    author = book.find('author').text
    print(f"Title: {title}, Author: {author}")

# 3. XPath Queries

XPath is a powerful language used to navigate through elements in XML documents.

### See Scrapy Tutorial for XPath explanation

# 4. Manipulating XML Data

You can modify XML elements by accessing them and changing their text or attributes.

## 4.1 Modify Element Text

In [None]:
# Change the title of the first book
first_book = root.xpath("//book[1]")[0]
first_book.find('title').text = "Advanced Python Programming"

# Print the updated title
print(first_book.find('title').text)  # Output: Advanced Python Programming

## 4.2 Add New Elements

In [None]:
# Create a new book element
new_book = etree.Element("book", id="3")
new_title = etree.SubElement(new_book, "title")
new_title.text = "Learning XML with Python"
new_author = etree.SubElement(new_book, "author")
new_author.text = "Chris Brown"

# Add the new book to the root
root.append(new_book)

# Print the new XML structure
print(etree.tostring(root, pretty_print=True).decode())

# 5. Parsing HTML

lxml also supports parsing HTML, including malformed or incomplete HTML documents.

## 5.1 Parsing HTML from a String

In [None]:
from lxml import etree

html_data = """<html>
    <body>
        <h1>Welcome to Python</h1>
        <p>This is a <b>tutorial</b> on lxml.</p>
    </body>
</html>"""

# Parse the HTML
root = etree.HTML(html_data)

# Extract elements using XPath
h1 = root.xpath("//h1/text()")
print(h1[0])  # Output: Welcome to Python

## 5.2 Extracting Data from HTML

You can use **XPath** to extract text, attributes, or specific tags.

#### Example
Extract all paragraphs (\<p>) from HTML:

In [None]:
paragraphs = root.xpath("//p")
for p in paragraphs:
    print(p.text)

# 6. Writing Data Back to an XML/HTML File

Once you’ve manipulated your XML or HTML data, you can write it back to a file.

## 6.1 Writing XML Data to a File

In [None]:
tree = etree.ElementTree(root)
tree.write("modified_books.xml", pretty_print=True, xml_declaration=True, encoding="UTF-8")

## 6.2 Writing HTML Data to a File

In [None]:
tree = etree.ElementTree(root)
tree.write("modified_page.html", pretty_print=True, html=True, encoding="UTF-8")

# 7. Creating XML files

## 7.1 Creating XML from a Dictionary

In [None]:
from lxml import etree

# Example dictionary
data = {
    'book': [
        {
            'id': '1',
            'title': 'Python Programming',
            'author': 'John Doe'
        },
        {
            'id': '2',
            'title': 'Learning XML',
            'author': 'Jane Smith'
        }
    ]
}

# Create the root element
root = etree.Element('books')

# Iterate through the dictionary to build the XML structure
for book in data['book']:
    book_element = etree.SubElement(root, 'book', id=book['id'])
    title = etree.SubElement(book_element, 'title')
    title.text = book['title']
    author = etree.SubElement(book_element, 'author')
    author.text = book['author']

# Convert the tree to a string and print it
print(etree.tostring(root, pretty_print=True, encoding='UTF-8').decode())