# lxml : Python library for XML and HTML processing. 

The lxml XML toolkit is a Pythonic binding for the C libraries **libxml2** and **libxslt**. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.  The latest release of lxml library works with all CPython versions from 2.7 to 3.8.

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play. The key benefits of this library are that it's ease of use, extremely fast when parsing large documents, very well documented, and provides easy conversion of data to Python data types, resulting in easier file manipulation.

## Advantages

With the continued growth of both Python and XML, there are a plethora of packages out there that help you read, generate, and modify XML files from Python scripts. Compared to most of them, the python lxml package has two big advantages:

* Performance: Fast Reading and writing even fairly large XML files.

* Ease of programming: python lxml library has easy syntax and more adaptive nature than other packages.

## Uses

* python lxml library can be used to create XML/HTML structure using elements.

* python lxml library can be used to parse XML/HTML structure to retrieve information from them. 

This library can be used in **web scraping** i.e to get information from different web services and web resources, as these are implemented in XML/HTML format.


# Installation


There are multiple ways to install lxml on your system.

## Method 1 : Using Pip

Pip is a Python package manager which is used to download and install Python libraries to your local system with ease i.e. it downloads and installs all the dependencies for the package you're installing, as well.

If you have pip installed on your system, simply run the following command in terminal or command prompt:

                   pip install lxml

To install a specific version, specify the version name,
                
                pip install lxml==3.4.2
                
## Method 2 : Using apt-get

If you're using MacOS or Linux, you can install lxml by running this command in your terminal:
                    
                    sudo apt-get install python-lxml

for Python 2.x,
sudo apt-get install python-lxml

for Python 3.x,
sudo apt-get install python3-lxml
                    
                    
## Method 3 : Using easy_install

You probably won't get to this part, but if none of the above commands works for you for some reason, try using easy_install:

                    easy_install lxml


In [1]:
!pip install lxml



# This spotlight includes,

     1. The ElementTree class of lxml.
     2. E-factory.
     3. ElementPath.
     4. Basics of web scraping using lxml.
     5. Example: Extracting the imdb rating of movies currently in theaters.


# 1. The ElementTree class

## 1.1. Creating HTML/XML Documents

lxml has a etree class(ElementTree class), using which we can create XML/HTML elements and their subelements, which is a very useful thing if we're trying to write or manipulate an HTML or XML file. 

Let's try to create the basic structure of an HTML file using etree:

In [2]:
#import the etree mmodule from lxml
from lxml import etree

#Element class of etree is used to create a html Element. Element function only 'requires' the name of the element to be created 
root = etree.Element('html', version="5.0")


#Add subelements to the element using etree.SubElement
#SubElement function requires the name of both the root node and the child node to be created.
# Pass the parent node, name of the child node, and any number of optional attributes.

etree.SubElement(root, 'head')
etree.SubElement(root, 'title', bgcolor="red", fontsize='22')
etree.SubElement(root, 'body', fontsize="15")

<Element body at 0x23553d2ff08>

In [3]:
# Use pretty_print=True to indent the HTML output
print (etree.tostring(root, pretty_print=True).decode("utf-8"))

<html version="5.0">
  <head/>
  <title bgcolor="red" fontsize="22"/>
  <body fontsize="15"/>
</html>



## 1.2. Manipulating the HTML/XML Documents

Lets use the above created HTML Element as an example for manipulations.

In [4]:
#ACCESSING THE TAGS

#getting te tag name of a specific element, eg:root
print(root.tag)


#iterate over the child elements and print their tag
for e in root:
    print(e.tag)

html
head
title
body


In [5]:
#ADDING ATTRIBUTES TO ELEMENTS

print(etree.tostring(root, pretty_print=True).decode("utf-8"))

#add attribute to the element, eg: add attribute named 'newAttribute' valued 'attributeValue' to root element
root.set('newAttribute', 'attributeValue')

print(etree.tostring(root, pretty_print=True).decode("utf-8"))


# ACESSING THE ATTRIBUTES

#get the attribute of the element
print(root[1].get('bgcolor'))


<html version="5.0">
  <head/>
  <title bgcolor="red" fontsize="22"/>
  <body fontsize="15"/>
</html>

<html version="5.0" newAttribute="attributeValue">
  <head/>
  <title bgcolor="red" fontsize="22"/>
  <body fontsize="15"/>
</html>

red


In [6]:
# ADDING TEST TO ELEMENTS AND SUBELEMENTS

root[0].text = "This is the head of that file"
root[1].text = "This is the title of that file"
root[2].text = "This is the body of that file and would contain paragraphs etc"
print(etree.tostring(root, pretty_print=True).decode("utf-8")) 

<html version="5.0" newAttribute="attributeValue">
  <head>This is the head of that file</head>
  <title bgcolor="red" fontsize="22">This is the title of that file</title>
  <body fontsize="15">This is the body of that file and would contain paragraphs etc</body>
</html>



In [7]:
#FEEDING RAW XML FOR SERIALISATION

html = etree.XML('<html><head>Head of HTML</head><title>I am the title!</title><body>Howdy</body></html>')
print(etree.tostring(html, pretty_print=True).decode('utf-8'))

<html>
  <head>Head of HTML</head>
  <title>I am the title!</title>
  <body>Howdy</body>
</html>



In [8]:
# PARSING FROM FILES OR FILE LIKE OBJECTS - parse() function


from io import BytesIO
some_file_or_file_like_object = BytesIO(b"<html><head></head><body>Howdy from File</body></html>")
tree = etree.parse(some_file_or_file_like_object)
print(etree.tostring(tree, pretty_print=True).decode('utf-8'))

<html>
  <head/>
  <body>Howdy from File</body>
</html>



# 2. The E-factory
The E-factory provides a simple and compact syntax for generating XML and HTML:

In [9]:
from lxml.builder import E

def CLASS(*args): # class is a reserved word in Python
    return {"class":' '.join(args)}

html = page = (
    E.html(       # create an Element called "html"
     E.head(
       E.title("This is a sample document")
     ),
     E.body(
       E.h1("Hello!", CLASS("title")),
       E.p("This is a paragraph with ", E.b("bold"), " text in it!"),
       E.p("This is another paragraph, with a", "\n      ",
         E.a("link", href="http://www.python.org"), "."),
       E.p("Here are some reserved characters: <spam&egg>."),
       etree.XML("<p>And finally an embedded XHTML fragment.</p>"),
     )
   )
)
  


print(etree.tostring(page, pretty_print=True).decode('utf-8'))

<html>
  <head>
    <title>This is a sample document</title>
  </head>
  <body>
    <h1 class="title">Hello!</h1>
    <p>This is a paragraph with <b>bold</b> text in it!</p>
    <p>This is another paragraph, with a
      <a href="http://www.python.org">link</a>.</p>
    <p>Here are some reserved characters: &lt;spam&amp;egg&gt;.</p>
    <p>And finally an embedded XHTML fragment.</p>
  </body>
</html>



# 3. ElementPath
The ElementTree library comes with a simple XPath-like path language called ElementPath. It helps us to find Elements and Element trees.

* iterfind() iterates over all Elements that match the path expression
* findall() returns a list of matching Elements
* find() efficiently returns only the first match
* findtext() returns the .text content of the first match

In [10]:
root = etree.XML("<root><a x='123'>aText<b/><c/><b/></a></root>")

#Find a child of an Element:
print(root.find("b"))
print(root.find("a").tag)

#Find an Element anywhere in the tree:
print(root.find(".//b").tag)

#Find Elements with a certain attribute:
print(root.findall(".//a[@x]")[0].tag)  #finds all anchor tags in the XML Tree which have an attribute x, and returns the tag of the first match.
print(root.findall(".//a[@y]"))  #finds all anchor tags in the XML Tree which have an attribute x



None
a
b
a
[]


# 4. Basics of web scraping using lxml

Steps to perform webscraping :

1. Send a link and get the response from the sent link
2. Then convert response object to a byte string.
3. Pass the byte string to ‘fromstring’ method in html class in lxml module.
4. Get to a particular element by xpath.
5. Use the content according to your need.

I will use the **requests** python module, which is used to send HTTP requests to web URLs.  It can download a web page’s HTML given its URL.

If you don't have requests installed, you can easily install it by running this command in the terminal:

            pip install requests

In [11]:
import requests
import lxml.html

page = requests.get('https://store.steampowered.com/explore/new/')



In [12]:
#We need to use response.content rather than response.text because lxml.html.fromstring implicitly expects bytes as input.

doc = lxml.html.fromstring(page.content) #Returns an object of type HTMLElement. 


## XPath

**XPath** is a way of locating information in structured documents such as HTML or XML documents. XPath uses path expressions to select nodes or node-sets in an XML document. The node is selected by following a path or steps.

The most useful path expressions are,

*  /   Selects from the root node
* //    Selects nodes in the document from the current node that match the selection no matter where they are 
* .    Selects the current node 
* ..   Selects the parent of the current node 
*  @   Selects attributes

In [13]:
#This statement will return a list of all the divs in the HTML page which have an id of tab_newreleases_content

new_releases = doc.xpath('//div[@id="tab_newreleases_content"]')


new_releases = doc.xpath('//div[@id="tab_newreleases_content"]')

* // these double forward slashes tell lxml that we want to search for all tags in the HTML document which match our requirements
* div tells lxml that we are searching for divs in the HTML page
* [@id="tab_newreleases_content"] tells lxml that we are only interested in those divs which have an id of tab_newreleases_content

In [14]:
#EXTRACTING TAGS
tags = new_releases[0].xpath('.//div[@class="tab_item_top_tags"]')
total_tags = []
for tag in tags:
    total_tags.append(tag.text_content())
print(total_tags)

['Indie, Adventure, RPG, Puzzle', 'Action, Masterpiece, Great Soundtrack, Classic', 'Management, Simulation, Strategy, Building', 'Action, Strategy, RPG, Hack and Slash', 'Adventure, Casual, Visual Novel, Simulation', 'Adventure, Sexual Content, Visual Novel, Female Protagonist', 'Indie, Simulation, Sports, Difficult', 'Simulation, Indie, Sexual Content, Nudity', 'Strategy, Tactical, World War II, Turn-Based', 'Indie, Strategy, Simulation', 'Simulation, Strategy, Indie, Casual', 'Simulation, Adventure, Casual, Indie', 'Action, Strategy, Indie, Tower Defense', 'Action, FPS, Multiplayer', 'Building, Sandbox, Physics, Destruction', 'Strategy, Free to Play, RTS, Real Time Tactics', 'Post-apocalyptic, Action, Atmospheric, FPS', 'Action, Mechs, Anime, Character Customization', 'RPG, Hack and Slash, Action RPG, Action', 'Free to Play, Casual, Indie, Visual Novel', 'Free to Play, Visual Novel, Sexual Content, Indie', 'Free to Play, Casual, Simulation, Social Deduction', 'Action, Adventure, RPG

So what we are doing above is that we are extracting the divs containing the tags for the games. Then we loop over the list of extracted tags and then extract the text from those tags using the text_content() method. text_content() returns the text contained within an HTML tag without the HTML markup

# 5. Example: Extracting the imdb rating of movies currently in theaters

In this example, Lets extract the movies that are currently in theaters and their imdb ratings. The imdb ratings of the movies can be found at the imdb website (https://www.imdb.com/?ref_=nv_home)



Step1 : Exploring the webpage.

* First of all, find the section where movies in theaters are displayed on the imdb website.
* Open up Chrome developer tools and see which HTML tags contain the required data.
* In our case, if we take a look, we can see that every seperate element in encapsulated in div tag.
* The div tags themselves are encapsulated in the another div whose class is "in-theaters"


![title](inspect_website.png)

Step2 : import the libraries and create a HTMLElement from the webpage

In [15]:
import requests
import lxml.html  # the html modolue of lxml

#get the webpage
page = requests.get('https://www.imdb.com/?ref_=nv_home')

#create an object of HTMLElement type from the web page contents
html_tree = lxml.html.fromstring(page.content) 


This statement below will return a list of all the divs in the HTML page which have a class of "in-theaters". Only one div on the page has this class name and we can take out the first element from the list ([0]) and that would be our required div

In [16]:
new_releases = html_tree.xpath('//div[@class="in-theaters"]')[0]
print(new_releases)

<Element div at 0x235553e0ef8>


Step3: Extract the movie names and the imdb ratings

* The movie titles are written inside the anchor tag whose class is "ipc-poster-card__title-href"
* The imdb ratings are written inside span tag whose class is "ipc-rating-star ipc-rating-star--baseAlt ipc-rating-star--imdb"

In [17]:

titles = new_releases.xpath('.//a[@class="ipc-poster-card__title-href"]/text()')

In [18]:
print(titles)

['Sonic the Hedgehog', 'The Call of the Wild', 'The Invisible Man', 'Onward', 'Birds of Prey: And the Fantabulous Emancipation of One Harley Quinn', 'Dolittle', 'Bad Boys For Life', 'Fantasy Island', '1917', 'Brahms: The Boy II', 'Impractical Jokers: The Movie', 'Jumanji: The Next Level', 'Emma.', 'My Hero Academia: Heroes Rising', 'The Photograph', 'Parasite', 'Downhill', 'Las Pildoras De Mi Novio', 'Gretel & Hansel', 'Seberg']


In [19]:
imdb = new_releases.xpath('.//span[@class="ipc-rating-star ipc-rating-star--baseAlt ipc-rating-star--imdb"]/text()')

In [20]:
print(imdb)

['6.8', '6.9', '7.6', '7.4', '6.6', '5.5', '7.2', '4.6', '8.4', '4.3', '7.0', '6.8', '6.9', '8.0', '6.1', '8.6', '4.9', '3.8', '5.4', '5.1']


Step4: Store the scrapped data into csv file using pandas library

In [21]:
import pandas as pd

data = {'Movie_Title' : titles, 'IMDB Ratings' : imdb}
df = pd.DataFrame(data)

In [22]:
df = df.sort_values('IMDB Ratings',ascending=False)

In [23]:
df.to_csv (r'Movies_in_Theater-Rated.csv', index = False, header=True) 

![title](result.png)

# Resources 

1. https://lxml.de/
2. https://kite.com/python/docs/lxml


# References

1. https://timber.io/blog/an-intro-to-web-scraping-with-lxml-and-python/#step-4-extract-the-titles-prices
2. https://stackabuse.com/introduction-to-the-python-lxml-library/
3. https://www.journaldev.com/18043/python-lxml
