# Alternative libraries

Before this, the two main libraries used for scraping a webpage were **requests** and **BeautifulSoup**. However, there ar ealso alternative libraries that can serve the same purpose.

- **urllib2** - the standard Python library for sending requests to URL and reading the HTML content. The two main functions are **urlopen()** (similar to **get()** from **requests**) and **read()** (similar to **text** from **requests**)
- **lxml** - a third party library (like **BeautifulSoup**) used for parsing **xml** and **html** files. The syntax is very similar to that of **BeautifulSoup** yet this library is much faster. The disadvantage is that it best suits for standard webpages, not for more or less unstructured ones (not for soups).

<blockquote>
It is worth to note, that **lxml** has a soupparser method (**lxml.html.soupparser**), which *"mimics"* the **BeautifulSoup** approach. At the same time, **BeautifulSoup()** functino from the samename library can take **lxml** as an argument and use the latter as a parser to scrape the websites more quickly.
</blockquote>

In [1]:
import urllib2
from lxml import html

In [2]:
url = "https://careercenter.am/ccidxann.php"

In [3]:
response = urllib2.urlopen(url)
page = response.read()

tree = html.document_fromstring(page)

The **findAll()** function from BeautifulSoup is replaced by **cssselect()** in lxml, which finds all the tags given inside quotes as follows.

In [4]:
tables = tree.cssselect("table")

In [5]:
len(tables)

5

To get the text content of the tag the **text_content()** function should be used on an element of the list.

In [6]:
tables[-1].text_content()

'\n    COMPETITIONS\n    \n      \n      Open Tender for Choosing an Organization to Purchase Ink System for ATM Cassettes / Union of Banks of Armenia\n    \n  '

We may use the table attributes to find the correct table that we are looking for. Multiple attributes can be listed one by one each inside square brackets and separated by a comma as follows:

In [7]:
our_table = tree.cssselect('[width="100%"],[border="0"]')

To get the text content of each table, we should create a for loop that will iterate over the list of tables and provide us with the text content.

In [8]:
for i in our_table:
    print(i.text_content())







    JOB OPPORTUNITIES
    
      
      IT Project Manager, IT Department / HSBC Bank Armenia
    
    
      
      Medical Representative / Les Laboratoires Servier Armenia
    
    
      
      Marketing Manager / Accurate Group
    
    
      
      Technical Specifications Development Specialist / RA Ministry of Health  Centralized Procurement Organization and Coordination Working Group
    
    
      
      Pharmacologist / RA Ministry of Health  Centralized Procurement Organization and Coordination Working Group
    
    
      
      Procurement Specialist / RA Ministry of Health  Centralized Procurement Organization and Coordination Working Group
    
    
      
      National Consultant/ Translator to Assist in Learning Evaluation Mission / GIZ
    
    
      
      Store Manager / Zigzag
    
    
      
      Business Development Manager / Oriflame Cosmetics
    
    
      
      Warehouse Worker / Oriflame Cosmetics
    
    
      
      Chief Accountant / Alf

One thing that can be considered as an advantage to the **lxml** library is that it provides two options for scraping: 1) CSS selectors (similar to **BeautifulSoup**) and 2) XPath. The latter is not supported by **BeautifulSoup**, yet sometimes may be quite handy. **XPath** is the navigation tool for the **XML** files (that **lxml** is meant for worknig with). To work with **XPath** one needs to use the forward slash sign (**/**) to define address and the *"dog"* sign (**@**) inside square brackets (**[ ]**) to define an attibute. To look for the very first table **//table** path can be used.

In [9]:
tree.xpath('//table')[-1].text_content()

'\n    COMPETITIONS\n    \n      \n      Open Tender for Choosing an Organization to Purchase Ink System for ATM Cassettes / Union of Banks of Armenia\n    \n  '

To find the table that has a border argument with a value of 0, the following approach should be used.

In [10]:
tree.xpath('//table[@border="0"]')[-1].text_content()

'\n    COMPETITIONS\n    \n      \n      Open Tender for Choosing an Organization to Purchase Ink System for ATM Cassettes / Union of Banks of Armenia\n    \n  '

If one is interested in getting the value of an attibute (similar to **get()** in **BeautifulSoup**), then **@** without square brackets can be used after the **/** sign as follows:

In [11]:
tree.xpath('//table/@border')

['0', '0', '0', '0', '0']