# Notebook №9. Information technologies

by Alex Filozop from IS/b-20-2-o

## Working with the Web Resource API using XML

In [1]:
# import specific class from module
from bs4 import BeautifulSoup

### API и XML

Analyzing web pages and extracting information from them, we are trying to write a program that would act like a person. It can be difficult. Fortunately, more and more often various sites offer information that can be easily processed not only by a person, but also by another program. This is called the API — application program interface. A normal interface is a way for a person to interact with a program, and an API is a way for one program to interact with another. For example, your Python script with a remote web server.

HTML is used to store web pages that people read. To store arbitrary structured data exchanged between programs, other languages are used — in particular, the XML language, similar to HTML. It would be more accurate to say that XML is a metalanguage, that is, a way of describing languages. Unlike HTML, the set of tags in an XML document can be arbitrary (and is determined by the developer of a particular XML dialect). For example, if we wanted to describe in XML some student group, it might look like this:

In [None]:
<group>
  <number>ПИ/б-18-1-о</number>
  <student>
    <firstname>Виталий</firstname>
    <lastname>Иванов</lastname>
  </student>
  <student>
    <firstname>Мария</firstname>
    <lastname>Петрова</lastname>
  </student>
</group>

To process XML files, you can use the same *Beautiful Soup* package that we have already used to work with HTML. The only difference is that you need to specify an additional parameter `features="xml"` when calling the *Beautiful Soup* function — so that it does not search in the document HTML tags.

If the `features="xml"` parameter leads to an error, then you need to install the *lxml* package. To do this, open the Anaconda Prompt window and run the `pip install lxml` command

In [3]:
# assign a string (xml) to the variable 
group = """<group>
  <number>ПИ/б-18-1-о</number>
  <student>
  <firstname>Виталий</firstname>
  <lastname>Иванов</lastname>
  </student>
  <student>
  <firstname>Мария</firstname>
  <lastname>Петрова</lastname>
  </student>
  </group>"""

In [4]:
obj = BeautifulSoup(group, features="xml") # parse xml string
print(obj.prettify()) # print parsed string in pretty format

<?xml version="1.0" encoding="utf-8"?>
<group>
 <number>
  ПИ/б-18-1-о
 </number>
 <student>
  <firstname>
   Виталий
  </firstname>
  <lastname>
   Иванов
  </lastname>
 </student>
 <student>
  <firstname>
   Мария
  </firstname>
  <lastname>
   Петрова
  </lastname>
 </student>
</group>


This is how we can find the group number in our XML document:

In [5]:
obj.group.number.string # get a parsed value

'ПИ/б-18-1-о'

This means "find the *group* tag in the obj object, find the *number* tag in it and output as a string what it contains.

And this is how you can list all the students:

In [6]:
# print all students personal data by using for loop
for student in obj.group.findAll('student'):
  print(student.lastname.string, student.firstname.string)

Иванов Виталий
Петрова Мария


### Get a list of articles from the category in Wikipedia

Let's say we needed to get a list of all articles from some category in Wikipedia. We could open this category in the browser and continue to use the methods discussed above. However, Wikipedia has a convenient API. To learn how to work with it, you will have to get acquainted with [the documentation]([https://www.mediawiki.org/wiki/API:Main_page]) (this will be the case with any API), but it seems complicated only the first time.

So, let's get started. Interaction with the server using the API occurs by sending specially generated requests and receiving a response in one of the machine-readable formats. We will be interested in the XML format, although there are others (later we will get acquainted with JSON). But we can send such a request:

[https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Physics&cmsort=timestamp&cmdir=desc&format=xmlfm]([https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Physics&cmsort=timestamp&cmdir=desc&format=xmlfm])

String *https://en.wikipedia.org/w/api.php* (before the question mark) is the API entry point. Everything that comes after the question mark is, in fact, a request. It is something like a dictionary and consists of "key=value" pairs separated by an ampersand & . Some characters have to be encoded in a special way.

For example, the address above says that we want to make a query (*action=query*), list the elements of the category *list=categorymembers*, as the category that interests us is indicated Category:Physics ( cmtitle=Category:Physics ) and some other parameters are specified. If you click on this link, something like this will open:

In [None]:
<?xml version="1.0"?>
<api batchcomplete="">
<continue cmcontinue="2015-05-30 19:37:50|1653925" continue="-||" />
<query>
  <categorymembers>
    <cm pageid="24293838" ns="0" title="Wigner rotation" />
    <cm pageid="48583145" ns="0" title="Northwest Nuclear Consortium" />
    <cm pageid="48407923" ns="0" title="Hume Feldman" />
    <cm pageid="48249441" ns="0" title="Phase Stretch Transform" />
    <cm pageid="47723069" ns="0" title="Epicatalysis" />
    <cm pageid="2237966" ns="14" title="Category:Surface science" />
    <cm pageid="2143601" ns="14" title="Category:Interaction" />
    <cm pageid="10844347" ns="14" title="Category:Physical systems" />
    <cm pageid="18726608" ns="14" title="Category:Physical quantities" />
    <cm pageid="22688097" ns="0" title="Branches of physics" />
  </categorymembers>
</query>
</api>

We see different tags here, and we see that we are interested in the `<cm>` tags that are inside the tag `<categorymembers>`.

Let's make the appropriate request using Python. To do this , we will need the already familiar `requests` module.

In [8]:
# import module
import requests

In [9]:
url = "https://en.wikipedia.org/w/api.php" # create a string (URL)
# create a dictionary (query parameters)
params = {
  'action':'query',
  'list':'categorymembers',
  'cmtitle': 'Category:Physics',
  'format': 'xml'
}
g = requests.get(url, params=params) # perform a GET query and assign the result to the variable

As you can see, we pass the list of parameters in the form of a regular dictionary. Let's see what happened.

In [10]:
g.ok # does status code of the request place between 200 and 400

True

It's all good. Now we use Beautiful Soup to process this XML.

In [11]:
data = BeautifulSoup(g.text, features='xml') # parse a XML document

In [12]:
print(data.prettify()) # print a parsed document in pretty format

<?xml version="1.0" encoding="utf-8"?>
<api batchcomplete="">
 <continue cmcontinue="subcat|383a4e50464c5a0446340448385a4e3a2e4e011601dc15|1310583" continue="-||"/>
 <query>
  <categorymembers>
   <cm ns="0" pageid="22939" title="Physics"/>
   <cm ns="100" pageid="1653925" title="Portal:Physics"/>
   <cm ns="0" pageid="23479" title="Physicalism"/>
   <cm ns="0" pageid="71771866" title="Six Ideas that Shaped Physics"/>
   <cm ns="14" pageid="36477012" title="Category:Concepts in physics"/>
   <cm ns="14" pageid="49740128" title="Category:Subfields of physics"/>
   <cm ns="14" pageid="694942" title="Category:Physicists"/>
   <cm ns="14" pageid="5625591" title="Category:Physics awards"/>
   <cm ns="14" pageid="70983414" title="Category:Physics by country"/>
   <cm ns="14" pageid="71976587" title="Category:Physics events"/>
  </categorymembers>
 </query>
</api>


Find all occurrences of the `<cm>` tag and output their `title` attribute:

In [13]:
# print title of each article via for loop
for cm in data.api.query.categorymembers("cm"):
  print(cm['title'])

Physics
Portal:Physics
Physicalism
Six Ideas that Shaped Physics
Category:Concepts in physics
Category:Subfields of physics
Category:Physicists
Category:Physics awards
Category:Physics by country
Category:Physics events


It was possible to simplify the search for `<cm>` without specifying the "full path" to them:

In [14]:
# do the same above
for cm in data("cm"):
  print(cm['title'])

Physics
Portal:Physics
Physicalism
Six Ideas that Shaped Physics
Category:Concepts in physics
Category:Subfields of physics
Category:Physicists
Category:Physics awards
Category:Physics by country
Category:Physics events


By default, the server returned us a list of 10 items. If we want more, we need to use the `continue` element — this is a kind of hyperlink to the next 10 elements.

In [15]:
data.find("continue")['cmcontinue'] # get a tag value

'subcat|383a4e50464c5a0446340448385a4e3a2e4e011601dc15|1310583'

We had to use the `find()` method instead of just writing `data.continue`, because continue in Python has a special meaning.

Now let's add `cmcontinue` to our request and execute it again:

In [16]:
params['cmcontinue'] = data.api("continue")[0]['cmcontinue']

In [18]:
g = requests.get(url, params=params)
data = BeautifulSoup(g.text, features='xml')
for cm in data.api.query.categorymembers("cm"):
  print(cm['title'])

Category:History of physics
Category:Physics-related lists
Category:Physics literature
Category:Physical modeling
Category:Physics organizations
Category:Physical systems
Category:Works about physics
Category:Physics stubs


We got the following 10 items from the category. Continuing in this way, you can even pump it out completely (although it will take a lot of time).

Similarly, work with a variety of other APIs available on different sites is implemented. Somewhere the API is completely open (as in Wikipedia), somewhere you will need to register and get an application id and some key to access the API, somewhere you will even be asked to pay (for example, an automatic Google search costs something like $ 5 per 100 requests). There are APIs that only allow you to read information, and there are also those that allow you to edit it. For example, you can write a script that will automatically save some information in Google Spreadsheets. Whenever you use the API, you will have to study its documentation, but in any case it is easier than processing HTML code. Sometimes it is possible to simplify API access by using special libraries.