# Chapter 1: Interaction with the web

## 1. The WEB architecture

Webpages and APIs offer an incredible amount of data for researchers, be it literature texts, statistics or tweets, and are central to the development of research in the Digital Humanities. The difference between a webpage and an API is the presentation of the data : while webpages are encapsulated in HTML, which is a markup language oriented for design, APIs' content are described by their markup in RDF formats, in XML or in JSON.

The world wide web is organized around http. [HTTP](https://httpwg.github.io/specs/rfc7540.html) defines the way computers, whether servers or clients, communicate with one another. There are 4 methods you should know :
- GET : this is the base method for http communication. You can pass parameters to request what you want to see. You use it when you search or when you open a webpage.
- POST : this sends data to the server, to update or save information. You use it when you sign up for or log into a website.
- DELETE : this suppresses information.
- PUT : this saves a new resource on a server.

When browsing the web you'll use the first method 90% of the time, the second one 9.99 % and the remaining two maybe once a year. But those same websites you visit probably use those last two and others listed on the [w3c](http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html) website every second or minute. Most of the time these websites use what's called a REST API.

![REST API](images/rest.png)

Now that we have a general idea of how the web is constructed, let's get started with python.

## 2. Python and the web : getting a page

As you've seen, Python is extremely modular. That means that we'll be using modules to query the web. There are many possibilities, but the one we'll be using is `requests`. To import a library, write the following :

In [1]:
import requests

This line will allow you to query the web. For example, the following lines of code will query the [CTS API](http://cite-architecture.github.io/cts_spec/) of Perseus. Execute it :

In [None]:
url = "http://services2.perseids.org/exist/restxq/cts?request=GetCapabilities&inv=latin"
response = requests.get(url)
print(response)

Can you explain what we just did ? Or why the printed result is `<Response [200]>` ?

Explanation : after setting up a `url` variable, we used the function `get()` from the library `requests`. This function takes as its first parameter a string representing a URL. This function performs a GET query on the url, according to http standards. We then receive a response from the server with the following information : 
- a code, which expresses the status of the request. You might not know 200, because it means "everything went well", but I'm sure you've seen 404. For more codes, see the list on [wikipedia](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- a header, which tells us about the content of the response.
- a body, which is what the HTML you see would be.

The result of a query with `requests` contains all that information:
- `response.status_code` represents the status of a query
- `response.headers` is a dictionary containing the headers
- `response.text` is the content of the body.

In [None]:
# Let's see our headers :
print(response.headers)

In [None]:
# And the few first characters of our text :
print(response.text[0:100])

Great ! It works ! So maybe it's time to do a request.

**DIY**

We will use the API of Perseids to get the famous first verse of the Aeneid. The text of the answer is contained between two tags `tei:l`. Can you query this page : "http://services2.perseids.org/exist/restxq/cts?request=GetPassage&inv=nemo&urn=urn:cts:latinLit:phi0959.phi001.perseus-lat2:1.1.1" ?

In [None]:
# Write your code here

## 3. Passing parameters to the web

Now that we have seen the basic usage of the `requests.get()` function, it's time to heat things up a little. At some point, you'll probably face a loop such as the following :

- Get page 1 of text A
- Get page 2 of text A
- ...
- Get page 24 of text D

And constructing URL is far from fun. Thankfully, `requests` graciously offers us a way to deal with that: the named argument `params`. Let's see how the previous query could have been done using `requests` :

In [None]:
baseurl = "http://services2.perseids.org/exist/restxq/cts"
parameters = {
        "request" : "GetPassage",
        "inv": "nemo",
        "urn": "urn:cts:latinLit:phi0959.phi001.perseus-lat2:1.1.1"
    }
response = requests.get(baseurl, params=parameters)

`params` takes a dictionary as value. Key of the dictionaries are the name of the URL parameters, value are the value. Plain and simple. To check that the query is correct, we can print the url using another attribute of the response we haven't seen before :

In [None]:
print(response.url)

Now that we have a nice way to add simple params, the next question is how to add a list as params. You might find yourself in a situation where you require one argument with few values. The web has a way to deal with that : it appends a suffix to its parameter with `[]` or simply uses the normal name few times in the same url.

In [19]:
# Example 1
params = {
        "key" : ["a", "b"]
    }
# Example 2
params = {
        "key[]" : ["a", "b"]
    }

**Headers**

We have seen that http responses have headers. Did you know that requests do, too ? Headers in http are used to transmit various information, like the version of your software when browsing.

What good is that to us ? In some cases, you'll need to use those headers to request a particular format of data. For example, the Ahab API of Perseids can output both xml and json. The standard format is json, but what about getting xml ? The same way that methods functions have a `params` argument, they also accept a `header` argument :


In [None]:
url = "http://www.perseids.org/apps-stage/ahab/rest/v1.0/search"
parameters = {
    "query" : "cicero",
    "urn" : "urn:cts:latinLit"
}
header = {
    "Accept" : "application/xml"
}
answer_1 = requests.get(url, params=parameters)
# Note that you can keep params and headers together
answer_2 = requests.get(url, params=parameters, headers=header)


print("Without headers")
print("Content type : " + answer_1.headers["Content-Type"])
print(answer_1.text[0:100])
print("-----")
print("With headers")
print("Content type : " + answer_2.headers["Content-Type"])
print(answer_2.text[0:100])


**Files**

Sending a file works in a similar way. In your params, simply add an IO instance of an object, *ie* open it. The python library will take care of the rest and transmit the data.

In [None]:
# DO NOT RUN THIS CELL
url = "https://api.imgur.com/3/image"
params = {
        # The imgur API takes a image parameter with a file in it
        "image" : open("images/leonardo.jpg", "rb")
    }
answer = requests.post(url, params)

** What you have learned ** 

- Importing requests module
- HTTP Methods
- Difference between API and Webpage
- Querying a resource with `requests`
    - Adding parameters
    - Adding headers
- Reading the response of a request:
    - Reading the text
    - Status codes
    - Headers

## 4. Getting JSON out of APIs

Most of the time you'll encounter two data formats: xml and json, be it rdf or just proprietary descriptions. While XML is a bit more complex to deal with, JSON is quite simple. How so ? JSON is like a dictionary or a list in Python.

> JSON (/ˈdʒeɪsən/ JAY-sən),[1] or JavaScript Object Notation, is an open standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is used primarily to transmit data between a server and web application, as an alternative to XML.

> Although originally derived from the JavaScript scripting language, JSON is a language-independent data format. Code for parsing and generating JSON data is readily available in many programming languages.

> The JSON format was originally specified by Douglas Crockford. It is currently described by two competing standards, RFC 7159 and ECMA-404. The ECMA standard is minimal, describing only the allowed grammar syntax, whereas the RFC also provides some semantic and security considerations.[2] The official Internet media type for JSON is application/json. The JSON filename extension is .json.

\- https://en.wikipedia.org/wiki/JSON

Let's see how we would call an API and read its json. Do you know Github ? Github is a platform for collaborating and versioning its code base. It offers a User Interface with extension regarding git. For example, we can open debug or enhancement tickets called issues. Those issues can then be tracked during commit, pull request (asking to merge changes into the main codebase), etc.

Github is one of the few large websites whose API does not require authentification for every request. For example, you can easily get the issues from the former repository of Perseus texts:

In [None]:
url = "https://api.github.com/repos/perseusDL/canonical/issues"
response = requests.get(url)
print(response.text[0:100])

As you can see, the request is json. Now, we can simply transform this object into a python object through another method of the requests response:

In [None]:
# Such as .text or .headers, requests produces an object with a function json()
answer = response.json()
print(type(answer))

Github's API answers with a list of issues that are themselves dictionary. To get the title and the text of the first issue, we write the following :

In [None]:
print("Title : " + answer[1]["title"])
print("Body  : " + answer[1]["body"])

**DIY**

Without opening the Github website, can you find how many open issues there are in the repository [*Capitains/MyCapytains*](https://github.com/Capitains/MyCapytain) ? Can you list the usernames of everyone who's opened issues (without duplicates !) ? (Hint : the API is built using the following URL namespace api.github.com/repos/*username*/*repo-name*/issues)

In [8]:
# Put your code here

## 5. Handling errors

The web is great; Python is great. But sometimes, despite their greatness, Python or the web will fail you. Regardless whether it's your fault or the server's, you will encounter a situation where you're simply unable to retrieve information. To deal with that, there's a general system in Python called error handling.

In Python, you have already probably seen errors, such as the one created by following snippet :

In [None]:
a = 5/0

To deal with that, Python and most languages use the concept of "Try". `Try` will attempt to run your code, and if it doesn't work, it won't stop the rest of your code from being processed. `Try` needs to be accompanied by its counterpart `except`, which tells the computer what to execute in case of failure.

In [None]:
try:
    a = 3 / 0
except:
    pass # pass means : do nothing, just let it go
print("Hello")

This is really powerful, and you'll use it often. To go a little further, `except` takes slightly more complex syntax if required.

In [None]:
try:
    a = 3 / 0
except NotImplementedError:
    print("This function is not implemented")
except ZeroDivisionError:
    print("Error when tried to divide by zero")

As you can see, only the second print is executed here. You can pass a specific error name to except to handle error according to its type. Here, the first is an error raised when something is not implemented, the second is raised when trying to divide by zero.

In [None]:
try:
    a = 3 / 0
except Exception as e:
    print(e)

Exception is the mother of all errors : in Python, all errors are descendants of it. It means with this snippet, you will handle any exception, regardless of type. What we've also done is captured the error. The syntax `except Exception as X` stores the current error in a variable named e that we simply print.

`requests` can issue some specific error to its core, for example, when a server is down, it will raise a `requests.exceptions.ConnectionError`, while a bad http code such as 404 or 400 will raise a `requests.exceptions.HTTPError`.

In [None]:
try:
    requests.get("http://thissitedoesnotexist.digitalhumanities")
except requests.exceptions.ConnectionError as E:
    print(E)

In addition to common errors, `requests` gives the ability to raise errors if the status code is wrong with the method `raise_for_status()` (The same way *json()* transformed data into python objects) :

In [None]:
url = "https://httpbin.org/status/500"

response = requests.get(url)
try:
    response.raise_for_status() #
except requests.exceptions.HTTPError as E:
    print(E)

** DIY **

Call the url "https://fr.wikipedia.org/wieki/Erreur_HTTP_404". Make sure the code runs and fails only if the page has a wrong status code

In [53]:
url = "https://fr.wikipedia.org/wieki/Erreur_HTTP_404"

##6\. Use case : the CTS API and its AHAB extension

> The Canonical Text Services protocol defines interaction between a client and server providing identification of texts and retrieval of canonically cited passages of texts

> C. Blackwell and N. Smith, http://cite-architecture.github.io/cts_spec/

The CTS API provides a way to retrieve chunk of texts based on persistent identifiers. This mean that using any implementation, you will be able to perform the same queries with the same identifiers or parameters. This norm has been developed to answer to the problem of interoperability accross platforms serving texts, particularly Latin and Greek texts.

For a long time, most texts of Perseus were accessible mainly as a whole on github or through some API. To get a text of Perseus from the canonical repo type the following :

In [None]:
# URL of the Aeneid
url = "https://raw.githubusercontent.com/PerseusDL/canonical-latinLit/master/data/phi0690/phi003/phi0690.phi003.perseus-lat1.xml"
answer = requests.get(url)
print(answer.text[0:200])

With a CTS API, you can ask for particular content. For example, we could ask for the first line of the first poem of the first book of Martial's Epigrammata :

In [None]:
params = {
    "inv" : "nemo",
    "urn" : "urn:cts:latinLit:phi1294.phi002.perseus-lat2:1.1.1",
    "request" : "GetPassage"
}
url = "http://services2.perseids.org/exist/restxq/cts"

answer = requests.get(url, params=params)
print(answer.text[0:200])

CTS APIs are built around CTS URNs. A CTS URN is composed like this "urn:cts:latinLit:phi1294.phi002.perseus-lat2:1.1.1" :

- "cts" in the urn namespace
- "latinLit" the CTS namespace (here Latin literature)
- "phi1294.phi002.perseus-lat2" represents the work
    - "phi1294" represents Martial, it's called a textgroup (an author or group of authors)
    - "phi002" represents Epigrammata, the work of the author
    - "perseus-lat2" represents the version of a work, in this case an edition digitized by Perseus. This part is optional.
- 1.1.1 represents a passage. Martial's Epigrammata have 3 levels of citation : Book, Poem and Line.

CTS then have different request type, called through the parameter "request" :

- *GetPassage* : get the passage of a text
- *GetPassagePlus* : get the passage of a text with metadata
- *GetCapabilities* : get the content of he text repository
- and other which you can read abouto on the CTS Spec (See going further section below)

Those requests take at least one parameter : inv which represent a text inventory. You can add to that a second parameter for passage-related queries with a urn value.


**DIY**

Can you choose a Greek text from the inventory "nemo" of Perseids and get its first passage ?

In [62]:
# Write your code here

## Exercises
1\. "Arma virumque cano"

Using the following URL, you will retrieve the first line of the Aeneid without xml markup (Hint : Use regular expressions !)

In [None]:
#Use this url
url = "http://services2.perseids.org/exist/restxq/cts?request=GetPassage&inv=nemo&urn=urn:cts:latinLit:phi0959.phi001.perseus-lat2:1.1.1"


2\. Code study

Can you explain the following code : https://gist.github.com/anonymous/93260e06c985e26bf99c#file-etym-py-L11-L63


3\. Using the previous exercise's output, can you find the etymology of "hood" ?

4\. Can you retrieve the author's name and the title of the book's passage represented by `urn:cts:greekLit:tlg0028.tlg005.perseus-grc1:10` using regular expressions and requests ?

5\. Can you make a dictionary of commits where keys are their sha value and their values are a dictionary containing the title and the username responsible for it ?

In [1]:
url = "https://api.github.com/repos/Capitains/MyCapytain/commits"

## Going further
1. Documentation of `requests` : http://docs.python-requests.org
2. About error handling with `requests` : http://www.mobify.com/blog/http-requests-are-hard/
3. The CTS Norm : http://cite-architecture.github.io/cts_spec/
4. Capitains, an organization built for providing tools for CTS : http://capitains.github.io

-----

In [19]:
# Do not care about this cell, it's just here to make the page nicer.

from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

---

<p><small><a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">Python Programming for the Humanities</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="http://fbkarsdorp.github.io/python-course" property="cc:attributionName" rel="cc:attributionURL">http://fbkarsdorp.github.io/python-course</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. Based on a work at <a xmlns:dct="http://purl.org/dc/terms/" href="https://github.com/fbkarsdorp/python-course" rel="dct:source">https://github.com/fbkarsdorp/python-course</a>.</small></p>