# Python and REST APIs

Python can be used instead of `curl` for accessing REST APIs. The most useful library for this is [requests](https://requests.readthedocs.io/en/latest/). When combined with the `json` library in Python, we can easily write small programs, and wrap them into command-line utilities.

We simply import the `requests` library, and use it to retrieve data. The OpenAlex API returns json data, and the request library makes it easy to access that in the form of a Python dictionary. Here is a short example.



In [None]:
import requests

req = requests.get('https://api.openalex.org/institutions?search=carnegie+mellon+university')
data = req.json()
data



Now, it is easy to replicate the example we had from last class to show each result with the works_count and cited_by_count, even with simple formatting.



In [None]:
for result in data['results']:
    print(f'{result["display_name"]:50s}{result["works_count"]:10d}{result["cited_by_count"]:10d}')



## Our first Python based shell script



It is convenient to use the notebook for this, but let's convert this to a script that takes an argument. Let's do this in a few pieces.

1. create a file called oa_inst.py, and make it executable.

In the file, add these lines:

```
#!/usr/bin/env python

import sys
print(sys.argv)
```

The first line (the so-called shebang line) tells the shell what interpreter to use, in this case, that it is a python script. Then we import the `sys` module. This module provides basic access to command line arguments through the .argv attribute.

Run your script with a few examples:

    ./oa_inst.py
    ./oa_inst.py carnegie mellon university
    
You can see the first element of `sys.argv` is always the script name. All the other elements are what we call the command-line arguments. 



We need to join these with + as we did in the shell script before. It is easy to do in Python. Now, add these lines:

```
query = '+'.join(sys.argv[1:])
url = f'https://api.openalex.org/institutions?search={query}'

import requests

req = requests.get(url)
data = req.json()

for result in data['results']:
    print(f'{result["display_name"]:50s}{result["works_count"]:10d}{result["cited_by_count"]:10d}')
```

Now you should be able to run this python script like the shell script.

You can move the script to ~/bin if you put that on your path like we described in the first lecture, and then run that from anywhere.

Our script is not without issues. They aren't big issues, but we can *only* use this script in the shell. We can't import it and use it here in the notebook. If you do import it, you will see that it tries to run something, but something weird happens, and it doesn't work right.

We need to separate some things out so we can have a script *and* importable library.



In [None]:
import oa_inst



## Getting better than sys.argv

`sys.argv` is really only suitable for the simplest of command line arguments, and it isn't really even great for those. Among the limitations are:

1. No option parsing (or you have to write your own)
2. No built-in help or documentation

Some built-in core libraries in Python can help with this, e.g. [argparse](https://docs.python.org/3/library/argparse.html). There are also third-party libraries like [click](https://palletsprojects.com/p/click/). 

Let's rewrite the script above using click. The principle idea is we write a function that does what we want with some arguments, use the click library to convert the command line arguments into variables we use in the function, and then, we only run the function when we run the script (as opposed to importing from it).

Making this look easy requires some advanced Python skills. Let's work out a reusable function first. I am writing this with some 20/20 hindsight we don't have yet. We will return later to why we wrote the function this specific way. This function should take a list of terms, or a string to query. Either way, we convert it to a string with each word joined by +. Then, we return a formatted string for each result found.



In [None]:
import requests 
from collections.abc import Iterable 

def openalex_institution(query):
    'query is a list of terms in the query, or a string.'
    
    # Replace spaces with +
    if isinstance(query, str):
        query = '+'.join(query.split())

    # If it is not a string We assume it is an iterable of strings.
    elif isinstance(query, Iterable):
        query = '+'.join(query)
        
    else:
        raise Exception('query should be a string or Iterable')
        
    url = f'https://api.openalex.org/institutions?search={query}'
    req = requests.get(url)
    data = req.json()

    return [f'{result["display_name"]:50s}{result["works_count"]:10d}{result["cited_by_count"]:10d}'
            for result in data['results']]
            
openalex_institution('carnegie mellon university')            



In [None]:
# Test with list of words
openalex_institution(['carnegie', 'mellon', 'university'])          



Note our function returns data in the form of a list of strings. Later, we can join them into a single string like this.



In [None]:
print('\n'.join(openalex_institution(['carnegie', 'mellon', 'university'])))



## Basic click usage
That is the independent reusable part. Now, let's look at how click works. We have to create a function that does what we want, and then decorate it with click functions. Start by creating a new file: oa_inst2.py with these contents. The `main` function is what will run in our script, and only when we run this as a script.

```
import click

@click.command(help='OpenAlex Institutions')
@click.argument('query', nargs=-1)
def main(query):
    print(query)
    
if __name__ == '__main__':
    main()
```

Now, you can see we automatically get help



In [None]:
! ./oa_inst2.py --help



We can also run the command with a few arguments. Here you see that the arguments become a tuple of strings. 



In [None]:
! ./oa_inst2.py carnegie mellon university



Now, we combine the function we worked out above in the script. We make a few modifications to the main function here. First, we have to join the strings returned by the openalex_institution function with \n, and then print that string so we can see it on stdout in the shell.

```
#!/usr/bin/env python
import click

import requests 
from collections.abc import Iterable 

def openalex_institution(query):
    'query is a list of terms in the query, or a string.'
    if isinstance(query, str):
        query = '+'.join(query.split())

    # We assume it is an iterable of strings.
    elif isinstance(query, Iterable):
        query = '+'.join(query)
        
    url = f'https://api.openalex.org/institutions?search={query}'
    req = requests.get(url)
    data = req.json()

    return [f'{result["display_name"]:50s}{result["works_count"]:10d}{result["cited_by_count"]:10d}'
            for result in data['results']]

@click.command(help='OpenAlex Institutions')
@click.argument('query', nargs=-1)
def main(query):
    print('\n'.join(openalex_institution(query)))
    
if __name__ == '__main__':
    main()
```    



In [None]:
! ./oa_inst2.py carnegie mellon university



Finally, we can import the function we wrote and use it in the notebook. This import works because the python file is in this directory. Later we will learn how to do this more generally. For now the critical idea is we have one file that can be used two different ways: one as a script in a shell, and one as a python library you can import the same function for use in a notebook, or even another script.



In [None]:
from oa_inst2 import openalex_institution
print('\n'.join(openalex_institution('carnegie+mellon+university')))



# Back to the author endpoint

You can access an author from a URL like this:
https://api.openalex.org/authors/https://orcid.org/0000-0003-2625-9232. Let's look at a few things. We get the name, number of works, and a url to those works. Although the url here says there are 172 works, it does not list them. Instead, it provides you with a url to get to them. Let's click on this url, and see what is there.



In [None]:
import requests

url = 'https://api.openalex.org/authors/https://orcid.org/0000-0003-2625-9232'
data = requests.get(url).json()
data['display_name'], data['works_count'], data['works_api_url']



It is another set of json data.



In [None]:
works = requests.get(data['works_api_url']).json()
works['meta']



This new data has a new feature. There are 172 works, but on this "page" of data, there are only 25 results. We have to consider how to access all the pages to get the rest of the data. Paging is described here https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging. The gist is we have to add something to the url to increase the number of results per-page. We can increase it up to 200, and this example only has 172, so we do that.



In [None]:
works = requests.get(data['works_api_url'] + f'?page=1&per-page={data["works_count"]}').json()
works['meta']



Our goal now is to retrieve each work, and get the cited_by_count for each paper. Then, we will compute the H-index for this list of papers. The H-index is the number of papers that have at least H citations. The works are already sorted in descending citations here, so we don't have to sort them ourselves.



In [None]:
citations = [work['cited_by_count'] for work in works['results']]
for i, cite in enumerate(citations, start=1):
    if cite < i:
        print(f'H-index = {i - 1}')
        break
    print(f'{i:3d}{cite:8d}')



# Group exercise

Work together to create a Python based shell script that takes an ORCID and computes the H-index for the author.

