<div align="center" style="border:solid 1px gray;">
    <a href="https://openalex.org/">
        <img src="../../resources/img/OpenAlex-banner.png" alt="OpenAlex banner" width="300">
    </a>
</div>

# Calculate the h-index for a given author

<div style='background:#e7edf7'>
    In this notebook we will use data from the OpenAlex API to determine:
    <blockquote>
        <b><i>What is the h-index for a given researcher?</i></b>
    </blockquote>
    To get to the bottom of this, we will use the following API functionalities: 
    <a href="https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists">filtering</a>, 
    <a href="https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/sort-entity-lists">sorting</a> and
    <a href="https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging">paging</a>.
</div>
<br>


Whether you would like to check if you're on track to become a Nobel laureate, looking for a new academic position or trying to secure funding, at some point in your academic career you might want (or need) to calculate your h-index.  
In this tutorial, we will guide you through an example of how to compute the h-index of a researcher using metadata from OpenAlex. 

### Steps
The data needed for determining a researcher’s h-index is a list of all their publications along with the number of citations each publication has received.  
To obtain this information from OpenAlex, we will divide the query process into the following three steps:
* 1. Gather a list of the researcher's publications.
* 2. Determine the number of citations each publication has received.
* 3. Sort the publications by citation count (from highest to lowest).

Once we have a sorted list of citation counts, we can move on to the final step:  
* 4. Calculate the h-index


### Input
The only input we need is an identifier for the selected researcher and here we opted for a person's [ORCID](https://info.orcid.org/researchers/).  
The selected researcher for our example will be Heather Piwowar (https://orcid.org/0000-0003-1613-5981):

In [1]:
# input
orcid = 'https://orcid.org/0000-0003-1613-5981'

All set, so let's dive in!

<hr>

## 1. Gather a list of the researcher's publications
To query the OpenAlex API for a list of publications, we need to put together a suitable URL that specifies the data we are looking for.  
Two pieces of information are needed for this URL:

1. Which **entity type** (author, concept, institution, venue, work) should the query return as a result?
* --> Since we want to query for "_a list of publications,_" the entity type should be `works`.  

2. What are the **criteria** the `works` need to fulfill to fit our purpose?
* Here we need to look into the list of available [filters for works](https://docs.openalex.org/api-entities/works/filter-works) and select the appropriate ones to specify the subset of `works` we are looking for.  
* -->  We want to query for "*a list of the researcher's publications*", so we will filter for the works that:  
  * have at least one [authorship](https://docs.openalex.org/api-entities/works/work-object#the-authorship-object) affiliation with the researcher:  `author.orcid:https://orcid.org/0000-0003-1613-5981`, 
  * are not [paratext](https://docs.openalex.org/api-entities/works/work-object#is_paratext):  `is_paratext:false`

<br>

From these two pieces of information we can **put the URL together** as follows:
* Starting point is the base URL of the OpenAlex API: `https://api.openalex.org/` 
* We append the entity type to it: `https://api.openalex.org/works` 
* All criteria need to go into the query parameter `filter` that is added after a question mark: `https://api.openalex.org/works?filter=` 
* To construct the filter value we take the criteria we specified and concatenate them using commas as separators: `https://api.openalex.org/works?filter=author.orcid:https://orcid.org/0000-0003-1613-5981,is_paratext:false`

With this URL we can get all of Heather's works from OpenAlex!

In [2]:
def build_author_works_url(orcid):
    # specify endpoint
    endpoint = 'works'

    # build the 'filter' parameter
    filters = (
      f'author.orcid:{orcid}',
      'is_paratext:false'
    )

    # put the URL together
    return f'https://api.openalex.org/{endpoint}?filter={",".join(filters)}'

author_works_url = build_author_works_url(orcid)
print(f'complete URL with filters:\n{author_works_url}')

complete URL with filters:
https://api.openalex.org/works?filter=author.orcid:https://orcid.org/0000-0003-1613-5981,is_paratext:false


<hr>

## 2. Determine each publication's citation count
The next step is to determine the citation count for each of Heather's publications.  
Fortunately, we don't have to look too far: the [metadata for each publication](https://docs.openalex.org/api-entities/works/work-object) already contains an attribute called `cited_by_count`, which is defined as *"The number of citations to this work."*  This attribute is exactly what we are looking for!

Let's download the list of publications from OpenAlex (using [paging](https://github.com/ourresearch/openalex-api-tutorials/blob/main/notebooks/getting-started/paging.ipynb)) and extract `cited_by_count` from each publication:

In [3]:
import requests

def get_all_citations(works_url):
    works_url_with_cursor = works_url + '&cursor={}'

    # loop through pages
    cursor = '*'
    citation_counts = []
    while cursor:
        # set cursor value and request page from OpenAlex
        url = works_url_with_cursor.format(cursor)
        page_with_results = requests.get(url).json()

        # loop through partial list of results
        # extract citation count from every work
        results = page_with_results['results']
        citation_counts += [work['cited_by_count'] for work in results]

        # update cursor to meta.next_cursor
        cursor = page_with_results['meta']['next_cursor']

    return citation_counts

citation_counts = get_all_citations(author_works_url)
print("complete list of citation counts:\n" + ', '.join(str(x) for x in citation_counts))

complete list of citation counts:
656, 503, 337, 319, 169, 118, 108, 74, 73, 42, 32, 27, 27, 26, 25, 23, 22, 16, 14, 10, 6, 5, 3, 3, 3, 3, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0


<hr>

## 3. Sort the publications by citation count
There is one more useful feature in the OpenAlex API, we can take advantage of that will make calculating the h-index more efficient:  
Because we need the list of citation counts sorted from highest to lowest (as a prerequisite for h-index calculation), we can instruct the API to deliver the [list of publications already sorted](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/sort-entity-lists) by citation count.

To use this feature, we need to add the `sort` query parameter to our URL, where we specify
* the attribute we want to sort by: `cited_by_count` 
* along with the sorting order: descending

In [4]:
sort_value = 'cited_by_count:desc'
author_works_sorted_url = author_works_url + f'&sort={sort_value}'

print(f'complete URL with sort:\n{author_works_sorted_url}')

complete URL with sort:
https://api.openalex.org/works?filter=author.orcid:https://orcid.org/0000-0003-1613-5981,is_paratext:false&sort=cited_by_count:desc


<br>

Let's use this URL to call the function from the previous step that downloads a list of publications from OpenAlex and extracts their citation count.  
This will give us the list of citation counts but this time sorted from highest to lowest:

In [5]:
sorted_citation_counts = get_all_citations(author_works_sorted_url)
print("complete list of sorted citation counts:\n" + ', '.join(str(x) for x in sorted_citation_counts))

complete list of sorted citation counts:
656, 503, 337, 319, 169, 118, 108, 74, 73, 42, 32, 27, 27, 26, 25, 23, 22, 16, 14, 10, 6, 5, 3, 3, 3, 3, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0


<hr>

## 4. Calculate h-index

From a sorted list of citation counts we can simply determine the h-index as the last position in which a citation count is greater than or equal to its position in the list:

In [6]:
# modified binary search
def calculate_hirsch_index(sorted_citations):
    def hirsch_rec(low, high):
        if low >= high:
            return low + 1

        mid = -(-(high + low) // 2) # math.ceil
        if sorted_citations[mid] >= mid+1:
            return hirsch_rec(mid, high)
        else:
            return hirsch_rec(low, mid-1)

    # handle edge case: no citations
    if not sorted_citations or sorted_citations[0]==0:
        return 0
    else:
        return hirsch_rec(0, len(sorted_citations)-1)

hindex = calculate_hirsch_index(sorted_citation_counts)
print(f'--> The specified researcher has an h-index of {hindex}.')

--> The specified researcher has an h-index of 17.


<br>

While this is a quick solution, let's do it one more time **step by step** and visualize the process, so we can follow along:

We take the list of sorted citation counts and **put it in a table**:
* On the left side we add a column called "position" which is simply the publication's position in the sorted list (1,...,n).
* On the right side we add another column that will tell us for each row if the citation count is greater than or equal to its position in the list.

Based on the right column, we **divide the publications into two groups**:  
* the ones where citation count >= position (green) and 
* the ones where citation count < position (yellow).

The **h-index** is now simply the maximum value for position in the green group (circled).

In [7]:
import pandas as pd

def visualize_hirsch_index(citation_counts, hindex):
    # create table with columns position, citations, position<=citations?
    df = pd.DataFrame(citation_counts, columns =['citations'])
    df.insert(0, 'position', range(1, 1 + len(df)))
    df['position<=citations?'] = (df['position'] <= df['citations'])

    # highlight row and hindex
    def highlight_hindex_row(s, hindex):
        if s['position'] < hindex:
            return [''] + [''] + ['background-color: lightgreen;']
        if s['position'] == hindex:
            return ['border-radius: 50%;background-color: pink;border-bottom: 2px solid black;'] \
            + ['border-bottom: 2px solid black;'] \
            + ['background-color: lightgreen;border-bottom: 2px solid black;']
        #else: 
        return [''] + [''] + ['background-color: gold;']

    # style table: center columns, hide index, highlight rows
    df_styled = df.style.hide(axis="index") \
                      .set_properties(**{'text-align': 'center'}) \
                      .apply(highlight_hindex_row, hindex=hindex, axis=1)

    return df,df_styled

df,viz_df = visualize_hirsch_index(sorted_citation_counts, hindex)
display(viz_df)

position,citations,position<=citations?
1,656,True
2,503,True
3,337,True
4,319,True
5,169,True
6,118,True
7,108,True
8,74,True
9,73,True
10,42,True


-----
Now that you know, how to determine the h-index using metadata from OpenAlex, feel free to use this notebook to calculate your h-index.  
Additionally, if you are interested in further exploring the OpenAlex API, you could enhance the algorithm, e.g. one potential adaptation would be to exclude self-citations from the citation counts.  


We hope this tutorial has been helpful, and we look forward to hearing how you are using data from OpenAlex!

Happy exploring! 😎