## Author Sequence Issues

Authors inputting that there were multiple first authors or no first authors was an issue found in phase 1. Here the script will count the number of occurences of 'first' within an author list, and flag those that have no authors, multiple first authors, or no first authors.

In [1]:
import pandas as pd
import json
# Generator function for handling data iteration
def gen(file_name):
    with open(file_name, 'r') as fh:
        for record in json.load(fh):
            yield record

df = pd.DataFrame(columns=['DOI', 'issue'])
data = gen('bigger_sample.json')
for record in data:
    message = record['message']
    doi = message['DOI']
    # Non-articles typically do not have author fields
    if message['type'] == 'journal-article':
        try:
            author_list = message['author']
            counter = 0
            for author in author_list:
                # 'name' is commonly the field name used when presenting
                # an institution as an author
                if 'name' not in author.keys():
                    if author['sequence'] == 'first':
                        counter +=1
                    else:
                        continue
                else:
                    if len(author_list) == 1:
                        counter +=1
            if counter == 0:
                df.loc[len(df)] = (doi, 'no first author')
            elif counter > 1:
                df.loc[len(df)] = (doi, 'multiple first authors')
        except:
            df.loc[len(df)] = (doi, 'no authors')
    else:
        continue

In [2]:
df.shape

(94, 2)

In [3]:
df.value_counts(subset=df['issue'])

issue
no authors                84
multiple first authors     9
no first author            1
dtype: int64

In [4]:
df.head(10)

Unnamed: 0,DOI,issue
0,10.1016/0379-6787(80)90068-x,no authors
1,10.3176/eco.2010.3.02,multiple first authors
2,10.1007/bf03221267,no authors
3,10.1017/s003329170400251x,no authors
4,10.1001/jama.1944.02850110032013,no authors
5,10.1007/bf01451578,no authors
6,10.1175/jcli3586.1,multiple first authors
7,10.1002/ltl.20463,no authors
8,10.1007/s40278-020-81276-7,no authors
9,10.1557/jmr.2015.240,no authors


http://api.crossref.org/works/10.1016/0379-6787(80)90068-x

http://api.crossref.org/works/10.3176/eco.2010.3.02

http://api.crossref.org/works/10.1007/bf03221267

Looking at the first 3 results from the df, we see that the script accurately returned records where there were no authors present, or in the second record's case multiple first authors.

However, it also revealed a problem wherein items are tagged with the 'type' 'journal-article' despite being an index and a directory, respectively. This is an issue that I'm currently investigating to see what avenues are most accurate, and least cumbersome, in being able to assess an item's type. That as well as including 'editor' fields for, as an example, books where there are not authors listed but editors.