# Detecting Initials and Affiliations

Many high priority issues focused on the ways in which information about the author is entered into the record. In this script, we will detect a few of those issue such as the use of initials and listing affiliations as authors. We'll be using a sample of 1300 records from Crossref.

First, we'll import our libraries and set up our generator and dataframes.

In [2]:
import re
import pandas as pd
import json
import numpy as np

def gen(file_name):
    with open(file_name, 'r') as fh:
        for record in json.load(fh):
            yield record
# Set up dataframe for holding detected errors
df = pd.DataFrame(columns=['DOI', 'issue'])
# DataFrame to hold non-article types that commonly do not have authors
alt_df = pd.DataFrame(columns=['DOI', 'type'])
data = gen('bigger_sample.json')

Now we'll iterate through the records and label the issue appropriately.

In [3]:
for record in data:
    message = record['message']
    doi = message['DOI']
    try:
        authorList = message['author']
        for x in authorList:
            try:
                # Regular expression to detect the use of initials
                detector = re.match(r"^(?:[A-Z]\W{,3}\s?){,3}$", x['given'])
                if  detector != None or len(x['given']) < 2:
                    df.loc[len(df)] = (doi, 'given name initials')
                else:
                    last_detect = re.match(r"^(?:[A-Z]\.?\s?)?$", x['family'])
                    if last_detect != None:
                        df.loc[len(df)] = (doi, 'last name initials')
            except:
                for key in x:
                    # When affiliations are presented as an author
                    # almost exclusively it is done so with the 
                    # field 'name' which breaks the convention
                    # for authors 'given' and 'family'
                    if key == 'name':
                        df.loc[len(df)] = (doi, 'institution as author')
    except:
        if message['type'] == 'journal-article':
            df.loc[len(df)] = (doi, 'no authors')
        else:
            alt_df.loc[len(alt_df)] = (doi, message['type'])

Let's take a look at the outcome:

In [4]:
df.shape

(949, 2)

In [5]:
alt_df.shape

(182, 2)

There's an issue with 949 author entries.

In [6]:
len(set(df['DOI']))

384

384 articles out of 889 articles have issues with the entry of the author names.

In [7]:
df.value_counts(subset=df['issue'])

issue
given name initials      819
no authors                84
institution as author     42
last name initials         4
dtype: int64

In [8]:
df.head(5)

Unnamed: 0,DOI,issue
0,10.5840/philstudies19577039,institution as author
1,10.1016/j.euje.2006.04.001,given name initials
2,10.1016/j.euje.2006.04.001,given name initials
3,10.1016/0040-6090(88)90303-3,given name initials
4,10.1016/0040-6090(88)90303-3,given name initials


Let's take a look at these 3 records.

http://api.crossref.org/works/10.5840/philstudies19577039

http://api.crossref.org/works/10.1016/j.euje.2006.04.001

http://api.crossref.org/works/10.1016/0040-6090(88)90303-3

After checking each record, we see that they do indeed have the issues detected.