# Obtain Editorial Board Makeup

This notebook demonstrates the process of downloading Editorial Board pages and parsing editors from them using a random issue (Volume 15) from a arbitrary journal (Comparative Biochemistry and Physiology Part D: Genomics and Proteomics).

The xml data returned by Elsevier API is downloaded and stored in `../data/editorial_board_info/`, and the parsed editors are stored in `../data/SampleEditorsVol15.csv`. In case you are interested in testing the API yourself, you should first apply to get an API key at [Elsevier Developer Portal](https://dev.elsevier.com/), and initialize the global variable `APIKEY` with your key.

Jump to

- [Download editorial info page](#Download-editorial-info-page)
- [Insert delimiters](#Parse-editor-records)
- [Extract editors](#Extract-editors)

In [1]:
APIKEY = ""

In [2]:
import sys
sys.path.insert(1, '../src')

from utils import loadText

import requests
import time
import glob
import re

import pandas as pd
from tqdm.notebook import tqdm
from bs4 import BeautifulSoup
from unidecode import unidecode

## Download editorial info page

In [3]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
          'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
          "X-ELS-APIKey": APIKEY}

In [4]:
# the link to each pdf file that shows the editorial page
links = pd.read_csv('../data/EditorialPageLinks.csv',sep='\t',dtype={'issn':str})

In [5]:
links

Unnamed: 0,title,issue,volume,date,date_text,pdf_link,issn,supplement_text,exist
0,Comparative Biochemistry and Physiology Part D...,,15,20150901,September 2015,https://www.sciencedirect.com/science/article/...,1744117X,Volume 15,1.0


In [6]:
collected = set(glob.glob('../data/editorial_board_info/*_*.xml'))

In [7]:
%%time
url = "https://api.elsevier.com/content/article/pii/{}?view=FULL"

for ind, row in tqdm(links.iterrows()):
    l = row['pdf_link'].split('/')
    outfile = f"../data/editorial_board_info/{row['issn']}_{row['supplement_text']}.xml"
    
    if outfile in collected: continue
        
    if l[5] == 'pii':
        res = requests.get(url.format(l[6]), headers=headers)
        assert(res.status_code==200)
        
        with open(outfile, 'w+') as f:
            f.write(res.text)
    else:
        print("error", "l[5] != 'pii', ind is {}, link is {}".format(ind, row['pdf_link']))
        
    time.sleep(2)

|          | 0/? [00:00<?, ?it/s]

CPU times: user 20.8 ms, sys: 996 µs, total: 21.8 ms
Wall time: 21.4 ms


## Parse editor records

Parsing editor records contains several steps.

First, we need to extract the raw-text from xml files.

Then, as you can see, the raw-text lacks delimiters which seperate editor records. So we need to insert new line characters which faciliates further parsing. Editor roles needs to be identified. Then observe that each editor record ends in a country name or abbreviation, after which I insert '\n'.

In [8]:
def isOverlap(a, b):
    # span should not overlap, so that don't break down "associate editors" into "associate \n editor\n s"
    
    if a[0] >= b[0] and a[0] <= b[1]:
        return True
    if a[1] <= b[1] and a[1] >= b[0]:
        return True
    if b[0] >= a[0] and b[0] <= a[1]:
        return True
    if b[1] <= a[1] and b[1] >= a[0]:
        return True
    
    return False

In [9]:
def match(text, title_list):
    # find the positions of editor-role
    
    span = []
    text = text.lower() # convert to lowercase to match
    
    for i in range(0, len(text)):
        for query in title_list:
            if i+len(query) <= len(text) and text[i:i+len(query)] == query:
                new_span = (i, i+len(query))
                span.append(new_span)
                    
    # only keep the longest match
    keep = [1 for _ in range(len(span))]
    for i in range(len(span)):
        for j in range(len(span)):
            if isOverlap(span[i], span[j]):
                if span[i][1]-span[i][0] > span[j][1] - span[j][0]:
                    keep[j] = 0 # when two overlaps, discard the shorter one
                    
    kept = [span[i] for i in range(len(span)) if keep[i] == 1]
    
    return set(kept)

### Editor roles

The editor roles (editor-in-chief, associate editor, etc.) are collected from the list of current elsevier editors shown on the website of each elsevier journals.

In [10]:
## editor titles
title_df = pd.read_csv("../data/EditorTitles.csv", sep="\t", index_col=0)
print(title_df.shape, title_df.normalized.nunique())

(1477, 2) 1278


In [11]:
title_df = title_df.append({'original': 'emeritus editor-in-chief', 'normalized': 'emeritus editor in chief'}, ignore_index=True)

In [12]:
title_df.head()

Unnamed: 0,original,normalized
0,editor- in- chief,editor in chief
1,editorial board,editorial board
2,editor-in-chief,editor in chief
3,associate editors,associate editor
4,founding editor,founding editor


### Countries

The list of all countries around the world.

Manually added common abbreviations: USA, UK, Russia.

Removed "Jersey", since New Jersey is a state in the US, and Jersey is an uncommon country.

In [13]:
import pycountry

In [14]:
# get country list
country_list = []

for c in pycountry.countries:
    country_list.append(c.name)
    try:
        country_list.append(c.official_name)
    except Exception as e:
        continue

# some special change to make
country_list.append("USA")
country_list.append("UK")
country_list.append("Russia")
country_list.append('South Korea')
country_list.remove("Jersey")
country_list = sorted(country_list, key = len, reverse = True)
country_list = [x.lower() for x in country_list]

### Clean raw text and insert delimiters before and after roles

In [15]:
def findRoles(text):
    
    processed = text
    span = match(text, title_df.original)
    start = set([s[0] for s in span])
    end = set([s[1] for s in span])
    start.update(end)
    insert_pos = sorted(start, reverse = True)

    for insert_ind in insert_pos:
        if insert_ind != 0:
            new_entry = processed[:insert_ind] + '\n' + processed[insert_ind:]
            processed = new_entry
    
    return processed

In [16]:
def removeEmail(text):
    # if email is present (appears after country name), remove email
    email_re = r'[\w\.-]+@[\w\.-]+'
    
    processed = re.sub(email_re, "", text)
    processed = processed.replace("Email", "").replace("email", "").replace("ail", "")
    processed = processed.replace(":", '').replace(" .", '')

    return processed

In [17]:
def cleanText(df):
    df = df.assign(plaintext = df.text.apply(removeEmail))
    df = df.assign(plaintext = df.text.apply(findRoles))
    
    return df

In [18]:
%%time
df = pd.DataFrame()

for file in glob.glob("../data/editorial_board_info/*"):
    issn, issue = file.split('/')[-1].replace('.xml','').split('_')
    
    text = loadText(file)
    soup = BeautifulSoup(text, 'html.parser')
    rawtext = soup.find('xocs:rawtext') # find the raw-text field in the xml
    
    try:
        df = df.append({'issn': issn, 'issue': issue, 'text': rawtext.text.strip()}, ignore_index=True)
    except Exception as e:
        continue

CPU times: user 9.06 ms, sys: 0 ns, total: 9.06 ms
Wall time: 8.89 ms


In [19]:
df.shape

(1, 3)

In [20]:
%%time
df = cleanText(df)

CPU times: user 4.72 s, sys: 74 ms, total: 4.79 s
Wall time: 4.98 s


In [21]:
df.head()

Unnamed: 0,issn,issue,text,plaintext
0,1744117X,Volume 15,Editor-in-Chief M. Rise Memorial University of...,Editor-in-Chief\n M. Rise Memorial University ...


### Insert delimeter after countries

In [22]:
def notPart(start, end, text):
    # make sure the "matched" country name is not part of another word
    # Make sure we don't match province, eg. Peru and Perugia
    # or Singapore at the end of "National University of Singapore"
    if start > 0:
        if not (text[start-1] == ' ' or text[start-1] == ',' or text[start-1] == '.'):
            return False
    
    if end < len(text):
        if not (text[end] == ' ' or text[end] == ',' or text[end] == '.'):
            return False
    
    return True

def matchCountry(text, country_list):
    # find the positions of editor-role
    
    span = []
    text = text.lower() # convert to lowercase to match
    
    for i in range(0, len(text)):
        for query in country_list:
            if i+len(query) <= len(text) and text[i:i+len(query)] == query:
                if notPart(i, i+len(query), text):
                    new_span = (i, i+len(query))
                    span.append(new_span)
                    
    # only keep the longest match
    keep = [1 for _ in range(len(span))]
    for i in range(len(span)):
        for j in range(len(span)):
            if isOverlap(span[i], span[j]):
                if span[i][1]-span[i][0] > span[j][1] - span[j][0]:
                    keep[j] = 0 # when two overlaps, discard the shorter one
                    
    kept = [span[i] for i in range(len(span)) if keep[i] == 1]
    
    return set(kept)

In [23]:
def findCountry(text):
    processed = text

    span = matchCountry(processed, country_list)
    start = set([s[0] for s in span])
    end = set([s[1] for s in span])
    insert_pos = sorted(end, reverse = True)

    for insert_ind in insert_pos:
        if insert_ind != 0:
            new_entry = processed[:insert_ind] + '\n' + processed[insert_ind:]
            processed = new_entry


    return processed

In [24]:
%%time
df = df.assign(plaintext = df.plaintext.apply(findCountry))

CPU times: user 1.1 s, sys: 3.04 ms, total: 1.1 s
Wall time: 1.11 s


In [25]:
df.head()

Unnamed: 0,issn,issue,text,plaintext
0,1744117X,Volume 15,Editor-in-Chief M. Rise Memorial University of...,Editor-in-Chief\n M. Rise Memorial University ...


## Extract editors

First, each line consists of "editor name, affiliation, country".

In [26]:
def parseEditorRole(df, title_list):
    # split each line as an entry of editor
    # identify the role of each "editor"
    
    editorDf = pd.DataFrame()

    for ind, row in tqdm(df.iterrows()):
        currentEditorTitle = "<NO-TITLE>"
        issue = row['issue']
        issn = row['issn']

        #for ind, row in df.iterrows():
            # s = df.at[ind, 'country']

        for line in row['plaintext'].split('\n'):
            line = line.strip()

            if line.lower() in title_list:
                currentEditorTitle = line.lower()
            else:
                if len(line) < 200:
                    if currentEditorTitle != "":
                        newEntry = {'issn': issn, 'issue': issue, 'editor': line, 'title': currentEditorTitle}
                        editorDf = editorDf.append(newEntry, ignore_index=True)
                        
    return editorDf

In [27]:
def remove_non_alpha(s):
    # remove all the starting non-alphabetical characters
    
    while 1:
        if len(s) == 0:
            break
        if s[0].isalpha():
            break
        s = s[1:]
    
    return s

In [28]:
def remove_bracket(s):
    # if starting with open bracket
    # find the [matching] closing one, and remove everything in between
    # if not matching one is found, simply remove it
    
    if s[0] == '(':
        bi = find_matching_bracket(s)
        s = s[bi+1:].strip()
    
    return s

def find_matching_bracket(s):
    count = 0
    for ind in range(len(s)):
        if s[ind] == '(':
            count += 1
        if s[ind] == ')':
            count -= 1
        if count == 0:
            return ind
    return 0

print('Tests:')
print('1. ', remove_bracket('(aaaa) bbb'))
print('2. ', remove_bracket('(a(acc)aaa) 111'))

Tests:
1.  bbb
2.  111


In [29]:
def not_irregular(s):
    s = s.lower()

    if s[:10] == "journal of" or s[:5] == 'phone' or s[:3] == 'fax':
        return False

    return True

def remove_em(s):

    if len(s) >= 3 and s.lower()[:3] == "e-m" :
        s = s[3:].strip()
        
    return s

In [30]:
title_list = title_df.original.values

In [31]:
%%time
editorDf = parseEditorRole(df, title_list)
print(editorDf.shape)

|          | 0/? [00:00<?, ?it/s]

(141, 4)
CPU times: user 297 ms, sys: 4.65 ms, total: 302 ms
Wall time: 324 ms


In [32]:
# remove empty lines
editorDf = editorDf[editorDf.editor.apply(lambda x: len(x) > 1)]
editorDf.shape

(138, 4)

In [33]:
# convert to ascii
editorDf = editorDf.assign(editor = editorDf.editor.apply(unidecode))

In [34]:
# if a line start with number, remove the entire line
editorDf = editorDf[editorDf.editor.apply(lambda x: not x[0].isnumeric())]
editorDf.shape

(138, 4)

In [35]:
# if line start with bracket, remove everything in between
editorDf = editorDf.assign(editor = editorDf.editor.apply(remove_bracket))

In [36]:
# remove all the starting non-alphabetical characters
editorDf = editorDf.assign(editor = editorDf.editor.apply(remove_non_alpha))

In [37]:
# remove the entries that are obviously not editors
editorDf = editorDf[editorDf.editor.apply(not_irregular)]
editorDf.shape # 138

(138, 4)

In [38]:
# remove the starting 'e-m' of each line (for example, journal 09258388)
editorDf = editorDf.assign(editor = editorDf.editor.apply(remove_em))

In [39]:
# drop those with less than 2 words
editorDf = editorDf[editorDf.editor.apply(lambda x: len(x.split()) > 2)]
editorDf.shape # 131

(131, 4)

In [40]:
editorDf.head()

Unnamed: 0,editor,issn,issue,title
0,"M. Rise Memorial University of Newfoundland, N...",1744117X,Volume 15,editor-in-chief
2,"T.P. Mommsen University of Victoria, Victoria,...",1744117X,Volume 15,emeritus editor-in-chief
5,"J. Altimiras Linkoping University, Sweden",1744117X,Volume 15,editorial board
6,"G. Anderson University of Manitoba, Winnipeg, ...",1744117X,Volume 15,editorial board
7,"N.J. Bernier University of Guelph, Guelph, ON,...",1744117X,Volume 15,editorial board


### Parse editor names

Parse editor names to be either "first-name last-name", "first-name middle-name last-name", "first-name middle-initial last-name", or "first-initial middle-initial last-name".

1. Convert to lower case;
2. Pick the first two, unless the middle one is an initial;
2. Remove titles, "Dr.", and "Prof." and "Professor", "professor";
3. If first name is "A.B.", seperate as two
4. Remove ".".

In [41]:
def parse_name(s):
    
    s = s.lower().split()
    name = []
    
    # remove title
    if s[0] in ['dr.', 'dr', 'prof.', 'prof', 'professor', 'doctor']:
        s = s[1:]
        
    if len(s[0]) >=3 and s[0][1] == '.' and s[0][0].isalpha() and s[0][2].isalpha():
        s[0] = s[0][0] + ' ' + s[0][2]
    
    if len(s) <= 2:
        name = ' '.join(s).strip()
        name = ''.join([i for i in name if i.isalpha() or i == ' '])
        return name, ''
    else:
        name.append(s[0])
        name.append(s[1])
        ind = 2
    
    while (len(name[-1]) == 2 and name[-1][1] == '.' and name[-1][0].isalpha()) or len(name[-1]) == 1:
        if ind == len(s):
            break
        name.append(s[ind]) # some ppl may have multiple middle names
        ind += 1
        
    name = ' '.join(name)
    name = ''.join([i for i in name if i.isalpha() or i == ' '])
    aff = ' '.join(s[ind:])
    
    return name, aff

# test cases
print('Tests:')
print(parse_name("Professor J. L Liu,*"))
print(parse_name("Dr. Jay l Lee"))
print(parse_name("mike liu nasa"))
print(parse_name("mike l. na"))
print(parse_name("m.a. liu, university of california"))
print(parse_name("m.a li university"))
print(parse_name("mike, france"))

Tests:
('j l liu', '')
('jay l lee', '')
('mike liu', 'nasa')
('mike l na', '')
('m a liu', 'university of california')
('m a li', 'university')
('mike france', '')


In [42]:
def return_name(s):
    return parse_name(s)[0]

def return_aff(s):
    return parse_name(s)[1]

In [43]:
editorDf['editorName'] = editorDf.editor.apply(return_name)
editorDf['editorAff'] = editorDf.editor.apply(return_aff)

In [44]:
editorDf = editorDf.drop(['editor'],axis=1)

### Get editor year

In [45]:
# the link to each pdf file that shows the editorial page
links = pd.read_csv('../data/EditorialPageLinks.csv',sep='\t',dtype={'issn':str},
                   usecols=['issn','supplement_text','date'])

In [46]:
links = links.rename(columns={'supplement_text':'issue'})

In [47]:
editorDf = editorDf.merge(links, on=['issn','issue'])
editorDf.shape

(131, 6)

In [48]:
editorDf = editorDf.assign(Year = editorDf.date.apply(lambda x: int(str(x)[:4])))

In [49]:
editorDf.drop_duplicates().to_csv("../data/SampleEditorsVol15.csv",sep='\t',index=False) # sample

In [50]:
editorDf.shape, editorDf.drop_duplicates().shape # (131, 7), (90, 7)

((131, 7), (90, 7))

In [51]:
editorDf.head()

Unnamed: 0,issn,issue,title,editorName,editorAff,date,Year
0,1744117X,Volume 15,editor-in-chief,m rise,"memorial university of newfoundland, newfoundl...",20150901,2015
1,1744117X,Volume 15,emeritus editor-in-chief,t p mommsen,"university of victoria, victoria, bc, canada",20150901,2015
2,1744117X,Volume 15,editorial board,j altimiras,"linkoping university, sweden",20150901,2015
3,1744117X,Volume 15,editorial board,g anderson,"university of manitoba, winnipeg, manitoba, ca...",20150901,2015
4,1744117X,Volume 15,editorial board,n j bernier,"university of guelph, guelph, on, canada",20150901,2015
