# Disambiguate sample editors

This notebook is concerned with identifying collected editorial board members in MAG. The process relies on the assumption that it is extremely unlikely that two scientists that shares the same name are affiliated with the same institution in any given years. Therefore, editors are disambiguated based on their first and last name (or first and middle initial and last name), affiliation, and year(s) of the affiliation.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import pandas as pd
from fuzzywuzzy import fuzz
from unidecode import unidecode

import re

## Parse editor name

We allow first names with two letters to be interpreted in two ways. It is possible that "DJ Trump" is actually "D. J. Trump" but mistakened to be "Dj Trump" in the process of OCR, and it is also possible that "Dj" is actually someone's first name. Therefore, we allow both possiblities to exist in the first step of parsing.

In the later steps, only the correct interpretation of name can find a match within MAG, because it is extremely unlikely that "Dj Trump" and "D. J. Trump" are both affiliation with the same institution in the same year.

In [2]:
def parse_initial(name, mode=0):
    '''
    param: name - string
    mode: 0 or 1, whether to treat length-2 first name as seperate initials
    
    return: first, middle, last, first_initial, middle_initial, last_initial
    '''
    if mode != 0 and mode != 1:
        raise("Error. Mode values must be 0 or 1")
        
    l = name.split()
    if len(l) == 0:
        return '', '', '', '', '', ''
    
    if len(l) == 1:
        first, middle, last = '', '', l[0]
        f_ini, m_ini, l_ini = '', '', last[0]
        
        return first, middle, last, f_ini, m_ini, l_ini
    
    # if the name is two parts, first name and last name
    # assumes that if first or middle name is of length 1, it is an initial
    elif len(l) == 2:
        first, middle, last = l[0], '', l[1]
        f_ini, m_ini, l_ini = first[0], '', last[0]
        
        if len(first) == 1:
            first = ''
            
        elif len(first) == 2:
            # if mode == 0, treat as first name
            if mode == 1:
                # treat as combination of two initials
                f_ini, m_ini = first[0], first[1]
                first = ''
        
        return first, middle, last, f_ini, m_ini, l_ini
    
    # if the name is three parts, first, middle, and last
    elif len(l) == 3:
        first, middle, last = l[0], l[1], l[2]
        f_ini, m_ini, l_ini = first[0], middle[0], last[0]
        
        if len(first) == 1:
            first = ''
        if len(middle) == 1:
            middle = ''
            
        return first, middle, last, f_ini, m_ini, l_ini
    
    # if the name is more than three parts, first, middle, ..., last
    elif len(l) > 3:
        first, middle, last = l[0], ' '.join(l[1:-1]), l[-1]
        f_ini, m_ini, l_ini = first[0], middle[0], last[0]
        
        if len(first) == 1:
            first = ''
        if len(middle) == 1:
            middle = ''
        
        return first, middle, last, f_ini, m_ini, l_ini

In [3]:
all_editors = pd.read_csv("../data/SampleEditorsVol15.csv",sep='\t',
                        usecols=['issn','title','editorName','editorAff','Year'],
                        dtype={'issn':str,'Year':int,'issn':str})
print(all_editors.shape)

all_editors = all_editors.drop_duplicates()
print(all_editors.shape)

all_editors = all_editors.rename(columns={'editorName':'normalized_name'})

(90, 5)
(90, 5)


In [4]:
editors_0 = all_editors.assign(first = all_editors.normalized_name.apply(lambda x: parse_initial(x, 0)[0]),
                                middle = all_editors.normalized_name.apply(lambda x: parse_initial(x, 0)[1]),
                                last = all_editors.normalized_name.apply(lambda x: parse_initial(x, 0)[2]),
                                f_ini = all_editors.normalized_name.apply(lambda x: parse_initial(x, 0)[3]),
                                m_ini= all_editors.normalized_name.apply(lambda x: parse_initial(x, 0)[4]))

editors_1 = all_editors.assign(first = all_editors.normalized_name.apply(lambda x: parse_initial(x, 1)[0]),
                                middle = all_editors.normalized_name.apply(lambda x: parse_initial(x, 1)[1]),
                                last = all_editors.normalized_name.apply(lambda x: parse_initial(x, 1)[2]),
                                f_ini = all_editors.normalized_name.apply(lambda x: parse_initial(x, 1)[3]),
                                m_ini= all_editors.normalized_name.apply(lambda x: parse_initial(x, 1)[4]))

all_editors = editors_1.append(editors_0, ignore_index=True, sort=False)
all_editors.shape

(180, 10)

In [5]:
all_editors = all_editors.dropna(subset=['editorAff'])
print(all_editors.shape)

(180, 10)


In [6]:
all_editors = all_editors.drop_duplicates()
print(all_editors.shape)

(92, 10)


In [7]:
all_editors.head()

Unnamed: 0,issn,title,normalized_name,editorAff,Year,first,middle,last,f_ini,m_ini
0,1744117X,editor-in-chief,m rise,"memorial university of newfoundland, newfoundl...",2015,,,rise,m,
1,1744117X,emeritus editor-in-chief,t p mommsen,"university of victoria, victoria, bc, canada",2015,,,mommsen,t,p
2,1744117X,editorial board,j altimiras,"linkoping university, sweden",2015,,,altimiras,j,
3,1744117X,editorial board,g anderson,"university of manitoba, winnipeg, manitoba, ca...",2015,,,anderson,g,
4,1744117X,editorial board,n j bernier,"university of guelph, guelph, on, canada",2015,,,bernier,n,j


## Remove country

This step is to remove as much noise as possible from intitution names.

Notice that Elsevier journals sometimes display affiliations as "institution name, city, country" but affiliations in MAG is much cleaner, i.e. without the city and country info. Therefore, to increase the accuracy when matching affiliations that we collected with the affiliations in MAG, we first remove all city, country, and country abbreviations from editors' affiliation. If an affiliation is within US or Canada, we also remove two-letter state abbreviations if found.

In [8]:
cities = pd.read_csv("../data/worldcities.csv", sep=",", 
                     usecols=['city_ascii', 'country', 'iso2', 'iso3'],
                    dtype={'city_ascii':str, 'country':str, 'iso2':str, 'iso3':str}).drop_duplicates()
cities.shape

(13879, 4)

In [9]:
city_names = set([x.lower() for x in cities.city_ascii])

country_names = [x.lower() for x in cities.country]
country_names.extend([x.lower() for x in cities.iso2 if type(x) == str])
country_names.extend([x.lower() for x in cities.iso3])
country_names = set(country_names)
country_names.add('scotland')
country_names.add('uk')

us_states = [x.lower() for x in ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
"MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
"NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
"SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"] ]

canada_states = [x.lower() for x in ['AB', 'BC', 'MB', 'NB', 'NL', 'NT', 'NS', 'NU', 'ON', 'PE', 'QC', 'SK', 'YT'] ]

In [10]:
def clean_aff(x):
    # throw away country and city names
    # if in the States, identify state as well
    
    l = str(x).split(',')
    
    country = re.sub('[^0-9a-zA-Z ]+', '', l[-1]).strip()
    if country in country_names:
        l = l[:-1]
    
    if len(l) > 0 and country in ['usa', 'us', 'united states', 'the united states']:
        state = re.sub('[^0-9a-zA-Z]+', '', l[-1]).strip()
        if state in us_states:
            l = l[:-1]
            
    if len(l) > 0 and country == 'canada':
        state = re.sub('[^0-9a-zA-Z]+', '', l[-1]).strip()
        if state in canada_states:
            l = l[:-1]
    
    if len(l) > 0:
        city = re.sub('[^0-9a-zA-Z ]+', '', l[-1]).strip()
        if city in city_names:
            l = l[:-1]
    
    return  ' '.join(l).strip()

In [11]:
assert(clean_aff('university of california, ca, usa') == 'university of california')
assert(clean_aff('hong kong') == '')
assert(clean_aff('c a, united states') == '')
assert(clean_aff('peking university, beijing, china') == 'peking university')

In [12]:
all_editors = all_editors.assign(cleaned_aff = all_editors.editorAff.apply(clean_aff))

In [13]:
all_editors.head()

Unnamed: 0,issn,title,normalized_name,editorAff,Year,first,middle,last,f_ini,m_ini,cleaned_aff
0,1744117X,editor-in-chief,m rise,"memorial university of newfoundland, newfoundl...",2015,,,rise,m,,memorial university of newfoundland newfoundl...
1,1744117X,emeritus editor-in-chief,t p mommsen,"university of victoria, victoria, bc, canada",2015,,,mommsen,t,p,university of victoria
2,1744117X,editorial board,j altimiras,"linkoping university, sweden",2015,,,altimiras,j,,linkoping university
3,1744117X,editorial board,g anderson,"university of manitoba, winnipeg, manitoba, ca...",2015,,,anderson,g,,university of manitoba winnipeg manitoba
4,1744117X,editorial board,n j bernier,"university of guelph, guelph, on, canada",2015,,,bernier,n,j,university of guelph guelph


## Match editor names

Find potential matches who have the same name as editors.

In [14]:
def filterSameName(merged):
    merged = merged[(merged.first_x == '') | (merged.first_y == '') | (merged.first_x == merged.first_y)]
    merged = merged[(merged.middle_x =='') | (merged.middle_y == '') | (merged.middle_x == merged.middle_y)]
    merged = merged[(merged.m_ini_x == '') | (merged.m_ini_y == '') | (merged.m_ini_x == merged.m_ini_y)]
    merged = merged.drop(['first_x','first_y','middle_x','middle_y','last','f_ini','m_ini_x','m_ini_y'],axis=1)
    
    return merged

In [15]:
%%time
authors = pd.read_csv("../data/mag/AuthorNames.csv", sep='\t', memory_map=True,
                      usecols=['NewAuthorId', 'first','middle','last', 'f_ini', 'm_ini'],
                      dtype={'NewAuthorId':int,'first':str,'last':str,'middle':str,'m_ini':str,'f_ini':str})
print(authors.shape)

(188651, 6)
CPU times: user 72.5 ms, sys: 25.8 ms, total: 98.3 ms
Wall time: 98.8 ms


In [16]:
all_editors = all_editors.fillna('')
authors = authors.fillna('')

In [17]:
%%time
merged = all_editors.merge(authors, on=['last', 'f_ini'])
print(merged.shape)

merged = filterSameName(merged)
print(merged.shape) # (195631, 7)

(253968, 15)
(195631, 7)
CPU times: user 256 ms, sys: 69.7 ms, total: 325 ms
Wall time: 330 ms


In [18]:
merged.head()

Unnamed: 0,issn,title,normalized_name,editorAff,Year,cleaned_aff,NewAuthorId
0,1744117X,editor-in-chief,m rise,"memorial university of newfoundland, newfoundl...",2015,memorial university of newfoundland newfoundl...,68738365
1,1744117X,editor-in-chief,m rise,"memorial university of newfoundland, newfoundl...",2015,memorial university of newfoundland newfoundl...,7616278
2,1744117X,editor-in-chief,m rise,"memorial university of newfoundland, newfoundl...",2015,memorial university of newfoundland newfoundl...,10083403
3,1744117X,editor-in-chief,m rise,"memorial university of newfoundland, newfoundl...",2015,memorial university of newfoundland newfoundl...,10444668
4,1744117X,editor-in-chief,m rise,"memorial university of newfoundland, newfoundl...",2015,memorial university of newfoundland newfoundl...,11042510


## Match editor affiliations

We use a combination of three measures of similarity to determine whether two institution names, when spelled differently, are representing the same affiliation.

`../data/manual_labeled_same.csv` and `../data/manual_labeled_different.csv` contains pairs of institutions that are manually labelled to be same or not. We use these manually labelled data as verification sets to determine the performance of chosen parameter. `../data/AllAffiliationSpellings.csv` contains all affiliations appeared as the affiliation of and editor (according to Elsevier) and all affiliations of a possible match (according to MAG).

In [19]:
def subfuzz(a, b):
    if len(a) < len(b):
        a, b = b, a
    # len(a) >= len(b)
    score = -1
    lenb = len(b)
    
    for start in range(len(a) - len(b)+1):
        score = max(score, fuzz.ratio(a[start:start+lenb], b))
        
    return score

assert(subfuzz('aaab', 'ab') == 100)
assert(subfuzz('ab', 'aaab') == 100)
assert(subfuzz('abcdabcabc', 'bcxy') == 50)

#### Get affiliation names 

In [21]:
%%time
year_aff = pd.read_csv("../data/mag/AuthorAffYear.csv", sep='\t', memory_map=True,
                      usecols=['Year', 'NewAuthorId', 'AffiliationId'],
                      dtype={'Year':int, 'NewAuthorId':int, 'AffiliationId':int})
print(year_aff.shape)

affiliations = pd.read_csv("../data/mag/Affiliations.txt", sep="\t", memory_map=True,
                          names=['AffiliationId', "Rank", "NormalizedName", "DisplayName", "GridId",
                                 "OfficialPage", "WikiPage", "PaperCount", "CitationCount", "Latitude",
                                 "Longitude", 'CreatedDate'], usecols=['AffiliationId', 'NormalizedName'],
                          dtype={'AffiliationId': int, 'NormalizedName': str}).rename(
    columns={'NormalizedName':'AffName'}).drop_duplicates()
print(affiliations.shape)

(156296, 3)
(25512, 2)
CPU times: user 89 ms, sys: 4.7 ms, total: 93.7 ms
Wall time: 94.5 ms


In [22]:
%%time
merged = merged.merge(year_aff, on=['NewAuthorId','Year'])
print(merged.shape)

merged = merged.merge(affiliations, on='AffiliationId')
print(merged.shape)

(13085, 8)
(13085, 9)
CPU times: user 75.6 ms, sys: 0 ns, total: 75.6 ms
Wall time: 75.3 ms


In [23]:
all_matched = merged.drop(['Year', 'AffiliationId'], axis=1).drop_duplicates() # all potential matches
all_matched = all_matched.rename(columns={'normalized_name':'EditorName'})

all_matched.shape

(10081, 7)

In [24]:
all_matched.head(3)

Unnamed: 0,issn,title,EditorName,editorAff,cleaned_aff,NewAuthorId,AffName
0,1744117X,editor-in-chief,m rise,"memorial university of newfoundland, newfoundl...",memorial university of newfoundland newfoundl...,16778074,norwegian university of science and technology
1,1744117X,editorial board,m rise,"memorial university of newfoundland, st. johna...",memorial university of newfoundland st. johna...,16778074,norwegian university of science and technology
2,1744117X,editor-in-chief,m rise,"memorial university of newfoundland, newfoundl...",memorial university of newfoundland newfoundl...,23046221,memorial university of newfoundland


#### Get corpus

In [25]:
same = pd.read_csv("../data/manual_labeled_same.csv", sep="\t", index_col=0)
diff = pd.read_csv("../data/manual_labeled_different.csv", sep="\t", index_col=0)
allAff = pd.read_csv("../data/AllAffiliationSpellings.csv", sep="\t").fillna('')

same.shape, diff.shape, allAff.shape

((3337, 3), (9541, 3), (169128, 1))

In [26]:
allAff.head()

Unnamed: 0,Affiliations
0,"university hospital lausanne lausanne, switzer..."
1,"arias (madrid, spain"
2,instituto de agroquimica y tecnologia de alime...
3,"university of padova, italy"
4,"(ispra, italy"


In [27]:
corpus = list(set(allAff.Affiliations).union(
    set(same.affiliation)).union(set(same.affiliationMag)).union(
    set(diff.affiliation)).union(set(diff.affiliationMag)))

# corpus = [re.sub('[^0-9a-zA-Z]+', ' ', x) for x in corpus]
len(corpus)

169744

In [28]:
m = {}
for ind in range(len(corpus)):
    m[ corpus[ind] ] = ind

In [29]:
%%time
vectorizer = TfidfVectorizer()

sparse_matrix = vectorizer.fit_transform(corpus)

doc_term_matrix = sparse_matrix.todense()

print(doc_term_matrix.shape)

(169744, 99128)
CPU times: user 2.5 s, sys: 3.25 s, total: 5.75 s
Wall time: 5.85 s


#### Three outcomes and filter

In [30]:
%%time
all_matched = all_matched.assign(tfidf=all_matched.apply(lambda row: 
        cosine_similarity(doc_term_matrix[m[row['AffName']]], doc_term_matrix[m[row['cleaned_aff']]])[0][0], axis=1))

CPU times: user 14.3 s, sys: 549 ms, total: 14.9 s
Wall time: 15.1 s


In [31]:
%%time
all_matched = all_matched.assign(fuz = all_matched.apply(
    lambda row: fuzz.ratio(row['cleaned_aff'], row['AffName']), axis=1))

all_matched = all_matched.assign(subfuz = all_matched.apply(
    lambda row: subfuzz(row['cleaned_aff'], row['AffName']), axis=1))

CPU times: user 1.51 s, sys: 20.8 ms, total: 1.53 s
Wall time: 1.57 s


In [32]:
filtered = all_matched[(all_matched.tfidf>=33/100) & (all_matched.subfuz>=94) & (all_matched.subfuz>=19)]
filtered.shape # (1915, 11) # (1617, 11)

(183, 10)

In [33]:
# only keep unique matches (each editor (issn, name) is uniquely matched to one author in MAG)
filtered = filtered.drop_duplicates(subset=['issn', 'EditorName', 'NewAuthorId'])
filtered = filtered.drop_duplicates(subset=['issn', 'EditorName'], keep=False)

filtered.shape

(47, 10)

In [34]:
# find the start and end of editors 
editors = filtered.merge(all_editors[['issn','normalized_name','Year']].drop_duplicates().rename(
    columns={'normalized_name':'EditorName'}), on=['issn','EditorName']).groupby(
    ['issn','NewAuthorId']).agg({'Year':[min, max]}).reset_index()
editors.shape # 47

(47, 4)

In [35]:
editors.columns = [f'{i} {j}'.strip() for i, j in editors.columns]

In [36]:
editors = editors.rename(columns={'Year min':'start_year', 'Year max':'end_year'})

In [37]:
editors.head()

Unnamed: 0,issn,NewAuthorId,start_year,end_year
0,1744117X,345472,2015,2015
1,1744117X,1276115,2015,2015
2,1744117X,1465789,2015,2015
3,1744117X,1504509,2015,2015
4,1744117X,1881740,2015,2015


Since we only parsed one editorial page, the start and end of editorial career is 2015 for all editors. Were the same procedure to be applied on the entired set of editorial pages, you can end up with the complete set of editors and their accurate start and end of editorial career.

From the set of editors that we have obtained accordingly, we sample 10 editors from the aforementioned journal, stored at `../data/SampleEditors.csv`, and use them to demonstrate susbsequent analysis.