# Linking Affiliation with Network Data

In this notebook I want to start linking the universities/departments that appear in the network dataset to their affiliation ID that appears in the MAG dataset. There are few enough universities that where there are not exact matches on the basis of the name of the school, that I can do a little bit of work to manually find the appropriate ID. 

But first, I will try to make it easy for myself and normalize the university name and do a one to one match on the normalized name, see what proportion of schools I have and the proceed from there. 

Ultimately, this will be helpful because we will do a direct link from the network dataset and the MAG dataset by matching the authors name (direct matches only) and their most recent affiliation. Hopefully that will get us a more accurate look at the citations and publication counts for everyone. 

In [16]:
import os
import re
import csv
import glob
import json
import functions
import pandas as pd
import dask.dataframe as dd

os.chdir('/home/timothyelder/mag')

path = '/project/jevans/MAG_0802_2021_snap_shot/'

# read in the normalized faculty names from the network data
with open("data/faculty_names.txt", "r") as f:
    faculty_names = json.loads(f.read())

faculty_df_complete = pd.read_csv('data/faculty_df_complete.csv')
faculty_df_complete = faculty_df_complete[faculty_df_complete["year_observed"] >= 2020] # subset to just 2020-21 observations

departments = set(faculty_df_complete['current_dept'].to_list())
departments = list(departments)

In [12]:
dept_df = faculty_df_complete.groupby(['current_dept']).size().reset_index(name='counts')
dept_df["pct"] = (dept_df.counts / dept_df.counts.sum())*100
dept_df.sort_values(by=['pct'], ascending=False)

#dept_df[dept_df["current_dept"] == "University of Maryland"]

Unnamed: 0,current_dept,counts,pct
89,University of California-Los Angeles,75,1.427756
23,Florida International University,74,1.408719
88,University of California-Irvine,73,1.389682
130,University of North Carolina at Greensboro,71,1.351609
156,Virginia Tech,64,1.218351
...,...,...,...
119,University of Mississippi,9,0.171331
143,University of Texas at Dallas,8,0.152294
68,Southern Illinois University-Carbondale,7,0.133257
138,University of South Alabama,6,0.114220


## Normalizing Department Names

In [13]:
for idx, i in enumerate(departments):
    i = i.lower()
    i = re.sub(',|\.', '', i)
    i = re.sub('-', ' ', i)
    i = re.sub('\'', ' ', i)
    i = re.sub('\&', ' ', i)
    
    i = i.strip()
    departments[idx] = i

### Load MAG Affiliations and direct match

This will get us a dataframe which has all the AffiliationIds for each of the departments that appear in the network data. We can then do a merge to the `faculty_df_complete` to get the id in and then do a match with the big authors table on the normalized name and the affiliation id

In [14]:
affiliations_df = dd.read_csv(path + 'Affiliations.txt', dtype = {"AffiliationID": int},
                                           sep="\t", header=None,
                                           error_bad_lines=False,
                                           quoting=csv.QUOTE_NONE,
                                           encoding='utf-8')

new_columns = ['AffiliationId', 'Rank', 'NormalizedName', 'DisplayName',
               'GridId', 'OfficialPage', 'WikiPage', 'PaperCount', 
               'PaperFamilyCount', 'CitationCount', 'Iso3166Code', 'Latitude', 
               'Longitude', 'CreatedDate']

affiliations_df = affiliations_df.rename(columns=dict(zip(affiliations_df.columns, new_columns)))

mag_departments = affiliations_df[affiliations_df['NormalizedName'].isin(set(departments))].compute()

mag_departments = mag_departments.drop(columns=['Rank',
               'GridId', 'OfficialPage', 'WikiPage', 'PaperCount', 
               'PaperFamilyCount', 'CitationCount', 'Iso3166Code', 'Latitude', 
               'Longitude', 'CreatedDate'])

mag_departments.head()



  **kwargs,


  path_info,


Unnamed: 0,AffiliationId,NormalizedName,DisplayName
711,114395901,university of nebraska lincoln,University of Nebraska–Lincoln
714,121847817,the graduate center cuny,"The Graduate Center, CUNY"
917,110378019,southern illinois university carbondale,Southern Illinois University Carbondale
928,157725225,university of illinois at urbana champaign,University of Illinois at Urbana–Champaign
1116,149910238,kent state university,Kent State University


# Strategy 1: Direct Matches between names and affiliation ids

In [15]:
print(len(faculty_df_complete))
faculty_df_complete = faculty_df_complete.merge(mag_departments, how='left', left_on="current_dept", right_on="DisplayName")
print(len(faculty_df_complete))


5253
5253


In [16]:
faculty_df_complete = faculty_df_complete.drop(columns=["DisplayName", "NormalizedName", "position", 
                                  "interests", "highest_degree", "phd_year", 
                                  "source_dept"])

faculty_df_complete = faculty_df_complete.rename(columns={ "faculty_name":"NormalizedName", "AffiliationId":"LastKnownAffiliationId"})



In [17]:
faculty_df_complete.head()

Unnamed: 0,NormalizedName,current_dept,year_observed,LastKnownAffiliationId
0,"Angotti, Nicole",American University,2020,181401687.0
1,"Bader, Michael",American University,2020,181401687.0
2,"Blankenship, Kim M",American University,2020,181401687.0
3,"Castaneda, Ernesto",American University,2020,181401687.0
4,"Dondero, Molly",American University,2020,181401687.0


## Merging Dataframes and Matching with MAG to get Authors

### Normalizing faculty names

In [18]:
pattern = r'(.+\,)(.+)' # regex for matching the first name and last name
aux_pattern = '(\S+)(.+)' # extra pattern for when the above doesn't match

faculty_names = faculty_df_complete['NormalizedName'].to_list()

for idx,i in enumerate(faculty_names):
    i = re.sub(r';|:', ',', i)
    # match regex to the file_name string
    if re.search(pattern, i) == None:
        match = re.search(aux_pattern, i)
        new_name = match.group(2) + ' ' + match.group(1)
        new_name = re.sub('\/', 'l', new_name, count=1) # replaces / for l, a common error
        new_name = re.sub('\,', '', new_name, count=1)
        new_name = re.sub('\.', '', new_name, count=1)
        new_name = re.sub('\‘', '', new_name, count=1)
        new_name = new_name.lower()
        new_name = new_name.strip()

        faculty_names[idx] = new_name

    else:
        # match regex to the file_name string
        match = re.search(pattern, i)

        new_name = match.group(2) + ' ' + match.group(1)
        new_name = re.sub('\/', 'l', new_name, count=1) # replaces / for l, a common error
        new_name = re.sub('\,', '', new_name, count=1)
        new_name = re.sub('\.', '', new_name, count=1)
        new_name = re.sub('\‘', '', new_name, count=1)
        new_name = new_name.lower()
        new_name = new_name.strip()
        
        faculty_names[idx] = new_name

faculty_df_complete['NormalizedName'] = faculty_names

In [9]:
# Load authors dataframe from MAG
authors_df = dd.read_csv(path + 'Authors.txt',
                                           sep="\t", header=None,
                                           error_bad_lines=False,
                                           quoting=csv.QUOTE_NONE,
                                           encoding='utf-8')

new_columns = ['AuthorId', 'Rank',
               'NormalizedName', 'DisplayName',
               'LastKnownAffiliationId', 'PaperCount',
               'PaperFamilyCount', 'CitationCount', 'CreatedDate']

authors_df = authors_df.rename(columns=dict(zip(authors_df.columns, new_columns)))

In [None]:
#authors_df = authors_df.set_index(['LastKnownAffiliationId'])
#faculty_df_complete = faculty_df_complete.set_index(['LastKnownAffiliationId'])

filtered_authors = authors_df[authors_df['LastKnownAffiliationId'].isin(faculty_df_complete.LastKnownAffiliationId)].compute()
print(len(filtered_authors))

filtered_authors = faculty_df_complete.merge(filtered_authors, how="left", on=["NormalizedName", "LastKnownAffiliationId"])
print(len(filtered_authors))


In [39]:
# Filter authors dataframe for authors that appear in network data.
# filtered_authors = authors_df[authors_df['NormalizedName'].isin(faculty_df_complete.faculty_name)].compute()

In [47]:
df = filtered_authors.dropna()

In [49]:
len(set(df.NormalizedName))/len(set(faculty_df_complete.NormalizedName))

0.38061968408262453

In [44]:
len(set(faculty_df_complete.NormalizedName))

3292

## Renaming Columns and Merging

What I want is to match names and their current affiliation Id

In [43]:
new_columns = ['NormalizedName', 'current_dept', 'source_dept', 'phd_year',
       'highest_degree', 'interests', 'position', 'year_observed',
       'LastKnownAffiliationId', 'Rank', 'DisplayName', 'GridId',
       'OfficialPage', 'WikiPage', 'PaperCount', 'PaperFamilyCount',
       'CitationCount', 'Iso3166Code', 'Latitude', 'Longitude', 'CreatedDate']

faculty_df_complete = faculty_df_complete.rename(columns=dict(zip(faculty_df_complete.columns, new_columns)))



In [44]:
df = pd.merge(faculty_df_complete, filtered_authors,  on=['NormalizedName', 'LastKnownAffiliationId'], how = "left", copy=False)
#len(set(df.AuthorId.to_list()))
df = df.dropna()
len(set(df.NormalizedName.to_list()))

2616

In [45]:
newest_df = filtered_authors[filtered_authors["LastKnownAffiliationId"].isin(mag_departments.AffiliationId)]

In [58]:
newest_df[newest_df["NormalizedName"].str.contains('danielle')]

Unnamed: 0,AuthorId,Rank,NormalizedName,DisplayName,LastKnownAffiliationId,PaperCount,PaperFamilyCount,CitationCount,CreatedDate
246684,1164340995,17204,danielle bessett,Danielle Bessett,63135867.0,29,29,347,2016-06-24
105623,3147025750,20504,danielle kane,Danielle Kane,219193219.0,2,2,0,2021-04-13


In [157]:
len(set(newest_df.NormalizedName.to_list()))/len(set(faculty_df_complete.NormalizedName.to_list()))

29.138686131386862

In [158]:
newest_df

Unnamed: 0,AuthorId,Rank,NormalizedName,DisplayName,LastKnownAffiliationId,PaperCount,PaperFamilyCount,CitationCount,CreatedDate
471,184369,15782,charles kurzman,Charles Kurzman,114027177.0,79,79,1497,2016-06-24
736,284723,16745,charles n halaby,Charles N. Halaby,135310074.0,9,9,1029,2016-06-24
1602,633513,16244,karen a hegtvedt,Karen A. Hegtvedt,150468666.0,48,48,1392,2016-06-24
4756,2033759,18047,georgi derluguian,Georgi Derluguian,111979921.0,16,16,71,2016-06-24
13549,6101366,16127,joanna dreby,Joanna Dreby,392282.0,56,56,1507,2016-06-24
...,...,...,...,...,...,...,...,...,...
855224,3186726551,20463,geoff ward,Geoff Ward,12912129.0,2,2,2,2021-08-02
909291,3186854906,18517,omer r galle,Omer R. Galle,200719446.0,4,4,59,2021-08-02
927882,3186899265,20504,victor chen,Victor Chen,921990950.0,2,2,0,2021-08-02
1970,3186955294,20254,michael toney,Michael Toney,39422238.0,1,1,16,2021-08-02


In [55]:
mag_departments[mag_departments["AffiliationId"] == 39422238]

Unnamed: 0,AffiliationId,Rank,NormalizedName,DisplayName,GridId,OfficialPage,WikiPage,PaperCount,PaperFamilyCount,CitationCount,Iso3166Code,Latitude,Longitude,CreatedDate
20601,39422238,6531,university of illinois at chicago,University of Illinois at Chicago,grid.185648.6,http://www.uic.edu/,http://en.wikipedia.org/wiki/University_of_Ill...,118465,115030,3564311,US,41.871887,-87.64925,2016-06-24


# Strategy 2:Fuzzy Matches

To recap, what we want to do is to create a list of probable matches between the faculty name as it appears in the network dataset, and the author name that appears in the MAG data. Then we return the AuthorIds and proceed with the publication analysis. In strategy 1 we attempted to do this just by taking direct matches between the normalized faculty name and the normalized author name and the affiliation id. 

In strategy 2 we will do something similar but using the fuzzy matches. 

In [176]:
df = faculty_df_complete.drop(columns=['current_dept', 'source_dept', 'phd_year',
       'highest_degree', 'interests', 'position', 'year_observed',
       'NormalizedName', 'DisplayName'])

In [19]:
# concatenating individual matched dataframes
match_path = r'/home/timothyelder/mag/data/matches'
all_files = glob.glob(match_path + "/*.csv")
df_from_each_file = (pd.read_csv(f, sep=',') for f in all_files) # generate a list of files to concatenate

df_merged = pd.concat(df_from_each_file, ignore_index=True) # Concatenate pandas dataframes

names = df_merged['network_name'].to_list()

for idx, i in enumerate(names):
    i = re.sub('\.','', i)
    names[idx] = i

df_merged['network_name'] = names


candidate_matches = df_merged[df_merged['network_name'] == df_merged['NormalizedName']]
df_merged = df_merged[df_merged['network_name'] != df_merged['NormalizedName']]
df_merged = df_merged[df_merged['best_match_score'] >= .50]
candidate_matches = candidate_matches.append(df_merged)
candidate_matches = candidate_matches.drop(columns=['Unnamed: 0', 'best_match_score', '__id_left', '__id_right', 'faculty_name'])
candidate_matches = candidate_matches.rename(columns={"network_name": "faculty_name"})
candidate_matches

Unnamed: 0,faculty_name,AuthorId,NormalizedName
65,kim m blankenship,3.048999e+09,kim m blankenship
295,stephen j pfohl,3.045411e+09,stephen j pfohl
302,david h smith,3.043000e+09,david h smith
308,stephen j pfohl,3.045411e+09,stephen j pfohl
343,stephen j pfohl,3.045411e+09,stephen j pfohl
...,...,...,...
2147649,stephanie hartwell,3.068725e+09,stephanie w hartwell
2147658,heather e dillaway,3.069277e+09,heather dell
2147763,richard l gee,3.066568e+09,j l richard
2147787,richard l gee,3.066568e+09,j l richard


In [30]:
candidate_matches = candidate_matches[candidate_matches['faculty_name'].isin(faculty_df_complete["NormalizedName"])].drop_duplicates()
candidate_matches

Unnamed: 0,faculty_name,AuthorId,NormalizedName
65,kim m blankenship,3.048999e+09,kim m blankenship
295,stephen j pfohl,3.045411e+09,stephen j pfohl
704,laura j miller,3.042681e+09,laura j miller
713,karen v hansen,3.047574e+09,karen v hansen
860,john p hoffmann,3.048419e+09,john p hoffmann
...,...,...,...
2145041,michael murphy,3.066969e+09,michael r murphy
2146011,theresa a martinez,3.068348e+09,a martinez torres
2147422,erik johnson,3.066532e+09,johnson erik keith
2147581,heather e dillaway,3.069277e+09,heather dell


In [31]:
len(set(candidate_matches['faculty_name']))/len(set(faculty_df_complete.NormalizedName))

0.551640340218712

In [27]:
len(set(faculty_df_complete.NormalizedName))

3292

In [28]:
len(set(df_merged['faculty_name']))

3630

In [12]:
len(df_merged)
print(df_merged.best_match_score.std())
df_merged.best_match_score.min()
df_merged.best_match_score.max()
df_merged.best_match_score.hist(bins=1000)

NameError: name 'df_merged' is not defined

In [55]:
new_df = candidate_matches.merge(df, how="left", on="NormalizedName")
new_df.dropna()

Unnamed: 0,faculty_name,AuthorId_x,NormalizedName,current_dept,year_observed,LastKnownAffiliationId,AuthorId_y,Rank,DisplayName,PaperCount,PaperFamilyCount,CitationCount,CreatedDate
0,kim m blankenship,3.048999e+09,kim m blankenship,American University,2020.0,181401687.0,2.689600e+09,15490.0,Kim M. Blankenship,78.0,76.0,2876.0,2017-06-30
1,kim m blankenship,3.048999e+09,kim m blankenship,American University,2021.0,181401687.0,2.689600e+09,15490.0,Kim M. Blankenship,78.0,76.0,2876.0,2017-06-30
12,laura j miller,3.042681e+09,laura j miller,Brandeis University,2020.0,6902469.0,2.736977e+09,19588.0,Laura J. Miller,5.0,5.0,0.0,2017-07-31
13,laura j miller,3.042681e+09,laura j miller,Brandeis University,2021.0,6902469.0,2.736977e+09,19588.0,Laura J. Miller,5.0,5.0,0.0,2017-07-31
14,karen v hansen,3.047574e+09,karen v hansen,Brandeis University,2020.0,6902469.0,2.113332e+09,17127.0,Karen V. Hansen,35.0,35.0,297.0,2016-06-24
...,...,...,...,...,...,...,...,...,...,...,...,...,...
109204,theresa martinez,2.345058e+09,theresa a martinez,University of Utah,2020.0,223532165.0,2.113419e+09,18553.0,Theresa A. Martinez,5.0,5.0,111.0,2016-06-24
109205,theresa martinez,2.345058e+09,theresa a martinez,University of Utah,2021.0,223532165.0,2.113419e+09,18553.0,Theresa A. Martinez,5.0,5.0,111.0,2016-06-24
109232,raoul lievanos,2.333482e+09,raoul s lievanos,University of Oregon,2021.0,181233156.0,2.333482e+09,17707.0,Raoul S. Liévanos,23.0,22.0,241.0,2016-06-24
109528,deirdre a royster,3.067059e+09,deirdre royster,New York University,2020.0,57206974.0,2.651869e+09,21197.0,Deirdre Royster,1.0,1.0,0.0,2017-06-30


Unnamed: 0,faculty_name,AuthorId_x,NormalizedName,current_dept,year_observed,LastKnownAffiliationId,AuthorId_y,Rank,DisplayName,PaperCount,PaperFamilyCount,CitationCount,CreatedDate
314,michael o emerson,3.049758e+09,michael o emerson,University of Illinois at Chicago,2021.0,39422238.0,3.049758e+09,20058.0,Michael O. Emerson,3.0,3.0,2.0,2020-08-21
591,john g dale,2.134923e+09,john g dale,George Mason University,2020.0,162714631.0,2.134923e+09,17811.0,John G. Dale,27.0,27.0,38.0,2016-06-24
592,john g dale,2.134923e+09,john g dale,George Mason University,2021.0,162714631.0,2.134923e+09,17811.0,John G. Dale,27.0,27.0,38.0,2016-06-24
593,john g dale,2.134923e+09,john g dale,George Mason University,2020.0,162714631.0,2.134923e+09,17811.0,John G. Dale,27.0,27.0,38.0,2016-06-24
594,john g dale,2.134923e+09,john g dale,George Mason University,2021.0,162714631.0,2.134923e+09,17811.0,John G. Dale,27.0,27.0,38.0,2016-06-24
...,...,...,...,...,...,...,...,...,...,...,...,...,...
107260,gene theodori,2.981137e+09,gene l theodori,Sam Houston State University,2021.0,191429286.0,2.981137e+09,16309.0,Gene L. Theodori,59.0,59.0,1187.0,2019-10-25
107263,gene theodori,2.981137e+09,gene l theodori,Sam Houston State University,2020.0,191429286.0,2.981137e+09,16309.0,Gene L. Theodori,59.0,59.0,1187.0,2019-10-25
107265,gene theodori,2.981137e+09,gene l theodori,Sam Houston State University,2021.0,191429286.0,2.981137e+09,16309.0,Gene L. Theodori,59.0,59.0,1187.0,2019-10-25
108793,raoul 8 lievanos,2.333482e+09,raoul s lievanos,University of Oregon,2021.0,181233156.0,2.333482e+09,17707.0,Raoul S. Liévanos,23.0,22.0,241.0,2016-06-24


In [182]:
#df_merged = df_merged[df_merged['best_match_score'] >= .50]
len(set(new_df.NormalizedName))

9931

In [192]:
# Load authors dataframe from MAG
authors_df = dd.read_csv(path + 'Authors.txt',
                                           sep="\t", header=None,
                                           error_bad_lines=False,
                                           quoting=csv.QUOTE_NONE,
                                           encoding='utf-8')

new_columns = ['AuthorId', 'Rank',
               'NormalizedName', 'DisplayName',
               'LastKnownAffiliationId', 'PaperCount',
               'PaperFamilyCount', 'CitationCount', 'CreatedDate']

authors_df = authors_df.rename(columns=dict(zip(authors_df.columns, new_columns)))

# Filter authors dataframe for authors that appear in network data.
#filtered_authors = authors_df[authors_df['AuthorId'].isin(df_merged.AuthorId)].compute()
#len(filtered_authors)

newest_df = authors_df.merge(new_df, how="inner", left_on=["AuthorId","LastKnownAffiliationId"], right_on=["AuthorId", "AffiliationId"]).compute()



  **kwargs,


  path_info,


In [195]:
len(set(newest_df.AuthorId))

17644

In [204]:
import numpy as np
filtered_authors.dropna(subset=["LastKnownAffiliationId"])

Unnamed: 0,AuthorId,Rank,NormalizedName_x,DisplayName_x,LastKnownAffiliationId,PaperCount,PaperFamilyCount,CitationCount,CreatedDate,faculty_name,current_dept,source_dept,phd_year,highest_degree,interests,position,year_observed,AffiliationId,NormalizedName_y,DisplayName_y
0,184369,15782,charles kurzman,Charles Kurzman,114027177.0,79,79,1497,2016-06-24,charles kurzman,University of North Carolina at Chapel Hill,University of California-Berkeley,1992,PhD,"['Collective Behavior/Social Movements', ' Pol...",Stadter Distinguished Professor,2020,114027177.0,university of north carolina at chapel hill,University of North Carolina at Chapel Hill
1,184369,15782,charles kurzman,Charles Kurzman,114027177.0,79,79,1497,2016-06-24,charles kurzman,University of North Carolina at Chapel Hill,University of California-Berkeley,1992,PhD,"['Collective Behavior/Social Movements', ' Pol...",Stadter Distinguished Professor,2021,114027177.0,university of north carolina at chapel hill,University of North Carolina at Chapel Hill
2,633513,16244,karen a hegtvedt,Karen A. Hegtvedt,150468666.0,48,48,1392,2016-06-24,karen a hegtvedt,Emory University,University of Washington,1984,PhD,"['Emotions', ' Small Groups', ' Social Psychol...",Professor,2020,150468666.0,emory university,Emory University
3,633513,16244,karen a hegtvedt,Karen A. Hegtvedt,150468666.0,48,48,1392,2016-06-24,karen a hegtvedt,Emory University,University of Washington,1984,PhD,"['Emotions', ' Small Groups', ' Social Psychol...",Professor,2021,150468666.0,emory university,Emory University
4,6101366,16127,joanna dreby,Joanna Dreby,392282.0,56,56,1507,2016-06-24,joanna dreby,"University at Albany, SUNY","The Graduate Center, CUNY",2007,PhD,"['Children and Youth', ' Family', ' Sex and Ge...",Associate Professor,2020,392282.0,university at albany suny,"University at Albany, SUNY"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
345,3184334690,21197,ming wen,Ming Wen,223532165.0,1,1,0,2021-08-02,ming wen,University of Utah,University of Chicago,2003,PhD,"['Medical Sociology', ' Community', ' Migratio...",Professor/Chair,2020,223532165.0,university of utah,University of Utah
346,3184334690,21197,ming wen,Ming Wen,223532165.0,1,1,0,2021-08-02,ming wen,University of Utah,University of Chicago,2003,PhD,"['Medical Sociology', ' Community', ' Migratio...",Professor/Chair,2021,223532165.0,university of utah,University of Utah
181,3186178984,21197,yun zhou,Yun Zhou,27837315.0,1,1,0,2021-08-02,yun zhou,University of Michigan,Harvard University,2017,PhD,"['Family', ' Sex and Gender', ' Demography']",Assistant Professor,2021,27837315.0,university of michigan,University of Michigan
184,3186307041,21197,yi wu,Yi Wu,8078737.0,1,1,0,2021-08-02,yi wu,Clemson University,Columbia University,2010,PhD,"['Legal Anthropology', ' Land Tenure Systems',...",Assistant Professor of\nAnthropology,2020,8078737.0,clemson university,Clemson University


In [16]:
pattern = r'(.+\,)(.+)' # regex for matching the first name and last name
aux_pattern = '(\S+)(.+)' # extra pattern for when the above doesn't match

faculty_df_complete = pd.read_csv("/home/timothyelder/mag/data/faculty_df_complete.csv")
faculty_names = faculty_df_complete['faculty_name'].to_list()

for idx,i in enumerate(faculty_names):
    i = re.sub(r';|:', ',', i)
    # match regex to the file_name string
    if re.search(pattern, i) == None:
        match = re.search(aux_pattern, i)
        new_name = match.group(2) + ' ' + match.group(1)
        new_name = re.sub('\/', 'l', new_name, count=1) # replaces / for l, a common error
        new_name = re.sub('\,', '', new_name, count=1)
        new_name = re.sub('\.', '', new_name, count=1)
        new_name = re.sub('\‘', '', new_name, count=1)
        new_name = new_name.lower()
        new_name = new_name.strip()

        faculty_names[idx] = new_name

    else:
        # match regex to the file_name string
        match = re.search(pattern, i)

        new_name = match.group(2) + ' ' + match.group(1)
        new_name = re.sub('\/', 'l', new_name, count=1) # replaces / for l, a common error
        new_name = re.sub('\,', '', new_name, count=1)
        new_name = re.sub('\.', '', new_name, count=1)
        new_name = re.sub('\‘', '', new_name, count=1)
        new_name = new_name.lower()
        new_name = new_name.strip()
        
        faculty_names[idx] = new_name

faculty_df_complete["faculty_name"] = faculty_names

filtered_authors_2 = authors_df[authors_df['NormalizedName'].isin(faculty_df_complete["faculty_name"])].compute()

filtered_authors_2.head()



  path_info,


Unnamed: 0,AuthorId,Rank,NormalizedName,DisplayName,LastKnownAffiliationId,PaperCount,PaperFamilyCount,CitationCount,CreatedDate
471,184369,15782,charles kurzman,Charles Kurzman,114027177.0,79,79,1497,2016-06-24
736,284723,16745,charles n halaby,Charles N. Halaby,135310074.0,9,9,1029,2016-06-24
1602,633513,16244,karen a hegtvedt,Karen A. Hegtvedt,150468666.0,48,48,1392,2016-06-24
4756,2033759,18047,georgi derluguian,Georgi Derluguian,111979921.0,16,16,71,2016-06-24
6408,2828213,18740,albert j bergesen,Albert J. Bergesen,,11,11,11,2016-06-24


In [17]:
filtered_authors = filtered_authors.append(filtered_authors_2)
len(filtered_authors)

357872

In [18]:
len(set(filtered_authors.NormalizedName))

13762

In [19]:
filtered_authors = filtered_authors[filtered_authors["LastKnownAffiliationId"].isin(mag_departments.AffiliationId)]
len(filtered_authors)

11443

In [20]:
len(set(filtered_authors.NormalizedName))/len(set(faculty_df_complete.faculty_name))

0.6551983927674535

In [21]:
filtered_authors[filtered_authors["NormalizedName"].str.contains('raudenbush')]

Unnamed: 0,AuthorId,Rank,NormalizedName,DisplayName,LastKnownAffiliationId,PaperCount,PaperFamilyCount,CitationCount,CreatedDate
389291,2285384011,12637,stephen w raudenbush,Stephen W. Raudenbush,40347166.0,170,162,50819,2016-06-24
304912,2319179191,20843,danielle t raudenbush,Danielle T. Raudenbush,40347166.0,1,1,9,2016-06-24
626977,2484153218,20238,danielle t raudenbush,Danielle T. Raudenbush,36258959.0,2,2,19,2016-08-23
705812,2618293213,16989,stephen w raudenbush,Stephen W. Raudenbush,136199984.0,4,4,379,2017-06-05
768216,2975987558,18688,stephen l raudenbush,Stephen L. Raudenbush,27837315.0,1,1,87,2019-10-03


In [22]:
authors2papers_df = dd.read_csv(path + 'PaperAuthorAffiliations.txt',
                                           sep="\t", header=None,
                                           error_bad_lines=False,
                                           quoting=csv.QUOTE_NONE,
                                           encoding='utf-8')

new_columns = ['PaperId', 'AuthorId',
               'AffiliationId', 'AuthorSequenceNumber',
               'OriginalAuthor', 'OriginalAffiliation']

authors2papers_df = authors2papers_df.rename(columns=dict(zip(authors2papers_df.columns, new_columns)))

filtered_authors2papers = authors2papers_df[authors2papers_df['AuthorId'].isin(filtered_authors['AuthorId'])].compute()

# filtered_authors2papers.to_csv('/home/timothyelder/mag/data/authors2papers.csv', index=False)

papers_df = dd.read_csv(path + 'Papers.txt',
                        sep="\t", header=None, dtype={16: 'object', 17: 'object',
                                                      18: 'float64', 19: 'float64',
                                                      20: 'float64', 24: 'object',
                                                       7: 'float64', 9: 'object',
                                                       8: 'string', 14: 'float64'},
                                                       error_bad_lines=False, quoting=csv.QUOTE_NONE,
                                                       encoding='utf-8')

new_columns =['PaperId', 'Rank', 'Doi', 'DocType',
              'PaperTitle', 'OriginalTitle',
              'BookTitle', 'Year', 'Date',
              'OnlineDate', 'Publisher',
              'JournalId', 'ConferenceSeriesId',
              'ConferenceInstanceId', 'Volume',
              'Issue', 'FirstPage', 'LastPage',
              'ReferenceCount', 'CitationCount',
              'EstimatedCitation', 'OriginalVenue',
              'FamilyId', 'FamilyRank', 'DocSubTypes',
              'CreatedDate']

papers_df = papers_df.rename(columns=dict(zip(papers_df.columns, new_columns)))

filtered_papers = papers_df[papers_df['PaperId'].isin(filtered_authors2papers['PaperId'])].compute()

# filtered_papers.to_csv('/home/timothyelder/mag/data/papers.csv', index=False)






  **kwargs,


  path_info,


  **kwargs,


  path_info,
  path_info,


In [27]:
filtered_papers = filtered_papers[filtered_papers['DocType'] == "Journal"] 

In [25]:
filtered_authors2papers

Unnamed: 0,PaperId,AuthorId,AffiliationId,AuthorSequenceNumber,OriginalAuthor,OriginalAffiliation
3641,33518,1963213065,,1,Jeffery R. Broadbent,
6366,63235,614356337,78577930.0,2,Denise Kandel,Columbia University and New York Psychiatric I...
7984,79956,2203244522,130769515.0,1,Graham B. Spanier,Penn State-University Park#TAB#
10717,109026,2170308787,,1,David Knox,
15730,162270,2167722686,,1,Robert Sampson,
...,...,...,...,...,...,...
266423,3187040820,2922503776,,2,Wendy Wang,
272143,3187051285,2133141321,111979921.0,3,Klaus Weber,Northwestern U.
278540,3187062372,2972101691,,9,Alice Lee,
283509,3187071747,2099755627,193531525.0,1,Amitai Etzioni,The George Washington University


In [28]:
filtered_authors2papers.to_csv('/home/timothyelder/mag/data/authors2papers.csv', index=False)
filtered_papers.to_csv('/home/timothyelder/mag/data/papers.csv', index=False)

# Adding Direct Matches to Candidates

In [18]:
faculty_df_complete = pd.read_csv('data/faculty_df_complete.csv')

pattern = r'(.+\,)(.+)' # regex for matching the first name and last name
aux_pattern = '(\S+)(.+)' # extra pattern for when the above doesn't match

faculty_names = faculty_df_complete['faculty_name'].to_list()

for idx,i in enumerate(faculty_names):
    i = re.sub(r';|:', ',', i)
    # match regex to the file_name string
    if re.search(pattern, i) == None:
        match = re.search(aux_pattern, i)
        new_name = match.group(2) + ' ' + match.group(1)
        new_name = re.sub('\/', 'l', new_name, count=1) # replaces / for l, a common error
        new_name = re.sub('\,', '', new_name, count=1)
        new_name = re.sub('\.', '', new_name, count=1)
        new_name = re.sub('\‘', '', new_name, count=1)
        new_name = new_name.lower()
        new_name = new_name.strip()

        faculty_names[idx] = new_name

    else:
        # match regex to the file_name string
        match = re.search(pattern, i)

        new_name = match.group(2) + ' ' + match.group(1)
        new_name = re.sub('\/', 'l', new_name, count=1) # replaces / for l, a common error
        new_name = re.sub('\,', '', new_name, count=1)
        new_name = re.sub('\.', '', new_name, count=1)
        new_name = re.sub('\‘', '', new_name, count=1)
        new_name = new_name.lower()
        new_name = new_name.strip()
        
        faculty_names[idx] = new_name

faculty_df_complete['faculty_name'] = faculty_names

In [19]:
faculty_df_complete['faculty_name']

0                michael bader
1            monica biradavolu
2            kim m blankenship
3        andrea malkin brenner
4                    alan dahl
                 ...          
26379               alka menon
26380           rourke o'brien
26381             philip smith
26382         jonathan wyrtzen
26383                emma zang
Name: faculty_name, Length: 26384, dtype: object

In [37]:
# concatenating individual matched dataframes
match_path = r'/home/timothyelder/matches'
all_files = glob.glob(match_path + "/*.csv")
df_from_each_file = (pd.read_csv(f, sep=',') for f in all_files) # generate a list of files to concatenate

df_merged = pd.concat(df_from_each_file, ignore_index=True) # Concatenate pandas dataframes

df_merged = df_merged[df_merged['best_match_score'] >= .50]

names = df_merged['network_name'].to_list()

for idx, i in enumerate(names):
    i = re.sub('\.','', i)
    names[idx] = i

df_merged['network_name'] = names

df_merged['network_name']

65          kim m blankenship
79          jennie kronenfeld
112           richard j petts
140            sung joon jang
199            sung joon jang
                  ...        
2147649    stephanie hartwell
2147658    heather e dillaway
2147763         richard l gee
2147787         richard l gee
2148138       john richardson
Name: network_name, Length: 94557, dtype: object

In [38]:
len(df_merged[df_merged['network_name']== df_merged['NormalizedName']])

37144

In [39]:
df_merged.head()

Unnamed: 0.1,Unnamed: 0,best_match_score,__id_left,__id_right,faculty_name,network_name,AuthorId,NormalizedName
65,2990,0.659706,74_left,1571423_right,"Blankenship, Kim M.",kim m blankenship,3048999000.0,kim m blankenship
79,3556,0.586469,89_left,1365075_right,"Kronenfeld, Jennie",jennie kronenfeld,3048038000.0,jennie jacobs kronenfeld
112,4876,0.622506,125_left,850412_right,"Petts, Richard J.",richard j petts,3045549000.0,richard j betts
140,6029,0.553071,155_left,126077_right,"Jang, Sung Joon",sung joon jang,3042249000.0,jang sung jin
199,8864,0.553071,216_left,126077_right,"Jang, Sung Joon",sung joon jang,3042249000.0,jang sung jin


In [25]:
len(set(faculty_df_complete.faculty_name))

7963

In [26]:
faculty_df_complete.faculty_name

0                michael bader
1            monica biradavolu
2            kim m blankenship
3        andrea malkin brenner
4                    alan dahl
                 ...          
26379               alka menon
26380           rourke o'brien
26381             philip smith
26382         jonathan wyrtzen
26383                emma zang
Name: faculty_name, Length: 26384, dtype: object

In [24]:
not_shared = []
for i in faculty_names:
    if i not in names:
        not_shared.append(i)
print(len(not_shared))

1147


In [27]:
# Load authors dataframe from MAG
authors_df = dd.read_csv(path + 'Authors.txt',
                                           sep="\t", header=None,
                                           error_bad_lines=False,
                                           quoting=csv.QUOTE_NONE,
                                           encoding='utf-8')

new_columns = ['AuthorId', 'Rank',
               'NormalizedName', 'DisplayName',
               'LastKnownAffiliationId', 'PaperCount',
               'PaperFamilyCount', 'CitationCount', 'CreatedDate']

authors_df = authors_df.rename(columns=dict(zip(authors_df.columns, new_columns)))

# Filter authors dataframe for authors that appear in network data.
filtered_authors = authors_df[authors_df['NormalizedName'].isin(faculty_df_complete.faculty_name)].compute()

filtered_authors.head()

Unnamed: 0,AuthorId,Rank,NormalizedName,DisplayName,LastKnownAffiliationId,PaperCount,PaperFamilyCount,CitationCount,CreatedDate
471,184369,15782,charles kurzman,Charles Kurzman,114027177.0,79,79,1497,2016-06-24
736,284723,16745,charles n halaby,Charles N. Halaby,135310074.0,9,9,1029,2016-06-24
1602,633513,16244,karen a hegtvedt,Karen A. Hegtvedt,150468666.0,48,48,1392,2016-06-24
4756,2033759,18047,georgi derluguian,Georgi Derluguian,111979921.0,16,16,71,2016-06-24
6408,2828213,18740,albert j bergesen,Albert J. Bergesen,,11,11,11,2016-06-24


In [40]:
len(set(filtered_authors.AuthorId)) + len(set(df_merged.AuthorId))

369548

In [48]:
df_merged.drop(columns=['Unnamed: 0', 'best_match_score', '__id_left', '__id_right'])

Unnamed: 0,faculty_name,network_name,AuthorId,NormalizedName
65,"Blankenship, Kim M.",kim m blankenship,3.048999e+09,kim m blankenship
79,"Kronenfeld, Jennie",jennie kronenfeld,3.048038e+09,jennie jacobs kronenfeld
112,"Petts, Richard J.",richard j petts,3.045549e+09,richard j betts
140,"Jang, Sung Joon",sung joon jang,3.042249e+09,jang sung jin
199,"Jang, Sung Joon",sung joon jang,3.042249e+09,jang sung jin
...,...,...,...,...
2147649,"Hartwell, Stephanie",stephanie hartwell,3.068725e+09,stephanie w hartwell
2147658,"Dillaway, Heather E.",heather e dillaway,3.069277e+09,heather dell
2147763,"Gee, Richard L.",richard l gee,3.066568e+09,j l richard
2147787,"Gee, Richard L.",richard l gee,3.066568e+09,j l richard


In [51]:
more_filtered_authors = authors_df[authors_df['AuthorId'].isin(df_merged.AuthorId)].compute()

In [53]:
print(len(filtered_authors))
print(len(more_filtered_authors))
print(len(filtered_authors)+len(more_filtered_authors))

filtered_authors = filtered_authors.append(more_filtered_authors)

339359
30189
369548


In [54]:
len(set(filtered_authors.NormalizedName))

13769

In [57]:
filtered_authors.head()

Unnamed: 0,AuthorId,Rank,NormalizedName,DisplayName,LastKnownAffiliationId,PaperCount,PaperFamilyCount,CitationCount,CreatedDate
471,184369,15782,charles kurzman,Charles Kurzman,114027177.0,79,79,1497,2016-06-24
736,284723,16745,charles n halaby,Charles N. Halaby,135310074.0,9,9,1029,2016-06-24
1602,633513,16244,karen a hegtvedt,Karen A. Hegtvedt,150468666.0,48,48,1392,2016-06-24
4756,2033759,18047,georgi derluguian,Georgi Derluguian,111979921.0,16,16,71,2016-06-24
6408,2828213,18740,albert j bergesen,Albert J. Bergesen,,11,11,11,2016-06-24


In [58]:
names = set(filtered_authors['NormalizedName'].to_list())

not_shared = []
for i in names:
    if i not in faculty_names:
        not_shared.append(i)
print(len(not_shared))

8045


In [60]:
filtered_authors.to_csv('/home/timothyelder/mag/data/authors.csv', index=False)

In [1]:
import os
import re
import csv
import glob
import pandas as pd
import dask.dataframe as dd

os.chdir('/home/timothyelder/mag')

path = '/project/jevans/MAG_0802_2021_snap_shot/'

# Filter authors dataframe for authors that appear in network data.
filtered_authors = pd.read_csv("data/authors.csv")

filtered_authors

Unnamed: 0,AuthorId,Rank,NormalizedName,DisplayName,LastKnownAffiliationId,PaperCount,PaperFamilyCount,CitationCount,CreatedDate
0,184369,15782,charles kurzman,Charles Kurzman,114027177.0,79,79,1497,2016-06-24
1,284723,16745,charles n halaby,Charles N. Halaby,135310074.0,9,9,1029,2016-06-24
2,633513,16244,karen a hegtvedt,Karen A. Hegtvedt,150468666.0,48,48,1392,2016-06-24
3,2033759,18047,georgi derluguian,Georgi Derluguian,111979921.0,16,16,71,2016-06-24
4,2828213,18740,albert j bergesen,Albert J. Bergesen,,11,11,11,2016-06-24
...,...,...,...,...,...,...,...,...,...
369543,3186984193,21197,dellinger kristen,Dellinger Kristen,,1,1,0,2021-08-02
369544,3187001419,21197,jan frankowski,Jan Frankowski,,1,1,0,2021-08-02
369545,3187005615,18086,kimberly greer,Kimberly A. Greer,592451.0,9,9,224,2021-08-02
369546,3187024181,19048,david a smith,David A. Smith,,1,1,27,2021-08-02


In [2]:
# concatenating individual matched dataframes
match_path = r'/home/timothyelder/mag/data/matches'
all_files = glob.glob(match_path + "/*.csv")
df_from_each_file = (pd.read_csv(f, sep=',') for f in all_files) # generate a list of files to concatenate

df_merged = pd.concat(df_from_each_file, ignore_index=True) # Concatenate pandas dataframes

names = df_merged['network_name'].to_list()

for idx, i in enumerate(names):
    i = re.sub('\.','', i)
    names[idx] = i

df_merged['network_name'] = names

In [3]:
df_merged

Unnamed: 0.1,Unnamed: 0,best_match_score,__id_left,__id_right,faculty_name,network_name,AuthorId,NormalizedName
0,0,0.060897,0_left,1617022_right,"Siegenthaler, Jurg",jurg siegenthaler,3.049211e+09,jurg bernhard
1,22,0.038916,1_left,629022_right,"Clark, Leon E.",leon e clark,3.044426e+09,l e de leon
2,27,0.181168,2_left,147719_right,"Young, Gloria (Gay) A.",gloria (gay) a young,3.042337e+09,a c young
3,77,0.440146,3_left,1550995_right,"Stone, Russell",russell stone,3.048906e+09,russell e stone
4,78,0.330313,4_left,1309911_right,"Dickerson, Bette J.",bette j dickerson,3.047763e+09,j dickerson
...,...,...,...,...,...,...,...,...
2148667,919042,0.346524,25465_left,751186_right,"Christakis, Nicholas A.",nicholas a christakis,3.066812e+09,a nicholas
2148668,919053,0.239714,25466_left,1906987_right,"Gorski, Philip S.",philip s gorski,3.068162e+09,s philip
2148669,919059,0.098331,25468_left,762303_right,"Menon, Alka",alka menon,3.066825e+09,m menon
2148670,919086,0.167088,25469_left,1399533_right,"O'Brien, Rourke",rourke o'brien,3.067569e+09,d a rourke
