# Combine authors in Web of Science and investors in flux sites
In this section, what we wanna do is tag those authors who also acted as investors of flux sites, thus further we can check the relationship between scholar publishments and research facilities. The only avaliable key that connects both Web of Science and Flux site datasets is the name. While for a single author/investor, his/her name will falls into multiple forms. So the primary job here is to find all duplicate names and bind them to one person.

### 1. Data structure
Before diving into the real datasets, we should design neat data structures to handle authors.

In [1]:
class NameSequence(object):
    def __init__(self, namestring): #Initialize the name from raw data, turn name into three parts
        if ',' in namestring:
            temp = namestring.split(', ')
            self.FP = temp[0].translate(None,punctuation)
            secondpart = temp[1].translate(None,punctuation)
            if len(secondpart.split(' ')) > 1:
                self.SP = secondpart.split(' ')[0]
                self.MD = secondpart.split(' ')[1]
            else:
                self.SP = secondpart
                self.MD = ''
        elif ' ' in namestring:
            temp = namestring.split(' ')
            self.FP = temp[0].translate(None,punctuation)
            self.SP = temp[1].translate(None,punctuation)
            self.MD = ''
        else:
            self.FP = namestring
            self.SP = ' '
            self.MD = ''
    #Print the full name out
    def Printout(self):
        print(self.FP,self.SP,self.MD)
        
#Author definition
class Author(object):
    def __init__(self, *args, **kwargs):
        self.Name = [kwargs['Name']]
        self.Short = [kwargs['Short']]
        self.Paper = [kwargs['Paper']]
        self.POI = [kwargs['POI']]
        self.isReprint = [kwargs['isReprint']]
        self.isFirst = [kwargs['isFirst']]
        self.Multiname = 0
        self.isInvest = 0
        self.Site = []
        self.attr = kwargs
        
    def Attach(self, single):
        self.Name.extend(single.Name)
        self.Short.extend(single.Short)
        self.Paper.extend(single.Paper)
        self.POI.extend(single.POI)
        self.isReprint.extend(single.isReprint)
        self.isFirst.extend(single.isFirst)
        self.Multiname += 1
                
    def PrintName(self):
        for name in self.Name:
            name.Printout()
    
    def PrintPaper(self):
        for paper in self.Paper:
            print(paper)

class Investor(object):
    def __init__(self, name, sites):
        self.Name = name
        self.Sites = sites

### 2. Load libraries and Web of Science data

In [3]:
import pandas as pd
import difflib as dif
from string import punctuation
    
PRecords = pd.read_csv('D:/_Research/Project_Sharing_Data_FromLinux/ProjectRebuild_2017/Data/Full_Record_WoS.csv', sep = ',', encoding = 'utf-8')
PRecords = PRecords.drop('Unnamed: 0', axis = 1)
PRecords

Unnamed: 0,AB,AF,AR,AU,BE,BN,BP,BS,C1,CA,...,SU,TC,TI,U1,U2,UT,VL,VR,WC,Z9
0,"In this study, net surface radiation (R-n) was...","Mahalakshmi, D. V.|Paul, Arati|Dutta, D.|Ali, ...",,"Mahalakshmi, DV|Paul, A|Dutta, D|Ali, MM|Dadhw...",,,1,,"[Mahalakshmi, D. V.; Ali, M. M.; Dadhwal, V. K...",,...,,0,Estimation of net surface radiation using eddy...,1,1,WOS:000381162400001,33,1.0,Geochemistry & Geophysics,0
1,"To date, direct validation of city-wide emissi...","Vaughan, Adam R.|Lee, James D.|Misztal, Pawel ...",,"Vaughan, AR|Lee, JD|Misztal, PK|Metzger, S|Sha...",,,455,,"[Vaughan, Adam R.] Univ York, Dept Chem, York,...",,...,,3,Spatially resolved flux measurements of NOx fr...,8,10,WOS:000380099700022,189,,"Chemistry, Physical",3
2,Large variability in N2O emissions from manage...,"Grant, Robert F.|Neftel, Albrecht|Calanca, Pie...",,"Grant, RF|Neftel, A|Calanca, P",,,3549,,"[Grant, Robert F.] Univ Alberta, Dept Renewabl...",,...,,0,Ecological controls on N2O emission in surface...,11,12,WOS:000379427700003,13,,"Ecology; Geosciences, Multidisciplinary",0
3,"Conversions of natural ecosystems, e.g., from ...","Merten, Jennifer|Roell, Alexander|Guillaume, T...",5,"Merten, J|Roll, A|Guillaume, T|Meijide, A|Tari...",,,,,"[Merten, Jennifer; Dittrich, Christoph; Faust,...",,...,,2,Water scarcity and oil palm expansion: social ...,16,28,WOS:000380049100006,21,,Ecology; Environmental Studies,2
4,A scheme describing the process of stream-aqui...,"Zeng, Yujin|Xie, Zhenghui|Yu, Yan|Liu, Shuang|...",,"Zeng, YJ|Xie, ZH|Yu, Y|Liu, S|Wang, LY|Jia, BH...",,,2333,,"[Zeng, Yujin; Xie, Zhenghui; Liu, Shuang; Wang...",,...,,3,Ecohydrological effects of stream-aquifer wate...,10,15,WOS:000379419500013,20,,"Geosciences, Multidisciplinary; Water Resources",3
5,There have been few studies conducted on the c...,"Yang, Zesu|Zhang, Qiang|Hao, Xiaocui",6809749,"Yang, ZS|Zhang, Q|Hao, XC",,,,,"[Yang, Zesu] Chengdu Univ Informat Technol, Co...",,...,,0,Evapotranspiration Trend and Its Relationship ...,8,8,WOS:000379433600001,,,Meteorology & Atmospheric Sciences,0
6,The lifetime of nitrogen oxides (NOx) affects ...,"Romer, Paul S.|Duffey, Kaitlin C.|Wooldridge, ...",,"Romer, PS|Duffey, KC|Wooldridge, PJ|Allen, HM|...",,,7623,,"[Romer, Paul S.; Duffey, Kaitlin C.; Wooldridg...",,...,,2,The lifetime of nitrogen oxides in an isoprene...,16,26,WOS:000379417300009,16,,Meteorology & Atmospheric Sciences,2
7,"The emission, dispersion, and photochemistry o...","Su, Luping|Patton, Edward G.|de Arellano, Jord...",,"Su, LP|Patton, EG|de Arellano, JVG|Guenther, A...",,,7725,,"[Su, Luping; Mak, John E.] SUNY Stony Brook, S...",,...,,3,Understanding isoprene photooxidation using ob...,7,10,WOS:000379417300016,16,,Meteorology & Atmospheric Sciences,3
8,"We measured volatile organic compounds (VOCs),...","Rantala, Pekka|Jarvi, Leena|Taipale, Risto|Lau...",,"Rantala, P|Jarvi, L|Taipale, R|Laurila, TK|Pat...",,,7981,,"[Rantala, Pekka; Jarvi, Leena; Taipale, Risto;...",,...,,0,Anthropogenic and biogenic influence on VOC fl...,3,12,WOS:000379417300032,16,,Meteorology & Atmospheric Sciences,0
9,The dry component of total nitrogen and sulfur...,"Rumsey, Ian C.|Walker, John T.",,"Rumsey, IC|Walker, JT",,,2581,,"[Rumsey, Ian C.] Coll Charleston, Dept Phys & ...",,...,,0,Application of an online ion-chromatography-ba...,4,10,WOS:000379397100008,9,,Meteorology & Atmospheric Sciences,0


The data is quite neat and clean here. Each record represents a single paper, Our goal is creating profile for each author, the data should be transformed into Author -> Papers type.

### 3. Web of Science data transformation
First, let's check how many papers we have over here.

In [3]:
len(PRecords)

5654

Then each author in each paper is extracted into a single Author data stucture, sorted by primary key FP and subkey SP,waiting for further combination.Meanwhile, we'll target those authors who are Reprint Authors or First Authors.

In [4]:
#ReGroup the name
PRecords.AF = [names.split('|') for names in PRecords.AF]
PRecords.AU = [names.split('|') for names in PRecords.AU]

#Author Initilization
PreAuthors = []
for index, paper in PRecords.iterrows():
    #Reprint Author Tag
    if  paper.RP == paper.RP:
        Reprint = paper.RP.split(' (re')[0]
        for j, Sname in enumerate(paper.AU):
            if Reprint in Sname:
                PRecords.AF[index][j] += '*'
    #First Author Tag
    PRecords.AF[index][0] += '$'
    
    for name in paper.AF:
        Rptag = 0
        Fstag = 0
        if '*' in name:
            Rptag = 1
        if '$' in name:
            Fstag = 1
        #Append Authors
        PreAuthors.append(Author(Name = NameSequence(str(name)), Paper = paper.DI, POI = index, isReprint = Rptag, isFirst = Fstag))

In [5]:
PreAuthors[0].PrintName()
print(PreAuthors[0].isReprint, PreAuthors[0].isFirst)

('Mahalakshmi', 'D', 'V')
([1], [1])


Finally, we'll combine all PreAuthors records according to their name, making profiles for every single author. The name compare algorithm is basically based on similarity check, the similiarity is quantified by weighted Levenshtein distance, and deployed by dif.SequenceMatcher().ratio() function, which is quite time consuming. 
The similarities between some names are confusing for algorithms, so mannual check is inserted to provide help. This internal result will be saved in csv.

In [None]:
#Double Sort the Preauthor list in Alphabet order, which makes continous calculation more efficient
PreAuthors = sorted(PreAuthors, key = lambda author: author.Name[0].SP)
PreAuthors = sorted(PreAuthors, key = lambda author: author.Name[0].FP)

def Get_Match(x,y): #Inner WoS Name check needs accuracy, so we use .ratio function
    return dif.SequenceMatcher(None,x,y).ratio()

def CheckWOS(x,y): #x and y variables are NameSequence Class
    tag = 0
    CisFP = Get_Match(x.FP.lower(),y.FP.lower())
    CisSP = Get_Match(x.SP.lower(),y.SP.lower())

    if ((CisFP < 0.9 and CisSP < 0.9) or (x.SP[0] != y.SP[0])):
        tag = 0
    elif (CisFP == 1 and CisSP == 1):
        tag = 3
    elif (CisFP >= 0.9 and CisSP >= 0.9):
        tag = 1
    return tag

def MannualCheck(namelist, insertname):
    print('^^^^^^^Please Check^^^^^^^')
    for name in namelist:
        name.Printout()
    print('^^^^^^^^^^^^^^^^^^^^^^^^^^')
    for name in insertname:
        name.Printout()
    tg = input('Match or Not?\n')
    return(tg)

#Initialization of name compare
Authors = []
processed = 0
bias = 0
tag = 0
for pauthor in PreAuthors:
    for i,author in enumerate(Authors[-bias:]):
        #Tag Calculate
        tag = 0
        for j,name in enumerate(author.Name):
              tag = max(tag,CheckWOS(pauthor.Name[0],name))

        #Tag Check
        if tag == 1:
            if MannualCheck(author.Name,pauthor.Name):
                author.Attach(pauthor)
                break
        elif tag == 3:
            author.Attach(pauthor)
            author.Multiname -= 1
            break
    #In case none available names in Authors List            
    if tag == 0: 
        Authors.append(pauthor)
    #Set Check bias to 50 to accelerate the calculation
    bias = min(len(Authors),50)
    processed += 1
    print(processed,tag)
    
#Rebuild Authors to DataFrame and Print to csv
def CombineNameSequence(ns): #Turn NameSequence structure into strings for comparison
    return(ns.FP + ' ' + ns.SP + ' ' + ns.MD)

OutputAuthors = []
for index, author in enumerate(Authors):
    for j, item in enumerate(author.Name):
        OutputAuthors.append({'Universal_Name_Tag':index, 'Name':CombineNameSequence(item), 'Short':author.Short[j], 'Paper':author.Paper[j], 'POI':author.POI[j], 'isReprint':author.isReprint[j], 'isFirst':author.isFirst[j]})

OutputAuthors = pd.DataFrame.from_dict(OutputAuthors)
OutputAuthors.to_csv('D:/_Research/Project_Sharing_Data_FromLinux/ProjectRebuild_2017/Data/FirstStep_Author_Records.csv')

(1, 0)
(2, 0)
(3, 0)
(4, 3)
(5, 3)
(6, 3)
(7, 3)
(8, 3)
(9, 0)
(10, 0)
(11, 3)
(12, 3)
(13, 3)
(14, 3)
(15, 0)
(16, 3)
(17, 3)
(18, 3)
(19, 3)
(20, 3)
(21, 3)
(22, 0)
(23, 0)
(24, 0)
(25, 0)
(26, 3)
(27, 0)
(28, 0)
(29, 0)
(30, 0)
(31, 0)
(32, 0)
(33, 0)
(34, 3)
(35, 0)
(36, 0)
(37, 0)
(38, 0)
(39, 3)
(40, 0)
(41, 3)
(42, 0)
(43, 0)
(44, 3)
(45, 0)
(46, 3)
(47, 0)
(48, 0)
(49, 3)
(50, 0)
(51, 3)
(52, 0)
(53, 0)
(54, 3)
(55, 0)
(56, 3)
(57, 0)
(58, 0)
(59, 0)
(60, 0)
(61, 3)
(62, 0)
(63, 3)
(64, 0)
(65, 0)
(66, 3)
(67, 0)
(68, 3)
(69, 3)
(70, 3)
(71, 3)
(72, 3)
(73, 3)
(74, 3)
(75, 0)
(76, 3)
(77, 0)
(78, 0)
(79, 0)
(80, 0)
(81, 0)
(82, 0)
(83, 0)
(84, 0)
(85, 0)
(86, 0)
(87, 3)
(88, 3)
(89, 3)
(90, 0)
(91, 3)
(92, 0)
(93, 0)
(94, 0)
(95, 0)
(96, 0)
(97, 0)
(98, 0)
(99, 0)
(100, 0)
(101, 0)
(102, 0)
(103, 3)
(104, 0)
(105, 3)
(106, 0)
(107, 0)
(108, 0)
(109, 3)
(110, 3)
(111, 3)
(112, 3)
(113, 0)
(114, 3)
(115, 3)
(116, 0)
(117, 0)
(118, 3)
(119, 3)
(120, 0)
(121, 0)
(122, 0)
(123, 0)
(

This step may cost quite a long time (both automatical and mannual work), but if we check the data again, we'll find some more names to be combine with.
So a second automatically name check will be performed to combine neighbor similiar names, this check is mostly based on Short Name check. Neighbour names with same Short Name will be combined.

In [None]:
#Second Check on Short Names
def Picklongest(ShortName):
    temp = ''
    for sn in ShortName:
        secondpart = sn.split(', ')[1]
        if len(secondpart) > len(temp):
            temp = secondpart
    return(temp)

Authors = sorted(Authors, key = lambda author: author.Short[0])
RAuthors = []
for index, author in enumerate(Authors):
    if index == 0:
        RAuthors.append(author)
    else:
        prvS = Picklongest(RAuthors[-1].Short)
        crtS = Picklongest(Authors[index].Short)

        if (prvS in crtS) or (crtS in prvS):
            tag = 1
        else:
            tag = 0
           
        if tag:
            RAuthors[-1].Attach(Authors[index])
        else:
            RAuthors.append(author)

After this combination, we shall just output the result to local disk for fast reload in future work.

In [None]:
##Direct Output Authors
Fileoutput = open('D:/_Research/Project_Sharing_Data_FromLinux/ProjectRebuild_2017/Data/UnifieAuthors.txt','w')
for author in RAuthors:
    #Print names
    for name in author.Name:
        ft = name.FP + '$' + name.SP + '$' + name.MD + '@'
        print >> Fileoutput, ft,
    print >> Fileoutput
    #Print PaperDOI
    for paper in author.Paper:  
        if paper != paper:
            print >> Fileoutput, '|',
        else:
            print >> Fileoutput, paper + '|',
    print >> Fileoutput
    #Print POI
    for poi in author.POI:
        print >> Fileoutput, str(poi) + '|',
    print >> Fileoutput
    #Print isRePrint
    for rpt in author.isReprint:
        print >> Fileoutput, str(rpt) + '|',
    print >> Fileoutput   
    #Print isFirst
    for fst in author.isFirst:
        print >> Fileoutput, str(fst) + '|',
    print >> Fileoutput
Fileoutput.close()

It's better to save this dataset in a DataFrame of pandas, transform the RAuthor data to dict and further pd.DataFrame.
We'll also attach the unifield authors to previous record table, then save the result to disk.

In [None]:
#Rebuild Authors to DataFrame and Print to csv
def CombineNameSequence(ns): #Turn NameSequence structure into strings for comparison
    return(ns.FP + ' ' + ns.SP + ' ' + ns.MD)
Authors = []

for index, author in enumerate(RAuthors):
    for j, item in enumerate(author.Name):
        Authors.append({'Universal_Name_Tag':index, 'Name':CombineNameSequence(item), 'Paper':author.Paper[j], 'POI':author.POI[j], 'isReprint':author.isReprint[j], 'isFirst':author.isFirst[j]})

Authors = pd.DataFrame.from_dict(Authors)
Authors.to_csv('D:/_Research/Project_Sharing_Data_FromLinux/ProjectRebuild_2017/Data/Full_Author_Records.csv')

In [None]:
#Attach Unifield Key to the table
Pcords = ['' for cols in range(len(PRecords))]
#Remark the Author by UniqueNumbers
for index, author in enumerate(RAuthors):
    print(index)
    for Pnum in author.POI:
        Pcords[int(Pnum)] += str(index) + '|'
PRecords['Author_Unique_Key'] = Pcords

import sys
reload(sys)
sys.setdefaultencoding('utf-8')
PRecords.to_csv('D:/_Research/Project_Sharing_Data_FromLinux/ProjectRebuild_2017/Data/Full_Record_With_AuthorKey.csv')

### 4. Reload author and investigator data
Load processed investigator data into a new class Investor. For investigator data was crawled form the ineternet, we should first remove all punctuations inside investor name fields. We don't want to deduplciate WoS a second time, so we just reload the author data saved in the last section.

In [4]:
from string import punctuation
import difflib as dif

Fileinput = open('D:/_Research/Project_Sharing_Data_FromLinux/ProjectRebuild_2017/Data/SiteInfo/InvestSite.txt')
Investors = []
for lines in Fileinput:
    temp = lines.replace('\n','').split('|')
    Investors.append(Investor(temp[0].translate(None,punctuation),temp[1].split(',')))
Fileinput.close()

Authors = pd.read_csv('D:/_Research/Project_Sharing_Data_FromLinux/ProjectRebuild_2017/Data/Universal_Author_Scores.csv')
Authors = Authors.drop('Unnamed: 0', axis = 1)
Authors

Unnamed: 0,Citation_Score,NameIndex,Names,Short_Names,Total_Citation,Total_Publication,LeaderRank
0,2.000000,ARCADO TE,ARCADO TE |,"ARCADO, TE|",32,1,0.281736
1,1.000000,Aalto Juho,Aalto Juho |,"Aalto, J|",7,1,0.406272
2,31.529762,Aalto P P,Aalto P |Aalto P |Aalto P P|Aalto P |Aalto P |...,"Aalto, P|Aalto, P|Aalto, PP|Aalto, P|Aalto, P|...",241,7,1.863562
3,34.042857,Aalto Tuula,Aalto T |Aalto T |Aalto T |Aalto T |Aalto T |A...,"Aalto, T|Aalto, T|Aalto, T|Aalto, T|Aalto, T|A...",171,12,1.864732
4,2.833333,Abaimov A P,Abaimov A P|,"Abaimov, AP|",17,1,0.367166
5,0.101695,Abaoui J,Abaoui J |,"Abaoui, J|",6,1,0.284561
6,0.000000,Abdalati Waleed,Abdalati Waleed |,"Abdalati, W|",0,1,0.388332
7,5.581395,Abdalla Mohamed,Abdalla M |Abdalla M |Abdalla Mohamed |,"Abdalla, M|Abdalla, M|Abdalla, M|",156,3,0.425961
8,0.000000,Abdalla Seifeldin H,Abdalla Seifeldin H|,"Abdalla, SH|",0,1,0.596599
9,6.000000,Abdelghani Chehbouni,Abdelghani Chehbouni |,"Abdelghani, C|",6,1,0.395203


### 5. Combine two datasets
Now we can combine the Investors into Author datasets by the similarity of names. In this  section, we'll only automatically accept 100% similarity combination, while for those pairs with lower similarities, a mannual check will be conducted to determine the following operation.

In [None]:
def Get_Quick_Match(x,y): #Name check here is less strict, quick ratio is fine
    return dif.SequenceMatcher(None,x,y).quick_ratio()

def MannualCrossCheck(author, invest):
    print('^^^^^^^Please Check^^^^^^^')
    print(author)
    print('^^^^^^^^^^^^^^^^^^^^^^^^^^')
    print(invest.Name)
    tg = input('Match or Not?\n')
    return(tg)

Sites = [[] for cols in range(len(Authors))]

for num,invest in enumerate(Investors):
    cpvalue = 0
    record = 0
    for i,author in Authors.iterrows():
        temp = 0
        for name in author.Names.replace(' |','|').split('|'):
            temp = max(temp,Get_Quick_Match(invest.Name, name))
        if temp > cpvalue:
            record = i
            cpvalue = temp
            
    if cpvalue > 0.9:
        Sites[record].extend(invest.Sites)
        print(str(num) + ': ' + invest.Name + ' & ' + Authors.NameIndex[record] + '--------' + str(100*cpvalue) + '% Match!')
    elif MannualCrossCheck(Authors.Names[record], invest):
        Sites[record].extend(invest.Sites)
        print(str(num) +  ': ' + invest.Name + ' & ' + Authors.NameIndex[record] + '--------' + str(100*cpvalue) + '% Match!')
    else:
        print(str(num) + ': ' + invest.Name + ' & ' + Authors.NameIndex[record] + '--------' + str(100*cpvalue) + '    Not Match!')

CSites = ['' for cols in range(len(Authors))]
for index, item in enumerate(Sites):
    for sc in item:
        CSites[index] += sc + ','

Authors['Invested_Site'] = CSites
Authors.to_csv('D:/_Research/Project_Sharing_Data_FromLinux/ProjectRebuild_2017/Data/Sites_LeaderRank_AuthorList.csv')

^^^^^^^Please Check^^^^^^^
Oishi A Christopher|Oishi A Christopher|Oishi A Christopher|Oishi A Christopher|Oishi A Christopher|Oishi A Christopher|Oishi A Christopher|Oishi AC |Oishi C |Ojala A |Ojala Anne |Ojala Anne |Ojala Anne |Ojala Anne |Ojala Anne |Ojala Anne |Ojala Anne |Ojala Anne |Ojala Anne |
^^^^^^^^^^^^^^^^^^^^^^^^^^
A Chris Oishi
0: A Chris Oishi & Oishi A Christopher--------81.25% Match!
1: Abad Chabbi & Chabbi Abad --------100.0% Match!
2: Abel Rodrigues & Rodrigues Abel --------100.0% Match!
3: Achim Grelle & Grelle Achim --------100.0% Match!
4: Adam Wolf & Wolf Annett --------100.0% Match!
5: Adrian Rocha & Rocha Adrian V--------100.0% Match!
6: Aikaterini Trepekli & Trepekli Aikaterini --------100.0% Match!
7: Akira Miyata & Miyata Akira --------100.0% Match!
8: Alan Barr & Barr Alan G--------100.0% Match!
^^^^^^^Please Check^^^^^^^
Karppinen A |Karsanaev S A|Karsanaev S A|Karsanaev S A|Karsanaev S V|
^^^^^^^^^^^^^^^^^^^^^^^^^^
Alan Knapp
