# Data Transformation with Microsoft Academic Graph 

Microsift Academic Graph (MAG) is a large database with tables that include information about publications, authors, affiliations, journals and citation. In this notebook, we will work on a sample of MAG and transform it with Pandas.

In [24]:
# Importing packages for data transformation
import numpy as np
import pandas as pd

In [30]:
# Importing basic MAG tables
Papers = pd.read_csv('./datasets/s4/MAG/Papers.csv')
PaperAuthorAffiliations = pd.read_csv('./datasets/s4/MAG/PaperAuthorAffiliations.csv')
Authors = pd.read_csv('./datasets/s4/MAG/Authors.csv')
Affiliations = pd.read_csv('./datasets/s4/MAG/Affiliations.csv')
Journals = pd.read_csv('./datasets/s4/MAG/Journals.csv') 

In [28]:
# "Papers" is a table with information about publications. It includes paper title, publication date, DOI numbers and more.
# We can link the table with other tables in MAG with ids to discover the relationship between different entities.
Papers.head()

Unnamed: 0,PaperId,Rank,Doi,DocType,PaperTitle,OriginalTitle,BookTitle,Year,Date,OnlineDate,...,ConferenceSeriedId,ConferenceInstanceId,Volume,Issue,FirstPage,LastPage,ReferenceCount,CitationCount,EstimatedCitation,OriginalVenue
0,51264158,27169,,,no has visto nada en treblinka,No has visto nada en Treblinka,,2009,2009-01-01 00:00:00,,...,,,,26.0,73.0,,0,0,0,Cahiers du cinéma: España
1,93781424,22616,,Journal,reading minds how infants come to understand o...,Reading Minds: How Infants Come to Understand ...,,2009,2009-11-01 00:00:00,,...,,,30.0,2.0,28.0,32.0,0,6,6,Zero to Three
2,138145309,22947,,Patent,weight measuring device for cooking appliance,Weight measuring device for cooking appliance,,1996,1996-09-25 00:00:00,,...,,,,,,,8,6,6,
3,214118367,21825,10.1007/BF00324200,Journal,effect of oxygen segregation on the surface st...,Effect of oxygen segregation on the surface st...,,1992,1992-04-01 00:00:00,,...,,,54.0,4.0,350.0,354.0,22,25,25,Applied Physics A
4,267808649,22391,,,is mars sample return required prior to sendin...,Is Mars Sample Return Required Prior to Sendin...,,2012,2012-05-22 00:00:00,,...,,,,,,,0,2,2,


In [45]:
# Choosing PaperId and PaperTitle
Papers.loc[0:1, ['PaperId', 'PaperTitle']]

Unnamed: 0,PaperId,PaperTitle
0,51264158,no has visto nada en treblinka
1,93781424,reading minds how infants come to understand o...


In [49]:
# By linking PaperId and AuthorId with "PaperAuthorAffiliations" and "Author" table, 
# we can get the author names for publications

Papers.loc[4:5, ['PaperId', 'PaperTitle']].\
    merge(PaperAuthorAffiliations, how = 'inner', on = 'PaperId').\
    merge(Authors, how = 'inner', on = 'AuthorId')[['PaperTitle', 'OriginalAuthor']]

Unnamed: 0,PaperTitle,OriginalAuthor
0,is mars sample return required prior to sendin...,Charles Budney
1,fertility transition latin america and the car...,Cecilia Gayet


In [52]:
# By linking PaperId with "PaperAuthorAffiliations" and "Affiliations" table,
# we can get information about the affiliation for authors of papers we selected
Papers.loc[:10, ['PaperId', 'PaperTitle']].\
    merge(PaperAuthorAffiliations, how = 'inner', on = 'PaperId').\
    merge(Affiliations, how = 'inner', on = 'AffiliationId')[['PaperTitle', 'NormalizedName', 'OfficialPage']]

Unnamed: 0,PaperTitle,NormalizedName,OfficialPage
0,effect of oxygen segregation on the surface st...,karlsruhe institute of technology,https://www.kit.edu/english/
1,integrated camera and associated methods,flir systems,http://www.flir.co.uk/
2,1 gbyte s error free optical interconnection u...,nec,http://www.nec.com/
3,polycarbonate with high refractive index,sabic,http://www.sabic.com/


In [54]:
# Loading tables with field of study information

PaperFields = pd.read_csv('./datasets/s4/MAG/PaperFieldsOfStudy.csv')
Fields = pd.read_csv('./datasets/s4/MAG/FieldsOfStudy.csv')

In [73]:
# Checking the field of study for selected papers

pd.set_option('display.max_columns', None)  
Papers.loc[:10, ['PaperId', 'PaperTitle']].\
    merge(PaperFields, how = 'inner', on = 'PaperId').\
    merge(Fields, how = 'inner', on = 'FieldOfStudyId')[['PaperTitle', 'NormalizedName']].groupby('PaperTitle').agg(list)

Unnamed: 0_level_0,NormalizedName
PaperTitle,Unnamed: 1_level_1
1 gbyte s error free optical interconnection using a forward error correcting code,"[computer science, optical computing, soft err..."
amplifier circuit having linear and non linear amplification ranges,"[instrumentation amplifier, fully differential..."
effect of oxygen segregation on the surface structure of single crystalline niobium films on sapphire,"[sapphire, superlattice, crystal, crystal grow..."
fertility transition latin america and the caribbean,"[unintended pregnancy, sub replacement fertili..."
integrated camera and associated methods,"[engineering, die, division, engineering drawing]"
is mars sample return required prior to sending humans to mars,"[martian, astrobiology, in situ resource utili..."
polycarbonate with high refractive index,"[polycarbonate, composite material, materials ..."
reading minds how infants come to understand others,"[differential psychology, child development, p..."
weight measuring device for cooking appliance,"[waste management, food products, meter, engin..."
zespol comela nethertona u 14 miesiecznego dziecka,"[dermatology, medicine, atopic dermatitis, com..."


In [75]:
# "Journals" table includes metadata of journals

Journals.head()

Unnamed: 0,JournalId,Rank,NormalizedName,DisplayName,Issn,Publisher,Webpage,PaperCount,PaperFamilyCount,CitationCount
0,2764477011,12346,zero to three,Zero to Three,0736-8038,,,792,792,3040
1,52123058,8572,applied physics a,Applied Physics A,0947-8396,,http://www.springer.com/materials/journal/339,17498,17498,268249
2,2738606534,12807,pediatria i medycyna rodzinna,Pediatria i Medycyna Rodzinna,1734-1531,,,726,726,144
3,2764864177,12041,research memorandum,research memorandum,,,,1337,1337,1814
4,58228071,11252,medico chirurgical transactions,Medico-Chirurgical Transactions,,,,1828,1828,5988


In [84]:
# Loading tables with citation information

PaperCitationContexts = pd.read_csv('./datasets/s4/MAG/PaperCitationContexts.csv')
PaperCitation = pd.read_csv('./datasets/s4/MAG/PaperCitation.csv')

In [77]:
PaperCitationContexts

Unnamed: 0,PaperId,PaperReferenceId,CitationContext
0,2031667581,1964706255,"Recently, we have observed QHR plateaus in lar..."
1,2031667581,2022106697,This energy gap stabilizes the transverse resi...
2,2031667581,2022106697,A collaboration of University laboratories in ...
3,2031667581,2031839987,"Furthermore, graphene could support the produc..."
4,2031667581,2053728405,multiple symmetries in the quantum state of mo...
...,...,...,...
20999,2402562600,1571895365,But when and how can children learn abstract p...
21000,2402562600,2043187947,"However, theoretical advances drawing on Bayes..."
21001,2402562600,2060093866,The earlier literature invoked a “relational s...
21002,2402562600,2118450042,These findings are also relevant to the broade...


In [89]:
# Check paper citation

Papers.loc[:30, ['PaperId', 'PaperTitle']].\
    merge(PaperCitation, how = 'inner', on = 'PaperId').\
    merge(Papers.rename({'PaperId': 'PaperReferenceId', 'PaperTitle': 'PaperReferenceTitle'}, axis = 1)[['PaperReferenceId', 'PaperReferenceTitle']],
          how = 'inner',
          on = 'PaperReferenceId')

Unnamed: 0,PaperId,PaperTitle,PaperReferenceId,PaperReferenceTitle
0,138145309,weight measuring device for cooking appliance,10599529,simultaneous display of net and gross weight o...
1,138145309,weight measuring device for cooking appliance,1523397624,food dispenser dispenser container and method
2,138145309,weight measuring device for cooking appliance,1561302427,weighing and dispensing device
3,138145309,weight measuring device for cooking appliance,2163726600,early product removal alarm
4,138145309,weight measuring device for cooking appliance,2227805926,weight measuring device
...,...,...,...,...
818,2071717597,realization of super resolution imaging by mic...,2011605930,optical virtual imaging at 50 nm lateral resol...
819,2071717597,realization of super resolution imaging by mic...,2054455698,imaging intracellular fluorescent proteins at ...
820,2071717597,realization of super resolution imaging by mic...,2101336703,ultra high resolution imaging by fluorescence ...
821,2071717597,realization of super resolution imaging by mic...,2160762085,scanning near field optical microscopy with ap...


In [83]:
# Check paper citation and the citation context

Papers.loc[:30, ['PaperId', 'PaperTitle']].\
    merge(PaperCitationContexts, how = 'inner', on = 'PaperId')[['PaperTitle', 'CitationContext']]

Unnamed: 0,PaperTitle,CitationContext
0,characteristics of graphene for quantized hall...,"Recently, we have observed QHR plateaus in lar..."
1,characteristics of graphene for quantized hall...,This energy gap stabilizes the transverse resi...
2,characteristics of graphene for quantized hall...,A collaboration of University laboratories in ...
3,characteristics of graphene for quantized hall...,"Furthermore, graphene could support the produc..."
4,characteristics of graphene for quantized hall...,multiple symmetries in the quantum state of mo...
5,characteristics of graphene for quantized hall...,gating technique [9] developed by the group wo...
6,characteristics of graphene for quantized hall...,The superior electronic properties of two-dime...
7,characteristics of graphene for quantized hall...,The superior electronic properties of two-dime...
8,characteristics of graphene for quantized hall...,"Furthermore, graphene could support the produc..."
