# The Science of Science
by: Dashun Wang and Albert-Laszlo Barabasi

[You can get the textbook here.](https://www.amazon.com/Science-Dashun-Wang/dp/1108716954/ref=asc_df_1108716954/?tag=hyprod-20&linkCode=df0&hvadid=459526655425&hvpos=&hvnetw=g&hvrand=10075848530578766295&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9002059&hvtargid=pla-967727027885&psc=1)

The companion notebooks by Alex Gates.

To start, we need to download the data and preprocess it.  
These steps only need to be run once, when you first download the data.  

Note: We just learned Microsoft will discontinue their support for MAG as of Dec 2021.  As other data become available, we will update this code.

In [None]:
# load the pyscisci package

import pyscisci.all as pyscisci

In [None]:
# you should download the MAG data from Microsoft's website:
# https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/

# set this path to where the MAG database is locally stored
path2mag = '/home/ajgates/MAG'


In [None]:
# Create a MAG object

mymag = pyscisci.MAG(path2mag, keep_in_memory=False) 
# set keep_in_memory=False if you want to load the database each time its needed - good for when you 
# cant keep more than one database in memory at a time
# otherwise keep_in_memory=True will keep each database in memory after its loaded

In [None]:
# before we can start running our analysis, we have to preprocess the raw data into
# DataFrames that are more convenient to work with

# we only need to run this for the first time, but it will take awhile
mymag.preprocess(verbose=True)

In [None]:
# MAG contains the following dataframes:

# pub_df - keeps all of the publication information
# columns : ['PublicationId', 'Year', 'JournalId', 'FamilyId',  'Doi', 'Title', 'Date', 'Volume', 'Issue', 'DocType']

# author_df - keeps all of the author information
# columns : ['AuthorId', 'FullName', 'LastName', 'FirstName', 'MiddleName']

# pub2ref_df - links publications to their references or citations
# columns : ['CitingPublicationId', 'CitedPublicationId']

# paa_df - links publications, authors, and affiliations
# columns : ['PublicationId', 'AuthorId', 'AffiliationId', 'AuthorSequence',  'OrigAuthorName', 'OrigAffiliationName']

# author2pub_df - links the authors to their publications
# columns : ['PublicationId', 'AuthorId', 'AuthorOrder']

# field_df - field information
# columns : ['FieldId', 'FieldLevel', 'NumberPublications', 'FieldName']

# pub2field_df - links publications to their fields
# columns : ['PublicationId', 'FieldId']

# affiliation_df - affiliation information
# columns : ['AffiliationId', 'NumberPublications', 'NumberCitations', 'FullName', 'GridId', 'OfficialPage', 'WikiPage', 'Latitude', 'Longitude']

# journal_df - journal information
# columns : ['JournalId', 'FullName', 'Issn', 'Publisher', 'Webpage']


# after additional processing, these DataFrames become available

# pub2refnoself_df - links publications to their references or citations with self-citations removed
# columns : ['CitingPublicationId', 'CitedPublicationId']

# impact_df - precomputed citation counts, columns will depend on which counts are computed
# columns : ['PublicationId', 'Year', ....]