# MAG Stanford AI Index

In early March 2021, Stanford University has announced their latest 2021 AI Index Report ([The AI Index Report – Artificial Intelligence Index (stanford.edu)](https://aiindex.stanford.edu/report/)). This report aims to be "the world's most credible and authoritative source for data and insights about AI". It covers important aspects from AI investment, industry shift, diversity challenges among AI researchers, to the progress on AI ethics, among others. MAG data have been used to contribute to the Research and Development chapter of the AI index. In this example, we will provide the detailed sample code to reproduce part of the AI Index report generated by MAG data. To reproduce the AI Index results from MAG, please use 2021-01-18 MAG version.

## Prerequisites

Complete these tasks before you begin this tutorial:

- Setting up provisioning of Microsoft Academic Graph to an Azure blob storage account. See [Get Microsoft Academic Graph on Azure storage](https://docs.microsoft.com/academic-services/graph/get-started-setup-provisioning).
- Setting up Azure Databricks service. See [Set up Azure Databricks](https://docs.microsoft.com/academic-services/graph/get-started-setup-databricks).
- Install python library `plotly`, `pycountry` and `azure-storage-blob` on the cluster you want to run this tutorial.

## Gather the information

Before you begin, you should have these items of information:

- The name of your Azure Storage (AS) account containing MAG dataset from [Get Microsoft Academic Graph on Azure storage](https://docs.microsoft.com/academic-services/graph/get-started-setup-provisioning#note-azure-storage-account-name-and-primary-key).
- The access key of your Azure Storage (AS) account from [Get Microsoft Academic Graph on Azure storage](https://docs.microsoft.com/academic-services/graph/get-started-setup-provisioning#note-azure-storage-account-name-and-primary-key).
- The name of the container in your Azure Storage (AS) account containing MAG dataset.

## Import notebooks

- [Import](https://docs.databricks.com/user-guide/notebooks/notebook-manage.html#import-a-notebook) samples/pyspark/MagClass.py in MAG dataset under your working folder.
- [Import](https://docs.databricks.com/user-guide/notebooks/notebook-manage.html#import-a-notebook) this notebook (samples/pyspark/AIIndex.py) under the same folder.

### Initialize storage account and container details

  | Variable  | Value | Description  |
  | --------- | --------- | --------- |
  | AzureStorageAccount | Replace **`<AzureStorageAccount>`** | This is the Azure Storage account containing MAG dataset. |
  | AzureStorageAccessKey | Replace **`<AzureStorageAccessKey>`** | This is the Access Key of the Azure Storage account. |
  | MagContainer | Replace **`<MagContainer>`** | This is the container name in Azure Storage account containing MAG dataset, usually in the form of `mag-yyyy-mm-dd`. |
  | OutputContainer | Replace **`<OutputContainer>`** | This is the container name in Azure Storage account where the output goes to, this container needs to be created before running this script. |

In [0]:
AzureStorageAccount = '<AzureStorageAccount>'
AzureStorageAccessKey = '<AzureStorageAccessKey>'
MagContainer = '<MagContainer>'
OutputContainer = '<OutputContainer>'

### Define MicrosoftAcademicGraph class

Run the MagClass notebook to define MicrosoftAcademicGraph class.

In [0]:
%run "./MagClass"

### Create a MicrosoftAcademicGraph instance to access MAG dataset
Use account=AzureStorageAccount, key=AzureStorageAccessKey, container=MagContainer.

In [0]:
MAG = MicrosoftAcademicGraph(account=AzureStorageAccount, key=AzureStorageAccessKey, container=MagContainer)

### Create a AzureStorageUtil to access other Azure Storage files
Use account=AzureStorageAccount, key=AzureStorageAccessKey, container=OutputContainer.

In [0]:
ASU = AzureStorageUtil(account=AzureStorageAccount, key=AzureStorageAccessKey, container=OutputContainer)

### Import python libraries

In [0]:
from pyspark.sql import functions as F
from plotly.offline import plot
from plotly.graph_objs import *
import numpy as np
import pandas as pd
from pyspark.sql.window import Window
import pycountry as pc

### Load MAG data

In [0]:
Paper = MAG.getDataframe('Papers')
Fos = MAG.getDataframe('FieldsOfStudy')
FosHierarchy = MAG.getDataframe('FieldOfStudyChildren')
Aff = MAG.getDataframe('Affiliations')
PaperAuthorAff = MAG.getDataframe('PaperAuthorAffiliations')
PaperFos = MAG.getDataframe('PaperFieldsOfStudy')

### Step 1: Get field of study ids of artificial intelligence and selected sub-domains

In [0]:
AIFosId = Fos.where(Fos.DisplayName == 'Artificial intelligence').select(Fos.FieldOfStudyId).first()[0]
RoboticsFosId = Fos.where(Fos.DisplayName == 'Robotics').select(Fos.FieldOfStudyId).first()[0]
ComputerVisionFosId = Fos.where(Fos.DisplayName == 'Computer vision').select(Fos.FieldOfStudyId).first()[0]
PatternRecFosId = Fos.where(Fos.DisplayName == 'Pattern recognition').select(Fos.FieldOfStudyId).first()[0]
MLFosId = Fos.where(Fos.DisplayName == 'Machine learning').select(Fos.FieldOfStudyId).first()[0]
NLPFosId = Fos.where(Fos.DisplayName == 'Natural language processing').select(Fos.FieldOfStudyId).first()[0]

### Step 2: Get all field of study ids in artificial intelligent domain
- MAG definition (default): `artificial intelligent` only.
- OECD definition: all sub-topics under `artificial intelligent` and `machine learning` in the MAG taxonomy. To use OECD definition, set `UseOecdDefinition = True`.

In [0]:
AISubFos = Fos.where(Fos.DisplayName == 'Artificial intelligence').select(Fos.FieldOfStudyId)

UseOecdDefinition = False
if (UseOecdDefinition):
  subFos = Fos.where((Fos.FieldOfStudyId  == AIFosId) | (Fos.FieldOfStudyId == MLFosId)) \
    .select(Fos.FieldOfStudyId)
  AISubFos = subFos.select('*')

  while(subFos.count() > 0):
    subFos = FosHierarchy.join(subFos, subFos.FieldOfStudyId == FosHierarchy.FieldOfStudyId, 'inner') \
      .select(FosHierarchy.ChildFieldOfStudyId.alias('FieldOfStudyId')).distinct()
    AISubFos = AISubFos.union(subFos).distinct()

### Step 3: Get all AI papers
- Year from 2000 to 2020
- DocType in `Journal`, `Conference`, `Patent`, and `Repository`
- Step 3, 4, and 6.5 might take more than 10 minutes on a Standard_DS3_v2 (14GB) cluster. Cluster with larger memory will reduce the execution time.

In [0]:
AIPapers = Paper.join(PaperFos, Paper.PaperId == PaperFos.PaperId, 'inner') \
  .join(AISubFos, PaperFos.FieldOfStudyId == AISubFos.FieldOfStudyId, 'inner') \
  .where(  (Paper.Year >= 2000) \
         & (Paper.Year <= 2020) \
         & ((Paper.DocType == 'Journal') | (Paper.DocType == 'Conference') | (Paper.DocType == 'Patent') | (Paper.DocType == 'Repository'))) \
  .select(Paper.PaperId, Paper.DocType, Paper.Year, Paper.EstimatedCitation).distinct()

# Convert Spark DataFrame to Pandas DataFrame to expedite graph drawing
AIPapersPandas = AIPapers.toPandas()

### Step 4: Get papers geographic information
- Author's affiliation location is used to derive the paper's geographic region.
- ISO3166Code is the two-letter codes (alpha-2) defined in [ISO_3166 Code (ISO.org)](https://www.iso.org/iso-3166-country-codes.html) and [ISO 3166-2 (Wiki)](https://en.wikipedia.org/wiki/ISO_3166-2).

In [0]:
aiPaperRegion = AIPapers \
  .join(PaperAuthorAff, AIPapers.PaperId == PaperAuthorAff.PaperId, 'inner') \
  .join(Aff, PaperAuthorAff.AffiliationId == Aff.AffiliationId, 'left_outer') \
  .select(AIPapers.PaperId, AIPapers.Year, AIPapers.DocType, \
         F.when( ((Aff.Iso3166Code.isNull()) | (Aff.Iso3166Code == '' )), 'Unknown').otherwise(Aff.Iso3166Code).alias('Region'), \
         AIPapers.EstimatedCitation) \
  .distinct()

# convert to Pandas DataFrame for performance purpose
aiPaperRegionPandas = aiPaperRegion.toPandas()

### Step 5: Distribute weights among collaborating geographic locations
Each paper is counted exactly once. When a paper has multiple authors or regions, the credit is equally distributed to the unique regions. For example, if a paper has two authors from the United States, one from China, and one from the United Kingdom, then the United States, China, and the United Kingdom each get one-third credit.

In [0]:
paperRegionCount = aiPaperRegionPandas.groupby(['PaperId'])['Region'].nunique().reset_index(name='NumberOfRegion')

AIPaperRegionNormalized = aiPaperRegionPandas.merge(paperRegionCount, how='inner', on='PaperId')
AIPaperRegionNormalized['NormalizedPaperCount'] = 1/AIPaperRegionNormalized.NumberOfRegion
AIPaperRegionNormalized['NormalizedCitation'] = AIPaperRegionNormalized.EstimatedCitation/AIPaperRegionNormalized.NumberOfRegion

### Step 6: AI paper distributions

#### Step 6.1: AI paper distribution by DocType

In [0]:
# Count AI papers by DocType
DocTypeCount = AIPapersPandas.groupby(['DocType'])['PaperId'].nunique().reset_index(name='PaperCount')

# Plot AI paper distribution by DocType
# Once this cell finishes running, set Plot Options to: 
#   Display type: Pie chart
#   Plot Options:
#     Keys: DocType
#     Values: PaperCount
display(DocTypeCount)

DocType,PaperCount
Conference,691741
Journal,864674
Patent,1036811
Repository,117122


#### Step 6.2: AI paper distribution by DocType and Year

In [0]:
# Count AI papers by DocType and year
YearCount = AIPapersPandas.groupby(['DocType','Year'])['PaperId'].nunique().reset_index(name='PaperCount')
YearCount.sort_values(['DocType','Year'])

# Plot AI papers yearly change by DocType
# Once this cell finishes running, set Plot Options to: 
#   Display type: Bar chart (stacked)
#   Plot Options:
#     Keys: Year
#     Series groupings: DocType
#     Values: PaperCount
#     Aggregation: SUM
display(YearCount)

DocType,Year,PaperCount
Conference,2000,11755
Conference,2001,11176
Conference,2002,14697
Conference,2003,16551
Conference,2004,21591
Conference,2005,24416
Conference,2006,28895
Conference,2007,30390
Conference,2008,35168
Conference,2009,40056


#### Step 6.3: AI paper distribution by DocType and Region

In [0]:
# Count AI papers by DocType and Region
RegionCountPd = AIPaperRegionNormalized.groupby(['DocType', 'Region']).agg({'PaperId':[('RowCount','nunique')], 'NormalizedPaperCount':[('NormalizedCount','sum')], 'NormalizedCitation':[('NormalizedCitationCount','sum')]})
RegionCountPd.columns = RegionCountPd.columns.get_level_values(1)
RegionCountPd = RegionCountPd.reset_index() 

# convert back to Spark Dataframe to utilize the world map plot
RegionCount = spark.createDataFrame(RegionCountPd)

iso2_to_iso3 = F.udf(lambda x: pc.countries.get(alpha_2 =x).alpha_3 if x != 'Unknown' else x, StringType())
RegionCount = RegionCount.withColumn('ISO3', iso2_to_iso3(RegionCount.Region))

iso3Count = RegionCount \
  .where(RegionCount.ISO3 != 'Unknown') \
  .select(RegionCount.ISO3, RegionCount.NormalizedCount) \
  .groupBy(RegionCount.ISO3) \
  .sum('NormalizedCount') \
  .withColumnRenamed('sum(NormalizedCount)','NormalizedCount') \
  .orderBy(F.desc('NormalizedCount'))

# Utilize databrick "display" function to visualize the iso3Count DataFrame in map.
# Once this cell finishes running, set Plot Options to: 
#  Display type: World map
#  Plot Options:
#    Keys: ISO3
#    Values: NormalizedCount
#    Aggregation: SUM
display(iso3Count.take(50))

# Save to Azure blob for further analysis if needed
ASU.save(iso3Count,'AIIndex/AIPaperDistributionByISO3.tsv', coalesce=True)

ISO3,NormalizedCount
USA,302946.30331745854
CHN,254009.3368631016
JPN,82776.78897537189
GBR,62926.24389080522
DEU,51888.95535156916
IND,45235.48493171074
FRA,43807.55232404151
KOR,40713.81029294409
CAN,33458.56095707529
TWN,28257.319031411447


#### Step 6.4: AI paper distribution by DocType, Year, and Region

In [0]:
# Stats for AI paper distribution by DocType, Year and Region
RegionYearDocTypeCount = AIPaperRegionNormalized.groupby(['Year','DocType','Region']) \
  .agg({'PaperId':[('RowCount','count'), ('PidCount', 'nunique')], 'NormalizedPaperCount':[('NormalizedCount','sum')], 'NormalizedCitation':[('NormalizedCitationCount','sum')]})
RegionYearDocTypeCount.columns = RegionYearDocTypeCount.columns.get_level_values(1)
RegionYearDocTypeCount = RegionYearDocTypeCount.reset_index()

display(RegionYearDocTypeCount.head(10))

# Save to Azure blob for further analysis if needed
ASU.save(spark.createDataFrame(RegionYearDocTypeCount),'AIIndex/AIPaperDistributionByDocTypeYearRegion.tsv', coalesce=True)

Year,DocType,Region,RowCount,PidCount,NormalizedCount,NormalizedCitationCount
2000,Conference,AE,29,29,21.666666666666668,520.6666666666666
2000,Conference,AG,1,1,0.3333333333333333,0.6666666666666666
2000,Conference,AO,2,2,1.0,9.5
2000,Conference,AR,2,2,2.0,25.0
2000,Conference,AT,48,48,33.916666666666664,1087.4999999999998
2000,Conference,AU,254,254,187.5,4705.5
2000,Conference,BA,1,1,1.0,97.0
2000,Conference,BE,71,71,52.20000000000001,2192.7
2000,Conference,BG,7,7,4.833333333333333,42.5
2000,Conference,BR,95,95,62.50000000000001,787.0


#### Step 6.5: Get top level domains for AI papers
- Include fields of study level 0 and 1.
- Also include `robotics` from level 3 which is a domain of interest.

In [0]:
AIPaperFos = AIPapers \
  .join(PaperFos, PaperFos.PaperId == AIPapers.PaperId, 'inner') \
  .join(Fos, PaperFos.FieldOfStudyId == Fos.FieldOfStudyId, 'inner') \
  .where( (Fos.Level == 0) | (Fos.Level == 1) | (Fos.FieldOfStudyId == RoboticsFosId)) \
  .select(AIPapers.PaperId, AIPapers.DocType, AIPapers.Year, AIPapers.EstimatedCitation.alias('Citation'), \
          Fos.DisplayName.alias('DomainName'), Fos.FieldOfStudyId, Fos.Level, PaperFos.Score)

# convert Spark DataFrame to Pandas DataFrame for performance purpose
AIPaperFosPandas = AIPaperFos.toPandas()

#### Step 6.6: AI paper distribution by top level domains (level 0)

In [0]:
AIPapersL0 = AIPaperFosPandas[AIPaperFosPandas['Level'] == 0] \
  .groupby(['DomainName','Year','DocType'])['PaperId'].nunique().reset_index(name = 'PaperCount')

ASU.save(spark.createDataFrame(AIPapersL0), 'AIIndex/AIPaperDistributionByYearDocTypeDomainL0.tsv', coalesce=True)

#### Step 6.7: AI paper distribution by selected sub-domains
Each paper is categorized to only one of the selected sub-domains below. Since each paper may belong to more than one sub-domains in MAG, the one with the highest "Score" in PaperFoS relationship is selected. <br><br>
 - Computer vision                 
 - Pattern recognition         
 - Machine learning            
 - Natural language processing 
 - Robotics                    
 - Other

In [0]:
AIPaperL1_raw = AIPaperFosPandas[ ((AIPaperFosPandas['Level'] == 1 ) | (AIPaperFosPandas['FieldOfStudyId'] == RoboticsFosId )) \
                  & (AIPaperFosPandas['FieldOfStudyId'] != AIFosId)] 
ranks = AIPaperL1_raw.groupby('PaperId')['Score'].rank(method='first', ascending=False)
ranks.name = 'RowCount'
AIPaperL1_rank = pd.concat([AIPaperL1_raw, ranks], axis = 1)
AIPaperL1 = AIPaperL1_rank[AIPaperL1_rank['RowCount'] == 1]

AIPapersL1_dist = AIPaperL1.groupby(['DomainName','FieldOfStudyId'])['PaperId'].nunique().reset_index(name = 'PaperCount')
ASU.save(spark.createDataFrame(AIPapersL1_dist), 'AIIndex/AIPaperDistributionByDomainL1.tsv', coalesce=True)

subdomainPaper1 = AIPaperL1[AIPaperL1['FieldOfStudyId'].isin([ComputerVisionFosId, PatternRecFosId, MLFosId, NLPFosId, RoboticsFosId])]
subdomainPaper1 = subdomainPaper1[['PaperId', 'FieldOfStudyId', 'DomainName']]

otherAIPaper = AIPapersPandas.merge(subdomainPaper1, how ='left', on='PaperId', indicator=True).loc[lambda x : x['_merge']=='left_only']
otherAIPaper['DomainName'] = 'Other'
otherAIPaper = otherAIPaper[['PaperId', 'FieldOfStudyId', 'DomainName']].drop_duplicates()

AIPaper_subdomain = pd.concat([subdomainPaper1, otherAIPaper])

#### Step 6.8: AI Paper distribution by Region, Year, DocType, and Sub-domain

In [0]:
sub = AIPaperRegionNormalized.merge(AIPaper_subdomain, how='inner', on='PaperId') \
  .groupby(['Region', 'Year', 'DocType', 'DomainName']) \
  .agg({'NormalizedPaperCount':[('NormalizedPaperCount','sum')], 'NormalizedCitation':[('CitationCount','sum')]})
sub.columns = sub.columns.get_level_values(1)
AIPaper_subdomain_region = sub.reset_index()  

ASU.save(spark.createDataFrame(AIPaper_subdomain_region), 'AIIndex/AIPaperDistributionByRegionYearDocTypeDomain.tsv', coalesce=True)