# Location Web Scraping of Microsoft Academics Graph (MAG) Dataset

Jupyter Notebook for the web scraping of the conferences locations of the Microsoft Academics Graph (MAG) dump.

For this process, the following CSV file is needed: ```out_mag_citations_count_and_conferences.csv```. 
The above file must be generated running the ```preprocess_mag.ipynb``` Notebook that is contained in the ```1 - Citation Dumps Preprocess``` folder of this Repository.

In particular, the following operations are going to be executed:
* Opening of the CSV peprocessed dump
* Fix of the conferences names
* Obtaining the missing locations with queries to the DBLP website
* Fix of the locations format

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [43]:
# Libraries Import
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [44]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

## Read of the CSV Preprocessed Dump

In [46]:
df_mag_with_locations_v1 = pd.read_csv(path_file_export + 'out_mag_citations_and_locations.csv', low_memory=False, index_col=[0])
df_mag_with_locations_v1.drop(df_mag_with_locations_v1.filter(regex="Unname"), axis=1, inplace=True)
df_mag_with_locations_v1

Unnamed: 0,PaperID,Doi,PaperTitle,OriginalTitle,Year,ConferenceSeriesID,ConferenceInstanceID,CitationCount,EstimatedCitation,ConferenceSeriesNormalizedName,ConferenceSeriesDisplayName,ConferenceNormalizedName,ConferenceDisplayName,ConferenceLocation
0,14558443,10.1007/978-3-662-45174-8_28,the adaptive priority queue with elimination a...,The Adaptive Priority Queue with Elimination a...,2014,1.131603e+09,4038532.0,12,12,DISC,International Symposium on Distributed Computing,disc 2014,DISC 2014,"Austin, TX"
1,15354235,10.1007/978-3-662-44777-2_60,document retrieval on repetitive collections,Document Retrieval on Repetitive Collections,2014,1.154039e+09,157008481.0,10,10,ESA,European Symposium on Algorithms,esa 2014,ESA 2014,"Wrocław, Poland"
2,24327294,10.1007/978-3-319-03973-2_13,socomo marketing for travel and tourism,SoCoMo Marketing for Travel and Tourism,2013,1.196984e+09,,20,20,ENTER,Information and Communication Technologies in ...,enter 2013,,"Innsbruck, Austria"
3,60437532,10.1007/3-540-46146-9_77,similarity image retrieval system using hierar...,Similarity Image Retrieval System Using Hierar...,2002,1.192665e+09,,0,0,DEXA,Database and Expert Systems Applications,dexa 2002,,"Aix-en-Provence, France"
4,198056957,10.1007/11785231_94,leukemia prediction from gene expression data ...,Leukemia prediction from gene expression data—...,2006,1.176896e+09,,19,19,ICAISC,International Conference on Artificial Intelli...,icaisc 2006,,"Zakopane, Poland"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4409807,3102242761,10.1109/IECON43393.2020.9254316,loss reduction by synchronous rectification in...,Loss Reduction by Synchronous Rectification in...,2020,2.623572e+09,,0,0,IECON,Conference of the Industrial Electronics Society,iecon 2020,,Singapore
4409808,3136855299,10.1109/BMSB49480.2020.9379806,data over cable services improving the bicm ca...,Data Over Cable Services – Improving the BICM ...,2020,2.623662e+09,,0,0,BMSB,International Symposium on Broadband Multimedi...,bmsb 2020,,"Paris, France"
4409809,3145351916,10.1109/ACC.1988.4172843,model reference robust adaptive control withou...,Model Reference Robust Adaptive Control withou...,1988,2.238538e+09,,0,0,ACC,American Control Conference,acc 1988,,
4409810,3151696876,10.1109/ICASSP.2002.1005676,missing data speech recognition in reverberant...,Missing data speech recognition in reverberant...,2002,1.121228e+09,,0,0,ICASSP,"International Conference on Acoustics, Speech,...",icassp 2002,,"Orlando, Florida, USA"


## Fix of the Missing Conferences Names
Some papers have only the indication of the conference series. For this reason, the conference instance and the related conference locations don't have a value.

However, every paper has been published in a specific "instance" of a conference, hence it should have a location. These papers will be "fixed" considering the year of their publication and their conference.

In [None]:
df_mag_preprocessed_subset = df_mag_preprocessed.iloc[:50]
df_mag_preprocessed_subset = df_mag_preprocessed_subset.dropna(subset = ['ConferenceNormalizedName'])
df_mag_preprocessed_subset.iloc[:10][["Year", "ConferenceSeriesNormalizedName", "ConferenceNormalizedName", "ConferenceDisplayName"]]

As you can see in the above test, the ConferenceNormalizedName seems to be made by the concatenation of ConferenceSeriesNormalizedName in lowercase, a space, and the papers' year.

**Note**: in the above subset the ConferenceDisplayName seems to be composed in the same way of ConferenceNormalizedName, but without the lowercase. However, this is not always true!

Now we're going to populate the ConferenceNormalizedName instances that don't have a value.

In [None]:
df_mag_preprocessed.ConferenceNormalizedName.fillna(df_mag_preprocessed.ConferenceSeriesNormalizedName.str.lower() + ' ' + df_mag_preprocessed.Year.astype(str), inplace=True)
df_mag_preprocessed.iloc[:5]

I tried to do a new merge with the Conference Instances dataframe (this time it will be made on the ConferenceNormalizedName column), but I had no luck: these conference instances are missing. That's probably the reason of the NaN values in the ConferenceInstanceID field of the original Papers table.

## Obtaining the Missing Conferences Locations Our Previous Work
The missing conferences locations are going to be obtained from our previous work.

In [47]:
df_place_of_conference = pd.read_csv(path_file_import + 'place_of_conference1.csv', names=['ConferenceNormalizedName', 'ConferenceLocation'], header=0)
df_place_of_conference

Unnamed: 0,ConferenceNormalizedName,ConferenceLocation
0,conf/cscw/2000,"Philadelphia, Pennsylvania, USA"
1,conf/mobicom/2000,"Boston, Massachusetts, USA"
2,conf/crypto/2001,"Santa Barbara, California, USA"
3,conf/wmcsa/1999,"New Orleans, LA, USA"
4,conf/eurocrypt/2005,"Aarhus, Denmark"
...,...,...
29107,conf/egh/2001,
29108,conf/ml4cps/2017,
29109,conf/vlsi/2017socs,
29110,conf/egh/2000,


Drop of the conferences without a location:

In [48]:
# None values are string, so we can't simply use the "fillna" function
df_place_of_conference['ConferenceLocation'].replace('None', np.nan, inplace=True)

df_place_of_conference = df_place_of_conference.dropna(subset = ['ConferenceLocation'])
df_place_of_conference

Unnamed: 0,ConferenceNormalizedName,ConferenceLocation
0,conf/cscw/2000,"Philadelphia, Pennsylvania, USA"
1,conf/mobicom/2000,"Boston, Massachusetts, USA"
2,conf/crypto/2001,"Santa Barbara, California, USA"
3,conf/wmcsa/1999,"New Orleans, LA, USA"
4,conf/eurocrypt/2005,"Aarhus, Denmark"
...,...,...
27166,conf/infoseccd/2014,"Kennesaw, Georgia, USA"
27167,conf/icecsys/2017,"Batumi, Georgia"
27168,conf/async/2017,"San Diego, CA, USA"
27169,conf/icinfa/2017,"Macau, SAR, China"


Conversion of the conference name format:

In [49]:
df_place_of_conference.ConferenceNormalizedName = df_place_of_conference.ConferenceNormalizedName.str.split('/').str[1] + ' ' + df_place_of_conference.ConferenceNormalizedName.str.split('/').str[2]
df_place_of_conference

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_place_of_conference.ConferenceNormalizedName = df_place_of_conference.ConferenceNormalizedName.str.split('/').str[1] + ' ' + df_place_of_conference.ConferenceNormalizedName.str.split('/').str[2]


Unnamed: 0,ConferenceNormalizedName,ConferenceLocation
0,cscw 2000,"Philadelphia, Pennsylvania, USA"
1,mobicom 2000,"Boston, Massachusetts, USA"
2,crypto 2001,"Santa Barbara, California, USA"
3,wmcsa 1999,"New Orleans, LA, USA"
4,eurocrypt 2005,"Aarhus, Denmark"
...,...,...
27166,infoseccd 2014,"Kennesaw, Georgia, USA"
27167,icecsys 2017,"Batumi, Georgia"
27168,async 2017,"San Diego, CA, USA"
27169,icinfa 2017,"Macau, SAR, China"


## Join of the New Location Data with the Original Dataframe

In [51]:
df_mag_with_locations_v1 = pd.merge(df_mag_with_locations_v1, df_place_of_conference, on=['ConferenceNormalizedName'], how='left')

# Combine the two columns
df_mag_with_locations_v1['ConferenceLocation_x'] = df_mag_with_locations_v1['ConferenceLocation_x'].fillna(df_mag_with_locations_v1['ConferenceLocation_y'])
df_mag_with_locations_v1.rename(columns = {'ConferenceLocation_x':'ConferenceLocation'}, inplace=True)
df_mag_with_locations_v1 = df_mag_with_locations_v1.drop(columns=['ConferenceLocation_y'])

df_mag_with_locations_v1.iloc[:5]

Unnamed: 0,PaperID,Doi,PaperTitle,OriginalTitle,Year,ConferenceSeriesID,ConferenceInstanceID,CitationCount,EstimatedCitation,ConferenceSeriesNormalizedName,ConferenceSeriesDisplayName,ConferenceNormalizedName,ConferenceDisplayName,ConferenceLocation,ConferenceLocation.1
0,14558443,10.1007/978-3-662-45174-8_28,the adaptive priority queue with elimination a...,The Adaptive Priority Queue with Elimination a...,2014,1131603000.0,4038532.0,12,12,DISC,International Symposium on Distributed Computing,disc 2014,DISC 2014,"Austin, TX",
1,15354235,10.1007/978-3-662-44777-2_60,document retrieval on repetitive collections,Document Retrieval on Repetitive Collections,2014,1154039000.0,157008481.0,10,10,ESA,European Symposium on Algorithms,esa 2014,ESA 2014,"Wrocław, Poland",
2,24327294,10.1007/978-3-319-03973-2_13,socomo marketing for travel and tourism,SoCoMo Marketing for Travel and Tourism,2013,1196984000.0,,20,20,ENTER,Information and Communication Technologies in ...,enter 2013,,"Innsbruck, Austria",
3,60437532,10.1007/3-540-46146-9_77,similarity image retrieval system using hierar...,Similarity Image Retrieval System Using Hierar...,2002,1192665000.0,,0,0,DEXA,Database and Expert Systems Applications,dexa 2002,,"Aix-en-Provence, France",
4,198056957,10.1007/11785231_94,leukemia prediction from gene expression data ...,Leukemia prediction from gene expression data—...,2006,1176896000.0,,19,19,ICAISC,International Conference on Artificial Intelli...,icaisc 2006,,"Zakopane, Poland",


Count of how many paper's conference locations are still missing

In [52]:
n_missing = len(df_mag_preprocessed.index) - len(df_mag_preprocessed.dropna(subset = ['ConferenceLocation']).index)
print(f"{n_missing} missing paper's conference locations")

1888843 missing paper's conference locations


## Write of the Final CSV on Disk

In [None]:
# Write of the resulting CSV on Disk
df_mag_preprocessed.to_csv(path_file_export + 'out_mag_citations_and_locations_2.csv')
print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_mag_citations_and_locations_2.csv')

Check of the Exported CSV to be sure that everything went fine.

In [None]:
# Check of the Exported CSV
df_mag_exported_csv = pd.read_csv(path_file_export + 'out_mag_citations_and_locations_2.csv', low_memory=False, index_col=[0])
df_mag_exported_csv.drop(df_mag_exported_csv.filter(regex="Unname"), axis=1, inplace=True)
df_mag_exported_csv

Order by citations count descending to see the articles with the most citations

In [None]:
# Order by citations count descending to see the articles with the most citations
df_mag_exported_csv = df_mag_exported_csv.sort_values(by='CitationCount', ascending=False)
df_mag_exported_csv.iloc[:5]