# Location Web Scraping of Microsoft Academics Graph (MAG) Dataset

Jupyter Notebook for the web scraping of the conferences locations of the Microsoft Academics Graph (MAG) dump.

For this process, the following CSV file is needed: ```out_mag_citations_and_locations.csv```. 
The above file must be generated running the ```1 - mag_fix_locations_from_raw_dblp_dump.ipynb``` Notebook that is contained in the same folder as this Notebook.

In particular, the following operations are going to be executed:
* Opening of the CSV peprocessed dump
* Obtaining the missing locations with queries to the DBLP website
* Fix of the locations format

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd
import platform
import multiprocessing as mp 
import concurrent       
from location_scraper_multithread_utils import * 

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

### Multithreading Settings
Settings needed for the multithreaded queries to gather the missing conferences locations from the DBLP website.

Please specify the max number of workers below:

**Important Note**: during our tests we found out that DBLP refuses incoming connections if requests are made too frequently. You can read more about the DBLP Servers Rate Limit [here](https://dblp.org/faq/1474706.html).

We suggest to **set the number of workers to 1 if you have a large bandwidth** (over 100Mbps). Otherwise, you could try to set a higher value to make requests in parallel.

In [3]:
MAX_WORKERS = 1

You can also set a **sleep delay** (in seconds) between requests if that's not enough:

In [4]:
SLEEP_DELAY = 0.3 # Seconds

Special setting for the specific operating systems.

**Note**: Due to the latest MacOS releases' security measures, we need to use the spawn method instead of fork.

In [5]:
print(f"Notebook running on {platform.system()} OS: ")

if platform.system() == "Darwin" or platform.system() == "Windows": # MacOS and windows
    mp_ctx = mp.get_context("spawn")
    print("Spawn method has been set")
    
else: # other unix systems
    mp_ctx = mp.get_context("fork")
    print("Fork method has been set")

Notebook running on Darwin OS: 
Spawn method has been set


## Read of the CSV Preprocessed Dump

In [6]:
df_mag_preprocessed = pd.read_csv(path_file_export + 'out_mag_citations_and_locations.csv', low_memory=False)
df_mag_preprocessed

Unnamed: 0.1,Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceSeriesNormalizedName,Doi,EstimatedCitation,OriginalTitle,PaperTitle,Year
0,0,12,"Austin, TX",disc 2014,International Symposium on Distributed Computing,DISC,10.1007/978-3-662-45174-8_28,12,The Adaptive Priority Queue with Elimination a...,the adaptive priority queue with elimination a...,2014
1,1,10,"Wrocław, Poland",esa 2014,European Symposium on Algorithms,ESA,10.1007/978-3-662-44777-2_60,10,Document Retrieval on Repetitive Collections,document retrieval on repetitive collections,2014
2,2,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,ENTER,10.1007/978-3-319-03973-2_13,20,SoCoMo Marketing for Travel and Tourism,socomo marketing for travel and tourism,2013
3,3,0,"Provence, France",dexa 2002,Database and Expert Systems Applications,DEXA,10.1007/3-540-46146-9_77,0,Similarity Image Retrieval System Using Hierar...,similarity image retrieval system using hierar...,2002
4,4,19,"Zakopane, Poland",icaisc 2006,International Conference on Artificial Intelli...,ICAISC,10.1007/11785231_94,19,Leukemia prediction from gene expression data—...,leukemia prediction from gene expression data ...,2006
...,...,...,...,...,...,...,...,...,...,...,...
4409807,4409807,0,Singapore,iecon 2020,Conference of the Industrial Electronics Society,IECON,10.1109/IECON43393.2020.9254316,0,Loss Reduction by Synchronous Rectification in...,loss reduction by synchronous rectification in...,2020
4409808,4409808,0,"Paris, France",bmsb 2020,International Symposium on Broadband Multimedi...,BMSB,10.1109/BMSB49480.2020.9379806,0,Data Over Cable Services – Improving the BICM ...,data over cable services improving the bicm ca...,2020
4409809,4409809,0,,acc 1988,American Control Conference,ACC,10.1109/ACC.1988.4172843,0,Model Reference Robust Adaptive Control withou...,model reference robust adaptive control withou...,1988
4409810,4409810,0,"Orlando, Florida, USA",icassp 2002,"International Conference on Acoustics, Speech,...",ICASSP,10.1109/ICASSP.2002.1005676,0,Missing data speech recognition in reverberant...,missing data speech recognition in reverberant...,2002


## Obtaining the Missing Conferences Locations from the DBLP Website
The missing conferences locations are going to be obtained from queries to the DBLP Website.

In [7]:
df_mag_conferences = df_mag_preprocessed[["ConferenceNormalizedName", "ConferenceLocation"]]

Drop of the papers that don't need their location to be fixed.

In [8]:
df_mag_conferences = df_mag_conferences[df_mag_conferences["ConferenceLocation"].isna()]
df_mag_conferences

Unnamed: 0,ConferenceNormalizedName,ConferenceLocation
10,acc 1990,
13,asilomar 1991,
23,ire 1964,
27,ecml 1994,
39,acc 1986,
...,...,...
4409793,icieam 2017,
4409796,dueu 2018,
4409803,ra 2004,
4409806,fnces 2012,


Drop of the duplicated conferences. We only need unique values.

In [9]:
df_mag_conferences = df_mag_conferences.drop_duplicates(subset="ConferenceNormalizedName")

print(f"Now we only need to search for the location of {df_mag_conferences.__len__()} unique conferences")

Now we only need to search for the location of 14911 unique conferences


### Define of the Web Scraping Function
We'll do a web scraping in two different URL formats, hence the need of two web scraping phases (with two different functions that are going to be passed as parameter).

In [10]:
def dblp_location_scraper(conferences_dataframe, mt_downloader_operation_function, dblp_url = "https://dblp.org/db/conf/"):
    dict_conf_locations = {}      
    download_list = list(conferences_dataframe.ConferenceNormalizedName.values)

    executor = concurrent.futures.ProcessPoolExecutor(max_workers=int(MAX_WORKERS), mp_context=mp_ctx)
    futures = [executor.submit(mt_downloader_operation_function, conf_name, dblp_url, SLEEP_DELAY) for conf_name in download_list]

    for future in concurrent.futures.as_completed(futures):
        try:
            k, v = future.result()
        except Exception as e:
            print(f"{futures[future]} throws {e}")
        else:
            dict_conf_locations[k] = v
            pass

    # Converting the resulting dictionary to a dataframe
    df_conf_locations = pd.DataFrame(dict_conf_locations.items(), columns=['ConferenceNormalizedName', 'ConferenceLocation'])

    return df_conf_locations

### Web Scraping Phase 1

#### Queries to https://dblp.org/db/conf/CONF_NAME/index.html

Parallel execution of the queries to the DBLP website.

**Note**: this operation should take less than three hours, depending on your Internet speed.

In [11]:
df_conf_locations_v1 = dblp_location_scraper(df_mag_conferences, mt_get_mag_conf_location_from_dblp_operation_v1, "https://dblp.org/db/conf/")

https://dblp.org/db/conf/acc/index.html - Year 1990: None
https://dblp.org/db/conf/ecml/index.html - Year 1994: <h2 id="1994">7th ECML 1994: Catania, Italy</h2>
https://dblp.org/db/conf/acc/index.html - Year 1986: None
https://dblp.org/db/conf/acc/index.html - Year 1994: None
https://dblp.org/db/conf/embc/index.html - Year 2003: None
https://dblp.org/db/conf/wcc/index.html - Year 2006: None
https://dblp.org/db/conf/iscas/index.html - Year 1998: <h2 id="1998">ISCAS 1998: Monterey, CA, USA</h2>
https://dblp.org/db/conf/emnets/index.html - Year 2005: <h2 id="2005">EmNets 2005: Sydney, Australia</h2>
https://dblp.org/db/conf/embc/index.html - Year 2000: None
https://dblp.org/db/conf/ecoop/index.html - Year 1997: <h2 id="1997">11th ECOOP 1997: Jyväskylä, Finland</h2>
https://dblp.org/db/conf/icee/index.html - Year 2012: None
https://dblp.org/db/conf/ecoc/index.html - Year 2007: None
https://dblp.org/db/conf/wdag/index.html - Year 1997: <h2 id="1997">11th WDAG 1997: Saarbrücken, Germany</h2>

Let's see how many conference locations have been fixed.

In [12]:
df_conf_locations_v1 = df_conf_locations_v1.dropna(subset = ['ConferenceLocation'])

print(f"Fixed {len(df_conf_locations_v1.index)} over {len(df_mag_conferences.index)} unique conferences")

Fixed 1029 over 14911 unique conferences


Write of the fixed locations on disk:

In [13]:
df_conf_locations_v1.to_csv(path_file_export + 'out_mag_locations_fixed_v1.csv')
print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_mag_locations_fixed_v1.csv')

Successfully Exported the Preprocessed CSV to /Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/out_mag_locations_fixed_v1.csv


### Web Scraping Phase 2

#### Queries to https://dblp.org/db/conf/CONF_NAME/CONF_NAMEYEAR.html

Parallel execution of the queries to the DBLP website.

**Note**: this operation should take less than three hours, depending on your Internet speed.

**IMPORTANT**: our tests didn't give us good results: over 14k distinct conferences, we mangaed to recover only 3 of them! You can skip this part of the code by editing the following variable:

In [None]:
download_anyway = True

First of all, we have to filter the conferences that have already been obtained:

In [14]:
rows_to_drop = df_mag_conferences["ConferenceNormalizedName"].isin(df_conf_locations_v1["ConferenceNormalizedName"])
df_mag_conferences.drop(df_mag_conferences[rows_to_drop].index, inplace=True)

print(f"Now we only need to search for the location of {df_mag_conferences.__len__()} unique conferences")

Now we only need to search for the location of 13882 unique conferences


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mag_conferences.drop(df_mag_conferences[rows_to_drop].index, inplace=True)


In [15]:
if download_anyway:
    df_conf_locations_v2 = dblp_location_scraper(df_mag_conferences, mt_get_mag_conf_location_from_dblp_operation_v2, "https://dblp.org/db/conf/")

https://dblp.org/db/conf/acc/acc1990.html
https://dblp.org/db/conf/asilomar/asilomar1991.html
https://dblp.org/db/conf/ire/ire1964.html
https://dblp.org/db/conf/acc/acc1986.html
https://dblp.org/db/conf/acc/acc1994.html
https://dblp.org/db/conf/embc/embc2003.html
https://dblp.org/db/conf/wcc/wcc2006.html
https://dblp.org/db/conf/caol/caol2008.html
https://dblp.org/db/conf/lasers/lasers2005.html
https://dblp.org/db/conf/casa/casa2011.html
https://dblp.org/db/conf/isdc/isdc2009.html
https://dblp.org/db/conf/euro-par/euro-par1999.html
https://dblp.org/db/conf/fitme/fitme2010.html
https://dblp.org/db/conf/pesc/pesc2002.html
https://dblp.org/db/conf/lasers/lasers2003.html
https://dblp.org/db/conf/approx/random/approx/random2014.html
https://dblp.org/db/conf/embc/embc2000.html
https://dblp.org/db/conf/itme/itme2009.html
https://dblp.org/db/conf/iccme/iccme2009.html
https://dblp.org/db/conf/mems/mems2004.html
https://dblp.org/db/conf/icee/icee2012.html
https://dblp.org/db/conf/ecoc/ecoc2007.h

Let's see how many conference locations have been fixed.

In [16]:
if download_anyway:
    df_conf_locations_v2 = df_conf_locations_v2.dropna(subset = ['ConferenceLocation'])
    print(f"Fixed {len(df_conf_locations_v2.index)} over {len(df_mag_conferences.index)} unique conferences")

Fixed 3 over 13882 unique conferences


Write of the fixed locations on disk:

In [17]:
if download_anyway:
    df_conf_locations_v2.to_csv(path_file_export + 'out_mag_locations_fixed_v2.csv')
    print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_mag_locations_fixed_v2.csv')

Successfully Exported the Preprocessed CSV to /Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/out_mag_locations_fixed_v2.csv


## Join of the New Location Data with the Original Dataframe

In [18]:
# Merge with the first location dataframe
df_mag_preprocessed = pd.merge(df_mag_preprocessed, df_conf_locations_v1, on=['ConferenceNormalizedName'], how='left')

# Combine the two columns
df_mag_preprocessed['ConferenceLocation_x'] = df_mag_preprocessed['ConferenceLocation_x'].fillna(df_mag_preprocessed['ConferenceLocation_y'])
df_mag_preprocessed.rename(columns = {'ConferenceLocation_x':'ConferenceLocation'}, inplace=True)
df_mag_preprocessed = df_mag_preprocessed.drop(columns=['ConferenceLocation_y'])

if download_anyway:
    
    # Merge with the second location dataframe
    df_mag_preprocessed = pd.merge(df_mag_preprocessed, df_conf_locations_v2, on=['ConferenceNormalizedName'], how='left')

    # Combine the two columns
    df_mag_preprocessed['ConferenceLocation_x'] = df_mag_preprocessed['ConferenceLocation_x'].fillna(df_mag_preprocessed['ConferenceLocation_y'])
    df_mag_preprocessed.rename(columns = {'ConferenceLocation_x':'ConferenceLocation'}, inplace=True)
    df_mag_preprocessed = df_mag_preprocessed.drop(columns=['ConferenceLocation_y'])


df_mag_preprocessed.iloc[:3]

Unnamed: 0.1,Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceSeriesNormalizedName,Doi,EstimatedCitation,OriginalTitle,PaperTitle,Year
0,0,12,"Austin, TX",disc 2014,International Symposium on Distributed Computing,DISC,10.1007/978-3-662-45174-8_28,12,The Adaptive Priority Queue with Elimination a...,the adaptive priority queue with elimination a...,2014
1,1,10,"Wrocław, Poland",esa 2014,European Symposium on Algorithms,ESA,10.1007/978-3-662-44777-2_60,10,Document Retrieval on Repetitive Collections,document retrieval on repetitive collections,2014
2,2,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,ENTER,10.1007/978-3-319-03973-2_13,20,SoCoMo Marketing for Travel and Tourism,socomo marketing for travel and tourism,2013


Count of how many paper's conference locations are still missing

In [19]:
n_missing = len(df_mag_preprocessed.index) - len(df_mag_preprocessed.dropna(subset = ['ConferenceLocation']).index)
print(f"{n_missing} missing paper's conference locations")

1619628 missing paper's conference locations


## Write of the Final CSV on Disk

In [20]:
# Write of the resulting CSV on Disk
df_mag_preprocessed.to_csv(path_file_export + 'out_mag_citations_and_locations_final.csv')
print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_mag_citations_and_locations_final.csv')

Successfully Exported the Preprocessed CSV to /Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/out_mag_citations_and_locations_final.csv


Check of the Exported CSV to be sure that everything went fine.

In [21]:
# Check of the Exported CSV
df_mag_exported_csv = pd.read_csv(path_file_export + 'out_mag_citations_and_locations_final.csv', low_memory=False, index_col=[0])
df_mag_exported_csv.drop(df_mag_exported_csv.filter(regex="Unname"), axis=1, inplace=True)
df_mag_exported_csv

Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceSeriesNormalizedName,Doi,EstimatedCitation,OriginalTitle,PaperTitle,Year
0,12,"Austin, TX",disc 2014,International Symposium on Distributed Computing,DISC,10.1007/978-3-662-45174-8_28,12,The Adaptive Priority Queue with Elimination a...,the adaptive priority queue with elimination a...,2014
1,10,"Wrocław, Poland",esa 2014,European Symposium on Algorithms,ESA,10.1007/978-3-662-44777-2_60,10,Document Retrieval on Repetitive Collections,document retrieval on repetitive collections,2014
2,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,ENTER,10.1007/978-3-319-03973-2_13,20,SoCoMo Marketing for Travel and Tourism,socomo marketing for travel and tourism,2013
3,0,"Provence, France",dexa 2002,Database and Expert Systems Applications,DEXA,10.1007/3-540-46146-9_77,0,Similarity Image Retrieval System Using Hierar...,similarity image retrieval system using hierar...,2002
4,19,"Zakopane, Poland",icaisc 2006,International Conference on Artificial Intelli...,ICAISC,10.1007/11785231_94,19,Leukemia prediction from gene expression data—...,leukemia prediction from gene expression data ...,2006
...,...,...,...,...,...,...,...,...,...,...
4409807,0,Singapore,iecon 2020,Conference of the Industrial Electronics Society,IECON,10.1109/IECON43393.2020.9254316,0,Loss Reduction by Synchronous Rectification in...,loss reduction by synchronous rectification in...,2020
4409808,0,"Paris, France",bmsb 2020,International Symposium on Broadband Multimedi...,BMSB,10.1109/BMSB49480.2020.9379806,0,Data Over Cable Services – Improving the BICM ...,data over cable services improving the bicm ca...,2020
4409809,0,,acc 1988,American Control Conference,ACC,10.1109/ACC.1988.4172843,0,Model Reference Robust Adaptive Control withou...,model reference robust adaptive control withou...,1988
4409810,0,"Orlando, Florida, USA",icassp 2002,"International Conference on Acoustics, Speech,...",ICASSP,10.1109/ICASSP.2002.1005676,0,Missing data speech recognition in reverberant...,missing data speech recognition in reverberant...,2002


Order by citations count descending to see the articles with the most citations

In [26]:
# Order by citations count descending to see the articles with the most citations
df_mag_exported_csv = df_mag_exported_csv.sort_values(by='CitationCount', ascending=False)
df_mag_exported_csv.iloc[:5]

Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceSeriesNormalizedName,Doi,EstimatedCitation,OriginalTitle,PaperTitle,Year
4392494,62329,"Las Vegas, Nevada, USA",cvpr 2016,Computer Vision and Pattern Recognition,CVPR,10.1109/CVPR.2016.90,75544,Deep Residual Learning for Image Recognition,deep residual learning for image recognition,2016
176794,26215,Singapore,icon 2002,International Conference on Networks,ICON,10.1109/ICNN.1995.488968,49377,Particle swarm optimization,particle swarm optimization,2002
562266,23180,"San Diego, CA, USA",cvpr 2005,Computer Vision and Pattern Recognition,CVPR,10.1109/CVPR.2005.177,36647,Histograms of oriented gradients for human det...,histograms of oriented gradients for human det...,2005
3702319,22980,"Miami Beach, Florida",cvpr 2009,Computer Vision and Pattern Recognition,CVPR,10.1109/CVPR.2009.5206848,28822,ImageNet: A large-scale hierarchical image dat...,imagenet a large scale hierarchical image data...,2009
4005340,20853,Munich,miccai 2015,Medical Image Computing and Computer-Assisted ...,MICCAI,10.1007/978-3-319-24574-4_28,26844,U-Net: Convolutional Networks for Biomedical I...,u net convolutional networks for biomedical im...,2015
