# Location Web Scraping of Microsoft Academics Graph (MAG) Dataset

Jupyter Notebook for the web scraping of the conferences locations of the Microsoft Academics Graph (MAG) dump.

For this process, the following CSV file is needed: ```out_mag_citations_count_and_conferences.csv```. 
The above file must be generated running the ```preprocess_mag.ipynb``` Notebook that is contained in the ```1 - Citation Dumps Preprocess``` folder of this Repository.

In particular, the following operations are going to be executed:
* Opening of the CSV peprocessed dump
* Fix of the conferences names
* Obtaining the missing locations with queries to the DBLP website
* Fix of the locations format

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd
import platform
import multiprocessing as mp 
import concurrent       
from location_scraper_multithread_utils import * 

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

### Multithreading Settings
Settings needed for the multithreaded queries to gather the missing conferences locations from the DBLP website.

Please specify the max number of workers below:

**Important Note**: during our tests we found out that DBLP refuses incoming connections if requests are made too frequently. You can read more about the DBLP Servers Rate Limit [here](https://dblp.org/faq/1474706.html).

We suggest to **set the number of workers to 1 if you have a large bandwidth** (over 100Mbps). Otherwise, you could try to set a higher value to make requests in parallel.

In [3]:
MAX_WORKERS = 1

Special setting for the specific operating systems.

**Note**: Due to the latest MacOS releases' security measures, we need to use the spawn method instead of fork.

In [4]:
print(f"Notebook running on {platform.system()} OS: ")

if platform.system() == "Darwin" or platform.system() == "Windows": # MacOS and windows
    mp_ctx = mp.get_context("spawn")
    print("Spawn method has been set")
    
else: # other unix systems
    mp_ctx = mp.get_context("fork")
    print("Spawn method has been set")

Notebook running on Darwin OS: 
Spawn method has been set


## Read of the CSV Preprocessed Dump

In [None]:
df_mag_preprocessed = pd.read_csv(path_file_export + 'out_mag_citations_count_and_conferences.csv', low_memory=False)
df_mag_preprocessed

## Fix of the Missing Conferences Names
Some papers have only the indication of the conference series. For this reason, the conference instance and the related conference locations don't have a value.

However, every paper has been published in a specific "instance" of a conference, hence it should have a location. These papers will be "fixed" considering the year of their publication and their conference.

In [None]:
df_mag_preprocessed_subset = df_mag_preprocessed.iloc[:50]
df_mag_preprocessed_subset = df_mag_preprocessed_subset.dropna(subset = ['ConferenceNormalizedName'])
df_mag_preprocessed_subset.iloc[:10][["Year", "ConferenceSeriesNormalizedName", "ConferenceNormalizedName", "ConferenceDisplayName"]]

As you can see in the above test, the ConferenceNormalizedName seems to be made by the concatenation of ConferenceSeriesNormalizedName in lowercase, a space, and the papers' year.

**Note**: in the above subset the ConferenceDisplayName seems to be composed in the same way of ConferenceNormalizedName, but without the lowercase. However, this is not always true!

Now we're going to populate the ConferenceNormalizedName instances that don't have a value.

In [None]:
df_mag_preprocessed.ConferenceNormalizedName.fillna(df_mag_preprocessed.ConferenceSeriesNormalizedName.str.lower() + ' ' + df_mag_preprocessed.Year.astype(str), inplace=True)
df_mag_preprocessed.iloc[:5]

I tried to do a new merge with the Conference Instances dataframe (this time it will be made on the ConferenceNormalizedName column), but I had no luck: these conference instances are missing. That's probably the reason of the NaN values in the ConferenceInstanceID field of the original Papers table.

## Obtaining the Missing Conferences Locations from the DBLP Website
The missing conferences locations are going to be obtained from queries to the DBLP Website.

In [None]:
df_mag_conferences = df_mag_preprocessed[["ConferenceNormalizedName", "ConferenceLocation"]]

In [5]:
## TODO TEST
df_mag_conferences = pd.read_csv(path_file_export + 'out_mag_tmp_locations_subset.csv', low_memory=False)

Drop of the papers that don't need their location to be fixed.

In [6]:
df_mag_conferences = df_mag_conferences[df_mag_conferences["ConferenceLocation"].isna()]
df_mag_conferences

Unnamed: 0.1,Unnamed: 0,PaperID,ConferenceNormalizedName,ConferenceLocation
2,2,24327294,enter 2013,
3,3,60437532,dexa 2002,
4,4,198056957,icaisc 2006,
5,5,206983337,interact 2011,
6,6,1498014147,fct 2005,
...,...,...,...,...
4409807,4409811,3102242761,iecon 2020,
4409808,4409812,3136855299,bmsb 2020,
4409809,4409813,3145351916,acc 1988,
4409810,4409814,3151696876,icassp 2002,


Drop of the duplicated conferences. We only need unique values.

In [7]:
df_mag_conferences = df_mag_conferences.drop_duplicates(subset="ConferenceNormalizedName")

print(f"Now we only need to search for the location of {df_mag_conferences.__len__()} unique conferences")

Now we only need to search for the location of 29512 unique conferences


### Define of the Web Scraping Function
We'll do a web scraping in two different URL formats, hence the need of two web scraping phases (with two different functions that are going to be passed as parameter).

In [8]:
def dblp_location_scraper(conferences_dataframe, mt_downloader_operation_function, dblp_url = "https://dblp.org/db/conf/"):
    dict_conf_locations = {}      
    download_list = list(conferences_dataframe.ConferenceNormalizedName.values)

    executor = concurrent.futures.ProcessPoolExecutor(max_workers=int(MAX_WORKERS), mp_context=mp_ctx)
    futures = [executor.submit(mt_downloader_operation_function, conf_name, dblp_url) for conf_name in download_list]

    for future in concurrent.futures.as_completed(futures):
        try:
            k, v = future.result()
        except Exception as e:
            print(f"{futures[future]} throws {e}")
        else:
            dict_conf_locations[k] = v
            pass

    # Converting the resulting dictionary to a dataframe
    df_conf_locations = pd.DataFrame(dict_conf_locations.items(), columns=['ConferenceNormalizedName', 'ConferenceLocation'])

    return df_conf_locations

### Web Scraping Phase 1

#### Queries to https://dblp.org/db/conf/CONF_NAME/index.html

Parallel execution of the queries to the DBLP website.

**Note**: this operation should take less than 1h, depending on your Internet speed.

In [11]:
# TODO TEST
df_mag_conferences_v1_1_subset = df_mag_conferences.iloc[0:3000]
df_conf_locations_v1_1 = dblp_location_scraper(df_mag_conferences_v1_1_subset, mt_get_mag_conf_location_from_dblp_operation_v1, "https://dblp.org/db/conf/")

https://dblp.org/db/conf/enter/index.html - Year 2013: <h2 id="2013">ENTER 2013: Innsbruck, Austria</h2>
https://dblp.org/db/conf/dexa/index.html - Year 2002: <h2 id="2002">13th DEXA 2002: Aix-en-Provence, France</h2>
https://dblp.org/db/conf/icaisc/index.html - Year 2006: <h2 id="2006">8. ICAISC 2006: Zakopane, Poland</h2>
https://dblp.org/db/conf/interact/index.html - Year 2011: <h2 id="2011">INTERACT 2011: Lisbon, Portugal</h2>
https://dblp.org/db/conf/fct/index.html - Year 2005: <h2 id="2005">15th FCT 2005: Lübeck, Germany</h2>
https://dblp.org/db/conf/icdcit/index.html - Year 2006: <h2 id="2006">3rd ICDCIT 2006: Bhubaneswar, India</h2>
https://dblp.org/db/conf/acc/index.html - Year 1990: None
https://dblp.org/db/conf/safecomp/index.html - Year 2002: <h2 id="2002">21st SAFECOMP 2002: Catania, Italy</h2>
https://dblp.org/db/conf/haid/index.html - Year 2006: <h2 id="2006">HAID 2006: Glasgow, UK</h2>
https://dblp.org/db/conf/tsd/index.html - Year 2002: <h2 id="2002">5th TSD 2002: Brno

In [None]:
df_conf_locations_v1_1 = dblp_location_scraper(df_mag_conferences, mt_get_mag_conf_location_from_dblp_operation_v1, "https://dblp.org/db/conf/")

Let's see how many conference locations have been fixed.

In [12]:
df_conf_locations_v1_1 = df_conf_locations_v1_1.dropna(subset = ['ConferenceLocation'])

print(f"Fixed {len(df_conf_locations_v1_1.index)} over {len(df_mag_conferences.index)} unique conferences")

Fixed 1184 over 29512 unique conferences


Write of the fixed locations on disk:

In [None]:
df_conf_locations_v1_1.to_csv(path_file_export + 'out_mag_locations_fixed_v1_1.csv')
print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_mag_locations_fixed_v1_1.csv')

#### Queries to https://dblp.org/db/series/CONF_NAME/index.html

Parallel execution of the queries to the DBLP website.

**Note**: this operation should take less than 1h, depending on your Internet speed.

First of all, we have to filter the conferences that have already been obtained:

In [None]:
rows_to_drop = df_mag_conferences["ConferenceNormalizedName"].isin(df_conf_locations_v1_1["ConferenceNormalizedName"])
df_mag_conferences.drop(df_mag_conferences[rows_to_drop].index, inplace=True)

print(f"Now we only need to search for the location of {df_mag_conferences.__len__()} unique conferences")

In [None]:
df_conf_locations_v1_2 = dblp_location_scraper(df_mag_conferences, mt_get_mag_conf_location_from_dblp_operation_v1, "https://dblp.org/db/series/")

Let's see how many conference locations have been fixed.

In [None]:
df_conf_locations_v1_2 = df_conf_locations_v1_2.dropna(subset = ['ConferenceLocation'])

print(f"Fixed {len(df_conf_locations_v1_2.index)} over {len(df_mag_conferences.index)} unique conferences")

Write of the fixed locations on disk:

In [None]:
df_conf_locations_v1_2.to_csv(path_file_export + 'out_mag_locations_fixed_v1_2.csv')
print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_mag_locations_fixed_v1_2.csv')

### Web Scraping Phase 2

#### Queries to https://dblp.org/db/conf/CONF_NAME/CONF_NAMEYEAR.html

Parallel execution of the queries to the DBLP website.

**Note**: this operation should take less than 1h, depending on your Internet speed.

First of all, we have to filter the conferences that have already been obtained:

In [None]:
rows_to_drop = df_mag_conferences["ConferenceNormalizedName"].isin(df_conf_locations_v1_2["ConferenceNormalizedName"])
df_mag_conferences.drop(df_mag_conferences[rows_to_drop].index, inplace=True)

print(f"Now we only need to search for the location of {df_mag_conferences.__len__()} unique conferences")

In [None]:
df_conf_locations_v2_1 = dblp_location_scraper(df_mag_conferences, mt_get_mag_conf_location_from_dblp_operation_v2, "https://dblp.org/db/conf/")

Let's see how many conference locations have been fixed.

In [None]:
df_conf_locations_v2_1 = df_conf_locations_v2_1.dropna(subset = ['ConferenceLocation'])

print(f"Fixed {len(df_conf_locations_v2_1.index)} over {len(df_mag_conferences.index)} unique conferences")

Write of the fixed locations on disk:

In [None]:
df_conf_locations_v2_1.to_csv(path_file_export + 'out_mag_locations_fixed_v2_1.csv')
print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_mag_locations_fixed_v2_1.csv')

#### Queries to https://dblp.org/db/series/CONF_NAME/CONF_NAMEYEAR.html

We're going to try to get more location composing the URL in a different way.

**Note**: this operation should take less than 1h, depending on your Internet speed.

**Note**: in my tests, this method gave no results. I decided to leave the original code, in case something will change on the DBLP website. You can execute the download anyway if you want, by editing the following value. 

In [None]:
download_anyway = True

First of all, we have to filter the conferences that have already been obtained:

In [None]:
rows_to_drop = df_mag_conferences["ConferenceNormalizedName"].isin(df_conf_locations_v2_1["ConferenceNormalizedName"])
df_mag_conferences.drop(df_mag_conferences[rows_to_drop].index, inplace=True)

print(f"Now we only need to search for the location of {df_mag_conferences.__len__()} unique conferences")

In [None]:
if download_anyway:
    df_conf_locations_v2_2 = dblp_location_scraper(df_mag_conferences, mt_get_mag_conf_location_from_dblp_operation_v2, "https://dblp.org/db/series/")

Let's see how many conference locations have been fixed.

In [None]:
if download_anyway:
    df_conf_locations_v2_2 = df_conf_locations_v2_2.dropna(subset = ['ConferenceLocation'])

    print(f"Fixed {len(df_conf_locations_v2_2.index)} over {len(df_mag_conferences.index)} unique conferences")

Write of the fixed locations on disk:

In [None]:
if download_anyway:
    df_conf_locations_v2_2.to_csv(path_file_export + 'out_mag_locations_fixed_v2_2.csv')
    print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_mag_locations_fixed_v2_2.csv')

## Join of the New Location Data with the Original Dataframe

In [None]:
# Merge with the first location dataframe
df_mag_preprocessed = pd.merge(df_mag_preprocessed, df_conf_locations_v1_1, on=['ConferenceNormalizedName'], how='left')

# Combine the two columns
df_mag_preprocessed['ConferenceLocation_x'] = df_mag_preprocessed['ConferenceLocation_x'].fillna(df_mag_preprocessed['ConferenceLocation_y'])
df_mag_preprocessed.rename(columns = {'ConferenceLocation_x':'ConferenceLocation'}, inplace=True)
df_mag_preprocessed = df_mag_preprocessed.drop(columns=['ConferenceLocation_y'])


# Merge with the second location dataframe
df_mag_preprocessed = pd.merge(df_mag_preprocessed, df_conf_locations_v1_2, on=['ConferenceNormalizedName'], how='left')

# Combine the two columns
df_mag_preprocessed['ConferenceLocation_x'] = df_mag_preprocessed['ConferenceLocation_x'].fillna(df_mag_preprocessed['ConferenceLocation_y'])
df_mag_preprocessed.rename(columns = {'ConferenceLocation_x':'ConferenceLocation'}, inplace=True)
df_mag_preprocessed = df_mag_preprocessed.drop(columns=['ConferenceLocation_y'])


# Merge with the third location dataframe
df_mag_preprocessed = pd.merge(df_mag_preprocessed, df_conf_locations_v2_1, on=['ConferenceNormalizedName'], how='left')

# Combine the two columns
df_mag_preprocessed['ConferenceLocation_x'] = df_mag_preprocessed['ConferenceLocation_x'].fillna(df_mag_preprocessed['ConferenceLocation_y'])
df_mag_preprocessed.rename(columns = {'ConferenceLocation_x':'ConferenceLocation'}, inplace=True)
df_mag_preprocessed = df_mag_preprocessed.drop(columns=['ConferenceLocation_y'])


if download_anyway:
    # Merge with the fourth  location dataframe
    df_mag_preprocessed = pd.merge(df_mag_preprocessed, df_conf_locations_v2_2, on=['ConferenceNormalizedName'], how='left')

    # Combine the two columns
    df_mag_preprocessed['ConferenceLocation_x'] = df_mag_preprocessed['ConferenceLocation_x'].fillna(df_mag_preprocessed['ConferenceLocation_y'])
    df_mag_preprocessed.rename(columns = {'ConferenceLocation_x':'ConferenceLocation'}, inplace=True)
    df_mag_preprocessed = df_mag_preprocessed.drop(columns=['ConferenceLocation_y'])

df_mag_preprocessed.iloc[:5]

Count of how many paper's conference locations are still missing

In [None]:
n_missing = len(df_mag_preprocessed.index) - len(df_mag_preprocessed.dropna(subset = ['ConferenceLocation']).index)
print(f"{n_missing} missing paper's conference locations")

## Write of the Final CSV on Disk

In [None]:
# Write of the resulting CSV on Disk
df_mag_preprocessed.to_csv(path_file_export + 'out_mag_citations_and_locations.csv')
print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_mag_citations_and_locations.csv')

Check of the Exported CSV to be sure that everything went fine.

In [None]:
# Check of the Exported CSV
df_mag_exported_csv = pd.read_csv(path_file_export + 'out_mag_citations_and_locations.csv', low_memory=False)
df_mag_exported_csv

Order by citations count descending to see the articles with the most citations

In [None]:
# Order by citations count descending to see the articles with the most citations
df_mag_exported_csv = df_mag_exported_csv.sort_values(by='CitationCount', ascending=False)
df_mag_exported_csv.iloc[:5]