
# Location Web Scraping of DBLP Dataset

Jupyter Notebook for web scraping of the missing conferences locations of the DBLP dump.

For this process, the following CSV file is needed: ```out_dblp_papers_and_locations.csv```. <br>
The above file must be generated running the ```1 - dblp_add_locations_from_raw_dblp_dump.ipynb``` Notebook that is contained in the same folder as this Noteook.

In particular, the following operations are going to be executed:
* Opening of the CSV peprocessed dump
* Obtaining the missing locations with queries to the DBLP website
* Fix of the locations format

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd
import platform
import multiprocessing as mp 
import concurrent       
from location_scraper_multithread_utils import * 

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

### Multithreading Settings
Settings needed for the multithreaded queries to gather the missing conferences locations from the DBLP website.

Please specify the max number of workers below:

**Important Note**: during our tests we found out that DBLP refuses incoming connections if requests are made too frequently. You can read more about the DBLP Servers Rate Limit [here](https://dblp.org/faq/1474706.html).

We suggest to **set the number of workers to 1 if you have a large bandwidth** (over 100Mbps). Otherwise, you could try to set a higher value to make requests in parallel.

In [3]:
MAX_WORKERS = 1

You can also set a **sleep delay** (in seconds) between requests if that's not enough:

In [4]:
SLEEP_DELAY = 0.3 # Seconds

Special setting for the specific operating systems.

**Note**: Due to the latest MacOS releases' security measures, we need to use the spawn method instead of fork.

In [5]:
print(f"Notebook running on {platform.system()} OS: ")

if platform.system() == "Darwin" or platform.system() == "Windows": # MacOS and windows
    mp_ctx = mp.get_context("spawn")
    print("Spawn method has been set")
    
else: # other unix systems
    mp_ctx = mp.get_context("fork")
    print("Fork method has been set")

Notebook running on Darwin OS: 
Spawn method has been set


## Read of the CSV Preprocessed Dump

In [6]:
df_dblp_preprocessed = pd.read_csv(path_file_export + 'out_dblp_papers_and_locations.csv', low_memory=False, index_col=[0])
df_dblp_preprocessed

Unnamed: 0,ConferenceLocation,ConferenceTitle,crossref,ee,key,url,year
0,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_29,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html,2000
1,,On the Move to Meaningful Internet Systems 200...,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html,2004
2,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_9,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html,2000
3,"Kiawah Island, South Carolina, USA",Proceedings of the Second IFCIS International ...,conf/coopis/97,http://doi.ieeecomputersociety.org/10.1109/COO...,conf/coopis/BultzingsloewenKK97,db/conf/coopis/coopis97.html,1997
4,"Catania, Sicily, Italy",On The Move to Meaningful Internet Systems 200...,conf/coopis/2003,https://doi.org/10.1007/978-3-540-39964-3_50,conf/coopis/GiacolettoA03,db/conf/coopis/coopis2003.html,2003
...,...,...,...,...,...,...,...
2484040,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_9,series/sapere/Besold13,db/series/sapere/sapere5.html,2011
2484041,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_20,series/sapere/Steiner13,db/series/sapere/sapere5.html,2011
2484042,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_25,series/sapere/Armstrong13,db/series/sapere/sapere5.html,2011
2484043,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_12,series/sapere/Freed13,db/series/sapere/sapere5.html,2011


## Obtaining the Conferences Locations from the DBLP Website
The conferences locations are going to be obtained from queries to the DBLP Website.

First we need to obtain the conferences that don't have a location:

In [7]:
df_dblp_conferences = df_dblp_preprocessed[df_dblp_preprocessed["ConferenceLocation"].isna()]

Extraction of the conference URLs:

In [8]:
df_dblp_conferences = df_dblp_conferences[["url"]].copy()

Adding the new location column:

In [9]:
df_dblp_conferences['ConferenceLocation'] = np.nan

Drop of the duplicated conferences. We only need unique values.

In [10]:
df_dblp_conferences = df_dblp_conferences.drop_duplicates(subset="url")

print(f"Now we only need to search for the location of {df_dblp_conferences.__len__()} unique conferences")

Now we only need to search for the location of 1043 unique conferences


### Define of the Web Scraping Function
For simplicity, we're goint to use the same function used in the MAG web scraper. For this reason the web function is going to be passed as parameter.

In [11]:
def dblp_location_scraper(conferences_dataframe, mt_downloader_operation_function, dblp_url = "https://dblp.org/"):
    dict_conf_locations = {}      
    download_list = list(conferences_dataframe.url.values)

    executor = concurrent.futures.ProcessPoolExecutor(max_workers=int(MAX_WORKERS), mp_context=mp_ctx)
    futures = [executor.submit(mt_downloader_operation_function, conf_url, dblp_url, SLEEP_DELAY) for conf_url in download_list]

    for future in concurrent.futures.as_completed(futures):
        try:
            k, v = future.result()
        except Exception as e:
            print(f"{futures[future]} throws {e}")
        else:
            dict_conf_locations[k] = v
            pass

    # Converting the resulting dictionary to a dataframe
    df_conf_locations = pd.DataFrame(dict_conf_locations.items(), columns=['url', 'ConferenceLocation'])

    return df_conf_locations

#### Queries to the DBLP Website

Parallel execution of the queries to the DBLP website.

**Note**: this operation should take less than 12 hours, depending on your Internet speed.

In [12]:
df_conf_locations = dblp_location_scraper(df_dblp_conferences, mt_get_dblp_conf_location_from_dblp_operation, "https://dblp.org/")

https://dblp.org/db/conf/coopis/coopis2004-2.html
https://dblp.org/db/conf/coopis/coopis2004-1.html
https://dblp.org/db/conf/sbbd/sbbd2020.html
https://dblp.org/db/conf/approx/approx2020.html
https://dblp.org/db/conf/icsports/icsports2021.html
https://dblp.org/db/conf/ipas/ipas2021.html
https://dblp.org/db/conf/assistive/assistive1998.html
https://dblp.org/db/conf/cikm/datamod2020.html
https://dblp.org/db/conf/cikm/cikmw94.html
https://dblp.org/db/conf/ismco/ismco2021.html
https://dblp.org/db/conf/pe/pe2001.html
https://dblp.org/db/conf/ctrsa/ctrsa2022.html
https://dblp.org/db/conf/ctrsa/ctrsa2021.html
https://dblp.org/db/conf/rz/rz1980.html
https://dblp.org/db/conf/rz/rz1975.html
https://dblp.org/db/conf/sigada/ada1980.html
https://dblp.org/db/conf/scisec/scisec2021.html
https://dblp.org/db/conf/data/data2020s.html
https://dblp.org/db/conf/data/data2021.html
https://dblp.org/db/conf/conversations/conversations2020.html
https://dblp.org/db/conf/socpros/socpros2011-2.html
https://dblp.o

Let's see how many conference locations have been fixed.

In [13]:
df_conf_locations = df_conf_locations.dropna(subset = ['ConferenceLocation'])

print(f"Fixed {len(df_conf_locations.index)} over {len(df_dblp_conferences.index)} unique conferences")

Fixed 725 over 1043 unique conferences


## Join of the New Location Data with the Original Dataframe

In [14]:
# Merge with the location dataframe
df_dblp_preprocessed = pd.merge(df_dblp_preprocessed, df_conf_locations, on=['url'], how='left')

# Combine the two columns
df_dblp_preprocessed['ConferenceLocation_x'] = df_dblp_preprocessed['ConferenceLocation_x'].fillna(df_dblp_preprocessed['ConferenceLocation_y'])
df_dblp_preprocessed.rename(columns = {'ConferenceLocation_x':'ConferenceLocation'}, inplace=True)
df_dblp_preprocessed = df_dblp_preprocessed.drop(columns=['ConferenceLocation_y'])

df_dblp_preprocessed.iloc[:5]

Unnamed: 0,ConferenceLocation,ConferenceTitle,crossref,ee,key,url,year
0,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_29,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html,2000
1,"Agia Napa, Cyprus - Volume 2",On the Move to Meaningful Internet Systems 200...,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html,2004
2,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_9,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html,2000
3,"Kiawah Island, South Carolina, USA",Proceedings of the Second IFCIS International ...,conf/coopis/97,http://doi.ieeecomputersociety.org/10.1109/COO...,conf/coopis/BultzingsloewenKK97,db/conf/coopis/coopis97.html,1997
4,"Catania, Sicily, Italy",On The Move to Meaningful Internet Systems 200...,conf/coopis/2003,https://doi.org/10.1007/978-3-540-39964-3_50,conf/coopis/GiacolettoA03,db/conf/coopis/coopis2003.html,2003


Count of how many paper's conference locations are still missing

In [15]:
n_missing = len(df_dblp_preprocessed.index) - len(df_dblp_preprocessed.dropna(subset = ['ConferenceLocation']).index)
print(f"{n_missing} missing paper's conference locations")

8812 missing paper's conference locations


## Cleaning the Conference Locations
Some conferences are divided in separated volumes. The volume number is usually indicated in the header of the DBLP page, following the conference location.

For example: *OTM 2004: Agia Napa, Cyprus - Volume 2*

We need to filter it.

In [16]:
df_dblp_preprocessed.ConferenceLocation = df_dblp_preprocessed.ConferenceLocation.str.split(' - ').str[0]

df_dblp_preprocessed.iloc[:5]

Unnamed: 0,ConferenceLocation,ConferenceTitle,crossref,ee,key,url,year
0,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_29,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html,2000
1,"Agia Napa, Cyprus",On the Move to Meaningful Internet Systems 200...,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html,2004
2,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_9,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html,2000
3,"Kiawah Island, South Carolina, USA",Proceedings of the Second IFCIS International ...,conf/coopis/97,http://doi.ieeecomputersociety.org/10.1109/COO...,conf/coopis/BultzingsloewenKK97,db/conf/coopis/coopis97.html,1997
4,"Catania, Sicily, Italy",On The Move to Meaningful Internet Systems 200...,conf/coopis/2003,https://doi.org/10.1007/978-3-540-39964-3_50,conf/coopis/GiacolettoA03,db/conf/coopis/coopis2003.html,2003


## Write of the Final CSV on Disk

Saving the resulting dataframe on disk in CSV format.

In [17]:
# Write of the resulting CSV on Disk
df_dblp_preprocessed.to_csv(path_file_export + 'out_dblp_papers_and_locations_final.csv')
print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_dblp_papers_and_locations_final.csv')

Check of the Exported CSV to be sure that everything went fine.

In [18]:
# Check of the Exported CSV
df_dblp_exported_csv = pd.read_csv(path_file_export + 'out_dblp_papers_and_locations_final.csv', low_memory=False, index_col=[0])
df_dblp_exported_csv

Unnamed: 0,ConferenceLocation,ConferenceTitle,crossref,ee,key,url,year
0,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_29,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html,2000
1,"Agia Napa, Cyprus",On the Move to Meaningful Internet Systems 200...,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html,2004
2,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_9,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html,2000
3,"Kiawah Island, South Carolina, USA",Proceedings of the Second IFCIS International ...,conf/coopis/97,http://doi.ieeecomputersociety.org/10.1109/COO...,conf/coopis/BultzingsloewenKK97,db/conf/coopis/coopis97.html,1997
4,"Catania, Sicily, Italy",On The Move to Meaningful Internet Systems 200...,conf/coopis/2003,https://doi.org/10.1007/978-3-540-39964-3_50,conf/coopis/GiacolettoA03,db/conf/coopis/coopis2003.html,2003
...,...,...,...,...,...,...,...
2434139,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_9,series/sapere/Besold13,db/series/sapere/sapere5.html,2011
2434140,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_20,series/sapere/Steiner13,db/series/sapere/sapere5.html,2011
2434141,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_25,series/sapere/Armstrong13,db/series/sapere/sapere5.html,2011
2434142,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_12,series/sapere/Freed13,db/series/sapere/sapere5.html,2011
