
# Location Web Scraping of DBLP Dataset

Jupyter Notebook for web scraping of the conferences locations of the DBLP dump.

For this process, the following CSV file is needed: ```out_dblp_papers.csv```. 
The above file must be generated running the ```preprocess_dblp.ipynb``` Notebook that is contained in the ```1 - Citation Dumps Preprocess``` folder of this Repository.

In particular, the following operations are going to be executed:
* Opening of the CSV peprocessed dump
* Obtaining the missing locations with queries to the DBLP website
* Fix of the locations format

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [None]:
# Libraries Import
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [None]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

## Read of the CSV Preprocessed Dump

In [None]:
df_dblp = pd.read_csv(path_file_export + 'out_dblp_papers.csv', low_memory=False)
df_dblp

## Obtaining the Missing Conferences Locations Our Previous Work
The missing conferences locations are going to be obtained from our previous work.

In [None]:
df_place_of_conference = pd.read_csv(path_file_import + 'place_of_conference1.csv', names=['crossref', 'ConferenceLocation'], header=0)
df_place_of_conference

Drop of the conferences without a location:

In [None]:
# None values are string, so we can't simply use the "fillna" function
df_place_of_conference['ConferenceLocation'].replace('None', np.nan, inplace=True)

df_place_of_conference = df_place_of_conference.dropna(subset = ['ConferenceLocation'])
df_place_of_conference

## Join of the New Location Data with the Original Dataframe

Adding the new location column:

In [None]:
df_dblp['ConferenceLocation'] = np.nan
df_dblp

In [None]:
df_dblp = pd.merge(df_dblp, df_place_of_conference, on=['crossref'], how='left')

# Combine the two columns
df_dblp['ConferenceLocation_x'] = df_dblp['ConferenceLocation_x'].fillna(df_dblp['ConferenceLocation_y'])
df_dblp.rename(columns = {'ConferenceLocation_x':'ConferenceLocation'}, inplace=True)
df_dblp = df_dblp.drop(columns=['ConferenceLocation_y'])

df_dblp.iloc[:5]

Count of how many paper's conference locations are still missing

In [None]:
n_missing = len(df_dblp.index) - len(df_dblp.dropna(subset = ['ConferenceLocation']).index)
print(f"{n_missing} missing paper's conference locations")

## Write of the Final CSV on Disk

Saving the resulting dataframe on disk in CSV format.

In [None]:
# Write of the resulting CSV on Disk
df_dblp.to_csv(path_file_export + 'out_dblp_papers.csv')

Check of the Exported CSV to be sure that everything went fine.

In [None]:
# Check of the Exported CSV
df_dblp_exported_csv = pd.read_csv(path_file_export + 'out_dblp_papers.csv', low_memory=False)
df_dblp_exported_csv