# Fix of the Missing DBLP Location Using the DBLP Dump

Jupyter Notebook for web scraping of the conferences locations of the DBLP dump.

For this process, the following CSV files are needed: ```out_dblp_raw_proceedings.csv``` and ```out_dblp_papers_and_locations.csv```. <br>
The first file must be generated following the instructions cointained in the ```preprocess_dblp.ipynb``` Notebook, that is contained in the ```1 - Citation Dumps Preprocess``` folder of this Repository.<br>
The second one must be generated running the ```1 - dblp_location_scraper.ipynb``` Notebook that is contained in this folder.

In particular, the following operations are going to be executed:
* Opening of the CSV preprocessed dump
* Opening of the proceedings raw dump
* Obtaining the missing locations with queries to the DBLP website

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd
import spacy
import numpy as np

pd.set_option('display.max_columns', None)

## Download of the Spacy Pipelines
Before getting started, be sure to download the Spacy Pipelines that is needed for the NLP operations.

You can do so by running the following command in a Python shell: ```python -m spacy download en_core_web_lg```

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

## Read of the CSV Preprocessed Dump

In [7]:
df_dblp = pd.read_csv(path_file_export + 'out_dblp_papers.csv', low_memory=False, index_col=[0])
df_dblp

Unnamed: 0,id,crossref,ee,key,url,year
1,3062808,conf/coopis/2000,https://doi.org/10.1007/10722620_29,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html#ChenD00,2000
2,3062809,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html#AbdellatifCL04,2004
4,3062811,conf/coopis/2000,https://doi.org/10.1007/10722620_9,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html#PapastavrouCSP00,2000
5,3062812,conf/coopis/97,http://doi.ieeecomputersociety.org/10.1109/COO...,conf/coopis/BultzingsloewenKK97,db/conf/coopis/coopis97.html#BultzingsloewenKK97,1997
6,3062813,conf/coopis/2003,https://doi.org/10.1007/978-3-540-39964-3_50,conf/coopis/GiacolettoA03,db/conf/coopis/coopis2003.html#GiacolettoA03,2003
...,...,...,...,...,...,...
2990257,8958667,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_9,series/sapere/Besold13,db/series/sapere/sapere5.html#Besold13,2011
2990258,8958671,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_20,series/sapere/Steiner13,db/series/sapere/sapere5.html#Steiner13,2011
2990259,8958672,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_25,series/sapere/Armstrong13,db/series/sapere/sapere5.html#Armstrong13,2011
2990260,8958682,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_12,series/sapere/Freed13,db/series/sapere/sapere5.html#Freed13,2011


## Read of the Proceedings CSV

In [None]:
df_dblp_proceedings = pd.read_csv(path_file_import + 'out_dblp_raw_proceedings.csv', low_memory=False, sep=";")
df_dblp_proceedings

Here the main useless columns are going to be removed from the dataframe.

In [None]:
df_dblp_proceedings = df_dblp_proceedings.drop(columns=['address', 'author', 'booktitle', 'cite', 'cite-label', 'editor', 'editor-orcid', 'ee-type', 'i', 'isbn', 'isbn-type', 'journal', 'mdate', 'note', 'note-type', 'number', 'pages', 'publisher', 'publisher-href', 'publtype', 'school', 'sub', 'sup', 'volume', 'year'])
df_dblp_proceedings.loc[0:3]

## Fix of the Conferences URL Format
The most of the papers have their URLs in the following formats:
* ```/db/conf/CONF_NAME/CONF_NAME+YEAR.html#PAPER_IDENTIFIER```
* ```/db/series/SERIES_NAME/SERIES_NAME+YEAR.html#PAPER_IDENTIFIER```

The final part of the URL that follows '.html' is totally useless for our purpose.

We must filter it, so we'll be able to filter the duplicated URLs to speed up our queries.

In [8]:
df_dblp.url = df_dblp.url.str.split('#').str[0]
df_dblp

Unnamed: 0,id,crossref,ee,key,url,year
1,3062808,conf/coopis/2000,https://doi.org/10.1007/10722620_29,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html,2000
2,3062809,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html,2004
4,3062811,conf/coopis/2000,https://doi.org/10.1007/10722620_9,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html,2000
5,3062812,conf/coopis/97,http://doi.ieeecomputersociety.org/10.1109/COO...,conf/coopis/BultzingsloewenKK97,db/conf/coopis/coopis97.html,1997
6,3062813,conf/coopis/2003,https://doi.org/10.1007/978-3-540-39964-3_50,conf/coopis/GiacolettoA03,db/conf/coopis/coopis2003.html,2003
...,...,...,...,...,...,...
2990257,8958667,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_9,series/sapere/Besold13,db/series/sapere/sapere5.html,2011
2990258,8958671,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_20,series/sapere/Steiner13,db/series/sapere/sapere5.html,2011
2990259,8958672,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_25,series/sapere/Armstrong13,db/series/sapere/sapere5.html,2011
2990260,8958682,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_12,series/sapere/Freed13,db/series/sapere/sapere5.html,2011


## Extraction of the Conference Location from the Proceeding Title

Adding the location column to the proceedings dataframe:

In [None]:
df_dblp_proceedings['ConferenceLocation'] = np.nan
df_dblp['ConferenceLocation'] = np.nan

Make sure the indexes are in pair with number of rows

In [None]:
df_dblp_proceedings = df_dblp_proceedings.reset_index()

Extraction of the locations using Spacy:

In [None]:
nlp = spacy.load("en_core_web_lg")

for index, row in df_dblp_proceedings.iterrows():
    doc = nlp(row['title'])

    conf_location = ""
    for ent in doc.ents:
        if ent.label_ == "GPE":
            if conf_location.__len__() != 0:
                conf_location += ", "
            conf_location += ent.text
        
        #print(ent.text, ent.label_) # TODO DEBUG

    if conf_location.__len__() == 0: # no location found
        conf_location = np.nan
    
    df_dblp_proceedings.at[index, 'ConferenceLocation'] = conf_location

    #print(f"{row['title']} --- Extracted Location: {conf_location}\n\n") # TODO DEBUG

Count of how many proceedings locations that are still missing:

In [None]:
n_missing = len(df_dblp_proceedings.index) - len(df_dblp_proceedings.dropna(subset = ['ConferenceLocation']).index)
print(f"{n_missing} missing paper's conference locations")

## Join of the New Location Data with the Original Dataframe

Here the rest of the useless columns are going to be removed from the dataframe.

In [None]:
df_dblp_proceedings = df_dblp_proceedings.drop(columns=['id', 'key', 'series', 'series-href', 'title', 'ee'])
df_dblp_proceedings.loc[0:3]

Filtering the Proceedings without the location:

In [None]:
df_dblp_proceedings = df_dblp_proceedings.dropna(subset = ['ConferenceLocation'])
df_dblp_proceedings = df_dblp_proceedings.drop(columns=['index'])
df_dblp_proceedings

Merge with the original dblp with locations dataframe

In [10]:
# Merge with the location dataframe
df_dblp = pd.merge(df_dblp, df_dblp_proceedings, on=['url'], how='left')

# Combine the two columns
df_dblp['ConferenceLocation_x'] = df_dblp['ConferenceLocation_x'].fillna(df_dblp['ConferenceLocation_y'])
df_dblp.rename(columns = {'ConferenceLocation_x':'ConferenceLocation'}, inplace=True)
df_dblp = df_dblp.drop(columns=['ConferenceLocation_y'])

df_dblp.iloc[:5]

Unnamed: 0,id,crossref,ee,key,url,year,ConferenceLocation
0,3062808,conf/coopis/2000,https://doi.org/10.1007/10722620_29,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html,2000,"Eilat, Israel"
1,3062809,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html,2004,
2,3062811,conf/coopis/2000,https://doi.org/10.1007/10722620_9,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html,2000,"Eilat, Israel"
3,3062812,conf/coopis/97,http://doi.ieeecomputersociety.org/10.1109/COO...,conf/coopis/BultzingsloewenKK97,db/conf/coopis/coopis97.html,1997,"Kiawah Island, South Carolina, USA"
4,3062813,conf/coopis/2003,https://doi.org/10.1007/978-3-540-39964-3_50,conf/coopis/GiacolettoA03,db/conf/coopis/coopis2003.html,2003,"Catania, Sicily, Italy"


Count of how many paper's conference locations are still missing

In [11]:
n_missing = len(df_dblp.index) - len(df_dblp.dropna(subset = ['ConferenceLocation']).index)
print(f"{n_missing} missing paper's conference locations")

88095 missing paper's conference locations


## Drop of the Papers Without Location

In [None]:
df_dblp = df_dblp.dropna(subset = ['ConferenceLocation'])
df_dblp

## Write of the Final CSV on Disk

Saving the resulting dataframe on disk in CSV format.

In [None]:
# Write of the resulting CSV on Disk
df_dblp.to_csv(path_file_export + 'out_dblp_papers_and_locations.csv')

Check of the Exported CSV to be sure that everything went fine.

In [None]:
# Check of the Exported CSV
df_dblp_exported_csv = pd.read_csv(path_file_export + 'out_dblp_papers_and_locations.csv', low_memory=False, index_col=[0])
df_dblp_exported_csv