# Add the DBLP Conferences Location Using the DBLP Raw Dump

Jupyter Notebook for adding the conferences locations of DBLP.

For this process, the following CSV files are needed: ```out_dblp_raw_proceedings.csv``` and ```out_dblp_papers.csv```. <br>
The first file must be generated following the instructions cointained in the ```preprocess_dblp.ipynb``` Notebook, that is contained in the ```1 - Citation Dumps Preprocess``` folder of this Repository.<br>
The second one must be generated running the ```preprocess_dblp.ipynb``` Notebook that is contained in the ```1 - Citation Dumps Preprocess``` folder of this Repository.

In particular, the following operations are going to be executed:
* Opening of the CSV preprocessed dump
* Opening of the proceedings raw dump
* Obtaining the locations from the proceedings raw dump
* Filtering the seminar related papers

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd
import spacy
import numpy as np

pd.set_option('display.max_columns', None)

## Download of the Spacy Pipelines
Before getting started, be sure to download the Spacy Pipelines that is needed for the NLP operations.

You can do so by running the following command in a Python shell: ```python -m spacy download en_core_web_lg```

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

## Read of the CSV Preprocessed Dump

In [3]:
df_dblp = pd.read_csv(path_file_export + 'out_dblp_papers.csv', low_memory=False, index_col=[0])
df_dblp

Unnamed: 0,id,crossref,ee,key,url,year
1,3062808,conf/coopis/2000,https://doi.org/10.1007/10722620_29,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html#ChenD00,2000
2,3062809,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html#AbdellatifCL04,2004
4,3062811,conf/coopis/2000,https://doi.org/10.1007/10722620_9,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html#PapastavrouCSP00,2000
5,3062812,conf/coopis/97,http://doi.ieeecomputersociety.org/10.1109/COO...,conf/coopis/BultzingsloewenKK97,db/conf/coopis/coopis97.html#BultzingsloewenKK97,1997
6,3062813,conf/coopis/2003,https://doi.org/10.1007/978-3-540-39964-3_50,conf/coopis/GiacolettoA03,db/conf/coopis/coopis2003.html#GiacolettoA03,2003
...,...,...,...,...,...,...
2990257,8958667,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_9,series/sapere/Besold13,db/series/sapere/sapere5.html#Besold13,2011
2990258,8958671,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_20,series/sapere/Steiner13,db/series/sapere/sapere5.html#Steiner13,2011
2990259,8958672,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_25,series/sapere/Armstrong13,db/series/sapere/sapere5.html#Armstrong13,2011
2990260,8958682,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_12,series/sapere/Freed13,db/series/sapere/sapere5.html#Freed13,2011


## Read of the Proceedings CSV

In [4]:
df_dblp_proceedings = pd.read_csv(path_file_import + 'out_dblp_raw_proceedings.csv', low_memory=False, sep=";")
df_dblp_proceedings

Unnamed: 0,id,address,author,booktitle,cite,cite-label,editor,editor-orcid,ee,ee-type,i,isbn,isbn-type,journal,key,mdate,note,note-type,number,pages,publisher,publisher-href,publtype,school,series,series-href,sub,sup,title,url,volume,year
0,90858,,,,,,Amir Hossein Alavi|Amir Hossein Gandomi|Conor ...,0000-0002-2798-0104|0000-0002-7002-5815|0000-0...,https://doi.org/10.1007/978-3-319-20883-1,,,978-3-319-20882-4,,,reference/genetic/2015,2020-03-27,,,,,Springer,,,,,,,,Handbook of Genetic Programming Applications,db/reference/genetic/genetic2015.html,,2015
1,109806,,,,,,Ankur Agarwal|Borko Furht,,https://doi.org/10.1007/978-1-4614-8495-0,,,978-1-4614-8494-3,,,reference/med/2013,2017-05-16,,,,,Springer,,,,,,,,Handbook of Medical and Healthcare Technologies,db/reference/med/med2013.html,,2013
2,3062827,,,OTM,,,Douglas C. Schmidt|Robert Meersman|Zahir Tari,,https://doi.org/10.1007/b94348,,,3-540-20498-9,,,conf/coopis/2003,2021-04-30,,,,,Springer,,,,Lecture Notes in Computer Science,db/series/lncs/index.html,,,On The Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2003.html,2888,2003
3,3062933,,,CoopIS/DOA/ODBASE,,,Robert Meersman|Zahir Tari,,https://doi.org/10.1007/b102176,,,3-540-23662-7,,,conf/coopis/2004-2,2019-05-14,,,,,Springer,,,,Lecture Notes in Computer Science,db/series/lncs/index.html,,,On the Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2004-2.html,3291,2004
4,3062955,,,CoopIS,,,Carlo Batini|Fausto Giunchiglia|Massimo Mecell...,,https://doi.org/10.1007/3-540-44751-2,,,3-540-42524-1,,,conf/coopis/2001,2019-05-14,,,,,Springer,,,,Lecture Notes in Computer Science,db/series/lncs/index.html,,,"Cooperative Information Systems, 9th Internati...",db/conf/coopis/coopis2001.html,2172,2001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50207,8958161,,,,,,Rudi Studer|Steffen Staab,,https://doi.org/10.1007/978-3-540-92673-3|http...,,,978-3-540-70999-2|978-3-540-92673-3,,,series/ihis/2009hoo,2018-11-14,,,,,Springer,,,,International Handbooks on Information Systems,db/series/ihis/index.html,,,Handbook on Ontologies,db/series/ihis/hoo2009.html,,2009
50208,8958192,,,,,,Jan vom Brocke|Michael Rosemann,,https://doi.org/10.1007/978-3-642-45100-3,,,978-3-642-45099-0|978-3-642-45100-3,,,series/ihis/2015bpm1,2017-05-16,,,,,Springer,,,,International Handbooks on Information Systems,db/series/ihis/index.html,,,"Handbook on Business Process Management 1, Int...",db/series/ihis/bpm2015-1.html,,2015
50209,8958224,,,,,,Jan vom Brocke|Michael Rosemann,,https://doi.org/10.1007/978-3-642-45103-4,,,978-3-642-45102-7|978-3-642-45103-4,,,series/ihis/2015bpm2,2017-05-16,,,,,Springer,,,,International Handbooks on Information Systems,db/series/ihis/index.html,,,"Handbook on Business Process Management 2, Str...",db/series/ihis/bpm2015-2.html,,2015
50210,8958650,,,PT-AI,,,Vincent C. Müller,0000-0002-4144-4957,https://doi.org/10.1007/978-3-642-31674-6,,,978-3-642-31673-9,,,series/sapere/2013-5,2019-09-06,,,,,Springer,,,,"Studies in Applied Philosophy, Epistemology an...",db/series/sapere/index.html,,,Philosophy and Theory of Artificial Intelligen...,db/series/sapere/sapere5.html,5,2013


Here the main useless columns are going to be removed from the dataframe.

In [5]:
df_dblp_proceedings = df_dblp_proceedings.drop(columns=['address', 'author', 'booktitle', 'cite', 'cite-label', 'editor', 'editor-orcid', 'ee-type', 'i', 'isbn', 'isbn-type', 'journal', 'mdate', 'note', 'note-type', 'number', 'pages', 'publisher', 'publisher-href', 'publtype', 'school', 'sub', 'sup', 'volume', 'year'])
df_dblp_proceedings.loc[0:3]

Unnamed: 0,id,ee,key,series,series-href,title,url
0,90858,https://doi.org/10.1007/978-3-319-20883-1,reference/genetic/2015,,,Handbook of Genetic Programming Applications,db/reference/genetic/genetic2015.html
1,109806,https://doi.org/10.1007/978-1-4614-8495-0,reference/med/2013,,,Handbook of Medical and Healthcare Technologies,db/reference/med/med2013.html
2,3062827,https://doi.org/10.1007/b94348,conf/coopis/2003,Lecture Notes in Computer Science,db/series/lncs/index.html,On The Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2003.html
3,3062933,https://doi.org/10.1007/b102176,conf/coopis/2004-2,Lecture Notes in Computer Science,db/series/lncs/index.html,On the Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2004-2.html


## Fix of the Conferences URL Format
The most of the papers have their URLs in the following formats:
* ```/db/conf/CONF_NAME/CONF_NAME+YEAR.html#PAPER_IDENTIFIER```
* ```/db/series/SERIES_NAME/SERIES_NAME+YEAR.html#PAPER_IDENTIFIER```

The final part of the URL that follows '.html' is totally useless for our purpose.

We must filter it, so we'll be able to filter the duplicated URLs to speed up our queries.

In [6]:
df_dblp.url = df_dblp.url.str.split('#').str[0]
df_dblp

Unnamed: 0,id,crossref,ee,key,url,year
1,3062808,conf/coopis/2000,https://doi.org/10.1007/10722620_29,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html,2000
2,3062809,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html,2004
4,3062811,conf/coopis/2000,https://doi.org/10.1007/10722620_9,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html,2000
5,3062812,conf/coopis/97,http://doi.ieeecomputersociety.org/10.1109/COO...,conf/coopis/BultzingsloewenKK97,db/conf/coopis/coopis97.html,1997
6,3062813,conf/coopis/2003,https://doi.org/10.1007/978-3-540-39964-3_50,conf/coopis/GiacolettoA03,db/conf/coopis/coopis2003.html,2003
...,...,...,...,...,...,...
2990257,8958667,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_9,series/sapere/Besold13,db/series/sapere/sapere5.html,2011
2990258,8958671,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_20,series/sapere/Steiner13,db/series/sapere/sapere5.html,2011
2990259,8958672,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_25,series/sapere/Armstrong13,db/series/sapere/sapere5.html,2011
2990260,8958682,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_12,series/sapere/Freed13,db/series/sapere/sapere5.html,2011


## Extraction of the Conference Location from the Proceeding Title

Adding the location column to the proceedings dataframe:

In [7]:
df_dblp_proceedings['ConferenceLocation'] = np.nan
df_dblp['ConferenceLocation'] = np.nan

Make sure the indexes are in pair with number of rows

In [8]:
df_dblp_proceedings = df_dblp_proceedings.reset_index()

Extraction of the locations using Spacy:

In [9]:
nlp = spacy.load("en_core_web_lg")

for index, row in df_dblp_proceedings.iterrows():
    doc = nlp(row['title'])

    conf_location = ""
    for ent in doc.ents:
        if ent.label_ == "GPE":
            if conf_location.__len__() != 0:
                conf_location += ", "
            conf_location += ent.text

    if conf_location.__len__() == 0: # no location found
        conf_location = np.nan
    
    df_dblp_proceedings.at[index, 'ConferenceLocation'] = conf_location

Count of how many proceedings locations that are still missing:

In [10]:
n_missing = len(df_dblp_proceedings.index) - len(df_dblp_proceedings.dropna(subset = ['ConferenceLocation']).index)
print(f"{n_missing} missing paper's conference locations")

2105 missing paper's conference locations


## Join of the New Location Data with the Original Dataframe

Here the rest of the useless columns are going to be removed from the dataframe.

In [11]:
df_dblp_proceedings = df_dblp_proceedings.drop(columns=['id', 'key', 'series', 'series-href', 'ee', 'index'])
df_dblp_proceedings.loc[0:3]

Unnamed: 0,title,url,ConferenceLocation
0,Handbook of Genetic Programming Applications,db/reference/genetic/genetic2015.html,
1,Handbook of Medical and Healthcare Technologies,db/reference/med/med2013.html,
2,On The Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2003.html,"Catania, Sicily, Italy"
3,On the Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2004-2.html,


Column rename to remove ambiguity for the future joins

In [12]:
df_dblp_proceedings.rename(columns={'title': 'ConferenceTitle'}, inplace=True)
df_dblp_proceedings.loc[0:3]

Unnamed: 0,ConferenceTitle,url,ConferenceLocation
0,Handbook of Genetic Programming Applications,db/reference/genetic/genetic2015.html,
1,Handbook of Medical and Healthcare Technologies,db/reference/med/med2013.html,
2,On The Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2003.html,"Catania, Sicily, Italy"
3,On the Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2004-2.html,


Merge with the original dblp with locations dataframe

In [13]:
# Merge with the location dataframe
df_dblp = pd.merge(df_dblp, df_dblp_proceedings, on=['url'], how='left')

# Combine the two columns
df_dblp['ConferenceLocation_x'] = df_dblp['ConferenceLocation_x'].fillna(df_dblp['ConferenceLocation_y'])
df_dblp.rename(columns = {'ConferenceLocation_x':'ConferenceLocation'}, inplace=True)
df_dblp = df_dblp.drop(columns=['ConferenceLocation_y'])

# Column sort
df_dblp = df_dblp.reindex(sorted(df_dblp.columns), axis=1)

df_dblp.iloc[:5]

Unnamed: 0,ConferenceLocation,ConferenceTitle,crossref,ee,id,key,url,year
0,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_29,3062808,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html,2000
1,,On the Move to Meaningful Internet Systems 200...,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,3062809,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html,2004
2,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_9,3062811,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html,2000
3,"Kiawah Island, South Carolina, USA",Proceedings of the Second IFCIS International ...,conf/coopis/97,http://doi.ieeecomputersociety.org/10.1109/COO...,3062812,conf/coopis/BultzingsloewenKK97,db/conf/coopis/coopis97.html,1997
4,"Catania, Sicily, Italy",On The Move to Meaningful Internet Systems 200...,conf/coopis/2003,https://doi.org/10.1007/978-3-540-39964-3_50,3062813,conf/coopis/GiacolettoA03,db/conf/coopis/coopis2003.html,2003


Count of how many paper's conference locations are still missing

In [14]:
n_missing = len(df_dblp.index) - len(df_dblp.dropna(subset = ['ConferenceLocation']).index)
print(f"{n_missing} missing paper's conference locations")

88110 missing paper's conference locations


## Filtering the Seminar Related Papers

In [15]:
df_dblp = df_dblp[df_dblp['ConferenceTitle'].str.contains("Seminar") == False]
df_dblp = df_dblp[df_dblp['ConferenceTitle'].str.contains("seminar") == False]
df_dblp

Unnamed: 0,ConferenceLocation,ConferenceTitle,crossref,ee,id,key,url,year
0,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_29,3062808,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html,2000
1,,On the Move to Meaningful Internet Systems 200...,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,3062809,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html,2004
2,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_9,3062811,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html,2000
3,"Kiawah Island, South Carolina, USA",Proceedings of the Second IFCIS International ...,conf/coopis/97,http://doi.ieeecomputersociety.org/10.1109/COO...,3062812,conf/coopis/BultzingsloewenKK97,db/conf/coopis/coopis97.html,1997
4,"Catania, Sicily, Italy",On The Move to Meaningful Internet Systems 200...,conf/coopis/2003,https://doi.org/10.1007/978-3-540-39964-3_50,3062813,conf/coopis/GiacolettoA03,db/conf/coopis/coopis2003.html,2003
...,...,...,...,...,...,...,...,...
2484040,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_9,8958667,series/sapere/Besold13,db/series/sapere/sapere5.html,2011
2484041,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_20,8958671,series/sapere/Steiner13,db/series/sapere/sapere5.html,2011
2484042,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_25,8958672,series/sapere/Armstrong13,db/series/sapere/sapere5.html,2011
2484043,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_12,8958682,series/sapere/Freed13,db/series/sapere/sapere5.html,2011


Drop of the paper ID column:

In [16]:
df_dblp = df_dblp.drop(columns=['id'])
df_dblp.loc[0:3]

Unnamed: 0,ConferenceLocation,ConferenceTitle,crossref,ee,key,url,year
0,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_29,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html,2000
1,,On the Move to Meaningful Internet Systems 200...,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html,2004
2,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_9,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html,2000
3,"Kiawah Island, South Carolina, USA",Proceedings of the Second IFCIS International ...,conf/coopis/97,http://doi.ieeecomputersociety.org/10.1109/COO...,conf/coopis/BultzingsloewenKK97,db/conf/coopis/coopis97.html,1997


## Write of the Final CSV on Disk

Saving the resulting dataframe on disk in CSV format.

In [17]:
# Write of the resulting CSV on Disk
df_dblp.to_csv(path_file_export + 'out_dblp_papers_and_locations.csv')
print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_dblp_papers_and_locations.csv')

Check of the Exported CSV to be sure that everything went fine.

In [18]:
# Check of the Exported CSV
df_dblp_exported_csv = pd.read_csv(path_file_export + 'out_dblp_papers_and_locations.csv', low_memory=False, index_col=[0])
df_dblp_exported_csv

Unnamed: 0,ConferenceLocation,ConferenceTitle,crossref,ee,key,url,year
0,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_29,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html,2000
1,,On the Move to Meaningful Internet Systems 200...,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html,2004
2,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_9,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html,2000
3,"Kiawah Island, South Carolina, USA",Proceedings of the Second IFCIS International ...,conf/coopis/97,http://doi.ieeecomputersociety.org/10.1109/COO...,conf/coopis/BultzingsloewenKK97,db/conf/coopis/coopis97.html,1997
4,"Catania, Sicily, Italy",On The Move to Meaningful Internet Systems 200...,conf/coopis/2003,https://doi.org/10.1007/978-3-540-39964-3_50,conf/coopis/GiacolettoA03,db/conf/coopis/coopis2003.html,2003
...,...,...,...,...,...,...,...
2484040,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_9,series/sapere/Besold13,db/series/sapere/sapere5.html,2011
2484041,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_20,series/sapere/Steiner13,db/series/sapere/sapere5.html,2011
2484042,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_25,series/sapere/Armstrong13,db/series/sapere/sapere5.html,2011
2484043,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_12,series/sapere/Freed13,db/series/sapere/sapere5.html,2011
