# Fix the Missing MAG Conferences Location Using the DBLP Raw Dump

Jupyter Notebook for adding the conferences locations of DBLP.

For this process, the following CSV files are needed: ```out_dblp_raw_proceedings.csv``` and ```out_mag_citations_count_and_conferences.csv```. <br>
The first file must be generated following the instructions cointained in the ```preprocess_dblp.ipynb``` Notebook, that is contained in the ```1 - Citation Dumps Preprocess``` folder of this Repository.<br>
The second one must be generated running the ```preprocess_mag.ipynb``` Notebook that is contained in the ```1 - Citation Dumps Preprocess``` folder of this Repository.

In particular, the following operations are going to be executed:
* Opening of the CSV preprocessed dump
* Opening of the proceedings raw dump
* Fix of the missing conferences names
* Obtaining the locations from the proceedings raw dump

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd
import spacy
import numpy as np

pd.set_option('display.max_columns', None)

## Download of the Spacy Pipelines
Before getting started, be sure to download the Spacy Pipelines that is needed for the NLP operations.

You can do so by running the following command in a Python shell: ```python -m spacy download en_core_web_lg```

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

## Read of the CSV Preprocessed Dump

In [3]:
df_mag = pd.read_csv(path_file_export + 'out_mag_citations_count_and_conferences.csv', low_memory=False, index_col=[0])

# Column sort
df_mag = df_mag.reindex(sorted(df_mag.columns), axis=1)

df_mag

Unnamed: 0,CitationCount,ConferenceDisplayName,ConferenceInstanceID,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceSeriesID,ConferenceSeriesNormalizedName,Doi,EstimatedCitation,OriginalTitle,PaperID,PaperTitle,Year
0,12,DISC 2014,4038532.0,"Austin, TX",disc 2014,International Symposium on Distributed Computing,1.131603e+09,DISC,10.1007/978-3-662-45174-8_28,12,The Adaptive Priority Queue with Elimination a...,14558443,the adaptive priority queue with elimination a...,2014
1,10,ESA 2014,157008481.0,"Wrocław, Poland",esa 2014,European Symposium on Algorithms,1.154039e+09,ESA,10.1007/978-3-662-44777-2_60,10,Document Retrieval on Repetitive Collections,15354235,document retrieval on repetitive collections,2014
2,20,,,,,Information and Communication Technologies in ...,1.196984e+09,ENTER,10.1007/978-3-319-03973-2_13,20,SoCoMo Marketing for Travel and Tourism,24327294,socomo marketing for travel and tourism,2013
3,0,,,,,Database and Expert Systems Applications,1.192665e+09,DEXA,10.1007/3-540-46146-9_77,0,Similarity Image Retrieval System Using Hierar...,60437532,similarity image retrieval system using hierar...,2002
4,19,,,,,International Conference on Artificial Intelli...,1.176896e+09,ICAISC,10.1007/11785231_94,19,Leukemia prediction from gene expression data—...,198056957,leukemia prediction from gene expression data ...,2006
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4409811,0,,,,,Conference of the Industrial Electronics Society,2.623572e+09,IECON,10.1109/IECON43393.2020.9254316,0,Loss Reduction by Synchronous Rectification in...,3102242761,loss reduction by synchronous rectification in...,2020
4409812,0,,,,,International Symposium on Broadband Multimedi...,2.623662e+09,BMSB,10.1109/BMSB49480.2020.9379806,0,Data Over Cable Services – Improving the BICM ...,3136855299,data over cable services improving the bicm ca...,2020
4409813,0,,,,,American Control Conference,2.238538e+09,ACC,10.1109/ACC.1988.4172843,0,Model Reference Robust Adaptive Control withou...,3145351916,model reference robust adaptive control withou...,1988
4409814,0,,,,,"International Conference on Acoustics, Speech,...",1.121228e+09,ICASSP,10.1109/ICASSP.2002.1005676,0,Missing data speech recognition in reverberant...,3151696876,missing data speech recognition in reverberant...,2002


## Read of the Proceedings CSV

In [4]:
df_dblp_proceedings = pd.read_csv(path_file_import + 'out_dblp_raw_proceedings.csv', low_memory=False, sep=";")
df_dblp_proceedings

Unnamed: 0,id,address,author,booktitle,cite,cite-label,editor,editor-orcid,ee,ee-type,i,isbn,isbn-type,journal,key,mdate,note,note-type,number,pages,publisher,publisher-href,publtype,school,series,series-href,sub,sup,title,url,volume,year
0,90858,,,,,,Amir Hossein Alavi|Amir Hossein Gandomi|Conor ...,0000-0002-2798-0104|0000-0002-7002-5815|0000-0...,https://doi.org/10.1007/978-3-319-20883-1,,,978-3-319-20882-4,,,reference/genetic/2015,2020-03-27,,,,,Springer,,,,,,,,Handbook of Genetic Programming Applications,db/reference/genetic/genetic2015.html,,2015
1,109806,,,,,,Ankur Agarwal|Borko Furht,,https://doi.org/10.1007/978-1-4614-8495-0,,,978-1-4614-8494-3,,,reference/med/2013,2017-05-16,,,,,Springer,,,,,,,,Handbook of Medical and Healthcare Technologies,db/reference/med/med2013.html,,2013
2,3062827,,,OTM,,,Douglas C. Schmidt|Robert Meersman|Zahir Tari,,https://doi.org/10.1007/b94348,,,3-540-20498-9,,,conf/coopis/2003,2021-04-30,,,,,Springer,,,,Lecture Notes in Computer Science,db/series/lncs/index.html,,,On The Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2003.html,2888,2003
3,3062933,,,CoopIS/DOA/ODBASE,,,Robert Meersman|Zahir Tari,,https://doi.org/10.1007/b102176,,,3-540-23662-7,,,conf/coopis/2004-2,2019-05-14,,,,,Springer,,,,Lecture Notes in Computer Science,db/series/lncs/index.html,,,On the Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2004-2.html,3291,2004
4,3062955,,,CoopIS,,,Carlo Batini|Fausto Giunchiglia|Massimo Mecell...,,https://doi.org/10.1007/3-540-44751-2,,,3-540-42524-1,,,conf/coopis/2001,2019-05-14,,,,,Springer,,,,Lecture Notes in Computer Science,db/series/lncs/index.html,,,"Cooperative Information Systems, 9th Internati...",db/conf/coopis/coopis2001.html,2172,2001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50207,8958161,,,,,,Rudi Studer|Steffen Staab,,https://doi.org/10.1007/978-3-540-92673-3|http...,,,978-3-540-70999-2|978-3-540-92673-3,,,series/ihis/2009hoo,2018-11-14,,,,,Springer,,,,International Handbooks on Information Systems,db/series/ihis/index.html,,,Handbook on Ontologies,db/series/ihis/hoo2009.html,,2009
50208,8958192,,,,,,Jan vom Brocke|Michael Rosemann,,https://doi.org/10.1007/978-3-642-45100-3,,,978-3-642-45099-0|978-3-642-45100-3,,,series/ihis/2015bpm1,2017-05-16,,,,,Springer,,,,International Handbooks on Information Systems,db/series/ihis/index.html,,,"Handbook on Business Process Management 1, Int...",db/series/ihis/bpm2015-1.html,,2015
50209,8958224,,,,,,Jan vom Brocke|Michael Rosemann,,https://doi.org/10.1007/978-3-642-45103-4,,,978-3-642-45102-7|978-3-642-45103-4,,,series/ihis/2015bpm2,2017-05-16,,,,,Springer,,,,International Handbooks on Information Systems,db/series/ihis/index.html,,,"Handbook on Business Process Management 2, Str...",db/series/ihis/bpm2015-2.html,,2015
50210,8958650,,,PT-AI,,,Vincent C. Müller,0000-0002-4144-4957,https://doi.org/10.1007/978-3-642-31674-6,,,978-3-642-31673-9,,,series/sapere/2013-5,2019-09-06,,,,,Springer,,,,"Studies in Applied Philosophy, Epistemology an...",db/series/sapere/index.html,,,Philosophy and Theory of Artificial Intelligen...,db/series/sapere/sapere5.html,5,2013


Here the main useless columns are going to be removed from the dataframe.

In [5]:
df_dblp_proceedings = df_dblp_proceedings.drop(columns=['address', 'author', 'booktitle', 'cite', 'cite-label', 'editor', 'editor-orcid', 'ee-type', 'i', 'isbn', 'isbn-type', 'journal', 'mdate', 'note', 'note-type', 'number', 'pages', 'publisher', 'publisher-href', 'publtype', 'school', 'sub', 'sup', 'volume'])
df_dblp_proceedings.loc[0:3]

Unnamed: 0,id,ee,key,series,series-href,title,url,year
0,90858,https://doi.org/10.1007/978-3-319-20883-1,reference/genetic/2015,,,Handbook of Genetic Programming Applications,db/reference/genetic/genetic2015.html,2015
1,109806,https://doi.org/10.1007/978-1-4614-8495-0,reference/med/2013,,,Handbook of Medical and Healthcare Technologies,db/reference/med/med2013.html,2013
2,3062827,https://doi.org/10.1007/b94348,conf/coopis/2003,Lecture Notes in Computer Science,db/series/lncs/index.html,On The Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2003.html,2003
3,3062933,https://doi.org/10.1007/b102176,conf/coopis/2004-2,Lecture Notes in Computer Science,db/series/lncs/index.html,On the Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2004-2.html,2004


## Fix of the Missing Conferences Names
Some papers have only the indication of the conference series. For this reason, the conference instance and the related conference locations don't have a value.

However, every paper has been published in a specific "instance" of a conference, hence it should have a location. These papers will be "fixed" considering the year of their publication and their conference.

In [6]:
df_mag_subset = df_mag.iloc[:50]
df_mag_subset = df_mag.dropna(subset = ['ConferenceNormalizedName'])
df_mag_subset.iloc[:10][["Year", "ConferenceSeriesNormalizedName", "ConferenceNormalizedName", "ConferenceDisplayName"]]

Unnamed: 0,Year,ConferenceSeriesNormalizedName,ConferenceNormalizedName,ConferenceDisplayName
0,2014,DISC,disc 2014,DISC 2014
1,2014,ESA,esa 2014,ESA 2014
7,2011,LTC,ltc 2011,LTC 2011
8,2013,CVPR,cvpr 2013,CVPR 2013
14,2008,BMSB,bmsb 2008,BMSB 2008
16,2000,CAV,cav 2000,CAV 2000
19,2008,ISVC,isvc 2008,ISVC 2008
25,2000,CLEO,cleo 2000,CLEO 2000
33,2014,ICC,icc 2014,ICC 2014
35,2000,CRYPTO,crypto 2000,CRYPTO 2000


As you can see in the above test, the ConferenceNormalizedName seems to be made by the concatenation of ConferenceSeriesNormalizedName in lowercase, a space, and the papers' year.

**Note**: in the above subset the ConferenceDisplayName seems to be composed in the same way of ConferenceNormalizedName, but without the lowercase. However, this is not always true!

Now we're going to populate the ConferenceNormalizedName instances that don't have a value.

In [7]:
df_mag.ConferenceNormalizedName.fillna(df_mag.ConferenceSeriesNormalizedName.str.lower() + ' ' + df_mag.Year.astype(str), inplace=True)
df_mag.iloc[:5]

Unnamed: 0,CitationCount,ConferenceDisplayName,ConferenceInstanceID,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceSeriesID,ConferenceSeriesNormalizedName,Doi,EstimatedCitation,OriginalTitle,PaperID,PaperTitle,Year
0,12,DISC 2014,4038532.0,"Austin, TX",disc 2014,International Symposium on Distributed Computing,1131603000.0,DISC,10.1007/978-3-662-45174-8_28,12,The Adaptive Priority Queue with Elimination a...,14558443,the adaptive priority queue with elimination a...,2014
1,10,ESA 2014,157008481.0,"Wrocław, Poland",esa 2014,European Symposium on Algorithms,1154039000.0,ESA,10.1007/978-3-662-44777-2_60,10,Document Retrieval on Repetitive Collections,15354235,document retrieval on repetitive collections,2014
2,20,,,,enter 2013,Information and Communication Technologies in ...,1196984000.0,ENTER,10.1007/978-3-319-03973-2_13,20,SoCoMo Marketing for Travel and Tourism,24327294,socomo marketing for travel and tourism,2013
3,0,,,,dexa 2002,Database and Expert Systems Applications,1192665000.0,DEXA,10.1007/3-540-46146-9_77,0,Similarity Image Retrieval System Using Hierar...,60437532,similarity image retrieval system using hierar...,2002
4,19,,,,icaisc 2006,International Conference on Artificial Intelli...,1176896000.0,ICAISC,10.1007/11785231_94,19,Leukemia prediction from gene expression data—...,198056957,leukemia prediction from gene expression data ...,2006


I tried to do a new merge with the Conference Instances dataframe (this time it will be made on the ConferenceNormalizedName column), but I had no luck: these conference instances are missing. That's probably the reason of the NaN values in the ConferenceInstanceID field of the original Papers table.

Drop of the Conference Display Name column:

In [8]:
df_mag = df_mag.drop(columns=['ConferenceDisplayName'])

## Drop of the ID Columns from the MAG Dataframe
These columns are not needed anymore, since they were needed only for the join operations between the MAG tables.

In [9]:
df_mag = df_mag.drop(columns=['ConferenceInstanceID', 'ConferenceSeriesID', 'PaperID'])
df_mag.loc[0:3]

Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceSeriesNormalizedName,Doi,EstimatedCitation,OriginalTitle,PaperTitle,Year
0,12,"Austin, TX",disc 2014,International Symposium on Distributed Computing,DISC,10.1007/978-3-662-45174-8_28,12,The Adaptive Priority Queue with Elimination a...,the adaptive priority queue with elimination a...,2014
1,10,"Wrocław, Poland",esa 2014,European Symposium on Algorithms,ESA,10.1007/978-3-662-44777-2_60,10,Document Retrieval on Repetitive Collections,document retrieval on repetitive collections,2014
2,20,,enter 2013,Information and Communication Technologies in ...,ENTER,10.1007/978-3-319-03973-2_13,20,SoCoMo Marketing for Travel and Tourism,socomo marketing for travel and tourism,2013
3,0,,dexa 2002,Database and Expert Systems Applications,DEXA,10.1007/3-540-46146-9_77,0,Similarity Image Retrieval System Using Hierar...,similarity image retrieval system using hierar...,2002


## Extraction of the Conference Location from the Proceeding Title

Adding the location column to the proceedings dataframe:

In [10]:
df_dblp_proceedings['ConferenceLocation'] = np.nan

Make sure the indexes are in pair with number of rows

In [11]:
df_dblp_proceedings = df_dblp_proceedings.reset_index()

Extraction of the locations using Spacy:

In [12]:
nlp = spacy.load("en_core_web_lg")

for index, row in df_dblp_proceedings.iterrows():
    doc = nlp(row['title'])

    conf_location = ""
    for ent in doc.ents:
        if ent.label_ == "GPE":
            if conf_location.__len__() != 0:
                conf_location += ", "
            conf_location += ent.text
        
        #print(ent.text, ent.label_) # TODO DEBUG

    if conf_location.__len__() == 0: # no location found
        conf_location = np.nan
    
    df_dblp_proceedings.at[index, 'ConferenceLocation'] = conf_location

    #print(f"{row['title']} --- Extracted Location: {conf_location}\n\n") # TODO DEBUG

Count of how many proceedings locations are still missing:

In [13]:
n_missing = len(df_dblp_proceedings.index) - len(df_dblp_proceedings.dropna(subset = ['ConferenceLocation']).index)
print(f"{n_missing} missing paper's conference locations")

2105 missing paper's conference locations


## Removing the Proceedings Data that We Don't Need Anymore

Here the rest of the useless columns are going to be removed from the dataframe.

In [14]:
df_dblp_proceedings = df_dblp_proceedings.drop(columns=['id', 'key', 'series', 'series-href', 'ee'])
df_dblp_proceedings.loc[0:3]

Unnamed: 0,index,title,url,year,ConferenceLocation
0,0,Handbook of Genetic Programming Applications,db/reference/genetic/genetic2015.html,2015,
1,1,Handbook of Medical and Healthcare Technologies,db/reference/med/med2013.html,2013,
2,2,On The Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2003.html,2003,"Catania, Sicily, Italy"
3,3,On the Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2004-2.html,2004,


Filtering the Proceedings without the location:

In [15]:
df_dblp_proceedings = df_dblp_proceedings.dropna(subset = ['ConferenceLocation'])
df_dblp_proceedings = df_dblp_proceedings.drop(columns=['index'])
df_dblp_proceedings

Unnamed: 0,title,url,year,ConferenceLocation
2,On The Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2003.html,2003,"Catania, Sicily, Italy"
4,"Cooperative Information Systems, 9th Internati...",db/conf/coopis/coopis2001.html,2001,"Trento, Italy"
5,Proceedings of the First IFCIS International C...,db/conf/coopis/coopis96.html,1996,"Brussels, Belgium"
6,"Cooperative Information Systems, 7th Internati...",db/conf/coopis/coopis2000.html,2000,"Eilat, Israel"
7,Proceedings of the Third International Confere...,db/conf/coopis/coopis95.html,1995,"Vienna, Austria"
...,...,...,...,...
50158,"Constraint Programming: Basics and Trends, Châ...",db/journals/lncs/lncs910.html,1995,France
50198,Next Frontier in Agent-Based Complex Automated...,db/series/sci/sci596.html,2015,Agent
50204,"Software Engineering Research, Management and ...",db/series/sci/sci578.html,2015,"Kitakyushu, Japan"
50205,Advances in Knowledge Discovery and Management...,db/series/sci/sci471.html,2013,"Brest, France"


## Filtering the Seminar Related Proceedings

In [16]:
df_dblp_proceedings = df_dblp_proceedings[df_dblp_proceedings['title'].str.contains("Seminar") == False]
df_dblp_proceedings = df_dblp_proceedings[df_dblp_proceedings['title'].str.contains("seminar") == False]
df_dblp_proceedings

Unnamed: 0,title,url,year,ConferenceLocation
2,On The Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2003.html,2003,"Catania, Sicily, Italy"
4,"Cooperative Information Systems, 9th Internati...",db/conf/coopis/coopis2001.html,2001,"Trento, Italy"
5,Proceedings of the First IFCIS International C...,db/conf/coopis/coopis96.html,1996,"Brussels, Belgium"
6,"Cooperative Information Systems, 7th Internati...",db/conf/coopis/coopis2000.html,2000,"Eilat, Israel"
7,Proceedings of the Third International Confere...,db/conf/coopis/coopis95.html,1995,"Vienna, Austria"
...,...,...,...,...
50158,"Constraint Programming: Basics and Trends, Châ...",db/journals/lncs/lncs910.html,1995,France
50198,Next Frontier in Agent-Based Complex Automated...,db/series/sci/sci596.html,2015,Agent
50204,"Software Engineering Research, Management and ...",db/series/sci/sci578.html,2015,"Kitakyushu, Japan"
50205,Advances in Knowledge Discovery and Management...,db/series/sci/sci471.html,2013,"Brest, France"


## Preparing the Dataframes for the Join Operation

Before joining the two dataframes, we need to create some auxiliary columns that are going to help us during the following joins operations:

In [17]:
df_dblp_proceedings['ConferenceNameForJoin'] = df_dblp_proceedings['url'].copy()
df_mag['ConferenceNameForJoin'] = df_mag['ConferenceNormalizedName'].copy()

Now we're going to "prepare" the conferences name for the join operations:

In [18]:
df_mag.ConferenceNameForJoin = df_mag.ConferenceNameForJoin.str.split(' ').str[0] + df_mag.ConferenceNameForJoin.str.split(' ').str[1]
df_mag.loc[0:3]

Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceSeriesNormalizedName,Doi,EstimatedCitation,OriginalTitle,PaperTitle,Year,ConferenceNameForJoin
0,12,"Austin, TX",disc 2014,International Symposium on Distributed Computing,DISC,10.1007/978-3-662-45174-8_28,12,The Adaptive Priority Queue with Elimination a...,the adaptive priority queue with elimination a...,2014,disc2014
1,10,"Wrocław, Poland",esa 2014,European Symposium on Algorithms,ESA,10.1007/978-3-662-44777-2_60,10,Document Retrieval on Repetitive Collections,document retrieval on repetitive collections,2014,esa2014
2,20,,enter 2013,Information and Communication Technologies in ...,ENTER,10.1007/978-3-319-03973-2_13,20,SoCoMo Marketing for Travel and Tourism,socomo marketing for travel and tourism,2013,enter2013
3,0,,dexa 2002,Database and Expert Systems Applications,DEXA,10.1007/3-540-46146-9_77,0,Similarity Image Retrieval System Using Hierar...,similarity image retrieval system using hierar...,2002,dexa2002


In [19]:
df_dblp_proceedings.ConferenceNameForJoin = df_dblp_proceedings.ConferenceNameForJoin.str.split('/').str[3]
df_dblp_proceedings.ConferenceNameForJoin = df_dblp_proceedings.ConferenceNameForJoin.str.split('.').str[0]

# Operation for the proceedings that have multiple volumes
df_dblp_proceedings.ConferenceNameForJoin = df_dblp_proceedings.ConferenceNameForJoin.str.split('-').str[0]
df_dblp_proceedings = df_dblp_proceedings.drop_duplicates(subset="ConferenceNameForJoin")

Make sure the indexes are in pair with number of rows

In [20]:
df_dblp_proceedings = df_dblp_proceedings.reset_index()
df_dblp_proceedings = df_dblp_proceedings.drop(columns=['index'])
df_dblp_proceedings.loc[:3]

Unnamed: 0,title,url,year,ConferenceLocation,ConferenceNameForJoin
0,On The Move to Meaningful Internet Systems 200...,db/conf/coopis/coopis2003.html,2003,"Catania, Sicily, Italy",coopis2003
1,"Cooperative Information Systems, 9th Internati...",db/conf/coopis/coopis2001.html,2001,"Trento, Italy",coopis2001
2,Proceedings of the First IFCIS International C...,db/conf/coopis/coopis96.html,1996,"Brussels, Belgium",coopis96
3,"Cooperative Information Systems, 7th Internati...",db/conf/coopis/coopis2000.html,2000,"Eilat, Israel",coopis2000


## Join of the New Location Data with the Original Dataframe (Based on the Conference Normalized Name)

In [21]:
# Merge with the location dataframe
df_mag = pd.merge(df_mag, df_dblp_proceedings, on=['ConferenceNameForJoin'], how='left')

# Combine the two columns
df_mag['ConferenceLocation_x'] = df_mag['ConferenceLocation_x'].fillna(df_mag['ConferenceLocation_y'])
df_mag.rename(columns = {'ConferenceLocation_x':'ConferenceLocation'}, inplace=True)
df_mag = df_mag.drop(columns=['ConferenceLocation_y'])
df_mag = df_mag.drop(columns=['year'])

df_mag.iloc[:4]

Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceSeriesNormalizedName,Doi,EstimatedCitation,OriginalTitle,PaperTitle,Year,ConferenceNameForJoin,title,url
0,12,"Austin, TX",disc 2014,International Symposium on Distributed Computing,DISC,10.1007/978-3-662-45174-8_28,12,The Adaptive Priority Queue with Elimination a...,the adaptive priority queue with elimination a...,2014,disc2014,Distributed Computing - 28th International Sym...,db/conf/wdag/disc2014.html
1,10,"Wrocław, Poland",esa 2014,European Symposium on Algorithms,ESA,10.1007/978-3-662-44777-2_60,10,Document Retrieval on Repetitive Collections,document retrieval on repetitive collections,2014,esa2014,Algorithms - ESA 2014 - 22th Annual European S...,db/conf/esa/esa2014.html
2,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,ENTER,10.1007/978-3-319-03973-2_13,20,SoCoMo Marketing for Travel and Tourism,socomo marketing for travel and tourism,2013,enter2013,Information and Communication Technologies in ...,db/conf/enter/enter2013.html
3,0,"Provence, France",dexa 2002,Database and Expert Systems Applications,DEXA,10.1007/3-540-46146-9_77,0,Similarity Image Retrieval System Using Hierar...,similarity image retrieval system using hierar...,2002,dexa2002,"Database and Expert Systems Applications, 13th...",db/conf/dexa/dexa2002.html


Make sure the indexes are in pair with number of rows

In [22]:
df_mag = df_mag.reset_index()
df_mag = df_mag.drop(columns=['index'])
df_mag.loc[:2]

Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceSeriesNormalizedName,Doi,EstimatedCitation,OriginalTitle,PaperTitle,Year,ConferenceNameForJoin,title,url
0,12,"Austin, TX",disc 2014,International Symposium on Distributed Computing,DISC,10.1007/978-3-662-45174-8_28,12,The Adaptive Priority Queue with Elimination a...,the adaptive priority queue with elimination a...,2014,disc2014,Distributed Computing - 28th International Sym...,db/conf/wdag/disc2014.html
1,10,"Wrocław, Poland",esa 2014,European Symposium on Algorithms,ESA,10.1007/978-3-662-44777-2_60,10,Document Retrieval on Repetitive Collections,document retrieval on repetitive collections,2014,esa2014,Algorithms - ESA 2014 - 22th Annual European S...,db/conf/esa/esa2014.html
2,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,ENTER,10.1007/978-3-319-03973-2_13,20,SoCoMo Marketing for Travel and Tourism,socomo marketing for travel and tourism,2013,enter2013,Information and Communication Technologies in ...,db/conf/enter/enter2013.html


Count of how many paper's conference locations are still missing

In [23]:
n_missing = len(df_mag.index) - len(df_mag.dropna(subset = ['ConferenceLocation']).index)
print(f"{n_missing} missing paper's conference locations")

1876842 missing paper's conference locations


Drop of the columns that we don't need

In [24]:
df_mag = df_mag.drop(columns=['ConferenceNameForJoin', 'title', 'url'])
df_mag.loc[:2]

Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceSeriesNormalizedName,Doi,EstimatedCitation,OriginalTitle,PaperTitle,Year
0,12,"Austin, TX",disc 2014,International Symposium on Distributed Computing,DISC,10.1007/978-3-662-45174-8_28,12,The Adaptive Priority Queue with Elimination a...,the adaptive priority queue with elimination a...,2014
1,10,"Wrocław, Poland",esa 2014,European Symposium on Algorithms,ESA,10.1007/978-3-662-44777-2_60,10,Document Retrieval on Repetitive Collections,document retrieval on repetitive collections,2014
2,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,ENTER,10.1007/978-3-319-03973-2_13,20,SoCoMo Marketing for Travel and Tourism,socomo marketing for travel and tourism,2013


## Extraction of the Conference Location Searching Within the Proceeding Title

Since the match based on the Conference Normalized Name that we created didn't give enough matches, we're going to try a join based on a partial string match over the proceeding title.

**Note**: we don't have the ConferenceInstances "Display Names", but only the ConferenceSeriesDisplayName. For this reason, we're going to use also the Conference Year for matching only the specific row that we need.

Let's get the conferences that need to be fixed:

In [25]:
df_mag_conferences = df_mag[["ConferenceNormalizedName", "ConferenceSeriesDisplayName", "ConferenceLocation", "Year"]]

Drop of the papers that don't need their location to be fixed.

In [26]:
df_mag_conferences = df_mag_conferences[df_mag_conferences["ConferenceLocation"].isna()]
df_mag_conferences

Unnamed: 0,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceLocation,Year
10,acc 1990,American Control Conference,,1990
13,asilomar 1991,"Asilomar Conference on Signals, Systems and Co...",,1991
23,ire 1964,IRE International Convention Record,,1964
26,icet 2007,International Conference on Emerging Technologies,,2007
27,ecml 1994,European conference on Machine Learning,,1994
...,...,...,...,...
4409793,icieam 2017,International Conference on Industrial Enginee...,,2017
4409796,dueu 2018,"International Conference of Design, User Exper...",,2018
4409803,ra 2004,Robotics and Applications,,2004
4409806,fnces 2012,National Conference for Engineering Sciences,,2012


Drop of the duplicated conferences. We only need unique values.

In [27]:
df_mag_conferences = df_mag_conferences.drop_duplicates(subset="ConferenceNormalizedName")

n_conf = df_mag_conferences.__len__()
print(f"Now we only need to search for the location of {n_conf} unique conferences")

Now we only need to search for the location of 17027 unique conferences


Make sure the indexes are in pair with number of rows

In [28]:
df_mag_conferences = df_mag_conferences.reset_index()
df_mag_conferences

Unnamed: 0,index,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceLocation,Year
0,10,acc 1990,American Control Conference,,1990
1,13,asilomar 1991,"Asilomar Conference on Signals, Systems and Co...",,1991
2,23,ire 1964,IRE International Convention Record,,1964
3,26,icet 2007,International Conference on Emerging Technologies,,2007
4,27,ecml 1994,European conference on Machine Learning,,1994
...,...,...,...,...,...
17022,4391498,qels 1990,Quantum Electronics and Laser Science Conference,,1990
17023,4392151,iait 2014,Advances in Information Technology,,2014
17024,4395100,gi 1998,Graphics Interface,,1998
17025,4405987,iwaal 2016,International Workshop on Ambient Assisted Living,,2016


Searching the locations with a match on the proceeding title:

In [29]:
for mag_index, mag_row in df_mag_conferences.iterrows():

    print(f"Searching the location for the conference {mag_index + 1} out of {n_conf}") # TODO DEBUG

    match = False

    for proc_index, proc_row in df_dblp_proceedings.iterrows():

        if mag_row['ConferenceSeriesDisplayName'] in proc_row['title'] and mag_row['Year'] == proc_row['year']:
            df_mag_conferences.at[mag_index, 'ConferenceLocation'] = proc_row['ConferenceLocation']
            match = True
            break
    
    # If we got a match, we remove the proceeding to speed up the next search
    if match:
        df_dblp_proceedings.drop([proc_index, proc_index], inplace=True)

df_mag_conferences

Searching the location for the conference 1 out of 17027
Searching the location for the conference 2 out of 17027
Searching the location for the conference 3 out of 17027
Searching the location for the conference 4 out of 17027
Searching the location for the conference 5 out of 17027
Searching the location for the conference 6 out of 17027
Searching the location for the conference 7 out of 17027
Searching the location for the conference 8 out of 17027
Searching the location for the conference 9 out of 17027
Searching the location for the conference 10 out of 17027
Searching the location for the conference 11 out of 17027
Searching the location for the conference 12 out of 17027
Searching the location for the conference 13 out of 17027
Searching the location for the conference 14 out of 17027
Searching the location for the conference 15 out of 17027
Searching the location for the conference 16 out of 17027
Searching the location for the conference 17 out of 17027
Searching the location 

Unnamed: 0,index,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceLocation,Year
0,10,acc 1990,American Control Conference,,1990
1,13,asilomar 1991,"Asilomar Conference on Signals, Systems and Co...",,1991
2,23,ire 1964,IRE International Convention Record,,1964
3,26,icet 2007,International Conference on Emerging Technologies,"Patras, Greece",2007
4,27,ecml 1994,European conference on Machine Learning,,1994
...,...,...,...,...,...
17022,4391498,qels 1990,Quantum Electronics and Laser Science Conference,,1990
17023,4392151,iait 2014,Advances in Information Technology,India,2014
17024,4395100,gi 1998,Graphics Interface,"Vancouver, BC, Canada",1998
17025,4405987,iwaal 2016,International Workshop on Ambient Assisted Living,,2016


Count of how many conferences locations are still missing:

In [30]:
n_missing = len(df_mag_conferences.index) - len(df_mag_conferences.dropna(subset = ['ConferenceLocation']).index)
print(f"{n_missing} missing paper's conference locations")

14911 missing paper's conference locations


Drop of the conferences that still don't have a location:

In [31]:
df_mag_conferences = df_mag_conferences.dropna(subset = ['ConferenceLocation'])

Drop of the columns that we don't need anymore

In [32]:
df_mag_conferences = df_mag_conferences.drop(columns=['index', 'ConferenceSeriesDisplayName', 'Year'])

## Join of the New Location Data with the Original Dataframe (Based on the Proceeding Title and Year)

Join with the original dataframe

In [34]:
df_mag = pd.merge(df_mag, df_mag_conferences, on=['ConferenceNormalizedName'], how='left')

# Combine the two columns
df_mag['ConferenceLocation_x'] = df_mag['ConferenceLocation_x'].fillna(df_mag['ConferenceLocation_y'])
df_mag.rename(columns = {'ConferenceLocation_x':'ConferenceLocation'}, inplace=True)
df_mag = df_mag.drop(columns=['ConferenceLocation_y'])

df_mag.iloc[:5]

Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceSeriesNormalizedName,Doi,EstimatedCitation,OriginalTitle,PaperTitle,Year
0,12,"Austin, TX",disc 2014,International Symposium on Distributed Computing,DISC,10.1007/978-3-662-45174-8_28,12,The Adaptive Priority Queue with Elimination a...,the adaptive priority queue with elimination a...,2014
1,10,"Wrocław, Poland",esa 2014,European Symposium on Algorithms,ESA,10.1007/978-3-662-44777-2_60,10,Document Retrieval on Repetitive Collections,document retrieval on repetitive collections,2014
2,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,ENTER,10.1007/978-3-319-03973-2_13,20,SoCoMo Marketing for Travel and Tourism,socomo marketing for travel and tourism,2013
3,0,"Provence, France",dexa 2002,Database and Expert Systems Applications,DEXA,10.1007/3-540-46146-9_77,0,Similarity Image Retrieval System Using Hierar...,similarity image retrieval system using hierar...,2002
4,19,"Zakopane, Poland",icaisc 2006,International Conference on Artificial Intelli...,ICAISC,10.1007/11785231_94,19,Leukemia prediction from gene expression data—...,leukemia prediction from gene expression data ...,2006


Count of how many paper's conference locations are still missing

In [35]:
n_missing = len(df_mag.index) - len(df_mag.dropna(subset = ['ConferenceLocation']).index)
print(f"{n_missing} missing paper's conference locations")

1694764 missing paper's conference locations


## Write of the Final CSV on Disk

Saving the resulting dataframe on disk in CSV format.

In [36]:
# Write of the resulting CSV on Disk
df_mag.to_csv(path_file_export + 'out_mag_citations_and_locations.csv')
print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_mag_citations_and_locations.csv')

Successfully Exported the Preprocessed CSV to /Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/out_mag_citations_and_locations.csv


Check of the Exported CSV to be sure that everything went fine.

In [37]:
# Check of the Exported CSV
df_mag_exported_csv = pd.read_csv(path_file_export + 'out_mag_citations_and_locations.csv', low_memory=False, index_col=[0])
df_mag_exported_csv

Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceSeriesNormalizedName,Doi,EstimatedCitation,OriginalTitle,PaperTitle,Year
0,12,"Austin, TX",disc 2014,International Symposium on Distributed Computing,DISC,10.1007/978-3-662-45174-8_28,12,The Adaptive Priority Queue with Elimination a...,the adaptive priority queue with elimination a...,2014
1,10,"Wrocław, Poland",esa 2014,European Symposium on Algorithms,ESA,10.1007/978-3-662-44777-2_60,10,Document Retrieval on Repetitive Collections,document retrieval on repetitive collections,2014
2,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,ENTER,10.1007/978-3-319-03973-2_13,20,SoCoMo Marketing for Travel and Tourism,socomo marketing for travel and tourism,2013
3,0,"Provence, France",dexa 2002,Database and Expert Systems Applications,DEXA,10.1007/3-540-46146-9_77,0,Similarity Image Retrieval System Using Hierar...,similarity image retrieval system using hierar...,2002
4,19,"Zakopane, Poland",icaisc 2006,International Conference on Artificial Intelli...,ICAISC,10.1007/11785231_94,19,Leukemia prediction from gene expression data—...,leukemia prediction from gene expression data ...,2006
...,...,...,...,...,...,...,...,...,...,...
4409807,0,Singapore,iecon 2020,Conference of the Industrial Electronics Society,IECON,10.1109/IECON43393.2020.9254316,0,Loss Reduction by Synchronous Rectification in...,loss reduction by synchronous rectification in...,2020
4409808,0,"Paris, France",bmsb 2020,International Symposium on Broadband Multimedi...,BMSB,10.1109/BMSB49480.2020.9379806,0,Data Over Cable Services – Improving the BICM ...,data over cable services improving the bicm ca...,2020
4409809,0,,acc 1988,American Control Conference,ACC,10.1109/ACC.1988.4172843,0,Model Reference Robust Adaptive Control withou...,model reference robust adaptive control withou...,1988
4409810,0,"Orlando, Florida, USA",icassp 2002,"International Conference on Acoustics, Speech,...",ICASSP,10.1109/ICASSP.2002.1005676,0,Missing data speech recognition in reverberant...,missing data speech recognition in reverberant...,2002
