# Citation and Conference Data Join - DBLP and MAG

Jupyter Notebook for the join of the conferences and location data between the DBLP and MAG dumps.

For this process, the following CSV files are needed: ```out_dblp_papers_and_locations_final.csv``` and ```out_mag_citations_and_locations_final.csv```. <br>
The above files must be generated running the Notebooks that are contained in the ```2 - Conference Location Fix``` folder of this Project.

In particular, the following operations are going to be executed:
* Opening of the CSV preprocessed dumps
* Creation of the conference identifiers for DBLP
* Join between the two datasets
* Drop of the useless columns
* Fix of the mismatched data types

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

## Read of the CSV Preprocessed Dumps

In [3]:
df_dblp = pd.read_csv(path_file_export + 'out_dblp_papers_and_locations_final.csv', low_memory=False, index_col=[0])
df_dblp

Unnamed: 0,ConferenceLocation,ConferenceTitle,crossref,ee,key,url,year
0,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_29,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html,2000
1,"Agia Napa, Cyprus",On the Move to Meaningful Internet Systems 200...,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html,2004
2,"Eilat, Israel","Cooperative Information Systems, 7th Internati...",conf/coopis/2000,https://doi.org/10.1007/10722620_9,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html,2000
3,"Kiawah Island, South Carolina, USA",Proceedings of the Second IFCIS International ...,conf/coopis/97,http://doi.ieeecomputersociety.org/10.1109/COO...,conf/coopis/BultzingsloewenKK97,db/conf/coopis/coopis97.html,1997
4,"Catania, Sicily, Italy",On The Move to Meaningful Internet Systems 200...,conf/coopis/2003,https://doi.org/10.1007/978-3-540-39964-3_50,conf/coopis/GiacolettoA03,db/conf/coopis/coopis2003.html,2003
...,...,...,...,...,...,...,...
2434139,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_9,series/sapere/Besold13,db/series/sapere/sapere5.html,2011
2434140,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_20,series/sapere/Steiner13,db/series/sapere/sapere5.html,2011
2434141,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_25,series/sapere/Armstrong13,db/series/sapere/sapere5.html,2011
2434142,"Thessaloniki, Greece",Philosophy and Theory of Artificial Intelligen...,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_12,series/sapere/Freed13,db/series/sapere/sapere5.html,2011


In [4]:
df_mag = pd.read_csv(path_file_export + 'out_mag_citations_and_locations_final.csv', low_memory=False, index_col=[0])
df_mag.drop(df_mag.filter(regex="Unname"), axis=1, inplace=True)
df_mag

Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesDisplayName,ConferenceSeriesNormalizedName,Doi,EstimatedCitation,OriginalTitle,PaperTitle,Year
0,12,"Austin, TX",disc 2014,International Symposium on Distributed Computing,DISC,10.1007/978-3-662-45174-8_28,12,The Adaptive Priority Queue with Elimination a...,the adaptive priority queue with elimination a...,2014
1,10,"Wrocław, Poland",esa 2014,European Symposium on Algorithms,ESA,10.1007/978-3-662-44777-2_60,10,Document Retrieval on Repetitive Collections,document retrieval on repetitive collections,2014
2,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,ENTER,10.1007/978-3-319-03973-2_13,20,SoCoMo Marketing for Travel and Tourism,socomo marketing for travel and tourism,2013
3,0,"Provence, France",dexa 2002,Database and Expert Systems Applications,DEXA,10.1007/3-540-46146-9_77,0,Similarity Image Retrieval System Using Hierar...,similarity image retrieval system using hierar...,2002
4,19,"Zakopane, Poland",icaisc 2006,International Conference on Artificial Intelli...,ICAISC,10.1007/11785231_94,19,Leukemia prediction from gene expression data—...,leukemia prediction from gene expression data ...,2006
...,...,...,...,...,...,...,...,...,...,...
4409807,0,Singapore,iecon 2020,Conference of the Industrial Electronics Society,IECON,10.1109/IECON43393.2020.9254316,0,Loss Reduction by Synchronous Rectification in...,loss reduction by synchronous rectification in...,2020
4409808,0,"Paris, France",bmsb 2020,International Symposium on Broadband Multimedi...,BMSB,10.1109/BMSB49480.2020.9379806,0,Data Over Cable Services – Improving the BICM ...,data over cable services improving the bicm ca...,2020
4409809,0,,acc 1988,American Control Conference,ACC,10.1109/ACC.1988.4172843,0,Model Reference Robust Adaptive Control withou...,model reference robust adaptive control withou...,1988
4409810,0,"Orlando, Florida, USA",icassp 2002,"International Conference on Acoustics, Speech,...",ICASSP,10.1109/ICASSP.2002.1005676,0,Missing data speech recognition in reverberant...,missing data speech recognition in reverberant...,2002


## Preparation of the CSV Preprocessed DBLP Dump

### Removing the DOI Website Path from the DOI URL (ee Column)

First we need to fix the rows that have multiple URLs separated by the | character

In [None]:
for dblp_index, dblp_row in df_dblp.iterrows():

    if str(dblp_row['ee']).split('|').__len__() >= 2:
        df_dblp.at[dblp_index, 'ee'] = dblp_row['ee'].split('|')[1]

Removing the DOI website path

In [None]:
df_dblp['doi'] = np.nan
df_dblp.doi = df_dblp.ee.str.lower().str.split('.org/').str[1]

# Column sort
df_dblp = df_dblp.reindex(sorted(df_dblp.columns), axis=1)

df_dblp.loc[:5]

### Creation of the DBLP Conference Instance Identifier
Before we can proceed with the join operations, we first need to create a conference identifier that will allow us to join the conference data between DBLP and MAG.

The chosen identifier is going to be composed in the same way of the ConferenceNormalizedName field of the MAG dump: name of the conference series (lower case) + year.

#### Extraction of the Conference Series Name from the Column "key"

In [None]:
df_dblp['ConferenceNormalizedName'] = np.nan
df_dblp.ConferenceNormalizedName = df_dblp.key.str.split('/').str[1]

# Column sort
df_dblp = df_dblp.reindex(sorted(df_dblp.columns), axis=1)
df_dblp.loc[:5]

#### Fix of the Wrong Conferences Series Names
The above operation works for the most of the rows. However, there are some special cases that we need to fix manually.

Fix of the papers that need the extraction from the crossref field:

In [None]:
crossref_list = ['agp', 'aipl', 'amcc', 'amec', 'bcd', 'biotec', 'calculemus', 'cf', 'cnps', 'corr', 'cw', 'daglib', 'dali', 'dbtel', 'ecoopwException', 'express', 'f-egc', 'icsengt', 'ifip5-5', 'infinity', 'isdt', 'iwlcs', 'npiv', 'wcit', 'isdt', 'dsai', 'aipl', 'dbtel', 'cw']

Special cases:

In [None]:
print_counter = 0

for dblp_index, dblp_row in df_dblp.iterrows():

    print_counter += 1
    if print_counter == 25000:
        print(f"Riga numero {dblp_index + 1}")
        print_counter = 0

    fixed = False

    for i in range (0, crossref_list.__len__()):
        if str(dblp_row['crossref']).split('/').__len__() >= 2:
            if crossref_list[i] == str(dblp_row['crossref']).split('/')[1]:
                df_dblp.at[dblp_index, 'ConferenceNormalizedName'] = crossref_list[i]
                fixed = True
                break
    
    if not fixed:
        if "ecoopwException" in str(dblp_row['crossref']):
            df_dblp.at[dblp_index, 'ConferenceNormalizedName'] = "ecoopw"
        elif "icsengt" in str(dblp_row['crossref']):
            df_dblp.at[dblp_index, 'ConferenceNormalizedName'] = "icset"

        if "ali2" in str(dblp_row['key']):
            df_dblp.at[dblp_index, 'ConferenceNormalizedName'] = "ali"
        elif "ictcs2" in str(dblp_row['key']):
            df_dblp.at[dblp_index, 'ConferenceNormalizedName'] = "ictcs" 


#### Adding the Year to the Conference Series Name

In [None]:
df_dblp.ConferenceNormalizedName = df_dblp.ConferenceNormalizedName + ' ' + df_dblp.year.astype(str)

In [None]:
df_dblp.loc[:5]

### Write of the Join Ready DBLP Dump on Disk

In [None]:
# Write of the resulting CSV on Disk
df_dblp.to_csv(path_file_export + 'out_dblp_join_ready.csv')
print(f'Successfully Exported the Processed CSV to {path_file_export}out_dblp_join_ready.csv')

## Join Between DBLP and MAG

Rename of the DBLP doi column:

In [6]:
df_dblp = df_dblp.rename(columns={'doi': 'Doi'})

Making sure that all dois are in lowercase:

In [7]:
df_dblp.Doi = df_dblp.Doi.str.lower()
df_mag.Doi = df_mag.Doi.str.lower()

Now we can proceed with the join and cleaning operations:

In [8]:
df_mag = pd.merge(df_mag, df_dblp, on=['Doi'], how='outer')

# Combine the conference location columns
df_mag['ConferenceLocation_x'] = df_mag['ConferenceLocation_x'].fillna(df_mag['ConferenceLocation_y'])
df_mag.rename(columns = {'ConferenceLocation_x':'ConferenceLocation'}, inplace=True)
df_mag = df_mag.drop(columns=['ConferenceLocation_y'])

# Combine the conference normalized name columns
df_mag['ConferenceNormalizedName_x'] = df_mag['ConferenceNormalizedName_x'].fillna(df_mag['ConferenceNormalizedName_y'])
df_mag.rename(columns = {'ConferenceNormalizedName_x':'ConferenceNormalizedName'}, inplace=True)
df_mag = df_mag.drop(columns=['ConferenceNormalizedName_y'])

# Combine the year columns
df_mag['Year'] = df_mag['Year'].fillna(df_mag['year'])
df_mag = df_mag.drop(columns=['year'])

# Drop of the DBLP columns that are not needed anymore
df_mag = df_mag.drop(columns=['crossref', 'ee', 'key', 'url'])

# Drop of the MAG columns that are not needed anymore
df_mag = df_mag.drop(columns=['PaperTitle', 'ConferenceSeriesDisplayName', 'ConferenceSeriesNormalizedName'])

# Rename of some columns to remove ambiguity
df_mag.rename(columns={'CitationCount': 'CitationCount_Mag', 'EstimatedCitation': 'CitationCount_MagEstimated'}, inplace=True)

# Column sort
df_mag = df_mag.reindex(sorted(df_mag.columns), axis=1)

# Fix of the year column data type
df_mag = df_mag.astype({"Year": int}) 

df_mag

Unnamed: 0,CitationCount_Mag,CitationCount_MagEstimated,ConferenceLocation,ConferenceNormalizedName,ConferenceTitle,Doi,OriginalTitle,Year
0,12.0,12.0,"Austin, TX",disc 2014,Distributed Computing - 28th International Sym...,10.1007/978-3-662-45174-8_28,The Adaptive Priority Queue with Elimination a...,2014
1,10.0,10.0,"Wrocław, Poland",esa 2014,Algorithms - ESA 2014 - 22th Annual European S...,10.1007/978-3-662-44777-2_60,Document Retrieval on Repetitive Collections,2014
2,20.0,20.0,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,10.1007/978-3-319-03973-2_13,SoCoMo Marketing for Travel and Tourism,2013
3,0.0,0.0,"Provence, France",dexa 2002,"Database and Expert Systems Applications, 13th...",10.1007/3-540-46146-9_77,Similarity Image Retrieval System Using Hierar...,2002
4,19.0,19.0,"Zakopane, Poland",icaisc 2006,Artificial Intelligence and Soft Computing - I...,10.1007/11785231_94,Leukemia prediction from gene expression data—...,2006
...,...,...,...,...,...,...,...,...
4988315,,,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_9,,2011
4988316,,,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_20,,2011
4988317,,,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_25,,2011
4988318,,,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_12,,2011


## Converting the NaN Citations to 0

In [9]:
df_mag['CitationCount_Mag'] = df_mag['CitationCount_Mag'].fillna(0)
df_mag['CitationCount_MagEstimated'] = df_mag['CitationCount_MagEstimated'].fillna(0)

Fix of the data type:

In [10]:
df_mag = df_mag.astype({"CitationCount_Mag": int}) 
df_mag = df_mag.astype({"CitationCount_MagEstimated": int}) 

In [11]:
df_mag.loc[:5]

Unnamed: 0,CitationCount_Mag,CitationCount_MagEstimated,ConferenceLocation,ConferenceNormalizedName,ConferenceTitle,Doi,OriginalTitle,Year
0,12,12,"Austin, TX",disc 2014,Distributed Computing - 28th International Sym...,10.1007/978-3-662-45174-8_28,The Adaptive Priority Queue with Elimination a...,2014
1,10,10,"Wrocław, Poland",esa 2014,Algorithms - ESA 2014 - 22th Annual European S...,10.1007/978-3-662-44777-2_60,Document Retrieval on Repetitive Collections,2014
2,20,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,10.1007/978-3-319-03973-2_13,SoCoMo Marketing for Travel and Tourism,2013
3,0,0,"Provence, France",dexa 2002,"Database and Expert Systems Applications, 13th...",10.1007/3-540-46146-9_77,Similarity Image Retrieval System Using Hierar...,2002
4,19,19,"Zakopane, Poland",icaisc 2006,Artificial Intelligence and Soft Computing - I...,10.1007/11785231_94,Leukemia prediction from gene expression data—...,2006
5,4,4,"Lisbon, Portugal",interact 2011,HCI International 2011 - Posters' Extended Abs...,10.1007/978-3-642-22095-1_80,Computer Interaction and the Benefits of Socia...,2011


## Write of the Final CSV on Disk

Saving the resulting dataframe on disk in CSV format.

In [12]:
# Write of the resulting CSV on Disk
df_mag.to_csv(path_file_export + 'out_dblp_and_mag_joined.csv')
print(f'Successfully Exported the Processed CSV to {path_file_export}out_dblp_and_mag_joined.csv')

Successfully Exported the Processed CSV to /Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/out_dblp_and_mag_joined.csv


Check of the Exported CSV to be sure that everything went fine.

In [13]:
# Check of the Exported CSV
df_joined_exported_csv = pd.read_csv(path_file_export + 'out_dblp_and_mag_joined.csv', low_memory=False, index_col=[0])
df_joined_exported_csv

Unnamed: 0,CitationCount_Mag,CitationCount_MagEstimated,ConferenceLocation,ConferenceNormalizedName,ConferenceTitle,Doi,OriginalTitle,Year
0,12,12,"Austin, TX",disc 2014,Distributed Computing - 28th International Sym...,10.1007/978-3-662-45174-8_28,The Adaptive Priority Queue with Elimination a...,2014
1,10,10,"Wrocław, Poland",esa 2014,Algorithms - ESA 2014 - 22th Annual European S...,10.1007/978-3-662-44777-2_60,Document Retrieval on Repetitive Collections,2014
2,20,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,10.1007/978-3-319-03973-2_13,SoCoMo Marketing for Travel and Tourism,2013
3,0,0,"Provence, France",dexa 2002,"Database and Expert Systems Applications, 13th...",10.1007/3-540-46146-9_77,Similarity Image Retrieval System Using Hierar...,2002
4,19,19,"Zakopane, Poland",icaisc 2006,Artificial Intelligence and Soft Computing - I...,10.1007/11785231_94,Leukemia prediction from gene expression data—...,2006
...,...,...,...,...,...,...,...,...
4988315,0,0,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_9,,2011
4988316,0,0,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_20,,2011
4988317,0,0,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_25,,2011
4988318,0,0,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_12,,2011
