# Citation and Conference Data Join - DBLP + MAG and COCI

Jupyter Notebook for the join of the conferences and location data between the DBLP + MAG and COCI dumps.

For this process, the following CSV files are needed: ```out_coci_citations_count.csv``` and ```out_dblp_and_mag_joined.csv```. <br>
The first must be generated running the Notebook ```preprocess_opencitations.ipynb``` that is contained in the ```1 - Citation Dumps Preprocess``` folder of this project.
The above files must be generated running the ```1 - DBLP and MAG Data Join Notebook.ipynb``` Notebook that is contained in the same folder as this Notebook.

In particular, the following operations are going to be executed:
* Opening of the CSV preprocessed dumps
* Join between the two datasets
* Drop of the useless columns
* Fix of the mismatched data types

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

## Read of the CSV Preprocessed Dumps

In [3]:
df_coci = pd.read_csv(path_file_export + 'out_coci_citations_count.csv', low_memory=False, index_col=[0])
df_coci

Unnamed: 0,article,citations_count
0,10.10.18045/zbefri.2015.2.207,1
1,10.1000/182,3
2,10.1000/res#test,1
3,10.1001,6
4,10.1001/.391,22
...,...,...
58110410,10.9799/ksfan.2017.30.4.728,1
58110411,10.9799/ksfan.2017.30.5.1035,1
58110412,10.9799/ksfan.2018.31.1.033,1
58110413,10.9799/ksfan.2019.32.4.328,1


In [4]:
df_dblp_and_mag = pd.read_csv(path_file_export + 'out_dblp_and_mag_joined.csv', low_memory=False, index_col=[0])
df_dblp_and_mag

Unnamed: 0,CitationCount_Mag,CitationCount_MagEstimated,ConferenceLocation,ConferenceNormalizedName,ConferenceTitle,Doi,OriginalTitle,Year
0,12,12,"Austin, TX",disc 2014,Distributed Computing - 28th International Sym...,10.1007/978-3-662-45174-8_28,The Adaptive Priority Queue with Elimination a...,2014
1,10,10,"Wrocław, Poland",esa 2014,Algorithms - ESA 2014 - 22th Annual European S...,10.1007/978-3-662-44777-2_60,Document Retrieval on Repetitive Collections,2014
2,20,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,10.1007/978-3-319-03973-2_13,SoCoMo Marketing for Travel and Tourism,2013
3,0,0,"Provence, France",dexa 2002,"Database and Expert Systems Applications, 13th...",10.1007/3-540-46146-9_77,Similarity Image Retrieval System Using Hierar...,2002
4,19,19,"Zakopane, Poland",icaisc 2006,Artificial Intelligence and Soft Computing - I...,10.1007/11785231_94,Leukemia prediction from gene expression data—...,2006
...,...,...,...,...,...,...,...,...
4988315,0,0,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_9,,2011
4988316,0,0,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_20,,2011
4988317,0,0,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_25,,2011
4988318,0,0,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_12,,2011


## Preparation of the CSV Preprocessed COCI Dump

Renaming the article column to doi and making sure that everything is in lowercase:

In [6]:
df_coci = df_coci.rename(columns={'article': 'Doi'})
df_coci = df_coci.reindex(sorted(df_coci.columns), axis=1)

df_coci.Doi = df_coci.Doi.str.lower()
df_coci.iloc[:5]

Unnamed: 0,Doi,citations_count
0,10.10.18045/zbefri.2015.2.207,1
1,10.1000/182,3
2,10.1000/res#test,1
3,10.1001,6
4,10.1001/.391,22


## Join Between DBLP+MAG and COCI

Making sure that all dois are in lowercase:

In [8]:
df_coci.Doi = df_coci.Doi.str.lower()

In [9]:
df_dblp_and_mag = pd.merge(df_dblp_and_mag, df_coci, on=['Doi'], how='left')

df_dblp_and_mag.iloc[:5]

Unnamed: 0,CitationCount_Mag,CitationCount_MagEstimated,ConferenceLocation,ConferenceNormalizedName,ConferenceTitle,Doi,OriginalTitle,Year,citations_count
0,12,12,"Austin, TX",disc 2014,Distributed Computing - 28th International Sym...,10.1007/978-3-662-45174-8_28,The Adaptive Priority Queue with Elimination a...,2014,10.0
1,10,10,"Wrocław, Poland",esa 2014,Algorithms - ESA 2014 - 22th Annual European S...,10.1007/978-3-662-44777-2_60,Document Retrieval on Repetitive Collections,2014,5.0
2,20,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,10.1007/978-3-319-03973-2_13,SoCoMo Marketing for Travel and Tourism,2013,11.0
3,0,0,"Provence, France",dexa 2002,"Database and Expert Systems Applications, 13th...",10.1007/3-540-46146-9_77,Similarity Image Retrieval System Using Hierar...,2002,1.0
4,19,19,"Zakopane, Poland",icaisc 2006,Artificial Intelligence and Soft Computing - I...,10.1007/11785231_94,Leukemia prediction from gene expression data—...,2006,9.0


Column rename and sort:

In [10]:
df_dblp_and_mag.rename(columns={'citations_count': 'CitationCount_COCI'}, inplace=True)
df_dblp_and_mag = df_dblp_and_mag.reindex(sorted(df_dblp_and_mag.columns), axis=1)
df_dblp_and_mag.iloc[:5]

Unnamed: 0,CitationCount_COCI,CitationCount_Mag,CitationCount_MagEstimated,ConferenceLocation,ConferenceNormalizedName,ConferenceTitle,Doi,OriginalTitle,Year
0,10.0,12,12,"Austin, TX",disc 2014,Distributed Computing - 28th International Sym...,10.1007/978-3-662-45174-8_28,The Adaptive Priority Queue with Elimination a...,2014
1,5.0,10,10,"Wrocław, Poland",esa 2014,Algorithms - ESA 2014 - 22th Annual European S...,10.1007/978-3-662-44777-2_60,Document Retrieval on Repetitive Collections,2014
2,11.0,20,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,10.1007/978-3-319-03973-2_13,SoCoMo Marketing for Travel and Tourism,2013
3,1.0,0,0,"Provence, France",dexa 2002,"Database and Expert Systems Applications, 13th...",10.1007/3-540-46146-9_77,Similarity Image Retrieval System Using Hierar...,2002
4,9.0,19,19,"Zakopane, Poland",icaisc 2006,Artificial Intelligence and Soft Computing - I...,10.1007/11785231_94,Leukemia prediction from gene expression data—...,2006


## Converting the NaN Citations to 0

In [11]:
df_dblp_and_mag['CitationCount_COCI'] = df_dblp_and_mag['CitationCount_COCI'].fillna(0)
df_dblp_and_mag['CitationCount_Mag'] = df_dblp_and_mag['CitationCount_Mag'].fillna(0)
df_dblp_and_mag['CitationCount_MagEstimated'] = df_dblp_and_mag['CitationCount_MagEstimated'].fillna(0)

Fix of the data type:

In [12]:
df_dblp_and_mag = df_dblp_and_mag.astype({"CitationCount_COCI": int}) 
df_dblp_and_mag = df_dblp_and_mag.astype({"CitationCount_Mag": int}) 
df_dblp_and_mag = df_dblp_and_mag.astype({"CitationCount_MagEstimated": int}) 

In [13]:
df_dblp_and_mag.iloc[:5]

Unnamed: 0,CitationCount_COCI,CitationCount_Mag,CitationCount_MagEstimated,ConferenceLocation,ConferenceNormalizedName,ConferenceTitle,Doi,OriginalTitle,Year
0,10,12,12,"Austin, TX",disc 2014,Distributed Computing - 28th International Sym...,10.1007/978-3-662-45174-8_28,The Adaptive Priority Queue with Elimination a...,2014
1,5,10,10,"Wrocław, Poland",esa 2014,Algorithms - ESA 2014 - 22th Annual European S...,10.1007/978-3-662-44777-2_60,Document Retrieval on Repetitive Collections,2014
2,11,20,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,10.1007/978-3-319-03973-2_13,SoCoMo Marketing for Travel and Tourism,2013
3,1,0,0,"Provence, France",dexa 2002,"Database and Expert Systems Applications, 13th...",10.1007/3-540-46146-9_77,Similarity Image Retrieval System Using Hierar...,2002
4,9,19,19,"Zakopane, Poland",icaisc 2006,Artificial Intelligence and Soft Computing - I...,10.1007/11785231_94,Leukemia prediction from gene expression data—...,2006


## Write of the Final CSV on Disk

Saving the resulting dataframe on disk in CSV format.

In [14]:
# Write of the resulting CSV on Disk
df_dblp_and_mag.to_csv(path_file_export + 'out_citations_and_conferences.csv')
print(f'Successfully Exported the Processed CSV to {path_file_export}out_citations_and_conferences.csv')

Successfully Exported the Processed CSV to /Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/out_citations_and_conferences.csv


Check of the Exported CSV to be sure that everything went fine.

In [15]:
# Check of the Exported CSV
df_joined_exported_csv = pd.read_csv(path_file_export + 'out_citations_and_conferences.csv', low_memory=False, index_col=[0])
df_joined_exported_csv

Unnamed: 0,CitationCount_COCI,CitationCount_Mag,CitationCount_MagEstimated,ConferenceLocation,ConferenceNormalizedName,ConferenceTitle,Doi,OriginalTitle,Year
0,10,12,12,"Austin, TX",disc 2014,Distributed Computing - 28th International Sym...,10.1007/978-3-662-45174-8_28,The Adaptive Priority Queue with Elimination a...,2014
1,5,10,10,"Wrocław, Poland",esa 2014,Algorithms - ESA 2014 - 22th Annual European S...,10.1007/978-3-662-44777-2_60,Document Retrieval on Repetitive Collections,2014
2,11,20,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,10.1007/978-3-319-03973-2_13,SoCoMo Marketing for Travel and Tourism,2013
3,1,0,0,"Provence, France",dexa 2002,"Database and Expert Systems Applications, 13th...",10.1007/3-540-46146-9_77,Similarity Image Retrieval System Using Hierar...,2002
4,9,19,19,"Zakopane, Poland",icaisc 2006,Artificial Intelligence and Soft Computing - I...,10.1007/11785231_94,Leukemia prediction from gene expression data—...,2006
...,...,...,...,...,...,...,...,...,...
4988315,4,0,0,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_9,,2011
4988316,4,0,0,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_20,,2011
4988317,2,0,0,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_25,,2011
4988318,0,0,0,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_12,,2011
