
# Preprocess DBLP) Dataset

Jupyter Notebook for the preprocessing of the DBLP dump.

## TODO TODO TODO

For this process, the following CSV files are needed: ```ConferenceInstances.txt```, ```ConferenceSeries.txt```, ```Papers.txt```. 
The above files can be found here: https://archive.org/download/mag-2021-06-07/mag/

In particular, the following operations are going to be executed:
* Opening of ConferenceInstances and ConferenceSeries CSVs
* Drop of the useless columns 
* Chuncked Processing of the Papers CSV
    * Drop of the useless columns
    * Drop of papers without DOI
    * Drop of papers from journals and books rows
* Merge with the processed conferences data

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd
import xml.sax

pd.set_option('display.max_columns', None)

# Import of the XML to CSV conversion module
from dblp_xml2csv_process import DBLP_Handler

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

### Use a Dump Already Converted to CSV
This can be really useful to save some time using a DBLP dump that has already been converted to CSV format. We don't need to repeat the same operations!

**Note**: the CSV needs to be in the same format of the one generated with this script

In [10]:
# Use a Previuously Preprocessed Papers CSV
#
# This can be really useful to save some time using a previously elaborated
# CSV file. We don't need to repeat the same operations!
#
# Note: the CSV needs to be in the same format of the one generated with this script
read_already_converted_csv = False
already_converted_csv_path = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/out_dblp_raw.csv'


## Dump Conversion from XML to CSV
For this initial part of the work, we're going to use a class that can be found on the following [Github Repository](https://github.com/hibernator11/notebook-emerging-topics-corpora). All the rights of the class go to its creator.

**NOTE**: Small modifications have been made to add the possibility to save the resulting file in a specified path

In [14]:
if not read_already_converted_csv:
    xml_parser = xml.sax.make_parser()
    dblp_handler = DBLP_Handler()
    xml_parser.setContentHandler(dblp_handler)
    xml_parser.parse(path_file_import + 'dblp-2022-03-01.xml')
    dblp_handler.save(path_file_import)

    print(f'Successfully Exported the converted Dump to {path_file_import}out_dblp_raw.csv')


Processed 0.0 million elements
Processed 0.3 million elements
Processed 0.4 million elements
Processed 0.5 million elements
Processed 0.6 million elements
Duplicated title ('www', 'homepages/165/4664', 'TITLE') Home Page Home Page
Processed 0.7 million elements
Processed 0.8 million elements
Processed 0.9 million elements
Processed 1.0 million elements
Processed 1.1 million elements
Processed 1.2 million elements
Processed 1.3 million elements
Processed 1.4 million elements
Processed 1.5 million elements
Processed 1.6 million elements
Processed 1.7 million elements
Processed 1.8 million elements
Processed 1.9 million elements
Processed 2.0 million elements
Processed 2.1 million elements
Processed 2.2 million elements
Processed 2.3 million elements
Processed 2.4 million elements
Processed 2.5 million elements
Processed 2.6 million elements
Processed 2.7 million elements
Processed 2.8 million elements
Processed 2.9 million elements
Processed 3.0 million elements
Processed 3.1 million ele

## Preprocess of the DBLP Dump
In this phase we're going to use the CSV file generated from the previous conversion phase.

In [14]:
# Read of the DBLP Raw Dump

# The column names follow the MAG' scheme official documentation
#df_mag_conf_instances_col_names = ['ConferenceInstanceID', 'NormalizedName', 'DisplayName', 'ConferenceSeriesID', 'Location', 'OfficialUrl', 'StartDate', 'EndDate', 'AbstractRegistrationDate', 'SubmissionDeadlineDate', 'NotificationDueDate', 'FinalVersionDueDate', 'PageCount', 'PaperFamilyCount', 'CitationCount', 'Latitude', 'Longitude', 'CreatedDate']

df_dblp_raw = pd.read_csv(path_file_import + 'out_dblp_raw.csv', sep='\t')
df_dblp_raw

Unnamed: 0.1,Unnamed: 0,TYPE,TITLE,YEAR
0,dblpnote/error,article,(error),
1,dblpnote/ellipsis,article,…,
2,dblpnote/neverpublished,article,(was never published),
3,phd/Turpin92,phdthesis,Programming Data Structures in Logic.,1992.0
4,phd/Olken93,phdthesis,Random Sampling from Databases,1993.0
...,...,...,...,...
8960390,tr/gte/TR-0146-06-91-165,article,Towards a Transaction Management System for DOM.,1991.0
8960391,tr/gte/TR-0222-10-92-165,article,DARWIN: On the Incremental Migration of Legacy...,1993.0
8960392,tr/dec/SRC1997-018,article,"The 1995 SQL Reunion: People, Project, and Pol...",1997.0
8960393,tr/ucb/erl-m79-28,article,Muffin: A Distributed Database Machine,1979.0


In [15]:
publtype = df_dblp_raw['TYPE'].unique()
publtype

array(['article', 'phdthesis', 'mastersthesis', 'book', 'incollection',
       'proceedings', 'www', 'inproceedings'], dtype=object)

chmod +x XMLToCSV.py
./XMLToCSV.py dblp-2022-03-01.xml dblp-2019-11-22.dtd out_dblp_raw_v2.csv

In [24]:
# Read of the DBLP Raw Dump

# The column names follow the MAG' scheme official documentation
#df_mag_conf_instances_col_names = ['ConferenceInstanceID', 'NormalizedName', 'DisplayName', 'ConferenceSeriesID', 'Location', 'OfficialUrl', 'StartDate', 'EndDate', 'AbstractRegistrationDate', 'SubmissionDeadlineDate', 'NotificationDueDate', 'FinalVersionDueDate', 'PageCount', 'PaperFamilyCount', 'CitationCount', 'Latitude', 'Longitude', 'CreatedDate']

df_dblp_raw = pd.read_csv(path_file_import + 'out_dblp_raw_v2_inproceedings.csv', sep=';', on_bad_lines='skip')
df_dblp_raw

  df_dblp_raw = pd.read_csv(path_file_import + 'out_dblp_raw_v2_inproceedings.csv', sep=';', on_bad_lines='skip')


Unnamed: 0,id,author,author-aux,author-orcid,booktitle,cdrom,cite,cite-label,crossref,editor,editor-orcid,ee,ee-type,i,key,mdate,month,note,note-type,number,pages,publtype,sub,sup,title,title-bibtex,tt,url,volume,year
0,3062806,Arnon Rosenthal,,,SWEE,,,,conf/swee/1998,,,http://www.mitre.org/support/swee/rosenthal.html,,,www/org/mitre/future,2019-07-30,,,,,,,,,The Future of Classic Data Administration: Obj...,,,db/conf/swee/swee1998.html,,1998
1,3062808,Qiming Chen|Umeshwar Dayal,,,CoopIS,,...|...|...|books/mk/GrayR93|books/mk/elmagarm...,,conf/coopis/2000,,,https://doi.org/10.1007/10722620_29,,,conf/coopis/ChenD00,2017-05-24,,,,,311-322,,,,Multi-Agent Cooperative Transactions for E-Com...,,,db/conf/coopis/coopis2000.html#ChenD00,,2000
2,3062809,Emmanuel Cecchet|Renaud Lachaize|Takoua Abdell...,,,CoopIS/DOA/ODBASE (2),,,,conf/coopis/2004-2,,,https://doi.org/10.1007/978-3-540-30469-2_46,,,conf/coopis/AbdellatifCL04,2017-05-25,,,,,1571-1589,,,,Evaluation of a Group Communication Middleware...,,,db/conf/coopis/coopis2004-2.html#AbdellatifCL04,,2004
3,3062810,Robert Grob|Stefanie Kethers|Stephan Jacobs,,,CoopIS,,,,conf/coopis/1994,,,,,,conf/coopis/GrobJK94,2019-08-08,,,,,134-145,,,,Towards CIS in Quality Management - Integratio...,,,db/conf/coopis/coopis94.html#GrobJK94,,1994
4,3062811,Evaggelia Pitoura|George Samaras|Panos K. Chry...,,,CoopIS,,...|...|...|...|...|...|...|...|...|...|...|.....,,conf/coopis/2000,,,https://doi.org/10.1007/10722620_9,,,conf/coopis/PapastavrouCSP00,2017-05-24,,,,,102-113,,,,An Evaluation of the Java-Based Approaches to ...,,,db/conf/coopis/coopis2000.html#PapastavrouCSP00,,2000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2990257,8958667,Tarek Richard Besold,,0000-0002-8002-0049,PT-AI,,,,series/sapere/2013-5,,,https://doi.org/10.1007/978-3-642-31674-6_9,,,series/sapere/Besold13,2019-09-06,,,,,121-132,,,,Turing Revisited: A Cognitively-Inspired Decom...,,,db/series/sapere/sapere5.html#Besold13,,2011
2990258,8958671,Pierre Steiner,,,PT-AI,,,,series/sapere/2013-5,,,https://doi.org/10.1007/978-3-642-31674-6_20,,,series/sapere/Steiner13,2019-09-06,,,,,265-276,,,,C.S. Peirce and Artificial Intelligence: Histo...,,,db/series/sapere/sapere5.html#Steiner13,,2011
2990259,8958672,Stuart Armstrong,,,PT-AI,,,,series/sapere/2013-5,,,https://doi.org/10.1007/978-3-642-31674-6_25,,,series/sapere/Armstrong13,2019-09-06,,,,,335-347,,,,Risks and Mitigation Strategies for Oracle AI.,,,db/series/sapere/sapere5.html#Armstrong13,,2011
2990260,8958682,Sam Freed,,,PT-AI,,,,series/sapere/2013-5,,,https://doi.org/10.1007/978-3-642-31674-6_12,,,series/sapere/Freed13,2019-09-06,,,,,167-177,,,,Practical Introspection as Inspiration for AI.,,,db/series/sapere/sapere5.html#Freed13,,2011


In [28]:
df_dblp_raw = df_dblp_raw.sort_values(by='year', ascending=False)
df_dblp_raw

Unnamed: 0,id,author,author-aux,author-orcid,booktitle,cdrom,cite,cite-label,crossref,editor,editor-orcid,ee,ee-type,i,key,mdate,month,note,note-type,number,pages,publtype,sub,sup,title,title-bibtex,tt,url,volume,year
2865272,5976892,Jitender Jamwal|R. C. Hansdah|Ravi Babu Gudivada,,,ICDCN,,,,conf/icdcn/2022,,,https://doi.org/10.1145/3491003.3491021,,,conf/icdcn/HansdahJG22,2022-01-26,,,,,188-197,,,,Dragonshield : An Authentication Enhancement f...,,,db/conf/icdcn/icdcn2022.html#HansdahJG22,,2022
2661408,5768872,Norihisa Fujita|Ryohei Kobayashi|Ryuta Kashino...,,,HPC Asia,,,,conf/hpcasia/2022,,,https://doi.org/10.1145/3492805.3492817,,,conf/hpcasia/KashinoKFB22,2022-01-11,,,,,84-93,,,,Multi-hetero Acceleration by GPU and FPGA for ...,,,db/conf/hpcasia/hpcasia2022.html#KashinoKFB22,,2022
523907,3595158,Abhinav Rajvanshi|Alex Krasner|Han-Pang Chiu|K...,,,WACV,,,,conf/wacv/2022,,,https://doi.org/10.1109/WACV51458.2022.00192,,,conf/wacv/KrasnerSRCMKMVS22,2022-02-17,,,,,1858-1867,,,,SIGNAV: Semantically-Informed GPS-Denied Navig...,,,db/conf/wacv/wacv2022.html#KrasnerSRCMKMVS22,,2022
439396,3509766,Dimitri Lajou|Hervé Hocquard|Julien Bensmail|É...,,,CALDAM,,,,conf/caldam/2022,,,https://doi.org/10.1007/978-3-030-95018-7_1,,,conf/caldam/BensmailHLS22,2022-01-28,,,,,3-14,,,,A Proof of the Multiplicative 1-2-3 Conjecture.,,,db/conf/caldam/caldam2022.html#BensmailHLS22,,2022
2288276,5390045,Charilaos I. Kanatsoulis|Nicholas D. Sidiropoulos,,,WSDM,,,,conf/wsdm/2022,,,https://doi.org/10.1145/3488560.3498467,,,conf/wsdm/KanatsoulisS22,2022-02-18,,,,,439-448,,,,GAGE: Geometry Preserving Attributed Graph Emb...,,,db/conf/wsdm/wsdm2022.html#KanatsoulisS22,,2022
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1800199,4894975,Jay W. Forrester,,,AIEE-IRE Computer Conference,,,,conf/aieeire/1951,,,https://doi.org/10.1145/1434770.1434789,,,conf/aieeire/Forrester51,2021-04-22,,,,,109-114,,,,Digital computers: present and future trends.,,,db/conf/aieeire/aieeire1951.html#Forrester51,,1951
1800356,4895136,W. H. Mac Williams,,,AIEE-IRE Computer Conference,,,,conf/aieeire/1951,,,https://doi.org/10.1145/1434770.1434771,,,conf/aieeire/Williams51,2021-04-22,,,,,5-6,,,,Keynote address.,,,db/conf/aieeire/aieeire1951.html#Williams51,,1951
1800368,4895149,Maurice V. Wilkes,,,AIEE-IRE Computer Conference,,,,conf/aieeire/1951,,,https://doi.org/10.1145/1434770.1434783,,,conf/aieeire/Wilkes51,2021-04-22,,,,,79-83,,,,The EDSAC computer.,,,db/conf/aieeire/aieeire1951.html#Wilkes51,,1951
1800449,4895230,Robert R. Everett,,,AIEE-IRE Computer Conference,,,,conf/aieeire/1951,,,https://doi.org/10.1145/1434770.1434781|https:...,,,conf/aieeire/Everett51,2021-06-01,,,,,70-74,,,,The Whirlwind I computer.,,,db/conf/aieeire/aieeire1951.html#Everett51,,1951


In [18]:
publtype = df_dblp_raw['publtype'].unique()
publtype

array(['informal', nan, 'withdrawn', 'data', 'informal withdrawn',
       'survey', 'software', 'edited'], dtype=object)

In [32]:
publtype = df_dblp_raw.loc[df_dblp_raw.author.str.contains("Bedogni", na=False)]
publtype

Unnamed: 0,id,author,author-aux,author-orcid,booktitle,cdrom,cite,cite-label,crossref,editor,editor-orcid,ee,ee-type,i,key,mdate,month,note,note-type,number,pages,publtype,sub,sup,title,title-bibtex,tt,url,volume,year
1626363,4718043,Francesco Poggi|Luca Bedogni,,,CCNC,,,,conf/ccnc/2022,,,https://doi.org/10.1109/CCNC49033.2022.9700519,,,conf/ccnc/BedogniP22,2022-02-14,,,,,405-410,,,,A Web Of Things Context-Aware IoT System lever...,,,db/conf/ccnc/ccnc2022.html#BedogniP22,,2022
1624326,4715998,Federico Montori|Luca Bedogni|Vincenzo Armandi,,,CCNC,,,,conf/ccnc/2022,,,https://doi.org/10.1109/CCNC49033.2022.9700603,,,conf/ccnc/MontoriAB22,2022-02-14,,,,,411-416,,,,A Hierarchical Architectural Model for IoT End...,,,db/conf/ccnc/ccnc2022.html#MontoriAB22,,2022
2592157,5698117,Federico Montori|Luca Bedogni|Vincenzo Armandi,,,SMARTCOMP,,,,conf/smartcomp/2021,,,https://doi.org/10.1109/SMARTCOMP52413.2021.00083,,,conf/smartcomp/MontoriAB21,2021-10-12,,,,,401-403,,,,IoT End-User Service Composition via a Visual ...,,,db/conf/smartcomp/smartcomp2021.html#MontoriAB21,,2021
1532787,4622758,Kevin Choi|Luca Bedogni|Marco Levorato,,,GLOBECOM,,,,conf/globecom/2020,,,https://doi.org/10.1109/GLOBECOM42002.2020.932...,,,conf/globecom/ChoiBL20,2021-02-01,,,,,1-6,,,,Towards Green Crowdsourced Social Delivery Net...,,,db/conf/globecom/globecom2020.html#ChoiBL20,,2020
911618,3991409,Federico Montori|Luca Bedogni,,0000-0002-9943-4209,PerCom Workshops,,,,conf/percom/2020w,,,https://doi.org/10.1109/PerComWorkshops48775.2...,,,conf/percom/MontoriB20,2021-10-14,,,,,1-6,,,,A Privacy Preserving Framework for Rewarding U...,,,db/conf/percom/percomw2020.html#MontoriB20,,2020
912656,3992460,Federico Montori|Gianluca Iselli|Luca Bedogni|...,,0000-0002-9943-4209,PerCom Workshops,,,,conf/percom/2020w,,,https://doi.org/10.1109/PerComWorkshops48775.2...,,,conf/percom/MontoriBIB20,2021-10-14,,,,,1-6,,,,Delivering IoT Smart Services through Collecti...,,,db/conf/percom/percomw2020.html#MontoriBIB20,,2020
911099,3990886,Luca Bedogni|Luciano Bononi,,0000-0001-9993-4046,PerCom Workshops,,,,conf/percom/2019w,,,https://doi.org/10.1109/PERCOMW.2019.8730753,,,conf/percom/BedogniB19,2020-03-27,,,,,820-825,,,,Vehicular Route Identification Using Mobile De...,,,db/conf/percom/percomw2019.html#BedogniB19,,2019
912949,3992756,Andrea Alcaras|Luca Bedogni|Luciano Bononi,,0000-0001-9993-4046,PerCom Workshops,,,,conf/percom/2019w,,,https://doi.org/10.1109/PERCOMW.2019.8730731,,,conf/percom/BedogniAB19,2020-03-27,,,,,28-33,,,,Permission-free Keylogging through Touch Event...,,,db/conf/percom/percomw2019.html#BedogniAB19,,2019
2055719,5154520,Luca Bedogni|Marco Levorato|Octavian Bujor,,0000-0001-9993-4046,WOWMOM,,,,conf/wowmom/2019,,,https://doi.org/10.1109/WoWMoM.2019.8793032,,,conf/wowmom/BedogniBL19,2020-03-27,,,,,1-9,,,,Texting and Driving Recognition Exploiting Sub...,,,db/conf/wowmom/wowmom2019.html#BedogniBL19,,2019
1632918,4724741,Andrea Capponi|Claudio Fiandrino|Emanuele Cort...,,0000-0001-6945-3196|0000-0001-9993-4046|0000-0...,MSWiM,,,,conf/mswim/2019,,,https://doi.org/10.1145/3345768.3355929,,,conf/mswim/MontoriCBCFB19,2021-10-14,,,,,289-296,,,,CrowdSenSim 2.0: a Stateful Simulation Platfor...,,,db/conf/mswim/mswim2019.html#MontoriCBCFB19,,2019


In [10]:
# Read of the DBLP Raw Dump

# The column names follow the MAG' scheme official documentation
#df_mag_conf_instances_col_names = ['ConferenceInstanceID', 'NormalizedName', 'DisplayName', 'ConferenceSeriesID', 'Location', 'OfficialUrl', 'StartDate', 'EndDate', 'AbstractRegistrationDate', 'SubmissionDeadlineDate', 'NotificationDueDate', 'FinalVersionDueDate', 'PageCount', 'PaperFamilyCount', 'CitationCount', 'Latitude', 'Longitude', 'CreatedDate']

df_dblp_raw = pd.read_csv(path_file_import + 'out_dblp_raw_v2_www.csv', sep=';', on_bad_lines='skip')
df_dblp_raw

  df_dblp_raw = pd.read_csv(path_file_import + 'out_dblp_raw_v2_www.csv', sep=';', on_bad_lines='skip')


Unnamed: 0,id,author,author-bibtex,cite,crossref,editor,ee,key,mdate,note,note-label,note-type,publtype,title,url,url-type,year
0,110493,Omon Monakhov,,,,,,homepages/308/9878,2021-12-17,,,,,Home Page,,,
1,110494,Raden Putra,,,,,,homepages/308/3672,2021-12-09,,,,,Home Page,,,
2,110495,Ana Harumi Grota Suzuki,,,,,,homepages/308/9816,2021-12-17,,,,,Home Page,,,
3,110496,Joey Öhman,,,,,,homepages/308/1161,2021-12-07,,,,,Home Page,,,
4,110497,Markus A. Janout,,,,,,homepages/308/0858,2021-12-07,,,,,Home Page,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2952305,3062804,,,,,,http://www.w3.org/Style/XSL/,www/org/w3/style-xsl,2019-07-10,,,,,W3C: Extensible Stylesheet Language (XSL),,,2001.0
2952306,3062805,,,,,,http://www.aiim.org/wfmc/mainframe.htm,www/org/aiim-wfmc,2019-07-10,,,,,"WfMC Standards: The Workflow Reference Model, ...",,,1995.0
2952307,3062807,,,,,Klaus Tschira,http://www.klaus-tschira-stiftung.de/,www/de/kts,2016-10-21,,,,,"Klaus Tschira Stiftung gemeinnützige GmbH, KTS",,,2012.0
2952308,8940442,Michael Ley,,,,,http://www.informatik.uni-trier.de/~ley/db/abo...,persons/Ley2003,2018-10-18,,,,,ACM SIGMOD Contribution Award 2003 Acceptance ...,,,2003.0


## Preprocess of Conference Instances CSV

In [4]:
# ******************* CONFERENCE INSTANCES ********************

# Read of the Conference Instances File

# The column names follow the MAG' scheme official documentation
df_mag_conf_instances_col_names = ['ConferenceInstanceID', 'NormalizedName', 'DisplayName', 'ConferenceSeriesID', 'Location', 'OfficialUrl', 'StartDate', 'EndDate', 'AbstractRegistrationDate', 'SubmissionDeadlineDate', 'NotificationDueDate', 'FinalVersionDueDate', 'PageCount', 'PaperFamilyCount', 'CitationCount', 'Latitude', 'Longitude', 'CreatedDate']

df_mag_conf_instances = pd.read_csv(path_file_import + 'ConferenceInstances.txt', sep='\t', names=df_mag_conf_instances_col_names)
df_mag_conf_instances

Unnamed: 0,ConferenceInstanceID,NormalizedName,DisplayName,ConferenceSeriesID,Location,OfficialUrl,StartDate,EndDate,AbstractRegistrationDate,SubmissionDeadlineDate,NotificationDueDate,FinalVersionDueDate,PageCount,PaperFamilyCount,CitationCount,Latitude,Longitude,CreatedDate
0,7785157,time 2008,TIME 2008,2624631009,"Montreal, Canada",http://www.time2008.org/,2008-06-16,2008-06-18,,2008-01-11,,,23,23,319,45.512400,-73.554680,2016-06-24
1,15420687,ipmu 2008,IPMU 2008,1128239323,"Malaga, Spain",http://www.gimac.uma.es/ipmu08,2008-06-22,2008-06-27,,2007-12-07,,,5,5,45,36.718320,-4.420160,2016-06-24
2,16798864,wosn 2010,WOSN 2010,2756885533,"Boston, MA, USA",http://www.usenix.org/events/wosn10/cfp/,2010-06-22,2010-06-22,,2010-02-25,2010-04-30,2010-05-25,9,9,666,42.358660,-71.056740,2016-06-24
3,18230910,sasn 2009,SASN 2009,1128894334,Saint Petersburg (Russia),http://www.ieee-sasn.org/index.html,2009-10-12,2009-10-14,2009-06-19,2009-06-26,2009-07-31,2009-09-11,0,0,0,59.933180,30.306030,2016-06-24
4,31227610,eurocon 2011,EUROCON 2011,1190350587,"Lisbon, Portugal",http://www.eurocon2011.it.pt/,2011-04-27,2011-04-29,,2010-10-30,2011-01-30,2011-02-28,279,279,864,38.725700,-9.150250,2016-06-24
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16454,2890348533,icn 2019,ICN 2019,1147859159,"Valencia, Spain",http://iaria.org/conferences2019/ICN19.html,2019-03-24,2019-03-28,,2018-11-10,2019-01-10,2019-02-04,179,179,180,39.468990,-0.376860,2018-09-27
16455,2890856803,ismco'2019,ISMCO'2019,2898428697,"Incline Village,us",http://www.ismco.net,2019-04-29,2019-05-01,,2018-12-17,,,6,6,12,39.250090,-119.959267,2018-09-27
16456,2891744759,ro2018,RO2018,2898131225,"Amsterdam,nl",http://www.researchobject.org/ro2018/,2018-10-29,2018-10-29,2018-07-15,2018-07-15,,,1,1,0,52.353218,5.002769,2018-09-27
16457,2892113557,icccn 2019,ICCCN 2019,1137850760,"Valencia, Spain",http://icccn.org/icccn19,2019-07-29,2019-08-01,,2019-03-01,2019-04-26,2019-05-10,126,126,224,39.468990,-0.376860,2018-09-27


Here the useless columns are going to be removed from the dataframe.

In [5]:
# Drop of Conference Instances' Useless Columns
df_mag_conf_instances = df_mag_conf_instances.drop(columns=['OfficialUrl', 'AbstractRegistrationDate', 'SubmissionDeadlineDate', 'NotificationDueDate', 'FinalVersionDueDate', 'PageCount', 'PaperFamilyCount', 'CitationCount', 'Latitude', 'Longitude', 'CreatedDate', 'StartDate', 'EndDate'])
df_mag_conf_instances

Unnamed: 0,ConferenceInstanceID,NormalizedName,DisplayName,ConferenceSeriesID,Location
0,7785157,time 2008,TIME 2008,2624631009,"Montreal, Canada"
1,15420687,ipmu 2008,IPMU 2008,1128239323,"Malaga, Spain"
2,16798864,wosn 2010,WOSN 2010,2756885533,"Boston, MA, USA"
3,18230910,sasn 2009,SASN 2009,1128894334,Saint Petersburg (Russia)
4,31227610,eurocon 2011,EUROCON 2011,1190350587,"Lisbon, Portugal"
...,...,...,...,...,...
16454,2890348533,icn 2019,ICN 2019,1147859159,"Valencia, Spain"
16455,2890856803,ismco'2019,ISMCO'2019,2898428697,"Incline Village,us"
16456,2891744759,ro2018,RO2018,2898131225,"Amsterdam,nl"
16457,2892113557,icccn 2019,ICCCN 2019,1137850760,"Valencia, Spain"


Column rename to remove ambiguity for the future joins

In [6]:
# Column rename to remove ambiguity for the future joins
df_mag_conf_instances.rename(columns={'NormalizedName': 'ConferenceNormalizedName', 'DisplayName': 'ConferenceDisplayName', 'Location': 'ConferenceLocation'}, inplace=True)
df_mag_conf_instances

Unnamed: 0,ConferenceInstanceID,ConferenceNormalizedName,ConferenceDisplayName,ConferenceSeriesID,ConferenceLocation
0,7785157,time 2008,TIME 2008,2624631009,"Montreal, Canada"
1,15420687,ipmu 2008,IPMU 2008,1128239323,"Malaga, Spain"
2,16798864,wosn 2010,WOSN 2010,2756885533,"Boston, MA, USA"
3,18230910,sasn 2009,SASN 2009,1128894334,Saint Petersburg (Russia)
4,31227610,eurocon 2011,EUROCON 2011,1190350587,"Lisbon, Portugal"
...,...,...,...,...,...
16454,2890348533,icn 2019,ICN 2019,1147859159,"Valencia, Spain"
16455,2890856803,ismco'2019,ISMCO'2019,2898428697,"Incline Village,us"
16456,2891744759,ro2018,RO2018,2898131225,"Amsterdam,nl"
16457,2892113557,icccn 2019,ICCCN 2019,1137850760,"Valencia, Spain"


## Preprocess of Conference Series CSV

In [7]:
# ******************* CONFERENCE SERIES ********************

# Read of the Conference Series File

# The column names follow the MAG' scheme official documentation
df_mag_conf_series_col_names = ['ConferenceSeriesID', 'Rank', 'NormalizedName', 'DisplayName', 'PaperCount', 'PaperFamilyCount', 'CitationCount', 'CreatedDate']

df_mag_conf_series = pd.read_csv(path_file_import + 'ConferenceSeries.txt', sep='\t', names=df_mag_conf_series_col_names)
df_mag_conf_series

Unnamed: 0,ConferenceSeriesID,Rank,NormalizedName,DisplayName,PaperCount,PaperFamilyCount,CitationCount,CreatedDate
0,1134804816,12817,ICIDS,International Conference on Interactive Digita...,611,610,2945,2016-06-24
1,1165160117,14777,SWAT4LS,Semantic Web Applications and Tools for Life S...,85,85,213,2016-06-24
2,1192093291,12271,TRIDENTCOM,Testbeds and Research Infrastructures for the ...,571,571,5174,2016-06-24
3,1199066382,10155,BIOINFORMATICS,International Conference on Bioinformatics,10692,10692,17021,2016-06-24
4,1201746639,15567,AIS,Autonomous and Intelligent Systems,165,165,1002,2016-06-24
...,...,...,...,...,...,...,...,...
4533,2754809603,14461,IPSS,IEEE International Power Sources Symposium,101,101,188,2017-09-25
4534,2756271167,13527,ECMS,European Conference on Modelling and Simulation,283,283,915,2017-09-25
4535,2756896743,17566,CAI,Conference on Algebraic Informatics,124,124,567,2017-10-06
4536,2757378734,15053,UPGRADE-CN,"Use of P2P, GRID and Agents for the Developmen...",40,40,314,2017-10-06


Here the useless columns are going to be removed from the dataframe.

In [8]:
# Drop of Conference Series' Useless Columns
df_mag_conf_series = df_mag_conf_series.drop(columns=['Rank', 'PaperCount', 'PaperFamilyCount', 'CitationCount', 'CreatedDate'])
df_mag_conf_series

Unnamed: 0,ConferenceSeriesID,NormalizedName,DisplayName
0,1134804816,ICIDS,International Conference on Interactive Digita...
1,1165160117,SWAT4LS,Semantic Web Applications and Tools for Life S...
2,1192093291,TRIDENTCOM,Testbeds and Research Infrastructures for the ...
3,1199066382,BIOINFORMATICS,International Conference on Bioinformatics
4,1201746639,AIS,Autonomous and Intelligent Systems
...,...,...,...
4533,2754809603,IPSS,IEEE International Power Sources Symposium
4534,2756271167,ECMS,European Conference on Modelling and Simulation
4535,2756896743,CAI,Conference on Algebraic Informatics
4536,2757378734,UPGRADE-CN,"Use of P2P, GRID and Agents for the Developmen..."


Column rename to remove ambiguity for the future joins

In [9]:
# Column rename to remove ambiguity for the future joins
df_mag_conf_series.rename(columns={'NormalizedName': 'ConferenceSeriesNormalizedName', 'DisplayName': 'ConferenceSeriesDisplayName'}, inplace=True)
df_mag_conf_series

Unnamed: 0,ConferenceSeriesID,ConferenceSeriesNormalizedName,ConferenceSeriesDisplayName
0,1134804816,ICIDS,International Conference on Interactive Digita...
1,1165160117,SWAT4LS,Semantic Web Applications and Tools for Life S...
2,1192093291,TRIDENTCOM,Testbeds and Research Infrastructures for the ...
3,1199066382,BIOINFORMATICS,International Conference on Bioinformatics
4,1201746639,AIS,Autonomous and Intelligent Systems
...,...,...,...
4533,2754809603,IPSS,IEEE International Power Sources Symposium
4534,2756271167,ECMS,European Conference on Modelling and Simulation
4535,2756896743,CAI,Conference on Algebraic Informatics
4536,2757378734,UPGRADE-CN,"Use of P2P, GRID and Agents for the Developmen..."


## Preprocess of Papers CSV
The Papers CSV is going to be processed in chunks, due to its size.

The following operations are going to be executed:
* Drop of the useless columns
* Filtering of papers without DOI
* Filtering papers that are not related to conferences
* Drop of the doctype column
* Write of the processed file on disk (in CSV format)

In [10]:
# ******************* PAPERS ********************

# Read of previously prerocessed CSV
df_mag_papers = None
if read_preprocessed_papers:
    df_mag_papers = pd.read_csv(preprocessed_papers_csv_path, low_memory=False, index_col=0)
else:
    # The Papers CSV is going to be processed in chunks, due to its size

    # The column names follow the MAG' scheme official documentation
    df_mag_papers_col_names = ['PaperID', 'Rank', 'Doi', 'DocType', 'PaperTitle', 'OriginalTitle', 'BookTitle', 'Year', 'Date', 'OnlineDate', 'Publisher', 'JournalID', 'ConferenceSeriesID', 'ConferenceInstanceID', 'Volume', 'Issue', 'FirstPage', 'LastPage', 'ReferenceCount', 'CitationCount', 'EstimatedCitation', 'OriginalVenue', 'FamilyID', 'FamilyRank', 'Retracion', 'CreatedDate']

    # List of processed chunks.
    df_mag_papers_list_of_chunks = list()

    # Define of the chunk size
    chunksize = 10 ** 7

    count = 1
    with pd.read_csv(path_file_import + 'Papers.txt', sep='\t', chunksize=chunksize, low_memory=False, on_bad_lines='skip', names=df_mag_papers_col_names) as reader:
        for chunk in reader:

            # Drop of the useless columns
            chunk = chunk.drop(columns=['Rank', 'OnlineDate', 'Publisher', 'Volume', 'Issue', 'FirstPage', 'LastPage', 'ReferenceCount', 'OriginalVenue', 'FamilyID', 'FamilyRank', 'Retracion', 'CreatedDate', 'JournalID', 'BookTitle', 'Date'])

            # Filtering of papers without DOI
            chunk = chunk.dropna(subset = ['Doi'])

            # Filtering papers that are not related to conferences
            chunk = chunk[chunk.DocType == 'Conference']

            # Drop of the doctype column
            chunk = chunk.drop(columns=['DocType'])

            # Insert of the resulting chunk in the list 
            df_mag_papers_list_of_chunks.append(chunk)

            print(f'Successfully processed chunk {count} out of around {260000000 / chunksize}')
            count += 1
            break

    # Concatenation of the processed chunks
    df_mag_papers = pd.concat(df_mag_papers_list_of_chunks)

    # Empty the list to free some memory
    df_mag_papers_list_of_chunks = list()

    # Write of the resulting CSV on Disk
    df_mag_papers.to_csv(path_file_export + 'out_mag_papers.csv')

df_mag_papers

Unnamed: 0,PaperID,Doi,PaperTitle,OriginalTitle,Year,ConferenceSeriesID,ConferenceInstanceID,CitationCount,EstimatedCitation
37,14558443,10.1007/978-3-662-45174-8_28,the adaptive priority queue with elimination a...,The Adaptive Priority Queue with Elimination a...,2014.0,1.131603e+09,4038532.0,12.0,12
39,15354235,10.1007/978-3-662-44777-2_60,document retrieval on repetitive collections,Document Retrieval on Repetitive Collections,2014.0,1.154039e+09,157008481.0,10.0,10
68,24327294,10.1007/978-3-319-03973-2_13,socomo marketing for travel and tourism,SoCoMo Marketing for Travel and Tourism,2013.0,1.196984e+09,,20.0,20
197,60437532,10.1007/3-540-46146-9_77,similarity image retrieval system using hierar...,Similarity Image Retrieval System Using Hierar...,2002.0,1.192665e+09,,0.0,0
666,198056957,10.1007/11785231_94,leukemia prediction from gene expression data ...,Leukemia prediction from gene expression data—...,2006.0,1.176896e+09,,19.0,19
...,...,...,...,...,...,...,...,...,...
259718386,3102242761,10.1109/IECON43393.2020.9254316,loss reduction by synchronous rectification in...,Loss Reduction by Synchronous Rectification in...,2020.0,2.623572e+09,,0.0,0
259718500,3136855299,10.1109/BMSB49480.2020.9379806,data over cable services improving the bicm ca...,Data Over Cable Services – Improving the BICM ...,2020.0,2.623662e+09,,0.0,0
259718537,3145351916,10.1109/ACC.1988.4172843,model reference robust adaptive control withou...,Model Reference Robust Adaptive Control withou...,1988.0,2.238538e+09,,0.0,0
259718570,3151696876,10.1109/ICASSP.2002.1005676,missing data speech recognition in reverberant...,Missing data speech recognition in reverberant...,2002.0,1.121228e+09,,0.0,0


## Merge of Conferences and Papers Data

In [11]:
# ******************* MERGE OF CONFERENCES AND PAPERS DATA ********************

# Merge of conferences and papers data over the conferenceseries id columnn to get the conference series name
# The papers' row that will not match will be preserved
df_mag_preprocessed = pd.merge(df_mag_papers, df_mag_conf_series, on=['ConferenceSeriesID'], how='left')

# Merge of conferences and papers data over the conferenceinstances id columnn
# The papers' row that will not match will be preserved
df_mag_preprocessed = pd.merge(df_mag_preprocessed, df_mag_conf_instances, on=['ConferenceInstanceID'], how='left')

# Drop of the duplicated columns
df_mag_preprocessed = df_mag_preprocessed.drop(columns=['ConferenceSeriesID_y'])
df_mag_preprocessed.rename(columns = {'ConferenceSeriesID_x':'ConferenceSeriesID'}, inplace=True)

# Removing broken data (four records seems to have mismatched types in some columns)
df_mag_preprocessed = df_mag_preprocessed.dropna(subset = ['CitationCount'])

# Write of the resulting CSV on Disk
df_mag_preprocessed.to_csv(path_file_export + 'out_mag_citations_count_and_conferences.csv')

df_mag_preprocessed

Unnamed: 0,PaperID,Doi,PaperTitle,OriginalTitle,Year,ConferenceSeriesID,ConferenceInstanceID,CitationCount,EstimatedCitation,ConferenceSeriesNormalizedName,ConferenceSeriesDisplayName,ConferenceNormalizedName,ConferenceDisplayName,ConferenceLocation
0,14558443,10.1007/978-3-662-45174-8_28,the adaptive priority queue with elimination a...,The Adaptive Priority Queue with Elimination a...,2014.0,1.131603e+09,4038532.0,12.0,12,DISC,International Symposium on Distributed Computing,disc 2014,DISC 2014,"Austin, TX"
1,15354235,10.1007/978-3-662-44777-2_60,document retrieval on repetitive collections,Document Retrieval on Repetitive Collections,2014.0,1.154039e+09,157008481.0,10.0,10,ESA,European Symposium on Algorithms,esa 2014,ESA 2014,"Wrocław, Poland"
2,24327294,10.1007/978-3-319-03973-2_13,socomo marketing for travel and tourism,SoCoMo Marketing for Travel and Tourism,2013.0,1.196984e+09,,20.0,20,ENTER,Information and Communication Technologies in ...,,,
3,60437532,10.1007/3-540-46146-9_77,similarity image retrieval system using hierar...,Similarity Image Retrieval System Using Hierar...,2002.0,1.192665e+09,,0.0,0,DEXA,Database and Expert Systems Applications,,,
4,198056957,10.1007/11785231_94,leukemia prediction from gene expression data ...,Leukemia prediction from gene expression data—...,2006.0,1.176896e+09,,19.0,19,ICAISC,International Conference on Artificial Intelli...,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4409811,3102242761,10.1109/IECON43393.2020.9254316,loss reduction by synchronous rectification in...,Loss Reduction by Synchronous Rectification in...,2020.0,2.623572e+09,,0.0,0,IECON,Conference of the Industrial Electronics Society,,,
4409812,3136855299,10.1109/BMSB49480.2020.9379806,data over cable services improving the bicm ca...,Data Over Cable Services – Improving the BICM ...,2020.0,2.623662e+09,,0.0,0,BMSB,International Symposium on Broadband Multimedi...,,,
4409813,3145351916,10.1109/ACC.1988.4172843,model reference robust adaptive control withou...,Model Reference Robust Adaptive Control withou...,1988.0,2.238538e+09,,0.0,0,ACC,American Control Conference,,,
4409814,3151696876,10.1109/ICASSP.2002.1005676,missing data speech recognition in reverberant...,Missing data speech recognition in reverberant...,2002.0,1.121228e+09,,0.0,0,ICASSP,"International Conference on Acoustics, Speech,...",,,


Check of the Exported CSV to be sure that everything went fine.

In [12]:
# Check of the Exported CSV
df_mag_exported_csv = pd.read_csv(path_file_export + 'out_mag_citations_count_and_conferences.csv', low_memory=False)
df_mag_exported_csv

Unnamed: 0.1,Unnamed: 0,PaperID,Doi,PaperTitle,OriginalTitle,Year,ConferenceSeriesID,ConferenceInstanceID,CitationCount,EstimatedCitation,ConferenceSeriesNormalizedName,ConferenceSeriesDisplayName,ConferenceNormalizedName,ConferenceDisplayName,ConferenceLocation
0,0,14558443,10.1007/978-3-662-45174-8_28,the adaptive priority queue with elimination a...,The Adaptive Priority Queue with Elimination a...,2014.0,1.131603e+09,4038532.0,12.0,12,DISC,International Symposium on Distributed Computing,disc 2014,DISC 2014,"Austin, TX"
1,1,15354235,10.1007/978-3-662-44777-2_60,document retrieval on repetitive collections,Document Retrieval on Repetitive Collections,2014.0,1.154039e+09,157008481.0,10.0,10,ESA,European Symposium on Algorithms,esa 2014,ESA 2014,"Wrocław, Poland"
2,2,24327294,10.1007/978-3-319-03973-2_13,socomo marketing for travel and tourism,SoCoMo Marketing for Travel and Tourism,2013.0,1.196984e+09,,20.0,20,ENTER,Information and Communication Technologies in ...,,,
3,3,60437532,10.1007/3-540-46146-9_77,similarity image retrieval system using hierar...,Similarity Image Retrieval System Using Hierar...,2002.0,1.192665e+09,,0.0,0,DEXA,Database and Expert Systems Applications,,,
4,4,198056957,10.1007/11785231_94,leukemia prediction from gene expression data ...,Leukemia prediction from gene expression data—...,2006.0,1.176896e+09,,19.0,19,ICAISC,International Conference on Artificial Intelli...,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4409807,4409811,3102242761,10.1109/IECON43393.2020.9254316,loss reduction by synchronous rectification in...,Loss Reduction by Synchronous Rectification in...,2020.0,2.623572e+09,,0.0,0,IECON,Conference of the Industrial Electronics Society,,,
4409808,4409812,3136855299,10.1109/BMSB49480.2020.9379806,data over cable services improving the bicm ca...,Data Over Cable Services – Improving the BICM ...,2020.0,2.623662e+09,,0.0,0,BMSB,International Symposium on Broadband Multimedi...,,,
4409809,4409813,3145351916,10.1109/ACC.1988.4172843,model reference robust adaptive control withou...,Model Reference Robust Adaptive Control withou...,1988.0,2.238538e+09,,0.0,0,ACC,American Control Conference,,,
4409810,4409814,3151696876,10.1109/ICASSP.2002.1005676,missing data speech recognition in reverberant...,Missing data speech recognition in reverberant...,2002.0,1.121228e+09,,0.0,0,ICASSP,"International Conference on Acoustics, Speech,...",,,


Order by citations count descending to see the articles with the most citations

In [13]:
# Order by citations count descending to see the articles with the most citations
df_mag_exported_csv = df_mag_exported_csv.sort_values(by='CitationCount', ascending=False)
df_mag_exported_csv

Unnamed: 0.1,Unnamed: 0,PaperID,Doi,PaperTitle,OriginalTitle,Year,ConferenceSeriesID,ConferenceInstanceID,CitationCount,EstimatedCitation,ConferenceSeriesNormalizedName,ConferenceSeriesDisplayName,ConferenceNormalizedName,ConferenceDisplayName,ConferenceLocation
4392494,4392498,2194775991,10.1109/CVPR.2016.90,deep residual learning for image recognition,Deep Residual Learning for Image Recognition,2016.0,1.158168e+09,2.334864e+09,62329.0,75544,CVPR,Computer Vision and Pattern Recognition,cvpr 2016,CVPR 2016,"Las Vegas, Nevada, USA"
176794,176795,2152195021,10.1109/ICNN.1995.488968,particle swarm optimization,Particle swarm optimization,2002.0,1.174935e+09,,26215.0,49377,ICON,International Conference on Networks,,,
562266,562267,2161969291,10.1109/CVPR.2005.177,histograms of oriented gradients for human det...,Histograms of oriented gradients for human det...,2005.0,1.158168e+09,2.786361e+09,23180.0,36647,CVPR,Computer Vision and Pattern Recognition,cvpr 2005,CVPR 2005,"San Diego, CA, USA"
3702319,3702323,2108598243,10.1109/CVPR.2009.5206848,imagenet a large scale hierarchical image data...,ImageNet: A large-scale hierarchical image dat...,2009.0,1.158168e+09,1.702096e+08,22980.0,28822,CVPR,Computer Vision and Pattern Recognition,cvpr 2009,CVPR 2009,"Miami Beach, Florida"
4005340,4005344,1901129140,10.1007/978-3-319-24574-4_28,u net convolutional networks for biomedical im...,U-Net: Convolutional Networks for Biomedical I...,2015.0,1.129325e+09,1.333575e+08,20853.0,26844,MICCAI,Medical Image Computing and Computer-Assisted ...,miccai 2015,MICCAI 2015,Munich
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2503260,2503263,1579681455,10.1007/11551898_8,an indexing approach for representing multimed...,An indexing approach for representing multimed...,2005.0,1.168181e+09,,0.0,0,MMEDIA,Advances in Multimedia,,,
363675,363676,1619526676,10.1109/SICON.1995.526046,eps a parallel execution environment,EPS: a parallel execution environment,1995.0,1.174935e+09,,0.0,0,ICON,International Conference on Networks,,,
363676,363677,1842712729,10.1109/MWSYM.1994.335581,competitive dual use mmic technologies and pro...,Competitive dual use MMIC technologies and pro...,1994.0,2.622909e+09,,0.0,0,IMS,International Microwave Symposium,,,
2503252,2503255,1162637688,10.1109/SYSOSE.2012.6333666,the study of assessing index of electronic sys...,The study of assessing index of electronic sys...,2012.0,2.624241e+09,2.625964e+09,0.0,0,SoSE,International Conference on System of Systems ...,sose 2012,SoSE 2012,"Genova, Italy"
