
# Preprocess DBLP) Dataset

Jupyter Notebook for the preprocessing of the DBLP dump.

## TODO TODO TODO

For this process, the following CSV files are needed: ```ConferenceInstances.txt```, ```ConferenceSeries.txt```, ```Papers.txt```. 
The above files can be found here: https://archive.org/download/mag-2021-06-07/mag/

In particular, the following operations are going to be executed:
* Opening of ConferenceInstances and ConferenceSeries CSVs
* Drop of the useless columns 
* Chuncked Processing of the Papers CSV
    * Drop of the useless columns
    * Drop of papers without DOI
    * Drop of papers from journals and books rows
* Merge with the processed conferences data

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

## Dump Conversion from XML to CSV
### About the Conversion Script that is Going to Be Used
For this initial part of the work, we're going to use a conversion script specifically made for this purpose by *Thom Hurks*. All the rights of that script goes to him.

For semplicity, **a copy of the script has already been included in the present folder**.
You can read more about how the script works and other command line options on its original [Github Repository](https://github.com/ThomHurks/dblp-to-csv).

### What is Needed
Before proceeding, you first need to download and unpack the DBLP XML file from [here](https://dblp.org/xml/release/). You also need to download the latest DTD file that you can find in the same page.

### Running the Conversion Script
To run the conversion script, you first need to place the XML and DTD files in the same folder of the script. 
Then, you can run the following shell commands: <br>
```chmod +x XMLToCSV.py ```<br>
```./XMLToCSV.py dblp.xml dblp.dtd out_dblp_raw.csv```<br><br>
**NOTE**: Be sure to properly specify the XML and DTD file names

### Deleting Unnecessary Output Files
The script generates an output file for each element in the XML file (such as articles, books, etc). We only need the CSV file about **inproceedings**. In this example, its name is going to be *out_dblp_raw_inproceedings.csv*.

If you named the output files differently, you can specify your file name in the following cell.


In [3]:
csv_inproceedings_name = 'out_dblp_raw_inproceedings.csv'

**NOTE**: Please copy the inproceedings file in the Import directory specified above

## Preprocess of the DBLP Dump
In this phase we're going to use the CSV file generated from the previous conversion phase.

In [4]:
# Read of the DBLP Raw Dump

df_dblp = pd.read_csv(path_file_import + csv_inproceedings_name, sep=';', low_memory=False)
df_dblp

Unnamed: 0,id,author,author-aux,author-orcid,booktitle,cdrom,cite,cite-label,crossref,editor,editor-orcid,ee,ee-type,i,key,mdate,month,note,note-type,number,pages,publtype,sub,sup,title,title-bibtex,tt,url,volume,year
0,3062806,Arnon Rosenthal,,,SWEE,,,,conf/swee/1998,,,http://www.mitre.org/support/swee/rosenthal.html,,,www/org/mitre/future,2019-07-30,,,,,,,,,The Future of Classic Data Administration: Obj...,,,db/conf/swee/swee1998.html,,1998
1,3062808,Qiming Chen|Umeshwar Dayal,,,CoopIS,,...|...|...|books/mk/GrayR93|books/mk/elmagarm...,,conf/coopis/2000,,,https://doi.org/10.1007/10722620_29,,,conf/coopis/ChenD00,2017-05-24,,,,,311-322,,,,Multi-Agent Cooperative Transactions for E-Com...,,,db/conf/coopis/coopis2000.html#ChenD00,,2000
2,3062809,Emmanuel Cecchet|Renaud Lachaize|Takoua Abdell...,,,CoopIS/DOA/ODBASE (2),,,,conf/coopis/2004-2,,,https://doi.org/10.1007/978-3-540-30469-2_46,,,conf/coopis/AbdellatifCL04,2017-05-25,,,,,1571-1589,,,,Evaluation of a Group Communication Middleware...,,,db/conf/coopis/coopis2004-2.html#AbdellatifCL04,,2004
3,3062810,Robert Grob|Stefanie Kethers|Stephan Jacobs,,,CoopIS,,,,conf/coopis/1994,,,,,,conf/coopis/GrobJK94,2019-08-08,,,,,134-145,,,,Towards CIS in Quality Management - Integratio...,,,db/conf/coopis/coopis94.html#GrobJK94,,1994
4,3062811,Evaggelia Pitoura|George Samaras|Panos K. Chry...,,,CoopIS,,...|...|...|...|...|...|...|...|...|...|...|.....,,conf/coopis/2000,,,https://doi.org/10.1007/10722620_9,,,conf/coopis/PapastavrouCSP00,2017-05-24,,,,,102-113,,,,An Evaluation of the Java-Based Approaches to ...,,,db/conf/coopis/coopis2000.html#PapastavrouCSP00,,2000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2990257,8958667,Tarek Richard Besold,,0000-0002-8002-0049,PT-AI,,,,series/sapere/2013-5,,,https://doi.org/10.1007/978-3-642-31674-6_9,,,series/sapere/Besold13,2019-09-06,,,,,121-132,,,,Turing Revisited: A Cognitively-Inspired Decom...,,,db/series/sapere/sapere5.html#Besold13,,2011
2990258,8958671,Pierre Steiner,,,PT-AI,,,,series/sapere/2013-5,,,https://doi.org/10.1007/978-3-642-31674-6_20,,,series/sapere/Steiner13,2019-09-06,,,,,265-276,,,,C.S. Peirce and Artificial Intelligence: Histo...,,,db/series/sapere/sapere5.html#Steiner13,,2011
2990259,8958672,Stuart Armstrong,,,PT-AI,,,,series/sapere/2013-5,,,https://doi.org/10.1007/978-3-642-31674-6_25,,,series/sapere/Armstrong13,2019-09-06,,,,,335-347,,,,Risks and Mitigation Strategies for Oracle AI.,,,db/series/sapere/sapere5.html#Armstrong13,,2011
2990260,8958682,Sam Freed,,,PT-AI,,,,series/sapere/2013-5,,,https://doi.org/10.1007/978-3-642-31674-6_12,,,series/sapere/Freed13,2019-09-06,,,,,167-177,,,,Practical Introspection as Inspiration for AI.,,,db/series/sapere/sapere5.html#Freed13,,2011


Here the useless columns are going to be removed from the dataframe.

In [5]:
df_dblp = df_dblp.drop(columns=['author', 'author-aux', 'author-orcid', 'booktitle', 'cdrom', 'cite', 'cite-label', 'editor', 'editor-orcid', 'ee-type', 'i', 'mdate', 'month', 'note', 'note-type', 'number', 'pages', 'publtype', 'sub', 'sup', 'title', 'title-bibtex', 'tt', 'volume'])
df_dblp

Unnamed: 0,id,crossref,ee,key,url,year
0,3062806,conf/swee/1998,http://www.mitre.org/support/swee/rosenthal.html,www/org/mitre/future,db/conf/swee/swee1998.html,1998
1,3062808,conf/coopis/2000,https://doi.org/10.1007/10722620_29,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html#ChenD00,2000
2,3062809,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html#AbdellatifCL04,2004
3,3062810,conf/coopis/1994,,conf/coopis/GrobJK94,db/conf/coopis/coopis94.html#GrobJK94,1994
4,3062811,conf/coopis/2000,https://doi.org/10.1007/10722620_9,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html#PapastavrouCSP00,2000
...,...,...,...,...,...,...
2990257,8958667,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_9,series/sapere/Besold13,db/series/sapere/sapere5.html#Besold13,2011
2990258,8958671,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_20,series/sapere/Steiner13,db/series/sapere/sapere5.html#Steiner13,2011
2990259,8958672,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_25,series/sapere/Armstrong13,db/series/sapere/sapere5.html#Armstrong13,2011
2990260,8958682,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_12,series/sapere/Freed13,db/series/sapere/sapere5.html#Freed13,2011


Filtering of the papers without a DOI

In [6]:
df_dblp = df_dblp.loc[df_dblp.ee.str.contains("https://doi.org/", na=False)]
df_dblp

Unnamed: 0,id,crossref,ee,key,url,year
1,3062808,conf/coopis/2000,https://doi.org/10.1007/10722620_29,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html#ChenD00,2000
2,3062809,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html#AbdellatifCL04,2004
4,3062811,conf/coopis/2000,https://doi.org/10.1007/10722620_9,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html#PapastavrouCSP00,2000
5,3062812,conf/coopis/97,http://doi.ieeecomputersociety.org/10.1109/COO...,conf/coopis/BultzingsloewenKK97,db/conf/coopis/coopis97.html#BultzingsloewenKK97,1997
6,3062813,conf/coopis/2003,https://doi.org/10.1007/978-3-540-39964-3_50,conf/coopis/GiacolettoA03,db/conf/coopis/coopis2003.html#GiacolettoA03,2003
...,...,...,...,...,...,...
2990257,8958667,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_9,series/sapere/Besold13,db/series/sapere/sapere5.html#Besold13,2011
2990258,8958671,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_20,series/sapere/Steiner13,db/series/sapere/sapere5.html#Steiner13,2011
2990259,8958672,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_25,series/sapere/Armstrong13,db/series/sapere/sapere5.html#Armstrong13,2011
2990260,8958682,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_12,series/sapere/Freed13,db/series/sapere/sapere5.html#Freed13,2011


Saving the resulting dataframe on disk in CSV format.

In [7]:
# Write of the resulting CSV on Disk
df_dblp.to_csv(path_file_export + 'out_dblp.csv')

Check of the Exported CSV to be sure that everything went fine.

In [8]:
# Check of the Exported CSV
df_dblp_exported_csv = pd.read_csv(path_file_export + 'out_dblp.csv', low_memory=False)
df_dblp_exported_csv

Unnamed: 0.1,Unnamed: 0,id,crossref,ee,key,url,year
0,1,3062808,conf/coopis/2000,https://doi.org/10.1007/10722620_29,conf/coopis/ChenD00,db/conf/coopis/coopis2000.html#ChenD00,2000
1,2,3062809,conf/coopis/2004-2,https://doi.org/10.1007/978-3-540-30469-2_46,conf/coopis/AbdellatifCL04,db/conf/coopis/coopis2004-2.html#AbdellatifCL04,2004
2,4,3062811,conf/coopis/2000,https://doi.org/10.1007/10722620_9,conf/coopis/PapastavrouCSP00,db/conf/coopis/coopis2000.html#PapastavrouCSP00,2000
3,5,3062812,conf/coopis/97,http://doi.ieeecomputersociety.org/10.1109/COO...,conf/coopis/BultzingsloewenKK97,db/conf/coopis/coopis97.html#BultzingsloewenKK97,1997
4,6,3062813,conf/coopis/2003,https://doi.org/10.1007/978-3-540-39964-3_50,conf/coopis/GiacolettoA03,db/conf/coopis/coopis2003.html#GiacolettoA03,2003
...,...,...,...,...,...,...,...
2481947,2990257,8958667,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_9,series/sapere/Besold13,db/series/sapere/sapere5.html#Besold13,2011
2481948,2990258,8958671,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_20,series/sapere/Steiner13,db/series/sapere/sapere5.html#Steiner13,2011
2481949,2990259,8958672,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_25,series/sapere/Armstrong13,db/series/sapere/sapere5.html#Armstrong13,2011
2481950,2990260,8958682,series/sapere/2013-5,https://doi.org/10.1007/978-3-642-31674-6_12,series/sapere/Freed13,db/series/sapere/sapere5.html#Freed13,2011


## Preprocess of Papers CSV
The Papers CSV is going to be processed in chunks, due to its size.

The following operations are going to be executed:
* Drop of the useless columns
* Filtering of papers without DOI
* Filtering papers that are not related to conferences
* Drop of the doctype column
* Write of the processed file on disk (in CSV format)

In [None]:
# ******************* PAPERS ********************

# Read of previously prerocessed CSV
df_mag_papers = None
if read_preprocessed_papers:
    df_mag_papers = pd.read_csv(preprocessed_papers_csv_path, low_memory=False, index_col=0)
else:
    # The Papers CSV is going to be processed in chunks, due to its size

    # The column names follow the MAG' scheme official documentation
    df_mag_papers_col_names = ['PaperID', 'Rank', 'Doi', 'DocType', 'PaperTitle', 'OriginalTitle', 'BookTitle', 'Year', 'Date', 'OnlineDate', 'Publisher', 'JournalID', 'ConferenceSeriesID', 'ConferenceInstanceID', 'Volume', 'Issue', 'FirstPage', 'LastPage', 'ReferenceCount', 'CitationCount', 'EstimatedCitation', 'OriginalVenue', 'FamilyID', 'FamilyRank', 'Retracion', 'CreatedDate']

    # List of processed chunks.
    df_mag_papers_list_of_chunks = list()

    # Define of the chunk size
    chunksize = 10 ** 7

    count = 1
    with pd.read_csv(path_file_import + 'Papers.txt', sep='\t', chunksize=chunksize, low_memory=False, on_bad_lines='skip', names=df_mag_papers_col_names) as reader:
        for chunk in reader:

            # Drop of the useless columns
            chunk = chunk.drop(columns=['Rank', 'OnlineDate', 'Publisher', 'Volume', 'Issue', 'FirstPage', 'LastPage', 'ReferenceCount', 'OriginalVenue', 'FamilyID', 'FamilyRank', 'Retracion', 'CreatedDate', 'JournalID', 'BookTitle', 'Date'])

            # Filtering of papers without DOI
            chunk = chunk.dropna(subset = ['Doi'])

            # Filtering papers that are not related to conferences
            chunk = chunk[chunk.DocType == 'Conference']

            # Drop of the doctype column
            chunk = chunk.drop(columns=['DocType'])

            # Insert of the resulting chunk in the list 
            df_mag_papers_list_of_chunks.append(chunk)

            print(f'Successfully processed chunk {count} out of around {260000000 / chunksize}')
            count += 1
            break

    # Concatenation of the processed chunks
    df_mag_papers = pd.concat(df_mag_papers_list_of_chunks)

    # Empty the list to free some memory
    df_mag_papers_list_of_chunks = list()

    # Write of the resulting CSV on Disk
    df_mag_papers.to_csv(path_file_export + 'out_mag_papers.csv')

df_mag_papers

## Merge of Conferences and Papers Data

In [None]:
# ******************* MERGE OF CONFERENCES AND PAPERS DATA ********************

# Merge of conferences and papers data over the conferenceseries id columnn to get the conference series name
# The papers' row that will not match will be preserved
df_mag_preprocessed = pd.merge(df_mag_papers, df_mag_conf_series, on=['ConferenceSeriesID'], how='left')

# Merge of conferences and papers data over the conferenceinstances id columnn
# The papers' row that will not match will be preserved
df_mag_preprocessed = pd.merge(df_mag_preprocessed, df_mag_conf_instances, on=['ConferenceInstanceID'], how='left')

# Drop of the duplicated columns
df_mag_preprocessed = df_mag_preprocessed.drop(columns=['ConferenceSeriesID_y'])
df_mag_preprocessed.rename(columns = {'ConferenceSeriesID_x':'ConferenceSeriesID'}, inplace=True)

# Removing broken data (four records seems to have mismatched types in some columns)
df_mag_preprocessed = df_mag_preprocessed.dropna(subset = ['CitationCount'])

# Write of the resulting CSV on Disk
df_mag_preprocessed.to_csv(path_file_export + 'out_mag_citations_count_and_conferences.csv')

df_mag_preprocessed

Check of the Exported CSV to be sure that everything went fine.

In [None]:
# Check of the Exported CSV
df_mag_exported_csv = pd.read_csv(path_file_export + 'out_mag_citations_count_and_conferences.csv', low_memory=False)
df_mag_exported_csv

Order by citations count descending to see the articles with the most citations

In [None]:
# Order by citations count descending to see the articles with the most citations
df_mag_exported_csv = df_mag_exported_csv.sort_values(by='CitationCount', ascending=False)
df_mag_exported_csv