Skip to content
A python parser for DBLP dataset
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
img update imgs Apr 26, 2018
src divide parse_proceedings into that and parse_inproceedings Nov 4, 2018
.gitignore ignore dataset folder Nov 4, 2018
LICENSE Initial commit Apr 25, 2018
README.md Update README.md Mar 20, 2019

README.md

DBLP Dataset Parser

Authour Python

It is a python parser for DBLP dataset, the XML format dumped file can be downloaded here from DBLP Homepage.

This parser requires dtd file, so make sure you have both dblp-XXX.xml (dataset) and dblp-XXX.dtd files. Note that you also should guarantee that both xml and dtd files are in the same directory, and the name of dtd file shoud same as the name given in the <!DOCTYPE> tag of the xml file. Such information can be easily accessed through head dblp-XXX.xml command. As shown below

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp-2017-08-29.dtd">
<dblp>
<phdthesis mdate="2016-05-04" key="phd/dk/Heine2010">
<author>Carmen Heine</author>
<title>Modell zur Produktion von Online-Hilfen.</title>
...

A sample to use the parser:

def main():
    dblp_path = 'dataset/dblp.xml'
    save_path = 'article.json'
    try:
        context_iter(dblp_path)
        log_msg("LOG: Successfully loaded \"{}\".".format(dblp_path))
    except IOError:
        log_msg("ERROR: Failed to load file \"{}\". Please check your XML and DTD files.".format(dblp_path))
        exit()
    parse_article(dblp_path, save_path, save_to_csv=False)  # default save as json format

Some extracted results:

Count the number of all different type of publications: general

Count the number of all different attributes among all publications: all_feature

Count the number of five different features of articles: article_feature

Distribution of published year of articles: article_year

You can’t perform that action at this time.