Skip to content

SiddharthaAnand/dblp-spider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dblp-spider

A spider written using scrapy that crawls dblp website to extract various data about the authors like his/her co-authors' names, communities to which he/she belongs to and articles that he/she has published. It is then used to build a co-authorship network graph.

Example of a co-authorship network

"Siddhartha Anand" "Partha Basuchowdhuri"
"Siddhartha Anand" "Khusbu Mishra"
...

The above example denotes an edge-list (author_one->author_two). Such an edge-list denotes a graph of co-authors who have worked on a paper together.

Dependencies

How to clone the repository

Simply, run the following command:

$ git clone https://github.com/SiddharthaAnand/dblp-spider.git

This will clone this repository to your local system.

How to run the code

Make sure you are in the working directory of dblp-spider. Then run the following command:

$ scrapy crawl dblpspider [-o] [filename]

This will start the spider, send requests asynchronously and receive data and store the output (denoted by '-o' in the filename given by you).

You can store the extracted data in different file formats, all thanks to scrapy's in-built capacity to do it. You can store file in .csv format, .json format and .jl format. You can read about .jl format and how is it better than .json format over google.

I have used .jl just as an example:

$ scrapy crawl dblpspider -o dblp_data.jl

Sample json data

This is the sample data that you might get after the crawl is over. You can optionally use the in-built json package to pretty print the contents of the json file.

$ head dblp_json.jl
{
    "author_articles_published": [
        "Spanning tree-based fast community detection methods in social networks."
    ],
    "author_name": "Siddhartha Anand",
    "coauthor_communities_list": [
        "show coauthor community: group 1",
        "show coauthor community: group 1",
        "show coauthor community: group 1",
        "show coauthor community: group 1",
        "show coauthor community: group 1"
    ],
    "coauthors_name_list": [
        "Partha Basuchowdhuri",
        "Subhashis Majumder",
        "Riya Roy",
        "Sanjoy Kumar Saha",
        "Diksha Roy Srivastava"
    ]
}
...

Licence

This project is licensed under Apache Licence - see the LICENSE.md for more details.

Future enhancements

  • Add a no-sql db to insert data
  • Deploy the spider on a server for large scale crawl
  • Extract more data from dblp
  • Visualize the data using a visualization tool

Contributions

Any kind of contribution or suggestion are always welcome. You can modify it and extract even more data from dblp.

About

The code makes use of scrapy and collects large scale data in json format from http://dblp.uni-trier.de/.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages