PubCrawl

PubCrawl - The scientific publication crawling suite.

Overview

PubCrawl is a Python package for crawling scientific publications. The idea is to provide a simple interface to crawl scientific publications from different sources. The package is designed to be easily extensible to support new sources.

Current Status

The package is currently in a very early stage of development. The following sources are currently supported:

arXiv
pdfs - Plain PDF Files

Installation

To start, clone the repository and install the package using the requirements file:

git clone https://github.com/J0nasW/PubCrawl.git
cd pubcrawl

pip install -r requirements.txt

Usage

The package is designed to be used as a library. The following example shows how to use the package to crawl publications from arXiv.

Example: Crawling publications from arXiv

The following example shows how to crawl publications from arXiv. The example crawls all publications from the category cs.AI, processes them and saves the results to a JSON file. Note, that you have to provide a valid JSON file containing the metadata of all arXiv publications. You can download this file from the Kaggle dataset here.

python3 main.py --s arxiv -c cs.AI -f arxiv.json -p

API

The package can be called using the following arguments:

Argument	Type	Description
-s, --source	required	The source to crawl publications from.
-f, --file	optional	The file containing the metadata of all publications.
-l, --local_PDFS	optional	Use local PDFs and provide a PDF dicrectory instead of downloading them from GCP.
-c, --category	optional	The category to crawl publications from. Delimit multiple categories with a
-p, --process	optional	Whether to process the crawled publications.
-r, --rows	optional	Number of entries to process (can be useful in development mode)

Downstream Tasks

This package can be used in combination with other packages to perform downstream tasks. The following packages are currently available:

PubGraph

When cloned the repository, you can use the extra flag -d to activate downstream tasks. PubCrawl will then automatically clone the downstream task repository and install the package using the requirements file.

The following example shows how to crawl all publications from the category cs.AI and create a graph from them.

python3 main.py --s arxiv -c cs.AI -f arxiv.json -p -d pubgraph

Using PubGraph to create publication graphs

tbd

License & Contributions

Copyright (c) 2023, Jonas Wilinski A lot of code regarding the PDF2TXT conversion is taken from this repository, which is licensed under the MIT license. Thank you, Matt Bierbaum, Colin Clement, Kevin O'Keeffe, Alex Alemi for sharing your code!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
datasources		datasources
helpers		helpers
pipeline		pipeline
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasources

datasources

helpers

helpers

pipeline

pipeline

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

PubCrawl

Overview

Current Status

Installation

Usage

Example: Crawling publications from arXiv

API

Downstream Tasks

Using PubGraph to create publication graphs

License & Contributions

About

Languages

License

J0nasW/PubCrawl

Folders and files

Latest commit

History

Repository files navigation

PubCrawl

Overview

Current Status

Installation

Usage

Example: Crawling publications from arXiv

API

Downstream Tasks

Using PubGraph to create publication graphs

License & Contributions

About

Topics

Resources

License

Stars

Watchers

Forks

Languages