GitHub - eth-library-lab/inDexDa: Natural Language Processing of academic papers for dataset indexing

inDexDa - Natural Language Processing of academic papers for dataset identification and indexing.

An Initiative for human-centered Innovation in the Knowledge Sphere of the ETH Library Lab.

Getting Started

This project is divided into multiple sections, pipelines, to make it more modular and easier to use/modify. These are as the followings:

Pipeline	Description
PaperScraper	Used to comb through a specified online archive of academic papers in order to find papers relating to a field, topic, or search term. Stores these papers in a MongoDB database. See PaperScraper folder for more information and usage instructions.
NLP	From the papers found using PaperScraper, used natural language processing techniques to determine whether the papers shows a new dataset was created. If so, it stores this information within the MongoDB database for later use. See NaturalLanguageProcessing folder for more information and usage instructions.
Dataset Extraction	Collects information from the papers the BERT network predicts contain new datasets such as links to the dataset, type of data used, size of dataset, etc.

Setup

This code has been tested on a computer with following specifications:

OS Platform and Distribution: Linux Ubuntu 18.04LTS
CUDA/cuDNN version: CUDA 10.0.130, cuDNN 7.6.4
GPU model and memory: NVidia GeForce GTX 1080, 8GB
Python: 3.6.8
TensorFlow: 1.14

Installation Instructions

To install the virtual environment and most of the required dependencies, run:

pip install pew
pew new inDexDa
pew in inDexDa

git clone https://github.com/eth-library-lab/inDexDa.git
cd inDexDa
./install.sh

Networks used in this project are run using Tensorflow backend.

Usage

To begin running inDexDa check the args.json file in the main directory. This contains relevant information which will be used during the process. Please make sure to add the following fields:

Configuration

inDexDa is configured primarily through the args.json file. In this file is included a variety of options for web-scraping, network training, and dataset extraction options. Each section is explained more thoroughly in the PaperScraper README, but the following steps will allow you to run inDexDa quickly.

Choose the online academic paper repository you wish to scrape in the archives_to_scrape section. InDexDa natively supports both arXiv and ScienceDirect scraping APIs. You can use either a single scraper or multiple scrapers in sequence.
Replace the default search query with your specific word or phrase. More specific search queries will yield less results, but will run much faster.
If using ScienceDirect scraper, apply for an API key (https://dev.elsevier.com/apikey/manage). Once a key has been obtained, include it in the archive_info ScienceDirect apikey field. Also make sure to include the start and end years for the search.

Running inDexDa

Once the args.json file has been configured, run the run.py file using the following flags as desired, but only include EITHER the train or the scrape flag:

python3 run.py
    --first_time  # Must be included the first time you run inDexDa
    --scrape      # Will run inDexDa and output datasets it finds
    --train       # Will re-train the BERT network

Contact

For any inquiries, use the ETH Library Lab contact form.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
DatasetIndexing		DatasetIndexing
NLP		NLP
PaperScraper		PaperScraper
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
args.json		args.json
install.sh		install.sh
run.py		run.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DatasetIndexing

DatasetIndexing

NLP

NLP

PaperScraper

PaperScraper

data

data

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

args.json

args.json

install.sh

install.sh

run.py

run.py

utils.py

utils.py

Repository files navigation

Table of contents

Getting Started

Setup

Installation Instructions

Usage

Configuration

Running inDexDa

Contact

License

About

Releases

Packages

Contributors 3

Languages

License

eth-library-lab/inDexDa

Folders and files

Latest commit

History

Repository files navigation

Table of contents

Getting Started

Setup

Installation Instructions

Usage

Configuration

Running inDexDa

Contact

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages