SciMON: Scientific Inspiration Machines Optimized for Novelty

Unzip all the zip files located in the data folder, including its subfolders.
Place the following folders, extracted from their respective zip files, under the data folder: kg,ct, and gold_subset
Locate the local_context_dataset folder unzipped from data/local_context_dataset.zip.Move it to models/T5.
Copy the file e2t.json and paste it into the following folders: models\GPT*\, models\Iterative\, and preprocess\
Locate the og, sim, kg, and ct folder under the biomedical folder, copy them to the corresponding folder under biomedical_models\*\data

Data Preprocess

Navigate to the preprocess and run the bash preprocess.sh
Navigate to the models\GPTFS and run process.py
Navigate to the biomedical_models\* and run preprocess.py

Data and Code Description

The project data includes the following components:

data/local_context_dataset.zip: This folder contains the training, validation, and testing files for our task.
data/kg/*.json: The data/kg directory contains files that store the original Information Extraction (IE) results for all paper abstracts.
data/ct/*.csv: The data/ct directory contains files that represent the citation network for all papers.
data/gold_subset: This directory contains our gold annotation subsets.
data/biomedical.zip: This directory contains our biochemical datasets.
evaluation contain sample evaluation code.

Results

result/sentence_generation.zip: This zip file contains GPT3.5/GPT4 for initial round results result/iterative_novelty_boosting.zip: This zip file contains GPT3.5/GPT4 for iterative novelty boosting results

Quickstart for NLP domain

Set up environment first:

conda activate pyt1.11

Training

To train the T5 model under models\T5*, run the following command:

bash finetune_*.sh

Test

To test the T5 model under models\T5*, run the following command:

bash eval_*.sh

To test the GPT3.5 model under models\GPT*, run the following command:

bash eval3.sh

After getting GPT3.5 results, we can also get GPT4 results using same input by running the following command:

python gpt4.py

After gettubg GPT4 results, first copy all GPT4 results under the iterative folder, you can then run the first two iterations of iterative novelty boosting by running the following command:

python calculate_sim.py
python gpt4_iter1.py
python calculate_sim1.py
python gpt4_iter2.py

Quickstart for biochemical domain

Set up environment first:

conda activate pyt2.2

Download Meditron-7b model from huggingface and put it under biomedical_models\model

Training

To train the T5 model under biomedical_models\*\, run the following command:

bash train.sh

Test

To test the trained model under biomedical_models\*\, run the following command:

python inf_generator.py

Citation

@article{wang2023learning,
  title={SciMON: Scientific Inspiration Machines Optimized for Novelty},
  author={Wang, Qingyun and Downey, Doug and Ji, Heng and Hope, Tom},
  journal={arXiv preprint arXiv:2305.14259},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SciMON: Scientific Inspiration Machines Optimized for Novelty

Table of Contents

Overview

Requirements

Environment

Environment Setup Instructions for NLP

Environment Setup Instructions for biochemical

Data Setup

Data Preprocess

Data and Code Description

Results

Quickstart for NLP domain

Training

Test

Quickstart for biochemical domain

Training

Test

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

SciMON: Scientific Inspiration Machines Optimized for Novelty

Table of Contents

Overview

Requirements

Environment

Environment Setup Instructions for NLP

Environment Setup Instructions for biochemical

Data Setup

Data Preprocess

Data and Code Description

Results

Quickstart for NLP domain

Training

Test

Quickstart for biochemical domain

Training

Test

Citation