Getting MoRE out of Mixture of Language Model Reasoning Experts (EMNLP 2023 Findings)

This repository contains the code and data for running the experiments in our paper. Please see below for more detailed instructions for running the code.

Data

All the model prediction data can be downloaded from this link. Once you download it, unzip it and put it under the uniqa_predictions_final folder.

It contains two subsets: one for dev set and another for test set. All our evaluation results are based on the test sets. Each subset should contain the experts' (and the dataset-specific few-shot baseline's) predictions on all the 12 datasets used in our paper.

Training the Router

You can run python3 feature_classifier.py to train the random forest router and run inference to score all predictions. For ablation, you can set agreement = False to exclude the inter-expert agreement features; or you can also set qonly = True to train a router that only uses the question features (see more detailed in the paper).

Generalizability Evaluation

Once you run inference and save the router scores (which we already provided in feature_classifiers), you can run python3 ensemble.py to reproduce all results reported in Table 1, The default method is classifier, which uses the router classifier's scores for answer selection; you can also set to other methods for comparison.

Selective QA Evaluation

For the selective QA evaluation, run python3 abstention.py. You can use either MaxProb or the router's score to score predictions by setting method correspondingly and set the metric among AUC, Cov@80, and Cov@90 in the all_metric function. Use the ER_metric function to compute the effective reliability, which involves first searching for a threshold based on the dev set.

Citation

@article{Si:Shi:Zhao:Zettlemoyer:Boyd-Graber-2023,
	Title = {Getting \underline{MoRE} out of \underline{M}ixture \underline{o}f Language Model \underline{R}easoning \underline{E}xperts},
	Author = {Chenglei Si and Weijia Shi and Chen Zhao and Luke Zettlemoyer and Jordan Lee Boyd-Graber},
	Journal = {Findings of Empirical Methods in Natural Language Processing},
	Year = {2023},
	Location = {Singapore},
}

If you have any questions about the code or paper, feel free to email Chenglei (sichenglei1125@gmail.com).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
feature_classifiers		feature_classifiers
.DS_Store		.DS_Store
README.md		README.md
TeaserFigure.png		TeaserFigure.png
abstention.py		abstention.py
ensemble.py		ensemble.py
feature_classifier.py		feature_classifier.py
gpt_preds.json		gpt_preds.json
gpt_router.py		gpt_router.py
random_selector.py		random_selector.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature_classifiers

feature_classifiers

.DS_Store

.DS_Store

README.md

README.md

TeaserFigure.png

TeaserFigure.png

abstention.py

abstention.py

ensemble.py

ensemble.py

feature_classifier.py

feature_classifier.py

gpt_preds.json

gpt_preds.json

gpt_router.py

gpt_router.py

random_selector.py

random_selector.py

utils.py

utils.py

Repository files navigation

Getting MoRE out of Mixture of Language Model Reasoning Experts (EMNLP 2023 Findings)

Data

Training the Router

Generalizability Evaluation

Selective QA Evaluation

Citation

About

Releases

Packages

Languages

NoviScl/MoRE

Folders and files

Latest commit

History

Repository files navigation

Getting MoRE out of Mixture of Language Model Reasoning Experts (EMNLP 2023 Findings)

Data

Training the Router

Generalizability Evaluation

Selective QA Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Languages