Beyond prompting: Making Pre-trained Language Models Better Zero-shot Learners by Clustering Representations

Code and datasets for EMNLP 2022 paper: Beyond prompting: Making Pre-trained Language Models Better Zero-shot Learners by Clustering Representations

Installation

The easiest way to install the code is to create a fresh anaconda environment:

conda create -n simptc python=3.7
source activate simptc
pip install -r requirements.txt

Download the data

All the datasets we used are publicly available text classification datasets. One can download all datasets using the following link: https://drive.google.com/file/d/1MDO8LDRUNC-T8OCtHDk8yfuSnjo84EMX/view?usp=sharing

One should put the datasets and embeddings folders at the same directory of the codes.
datasets contains texts and labels for all datasets.
embeddingscontains the ConcepNet NumberBatch embeddings used for expanding class names.

We provide more data source information below.

5 benchmark datasets

AG's News, DBpedia, Yahoo, IMDB, Amazon are 5 benchmark datasets commonly used to evaluate zero-shot text classification methods. We collect the datasets from: https://github.com/thunlp/OpenPrompt/blob/main/datasets/download_text_classification.sh

TC14

To study the performance of SimPTC on a wider range of text classification tasks we did a literature search in zero-shot text classification and collected datasets that best fit our text classification setting with label names that have class-info. Here are the resources of the datasets in TC14.

20 News: https://huggingface.co/datasets/SetFit/20_newsgroups

NYT topic and NTY location: https://github.com/yumeng5/CatE/tree/master/datasets/nyt

BBC News: https://huggingface.co/datasets/SetFit/bbc-news

Yelp: https://huggingface.co/datasets/yelp_polarity

Emotion: https://huggingface.co/datasets/emotion

Banking77: https://huggingface.co/datasets/banking77

SST-2: https://huggingface.co/datasets/gpt3mix/sst2

SST-5: https://huggingface.co/datasets/SetFit/sst5

MPQA: https://github.com/princeton-nlp/LM-BFF/blob/main/data/download_dataset.sh

Subj: https://huggingface.co/datasets/SetFit/subj

TREC: https://huggingface.co/datasets/trec

Biomedical: https://github.com/rashadulrakib/short-text-clustering-enhancement/tree/master/data

StackOverflow: https://github.com/rashadulrakib/short-text-clustering-enhancement/tree/master/data

Reproducing our results

After downloading and unzipping the two folders under the code directory, one can reproduce our result.

We have already include the expanded class names and encoded datasets for IMDb, one can reproduce the results on IMDb with template #0 by

python run.py --dataset imdb --template 0

Averaging the final results of template 0 to 3 gives the number in Table 1 of the paper.

To reproduce the results on other datasets:

Expand class names using ConceptNet (can be skipped as we have already include the expanded class names in the downloaded datasets):

python gen_related_words.py --dataset agnews

Encode texts and class anchor texts (GPU required) using SimCSE

CUDA_VISIBLE_DEVICES=0 python encode_dataset.py --dataset agnews

Run SimPTC.
1. To reproduce the results on AG's News, DBpedia, Yahoo, IMDB, Amazon: (Averaging the final results of template 0 to 3 gives the number in Table 1 of the paper.)
```
python run.py --dataset agnews --template 0
```
1. To reproduce the results on TC14 (we only use the naive template "" on TC14, running the following code gives the numbers in Figure 4)
```
python run.py --dataset sst2
```

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
configs.py		configs.py
embedding_vis.ipynb		embedding_vis.ipynb
encode_dataset.py		encode_dataset.py
gen_related_words.py		gen_related_words.py
model.py		model.py
requirements.txt		requirements.txt
run.py		run.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

configs.py

configs.py

embedding_vis.ipynb

embedding_vis.ipynb

encode_dataset.py

encode_dataset.py

gen_related_words.py

gen_related_words.py

model.py

model.py

requirements.txt

requirements.txt

run.py

run.py

utils.py

utils.py

Repository files navigation

Beyond prompting: Making Pre-trained Language Models Better Zero-shot Learners by Clustering Representations

Installation

Download the data

5 benchmark datasets

TC14

Reproducing our results

About

Releases

Packages

Contributors 2

Languages

fywalter/simptc

Folders and files

Latest commit

History

Repository files navigation

Beyond prompting: Making Pre-trained Language Models Better Zero-shot Learners by Clustering Representations

Installation

Download the data

5 benchmark datasets

TC14

Reproducing our results

About

Resources

Stars

Watchers

Forks

Languages