Text classification code described in "SoPa: Bridging CNNs, RNNs, and Weighted Finite-State Machines" by Roy Schwartz, Sam Thomson and Noah A. Smith, ACL 2018
Switch branches/tags
Clone or download
Roy Schwartz
Latest commit 591c79f Nov 3, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
baselines 1. Added installation script. 2. Fixed weird tensorboardX and pytorch… Oct 11, 2018
data data Oct 11, 2018
scripts 1. Added installation script. 2. Fixed weird tensorboardX and pytorch… Oct 11, 2018
test Updated README file: May 15, 2018
.gitignore making bias a free parameter instead of tied to norm Sep 28, 2017
LICENSE Initial commit Sep 14, 2017
README.md Minor README typo fix Oct 11, 2018
__init__.py add (failing) test for batching Oct 31, 2017
data.py Fixed small bugs, added debug print Sep 7, 2018
environment_linux.yml 1. Added installation script. 2. Fixed weird tensorboardX and pytorch… Oct 11, 2018
environment_osx.yml 1. Added installation script. 2. Fixed weird tensorboardX and pytorch… Oct 11, 2018
install.sh Added installation script Oct 11, 2018
interpret_classification_results.py 1. Added installation script. 2. Fixed weird tensorboardX and pytorch… Oct 11, 2018
mlp.py Update default value for MLP hidden dim to 25 Nov 4, 2018
rnn.py Fixed arg_parser issues to make CNN baseline run Apr 10, 2018
soft_patterns.py 1. Added installation script. 2. Fixed weird tensorboardX and pytorch… Oct 11, 2018
soft_patterns_test.py 1. Added installation script. 2. Fixed weird tensorboardX and pytorch… Oct 11, 2018
util.py CNN baseline. TODO: pad or filter docs shorter than window_size Jan 11, 2018
visualize.py 1. Added installation script. 2. Fixed weird tensorboardX and pytorch… Oct 11, 2018
visualize_efficiently.py 1. Added installation script. 2. Fixed weird tensorboardX and pytorch… Oct 11, 2018

README.md

Soft Patterns

Text classification code using SoPa, based on "SoPa: Bridging CNNs, RNNs, and Weighted Finite-State Machines" by Roy Schwartz, Sam Thomson and Noah A. Smith, ACL 2018

Setup

The code is implemented in python3.6 using pytorch. To run, we recommend using conda. The following code creates a new conda environment and activates it:

./install.sh
source activate sopa

Data format

The training and test code requires a two files for training, development and test: a data file and a labels file. Both files contain one line per sample. The data file contains the text, and the labels file contain the label. In addition, a word vector file is required (plain text, standard format of one line per vector, starting with the word, followed by the vector).

For other paramteres, run the following commands using the --help flag.

Training

To train our model, run

python3.6 ./soft_patterns.py \
    -e <word embeddings file> \
    --td <train data> \
    --tl <train labels> \
    --vd <dev data> \
    --vl <dev labels> \
    -p <pattern specification> \
    --model_save_dir <output model directory>

Test

To test our model, run

python3.6 ./soft_patterns_test.py \
    -e <word embeddings file> \
    --vd <test data> \
    --vl <test labels> \
    -p <pattern specification> \
    --input_model <input model>

Sample data

The data/ folder contains sample files for training, development and testing. The data comes from the SST dataset (with a 100 training samples).

Each fold X (train, dev, test) contains two file: X.data (plain text sentences, one sentence per line) and X.labels (one label per line).

Visualizing the Model

Under construction.

Sanity Tests

python -m unittest

References

If you make use if this code, please cite the following paper:

@inproceedings{Schwartz:2018,
  author={Schwartz, Roy and Thomson, Sam and Smith, Noah A.},
  title={{SoPa}: Bridging {CNNs}, {RNNs}, and Weighted Finite-State Machines},
  booktitle={Proc. of ACL},
  year={2018}
}

Contact

For questions, comments or feedback, please email roysch@cs.washington.edu