This project uses spaCy to train a text classifier on the GoEmotions dataset with options for a pipeline with and without transformer weights. To use the BERT-based config, change the config
variable in the project.yml
.
The goal of this project is to show how to train a spaCy classifier based on a csv file, not to showcase a model that's ready for production. The GoEmotions dataset has known flaws described here as well as label errors resulting from annotator disagreement.
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
Weasel documentation.
The following commands are defined by the project. They
can be executed using weasel run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
init-vectors |
Download vectors and convert to model |
preprocess |
Convert the corpus to spaCy's format |
train |
Train a spaCy pipeline using the specified corpus and config |
evaluate |
Evaluate on the test data and save the metrics |
package |
Package the trained model so it can be installed |
visualize |
Visualize the model's output interactively using Streamlit |
assemble |
Combine the model with a pretrained pipeline. |
The following workflows are defined by the project. They
can be executed using weasel run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
all |
preprocess → train → evaluate → package |
The following assets are defined by the project. They can
be fetched by running weasel assets
in the project directory.
File | Source | Description |
---|---|---|
assets/categories.txt |
URL | The categories to train |
assets/train.tsv |
URL | The training data |
assets/dev.tsv |
URL | The development data |
assets/test.tsv |
URL | The test data |
If you want to use the BERT-based config (bert.cfg
), make
sure you have spacy-transformers
installed:
pip install spacy-transformers
You can choose your GPU by setting the gpu_id
variable in the
project.yml
.
To change hyperparameters, you can edit the config (or create a new
custom config). For instance, you could edit the
components.textcat.model.tok2vec.encode.width
value, changing it to 32
:
[components.textcat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 32
depth = 4
window_size = 1
maxout_pieces = 3
Now you can retrain and reevaluate, and commit the updated config and metrics:
spacy project run train
spacy project run evaluate
git commit configs/my_new_config.cfg metrics/my_new_config.cfg -m "Scores TODO%"
You can also run experiments in a more lightweight way by running spacy train
directly and
overwriting
hyperparameters on the command line:
spacy train \
configs/my_new_config.cfg \
--components.textcat.model.tok2vec.encode.width 32
Suppose you want to keep all the functionality of the en_core_web_sm
model
and add the textcat model you just trained. You can do this using without
changing your training config by using spacy assemble
- you'll just need to prepare a
config describing your final pipeline.
A sample config for doing this is included in
configs/cnn_with_pretrained.cfg
. After training the model in this project,
you can combine it with a pretrained pipeline by running spacy project run assemble
, which will save the new pipeline to cnn_with_pretrained/
.
To make your own config to combine pipelines, the basic steps are:
- Include all the components you want in
nlp.pipeline
- Add a section for each component, specifying the pipeline to source it from.
- If you have two components of the same type, specify unqiue component names for each.
- If necessary, specify
replace_listeners
to bundle a component with its tok2vec.
You can also remove many values related to training - since you aren't running
a training loop with spacy assemble
, default values are fine.
Let's go over the last two steps in a little more detail.
By default, components have a simple default name in the pipeline, like "ner"
or "textcat". However, if you have two copies of a component, then they need to
have different names. If you need to change the name of a component you can do
that by giving it a different name in nlp.pipeline
, and specifying the name
in the original pipeline using the name
value in the section for the
component.
It depends on the pipeline, but often components use a Listener to just look for a tok2vec (or transformer) to get features from. If the tok2vec in your final pipeline is from the same pipeline as the component you're adding, then you don't have to do anything. But if a component has a different tok2vec, you can bundle a standalone copy of the original tok2vec with the component so that it doesn't use the wrong one.
Here's an example of a component where the name has changed from ner
to renamed_ner
, and also uses replace_listeners
:
[components.renamed_ner]
source = "my_pipeline"
# the "ner" here is the name in the base pipeline
name = "ner"
# and it listened to the "tok2vec" in the original pipeline
replace_listeners = ["model.tok2vec"]
In the sample config, since most of our components come from the pretrained pipeline, we use the tok2vec from that in the pipeline, and replace the listeners for the textcat component we trained in this project. Exactly what configuration of tok2vecs and listeners works depends on your pipeline, for more details see the docs on shared vs. indepedent embedding layers .
First, download an existing trained pipeline with word vectors.
The word vectors of this model can then be specified in paths.vectors
or initialize.vectors
.
spacy download en_core_web_lg
spacy train \
configs/cnn.cfg \
--paths.vectors "en_core_web_lg" \
--components.textcat.model.tok2vec.embed.include_static_vectors true
Uncomment the asset in your project.yml
:
assets:
- dest: "assets/vectors.zip"
url: "https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip"
Then download the asset and run the init-vectors
command:
spacy project assets
spacy project run init-vectors
Use the vectors:
spacy train \
configs/cnn.cfg \
--paths.vectors "assets/en_fasttext_vectors" \
--components.textcat.model.tok2vec.embed.include_static_vectors true