textcat_goemotions

Feb 8, 2024

1771b63 · Feb 8, 2024

Name	Name	Last commit message	Last commit date
parent directory ..
configs	configs	Change GoEmotions to use spacy assemble (#140 )	Dec 16, 2022
scripts	scripts	Allowing custom .tsv files with unlabelled entries to be read and pro…	May 26, 2021
.gitignore	.gitignore	Update .gitignore	Oct 11, 2020
README.md	README.md	fix hyperlink (#205 )	Feb 8, 2024
meta.json	meta.json	Add meta requirements for package command (#71 )	Aug 2, 2021
project.yml	project.yml	Change GoEmotions to use spacy assemble (#140 )	Dec 16, 2022
requirements.txt	requirements.txt	Add meta requirements for package command (#71 )	Aug 2, 2021
test_project_textcat_goemotions.py	test_project_textcat_goemotions.py	add -S to download commands for Windows CLI (#32 )	Feb 2, 2021

README.md

🪐 Weasel Project: Categorization of emotions in Reddit posts (Text Classification)

This project uses spaCy to train a text classifier on the GoEmotions dataset with options for a pipeline with and without transformer weights. To use the BERT-based config, change the config variable in the project.yml.

The goal of this project is to show how to train a spaCy classifier based on a csv file, not to showcase a model that's ready for production. The GoEmotions dataset has known flaws described here as well as label errors resulting from annotator disagreement.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the Weasel documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using weasel run [name]. Commands are only re-run if their inputs have changed.

Command	Description
`init-vectors`	Download vectors and convert to model
`preprocess`	Convert the corpus to spaCy's format
`train`	Train a spaCy pipeline using the specified corpus and config
`evaluate`	Evaluate on the test data and save the metrics
`package`	Package the trained model so it can be installed
`visualize`	Visualize the model's output interactively using Streamlit
`assemble`	Combine the model with a pretrained pipeline.

⏭ Workflows

The following workflows are defined by the project. They can be executed using weasel run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow	Steps
`all`	`preprocess` → `train` → `evaluate` → `package`

🗂 Assets

The following assets are defined by the project. They can be fetched by running weasel assets in the project directory.

File	Source	Description
`assets/categories.txt`	URL	The categories to train
`assets/train.tsv`	URL	The training data
`assets/dev.tsv`	URL	The development data
`assets/test.tsv`	URL	The test data

Usage

If you want to use the BERT-based config (bert.cfg), make sure you have spacy-transformers installed:

pip install spacy-transformers

You can choose your GPU by setting the gpu_id variable in the project.yml.

Tuning a hyper-parameter in the config

To change hyperparameters, you can edit the config (or create a new custom config). For instance, you could edit the components.textcat.model.tok2vec.encode.width value, changing it to 32:

[components.textcat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 32
depth = 4
window_size = 1
maxout_pieces = 3

Now you can retrain and reevaluate, and commit the updated config and metrics:

spacy project run train
spacy project run evaluate
git commit configs/my_new_config.cfg metrics/my_new_config.cfg -m "Scores TODO%"

You can also run experiments in a more lightweight way by running spacy train directly and overwriting hyperparameters on the command line:

spacy train \
    configs/my_new_config.cfg \
    --components.textcat.model.tok2vec.encode.width 32

Adding components from another model

Suppose you want to keep all the functionality of the en_core_web_sm model and add the textcat model you just trained. You can do this using without changing your training config by using spacy assemble - you'll just need to prepare a config describing your final pipeline.

A sample config for doing this is included in configs/cnn_with_pretrained.cfg. After training the model in this project, you can combine it with a pretrained pipeline by running spacy project run assemble, which will save the new pipeline to cnn_with_pretrained/.

To make your own config to combine pipelines, the basic steps are:

Include all the components you want in nlp.pipeline
Add a section for each component, specifying the pipeline to source it from.
If you have two components of the same type, specify unqiue component names for each.
If necessary, specify replace_listeners to bundle a component with its tok2vec.

You can also remove many values related to training - since you aren't running a training loop with spacy assemble, default values are fine.

Let's go over the last two steps in a little more detail.

By default, components have a simple default name in the pipeline, like "ner" or "textcat". However, if you have two copies of a component, then they need to have different names. If you need to change the name of a component you can do that by giving it a different name in nlp.pipeline, and specifying the name in the original pipeline using the name value in the section for the component.

It depends on the pipeline, but often components use a Listener to just look for a tok2vec (or transformer) to get features from. If the tok2vec in your final pipeline is from the same pipeline as the component you're adding, then you don't have to do anything. But if a component has a different tok2vec, you can bundle a standalone copy of the original tok2vec with the component so that it doesn't use the wrong one.

Here's an example of a component where the name has changed from ner to renamed_ner, and also uses replace_listeners:

[components.renamed_ner]
source = "my_pipeline"
# the "ner" here is the name in the base pipeline
name = "ner"
# and it listened to the "tok2vec" in the original pipeline
replace_listeners = ["model.tok2vec"]

In the sample config, since most of our components come from the pretrained pipeline, we use the tok2vec from that in the pipeline, and replace the listeners for the textcat component we trained in this project. Exactly what configuration of tok2vecs and listeners works depends on your pipeline, for more details see the docs on shared vs. indepedent embedding layers .

Using embeddings from a spaCy package

First, download an existing trained pipeline with word vectors. The word vectors of this model can then be specified in paths.vectors or initialize.vectors.

spacy download en_core_web_lg
spacy train \
    configs/cnn.cfg \
    --paths.vectors "en_core_web_lg" \
    --components.textcat.model.tok2vec.embed.include_static_vectors true

Making and using new embeddings

Uncomment the asset in your project.yml:

assets:
  - dest: "assets/vectors.zip"
    url: "https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip"

Then download the asset and run the init-vectors command:

spacy project assets
spacy project run init-vectors

Use the vectors:

spacy train \
    configs/cnn.cfg \
    --paths.vectors "assets/en_fasttext_vectors" \
    --components.textcat.model.tok2vec.embed.include_static_vectors true

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

textcat_goemotions

textcat_goemotions

README.md

🪐 Weasel Project: Categorization of emotions in Reddit posts (Text Classification)

📋 project.yml

⏯ Commands

⏭ Workflows

🗂 Assets

Usage

Tuning a hyper-parameter in the config

Adding components from another model

Using embeddings from a spaCy package

Making and using new embeddings

Files

textcat_goemotions

Directory actions

More options

Directory actions

More options

Latest commit

History

textcat_goemotions

Folders and files

parent directory

README.md

🪐 Weasel Project: Categorization of emotions in Reddit posts (Text Classification)

📋 project.yml

⏯ Commands

⏭ Workflows

🗂 Assets

Usage

Tuning a hyper-parameter in the config

Adding components from another model

Using embeddings from a spaCy package

Making and using new embeddings