Skip to content
This repository has been archived by the owner on Nov 21, 2022. It is now read-only.

Commit

Permalink
[Docs] Refactor for new class based approach (#243)
Browse files Browse the repository at this point in the history
  • Loading branch information
Sean Naren committed May 17, 2022
1 parent add189b commit 1994d03
Show file tree
Hide file tree
Showing 30 changed files with 486 additions and 523 deletions.
4 changes: 1 addition & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,7 @@ dm = TextClassificationDataModule(
),
tokenizer=tokenizer,
)
model = TextClassificationTransformer(
pretrained_model_name_or_path="bert-base-cased", num_labels=dm.num_classes
)
model = TextClassificationTransformer(pretrained_model_name_or_path="bert-base-cased")

trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)

Expand Down
38 changes: 0 additions & 38 deletions docs/source/advanced/custom_task.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ through the model, and the loss calculation for a specific task. Below are the s

1. Inherit from Lightning Transformers Base Class
2. Add custom task logic
3. Create Hydra config

1. Inherit from Lightning Transformers Base Class
-------------------------------------------------
Expand Down Expand Up @@ -54,40 +53,3 @@ The ``LMHeadAutoModel`` task provides separate keys for the backbone and the ful
for param in self.model.parameters():
param.add_(torch.randn(param.size()) * 0.1)
return loss
3. Create Hydra Config
----------------------

Finally to use the Hydra CLI and configs, we would add our own custom yaml file containing the necessary code to run using our task.

We create a file at ``conf/task/nlp/my_language_modeling.yaml`` containing the below config.

.. code-block:: yaml
# @package task
defaults:
- nlp/default # Use the defaults from the default config found at `conf/task/nlp/default.yaml`
_target_: examples.custom_language_modeling.model.MyLanguageModelingTransformer # path to the class we'd like to instantiate
downstream_model_type: transformers.AutoModelForCausalLM
Hydra supports config inheritence, so we could inherit from the language modeling task directly, simplifying our config a bit:

.. code-block:: yaml
# @package task
defaults:
- nlp/language_modeling # Use the defaults from the config found at `conf/task/nlp/language_modeling.yaml`
_target_: examples.custom_language_modeling.model.MyLanguageModelingTransformer # path to the class we'd like to instantiate
With this in place you can now train using pre-made HuggingFace datasets:

.. code-block:: python
python train.py task=nlp/my_language_modeling dataset=nlp/language_modeling/wikitext dataset.train_file=train.csv dataset.validation_file=valid.csv
Or with your own files:

.. code-block:: python
python train.py task=nlp/my_language_modeling dataset.train_file=train.csv dataset.validation_file=valid.csv
44 changes: 3 additions & 41 deletions docs/source/advanced/nlp/language_modeling_data.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,10 @@
Language Modeling using Custom Data Processing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Below we show an example of how to override data processing logic. In this example, we add a prefix to each block of text used in the language modeling task.
Below we show how to override data processing logic.

This reflects the idea of passing a conditional term that is used to give the language model context. Check :doc:`/tasks/nlp/language_modeling` for more information around the task.

Ultimately to create your own custom data processing the flow is like this:

1. Extend the ``LanguageModelingDataModule`` base class, Override hooks with your own logic
2. (Optional) Keep file in the specific task directory
3. Add a hydra config object to use your new dataset

1. Extend the ``LanguageModelingDataModule`` base class
"""""""""""""""""""""""""""""""""""""""""""""""""""""""
Extend the ``LanguageModelingDataModule`` base class
""""""""""""""""""""""""""""""""""""""""""""""""""""

The base data module can be used to modify this code, and follows a simple pattern. Internally the dataset is loaded via HuggingFace Datasets, which returns an `Apache Arrow Parquet <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html>`_ Dataset. This data format is easy to transform and modify using map functions, which you'll see within the class.

Expand Down Expand Up @@ -105,33 +97,3 @@ Below we have the pseudo code version to show where most of the changes happened
}
result["labels"] = result["input_ids"].copy()
return result
To see the full example, see ``examples/custom/dataset/language_modeling/custom_dataset.py``

2. (Optional) Keep file in the specific task directory
""""""""""""""""""""""""""""""""""""""""""""""""""""""

This makes tracking of files easier. Our example is stored in ``examples/`` however in reality we would store our DataModule in ``lightning_transformers/task/nlp/language_modeling/custom_dataset.py``.

3. Add a hydra config object to use your new dataset
""""""""""""""""""""""""""""""""""""""""""""""""""""

Finally to use the Hydra CLI and configs, we would add our own custom yaml file containing the necessary code to run using our dataset.

We create a file at ``conf/datasets/nlp/language_modeling/my_dataset.yaml`` containing the below config.

.. code-block:: yaml
# @package dataset
defaults:
- nlp/default # Use the defaults from the default config found at `conf/dataset/nlp/default.yaml`
_target_: lightning_transformers.custom_language_modeling.dataset.MyLanguageModelingDataModule # path to the class we'd like to instantiate
cfg:
block_size: 512 # any parameters you'd like from the inherited config object.
With this in place you can now train using either HuggingFace Datasets or your own custom files.

.. code-block:: bash
python train.py task=nlp/language_modeling dataset=nlp/language_modeling/my_dataset dataset.cfg.train_file=train.csv dataset.cfg.validation_file=valid.csv
41 changes: 3 additions & 38 deletions docs/source/advanced/nlp/translation_data.rst
Original file line number Diff line number Diff line change
@@ -1,16 +1,10 @@
Translation using Custom Data Processing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Below we show an example of how overriding data processing logic, by adding a prefix to the source language sample in translation. Check :doc:`/tasks/nlp/translation` for more information around the task.
Below we show how to override data processing logic.

Ultimately to create your own custom data processing the flow is like this:

1. Extend the ``TranslationDataModule`` base class, Override hooks with your own logic
2. (Optional) Keep file in the specific task directory
3. Add a hydra config object to use your new dataset

1. Extend the ``TranslationDataModule`` base class
""""""""""""""""""""""""""""""""""""""""""""""""""
Extend the ``TranslationDataModule`` base class
"""""""""""""""""""""""""""""""""""""""""""""""

The base data module can be used to modify this code, and follows a simple pattern. Internally the dataset is loaded via HuggingFace Datasets, which returns an `Apache Arrow Parquet <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html>`_ Dataset. This data format is easy to transform and modify using map functions, which you'll see within the class.

Expand Down Expand Up @@ -63,32 +57,3 @@ Extend ``TranslationDataModule``, like this.
...
Make any changes you'd like to the dataset processing via the hooks.

To see the full example, see ``examples/custom/dataset/translation/custom_dataset.py``

2. (Optional) Keep file in the specific task directory
""""""""""""""""""""""""""""""""""""""""""""""""""""""

This makes tracking of files easier. Our example is stored in ``examples/`` however in reality we would store our DataModule in ``lightning_transformers/task/nlp/translation/datasets/custom_dataset.py``.

3. Add a hydra config object to use your new dataset
""""""""""""""""""""""""""""""""""""""""""""""""""""

Finally to use the Hydra CLI and configs, we would add our own custom yaml file containing the necessary code to run using our dataset.

We create a file at ``conf/datasets/nlp/translation/my_dataset.yaml`` containing the below config.

.. code-block:: yaml
# @package dataset
defaults:
- nlp/default # Use the defaults from the default config found at `conf/dataset/nlp/default.yaml`
_target_: examples.custom_translation.dataset.MyTranslationDataModule # path to the class we'd like to instantiate
cfg:
max_source_length: 128 # any parameters you'd like from the inherited config object.
With this in place you can now train using either HuggingFace Datasets or your own custom files.

.. code-block:: bash
python train.py task=nlp/translation dataset=nlp/translation/my_dataset dataset.cfg.train_file=train.csv dataset.cfg.validation_file=valid.csv
21 changes: 15 additions & 6 deletions docs/source/datasets/nlp/custom_subset_names.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,24 @@ Custom Subset Names (Edge Cases such as MNLI)
Some datasets, such as MNLI when loaded from the Huggingface `datasets` library, have special subset names that don't match the standard train/validation/test convention.
Specifically, MNLI has two validation and two test sets, with flavors 'matched' and 'mismatched'.
When using such datasets, you must manually indicate which subset names you want to use for each of train/validation/text.
For this, you can set the config variables `dataset.cfg.train_subset_name`, `dataset.cfg.validation_subset_name` and `dataset.cfg.test_subset_name`.

An example for how to train and validate on MNLI would the the following:

.. code-block:: python
python train.py task=nlp/text_classification dataset=nlp/text_classification/glue dataset.cfg.dataset_config_name=mnli ++dataset.cfg.validation_subset_name=validation_matched
from lightning_transformers.task.nlp.text_classification import (
TextClassificationDataConfig,
TextClassificationDataModule,
TextClassificationTransformer,
)
It also works for train and test subsets, like so:

++dataset.cfg.train_subset_name=name_of_subset
++dataset.cfg.test_subset_name=name_of_subset
dm = TextClassificationDataModule(
cfg=TextClassificationDataConfig(
batch_size=1,
dataset_name="glue",
dataset_config_name="mnli",
max_length=512,
validation_subset_name="validation_matched"
),
tokenizer=tokenizer,
)
14 changes: 13 additions & 1 deletion docs/source/datasets/nlp/language_modeling_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,16 @@ When specifying the file path with hydra, it is important to use the absolute pa

.. code-block:: python
python train.py task=nlp/language_modeling dataset.cfg.train_file=abs/path/train.csv dataset.cfg.validation_file=abs/path/valid.csv
from lightning_transformers.task.nlp.language_modeling import (
LanguageModelingDataConfig,
LanguageModelingDataModule,
)
dm = LanguageModelingDataModule(
cfg=LanguageModelingDataConfig(
batch_size=1,
train_file="path/train.csv",
validation_file="/path/valid.csv"
),
tokenizer=tokenizer,
)
17 changes: 16 additions & 1 deletion docs/source/datasets/nlp/multiple_choice_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,19 @@ We override the dataset files, allowing us to still use the data transforms defi

.. code-block:: python
python train.py task=nlp/multiple_choice dataset=language_modeling/race dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json
from lightning_transformers.task.nlp.multiple_choice import (
MultipleChoiceDataConfig,
RaceMultipleChoiceDataModule,
)
dm = RaceMultipleChoiceDataModule(
cfg=MultipleChoiceDataConfig(
batch_size=1,
dataset_name="race",
dataset_config_name="all",
padding=False,
train_file="path/train.json",
validation_file="/path/valid.json"
),
tokenizer=tokenizer,
)
23 changes: 22 additions & 1 deletion docs/source/datasets/nlp/question_answering_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,25 @@ We override the dataset files, allowing us to still use the data transforms defi

.. code-block:: python
python train.py task=nlp/question_answering dataset=nlp/question_answering/squad dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json
from lightning_transformers.task.nlp.question_answering import (
QuestionAnsweringDataConfig,
SquadDataModule,
)
dm = SquadDataModule(
cfg=QuestionAnsweringDataConfig(
batch_size=1,
dataset_name="squad",
dataset_config_name="plain_text",
max_length=384,
version_2_with_negative=False,
null_score_diff_threshold=0.0,
doc_stride=128,
n_best_size=20,
max_answer_length=30,
train_file="path/train.csv",
validation_file="/path/valid.csv"
),
tokenizer=tokenizer,
)
18 changes: 17 additions & 1 deletion docs/source/datasets/nlp/summarization_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,20 @@ We override the dataset files, allowing us to still use the data transforms defi

.. code-block:: python
python train.py task=nlp/summarization dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json
from lightning_transformers.task.nlp.summarization import (
SummarizationConfig,
SummarizationDataConfig,
XsumSummarizationDataModule,
)
dm = XsumSummarizationDataModule(
cfg=SummarizationDataConfig(
batch_size=1,
dataset_name="xsum",
max_source_length=128,
max_target_length=128,
train_file="path/train.csv",
validation_file="/path/valid.csv"
),
tokenizer=tokenizer,
)
16 changes: 15 additions & 1 deletion docs/source/datasets/nlp/text_classification_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,18 @@ The label mapping is automatically generated from the training dataset labels if
.. code-block:: python
python train.py task=nlp/text_classification dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json
from lightning_transformers.task.nlp.text_classification import (
TextClassificationDataConfig,
TextClassificationDataModule,
TextClassificationTransformer,
)
dm = TextClassificationDataModule(
cfg=TextClassificationDataConfig(
batch_size=1,
max_length=512,
train_file="path/train.json",
validation_file="/path/valid.json"
),
tokenizer=tokenizer,
)
19 changes: 18 additions & 1 deletion docs/source/datasets/nlp/token_classification_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,21 @@ To use custom text files, the files should contain new line delimited json objec
.. code-block:: python
python train.py task=nlp/token_classification dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json
from lightning_transformers.task.nlp.token_classification import (
TokenClassificationDataConfig,
TokenClassificationDataModule,
)
dm = TokenClassificationDataModule(
cfg=TokenClassificationDataConfig(
batch_size=1,
task_name="ner",
dataset_name="conll2003",
preprocessing_num_workers=1,
label_all_tokens=False,
revision="master",
train_file="path/train.json",
validation_file="/path/valid.json"
),
tokenizer=tokenizer,
)
20 changes: 19 additions & 1 deletion docs/source/datasets/nlp/translation_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,22 @@ We override the dataset files, allowing us to still use the data transforms defi

.. code-block:: python
python train.py task=nlp/translation dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json
from lightning_transformers.task.nlp.translation import (
TranslationDataConfig,
WMT16TranslationDataModule,
)
dm = WMT16TranslationDataModule(
cfg=TranslationDataConfig(
dataset_name="wmt16",
# WMT translation datasets: ['cs-en', 'de-en', 'fi-en', 'ro-en', 'ru-en', 'tr-en']
dataset_config_name="ro-en",
source_language="en",
target_language="ro",
max_source_length=128,
max_target_length=128,
train_file="path/train.json",
validation_file="/path/valid.json"
),
tokenizer=tokenizer,
)
6 changes: 1 addition & 5 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ Lightning Transformers
:caption: Get started

quickstart
structure/conf

.. toctree::
:maxdepth: 1
Expand All @@ -32,11 +31,8 @@ Lightning Transformers
.. toctree::
:maxdepth: 1
:name: optimization
:caption: Training Optimizations
:caption: Transformer Optimizations

optimizations/lightning
optimizations/deepspeed
optimizations/sharded
optimizations/sparseml

.. toctree::
Expand Down

0 comments on commit 1994d03

Please sign in to comment.