[Docs] Refactor for new class based approach (#243)

Lightning-Universe · May 17, 2022 · 1994d03 · 1994d03
1 parent add189b
commit 1994d03
Show file tree

Hide file tree

Showing 30 changed files with 486 additions and 523 deletions.
diff --git a/README.md b/README.md
@@ -63,9 +63,7 @@ dm = TextClassificationDataModule(
     ),
     tokenizer=tokenizer,
 )
-model = TextClassificationTransformer(
-    pretrained_model_name_or_path="bert-base-cased", num_labels=dm.num_classes
-)
+model = TextClassificationTransformer(pretrained_model_name_or_path="bert-base-cased")
 
 trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)
 

diff --git a/docs/source/advanced/custom_task.rst b/docs/source/advanced/custom_task.rst
@@ -11,7 +11,6 @@ through the model, and the loss calculation for a specific task. Below are the s
 
 1. Inherit from Lightning Transformers Base Class
 2. Add custom task logic
-3. Create Hydra config
 
 1. Inherit from Lightning Transformers Base Class
 -------------------------------------------------
@@ -54,40 +53,3 @@ The ``LMHeadAutoModel`` task provides separate keys for the backbone and the ful
                 for param in self.model.parameters():
                     param.add_(torch.randn(param.size()) * 0.1)
             return loss
-
-
-3. Create Hydra Config
-----------------------
-
-Finally to use the Hydra CLI and configs, we would add our own custom yaml file containing the necessary code to run using our task.
-
-We create a file at ``conf/task/nlp/my_language_modeling.yaml`` containing the below config.
-
-.. code-block:: yaml
-
-    # @package task
-    defaults:
-      - nlp/default # Use the defaults from the default config found at `conf/task/nlp/default.yaml`
-    _target_: examples.custom_language_modeling.model.MyLanguageModelingTransformer # path to the class we'd like to instantiate
-    downstream_model_type: transformers.AutoModelForCausalLM
-
-Hydra supports config inheritence, so we could inherit from the language modeling task directly, simplifying our config a bit:
-
-.. code-block:: yaml
-
-    # @package task
-    defaults:
-      - nlp/language_modeling # Use the defaults from the config found at `conf/task/nlp/language_modeling.yaml`
-    _target_: examples.custom_language_modeling.model.MyLanguageModelingTransformer # path to the class we'd like to instantiate
-
-With this in place you can now train using pre-made HuggingFace datasets:
-
-.. code-block:: python
-
-    python train.py task=nlp/my_language_modeling dataset=nlp/language_modeling/wikitext dataset.train_file=train.csv dataset.validation_file=valid.csv
-
-Or with your own files:
-
-.. code-block:: python
-
-    python train.py task=nlp/my_language_modeling dataset.train_file=train.csv dataset.validation_file=valid.csv
diff --git a/docs/source/advanced/nlp/language_modeling_data.rst b/docs/source/advanced/nlp/language_modeling_data.rst
@@ -1,18 +1,10 @@
 Language Modeling using Custom Data Processing
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Below we show an example of how to override data processing logic. In this example, we add a prefix to each block of text used in the language modeling task.
+Below we show how to override data processing logic.
 
-This reflects the idea of passing a conditional term that is used to give the language model context. Check :doc:`/tasks/nlp/language_modeling` for more information around the task.
-
-Ultimately to create your own custom data processing the flow is like this:
-
-1. Extend the ``LanguageModelingDataModule`` base class, Override hooks with your own logic
-2. (Optional) Keep file in the specific task directory
-3. Add a hydra config object to use your new dataset
-
-1. Extend the ``LanguageModelingDataModule`` base class
-"""""""""""""""""""""""""""""""""""""""""""""""""""""""
+Extend the ``LanguageModelingDataModule`` base class
+""""""""""""""""""""""""""""""""""""""""""""""""""""
 
 The base data module can be used to modify this code, and follows a simple pattern. Internally the dataset is loaded via HuggingFace Datasets, which returns an `Apache Arrow Parquet <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html>`_ Dataset. This data format is easy to transform and modify using map functions, which you'll see within the class.
 
@@ -105,33 +97,3 @@ Below we have the pseudo code version to show where most of the changes happened
             }
             result["labels"] = result["input_ids"].copy()
             return result
-
-
-To see the full example, see ``examples/custom/dataset/language_modeling/custom_dataset.py``
-
-2. (Optional) Keep file in the specific task directory
-""""""""""""""""""""""""""""""""""""""""""""""""""""""
-
-This makes tracking of files easier. Our example is stored in ``examples/`` however in reality we would store our DataModule in ``lightning_transformers/task/nlp/language_modeling/custom_dataset.py``.
-
-3. Add a hydra config object to use your new dataset
-""""""""""""""""""""""""""""""""""""""""""""""""""""
-
-Finally to use the Hydra CLI and configs, we would add our own custom yaml file containing the necessary code to run using our dataset.
-
-We create a file at ``conf/datasets/nlp/language_modeling/my_dataset.yaml`` containing the below config.
-
-.. code-block:: yaml
-
-    # @package dataset
-    defaults:
-      - nlp/default # Use the defaults from the default config found at `conf/dataset/nlp/default.yaml`
-    _target_: lightning_transformers.custom_language_modeling.dataset.MyLanguageModelingDataModule # path to the class we'd like to instantiate
-    cfg:
-      block_size: 512 # any parameters you'd like from the inherited config object.
-
-With this in place you can now train using either HuggingFace Datasets or your own custom files.
-
-.. code-block:: bash
-
-    python train.py task=nlp/language_modeling dataset=nlp/language_modeling/my_dataset dataset.cfg.train_file=train.csv dataset.cfg.validation_file=valid.csv
diff --git a/docs/source/advanced/nlp/translation_data.rst b/docs/source/advanced/nlp/translation_data.rst
@@ -1,16 +1,10 @@
 Translation using Custom Data Processing
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Below we show an example of how overriding data processing logic, by adding a prefix to the source language sample in translation. Check :doc:`/tasks/nlp/translation` for more information around the task.
+Below we show how to override data processing logic.
 
-Ultimately to create your own custom data processing the flow is like this:
-
-1. Extend the ``TranslationDataModule`` base class, Override hooks with your own logic
-2. (Optional) Keep file in the specific task directory
-3. Add a hydra config object to use your new dataset
-
-1. Extend the ``TranslationDataModule`` base class
-""""""""""""""""""""""""""""""""""""""""""""""""""
+Extend the ``TranslationDataModule`` base class
+"""""""""""""""""""""""""""""""""""""""""""""""
 
 The base data module can be used to modify this code, and follows a simple pattern. Internally the dataset is loaded via HuggingFace Datasets, which returns an `Apache Arrow Parquet <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html>`_ Dataset. This data format is easy to transform and modify using map functions, which you'll see within the class.
 
@@ -63,32 +57,3 @@ Extend ``TranslationDataModule``, like this.
         ...
 
 Make any changes you'd like to the dataset processing via the hooks.
-
-To see the full example, see ``examples/custom/dataset/translation/custom_dataset.py``
-
-2. (Optional) Keep file in the specific task directory
-""""""""""""""""""""""""""""""""""""""""""""""""""""""
-
-This makes tracking of files easier. Our example is stored in ``examples/`` however in reality we would store our DataModule in ``lightning_transformers/task/nlp/translation/datasets/custom_dataset.py``.
-
-3. Add a hydra config object to use your new dataset
-""""""""""""""""""""""""""""""""""""""""""""""""""""
-
-Finally to use the Hydra CLI and configs, we would add our own custom yaml file containing the necessary code to run using our dataset.
-
-We create a file at ``conf/datasets/nlp/translation/my_dataset.yaml`` containing the below config.
-
-.. code-block:: yaml
-
-    # @package dataset
-    defaults:
-      - nlp/default # Use the defaults from the default config found at `conf/dataset/nlp/default.yaml`
-    _target_: examples.custom_translation.dataset.MyTranslationDataModule # path to the class we'd like to instantiate
-    cfg:
-      max_source_length: 128 # any parameters you'd like from the inherited config object.
-
-With this in place you can now train using either HuggingFace Datasets or your own custom files.
-
-.. code-block:: bash
-
-    python train.py task=nlp/translation dataset=nlp/translation/my_dataset dataset.cfg.train_file=train.csv dataset.cfg.validation_file=valid.csv
diff --git a/docs/source/datasets/nlp/custom_subset_names.rst b/docs/source/datasets/nlp/custom_subset_names.rst
@@ -4,15 +4,24 @@ Custom Subset Names (Edge Cases such as MNLI)
 Some datasets, such as MNLI when loaded from the Huggingface `datasets` library, have special subset names that don't match the standard train/validation/test convention.
 Specifically, MNLI has two validation and two test sets, with flavors 'matched' and 'mismatched'.
 When using such datasets, you must manually indicate which subset names you want to use for each of train/validation/text.
-For this, you can set the config variables `dataset.cfg.train_subset_name`, `dataset.cfg.validation_subset_name` and `dataset.cfg.test_subset_name`.
 
 An example for how to train and validate on MNLI would the the following:
 
 .. code-block:: python
 
-    python train.py task=nlp/text_classification dataset=nlp/text_classification/glue dataset.cfg.dataset_config_name=mnli ++dataset.cfg.validation_subset_name=validation_matched
+    from lightning_transformers.task.nlp.text_classification import (
+        TextClassificationDataConfig,
+        TextClassificationDataModule,
+        TextClassificationTransformer,
+    )
 
-It also works for train and test subsets, like so:
-
-++dataset.cfg.train_subset_name=name_of_subset
-++dataset.cfg.test_subset_name=name_of_subset
+    dm = TextClassificationDataModule(
+        cfg=TextClassificationDataConfig(
+            batch_size=1,
+            dataset_name="glue",
+            dataset_config_name="mnli",
+            max_length=512,
+            validation_subset_name="validation_matched"
+        ),
+        tokenizer=tokenizer,
+    )
diff --git a/docs/source/datasets/nlp/language_modeling_data.rst b/docs/source/datasets/nlp/language_modeling_data.rst
@@ -18,4 +18,16 @@ When specifying the file path with hydra, it is important to use the absolute pa
 
 .. code-block:: python
 
-    python train.py task=nlp/language_modeling dataset.cfg.train_file=abs/path/train.csv dataset.cfg.validation_file=abs/path/valid.csv
+    from lightning_transformers.task.nlp.language_modeling import (
+        LanguageModelingDataConfig,
+        LanguageModelingDataModule,
+    )
+
+    dm = LanguageModelingDataModule(
+        cfg=LanguageModelingDataConfig(
+            batch_size=1,
+            train_file="path/train.csv",
+            validation_file="/path/valid.csv"
+        ),
+        tokenizer=tokenizer,
+    )
diff --git a/docs/source/datasets/nlp/multiple_choice_data.rst b/docs/source/datasets/nlp/multiple_choice_data.rst
@@ -21,4 +21,19 @@ We override the dataset files, allowing us to still use the data transforms defi
 
 .. code-block:: python
 
-    python train.py task=nlp/multiple_choice dataset=language_modeling/race dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json
+    from lightning_transformers.task.nlp.multiple_choice import (
+        MultipleChoiceDataConfig,
+        RaceMultipleChoiceDataModule,
+    )
+
+    dm = RaceMultipleChoiceDataModule(
+        cfg=MultipleChoiceDataConfig(
+            batch_size=1,
+            dataset_name="race",
+            dataset_config_name="all",
+            padding=False,
+            train_file="path/train.json",
+            validation_file="/path/valid.json"
+        ),
+        tokenizer=tokenizer,
+    )
diff --git a/docs/source/datasets/nlp/question_answering_data.rst b/docs/source/datasets/nlp/question_answering_data.rst
@@ -22,4 +22,25 @@ We override the dataset files, allowing us to still use the data transforms defi
 
 .. code-block:: python
 
-    python train.py task=nlp/question_answering dataset=nlp/question_answering/squad dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json
+
+    from lightning_transformers.task.nlp.question_answering import (
+        QuestionAnsweringDataConfig,
+        SquadDataModule,
+    )
+
+    dm = SquadDataModule(
+        cfg=QuestionAnsweringDataConfig(
+            batch_size=1,
+            dataset_name="squad",
+            dataset_config_name="plain_text",
+            max_length=384,
+            version_2_with_negative=False,
+            null_score_diff_threshold=0.0,
+            doc_stride=128,
+            n_best_size=20,
+            max_answer_length=30,
+            train_file="path/train.csv",
+            validation_file="/path/valid.csv"
+        ),
+        tokenizer=tokenizer,
+    )
diff --git a/docs/source/datasets/nlp/summarization_data.rst b/docs/source/datasets/nlp/summarization_data.rst
@@ -14,4 +14,20 @@ We override the dataset files, allowing us to still use the data transforms defi
 
 .. code-block:: python
 
-    python train.py task=nlp/summarization dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json
+    from lightning_transformers.task.nlp.summarization import (
+        SummarizationConfig,
+        SummarizationDataConfig,
+        XsumSummarizationDataModule,
+    )
+
+    dm = XsumSummarizationDataModule(
+        cfg=SummarizationDataConfig(
+            batch_size=1,
+            dataset_name="xsum",
+            max_source_length=128,
+            max_target_length=128,
+            train_file="path/train.csv",
+            validation_file="/path/valid.csv"
+        ),
+        tokenizer=tokenizer,
+    )
diff --git a/docs/source/datasets/nlp/text_classification_data.rst b/docs/source/datasets/nlp/text_classification_data.rst
@@ -14,4 +14,18 @@ The label mapping is automatically generated from the training dataset labels if
 
 .. code-block:: python
 
-    python train.py task=nlp/text_classification dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json
+    from lightning_transformers.task.nlp.text_classification import (
+        TextClassificationDataConfig,
+        TextClassificationDataModule,
+        TextClassificationTransformer,
+    )
+
+    dm = TextClassificationDataModule(
+        cfg=TextClassificationDataConfig(
+            batch_size=1,
+            max_length=512,
+            train_file="path/train.json",
+            validation_file="/path/valid.json"
+        ),
+        tokenizer=tokenizer,
+    )
diff --git a/docs/source/datasets/nlp/token_classification_data.rst b/docs/source/datasets/nlp/token_classification_data.rst
@@ -12,4 +12,21 @@ To use custom text files, the files should contain new line delimited json objec
 
 .. code-block:: python
 
-    python train.py task=nlp/token_classification dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json
+    from lightning_transformers.task.nlp.token_classification import (
+        TokenClassificationDataConfig,
+        TokenClassificationDataModule,
+    )
+
+    dm = TokenClassificationDataModule(
+        cfg=TokenClassificationDataConfig(
+            batch_size=1,
+            task_name="ner",
+            dataset_name="conll2003",
+            preprocessing_num_workers=1,
+            label_all_tokens=False,
+            revision="master",
+            train_file="path/train.json",
+            validation_file="/path/valid.json"
+        ),
+        tokenizer=tokenizer,
+    )
diff --git a/docs/source/datasets/nlp/translation_data.rst b/docs/source/datasets/nlp/translation_data.rst
@@ -14,4 +14,22 @@ We override the dataset files, allowing us to still use the data transforms defi
 
 .. code-block:: python
 
-    python train.py task=nlp/translation dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json
+    from lightning_transformers.task.nlp.translation import (
+        TranslationDataConfig,
+        WMT16TranslationDataModule,
+    )
+
+    dm = WMT16TranslationDataModule(
+        cfg=TranslationDataConfig(
+            dataset_name="wmt16",
+            # WMT translation datasets: ['cs-en', 'de-en', 'fi-en', 'ro-en', 'ru-en', 'tr-en']
+            dataset_config_name="ro-en",
+            source_language="en",
+            target_language="ro",
+            max_source_length=128,
+            max_target_length=128,
+            train_file="path/train.json",
+            validation_file="/path/valid.json"
+        ),
+        tokenizer=tokenizer,
+    )
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -14,7 +14,6 @@ Lightning Transformers
    :caption: Get started
 
    quickstart
-   structure/conf
 
 .. toctree::
    :maxdepth: 1
@@ -32,11 +31,8 @@ Lightning Transformers
 .. toctree::
    :maxdepth: 1
    :name: optimization
-   :caption: Training Optimizations
+   :caption: Transformer Optimizations
 
-   optimizations/lightning
-   optimizations/deepspeed
-   optimizations/sharded
    optimizations/sparseml
 
 .. toctree::