Skip to content
This repository has been archived by the owner on Nov 21, 2022. It is now read-only.

Commit

Permalink
Feat/remove configs (#264)
Browse files Browse the repository at this point in the history
  • Loading branch information
Sean Naren committed Jun 23, 2022
1 parent 9026a1e commit e51ce01
Show file tree
Hide file tree
Showing 94 changed files with 614 additions and 1,076 deletions.
38 changes: 14 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,18 +49,15 @@ from transformers import AutoTokenizer
from lightning_transformers.task.nlp.text_classification import (
TextClassificationDataModule,
TextClassificationTransformer,
TextClassificationDataConfig,
)

tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path="bert-base-cased"
)
dm = TextClassificationDataModule(
cfg=TextClassificationDataConfig(
batch_size=1,
dataset_name="emotion",
max_length=512,
),
batch_size=1,
dataset_name="emotion",
max_length=512,
tokenizer=tokenizer,
)
model = TextClassificationTransformer(
Expand All @@ -81,33 +78,26 @@ from transformers import AutoTokenizer
from lightning_transformers.task.nlp.translation import (
TranslationTransformer,
WMT16TranslationDataModule,
TranslationConfig,
TranslationDataConfig,
)

tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path="google/mt5-base"
)
model = TranslationTransformer(
pretrained_model_name_or_path="google/mt5-base",
cfg=TranslationConfig(
n_gram=4,
smooth=False,
val_target_max_length=142,
num_beams=None,
compute_generate_metrics=True,
),
n_gram=4,
smooth=False,
val_target_max_length=142,
num_beams=None,
compute_generate_metrics=True,
)
dm = WMT16TranslationDataModule(
cfg=TranslationDataConfig(
dataset_name="wmt16",
# WMT translation datasets: ['cs-en', 'de-en', 'fi-en', 'ro-en', 'ru-en', 'tr-en']
dataset_config_name="ro-en",
source_language="en",
target_language="ro",
max_source_length=128,
max_target_length=128,
),
# WMT translation datasets: ['cs-en', 'de-en', 'fi-en', 'ro-en', 'ru-en', 'tr-en']
dataset_config_name="ro-en",
source_language="en",
target_language="ro",
max_source_length=128,
max_target_length=128,
tokenizer=tokenizer,
)
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)
Expand Down
8 changes: 2 additions & 6 deletions docs/source/advanced/nlp/language_modeling_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,6 @@ The base data module can be used to modify this code, and follows a simple patte
class LanguageModelingDataModule(HFDataModule):
def __init__(self, cfg: LanguageModelingDataConfig = LanguageModelingDataConfig()):
super().__init__(cfg=cfg)
def process_data(self, dataset: Dataset, stage: Optional[str] = None) -> Dataset:
# `process_data` converting the dataset into features.
# The dataset is pre-loaded using `load_dataset`.
Expand Down Expand Up @@ -63,14 +60,13 @@ Below we have the pseudo code version to show where most of the changes happened
from datasets import Dataset, Optional
from transformers import PreTrainedTokenizerBase
from lightning_transformers.core.nlp.huggingface import HFTransformerDataConfig
from lightning_transformers.task.nlp.language_modeling import LanguageModelingDataModule
class MyLanguageModelingDataModule(LanguageModelingDataModule):
def __init__(self, cfg: HFTransformerDataConfig, tokenizer: PreTrainedTokenizerBase):
super().__init__(cfg, tokenizer)
def __init__(self, tokenizer: PreTrainedTokenizerBase, *args, **kwargs):
super().__init__(tokenizer, *args, **kwargs)
self.tokenized_condition_term = tokenizer("This is a story: ")
def process_data(self, dataset: Dataset, stage: Optional[str] = None) -> Dataset:
Expand Down
6 changes: 0 additions & 6 deletions docs/source/advanced/nlp/translation_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,6 @@ The base data module can be used to modify this code, and follows a simple patte
class TranslationDataModule(Seq2SeqDataModule):
def __init__(self, cfg: TranslationDataConfig = TranslationDataConfig()):
super().__init__(cfg=cfg)
@property
def source_target_column_names(self) -> Tuple[str, str]:
return self.cfg.source_language, self.cfg.target_language
Expand All @@ -23,9 +20,6 @@ The base data module can be used to modify this code, and follows a simple patte
class Seq2SeqDataModule(HFDataModule):
def __init__(self, cfg: Seq2SeqDataConfig = Seq2SeqDataConfig()):
super().__init__(cfg=cfg)
def process_data(self, dataset: Dataset, stage: Optional[str] = None) -> Dataset:
# `process_data` converting the dataset into features.
# The dataset is pre-loaded using `load_dataset`.
Expand Down
13 changes: 5 additions & 8 deletions docs/source/datasets/nlp/custom_subset_names.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,15 @@ An example for how to train and validate on MNLI would the the following:
.. code-block:: python
from lightning_transformers.task.nlp.text_classification import (
TextClassificationDataConfig,
TextClassificationDataModule,
TextClassificationTransformer,
)
dm = TextClassificationDataModule(
cfg=TextClassificationDataConfig(
batch_size=1,
dataset_name="glue",
dataset_config_name="mnli",
max_length=512,
validation_subset_name="validation_matched"
),
batch_size=1,
dataset_name="glue",
dataset_config_name="mnli",
max_length=512,
validation_subset_name="validation_matched"
tokenizer=tokenizer,
)
9 changes: 3 additions & 6 deletions docs/source/datasets/nlp/language_modeling_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,12 @@ Below we have defined a csv file to use as our input data.
.. code-block:: python
from lightning_transformers.task.nlp.language_modeling import (
LanguageModelingDataConfig,
LanguageModelingDataModule,
)
dm = LanguageModelingDataModule(
cfg=LanguageModelingDataConfig(
batch_size=1,
train_file="path/train.csv",
validation_file="/path/valid.csv"
),
batch_size=1,
train_file="path/train.csv",
validation_file="/path/valid.csv"
tokenizer=tokenizer,
)
14 changes: 5 additions & 9 deletions docs/source/datasets/nlp/multiple_choice_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,18 +22,14 @@ We override the dataset files, allowing us to still use the data transforms defi
.. code-block:: python
from lightning_transformers.task.nlp.multiple_choice import (
MultipleChoiceDataConfig,
RaceMultipleChoiceDataModule,
)
dm = RaceMultipleChoiceDataModule(
cfg=MultipleChoiceDataConfig(
batch_size=1,
dataset_name="race",
dataset_config_name="all",
padding=False,
train_file="path/train.json",
validation_file="/path/valid.json"
),
batch_size=1,
dataset_config_name="all",
padding=False,
train_file="path/train.json",
validation_file="/path/valid.json"
tokenizer=tokenizer,
)
24 changes: 10 additions & 14 deletions docs/source/datasets/nlp/question_answering_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,23 +24,19 @@ We override the dataset files, allowing us to still use the data transforms defi
from lightning_transformers.task.nlp.question_answering import (
QuestionAnsweringDataConfig,
SquadDataModule,
)
dm = SquadDataModule(
cfg=QuestionAnsweringDataConfig(
batch_size=1,
dataset_name="squad",
dataset_config_name="plain_text",
max_length=384,
version_2_with_negative=False,
null_score_diff_threshold=0.0,
doc_stride=128,
n_best_size=20,
max_answer_length=30,
train_file="path/train.csv",
validation_file="/path/valid.csv"
),
batch_size=1,
dataset_config_name="plain_text",
max_length=384,
version_2_with_negative=False,
null_score_diff_threshold=0.0,
doc_stride=128,
n_best_size=20,
max_answer_length=30,
train_file="path/train.csv",
validation_file="/path/valid.csv"
tokenizer=tokenizer,
)
15 changes: 5 additions & 10 deletions docs/source/datasets/nlp/summarization_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,14 @@ We override the dataset files, allowing us to still use the data transforms defi
.. code-block:: python
from lightning_transformers.task.nlp.summarization import (
SummarizationConfig,
SummarizationDataConfig,
XsumSummarizationDataModule,
)
dm = XsumSummarizationDataModule(
cfg=SummarizationDataConfig(
batch_size=1,
dataset_name="xsum",
max_source_length=128,
max_target_length=128,
train_file="path/train.csv",
validation_file="/path/valid.csv"
),
batch_size=1,
max_source_length=128,
max_target_length=128,
train_file="path/train.csv",
validation_file="/path/valid.csv"
tokenizer=tokenizer,
)
11 changes: 4 additions & 7 deletions docs/source/datasets/nlp/text_classification_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,17 +15,14 @@ The label mapping is automatically generated from the training dataset labels if
.. code-block:: python
from lightning_transformers.task.nlp.text_classification import (
TextClassificationDataConfig,
TextClassificationDataModule,
TextClassificationTransformer,
)
dm = TextClassificationDataModule(
cfg=TextClassificationDataConfig(
batch_size=1,
max_length=512,
train_file="path/train.json",
validation_file="/path/valid.json"
),
batch_size=1,
max_length=512,
train_file="path/train.json",
validation_file="/path/valid.json"
tokenizer=tokenizer,
)
23 changes: 9 additions & 14 deletions docs/source/datasets/nlp/token_classification_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,21 +12,16 @@ To use custom text files, the files should contain new line delimited json objec
.. code-block:: python
from lightning_transformers.task.nlp.token_classification import (
TokenClassificationDataConfig,
TokenClassificationDataModule,
)
from lightning_transformers.task.nlp.token_classification import TokenClassificationDataModule
dm = TokenClassificationDataModule(
cfg=TokenClassificationDataConfig(
batch_size=1,
task_name="ner",
dataset_name="conll2003",
preprocessing_num_workers=1,
label_all_tokens=False,
revision="master",
train_file="path/train.json",
validation_file="/path/valid.json"
),
batch_size=1,
task_name="ner",
dataset_name="conll2003",
preprocessing_num_workers=1,
label_all_tokens=False,
revision="master",
train_file="path/train.json",
validation_file="/path/valid.json"
tokenizer=tokenizer,
)
24 changes: 9 additions & 15 deletions docs/source/datasets/nlp/translation_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,22 +14,16 @@ We override the dataset files, allowing us to still use the data transforms defi

.. code-block:: python
from lightning_transformers.task.nlp.translation import (
TranslationDataConfig,
WMT16TranslationDataModule,
)
from lightning_transformers.task.nlp.translation import WMT16TranslationDataModule
dm = WMT16TranslationDataModule(
cfg=TranslationDataConfig(
dataset_name="wmt16",
# WMT translation datasets: ['cs-en', 'de-en', 'fi-en', 'ro-en', 'ru-en', 'tr-en']
dataset_config_name="ro-en",
source_language="en",
target_language="ro",
max_source_length=128,
max_target_length=128,
train_file="path/train.json",
validation_file="/path/valid.json"
),
# WMT translation datasets: ['cs-en', 'de-en', 'fi-en', 'ro-en', 'ru-en', 'tr-en']
dataset_config_name="ro-en",
source_language="en",
target_language="ro",
max_source_length=128,
max_target_length=128,
train_file="path/train.json",
validation_file="/path/valid.json"
tokenizer=tokenizer,
)
Original file line number Diff line number Diff line change
Expand Up @@ -27,19 +27,16 @@ To save an additional HF Checkpoint everytime the checkpoint callback saves, pas
from lightning_transformers.plugins.checkpoint import HFSaveCheckpoint
from lightning_transformers.task.nlp.text_classification import (
TextClassificationDataConfig,
TextClassificationDataModule,
TextClassificationTransformer,
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="prajjwal1/bert-tiny")
dm = TextClassificationDataModule(
cfg=TextClassificationDataConfig(
batch_size=1,
dataset_name="glue",
dataset_config_name="sst2",
max_length=512,
),
batch_size=1,
dataset_name="glue",
dataset_config_name="sst2",
max_length=512,
tokenizer=tokenizer,
)
model = TextClassificationTransformer(pretrained_model_name_or_path="prajjwal1/bert-tiny")
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.. _large_model:

Inference for Big Transformers
==============================
Big Transformer Models Inference
================================

Lightning Transformers provides out of the box support for running inference with very large billion parameter models. Under-the-hood we use HF Accelerates' Transformer support to auto-select devices for optimal throughput and memory usage.

Expand Down
27 changes: 27 additions & 0 deletions docs/source/features/sparseml.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
.. _sparseml:

SparseML
========

`SparseML <https://github.com/neuralmagic/sparseml>`__ provides GPU-class performance on CPUs through sparsification, pruning, and quantization.
For more details, see `SparseML docs <https://docs.neuralmagic.com/sparseml/>`__.

With multiple machines, the command has to be run on all machines either manually, or using an orchestration system such as SLURM or TorchElastic. More information can be seen in the Pytorch Lightning `Computing Cluster <https://pytorch-lightning.readthedocs.io/en/latest/advanced/cluster.html#computing-cluster>`_.

We provide out of the box configs to use SparseML. Just pass the SparseML Callback when training.

.. code-block:: python
import pytorch_lightning as pl
from lightning_transformers.callbacks import TransformerSparseMLCallback
pl.Trainer(
callbacks=TransformerSparseMLCallback(
output_dir="/content/MODELS",
recipe_path="/content/recipe.yaml"
)
)
These commands are only useful when a recipe has already been created. Example recipes can be found `here <https://github.com/neuralmagic/sparseml/tree/main/integrations/huggingface-transformers/recipes>`__.

After training, this will leave two ONNX models in the trainer.callbacks.output_dir folder: small_model.onnx and model.onnx. small_model.onnx is excellent for demos. For reliable inference, it is recommended to optimize model.onnx with your compression algorithm.
Loading

0 comments on commit e51ce01

Please sign in to comment.