Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: text classification lightning pipeline #119

Merged
merged 43 commits into from
Dec 16, 2021
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
3e6e3d3
Add initial version of pipeline
ktagowski Sep 21, 2021
0b17322
feat: adapt and extend pipelines functionalities (#96)
djaniak Sep 21, 2021
d11ea6c
Refactor code
ktagowski Sep 21, 2021
995c865
Update pipelines
ktagowski Sep 22, 2021
fe8d078
Add abstraction layer for OptimizedPipeline
ktagowski Sep 22, 2021
d83653e
Fix typing
ktagowski Sep 22, 2021
222dfa2
Seperate metadata for pipelines
ktagowski Sep 24, 2021
3aae9cb
Add persisting
ktagowski Sep 24, 2021
e07b49e
Add another pipeline
ktagowski Sep 27, 2021
4841d61
Code refactor
ktagowski Sep 28, 2021
dd8fcbd
implement split sampling transformations (#98)
djaniak Sep 30, 2021
4bb775f
Update examples/preprocess_evaluate_sequence_labeling.py
ktagowski Oct 5, 2021
e1b0c67
Feature/config space embeddings params (#103)
djaniak Oct 27, 2021
b292157
Add tests for hyperparameter search functionality (#113)
ktagowski Nov 30, 2021
c869db5
fix(examples): Remove temporal script
ktagowski Nov 30, 2021
3930b6a
Update embeddings/hyperparameter_search/configspace.py
ktagowski Nov 30, 2021
4a8eb5f
revert: Revert exception type in ConfigSpace
ktagowski Nov 30, 2021
6aa1c00
feat: init text classification lightning pipeline
djaniak Dec 6, 2021
e6ecc7f
chore: add pytorch-lightning to poetry
djaniak Dec 6, 2021
bb5c8d3
refactor: fix mypy errors
djaniak Dec 6, 2021
7e3c3b0
refactor: move input dim to init arguments in datamodule
djaniak Dec 6, 2021
c7b4f07
refactor: refactor and add more abstract classes with generics
djaniak Dec 7, 2021
880a0e7
refactor: move predict method to child class
djaniak Dec 7, 2021
55851da
refactor: refactor lightning task
djaniak Dec 7, 2021
3a24151
Update embeddings/pipeline/lightning_pipeline.py
djaniak Dec 9, 2021
09e6ce4
refactor: code refactor based on PR comments
djaniak Dec 9, 2021
c4c8323
refactor: fix example code
djaniak Dec 9, 2021
1c227d6
feat: add scheduler and refactor code
djaniak Dec 9, 2021
c70ea9f
refactor
djaniak Dec 9, 2021
c80ee72
refactor: fix lightning example, max_seq_length arg and save to str m…
djaniak Dec 9, 2021
2d00eaa
refactor: refactor due to PR comments
djaniak Dec 10, 2021
2c984f9
refactor: create model dynamically in LightningModule setup
djaniak Dec 10, 2021
e178d32
refactor: remove unnecessary argument in datamodule `init`
djaniak Dec 10, 2021
3d69a89
refactor: remove unnecessary argument in datamodule `init` v2
djaniak Dec 10, 2021
f586c8b
refactor: remove redundant loop and simplified processing
djaniak Dec 10, 2021
2d52593
refactor: add default values for keyword arguments and refactor
djaniak Dec 10, 2021
cbac091
refactor: code refactor due to PR comments
djaniak Dec 10, 2021
f1f839d
chore: update poetry.lock with pytorch-lightning
djaniak Dec 14, 2021
2032b90
refactor: fix numpy ndarray typing
djaniak Dec 14, 2021
de46cf1
refactor: add datamodule kwargs to pipeline
djaniak Dec 15, 2021
b194796
Update embeddings/data/datamodule.py
djaniak Dec 16, 2021
bdbca1e
refactor: transformers are unfrozen on default
djaniak Dec 16, 2021
b802ff8
refactor: auto gpu detect, task refactor
djaniak Dec 16, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ pip install clarinpl-embeddings
Text-classification with polemo2 dataset and transformer-based embeddings

```python
from embeddings.pipeline.hugging_face_classification import HuggingFaceClassificationPipeline
from embeddings.pipeline.flair_classification import FlairClassificationPipeline

pipeline = HuggingFaceClassificationPipeline(
pipeline = FlairClassificationPipeline(
dataset_name="clarin-pl/polemo2-official",
embedding_name="allegro/herbert-base-cased",
input_column_name="text",
Expand Down Expand Up @@ -47,8 +47,8 @@ We share predefined pipelines for common NLP tasks with corresponding scripts.
```python
from pathlib import Path

from embeddings.data.hugging_face_data_loader import HuggingFaceDataLoader
from embeddings.data.hugging_face_dataset import HuggingFaceDataset
from embeddings.data.data_loader import HuggingFaceDataLoader
from embeddings.data.dataset import HuggingFaceDataset
from embeddings.embedding.auto_flair import AutoFlairDocumentEmbedding
from embeddings.evaluator.text_classification_evaluator import TextClassificationEvaluator
from embeddings.model.flair_model import FlairModel
Expand Down Expand Up @@ -124,12 +124,12 @@ compatible with our pipeline.
Model and training parameters can be controlled via `task_model_kwargs` and
`task_train_kwargs` parameters.

## Example with `polemo2` dataset.
## Example with `polemo2` dataset.

```python
from embeddings.pipeline.hugging_face_classification import HuggingFaceClassificationPipeline
from embeddings.pipeline.flair_classification import FlairClassificationPipeline

pipeline = HuggingFaceClassificationPipeline(
pipeline = FlairClassificationPipeline(
dataset_name="clarin-pl/polemo2-official",
embedding_name="allegro/herbert-base-cased",
input_column_name="text",
Expand Down Expand Up @@ -231,9 +231,9 @@ df, metadata = pipeline.run()
After the parameters search process we can train model with best parameters found.

```python
from embeddings.pipeline.hugging_face_classification import HuggingFaceClassificationPipeline
from embeddings.pipeline.flair_classification import FlairClassificationPipeline

pipeline = HuggingFaceClassificationPipeline(**metadata)
pipeline = FlairClassificationPipeline(**metadata)
results = pipeline.run()
```

Expand Down
173 changes: 173 additions & 0 deletions embeddings/data/datamodule.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
import abc
from typing import Any, Dict, Generic, List, Optional, Sequence, TypeVar, Union

import datasets
import pytorch_lightning as pl
from datasets import ClassLabel, DatasetDict
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, BatchEncoding

from embeddings.utils.loggers import get_logger

Data = TypeVar("Data")
HuggingFaceDataset = TypeVar("HuggingFaceDataset")

_logger = get_logger(__name__)


class BaseDataModule(abc.ABC, pl.LightningDataModule, Generic[Data]):
dataset: Data


class HuggingFaceDataModule(BaseDataModule[DatasetDict]):
LOADER_COLUMNS = [
"datasets_idx",
"input_ids",
"token_type_ids",
"attention_mask",
"start_positions",
"end_positions",
"labels",
]
DEFAULT_TOKENIZER_KWARGS = {"use_fast": True}
DEFAULT_BATCH_ENCODING_KWARGS = {
"padding": True,
"truncation": True,
}

def __init__(
self,
tokenizer_name_or_path: str,
dataset_name: str,
target_field: str,
max_seq_length: Optional[int] = None,
train_batch_size: int = 32,
eval_batch_size: int = 32,
tokenizer_kwargs: Optional[Dict[str, Any]] = None,
batch_encoding_kwargs: Optional[Dict[str, Any]] = None,
load_dataset_kwargs: Optional[Dict[str, Any]] = None,
**kwargs: Any,
) -> None:
# ignoring the type to avoid calling to untyped function "__init__" in typed context error
# caused by pl.LightningDataModule __init__ method not being typed
super().__init__() # type: ignore
ktagowski marked this conversation as resolved.
Show resolved Hide resolved
self.tokenizer_name_or_path = tokenizer_name_or_path
self.dataset_name = dataset_name
self.target_field = target_field
self.max_seq_length = max_seq_length
self.train_batch_size = train_batch_size
self.eval_batch_size = eval_batch_size
self.tokenizer = AutoTokenizer.from_pretrained(
self.tokenizer_name_or_path,
**tokenizer_kwargs if tokenizer_kwargs else self.DEFAULT_TOKENIZER_KWARGS,
)
self.batch_encoding_kwargs = (
batch_encoding_kwargs if batch_encoding_kwargs else self.DEFAULT_BATCH_ENCODING_KWARGS
)
self.load_dataset_kwargs = load_dataset_kwargs if load_dataset_kwargs else {}

def load_dataset(self) -> DatasetDict:
return datasets.load_dataset(self.dataset_name, **self.load_dataset_kwargs)

def get_num_classes(self) -> int:
assert isinstance(self.dataset, DatasetDict)
if not isinstance(self.dataset["train"].features[self.target_field], ClassLabel):
self.dataset.class_encode_column(self.target_field)
num_classes = self.dataset["train"].features[self.target_field].num_classes
assert isinstance(num_classes, int)
return num_classes

def prepare_data(self) -> None:
laugustyniak marked this conversation as resolved.
Show resolved Hide resolved
datasets.load_dataset(self.dataset_name)
AutoTokenizer.from_pretrained(self.tokenizer_name_or_path)

def setup(self, stage: Optional[str] = None) -> None:
self.dataset = self.load_dataset()
self.num_classes = self.get_num_classes()
self.process_data()

def process_data(self) -> None:
columns = [c for c in self.dataset["train"].column_names if c not in self.LOADER_COLUMNS]
self.dataset = self.dataset.map(
self.convert_to_features,
batched=True,
remove_columns=columns,
)
self.dataset.set_format(type="torch")

def train_dataloader(self) -> DataLoader[HuggingFaceDataset]:
return DataLoader(self.dataset["train"], batch_size=self.train_batch_size)

# Ignoring the type of val_dataloader method from supertype "DataHooks" allowing for None
# and training without validation dataset.
def val_dataloader(self) -> Optional[DataLoader[HuggingFaceDataset]]: # type: ignore
if "validation" in self.dataset:
return DataLoader(self.dataset["validation"], batch_size=self.eval_batch_size)
else:
return None

def test_dataloader(self) -> DataLoader[HuggingFaceDataset]:
return DataLoader(self.dataset["test"], batch_size=self.eval_batch_size)

@abc.abstractmethod
def convert_to_features(
self, example_batch: Dict[str, Any], indices: Optional[List[int]] = None
) -> BatchEncoding:
pass


class TextClassificationDataModule(HuggingFaceDataModule):
def __init__(
self,
tokenizer_name_or_path: str,
dataset_name: str,
text_fields: Union[str, Sequence[str]],
target_field: str,
max_seq_length: Optional[int] = None,
train_batch_size: int = 32,
eval_batch_size: int = 32,
tokenizer_kwargs: Optional[Dict[str, Any]] = None,
batch_encoding_kwargs: Optional[Dict[str, Any]] = None,
load_dataset_kwargs: Optional[Dict[str, Any]] = None,
**kwargs: Any,
):
if isinstance(text_fields, str):
text_fields = [text_fields]
if len(text_fields) > 2:
raise ValueError("Too many fields given in text_fields attribute")
self.text_fields = text_fields
super().__init__(
tokenizer_name_or_path=tokenizer_name_or_path,
dataset_name=dataset_name,
target_field=target_field,
max_seq_length=max_seq_length,
train_batch_siz=train_batch_size,
eval_batch_size=eval_batch_size,
tokenizer_kwargs=tokenizer_kwargs,
batch_encoding_kwargs=batch_encoding_kwargs,
load_dataset_kwargs=load_dataset_kwargs,
**kwargs,
)

def convert_to_features(
self, example_batch: Dict[str, Any], indices: Optional[List[int]] = None
) -> BatchEncoding:
"""Encodes either single sentence or sentence pairs."""
if len(self.text_fields) == 2:
texts_or_text_pairs = list(
zip(example_batch[self.text_fields[0]], example_batch[self.text_fields[1]])
)
elif len(self.text_fields) == 1:
texts_or_text_pairs = example_batch[self.text_fields[0]]
else:
raise ValueError("Inappropriate length of text_fields attribute")

features = self.tokenizer(
texts_or_text_pairs,
max_length=self.max_seq_length,
**self.batch_encoding_kwargs,
)

features["labels"] = example_batch[self.target_field]

return features
29 changes: 29 additions & 0 deletions embeddings/model/lightning_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
from typing import Any, Dict, Literal

import pytorch_lightning as pl
from numpy import typing as nptyping
from torch.utils.data import DataLoader

from embeddings.model.model import Model
from embeddings.task.lightning_task.lightning_task import HuggingFaceLightningTask


class LightningModel(Model[pl.LightningDataModule, Dict[str, nptyping.NDArray[Any]]]):
def __init__(
self,
trainer: pl.Trainer,
task: HuggingFaceLightningTask,
predict_subset: Literal["dev", "test"] = "test",
) -> None:
super().__init__()
self.trainer = trainer
self.task = task
self.predict_subset = predict_subset

def execute(self, data: pl.LightningDataModule) -> Dict[str, nptyping.NDArray[Any]]:
self.trainer.fit(self.task, data)
dataloader = (
data.test_dataloader() if self.predict_subset == "test" else data.val_dataloader()
)
assert isinstance(dataloader, DataLoader)
return self.task.predict(dataloader=dataloader)
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
from embeddings.transformation.transformation import Transformation


class HuggingFaceClassificationPipeline(
class FlairClassificationPipeline(
StandardPipeline[
str, datasets.DatasetDict, Corpus, Dict[str, nptyping.NDArray[Any]], Dict[str, Any]
]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
from embeddings.transformation.transformation import Transformation


class HuggingFacePairClassificationPipeline(
class FlairPairClassificationPipeline(
StandardPipeline[
str, datasets.DatasetDict, Corpus, Dict[str, nptyping.NDArray[Any]], Dict[str, Any]
]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
from embeddings.transformation.transformation import Transformation


class HuggingFaceSequenceLabelingPipeline(
class FlairSequenceLabelingPipeline(
StandardPipeline[
str, datasets.DatasetDict, Corpus, Dict[str, nptyping.NDArray[Any]], Dict[str, Any]
]
Expand Down
65 changes: 65 additions & 0 deletions embeddings/pipeline/lightning_classification.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
from typing import Any, Dict, Optional, Sequence, Union

import datasets
import pytorch_lightning as pl
from numpy import typing as nptyping

from embeddings.data.datamodule import TextClassificationDataModule
from embeddings.data.io import T_path
from embeddings.evaluator.text_classification_evaluator import TextClassificationEvaluator
from embeddings.model.lightning_model import LightningModel
from embeddings.pipeline.lightning_pipeline import LightningPipeline
from embeddings.task.lightning_task.text_classification import TextClassification


class LightningClassificationPipeline(
LightningPipeline[datasets.DatasetDict, Dict[str, nptyping.NDArray[Any]], Dict[str, Any]]
):
DEFAULT_TASK_TRAIN_KWARGS = {"gpus": 1, "auto_select_gpus": True}
asawczyn marked this conversation as resolved.
Show resolved Hide resolved
DEFAULT_TASK_MODEL_KWARGS = {"use_scheduler": True}

def __init__(
self,
embedding_name: str,
dataset_name: str,
input_column_name: Union[str, Sequence[str]],
target_column_name: str,
output_path: T_path,
max_seq_length: Optional[int] = None,
train_batch_size: int = 32,
eval_batch_size: int = 32,
tokenizer_name: Optional[str] = None,
tokenizer_kwargs: Optional[Dict[str, Any]] = None,
batch_encoding_kwargs: Optional[Dict[str, Any]] = None,
load_dataset_kwargs: Optional[Dict[str, Any]] = None,
task_model_kwargs: Optional[Dict[str, Any]] = None,
task_train_kwargs: Optional[Dict[str, Any]] = None,
):
datamodule = TextClassificationDataModule(
ktagowski marked this conversation as resolved.
Show resolved Hide resolved
tokenizer_name_or_path=tokenizer_name if tokenizer_name else embedding_name,
dataset_name=dataset_name,
text_fields=input_column_name,
target_field=target_column_name,
max_seq_length=max_seq_length,
train_batch_size=train_batch_size,
eval_batch_size=eval_batch_size,
tokenizer_kwargs=tokenizer_kwargs,
batch_encoding_kwargs=batch_encoding_kwargs,
load_dataset_kwargs=load_dataset_kwargs,
)
trainer = pl.Trainer(
default_root_dir=output_path,
**task_train_kwargs if task_train_kwargs else self.DEFAULT_TASK_TRAIN_KWARGS
)

task = TextClassification(
model_name_or_path=embedding_name,
train_batch_size=train_batch_size,
eval_batch_size=eval_batch_size,
task_model_kwargs=task_model_kwargs
if task_model_kwargs
else self.DEFAULT_TASK_MODEL_KWARGS,
)
model = LightningModel(trainer=trainer, task=task, predict_subset="test")
evaluator = TextClassificationEvaluator()
super().__init__(datamodule, model, evaluator)
29 changes: 29 additions & 0 deletions embeddings/pipeline/lightning_pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
from typing import Generic, TypeVar

from embeddings.data.datamodule import BaseDataModule, Data
from embeddings.evaluator.evaluator import Evaluator
from embeddings.model.model import Model
from embeddings.pipeline.pipeline import Pipeline

EvaluationResult = TypeVar("EvaluationResult")
ModelResult = TypeVar("ModelResult")


class LightningPipeline(
Pipeline[EvaluationResult],
Generic[Data, ModelResult, EvaluationResult],
ktagowski marked this conversation as resolved.
Show resolved Hide resolved
):
def __init__(
self,
datamodule: BaseDataModule[Data],
model: Model[BaseDataModule[Data], ModelResult],
evaluator: Evaluator[ModelResult, EvaluationResult],
) -> None:
self.datamodule = datamodule
self.model = model
self.evaluator = evaluator

def run(self) -> EvaluationResult:
self.datamodule.setup("fit")
asawczyn marked this conversation as resolved.
Show resolved Hide resolved
model_result = self.model.execute(data=self.datamodule)
return self.evaluator.evaluate(model_result)
Empty file.
Loading