Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite hang after running Trainer.fit #18490

Open
Lv101Magikarp opened this issue Sep 5, 2023 · 3 comments
Open

Infinite hang after running Trainer.fit #18490

Lv101Magikarp opened this issue Sep 5, 2023 · 3 comments
Labels
bug Something isn't working strategy: ddp DistributedDataParallel ver: 2.0.x

Comments

@Lv101Magikarp
Copy link

Lv101Magikarp commented Sep 5, 2023

Bug description

I followed this tutorial to build a lightning model for multi-label text classification, except I'm using my own dataset.

I had to fix some of the code because it was using deprecated syntax/features I believe.

However when I run Trainer.fit I get an infinite hang with no previous meaningful error.
As I'm not sure how to go about debugging this I'm creating this issue.

I'm running lightning 2.0.8, python 3.8.10, cuda 11.7 and nvidia driver 470.199.02.

What version are you seeing the problem on?

v2.0

How to reproduce the bug

This is my custom defined module for training the model.

class Tagger(pl.LightningModule):

    def __init__(self, n_classes: int, n_training_steps=None, n_warmup_steps=None):
        super().__init__()
        self.bert = BertModel.from_pretrained(bert_model_name, return_dict=True)
        self.classifier = nn.Linear(self.bert.config.hidden_size, n_classes)
        self.n_training_steps = n_training_steps
        self.n_warmup_steps = n_warmup_steps
        self.critireion = nn.BCELoss()

    def forward(self, input_ids, attention_mask, labels=None):
        output = self.bert(input_ids, attention_mask=attention_mask)
        output = self.classifier(output.pooler_output)
        output = torch.sigmoid(output)
        loss = 0
        if labels is not None:
            loss = self.criterion(output, labels)
        return loss, output
    
    def training_step(self, batch, batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']
        loss, outputs = self(input_ids, attention_mask, labels)
        self.log('train_loss', loss, prog_bar=True, logger=True)
        return {'loss': loss, 'predictions': outputs, 'labels': labels}
    
    def validation_step(self, batch, batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']
        loss, outputs = self(input_ids, attention_mask, labels)
        self.log('val_loss', loss, prog_bar=True, logger=True)
        return loss
    
    def test_step(self, batch, batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']
        loss, outputs = self(input_ids, attention_mask, labels)
        self.log('test_loss', loss, prog_bar=True, logger=True)
        return loss
    
    def on_train_epoch_end(self, outputs):
        labels = []
        predictions = []
        for output in outputs:
            for out_labels in output['labels'].detach.cpu():
                labels.append(out_labels)
            for out_predictions in output['predictions'].detach().cpu():
                predictions.append(out_predictions)
        labels = torch.stack(labels).int()
        predictions = torch.stack(predictions)
        for i, name in enumerate(label_columns):
            class_roc_auc = auroc(predictions[:, i], labels[:, i])
            self.logger.experiment.add_scalar(f'{name}_roc_auc/Train', class_roc_auc, self.current_epoch)

    def configure_optimizers(self):
        optimizer = AdamW(self.parameters(), lr=1e-5)
        scheduler = get_linear_schedule_with_warmup(
            optimizer,
            num_warmup_steps=self.n_warmup_steps,
            num_training_steps=self.n_training_steps
        )
        return dict(optimizer=optimizer, lr_scheduler=scheduler, interval='step')

Error messages and logs

This is what I get before it hangs.

[rank: 0] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[rank: 1] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Missing logger folder: model/saved/lightning_logs
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

Missing logger folder: model/saved/lightning_logs

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @justusschock @awaelchli

@Lv101Magikarp Lv101Magikarp added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Sep 5, 2023
@awaelchli awaelchli added strategy: ddp DistributedDataParallel and removed needs triage Waiting to be triaged by maintainers labels Nov 25, 2023
@riyaj8888
Copy link

i am also facing same issue, any update on this?

@riyaj8888
Copy link

with single gpu script ran , but with multi-gpu it stuck

@Lv101Magikarp
Copy link
Author

i am also facing same issue, any update on this?

unfortunately no, my solution for said project was to move away from pytorch-lighting for the moment...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working strategy: ddp DistributedDataParallel ver: 2.0.x
Projects
None yet
Development

No branches or pull requests

3 participants