Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speech Regression Model #1629

Closed
wants to merge 0 commits into from
Closed

Speech Regression Model #1629

wants to merge 0 commits into from

Conversation

diego-fustes
Copy link
Contributor

Extend the current speech classification models (e.g. Speaker Recognition) to accommodate the prediction of a numeric target (regression).
This can be applied to use cases like Speech Quality Assessment

@lgtm-com
Copy link

lgtm-com bot commented Jan 13, 2021

This pull request introduces 1 alert when merging 657e3e5 into bc5f034 - view on LGTM.com

new alerts:

  • 1 for 'import *' may pollute namespace

@diego-fustes
Copy link
Contributor Author

Hi @titu1994, I would like to merge this pull request. Could you review it? Thanks in advance

@lgtm-com
Copy link

lgtm-com bot commented Jan 13, 2021

This pull request introduces 1 alert when merging b4cbe14 into bc5f034 - view on LGTM.com

new alerts:

  • 1 for 'import *' may pollute namespace

Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great PR! There are a few concerns with the amount of code duplication occuring with the parallel EncDecClassificationModel.

PR #1617 from @fayejf is trying to unify the two already similar models (Classification model and Label model for speaker labels), and I think this regression model should become a third subclass in order to share the vast commonalities between the models.

That PR also greatly refactors the dataset management using helper methods that collapse configs, and unify access, so thats another thing this PR would need to do to stay consistent.

I think this PR should remain in draft status, and rebase on top of #1617 once that is merged, so as to subclass and share as much common code as possible.

Module which reads speech recognition with a numeric target. It accepts comma-separated
JSON manifest files describing the correspondence between wav audio files
and their target values. JSON files should be of the following format::
{"audio_filepath": path_to_wav_0, "duration": time_in_sec_0, "target": \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is target and scalar used in side __getitem__? This dataset seems to be a duplicate of the AudioToLabelDataSet

I think it would be best to reuse functionality in the preexisting dataset classes (by making them adaptive to the input data).

So if the AudioToLabelDataSet gets a label - assume its a classification problem. If its a regression_target - assume its a regression value.

__all__ = ['EncDecRegressionModel']


class EncDecRegressionModel(ASRModel, Exportable):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class is almost identical to the EncDecClassificationModel with minor semantic changes - output_types is regression type, loss is MSE, dataset is slightly different., logs are slightly different. Its constructor, forward, train/val/test/multi_*, transcribe, export etc are all very similar to each other with minor differences only.

I think we can refactor a base class that operates on both classification and regression tasks, and dynamically switches its behaviour. This will avoid long term maintenance costs of keeping two nearly identical models in sync.

'test_loss': loss_value,
}

def multi_test_epoch_end(self, outputs, dataloader_idx: int = 0):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For multi methods, please return a log dict. It manages the actual logging with self.log() internally while properly attaching dataset prefix names.

'val_loss': loss_value,
}

def multi_validation_epoch_end(self, outputs, dataloader_idx: int = 0):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For multi methods, please return a log dict. It manages the actual logging with self.log() internally while properly attaching dataset prefix names.

@@ -331,6 +331,57 @@ def num_classes(self):
return self._num_classes


class ConvASRDecoderRegression(NeuralModule, Exportable):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This neural module can be refactored entirely inside ConvASRDecoderClassification by adding/removing the final activation function. I think having a shared base module between the two will greatly reduce this code duplication.

@diego-fustes
Copy link
Contributor Author

I completely agree with your comments, there are clear opportunities to avoid code duplication. I've implemented this way for the sake of clarity and to avoid conflicts. Anyway, I will watch the related PR and refactor this one after it's been merged.

On the other hand, the DCO check is failing as my commits have been signed with a wrong email. I've tried to correct it by rebasing, but I wasn't able to fix it so far. Any clue about how to handle it?

Thanks for your help

@diego-fustes diego-fustes marked this pull request as draft January 14, 2021 08:46
@fayejf
Copy link
Collaborator

fayejf commented Jan 30, 2021

HI @diego-fustes Sorry for keep you waiting. PR#1617 is merged now 😃 . Please rebase on it. And let me know if you have any questions.

@lgtm-com
Copy link

lgtm-com bot commented Jan 30, 2021

This pull request introduces 1 alert when merging 33d6fd0 into 14904fd - view on LGTM.com

new alerts:

  • 1 for 'import *' may pollute namespace

@diego-fustes
Copy link
Contributor Author

Hi @fayejf and @titu1994 :

I've been looking at the current code base, and I think that some refactoring is still needed. To start with, I think that EndDecSpeakerLabelModel should be a subclass of EncDecClassificationModel. At the end, I think that it's almost the same, just with the option of switching the loss to AngularSoftmaxLoss and that it outputs the embeddings along with the logits.

On the other hand, the class EncDecRegressionModel that I'm implementing should be a superclass of EncDecClassificationModel. At the end, classification can be seem as a special type of regression with a threshold...

What do you think? It's quite a major change, I'm willing to implement it if you agree

@titu1994
Copy link
Collaborator

titu1994 commented Feb 1, 2021

Hey @diego-fustes,

You raise some are valid points, but there are other reasons they are kept separate as of now.

EndDecSpeakerLabelModel should be a subclass of EncDecClassificationModel. At the end, I think that it's almost the same, just with the option of switching the loss to AngularSoftmaxLoss and that it outputs the embeddings along with the logits.

Yes, this is true - for the present. The reason we internally decided to separate the two models is that Speech classification is guarenteed to have a 1 ip - 1 op pipeline as in general classification models. Speaker classification on the other hand may use auxiliary information in the future, or might require architectures with multiple outputs. Speaker classification is also one step in the entire speaker diarization pipeline, which needs different interaction than the simpler classification model. All in all, it was therefore concluded that albeit they share significant similarities, they would still be separate entities.

EncDecRegressionModel that I'm implementing should be a superclass of EncDecClassificationModel. At the end, classification can be seem as a special type of regression with a threshold

This is something I disagree with. Simply put, while mathematically correct that classification is a special case of regression, that does not mean it is semantically the same task. We dont apply regression losses (MSE) to train MNIST / ImageNet, nor do we use any activation function for regression models (as an example). The dataset, and the semantic task that the model performs is entirely different, even if mathematically the only difference is the application of softmax/sigmoid at the end of the model.

@lgtm-com
Copy link

lgtm-com bot commented Feb 3, 2021

This pull request introduces 1 alert when merging 01e79ac into 941ef1f - view on LGTM.com

new alerts:

  • 1 for 'import *' may pollute namespace

@diego-fustes diego-fustes marked this pull request as ready for review February 3, 2021 12:50
@diego-fustes
Copy link
Contributor Author

Hi @titu1994

I agree with you, from a practical perspective it's better to implement the common code for regression and classification in an abstract class. I've just pushed a new version, trying to minimize code duplication. Apologies as I think that I've added more commits that needed, I'm a kind of starting with GitHub.

Kind regards

@lgtm-com
Copy link

lgtm-com bot commented Feb 3, 2021

This pull request introduces 1 alert when merging 1452075 into 941ef1f - view on LGTM.com

new alerts:

  • 1 for 'import *' may pollute namespace

@titu1994
Copy link
Collaborator

titu1994 commented Feb 3, 2021

Hey @diego-fustes, no worries about Git. Could you try rebasing your commit on top of main branch instead of merge? That should clean up the commit history.

I was looking though some of the changes, and they are pretty good. Some minor cleanup here and there and it should be good to merge. I'll post the comments after the rebase (as comments sometime get deleted after rebase)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants