-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speech Regression Model #1629
Speech Regression Model #1629
Conversation
This pull request introduces 1 alert when merging 657e3e5 into bc5f034 - view on LGTM.com new alerts:
|
Hi @titu1994, I would like to merge this pull request. Could you review it? Thanks in advance |
This pull request introduces 1 alert when merging b4cbe14 into bc5f034 - view on LGTM.com new alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this great PR! There are a few concerns with the amount of code duplication occuring with the parallel EncDecClassificationModel.
PR #1617 from @fayejf is trying to unify the two already similar models (Classification model and Label model for speaker labels), and I think this regression model should become a third subclass in order to share the vast commonalities between the models.
That PR also greatly refactors the dataset management using helper methods that collapse configs, and unify access, so thats another thing this PR would need to do to stay consistent.
I think this PR should remain in draft status, and rebase on top of #1617 once that is merged, so as to subclass and share as much common code as possible.
Module which reads speech recognition with a numeric target. It accepts comma-separated | ||
JSON manifest files describing the correspondence between wav audio files | ||
and their target values. JSON files should be of the following format:: | ||
{"audio_filepath": path_to_wav_0, "duration": time_in_sec_0, "target": \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is target
and scalar
used in side __getitem__
? This dataset seems to be a duplicate of the AudioToLabelDataSet
I think it would be best to reuse functionality in the preexisting dataset classes (by making them adaptive to the input data).
So if the AudioToLabelDataSet gets a label
- assume its a classification problem. If its a regression_target
- assume its a regression value.
__all__ = ['EncDecRegressionModel'] | ||
|
||
|
||
class EncDecRegressionModel(ASRModel, Exportable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this class is almost identical to the EncDecClassificationModel with minor semantic changes - output_types is regression type, loss is MSE, dataset is slightly different., logs are slightly different. Its constructor, forward, train/val/test/multi_*, transcribe, export etc are all very similar to each other with minor differences only.
I think we can refactor a base class that operates on both classification and regression tasks, and dynamically switches its behaviour. This will avoid long term maintenance costs of keeping two nearly identical models in sync.
'test_loss': loss_value, | ||
} | ||
|
||
def multi_test_epoch_end(self, outputs, dataloader_idx: int = 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For multi methods, please return a log dict. It manages the actual logging with self.log() internally while properly attaching dataset prefix names.
'val_loss': loss_value, | ||
} | ||
|
||
def multi_validation_epoch_end(self, outputs, dataloader_idx: int = 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For multi methods, please return a log dict. It manages the actual logging with self.log() internally while properly attaching dataset prefix names.
@@ -331,6 +331,57 @@ def num_classes(self): | |||
return self._num_classes | |||
|
|||
|
|||
class ConvASRDecoderRegression(NeuralModule, Exportable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This neural module can be refactored entirely inside ConvASRDecoderClassification
by adding/removing the final activation function. I think having a shared base module between the two will greatly reduce this code duplication.
I completely agree with your comments, there are clear opportunities to avoid code duplication. I've implemented this way for the sake of clarity and to avoid conflicts. Anyway, I will watch the related PR and refactor this one after it's been merged. On the other hand, the DCO check is failing as my commits have been signed with a wrong email. I've tried to correct it by rebasing, but I wasn't able to fix it so far. Any clue about how to handle it? Thanks for your help |
HI @diego-fustes Sorry for keep you waiting. PR#1617 is merged now 😃 . Please rebase on it. And let me know if you have any questions. |
This pull request introduces 1 alert when merging 33d6fd0 into 14904fd - view on LGTM.com new alerts:
|
I've been looking at the current code base, and I think that some refactoring is still needed. To start with, I think that EndDecSpeakerLabelModel should be a subclass of EncDecClassificationModel. At the end, I think that it's almost the same, just with the option of switching the loss to AngularSoftmaxLoss and that it outputs the embeddings along with the logits. On the other hand, the class EncDecRegressionModel that I'm implementing should be a superclass of EncDecClassificationModel. At the end, classification can be seem as a special type of regression with a threshold... What do you think? It's quite a major change, I'm willing to implement it if you agree |
Hey @diego-fustes, You raise some are valid points, but there are other reasons they are kept separate as of now.
Yes, this is true - for the present. The reason we internally decided to separate the two models is that Speech classification is guarenteed to have a 1 ip - 1 op pipeline as in general classification models. Speaker classification on the other hand may use auxiliary information in the future, or might require architectures with multiple outputs. Speaker classification is also one step in the entire speaker diarization pipeline, which needs different interaction than the simpler classification model. All in all, it was therefore concluded that albeit they share significant similarities, they would still be separate entities.
This is something I disagree with. Simply put, while mathematically correct that classification is a special case of regression, that does not mean it is semantically the same task. We dont apply regression losses (MSE) to train MNIST / ImageNet, nor do we use any activation function for regression models (as an example). The dataset, and the semantic task that the model performs is entirely different, even if mathematically the only difference is the application of softmax/sigmoid at the end of the model. |
This pull request introduces 1 alert when merging 01e79ac into 941ef1f - view on LGTM.com new alerts:
|
Hi @titu1994 I agree with you, from a practical perspective it's better to implement the common code for regression and classification in an abstract class. I've just pushed a new version, trying to minimize code duplication. Apologies as I think that I've added more commits that needed, I'm a kind of starting with GitHub. Kind regards |
This pull request introduces 1 alert when merging 1452075 into 941ef1f - view on LGTM.com new alerts:
|
Hey @diego-fustes, no worries about Git. Could you try rebasing your commit on top of main branch instead of merge? That should clean up the commit history. I was looking though some of the changes, and they are pretty good. Some minor cleanup here and there and it should be good to merge. I'll post the comments after the rebase (as comments sometime get deleted after rebase) |
Extend the current speech classification models (e.g. Speaker Recognition) to accommodate the prediction of a numeric target (regression).
This can be applied to use cases like Speech Quality Assessment