Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/dataset-expansion-and-model-training #1

Closed
wants to merge 22 commits into from

Conversation

evamaxfield
Copy link
Member

Description of Changes

This adds the whole process of expanding a dataset from gecko:

gecko_annotation_file_1, audio_file_1,
gecko_annotation_file_2, audio_file_2,
...

into an annotated speaker id training set

speaker_label_1, audio_clip_1,
speaker_label_1, audio_clip_2,
speaker_label_2, audio_clip_3,
speaker_label_3, audio_clip_4,
...

and then training a new speaker identification model by fine tuning an existing large model from transformers / huggingface.

I'm now on my laptop and didn't save the confusion matrix but last time I trained on my desktop the model finished training with ~99% accuracy. I am a bit worried about overfitting but honestly idk, I am even evaluating on a holdout validation set 🤷

The README is updated and if you have an nvidia GPU feel free to give the quickstart a go, if you don't have a GPU (like me right now on my laptop), I wouldn't try training because it takes ~4 hours to complete on CPU.

@evamaxfield evamaxfield added the enhancement New feature or request label Feb 5, 2022
@evamaxfield evamaxfield self-assigned this Feb 5, 2022
@codecov
Copy link

codecov bot commented Feb 5, 2022

Codecov Report

Merging #1 (2d2658f) into main (90011e4) will decrease coverage by 1.45%.
The diff coverage is 4.90%.

❗ Current head 2d2658f differs from pull request most recent head 568a20d. Consider uploading reports for the commit 568a20d to get more accurate results

@@            Coverage Diff            @@
##             main      #1      +/-   ##
=========================================
- Coverage   10.74%   9.28%   -1.46%     
=========================================
  Files           5      10       +5     
  Lines         121     280     +159     
=========================================
+ Hits           13      26      +13     
- Misses        108     254     +146     
Impacted Files Coverage Δ
speakerbox/datasets/__init__.py 0.00% <0.00%> (ø)
speakerbox/datasets/seattle_2021_proto.py 0.00% <0.00%> (ø)
speakerbox/preprocess.py 0.00% <0.00%> (ø)
speakerbox/types.py 0.00% <0.00%> (ø)
speakerbox/utils.py 0.00% <0.00%> (ø)
speakerbox/main.py 29.26% <29.26%> (ø)
speakerbox/__init__.py 85.71% <100.00%> (+2.38%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90011e4...568a20d. Read the comment docs.

Copy link

@tohuynh tohuynh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

99% validation acc does seem suspiciously high

I just have a few clarifying questions.

audio_arrays,
sampling_rate=feature_extractor.sampling_rate,
max_length=int(feature_extractor.sampling_rate * max_duration),
do_normalize=True,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are the mean and variance calculated and used in this normalization step? Are the mean and variance calculated per batch and used to normalize that batch? Usually, the mean and variance are calculated from the train set and used to normalize the entire set.

Or is the mean and variance calculated from the entire set and then used to normalize the entire set? If yes then there is data leakage from holdout validation set to training step -- the training is peeking at the mean and variance of the validation set to pick the best model.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question and to be honest, I don't entirely know. The feature extractor comes shipped with the model so we would have to find their original code. This could just be normalizing a single audio array though. The mean of the audio array freq and putting between 0 and 1 or something? It doesn't have to take into account the whole dataset. But I see your point.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea normalization (according to the doc) is transforming the input's values to be between 0 and 1, by subtracting the mean and dividing the variance from the input. It's important where the mean and variance are calculated from. I don't know what the effects on training would be if the mean and variance are calculated for each input and used only for that input. I've only seen the mean and variance calculated from the train set, those values are frozen and used to normalize (make the values between 0 and 1) the entire set.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this now, I think the current code is fine. I think maybe the mean and variance are calculated from the original (non-cdp) dataset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try training a model with it turned off and see what effect if has on the accuracy

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try turning off normalization, but perform normalization ourselves?

Calculate the mean and variance from the train set, normalize the entire set?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance the accuracy does seem high, but IMHO we are classifying only a few different speakers from several hours of data. I don't think this number looks too suspicious considering speaking voices are very distinct. It looks like you have done your best to balance out speakers in each set and even if accuracy is not the best metric here if we wanted to use any of the F score family of metrics we would have to assign negatives and positives.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that all of our miss classifications are only female. Seattle does have 2/3 female so by default if we miss classify there is a higher probability it would be female all things being equal. From 6 misses on your confusion matrix a few females had 2 miss classifications. I am simply bringing it up so if we do say King County next where females have 1/3 if we only miss female members again we may want to look deeper.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if the miss-classes had less time in meetings? Gonzalez is new so I am guessing that could be true at least for one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I balanced the whole dataset to have n number of samples where n was the number with the least number of samples per person. So they all had a random ~200 samples or so iirc.

Gonzalez in the un-balanced dataset is actually the highest amount of data as I think you simply confused her with someone else. Gonzalez used to be the council president. But yes, will be interesting to test in king county

Comment on lines 86 to 87
train_and_test = dataset.train_test_split(test_size=0.4)
test_and_valid = train_and_test["test"].train_test_split(test_size=0.5)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the frequencies of the speaker labels about the same across the train, test, and validation set?

Could we not intermix events across these three sets? That is, audio files from an event should belong only to one set. I think this would better simulate the real-world use of the model, after training.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes as part of the setup / data prep process I am making sure that there are an even number of samples for each speaker label and then doing random samples for the train, test, and validation sets. So while there may be some randomness in how many of each speaker is in each set, the overall dataset is being reduced to where they have an even number and we would have to be incredibly unlucky to get really unbalanced subsets.

I really like the idea of having a couple of events that arent in the training set in the holdout validation / balancing the dataset by both speaker and event. Unfortunately our dataset is only made up of five events in total 😬

A part of me wants to say I should run this model across of a couple of transcripts to preseed annotations and then manually run through and fix the annotations where it got them wrong for quickly adding more training data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per_device_train_batch_size=batch_size,
gradient_accumulation_steps=4,
per_device_eval_batch_size=batch_size,
num_train_epochs=5,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think time and money could be saved if the training could be stopped early, instead of going all the way to a fixed epoch. When the training acc keeps improving, but the test acc plateaus or goes lower, overfitting has occurred and training can be stopped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree on the overfitting but the time and money is really low... on my 1080TI training takes all of 15 minutes. That said, will look into Trainer / TrainingArguments options for stopping training after some loss threshold.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into it. I'm worried others might not have a GPU available and would have to use AWS or something else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like I can add the following to Trainer:

callbacks=[
    EarlyStoppingCallback(
        early_stopping_patience=2,  # num evals to try with threshold until exit
        early_stopping_threshold=0.02,  # acc must improve my this much or patience decr
    )
],

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to add that to Trainer after I test it on a random transcript / get more data though imo


# Convert into train, test, and validate dict
if isinstance(dataset, Dataset):
train_and_test = dataset.train_test_split(test_size=0.4)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the size of the train set and is it enough to resist overfitting? Usually, the more complicated a model is the more it is prone to overfitting. Don't know what the recommended size would be given the complexity of this last linear layer of the model though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The splits come out to be train 60% test 20% validation 20%. Summary stats print out:

Summary stats for 'train' dataset
n-rows: 1680
n-labels: 8
Avg duration: 1.868732142857143
Min duration: 0.5
Max duration: 2.000000000000466
StD duration: 0.3446844716974556
--------------------------------------------------------------------------------
Summary stats for 'test' dataset
n-rows: 560
n-labels: 8
Avg duration: 1.8752499999999999
Min duration: 0.5
Max duration: 2.0
StD duration: 0.34210901452246373
--------------------------------------------------------------------------------
Summary stats for 'valid' dataset
n-rows: 560
n-labels: 8
Avg duration: 1.868196428571427
Min duration: 0.5
Max duration: 2.0000000000001164
StD duration: 0.3584679145799337
--------------------------------------------------------------------------------

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general rule of thumb is the training set size should be greater than 10 * the number of effective model parameters (the number of weights of the linear layer of the model that is being fine-tuned). Need to ask a machine learning practitioner if we want to be sure.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure about this, but the number of weights would be around the dimensions of the output of the feature_extractor * the number of speaker_labels

If adding more data is too expensive, we can add regularization to nudge the training toward simpler models to resist overfitting. Though I couldn't find that option in the doc.

@evamaxfield
Copy link
Member Author

validation-confusion

@tohuynh here is the latest confusion matrix from validation set

@tohuynh
Copy link

tohuynh commented Feb 5, 2022

@tohuynh here is the latest confusion matrix from validation set

That looks amazingly good

@evamaxfield
Copy link
Member Author

@tohuynh here is the latest confusion matrix from validation set

That looks amazingly good

I am going to try and make a holdout validation set that includes audio from an entirely different full council meeting. That will be the real test.

@dphoria
Copy link

dphoria commented Feb 6, 2022

The README is updated and if you have an nvidia GPU feel free to give the quickstart a go, if you don't have a GPU (like me right now on my laptop), I wouldn't try training because it takes ~4 hours to complete on CPU.

I did try this on my work Windows computer (because it has an NVIDIA card, old GTX 1050 Ti) and got an exception. I am not reporting a bug or anything like that. I am 100% certain the error is on the local environment. Just was disappointed personally my dumb work Windows computer didn't work out of the box for this haha.

>>> model_dir = train(seattle_2021_ds)
E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\transformers\configuration_utils.py:353: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.
  warnings.warn(
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at superb/wav2vec2-base-superb-sid and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([1251, 256]) in the checkpoint and torch.Size([8, 256]) in the model instantiated
- classifier.bias: found shape torch.Size([1251]) in the checkpoint and torch.Size([8]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The following columns in the training set  don't have a corresponding argument in `Wav2Vec2ForSequenceClassification.forward` and have been ignored: audio, duration.
E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\transformers\optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use thePyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 1680
  Num Epochs = 5
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 4
  Total optimization steps = 1050
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "E:\Programs\speakerbox\speakerbox\main.py", line 177, in train
    trainer.train()
  File "E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\transformers\trainer.py", line 1365, in train
    tr_loss_step = self.training_step(model, inputs)
  File "E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\transformers\trainer.py", line 1940, in training_step
    loss = self.compute_loss(model, inputs)
  File "E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\transformers\trainer.py", line 1972, in compute_loss
    outputs = model(**inputs)
  File "E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\transformers\models\wav2vec2\modeling_wav2vec2.py", line 1902, in forward
    loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1))
  File "E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\torch\nn\modules\loss.py", line 1150, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\torch\nn\functional.py", line 2846, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: expected scalar type Long but found Int

@evamaxfield
Copy link
Member Author

I did try this on my work Windows computer (because it has an NVIDIA card, old GTX 1050 Ti) and got an exception. I am not reporting a bug or anything like that. I am 100% certain the error is on the local environment. Just was disappointed personally my dumb work Windows computer didn't work out of the box for this haha.

Interesting... I don't know how to fix this at all.... A part of me wants to say I should update this repo / pip install to say it only allows ubuntu installation.

@evamaxfield
Copy link
Member Author

@tohuynh I think you are totally correct. I think the biggest issue to work on is creating annotating a few more meetings to add to the dataset. Specifically, I am going to try to annotate a few full council meetings where all the council members are present (or almost all).

The splits should be balanced by speaker time AND each set should have different meetings from each other.

Train should have say 6 meetings, eval should have 2 meetings, and valid should have 2 meetings.

Fortunately, with the current model, I can at least make the annotation process go a bit quicker by pre-labeling a bunch of sentences and I can come in and simply fix any of the mistakes. Already have a function to that here: https://github.com/CouncilDataProject/cdp-backend/blob/feature/apply-speaker-classifier/cdp_backend/annotation/speaker_labels.py

@evamaxfield
Copy link
Member Author

Don't want to spam you all. Will reopen when it's ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants