feature/dataset-expansion-and-model-training #1

evamaxfield · 2022-02-05T02:38:46Z

Description of Changes

This adds the whole process of expanding a dataset from gecko:

gecko_annotation_file_1, audio_file_1,
gecko_annotation_file_2, audio_file_2,
...

into an annotated speaker id training set

speaker_label_1, audio_clip_1,
speaker_label_1, audio_clip_2,
speaker_label_2, audio_clip_3,
speaker_label_3, audio_clip_4,
...

and then training a new speaker identification model by fine tuning an existing large model from transformers / huggingface.

I'm now on my laptop and didn't save the confusion matrix but last time I trained on my desktop the model finished training with ~99% accuracy. I am a bit worried about overfitting but honestly idk, I am even evaluating on a holdout validation set 🤷

The README is updated and if you have an nvidia GPU feel free to give the quickstart a go, if you don't have a GPU (like me right now on my laptop), I wouldn't try training because it takes ~4 hours to complete on CPU.

codecov · 2022-02-05T02:42:13Z

Codecov Report

Merging #1 (2d2658f) into main (90011e4) will decrease coverage by 1.45%.
The diff coverage is 4.90%.

❗ Current head 2d2658f differs from pull request most recent head 568a20d. Consider uploading reports for the commit 568a20d to get more accurate results

@@            Coverage Diff            @@
##             main      #1      +/-   ##
=========================================
- Coverage   10.74%   9.28%   -1.46%     
=========================================
  Files           5      10       +5     
  Lines         121     280     +159     
=========================================
+ Hits           13      26      +13     
- Misses        108     254     +146

Impacted Files	Coverage Δ
speakerbox/datasets/__init__.py	`0.00% <0.00%> (ø)`
speakerbox/datasets/seattle_2021_proto.py	`0.00% <0.00%> (ø)`
speakerbox/preprocess.py	`0.00% <0.00%> (ø)`
speakerbox/types.py	`0.00% <0.00%> (ø)`
speakerbox/utils.py	`0.00% <0.00%> (ø)`
speakerbox/main.py	`29.26% <29.26%> (ø)`
speakerbox/__init__.py	`85.71% <100.00%> (+2.38%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90011e4...568a20d. Read the comment docs.

tohuynh

99% validation acc does seem suspiciously high

I just have a few clarifying questions.

tohuynh · 2022-02-05T17:34:08Z

speakerbox/main.py

+            audio_arrays,
+            sampling_rate=feature_extractor.sampling_rate,
+            max_length=int(feature_extractor.sampling_rate * max_duration),
+            do_normalize=True,


How are the mean and variance calculated and used in this normalization step? Are the mean and variance calculated per batch and used to normalize that batch? Usually, the mean and variance are calculated from the train set and used to normalize the entire set.

Or is the mean and variance calculated from the entire set and then used to normalize the entire set? If yes then there is data leakage from holdout validation set to training step -- the training is peeking at the mean and variance of the validation set to pick the best model.

Great question and to be honest, I don't entirely know. The feature extractor comes shipped with the model so we would have to find their original code. This could just be normalizing a single audio array though. The mean of the audio array freq and putting between 0 and 1 or something? It doesn't have to take into account the whole dataset. But I see your point.

Yea normalization (according to the doc) is transforming the input's values to be between 0 and 1, by subtracting the mean and dividing the variance from the input. It's important where the mean and variance are calculated from. I don't know what the effects on training would be if the mean and variance are calculated for each input and used only for that input. I've only seen the mean and variance calculated from the train set, those values are frozen and used to normalize (make the values between 0 and 1) the entire set.

Thinking about this now, I think the current code is fine. I think maybe the mean and variance are calculated from the original (non-cdp) dataset.

I can try training a model with it turned off and see what effect if has on the accuracy

Could you try turning off normalization, but perform normalization ourselves?

Calculate the mean and variance from the train set, normalize the entire set?

At first glance the accuracy does seem high, but IMHO we are classifying only a few different speakers from several hours of data. I don't think this number looks too suspicious considering speaking voices are very distinct. It looks like you have done your best to balance out speakers in each set and even if accuracy is not the best metric here if we wanted to use any of the F score family of metrics we would have to assign negatives and positives.

Worth noting that all of our miss classifications are only female. Seattle does have 2/3 female so by default if we miss classify there is a higher probability it would be female all things being equal. From 6 misses on your confusion matrix a few females had 2 miss classifications. I am simply bringing it up so if we do say King County next where females have 1/3 if we only miss female members again we may want to look deeper.

Wondering if the miss-classes had less time in meetings? Gonzalez is new so I am guessing that could be true at least for one?

I balanced the whole dataset to have n number of samples where n was the number with the least number of samples per person. So they all had a random ~200 samples or so iirc.

Gonzalez in the un-balanced dataset is actually the highest amount of data as I think you simply confused her with someone else. Gonzalez used to be the council president. But yes, will be interesting to test in king county

tohuynh · 2022-02-05T17:39:06Z

speakerbox/main.py

+        train_and_test = dataset.train_test_split(test_size=0.4)
+        test_and_valid = train_and_test["test"].train_test_split(test_size=0.5)


Are the frequencies of the speaker labels about the same across the train, test, and validation set?

Could we not intermix events across these three sets? That is, audio files from an event should belong only to one set. I think this would better simulate the real-world use of the model, after training.

Yes as part of the setup / data prep process I am making sure that there are an even number of samples for each speaker label and then doing random samples for the train, test, and validation sets. So while there may be some randomness in how many of each speaker is in each set, the overall dataset is being reduced to where they have an even number and we would have to be incredibly unlucky to get really unbalanced subsets.

I really like the idea of having a couple of events that arent in the training set in the holdout validation / balancing the dataset by both speaker and event. Unfortunately our dataset is only made up of five events in total 😬

A part of me wants to say I should run this model across of a couple of transcripts to preseed annotations and then manually run through and fix the annotations where it got them wrong for quickly adding more training data.

Here is where I am doing the speaker balancing: https://github.com/CouncilDataProject/speakerbox/blob/feature/dataset-expansion/speakerbox/ds/utils.py#L195

tohuynh · 2022-02-05T17:43:36Z

speakerbox/main.py

+        per_device_train_batch_size=batch_size,
+        gradient_accumulation_steps=4,
+        per_device_eval_batch_size=batch_size,
+        num_train_epochs=5,


I think time and money could be saved if the training could be stopped early, instead of going all the way to a fixed epoch. When the training acc keeps improving, but the test acc plateaus or goes lower, overfitting has occurred and training can be stopped.

Totally agree on the overfitting but the time and money is really low... on my 1080TI training takes all of 15 minutes. That said, will look into Trainer / TrainingArguments options for stopping training after some loss threshold.

Thanks for looking into it. I'm worried others might not have a GPU available and would have to use AWS or something else.

looks like I can add the following to Trainer:

callbacks=[ EarlyStoppingCallback( early_stopping_patience=2, # num evals to try with threshold until exit early_stopping_threshold=0.02, # acc must improve my this much or patience decr ) ],

Going to add that to Trainer after I test it on a random transcript / get more data though imo

tohuynh · 2022-02-05T17:54:15Z

speakerbox/main.py

+
+    # Convert into train, test, and validate dict
+    if isinstance(dataset, Dataset):
+        train_and_test = dataset.train_test_split(test_size=0.4)


What is the size of the train set and is it enough to resist overfitting? Usually, the more complicated a model is the more it is prone to overfitting. Don't know what the recommended size would be given the complexity of this last linear layer of the model though.

The splits come out to be train 60% test 20% validation 20%. Summary stats print out:

Summary stats for 'train' dataset n-rows: 1680 n-labels: 8 Avg duration: 1.868732142857143 Min duration: 0.5 Max duration: 2.000000000000466 StD duration: 0.3446844716974556 -------------------------------------------------------------------------------- Summary stats for 'test' dataset n-rows: 560 n-labels: 8 Avg duration: 1.8752499999999999 Min duration: 0.5 Max duration: 2.0 StD duration: 0.34210901452246373 -------------------------------------------------------------------------------- Summary stats for 'valid' dataset n-rows: 560 n-labels: 8 Avg duration: 1.868196428571427 Min duration: 0.5 Max duration: 2.0000000000001164 StD duration: 0.3584679145799337 --------------------------------------------------------------------------------

The general rule of thumb is the training set size should be greater than 10 * the number of effective model parameters (the number of weights of the linear layer of the model that is being fine-tuned). Need to ask a machine learning practitioner if we want to be sure.

Not entirely sure about this, but the number of weights would be around the dimensions of the output of the feature_extractor * the number of speaker_labels

If adding more data is too expensive, we can add regularization to nudge the training toward simpler models to resist overfitting. Though I couldn't find that option in the doc.

evamaxfield · 2022-02-05T19:49:23Z

@tohuynh here is the latest confusion matrix from validation set

tohuynh · 2022-02-05T20:03:30Z

@tohuynh here is the latest confusion matrix from validation set

That looks amazingly good

evamaxfield · 2022-02-05T20:04:11Z

@tohuynh here is the latest confusion matrix from validation set

That looks amazingly good

I am going to try and make a holdout validation set that includes audio from an entirely different full council meeting. That will be the real test.

dphoria · 2022-02-06T04:22:40Z

The README is updated and if you have an nvidia GPU feel free to give the quickstart a go, if you don't have a GPU (like me right now on my laptop), I wouldn't try training because it takes ~4 hours to complete on CPU.

I did try this on my work Windows computer (because it has an NVIDIA card, old GTX 1050 Ti) and got an exception. I am not reporting a bug or anything like that. I am 100% certain the error is on the local environment. Just was disappointed personally my dumb work Windows computer didn't work out of the box for this haha.

>>> model_dir = train(seattle_2021_ds)
E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\transformers\configuration_utils.py:353: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.
  warnings.warn(
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at superb/wav2vec2-base-superb-sid and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([1251, 256]) in the checkpoint and torch.Size([8, 256]) in the model instantiated
- classifier.bias: found shape torch.Size([1251]) in the checkpoint and torch.Size([8]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The following columns in the training set  don't have a corresponding argument in `Wav2Vec2ForSequenceClassification.forward` and have been ignored: audio, duration.
E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\transformers\optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use thePyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 1680
  Num Epochs = 5
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 4
  Total optimization steps = 1050
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "E:\Programs\speakerbox\speakerbox\main.py", line 177, in train
    trainer.train()
  File "E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\transformers\trainer.py", line 1365, in train
    tr_loss_step = self.training_step(model, inputs)
  File "E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\transformers\trainer.py", line 1940, in training_step
    loss = self.compute_loss(model, inputs)
  File "E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\transformers\trainer.py", line 1972, in compute_loss
    outputs = model(**inputs)
  File "E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\transformers\models\wav2vec2\modeling_wav2vec2.py", line 1902, in forward
    loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1))
  File "E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\torch\nn\modules\loss.py", line 1150, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "E:\Programs\Anaconda3\envs\speakerbox\lib\site-packages\torch\nn\functional.py", line 2846, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: expected scalar type Long but found Int

evamaxfield · 2022-02-07T02:37:27Z

I did try this on my work Windows computer (because it has an NVIDIA card, old GTX 1050 Ti) and got an exception. I am not reporting a bug or anything like that. I am 100% certain the error is on the local environment. Just was disappointed personally my dumb work Windows computer didn't work out of the box for this haha.

Interesting... I don't know how to fix this at all.... A part of me wants to say I should update this repo / pip install to say it only allows ubuntu installation.

evamaxfield · 2022-02-07T02:44:05Z

@tohuynh I think you are totally correct. I think the biggest issue to work on is creating annotating a few more meetings to add to the dataset. Specifically, I am going to try to annotate a few full council meetings where all the council members are present (or almost all).

The splits should be balanced by speaker time AND each set should have different meetings from each other.

Train should have say 6 meetings, eval should have 2 meetings, and valid should have 2 meetings.

Fortunately, with the current model, I can at least make the annotation process go a bit quicker by pre-labeling a bunch of sentences and I can come in and simply fix any of the mistakes. Already have a function to that here: https://github.com/CouncilDataProject/cdp-backend/blob/feature/apply-speaker-classifier/cdp_backend/annotation/speaker_labels.py

evamaxfield · 2022-05-05T22:55:49Z

Don't want to spam you all. Will reopen when it's ready.

JacksonMaxfield added 5 commits January 27, 2022 16:21

Update module locs, start on dataset expansion

986f41c

Shuffle modules and add training code

f0ec39f

Drop pandas pin

12ddd9e

Training working, eval stuck on argmax

2f30c23

Working model training, acc=0.986

e5e4745

evamaxfield added the enhancement New feature or request label Feb 5, 2022

evamaxfield requested review from isaacna, tohuynh and dphoria February 5, 2022 02:38

evamaxfield self-assigned this Feb 5, 2022

Fix lint and typing

86564ca

tohuynh reviewed Feb 5, 2022

View reviewed changes

Add n-rows to summary stats printouts

9bd891a

JacksonMaxfield added 5 commits March 29, 2022 21:40

Also pull unlabeled transcripts during seattle ds

4206a87

Add commented out / temp early stopping callback to trainer

2aa83f6

Trials with pyannote

58bb8a7

Update black

8be025e

Fix lint

8577b62

evamaxfield mentioned this pull request Apr 25, 2022

Speaker diarisation before annotation #5

Closed

JacksonMaxfield added 4 commits April 29, 2022 09:58

Start working on preprocessing module

d3cf4c8

Okay diarization function -> annotation

491f19c

Ignore temporary dataset dirs

be916da

Allow py310 and update doc theme

c904a8f

JacksonMaxfield added 6 commits May 5, 2022 12:43

Rename and move modules, working expansion function for dia audio

88b2bec

Add temporary scripts to diarize and test

87f20b3

semi-finish the preprocess dataset

cfeba99

Switch to log debug

4dae93e

Try training

2d2658f

Update scripts and main trainer

568a20d

evamaxfield removed request for isaacna and dphoria May 5, 2022 22:55

evamaxfield closed this May 5, 2022

evamaxfield mentioned this pull request May 9, 2022

feature/preprocess-train-eval-and-more #7

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature/dataset-expansion-and-model-training #1

feature/dataset-expansion-and-model-training #1

evamaxfield commented Feb 5, 2022

codecov bot commented Feb 5, 2022 •

edited

tohuynh left a comment

tohuynh Feb 5, 2022

evamaxfield Feb 5, 2022

tohuynh Feb 5, 2022

tohuynh Feb 5, 2022

evamaxfield Feb 5, 2022

tohuynh Feb 5, 2022

kristopher-smith Feb 13, 2022

kristopher-smith Feb 13, 2022

kristopher-smith Feb 13, 2022

evamaxfield Feb 14, 2022

tohuynh Feb 5, 2022

evamaxfield Feb 5, 2022

evamaxfield Feb 5, 2022

tohuynh Feb 5, 2022

evamaxfield Feb 5, 2022

tohuynh Feb 5, 2022

evamaxfield Feb 5, 2022

evamaxfield Feb 5, 2022

tohuynh Feb 5, 2022

evamaxfield Feb 5, 2022 •

edited

tohuynh Feb 5, 2022

tohuynh Feb 5, 2022

evamaxfield commented Feb 5, 2022

tohuynh commented Feb 5, 2022

evamaxfield commented Feb 5, 2022

dphoria commented Feb 6, 2022

evamaxfield commented Feb 7, 2022

evamaxfield commented Feb 7, 2022

evamaxfield commented May 5, 2022

		train_and_test = dataset.train_test_split(test_size=0.4)
		test_and_valid = train_and_test["test"].train_test_split(test_size=0.5)

feature/dataset-expansion-and-model-training #1

feature/dataset-expansion-and-model-training #1

Conversation

evamaxfield commented Feb 5, 2022

Description of Changes

codecov bot commented Feb 5, 2022 • edited

Codecov Report

tohuynh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evamaxfield Feb 5, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evamaxfield commented Feb 5, 2022

tohuynh commented Feb 5, 2022

evamaxfield commented Feb 5, 2022

dphoria commented Feb 6, 2022

evamaxfield commented Feb 7, 2022

evamaxfield commented Feb 7, 2022

evamaxfield commented May 5, 2022

codecov bot commented Feb 5, 2022 •

edited

evamaxfield Feb 5, 2022 •

edited