-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature/dataset-expansion-and-model-training #1
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1 +/- ##
=========================================
- Coverage 10.74% 9.28% -1.46%
=========================================
Files 5 10 +5
Lines 121 280 +159
=========================================
+ Hits 13 26 +13
- Misses 108 254 +146
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
99% validation acc does seem suspiciously high
I just have a few clarifying questions.
audio_arrays, | ||
sampling_rate=feature_extractor.sampling_rate, | ||
max_length=int(feature_extractor.sampling_rate * max_duration), | ||
do_normalize=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are the mean and variance calculated and used in this normalization step? Are the mean and variance calculated per batch and used to normalize that batch? Usually, the mean and variance are calculated from the train set and used to normalize the entire set.
Or is the mean and variance calculated from the entire set and then used to normalize the entire set? If yes then there is data leakage from holdout validation set to training step -- the training is peeking at the mean and variance of the validation set to pick the best model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question and to be honest, I don't entirely know. The feature extractor comes shipped with the model so we would have to find their original code. This could just be normalizing a single audio array though. The mean of the audio array freq and putting between 0 and 1 or something? It doesn't have to take into account the whole dataset. But I see your point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea normalization (according to the doc) is transforming the input's values to be between 0 and 1, by subtracting the mean and dividing the variance from the input. It's important where the mean and variance are calculated from. I don't know what the effects on training would be if the mean and variance are calculated for each input and used only for that input. I've only seen the mean and variance calculated from the train set, those values are frozen and used to normalize (make the values between 0 and 1) the entire set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this now, I think the current code is fine. I think maybe the mean and variance are calculated from the original (non-cdp) dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try training a model with it turned off and see what effect if has on the accuracy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try turning off normalization, but perform normalization ourselves?
Calculate the mean and variance from the train set, normalize the entire set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance the accuracy does seem high, but IMHO we are classifying only a few different speakers from several hours of data. I don't think this number looks too suspicious considering speaking voices are very distinct. It looks like you have done your best to balance out speakers in each set and even if accuracy is not the best metric here if we wanted to use any of the F score family of metrics we would have to assign negatives and positives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that all of our miss classifications are only female. Seattle does have 2/3 female so by default if we miss classify there is a higher probability it would be female all things being equal. From 6 misses on your confusion matrix a few females had 2 miss classifications. I am simply bringing it up so if we do say King County next where females have 1/3 if we only miss female members again we may want to look deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if the miss-classes had less time in meetings? Gonzalez is new so I am guessing that could be true at least for one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I balanced the whole dataset to have n number of samples where n was the number with the least number of samples per person. So they all had a random ~200 samples or so iirc.
Gonzalez in the un-balanced dataset is actually the highest amount of data as I think you simply confused her with someone else. Gonzalez used to be the council president. But yes, will be interesting to test in king county
speakerbox/main.py
Outdated
train_and_test = dataset.train_test_split(test_size=0.4) | ||
test_and_valid = train_and_test["test"].train_test_split(test_size=0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the frequencies of the speaker labels about the same across the train, test, and validation set?
Could we not intermix events across these three sets? That is, audio files from an event should belong only to one set. I think this would better simulate the real-world use of the model, after training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes as part of the setup / data prep process I am making sure that there are an even number of samples for each speaker label and then doing random samples for the train, test, and validation sets. So while there may be some randomness in how many of each speaker is in each set, the overall dataset is being reduced to where they have an even number and we would have to be incredibly unlucky to get really unbalanced subsets.
I really like the idea of having a couple of events that arent in the training set in the holdout validation / balancing the dataset by both speaker and event. Unfortunately our dataset is only made up of five events in total 😬
A part of me wants to say I should run this model across of a couple of transcripts to preseed annotations and then manually run through and fix the annotations where it got them wrong for quickly adding more training data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where I am doing the speaker balancing: https://github.com/CouncilDataProject/speakerbox/blob/feature/dataset-expansion/speakerbox/ds/utils.py#L195
speakerbox/main.py
Outdated
per_device_train_batch_size=batch_size, | ||
gradient_accumulation_steps=4, | ||
per_device_eval_batch_size=batch_size, | ||
num_train_epochs=5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think time and money could be saved if the training could be stopped early, instead of going all the way to a fixed epoch. When the training acc keeps improving, but the test acc plateaus or goes lower, overfitting has occurred and training can be stopped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree on the overfitting but the time and money is really low... on my 1080TI training takes all of 15 minutes. That said, will look into Trainer
/ TrainingArguments
options for stopping training after some loss threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into it. I'm worried others might not have a GPU available and would have to use AWS or something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like I can add the following to Trainer
:
callbacks=[
EarlyStoppingCallback(
early_stopping_patience=2, # num evals to try with threshold until exit
early_stopping_threshold=0.02, # acc must improve my this much or patience decr
)
],
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to add that to Trainer after I test it on a random transcript / get more data though imo
speakerbox/main.py
Outdated
|
||
# Convert into train, test, and validate dict | ||
if isinstance(dataset, Dataset): | ||
train_and_test = dataset.train_test_split(test_size=0.4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the size of the train set and is it enough to resist overfitting? Usually, the more complicated a model is the more it is prone to overfitting. Don't know what the recommended size would be given the complexity of this last linear layer of the model though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The splits come out to be train 60% test 20% validation 20%. Summary stats print out:
Summary stats for 'train' dataset
n-rows: 1680
n-labels: 8
Avg duration: 1.868732142857143
Min duration: 0.5
Max duration: 2.000000000000466
StD duration: 0.3446844716974556
--------------------------------------------------------------------------------
Summary stats for 'test' dataset
n-rows: 560
n-labels: 8
Avg duration: 1.8752499999999999
Min duration: 0.5
Max duration: 2.0
StD duration: 0.34210901452246373
--------------------------------------------------------------------------------
Summary stats for 'valid' dataset
n-rows: 560
n-labels: 8
Avg duration: 1.868196428571427
Min duration: 0.5
Max duration: 2.0000000000001164
StD duration: 0.3584679145799337
--------------------------------------------------------------------------------
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general rule of thumb is the training set size should be greater than 10 * the number of effective model parameters (the number of weights of the linear layer of the model that is being fine-tuned). Need to ask a machine learning practitioner if we want to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely sure about this, but the number of weights would be around the dimensions of the output of the feature_extractor * the number of speaker_labels
If adding more data is too expensive, we can add regularization to nudge the training toward simpler models to resist overfitting. Though I couldn't find that option in the doc.
@tohuynh here is the latest confusion matrix from validation set |
That looks amazingly good |
I am going to try and make a holdout validation set that includes audio from an entirely different full council meeting. That will be the real test. |
I did try this on my work Windows computer (because it has an NVIDIA card, old GTX 1050 Ti) and got an exception. I am not reporting a bug or anything like that. I am 100% certain the error is on the local environment. Just was disappointed personally my dumb work Windows computer didn't work out of the box for this haha.
|
Interesting... I don't know how to fix this at all.... A part of me wants to say I should update this repo / pip install to say it only allows ubuntu installation. |
@tohuynh I think you are totally correct. I think the biggest issue to work on is creating annotating a few more meetings to add to the dataset. Specifically, I am going to try to annotate a few full council meetings where all the council members are present (or almost all). The splits should be balanced by speaker time AND each set should have different meetings from each other. Train should have say 6 meetings, eval should have 2 meetings, and valid should have 2 meetings. Fortunately, with the current model, I can at least make the annotation process go a bit quicker by pre-labeling a bunch of sentences and I can come in and simply fix any of the mistakes. Already have a function to that here: https://github.com/CouncilDataProject/cdp-backend/blob/feature/apply-speaker-classifier/cdp_backend/annotation/speaker_labels.py |
Don't want to spam you all. Will reopen when it's ready. |
Description of Changes
This adds the whole process of expanding a dataset from gecko:
into an annotated speaker id training set
and then training a new speaker identification model by fine tuning an existing large model from transformers / huggingface.
I'm now on my laptop and didn't save the confusion matrix but last time I trained on my desktop the model finished training with ~99% accuracy. I am a bit worried about overfitting but honestly idk, I am even evaluating on a holdout validation set 🤷
The README is updated and if you have an nvidia GPU feel free to give the quickstart a go, if you don't have a GPU (like me right now on my laptop), I wouldn't try training because it takes ~4 hours to complete on CPU.