Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: single embeddings matrix design #140

Merged
merged 35 commits into from
Jul 6, 2024

Conversation

Adamits
Copy link
Collaborator

@Adamits Adamits commented Oct 2, 2023

Throwing this WIP up to store all vocabs in a single embeddings matrix shared between source, target, and features. This will fix the current pointer-generator issues when we have disjoint vocabularies. I will get back to implementing it later this week.

  • Computes a single vocab_map
  • Retains target_vocabulary and source_vocabulary for logging.
  • Moves embedding initialization onto the model
  • Passes a single embedding layer to each module to share.

Closes #156.

TODO

  • I did not do anything with features. I still need to merge these into the single vocab, being careful not to break anything that requires separate features.
  • This does not handle the case that someone may want to have unmerged embeddings, such that the same unicode codepoint on the source and target side should be considered different. We will need to update the index to mark those characters with special symbols to implement this.

Copy link
Contributor

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. Some thoughts:

  • I will file a bug describing why we're doing this for you to link to...give me a little time though. This will also describe our discussion of embedding too, though that's a secondary issue.
  • I propose we shouldn't store the separate vocabularies in any class object. They should be computed separately, logged in the pretty-print fashion, then thrown out. This requires moving those logging statements, but so be it.
  • vocab_map would be better just called index and we should throw out the SymbolMap class.

Will work on this in a day or two.

@kylebgorman
Copy link
Contributor

I created issue #142 to document the plan here.

@Adamits
Copy link
Collaborator Author

Adamits commented Apr 9, 2024

I just rebased a ton of commits and want to make sure I didn't break anything.

EDIT: Hmmm I did not realize that all rebased commits would appear as changes in this PR> I have not resolved complicated git merges in a long time and thought rebasing was the right move, but I don't like how this looks...

Should I scrap this and merge master into my branch instead?

Edit Edit: I have done so, I think merging is better here.

@kylebgorman
Copy link
Contributor

I'm not a merge vs. rebase N*zi but the way the PR looks when I review is plenty clear as is.

@Adamits
Copy link
Collaborator Author

Adamits commented Apr 9, 2024

Yeah sounds good.

vocab_map would be better just called index and we should throw out the SymbolMap class.

Does this mean we would call it like index.index (consider that in dataset.py wew currently have e.g. self.index.vocab_map)?

@kylebgorman
Copy link
Contributor

Does this mean we would call it like index.index (consider that in dataset.py wew currently have e.g. self.index.vocab_map)?

You could make it so self.index could so be called, i.e., by perlocating that ability upward...if I'm understanding correctly.

@Adamits
Copy link
Collaborator Author

Adamits commented Apr 9, 2024

Ok I was working on replacing the indexing with one method as you suggested but there is an issue here:

our output space should be target_vocab_size. We could set this to the vocab_size (source + target + features), but this seems theoretically odd, since in many cases most of the output space is invalid, but we will still spill probability mass into it.

If we set it to be the a subset of the vocab (target_vocab_size), then we need to somehow map these output indices back to our vocabulary. Right now I was thinking about how this impacts turning ints back into symbols, but I think this also impacts decoder modules, which need to lookup the target in the shared embedding matrix.

@Adamits
Copy link
Collaborator Author

Adamits commented Apr 9, 2024

I think I can make this work with some tricky offsets to map target indices drawn from {0, ..., target_size} to the embedding matrix indices of {0, ..., source+target+features size}. However, this will require providing this info to the model so we can offset predictions when decoding. This seems a bit awkward to me.

@kylebgorman
Copy link
Contributor

kylebgorman commented Apr 9, 2024 via email

@Adamits
Copy link
Collaborator Author

Adamits commented Apr 9, 2024

Still probably some clunkiness here, and I need to fix formatting. But I wanted to put this up in case you want to take a look:

I have tested for Attentive LSTM and pg LSTM, and it solves the indexing issue Michael has. When I come back to this tomorrow I also want to set --tie_embeddings as the default (e.g. defaults to true). Though I guess adding a flag like --no_tie_embeddings is a bit awkward?

@kylebgorman
Copy link
Contributor

I have tested for Attentive LSTM and pg LSTM, and it solves the indexing issue Michael has. When I come back to this tomorrow I also want to set --tie_embeddings as the default (e.g. defaults to true). Though I guess adding a flag like --no_tie_embeddings is a bit awkward?

Look what we did for --log_wandb and --no_log_wandb here.

@kylebgorman
Copy link
Contributor

Look what we did for --log_wandb and --no_log_wandb here.

Nevermind, that's not relevant. I was looking for a flag that defaults to True. Here's a better example.

@Adamits
Copy link
Collaborator Author

Adamits commented Apr 11, 2024

I have cleaned this up and tested it. Everything works except Transducer right now, which seems to be because of the target_vocab_size manipulation that happens in the model here https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/transducer.py#L43.

I think its because target_vocab_size gets updated AFTER we initialize the single embedding matrix. This can cause idnex out of bounds errors. I can't work on this until tonight or tomorrow. @bonham79 in the meantime if you have any intuition about a good fix for this please take a look!

@Adamits
Copy link
Collaborator Author

Adamits commented Apr 11, 2024

And, as usual, black is failing the test but passing locally. Sorry I can never remember what the culprit is :(

@kylebgorman
Copy link
Contributor

And, as usual, black is failing the test but passing locally. Sorry I can never remember what the culprit is :(

pip install -r requirements.txt ought to set you to the exact version to use. (They have been making changes to the generated format...and I recently upgraded it because there was a security issue in a dependency.)

@bonham79
Copy link
Collaborator

bonham79 commented Jun 3, 2024

@Adamits Looking through your changes, it seems you never extend the transducer's vocab size with the additional target vocab size. The target vocab can't be acquired through the vocab map since it's edit actions from the expert. (It's like, every potential target vocab has a copy or insert action, so it's 2x the observed vocabulary.)

@Adamits
Copy link
Collaborator Author

Adamits commented Jun 4, 2024

@Adamits Looking through your changes, it seems you never extend the transducer's vocab size with the additional target vocab size. The target vocab can't be acquired through the vocab map since it's edit actions from the expert. (It's like, every potential target vocab has a copy or insert action, so it's 2x the observed vocabulary.)

Right, is there a reason why the action vocabulary loops through the dataset instead of just getting a handle on the vocabulary? I think doing that would just fix this.

@Adamits
Copy link
Collaborator Author

Adamits commented Jun 4, 2024

@Adamits Looking through your changes, it seems you never extend the transducer's vocab size with the additional target vocab size. The target vocab can't be acquired through the vocab map since it's edit actions from the expert. (It's like, every potential target vocab has a copy or insert action, so it's 2x the observed vocabulary.)

Right, is there a reason why the action vocabulary loops through the dataset instead of just getting a handle on the vocabulary? I think doing that would just fix this.

Sorry nevermind (though this may also be an issue) I am pretty sure my initial assessment is right: I now initialize the embeddings matrix before the transducer modifications. Will see if I can fix this elegantly.

EDIT: Basically what you said is right, we just use the arg vocab now instead of target_vocab.

@Adamits
Copy link
Collaborator Author

Adamits commented Jun 4, 2024

Most recent commit seems to fix the Transducer issue I created. I will try to test more tomorrow night to be sure all of the changes in this most recent commit are needed.

@bonham79
Copy link
Collaborator

bonham79 commented Jun 4, 2024

@Adamits Looking through your changes, it seems you never extend the transducer's vocab size with the additional target vocab size. The target vocab can't be acquired through the vocab map since it's edit actions from the expert. (It's like, every potential target vocab has a copy or insert action, so it's 2x the observed vocabulary.)

Right, is there a reason why the action vocabulary loops through the dataset instead of just getting a handle on the vocabulary? I think doing that would just fix this.

Nope, no reason. I just do it because it has to pass an iterable off to the maxwell dependency so can add on the fly.

Now that you mention it, it would be more consistent to just pass the dataloader vocabulary to the expert. (Or move expert around so it's calling maxwell while building the dataloader. But it's a cheap operation so not a major need.)

Copy link
Collaborator

@bonham79 bonham79 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just commenting to annoy myself to review the transducer changes.

@kylebgorman
Copy link
Contributor

Does this solve #156?

@Adamits
Copy link
Collaborator Author

Adamits commented Jun 13, 2024

Does this solve #156?

Oh yeah I think it does.

@Adamits
Copy link
Collaborator Author

Adamits commented Jul 4, 2024

This PR is getting very very stale. From what I recall it was ready to go. Shall I run another round of tests on each arch for both train and predict, and if it all works, we merge?

@kylebgorman
Copy link
Contributor

This PR is getting very very stale. From what I recall it was ready to go. Shall I run another round of tests on each arch for both train and predict, and if it all works, we merge?

@Adamits yes please merge this soon. I also thought it was basically ready to go.

@Adamits
Copy link
Collaborator Author

Adamits commented Jul 6, 2024

Ok I ran these, and everything runs and looks good, except the Transducer. No errors, but I trained it for 5 epochs on english sigmorphon 2017 medium data, and the validation accuracy looks messed up. See the metrics.csv Table in markdown below. We also get an error when running predict.py.

However, when I run the same code from master (i) I get 0 accuracy after 5 epochs and the train loss is converging absurdly slowly and (ii) predict.py still errors.

lr-Adam step val_loss val_accuracy epoch train_loss
0.001 0
15 0.9778230786323547 0.1899999976158142 0
15 0 3.108109951019287
0.001 16
31 0.7275158166885376 0.1899999976158142 1
31 1 0.9277768731117249
0.001 32
47 0.581281840801239 0.1899999976158142 2
47 2 0.7454320788383484
0.001 48
63 0.49098914861679077 0.1899999976158142 3
63 3 0.6462973356246948
0.001 64
79 0.4305487275123596 0.1899999976158142 4
79 4 0.6198569536209106

@Adamits
Copy link
Collaborator Author

Adamits commented Jul 6, 2024

Update ok the predict.py issue was minor and I fixed it. The transducer seems to be learning to just copy exactly the lemma. Not sure what to make of that yet...

@kylebgorman
Copy link
Contributor

I have also seen this pattern with validation accuracy, namely it being, implausibly, a constant. I don't know what the origin of it is.

Re: the copying of lemma, in the printout of the parameters do you see anything that would be doing feature encoding?

@kylebgorman
Copy link
Contributor

You might also want to try doing the manual merge against head to see if anything changed there helps...

@Adamits
Copy link
Collaborator Author

Adamits commented Jul 6, 2024

After I merge master back in, I get the same thing but with a lower acc:

lr-Adam step val_accuracy val_loss epoch train_loss
0.001 0
15 0.002 0.9778230786323547 0
15 0 3.108109951019287
0.001 16
31 0.002 0.7275158166885376 1
31 1 0.9277768731117249
0.001 32
47 0.002 0.581281840801239 2
47 2 0.7454320788383484
0.001 48
63 0.002 0.49098914861679077 3
63 3 0.6462973356246948
0.001 64
79 0.002 0.4305487275123596 4
79 4 0.6198569536209106

Prediction is also now broken:

Traceback (most recent call last):
  File "/Users/adamwiemerslage/python-envs/yoyodyne/bin/yoyodyne-predict", line 8, in <module>
    sys.exit(main())
  File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/predict.py", line 158, in main
    model = get_model_from_argparse_args(args)
  File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/predict.py", line 70, in get_model_from_argparse_args
    return model_cls.load_from_checkpoint(args.checkpoint)
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 139, in load_from_checkpoint
    return _load_from_checkpoint(
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 188, in _load_from_checkpoint
    return _load_state(cls, checkpoint, strict=strict, **kwargs)
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 234, in _load_state
    obj = cls(**_cls_kwargs)
TypeError: __init__() missing 1 required positional argument: 'expert'

@kylebgorman
Copy link
Contributor

The prediction bug is #197. Fine to submit over the top of that if everything else is working, IMO, since @bonham79 is on that one.

@Adamits
Copy link
Collaborator Author

Adamits commented Jul 6, 2024

The prediction bug is #197. Fine to submit over the top of that if everything else is working, IMO, since @bonham79 is on that one.

Sounds good. Let me just train the Transducer longer as a sanity check.I also feel unsure why the (static) accuracies are so much lower once I merge master. I would plan on committing the version with master merged since I resolved a couple conflicts, sound good?

I am also pretty sure you have to merge since this is your org---I dont think I have access to merge.

@Adamits
Copy link
Collaborator Author

Adamits commented Jul 6, 2024

Ok I got the same result from the transducer. I also wonder if the particular dataset is somehow causing it to learn this copy heuristic. It seems like we should do more explicit testing of that model, and being able to predict would help to debug---would you be ok merging this and I can open a separate ticket?

@kylebgorman
Copy link
Contributor

I'm fine with that. I'll deal with the conflicts now.

@Adamits
Copy link
Collaborator Author

Adamits commented Jul 6, 2024

I'm fine with that. I'll deal with the conflicts now.

Oh Sorry I blanked on committing those. It is minor changes so I guess I will let you do that to get a 2nd pair of eyeson it?

@kylebgorman
Copy link
Contributor

I just dealt with the conflicts. Flake8 noticed that source_vocab_size in train.py was unused so I removed it. I also fixed the pretty-printing to avoid nested [ and ]. I tested locally and things are fine.

@kylebgorman kylebgorman merged commit a72a60f into CUNY-CL:master Jul 6, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pointer-generator crashes when source and target are disjoint
3 participants