Hi,
Thanks for making this code available.
Have you thought about how to expand the architecture to support more than one class? Would it be better to have one attention mask for each class? Or we can combine each label mask into one binary mask and use it for attention?