Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UnitaryHack] Intial commit unknown words rewrite rule #94

Closed
wants to merge 5 commits into from

Conversation

WingCode
Copy link
Contributor

closes #84

@WingCode WingCode changed the title [Draft] Intial commit unknown words rewrite rule [UnitaryHack] Intial commit unknown words rewrite rule May 29, 2023
@WingCode WingCode marked this pull request as ready for review May 29, 2023 22:20
@le-big-mac
Copy link
Collaborator

le-big-mac commented May 30, 2023

Hi @WingCode, thanks for your submission, this PR looks really good, and certainly solves the problem of replacing unknown words in a single diagram. While not explicitly mentioned in the task (apologies!), it would be really useful to have a HandleUnknownWords module, that could handle some of the pre-processing necessary to use this rewrite rule on a dataset. In broad strokes, the functionality of the class might looks something like this:

  1. Given a list of diagrams (and potentially a list of the strings associated with them), and a train flag (indicating that this data is to be used for training), this module would take a min_freq minimum frequency parameter, and replace all words that appear less frequently than this in the data with the UNK token.
  2. Given a list of diagrams (and again potentially the strings too), and a test or dev flag (indicating that this data is to be used for evaluation), along with a list of words that appeared in the training dataset (which could potentially be returned by the module when train is passed), replace any words in these diagrams that do not appear in the training data with the UNK token.

I appreciate that this is asking you to do extra work, but if you think you could add that it would be really great!

@WingCode
Copy link
Contributor Author

@le-big-mac Thank you for going through the PR and describing the functionalities of HandleUnknownWords.
Would you like it to be added into this PR or a new separate PR?
In my opinion, separate PR would be better since the UnknownWordsRewriteRule implemented in this PR can work independently of HandleUnknownWords and it would be easier to review changes pertaining to HandleUnknownWords in a new PR.

not have one of these words, it will not be rewritten with
UNK.
"""
self.template = template
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think your RewriteRule requires a template.


unknown_words = ['unknown']
rule = UnknownWordsRewriteRule(
template=diagram,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagram that we are rewriting should not be being passed as a template, but this has no effect as template is never used in the RewriteRule.

@le-big-mac
Copy link
Collaborator

Hi @WingCode, I have taken another look and left a couple of small comments that need to be resolved in this PR. I agree with you that HandleUnknownWords should probably be done in a separate PR, and we would really appreciate your help with that!

Considering the initial framing of the question, I feel that once the comments in this PR are addressed we can assign you the issue, and then once the HandleUnknownWords PR is opened that issue can be closed.

@dimkart
Copy link
Contributor

dimkart commented Jun 7, 2023

@WingCode We have fixed an error in main which makes a test to fail, please merge and see if the tests are now passing for your PR.

@dimkart
Copy link
Contributor

dimkart commented Jun 12, 2023

@WingCode Please address any remaining comments by tomorrow 13/6, last day of the hackathon.

@nathanshammah
Copy link

@dimkart @WingCode unitaryHACK will accept minor reviews of open PRs even if they are made shortly after today's deadline.

@dimkart
Copy link
Contributor

dimkart commented Jun 16, 2023

Superseded by #105.

@dimkart dimkart closed this Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UnitaryHACK: Replace unknown words in diagrams with UNK token
4 participants