Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoNLL demo, part 3 #5

Closed
frreiss opened this issue Apr 3, 2020 · 0 comments
Closed

CoNLL demo, part 3 #5

frreiss opened this issue Apr 3, 2020 · 0 comments
Assignees

Comments

@frreiss
Copy link
Member

frreiss commented Apr 3, 2020

At the end of part 2 of the demo, we've shown that there are incorrect labels hidden in the CoNLL-2003 validation set, and that you can pinpoint those incorrect labels by data-mining the results of the 16 models the competitors submitted.

Our goal for part 3 of the demo is to pinpoint incorrect labels across the entire data set. The (rough) process to do so will be:

  1. Retokenize the entire corpus using a "BERT-compatible" tokenizer, and map the token/entity labels from the original corpus on to the new tokenization.
  2. Generate BERT embeddings for every token in the entire corpus in one pass, and store those embeddings in a dataframe column (of type TensorType) alongside the tokens and labels.
  3. Use the embeddings to quickly train multiple models at multiple levels of sophistication (something like: SVMs, random forests, and LSTMs with small and large numbers of hidden states). Split the corpus into 10 parts and perform a 10-fold cross-validation.
  4. Repeat the process from part 2 on each fold of the 10-fold cross-validation, comparing the outputs of every model on the validation set for each fold.
  5. Analyze the results of the models to pinpoint potential incorrect labels. Inspect those labels manually and build up a list of labels that are actually incorrect.
@frreiss frreiss closed this as completed Jul 24, 2020
frreiss pushed a commit that referenced this issue Aug 13, 2021
Added handling of categorical data to interactive widget table
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants