Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ner model #1706

Merged
merged 101 commits into from
Jun 26, 2024
Merged

Ner model #1706

merged 101 commits into from
Jun 26, 2024

Conversation

nsantacruz
Copy link
Contributor

@nsantacruz nsantacruz commented Nov 9, 2023

Overview

This PR in general adds functionality to the linker to (1) find named entities as well as citations (2) run the models in English.

The code needed to be refactored in order to make these changes in a clean way. Bugs were encountered with existing code and they were fixed as explained below.

Below is an overview of the major components that were changed.

Linker API

  • Linker API now returns v3 results for English as well

Linker Python class

  • Front door of linker is now Linker class instead of RefResolver.
  • Linker can now detect both citations and named entities. Named entities are linked to Topics when possible
  • Linker returns LinkedDoc which contains results for both named entities and citations

RefResolver

  • RefResolver now has no access to ML models. It's sole responsibility is taking ML model results and resolving refs.
  • Access to ML models is solely handled by NamedEntityRecognizer

NamedEntityRecognizer

  • Takes input as a string and returns raw results from ML models wrapped in simple objects (RawNamedEntity or RawRef)

NamedEntityResolver

  • Counterpart to RefResolver. Takes RawNamedEntity's and resolves them to ResolvedNamedEntity's

MapReferenceableBookNode

  • ReferenceableBookNodes in general serve as a way of insulating all (or almost all) access to sefaria.model.schema to this layer of code. This makes the linker code much more decoupled from the rest of Sefaria-Project.
  • The purpose of ReferenceableBookNodes are to map an Index schema to a tree structure that more accurately reflects a structure that is useful to the linker
  • MapReferenceableBookNode corresponds to an ArrayMapNode in the schema.

AltStructNode

  • Change the structural nodes of an alt struct (meaning, the nodes that are above the leaves) from TitledTreeNode to new class called AltStructNode
  • This allows us to add some linker attributes to these nodes and find citations in alt structs

CORSDebugMiddleware

  • It turns out this middleware is still useful when debugging the linker on localhost because nginx doesn't exist to add the CORS headers
    Linker JS
  • Fix a bug that prevented citations from being wrapped when they appeared in their own block element (e.g.
    )

Linker Index Converter

  • add more utilities to help convert indices to be usable by the Linker

Normalization

  • normalization utilities are used by the linker to normalize input to the linker and map the results back onto the original indices of the text before normalization
  • fixed some serious bugs that led to incorrect mappings from the normalized string to the original string
  • add norm_to_unnorm_indices() which combines two commonly used functions into one.

…uage specific.

This greatly simplifies the input parameters and generally makes more sense.
…gnizer to help abstract this functionality and make it more reusable.
This case seems to be either impossible or very unlikely. throwing and error for now unless we determine we want to handle this case.
It seems that the start indices from the mapping are already off-by-one, so end doesn't need to be -1.
edamboritz
edamboritz previously approved these changes Dec 19, 2023
@nsantacruz nsantacruz marked this pull request as ready for review June 26, 2024 07:02
@nsantacruz nsantacruz added this pull request to the merge queue Jun 26, 2024
Merged via the queue into master with commit 75edabf Jun 26, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants