Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update geopoint to geoshape #306

Merged
merged 2 commits into from
Jan 18, 2024
Merged

Update geopoint to geoshape #306

merged 2 commits into from
Jan 18, 2024

Conversation

ehanson8
Copy link
Contributor

@ehanson8 ehanson8 commented Jan 18, 2024

Helpful background context

While testing the updated mapping, we ran into an OpenSearch error when a record could not be ingested if it contained more than one geo_shape type. This caused issues because we mapped both dcat_bbox and locn_geometry to Location.geoshape, which is ingested as an OpenSearch geo_shape type. While MIT records do not have different values for these fields, we have already found examples in OGM records where the fields differ and we would prefer not to lose those unique values.

We discovered that removing include_in_parent property allowed us to successfully index the records. @JPrevost This has obvious implications for the GraphQL queries so we wanted to get your input and approval before proceeding with this change.

How can a reviewer manually see the effects of these changes?

Ensure that TIMDEX_OPENSEARCH_ENDPOINT is not set as an env var and spin up a local Docker container with:

docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "plugins.security.disabled=true" opensearchproject/opensearch:2.11.0

In new terminal window, create a new gismit index:

pipenv run tim create -s gismit

Copy the index name and promote the index to the gismit alias:

pipenv run tim promote -a gismit -i <index name>

Set Dev1 credentials and bulk-index the following S3 file with Location.geoshape to see that it successfully indexes:

pipenv run tim bulk-index -s gismit s3://timdex-extract-dev-222053980223/gismit/gismit.json

Includes new or updated dependencies?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README (or there is none)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer

  • The commit message is clear and follows our guidelines (not just this pull request message)
  • There are appropriate tests covering any new functionality
  • The documentation has been updated or is unnecessary
  • The changes have been verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
* Given that OpenSearch can process WKT strings and the values tend to be bounding boxes rather than points, we will use the geo_shape type instead of the geo_point type

How this addresses that need:
* Update geopoint > geoshape in OpenSearch mapping
* Remove include_in_parent property from locations field

Side effects of this change:
* This will impact how the TIMDEX API constructs queries

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-116
@JPrevost
Copy link
Member

@ehanson8 does dev1 opensearch have this data loaded?

@ehanson8
Copy link
Contributor Author

No, given the breaking nature of this change, we wanted to keep testing local until we had a chance to discuss in case this is a bad idea for a reason that we didn't realize

* Update include_in_parent and doc_values properties after discussion on the merits of several approaches
Copy link

Pull Request Test Coverage Report for Build 7576158927

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 7398648544: 0.0%
Covered Lines: 407
Relevant Lines: 407

💛 - Coveralls

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Noting for posterity that both docs_value=false or include_in_parent=false under the locations.geoshape property have the desired effect of avoiding the error:

Details: {"type": "illegal_argument_exception", "reason": "DocValuesField "locations.geoshape" appears more than once in this document (only one value is allowed per field)"}

where docs_value=false has the advantage of keeping the index size smaller, with the potential trade-off of slower queries.

Confirmed with the following query that the document was found:

{
  "query": {
    "bool": {
      "must": {
        "match_all": {}
      },
      "filter": {
        "geo_bounding_box": {
          "locations.geoshape": {
            "left": -110,
            "right": -109,           
            "top": 44,            
            "bottom": 43
            
          }
        }
      }
    }
  }
}

and this query fails to find it as the longitude values are outside of the box:

{
  "query": {
    "bool": {
      "must": {
        "match_all": {}
      },
      "filter": {
        "geo_bounding_box": {
          "locations.geoshape": {
            "left": -50,
            "right": -49,           
            "top": 44,            
            "bottom": 43
            
          }
        }
      }
    }
  }
}

@ehanson8 ehanson8 merged commit 9e2c66f into main Jan 18, 2024
3 checks passed
@ehanson8 ehanson8 deleted the gdt-116-update-location-field branch January 18, 2024 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants