-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check if ID will be overwritten by different project #124
Comments
Also related to #123 |
The posting script could just skip indexing a file if it's already present and inform one in the report printed at the end. And maybe add a |
@techgique you mean if the id belongs to a different collection? Because in general, we always want to override existing IDs since we would consider them to be an update. For example, re-indexing Cather Letters or something. |
I like the idea of a report or a |
I wasn't thinking about the updating aspect at the time I wrote my comment, so good point. Without collection prefixing (an alternative to saying "namespacing"?), we wouldn't be able to tell if the id belongs to a different collection unless we open the file and look for evidence right? That could slow posting down more than we'd like though. 🤔 Not sure how else we'd tell it's a completely different file from a different project. Diff the files and if the count of different lines is over a certain percentage of the file's total line count? Perhaps file last modified time and index document posted/modified time could be of use in determining if a file contained updates for the index. Maybe only posting to production environment tries to do some of this fancier update-cautious checking that could be bypassed with a |
I haven't given this a ton of thought, but I think we could gather the ids that are going to be posted, send a request to elasticsearch to query all the ids returning only the fields id and collection, and then quickly see if any of those collections do not match the current name of the project. If so, we could filter those filenames out of the list that will be posted? I think if it was limited to one request and some filtering, it probably wouldn't take a super long time to do? It might require a little bit of re-architecting when and how things currently happen in the "data manager" class in the data repo... |
That sounds good to me. I'm less familiar with constructing the Elasticsearch queries, so I hadn't thought of doing it that way. 👍 |
I think that you can query a list of things like that? I guess we'll find out :) |
I need to do a little testing to 100% confirm this is the case, but since we are using filenames as the IDs for documents in elasticsearch, I worry that if two projects have the same ID for a file (particularly those which are not namespaced by project (https://github.com/whitmanarchive/whitman-issues/issues/27) will overwrite each other silently.
@karindalziel suggests some kind of pre-indexing check, that would compare all the IDs that will be pushed against results already in the API index, and warn the user if there is already an ID by that name belonging to a different collection name.
Any other ideas?
The text was updated successfully, but these errors were encountered: