Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if ID will be overwritten by different project #124

Open
jduss4 opened this issue Oct 17, 2018 · 8 comments
Open

Check if ID will be overwritten by different project #124

jduss4 opened this issue Oct 17, 2018 · 8 comments
Assignees
Labels
feature request help wanted important finding the most important issues

Comments

@jduss4
Copy link
Contributor

jduss4 commented Oct 17, 2018

I need to do a little testing to 100% confirm this is the case, but since we are using filenames as the IDs for documents in elasticsearch, I worry that if two projects have the same ID for a file (particularly those which are not namespaced by project (https://github.com/whitmanarchive/whitman-issues/issues/27) will overwrite each other silently.

@karindalziel suggests some kind of pre-indexing check, that would compare all the IDs that will be pushed against results already in the API index, and warn the user if there is already an ID by that name belonging to a different collection name.

Any other ideas?

@jduss4
Copy link
Contributor Author

jduss4 commented Oct 17, 2018

Also related to #123

@techgique
Copy link
Member

The posting script could just skip indexing a file if it's already present and inform one in the report printed at the end. And maybe add a -f force option to overwrite? Could require the id be explicitly named with the -f option to be extra thorough and granular?

@jduss4
Copy link
Contributor Author

jduss4 commented Oct 17, 2018

@techgique you mean if the id belongs to a different collection? Because in general, we always want to override existing IDs since we would consider them to be an update. For example, re-indexing Cather Letters or something.

@jduss4
Copy link
Contributor Author

jduss4 commented Oct 17, 2018

I like the idea of a report or a -force override by ID. We could have a nuclear option somewhere, too, like if somebody had pushed a project under the wrong collection name and wanted to completely redo it by pushing to the correct collection, etc

@techgique
Copy link
Member

techgique commented Oct 17, 2018

I wasn't thinking about the updating aspect at the time I wrote my comment, so good point. Without collection prefixing (an alternative to saying "namespacing"?), we wouldn't be able to tell if the id belongs to a different collection unless we open the file and look for evidence right? That could slow posting down more than we'd like though.

🤔 Not sure how else we'd tell it's a completely different file from a different project. Diff the files and if the count of different lines is over a certain percentage of the file's total line count? Perhaps file last modified time and index document posted/modified time could be of use in determining if a file contained updates for the index. It looks like you're trying to post an older file... 📎

Maybe only posting to production environment tries to do some of this fancier update-cautious checking that could be bypassed with a -f option?

@jduss4
Copy link
Contributor Author

jduss4 commented Oct 18, 2018

I haven't given this a ton of thought, but I think we could gather the ids that are going to be posted, send a request to elasticsearch to query all the ids returning only the fields id and collection, and then quickly see if any of those collections do not match the current name of the project. If so, we could filter those filenames out of the list that will be posted? I think if it was limited to one request and some filtering, it probably wouldn't take a super long time to do? It might require a little bit of re-architecting when and how things currently happen in the "data manager" class in the data repo...

@techgique
Copy link
Member

That sounds good to me. I'm less familiar with constructing the Elasticsearch queries, so I hadn't thought of doing it that way. 👍

@jduss4
Copy link
Contributor Author

jduss4 commented Oct 18, 2018

I think that you can query a list of things like that? I guess we'll find out :)

@wkdewey wkdewey added the important finding the most important issues label Sep 16, 2021
@wkdewey wkdewey added this to the V2.0 milestone May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request help wanted important finding the most important issues
Projects
None yet
Development

No branches or pull requests

4 participants