Check if ID will be overwritten by different project #124

jduss4 · 2018-10-17T14:12:44Z

I need to do a little testing to 100% confirm this is the case, but since we are using filenames as the IDs for documents in elasticsearch, I worry that if two projects have the same ID for a file (particularly those which are not namespaced by project (https://github.com/whitmanarchive/whitman-issues/issues/27) will overwrite each other silently.

@karindalziel suggests some kind of pre-indexing check, that would compare all the IDs that will be pushed against results already in the API index, and warn the user if there is already an ID by that name belonging to a different collection name.

Any other ideas?

jduss4 · 2018-10-17T14:12:54Z

Also related to #123

techgique · 2018-10-17T15:51:00Z

The posting script could just skip indexing a file if it's already present and inform one in the report printed at the end. And maybe add a -f force option to overwrite? Could require the id be explicitly named with the -f option to be extra thorough and granular?

jduss4 · 2018-10-17T16:50:33Z

@techgique you mean if the id belongs to a different collection? Because in general, we always want to override existing IDs since we would consider them to be an update. For example, re-indexing Cather Letters or something.

jduss4 · 2018-10-17T16:51:25Z

I like the idea of a report or a -force override by ID. We could have a nuclear option somewhere, too, like if somebody had pushed a project under the wrong collection name and wanted to completely redo it by pushing to the correct collection, etc

techgique · 2018-10-17T23:12:29Z

I wasn't thinking about the updating aspect at the time I wrote my comment, so good point. Without collection prefixing (an alternative to saying "namespacing"?), we wouldn't be able to tell if the id belongs to a different collection unless we open the file and look for evidence right? That could slow posting down more than we'd like though.

🤔 Not sure how else we'd tell it's a completely different file from a different project. Diff the files and if the count of different lines is over a certain percentage of the file's total line count? Perhaps file last modified time and index document posted/modified time could be of use in determining if a file contained updates for the index. It looks like you're trying to post an older file... 📎

Maybe only posting to production environment tries to do some of this fancier update-cautious checking that could be bypassed with a -f option?

jduss4 · 2018-10-18T13:42:34Z

I haven't given this a ton of thought, but I think we could gather the ids that are going to be posted, send a request to elasticsearch to query all the ids returning only the fields id and collection, and then quickly see if any of those collections do not match the current name of the project. If so, we could filter those filenames out of the list that will be posted? I think if it was limited to one request and some filtering, it probably wouldn't take a super long time to do? It might require a little bit of re-architecting when and how things currently happen in the "data manager" class in the data repo...

techgique · 2018-10-18T16:05:48Z

That sounds good to me. I'm less familiar with constructing the Elasticsearch queries, so I hadn't thought of doing it that way. 👍

jduss4 · 2018-10-18T20:15:39Z

I think that you can query a list of things like that? I guess we'll find out :)

jduss4 added help wanted feature request labels Oct 17, 2018

jduss4 assigned karindalziel, jduss4 and techgique Oct 17, 2018

wkdewey added the important finding the most important issues label Sep 16, 2021

wkdewey added this to the V2.0 milestone May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check if ID will be overwritten by different project #124

Check if ID will be overwritten by different project #124

jduss4 commented Oct 17, 2018

jduss4 commented Oct 17, 2018

techgique commented Oct 17, 2018

jduss4 commented Oct 17, 2018

jduss4 commented Oct 17, 2018

techgique commented Oct 17, 2018 •

edited

Loading

jduss4 commented Oct 18, 2018

techgique commented Oct 18, 2018

jduss4 commented Oct 18, 2018 •

edited

Loading

Check if ID will be overwritten by different project #124

Check if ID will be overwritten by different project #124

Comments

jduss4 commented Oct 17, 2018

jduss4 commented Oct 17, 2018

techgique commented Oct 17, 2018

jduss4 commented Oct 17, 2018

jduss4 commented Oct 17, 2018

techgique commented Oct 17, 2018 • edited Loading

jduss4 commented Oct 18, 2018

techgique commented Oct 18, 2018

jduss4 commented Oct 18, 2018 • edited Loading

techgique commented Oct 17, 2018 •

edited

Loading

jduss4 commented Oct 18, 2018 •

edited

Loading