Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared data synchronization #7618

Closed
tobiasdiez opened this issue Apr 11, 2021 · 19 comments · Fixed by #9220
Closed

Shared data synchronization #7618

tobiasdiez opened this issue Apr 11, 2021 · 19 comments · Fixed by #9220

Comments

@tobiasdiez
Copy link
Member

tobiasdiez commented Apr 11, 2021

With the upcoming implementation of an online version for JabRef, we face the issue of synchronizing data across multiple devices (and eventually different users).

Setting

  • S1: One user, multiple devices: One user may have different devices with JabRef installed and would like to sync the library across all devices. A device may go offline at any point and may reconnect at a later time. JabRef's central server may been seen as one of these "user devices".
  • S2: Multiple users accessing the same database: In the future, it would be nice to share one or multiple entries (or all) with other people, giving them different rights (e.g. teacher > student, students only have read rights; author > coauthor, all have write privileges). Users don't need to be available always.
  • Mix of S1 and S2

Assumptions

JabRef's requirements are summarized as follows:

  • A1: Conflicts at the level of one entry are relatively rare. In S1, it seems most probable that some metadata changes on multiple devices simultaneous (e.g. read status) or that the users changes an entry on one device while having forgotten that she made a similar change already on another device. In S2, multiple users may completely revise the entry independently. But it is not as common to have conflicts as for example in collaborative text editing.
  • A2: Conflicts may be resolved manually since JabRef already has a nice 2-way merge interface and conflicts shouldn't happen that often to annoy users.
  • A3: Data correctness is valued very highly.
  • A4: Users should always be able to edit their own databases, since otherwise their local workflow with latex compilation is not possible (offline-first).
  • A5: Users may try to edit the same data at the same time (though this should happen rarely for our current user numbers).
  • A6: Users may go offline for some time, without transmitting their data to a central server before.
  • A7: All data that we want to transmit needs be saved in the users library, at least to the extend that the bib file should work as a snapshot from which the user can start working without loosing data.
  • A8: Users can change the bib file manually outside of JabRef, and changes should still be successfully transmitted to the server.
  • A9: Users may edit the library on any device, i.e. there is no "main" device that can be considered to be the single source of truth.
  • A10: The data stored in one entry is relatively small (300-500 bytes), but there may be many entries so that the total library is usually a few mb.
  • A11: All syncs are user-triggered. In particular, we can always ask the user to resolve conflicts and don't need automatic conflict handling.

Algorithm:

For each entry, we store the following data on the client side:

  • A "shared id" which uniquely identifies the entry for sync purposes.
  • A "client version number" that represents the version of the entry the last time the client synced with the server.
  • A "local version number" that represents the local version of the entry, and which is increased as soon as the entry is modified locally.
  • A "hashed entry" representing the entry at the point when the local version number has been increased the last time.

On server side, the "shared id" and a "server version number" is stored.

Then the sync operations are implemented as follows:

  • LOAD (JabRef is started and reads the library with sync information): For each entry, check if the the stored hashed entry coincides with the hashed version of the entry in the bib file. If not, increase the local version number and update the hashed entry. Then trigger PULL (if client is online).
  • CHANGE (an entry is modified in JabRef locally): Increase its local version number and update its hash. Then trigger PUSH (if client is online).
  • PULL (e.g. after the client has been offline, or in regular intervals): Send a map of entry shared IDs with their client version number to the server; server sends back the list of entries (including the field values) that have a higher server version number than the client version number that has been send; go to MERGE
  • PUSH (e.g. after the entry is locally modified): Get a list of all entries whose local version number is larger than the client version number (i.e entries modified locally), and then send these entries to the server. The server checks if the server version number is the same as the client version number (i.e. no update on the server happened in the meantime), and then replaces the server entry with the transmitted entry; otherwise the server returns the server-side entry and which triggers CONFLICT on the client side.
  • SUBSCRIBE is an event that is triggered by the server if the entry has been modified on the server side. The event contains the modified entry(ies). Action on client side is to go to MERGE.
  • MERGE (the server version number is higher than the client version number): Check if the local version number is the same as client version number (i.e. no update on the client happened), then replace client entry with server entry and update ; otherwise go to CONFLICT
  • CONFLICT (if the server entry and the user entry have been modified): If the entries are indeed different, show merge dialog to let user resolve the conflict. In case they are the same, set local and client version number to the sever version number (another option would be to update the server version number to the client version number; but that would lead to an unnecessary query on another device).

As an example:

Server version Client version Local version State Operation
2 1 1 Server entry has been updated; no update of client entry Replace client entry with server entry
1 1 2 Client entry has been updated; no update of server entry Replace server entry with client entry
2 1 2 Client and server entry have been updated Let user resolve conflict

Details and overview of (different) options

Lock systems

One node acquires a lock to write changes. All other write changes that happened during the same time are rejected or marked as conflicts that need be resolved by the user. Here we should favor Optimistic Offline Lock (always allow users to make changes, but commit to server requires lock) over pessimistic locks (only allow writing when one has a lock) since the chance for conflicts is low (A1) and users should always be able to make edits even when offline (A4). Optimistic offline lock is the approach we use currently for for sync with shared databases. In particular, we have a "version" counter that is incremented upon every write process. If the user's version counter is smaller than the remove version counter, then this is an indication that there was an additional update to the remote database in meantime and the local change is rejected.

The current implementation may run into problems when the server is offline (if I'm not mistaken). For example, suppose that two users have the same entry with version counter zero. If user 1 changes the entry while being offline, and the server version is updated by user 2 in the meantime (thus version 2 counter is incremented), then version 1 < version 2 indicating that the entry of user 1 needs to be replaced by the remote version. But this would overwrite the changes of user 1.

In general, I have the impression that optimistic locks are great for handling concurrent transactions (two users trying to update the same entry at the same time), but has shortcomings when used as a data synchronization protocol.

Optimistic replication

Locking (in the way we use it) ensures that there is essentially only one valid replica (the server version) and all user copies are almost immediately synced, leading to essentially one copy of the data across multiple nodes. In contrast, optimistic replication accepts that for some short time there might be divergent replica that only eventually converge to an end state. Having divergent versions of data seems to be a realist scenario as users may work offline for some time (A6). The goal of optimistic replication is to have eventual consistency, which should be sufficient for our purposes.

There are different options varying along different axes.

  • Operations

    • State-transfer: The entire state is submitted (e.g. the complete entry)
    • Operation-transfer: Only the change is submitted (e.g. title changed from A to B, or even more fine grained as "character X inserted at position Y")
    • Intermediate options such as sending only version-deltas

    In principle, we have easily access to the change operations and implemented listeners already. However, sending these change operations to the server would require that the user is online. In the case that the user is offline we would need to write a changelog to the bib file, which would be sent to the server once the user goes online again (and upon successful transfer to the server the changelog is reset again). Using an external store does not work because then the bib file is not enough to replay the changes, violating A7. In the same vain, when the user edits the bib file manually, we don't have any direct operation information (A8). For these reasons, we definitely need to support state-transfer at least as a fallback.

  • Conflict-raising

    • Just ignore conflicts, e.g. ignore later updates
    • Syntactic policies are only based on the timing of changes, e.g. if user 1 makes changes while user 2 changed the same entry, then this is marked as a conflict.
    • Semantic policies use application/operation specific knowledge to reduce conflicts.

    Syntactic policies are simpler and generic but cause more conflicts, whereas semantic policies are more flexible but also more complicated and application-specific.
    Ignoring conflicts is not an option in my opinion due to A3. I would mainly use with a mostly-synatic policy, enriching it with simple semantic rules. For example, additions of new different fields shouldn't result in conflicts. But I wouldn't try to analyze the user's intent if both edit the same field. This marks some changes as conflicts that could in principle be reconciled, eg starting from "A is B", user 1 changes "A" to "JabRef" and user 2 changes "B" to "great" could be resolved to "JabRef is great". A border case are special fields, e.g. two keywords are added or readstatus is changed to "skimmed" and on another device to "read". Proposal: start with synatic and enrich it by semantic rules based on user feedback.

  • Propagation: Network Topology
    The main communication path should be via the central JabRef server, resulting in a star topology. However, we also need to support the case where the user syncs the database manually (e.g. by using git).

  • Propagation: Degree of synchronicity

    • Pull (manually or regularly) asks for new updates
    • Push broadcasts new updates

    Using GraphQL subscriptions, it should be straightforward to implement a broadcast system for updates (while the user is online). Supplemented by a pull after times of being offline (e.g. JabRef start, connection loss).

  • Maintaining a "happens-before" relationship

    • Lamport clocks / scalar clocks: Every node has its own clock. When submitting an update, the sender increments the clock and attaches the new value to the object. When an update is received, the local clock is set to max(object value, local value) + 1. When an event A happens before B, then the clock for A is smaller than the clock of B. However, the converse may not true.
    • Vector clocks: Every node has an own logical clock (i.e. a counter) and a copy of the other clocks. Open submitting a change, the own clock is increased and then the vector of clocks passed along with the message. When updates are received, the local vector of clocks is updated.

    Vector clocks become problematic when the number of nodes becomes large (not an issue for us in the near future) or is dynamic (definitely an issue for us). For the latter see e.g Dynamic Vector Clocks for Consistent Ordering of Events in Dynamic Distributed Applications.
    Nice overview of many variations of vector clocks.

Old proposed algorithm
Every node has a vector clock (for each entry) and increments its own time once the entry is modified. Every update message from the server include the vector clock, which is then used to determine the merge strategy by comparing it with the local vector clock:

  • If the local and server time for this node is the same, the we know that the user didn't modified the entry locally, so we can simply copy the server entry over the local entry.
  • If the times differ, then there were two simultaneous modifications (at the server and locally). Thus, we try to merge the changes automatically and if this doesn't succeed then show the merge dialog (see above for conflict resolution).

When the user reconnects and wants to update (pull sync), then she sends all (shared) ids of entries and the local vector clock to the server. The response will be all entries with more recent times (in any of the clocks).
This is based on SData description.

One question I still have is how to store the vector clock (in the entry?) and name the nodes (user + device name?).

Other approaches:

  • Operational transformation: Doesn't work for us as we cannot reliable keep track of all operations.
  • Differential Sync
  • CRDTs

References:

@koppor
Copy link
Member

koppor commented Apr 11, 2021

(A4). Optimistic offline lock is the approach we use currently for for sync with shared databases. In particular, we have a "version" counter that is incremented upon every write process. If the user's version counter is smaller than the remove version counter, then this is an indication that there was an additional update to the remote database in meantime and the local change is rejected.

The change is not rejected, there is a merge. In case user1 modifies fieldA and user2 modifies fieldB, both updates go in. If both modify fieldA, there is the "resolve conflicts" dialog shown. Similar to git, if one modifies the same line of code.

One could implement something more intelligent here, if domain-specific knowledge is used. For instance, if an author is added by user1 and user2, the update algorithm could just add the two authors.

The current implementation may run into problems when the server is offline (if I'm not mistaken). For example, suppose that two users have the same entry with version counter zero. If user 1 changes the entry while being offline, and the server version is updated by user 2 in the meantime (thus version 2 counter is incremented), then version 1 < version 2 indicating that the entry of user 1 needs to be replaced by the remote version. But this would overwrite the changes of user 1.

No. See above. JabRef implements the above merge algorithm.

It does not work well if two persons are working on the same field at the same time: The user always gets the merge entries dialog presented.

  Using GraphQL subscriptions, it should be straightforward to implement a broadcast system for updates (while the user is online). Supplemented by a pull after times of being offline (e.g. JabRef start, connection loss).

We implemented that in our Postgres setting.

* _Maintaining a "happens-before" relationship_

This is, more or less, optimistic offline lock.

@JabRef/developers What do you think?

Stick with optimistic offline lock. It seems it works very well of your described cases. Furthermore, it is very simple to implement. It works both in a "real time" data synchronization and also in an offline setting.

@tobiasdiez
Copy link
Member Author

tobiasdiez commented Apr 11, 2021

Thanks @koppor, apparently I misread the code a bit. So what happens if the user is offline, changes something, and later goes online. When is the version counter incremented, and how do you recognize that there were changes to the local database in the last session that need to be pushed to the server?

@koppor
Copy link
Member

koppor commented Apr 11, 2021

Side comment: We can try the effects with a shared Postgres instance. See https://docs.jabref.org/collaborative-work/sqldatabase#try-it-out for details.

I assume the tool is not crashing. (If we discuss crashing, we need a local database storing the local id). I hope, I recall everything correctly; otherwise, I would need to buy the POSA book).

The secret sauce is that the client stores the version number of the record when it received the record from the server (call it first-number). The client increases that number by one when writing. The write is done in one transaction:

  • begin transaction
  • read number of server
  • if current number of server == first-number then continue else ABORT
  • write record
  • end transaction

In case another client pushed inbetween, the "current number of server" is not equal to "first-number". Thus, the client needs to retreive the data (update the first-number), perform a merge, and retry.

  • Version number is preferred over timestamp based versioning, since system clock is unlikely to be properly synchronized over different machines
  • Alternative to version number: Include values of all fields as they have been read in the WHERE clause.

@koppor
Copy link
Member

koppor commented Apr 11, 2021

All records need to be updated where first-number != local-number. (Where local-number is first-number++ when the record is updated locally)

@koppor
Copy link
Member

koppor commented Apr 12, 2021

Small note: Back in 2016, we did not have the concept of ADRs at hand. Everything discussed was between @obraliar and me. This synchronization concept was developed in the "StuPro 2015/2016". - It just came into my mind that the key/value field storage is one consequence out of that: With having a version on each field, JabRef knows whether the field was updated remote. Thus, we know whether we have to update a field. Thus, it is possible to work on the same entry. Since we did not implement "smart merge" on a field basis, one gets the "merge entries" dialog if two users concurrently work on one field.

Note that Postgres and Oracle support the life synchronization only (because they offer a pub/sub system). MySQL does not offer pub/sub. Thus, the user has to poll changes regularly in the case of MySQL.

@tobiasdiez
Copy link
Member Author

Thanks Olly. I think I understood the algorithm. But I have trouble seeing how it works if not all changes are immediately pushed to the server. Start with some entry, at version = 0. Say you are offline and remove a field, close JabRef (version is still 0 since it was not yet transmitted to the server, right?). In the meantime someone else modifies the entry on the server, so that there version = 1. Now you start JabRef in online mode. The server notifies you about the update, and the algorithm recognizes that 0 = local.version < server.version = 1, so the server version of the deleted field is used (right?). But the "correct" behavior would be to delete the deleted field and merge the other changes. So how does one recognize that locally the field was removed and not that the server version had the field added?
In more theoretical terms, the problem is that a Lamport clock only provides a "has happen before" partial order. I.e. if version counter A is smaller than version counter B, then you know that A didn't happen after B. This is perfect for the update scenario you described above, where we can use it to detect updates to the server replica. However, it is not possible to conclude from versionA < versionB, that B happened after A. See https://en.wikipedia.org/wiki/Lamport_timestamp#Implications.

The easiest solution (and if I understood it correctly, then one can actually proof that it is the minimal data structure doing this job) is to keep track of the local version as well, i.e. have an additional counter for each node that is increased when the local replica is modified. This is exactly the version / vector clock approach suggested above. This would be only a small modification of the offline lock algorithm to gracefully handle offline scenarios as well.

@tobiasdiez
Copy link
Member Author

I thought a bit more about it and came to the realization that the "vector clock" solution is a bit more complicated than what we need. If we have a star architecture, then every sync goes via the central server. Thus, a client doesn't need to know the local version of another client (which is what the vector clock gives). Instead, it suffices for the client to keep track of local updates relative to the server version. Thus my new proposal would be:

  • Each entry is identified by a unique identifier.
  • For each entry, the client has a client version number that represents the version of the entry the last time the client synced with the server. (This is exactly as in the optimistic lock, and is already implemented)
  • In addition, the client has a local version number for each entry that is increased as soon as the entry is modified locally.

Then the sync operations can be implemented as follows:

  • PULL (e.g. after the client has been offline, or in regular intervals if the server doesn't provide push notifications): Send a map of entry IDs with their client version number; server sends back a list of entries which have a higher version number on the server; go to MERGE
  • PUSH (e.g. after the entry is locally modified): Get a list of entries whose local version number is larger than the client version number, and send these entries to the server. If server version number is the same as the client version number (i.e. no update on the server happened), then the server entry is replaced with the user entry; otherwise go to CONFLICT
  • MERGE (if the server version is higher than the version number of the client): If local version number is the same as client version number (i.e. no update on the client happened), then replace user entry with server entry; otherwise go to CONFLICT
  • CONFLICT (if the server entry and the user entry have been modified): show merge dialog to let user resolve the conflict

In addition, the server, local and client version number need to be updated accordingly after each of these operations.

Server version Client version Local version State Operation
2 1 1 Server entry has been updated; no update of client entry Replace client entry with server entry
1 1 2 Client entry has been updated; no update of server entry Replace server entry with client entry
2 1 2 Client and server entry have been updated Let user resolve conflict

If client and local version numbers are stored in the entry, then A7 is supported. However, manual changes to the bibtex file are not supported (except if the user also increases the local version number).

What do you think @koppor.

@koppor
Copy link
Member

koppor commented May 20, 2021

What you are explaining in the table, is the optimistic offline lock.

(Assumption: "local version" = ".bib" file, Then, it is easy, too)

As soon as the "client" is started, it knows the version of the .bib file when it was last synched. Then, it loads the local bib file. In case of any differences, it increases the version number. Then, it starts synchronizing with the server.

Cold start: The client is not aware of anything. In case the bib file already exists on the server, a full comparison has to be made.

Please note that we put the version number on fields thus, users can edit the same entry. The merge-entries-dialog only pops up if they touch the same field.

@koppor
Copy link
Member

koppor commented May 20, 2021

I found our implemenatation: org.jabref.logic.shared.DBMSSynchronizer#synchronizeLocalDatabase, called from org.jabref.logic.shared.DBMSSynchronizer#pullChanges.

The tests are in org.jabref.logic.shared.SynchronizationTestSimulator.

Oh, wow, with that in mind, one has "just" to implement org.jabref.logic.shared.DatabaseSynchronizer for synchronization with JabRef online. And everything works.

@tobiasdiez
Copy link
Member Author

You are right, the algorithm is very similar to the optimistic lock. However, there you only have one "version" that is incremented upon change and used for the comparison. This works perfectly when you always have connection to the server and can directly push changes. However, when you are offline and change an entry this may lead to problems, as described above:

Start with some entry, at version = 0. Say you are offline and remove a field, close JabRef (version is still 0 since it was not yet transmitted to the server, right?). In the meantime someone else modifies the entry on the server, so that there version = 1. Now you start JabRef in online mode. The server notifies you about the update, and the algorithm recognizes that 0 = local.version < server.version = 1, so the server version of the deleted field is used (right?). But the "correct" behavior would be to delete the deleted field and merge the other changes. So how does one recognize that locally the field was removed and not that the server version had the field added?

That's why I proposed to have a "local version" that keeps track of unsynced local changes (you can see this as another optimistic lock).

As soon as the "client" is started, it knows the version of the .bib file when it was last synched.

Keeping a copy of the last synced state would be another solution for this issues, indeed. Is this already implemented?

Thanks also for the input of field vs entry sync. Need to think about this in more detail. What are the advantages of the field sync in your opinion?

@koppor
Copy link
Member

koppor commented May 24, 2021

Start with some entry, at version = 0. Say you are offline and remove a field,
close JabRef (version is still 0 since it was not yet transmitted to the server, right?).

The client has to be smart enough to note that it locally changed something. This is what you described by "keeping a copy of the last synced state".

I agree, we don't keep the copy. Since we thought about these issues, we found out that BibTeX data is key/value object. key1 could be authors, key2 journal, etc. - Thus, we can put versions into fields!

This also answers your question:

Thanks also for the input of field vs entry sync. Need to think about this in more detail. What are the advantages of the field sync in your opinion?

If the server touches field1 and we modify field2, we can easily (!) be aware of these changes.

If we do not do that and version the complete entry as whole, then we ran into issues as you describe. Then, the local view is absolutely necessary to correctly merge the updates.

But the "correct" behavior would be to delete the deleted field and merge the other changes. So how does one recognize that locally the field was removed and not that the server version had the field added?

With version on the field and notifications that fields are added and removed.

That's why I proposed to have a "local version" that keeps track of unsynced local changes (you can see this as another optimistic lock).

We somehow tried to avoid the local extra copy, but I agree its necessary if one really works offline.


I would store a synchronization state instead of the plain data view of received of the server.

field1:
  serverVersion: 1
  clientVersion: 1
field2:
  serverVersion: 1
  clientVersion: 2

Then, each sync attempt can use this information. If the server version changed, it needs to be fetched and merged accordingly.

I thought, this is easier than a BibTeX-Diff. - However, a BibTeX diff can IMHO produce similar data. For @obraliar and me, the field versions seemed to be easier to implement and test. - We found a proper BibTeX diff harder than a field diff.

In the case of overleaf, I would do a BibTeX diff. Based on that, I would create a field diff and set the version number of the fields accordingly. Based on that, I would do the sync of the local state with the server state. That would be more easy for me, as I find there is more control on each phase (diffing -> handling of field diffs)

@tobiasdiez
Copy link
Member Author

Thanks again. I thought a bit more about the per-field vs per-entry versioning system. Disadvantages I see with per-field are as follows:

  • Higher data volume: At the begin of the sync session, you need to query the server for the version of each field. That's almost the same payload as just querying the complete library. For a relatively normal database of 10k entries, you will need to query 100k version numbers.
  • There might be edge cases where users would actually prefer to be warned about updates of other fields. Consider, for example, a book entry whose authors correspond to another edition to what is specified in the edition field. User A recognizes this and updates the edition to the correct edition. User B also recognizes the mistake, but chooses to keep the edition and instead updates the authors. This can be merged without any conflicts as the user update different fields, but the resulting entry is wrong.

Also, if I understand the code correctly, it currently only supports versions per entry anyway.

if (idVersionEntry.getValue() > localEntry.getSharedBibEntryData().getVersion()) {

Do I miss something here?

@koppor
Copy link
Member

koppor commented Jun 7, 2021

(Notes from DevCall 2021-06-07)

TLDR: Version bibtex entry, make merge server-side in one transaction --> offer options to user

Where to store version?

  • store in the entry
    • URL or URI? (clickable or not)
    • jabref: https://jabref.org/entries/koppor-shared-bib/entry-1-clientid5-v1
    • what if user edits entry manually without using JabRef
      • how can JabRef know that user modified entry and thus has to increase local version?
      • JabRef uses the timestamp to know if something changes, then asks server what's changed.
    • what if user edits URL?
      • does server keep track of last client version to do some sanity check?
  • separate file: version only
    • what if bib file is put into version control
      • other user also syncs it to JabRef online
    • what to do if entry is sent around? / version file is missing?

Versioning

Optimisitc offline lock

"Relative time" (see also vector clocks)

  • Good, because well-established in database field

Timestamp-based sync

  • Bad, because relies on local time of each user - time sync is very difficult

Personal notes:

* Higher data volume: At the begin of the sync session, you need to query the server for the version of each field. That's almost the same payload as just querying the complete library. For a relatively normal database of 10k entries, you will need to query 100k version numbers.

No -> server-side processing.

* There might be edge cases where users would actually prefer to be warned about updates of other fields. Consider, for example, a book entry whose authors correspond to another edition to what is specified in the edition field. User A recognizes this and updates the edition to the correct edition. User B also recognizes the mistake, but chooses to keep the edition and instead updates the authors. This can be merged without any conflicts as the user update different fields, but the resulting entry is wrong.

Fuck :)

Also, if I understand the code correctly, it currently only supports versions per entry anyway.

if (idVersionEntry.getValue() > localEntry.getSharedBibEntryData().getVersion()) {

Do I miss something here?

Can't be true. Code lies. :)

@tobiasdiez
Copy link
Member Author

I finally found a bit more time to think about this.

Olly had a very good point during the DevCall pointing out that users can also edit entries outside of JabRef, and this needs be picked up by JabRef before it syncs (so the scenario is: user closes jabref > edits bib file manually > starts jabref). Regardless of where we save the local version number (in the entry or in a separate file next to the entry), the user edits don't trigger an update of the version number and thus the sync algorithm gets confused. This means we need a reliable way to make sure the local version is increased also when the user changes the bib file.

My proposal for this is as follows: In a separate file (called "sync file" in the following), we save the following:

  • A hash of the bibentries fields
  • A local version number

Once JabRef starts (or more exactly loads the bib file) and there is a sync file, then JabRef will compute the hashes for all entries and compares them with the hashes in the sync file. If the hashes coincide, then we do nothing. But if the hashes are different, the user changed the entry manually and we increase the local version number. Only after this update progress, the sync with the server is initiated.

@JabRef/developers What do you think?

I'll update the issue description with the new proposal.

@calixtus
Copy link
Member

Does this mean we need keep two files in the future for one library - a library file and a hash file? This would be probably the worst solution imho.

@tobiasdiez
Copy link
Member Author

One could also store the hash in the entry itself.

But why do you think a separate file is not a good solution? There are certain advantages of storing this externally, like not polluting the entry with sync-related data (makes it easier to send the entry to other people) and users cannot accidentally delete/modify the sync-data making everything more robust. On the other hand, it shouldn't be too hard to store the file out of sight of the user (say in %AppData%) so that they don't even notice we keep this data.

@calixtus
Copy link
Member

My concern was regarding the portability of the libraries across different PCs. Would you have to share the hash files then too?

@tobiasdiez
Copy link
Member Author

No, the user doesn't need to copy these hash files and they are solely managed by JabRef.
If the hash file is not existing, then JabRef downloads the most recent version of the entries from the database, and compares them with the local entries. Any changes are treated as conflicts and need to be handled manually by the user (since we don't know which version is older/newer). And then the hash file is created.

However, if the database is sync externally as well (say using git or dropbox), there is a small performance advantage to sync the hashed file as well (which can be taken as an argument for embedding the hash in the entry itself). For example, consider the following scenario (with no sharing of the hashed file):

  • On Computer A, the user changes an entry; then syncs it with git and with JabRef server.
  • On Computer B, the user syncs with git and then opens JabRef. The changes to the entry are recognized as external changes (since the hash is different to the one stored in JabRef's hash file). Thus, JabRef updates its local version number of this entry, and starts the sync process: download the current server version, and compare it with the local entry; they turn out to be the same, so no manual conflict resolution is necessary.

If the hash file are shared, then the changes via the git sync are not recognized as external changes and no unnecessary server query is invoked.

That's the only small advantage I see for storing the hashed entry in the entry itself.

@calixtus
Copy link
Member

Yes, that was more or less what I meant. 😅
But if you say the performance disadvantage is negligible if not sharing the hash file as well, I guess it's ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants