Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entry IDs aren't guaranteed to be unique #2273

Closed
rianhunter opened this issue Mar 15, 2019 · 13 comments

Comments

Projects
None yet
3 participants
@rianhunter
Copy link

commented Mar 15, 2019

Basically as title,

in the entrytmp table, ids are computed by adding the current usecs to the entry date. this is not guaranteed to be unique.

when adding to the entry table from the entrytmp table, ids are computed from the max id of the entrytmp table, and incremented. On two runs with the same max(id) - count(*) expression, you will have an id collision.

The entrytmp table seems to be used to get a date-oriented ordering for the id column. Why is this necessary? Why not order by the date columne in listWhereRaw() and sqlListWhere()?

@Alkarex

This comment has been minimized.

Copy link
Member

commented Mar 16, 2019

Hello,
The date column is the date declared by the feed. It is not reliable and purely informative. It is sometimes several years in the past or in the future. We have had several long discussions on the topic.
Although there is indeed no SQL guarantee of uniqueness, the general logic is supposed to ensure this uniqueness. Have you been hit by a case of non-uniqueness?

@rianhunter

This comment has been minimized.

Copy link
Author

commented Mar 16, 2019

I haven't but seemed like a glaring error. A malicious feed could potentially cause FreshRSS to never update by intentionally reusing old dates to cause a non-uniqueness constraint error when executing the SQL.

Feel free to close if you don't care to fix this, just wanted to bring attention to this.

@Alkarex

This comment has been minimized.

Copy link
Member

commented Mar 16, 2019

The IDs are produced using local server time, in microseconds, and the data from the feeds has no role in producing IDs. Collisions would be possible if we were to add more than 1M articles per second in average for a long period, which seems quite unlikely.
It is always good to track potential bugs, so thanks for that. For now, I think we are ok-safe here :-)

@Alkarex Alkarex closed this Mar 16, 2019

@rianhunter

This comment has been minimized.

Copy link
Author

commented Mar 16, 2019

The IDs are produced using local server time, in nanoseconds, and the data from the feeds has no role in producing IDs.

That's not true, this is the flow of how IDs are derived, they clearly originate from the entry date.

$id = min(time(), $entry_date) . uSecString();

$this->addEntryPrepared->bindParam(':id', $valuesTmp['id']);

$sql = 'SET @rank=(SELECT MAX(id) - COUNT(*) FROM `' . $this->prefix . 'entrytmp`); ' . //MySQL-specific

@Alkarex Alkarex reopened this Mar 16, 2019

@rianhunter

This comment has been minimized.

Copy link
Author

commented Mar 16, 2019

Also if the server time is adjusted, e.g. due to daylight savings time, that could cause an ID conflict

$id = uTimeString();

@Alkarex

This comment has been minimized.

Copy link
Member

commented Mar 16, 2019

@rianhunter You are indeed correct. This is not for the main case though, but when 1) adding new feeds, and 2) processing entries that are too old. This indeed needs to be corrected.

Regarding server time adjustments, Unix timestamp is not supposed to be affected by time-zones / daylight savings. We should though add a barrier for that as well.

@Alkarex Alkarex added this to the 1.14.0 milestone Mar 16, 2019

@rianhunter

This comment has been minimized.

Copy link
Author

commented Mar 16, 2019

Regarding server time adjustments, Unix timestamp is not supposed to be affected by time-zones / daylight savings. We should though add a barrier for that as well.

Ah yes, you're right about that. Saw the gettimeofday() call in uTimeString() and assumed it was timezone sensitive. I see you use the "sec" index.

EDIT: Just to be thorough this could still be sensitive to time adjustments from, e.g. ntp, while updating the feeds, but it isn't the same security concern as deriving the ID from the entry date and would work itself out over time. In general I wouldn't consider it good practice to derive unique IDs from wall clock time.

@Alkarex

This comment has been minimized.

Copy link
Member

commented Mar 16, 2019

For the two cases above, for a collision to occur, a feed would have to provide a date (in seconds) already existing in database, and then there is 1 chance over 1 million to hit the same microsecond using the server local time. So low risk, but still

@rianhunter

This comment has been minimized.

Copy link
Author

commented Mar 16, 2019

So low risk, but still

Yup that's why I said to feel free to close :) Though, I can see something like this causing a bug in an obscure way down the line so probably worth it long term to make it 100% solid.

@aledeg

This comment has been minimized.

Copy link
Contributor

commented Mar 16, 2019

@rianhunter Do you have any suggestion on how we can fix that?

Alkarex added a commit to Alkarex/FreshRSS that referenced this issue Mar 16, 2019

@Alkarex Alkarex referenced this issue Mar 16, 2019

Merged

No old ID #2276

@Alkarex

This comment has been minimized.

Copy link
Member

commented Mar 16, 2019

In #2276 I have removed the cases that could generate old IDs.

Remains the case of very unlucky server time change.

With MySQL / SQLite, articles with colliding IDs are ignored, and we would try to add them again with another ID at next update. I think this good enough.

With PostgreSQL, we can either update our requirements to version 9.5+ to take at last advantage of ON CONFLICT DO NOTHING, or alter the request with a more complicated expression.
Edit: Done

$sql = 'DO $$
DECLARE
maxrank bigint := (SELECT MAX(id) FROM `' . $this->prefix . 'entrytmp`);
rank bigint := (SELECT maxrank - COUNT(*) FROM `' . $this->prefix . 'entrytmp`);
BEGIN
INSERT INTO `' . $this->prefix . 'entry` (id, guid, title, author, content, link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags)
(SELECT rank + row_number() OVER(ORDER BY date) AS id, guid, title, author, content, link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags
FROM `' . $this->prefix . 'entrytmp` AS etmp
WHERE NOT EXISTS (SELECT 1 FROM `' . $this->prefix . 'entry` AS ereal WHERE etmp.id_feed = ereal.id_feed AND etmp.guid = ereal.guid)
ORDER BY date);
DELETE FROM `' . $this->prefix . 'entrytmp` WHERE id <= maxrank;
END $$;';

P.S.: The last Ubuntu shipping a PostgreSQL < 9.5 is Ubuntu 14.04 LTS, which stops next months. For Debian, it is Jessie 8.0 stopping in May 2020. In the Red Hat family, we need to wait for RHEL/CentOS 8.0 (currently in beta)...

@Alkarex

This comment has been minimized.

Copy link
Member

commented Mar 16, 2019

@aledeg I think the more proper way would be to have a new entry ID column based on auto-increment, and not use the discovery time (our current ID) for uniqueness. But it is a relatively big change for little gains (one such gain would be a possible compatibility with APIs limited to 32-bit integers such as TT-RSS).

@rianhunter

This comment has been minimized.

Copy link
Author

commented Mar 17, 2019

@Alkarex Yeah I agree, I would also avoid the temporary table song and dance.

If you really want date oriented IDs, you can derive the ID like this: (entry_date, unique_id), though I think there might be other constraints on your IDs and the way queries are made from the UI.

Alkarex added a commit that referenced this issue Mar 19, 2019

No old ID (#2276)
* No old ID

#2273

* PostgreSQL insert or ignore

@Alkarex Alkarex closed this Mar 22, 2019

Alkarex added a commit that referenced this issue Mar 23, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.