Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for duplicates may return non-duplicates (sys_file.storage, case insensitive DB, ...) #2

Closed
sypets opened this issue Apr 19, 2023 · 2 comments

Comments

@sypets
Copy link
Contributor

sypets commented Apr 19, 2023

From first look, I would consider this harmful if one is not careful.

There may be 2 reasons non-duplicates are detected as duplicates:

  1. sys_file.storage should be checked as well (if more than one storage used)
  2. if DB checks case-insensitive so files such as /myfile.jpg and /MYFILE.jpg are detected as duplicates (depends on DB collation, e.g. utf8mb4_general_ci is case insensitive)

What collation does TYPO3 use by default?
See
v11.5:

public/typo3/sysext/install/Classes/Controller/InstallerController.php: 'collate' => 'utf8mb4_unicode_ci',

https://docs.typo3.org/c/typo3/cms-core/main/en-us/Changelog/9.5/Feature-80398-Utf8mb4OnMysqlByDefaultForNewInstances.html

Reproduce

Try adding the following examples as files and check if they are detected as duplicates

Examples:

sys_file.storage sys_file.identifier file
1 (fileadmin) /dir1/abc.jpg fileadmin/dir1/abc.jpg
2 (media) /dir1/abc.jpg media/dir1/abc.jpg
1 (fileadmin) /dir1/ABC.jpg fileadmin/dir1/ABC.jpg

recommendation:

  • always check sys_file.storage as well (if duplicate, identifer and storage must be identical)
  • improve DB query, e.g. by checking if the following is identical as well: identifier_hash, sha1 (identifier_hash should suffice, I think)
  • or do a binary compare of the field
  • or do additional compare of the strings in PHP
@sypets
Copy link
Contributor Author

sypets commented Apr 19, 2023

I will create a PR

sypets added a commit to sypets/unduplicator that referenced this issue Apr 19, 2023
sypets added a commit to sypets/unduplicator that referenced this issue Apr 19, 2023
Since grouping by identifier and storage may falsely return
entries as duplicates where identifer is only the same if
compared case-insensitively, we group by identifier_hash as well.

Related: ElementareTeilchen#2
sypets added a commit to sypets/unduplicator that referenced this issue Apr 19, 2023
Du extra duplicate check in PHP instead of relying on identifier_hash.
This is a safer option, though it might be more inefficient.

Related: ElementareTeilchen#2
franzkugelmann pushed a commit that referenced this issue May 8, 2023
* Change version constraints to support TYPO3 v11

* Replace Command.php

Since TYPO3 v11, Services.yaml must be used to define
commands.

* Replace fetch() with appropriate function calls

For newer Doctrine dbal versions, the function calls
should be replaced:

- fetch() => fetchAssoicative()
- fetchAll() => fetchAllAssociative()
- fetchColumn(0) => fetchOne()

* [BUGFIX] Check for storage if checking for duplicates

Related: #2

* Add option --identifer to perform only on this identifier

Makes it more conveniant to check specific behaviour for
only a subset of files.

* Add option storage

* Add return 0 to execute()

Is necessary for TYPO3 v11

* Add additional output

* Make sure duplicate query works on case-insensitive DB collations

Since grouping by identifier and storage may falsely return
entries as duplicates where identifer is only the same if
compared case-insensitively, we group by identifier_hash as well.

Related: #2

* Update README

* Update comment

* Change duplicate check to do extra check in PHP

Du extra duplicate check in PHP instead of relying on identifier_hash.
This is a safer option, though it might be more inefficient.

Related: #2

* Perform a type-safe match

* Make sure nothing is performed if called with --dry-run

Simplify output messages.
Make sure nothing is done if command called with --dry-run.

* Return immediately if no duplicates found

If no duplicates found, the command exists with exit code 0.

If dupicates are found, it exits with exit code 1.

* Add testing-framework

* Add Test scripts

* Add GitHub CI script for running functional tests

* Remove TestEnvironment prepare

Is no longer necessary since TYPO3 v11.

* Add Resources/Public

This is only necessary for the test setup to work properly.

See https://review.typo3.org/c/Packages/TYPO3.CMS/+/71029

* Add functional tests

* Minor cleanup

* Remove rowCount()

This is not recommended in combination with SELECT
(see documentation) and it was not really necessary
as the while() statement is only entered if there
are results.

* Fix PHP version constraints

Use PHP versions supported by TYPO3 v11.

* Remove some unused options from runTests.sh

The option -t for multiple core testing is currently not used.
This can be added at a later time if multiple core testing
should be done in one branch.

Alternative is branching out and only testing one TYPO3 version
in current branch.

Additionally, only composerInstall is used and composerInstallMax
will not be executed in the GitHub CI matrix.

Testing with sqllite is deactivated for now because it fails and
must be investigated further.

* Add cleanBuild to runTests.sh

Cleanup can be performed with runTests.sh. The script
cleanup.sh is removed.

* Comment out some unused commands in runTest.sh

Configuration files for php-cs-fixer, unit test and phpstan are
not supplied yet. The commands are commented out in runTests.sh.
@sypets
Copy link
Contributor Author

sypets commented Jul 17, 2024

I am closing this. I can no longer reproduce the problem. I uploaded files which differed only in case (abc.jpg != ABC.jpg) and uploaded same filenames to different storages. Also, some commits references this issue.

@sypets sypets closed this as completed Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant