Directory names conflicting with file names #12407

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

jo-pol wants to merge 8 commits into IQSS:develop from DANS-KNAW-jp:directory-name-conflict

doc/sphinx-guides/source/api/native-api.rst

-Original file line number
+Diff line change
@@ Expand Up @@
     .. code-block:: bash
-      curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" -X POST "https://demo.dataverse.org/api/datasets/:persistentId/files/metadata?:persistentId=doi:10.5072/FK2/J8SJZB" --upload-file file-metadata-update.json
+      curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" -X POST "https://demo.dataverse.org/api/datasets/:persistentId/files/metadata?persistentId=doi:10.5072/FK2/J8SJZB" --upload-file file-metadata-update.json
     The ``file-metadata-update.json`` file should contain a JSON array of objects, each representing a file to be updated. Here's an example structure:
@@ Expand Down @@

doc/sphinx-guides/source/user/dataset-management.rst

-Original file line number
+Diff line change
@@ Expand Up @@
     - Files with the same checksum can be included in a dataset, even if the files are in the same directory.
     - Files with the same filename can be included in a dataset as long as the files are in different directories.
     - If a user uploads a file to a directory where a file already exists with that directory/filename combination, the Dataverse installation will adjust the file path and names by adding "-1" or "-2" as applicable. This change will be visible in the list of files being uploaded.
-    - If the directory or name of an existing or newly uploaded file is edited in such a way that would create a directory/filename combination that already exists, the Dataverse installation will display an error.
+    - If the directory or name of an existing or newly uploaded file is edited in such a way that would create a directory/filename combination that already exists, or the new directory/filename exists as directory, the Dataverse installation will display an error.
     - If a user attempts to replace a file with another file that has the same checksum, an error message will be displayed and the file will not be able to be replaced.
     - If a user attempts to replace a file with a file that has the same checksum as a different file in the dataset, a warning will be displayed.
@@ Expand Down @@

scripts/issues/12407/README.md

-Original file line number
+Diff line change
@@ -0,0 +1,13 @@
+    Detect existing datasets with directories duplicating full file-paths
+    =====================================================================
+    Downloaded zips with directories conflicting with file paths result in an error message when trying to extract the files. This [pull request](https://github.com/IQSS/dataverse/pull/12407) prevents those conflicts.
+    After deploying, users will get error messages when trying ta add files to a dataset with a conflicting file/directory path.
+    The file metadata of the conflicting files should be fixed manually, prefereably before deploying the pull request, to avoid confusion for users.
+    `scripts/issues/12407/find_duplicates.py` should be executed by the user that owns the `dvndb`.
+    These scripts scan for conflicting datasets. Depending on your preferences and the size of your database you might want a variation of the scripts.
+    In small databases you can drop both `WHERE datasetversion_id IN (:ids)` checks and directly run `find-duplicates.sql`. Another option s to divide the query in a chunks with a between-clause on the datasetversion_id. In that case also older versions of datasets are checked, not just the latest version.

scripts/issues/12407/find_duplicates.py

-Original file line number
+Diff line change
@@ -0,0 +1,107 @@
+    #!/usr/bin/env python3
+    import argparse
+    import psycopg2
+    from pathlib import Path
+    from textwrap import dedent
+    def read_sql(path: Path) -> str:
+        text = path.read_text(encoding="utf-8")
+        return "\n".join(
+            line for line in text.splitlines() if not line.lstrip().startswith("\\")
+        )
+    def fetch_dv_ids(conn, find_dv_ids_sql: str) -> list[int]:
+        with conn.cursor() as cur:
+            cur.execute(find_dv_ids_sql)
+            rows = cur.fetchall()
+        # Query returns dv_id as first selected column in your file.
+        return [int(row[0]) for row in rows]
+    def fetch_dataset_info(conn, datasetversion_id: int):
+        dataset_query = """
+                        SELECT dso.protocol, dso.authority, dso.identifier, dv.versionnumber, dv.minorversionnumber
+                        FROM datasetversion dv
+                                 JOIN dvobject dso ON dso.id = dv.dataset_id
+                        WHERE dv.id = %s \
+                        """
+        with conn.cursor() as cur:
+            cur.execute(dataset_query, (datasetversion_id,))
+            return cur.fetchone()
+        return None
+    def run_find_duplicates(conn, find_duplicates_sql: str):
+        last_dv_id = None
+        last_info = ("", "", "", "", "")
+        with conn.cursor() as cur:
+            cur.execute(find_duplicates_sql)
+            cols = [d[0] for d in cur.description]
+            extra_cols = ["protocol", "authority", "dataset_id", "versionnumber", "minorversionnumber"]
+            print("\t".join(cols + extra_cols))
+            for row in cur:
+                dv_id = int(row[0])  # datasetversion_id
+                if dv_id != last_dv_id:
+                    fetched = fetch_dataset_info(conn, dv_id)
+                    last_info = fetched if fetched is not None else ("", "", "", "", "")
+                    last_dv_id = dv_id
+                print("\t".join("" if v is None else str(v) for v in (tuple(row) + tuple(last_info))))
+    def main():
+        class RawDefaultsFormatter(
+            argparse.ArgumentDefaultsHelpFormatter,
+            argparse.RawDescriptionHelpFormatter,
+        ):
+            pass
+        parser = argparse.ArgumentParser(
+            description=dedent("""
+                Execute as owner of dvndb.
+                `find_duplicates.sql` is executed for dv_ids returned by `find_dv_ids.sql`.
+                `find_dv_ids.sql` returns the latest version per dataset.
+            """),
+            formatter_class=RawDefaultsFormatter,
+        )
+        parser.add_argument("--min-id", type=int, default=0, help="first dataset-version-id examined by `find_dv_ids.sql`")
+        parser.add_argument("--nr-of-ids", type=int, default=50, help="number of ID's returned by `find_dv_ids.sql`")
+        args = parser.parse_args()
+        conn_kwargs = {"dbname": 'dvndb'}
+        script_dir = Path(__file__).resolve().parent
+        dup_sql_raw = read_sql(script_dir / "find_duplicates.sql")
+        dv_sql = read_sql(script_dir / "find_dv_ids.sql")
+        dv_sql = dv_sql.replace(":min_id", str(args.min_id))
+        dv_sql = dv_sql.replace(":nr_of_ids", str(args.nr_of_ids))
+        try:
+            with psycopg2.connect(**conn_kwargs) as conn:
+                dv_ids = fetch_dv_ids(conn, dv_sql)
+                if not dv_ids:
+                    print("No dv_id values returned by find_dv_ids.sql")
+                    return
+                ids_csv = ",".join(str(i) for i in dv_ids)
+                print(f"dataset version ids: {ids_csv}")
+                run_find_duplicates(conn, dup_sql_raw.replace(":ids", ids_csv))
+        except psycopg2.OperationalError as e:
+            msg = str(e)
+            if "no password supplied" in msg.lower():
+                parser.print_help()
+                raise SystemExit(2)
+            print(f"Database connection failed: {e}")
+            raise SystemExit(1)
+    if __name__ == "__main__":
+        main()

scripts/issues/12407/find_duplicates.sql

-Original file line number
+Diff line change
@@ -0,0 +1,35 @@
+    \set ids 5,7,9
+    WITH dir_ancestors AS (
+            SELECT DISTINCT
+                datasetversion_id,
+                array_to_string((string_to_array(path, '/'))[1:n], '/') AS path
+            FROM (
+                     SELECT DISTINCT
+                         datasetversion_id,
+                         NULLIF(BTRIM(directorylabel), '') AS path
+                     FROM filemetadata
+                     WHERE datasetversion_id IN (:ids)
+                       AND NULLIF(BTRIM(directorylabel), '') IS NOT NULL
+                 ) dirs
+            CROSS JOIN LATERAL generate_series(
+, cardinality(string_to_array(path, '/'))
+                ) AS g(n)
+        ),
+        file_paths AS (
+            SELECT DISTINCT
+                datasetversion_id,
+                CASE
+                WHEN NULLIF(BTRIM(directorylabel), '') IS NULL THEN label
+                ELSE NULLIF(BTRIM(directorylabel), '') || '/' || label
+                END AS path
+            FROM filemetadata
+            WHERE datasetversion_id IN (:ids)
+        )
+    SELECT datasetversion_id, path
+    FROM dir_ancestors
+    INTERSECT
+    SELECT datasetversion_id, path
+    FROM file_paths
+    ORDER BY datasetversion_id, path;

scripts/issues/12407/find_dv_ids.sql

-Original file line number
+Diff line change
@@ -0,0 +1,35 @@
+    \set min_id 0
+    \set nr_of_ids 50
+    WITH ranked AS (
+        SELECT
+            dso.id AS dso_id,
+            dso.protocol,
+            dso.authority,
+            dso.identifier,
+            dv.id AS dv_id,
+            dv.versionnumber,
+            dv.minorversionnumber,
+            ROW_NUMBER() OVER (
+                PARTITION BY dso.id
+                ORDER BY
+                    dv.versionnumber DESC,
+                    dv.minorversionnumber DESC,
+                    dv.id DESC
+            ) AS rn
+        FROM datasetversion dv
+        JOIN dvobject dso ON dso.id = dv.dataset_id
+    )
+    SELECT
+        dv_id,
+        dso_id,
+        protocol,
+        authority,
+        identifier,
+        versionnumber,
+        minorversionnumber
+    FROM ranked
+    WHERE rn = 1
+      AND dv_id >= :min_id
+    ORDER BY dv_id
+        LIMIT :nr_of_ids;

scripts/tests/issues/12407/README.md

-Original file line number
+Diff line change
@@ -0,0 +1,56 @@
+    Semi-automated test
+    ===================
+    This is a semi-automated test to check the API endpoints that changed by this [pull request](https://github.com/IQSS/dataverse/pull/12407).
+    Adjust the configuration variables at the start of the script.
+    * Run the _python3_ script before deploying the pull request.
+    * Download all files from the dataset, the resulting zip will not extract.
+    * Try to add a non-conflicting file to the dataset, saving the changes succeeds.
+    * Deploy the pull request.
+    * Again try to add a non-conflicting file to the dataset, saving the changes now fails.
+    * Remove the resulting draft version of the dataset.
+    * Run the script again.
+    Result before deploy
+    --------------------
+    All requests to the API endpoints return 200-OK status code.
+    As a result the dataset will contain conflicting file/directorry paths for foo and foo/bar.
+    Running `scripts/issues/12407/find_duplicates.py` should show the conflicting dataset and file metadata. Note that a draft dataset has no version number.
+    ### Example of results
+    | datasetversion_id | path    | protocol  | authority  | dataset_id  | versionnumber | minorversionnumber |
+    |-------------------|---------|-----------|------------|-------------|---------------|--------------------|
+    | 4                 | foo     | doi       | 10.5072    | DAR/HBGPN5  |               |                    |
+    | 4                 | foo/bar | doi       | 10.5072    | DAR/HBGPN5  |               |                    |
+    | 4                 | foo.tab | doi       | 10.5072    | DAR/HBGPN5  |               |                    |
+    `select directorylabel,label,datasetversion_id from filemetadata;`
+    | directorylabel   | label                       | datasetversion_id |
+    |------------------|-----------------------------|-------------------|
+    |                  | original-metadata.zip       | 4                 |
+    |  foo             | bar                         | 4                 |
+    |  accessibilities | anonymous.txt               | 4                 |
+    |  accessibilities | request.txt                 | 4                 |
+    |  foo.tab         | bar                         | 4                 |
+    |                  | foo                         | 4                 |
+    |                  | foo.tab                     | 4                 |
+    |  foo/bar         | datasets-api.txt            | 4                 |
+    |                  | x                           | 4                 |
+    |  foo/bar         | dir-conflicts-with-file.txt | 4                 |
+    |  foo             | beer                        | 4                 |
+    |  foo             | Beer                        | 4                 |
+    ![](before-deploy.png)
+    Result after deploy
+    -------------------
+    Output with dashed lines show expected status codes and further notes.
+    ![](after-deploy.png)

scripts/tests/issues/12407/after-deploy.png

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

scripts/tests/issues/12407/before-deploy.png

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Directory names conflicting with file names #12407

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Directory names conflicting with file names #12407

Are you sure you want to change the base?

Uh oh!

Directory names conflicting with file names #12407

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!