Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3245,7 +3245,7 @@ The fully expanded example above (without environment variables) looks like this

.. code-block:: bash

curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" -X POST "https://demo.dataverse.org/api/datasets/:persistentId/files/metadata?:persistentId=doi:10.5072/FK2/J8SJZB" --upload-file file-metadata-update.json
curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" -X POST "https://demo.dataverse.org/api/datasets/:persistentId/files/metadata?persistentId=doi:10.5072/FK2/J8SJZB" --upload-file file-metadata-update.json

The ``file-metadata-update.json`` file should contain a JSON array of objects, each representing a file to be updated. Here's an example structure:

Expand Down
2 changes: 1 addition & 1 deletion doc/sphinx-guides/source/user/dataset-management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ Beginning with Dataverse Software 5.0, the way a Dataverse installation handles
- Files with the same checksum can be included in a dataset, even if the files are in the same directory.
- Files with the same filename can be included in a dataset as long as the files are in different directories.
- If a user uploads a file to a directory where a file already exists with that directory/filename combination, the Dataverse installation will adjust the file path and names by adding "-1" or "-2" as applicable. This change will be visible in the list of files being uploaded.
- If the directory or name of an existing or newly uploaded file is edited in such a way that would create a directory/filename combination that already exists, the Dataverse installation will display an error.
- If the directory or name of an existing or newly uploaded file is edited in such a way that would create a directory/filename combination that already exists, or the new directory/filename exists as directory, the Dataverse installation will display an error.
- If a user attempts to replace a file with another file that has the same checksum, an error message will be displayed and the file will not be able to be replaced.
- If a user attempts to replace a file with a file that has the same checksum as a different file in the dataset, a warning will be displayed.

Expand Down
13 changes: 13 additions & 0 deletions scripts/issues/12407/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Detect existing datasets with directories duplicating full file-paths
=====================================================================

Downloaded zips with directories conflicting with file paths result in an error message when trying to extract the files. This [pull request](https://github.com/IQSS/dataverse/pull/12407) prevents those conflicts.

After deploying, users will get error messages when trying ta add files to a dataset with a conflicting file/directory path.
The file metadata of the conflicting files should be fixed manually, prefereably before deploying the pull request, to avoid confusion for users.

`scripts/issues/12407/find_duplicates.py` should be executed by the user that owns the `dvndb`.

These scripts scan for conflicting datasets. Depending on your preferences and the size of your database you might want a variation of the scripts.

In small databases you can drop both `WHERE datasetversion_id IN (:ids)` checks and directly run `find-duplicates.sql`. Another option s to divide the query in a chunks with a between-clause on the datasetversion_id. In that case also older versions of datasets are checked, not just the latest version.
107 changes: 107 additions & 0 deletions scripts/issues/12407/find_duplicates.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
#!/usr/bin/env python3
import argparse
import psycopg2
from pathlib import Path
from textwrap import dedent

def read_sql(path: Path) -> str:
text = path.read_text(encoding="utf-8")
return "\n".join(
line for line in text.splitlines() if not line.lstrip().startswith("\\")
)


def fetch_dv_ids(conn, find_dv_ids_sql: str) -> list[int]:
with conn.cursor() as cur:
cur.execute(find_dv_ids_sql)
rows = cur.fetchall()

# Query returns dv_id as first selected column in your file.
return [int(row[0]) for row in rows]


def fetch_dataset_info(conn, datasetversion_id: int):
dataset_query = """
SELECT dso.protocol, dso.authority, dso.identifier, dv.versionnumber, dv.minorversionnumber
FROM datasetversion dv
JOIN dvobject dso ON dso.id = dv.dataset_id
WHERE dv.id = %s \
"""
with conn.cursor() as cur:
cur.execute(dataset_query, (datasetversion_id,))
return cur.fetchone()
return None


def run_find_duplicates(conn, find_duplicates_sql: str):
last_dv_id = None
last_info = ("", "", "", "", "")

with conn.cursor() as cur:
cur.execute(find_duplicates_sql)
cols = [d[0] for d in cur.description]

extra_cols = ["protocol", "authority", "dataset_id", "versionnumber", "minorversionnumber"]
print("\t".join(cols + extra_cols))

for row in cur:
dv_id = int(row[0]) # datasetversion_id

if dv_id != last_dv_id:
fetched = fetch_dataset_info(conn, dv_id)
last_info = fetched if fetched is not None else ("", "", "", "", "")
last_dv_id = dv_id

print("\t".join("" if v is None else str(v) for v in (tuple(row) + tuple(last_info))))


def main():
class RawDefaultsFormatter(
argparse.ArgumentDefaultsHelpFormatter,
argparse.RawDescriptionHelpFormatter,
):
pass

parser = argparse.ArgumentParser(
description=dedent("""
Execute as owner of dvndb.

`find_duplicates.sql` is executed for dv_ids returned by `find_dv_ids.sql`.
`find_dv_ids.sql` returns the latest version per dataset.
"""),
formatter_class=RawDefaultsFormatter,
)
parser.add_argument("--min-id", type=int, default=0, help="first dataset-version-id examined by `find_dv_ids.sql`")
parser.add_argument("--nr-of-ids", type=int, default=50, help="number of ID's returned by `find_dv_ids.sql`")
args = parser.parse_args()
conn_kwargs = {"dbname": 'dvndb'}

script_dir = Path(__file__).resolve().parent

dup_sql_raw = read_sql(script_dir / "find_duplicates.sql")

dv_sql = read_sql(script_dir / "find_dv_ids.sql")
dv_sql = dv_sql.replace(":min_id", str(args.min_id))
dv_sql = dv_sql.replace(":nr_of_ids", str(args.nr_of_ids))

try:
with psycopg2.connect(**conn_kwargs) as conn:
dv_ids = fetch_dv_ids(conn, dv_sql)

if not dv_ids:
print("No dv_id values returned by find_dv_ids.sql")
return

ids_csv = ",".join(str(i) for i in dv_ids)
print(f"dataset version ids: {ids_csv}")
run_find_duplicates(conn, dup_sql_raw.replace(":ids", ids_csv))
except psycopg2.OperationalError as e:
msg = str(e)
if "no password supplied" in msg.lower():
parser.print_help()
raise SystemExit(2)
print(f"Database connection failed: {e}")
raise SystemExit(1)

if __name__ == "__main__":
main()
35 changes: 35 additions & 0 deletions scripts/issues/12407/find_duplicates.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
\set ids 5,7,9
WITH dir_ancestors AS (
SELECT DISTINCT
datasetversion_id,
array_to_string((string_to_array(path, '/'))[1:n], '/') AS path
FROM (
SELECT DISTINCT
datasetversion_id,
NULLIF(BTRIM(directorylabel), '') AS path
FROM filemetadata
WHERE datasetversion_id IN (:ids)
AND NULLIF(BTRIM(directorylabel), '') IS NOT NULL
) dirs
CROSS JOIN LATERAL generate_series(
1, cardinality(string_to_array(path, '/'))
) AS g(n)
),
file_paths AS (
SELECT DISTINCT
datasetversion_id,
CASE
WHEN NULLIF(BTRIM(directorylabel), '') IS NULL THEN label
ELSE NULLIF(BTRIM(directorylabel), '') || '/' || label
END AS path
FROM filemetadata
WHERE datasetversion_id IN (:ids)
)
SELECT datasetversion_id, path
FROM dir_ancestors

INTERSECT

SELECT datasetversion_id, path
FROM file_paths
ORDER BY datasetversion_id, path;
35 changes: 35 additions & 0 deletions scripts/issues/12407/find_dv_ids.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
\set min_id 0
\set nr_of_ids 50

WITH ranked AS (
SELECT
dso.id AS dso_id,
dso.protocol,
dso.authority,
dso.identifier,
dv.id AS dv_id,
dv.versionnumber,
dv.minorversionnumber,
ROW_NUMBER() OVER (
PARTITION BY dso.id
ORDER BY
dv.versionnumber DESC,
dv.minorversionnumber DESC,
dv.id DESC
) AS rn
FROM datasetversion dv
JOIN dvobject dso ON dso.id = dv.dataset_id
)
SELECT
dv_id,
dso_id,
protocol,
authority,
identifier,
versionnumber,
minorversionnumber
FROM ranked
WHERE rn = 1
AND dv_id >= :min_id
ORDER BY dv_id
LIMIT :nr_of_ids;
56 changes: 56 additions & 0 deletions scripts/tests/issues/12407/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
Semi-automated test
===================

This is a semi-automated test to check the API endpoints that changed by this [pull request](https://github.com/IQSS/dataverse/pull/12407).

Adjust the configuration variables at the start of the script.
* Run the _python3_ script before deploying the pull request.
* Download all files from the dataset, the resulting zip will not extract.
* Try to add a non-conflicting file to the dataset, saving the changes succeeds.
* Deploy the pull request.
* Again try to add a non-conflicting file to the dataset, saving the changes now fails.
* Remove the resulting draft version of the dataset.
* Run the script again.

Result before deploy
--------------------

All requests to the API endpoints return 200-OK status code.
As a result the dataset will contain conflicting file/directorry paths for foo and foo/bar.

Running `scripts/issues/12407/find_duplicates.py` should show the conflicting dataset and file metadata. Note that a draft dataset has no version number.

### Example of results

| datasetversion_id | path | protocol | authority | dataset_id | versionnumber | minorversionnumber |
|-------------------|---------|-----------|------------|-------------|---------------|--------------------|
| 4 | foo | doi | 10.5072 | DAR/HBGPN5 | | |
| 4 | foo/bar | doi | 10.5072 | DAR/HBGPN5 | | |
| 4 | foo.tab | doi | 10.5072 | DAR/HBGPN5 | | |

`select directorylabel,label,datasetversion_id from filemetadata;`

| directorylabel | label | datasetversion_id |
|------------------|-----------------------------|-------------------|
| | original-metadata.zip | 4 |
| foo | bar | 4 |
| accessibilities | anonymous.txt | 4 |
| accessibilities | request.txt | 4 |
| foo.tab | bar | 4 |
| | foo | 4 |
| | foo.tab | 4 |
| foo/bar | datasets-api.txt | 4 |
| | x | 4 |
| foo/bar | dir-conflicts-with-file.txt | 4 |
| foo | beer | 4 |
| foo | Beer | 4 |



![](before-deploy.png)

Result after deploy
-------------------
Output with dashed lines show expected status codes and further notes.

![](after-deploy.png)
Binary file added scripts/tests/issues/12407/after-deploy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added scripts/tests/issues/12407/before-deploy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading