Skip to content

Conversation

saurabhojha
Copy link
Contributor

@saurabhojha saurabhojha commented Mar 22, 2025

Earlier download_data script used brace expansion to generate file names with correct formatting. However, it fails with following error.

 ./download_data.sh 
Select the dataset size to download:
1) 1m (default)
2) 10m
3) 100m
4) 1000m
Enter the number corresponding to your choice: 2
--2025-03-22 17:06:27--  https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_1.json.gz
Resolving clickhouse-public-datasets.s3.amazonaws.com (clickhouse-public-datasets.s3.amazonaws.com)... 3.5.138.223, 52.219.169.183, 3.5.134.252, ...
Connecting to clickhouse-public-datasets.s3.amazonaws.com (clickhouse-public-datasets.s3.amazonaws.com)|3.5.138.223|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2025-03-22 17:06:30 ERROR 404: Not Found.

--2025-03-22 17:06:30--  https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_2.json.gz
Reusing existing connection to clickhouse-public-datasets.s3.amazonaws.com:443.
HTTP request sent, awaiting response... 404 Not Found
2025-03-22 17:06:30 ERROR 404: Not Found.

--2025-03-22 17:06:30--  https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_3.json.gz
Reusing existing connection to clickhouse-public-datasets.s3.amazonaws.com:443.
HTTP request sent, awaiting response... 404 Not Found
2025-03-22 17:06:30 ERROR 404: Not Found.

--2025-03-22 17:06:30--  https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_4.json.gz
Reusing existing connection to clickhouse-public-datasets.s3.amazonaws.com:443.
HTTP request sent, awaiting response... 404 Not Found
2025-03-22 17:06:31 ERROR 404: Not Found.

--2025-03-22 17:06:31--  https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_5.json.gz
Reusing existing connection to clickhouse-public-datasets.s3.amazonaws.com:443.
HTTP request sent, awaiting response... 404 Not Found
2025-03-22 17:06:31 ERROR 404: Not Found.

--2025-03-22 17:06:31--  https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_6.json.gz
Reusing existing connection to clickhouse-public-datasets.s3.amazonaws.com:443.
HTTP request sent, awaiting response... 404 Not Found
2025-03-22 17:06:31 ERROR 404: Not Found.

--2025-03-22 17:06:31--  https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_7.json.gz
Reusing existing connection to clickhouse-public-datasets.s3.amazonaws.com:443.
HTTP request sent, awaiting response... 404 Not Found
2025-03-22 17:06:31 ERROR 404: Not Found.

--2025-03-22 17:06:31--  https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_8.json.gz
Reusing existing connection to clickhouse-public-datasets.s3.amazonaws.com:443.
HTTP request sent, awaiting response... 404 Not Found
2025-03-22 17:06:31 ERROR 404: Not Found.

--2025-03-22 17:06:31--  https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_9.json.gz
Reusing existing connection to clickhouse-public-datasets.s3.amazonaws.com:443.
HTTP request sent, awaiting response... 404 Not Found
2025-03-22 17:06:31 ERROR 404: Not Found.

--2025-03-22 17:06:31--  https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_10.json.gz
Reusing existing connection to clickhouse-public-datasets.s3.amazonaws.com:443.
HTTP request sent, awaiting response... 404 Not Found
2025-03-22 17:06:32 ERROR 404: Not Found.

The issue is that wrong file names are being generated. My suspicion is that leading zeroes are being ignored post brace expansion. To fix this i have moved it to using seq -f for creating file names which generates the correct file name for download.

seq -f "https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_%04g.json.gz" 1 10
https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_0001.json.gz
https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_0002.json.gz
https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_0003.json.gz
https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_0004.json.gz
https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_0005.json.gz
https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_0006.json.gz
https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_0007.json.gz
https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_0008.json.gz
https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_0009.json.gz
https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_0010.json.gz

@saurabhojha
Copy link
Contributor Author

@rschu1ze pls have a look whenever you are available 🚀

@rschu1ze
Copy link
Member

Thanks, that is a nice PR.

@rschu1ze rschu1ze merged commit 57dc71a into ClickHouse:main Mar 23, 2025
@rschu1ze rschu1ze mentioned this pull request Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants