From 4f6fc1bb5134bc92d1f69e794555575b8ce58872 Mon Sep 17 00:00:00 2001 From: Dale Mcdiarmid Date: Tue, 25 Feb 2025 09:37:17 +0000 Subject: [PATCH] update s3 guide with describe --- docs/integrations/data-ingestion/s3/index.md | 61 +++++++++++++++++++- 1 file changed, 60 insertions(+), 1 deletion(-) diff --git a/docs/integrations/data-ingestion/s3/index.md b/docs/integrations/data-ingestion/s3/index.md index a9a5009141e..de34bc2513b 100644 --- a/docs/integrations/data-ingestion/s3/index.md +++ b/docs/integrations/data-ingestion/s3/index.md @@ -29,8 +29,67 @@ Using wildcards in the path expression allow multiple files to be referenced and ### Preparation {#preparation} -To interact with our S3-based dataset, we prepare a standard `MergeTree` table as our destination. The statement below creates a table named `trips` in the default database: +Prior to creating the table in ClickHouse, you may want to first take a closer look at the data in the S3 bucket. You can do this directly from ClickHouse using the `DESCRIBE` statement: +```sql +DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_*.gz', 'TabSeparatedWithNames'); +``` + +The output of the `DESCRIBE TABLE` statement should show you how ClickHouse would automatically infer this data, as viewed in the S3 bucket. Notice that it also automatically recognizes and decompresses the gzip compression format: + +```sql +DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_*.gz', 'TabSeparatedWithNames') SETTINGS describe_compact_output=1 + +┌─name──────────────────┬─type───────────────┐ +│ trip_id │ Nullable(Int64) │ +│ vendor_id │ Nullable(Int64) │ +│ pickup_date │ Nullable(Date) │ +│ pickup_datetime │ Nullable(DateTime) │ +│ dropoff_date │ Nullable(Date) │ +│ dropoff_datetime │ Nullable(DateTime) │ +│ store_and_fwd_flag │ Nullable(Int64) │ +│ rate_code_id │ Nullable(Int64) │ +│ pickup_longitude │ Nullable(Float64) │ +│ pickup_latitude │ Nullable(Float64) │ +│ dropoff_longitude │ Nullable(Float64) │ +│ dropoff_latitude │ Nullable(Float64) │ +│ passenger_count │ Nullable(Int64) │ +│ trip_distance │ Nullable(String) │ +│ fare_amount │ Nullable(String) │ +│ extra │ Nullable(String) │ +│ mta_tax │ Nullable(String) │ +│ tip_amount │ Nullable(String) │ +│ tolls_amount │ Nullable(Float64) │ +│ ehail_fee │ Nullable(Int64) │ +│ improvement_surcharge │ Nullable(String) │ +│ total_amount │ Nullable(String) │ +│ payment_type │ Nullable(String) │ +│ trip_type │ Nullable(Int64) │ +│ pickup │ Nullable(String) │ +│ dropoff │ Nullable(String) │ +│ cab_type │ Nullable(String) │ +│ pickup_nyct2010_gid │ Nullable(Int64) │ +│ pickup_ctlabel │ Nullable(Float64) │ +│ pickup_borocode │ Nullable(Int64) │ +│ pickup_ct2010 │ Nullable(String) │ +│ pickup_boroct2010 │ Nullable(String) │ +│ pickup_cdeligibil │ Nullable(String) │ +│ pickup_ntacode │ Nullable(String) │ +│ pickup_ntaname │ Nullable(String) │ +│ pickup_puma │ Nullable(Int64) │ +│ dropoff_nyct2010_gid │ Nullable(Int64) │ +│ dropoff_ctlabel │ Nullable(Float64) │ +│ dropoff_borocode │ Nullable(Int64) │ +│ dropoff_ct2010 │ Nullable(String) │ +│ dropoff_boroct2010 │ Nullable(String) │ +│ dropoff_cdeligibil │ Nullable(String) │ +│ dropoff_ntacode │ Nullable(String) │ +│ dropoff_ntaname │ Nullable(String) │ +│ dropoff_puma │ Nullable(Int64) │ +└───────────────────────┴────────────────────┘ +``` + +To interact with our S3-based dataset, we prepare a standard `MergeTree` table as our destination. The statement below creates a table named `trips` in the default database. Note that we have chosen to modify some of those data types as inferred above, particularly to not use the [`Nullable()`](https://clickhouse.com/docs/en/sql-reference/data-types/nullable) data type modifier, which could cause some unnecessary additional stored data and some additional performance overhead: ```sql CREATE TABLE trips