Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 60 additions & 1 deletion docs/integrations/data-ingestion/s3/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,67 @@ Using wildcards in the path expression allow multiple files to be referenced and

### Preparation {#preparation}

To interact with our S3-based dataset, we prepare a standard `MergeTree` table as our destination. The statement below creates a table named `trips` in the default database:
Prior to creating the table in ClickHouse, you may want to first take a closer look at the data in the S3 bucket. You can do this directly from ClickHouse using the `DESCRIBE` statement:

```sql
DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_*.gz', 'TabSeparatedWithNames');
```

The output of the `DESCRIBE TABLE` statement should show you how ClickHouse would automatically infer this data, as viewed in the S3 bucket. Notice that it also automatically recognizes and decompresses the gzip compression format:

```sql
DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_*.gz', 'TabSeparatedWithNames') SETTINGS describe_compact_output=1

┌─name──────────────────┬─type───────────────┐
│ trip_id │ Nullable(Int64) │
│ vendor_id │ Nullable(Int64) │
│ pickup_date │ Nullable(Date) │
│ pickup_datetime │ Nullable(DateTime) │
│ dropoff_date │ Nullable(Date) │
│ dropoff_datetime │ Nullable(DateTime) │
│ store_and_fwd_flag │ Nullable(Int64) │
│ rate_code_id │ Nullable(Int64) │
│ pickup_longitude │ Nullable(Float64) │
│ pickup_latitude │ Nullable(Float64) │
│ dropoff_longitude │ Nullable(Float64) │
│ dropoff_latitude │ Nullable(Float64) │
│ passenger_count │ Nullable(Int64) │
│ trip_distance │ Nullable(String) │
│ fare_amount │ Nullable(String) │
│ extra │ Nullable(String) │
│ mta_tax │ Nullable(String) │
│ tip_amount │ Nullable(String) │
│ tolls_amount │ Nullable(Float64) │
│ ehail_fee │ Nullable(Int64) │
│ improvement_surcharge │ Nullable(String) │
│ total_amount │ Nullable(String) │
│ payment_type │ Nullable(String) │
│ trip_type │ Nullable(Int64) │
│ pickup │ Nullable(String) │
│ dropoff │ Nullable(String) │
│ cab_type │ Nullable(String) │
│ pickup_nyct2010_gid │ Nullable(Int64) │
│ pickup_ctlabel │ Nullable(Float64) │
│ pickup_borocode │ Nullable(Int64) │
│ pickup_ct2010 │ Nullable(String) │
│ pickup_boroct2010 │ Nullable(String) │
│ pickup_cdeligibil │ Nullable(String) │
│ pickup_ntacode │ Nullable(String) │
│ pickup_ntaname │ Nullable(String) │
│ pickup_puma │ Nullable(Int64) │
│ dropoff_nyct2010_gid │ Nullable(Int64) │
│ dropoff_ctlabel │ Nullable(Float64) │
│ dropoff_borocode │ Nullable(Int64) │
│ dropoff_ct2010 │ Nullable(String) │
│ dropoff_boroct2010 │ Nullable(String) │
│ dropoff_cdeligibil │ Nullable(String) │
│ dropoff_ntacode │ Nullable(String) │
│ dropoff_ntaname │ Nullable(String) │
│ dropoff_puma │ Nullable(Int64) │
└───────────────────────┴────────────────────┘
```

To interact with our S3-based dataset, we prepare a standard `MergeTree` table as our destination. The statement below creates a table named `trips` in the default database. Note that we have chosen to modify some of those data types as inferred above, particularly to not use the [`Nullable()`](https://clickhouse.com/docs/en/sql-reference/data-types/nullable) data type modifier, which could cause some unnecessary additional stored data and some additional performance overhead:

```sql
CREATE TABLE trips
Expand Down