Skip to content

Commit d6fbca9

Browse files
committed
Merge branch 'main' of github.com:apache/iceberg-python into fd-infer-types
2 parents 2817c61 + 9945f83 commit d6fbca9

35 files changed

+964
-406
lines changed

.github/workflows/pypi-build-artifacts.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ jobs:
6262
if: startsWith(matrix.os, 'ubuntu')
6363

6464
- name: Build wheels
65-
uses: pypa/cibuildwheel@v2.22.0
65+
uses: pypa/cibuildwheel@v2.23.0
6666
with:
6767
output-dir: wheelhouse
6868
config-file: "pyproject.toml"

.github/workflows/svn-build-artifacts.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ jobs:
5757
if: startsWith(matrix.os, 'ubuntu')
5858

5959
- name: Build wheels
60-
uses: pypa/cibuildwheel@v2.22.0
60+
uses: pypa/cibuildwheel@v2.23.0
6161
with:
6262
output-dir: wheelhouse
6363
config-file: "pyproject.toml"

mkdocs/docs/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@
3030
- [Verify a release](verify-release.md)
3131
- [How to release](how-to-release.md)
3232
- [Release Notes](https://github.com/apache/iceberg-python/releases)
33+
- [Nightly Build](nightly-build.md)
3334
- [Code Reference](reference/)
3435

3536
<!-- markdown-link-check-enable-->

mkdocs/docs/api.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -474,6 +474,71 @@ lat: [[48.864716],[52.371807],[53.11254],[37.773972]]
474474
long: [[2.349014],[4.896029],[6.0989],[-122.431297]]
475475
```
476476

477+
### Upsert
478+
479+
PyIceberg supports upsert operations, meaning that it is able to merge an Arrow table into an Iceberg table. Rows are considered the same based on the [identifier field](https://iceberg.apache.org/spec/?column-projection#identifier-field-ids). If a row is already in the table, it will update that row. If a row cannot be found, it will insert that new row.
480+
481+
Consider the following table, with some data:
482+
483+
```python
484+
from pyiceberg.schema import Schema
485+
from pyiceberg.types import IntegerType, NestedField, StringType
486+
487+
import pyarrow as pa
488+
489+
schema = Schema(
490+
NestedField(1, "city", StringType(), required=True),
491+
NestedField(2, "inhabitants", IntegerType(), required=True),
492+
# Mark City as the identifier field, also known as the primary-key
493+
identifier_field_ids=[1]
494+
)
495+
496+
tbl = catalog.create_table("default.cities", schema=schema)
497+
498+
arrow_schema = pa.schema(
499+
[
500+
pa.field("city", pa.string(), nullable=False),
501+
pa.field("inhabitants", pa.int32(), nullable=False),
502+
]
503+
)
504+
505+
# Write some data
506+
df = pa.Table.from_pylist(
507+
[
508+
{"city": "Amsterdam", "inhabitants": 921402},
509+
{"city": "San Francisco", "inhabitants": 808988},
510+
{"city": "Drachten", "inhabitants": 45019},
511+
{"city": "Paris", "inhabitants": 2103000},
512+
],
513+
schema=arrow_schema
514+
)
515+
tbl.append(df)
516+
```
517+
518+
Next, we'll upsert a table into the Iceberg table:
519+
520+
```python
521+
df = pa.Table.from_pylist(
522+
[
523+
# Will be updated, the inhabitants has been updated
524+
{"city": "Drachten", "inhabitants": 45505},
525+
526+
# New row, will be inserted
527+
{"city": "Berlin", "inhabitants": 3432000},
528+
529+
# Ignored, already exists in the table
530+
{"city": "Paris", "inhabitants": 2103000},
531+
],
532+
schema=arrow_schema
533+
)
534+
upd = tbl.upsert(df)
535+
536+
assert upd.rows_updated == 1
537+
assert upd.rows_inserted == 1
538+
```
539+
540+
PyIceberg will automatically detect which rows need to be updated, inserted or can simply be ignored.
541+
477542
## Inspecting tables
478543

479544
To explore the table metadata, tables can be inspected.

mkdocs/docs/configuration.md

Lines changed: 21 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ Iceberg tables support table properties to configure table behavior.
6464
| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group |
6565
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. |
6666
| `write.metadata.delete-after-commit.enabled` | Boolean | False | Whether to automatically delete old *tracked* metadata files after each table commit. It will retain a number of the most recent metadata files, which can be set using property `write.metadata.previous-versions-max`. |
67-
| `write.object-storage.enabled` | Boolean | True | Enables the [`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider) that adds a hash component to file paths. Note: the default value of `True` differs from Iceberg's Java implementation |
67+
| `write.object-storage.enabled` | Boolean | False | Enables the [`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider) that adds a hash component to file paths. |
6868
| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled |
6969
| `write.py-location-provider.impl` | String of form `module.ClassName` | null | Optional, [custom `LocationProvider`](configuration.md#loading-a-custom-location-provider) implementation |
7070
| `write.data.path` | String pointing to location | `{metadata.location}/data` | Sets the location under which data is written. |
@@ -108,22 +108,23 @@ For the FileIO there are several configuration options available:
108108

109109
<!-- markdown-link-check-disable -->
110110

111-
| Key | Example | Description |
112-
|----------------------|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
113-
| s3.endpoint | <https://10.0.19.25/> | Configure an alternative endpoint of the S3 service for the FileIO to access. This could be used to use S3FileIO with any s3-compatible object storage service that has a different endpoint, or access a private S3 endpoint in a virtual private cloud. |
114-
| s3.access-key-id | admin | Configure the static access key id used to access the FileIO. |
115-
| s3.secret-access-key | password | Configure the static secret access key used to access the FileIO. |
116-
| s3.session-token | AQoDYXdzEJr... | Configure the static session token used to access the FileIO. |
117-
| s3.role-session-name | session | An optional identifier for the assumed role session. |
118-
| s3.role-arn | arn:aws:... | AWS Role ARN. If provided instead of access_key and secret_key, temporary credentials will be fetched by assuming this role. |
119-
| s3.signer | bearer | Configure the signature version of the FileIO. |
120-
| s3.signer.uri | <http://my.signer:8080/s3> | Configure the remote signing uri if it differs from the catalog uri. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. |
121-
| s3.signer.endpoint | v1/main/s3-sign | Configure the remote signing endpoint. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. (default : v1/aws/s3/sign). |
122-
| s3.region | us-west-2 | Configure the default region used to initialize an `S3FileSystem`. `PyArrowFileIO` attempts to automatically resolve the region for each S3 bucket, falling back to this value if resolution fails. |
123-
| s3.proxy-uri | <http://my.proxy.com:8080> | Configure the proxy server to be used by the FileIO. |
124-
| s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. |
125-
| s3.request-timeout | 60.0 | Configure socket read timeouts on Windows and macOS, in seconds. |
126-
| s3.force-virtual-addressing | False | Whether to use virtual addressing of buckets. If true, then virtual addressing is always enabled. If false, then virtual addressing is only enabled if endpoint_override is empty. This can be used for non-AWS backends that only support virtual hosted-style access. |
111+
| Key | Example | Description |
112+
|-----------------------------|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
113+
| s3.endpoint | <https://10.0.19.25/> | Configure an alternative endpoint of the S3 service for the FileIO to access. This could be used to use S3FileIO with any s3-compatible object storage service that has a different endpoint, or access a private S3 endpoint in a virtual private cloud. |
114+
| s3.access-key-id | admin | Configure the static access key id used to access the FileIO. |
115+
| s3.secret-access-key | password | Configure the static secret access key used to access the FileIO. |
116+
| s3.session-token | AQoDYXdzEJr... | Configure the static session token used to access the FileIO. |
117+
| s3.role-session-name | session | An optional identifier for the assumed role session. |
118+
| s3.role-arn | arn:aws:... | AWS Role ARN. If provided instead of access_key and secret_key, temporary credentials will be fetched by assuming this role. |
119+
| s3.signer | bearer | Configure the signature version of the FileIO. |
120+
| s3.signer.uri | <http://my.signer:8080/s3> | Configure the remote signing uri if it differs from the catalog uri. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. |
121+
| s3.signer.endpoint | v1/main/s3-sign | Configure the remote signing endpoint. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. (default : v1/aws/s3/sign). |
122+
| s3.region | us-west-2 | Configure the default region used to initialize an `S3FileSystem`. `PyArrowFileIO` attempts to automatically tries to resolve the region if this isn't set (only supported for AWS S3 Buckets). |
123+
| s3.resolve-region | False | Only supported for `PyArrowFileIO`, when enabled, it will always try to resolve the location of the bucket (only supported for AWS S3 Buckets). |
124+
| s3.proxy-uri | <http://my.proxy.com:8080> | Configure the proxy server to be used by the FileIO. |
125+
| s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. |
126+
| s3.request-timeout | 60.0 | Configure socket read timeouts on Windows and macOS, in seconds. |
127+
| s3.force-virtual-addressing | False | Whether to use virtual addressing of buckets. If true, then virtual addressing is always enabled. If false, then virtual addressing is only enabled if endpoint_override is empty. This can be used for non-AWS backends that only support virtual hosted-style access. |
127128

128129
<!-- markdown-link-check-enable-->
129130

@@ -212,8 +213,7 @@ Both data file and metadata file locations can be customized by configuring the
212213

213214
For more granular control, you can override the `LocationProvider`'s `new_data_location` and `new_metadata_location` methods to define custom logic for generating file paths. See [`Loading a Custom Location Provider`](configuration.md#loading-a-custom-location-provider).
214215

215-
PyIceberg defaults to the [`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider), which generates file paths for
216-
data files that are optimized for object storage.
216+
PyIceberg defaults to the [`SimpleLocationProvider`](configuration.md#simple-location-provider) for managing file paths.
217217

218218
### Simple Location Provider
219219

@@ -233,9 +233,6 @@ partitioned over a string column `category` might have a data file with location
233233
s3://bucket/ns/table/data/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
234234
```
235235

236-
The `SimpleLocationProvider` is enabled for a table by explicitly setting its `write.object-storage.enabled` table
237-
property to `False`.
238-
239236
### Object Store Location Provider
240237

241238
PyIceberg offers the `ObjectStoreLocationProvider`, and an optional [partition-exclusion](configuration.md#partition-exclusion)
@@ -254,8 +251,8 @@ For example, a table partitioned over a string column `category` might have a da
254251
s3://bucket/ns/table/data/0101/0110/1001/10110010/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
255252
```
256253

257-
The `write.object-storage.enabled` table property determines whether the `ObjectStoreLocationProvider` is enabled for a
258-
table. It is used by default.
254+
The `ObjectStoreLocationProvider` is enabled for a table by explicitly setting its `write.object-storage.enabled` table
255+
property to `True`.
259256

260257
#### Partition Exclusion
261258

mkdocs/docs/contributing.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,8 @@ To get started, you can run `make install`, which installs Poetry and all the de
5252

5353
If you want to install the library on the host, you can simply run `pip3 install -e .`. If you wish to use a virtual environment, you can run `poetry shell`. Poetry will open up a virtual environment with all the dependencies set.
5454

55+
> **Note:** If you want to use `poetry shell`, you need to install it using `pip install poetry-plugin-shell`. Alternatively, you can run commands directly with `poetry run`.
56+
5557
To set up IDEA with Poetry:
5658

5759
- Open up the Python project in IntelliJ

mkdocs/docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ You either need to install `s3fs`, `adlfs`, `gcsfs`, or `pyarrow` to be able to
6464

6565
## Connecting to a catalog
6666

67-
Iceberg leverages the [catalog to have one centralized place to organize the tables](https://iceberg.apache.org/concepts/catalog/). This can be a traditional Hive catalog to store your Iceberg tables next to the rest, a vendor solution like the AWS Glue catalog, or an implementation of Icebergs' own [REST protocol](https://github.com/apache/iceberg/tree/main/open-api). Checkout the [configuration](configuration.md) page to find all the configuration details.
67+
Iceberg leverages the [catalog to have one centralized place to organize the tables](https://iceberg.apache.org/terms/#catalog). This can be a traditional Hive catalog to store your Iceberg tables next to the rest, a vendor solution like the AWS Glue catalog, or an implementation of Icebergs' own [REST protocol](https://github.com/apache/iceberg/tree/main/open-api). Checkout the [configuration](configuration.md) page to find all the configuration details.
6868

6969
For the sake of demonstration, we'll configure the catalog to use the `SqlCatalog` implementation, which will store information in a local `sqlite` database. We'll also configure the catalog to store data files in the local filesystem instead of an object store. This should not be used in production due to the limited scalability.
7070

0 commit comments

Comments
 (0)