You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyIceberg supports upsert operations, meaning that it is able to merge an Arrow table into an Iceberg table. Rows are considered the same based on the [identifier field](https://iceberg.apache.org/spec/?column-projection#identifier-field-ids). If a row is already in the table, it will update that row. If a row cannot be found, it will insert that new row.
480
+
481
+
Consider the following table, with some data:
482
+
483
+
```python
484
+
from pyiceberg.schema import Schema
485
+
from pyiceberg.types import IntegerType, NestedField, StringType
Copy file name to clipboardExpand all lines: mkdocs/docs/configuration.md
+21-24Lines changed: 21 additions & 24 deletions
Original file line number
Diff line number
Diff line change
@@ -64,7 +64,7 @@ Iceberg tables support table properties to configure table behavior.
64
64
|`write.parquet.dict-size-bytes`| Size in bytes | 2MB | Set the dictionary page size limit per row group |
65
65
|`write.metadata.previous-versions-max`| Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. |
66
66
|`write.metadata.delete-after-commit.enabled`| Boolean | False | Whether to automatically delete old *tracked* metadata files after each table commit. It will retain a number of the most recent metadata files, which can be set using property `write.metadata.previous-versions-max`. |
67
-
|`write.object-storage.enabled`| Boolean |True| Enables the [`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider) that adds a hash component to file paths. Note: the default value of `True` differs from Iceberg's Java implementation|
67
+
|`write.object-storage.enabled`| Boolean |False| Enables the [`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider) that adds a hash component to file paths. |
68
68
|`write.object-storage.partitioned-paths`| Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled |
69
69
|`write.py-location-provider.impl`| String of form `module.ClassName`| null | Optional, [custom `LocationProvider`](configuration.md#loading-a-custom-location-provider) implementation |
70
70
|`write.data.path`| String pointing to location |`{metadata.location}/data`| Sets the location under which data is written. |
@@ -108,22 +108,23 @@ For the FileIO there are several configuration options available:
| s3.endpoint |<https://10.0.19.25/>| Configure an alternative endpoint of the S3 service for the FileIO to access. This could be used to use S3FileIO with any s3-compatible object storage service that has a different endpoint, or access a private S3 endpoint in a virtual private cloud. |
114
-
| s3.access-key-id | admin | Configure the static access key id used to access the FileIO. |
115
-
| s3.secret-access-key | password | Configure the static secret access key used to access the FileIO. |
116
-
| s3.session-token | AQoDYXdzEJr... | Configure the static session token used to access the FileIO. |
117
-
| s3.role-session-name | session | An optional identifier for the assumed role session. |
118
-
| s3.role-arn | arn:aws:... | AWS Role ARN. If provided instead of access_key and secret_key, temporary credentials will be fetched by assuming this role. |
119
-
| s3.signer | bearer | Configure the signature version of the FileIO. |
120
-
| s3.signer.uri |<http://my.signer:8080/s3>| Configure the remote signing uri if it differs from the catalog uri. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. |
121
-
| s3.signer.endpoint | v1/main/s3-sign | Configure the remote signing endpoint. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. (default : v1/aws/s3/sign). |
122
-
| s3.region | us-west-2 | Configure the default region used to initialize an `S3FileSystem`. `PyArrowFileIO` attempts to automatically resolve the region for each S3 bucket, falling back to this value if resolution fails. |
123
-
| s3.proxy-uri |<http://my.proxy.com:8080>| Configure the proxy server to be used by the FileIO. |
| s3.request-timeout | 60.0 | Configure socket read timeouts on Windows and macOS, in seconds. |
126
-
| s3.force-virtual-addressing | False | Whether to use virtual addressing of buckets. If true, then virtual addressing is always enabled. If false, then virtual addressing is only enabled if endpoint_override is empty. This can be used for non-AWS backends that only support virtual hosted-style access. |
| s3.endpoint |<https://10.0.19.25/>| Configure an alternative endpoint of the S3 service for the FileIO to access. This could be used to use S3FileIO with any s3-compatible object storage service that has a different endpoint, or access a private S3 endpoint in a virtual private cloud. |
114
+
| s3.access-key-id | admin | Configure the static access key id used to access the FileIO. |
115
+
| s3.secret-access-key | password | Configure the static secret access key used to access the FileIO. |
116
+
| s3.session-token | AQoDYXdzEJr... | Configure the static session token used to access the FileIO. |
117
+
| s3.role-session-name | session | An optional identifier for the assumed role session. |
118
+
| s3.role-arn | arn:aws:... | AWS Role ARN. If provided instead of access_key and secret_key, temporary credentials will be fetched by assuming this role. |
119
+
| s3.signer | bearer | Configure the signature version of the FileIO. |
120
+
| s3.signer.uri |<http://my.signer:8080/s3>| Configure the remote signing uri if it differs from the catalog uri. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. |
121
+
| s3.signer.endpoint | v1/main/s3-sign | Configure the remote signing endpoint. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.signer.uri>/<s3.signer.endpoint>`. (default : v1/aws/s3/sign). |
122
+
| s3.region | us-west-2 | Configure the default region used to initialize an `S3FileSystem`. `PyArrowFileIO` attempts to automatically tries to resolve the region if this isn't set (only supported for AWS S3 Buckets). |
123
+
| s3.resolve-region | False | Only supported for `PyArrowFileIO`, when enabled, it will always try to resolve the location of the bucket (only supported for AWS S3 Buckets). |
124
+
| s3.proxy-uri |<http://my.proxy.com:8080>| Configure the proxy server to be used by the FileIO. |
| s3.request-timeout | 60.0 | Configure socket read timeouts on Windows and macOS, in seconds. |
127
+
| s3.force-virtual-addressing | False | Whether to use virtual addressing of buckets. If true, then virtual addressing is always enabled. If false, then virtual addressing is only enabled if endpoint_override is empty. This can be used for non-AWS backends that only support virtual hosted-style access. |
127
128
128
129
<!-- markdown-link-check-enable-->
129
130
@@ -212,8 +213,7 @@ Both data file and metadata file locations can be customized by configuring the
212
213
213
214
For more granular control, you can override the `LocationProvider`'s `new_data_location` and `new_metadata_location` methods to define custom logic for generating file paths. See [`Loading a Custom Location Provider`](configuration.md#loading-a-custom-location-provider).
214
215
215
-
PyIceberg defaults to the [`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider), which generates file paths for
216
-
data files that are optimized for object storage.
216
+
PyIceberg defaults to the [`SimpleLocationProvider`](configuration.md#simple-location-provider) for managing file paths.
217
217
218
218
### Simple Location Provider
219
219
@@ -233,9 +233,6 @@ partitioned over a string column `category` might have a data file with location
Copy file name to clipboardExpand all lines: mkdocs/docs/contributing.md
+2Lines changed: 2 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -52,6 +52,8 @@ To get started, you can run `make install`, which installs Poetry and all the de
52
52
53
53
If you want to install the library on the host, you can simply run `pip3 install -e .`. If you wish to use a virtual environment, you can run `poetry shell`. Poetry will open up a virtual environment with all the dependencies set.
54
54
55
+
> **Note:** If you want to use `poetry shell`, you need to install it using `pip install poetry-plugin-shell`. Alternatively, you can run commands directly with `poetry run`.
Copy file name to clipboardExpand all lines: mkdocs/docs/index.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -64,7 +64,7 @@ You either need to install `s3fs`, `adlfs`, `gcsfs`, or `pyarrow` to be able to
64
64
65
65
## Connecting to a catalog
66
66
67
-
Iceberg leverages the [catalog to have one centralized place to organize the tables](https://iceberg.apache.org/concepts/catalog/). This can be a traditional Hive catalog to store your Iceberg tables next to the rest, a vendor solution like the AWS Glue catalog, or an implementation of Icebergs' own [REST protocol](https://github.com/apache/iceberg/tree/main/open-api). Checkout the [configuration](configuration.md) page to find all the configuration details.
67
+
Iceberg leverages the [catalog to have one centralized place to organize the tables](https://iceberg.apache.org/terms/#catalog). This can be a traditional Hive catalog to store your Iceberg tables next to the rest, a vendor solution like the AWS Glue catalog, or an implementation of Icebergs' own [REST protocol](https://github.com/apache/iceberg/tree/main/open-api). Checkout the [configuration](configuration.md) page to find all the configuration details.
68
68
69
69
For the sake of demonstration, we'll configure the catalog to use the `SqlCatalog` implementation, which will store information in a local `sqlite` database. We'll also configure the catalog to store data files in the local filesystem instead of an object store. This should not be used in production due to the limited scalability.
0 commit comments