You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: mkdocs/docs/api.md
+202-1Lines changed: 202 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -49,7 +49,7 @@ catalog:
49
49
50
50
and loaded in python by calling `load_catalog(name="hive")` and `load_catalog(name="rest")`.
51
51
52
-
This information must be placed inside a file called `.pyiceberg.yaml` located either in the `$HOME` or `%USERPROFILE%` directory (depending on whether the operating system is Unix-based or Windows-based, respectively) or in the `$PYICEBERG_HOME` directory (if the corresponding environment variable is set).
52
+
This information must be placed inside a file called `.pyiceberg.yaml` located either in the `$HOME` or `%USERPROFILE%` directory (depending on whether the operating system is Unix-based or Windows-based, respectively), in the current working directory, or in the `$PYICEBERG_HOME` directory (if the corresponding environment variable is set).
53
53
54
54
For more details on possible configurations refer to the [specific page](https://py.iceberg.apache.org/configuration/).
PyIceberg supports upsert operations, meaning that it is able to merge an Arrow table into an Iceberg table. Rows are considered the same based on the [identifier field](https://iceberg.apache.org/spec/?column-projection#identifier-field-ids). If a row is already in the table, it will update that row. If a row cannot be found, it will insert that new row.
480
+
481
+
Consider the following table, with some data:
482
+
483
+
```python
484
+
from pyiceberg.schema import Schema
485
+
from pyiceberg.types import IntegerType, NestedField, StringType
Next, we'll upsert a table into the Iceberg table:
519
+
520
+
```python
521
+
df = pa.Table.from_pylist(
522
+
[
523
+
# Will be updated, the inhabitants has been updated
524
+
{"city": "Drachten", "inhabitants": 45505},
525
+
526
+
# New row, will be inserted
527
+
{"city": "Berlin", "inhabitants": 3432000},
528
+
529
+
# Ignored, already exists in the table
530
+
{"city": "Paris", "inhabitants": 2103000},
531
+
],
532
+
schema=arrow_schema
533
+
)
534
+
upd = tbl.upsert(df)
535
+
536
+
assert upd.rows_updated == 1
537
+
assert upd.rows_inserted == 1
538
+
```
539
+
540
+
PyIceberg will automatically detect which rows need to be updated, inserted or can simply be ignored.
541
+
477
542
## Inspecting tables
478
543
479
544
To explore the table metadata, tables can be inspected.
@@ -1546,3 +1611,139 @@ df.show(2)
1546
1611
1547
1612
(Showing first 2 rows)
1548
1613
```
1614
+
1615
+
### Polars
1616
+
1617
+
PyIceberg interfaces closely with Polars Dataframes and LazyFrame which provides a full lazily optimized query engine interface on top of PyIceberg tables.
1618
+
1619
+
<!-- prettier-ignore-start -->
1620
+
1621
+
!!! note "Requirements"
1622
+
This requires [`polars` to be installed](index.md).
1623
+
1624
+
```python
1625
+
pip install pyiceberg['polars']
1626
+
```
1627
+
<!-- prettier-ignore-end -->
1628
+
1629
+
PyIceberg data can be analyzed and accessed through Polars using either DataFrame or LazyFrame.
1630
+
If your code utilizes the Apache Iceberg data scanning and retrieval API and then analyzes the resulting DataFrame in Polars, use the `table.scan().to_polars()` API.
1631
+
If the intent is to utilize Polars' high-performance filtering and retrieval functionalities, use LazyFrame exported from the Iceberg table with the `table.to_polars()` API.
1632
+
1633
+
```python
1634
+
# Get LazyFrame
1635
+
iceberg_table.to_polars()
1636
+
1637
+
# Get Data Frame
1638
+
iceberg_table.scan().to_polars()
1639
+
```
1640
+
1641
+
#### Working with Polars DataFrame
1642
+
1643
+
PyIceberg makes it easy to filter out data from a huge table and pull it into a Polars dataframe locally. This will only fetch the relevant Parquet files for the query and apply the filter. This will reduce IO and therefore improve performance and reduce cost.
0 commit comments