# Polars example

Read and operate using polars

[Polars Kata from Luca Baggi](https://github.com/baggiponte/polars-katas)

In [8]:
import futuredata as fd
import polars as pl
import pyarrow.dataset as ds

In [2]:
storage_options = {
    "aws_access_key_id": fd.settings.AWS_ACCESS_KEY_ID,
    "aws_secret_access_key": fd.settings.AWS_SECRET_ACCESS_KEY,
    "aws_region": fd.settings.AWS_REGION,
}

In [11]:
source = "s3://intella/test/text/article_fact/*/*"
df = pl.scan_parquet(source, storage_options=storage_options)
df

In [22]:
# Scanning pyarrow dataset
source = "s3://intella/test/text/article_fact/0/article_fact_0"
dset = ds.dataset(source, format="parquet")

df = (
    pl.scan_pyarrow_dataset(dset)
    .select(["article_id", "article_title", "article_abstract"])
)


In [23]:
df.collect()

article_id,article_title,article_abstract
i64,str,str
9136,"""Chemistry of N…","""The 1-methyl-2…"
9145,"""Pre-Ischemic T…","""Physical exerc…"
9156,"""(Z)-2-(4-Chlor…","""2-Amino-6-etho…"
9162,"""3-Amino-6-etho…","""(Z)-2-(4-Chlor…"
9171,"""The More the W…","""Tinnitus distu…"
…,…,…
49995,"""Interaction of…","""Currently, sev…"
49996,"""Insights into …","""Influenza viru…"
49997,"""The Agrobacter…","""Cassava mosaic…"
49998,"""Dynamics of Ap…","""Apis mellifera…"


### Streaming

One additional benefit of the lazy API is that it allows queries to be executed in a streaming manner. Instead of processing the data all-at-once Polars can execute the query in batches allowing you to process datasets that are larger-than-memory.

https://docs.pola.rs/user-guide/concepts/streaming/

In [24]:
# Streaming
df.collect(streaming=True)

article_id,article_title,article_abstract
i64,str,str
9136,"""Chemistry of N…","""The 1-methyl-2…"
9145,"""Pre-Ischemic T…","""Physical exerc…"
9156,"""(Z)-2-(4-Chlor…","""2-Amino-6-etho…"
9162,"""3-Amino-6-etho…","""(Z)-2-(4-Chlor…"
9171,"""The More the W…","""Tinnitus distu…"
…,…,…
49995,"""Interaction of…","""Currently, sev…"
49996,"""Insights into …","""Influenza viru…"
49997,"""The Agrobacter…","""Cassava mosaic…"
49998,"""Dynamics of Ap…","""Apis mellifera…"


## Quindi?

Polars è molto potente, è un'ottima libreria per fare manipolazione dei dati.

Soprattutto ad esempio per fare la parte di trasformation di un ETL.

Attenzione però al consumo di RAM e a quello che fai, specialmente a quando collezioni un dataframe