<a href="https://colab.research.google.com/github/PozzOver13/learning/blob/main/polars/20231110_tutorial_polars.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Polars Tutorial 🐻




# References  

**official websites**  
https://www.pola.rs/ Polars Website  
https://pola-rs.github.io/polars/user-guide/ Polars User guide  

**talks**  
https://www.youtube.com/watch?v=LGAHTp4DYZY  Juan Luis- Expressive and fast dataframes in Python with polars  
https://www.youtube.com/watch?v=CtkMzCIXOWk  Thomas Bierhance: Polars - make the switch to lightning-fast dataframes   
https://www.youtube.com/watch?v=0rKV-aghHuc Beyond Pandas: lightning fast in-memory dataframes with Polars - Alberto Danese
https://www.youtube.com/watch?v=XsxrhGlkyk0 Is the great dataframe showdown finally over? Enter: polars - Luca Baggi  
https://www.youtube.com/watch?v=DwtbPhbDecQ&list=WL&index=32 Matthew Rocklin - Spark, Dask, DuckDB, and Polars: Benchmarks | PyData NYC 2023
  


**good 2 be known**  
https://wesmckinney.com/blog/apache-arrow-pandas-internals/ Wes McKinney
Blog post on pandas internals  
https://duckdblabs.github.io/db-benchmark/ Database-like ops benchmark  
https://pandas.pydata.org/docs/user_guide/pyarrow.html - Pandas Arrow backend
   

# Introduction

## What is Polars?

Polars is a fast and efficient DataFrame library for the Python programming language. It is designed to provide high-performance data manipulation and analysis capabilities similar to those offered by popular libraries like Pandas, but with a focus on speed and parallel processing.

Key features of Polars include:
1. **Performance**: Polars is built with performance in mind and is optimized for speed. It aims to provide faster data processing compared to Pandas, especially for large datasets.
1. **Lazy Evaluation**: Polars uses lazy evaluation, which means that operations are not immediately executed. Instead, they are recorded as a query plan, allowing for optimization and efficient execution of operations.
1. **Parallel Processing**: Polars is designed to take advantage of parallel processing and multicore systems, which can significantly speed up data manipulation tasks.
1. **Memory Efficiency**: Polars is optimized for memory efficiency, allowing users to work with large datasets without running into memory constraints.
1. **Expressive API**: Polars provides a concise and expressive API for working with data, making it easy to perform common data manipulation tasks.
1. **Integration with Arrow**: Polars leverages the Apache Arrow project for efficient in-memory columnar data representation, enabling seamless interoperability with other systems and libraries that support Arrow.

## What is Rust?

Rust is a programming language that is focused on providing performance, reliability, and safety. It was designed by Mozilla and first announced in 2010. Rust is often praised for its memory safety guarantees, zero-cost abstractions, and strong type system. It is intended for systems-level programming where low-level control over system resources and high performance are crucial.

Key features of Rust include:

1. **Memory Safety:** Rust uses a system of ownership, borrowing, and lifetimes to enforce memory safety without the need for a garbage collector. This helps prevent common programming errors like null pointer dereferencing and data races.
2. **Concurrency without Data Races:** Rust has a concurrency model that allows for safe and concurrent programming by avoiding data races. The ownership system helps prevent multiple threads from accessing the same data simultaneously, reducing the likelihood of bugs related to parallelism.
3. **Zero-Cost Abstractions:** Rust provides high-level abstractions without sacrificing performance. The language aims to eliminate the trade-off between high-level programming constructs and low-level control over system resources.
4. **Cargo Build System:** Rust comes with a build tool called Cargo, which makes it easy to manage dependencies, build projects, and run tests. Cargo simplifies the process of creating, building, and distributing Rust projects.
5. **Pattern Matching:** Rust includes a powerful pattern matching system that makes it expressive and allows for concise and readable code.
6. **C and C++ Interoperability:** Rust can be integrated with existing C and C++ codebases, allowing for gradual adoption or collaboration with projects written in these languages.
7. **Community and Ecosystem:** Rust has a growing and active community, and it is supported by a rich ecosystem of libraries and tools. The community values documentation, and Rust projects often come with well-documented code.

Rust has gained popularity for its focus on safety and performance, making it suitable for a wide range of applications, including system-level programming, embedded systems, and performance-critical applications. It is often used in projects where both low-level control and modern language features are important.

## Why it is worth to know it?

1.   Performance
1.   Syntax

**7 verbs get most jobs done (who's thinking to Dplyr?)**


```python
# select / slice columns
select
# create / transform / assign columns
with_columns
# filter / slice / query rows
filter
# join / merge another dataframe
join
# group dataframe rows
groupby
# aggregate groups
agg
# sort dataframe
sort
```


## Data Wrangling Alternatives?

1. **Pandas:**
   - **Overview:** Pandas is one of the most widely used data manipulation libraries in Python. It provides high-performance, easy-to-use data structures such as DataFrames for data analysis and manipulation.
   - **Strengths:** Great for exploratory data analysis, cleaning, and transformation. It has a large and active community.
   - **Considerations:** While Pandas is powerful, it might struggle with very large datasets due to its in-memory nature.

2. **Apache Arrow:**
   - **Overview:** Apache Arrow is a cross-language development platform for in-memory data that specifies a standardized language-independent columnar memory format for flat and hierarchical data.
   - **Strengths:** Efficient in-memory representation, supports multiple languages, and facilitates high-speed data interchange.
   - **Considerations:** While Arrow is crucial for interoperability, it doesn't provide the full range of data manipulation operations found in libraries like Pandas.

3. **Dask:**
   - **Overview:** Dask is a parallel computing library that integrates with existing Python libraries like NumPy, Pandas, and Scikit-Learn to enable parallel processing for larger-than-memory computations.
   - **Strengths:** Scales from a laptop to a cluster, enabling parallel and distributed computing. It's particularly useful for handling larger-than-memory datasets.
   - **Considerations:** Dask integrates with Pandas, but it doesn't replace Pandas; it extends its capabilities to larger datasets.

Other alternatives you might want to explore include:

4. **Vaex:**
   - **Overview:** Vaex is a Python library for lazy, out-of-core DataFrames. It aims to provide a Pandas-like API but with the ability to work with datasets larger than memory.
   - **Strengths:** Designed for handling larger-than-memory datasets, lazy evaluation for efficiency, and fast performance.

5. **Modin:**
   - **Overview:** Modin is a library that seeks to accelerate Pandas workflows by providing parallel and distributed computing capabilities.
   - **Strengths:** It allows you to scale your Pandas workflows using Dask or Ray, providing increased performance for some operations.


## Vectorized and Columnar processing

**Vectorization** refers to the practice of applying operations to entire arrays (vectors) of data at once, rather than performing operations element-wise. In a vectorized approach, operations are executed in parallel on entire columns or arrays, which can significantly improve computational efficiency. This is in contrast to traditional scalar processing, where operations are applied element by element.

In the context of data manipulation libraries:

- **Pandas (Row-Based Processing):** Pandas traditionally operates row-wise. Many Pandas operations involve iterating over rows of a DataFrame, which can be less efficient for certain types of computations, especially on large datasets. This is because iterating over rows doesn't take full advantage of modern CPU architectures optimized for vectorized operations. [⚠️ NEED VERIFICATION ⚠️]

- **Polars (Columnar Processing):** Polars, on the other hand, adopts a columnar processing approach. It organizes data internally in a columnar format, meaning that each column is stored separately in memory. This allows operations to be applied to entire columns at once, taking advantage of vectorized processing and providing better performance, especially for tasks like filtering, aggregating, and transforming data.

**Columnar Processing** involves storing and processing data by column rather than by row. Each column is stored as a contiguous block in memory, which can improve cache locality and reduce the amount of data that needs to be loaded into memory for a particular operation.

In a columnar storage format:

- **Pandas (Row-Based Storage):** Pandas typically stores data in a row-wise format, where each row is stored as a contiguous block. While this format is intuitive for many users, it may not be as cache-friendly or efficient for certain types of operations compared to columnar storage.

- **Polars (Columnar Storage):** Polars, being designed for columnar processing, stores data in a column-wise format. This can lead to more efficient memory access patterns, especially when performing operations that involve entire columns, as only the relevant columns need to be loaded into memory.

**Key Differences**:

1. **Memory Access Patterns:** Columnar processing is often more efficient for certain types of operations because it minimizes the amount of data that needs to be loaded into memory for a given computation.

2. **Parallelism:** Columnar processing facilitates better parallelism, as operations can be performed independently on each column, taking advantage of vectorized instructions in modern CPUs.

3. **Performance on Large Datasets:** Columnar processing can offer better performance for certain operations, especially on large datasets, due to improved cache efficiency and parallelism.



## Use Case Ideas
1. Provare a sostituire una parte di Financial Forecast con le espressioni di Polar

# Pandas🐼 vs Polars🐻

In [None]:
!python --version

Python 3.10.12


## Importing libraries

In [None]:
import pandas as pd
import polars as pl
import numpy as np

# Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
TRAIN_CSV_PATH = "/content/drive/MyDrive/Personale/Dataset/finance_loan_approval/train.csv"
TEST_CSV_PATH = "/content/drive/MyDrive/Personale/Dataset/finance_loan_approval/test.csv"

## Reading Files

https://pola-rs.github.io/polars/user-guide/basics/reading-writing/

In [None]:
# Read the CSV file into a pandas DataFrame
train_pd = pd.read_csv(TRAIN_CSV_PATH)
train_pl = pl.read_csv(TRAIN_CSV_PATH, infer_schema_length=10000) # without infer_schema_length generates an error interpreting a column as i64 instead of float

## Types & Memory Usage

Polars is entirely based on Arrow data types and backed by Arrow memory arrays. This makes data processing cache-efficient and well-supported for Inter Process Communication. Most data types follow the exact implementation from Arrow, with the exception of Utf8 (this is actually LargeUtf8), Categorical, and Object (support is limited).  
https://pola-rs.github.io/polars/user-guide/concepts/data-types/  

https://pola-rs.github.io/polars/user-guide/concepts/data-structures/

### is there an equivalent of pandas "memory_usage" method  in polars?

As of my last knowledge update in January 2022, Polars does not have an exact equivalent method to Pandas' `memory_usage` for directly calculating the memory usage of a DataFrame. However, you can achieve similar results by using other approaches.

In Pandas, the `memory_usage` method provides information about the memory usage of each column in a DataFrame. Polars, being designed for efficient memory usage, may not expose a direct method for this purpose. However, you can estimate memory usage in Polars using alternative methods:

1. **Using the `memory_usage()` Method in Pandas:**
   If you have a Polars DataFrame and want to estimate memory usage, you can convert it to a Pandas DataFrame temporarily and use the Pandas `memory_usage` method. Here's an example:

   ```python
   import pandas as pd
   import polars as pl

   # Create a Polars DataFrame
   polars_df = pl.DataFrame({
       'A': [1, 2, 3],
       'B': ['foo', 'bar', 'baz']
   })

   # Convert to Pandas DataFrame and use memory_usage
   pandas_df = polars_df.to_pandas()
   memory_usage = pandas_df.memory_usage(deep=True)
   print(memory_usage)
   ```

   Note that the `deep=True` argument in `memory_usage` considers the memory usage of the underlying objects, which can be more accurate but also more computationally expensive.

2. **Estimating Memory Usage in Polars:**
   While there may not be a dedicated method for memory usage in Polars, you can get an estimate of the size of a DataFrame using the `arrow_memory_size` method. This method returns the size of a `RecordBatch` in bytes. While it's not an exact memory usage for the DataFrame, it can give you an idea of the memory footprint.

   ```python
   import polars as pl

   # Create a Polars DataFrame
   polars_df = pl.DataFrame({
       'A': [1, 2, 3],
       'B': ['foo', 'bar', 'baz']
   })

   # Get an estimate of the memory usage
   memory_size_estimate = polars_df.arrow_memory_size()
   print(f"Estimated memory usage: {memory_size_estimate} bytes")
   ```

   Keep in mind that these approaches provide estimates, and the actual memory usage may depend on various factors, including the data types of the columns and other internal optimizations performed by the library. Always consider the specific requirements of your use case when estimating memory usage. Additionally, check the Polars documentation for any updates or new methods related to memory usage that may have been introduced after my last knowledge update.

### schemas

In [None]:
train_pl.schema

{'Loan_ID': Utf8,
 'Gender': Utf8,
 'Married': Utf8,
 'Dependents': Utf8,
 'Education': Utf8,
 'Self_Employed': Utf8,
 'ApplicantIncome': Int64,
 'CoapplicantIncome': Float64,
 'LoanAmount': Int64,
 'Loan_Amount_Term': Int64,
 'Credit_History': Int64,
 'Property_Area': Utf8,
 'Loan_Status': Utf8}

In [None]:
train_pd.dtypes

Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

In [None]:
def print_memory_usage(df):
    mem_MB = df.memory_usage(index=True, deep=True).sum() / 1024 / 1024
    print(f"memory used: {mem_MB:.1f} MB")
    # return df

In [None]:
print_memory_usage(train_pd)

memory used: 0.3 MB


In [None]:
# using the previous steps to extract the schema
dset_schema = train_pl.schema
train_pl = pl.read_csv(TRAIN_CSV_PATH, dtypes=dset_schema)

## Heading

In [None]:
train_pd.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y


In [None]:
train_pl.head(3)

Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
str,str,str,str,str,str,i64,f64,i64,i64,i64,str,str
"""LP001002""","""Male""","""No""","""0""","""Graduate""","""No""",5849,0.0,,360,1,"""Urban""","""Y"""
"""LP001003""","""Male""","""Yes""","""1""","""Graduate""","""No""",4583,1508.0,128.0,360,1,"""Rural""","""N"""
"""LP001005""","""Male""","""Yes""","""0""","""Graduate""","""Yes""",3000,0.0,66.0,360,1,"""Urban""","""Y"""


## Unique Values

**Count unique values**
There are two ways to count unique values in Polars: an exact methodology and an approximation. The approximation uses the HyperLogLog++ algorithm to approximate the cardinality and is especially useful for very large datasets where an approximation is good enough.

https://pola-rs.github.io/polars/user-guide/expressions/functions/#count-unique-values

In [None]:
print('Gender', train_pd['Gender'].nunique())
print('Married', train_pd['Married'].nunique())
print('Property_Area', train_pd['Property_Area'].nunique())

Gender 2
Married 2
Property_Area 3


In [None]:
print('Gender', train_pl['Gender'].n_unique())
print('Married', train_pl['Married'].n_unique())
print('Property_Area', train_pl['Property_Area'].n_unique())

Gender 3
Married 3
Property_Area 3


In [None]:
train_pd['Gender'].unique()

array(['Male', 'Female', nan], dtype=object)

In [None]:
train_pl['Gender'].unique() # in polars i valori missing sono considerati valori oppure e' una questione di lettura?

Gender
str
"""Male"""
"""Female"""
""


In [None]:
train_pd['Property_Area'].unique()

array(['Urban', 'Rural', 'Semiurban'], dtype=object)

In [None]:
train_pl['Property_Area'].unique()

Property_Area
str
"""Rural"""
"""Semiurban"""
"""Urban"""


## Grouping

In [None]:
train_pd.groupby(['Gender', 'Married'], dropna=False).size().to_frame('N').reset_index()

Unnamed: 0,Gender,Married,N
0,Female,No,80
1,Female,Yes,31
2,Female,,1
3,Male,No,130
4,Male,Yes,357
5,Male,,2
6,,No,3
7,,Yes,10


In [None]:
train_pl.groupby(['Gender', 'Married']).count().sort(by='Gender') # unexpected keyword argument 'dropna', object has no attribute 'to_frame'
# non mi sembra ordinato

Gender,Married,count
str,str,u32
,"""Yes""",10
,"""No""",3
"""Female""",,1
"""Female""","""No""",80
"""Female""","""Yes""",31
"""Male""",,2
"""Male""","""No""",130
"""Male""","""Yes""",357


## Expressions

In [None]:
median_applicant_income = (
    pl.col("ApplicantIncome")
    .median()
    .alias("median_ApplicantIncome")
)

In [None]:
median_applicant_income

In [None]:
n_unique_Loan_ID = (
    pl.col("Loan_ID")
    .n_unique()
    .alias("n_unique_Loan_ID")
)

In [None]:
train_pl.select(median_applicant_income)

median_ApplicantIncome
f64
3812.5


In [None]:
train_pl.select(n_unique_Loan_ID)

n_unique_Loan_ID
u32
614


In [None]:
train_pl.groupby(['Married']).agg(median_applicant_income, n_unique_Loan_ID)

Married,median_ApplicantIncome,n_unique_Loan_ID
str,f64,u32
"""No""",3750.0,213
"""Yes""",3854.5,398
,4758.0,3


## Lazyness

In general the lazy API should be preferred unless you are either interested in the intermediate results or are doing exploratory work and don't know yet what your query is going to look like.

https://pola-rs.github.io/polars/user-guide/concepts/lazy-vs-eager/#when-to-use-which

In [None]:
train_pl_lz = pl.scan_csv(TRAIN_CSV_PATH, infer_schema_length=10000)

In [None]:
train_pl_lz.schema

{'Loan_ID': Utf8,
 'Gender': Utf8,
 'Married': Utf8,
 'Dependents': Utf8,
 'Education': Utf8,
 'Self_Employed': Utf8,
 'ApplicantIncome': Int64,
 'CoapplicantIncome': Float64,
 'LoanAmount': Int64,
 'Loan_Amount_Term': Int64,
 'Credit_History': Int64,
 'Property_Area': Utf8,
 'Loan_Status': Utf8}

In [None]:
q = (
    pl.scan_csv(TRAIN_CSV_PATH, dtypes=dset_schema)
    .filter(pl.col("Married") == "Yes")
    .groupby("Education")
    .agg(median_applicant_income)
)

df_median_applicant_income = q.collect()

In [None]:
q

In [None]:
df_median_applicant_income

Education,median_ApplicantIncome
str,f64
"""Graduate""",4106.0
"""Not Graduate""",3333.0


In [None]:
# sink

(shape: (2, 2)
 ┌──────────────┬────────────────────────┐
 │ Education    ┆ median_ApplicantIncome │
 │ ---          ┆ ---                    │
 │ str          ┆ f64                    │
 ╞══════════════╪════════════════════════╡
 │ Not Graduate ┆ 3333.0                 │
 │ Graduate     ┆ 4106.0                 │
 └──────────────┴────────────────────────┘,
 shape: (3, 3)
 ┌───────────────────────────────────┬───────┬───────┐
 │ node                              ┆ start ┆ end   │
 │ ---                               ┆ ---   ┆ ---   │
 │ str                               ┆ u64   ┆ u64   │
 ╞═══════════════════════════════════╪═══════╪═══════╡
 │ optimization                      ┆ 0     ┆ 10    │
 │ csv(/content/drive/MyDrive/Perso… ┆ 10    ┆ 33083 │
 │ groupby(Education)                ┆ 33091 ┆ 33353 │
 └───────────────────────────────────┴───────┴───────┘)