## The Ultimate Guide to Mastering the Python Polars DataFrame
---
The aim of this comprehensive user guide is to equip you with all the necessary knowledge and skills required to utilize Python Polars DataFrames effectively for financial and supply chain data science analytics. 

It provides an in-depth overview of the most commonly used functions and capabilities of the package.


## Table of content
---

- Why another DataFrame
- Installation
- Creating DataFrame
- Lazy API
- IO
- Data Types
- About Expressions & Contexts
    - select
    - group by
- Polar SQLs

## Why another DataFrame
---
Despite the numerous state-of-the-art dataframe packages available in the market, the Polar dataframe, which is built on RUST, boasts the fastest execution speed, enabling it to handle complex data science operations on tabular datasets.

- Execution on larger-than-memory (RAM) data analytics
- Lazy API vs Eager execution
- Automatic Optimization
- Embarrassingly Parallel
- Easy to learn consistent, predictable API that has strict schema
- SQLs like expressions

#### Efficient Execution of Analytics on Large-than-Memory (RAM) Data

RAM is not a big deal these days as most computers and VMs offer inexpensive GBs of RAM. In fact, the availability of affordable RAM is the primary reason why Pandas-like DataFrames remain the go-to choice, and it is unlikely that Pandas or R Tables will become obsolete anytime soon.

However, Polars DataFrames are increasingly gaining popularity among developers due to their ability to harness the horsepower of Apache Spark, the backend support of DuckDB and Apache Arrow, and the ease-of-use of Pandas-like data frame functionalities. 

Additionally, Polars comes with built-in multi-core, multi-threaded parallel processing, making it a highly preferred choice.

#### Lazy API vs Eager execution

Just because an API is referred to as "lazy" does not necessarily imply that there will be a delay in processing or execution, and conversely, "eager" execution doesn't necessarily mean that the programming language will process data transformations or begin execution immediately and more quickly.

In simpler terms, using a Lazy API implies that the API will first take the time to optimize the query before execution, which often results in improved performance.

To illustrate this concept, consider running SQL on an RDBMS database. If the statistics, indexes, and data partitions have been appropriately optimized and the SQL is written in an optimized manner that utilizes the available statistics, indexes, and data partitions, the results will be delivered more quickly.

#### Automatic Optimization

We will learn few automation techniques to efficiently optimize queries.

#### Embarrassingly Parallel

#### Easy to learn consistent, predictable API that has strict schema

#### SQLs like expressions

## Installation
---


In [None]:
pip install -U polars

## Creating Polars DataFrame
---

As stated above, since our objective is learn Data Science operations on Finance and Supply chain dataset, we will focus on creating
few real life examples which are similar to Finance and Supply chain.

For more information, please learn more about [Finance and Supply chain ERP data](https://amitxshukla.github.io/GeneralLedger.jl/tutorials/erd/).

In [None]:
#########################
## Series and DataFrames
#########################
import polars as pl

# with a tuple
location_1 = pl.Series(["CA", "OR", "WA", "TX", "NY"]) 
# location_1 series when will converted to DataFrame will not have a column name

location_2 = pl.Series("location", ["CA", "OR", "WA", "TX", "NY"])

print(f"Location Type: Series 1: ", location_1)
print(f"Location Type: Series 2: ", location_2)

location_1_df = pl.DataFrame(location_1)
location_2_df = pl.DataFrame(location_2)
print(f"Location Type: DataFrame 1: ", location_1_df)
print(f"Location Type: DataFrame 2: ", location_2_df)
# type(location_1_df["location"]) # will error out, because location_1 series didn't had column name
type(location_2_df["location"]), type(location_1), type(location_2)

In [58]:
# Creating DataFrame from a dict or a collection of dicts.
# let's create a more sophisticated DataFrame

import random
from datetime import datetime

location = pl.DataFrame({
    "ID":  list(range(11, 23)),
    "AS_OF_DATE" : datetime(2022, 1, 1),
    "Description" : ["Boston","New York","Philadelphia","Cleveland","Richmond", "Atlanta","Chicago","St. Louis","Minneapolis","Kansas City", "Dallas","San Francisco"],
    "Region": ["Region A","Region B","Region C","Region D"] * 3,
    "Location_Type" : "Physical",
    "Location_Category" : ["Ship","Recv","Mfg"] * 4
})
location

ID,AS_OF_DATE,Description,Region,Location_Type,Location_Category
i64,datetime[μs],str,str,str,str
11,2022-01-01 00:00:00,"""Boston""","""Region A""","""Physical""","""Ship"""
12,2022-01-01 00:00:00,"""New York""","""Region B""","""Physical""","""Recv"""
13,2022-01-01 00:00:00,"""Philadelphia""","""Region C""","""Physical""","""Mfg"""
14,2022-01-01 00:00:00,"""Cleveland""","""Region D""","""Physical""","""Ship"""
15,2022-01-01 00:00:00,"""Richmond""","""Region A""","""Physical""","""Recv"""
16,2022-01-01 00:00:00,"""Atlanta""","""Region B""","""Physical""","""Mfg"""
17,2022-01-01 00:00:00,"""Chicago""","""Region C""","""Physical""","""Ship"""
18,2022-01-01 00:00:00,"""St. Louis""","""Region D""","""Physical""","""Recv"""
19,2022-01-01 00:00:00,"""Minneapolis""","""Region A""","""Physical""","""Mfg"""
20,2022-01-01 00:00:00,"""Kansas City""","""Region B""","""Physical""","""Ship"""


## how to use Lazy API

---

In the ideal case we use the lazy API right from a file as the query optimizer may help us to reduce the amount of data we read from the file.
- scan_csv



```
import polars as pl

from ..paths import DATA_DIR

q1 = (
    pl.scan_csv(f"{DATA_DIR}/reddit.csv")
    .with_columns(pl.col("name").str.to_uppercase())
    .filter(pl.col("comment_karma") > 0)
)
```

If we were to run the code above on the Reddit CSV the query would not be evaluated. Instead Polars takes each line of code, adds it to the internal query graph and optimizes the query graph.

```
import polars as pl

from ..paths import DATA_DIR

q4 = (
    pl.scan_csv(f"{DATA_DIR}/reddit.csv")
    .with_columns(pl.col("name").str.to_uppercase())
    .filter(pl.col("comment_karma") > 0)
    .collect()
)
```

## Execution on larger-than-memory (RAM) data analytics

---
If your data requires more memory than you have available Polars may be able to process the data in batches using streaming mode. To use streaming mode you simply pass the streaming=True argument to collect

```
import polars as pl

from ..paths import DATA_DIR

q5 = (
    pl.scan_csv(f"{DATA_DIR}/reddit.csv")
    .with_columns(pl.col("name").str.to_uppercase())
    .filter(pl.col("comment_karma") > 0)
    .collect(streaming=True)
)
```

Execution on a partial dataset

While you're writing, optimizing or checking your query on a large dataset, querying all available data may lead to a slow development process.

You can instead execute the query with the .fetch method. The .fetch method takes a parameter n_rows and tries to 'fetch' that number of rows at the data source. The number of rows cannot be guaranteed, however, as the lazy API does not count how many rows there are at each stage of the query.

Here we "fetch" 100 rows from the source file and apply the predicates.

```
import polars as pl

from ..paths import DATA_DIR

q9 = (
    pl.scan_csv(f"{DATA_DIR}/reddit.csv")
    .with_columns(pl.col("name").str.to_uppercase())
    .filter(pl.col("comment_karma") > 0)
    .fetch(n_rows=int(100))
)
```

- TODO: cover streaming topic
- TODO: cover sinking to a a file
- TODO: all topics from Lazy API Chapter

## Query Optimization
---

`df.describe_optimized_plan()`