## The Ultimate Guide to Mastering the Python Polars DataFrame
---
The aim of this comprehensive user guide is to equip you with all the necessary knowledge and skills required to utilize Python Polars DataFrames effectively for financial and supply chain data science analytics. 

It provides an in-depth overview of the most commonly used functions and capabilities of the package.


## Table of content
---

- Why another DataFrame
- Installation
- Finance and Supply chain Data Analytics
- Creating DataFrame
- Data Types
- IO
- Lazy API
- About Expressions & Contexts
    - select
    - group by
- Polar SQLs

## Why another DataFrame
---
Despite the numerous state-of-the-art dataframe packages available in the market, the Polar dataframe, which is built on RUST, boasts the fastest execution speed, enabling it to handle complex data science operations on tabular datasets.

- Execution on larger-than-memory (RAM) data analytics
- Lazy API vs Eager execution
- Automatic Optimization
- Embarrassingly Parallel
- Easy to learn consistent, predictable API that has strict schema
- SQLs like expressions

#### Efficient Execution of Analytics on Large-than-Memory (RAM) Data

RAM is not a big deal these days as most computers and VMs offer inexpensive GBs of RAM. In fact, the availability of affordable RAM is the primary reason why Pandas-like DataFrames remain the go-to choice, and it is unlikely that Pandas or R Tables will become obsolete anytime soon.

However, Polars DataFrames are increasingly gaining popularity among developers due to their ability to harness the horsepower of Apache Spark, the backend support of DuckDB and Apache Arrow, and the ease-of-use of Pandas-like data frame functionalities. 

Additionally, Polars comes with built-in multi-core, multi-threaded parallel processing, making it a highly preferred choice.

#### Lazy API vs Eager execution

Just because an API is referred to as "lazy" does not necessarily imply that there will be a delay in processing or execution, and conversely, "eager" execution doesn't necessarily mean that the programming language will process data transformations or begin execution immediately and more quickly.

In simpler terms, using a Lazy API implies that the API will first take the time to optimize the query before execution, which often results in improved performance.

To illustrate this concept, consider running SQL on an RDBMS database. If the statistics, indexes, and data partitions have been appropriately optimized and the SQL is written in an optimized manner that utilizes the available statistics, indexes, and data partitions, the results will be delivered more quickly.

#### Automatic Optimization

We will learn few automation techniques to efficiently optimize queries.

#### Embarrassingly Parallel

#### Easy to learn consistent, predictable API that has strict schema

#### SQLs like expressions

`Let's get started`
## Installation
---


In [None]:
pip install -U polars

## Finance and Supply chain Data Analytics
---
TODO:
As stated above, since our objective is learn Data Science operations on Finance and Supply chain dataset, we will focus on creating
few real life examples which are similar to Finance and Supply chain.

For more information, please learn more about [Finance and Supply chain ERP data](https://amitxshukla.github.io/GeneralLedger.jl/tutorials/erd/).

Objective of following section is to understand ERP GL like data. 

A sample of data structure and ERD relationship diagram can be seen in this diagram below.

![ERD Diagram](https://github.com/AmitXShukla/AmitXShukla.github.io/raw/master/blogs/PlutoCon/gl_erd.png)

## Creating Polars DataFrame
---

In [None]:
#########################
## Series and DataFrames
#########################
import polars as pl

# with a tuple
location_1 = pl.Series(["CA", "OR", "WA", "TX", "NY"]) 
# location_1 series when will converted to DataFrame will not have a column name
# and hence, later when is used to create a dataframe will assign a column name like column_xx

location_2 = pl.Series("location", ["CA", "OR", "WA", "TX", "NY"])

print(f"Location Type: Series 1: ", location_1)
print(f"Location Type: Series 2: ", location_2)

location_1_df = pl.DataFrame(location_1)
location_2_df = pl.DataFrame(location_2)
print(f"Location Type: DataFrame 1: ", location_1_df)
print(f"Location Type: DataFrame 2: ", location_2_df)
# type(location_1_df["location"]) # will error out, because location_1 series didn't had column name
type(location_2_df["location"]), type(location_1), type(location_2)

In [None]:
# Creating DataFrame from a dict or a collection of dicts.
# let's create a more sophisticated DataFrame
# in real world, Organization maintain dozens of record structure to store 
# different type of locations, like ShipTo Location, Receiving, Mailing, Corp. office, head office,
# field office etc. etc.

## LOCATION DataFrame ##
import random
from datetime import datetime

location = pl.DataFrame({
    "ID":  list(range(11, 23)),
    "AS_OF_DATE" : datetime(2022, 1, 1),
    "DESCRIPTION" : ["Boston","New York","Philadelphia","Cleveland","Richmond", "Atlanta","Chicago","St. Louis","Minneapolis","Kansas City", "Dallas","San Francisco"],
    "REGION": ["Region A","Region B","Region C","Region D"] * 3,
    "TYPE" : "Physical",
    "CATEGORY" : ["Ship","Recv","Mfg"] * 4
})
location

In [None]:
## ACCOUNTS DataFrame ##
import random
from datetime import datetime

accounts = pl.DataFrame({
    "ID":  list(range(10000, 45000, 1000)),
    "AS_OF_DATE" : datetime(2022, 1, 1),
    "DESCRIPTION" : ["Operating Expenses","Non Operating Expenses","Assets","Liabilities","Net worth accounts", "Statistical Accounts","Revenue"] * 5,
    "REGION": ["Region A","Region B","Region C","Region D", "Region E"] * 7,
    "TYPE" : ["E","E","A","L","N","S","R"] * 5,
    "STATUS" : "Active",
    "CLASSIFICATION" : ["OPERATING_EXPENSES","NON-OPERATING_EXPENSES", "ASSETS","LIABILITIES","NET_WORTH","STATISTICS","REVENUE"] * 5,
    "CATEGORY" : [
       		"Travel","Payroll","non-Payroll","Allowance","Cash",
       		"Facility","Supply","Services","Investment","Misc.",
       		"Depreciation","Gain","Service","Retired","Fault.",
       		"Receipt","Accrual","Return","Credit","ROI",
       		"Cash","Funds","Invest","Transfer","Roll-over",
       		"FTE","Members","Non_Members","Temp","Contractors",
       		"Sales","Merchant","Service","Consulting","Subscriptions"
       	],
})
accounts

In [None]:
## DEPARTMENT DataFrame ##
import random
from datetime import datetime

dept = pl.DataFrame({
    "ID":  list(range(1000, 2500, 100)),
    "AS_OF_DATE" : datetime(2022, 1, 1),
    "DESCRIPTION" : ["Sales & Marketing","Human Resource","Information Technology","Business leaders","other temp"] * 3,
    "REGION": ["Region A","Region B","Region C"] * 5,
    "STATUS" : "Active",
    "CLASSIFICATION" : ["SALES","HR", "IT","BUSINESS","OTHERS"] * 3,
    "TYPE" : ["S","H","I","B","O"] * 3,
    "CATEGORY" : ["sales","human_resource","IT_Staff","business","others"] * 3,
})
dept

In [199]:
## LEDGER DataFrame ##
import random
from datetime import datetime

sampleSize = 10
org = "ABC Inc."
ledger = "ACTUALS" # BUDGET, STATS are other Ledger types
fiscal_year_from = 2020
fiscal_year_to = 2023
random.seed(123)

ledger = pl.DataFrame({
	"LEDGER" : ledger,
	"ORG" : org,
	"FISCAL_YEAR": random.choices(list(range(fiscal_year_from, fiscal_year_to+1, 1)),k=sampleSize),
	"PERIOD": random.choices(list(range(1, 12+1, 1)),k=sampleSize),
	"ACCOUNT" : random.choices(accounts["ID"], k=sampleSize),
	"DEPT" : random.choices(dept["ID"], k=sampleSize),
	"LOCATION" : random.choices(location["ID"], k=sampleSize),
	"POSTED_TOTAL": random.sample(range(1000000), sampleSize)
})
ledger

LEDGER,ORG,FISCAL_YEAR,PERIOD,ACCOUNT,DEPT,LOCATION,POSTED_TOTAL
str,str,i64,i64,i64,i64,i64,i64
"""ACTUALS""","""ABC Inc.""",2020,5,41000,1500,19,860031
"""ACTUALS""","""ABC Inc.""",2020,5,13000,2200,18,86622
"""ACTUALS""","""ABC Inc.""",2021,3,14000,1300,18,958027
"""ACTUALS""","""ABC Inc.""",2020,1,37000,1900,21,510557
"""ACTUALS""","""ABC Inc.""",2023,6,10000,1700,16,690040
"""ACTUALS""","""ABC Inc.""",2020,2,41000,2200,19,274608
"""ACTUALS""","""ABC Inc.""",2022,8,30000,1400,15,178390
"""ACTUALS""","""ABC Inc.""",2021,1,19000,1500,11,351168
"""ACTUALS""","""ABC Inc.""",2023,4,39000,2100,20,353282
"""ACTUALS""","""ABC Inc.""",2020,6,36000,1700,13,581124


In [198]:
ledger = "BUDGET" # ACTUALS, STATS are other Ledger types

ledger_budg = pl.DataFrame({
	"LEDGER" : ledger,
	"ORG" : org,
	"FISCAL_YEAR": random.choices(list(range(fiscal_year_from, fiscal_year_to+1, 1)),k=sampleSize),
	"PERIOD": random.choices(list(range(1, 12+1, 1)),k=sampleSize),
	"ACCOUNT" : random.choices(accounts["ID"], k=sampleSize),
	"DEPT" : random.choices(dept["ID"], k=sampleSize),
	"LOCATION" : random.choices(location["ID"], k=sampleSize),
	"POSTED_TOTAL": random.sample(range(1000000), sampleSize)
})
ledger_budg

LEDGER,ORG,FISCAL_YEAR,PERIOD,ACCOUNT,DEPT,LOCATION,POSTED_TOTAL
str,str,i64,i64,i64,i64,i64,i64
"""BUDGET""","""ABC Inc.""",2022,5,25000,1800,14,629252
"""BUDGET""","""ABC Inc.""",2023,7,35000,1500,11,332428
"""BUDGET""","""ABC Inc.""",2022,5,35000,1100,17,87808
"""BUDGET""","""ABC Inc.""",2023,2,34000,2000,22,357715
"""BUDGET""","""ABC Inc.""",2021,3,25000,1300,17,581471
"""BUDGET""","""ABC Inc.""",2021,8,25000,1800,20,563800
"""BUDGET""","""ABC Inc.""",2023,6,37000,1100,21,139950
"""BUDGET""","""ABC Inc.""",2023,9,15000,2300,11,974519
"""BUDGET""","""ABC Inc.""",2021,6,21000,1300,14,933767
"""BUDGET""","""ABC Inc.""",2023,1,28000,2300,13,848318


In [201]:
# combined ledger for Actuals and Budget
pl.concat([ledger, ledger_budg], how="vertical")

LEDGER,ORG,FISCAL_YEAR,PERIOD,ACCOUNT,DEPT,LOCATION,POSTED_TOTAL
str,str,i64,i64,i64,i64,i64,i64
"""ACTUALS""","""ABC Inc.""",2020,5,41000,1500,19,860031
"""ACTUALS""","""ABC Inc.""",2020,5,13000,2200,18,86622
"""ACTUALS""","""ABC Inc.""",2021,3,14000,1300,18,958027
"""ACTUALS""","""ABC Inc.""",2020,1,37000,1900,21,510557
"""ACTUALS""","""ABC Inc.""",2023,6,10000,1700,16,690040
"""ACTUALS""","""ABC Inc.""",2020,2,41000,2200,19,274608
"""ACTUALS""","""ABC Inc.""",2022,8,30000,1400,15,178390
"""ACTUALS""","""ABC Inc.""",2021,1,19000,1500,11,351168
"""ACTUALS""","""ABC Inc.""",2023,4,39000,2100,20,353282
"""ACTUALS""","""ABC Inc.""",2020,6,36000,1700,13,581124


## how to use Lazy API

---

In the ideal case we use the lazy API right from a file as the query optimizer may help us to reduce the amount of data we read from the file.
- scan_csv



```
import polars as pl

from ..paths import DATA_DIR

q1 = (
    pl.scan_csv(f"{DATA_DIR}/reddit.csv")
    .with_columns(pl.col("name").str.to_uppercase())
    .filter(pl.col("comment_karma") > 0)
)
```

If we were to run the code above on the Reddit CSV the query would not be evaluated. Instead Polars takes each line of code, adds it to the internal query graph and optimizes the query graph.

```
import polars as pl

from ..paths import DATA_DIR

q4 = (
    pl.scan_csv(f"{DATA_DIR}/reddit.csv")
    .with_columns(pl.col("name").str.to_uppercase())
    .filter(pl.col("comment_karma") > 0)
    .collect()
)
```

## Execution on larger-than-memory (RAM) data analytics

---
If your data requires more memory than you have available Polars may be able to process the data in batches using streaming mode. To use streaming mode you simply pass the streaming=True argument to collect

```
import polars as pl

from ..paths import DATA_DIR

q5 = (
    pl.scan_csv(f"{DATA_DIR}/reddit.csv")
    .with_columns(pl.col("name").str.to_uppercase())
    .filter(pl.col("comment_karma") > 0)
    .collect(streaming=True)
)
```

Execution on a partial dataset

While you're writing, optimizing or checking your query on a large dataset, querying all available data may lead to a slow development process.

You can instead execute the query with the .fetch method. The .fetch method takes a parameter n_rows and tries to 'fetch' that number of rows at the data source. The number of rows cannot be guaranteed, however, as the lazy API does not count how many rows there are at each stage of the query.

Here we "fetch" 100 rows from the source file and apply the predicates.

```
import polars as pl

from ..paths import DATA_DIR

q9 = (
    pl.scan_csv(f"{DATA_DIR}/reddit.csv")
    .with_columns(pl.col("name").str.to_uppercase())
    .filter(pl.col("comment_karma") > 0)
    .fetch(n_rows=int(100))
)
```

- TODO: cover streaming topic
- TODO: cover sinking to a a file
- TODO: all topics from Lazy API Chapter

## Query Optimization
---

`df.describe_optimized_plan()`