# Session 20 🐍

☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️

***

# 157. Polars 
Polars is a blazingly fast DataFrame library implemented in Rust with Python bindings. It's designed for efficient data manipulation and analysis, offering performance that often surpasses pandas, especially for large datasets.

***

# 158. Important Features
- Lazy and eager execution modes
- Multi-threaded and vectorized query engine
- Memory efficiency (zero-copy reads)
- Expressive API similar to pandas but with some key differences
- Out-of-core processing for datasets larger than memory

***

# 159. Basic Usage

***

## 159-1. Creating DataFrames

In [None]:
import polars as pl

# From a dictionary
df = pl.DataFrame({
    "A": [1, 2, 3, 4, 5],
    "B": ["a", "b", "c", "d", "e"],
    "C": [1.1, 2.2, 3.3, 4.4, 5.5]
})

# From a list of tuples
df = pl.DataFrame([
    (1, "a", 1.1),
    (2, "b", 2.2),
    (3, "c", 3.3),
    (4, "d", 4.4),
    (5, "e", 5.5)
], schema=["A", "B", "C"])

# From a pandas DataFrame
import pandas as pd
pdf = pd.DataFrame({"A": [1, 2], "B": ["x", "y"]})
df = pl.from_pandas(pdf)

***

## 159-2. Basic Operations

In [None]:
# Selecting columns
df.select(["A", "B"])

# Filtering
df.filter(pl.col("A") > 3)

# Adding columns
df.with_columns((pl.col("A") * 10).alias("A_times_10"))

# Grouping and aggregating
df.group_by("B").agg(
    pl.col("A").sum(),
    pl.col("C").mean()
)

# Sorting
df.sort("A", descending=True)

***

# 160. Lazy API
Polars' lazy API builds a query plan and executes it only when needed, allowing for optimizations:

In [None]:
# Lazy evaluation
lf = pl.scan_csv("large_file.csv")  # Creates a LazyFrame

query = (lf
    .filter(pl.col("value") > 100)
    .group_by("category")
    .agg(pl.col("value").sum())
    
# Execution happens here
result = query.collect()

***

# 161. Advanced Features

***

## 161-1. Expressions
Polars uses a powerful expression syntax:

In [None]:
df.select([
    pl.col("A"),
    pl.col("B").str.to_uppercase().alias("B_upper"),
    (pl.col("A") * pl.col("C")).alias("A_times_C"),
    pl.when(pl.col("A") > 3).then(1).otherwise(0).alias("flag")
])

***

## 161-2. Joins

In [None]:
df1 = pl.DataFrame({"key": ["a", "b", "c"], "value1": [1, 2, 3]})
df2 = pl.DataFrame({"key": ["a", "b", "d"], "value2": [4, 5, 6]})

# Inner join
df1.join(df2, on="key", how="inner")

# Left join
df1.join(df2, on="key", how="left")

***

## 161-3. Temporal Data

In [None]:
df = pl.DataFrame({
    "date": ["2022-01-01", "2022-01-02", "2022-01-03"],
    "value": [1, 2, 3]
}).with_columns(pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"))

# Date operations
df.with_columns(
    pl.col("date").dt.year().alias("year"),
    pl.col("date").dt.month().alias("month")
)

***

## 161-4. Missing Data

In [None]:
df = pl.DataFrame({
    "A": [1, None, 3],
    "B": ["x", None, "z"]
})

# Fill nulls
df.fill_null(0)  # For numeric columns
df.fill_null("missing")  # For string columns

# Drop nulls
df.drop_nulls()

***

# 162. Performance Considerations
- Use the lazy API for large datasets and complex operations
- Prefer expressions over iterative operations
- Use proper data types (e.g., categoricals for strings with low cardinality)
- Avoid converting to pandas unless necessary

***

# 163. Comparison with pandas
|Feature	|Polars	|pandas|
|-----------|-------|------|
|Backend	|Rust	|C (mostly)|
|Execution	|Lazy and eager	|Eager only|
|Memory	|More efficient	|Less efficient|
|Parallelism	|Multi-threaded by default	|Mostly single-threaded|
|API	|Expression-based	|Method-chaining|
|Out-of-core	|Supported	|Limited support|

***

# 164. Polars VS Pandas
Polars is good for:
- Working with large datasets (GBs to TBs)
- Need for high-performance operations
- Complex data transformations
- Memory efficiency is important

pandas is good for:
- Small to medium datasets
- Need for mature ecosystem (more integrations)
- Interactive analysis where immediate feedback is valuable
- Legacy codebase

***

# 165. Integration
Polars works well with:
- Arrow (zero-copy conversion)
- NumPy (zero-copy conversion)
- pandas (conversion available)
- Connectors (SQL, Parquet, CSV, etc.)

In [None]:
# Convert to pandas
pd_df = df.to_pandas()

# Convert to numpy
array = df["A"].to_numpy()

# Read from SQL
df = pl.read_database("SELECT * FROM table", connection_uri)

***

***

# Some Excercises

**1.** Create a Polars DataFrame with:
- A column "names" containing 5 country names
- A column "population" with their populations in millions
- A column "continent" specifying their continent
- Convert this DataFrame from a dictionary and from a list of tuples (two separate implementations).

___

**2.** Using the DataFrame from Exercise 1:
- Select only the "names" and "population" columns
- Filter countries with population greater than 50 million
- Add a new column "population_double" with doubled population values
- Sort the DataFrame by population in descending order

---

**3.** Create a LazyFrame by scanning a CSV file (or create one programmatically)

Build a query that:

- Filters rows based on a condition

- Groups by a categorical column

- Aggregates with at least two different operations

Explain the query plan before executing it

Finally collect the results

---

**4.** Create a DataFrame with:

- "product" (5 items)

- "price" (numeric values)

- "category" (strings)

Write a single expression that:

Creates a "discounted_price" (10% off)

Flags "premium" products (price > 100)

Calculates price per character of product name

Returns only products from a specific category

***

**5.** Create two DataFrames:

- DF1: Employee data (id, name, hire_date)

- DF2: Department data (id, dept_name, manager_id)

Perform various joins (inner, left, outer) between them

Add a column showing years of service (from hire_date to today)

Filter employees hired before 2020

***

**6.** Create a DataFrame with:

- Some missing values in numeric columns

- Some missing values in string columns

Then:

Show different strategies for handling nulls (fill, drop, interpolate)

Count nulls per column

Create a boolean mask indicating rows with any null values

Replace nulls in string columns with "Unknown" and in numeric columns with the column mean

***

**7.** Create a large DataFrame (>1M rows) with mixed data types

Time operations in Polars (eager and lazy) vs pandas for:

- Groupby-aggregate

- Complex filtering

- Multi-column operations

Compare memory usage between Polars and pandas for the same DataFrame

Experiment with categorical data types in both libraries

***

**8.** Create a Polars DataFrame and convert it to:

- pandas DataFrame

- NumPy array (for a numeric column)

- Arrow Table

Read the DataFrame from/to:

- Parquet file

- CSV file

- SQL database (if available)

Create a function that accepts either Polars or pandas DataFrame and processes it appropriately

Demonstrate zero-copy conversion between Polars and Arrow

***

#                                                        🌞 https://github.com/AI-Planet 🌞