# Polars basics

**Polars** is a high-performance DataFrame library for Python (and Rust) designed for fast analytics. Compared to pandas, Polars is typically faster and more memory-efficient because it uses:
* operations work per column (great for analytics).
* ou write what you want (filters, aggregates) and Polars optimizes it.
* many operations run multi-threaded automatically.
* uilds a query plan first, then executes efficiently (pushes filters/column selection down so it reads less data).

## Prerequisites

If you don't have them, install the following:
* Python  (https://www.python.org/downloads/windows/)
* Jupyter notebook/ Jupyter lab (https://jupyter.org/install) or Google Colab (https://colab.google/)

## Installing Polars

In [9]:
pip install polars

Note: you may need to restart the kernel to use updated packages.




## Eager read

In [13]:
import polars as pl

df = pl.read_csv("data.csv")
print(df.shape)


(569, 33)


## Lazy scan (recommended for bigger CSVs)

In [14]:
import polars as pl

lf = pl.scan_csv("data.csv")
print(lf)                 # shows a LazyFrame (plan, not data)
df = lf.collect()         # executes the plan
print(df.head())

naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

Csv SCAN [data.csv]
PROJECT */33 COLUMNS
ESTIMATED ROWS: 580
shape: (5, 33)
┌──────────┬───────────┬─────────────┬──────────────┬───┬──────────────┬────────────────┬───────────────────────┬──────┐
│ id       ┆ diagnosis ┆ radius_mean ┆ texture_mean ┆ … ┆ concave      ┆ symmetry_worst ┆ fractal_dimension_wor ┆      │
│ ---      ┆ ---       ┆ ---         ┆ ---          ┆   ┆ points_worst ┆ ---            ┆ st                    ┆ ---  │
│ i64      ┆ str       ┆ f64         ┆ f64          ┆   ┆ ---          ┆ f64            ┆ ---                   ┆ str  │
│          ┆           ┆             ┆              ┆   ┆ f64          ┆                ┆ f64                   ┆      │
╞══════════╪═══════════╪═════════════╪══════════════╪═══╪══════════════╪════════════════╪═══════════════════════╪══════╡
│ 842302   ┆ M         ┆ 17.99       ┆ 10.38        ┆ … ┆ 0.2654       ┆ 0.4601         ┆ 0.1189                ┆ null 

Polars has two plans for lazy queries:

* Naive plan = what you wrote (direct translation of your operations)
* Optimized plan = what Polars will actually execute after optimizations

optimized=True shows the optimized plan, where Polars may:
* remove unused columns (projection pushdown)
* push filters earlier (predicate pushdown)
* reorder steps for speed
  
Polars will read the CSV file as the data source and because you used lazy (scan_csv) it will not load it immediately, it will scan it when you do .collect() (or .fetch() / .head().collect() etc).

ESTIMATED ROWS: 580. This is Polars’ best guess about how many rows it expects. This is an estimate, not guaranteed


## Read everything vs select needed columns

In [16]:
lf = pl.scan_csv("data.csv")
print(lf.explain(optimized=True))


Csv SCAN [data.csv]
PROJECT */33 COLUMNS
ESTIMATED ROWS: 580


* Csv SCAN: Polars will read from data.csv
* PROJECT /33 COLUMNS: it will read/use all 33 columns (because your lazy pipeline didn’t restrict columns)

In [18]:
lf = pl.scan_csv("data.csv").select(["diagnosis", "radius_mean", "texture_mean"])
print(lf.explain(optimized=True))

Csv SCAN [data.csv]
PROJECT 3/33 COLUMNS
ESTIMATED ROWS: 580


PROJECT 3/33 COLUMNS: Polars will only read 3 columns out of the 33

## Pandas vs Polars

Pandas, a widely adopted library, is known for its flexibility and ease of use. However, when dealing with large datasets, Pandas can suffer from performance bottlenecks due to its reliance on single-threaded execution. As the dataset size increases, processing times can become prohibitively long, limiting productivity.

Polars has been specifically designed to handle large datasets efficiently. With its lazy evaluation strategy and parallel execution capabilities, Polars excels at processing substantial amounts of data swiftly. By distributing computations across multiple CPU cores, Polars leverages parallelism to deliver impressive performance gains.

## Other Polars operations

In [21]:
# Load diamond data from a CSV file
df = pl.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv')


# Select specific columns: carat, cut, and price
selected_df = df.select(['Carat Weight', 'Cut', 'Price'])


# show selected_df head
selected_df.head()

Carat Weight,Cut,Price
f64,str,i64
1.1,"""Ideal""",5169
0.83,"""Ideal""",3470
0.85,"""Ideal""",3183
0.91,"""Ideal""",4370
0.83,"""Ideal""",3171


In [22]:
# Load diamond data from a CSV file
df = pl.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv')


# filter the df with condition
filtered_df = df.filter(pl.col('Carat Weight') > 2.0)


# show filtered_df head
filtered_df.head()

Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
f64,str,str,str,str,str,str,i64
2.11,"""Ideal""","""H""","""SI1""","""VG""","""VG""","""GIA""",18609
2.51,"""Very Good""","""G""","""VS2""","""VG""","""VG""","""GIA""",34361
2.2,"""Ideal""","""H""","""VS2""","""EX""","""VG""","""GIA""",22241
2.6,"""Ideal""","""G""","""VS2""","""EX""","""EX""","""GIA""",37621
2.02,"""Good""","""I""","""VVS2""","""EX""","""VG""","""GIA""",19756


### Handling missing values

In [23]:

# Load diamond data from a CSV file
df = pl.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv')

# drop missing values
cleaned_df = df.drop_nulls()

# show cleaned_df head
cleaned_df.head()

Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
f64,str,str,str,str,str,str,i64
1.1,"""Ideal""","""H""","""SI1""","""VG""","""EX""","""GIA""",5169
0.83,"""Ideal""","""H""","""VS1""","""ID""","""ID""","""AGSL""",3470
0.85,"""Ideal""","""H""","""SI1""","""EX""","""EX""","""GIA""",3183
0.91,"""Ideal""","""E""","""SI1""","""VG""","""VG""","""GIA""",4370
0.83,"""Ideal""","""G""","""SI1""","""EX""","""EX""","""GIA""",3171


## Advantages of Polars for large datasets

Polars offers several key advantages over pandas, particularly when dealing with large datasets:

* **Performance**: Polars is significantly faster than pandas in many operations, thanks to its Rust backend and parallel processing capabilities. This speedup can be crucial when working with large datasets where performance is a bottleneck.
* **Memory Efficiency**: Polars utilizes Rust's memory model, which can lead to more efficient memory usage compared to pandas, especially when handling data that doesn't fit comfortably in RAM.
* **Lazy Evaluation**: Polars' lazy evaluation approach defers computations until necessary, reducing unnecessary work and potentially leading to significant performance improvements.
* **Immutability** : Polars DataFrames are immutable, preventing accidental in-place modifications and promoting functional-style programming, which can lead to more predictable and maintainable code.
* **Expressive API**: Polars provides an expressive API for data manipulation tasks, making it easy to perform complex operations with concise and readable code.
* **Query Optimization**: Polars automatically optimizes query execution plans, aiming for efficient use of resources and further enhancing performance.

# From CSV to Classifier: Fast Data Preparation with Polars and Logistic Regression

In [25]:
import polars as pl
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


df = pl.read_csv("https://raw.githubusercontent.com/kishan0725/Breast-Cancer-Wisconsin-Diagnostic/master/data.csv")

print("Shape:", df.shape)
print("Columns:", df.columns[:8], "...")

#cleaning
all_null_cols = [c for c in df.columns if df.select(pl.col(c).is_null().all()).item()]
if all_null_cols:
    df = df.drop(all_null_cols)

# Drop ID column if present
if "id" in df.columns:
    df = df.drop("id")

# Target encoding: diagnosis (M=malignant, B=benign) -> 1/0
if "diagnosis" not in df.columns:
    raise ValueError("Expected a 'diagnosis' column in the CSV.")

df = df.with_columns(
    pl.when(pl.col("diagnosis") == "M").then(1).otherwise(0).alias("label")
).drop("diagnosis")

# build X, y (convert to numpy for scikit-learn)
X = df.drop("label").to_numpy()
y = df["label"].to_numpy()

print("X shape:", X.shape, "y shape:", y.shape)

# train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

#  Model pipeline: scaling + logistic regression
model = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=2000, random_state=42))
])

model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)

acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print("\nAccuracy:", acc)
print("\nConfusion matrix:\n", cm)
print("\nClassification report:\n", classification_report(y_test, y_pred, target_names=["Benign(0)", "Malignant(1)"]))

# Optional: show feature count + sanity checks
print("\nNumber of features:", X.shape[1])

Shape: (569, 33)
Columns: ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean'] ...
X shape: (569, 30) y shape: (569,)

Accuracy: 0.9649122807017544

Confusion matrix:
 [[71  1]
 [ 3 39]]

Classification report:
               precision    recall  f1-score   support

   Benign(0)       0.96      0.99      0.97        72
Malignant(1)       0.97      0.93      0.95        42

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114


Number of features: 30
