### Cell 1: Imports & Loading

Polars doesn't have built-in datasets, so we load via pandas first and convert. This is a very common workflow.

In [None]:
#%pip install polars

Note: you may need to restart the kernel to use updated packages.


In [5]:
import polars as pl
import seaborn as sns
from sklearn.model_selection import train_test_split

# 1. Load data via Seaborn (returns Pandas)
pandas_df = sns.load_dataset('titanic')

# 2. Convert to Polars (The "Pro" Switch)
df = pl.from_pandas(pandas_df)

print("Polars Data Loaded. Shape:", df.shape)
print(df.head())

Polars Data Loaded. Shape: (891, 15)
shape: (5, 15)
┌──────────┬────────┬────────┬──────┬───┬──────┬─────────────┬───────┬───────┐
│ survived ┆ pclass ┆ sex    ┆ age  ┆ … ┆ deck ┆ embark_town ┆ alive ┆ alone │
│ ---      ┆ ---    ┆ ---    ┆ ---  ┆   ┆ ---  ┆ ---         ┆ ---   ┆ ---   │
│ i64      ┆ i64    ┆ str    ┆ f64  ┆   ┆ cat  ┆ str         ┆ str   ┆ bool  │
╞══════════╪════════╪════════╪══════╪═══╪══════╪═════════════╪═══════╪═══════╡
│ 0        ┆ 3      ┆ male   ┆ 22.0 ┆ … ┆ null ┆ Southampton ┆ no    ┆ false │
│ 1        ┆ 1      ┆ female ┆ 38.0 ┆ … ┆ C    ┆ Cherbourg   ┆ yes   ┆ false │
│ 1        ┆ 3      ┆ female ┆ 26.0 ┆ … ┆ null ┆ Southampton ┆ yes   ┆ true  │
│ 1        ┆ 1      ┆ female ┆ 35.0 ┆ … ┆ C    ┆ Southampton ┆ yes   ┆ false │
│ 0        ┆ 3      ┆ male   ┆ 35.0 ┆ … ┆ null ┆ Southampton ┆ no    ┆ true  │
└──────────┴────────┴────────┴──────┴───┴──────┴─────────────┴───────┴───────┘


### Cell 2: The "Sanity Check" (Audit)

Polars provides glimpse() which is often better than info() because it shows you sample data immediately.

In [6]:
# Quick audit of structure and types
print("--- Glimpse ---")
print(df.glimpse())

# Check for duplicates
n_dupes = df.is_duplicated().sum()
print(f"\nDuplicate Rows: {n_dupes}")

# Remove duplicates (distinct)
if n_dupes > 0:
    df = df.unique()

# Check cardinality (Unique values)
print("\n--- Unique Values per Column ---")
# select(pl.all().n_unique()) runs this check on every column in parallel
print(df.select(pl.all().n_unique()))

--- Glimpse ---
Rows: 891
Columns: 15
$ survived     <i64> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1
$ pclass       <i64> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2
$ sex          <str> 'male', 'female', 'female', 'female', 'male', 'male', 'male', 'male', 'female', 'female'
$ age          <f64> 22.0, 38.0, 26.0, 35.0, 35.0, null, 54.0, 2.0, 27.0, 14.0
$ sibsp        <i64> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1
$ parch        <i64> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0
$ fare         <f64> 7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708
$ embarked     <str> 'S', 'C', 'S', 'S', 'S', 'Q', 'S', 'S', 'S', 'C'
$ class        <cat> Third, First, Third, First, Third, Third, First, Third, Third, Second
$ who          <str> 'man', 'woman', 'woman', 'woman', 'man', 'man', 'man', 'child', 'woman', 'child'
$ adult_male  <bool> True, False, False, False, True, True, True, False, False, False
$ deck         <cat> null, C, null, C, null, null, E, null, null, null
$ embark_town  <str> 'Southampton', 'Cherbourg', 'Southa

### Cell 3: Statistical Analysis (Outliers)

Notice the syntax change: we use filter explicitly rather than boolean indexing.

In [7]:
# Calculate Stats for 'fare'
stats = df.select([
    pl.col("fare").mean().alias("mean"),
    pl.col("fare").std().alias("std"),
    pl.col("fare").max().alias("max")
])

mean_fare = stats["mean"][0]
std_fare = stats["std"][0]

# Filter for Outliers (3 Sigma rule)
# We use .filter( condition )
outliers = df.filter(
    (pl.col("fare") - mean_fare).abs() > (3 * std_fare)
)

print(f"Number of extreme outliers: {outliers.height}")

Number of extreme outliers: 20


### Cell 4: Feature Engineering (The Expression API)

This is where Polars shines. Instead of np.where, we use the readable chain: when().then().otherwise().

In [8]:
# In Polars, we use .with_columns() to add/modify columns
df = df.with_columns([
    # logic 1: Family Size
    (pl.col("sibsp") + pl.col("parch")).alias("family_size")
])

# logic 2: Is Alone? & Age Group
# We can chain multiple creations inside one .with_columns() call for speed
df = df.with_columns([
    
    pl.when(pl.col("family_size") == 0)
      .then(1)
      .otherwise(0)
      .alias("is_alone"),
      
    pl.when(pl.col("age") < 12).then(pl.lit("child"))
      .when(pl.col("age") < 60).then(pl.lit("adult"))
      .otherwise(pl.lit("senior"))
      .alias("age_group")
])

print(df.select(["age", "age_group", "is_alone"]).head())

shape: (5, 3)
┌──────┬───────────┬──────────┐
│ age  ┆ age_group ┆ is_alone │
│ ---  ┆ ---       ┆ ---      │
│ f64  ┆ str       ┆ i32      │
╞══════╪═══════════╪══════════╡
│ 18.0 ┆ adult     ┆ 0        │
│ 18.0 ┆ adult     ┆ 0        │
│ 18.0 ┆ adult     ┆ 0        │
│ 58.0 ┆ adult     ┆ 1        │
│ null ┆ senior    ┆ 1        │
└──────┴───────────┴──────────┘


### Cell 5: Smart Cleaning

Polars handles nulls explicitly. We filter out the 'deck' column using drop.

In [9]:
# Check null counts
print(df.null_count())

# DECISION LOGIC:
# 1. Drop 'deck'
df_clean = df.drop("deck")

# 2. Impute 'age' with Median
# We calculate the median first, then fill
age_median = df_clean.select(pl.col("age").median()).item()
df_clean = df_clean.with_columns(
    pl.col("age").fill_null(age_median)
)

# 3. Drop rows where 'embarked_town' is null
df_clean = df_clean.drop_nulls(subset=["embarked_town"])

print("Shape after cleaning:", df_clean.shape)

shape: (1, 18)
┌──────────┬────────┬─────┬─────┬───┬───────┬─────────────┬──────────┬───────────┐
│ survived ┆ pclass ┆ sex ┆ age ┆ … ┆ alone ┆ family_size ┆ is_alone ┆ age_group │
│ ---      ┆ ---    ┆ --- ┆ --- ┆   ┆ ---   ┆ ---         ┆ ---      ┆ ---       │
│ u32      ┆ u32    ┆ u32 ┆ u32 ┆   ┆ u32   ┆ u32         ┆ u32      ┆ u32       │
╞══════════╪════════╪═════╪═════╪═══╪═══════╪═════════════╪══════════╪═══════════╡
│ 0        ┆ 0      ┆ 0   ┆ 106 ┆ … ┆ 0     ┆ 0           ┆ 0        ┆ 0         │
└──────────┴────────┴─────┴─────┴───┴───────┴─────────────┴──────────┴───────────┘


ColumnNotFoundError: "embarked_town" not found

### Cell 6: Mathematical Transformations

Polars has built-in math functions that are highly optimized.

In [None]:
# Log Transform Fare
# pl.col("fare").log1p() is the equivalent of np.log1p()
df_clean = df_clean.with_columns(
    pl.col("fare").log1p().alias("fare_log")
)

# Show variance comparison
# We use .var() aggregation
print(df_clean.select([
    pl.col("fare").var().alias("Original Variance"),
    pl.col("fare_log").var().alias("Log Variance")
]))

### Cell 7: Final Prep (Encoding & Splitting)

Polars has to_dummies for One-Hot Encoding. Since Scikit-Learn expects numpy/pandas arrays generally, we convert back right at the end.

In [None]:
# 1. Binary Encode Sex (Male=0, Female=1)
# cast(pl.Int8) turns the boolean (True/False) into 1/0
df_clean = df_clean.with_columns(
    (pl.col("sex") == "female").cast(pl.Int8).alias("sex_binary")
)

# 2. One Hot Encode Town & Class
df_final = df_clean.to_dummies(["embarked_town", "class"], drop_first=True)

# 3. Select Features
features = ['pclass', 'sex_binary', 'age', 'sibsp', 'parch', 'fare_log', 'is_alone']

# 4. Extract to Numpy/Pandas for Scikit-Learn Compatibility
# (Most ML libraries still expect standard arrays)
X = df_final.select(features).to_pandas()
y = df_final.select("survived").to_pandas()['survived']

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)

#### Key Polars "Gotchas" to remember:

pl.col("name"): You rarely use strings directly (like df['name']). You almost always wrap them in pl.col().

with_columns: You cannot just say df['new_col'] = x. You must use df = df.with_columns(...). This ensures the operations are parallelized.

alias: If you do math on a column, it keeps the old name unless you use .alias("new_name").