# Getting Started with Python: Exploratory Data Analysis (Polars)

**Author:** Alp Tezbasaran  
**Date:** 2026-02-02

## 0. Accessibility: Theme and Text Size

A few quick adjustments improve readability during live teaching.

- **Tools -> Settings -> Theme**: Light, Dark, or System default.
- **Tools -> Settings -> Editor -> Font size**: increase or decrease.
- **Page zoom**: Cmd/Ctrl + '+' or '-' to zoom.
- **Output readability**: In Settings -> Editor, enable monospace output if helpful.

Teaching tip: Agree on a standard zoom and theme at the start.

## Learning Outcomes

By the end of this workshop, you will be able to:

- Read tabular data with Polars
- Inspect shape, schema, and missing values
- Make simple summaries and grouped statistics
- Perform light cleaning and feature creation
- Create a few quick plots for exploration

## 1. Setup

We will use **Polars** for all data work. If you are in Colab, install it once per session.

In [None]:
# If you are using Colab, uncomment the next line
# !pip -q install polars

import polars as pl
import matplotlib.pyplot as plt
from pathlib import Path

data_dir = Path("data")

plt.style.use("ggplot")

## Polars vs. pandas (quick note)

- **Polars** is a fast DataFrame library written in Rust with a Python API.
- **pandas** is the classic DataFrame library in Python.
- We use **Polars only** in this workshop for speed and a clean, modern API.

## Eager vs. Lazy in Polars (and pandas)

- **Eager** runs each step immediately (similar to pandas).
- **Lazy** builds a query plan and runs it when you call `collect()`.
- Lazy can be faster because Polars can optimize the whole pipeline.

In this workshop we use **eager** operations to keep things simple.

### Lazy mode example (very basic)

This builds a plan first, then runs only when you call `collect()`.
This is helpful for large files (e.g., geospatial) where you want early checks
and avoid waiting on a full computation that might fail late.

In [None]:
lazy_example = (
    pl.scan_csv(data_dir / "NCSU_Mascots_v1.csv")
    .filter(pl.col("Species").is_not_null())
    .group_by("Species")
    .agg(pl.len().alias("n"))
    .sort("n", descending=True)
)

# No work has happened yet. This is just a plan.
lazy_example

# Now run it:
lazy_example.collect().head()

### Optional: view the query plan

A query plan is the step-by-step recipe Polars will follow to get your result.
Polars can show the optimized plan before execution.

In [None]:
lazy_example.explain()

## 2. Data Sources

We will use three small datasets:

- `NCSU_Mascots_v1.csv` (synthetic)
- `NCSU Celebrity Graduates_v1.csv` (synthetic)
- `penguins.csv` (public dataset, light cleaning demo)

Always treat data about people responsibly, even when synthetic.

In [None]:
mascots = pl.read_csv(data_dir / "NCSU_Mascots_v1.csv")
celebs = pl.read_csv(data_dir / "NCSU Celebrity Graduates_v1.csv")
penguins_raw = pl.read_csv(data_dir / "penguins.csv")

(mascots.shape, celebs.shape, penguins_raw.shape)

## 3. First Look at a Dataset

We start by checking rows, columns, and the schema, then preview a few rows.

In [None]:
print("Rows, cols:", mascots.shape)
print(mascots.schema)
mascots.head()

## 4. Missing Values and Summaries

Missing values are common. We check them early and often.

### Count missing values

See how many missing values each column has.

In [None]:
mascots.null_count()

### Summary statistics

Get a quick numeric summary for each column.

In [None]:
mascots.describe()

### Try it yourself

Which columns in `mascots` have the most missing values?

In [None]:
# Use null_count() and sorting to find the top missing columns



## 5. Cleaning Demo with Penguins

We will standardize text, check missing values, and make a small feature.

In [None]:
penguins = (
    penguins_raw
    .with_columns(
        pl.col("sex")
        .str.to_lowercase()
        .str.strip_chars()
        .alias("sex")
    )
    .with_columns(
        pl.when(pl.col("sex").is_null())
        .then("unknown")
        .otherwise(pl.col("sex"))
        .alias("sex")
    )
)

penguins.null_count()

### Create a simple feature

Drop rows missing key measurements and create a bill-length ratio feature.

In [None]:
penguins_clean = penguins.drop_nulls(
    subset=["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
)

penguins_clean = penguins_clean.with_columns(
    (pl.col("bill_length_mm") / pl.col("bill_depth_mm")).alias("bill_ratio")
)

penguins_clean.head()

### Try it yourself

Filter to a single island and compute the average body mass.

In [None]:
# Example: filter to "Biscoe" and compute mean body_mass_g



## 6. EDA: NCSU Mascots

We start with counts and a quick bar chart.

In [None]:
species_counts = (
    mascots.group_by("Species")
    .len()
    .sort("len", descending=True)
)

species_counts

### Plot species counts

Visualize the species counts with a simple bar chart.

In [None]:
plt.figure(figsize=(8, 4))
plt.bar(species_counts["Species"].to_list(), species_counts["len"].to_list())
plt.xticks(rotation=45, ha="right")
plt.ylabel("Count")
plt.title("Mascots by Species")
plt.tight_layout()
plt.show()

### Try it yourself

Compute average height and weight by species.

In [None]:
# group_by("Species") and compute mean Height and Weight



## 7. EDA: Celebrity Graduates

We will examine GPA distributions and compare colleges.

In [None]:
celebs_clean = celebs.with_columns(
    [
        pl.col("GPA").cast(pl.Float64),
        pl.col("Workstudy Hourly Rate").cast(pl.Float64),
        pl.col("Loved Library?").cast(pl.Int64),
    ]
)

celebs_clean.select(pl.col("GPA")).describe()

### Average GPA by college

Compute average GPA for each college.

In [None]:
gpa_by_college = (
    celebs_clean.group_by("College")
    .agg(pl.col("GPA").mean().alias("avg_gpa"))
    .sort("avg_gpa", descending=True)
)

gpa_by_college

### Plot GPA distribution

A quick histogram shows the spread of GPAs.

In [None]:
gpa_values = celebs_clean["GPA"].drop_nulls().to_list()

plt.figure(figsize=(6, 4))
plt.hist(gpa_values, bins=12)
plt.xlabel("GPA")
plt.ylabel("Count")
plt.title("GPA Distribution")
plt.tight_layout()
plt.show()

### Try it yourself

Compare average workstudy hourly rate by position.

In [None]:
# group_by("Workstudy Position") and compute mean "Workstudy Hourly Rate"



## 8. EDA: Penguins

We will compare species counts and flipper lengths.

### Summary Table

In [None]:
penguin_summary = penguins_clean.group_by("species").agg(
    [
        pl.len().alias("n"),
        pl.col("body_mass_g").mean().alias("avg_mass_g"),
        pl.col("flipper_length_mm").mean().alias("avg_flipper_mm"),
    ]
)

penguin_summary

### Flipper Length Plot

In [None]:
plt.figure(figsize=(7, 4))
for species in penguins_clean["species"].unique().to_list():
    values = (
        penguins_clean
        .filter(pl.col("species") == species)
        .select("flipper_length_mm")
        .to_series()
        .to_list()
    )
    plt.hist(values, alpha=0.5, label=species)

plt.legend()
plt.xlabel("Flipper length (mm)")
plt.ylabel("Count")
plt.title("Flipper Length by Species")
plt.tight_layout()
plt.show()

## Wrap-up

- You loaded data with Polars, checked missing values, and built summaries.
- You created simple features and quick plots.
- Next steps: try new questions, build cleaner visuals, and document insights.