# Lab 01: Pandas Index, Series & DataFrame Fundamentals (≈30 minutes)

**Goal:** Build confident, mental models for **Index**, **Series**, and **DataFrame**—and how to select, slice, filter, and reshape using labels vs positions.

> ⏱️ **Timebox**: ~30 minutes total. Each task has an estimate—keep moving if you’re behind.

## Prereqs
- Python 3.9+
- `pandas` (any recent release):
  ```bash
  pip install pandas
  ```
- IDE or notebook (VS Code, Jupyter, Databricks, etc.)

In [52]:
# Setup (2 min)
import pandas as pd
import numpy as np
pd.__version__

'2.2.3'

In [53]:
# Create a small dataset we will reuse
orders_data = {
    "order_id":   [1001, 1002, 1003, 1004, 1005, 1006],
    "customer":   ["Ava", "Ben", "Ava", "Cara", "Ben", "Ava"],
    "city":       ["NYC", "Boston", "NYC", "Austin", "Boston", "NYC"],
    "quantity":   [2, 1, 4, 3, 2, 1],
    "unit_price": [12.5, 9.0, 12.5, 15.0, 9.0, 12.5],
    "order_date": ["2024-01-10", "2024-01-12", "2024-01-12", "2024-02-01", "2024-02-03", "2024-03-15"],
}
df = pd.DataFrame(orders_data)
df["order_date"] = pd.to_datetime(df["order_date"]) 
df.head()

Unnamed: 0,order_id,customer,city,quantity,unit_price,order_date
0,1001,Ava,NYC,2,12.5,2024-01-10
1,1002,Ben,Boston,1,9.0,2024-01-12
2,1003,Ava,NYC,4,12.5,2024-01-12
3,1004,Cara,Austin,3,15.0,2024-02-01
4,1005,Ben,Boston,2,9.0,2024-02-03


## Task 1 — Series basics (5 min)
A **Series** is a 1-D labeled array (values + index).

In [54]:
# 1) Create a Series with a custom index and a name.
s = pd.Series([3, 5, 7], index=["x", "y", "z"], name="scores")
s

x    3
y    5
z    7
Name: scores, dtype: int64

In [55]:
# 2) Access items by label and position
print(s.loc["y"])     # 5 (label-based)
print(s.iloc[1])       # 5 (position-based)
s.loc[["z","x"]]      # re-order by labels

5
5


z    7
x    3
Name: scores, dtype: int64

In [56]:
# 3) Inspect metadata and stats
print(s.index, s.dtype, s.name)
display(s.describe())
s.mean()

Index(['x', 'y', 'z'], dtype='object') int64 scores


count    3.0
mean     5.0
std      2.0
min      3.0
25%      4.0
50%      5.0
75%      6.0
max      7.0
Name: scores, dtype: float64

np.float64(5.0)

In [57]:
# 4) Build a Series from a dict and reindex
prices = pd.Series({"AAPL": 189.3, "MSFT": 418.2, "GOOG": 172.1}, name="price").rename_axis("ticker")
display(prices)
prices2 = prices.reindex(["MSFT", "AAPL", "AMZN", "GOOG"])  # introduces NaN for AMZN
display(prices2)
display(prices2.isna())
prices2.fillna(0.0)

ticker
AAPL    189.3
MSFT    418.2
GOOG    172.1
Name: price, dtype: float64

ticker
MSFT    418.2
AAPL    189.3
AMZN      NaN
GOOG    172.1
Name: price, dtype: float64

ticker
MSFT    False
AAPL    False
AMZN     True
GOOG    False
Name: price, dtype: bool

ticker
MSFT    418.2
AAPL    189.3
AMZN      0.0
GOOG    172.1
Name: price, dtype: float64

## Task 2 — Index fundamentals (5 min)
The **Index** is the label axis. It’s used for alignment, selection, and joining.

In [58]:
# 1) Look at the default Index of df
df.index, df.columns

(RangeIndex(start=0, stop=6, step=1),
 Index(['order_id', 'customer', 'city', 'quantity', 'unit_price', 'order_date'], dtype='object'))

In [59]:
# 2) Set a meaningful index (primary key) and add a computed column
df = df.set_index("order_id")
df["total"] = df["quantity"] * df["unit_price"]
df.head()

Unnamed: 0_level_0,customer,city,quantity,unit_price,order_date,total
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1001,Ava,NYC,2,12.5,2024-01-10,25.0
1002,Ben,Boston,1,9.0,2024-01-12,9.0
1003,Ava,NYC,4,12.5,2024-01-12,50.0
1004,Cara,Austin,3,15.0,2024-02-01,45.0
1005,Ben,Boston,2,9.0,2024-02-03,18.0


In [60]:
# 3) Common index ops
print(df.index.name)
df = df.sort_index()
df = df.rename_axis("OrderID")
df.head(3)

order_id


Unnamed: 0_level_0,customer,city,quantity,unit_price,order_date,total
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1001,Ava,NYC,2,12.5,2024-01-10,25.0
1002,Ben,Boston,1,9.0,2024-01-12,9.0
1003,Ava,NYC,4,12.5,2024-01-12,50.0


In [61]:
# 4) Reset the index
tmp = df.reset_index()
tmp.head(2)

Unnamed: 0,OrderID,customer,city,quantity,unit_price,order_date,total
0,1001,Ava,NYC,2,12.5,2024-01-10,25.0
1,1002,Ben,Boston,1,9.0,2024-01-12,9.0


## Task 3 — DataFrame essentials (8 min)
A **DataFrame** is a 2-D table: columns (named), rows (indexed).

In [62]:
# 1) Column selection and basic inspection
display(df[["customer","city"]].head(3))
display(df.dtypes)
df.shape

Unnamed: 0_level_0,customer,city
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1
1001,Ava,NYC
1002,Ben,Boston
1003,Ava,NYC


customer              object
city                  object
quantity               int64
unit_price           float64
order_date    datetime64[ns]
total                float64
dtype: object

(6, 6)

In [63]:
# 2) Label-based vs position-based row selection
row_by_label = df.loc[1003]
row_by_pos = df.iloc[2]
display(row_by_label)
display(row_by_pos)

customer                      Ava
city                          NYC
quantity                        4
unit_price                   12.5
order_date    2024-01-12 00:00:00
total                        50.0
Name: 1003, dtype: object

customer                      Ava
city                          NYC
quantity                        4
unit_price                   12.5
order_date    2024-01-12 00:00:00
total                        50.0
Name: 1003, dtype: object

In [64]:
# 3) Slicing differences (inclusive vs exclusive end)
display(df.loc[1002:1005, ["customer","total"]])  # inclusive of 1005
display(df.iloc[1:4, [0, -1]])                      # exclusive of stop

Unnamed: 0_level_0,customer,total
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1
1002,Ben,9.0
1003,Ava,50.0
1004,Cara,45.0
1005,Ben,18.0


Unnamed: 0_level_0,customer,total
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1
1002,Ben,9.0
1003,Ava,50.0
1004,Cara,45.0


In [65]:
# 4) Fast scalar access
city = df.at[1004, "city"]
first_val = df.iat[1, 0]
city, first_val

('Austin', 'Ben')

In [66]:
# 5) Boolean filtering + sorting
nyc_big = df.loc[(df["city"].eq("NYC")) & (df["total"] > 25), ["customer","city","total"]]
nyc_big.sort_values("total", ascending=False)

Unnamed: 0_level_0,customer,city,total
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1003,Ava,NYC,50.0


In [67]:
# 6) Add/update columns the vectorized way
df["month"] = df["order_date"].dt.to_period("M")
df["discounted"] = np.where(df["total"] >= 40, df["total"]*0.9, df["total"])
df.head()

Unnamed: 0_level_0,customer,city,quantity,unit_price,order_date,total,month,discounted
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1001,Ava,NYC,2,12.5,2024-01-10,25.0,2024-01,25.0
1002,Ben,Boston,1,9.0,2024-01-12,9.0,2024-01,9.0
1003,Ava,NYC,4,12.5,2024-01-12,50.0,2024-01,45.0
1004,Cara,Austin,3,15.0,2024-02-01,45.0,2024-02,40.5
1005,Ben,Boston,2,9.0,2024-02-03,18.0,2024-02,18.0


## Task 4 — MultiIndex mini-exercise (5 min)
A **MultiIndex** lets you index by multiple keys (e.g., `customer` and `city`).

In [68]:
# 1) Create a MultiIndex
midx = df.reset_index().set_index(["customer","city","OrderID"]).sort_index()
midx.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,quantity,unit_price,order_date,total,month,discounted
customer,city,OrderID,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Ava,NYC,1001,2,12.5,2024-01-10,25.0,2024-01,25.0
Ava,NYC,1003,4,12.5,2024-01-12,50.0,2024-01,45.0
Ava,NYC,1006,1,12.5,2024-03-15,12.5,2024-03,12.5
Ben,Boston,1002,1,9.0,2024-01-12,9.0,2024-01,9.0
Ben,Boston,1005,2,9.0,2024-02-03,18.0,2024-02,18.0


In [69]:
# 2) Select all orders for a customer in a city (label tuple)
midx.loc[("Ava","NYC")]

Unnamed: 0_level_0,quantity,unit_price,order_date,total,month,discounted
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1001,2,12.5,2024-01-10,25.0,2024-01,25.0
1003,4,12.5,2024-01-12,50.0,2024-01,45.0
1006,1,12.5,2024-03-15,12.5,2024-03,12.5


In [70]:
# 3) Select by one level using .xs (cross-section)
ava_all = midx.xs("Ava", level="customer")
nyc_all = midx.xs("NYC", level="city")
display(ava_all.head())
display(nyc_all.head())

Unnamed: 0_level_0,Unnamed: 1_level_0,quantity,unit_price,order_date,total,month,discounted
city,OrderID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
NYC,1001,2,12.5,2024-01-10,25.0,2024-01,25.0
NYC,1003,4,12.5,2024-01-12,50.0,2024-01,45.0
NYC,1006,1,12.5,2024-03-15,12.5,2024-03,12.5


Unnamed: 0_level_0,Unnamed: 1_level_0,quantity,unit_price,order_date,total,month,discounted
customer,OrderID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Ava,1001,2,12.5,2024-01-10,25.0,2024-01,25.0
Ava,1003,4,12.5,2024-01-12,50.0,2024-01,45.0
Ava,1006,1,12.5,2024-03-15,12.5,2024-03,12.5


In [71]:
# 4) Partial slice across a range of order IDs for one customer/city
midx.loc[("Ava","NYC")].loc[1002:1006]

Unnamed: 0_level_0,quantity,unit_price,order_date,total,month,discounted
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1003,4,12.5,2024-01-12,50.0,2024-01,45.0
1006,1,12.5,2024-03-15,12.5,2024-03,12.5


## Stretch (optional, 3 min) — Time index
Working with dates is easier if dates are the index.

In [72]:
t = df.set_index("order_date").sort_index()
t["quantity"].resample("MS").sum()

order_date
2024-01-01    7
2024-02-01    5
2024-03-01    1
Freq: MS, Name: quantity, dtype: int64

## Quick self-check (2 min)
Answer without running code (then verify):

1) Which accessor is *inclusive* at the end for slicing: `.loc` or `.iloc`?

2) What happens when you `reindex` with labels not present in the original Series?

3) Which is faster for single-cell gets: `.loc` or `.at`?

4) How do you select all rows for `customer="Ben"` in the MultiIndex DataFrame?

**Answers:**
1) `.loc` is inclusive; `.iloc` is exclusive at the stop.
2) New labels appear with `NaN` (unless you provide `fill_value` or a method).
3) `.at` (label) and `.iat` (position) are optimized for scalars.
4) `midx.xs("Ben", level="customer")` (or `midx.loc["Ben"]` if it’s the outermost level).

## (Optional) Challenge
- Add a `category` column: map `unit_price >= 12` → `"Premium"` else `"Standard"`.
- Compute total revenue by `(customer, category)` using `groupby(["customer","category"])["total"].sum()`.
- Which two orders had the highest `discounted` value? Return `OrderID` and `discounted` only.

In [73]:
# Challenge (examples)
df["category"] = np.where(df["unit_price"] >= 12, "Premium", "Standard")
revenue_by_cc = df.groupby(["customer","category"])["total"].sum().sort_values(ascending=False)
top2_discounted = df["discounted"].nlargest(2)
df.loc[df["discounted"].nlargest(2).index, ["discounted"]]

Unnamed: 0_level_0,discounted
OrderID,Unnamed: 1_level_1
1003,45.0
1004,40.5
