### Indexes (Advanced)

If you have not already done so, you will need to install Pandas in your virtual environment:

pip install pandas

Reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html

This notebook focuses on best-practice, exam-ready Pandas indexing patterns:
- Index creation, dtype, naming, uniqueness, sorting/monotonicity
- Array-like vs set-like operations
- Correct selection with .loc / .iloc / boolean masks
- Alignment semantics (the source of many bugs)
- Reindexing patterns
- Specialized indexes: RangeIndex, DatetimeIndex, CategoricalIndex, IntervalIndex, MultiIndex
- Performance and correctness tips


In [1]:
import numpy as np
import pandas as pd

pd.set_option("display.max_rows", 20)
pd.set_option("display.max_columns", 20)
pd.set_option("display.width", 120)

rng = np.random.default_rng(42)


## 1) Core Index basics: dtype, name, uniqueness, ordering

Best practice: treat an Index as an immutable, typed label-array. Prefer stable, unique labels when possible.


In [2]:
idx_int = pd.Index([10, 20, 30], name="id")
idx_float = pd.Index([1, 3.14])
idx_obj = pd.Index(["element 1", "element 2"], name="label")

idx_int, idx_int.dtype, idx_int.name, idx_int.is_unique, idx_int.is_monotonic_increasing


(Index([10, 20, 30], dtype='int64', name='id'),
 dtype('int64'),
 'id',
 True,
 True)

Mixing types coerces to a broader dtype (often object). Prefer consistent dtypes for performance and correctness.

In [3]:
idx_mixed = pd.Index([1, "2", 3])
idx_mixed, idx_mixed.dtype


(Index([1, '2', 3], dtype='object'), dtype('O'))

Indexes are immutable. Any "change" creates a new Index.

In [4]:
idx = pd.Index(["London", "Paris", "New York", "Tokyo"], name="city")
try:
    idx[0] = "Berlin"
except TypeError as ex:
    err = ex
idx, type(err).__name__, str(err)


(Index(['London', 'Paris', 'New York', 'Tokyo'], dtype='object', name='city'),
 'TypeError',
 'Index does not support mutable operations')

## 2) Array-like behavior (positional)

Slicing/fancy indexing returns a new Index (not a list/ndarray).

In [5]:
idx = pd.Index([2, 4, 6, 8, 10])

first = idx[0]
slice_ = idx[1:4]
rev = idx[::-1]
fancy = idx[[1, 3, 4]]

first, slice_, rev, fancy


(np.int64(2),
 Index([4, 6, 8], dtype='int64'),
 Index([10, 8, 6, 4, 2], dtype='int64'),
 Index([4, 8, 10], dtype='int64'))

Boolean masking works like NumPy; keep the mask aligned (same length) to avoid mistakes.

In [6]:
idx = pd.Index(["London", "Paris", "New York", "Tokyo"])
idx[idx != "Tokyo"]


Index(['London', 'Paris', 'New York'], dtype='object')

## 3) Set-like behavior (labels)

Indexes support intersection/union/difference via methods:
- .intersection()
- .union()
- .difference()

Best practice: Use methods for clarity and explicit control of sorting (e.g., sort=).


In [7]:
idx_1 = pd.Index(["a", "b", "c"])
idx_2 = pd.Index(["c", "d", "e"])

inter_op = idx_1.intersection(idx_2)
union_op = idx_1.union(idx_2)
diff_op = idx_1.difference(idx_2)

inter_op, union_op, diff_op


(Index(['c'], dtype='object'),
 Index(['a', 'b', 'c', 'd', 'e'], dtype='object'),
 Index(['a', 'b'], dtype='object'))

Unions/intersections may upcast dtype to fit all values.

In [8]:
pd.Index([1, 2, 3]).union(pd.Index([0.1, 0.2]))


Index([0.1, 0.2, 1.0, 2.0, 3.0], dtype='float64')

Containment testing uses in (fast for many index types).

In [9]:
idx_1 = pd.Index(["a", "b", "c"])
idx_2 = pd.RangeIndex(0, 10, 2)

"b" in idx_1, 6 in idx_2, "x" in idx_1, 1 in idx_2


(True, True, False, False)

## 4) Specialized index: RangeIndex

RangeIndex is a compact, lazy representation (doesn't store all integers explicitly).

In [10]:
ri_from_range = pd.Index(range(2, 10, 2))
ri_direct = pd.RangeIndex(2, 10, 2)
ri_from_range, ri_direct


(RangeIndex(start=2, stop=10, step=2), RangeIndex(start=2, stop=10, step=2))

In [11]:
idx = pd.RangeIndex(2, 10, 2)
idx[0], idx[1:4], idx[::-1]


(2, RangeIndex(start=4, stop=10, step=2), RangeIndex(start=8, stop=0, step=-2))

RangeIndex unions/intersections may remain RangeIndex if representable; otherwise you get a materialized Int64Index.

In [12]:
idx_1 = pd.RangeIndex(0, 5)
idx_2 = pd.RangeIndex(4, 8)
inter = idx_1.intersection(idx_2)
union = idx_1.union(idx_2)
inter, list(inter), union, list(union)


(RangeIndex(start=4, stop=5, step=1),
 [4],
 RangeIndex(start=0, stop=8, step=1),
 [0, 1, 2, 3, 4, 5, 6, 7])

In [13]:
pd.RangeIndex(1, 10, 2).union(pd.RangeIndex(1, 10, 3))


Index([1, 3, 4, 5, 7, 9], dtype='int64')

## 5) Indexes + Series/DataFrame: alignment and selection

Indexes are most useful when attached to Series and DataFrame.

Exam-critical best practice:
- .iloc is positional
- .loc is label-based
- Avoid ambiguous [] on DataFrames; prefer .loc/.iloc.
- Many operations align by index labels, not by row order.


In [14]:
s = pd.Series([100, 200, 300], index=pd.Index([10, 20, 30], name="id"), name="value")
s


id
10    100
20    200
30    300
Name: value, dtype: int64

In [15]:
# label-based
s.loc[20]


np.int64(200)

In [16]:
# positional
s.iloc[1]


np.int64(200)

### Alignment semantics (common pitfall)
When you add two Series, Pandas aligns by index labels and introduces NaN for missing labels.

In [17]:
a = pd.Series([1, 2, 3], index=["x", "y", "z"])
b = pd.Series([10, 20, 30], index=["y", "z", "w"])
a + b


w     NaN
x     NaN
y    12.0
z    23.0
dtype: float64

If you want positional math, align explicitly or use .to_numpy() / .array (careful!).

In [18]:
positional_sum = a.to_numpy() + b.reindex(a.index).to_numpy()  # explicit alignment then positional
positional_sum


array([nan, 12., 23.])

### Reindexing
Use .reindex() to enforce a target index and control fill values.

In [19]:
target = pd.Index(["w", "x", "y", "z"], name="key")
a.reindex(target, fill_value=0)


key
w    0
x    1
y    2
z    3
dtype: int64

### DataFrame selection best practices
Use .loc[row_labels, col_labels] and .iloc[row_positions, col_positions].

In [20]:
df = pd.DataFrame(
    {
        "city": ["London", "Paris", "New York", "Tokyo"],
        "temp_c": [7.0, 5.5, 2.0, 9.0],
        "rain_mm": [4.2, 1.1, 0.0, 2.5],
    },
    index=pd.Index(["UK", "FR", "US", "JP"], name="country"),
)
df


Unnamed: 0_level_0,city,temp_c,rain_mm
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
UK,London,7.0,4.2
FR,Paris,5.5,1.1
US,New York,2.0,0.0
JP,Tokyo,9.0,2.5


In [21]:
df.loc[["FR", "JP"], ["city", "temp_c"]]


Unnamed: 0_level_0,city,temp_c
country,Unnamed: 1_level_1,Unnamed: 2_level_1
FR,Paris,5.5
JP,Tokyo,9.0


In [22]:
df.iloc[1:3, 0:2]


Unnamed: 0_level_0,city,temp_c
country,Unnamed: 1_level_1,Unnamed: 2_level_1
FR,Paris,5.5
US,New York,2.0


Boolean filtering with .loc is usually clearest, and keeps index labels.

In [23]:
df.loc[df["temp_c"] >= 6]


Unnamed: 0_level_0,city,temp_c,rain_mm
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
UK,London,7.0,4.2
JP,Tokyo,9.0,2.5


## 6) Duplicate index labels

Indexes do not need to be unique. But duplicates can make .loc return multiple rows.

Best practice: enforce uniqueness when your problem expects a single row per key (.is_unique, .duplicated(), .groupby(level=...)).

In [24]:
idx_dup = pd.Index([1, 1, 2, 2, 3, 3], name="k")
s_dup = pd.Series(range(len(idx_dup)), index=idx_dup)
idx_dup, idx_dup.is_unique, s_dup.loc[2]


(Index([1, 1, 2, 2, 3, 3], dtype='int64', name='k'),
 False,
 k
 2    2
 2    3
 dtype: int64)

## 7) Advanced index types (common exam extensions)

### 7.1 DatetimeIndex
Great for time series; supports partial string selection and time-based slicing.

In [25]:
dt_idx = pd.date_range("2025-01-01", periods=6, freq="D", name="date")
ts = pd.Series(rng.normal(size=len(dt_idx)).round(3), index=dt_idx, name="x")
ts


date
2025-01-01    0.305
2025-01-02   -1.040
2025-01-03    0.750
2025-01-04    0.941
2025-01-05   -1.951
2025-01-06   -1.302
Freq: D, Name: x, dtype: float64

In [26]:
# partial string selection (month)
ts.loc["2025-01"]


date
2025-01-01    0.305
2025-01-02   -1.040
2025-01-03    0.750
2025-01-04    0.941
2025-01-05   -1.951
2025-01-06   -1.302
Freq: D, Name: x, dtype: float64

### 7.2 CategoricalIndex
Useful for memory/performance when values repeat and a fixed category set is known.

In [27]:
cats = pd.Categorical(
    ["low", "medium", "low", "high", "medium"],
    categories=["low", "medium", "high"],
    ordered=True,
)
cat_idx = pd.CategoricalIndex(cats, name="priority")
cat_idx, cat_idx.dtype


(CategoricalIndex(['low', 'medium', 'low', 'high', 'medium'], categories=['low', 'medium', 'high'], ordered=True, dtype='category', name='priority'),
 CategoricalDtype(categories=['low', 'medium', 'high'], ordered=True, categories_dtype=object))

### 7.3 IntervalIndex
Represents numeric bins; supports membership queries via .get_indexer / .contains patterns.

In [28]:
bins = pd.IntervalIndex.from_breaks([0, 10, 20, 50], closed="left")
bins


IntervalIndex([[0, 10), [10, 20), [20, 50)], dtype='interval[int64, left]')

In [29]:
x = pd.Series([3, 12, 49, 50, -1], name="x")
idxr = bins.get_indexer(x)
idxr


array([ 0,  1,  2, -1, -1])

Interpretation: -1 means "no matching interval" (e.g., 50 is excluded for closed='left').

### 7.4 MultiIndex
Hierarchical indexing for "grouped keys". Prefer it when you truly need multi-key label selection.

Best practice: name levels; use .swaplevel(), .sort_index() for predictable slicing; use .xs() for cross-sections.

In [30]:
mi = pd.MultiIndex.from_product(
    [["EU", "NA"], ["A", "B"], [2024, 2025]],
    names=["region", "segment", "year"],
)
m_df = pd.DataFrame({"sales": rng.integers(50, 200, size=len(mi))}, index=mi)
m_df.head(8)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sales
region,segment,year,Unnamed: 3_level_1
EU,A,2024,160
EU,A,2025,164
EU,B,2024,157
EU,B,2025,167
,A,2024,126
,A,2025,69
,B,2024,175
,B,2025,117


In [31]:
# label selection across levels
m_df.loc[("EU", "A")]


Unnamed: 0_level_0,sales
year,Unnamed: 1_level_1
2024,160
2025,164


In [32]:
# cross-section: pick one level value
m_df.xs("B", level="segment").head()


Unnamed: 0_level_0,Unnamed: 1_level_0,sales
region,year,Unnamed: 2_level_1
EU,2024,157
EU,2025,167
,2024,175
,2025,117


## 8) Exam-style problems (advanced solutions)

These are typical index-focused tasks. Each task is solved with best-practice Pandas patterns and verified with asserts.

### Problem 1
Create an Index from [10, 20, 30], name it id, demonstrate:
- dtype
- slicing returns an Index
- immutability


In [33]:
idx = pd.Index([10, 20, 30], name="id")

assert idx.dtype == "int64"
sl = idx[1:]
assert isinstance(sl, pd.Index)

try:
    idx[0] = 999
    raise AssertionError("Expected Index to be immutable")
except TypeError:
    pass

idx, sl


(Index([10, 20, 30], dtype='int64', name='id'),
 Index([20, 30], dtype='int64', name='id'))

### Problem 2
Given two indexes ['a','b','c'] and ['c','d','e'], compute:
- intersection
- union
- difference (elements in first but not second)

Return as Index objects.

In [34]:
idx1 = pd.Index(["a", "b", "c"])
idx2 = pd.Index(["c", "d", "e"])

intersection = idx1.intersection(idx2)
union = idx1.union(idx2)
difference = idx1.difference(idx2)

assert intersection.equals(pd.Index(["c"]))
assert union.equals(pd.Index(["a", "b", "c", "d", "e"]))
assert difference.equals(pd.Index(["a", "b"]))

intersection, union, difference


(Index(['c'], dtype='object'),
 Index(['a', 'b', 'c', 'd', 'e'], dtype='object'),
 Index(['a', 'b'], dtype='object'))

### Problem 3
Create a RangeIndex equivalent to range(2, 10, 2) and show:
- first element
- slice 1:4 remains a RangeIndex
- reverse slice


In [35]:
idx = pd.RangeIndex(2, 10, 2)

first = idx[0]
mid = idx[1:4]
rev = idx[::-1]

assert first == 2
assert isinstance(mid, pd.RangeIndex)
assert list(mid) == [4, 6, 8]
assert list(rev) == [8, 6, 4, 2]

first, mid, rev


(2, RangeIndex(start=4, stop=10, step=2), RangeIndex(start=8, stop=0, step=-2))

### Problem 4
Demonstrate containment testing with in for:
- Index(['a','b','c'])
- RangeIndex(0,10,2)

Verify True/False cases.

In [36]:
idx1 = pd.Index(["a", "b", "c"])
idx2 = pd.RangeIndex(0, 10, 2)

assert ("b" in idx1) is True
assert ("x" in idx1) is False
assert (6 in idx2) is True
assert (1 in idx2) is False

("b" in idx1, "x" in idx1, 6 in idx2, 1 in idx2)


(True, False, True, False)

### Problem 5
Show that Index labels do not need to be unique:
- create Index([1,1,2,2,3,3])
- attach to a Series and select label 2 with .loc (should return multiple rows)
- detect duplicates


In [37]:
idx = pd.Index([1, 1, 2, 2, 3, 3], name="k")
s = pd.Series([10, 11, 20, 21, 30, 31], index=idx, name="v")

sel = s.loc[2]
dups = idx[idx.duplicated()]  # duplicated labels (keep='first')

assert isinstance(sel, pd.Series)
assert sel.tolist() == [20, 21]
assert dups.tolist() == [1, 2, 3]

idx, sel, dups


(Index([1, 1, 2, 2, 3, 3], dtype='int64', name='k'),
 k
 2    20
 2    21
 Name: v, dtype: int64,
 Index([1, 2, 3], dtype='int64', name='k'))

## 9) Practical best-practice checklist

- Prefer unique, stable keys for indexes.
- Use .loc for labels, .iloc for positions.
- Expect alignment by index labels in arithmetic/concat/join operations.
- Use .reindex() (and fill_value) to enforce a target index.
- For speed/memory with integer sequential indices: prefer RangeIndex.
- For multi-key labels: prefer MultiIndex with named levels and sorting.
