# Pandas Series — Expert Practice (with Solutions)

This notebook is **expert level** practice on `pandas.Series`.

**What makes these “expert”:**
- heavy use of **alignment**, **MultiIndex**, **time series**, and **missing data strategies**
- correctness checks with `assert`
- focuses on idiomatic, vectorized Pandas patterns

> Run each problem cell, then compare with the solution cell below it.

In [1]:
import pandas as pd
import numpy as np

## Problem 1 — Alignment + broadcasting with `axis=0` vs raw NumPy

You have a Series of weights and a Series of values with different label sets.

Tasks:
1) Compute a **weighted sum** over the intersection of labels only.
2) Compute a weighted sum treating missing weights as 0.
3) Show how a naive `.to_numpy()` approach can silently give a different answer.

In [2]:
values  = pd.Series({'a': 10, 'b': 20, 'c': 30, 'd': 40})
weights = pd.Series({'b': 0.2, 'c': 0.3, 'e': 0.9})

# YOUR WORK:
# inter_weighted_sum = ...
# weighted_sum_missing0 = ...
# naive = ...
# inter_weighted_sum, weighted_sum_missing0, naive

In [3]:
# SOLUTION
# 1) intersection only
common = values.index.intersection(weights.index)
inter_weighted_sum = (values.loc[common] * weights.loc[common]).sum()

# 2) missing weights treated as 0
aligned_w = weights.reindex(values.index, fill_value=0.0)
weighted_sum_missing0 = (values * aligned_w).sum()

# 3) naive numpy ignores labels and can mismatch ordering/coverage
naive = (values.to_numpy()[: len(weights.to_numpy())] * weights.to_numpy()).sum()

inter_weighted_sum, weighted_sum_missing0, naive

(np.float64(13.0), np.float64(13.0), np.float64(35.0))

## Problem 2 — Time series: resample + fill strategy + label-safe slicing

You have irregular timestamped sensor readings.

Tasks:
1) Convert to a **minute frequency** Series using `resample('1min')`.
2) Fill missing minutes using **time interpolation**.
3) Compute a 5-minute rolling **median**.
4) Slice a time window using `.loc[start:end]` (label-inclusive).

In [4]:
t = pd.to_datetime([
    '2025-03-01 12:00:00',
    '2025-03-01 12:02:00',
    '2025-03-01 12:07:00',
    '2025-03-01 12:08:00'
])
x = pd.Series([0.0, 1.0, 0.5, 0.8], index=t, name='sensor')

# YOUR WORK:
# per_min = ...
# filled = ...
# roll_med = ...
# window = ...
# per_min.head(10), filled.head(10), roll_med.head(10), window

In [5]:
# SOLUTION
per_min = x.resample('1min').mean()  # creates NaNs for missing minutes
filled = per_min.interpolate(method='time')
roll_med = filled.rolling('5min').median()  # time-based window
window = filled.loc['2025-03-01 12:01:00':'2025-03-01 12:06:00']

per_min.head(10), filled.head(10), roll_med.head(10), window

(2025-03-01 12:00:00    0.0
 2025-03-01 12:01:00    NaN
 2025-03-01 12:02:00    1.0
 2025-03-01 12:03:00    NaN
 2025-03-01 12:04:00    NaN
 2025-03-01 12:05:00    NaN
 2025-03-01 12:06:00    NaN
 2025-03-01 12:07:00    0.5
 2025-03-01 12:08:00    0.8
 Freq: min, Name: sensor, dtype: float64,
 2025-03-01 12:00:00    0.0
 2025-03-01 12:01:00    0.5
 2025-03-01 12:02:00    1.0
 2025-03-01 12:03:00    0.9
 2025-03-01 12:04:00    0.8
 2025-03-01 12:05:00    0.7
 2025-03-01 12:06:00    0.6
 2025-03-01 12:07:00    0.5
 2025-03-01 12:08:00    0.8
 Freq: min, Name: sensor, dtype: float64,
 2025-03-01 12:00:00    0.00
 2025-03-01 12:01:00    0.25
 2025-03-01 12:02:00    0.50
 2025-03-01 12:03:00    0.70
 2025-03-01 12:04:00    0.80
 2025-03-01 12:05:00    0.80
 2025-03-01 12:06:00    0.80
 2025-03-01 12:07:00    0.70
 2025-03-01 12:08:00    0.70
 Freq: min, Name: sensor, dtype: float64,
 2025-03-01 12:01:00    0.5
 2025-03-01 12:02:00    1.0
 2025-03-01 12:03:00    0.9
 2025-03-01 12:04:00    0

## Problem 3 — `asof` semantics: last valid observation before each timestamp

Task:
- Given a minute-level series with missing values, compute the value **as-of** certain query times.
- Use `asof` and confirm results match a manual approach.

Note: `asof` returns the last non-NA value up to the label (requires sorted index).

In [6]:
idx = pd.date_range('2025-04-01 09:00', periods=6, freq='min')
s = pd.Series([np.nan, 1.0, np.nan, 2.0, np.nan, 3.0], index=idx, name='v')
queries = pd.to_datetime(['2025-04-01 09:00', '2025-04-01 09:02', '2025-04-01 09:05'])

# YOUR WORK:
# asof_vals = ...
# manual = ...
# asof_vals, manual

In [7]:
# SOLUTION
s_sorted = s.sort_index()
asof_vals = pd.Series([s_sorted.asof(t) for t in queries], index=queries, name='asof')

def manual_asof(series, t):
    sub = series.loc[:t].dropna()
    return sub.iloc[-1] if len(sub) else np.nan

manual = pd.Series([manual_asof(s_sorted, t) for t in queries], index=queries, name='manual')

assert asof_vals.equals(manual)
asof_vals, manual

(2025-04-01 09:00:00    NaN
 2025-04-01 09:02:00    1.0
 2025-04-01 09:05:00    3.0
 Name: asof, dtype: float64,
 2025-04-01 09:00:00    NaN
 2025-04-01 09:02:00    1.0
 2025-04-01 09:05:00    3.0
 Name: manual, dtype: float64)

## Problem 4 — Duplicate index: stable disambiguation with (label, occurrence)

You have duplicates and need to refer to *"the 3rd occurrence of a label"* reliably.

Tasks:
1) Build a **MultiIndex** `(label, occurrence)` where occurrence counts within each label.
2) Use it to update only `('city', 2)` (3rd city) to `'Paris'`.
3) Return back to a plain index (drop the occurrence level) while keeping values.

In [8]:
base = pd.Series(
    ['USA', 'Topeka', 'France', 'Lyon', 'UK', 'Glasgow', 'Spain', 'Madrid'],
    index=['country', 'city', 'country', 'city', 'country', 'city', 'country', 'city'],
    name='Areas'
)

# YOUR WORK:
# occ = ...
# mi = ...
# updated = ...
# back = ...
# mi, updated, back

In [9]:
# SOLUTION
occ = base.groupby(level=0).cumcount()
mi = base.copy()
mi.index = pd.MultiIndex.from_arrays([base.index, occ], names=['label', 'occ'])

updated = mi.copy()
updated.loc[('city', 2)] = 'Paris'

back = updated.copy()
back.index = back.index.get_level_values('label')

mi, updated, back

(label    occ
 country  0          USA
 city     0       Topeka
 country  1       France
 city     1         Lyon
 country  2           UK
 city     2      Glasgow
 country  3        Spain
 city     3       Madrid
 Name: Areas, dtype: object,
 label    occ
 country  0         USA
 city     0      Topeka
 country  1      France
 city     1        Lyon
 country  2          UK
 city     2       Paris
 country  3       Spain
 city     3      Madrid
 Name: Areas, dtype: object,
 label
 country       USA
 city       Topeka
 country    France
 city         Lyon
 country        UK
 city        Paris
 country     Spain
 city       Madrid
 Name: Areas, dtype: object)

## Problem 5 — MultiIndex time series: per-group resample then align back

You have sales for multiple stores with irregular timestamps.

Tasks:
1) Build a MultiIndex Series `(store, ts)`.
2) For each store, resample to 5-minute frequency and forward-fill.
3) Return a **single** Series with MultiIndex `(store, ts)`.

Hint: `groupby(level=0).apply(...)` then sort the index.

In [10]:
stores = ['S1', 'S1', 'S1', 'S2', 'S2']
ts = pd.to_datetime([
    '2025-05-01 10:00', '2025-05-01 10:07', '2025-05-01 10:20',
    '2025-05-01 10:02', '2025-05-01 10:19'
])
y = pd.Series([10, 12, 15, 7, 11], name='sales')
sales = pd.Series(y.to_numpy(), index=pd.MultiIndex.from_arrays([stores, ts], names=['store', 'ts']))

# YOUR WORK:
# res = ...
# res.head(12)

In [11]:
# SOLUTION
def resample_ffill(s_store: pd.Series) -> pd.Series:
    # s_store index is 'ts' after groupby(level=0) drops the grouped level in the view
    s_store = s_store.droplevel('store').sort_index()
    out = s_store.resample('5min').mean().ffill()
    return out

res = sales.groupby(level='store').apply(resample_ffill)
res.index = res.index.set_names(['store', 'ts'])
res = res.sort_index()

res.head(12)

store  ts                 
S1     2025-05-01 10:00:00    10.0
       2025-05-01 10:05:00    12.0
       2025-05-01 10:10:00    12.0
       2025-05-01 10:15:00    12.0
       2025-05-01 10:20:00    15.0
S2     2025-05-01 10:00:00     7.0
       2025-05-01 10:05:00     7.0
       2025-05-01 10:10:00     7.0
       2025-05-01 10:15:00    11.0
dtype: float64

## Problem 6 — `where` vs `mask` + preserving dtype

Task:
- Replace negative values with 0, but keep dtype and index.
- Then replace values above a threshold with `NaN`.

Requirements:
- Use `where`/`mask` (not `apply`).
- Demonstrate the dtype changes and explain them.

In [12]:
z = pd.Series([5, -2, 3, -1, 10], index=list('abcde'))

# YOUR WORK:
# nonneg = ...
# capped_nan = ...
# nonneg, capped_nan, nonneg.dtype, capped_nan.dtype

In [13]:
# SOLUTION
nonneg = z.mask(z < 0, 0)  # mask: replace where condition True
capped_nan = nonneg.where(nonneg <= 6)  # where: keep where condition True, else NaN

# Note: introducing NaN forces float dtype for numeric series (unless nullable dtype is used)
nonneg, capped_nan, nonneg.dtype, capped_nan.dtype

(a     5
 b     0
 c     3
 d     0
 e    10
 dtype: int64,
 a    5.0
 b    0.0
 c    3.0
 d    0.0
 e    NaN
 dtype: float64,
 dtype('int64'),
 dtype('float64'))

## Problem 7 — Nullable dtypes: keep integers while allowing missing

Task:
- Convert an integer Series to a **nullable integer dtype**.
- Insert missing values without converting to float.
- Compare with the standard NumPy int behavior.

Goal: end with dtype `Int64` (capital I), not `float64`.

In [14]:
q = pd.Series([1, 2, 3], index=list('xyz'))

# YOUR WORK:
# q_nullable = ...
# q_nullable.loc['y'] = ...
# q_float = ...
# q_nullable, q_nullable.dtype, q_float, q_float.dtype

In [15]:
# SOLUTION
q_nullable = q.astype('Int64')
q_nullable.loc['y'] = pd.NA

# Standard behavior: inserting np.nan into int series upcasts to float
q_float = q.copy()
q_float.loc['y'] = np.nan

assert str(q_nullable.dtype) == 'Int64'
q_nullable, q_nullable.dtype, q_float, q_float.dtype

(x       1
 y    <NA>
 z       3
 dtype: Int64,
 Int64Dtype(),
 x    1.0
 y    NaN
 z    3.0
 dtype: float64,
 dtype('float64'))

## Problem 8 — `explode` + `value_counts` pipeline with missing handling

You have a Series of comma-separated tags, some missing.

Tasks:
1) Split into lists, explode into one tag per row.
2) Strip whitespace and drop empties.
3) Compute tag frequencies.

Constraints:
- Do not use Python loops.
- Treat missing as having no tags.

In [16]:
tags = pd.Series(['a, b, c', None, 'b, c', 'a,  ', 'c'], index=[101, 102, 103, 104, 105])

# YOUR WORK:
# freq = ...
# freq

In [17]:
# SOLUTION
freq = (
    tags.fillna('')
        .str.split(',')
        .explode()
        .astype('string')
        .str.strip()
        .loc[lambda s: s.ne('')]
        .value_counts()
)
freq

c    3
a    2
b    2
Name: count, dtype: Int64

## Problem 9 — Stable ranking with ties, then select “top-k per group”

You have a MultiIndex Series `(group, item)` of scores.

Tasks:
1) Compute a **dense rank** within each group (1 = best score).
2) Select all items with rank <= 2.
3) Return results sorted by `(group, rank, score desc)`.

Note: Do not use Python loops.

In [18]:
idx = pd.MultiIndex.from_tuples(
    [('G1','a'), ('G1','b'), ('G1','c'), ('G2','a'), ('G2','b'), ('G2','c')],
    names=['group', 'item']
)
scores = pd.Series([90, 90, 80, 70, 85, 85], index=idx, name='score')

# YOUR WORK:
# rank = ...
# top = ...
# top

In [19]:
# SOLUTION
rank = scores.groupby(level='group').rank(method='dense', ascending=False).astype('int64')
top = scores.loc[rank <= 2]

# Sort by (group, rank asc, score desc)
tmp = pd.DataFrame({'score': top, 'rank': rank.loc[top.index]})
tmp = tmp.sort_values(['group', 'rank', 'score'], ascending=[True, True, False])

# Return as a Series with a helpful MultiIndex including rank
out = tmp.set_index('rank', append=True)['score']
out.index = out.index.set_names(['group', 'item', 'rank'])
out

group  item  rank
G1     a     1       90
       b     1       90
       c     2       80
G2     b     1       85
       c     1       85
       a     2       70
Name: score, dtype: int64

## Problem 10 — `searchsorted` on a sorted index: fast interval lookup

You have a sorted Series of thresholds (index is numeric) mapping to categories.
Given query values, assign each query to the category whose threshold is the **largest <= query**.

Tasks:
1) Use `Index.searchsorted` (vectorized) to assign categories.
2) Handle queries below the smallest threshold as `None`.
3) Return a Series aligned to the query index.

This is a common expert pattern for fast binning without `cut`.

In [20]:
thresholds = pd.Series(['low', 'med', 'high'], index=pd.Index([0, 50, 100], name='min_score'))
queries = pd.Series([10, 55, 120, -5], index=list('wxyz'), name='score')

# YOUR WORK:
# assigned = ...
# assigned

In [21]:
# SOLUTION
idx = thresholds.index
pos = idx.searchsorted(queries.to_numpy(), side='right') - 1

assigned = pd.Series(index=queries.index, dtype='object', name='category')
valid = pos >= 0
assigned.loc[valid] = thresholds.iloc[pos[valid]].to_numpy()
assigned.loc[~valid] = None

assigned

w     low
x     med
y    high
z    None
Name: category, dtype: object