# Pandas Series — Advanced Practice (with Solutions)

This notebook contains **advanced (but not too advanced)** practice problems on **`pandas.Series`**.

**Best practices used here:**
- Use **`loc` / `iloc`** explicitly to avoid ambiguous indexing.
- Prefer **vectorized** operations over Python loops.
- Make results **deterministic** (fixed seed where randomness is used).
- Include **sanity checks** with `assert`.

> Tip: Run each problem cell first, then compare with the solution cell right below it.

In [1]:
import pandas as pd
import numpy as np

## Problem 1 — Index alignment vs positional math

You are given two Series with the **same values** but **different index orders**.

1) Compute `a + b` (default behavior) and explain the result.
2) Compute a **position-wise** sum (ignoring labels).
3) Produce a final Series that keeps `a`'s index but sums values by position.

In [2]:
a = pd.Series([10, 20, 30], index=['x', 'y', 'z'])
b = pd.Series([1,  2,  3],  index=['z', 'x', 'y'])

# YOUR WORK:
# 1) aligned_sum = ...
# 2) positional_sum = ...
# 3) final = ...
# aligned_sum, positional_sum, final

In [3]:
# SOLUTION
aligned_sum = a + b
# Pandas aligns by label: x gets b['x']=2, y gets b['y']=3, z gets b['z']=1

positional_sum = a.to_numpy() + b.to_numpy()
final = pd.Series(positional_sum, index=a.index)

aligned_sum, positional_sum, final

(x    12
 y    23
 z    31
 dtype: int64,
 array([11, 22, 33]),
 x    11
 y    22
 z    33
 dtype: int64)

## Problem 2 — Reindexing with fill strategy and sanity checks

Given a Series of daily revenue (some days missing):

1) Reindex to a full daily range.
2) Fill missing days with **0**.
3) Compute a 3-day **rolling mean** (including zeros).
4) Confirm the reindexed length via an `assert`.

In [4]:
rev = pd.Series(
    [100, 180, 120],
    index=pd.to_datetime(['2025-01-01', '2025-01-03', '2025-01-06'])
)

# YOUR WORK:
# full_idx = ...
# rev_full = ...
# roll3 = ...
# assert ...
# rev_full, roll3

In [5]:
# SOLUTION
full_idx = pd.date_range(rev.index.min(), rev.index.max(), freq='D')
rev_full = rev.reindex(full_idx, fill_value=0)
roll3 = rev_full.rolling(3).mean()

assert len(rev_full) == len(full_idx)
rev_full, roll3

(2025-01-01    100
 2025-01-02      0
 2025-01-03    180
 2025-01-04      0
 2025-01-05      0
 2025-01-06    120
 Freq: D, dtype: int64,
 2025-01-01          NaN
 2025-01-02          NaN
 2025-01-03    93.333333
 2025-01-04    60.000000
 2025-01-05    60.000000
 2025-01-06    40.000000
 Freq: D, dtype: float64)

## Problem 3 — `loc` vs `iloc` with integer labels (avoid the trap)

You are given a Series with **integer labels** that are not positional indices.

1) Select the value whose **label** is `20`.
2) Select the value at **position** 1.
3) Slice labels from 10 to 30 (inclusive).
4) Slice positions 0 to 2 (exclusive of 2).

In [6]:
s = pd.Series([7, 8, 9], index=[10, 20, 30])

# YOUR WORK:
# label_20 = ...
# pos_1 = ...
# label_slice = ...
# pos_slice = ...
# label_20, pos_1, label_slice, pos_slice

In [7]:
# SOLUTION
label_20 = s.loc[20]
pos_1 = s.iloc[1]
label_slice = s.loc[10:30]     # inclusive
pos_slice = s.iloc[0:2]       # exclusive end

label_20, pos_1, label_slice, pos_slice

(np.int64(8),
 np.int64(8),
 10    7
 20    8
 30    9
 dtype: int64,
 10    7
 20    8
 dtype: int64)

## Problem 4 — Duplicate index: targeted update without overwriting all matches

You have a Series with **duplicate labels**.

Task:
- Change **only the second** occurrence of label `'city'` to `'London'`.
- Do **not** change other `'city'` rows.

Hint: Use boolean logic plus `cumcount()` (via `groupby(level=0)`) or operate by positional index once you identify it.

In [8]:
areas = pd.Series(
    ['USA', 'Topeka', 'France', 'Lyon', 'UK', 'Glasgow'],
    index=['country', 'city', 'country', 'city', 'country', 'city'],
    name='Areas'
)

# YOUR WORK:
# updated = ...
# updated

In [9]:
# SOLUTION
updated = areas.copy()

# Identify the second occurrence of each label by order within the label group
occ = updated.groupby(level=0).cumcount()  # 0,1,2... within each label
mask = (updated.index == 'city') & (occ == 1)  # second 'city'

updated.loc[mask] = 'London'
updated

country        USA
city        Topeka
country     France
city        London
country         UK
city       Glasgow
Name: Areas, dtype: object

## Problem 5 — Combining two Series with different coverage (`combine_first`)

You have two data sources for the same metric. `primary` is preferred, but it has missing values.

1) Create a final Series that uses `primary` where available, otherwise falls back to `backup`.
2) Show how many values came from the fallback.
3) Verify that the final has no missing values.

In [10]:
primary = pd.Series([1.1, np.nan, 3.3, np.nan], index=list('abcd'))
backup  = pd.Series([1.0, 2.2, np.nan, 4.4], index=list('abcd'))

# YOUR WORK:
# final = ...
# fallback_count = ...
# assert ...
# final, fallback_count

In [11]:
# SOLUTION
final = primary.combine_first(backup)
fallback_count = primary.isna().sum()  # positions where fallback could be used

assert final.notna().all()
final, int(fallback_count)

(a    1.1
 b    2.2
 c    3.3
 d    4.4
 dtype: float64,
 2)

## Problem 6 — Fast wins: replace a slow loop with vectorization

You have a Series of prices and want to compute a **tiered fee**:
- fee is 2% if price < 100
- fee is 1% otherwise

1) Compute fees **without** a Python loop.
2) Return a Series of fees with the same index.
3) Confirm the dtype is float.

In [12]:
prices = pd.Series([50, 120, 80, 200], index=['p1', 'p2', 'p3', 'p4'])

# YOUR WORK:
# fees = ...
# assert ...
# fees

In [13]:
# SOLUTION
rates = np.where(prices < 100, 0.02, 0.01)
fees = prices * rates

assert isinstance(fees, pd.Series)
assert fees.dtype.kind == 'f'
fees

p1    1.0
p2    1.2
p3    1.6
p4    2.0
dtype: float64

## Problem 7 — Normalize and report top categories (robust output)

Given a Series of event types, compute:
1) counts per category
2) percentage share (sums to 1)
3) a clean summary Series that contains only the **top 2** categories by count, labeled like `"A (50.0%)"`.

Make sure your code works even if there are ties.

In [14]:
events = pd.Series(['A', 'B', 'A', 'C', 'A', 'B', 'C', 'A', 'B', 'B'])

# YOUR WORK:
# counts = ...
# share = ...
# top2 = ...
# counts, share, top2

In [15]:
# SOLUTION
counts = events.value_counts(dropna=False)
share = events.value_counts(normalize=True, dropna=False)

top2_idx = counts.nlargest(2).index
top2 = pd.Series(
    [f"{k} ({share.loc[k]*100:.1f}%)" for k in top2_idx],
    index=top2_idx,
    name='Top2'
)

assert abs(share.sum() - 1.0) < 1e-12
counts, share, top2

(A    4
 B    4
 C    2
 Name: count, dtype: int64,
 A    0.4
 B    0.4
 C    0.2
 Name: proportion, dtype: float64,
 A    A (40.0%)
 B    B (40.0%)
 Name: Top2, dtype: object)

## Problem 8 — Map vs replace: standardize messy labels safely

You have inconsistent country labels.

Requirements:
- Standardize using a mapping.
- Unknown values should remain unchanged.
- Produce a final Series and also a boolean Series indicating which entries were changed.

In [16]:
countries = pd.Series(['UK', 'U.K.', 'United Kingdom', 'USA', 'U.S.A.', 'Canada'])
mapping = {
    'UK': 'United Kingdom',
    'U.K.': 'United Kingdom',
    'USA': 'United States',
    'U.S.A.': 'United States'
}

# YOUR WORK:
# standardized = ...
# changed = ...
# standardized, changed

In [17]:
# SOLUTION
# map() turns non-mapped values into NaN, so we use fillna(original) to keep unknowns
mapped = countries.map(mapping)
standardized = mapped.fillna(countries)
changed = standardized.ne(countries)

standardized, changed

(0    United Kingdom
 1    United Kingdom
 2    United Kingdom
 3     United States
 4     United States
 5            Canada
 dtype: object,
 0     True
 1     True
 2    False
 3     True
 4     True
 5    False
 dtype: bool)

## Problem 9 — MultiIndex Series: slice and aggregate by one level

You have a Series indexed by `(store, day)`.

Tasks:
1) Get all data for store `'S2'`.
2) Compute total per store.
3) Compute the best (max) day per store and return just the **day label** for each store.

In [18]:
idx = pd.MultiIndex.from_product(
    [['S1', 'S2'], pd.to_datetime(['2025-02-01', '2025-02-02', '2025-02-03'])],
    names=['store', 'day']
)
sales = pd.Series([10, 12, 11,  7, 15,  9], index=idx, name='sales')

# YOUR WORK:
# s2 = ...
# totals = ...
# best_day = ...
# s2, totals, best_day

In [19]:
# SOLUTION
s2 = sales.loc['S2']
totals = sales.groupby(level='store').sum()

# idxmax gives the full MultiIndex key; we then extract the 'day' level
best_key = sales.groupby(level='store').idxmax()
best_day = best_key.map(lambda t: t[1])
best_day.name = 'best_day'

s2, totals, best_day

(day
 2025-02-01     7
 2025-02-02    15
 2025-02-03     9
 Name: sales, dtype: int64,
 store
 S1    33
 S2    31
 Name: sales, dtype: int64,
 store
 S1   2025-02-02
 S2   2025-02-02
 Name: best_day, dtype: datetime64[ns])

## Problem 10 — Safe deletion: drop by position (without guessing labels)

Given a Series with a non-trivial index, drop rows by **position**.

Tasks:
1) Drop the first and last elements by position.
2) Return a new Series (do not mutate the original).
3) Confirm original Series is unchanged.

In [20]:
s = pd.Series([100, 200, 300, 400], index=['alpha', 'beta', 'gamma', 'delta'], name='vals')

# YOUR WORK:
# dropped = ...
# assert ...
# s, dropped

In [21]:
# SOLUTION
to_drop_labels = s.index[[0, -1]]
dropped = s.drop(to_drop_labels)

assert s.index.tolist() == ['alpha', 'beta', 'gamma', 'delta']  # unchanged
s, dropped

(alpha    100
 beta     200
 gamma    300
 delta    400
 Name: vals, dtype: int64,
 beta     200
 gamma    300
 Name: vals, dtype: int64)