# Missing Data — Advanced Practice (with Solutions)

**Goal:** Practice robust, real-world missing-data handling in NumPy/Pandas.

**Best practices used here**
- Use `pd.isna` / `pd.notna` (handles `None`, `np.nan`, `pd.NA`).
- Prefer nullable dtypes (`Int64`, `Float64`, `string`, `boolean`) when you want missing values without dtype surprises.
- Avoid sentinel values (like `-1`) unless you *must* and it’s documented.
- Validate with small checks (`assert`) after transformations.

> Run cells top-to-bottom.

In [1]:
import numpy as np
import pandas as pd

pd.set_option('display.width', 120)
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

## Problem 1 — `NaN`/`None`/`pd.NA` detection (and why equality fails)

Create a `Series` that contains: `1`, `np.nan`, `None`, `pd.NA`, and `'x'`.

**Tasks**
1. Show that comparing missing values with `==` is unreliable.
2. Build a boolean mask that correctly flags all missing values.
3. Extract only the missing entries.

**Expected**: missing detection should mark `np.nan`, `None`, and `pd.NA` as missing.

In [2]:
# Solution
s = pd.Series([1, np.nan, None, pd.NA, 'x'], dtype='object')
display(s)

# 1) Equality is unreliable for missing values
print('np.nan == np.nan ->', np.nan == np.nan)
print('None == None ->', None == None)
try:
    print('pd.NA == pd.NA ->', pd.NA == pd.NA)  # produces <NA> (not True/False)
except Exception as e:
    print('pd.NA equality raised:', repr(e))

# 2) Correct missing mask
mask_missing = pd.isna(s)
display(mask_missing)

# 3) Extract missing entries
missing_entries = s[mask_missing]
display(missing_entries)

assert mask_missing.tolist() == [False, True, True, True, False]

0       1
1     NaN
2    None
3    <NA>
4       x
dtype: object

np.nan == np.nan -> False
None == None -> True
pd.NA == pd.NA -> <NA>


0    False
1     True
2     True
3     True
4    False
dtype: bool

1     NaN
2    None
3    <NA>
dtype: object

## Problem 2 — Integers with missing values: fix dtype pitfalls

You receive integer data with missing values. You must keep it *integer-like* (not float), while still representing missingness.

**Tasks**
1. Create a `Series([10, None, 30])` and show the dtype you get by default.
2. Convert it to Pandas nullable integer dtype (`Int64`).
3. Compute the sum *ignoring* missing values.
4. Replace missing values with the median (rounded to int) **without converting to float dtype**.

In [3]:
# Solution
raw = pd.Series([10, None, 30])
print('Default dtype:', raw.dtype)
display(raw)

nullable_int = raw.astype('Int64')
print('Nullable dtype:', nullable_int.dtype)
display(nullable_int)

total = nullable_int.sum(skipna=True)
print('Sum ignoring missing:', total)

# Median-based imputation (rounded) while preserving Int64
median_val = int(round(nullable_int.median(skipna=True)))
imputed = nullable_int.fillna(median_val).astype('Int64')
display(imputed)

assert str(nullable_int.dtype) == 'Int64'
assert str(imputed.dtype) == 'Int64'
assert imputed.isna().sum() == 0

Default dtype: float64


0    10.0
1     NaN
2    30.0
dtype: float64

Nullable dtype: Int64


0      10
1    <NA>
2      30
dtype: Int64

Sum ignoring missing: 40


0    10
1    20
2    30
dtype: Int64

## Problem 3 — Hidden missing values in text data (empty strings, 'NA', 'null')

In real datasets, missing text often appears as `''`, `'NA'`, `'N/A'`, `'null'`, `'None'`, `'  '`.

**Tasks**
1. Build a `Series` with messy values: `['Alice', '', ' NA ', 'null', None, 'Bob', '  ', 'N/A']`.
2. Normalize it by:
   - stripping whitespace
   - mapping `''`, `'NA'`, `'N/A'`, `'NULL'`, `'NONE'` (case-insensitive) to missing
3. Return a clean `string` dtype series.
4. Report the number of missing values after cleaning.

In [4]:
# Solution
s = pd.Series(['Alice', '', ' NA ', 'null', None, 'Bob', '  ', 'N/A'], dtype='object')
display(s)

# Normalize whitespace and case, then map known placeholders to missing
s_clean = (
    s.astype('string')
     .str.strip()
)

placeholders = {'', 'NA', 'N/A', 'NULL', 'NONE'}

s_clean = s_clean.mask(s_clean.str.upper().isin(placeholders), other=pd.NA)
s_clean = s_clean.astype('string')
display(s_clean)

missing_count = s_clean.isna().sum()
print('Missing after cleaning:', missing_count)

assert str(s_clean.dtype) == 'string'
assert missing_count == 6

0    Alice
1         
2      NA 
3     null
4     None
5      Bob
6         
7      N/A
dtype: object

0    Alice
1     <NA>
2     <NA>
3     <NA>
4     <NA>
5      Bob
6     <NA>
7     <NA>
dtype: string

Missing after cleaning: 6


## Problem 4 — Row filtering with `dropna`: `how=` and `thresh=`

Given a DataFrame, drop rows that are "too missing".

**Tasks**
1. Create the DataFrame below.
2. Drop rows where **all** values are missing.
3. Drop rows that have fewer than **3 non-missing** values.
4. Keep only rows where columns `A` and `B` are both present.

Use `dropna` correctly with `how`, `thresh`, and `subset`.

In [5]:
# Solution
df = pd.DataFrame(
    {
        'A': [1, np.nan, np.nan, 4, np.nan],
        'B': [10, 20, np.nan, np.nan, np.nan],
        'C': [100, np.nan, np.nan, 400, np.nan],
        'D': [1000, 2000, np.nan, np.nan, np.nan],
    },
    index=['r1', 'r2', 'r3', 'r4', 'r5']
)
display(df)

# 2) Drop rows where all values are missing
df_no_all_missing = df.dropna(how='all')
display(df_no_all_missing)

# 3) Keep rows with at least 3 non-missing values
df_thresh3 = df.dropna(thresh=3)
display(df_thresh3)

# 4) Keep only rows where A and B are present
df_ab_present = df.dropna(subset=['A', 'B'])
display(df_ab_present)

assert 'r5' not in df_no_all_missing.index  # row r5 is all-missing
assert set(df_thresh3.index) == {'r1'}
assert set(df_ab_present.index) == {'r1'}

Unnamed: 0,A,B,C,D
r1,1.0,10.0,100.0,1000.0
r2,,20.0,,2000.0
r3,,,,
r4,4.0,,400.0,
r5,,,,


Unnamed: 0,A,B,C,D
r1,1.0,10.0,100.0,1000.0
r2,,20.0,,2000.0
r4,4.0,,400.0,


Unnamed: 0,A,B,C,D
r1,1.0,10.0,100.0,1000.0


Unnamed: 0,A,B,C,D
r1,1.0,10.0,100.0,1000.0


## Problem 5 — Group-wise imputation (mean/median per group)

You have sales data with missing `revenue`. Impute missing values **within each region**.

**Tasks**
1. Create the DataFrame below.
2. Impute missing `revenue` with the **median revenue per region**.
3. If a region has all missing revenue, fall back to the **global median**.
4. Verify no missing remains in `revenue_imputed`.

Hint: `groupby().transform()` + `fillna()` is usually cleaner than `apply()` for this.

In [6]:
# Solution
sales = pd.DataFrame(
    {
        'region': ['East', 'East', 'East', 'West', 'West', 'North', 'North'],
        'revenue': [100.0, np.nan, 130.0, 80.0, np.nan, np.nan, np.nan],
        'units': [5, 7, 6, 4, 3, 2, 1]
    }
)
display(sales)

global_median = sales['revenue'].median(skipna=True)
region_median = sales.groupby('region')['revenue'].transform('median')

# First fill with region median; remaining NaNs (regions all-missing) filled with global median
sales['revenue_imputed'] = sales['revenue'].fillna(region_median).fillna(global_median)
display(sales)

assert sales['revenue_imputed'].isna().sum() == 0
assert sales.loc[sales['region'].eq('North'), 'revenue_imputed'].nunique() == 1  # all fallback to global median

Unnamed: 0,region,revenue,units
0,East,100.0,5
1,East,,7
2,East,130.0,6
3,West,80.0,4
4,West,,3
5,North,,2
6,North,,1


Unnamed: 0,region,revenue,units,revenue_imputed
0,East,100.0,5,100.0
1,East,,7,115.0
2,East,130.0,6,130.0
3,West,80.0,4,80.0
4,West,,3,80.0
5,North,,2,100.0
6,North,,1,100.0


## Problem 6 — Time-series imputation with `interpolate` + edge handling

You have daily sensor readings with missing values.

**Tasks**
1. Create a date-indexed `Series` with missing values.
2. Interpolate linearly.
3. Handle edges so that the first/last missing values get filled (use `ffill`/`bfill`).
4. Ensure the result has no missing values.

Note: `interpolate` will not always fill missing values at the edges.

In [7]:
# Solution
idx = pd.date_range('2025-01-01', periods=8, freq='D')
sensor = pd.Series([np.nan, 10.0, np.nan, 16.0, np.nan, np.nan, 28.0, np.nan], index=idx)
display(sensor)

interp = sensor.interpolate(method='time')  # time-aware because index is datetime
filled = interp.ffill().bfill()  # fix edges

display(pd.DataFrame({'raw': sensor, 'interp': interp, 'filled': filled}))
assert filled.isna().sum() == 0

2025-01-01     NaN
2025-01-02    10.0
2025-01-03     NaN
2025-01-04    16.0
2025-01-05     NaN
2025-01-06     NaN
2025-01-07    28.0
2025-01-08     NaN
Freq: D, dtype: float64

Unnamed: 0,raw,interp,filled
2025-01-01,,,10.0
2025-01-02,10.0,10.0,10.0
2025-01-03,,13.0,13.0
2025-01-04,16.0,16.0,16.0
2025-01-05,,20.0,20.0
2025-01-06,,24.0,24.0
2025-01-07,28.0,28.0,28.0
2025-01-08,,28.0,28.0


## Problem 7 — Missingness diagnostics: per-column %, per-row %, and correlation with an outcome

Missingness itself can be informative.

**Tasks**
1. Create a small DataFrame with an `outcome` column and some missing values.
2. Compute:
   - missing percentage per column
   - missing percentage per row
3. Create missing-indicator columns (e.g., `age_is_missing`).
4. Compare mean `outcome` for rows where `age` is missing vs not missing.

This helps detect potential **MNAR/MAR** patterns (not proving causality, just a signal).

In [8]:
# Solution
df = pd.DataFrame(
    {
        'age': [25, np.nan, 40, np.nan, 33, 29],
        'income': [50_000, 60_000, np.nan, 55_000, np.nan, 52_000],
        'city': ['A', 'B', None, 'A', 'C', 'B'],
        'outcome': [1.2, 0.7, 1.5, 0.6, 1.1, 1.0],
    }
)
display(df)

# 2) Missing % per column / per row
col_missing_pct = df.isna().mean().sort_values(ascending=False) * 100
row_missing_pct = df.isna().mean(axis=1) * 100
print('Missing % per column:')
display(col_missing_pct)
print('Missing % per row:')
display(row_missing_pct)

# 3) Missing indicators
for c in ['age', 'income', 'city']:
    df[f'{c}_is_missing'] = df[c].isna().astype('boolean')
display(df)

# 4) Outcome comparison
mean_missing_age = df.loc[df['age_is_missing'], 'outcome'].mean()
mean_not_missing_age = df.loc[~df['age_is_missing'], 'outcome'].mean()
print('Mean outcome | age missing:', mean_missing_age)
print('Mean outcome | age present:', mean_not_missing_age)

assert df.filter(like='_is_missing').dtypes.nunique() == 1  # all boolean
assert str(df['age_is_missing'].dtype) == 'boolean'

Unnamed: 0,age,income,city,outcome
0,25.0,50000.0,A,1.2
1,,60000.0,B,0.7
2,40.0,,,1.5
3,,55000.0,A,0.6
4,33.0,,C,1.1
5,29.0,52000.0,B,1.0


Missing % per column:


age        33.333333
income     33.333333
city       16.666667
outcome     0.000000
dtype: float64

Missing % per row:


0     0.0
1    25.0
2    50.0
3    25.0
4    25.0
5     0.0
dtype: float64

Unnamed: 0,age,income,city,outcome,age_is_missing,income_is_missing,city_is_missing
0,25.0,50000.0,A,1.2,False,False,False
1,,60000.0,B,0.7,True,False,False
2,40.0,,,1.5,False,True,True
3,,55000.0,A,0.6,True,False,False
4,33.0,,C,1.1,False,True,False
5,29.0,52000.0,B,1.0,False,False,False


Mean outcome | age missing: 0.6499999999999999
Mean outcome | age present: 1.2000000000000002


## Problem 8 — Imputation pipeline with safeguards (numeric + categorical)

Build a simple, readable imputation pipeline:
- Numeric: impute with median
- Categorical: impute with mode
- Add missing-indicator flags for all imputed columns

**Tasks**
1. Create the DataFrame below.
2. Implement the pipeline using Pandas only.
3. Ensure dtypes remain sensible (`Float64` for numeric, `string` for text).
4. Verify:
   - no missing remains in imputed columns
   - indicator columns exist and are boolean

This is a strong baseline for many ML workflows.

In [9]:
# Solution
data = pd.DataFrame(
    {
        'height_cm': [170, 165, np.nan, 180, np.nan],
        'weight_kg': [70, np.nan, 80, 90, np.nan],
        'gender': ['F', 'M', None, 'M', 'F'],
        'segment': ['A', None, 'B', 'A', None],
    }
)

# Use nullable dtypes intentionally
data['height_cm'] = data['height_cm'].astype('Float64')
data['weight_kg'] = data['weight_kg'].astype('Float64')
data['gender'] = data['gender'].astype('string')
data['segment'] = data['segment'].astype('string')
display(data)

num_cols = ['height_cm', 'weight_kg']
cat_cols = ['gender', 'segment']

# Missing indicators
for c in num_cols + cat_cols:
    data[f'{c}_was_missing'] = data[c].isna().astype('boolean')

# Numeric median impute
for c in num_cols:
    med = data[c].median(skipna=True)
    data[c] = data[c].fillna(med)

# Categorical mode impute (if all missing, choose a safe label)
for c in cat_cols:
    modes = data[c].mode(dropna=True)
    fill = modes.iloc[0] if len(modes) else 'UNKNOWN'
    data[c] = data[c].fillna(fill)

display(data)

# Validations
assert data[num_cols + cat_cols].isna().sum().sum() == 0
assert all(str(data[c].dtype) == 'Float64' for c in num_cols)
assert all(str(data[c].dtype) == 'string' for c in cat_cols)
assert all(str(data[c].dtype) == 'boolean' for c in data.columns if c.endswith('_was_missing'))

Unnamed: 0,height_cm,weight_kg,gender,segment
0,170.0,70.0,F,A
1,165.0,,M,
2,,80.0,,B
3,180.0,90.0,M,A
4,,,F,


Unnamed: 0,height_cm,weight_kg,gender,segment,height_cm_was_missing,weight_kg_was_missing,gender_was_missing,segment_was_missing
0,170.0,70.0,F,A,False,False,False,False
1,165.0,80.0,M,A,False,True,False,True
2,170.0,80.0,F,B,True,False,True,False
3,180.0,90.0,M,A,False,False,False,False
4,170.0,80.0,F,A,True,True,False,True
