# Pandas Loading Data — Advanced Practice (with Solutions)

This notebook contains several **advanced (but not too advanced)** problems about loading data with Pandas.

**Best practices used here:**
- No external files required (we use in-memory CSV/Excel).
- Explicit dtypes, parsing dates, and handling missing values.
- Validation checks (`assert`) to catch silent issues early.
- Clear separation of problems and solutions.


In [1]:
import numpy as np
import pandas as pd
from io import StringIO, BytesIO

pd.set_option('display.width', 120)
pd.set_option('display.max_columns', 50)


## Synthetic Data (in-memory)

We'll create a messy CSV and a multi-sheet Excel workbook in memory so you can run everything anywhere.

In [2]:
csv_text = """Geographic Area,July 1, 2001 Estimate,July 1, 2000 Estimate,April 1, 2000 Population Estimates Base,notes
North,  1200 , 1150 , 1148 , ok
South,  980  ,  990 ,  988 , revised
East,   N/A  , 1050 , 1049 , missing 2001
West,   1500 ,  0   , 1490 , zero baseline
Central, 1100, 1090 , 1088 , ok
"""

# Multi-sheet Excel (as bytes)
excel_bytes = BytesIO()
with pd.ExcelWriter(excel_bytes, engine="openpyxl") as writer:
    pd.DataFrame({
        "region": ["North", "South", "East", "West", "Central"],
        "2001": [1200, 980, np.nan, 1500, 1100],
        "2000": [1150, 990, 1050, 0, 1090],
        "status": ["ok", "revised", "ok", "ok", "ok"]
    }).to_excel(writer, sheet_name="data", index=False)

    pd.DataFrame({
        "region": ["North", "South", "East", "West", "Central"],
        "manager": ["Ava", "Noah", "Mia", "Liam", "Ema"],
        "opened": ["2000-01-15", "1998-07-01", "2001-03-20", "1999-11-05", "2000-06-30"],
    }).to_excel(writer, sheet_name="metadata", index=False)

excel_bytes.seek(0)

print("CSV and Excel data prepared in memory.")


CSV and Excel data prepared in memory.


## Problem 1 — Load, rename, select columns, set index (CSV)

**Task**
1. Load the CSV from `csv_text` into a DataFrame.
2. Keep only the columns corresponding to region, 2001 estimate, 2000 estimate.
3. Rename columns to: `region`, `2001`, `2000`.
4. Set `region` as the index.
5. Ensure `2001` and `2000` are numeric and missing values are properly set to `NaN`.

**Expected outcome**
- Index is `region`.
- `2001` has a missing value for `East`.


In [3]:
# SOLUTION 1
df1 = pd.read_csv(
    StringIO(csv_text),
    header=0,
    usecols=[0, 1, 2],
    names=["region", "2001", "2000"],
    index_col=0,
    skipinitialspace=True,
    na_values=["N/A", "NA", ""],
)

# Convert to numeric robustly (handles any stray whitespace/strings)
df1["2001"] = pd.to_numeric(df1["2001"], errors="coerce")
df1["2000"] = pd.to_numeric(df1["2000"], errors="coerce")

display(df1)

assert df1.index.name == "region"
assert set(df1.columns) == {"2001", "2000"}
assert np.isnan(df1.loc["East", "2001"])
assert df1[["2001", "2000"]].dtypes.apply(lambda x: np.issubdtype(x, np.number)).all()


Unnamed: 0_level_0,2001,2000
region,Unnamed: 1_level_1,Unnamed: 2_level_1
North,1200.0,1150
South,980.0,990
East,,1050
West,1500.0,0
Central,1100.0,1090


## Problem 2 — Compute growth % safely (division by zero)

**Task**
Using `df1`, compute the percent growth from 2000 to 2001:

\[ growth\_pct = 100 * (2001 - 2000) / 2000 \]

But handle edge cases:
- If `2000` is 0, return `NaN` (avoid infinite).
- If `2001` is missing, return `NaN`.

**Expected outcome**
- `West` should be `NaN` because 2000 is 0.
- `East` should be `NaN` because 2001 is missing.


In [4]:
# SOLUTION 2
den = df1["2000"].where(df1["2000"] != 0)  # turn 0 into NaN
growth_pct = 100 * (df1["2001"] - df1["2000"]) / den
growth_pct = growth_pct.rename("growth_pct")

display(growth_pct)

assert np.isnan(growth_pct.loc["West"])  # division by zero avoided
assert np.isnan(growth_pct.loc["East"])  # missing 2001


region
North      4.347826
South     -1.010101
East            NaN
West            NaN
Central    0.917431
Name: growth_pct, dtype: float64

## Problem 3 — Validate schema and values (defensive loading)

**Task**
Write a function `load_population_csv(text: str) -> pd.DataFrame` that:
- Loads the CSV from a string.
- Produces the same structure as `df1`.
- Validates:
  - Index has **no duplicates**.
  - Columns are exactly `2001` and `2000`.
  - Values are non-negative (ignore NaNs).

**Return** the cleaned DataFrame.


In [5]:
# SOLUTION 3
def load_population_csv(text: str) -> pd.DataFrame:
    df = pd.read_csv(
        StringIO(text),
        header=0,
        usecols=[0, 1, 2],
        names=["region", "2001", "2000"],
        index_col=0,
        skipinitialspace=True,
        na_values=["N/A", "NA", ""],
    )

    for c in ["2001", "2000"]:
        df[c] = pd.to_numeric(df[c], errors="coerce")

    # Validations
    if df.index.has_duplicates:
        dupes = df.index[df.index.duplicated()].unique().tolist()
        raise ValueError(f"Duplicate regions found: {dupes}")

    if list(df.columns) != ["2001", "2000"]:
        raise ValueError(f"Unexpected columns: {df.columns.tolist()}")

    # Non-negative check ignoring NaNs
    numeric = df[["2001", "2000"]]
    if (numeric.dropna().lt(0)).any().any():
        bad_rows = numeric.dropna().lt(0).any(axis=1)
        raise ValueError(f"Negative values found in rows: {numeric.index[bad_rows].tolist()}")

    return df

df_loaded = load_population_csv(csv_text)
display(df_loaded)
assert df_loaded.equals(df1)


Unnamed: 0_level_0,2001,2000
region,Unnamed: 1_level_1,Unnamed: 2_level_1
North,1200.0,1150
South,980.0,990
East,,1050
West,1500.0,0
Central,1100.0,1090


## Problem 4 — Chunked reading (simulate large CSV)

**Task**
Pretend the CSV is too large to load at once.

1. Read it in **chunks**.
2. For each chunk, compute `growth_pct` safely (as in Problem 2).
3. Combine the results into a single Series, indexed by region.

**Hint**: `pd.read_csv(..., chunksize=...)` returns an iterator.


In [6]:
# SOLUTION 4
chunks = pd.read_csv(
    StringIO(csv_text),
    header=0,
    usecols=[0, 1, 2],
    names=["region", "2001", "2000"],
    index_col=0,
    skipinitialspace=True,
    na_values=["N/A", "NA", ""],
    chunksize=2,
)

parts = []
for chunk in chunks:
    chunk["2001"] = pd.to_numeric(chunk["2001"], errors="coerce")
    chunk["2000"] = pd.to_numeric(chunk["2000"], errors="coerce")
    den = chunk["2000"].where(chunk["2000"] != 0)
    growth = 100 * (chunk["2001"] - chunk["2000"]) / den
    parts.append(growth)

growth_pct_chunked = pd.concat(parts).rename("growth_pct")
display(growth_pct_chunked)

# Check it matches the non-chunked result
assert growth_pct_chunked.sort_index().equals(growth_pct.sort_index())


region
North      4.347826
South     -1.010101
East            NaN
West            NaN
Central    0.917431
Name: growth_pct, dtype: float64

## Problem 5 — Load Excel sheet, join with metadata, parse dates

**Task**
From the in-memory Excel workbook (`excel_bytes`):
1. Load sheet `data` with `region` as the index.
2. Load sheet `metadata` and parse `opened` as dates.
3. Join them into one DataFrame on `region`.
4. Compute a new column `age_days` = (today - opened).days.

**Expected outcome**
- `opened` is datetime64.
- Joined DataFrame has columns: `2001`, `2000`, `status`, `manager`, `opened`, `age_days`.


In [7]:
# SOLUTION 5
excel_bytes.seek(0)
data_df = pd.read_excel(excel_bytes, sheet_name="data", index_col="region")

excel_bytes.seek(0)
meta_df = pd.read_excel(excel_bytes, sheet_name="metadata")
meta_df["opened"] = pd.to_datetime(meta_df["opened"], errors="coerce")
meta_df = meta_df.set_index("region")

joined = data_df.join(meta_df, how="left")

today = pd.Timestamp.today().normalize()
joined["age_days"] = (today - joined["opened"]).dt.days

display(joined)

assert np.issubdtype(joined["opened"].dtype, np.datetime64)
expected_cols = {"2001", "2000", "status", "manager", "opened", "age_days"}
assert set(joined.columns) == expected_cols


Unnamed: 0_level_0,2001,2000,status,manager,opened,age_days
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
North,1200.0,1150,ok,Ava,2000-01-15,9481
South,980.0,990,revised,Noah,1998-07-01,10044
East,,1050,ok,Mia,2001-03-20,9051
West,1500.0,0,ok,Liam,1999-11-05,9552
Central,1100.0,1090,ok,Ema,2000-06-30,9314


## Problem 6 — Enforce dtypes and handle missingness consistently (Excel)

**Task**
Create a function `load_population_excel(b: BytesIO) -> pd.DataFrame` that:
- Loads sheet `data`.
- Ensures `2000` and `2001` are numeric.
- Treats missing values in `2001` as NaN.
- Sets index to `region`.
- Returns only columns `2001`, `2000`.

**Bonus**: fail fast if any `2000` value is negative.


In [8]:
# SOLUTION 6
def load_population_excel(b: BytesIO) -> pd.DataFrame:
    b.seek(0)
    df = pd.read_excel(b, sheet_name="data")

    # Normalize column names defensively
    expected = {"region", "2001", "2000"}
    missing = expected - set(df.columns)
    if missing:
        raise ValueError(f"Missing expected columns: {sorted(missing)}")

    df = df.set_index("region")
    df = df[["2001", "2000"]].copy()

    df["2001"] = pd.to_numeric(df["2001"], errors="coerce")
    df["2000"] = pd.to_numeric(df["2000"], errors="coerce")

    # Fail fast if negative baseline exists
    if (df["2000"].dropna() < 0).any():
        bad = df.index[df["2000"] < 0].tolist()
        raise ValueError(f"Negative 2000 values for regions: {bad}")

    return df

df_excel = load_population_excel(excel_bytes)
display(df_excel)

assert df_excel.index.name == "region"
assert set(df_excel.columns) == {"2001", "2000"}
assert df_excel[["2001", "2000"]].dtypes.apply(lambda x: np.issubdtype(x, np.number)).all()


Unnamed: 0_level_0,2001,2000
region,Unnamed: 1_level_1,Unnamed: 2_level_1
North,1200.0,1150
South,980.0,990
East,,1050
West,1500.0,0
Central,1100.0,1090


## Quick Recap

You practiced:
- `read_csv` best practices: `usecols`, `names`, `index_col`, `na_values`, numeric coercion.
- Safe arithmetic with missing data and division by zero.
- Defensive loader functions with schema + value validation.
- Chunked reading with `chunksize`.
- `read_excel` with multi-sheet workflows and joining metadata.
- Enforcing dtypes and fail-fast checks.
