# DataFrames (Advanced Practice)

This notebook contains **advanced (but not too advanced)** pandas DataFrame problems **with solutions**.

## Best practices used here
- Prefer **explicit column names** and **explicit index**.
- Avoid `inplace=True` (return new objects instead).
- Use `loc` for label-based indexing and `iloc` for position-based indexing.
- Use `.assign(...)` for readable pipelines.
- Validate assumptions with `assert`.
- Convert dtypes explicitly (e.g., `pd.to_numeric`, `.astype`).

> Tip: Try each problem cell first. Then open the corresponding solution cell.

In [1]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)

## Shared dataset: NYC borough stats

We'll reuse this dataset in multiple problems.

In [2]:
boroughs = pd.Index(["The Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island"], name="borough")

counties_s = pd.Series(["Bronx", "Kings", "New York", "Queens", "Richmond"], index=boroughs, name="county")
population_s = pd.Series([1_418_207, 2_559_903, 1_628_706, 2_253_858, 476_143], index=boroughs, name="population")
gdp_s = pd.Series([42.695, 91.559, 600.244, 93.310, 14.514], index=boroughs, name="gdp")
area_s = pd.Series([42.10, 70.82, 22.83, 108.53, 58.37], index=boroughs, name="area")

nyc = pd.DataFrame({
    "county": counties_s,
    "population": population_s,
    "gdp": gdp_s,
    "area": area_s,
})

nyc

Unnamed: 0_level_0,county,population,gdp,area
borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Bronx,Bronx,1418207,42.695,42.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


# Problem 1 — Fix dtypes after a "bad" construction

A common mistake is to build a DataFrame from a list-of-lists mixing strings and numbers, then transpose it. This often produces `object` columns.

### Task
1. Create the "bad" DataFrame from the `data` below.
2. Transpose it.
3. Set the index to the borough names.
4. Rename the columns to: `county`, `population`, `gdp`, `area`.
5. Convert dtypes so that:
   - `population` is integer
   - `gdp` and `area` are float
   - `county` is string

Return the cleaned DataFrame in a variable named `clean_1`.

In [3]:
# STARTER
data = [
    ["The Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island"],
    ["Bronx", "Kings", "New York", "Queens", "Richmond"],
    [1_418_207, 2_559_903, 1_628_706, 2_253_858, 476_143],
    [42.695, 91.559, 600.244, 93.310, 14.514],
    [42.10, 70.82, 22.83, 108.53, 58.37],
]

# TODO: build clean_1
clean_1 = None

clean_1

In [4]:
# SOLUTION

bad = pd.DataFrame(data)

# Transpose and name columns by position
t = bad.T
t = t.rename(columns={0: "borough", 1: "county", 2: "population", 3: "gdp", 4: "area"})

# Set borough as index
t = t.set_index("borough")
t.index.name = "borough"

# Convert dtypes explicitly
clean_1 = (
    t.assign(
        county=t["county"].astype("string"),
        population=pd.to_numeric(t["population"], errors="raise").astype("int64"),
        gdp=pd.to_numeric(t["gdp"], errors="raise").astype("float64"),
        area=pd.to_numeric(t["area"], errors="raise").astype("float64"),
    )
)

# Validation
assert list(clean_1.columns) == ["county", "population", "gdp", "area"]
assert clean_1.index.name == "borough"
assert clean_1["population"].dtype == "int64"
assert clean_1["gdp"].dtype == "float64"
assert clean_1["area"].dtype == "float64"

clean_1

Unnamed: 0_level_0,county,population,gdp,area
borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Bronx,Bronx,1418207,42.695,42.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


# Problem 2 — Alignment rules when building from Series with different indexes

Pandas aligns by index labels. If one Series is missing labels, you'll get `NaN`.

### Task
1. Create a Series `median_rent` with **only** these boroughs:
   - `Brooklyn`: 3200
   - `Manhattan`: 4200
   - `Queens`: 2800
2. Build a DataFrame `rent_df` from `nyc[['population','gdp']]` plus this new series.
3. Fill missing rent with the **overall median** of available rents.

Return the result as `rent_df`.

In [5]:
# STARTER
# TODO: create median_rent, then rent_df
median_rent = None
rent_df = None

rent_df

In [6]:
# SOLUTION
median_rent = pd.Series(
    {"Brooklyn": 3200, "Manhattan": 4200, "Queens": 2800},
    name="median_rent",
).rename_axis("borough")

rent_df = nyc[["population", "gdp"]].join(median_rent)

rent_fill = rent_df["median_rent"].median()  # median of available values
rent_df["median_rent"] = rent_df["median_rent"].fillna(rent_fill)

# Validation
assert rent_df.isna().sum().sum() == 0
assert rent_df.index.equals(nyc.index)

rent_df

Unnamed: 0_level_0,population,gdp,median_rent
borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Bronx,1418207,42.695,3200.0
Brooklyn,2559903,91.559,3200.0
Manhattan,1628706,600.244,4200.0
Queens,2253858,93.31,2800.0
Staten Island,476143,14.514,3200.0


# Problem 3 — Create derived columns safely (no chained assignment)

### Task
Using the `nyc` DataFrame:
1. Create a new DataFrame `nyc_features` (do not modify `nyc`) with two new columns:
   - `density` = population / area
   - `gdp_per_capita` = (gdp * 1e9) / population  (treat `gdp` as billions)
2. Sort by `gdp_per_capita` descending.
3. Return only the columns: `county`, `density`, `gdp_per_capita`.

Store the final result in `nyc_features`.

In [7]:
# STARTER
# TODO: create nyc_features
nyc_features = None

nyc_features

In [8]:
# SOLUTION
nyc_features = (
    nyc.assign(
        density=lambda d: d["population"] / d["area"],
        gdp_per_capita=lambda d: (d["gdp"] * 1e9) / d["population"],
    )
    .sort_values("gdp_per_capita", ascending=False)
    .loc[:, ["county", "density", "gdp_per_capita"]]
)

# Validation
assert "density" in nyc_features.columns and "gdp_per_capita" in nyc_features.columns
assert nyc_features["density"].gt(0).all()
assert nyc_features["gdp_per_capita"].gt(0).all()

nyc_features

Unnamed: 0_level_0,county,density,gdp_per_capita
borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Manhattan,New York,71340.604468,368540.424116
Queens,Queens,20767.142726,41400.123699
Brooklyn,Kings,36146.611127,35766.589593
Staten Island,Richmond,8157.323968,30482.439099
The Bronx,Bronx,33686.627078,30104.914163


# Problem 4 — Advanced renaming: dict + function

You can rename columns using a mapping **or** a function.

### Task
Create `renamed` from `nyc` with:
1. Column rename mapping: `population -> pop`, `county -> county_name`
2. Then, apply a function to column names to make them **UPPERCASE**.
3. Keep index unchanged.

Store the result in `renamed`.

In [9]:
# STARTER
# TODO: create renamed
renamed = None

renamed

In [10]:
# SOLUTION
renamed = (
    nyc.rename(columns={"population": "pop", "county": "county_name"})
    .rename(columns=str.upper)
)

assert "POP" in renamed.columns
assert "COUNTY_NAME" in renamed.columns

renamed

Unnamed: 0_level_0,COUNTY_NAME,POP,GDP,AREA
borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Bronx,Bronx,1418207,42.695,42.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


# Problem 5 — Drop rows/columns safely and predictably

### Task
Create `subset` from `nyc` with:
1. Drop the column `area`.
2. Drop the rows for `Queens` and `Staten Island`.
3. Do it in a way that **won't crash** if a label is missing (hint: `errors=`).

Store the result in `subset`.

In [11]:
# STARTER
# TODO: create subset
subset = None

subset

In [12]:
# SOLUTION
subset = (
    nyc.drop(columns=["area"], errors="ignore")
       .drop(index=["Queens", "Staten Island"], errors="ignore")
)

assert "area" not in subset.columns
assert "Queens" not in subset.index
assert "Staten Island" not in subset.index

subset

Unnamed: 0_level_0,county,population,gdp
borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Bronx,Bronx,1418207,42.695
Brooklyn,Kings,2559903,91.559
Manhattan,New York,1628706,600.244


# Problem 6 — Create and use a MultiIndex (advanced indexing)

MultiIndex is helpful when you have hierarchical keys.

### Task
1. Create a new DataFrame `nyc_mi` from `nyc` where the index becomes a **MultiIndex**:
   - Level 1: `county`
   - Level 2: `borough` (current index)
2. After creating it, select:
   - All rows for county `Queens` (should return a DataFrame)
   - The single row for (`Kings`, `Brooklyn`) (should return a Series)

Store:
- the MultiIndex DataFrame in `nyc_mi`
- the county slice in `queens_rows`
- the single row in `kings_brooklyn`

In [13]:
# STARTER
# TODO: create nyc_mi, queens_rows, kings_brooklyn
nyc_mi = None
queens_rows = None
kings_brooklyn = None

nyc_mi.head()

AttributeError: 'NoneType' object has no attribute 'head'

In [14]:
# SOLUTION
nyc_mi = (
    nyc.reset_index()  # borough becomes a column
       .set_index(["county", "borough"])  # MultiIndex
       .sort_index()
)

queens_rows = nyc_mi.loc["Queens"]
kings_brooklyn = nyc_mi.loc[("Kings", "Brooklyn")]

# Validation
assert isinstance(queens_rows, pd.DataFrame)
assert isinstance(kings_brooklyn, pd.Series)
assert ("Kings", "Brooklyn") in nyc_mi.index

nyc_mi, queens_rows, kings_brooklyn

(                        population      gdp    area
 county   borough                                   
 Bronx    The Bronx         1418207   42.695   42.10
 Kings    Brooklyn          2559903   91.559   70.82
 New York Manhattan         1628706  600.244   22.83
 Queens   Queens            2253858   93.310  108.53
 Richmond Staten Island      476143   14.514   58.37,
          population    gdp    area
 borough                           
 Queens      2253858  93.31  108.53,
 population    2559903.000
 gdp                91.559
 area               70.820
 Name: (Kings, Brooklyn), dtype: float64)

# Problem 7 — Compare two construction styles and explain the difference

You're given two ways to construct DataFrames:
- **From dict of Series** (aligns by index)
- **From list of dicts** (row-wise records)

### Task
1. Construct `df_series` from a dict of Series (`counties_s`, `population_s`, `gdp_s`, `area_s`).
2. Construct `df_records` from a list of dict records (each record is a row), using the same values.
3. Make them equal by ensuring:
   - same index (borough)
   - same column order
4. Verify equality with `assert df_series.equals(df_records)`.

Store them in `df_series` and `df_records`.

In [15]:
# STARTER
# TODO: build df_series and df_records and make them equal
df_series = None
df_records = None

df_series, df_records

(None, None)

In [16]:
# SOLUTION
df_series = pd.DataFrame({
    "county": counties_s,
    "population": population_s,
    "gdp": gdp_s,
    "area": area_s,
})

records = [
    {"borough": b, "county": counties_s.loc[b], "population": population_s.loc[b], "gdp": gdp_s.loc[b], "area": area_s.loc[b]}
    for b in boroughs
]
df_records = pd.DataFrame.from_records(records).set_index("borough")
df_records.index.name = "borough"

# Ensure same column order
df_records = df_records.loc[:, df_series.columns]

assert df_series.equals(df_records)

df_series, df_records

(                 county  population      gdp    area
 borough                                             
 The Bronx         Bronx     1418207   42.695   42.10
 Brooklyn          Kings     2559903   91.559   70.82
 Manhattan      New York     1628706  600.244   22.83
 Queens           Queens     2253858   93.310  108.53
 Staten Island  Richmond      476143   14.514   58.37,
                  county  population      gdp    area
 borough                                             
 The Bronx         Bronx     1418207   42.695   42.10
 Brooklyn          Kings     2559903   91.559   70.82
 Manhattan      New York     1628706  600.244   22.83
 Queens           Queens     2253858   93.310  108.53
 Staten Island  Richmond      476143   14.514   58.37)

# Problem 8 — Debug a subtle issue: a typo in a column name

You receive a DataFrame where GDP is accidentally named `gpd`.

### Task
1. Create `broken` exactly as below.
2. Rename `gpd` -> `gdp`.
3. Add a column `gdp_share` = gdp / total_gdp.
4. Return the top 2 boroughs by `gdp_share`.

Store the final result in `top2`.

In [17]:
# STARTER
broken = nyc.rename(columns={"gdp": "gpd"})

# TODO: create top2
top2 = None

top2

In [18]:
# SOLUTION
fixed = broken.rename(columns={"gpd": "gdp"})
total_gdp = fixed["gdp"].sum()

top2 = (
    fixed.assign(gdp_share=lambda d: d["gdp"] / total_gdp)
         .sort_values("gdp_share", ascending=False)
         .head(2)
         .loc[:, ["county", "gdp", "gdp_share"]]
)

assert np.isclose(top2["gdp_share"].sum(), fixed["gdp"].nlargest(2).sum() / fixed["gdp"].sum())

top2

Unnamed: 0_level_0,county,gdp,gdp_share
borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Manhattan,New York,600.244,0.712606
Queens,Queens,93.31,0.110777
