# Basic Data Analysis (Advanced Practice)

This notebook contains **advanced-but-not-too-advanced** practice problems (with **solutions**) on basic data analysis using **pandas**.

## What you'll practice
- Robust loading (with a fallback synthetic dataset)
- Data quality checks (types, missingness, duplicates)
- Summary statistics for numeric + categorical
- Grouped analysis and feature engineering
- Outliers (IQR-based) and winsorization
- Correlations and sanity checks
- Building a reusable profiling summary

### Best practices used
- Clear, reproducible code (`random_state`)
- No magic prints: prefer tidy tables
- Functions for repeatability
- Assertions to validate results


In [1]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", 120)
pd.set_option("display.width", 120)

RNG = np.random.default_rng(42)

## Dataset

If the file `Morningstar - European Mutual Funds.csv` exists locally, we'll load it.

If not, we'll generate a **synthetic mutual-funds-like** dataset with a similar feel (tickers, names, categories, returns, fees, AUM, etc.).

That way, every problem still runs end-to-end.

In [2]:
from pathlib import Path

def make_synthetic_funds(n: int = 2500, random_state: int = 42) -> pd.DataFrame:
    rng = np.random.default_rng(random_state)

    categories = [
        "Europe Large-Cap Equity", "Global Equity", "Emerging Markets Equity",
        "Eurozone Bond", "Global Bond", "High Yield Bond",
        "Allocation 60/40", "Allocation Conservative", "Money Market"
    ]

    # Create somewhat realistic distributions
    cat = rng.choice(categories, size=n, p=[0.18, 0.15, 0.09, 0.16, 0.13, 0.07, 0.10, 0.08, 0.04])
    
    ticker = np.array([f"F{idx:05d}" for idx in range(1, n + 1)])
    fund_name = np.array([f"Fund {idx:05d}" for idx in range(1, n + 1)])

    # AUM in EUR millions (log-normal)
    aum_eur_m = np.round(rng.lognormal(mean=4.2, sigma=0.9, size=n), 2)  # wide range

    # Expense ratio (in %), with category effects
    base_fee = rng.normal(1.1, 0.35, size=n)
    fee_adjust = np.select(
        [
            np.isin(cat, ["Money Market", "Eurozone Bond", "Global Bond"]),
            np.isin(cat, ["High Yield Bond", "Emerging Markets Equity"]),
            np.isin(cat, ["Allocation 60/40", "Allocation Conservative"]),
        ],
        [-0.35, +0.25, -0.10],
        default=0.0,
    )
    expense_ratio = np.clip(base_fee + fee_adjust, 0.05, 3.5)

    # 2018 return (in %): category-driven centers + noise
    center = np.select(
        [
            cat == "Europe Large-Cap Equity",
            cat == "Global Equity",
            cat == "Emerging Markets Equity",
            cat == "Eurozone Bond",
            cat == "Global Bond",
            cat == "High Yield Bond",
            cat == "Allocation 60/40",
            cat == "Allocation Conservative",
            cat == "Money Market",
        ],
        [-10.0, -8.0, -13.5, -1.0, -2.0, -4.0, -6.0, -2.5, 0.2],
        default=-6.0,
    )
    fund_return_2018 = center + rng.normal(0, 6.5, size=n)

    # Create a few extreme outliers
    outlier_idx = rng.choice(n, size=max(5, n // 500), replace=False)
    fund_return_2018[outlier_idx] += rng.choice([35, -40, 50, -55], size=len(outlier_idx))

    # Currency + domicile (categorical)
    currency = rng.choice(["EUR", "GBP", "CHF", "SEK"], size=n, p=[0.68, 0.14, 0.10, 0.08])
    domicile = rng.choice(["IE", "LU", "DE", "FR", "GB", "CH"], size=n, p=[0.34, 0.36, 0.10, 0.08, 0.06, 0.06])

    # Some missingness
    rating = rng.integers(1, 6, size=n).astype("float")  # 1..5 stars
    rating[rng.random(n) < 0.12] = np.nan
    
    # Duplicate a handful of tickers to simulate data issues
    if n >= 100:
        dup_rows = rng.choice(n, size=8, replace=False)
        ticker[dup_rows[:4]] = ticker[dup_rows[4:]]  # make 4 duplicates

    df_syn = pd.DataFrame({
        "ticker": ticker,
        "fund_name": fund_name,
        "morningstar_category": cat,
        "fund_return_2018": np.round(fund_return_2018, 3),
        "aum_eur_m": aum_eur_m,
        "expense_ratio": np.round(expense_ratio, 3),
        "currency": currency,
        "domicile": domicile,
        "star_rating": rating,
    })

    # Add a small number of missing values in key columns
    for col in ["fund_return_2018", "expense_ratio", "domicile"]:
        mask = rng.random(n) < (0.02 if col != "domicile" else 0.01)
        df_syn.loc[mask, col] = np.nan

    return df_syn

def load_funds_dataset(path: str = "Morningstar - European Mutual Funds.csv") -> pd.DataFrame:
    p = Path(path)
    if p.exists():
        df_local = pd.read_csv(p)
        df_local["__source__"] = "csv"
        return df_local
    df_syn = make_synthetic_funds(n=2500, random_state=42)
    df_syn["__source__"] = "synthetic"
    return df_syn

df = load_funds_dataset()
df.shape, df["__source__"].iloc[0]

((49399, 112), 'csv')

Let's quickly inspect the dataset.

In [3]:
df.head(5)

Unnamed: 0,ticker,isin,fund_name,morningstar_category,morningstar_rating,morningstar_analyst_rating,morningstar_risk_rating,morningstar_performance_rating,nav_per_share_currency,nav_per_share,class_size_currency,class_size,fund_size_currency,fund_size,fund_return_ytd,fund_return_2018,fund_return_2017,fund_return_2016,fund_return_2015,fund_return_2014,fund_return_2013,fund_return_2012,fund_return_2011,fund_return_2010,investment_strategy,trailing_return_3years,trailing_return_5years,trailing_return_10years,trailing_return_since_inception,dividend_frequency,fund_benchmark,morningstar_benchmark,equity_style,equity_style_score,equity_size,equity_size_score,price_prospective_earnings,price_book,price_sales,price_cash_flow,dividend_yield_factor,long_term_projected_earnings_growth,historical_earnings_growth,sales_growth,cash_flow_growth,book_value_growth,roa,roe,roic,bond_interest_rate_sensitivity,bond_credit_quality,average_coupon_rate,average_credit_quality,modified_duration,effective_maturity,asset_stock,asset_bond,asset_cash,asset_other,country_exposure,top5_regions,sector_basic_materials,sector_consumer_cyclical,sector_financial_services,sector_real_estate,sector_consumer_defensive,sector_healthcare,sector_utilities,sector_communication_services,sector_energy,sector_industrials,sector_technology,market_capitalization_giant,market_capitalization_large,market_capitalization_medium,market_capitalization_small,market_capitalization_micro,credit_quality_aaa,credit_quality_aa,credit_quality_a,credit_quality_bbb,credit_quality_bb,credit_quality_b,credit_quality_below_b,credit_quality_not_rated,holdings_number_stock,holdings_number_bonds,top5_holdings,ongoing_cost,management_fees,sustainability_rank,esg_score,environmental_score,social_score,governance_score,controversy_score,sustainability_score,sustainability_percentage_rank,involvement_abortive_contraceptive,involvement_alcohol,involvement_animal_testing,involvement_controversial_weapons,involvement_gambling,involvement_gmo,involvement_military_contracting,involvement_nuclear,involvement_palm_oil,involvement_pesticides,involvement_small_arms,involvement_thermal_coal,involvement_tobacco,__source__
0,0P00000AWF,LU0171281750,BlackRock Global Funds - European Value Fund A2,Europe Large-Cap Value Equity,3.0,Bronze,3.0,3.0,USD,68.96,EUR,507320000,EUR,865540000,8.59,-18.13,10.45,15.21,7.36,-1.77,32.92,19.93,-6.29,-0.13,The European Value Fund seeks to maximise tota...,0.21,4.65,,7.02,,MSCI Europe Value NR EUR,MSCI Europe Value NR EUR,Value,125.0,Large,259.8,12.6,1.62,0.94,7.95,3.96,7.99,9.39,2.19,6.19,5.58,5.61,16.62,13.22,,,,,,,99.33,0.39,0.28,0.0,"AUT: 2.550566, BEL: 2.93189, CHE: 6.662146, DE...","Eurozone: 55.6, United Kingdom: 26.61, Europe ...",6.74,8.27,20.26,2.36,5.0,12.36,6.26,,9.04,28.01,1.72,,,,,,,,,,,,,,50.0,,"Total SA: 4.79, Sanofi SA: 4.76, Prudential PL...",1.82154,1.5001,2.0,60.12,58.36,59.92,59.18,7.07,53.05,83.0,13.48,0.0,21.31,2.7,0.0,0.0,5.05,6.54,0.0,0.0,0.0,12.32,0.0,csv
1,0P00000AYI,LU0071969892,BlackRock Global Funds - Continental European ...,Europe ex-UK Large-Cap Equity,4.0,Bronze,4.0,5.0,GBP,22.51,EUR,42520000,EUR,3608510000,20.89,-14.11,24.96,12.74,13.99,-1.62,27.13,26.08,,,The Continental European Flexible Fund seeks t...,8.36,13.11,,9.22,Annually,FTSE World Eur Ex UK TR EUR,MSCI Europe Ex UK NR EUR,Growth,250.51,Large,283.97,22.03,4.12,2.53,15.22,1.72,10.47,11.84,4.99,6.43,8.47,9.59,24.29,17.52,,,,,,,99.54,0.0,0.46,0.0,"BEL: 1.713572, CHE: 14.820541, DEU: 15.239124,...","Eurozone: 58.51, Europe - Ex Euro: 28.31, Unit...",8.8,15.24,5.99,1.64,6.06,18.35,1.23,,,23.94,18.74,,,,,,,,,,,,,,43.0,,"SAP SE: 6.49, LVMH Moet Hennessy Louis Vuitton...",1.82048,1.4999,3.0,60.31,58.59,60.98,57.01,4.4,55.91,62.0,13.36,9.53,21.33,9.19,0.0,0.0,10.93,1.98,0.0,0.0,0.0,1.98,0.0,csv
2,0P00000BOW,LU0011983433,Morgan Stanley Investment Funds - Global Bond ...,Global Bond,5.0,,3.0,5.0,EUR,44.2,USD,69100000,USD,721760000,9.49,3.26,0.23,22.28,1.06,8.11,-3.53,2.88,3.65,10.41,The Global Bond Fund's investment objective is...,2.21,6.95,,6.25,,BBgBarc Global Aggregate TR USD,Bloomberg Barclays Global Aggregate TR USD,,,,,,,,,,,,,,,6.3,18.14,-3.35,High,Medium,3.29,11.0,7.31,10.48,0.42,95.95,8.98,-5.34,USA: 100,United States: 100,,,,,,,,,,,100.0,,,,,,26.35,8.78,24.49,27.94,3.32,3.91,0.75,4.45,1.0,390.0,"Us 2yr Note Dec 19: 6.43, United States Treasu...",0.6459,0.45,,,,,,,,,1.2,0.37,1.88,0.0,0.24,0.16,0.0,0.38,0.0,0.35,0.0,1.67,0.29,csv
3,0P00000ESH,LU0757425763,Threadneedle (Lux) - American Select Class AU ...,US Large-Cap Growth Equity,2.0,,3.0,2.0,EUR,23.03,USD,5860000,USD,358380000,21.95,-1.6,12.86,28.83,6.1,11.47,27.42,10.09,0.8,12.4,The American Select Portfolio seeks to achieve...,12.57,14.62,,,,S&P 500 TR USD,Russell 1000 Growth TR USD,Growth,227.63,Large,338.44,20.1,2.76,3.05,8.99,0.97,12.24,13.66,11.42,27.32,8.49,9.43,24.06,14.17,,,,,,,95.48,0.0,4.52,0.0,"CHN: 1.31653, USA: 98.68347","United States: 98.68, Asia - Emerging: 1.32",,8.56,24.28,,3.49,10.03,,4.08,3.98,2.86,42.73,,,,,,,,,,,,,,43.0,,"Alphabet Inc A: 8.99, Microsoft Corp: 8.54, Be...",1.8,1.5,2.0,54.59,56.3,54.85,48.14,8.06,46.53,68.0,2.44,0.0,9.35,0.0,0.0,0.0,0.26,0.26,0.0,0.0,0.0,8.06,0.0,csv
4,0P00000ESL,LU0011818076,HSBC Global Investment Funds - Economic Scale ...,Japan Large-Cap Equity,3.0,,2.0,3.0,USD,11.44,JPY,1651140000,JPY,14846320000,13.02,-6.79,12.02,25.29,15.37,0.01,20.27,0.16,-13.05,16.06,The sub-fund aims to provide long term total r...,6.52,11.84,,,Annually,MSCI Japan NR USD,TOPIX TR JPY,Value,101.69,Large,273.46,10.7,0.82,0.51,3.76,3.23,8.19,3.64,2.37,-0.95,6.59,4.15,10.76,7.72,,,,,,,99.95,0.0,0.15,-0.1,JPN: 100,Japan: 100,7.4,18.32,13.29,1.05,5.62,4.16,4.6,11.05,2.15,20.11,12.25,,,,,,,,,,,,,,252.0,,"Nippon Telegraph & Telephone Corp: 5.45, Toyot...",0.75,0.4,3.0,53.39,56.07,53.24,47.07,4.76,48.63,35.0,2.28,0.82,11.16,0.0,0.18,0.0,0.79,5.3,0.0,0.42,0.15,9.22,2.34,csv


## Problem 1 — Data quality snapshot

Create a **data quality report** that includes:
- total rows / total columns
- number of duplicate rows
- number of duplicate tickers (if `ticker` exists)
- per-column: dtype, missing count, missing %, and nunique

Return it as two tables:
1) `overview`
2) `col_report` (sorted by missing % desc, then nunique desc)


In [4]:
# SOLUTION
n_rows, n_cols = df.shape

overview = pd.DataFrame({
    "metric": ["rows", "columns", "duplicate_rows", "duplicate_tickers"],
    "value": [
        n_rows,
        n_cols,
        int(df.duplicated().sum()),
        int(df["ticker"].duplicated().sum()) if "ticker" in df.columns else np.nan,
    ],
})

col_report = (
    pd.DataFrame({
        "dtype": df.dtypes.astype(str),
        "missing": df.isna().sum(),
        "missing_pct": (df.isna().mean() * 100).round(2),
        "nunique": df.nunique(dropna=True),
    })
    .sort_values(["missing_pct", "nunique"], ascending=[False, False])
)

overview, col_report.head(12)

(              metric  value
 0               rows  49399
 1            columns    112
 2     duplicate_rows      0
 3  duplicate_tickers      0,
                               dtype  missing  missing_pct  nunique
 trailing_return_10years     float64    49307        99.81       88
 credit_quality_aaa          float64    45812        92.74      468
 credit_quality_bbb          float64    45812        92.74      465
 credit_quality_aa           float64    45812        92.74      446
 credit_quality_bb           float64    45812        92.74      446
 credit_quality_a            float64    45812        92.74      441
 credit_quality_b            float64    45812        92.74      433
 credit_quality_not_rated    float64    45812        92.74      380
 credit_quality_below_b      float64    45812        92.74      278
 morningstar_analyst_rating   object    42500        86.03        6
 modified_duration           float64    42247        85.52      602
 effective_maturity          float64  

## Problem 2 — Robust type fixes

Some datasets store numbers as strings (e.g., `"1.23%"`, `"1,234"`).

Write a function `coerce_numeric(df, cols)` that:
- strips `%` signs
- converts commas to nothing (thousand separators)
- coerces errors to NaN

Apply it to numeric-looking columns among:
`fund_return_2018`, `expense_ratio`, `aum_eur_m` (only if present).`

Show dtypes before/after for those columns.


In [5]:
# SOLUTION
def coerce_numeric(df_in: pd.DataFrame, cols: list[str]) -> pd.DataFrame:
    df_out = df_in.copy()
    for c in cols:
        if c not in df_out.columns:
            continue
        s = df_out[c]
        # Only attempt string cleaning if object-like
        if pd.api.types.is_object_dtype(s) or pd.api.types.is_string_dtype(s):
            cleaned = (
                s.astype("string")
                .str.replace("%", "", regex=False)
                .str.replace(",", "", regex=False)
                .str.strip()
            )
            df_out[c] = pd.to_numeric(cleaned, errors="coerce")
        else:
            df_out[c] = pd.to_numeric(s, errors="coerce")
    return df_out

candidate_cols = [c for c in ["fund_return_2018", "expense_ratio", "aum_eur_m"] if c in df.columns]

before = df[candidate_cols].dtypes.astype(str)
df2 = coerce_numeric(df, candidate_cols)
after = df2[candidate_cols].dtypes.astype(str)

pd.DataFrame({"before": before, "after": after})

Unnamed: 0,before,after
fund_return_2018,float64,float64


## Problem 3 — Category-level return summary with confidence-aware counts

For `morningstar_category` (if present):

Create a table with:
- `n_total` = total funds in category
- `n_valid_return` = non-missing `fund_return_2018`
- `missing_return_pct`
- `mean_return`, `median_return`, `std_return`
- `p10`, `p90` of returns

Sort by `mean_return` descending.

**Best practice requirement:** categories with fewer than 20 valid returns should still appear, but set `mean_return` and `std_return` to NaN for those categories (to avoid misleading summaries).


In [6]:
# SOLUTION
required = {"morningstar_category", "fund_return_2018"}
if required.issubset(df2.columns):
    g = df2.groupby("morningstar_category", dropna=False)

    summary = g["fund_return_2018"].agg(
        n_total="size",
        n_valid_return="count",
        mean_return="mean",
        median_return="median",
        std_return="std",
        p10=lambda s: s.quantile(0.10),
        p90=lambda s: s.quantile(0.90),
    ).reset_index()

    summary["missing_return_pct"] = (
        (1 - summary["n_valid_return"] / summary["n_total"]) * 100
    ).round(2)

    # Confidence-aware masking
    low_n = summary["n_valid_return"] < 20
    summary.loc[low_n, ["mean_return", "std_return"]] = np.nan

    summary = summary.sort_values("mean_return", ascending=False)
    
    # Basic sanity checks
    assert (summary["n_valid_return"] <= summary["n_total"]).all()
    assert (summary["missing_return_pct"].between(0, 100)).all()

    summary.head(12)
else:
    "Columns missing: cannot compute category summary."

## Problem 4 — Identify and treat outliers (IQR rule)

Using `fund_return_2018`:
1) Compute the IQR bounds (`Q1 - 1.5*IQR`, `Q3 + 1.5*IQR`).
2) Report how many values are outliers.
3) Create a new column `fund_return_2018_winsor` where outliers are clipped to the bounds.
4) Compare `mean` and `std` before vs after winsorization.


In [7]:
# SOLUTION
if "fund_return_2018" in df2.columns:
    s = df2["fund_return_2018"].dropna()
    q1, q3 = s.quantile([0.25, 0.75])
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr

    is_outlier = df2["fund_return_2018"].lt(lower) | df2["fund_return_2018"].gt(upper)
    outlier_count = int(is_outlier.sum(skipna=True))

    df2 = df2.copy()
    df2["fund_return_2018_winsor"] = df2["fund_return_2018"].clip(lower=lower, upper=upper)

    compare = pd.DataFrame({
        "metric": ["mean", "std"],
        "before": [df2["fund_return_2018"].mean(), df2["fund_return_2018"].std()],
        "after_winsor": [df2["fund_return_2018_winsor"].mean(), df2["fund_return_2018_winsor"].std()],
    })

    {
        "iqr_lower": float(lower),
        "iqr_upper": float(upper),
        "outlier_count": outlier_count,
        "compare": compare,
    }
else:
    "Column fund_return_2018 missing."

## Problem 5 — Relationship check: fees vs returns

Many people suspect higher fees might correlate with lower returns.

Tasks:
1) Compute Pearson correlation between `expense_ratio` and `fund_return_2018_winsor`.
2) Compute the correlation **within each category** (if category exists), but only for categories with at least 50 valid pairs.
3) Return the top 5 categories with the **most negative** correlation.

Include a short "sanity" table showing counts of valid pairs used.


In [8]:
# SOLUTION
needed = {"expense_ratio", "fund_return_2018_winsor"}
if needed.issubset(df2.columns):
    pair = df2[["expense_ratio", "fund_return_2018_winsor"]].dropna()
    overall_corr = pair["expense_ratio"].corr(pair["fund_return_2018_winsor"], method="pearson")

    if "morningstar_category" in df2.columns:
        tmp = df2[["morningstar_category", "expense_ratio", "fund_return_2018_winsor"]].dropna()
        g = tmp.groupby("morningstar_category")

        corr_by_cat = g.apply(
            lambda d: pd.Series({
                "n_pairs": len(d),
                "corr": d["expense_ratio"].corr(d["fund_return_2018_winsor"], method="pearson"),
            })
        ).reset_index()

        corr_by_cat_50 = corr_by_cat[corr_by_cat["n_pairs"] >= 50].copy()
        most_negative = corr_by_cat_50.sort_values("corr", ascending=True).head(5)

        sanity = corr_by_cat.sort_values("n_pairs", ascending=False).head(10)
        {
            "overall_corr": float(overall_corr),
            "top5_most_negative": most_negative,
            "sanity_counts_top10": sanity,
        }
    else:
        {"overall_corr": float(overall_corr), "note": "No category column found."}
else:
    "Required columns missing."

## Problem 6 — Build a reusable column profiler

Create a function `profile_columns(df)` that returns a DataFrame with one row per column and these fields:

- `dtype`
- `missing`
- `missing_pct`
- `nunique`
- `example_values` (up to 3 non-null example values)
- For numeric columns only:
  - `mean`, `std`, `min`, `p25`, `p50`, `p75`, `max`

Sort by `missing_pct` desc.


In [9]:
# SOLUTION
def profile_columns(df_in: pd.DataFrame) -> pd.DataFrame:
    rows = []
    for col in df_in.columns:
        s = df_in[col]
        nonnull = s.dropna()
        examples = nonnull.head(3).tolist()

        row = {
            "column": col,
            "dtype": str(s.dtype),
            "missing": int(s.isna().sum()),
            "missing_pct": round(float(s.isna().mean() * 100), 2),
            "nunique": int(s.nunique(dropna=True)),
            "example_values": examples,
        }

        if pd.api.types.is_numeric_dtype(s):
            row.update({
                "mean": float(nonnull.mean()) if len(nonnull) else np.nan,
                "std": float(nonnull.std()) if len(nonnull) else np.nan,
                "min": float(nonnull.min()) if len(nonnull) else np.nan,
                "p25": float(nonnull.quantile(0.25)) if len(nonnull) else np.nan,
                "p50": float(nonnull.quantile(0.50)) if len(nonnull) else np.nan,
                "p75": float(nonnull.quantile(0.75)) if len(nonnull) else np.nan,
                "max": float(nonnull.max()) if len(nonnull) else np.nan,
            })
        else:
            row.update({"mean": np.nan, "std": np.nan, "min": np.nan, "p25": np.nan, "p50": np.nan, "p75": np.nan, "max": np.nan})

        rows.append(row)

    out = pd.DataFrame(rows).sort_values(["missing_pct", "nunique"], ascending=[False, False])
    return out.reset_index(drop=True)

profile = profile_columns(df2)
profile.head(12)

Unnamed: 0,column,dtype,missing,missing_pct,nunique,example_values,mean,std,min,p25,p50,p75,max
0,trailing_return_10years,float64,49307,99.81,88,"[6.75, 4.39, 2.1]",7.075761,3.861612,0.02,4.785,6.92,9.2075,17.97
1,credit_quality_aaa,float64,45812,92.74,468,"[26.35, 19.0, 19.0]",18.050309,16.056235,-3.81,4.76,14.1,28.34,88.92
2,credit_quality_bbb,float64,45812,92.74,465,"[27.94, 37.04, 37.04]",26.824065,12.922587,0.02,17.15,25.48,34.29,77.7
3,credit_quality_aa,float64,45812,92.74,446,"[8.78, 15.69, 15.69]",9.373549,10.154036,-8.42,3.18,5.84,11.62,63.49
4,credit_quality_bb,float64,45812,92.74,446,"[3.32, 3.07, 3.07]",14.356089,10.070419,-8.35,6.03,14.04,20.68,55.84
5,credit_quality_a,float64,45812,92.74,441,"[24.49, 19.36, 19.36]",14.02007,9.013538,0.22,7.28,12.96,18.8,47.29
6,credit_quality_b,float64,45812,92.74,433,"[3.91, 5.24, 5.24]",11.224921,10.0962,-2.0,3.31,8.64,16.395,53.92
7,credit_quality_not_rated,float64,45812,92.74,380,"[4.45, 0.26, 0.26]",3.958921,7.249961,-19.2,0.45,2.27,5.2,85.42
8,credit_quality_below_b,float64,45812,92.74,278,"[0.75, 0.34, 0.34]",2.191751,2.808746,-3.67,0.34,1.22,3.17,19.61
9,morningstar_analyst_rating,object,42500,86.03,6,"[Bronze, Bronze, Bronze]",,,,,,,


## Problem 7 — Create a clean analysis table for modeling

Goal: produce a DataFrame `model_df` with:
- index: `ticker` (if present; otherwise use the existing index)
- features:
  - `expense_ratio`
  - `aum_eur_m`
  - `star_rating`
  - one-hot encoded `morningstar_category` (top 6 most frequent categories, all others as `category__OTHER`)
- target: `fund_return_2018_winsor`

Rules:
- Drop rows with missing target.
- Impute missing numeric features with **median**.
- Return `model_df.shape` and show the first 5 rows.


In [10]:
# SOLUTION
dfm = df2.copy()

# Index
if "ticker" in dfm.columns:
    dfm = dfm.set_index("ticker")

# Target
target_col = "fund_return_2018_winsor" if "fund_return_2018_winsor" in dfm.columns else "fund_return_2018"
if target_col not in dfm.columns:
    raise ValueError("No target return column found.")

dfm = dfm.dropna(subset=[target_col])

# Numeric features
numeric_features = [c for c in ["expense_ratio", "aum_eur_m", "star_rating"] if c in dfm.columns]
for c in numeric_features:
    dfm[c] = pd.to_numeric(dfm[c], errors="coerce")
    dfm[c] = dfm[c].fillna(dfm[c].median())

# Categorical: top 6 categories
cat_cols = []
if "morningstar_category" in dfm.columns:
    top6 = dfm["morningstar_category"].value_counts().head(6).index
    cat_slim = dfm["morningstar_category"].where(dfm["morningstar_category"].isin(top6), other="OTHER")
    dummies = pd.get_dummies(cat_slim, prefix="category", dtype=int)
    cat_cols = dummies.columns.tolist()
else:
    dummies = pd.DataFrame(index=dfm.index)

# Build final table
model_df = pd.concat([
    dfm[numeric_features],
    dummies,
    dfm[[target_col]].rename(columns={target_col: "target_return"}),
], axis=1)

# Sanity checks
assert model_df["target_return"].isna().sum() == 0
assert all(pd.api.types.is_numeric_dtype(model_df[c]) for c in model_df.columns)

model_df.shape, model_df.head(5)

((41580, 8),
             category_Alt - Multistrategy  category_GBP Moderate Allocation  category_Global Emerging Markets Equity  \
 ticker                                                                                                                
 0P00000AWF                             0                                 0                                        0   
 0P00000AYI                             0                                 0                                        0   
 0P00000BOW                             0                                 0                                        0   
 0P00000ESH                             0                                 0                                        0   
 0P00000ESL                             0                                 0                                        0   
 
             category_Global Large-Cap Blend Equity  category_OTHER  category_Other Bond  category_Other Equity  \
 ticker                       

## Extra (Optional) — Quick report you can re-use

If you want a simple one-liner report on any DataFrame, use:

- `profile_columns(df).head(20)` to see the most problematic columns
- `df.describe(include='all')` for a broad summary
- `df.info(show_counts=True)` for a compact schema view


In [11]:
df2.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49399 entries, 0 to 49398
Columns: 113 entries, ticker to fund_return_2018_winsor
dtypes: float64(91), int64(2), object(20)
memory usage: 42.6+ MB
