# 6 Weighting & Aggregation

> *“Weight assignment is one of the most important steps in composite indicator*

I now have **21 normalised indicators** (after dropping the nine redundant ones).  Following the guidance in the handbook I will test **two transparent weighting schemes**:

| scheme | logic | compensability | why I include it |
|--------|-------|----------------|------------------|
| **Equal-by-group** | Each of the five groups gets 20 %.<br>Inside a group the share is split equally. | fully compensatory | baseline that respects the theoretical balance of sub indices |
| **PCA-variance** | Weight<sub>q</sub> ∝ Σ<sub>j</sub> λ<sub>j</sub> · loading<sub>qj</sub><sup>2</sup>. | data-driven, still linear | rewards indicators that explain more common variance |


## 6.1 Equal weights by group

The equal-indicator approach (1 ÷ 21 ≈ 0.0476) *over-weights* sub indicators with many indicators. I instead give **each group 20 %**, then divide that share by the number of indicators
inside the group:

In [3]:
import pandas as pd
import numpy as np
from pathlib import Path

ROOT   = Path("..")
PROC   = ROOT / "data" / "processed"
ART    = ROOT / "artifacts" / "weights"
ART.mkdir(parents=True, exist_ok=True)

df = pd.read_parquet(PROC / "csiai_input_normalized.parquet")
if "ticker" in df.columns:
    df = df.set_index("ticker")

GROUPS = {
    "financial_strength": ["roe","debt_to_equity","current_ratio",
                           "oper_cash_flow","ebitda_margin"],
    "growth_potential":   ["revenue_growth","operating_margin","gross_margin"],
    "market_performance": ["eps","market_cap","price_to_sales","payout_ratio"],
    "risk_volatility":    ["hist_volatility","beta","max_drawdown",
                           "stddev_returns","value_at_risk"],
    "liquidity_trading":  ["avg_volume_30d","bid_ask_spread","volume_growth",
                           "float_shares"],
}

# equal-by-group weights
w_equal = {}
for group, cols in GROUPS.items():
    share = 0.20 / len(cols)
    print(f"Group: {group}, Share: {share}")
    w_equal.update({c: share for c in cols})

w_equal = pd.Series(w_equal, name="w_equal_group")
w_equal.to_csv(ART / "weights_equal_group.csv")

Group: financial_strength, Share: 0.04
Group: growth_potential, Share: 0.06666666666666667
Group: market_performance, Share: 0.05
Group: risk_volatility, Share: 0.04
Group: liquidity_trading, Share: 0.05


### 6.2 PCA-variance weights

For every group I ran a PCA in the multivariate notebook and compute each indicator’s importance as the **variance it contributes** across the retained components:

- **Financial Strength, Growth Potential, Market Performance, Liquidity & Trading**  
  These sub-indices have their variance spread fairly evenly across multiple components (PC1 ≈ 25–35 %, PC2 ≈ 20–30 %, PC3 ≈ 15–30 %, and more).  To capture their full multi-dimensional structure, i weight each variable by  
  \[
    w_q \;=\;\sum_{j=1}^k \lambda_j \, L_{qj}^2
  \]  
  where \(L_{qj}\) is the loading of variable \(q\) on PC \(j\), and \(\lambda_j\) is the variance explained by PC \(j\).  I then normalize so each group’s weights sum to 0.20.

- **Risk & Volatility**  
  Here PC 1 alone explains a dominant share (~63 %) of the total variance; the remaining PCs each contribute < 20 %.  In this case using **only the first component** is both simpler and nearly lossless, so we set  
  \[
    w_q \;=\; L_{q,\text{PC1}}^2
  \]  
  and normalize those squared PC 1 loadings to sum to 0.20.

In [None]:
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import gmean

PCA_DIR = ROOT / "artifacts" / "pca"

def pca_weights(group: str, share: float = 0.20) -> pd.Series:
    L = pd.read_csv(PCA_DIR / f"{group}_loadings.csv", index_col=0)

    if group == "risk_volatility":
        w_raw = L["PC1"].pow(2)
    else:
        evr = (
            pd.read_csv(PCA_DIR / f"{group}_explained_variance.csv",
                        header=None, index_col=0)
              .iloc[:, 0]
        )

        evr.index = [f"PC{i+1}" for i in range(len(evr))]
        evr = evr.reindex(L.columns)

        w_raw = (L.pow(2).mul(evr, axis=1)).sum(axis=1)

    return share * w_raw / w_raw.sum()

weights_pca = (pd.concat([pca_weights(g) for g in GROUPS], axis=0).rename("w_pca_var"))

weights_pca.to_csv(ART / "weights_pca_variance.csv")
print("PCA-variance weights saved: ", ART / "weights_pca_variance.csv")

✓ PCA-variance weights saved → ../artifacts/weights/weights_pca_variance.csv
