# 6 Weighting & Aggregation

> *“Weight assignment is one of the most important steps in composite indicator*

I now have **21 normalised indicators** (after dropping the nine redundant ones).  Following the guidance in the handbook I will test **two transparent weighting schemes**:

| scheme | logic | compensability | why I include it |
|--------|-------|----------------|------------------|
| **Equal-by-group** | Each of the five groups gets 20 %.<br>Inside a group the share is split equally. | fully compensatory | baseline that respects the theoretical balance of sub indices |
| **PCA-variance** | Weight<sub>q</sub> ∝ Σ<sub>j</sub> λ<sub>j</sub> · loading<sub>qj</sub><sup>2</sup>. | data-driven, still linear | rewards indicators that explain more common variance |


## 6.1 Equal weights by group

The equal-indicator approach (1 ÷ 21 ≈ 0.0476) *over-weights* sub indicators with many indicators. I instead give **each group 20 %**, then divide that share by the number of indicators
inside the group:

In [9]:
import pandas as pd
import numpy as np
from pathlib import Path

ROOT   = Path("..")
PROC   = ROOT / "data" / "processed"
ART    = ROOT / "artifacts" / "weights"
ART.mkdir(parents=True, exist_ok=True)

df = pd.read_parquet(PROC / "csiai_input_normalized.parquet")
if "ticker" in df.columns:
    df = df.set_index("ticker")

GROUPS = {
    "financial_strength": ["roe","debt_to_equity","current_ratio",
                           "oper_cash_flow","ebitda_margin"],
    "growth_potential":   ["revenue_growth","operating_margin","gross_margin"],
    "market_performance": ["eps","market_cap","price_to_sales","payout_ratio"],
    "risk_volatility":    ["hist_volatility","beta","max_drawdown",
                           "stddev_returns","value_at_risk"],
    "liquidity_trading":  ["avg_volume_30d","bid_ask_spread","volume_growth",
                           "float_shares"],
}

# equal-by-group weights
w_equal = {}
for group, cols in GROUPS.items():
    share = 0.20 / len(cols)
    print(f"Group: {group}, Share: {share}")
    w_equal.update({c: share for c in cols})

w_equal = pd.Series(w_equal, name="w_equal_group")
w_equal.to_csv(ART / "weights_equal_group.csv")

Group: financial_strength, Share: 0.04
Group: growth_potential, Share: 0.06666666666666667
Group: market_performance, Share: 0.05
Group: risk_volatility, Share: 0.04
Group: liquidity_trading, Share: 0.05


### 6.2 PCA-variance weights

For every group I ran a PCA in the multivariate notebook and compute each indicator’s importance as the **variance it contributes** across the retained components:

- **Financial Strength, Growth Potential, Market Performance, Liquidity & Trading**  
  These sub-indices have their variance spread fairly evenly across multiple components (PC1 ≈ 25–35 %, PC2 ≈ 20–30 %, PC3 ≈ 15–30 %, and more).  To capture their full multi-dimensional structure, i weight each variable by  
  \[
    w_q \;=\;\sum_{j=1}^k \lambda_j \, L_{qj}^2
  \]  
  where \(L_{qj}\) is the loading of variable \(q\) on PC \(j\), and \(\lambda_j\) is the variance explained by PC \(j\).  I then normalize so each group’s weights sum to 0.20.

- **Risk & Volatility**  
  Here PC 1 alone explains a dominant share (~63 %) of the total variance; the remaining PCs each contribute < 20 %.  In this case using **only the first component** is both simpler and nearly lossless, so we set  
  \[
    w_q \;=\; L_{q,\text{PC1}}^2
  \]  
  and normalize those squared PC 1 loadings to sum to 0.20.

In [10]:
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import gmean

PCA_DIR = ROOT / "artifacts" / "pca"

def pca_weights(group: str, share: float = 0.20) -> pd.Series:
    L = pd.read_csv(PCA_DIR / f"{group}_loadings.csv", index_col=0)

    if group == "risk_volatility":
        w_raw = L["PC1"].pow(2)
    else:
        evr = (
            pd.read_csv(PCA_DIR / f"{group}_explained_variance.csv",
                        header=None, index_col=0)
              .iloc[:, 0]
        )

        evr.index = [f"PC{i+1}" for i in range(len(evr))]
        evr = evr.reindex(L.columns)

        w_raw = (L.pow(2).mul(evr, axis=1)).sum(axis=1)

    return share * w_raw / w_raw.sum()

w_pca = (pd.concat([pca_weights(g) for g in GROUPS], axis=0).rename("w_pca_var"))

w_pca.to_csv(ART / "weights_pca_variance.csv")
print("PCA-variance weights saved: ", ART / "weights_pca_variance.csv")

PCA-variance weights saved:  ../artifacts/weights/weights_pca_variance.csv


### 6 .3 Building sub-index scores 

In [13]:
sub_eq  = {}
sub_pca = {}

for grp, cols in GROUPS.items():
    sub_eq [grp] = df[cols].dot(w_equal[cols])
    sub_pca[grp] = df[cols].dot(w_pca  [cols])

sub_eq_df  = pd.DataFrame(sub_eq ).add_suffix("_equal")
sub_pca_df = pd.DataFrame(sub_pca).add_suffix("_pca")

sub_scores = sub_eq_df.join(sub_pca_df)

sub_ranks = sub_scores.rank(ascending=False).add_suffix("_rank")

SUB_DIR = PROC / "subindex"
SUB_DIR.mkdir(exist_ok=True)

sub_scores.to_parquet(SUB_DIR / "sub_index_scores.parquet")
sub_ranks .to_parquet(SUB_DIR / "sub_index_ranks.parquet")

display(sub_scores.head())
display(sub_ranks.head())

Unnamed: 0_level_0,financial_strength_equal,growth_potential_equal,market_performance_equal,risk_volatility_equal,liquidity_trading_equal,financial_strength_pca,growth_potential_pca,market_performance_pca,risk_volatility_pca,liquidity_trading_pca
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
COOP,0.093887,0.127802,0.099708,0.112674,0.077855,0.095714,0.115082,0.097526,0.111735,0.072024
XYL,0.080699,0.102068,0.131819,0.113403,0.101232,0.078878,0.0966,0.135195,0.112206,0.09203
MAR,0.062319,0.119279,0.132518,0.101518,0.103437,0.054943,0.108969,0.134199,0.095346,0.098832
ALSN,0.098543,0.105734,0.11146,0.110337,0.092854,0.102454,0.099241,0.111889,0.108157,0.095383
DHC,0.101562,0.096003,0.087719,0.098352,0.090941,0.104035,0.092243,0.08892,0.09452,0.093022


Unnamed: 0_level_0,financial_strength_equal_rank,growth_potential_equal_rank,market_performance_equal_rank,risk_volatility_equal_rank,liquidity_trading_equal_rank,financial_strength_pca_rank,growth_potential_pca_rank,market_performance_pca_rank,risk_volatility_pca_rank,liquidity_trading_pca_rank
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
COOP,194.0,23.0,1243.0,492.0,1825.0,215.0,21.0,1290.0,488.0,1969.0
XYL,1590.0,1286.0,419.0,447.0,751.0,1711.0,1289.0,484.0,459.0,1044.0
MAR,2233.0,207.0,398.0,1426.0,628.0,2236.0,205.0,509.0,1492.0,770.0
ALSN,91.0,944.0,960.0,645.0,1162.0,88.0,943.0,975.0,653.0,898.0
DHC,50.0,1926.0,1839.0,1722.0,1250.0,75.0,1927.0,1813.0,1583.0,1004.0


### 6.4 Building the composite score
The final composite score is a weighted sum of the normalised indicators, using the weights from either scheme:
CI_equal = Σ w_equal · x_q
CI_pca   = Σ w_pca   · x_q

I will use **linear aggregation** to reduce the effect of the weight.

- I will be doing both **linear** and **geometric** aggregation just to see how they compare.

In [11]:
def linear_index(values, weights):
    return np.dot(values, weights)

def geometric_index(values, weights, eps=1e-9):
    return np.exp(np.sum(weights * np.log(values + eps)))

ci = pd.DataFrame(index=df.index)

ci["CI_equal_lin"] = df.apply(linear_index, axis=1, weights=w_equal)
ci["CI_pca_lin"]   = df.apply(linear_index, axis=1, weights=w_pca)

ci["CI_equal_geo"] = df.apply(geometric_index, axis=1, weights=w_equal)
ci["CI_pca_geo"]   = df.apply(geometric_index, axis=1, weights=w_pca)

ci.to_parquet(PROC / "ci_scores.parquet")

### 6.5 Robustness check

* **Median |Δ-rank|** between the two indices tells me how much a company typically moves if I switch the weight logic.
* A **Spearman ρ matrix** of the four variants (equal vs PCA, linear vs  geometric) reveals whether the overall ordering is stable.

The handbook’s sensitivity suggests these two quick tests as a minimum.  If median |Δ| is small and ρ > 0.9 I can be confident that my index is not held hostage by the exact weighting choice.

In [12]:

rank = ci.rank(ascending=False)
delta = (rank["CI_equal_lin"] - rank["CI_pca_lin"]).abs()
print("Median |Δ-rank|  (Equal vs PCA, linear):",
      delta.median().round(2))

spearman = ci.corr(method="spearman")
display(spearman.round(3))

Median |Δ-rank|  (Equal vs PCA, linear): 123.0


Unnamed: 0,CI_equal_lin,CI_pca_lin,CI_equal_geo,CI_pca_geo
CI_equal_lin,1.0,0.942,0.813,0.771
CI_pca_lin,0.942,1.0,0.789,0.803
CI_equal_geo,0.813,0.789,1.0,0.98
CI_pca_geo,0.771,0.803,0.98,1.0


## 6.6 Summary

* **Assigned two alternative weight schemes**  
  * *Equal-by-group* – every sub-index gets 20 %, each of its indicators share that slice equally.  
  * *PCA-variance* – indicator weights ∝ variance explained by their principal components
    (PC 1 only for *Risk & Volatility* as recommended by the scree-test).

* **Built four composite variants**  
  * Linear vs geometric aggregation × Equal vs PCA weights.

* **Performed a first robustness check**  
  * Median absolute rank shift between the two linear variants = **123 places**  - rankings are sensitive to weight selection.  
  * Spearman ρ ≥ 0.77 for every pair of indices  - aggregation form matters, but a bit less than weights.