# Custom Metric Playbook
This notebook documents every engineered field and derived table used across the analysis suite.

## How to use this guide
- Run the notebook top to bottom after any refresh of `all_languages_combined.csv`.
- Each section pairs the formula definition with a validation snippet referencing the live dataset.
- Use the narrative notes when communicating why the metric matters.

In [1]:
from pathlib import Path
import numpy as np
import pandas as pd
pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", None)

DATA_DIR = Path.cwd().parent / "data" / "processed"
data_path = DATA_DIR / "all_languages_combined.csv"
date_columns = ["created_at", "updated_at", "pushed_at"]
df = pd.read_csv(data_path, parse_dates=date_columns)
df.head(2)

Unnamed: 0,id,name,full_name,owner,description,language,created_at,updated_at,pushed_at,stars,forks,watchers,open_issues,size_kb,license,has_wiki,has_pages,contributors_count,commits_30d,commits_90d,commits_365d,has_readme,has_license,has_contributing,has_code_of_conduct,url,stars_normalized,forks_normalized,watchers_normalized,popularity_score,commits_30d_normalized,contributors_normalized,days_since_push,recency_score,activity_score,health_score,overall_score
0,184456251,PowerToys,microsoft/PowerToys,microsoft,Microsoft PowerToys is a collection of utiliti...,C#,2019-05-01 17:44:02+00:00,2025-10-18 22:06:09+00:00,2025-10-18 21:26:55+00:00,124775,7426,124775,7414,451590,MIT License,True,False,414,100,100,100,True,True,True,True,https://github.com/microsoft/PowerToys,100.0,29.577409,100.0,78.873223,100.0,93.453725,0,100.0,98.036117,100.0,90.86193
1,17620347,aspnetcore,dotnet/aspnetcore,dotnet,ASP.NET Core is a cross-platform .NET framewor...,C#,2014-03-11 06:09:42+00:00,2025-10-18 22:08:05+00:00,2025-10-18 18:20:00+00:00,37271,10464,37271,3476,368328,MIT License,True,False,340,99,100,100,True,True,True,True,https://github.com/dotnet/aspnetcore,29.870567,41.67762,29.870567,33.412683,99.0,76.749436,0,100.0,92.624831,100.0,70.783764


## Engineered fields overview
The table below lists every custom column applied in `05_calculated_fields_and_derived_tables.ipynb`.
Columns that divide by contributor counts or commit volumes use `NaN` guards to avoid divide-by-zero surprises.

In [2]:
feature_docs = pd.DataFrame([
    ("stars_per_contributor", "stars / contributors_count", "Shows how efficiently the community converts contributors into attention; NaN when no contributors."),
    ("forks_per_contributor", "forks / contributors_count", "Highlights replication leverage from each contributor."),
    ("engagement_per_contributor", "(stars + forks + watchers) / contributors_count", "Aggregate engagement normalized by contributor base."),
    ("engagement_density", "(stars + forks + watchers) / size_kb", "Attention relative to repository size to flag compact yet popular codebases."),
    ("recent_commit_share", "commits_30d / commits_365d", "Portion of yearly commits landing in the most recent month; proxies momentum."),
    ("quarter_commit_share", "commits_90d / commits_365d", "Quarterly pulse to balance short-term spikes."),
    ("issue_to_commit_ratio", "open_issues / (commits_365d + 1)", "Open issue load relative to recent throughput; +1 guards division."),
    ("freshness_index", "1 / (1 + days_since_push)", "Simple recency decay where 0 days since push = 1.0."),
    ("support_load", "open_issues / contributors_count", "Signals whether maintainers are spread thin."),
    ("compliance_score", "mean(has_readme, has_license, has_contributing, has_code_of_conduct)", "Policy completeness ranging 0-1."),
    ("enterprise_ready", "all(has_license, has_contributing, has_code_of_conduct)", "Boolean flag for governance readiness."),
    ("maturity_score", "0.40*health + 0.35*activity + 0.25*popularity", "Weighted blend emphasising sustainability over hype."),
    ("growth_signal", "0.5*recent_share + 0.3*quarter_share + 0.2*(recency_score/100)", "Composite momentum index that rewards consistent velocity."),
] , columns=["metric", "formula", "narrative"]).set_index("metric")
feature_docs

Unnamed: 0_level_0,formula,narrative
metric,Unnamed: 1_level_1,Unnamed: 2_level_1
stars_per_contributor,stars / contributors_count,Shows how efficiently the community converts c...
forks_per_contributor,forks / contributors_count,Highlights replication leverage from each cont...
engagement_per_contributor,(stars + forks + watchers) / contributors_count,Aggregate engagement normalized by contributor...
engagement_density,(stars + forks + watchers) / size_kb,Attention relative to repository size to flag ...
recent_commit_share,commits_30d / commits_365d,Portion of yearly commits landing in the most ...
quarter_commit_share,commits_90d / commits_365d,Quarterly pulse to balance short-term spikes.
issue_to_commit_ratio,open_issues / (commits_365d + 1),Open issue load relative to recent throughput;...
freshness_index,1 / (1 + days_since_push),Simple recency decay where 0 days since push =...
support_load,open_issues / contributors_count,Signals whether maintainers are spread thin.
compliance_score,"mean(has_readme, has_license, has_contributing...",Policy completeness ranging 0-1.


## Live validation
This helper cell re-computes the engineered columns and confirms they match what is written back to disk.

In [None]:
def add_engineered_features(frame: pd.DataFrame) -> pd.DataFrame:
    enriched = frame.copy()
    contributors = enriched["contributors_count"].replace({0: np.nan})
    commits_365 = enriched["commits_365d"].replace({0: np.nan})
    size_kb = enriched["size_kb"].replace({0: np.nan})
    engagement_sum = enriched["stars"] + enriched["forks"] + enriched["watchers"]

    enriched["stars_per_contributor"] = enriched["stars"] / contributors
    enriched["forks_per_contributor"] = enriched["forks"] / contributors
    enriched["engagement_per_contributor"] = engagement_sum / contributors
    enriched["engagement_density"] = engagement_sum / size_kb
    enriched["recent_commit_share"] = enriched["commits_30d"] / commits_365
    enriched["quarter_commit_share"] = enriched["commits_90d"] / commits_365
    enriched["issue_to_commit_ratio"] = enriched["open_issues"] / (enriched["commits_365d"] + 1)
    enriched["freshness_index"] = 1 / (1 + enriched["days_since_push"])
    enriched["support_load"] = enriched["open_issues"] / contributors
    enriched["compliance_score"] = (
        enriched[["has_readme", "has_license", "has_contributing", "has_code_of_conduct"]]
        .astype(int)
        .mean(axis=1)
    )
    enriched["enterprise_ready"] = (
        enriched["has_license"]
        & enriched["has_contributing"]
        & enriched["has_code_of_conduct"]
    ).astype(bool)
enriched["maturity_score"] = (
    0.4 * enriched["health_score"]
    + 0.35 * enriched["activity_score"]
    + 0.25 * enriched["popularity_score"]
)
enriched["growth_signal"] = (
    enriched["recent_commit_share"].fillna(0) * 0.5
    + enriched["quarter_commit_share"].fillna(0) * 0.3
    + (enriched["recency_score"].fillna(0) / 100) * 0.2
)
return enriched
,
,
,

: 
,
: {
: 

: [
,

: 
,
: null,
: {
: 

: [],
: [
,
language_summary", "Language-level aggregates of maturity, growth, and compliance", "Supports leadership dashboards comparing ecosystems side-by-side"),
    ("top_growth_repos", "Top 20 repos ranked by growth_signal, tie-broken by maturity_score", "Quick shortlist of momentum projects worth deeper review"),
    ("segment_summary", "Repo counts by growth_segment x compliance_tier", "Two-dimensional segmentation to align investments with governance posture"),
    ("enterprise_table", "Enterprise readiness and support load by language", "Shows where policy coverage is strongest and if maintainer bandwidth is sustainable"),
    ("repositories_enriched", "Base dataset plus every engineered feature", "Feeds downstream notebooks and BI dashboards"),
] , columns=["table", "definition", "story angle"]).set_index("table")
table_docs
markdown
markdown
## Spot check
Here we compute quick benchmarks so communicators can cite concrete numbers in narratives.
code

python
language_summary = (
    validate.groupby("language")
    .agg({
        "id": "count",
        "maturity_score": "mean",
        "growth_signal": "mean",
        "compliance_score": "mean",
        "support_load": "median",
    })
    .rename(columns={"id": "repo_count"})
    .round({
        "maturity_score": 2,
        "growth_signal": 2,
        "compliance_score": 2,
        "support_load": 1,
    })
)
language_summary.sort_values("maturity_score", ascending=False).head(5)
markdown
markdown
## Communicating the findings
- Combine the maturity, growth, and compliance leaders to establish a well-rounded "best language" story.
- Use `support_load` to caution stakeholders when a popular ecosystem shows signs of maintainer strain.
- The segmentation table is ideal for prioritising developer relations investments by governance posture.
metadata
language
name
language_info
version
nbformat
nbformat_minor

SyntaxError: unterminated string literal (detected at line 29) (317183497.py, line 29)