# Workers vs Organizations Trustworthiness (SeeClickFix + WebCrowd25K)

**Goal**: Measure both sides using the same two datasets:
- **Participants / workers**: data quality, honesty, reliability
- **Organizations / platform leadership**: responsiveness, transparency, governance quality

**Important**: We intentionally avoid redoing exploratory plots / cleaning work already done in `data_cleaning_preprocessing.ipynb`.
This notebook focuses on **comparative metrics + composite scoring**.

Datasets:
- SeeClickFix: `SeeClickFix_Public_Service_Requests.csv`
- WebCrowd25K: `webcrowd25k/webcrowd25k/` (`crowd_judgements.csv`, `gold_judgements.txt`, optional `behaviorDataRelease.json`)

Outputs (CSV):
- `outputs/participants_sourceXcategory_scores.csv`
- `outputs/agencies_scores.csv`
- `outputs/webcrowd_worker_scores.csv`
- `outputs/webcrowd_governance_topic_scores.csv`



In [1]:
import os
from pathlib import Path

import numpy as np
import pandas as pd

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)

DATA_DIR = Path(".")
SEECLICKFIX_PATH = DATA_DIR / "SeeClickFix_Public_Service_Requests.csv"
WEBCROWD_DIR = DATA_DIR / "webcrowd25k" / "webcrowd25k"
CROWD_JUDGEMENTS_PATH = WEBCROWD_DIR / "crowd_judgements.csv"
GOLD_JUDGEMENTS_PATH = WEBCROWD_DIR / "gold_judgements.txt"

OUTPUT_DIR = DATA_DIR / "outputs"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print("Paths:")
print("- SeeClickFix:", SEECLICKFIX_PATH.resolve())
print("- WebCrowd crowd:", CROWD_JUDGEMENTS_PATH.resolve())
print("- WebCrowd gold:", GOLD_JUDGEMENTS_PATH.resolve())
print("- Outputs:", OUTPUT_DIR.resolve())

assert SEECLICKFIX_PATH.exists(), f"Missing file: {SEECLICKFIX_PATH}"
assert CROWD_JUDGEMENTS_PATH.exists(), f"Missing file: {CROWD_JUDGEMENTS_PATH}"
assert GOLD_JUDGEMENTS_PATH.exists(), f"Missing file: {GOLD_JUDGEMENTS_PATH}"


Paths:
- SeeClickFix: D:\Noor_work\Master\UAEU\RA\sensing\SeeClickFix_Public_Service_Requests.csv
- WebCrowd crowd: D:\Noor_work\Master\UAEU\RA\sensing\webcrowd25k\webcrowd25k\crowd_judgements.csv
- WebCrowd gold: D:\Noor_work\Master\UAEU\RA\sensing\webcrowd25k\webcrowd25k\gold_judgements.txt
- Outputs: D:\Noor_work\Master\UAEU\RA\sensing\outputs


## Shared scoring utilities

We compute a composite trust score using a **weighted sum** over normalized metrics.

For any entity \(e\) (worker, agency, participant-group, topic governance unit):

\[
T(e)=\sum_{k=1}^{K} w_k \,\tilde{m}_k(e)
\quad\text{s.t.}\quad
w_k \ge 0,\ \sum_{k=1}^{K} w_k = 1
\]

- \(\tilde{m}_k(e)\) are **0–1 normalized** metric values where **higher = better**.
- For “lower is better” metrics (e.g., response time, missingness), we **invert** after scaling.



In [2]:
def robust_minmax_01(s: pd.Series, lower_q: float = 0.05, upper_q: float = 0.95) -> pd.Series:
    """Robustly scale a numeric Series to [0,1] using quantile clipping.

    - Values below Q_lower are clipped to Q_lower, above Q_upper clipped to Q_upper.
    - If the clipped range collapses, returns 0.5 for non-null values.
    """
    s = pd.to_numeric(s, errors="coerce")
    if s.notna().sum() == 0:
        return pd.Series([np.nan] * len(s), index=s.index)

    lo = s.quantile(lower_q)
    hi = s.quantile(upper_q)
    s2 = s.clip(lower=lo, upper=hi)

    denom = (hi - lo)
    if pd.isna(denom) or denom == 0:
        out = pd.Series([np.nan] * len(s2), index=s2.index)
        out.loc[s2.notna()] = 0.5
        return out

    return (s2 - lo) / denom


def normalize_metrics_01(df: pd.DataFrame, metric_directions: dict, lower_q: float = 0.05, upper_q: float = 0.95) -> pd.DataFrame:
    """Return a DataFrame of normalized metrics in [0,1].

    metric_directions: {metric_name: 'high'|'low'} meaning whether higher is better.
    """
    out = pd.DataFrame(index=df.index)
    for m, direction in metric_directions.items():
        if m not in df.columns:
            raise KeyError(f"Missing metric column: {m}")
        x = robust_minmax_01(df[m], lower_q=lower_q, upper_q=upper_q)
        if direction not in ("high", "low"):
            raise ValueError(f"direction must be 'high' or 'low' for metric '{m}'")
        out[m] = x if direction == "high" else (1 - x)
    return out


def weighted_sum_score(norm_df: pd.DataFrame, weights: dict) -> pd.Series:
    """Compute weighted sum score from normalized metrics.

    weights: {metric_name: weight}. Must cover columns in norm_df.
    """
    w = pd.Series(weights, dtype=float)
    missing = set(norm_df.columns) - set(w.index)
    if missing:
        raise KeyError(f"Weights missing for metrics: {sorted(missing)}")

    w = w[norm_df.columns]
    if (w < 0).any():
        raise ValueError("Weights must be non-negative")
    if w.sum() == 0:
        raise ValueError("Sum of weights must be > 0")
    w = w / w.sum()

    return norm_df.mul(w, axis=1).sum(axis=1)


def rank_with_min_n(df: pd.DataFrame, n_col: str, min_n: int = 30) -> pd.DataFrame:
    """Filter entities by a minimum evidence size (e.g., #reports, #judgments)."""
    if n_col not in df.columns:
        raise KeyError(f"Missing count column: {n_col}")
    return df[df[n_col] >= min_n].copy()


## SeeClickFix: Participants (source × category) trustworthiness

We treat **participants** as *reporter groups* defined by `source × category` (because the dataset does not include a stable user id).

We compute report-quality proxies and aggregate them by group.

- We will **rank only groups with at least 30 reports** (`n_reports ≥ 30`).



In [3]:
# Load SeeClickFix (minimal: no EDA here)
# Note: We mirror the *minimal cleaning* from data_cleaning_preprocessing.ipynb without redoing its analysis.

df_scfx_raw = pd.read_csv(SEECLICKFIX_PATH, low_memory=False)
print("Loaded SeeClickFix:", df_scfx_raw.shape)

# Minimal cleaning consistent with earlier notebook
columns_to_drop = [c for c in ['reopened_at', 'image_url', 'image_square_url'] if c in df_scfx_raw.columns]
df_scfx = df_scfx_raw.drop(columns=columns_to_drop).copy()

# Standardize column names (lowercase)
df_scfx.columns = [c.strip().lower() for c in df_scfx.columns]

# Parse datetimes if present
for c in ['created_at', 'acknowledged_at', 'closed_at']:
    if c in df_scfx.columns:
        df_scfx[c] = pd.to_datetime(df_scfx[c], errors='coerce')

# Ensure key categorical columns exist
for c in ['source', 'category', 'agency']:
    if c in df_scfx.columns:
        df_scfx[c] = df_scfx[c].astype('string')

# Participant group id: source × category
scfx_source = df_scfx.get('source', pd.Series([pd.NA]*len(df_scfx), index=df_scfx.index)).fillna('Unknown').astype(str).str.strip()
scfx_cat = df_scfx.get('category', pd.Series([pd.NA]*len(df_scfx), index=df_scfx.index)).fillna('Unknown').astype(str).str.strip()
df_scfx['participant_group'] = (scfx_source + ' × ' + scfx_cat).astype('string')

# Per-report quality proxies
for txt_col in ['summary', 'description']:
    if txt_col in df_scfx.columns:
        s = df_scfx[txt_col].fillna('').astype(str)
        s_clean = s.str.strip()
        df_scfx[f'{txt_col}_missing'] = (s_clean == '')
        df_scfx[f'{txt_col}_len'] = s_clean.str.len()
    else:
        df_scfx[f'{txt_col}_missing'] = True
        df_scfx[f'{txt_col}_len'] = np.nan

df_scfx['text_len_total'] = pd.to_numeric(df_scfx['summary_len'], errors='coerce').fillna(0) + pd.to_numeric(df_scfx['description_len'], errors='coerce').fillna(0)

# Duplicate proxy (exact duplicate on normalized summary)
summary_norm = df_scfx['summary'].fillna('').astype(str).str.strip().str.lower()
df_scfx['summary_norm'] = summary_norm

# Comments: treat missing as 0
if 'comments_count' in df_scfx.columns:
    df_scfx['comments_count'] = pd.to_numeric(df_scfx['comments_count'], errors='coerce').fillna(0)

# Aggregate by participant group

def top_value_share(s: pd.Series) -> float:
    s = s.dropna().astype(str).str.strip()
    s = s[s != '']
    if len(s) == 0:
        return 0.0
    return float(s.value_counts(normalize=True).iloc[0])

participant_grp = (
    df_scfx.groupby('participant_group')
    .agg(
        n_reports=('participant_group', 'size'),
        description_missing_pct=('description_missing', 'mean'),
        summary_missing_pct=('summary_missing', 'mean'),
        description_len_median=('description_len', 'median'),
        summary_len_median=('summary_len', 'median'),
        text_len_total_median=('text_len_total', 'median'),
        comments_median=('comments_count', 'median') if 'comments_count' in df_scfx.columns else ('text_len_total', 'median'),
        top_summary_share=('summary_norm', top_value_share),
    )
    .reset_index()
)

# Convert proportions to % for readability
for c in ['description_missing_pct', 'summary_missing_pct', 'top_summary_share']:
    participant_grp[c] = (participant_grp[c].astype(float) * 100).round(2)

print("Participant groups:", participant_grp.shape)
participant_grp.sort_values('n_reports', ascending=False).head(10)


Loaded SeeClickFix: (7121, 20)
Participant groups: (347, 9)


Unnamed: 0,participant_group,n_reports,description_missing_pct,summary_missing_pct,description_len_median,summary_len_median,text_len_total_median,comments_median,top_summary_share
158,Request Form × Dead Animal,340,0.29,0.0,73.5,11.0,84.5,2.0,100.0
86,Other × Solid Waste - Yard Waste Dumpster,318,0.0,0.0,50.0,31.0,81.0,2.0,100.0
341,iPhone × Traffic Sign Down,256,2.34,0.0,69.0,17.0,86.0,3.0,97.27
100,Other × Traffic Signal,243,1.23,0.0,202.0,14.0,216.0,3.0,87.24
344,"iPhone × Unmaintained Vegetation, Right of Way",225,3.56,0.0,74.0,37.0,111.0,3.0,94.22
44,Other × Dead Animal,212,6.13,0.0,64.0,11.0,75.0,2.0,100.0
76,Other × Pothole,180,6.11,0.0,100.5,7.0,107.5,3.0,98.89
317,iPhone × Sidewalk Repair,175,8.0,0.0,56.0,15.0,71.0,4.0,92.0
316,iPhone × Roll Cart Left at Street,146,22.6,0.0,62.0,24.0,86.0,2.0,100.0
312,iPhone × Park Maintenance,139,0.0,0.0,89.0,16.0,105.0,3.0,97.84


In [4]:
# Participant trust score (source × category)
MIN_REPORTS = 30

# Metrics to include in the composite (higher=better after direction adjustment)
participant_metric_directions = {
    # completeness (lower missing is better)
    'description_missing_pct': 'low',
    'summary_missing_pct': 'low',
    # richness (higher is better)
    'description_len_median': 'high',
    'summary_len_median': 'high',
    'text_len_total_median': 'high',
    # duplication/noise proxy (lower top1 share is better)
    'top_summary_share': 'low',
    # engagement proxy (higher can indicate more interaction/clarification)
    'comments_median': 'high',
}

# Equal weights baseline
participant_weights = {m: 1.0 for m in participant_metric_directions}

pg = participant_grp.copy()
pg_rankable = rank_with_min_n(pg, n_col='n_reports', min_n=MIN_REPORTS)

pg_norm = normalize_metrics_01(pg_rankable, participant_metric_directions)
pg_rankable['participant_trust_score'] = weighted_sum_score(pg_norm, participant_weights)

pg_rankable = pg_rankable.sort_values('participant_trust_score', ascending=False)

print(f"Rankable participant groups (n_reports >= {MIN_REPORTS}): {len(pg_rankable)} / {len(pg)}")
pg_rankable.head(15)


Rankable participant groups (n_reports >= 30): 60 / 347


Unnamed: 0,participant_group,n_reports,description_missing_pct,summary_missing_pct,description_len_median,summary_len_median,text_len_total_median,comments_median,top_summary_share,participant_trust_score
144,Portal × Traffic Safety/Miscellaneous,78,1.28,0.0,235.5,28.0,263.5,4.0,91.03,0.732391
98,Other × Traffic Safety/Miscellaneous,98,0.0,0.0,269.5,28.0,297.5,3.0,90.82,0.704363
340,iPhone × Traffic Safety/Miscellaneous,115,0.0,0.0,130.0,28.0,161.0,4.0,83.48,0.67956
101,"Other × Unmaintained Vegetation, Right of Way",93,1.08,0.0,138.0,37.0,174.0,3.0,88.17,0.647801
100,Other × Traffic Signal,243,1.23,0.0,202.0,14.0,216.0,3.0,87.24,0.630287
37,Android × Traffic Signal,46,0.0,0.0,151.0,14.0,165.0,3.0,78.26,0.618502
148,"Portal × Unmaintained Vegetation, Right of Way",114,2.63,0.0,140.5,37.0,177.5,3.0,90.35,0.617158
35,Android × Traffic Safety/Miscellaneous,48,0.0,0.0,93.5,28.0,120.5,4.0,83.33,0.616456
146,Portal × Traffic Signal,104,4.81,0.0,282.5,14.0,300.0,3.0,84.62,0.607281
129,Portal × Sidewalk Repair,52,1.92,0.0,166.5,15.0,181.5,3.0,80.77,0.606927


In [5]:
# Save participant results
participants_out = pg_rankable.copy()
participants_out.to_csv(OUTPUT_DIR / 'participants_sourceXcategory_scores.csv', index=False)
print("Wrote:", (OUTPUT_DIR / 'participants_sourceXcategory_scores.csv'))


Wrote: outputs\participants_sourceXcategory_scores.csv


## SeeClickFix: Organization (agency) trustworthiness

We treat `agency` as the organization unit.

Key dimensions:
- **Responsiveness**: acknowledgment rate + time-to-acknowledge
- **Follow-through**: closure rate + time-to-close (when available)
- **Stability / governance consistency**: variability of responsiveness over time



In [6]:
# Agency-level metrics (minimal; extends prior notebook logic)

if 'agency' not in df_scfx.columns:
    raise KeyError("SeeClickFix is missing 'agency' column")

# Base flags + durations
sc = df_scfx.copy()

# Acknowledgment metrics
if 'acknowledged_at' in sc.columns and 'created_at' in sc.columns:
    sc['was_acknowledged'] = sc['acknowledged_at'].notna()
    sc['time_to_ack_hours'] = (sc['acknowledged_at'] - sc['created_at']).dt.total_seconds() / 3600.0
else:
    sc['was_acknowledged'] = False
    sc['time_to_ack_hours'] = np.nan

# Closure metrics (optional)
if 'closed_at' in sc.columns and 'created_at' in sc.columns:
    sc['was_closed'] = sc['closed_at'].notna()
    sc['time_to_close_hours'] = (sc['closed_at'] - sc['created_at']).dt.total_seconds() / 3600.0
else:
    sc['was_closed'] = False
    sc['time_to_close_hours'] = np.nan

# Time bucket for stability metrics
if 'created_at' in sc.columns:
    sc['created_month'] = sc['created_at'].dt.to_period('M').astype('string')
else:
    sc['created_month'] = pd.NA

# Aggregate main agency table
agency = (
    sc.groupby('agency')
    .agg(
        n_issues=('agency', 'size'),
        ack_count=('was_acknowledged', 'sum'),
        ack_rate=('was_acknowledged', 'mean'),
        ack_time_mean=('time_to_ack_hours', 'mean'),
        ack_time_median=('time_to_ack_hours', 'median'),
        close_count=('was_closed', 'sum'),
        close_rate=('was_closed', 'mean'),
        close_time_mean=('time_to_close_hours', 'mean'),
        close_time_median=('time_to_close_hours', 'median'),
        comments_mean=('comments_count', 'mean') if 'comments_count' in sc.columns else ('time_to_ack_hours', 'mean'),
    )
    .reset_index()
)

# Convert rates to % for readability
for c in ['ack_rate', 'close_rate']:
    agency[c] = (agency[c].astype(float) * 100).round(2)

# Stability metrics: month-level ack_rate and ack_time_median, then std across months
monthly = (
    sc.dropna(subset=['agency'])
    .groupby(['agency', 'created_month'])
    .agg(
        n_month_issues=('agency', 'size'),
        month_ack_rate=('was_acknowledged', 'mean'),
        month_ack_time_median=('time_to_ack_hours', 'median'),
    )
    .reset_index()
)

stability = (
    monthly.groupby('agency')
    .agg(
        n_months=('created_month', 'nunique'),
        ack_rate_std=('month_ack_rate', 'std'),
        ack_time_median_std=('month_ack_time_median', 'std'),
    )
    .reset_index()
)

agency = agency.merge(stability, on='agency', how='left')

print("Agencies:", agency.shape)
agency.sort_values('n_issues', ascending=False).head(10)


Agencies: (9, 14)


  sc['created_month'] = sc['created_at'].dt.to_period('M').astype('string')


Unnamed: 0,agency,n_issues,ack_count,ack_rate,ack_time_mean,ack_time_median,close_count,close_rate,close_time_mean,close_time_median,comments_mean,n_months,ack_rate_std,ack_time_median_std
1,"Chapel Hill, NC",6157,1766,28.68,139.775931,18.555972,5866,95.27,496.519361,93.189167,3.333117,96,0.180321,959.858134
7,Traffic Signals - Chapel Hill & Carrboro,569,348,61.16,58.897415,4.486944,569,100.0,728.508244,49.469167,3.843585,81,0.272632,1233.886946
4,Hurricane Florence CH Response Team,126,23,18.25,26.184758,2.635278,125,99.21,779.736173,10.219722,1.801587,4,0.108254,150.96386
6,Hurricane Michael CH Response Team,111,2,1.8,3.275,3.275,110,99.1,375.194909,90.212222,1.756757,3,0.011321,
8,Winter Storm Diego CH Response Team,65,25,38.46,8.689067,0.9975,65,100.0,1201.299521,22.1025,1.892308,2,0.280598,
0,Adverse Event CH Response,42,12,28.57,28.575347,0.682222,42,100.0,38.250886,16.691389,2.404762,6,0.192916,83.192176
3,Hurricane Dorian CH Response Team,24,10,41.67,1.859722,0.195694,24,100.0,37.719792,0.358194,2.333333,1,,
5,Hurricane Isaias CH Response Team,19,6,31.58,18.452593,0.771667,19,100.0,434.239708,0.208611,2.0,1,,
2,Hazardous Weather Event Response,8,0,0.0,,,5,62.5,184.964333,168.070833,1.625,1,,


In [7]:
# Agency trust score
MIN_ISSUES_AGENCY = 10

# Use closure metrics if they exist meaningfully; keep weights simple and thesis-friendly.
agency_metric_directions = {
    'ack_rate': 'high',
    'ack_time_median': 'low',
    'close_rate': 'high',
    'close_time_median': 'low',
    # stability (lower std = more consistent governance)
    'ack_rate_std': 'low',
    'ack_time_median_std': 'low',
    # transparency/engagement proxy
    'comments_mean': 'high',
}

agency_weights = {m: 1.0 for m in agency_metric_directions}

ag = agency.copy()
ag_rankable = rank_with_min_n(ag, n_col='n_issues', min_n=MIN_ISSUES_AGENCY)

# Note: ack_rate/close_rate are % (0..100), std uses 0..1 for month rates; normalization handles this.
ag_norm = normalize_metrics_01(ag_rankable, agency_metric_directions)
ag_rankable['agency_trust_score'] = weighted_sum_score(ag_norm, agency_weights)

ag_rankable = ag_rankable.sort_values('agency_trust_score', ascending=False)

print(f"Rankable agencies (n_issues >= {MIN_ISSUES_AGENCY}): {len(ag_rankable)} / {len(ag)}")
ag_rankable.head(15)


Rankable agencies (n_issues >= 10): 8 / 9


Unnamed: 0,agency,n_issues,ack_count,ack_rate,ack_time_mean,ack_time_median,close_count,close_rate,close_time_mean,close_time_median,comments_mean,n_months,ack_rate_std,ack_time_median_std,agency_trust_score
0,Adverse Event CH Response,42,12,28.57,28.575347,0.682222,42,100.0,38.250886,16.691389,2.404762,6,0.192916,83.192176,0.704742
4,Hurricane Florence CH Response Team,126,23,18.25,26.184758,2.635278,125,99.21,779.736173,10.219722,1.801587,4,0.108254,150.96386,0.625705
7,Traffic Signals - Chapel Hill & Carrboro,569,348,61.16,58.897415,4.486944,569,100.0,728.508244,49.469167,3.843585,81,0.272632,1233.886946,0.596915
3,Hurricane Dorian CH Response Team,24,10,41.67,1.859722,0.195694,24,100.0,37.719792,0.358194,2.333333,1,,,0.574931
5,Hurricane Isaias CH Response Team,19,6,31.58,18.452593,0.771667,19,100.0,434.239708,0.208611,2.0,1,,,0.514738
8,Winter Storm Diego CH Response Team,65,25,38.46,8.689067,0.9975,65,100.0,1201.299521,22.1025,1.892308,2,0.280598,,0.491229
6,Hurricane Michael CH Response Team,111,2,1.8,3.275,3.275,110,99.1,375.194909,90.212222,1.756757,3,0.011321,,0.362321
1,"Chapel Hill, NC",6157,1766,28.68,139.775931,18.555972,5866,95.27,496.519361,93.189167,3.333117,96,0.180321,959.858134,0.270347


In [8]:
# Save agency results
agencies_out = ag_rankable.copy()
agencies_out.to_csv(OUTPUT_DIR / 'agencies_scores.csv', index=False)
print("Wrote:", (OUTPUT_DIR / 'agencies_scores.csv'))


Wrote: outputs\agencies_scores.csv


## WebCrowd25K: Workers and platform governance

We compute:
- **Worker trustworthiness** at `wid` (accuracy vs gold, peer agreement, effort proxies, stability).
- **Platform governance trust (proxy)** at `tid` (topic-level): how well aggregated outcomes align with gold, ambiguity/tie rate, and non-response rate.



In [9]:
# Load WebCrowd data (minimal)

# Crowd judgments
wc_raw = pd.read_csv(CROWD_JUDGEMENTS_PATH, low_memory=False)
print("Loaded WebCrowd crowd:", wc_raw.shape)

# Gold judgments (space-delimited with an unused column)
df_gold = pd.read_csv(
    GOLD_JUDGEMENTS_PATH,
    sep=r"\s+",
    header=None,
    names=['topic_id', 'unused_col', 'document_id', 'gold_judgement'],
    engine='python'
)
df_gold = df_gold.drop(columns=['unused_col'])
print("Loaded WebCrowd gold:", df_gold.shape)

# Clean crowd: drop non-response label = -1
wc = wc_raw.copy()
wc['label'] = pd.to_numeric(wc['label'], errors='coerce')
nonresponse_mask = (wc['label'] == -1)

# topic/document ids
wc['tid'] = pd.to_numeric(wc['tid'], errors='coerce').astype('Int64')
wc['did'] = wc['did'].astype('string')

# duration
wc['duration'] = pd.to_numeric(wc.get('duration', pd.Series([pd.NA]*len(wc))), errors='coerce')

# filter
wc_clean = wc[~nonresponse_mask].copy()
wc_clean['label'] = wc_clean['label'].astype('Int64')

# Join with gold
wg = df_gold.rename(columns={'topic_id': 'tid', 'document_id': 'did'})
df_joined = wc_clean.merge(wg[['tid', 'did', 'gold_judgement']], on=['tid', 'did'], how='left')

print("After cleaning:")
print("- Non-response rate (%):", (nonresponse_mask.mean() * 100).round(2))
print("- Clean crowd:", wc_clean.shape)
print("- Joined:", df_joined.shape)
print("- Gold coverage (%):", (df_joined['gold_judgement'].notna().mean() * 100).round(2))

df_joined.head(3)


Loaded WebCrowd crowd: (25119, 12)
Loaded WebCrowd gold: (14432, 3)
After cleaning:
- Non-response rate (%): 2.0
- Clean crowd: (24617, 12)
- Joined: (24617, 13)
- Gold coverage (%): 99.95


Unnamed: 0,wid,feedback,url,mapping,label,start,tid,design,rationale,duration,did,ID,gold_judgement
0,wid#0,{},http://ir.ischool.utexas.edu/relevance/clueweb...,rlt3AvBHSs,1,Mon Apr 24 10:16:08 PDT 2017,267,NIST,Jaci Velasquez lyrics,298.0,clueweb12-0401wb-51-01278,0,1.0
1,wid#1,good,http://ir.ischool.utexas.edu/relevance/clueweb...,4qaUWlHcqH,0,Mon Apr 24 09:57:44 PDT 2017,267,NIST,Jaci Velasquez lyrics\n\nSort by album · Sor...,54.0,clueweb12-0401wb-51-01278,1,1.0
2,wid#2,{},http://ir.ischool.utexas.edu/relevance/clueweb...,rn12FoCKPi,0,Mon Apr 24 10:11:02 PDT 2017,267,NIST,He's My Savior lyrics,58.0,clueweb12-0401wb-51-01278,2,1.0


In [10]:
# Worker trustworthiness metrics (consistent with prior notebook definitions)

required_cols = {'wid', 'tid', 'did', 'label', 'gold_judgement'}
missing = required_cols - set(df_joined.columns)
if missing:
    raise ValueError(f"df_joined missing required columns: {missing}")

# Map gold (-2..4) to crowd ordinal (0..3)
def map_gold_to_crowd_m1(g):
    mapping = {-2: 0, 0: 0, 1: 1, 2: 2, 3: 3, 4: 3}
    return mapping.get(int(g), pd.NA) if pd.notna(g) else pd.NA

w = df_joined.copy()
w['label'] = w['label'].astype('Int64')
w['gold_judgement'] = pd.to_numeric(w['gold_judgement'], errors='coerce')
w['gold_m1'] = w['gold_judgement'].apply(map_gold_to_crowd_m1).astype('Int64')

# Majority vote per (tid,did) + ties
def majority_and_tie(x: pd.Series):
    m = x.mode(dropna=True)
    if len(m) == 0:
        return pd.Series({'majority_label': pd.NA, 'is_tie': True})
    if len(m) == 1:
        return pd.Series({'majority_label': int(m.iloc[0]), 'is_tie': False})
    return pd.Series({'majority_label': int(m.median()), 'is_tie': True})

maj = w.groupby(['tid', 'did'])['label'].apply(majority_and_tie).unstack()
item_stats = (
    w.groupby(['tid', 'did'])
     .agg(n_judgments=('label', 'size'))
     .join(maj)
     .reset_index()
)
w = w.merge(item_stats, on=['tid', 'did'], how='left')

# Row-level metrics
w['gold_available'] = w['gold_m1'].notna()
w['acc_exact_m1'] = (w['label'] == w['gold_m1']).where(w['gold_available'], pd.NA)
w['abs_err_m1'] = (w['label'] - w['gold_m1']).abs().where(w['gold_available'], pd.NA)
w['within1_m1'] = (w['abs_err_m1'] <= 1).where(w['gold_available'], pd.NA)

non_tie = (w['is_tie'] == False)
w['maj_agree'] = (w['label'] == w['majority_label']).where(non_tie, pd.NA)

# Behavior proxies
w['duration'] = pd.to_numeric(w.get('duration', pd.Series([pd.NA] * len(w))), errors='coerce')
w['is_very_fast'] = (w['duration'] < 10).where(w['duration'].notna(), False)

w['rationale'] = w.get('rationale', pd.Series([pd.NA] * len(w)))
w['rationale_clean'] = w['rationale'].fillna('').astype(str).str.strip()
w['rationale_empty'] = (w['rationale_clean'] == '')

# Stability: bin accuracy into 3 sequential bins per worker
w['start_dt'] = pd.to_datetime(w.get('start', pd.Series([pd.NA] * len(w))), errors='coerce')
w['_order_key'] = w['start_dt']
if w['_order_key'].isna().all():
    w['_order_key'] = np.arange(len(w))

w = w.sort_values(['wid', '_order_key'])

def assign_bins(group: pd.DataFrame) -> pd.Series:
    n = len(group)
    if n < 3:
        return pd.Series([0] * n, index=group.index)
    ranks = np.arange(n)
    bins = pd.qcut(ranks, q=3, labels=False)
    return pd.Series(bins, index=group.index)

w['time_bin'] = w.groupby('wid', group_keys=False).apply(assign_bins).astype('Int64')

bin_acc = (
    w[w['gold_available']]
    .groupby(['wid', 'time_bin'])['acc_exact_m1']
    .mean()
    .reset_index()
    .rename(columns={'acc_exact_m1': 'bin_acc_exact_m1'})
)

stability = (
    bin_acc.groupby('wid')['bin_acc_exact_m1']
    .agg(stability_std='std', stability_min='min')
    .reset_index()
)

worker_metrics = (
    w.groupby('wid')
    .agg(
        total_judgments=('label', 'size'),
        gold_coverage=('gold_available', 'mean'),
        acc_exact_m1=('acc_exact_m1', 'mean'),
        within1_m1=('within1_m1', 'mean'),
        mae_m1=('abs_err_m1', 'mean'),
        majority_agreement=('maj_agree', 'mean'),
        duration_median=('duration', 'median'),
        very_fast_pct=('is_very_fast', 'mean'),
        rationale_empty_pct=('rationale_empty', 'mean'),
    )
    .reset_index()
)

worker_metrics = worker_metrics.merge(stability, on='wid', how='left')

# Make key proportions %
for c in ['gold_coverage', 'acc_exact_m1', 'within1_m1', 'majority_agreement', 'very_fast_pct', 'rationale_empty_pct']:
    worker_metrics[c] = (worker_metrics[c].astype(float) * 100).round(2)

print("Workers:", worker_metrics.shape)
worker_metrics.sort_values('total_judgments', ascending=False).head(10)


Workers: (188, 12)


  w['start_dt'] = pd.to_datetime(w.get('start', pd.Series([pd.NA] * len(w))), errors='coerce')
  w['time_bin'] = w.groupby('wid', group_keys=False).apply(assign_bins).astype('Int64')


Unnamed: 0,wid,total_judgments,gold_coverage,acc_exact_m1,within1_m1,mae_m1,majority_agreement,duration_median,very_fast_pct,rationale_empty_pct,stability_std,stability_min
78,wid#17,1795,100.0,18.16,66.85,1.151532,66.57,14.0,17.83,0.0,0.030124,0.158598
89,wid#18,1737,100.0,27.35,76.57,0.99597,69.6,76.0,0.0,0.06,0.022275,0.248705
105,wid#24,1532,100.0,44.13,88.9,0.678198,60.97,19.0,0.78,0.0,0.056931,0.375734
100,wid#2,1504,100.0,39.76,81.25,0.795878,67.99,78.0,0.0,0.0,0.039074,0.353293
0,wid#0,1486,100.0,40.65,86.81,0.74428,67.99,83.0,0.0,0.0,0.106564,0.29899
101,wid#20,1128,100.0,33.69,76.06,0.927305,73.53,29.0,0.09,0.0,0.069878,0.295213
122,wid#4,1028,100.0,32.52,77.22,0.954236,70.75,154.0,0.0,0.0,0.054542,0.262391
56,wid#15,1010,100.0,7.92,27.62,1.981188,40.65,18.0,0.89,0.0,0.015085,0.065282
133,wid#5,949,100.0,24.97,55.11,1.353003,60.39,54.0,0.0,0.0,0.151284,0.129747
99,wid#19,900,100.0,32.33,81.33,0.863333,63.37,32.5,0.0,0.0,0.072111,0.263333


In [11]:
# Worker trust composite score
MIN_J_WORKER = 20

worker_metric_directions = {
    'within1_m1': 'high',
    'majority_agreement': 'high',
    'acc_exact_m1': 'high',
    'mae_m1': 'low',
    'very_fast_pct': 'low',
    'rationale_empty_pct': 'low',
    'stability_std': 'low',
    'stability_min': 'high',
}

worker_weights = {m: 1.0 for m in worker_metric_directions}

wm = worker_metrics.copy()
wm_rankable = rank_with_min_n(wm, n_col='total_judgments', min_n=MIN_J_WORKER)

wm_norm = normalize_metrics_01(wm_rankable, worker_metric_directions)
wm_rankable['worker_trust_score'] = weighted_sum_score(wm_norm, worker_weights)

wm_rankable = wm_rankable.sort_values('worker_trust_score', ascending=False)

print(f"Rankable workers (total_judgments >= {MIN_J_WORKER}): {len(wm_rankable)} / {len(wm)}")
wm_rankable.head(15)


Rankable workers (total_judgments >= 20): 54 / 188


Unnamed: 0,wid,total_judgments,gold_coverage,acc_exact_m1,within1_m1,mae_m1,majority_agreement,duration_median,very_fast_pct,rationale_empty_pct,stability_std,stability_min,worker_trust_score
180,wid#92,76,100.0,60.53,97.37,0.434211,76.92,16.0,2.63,0.0,0.040974,0.56,0.936647
125,wid#42,215,100.0,50.7,88.84,0.613953,68.48,146.0,0.0,0.0,0.054364,0.444444,0.923915
143,wid#59,55,100.0,47.27,92.73,0.6,77.08,1389.0,0.0,0.0,0.096107,0.368421,0.918363
179,wid#91,443,100.0,59.59,96.61,0.440181,78.38,43.0,0.0,0.0,0.165287,0.439189,0.899871
156,wid#70,356,100.0,50.56,89.33,0.603933,76.77,105.5,0.0,0.0,0.143877,0.352941,0.881796
146,wid#61,569,100.0,42.88,88.93,0.683656,71.67,74.0,0.0,0.0,0.06903,0.349206,0.877407
124,wid#41,51,100.0,41.18,82.35,0.803922,72.73,1307.0,0.0,0.0,0.058824,0.352941,0.851913
105,wid#24,1532,100.0,44.13,88.9,0.678198,60.97,19.0,0.78,0.0,0.056931,0.375734,0.847814
121,wid#39,46,100.0,47.83,86.96,0.673913,75.61,592.5,0.0,0.0,0.146012,0.333333,0.846621
100,wid#2,1504,100.0,39.76,81.25,0.795878,67.99,78.0,0.0,0.0,0.039074,0.353293,0.842025


In [12]:
# Save WebCrowd worker results
webcrowd_workers_out = wm_rankable.copy()
webcrowd_workers_out.to_csv(OUTPUT_DIR / 'webcrowd_worker_scores.csv', index=False)
print("Wrote:", (OUTPUT_DIR / 'webcrowd_worker_scores.csv'))


Wrote: outputs\webcrowd_worker_scores.csv


In [13]:
# Topic-level platform governance metrics (proxy)

# Item-level majority vs gold (using the `item_stats` embedded in w)
items = (
    w.drop_duplicates(subset=['tid', 'did'])
    [['tid', 'did', 'majority_label', 'is_tie', 'gold_m1']]
    .copy()
)

items['gold_available'] = items['gold_m1'].notna()
items['maj_acc_exact_m1'] = (items['majority_label'] == items['gold_m1']).where(items['gold_available'], pd.NA)
items['maj_abs_err_m1'] = (items['majority_label'] - items['gold_m1']).abs().where(items['gold_available'], pd.NA)
items['maj_within1_m1'] = (items['maj_abs_err_m1'] <= 1).where(items['gold_available'], pd.NA)

# Non-response rate per topic from raw
nr_by_tid = (
    wc_raw.assign(label_num=pd.to_numeric(wc_raw['label'], errors='coerce'))
    .groupby('tid')['label_num']
    .apply(lambda s: float((s == -1).mean()))
    .reset_index()
    .rename(columns={'label_num': 'nonresponse_rate'})
)

# Aggregate governance per tid

gov_tid = (
    items.groupby('tid')
    .agg(
        n_items=('did', 'nunique'),
        tie_rate=('is_tie', 'mean'),
        maj_acc_exact_m1=('maj_acc_exact_m1', 'mean'),
        maj_within1_m1=('maj_within1_m1', 'mean'),
    )
    .reset_index()
)

gov_tid = gov_tid.merge(nr_by_tid, on='tid', how='left')

# Convert proportions to % for readability
for c in ['tie_rate', 'maj_acc_exact_m1', 'maj_within1_m1', 'nonresponse_rate']:
    gov_tid[c] = (gov_tid[c].astype(float) * 100).round(2)

print("Topic governance table:", gov_tid.shape)
gov_tid.sort_values('n_items', ascending=False).head(10)


Topic governance table: (50, 6)


Unnamed: 0,tid,n_items,tie_rate,maj_acc_exact_m1,maj_within1_m1,nonresponse_rate
0,251,100,20.0,29.0,94.0,2.6
1,252,100,16.0,29.0,83.0,5.6
2,253,100,10.0,45.0,95.0,2.2
3,254,100,14.0,11.0,83.0,2.8
4,255,100,25.0,39.0,86.0,4.0
5,256,100,25.0,32.0,86.0,2.2
6,257,100,22.0,31.0,90.0,2.6
7,258,100,27.0,25.0,81.0,3.6
8,259,100,31.0,13.0,77.0,0.4
9,260,100,28.0,15.0,51.0,1.8


In [14]:
# Governance trust score (topic-level)

gov_metric_directions = {
    'maj_within1_m1': 'high',
    'maj_acc_exact_m1': 'high',
    'tie_rate': 'low',
    'nonresponse_rate': 'low',
}

gov_weights = {m: 1.0 for m in gov_metric_directions}

gt = gov_tid.copy()

# All topics are small (50 topics) so we rank all; keep `n_items` for context.
gt_norm = normalize_metrics_01(gt, gov_metric_directions)
gt['governance_trust_score'] = weighted_sum_score(gt_norm, gov_weights)

gt = gt.sort_values('governance_trust_score', ascending=False)

gt.head(15)


Unnamed: 0,tid,n_items,tie_rate,maj_acc_exact_m1,maj_within1_m1,nonresponse_rate,governance_trust_score
27,278,100,2.0,91.0,99.0,0.4,0.991438
41,292,100,7.0,70.0,95.0,0.4,0.972354
40,291,100,14.0,69.0,94.0,1.6,0.85671
16,267,100,8.0,55.0,98.0,2.0,0.851249
26,277,100,16.0,70.0,100.0,2.2,0.839776
33,284,100,14.0,43.0,99.0,0.6,0.812357
34,285,100,12.0,42.0,96.0,0.6,0.811226
20,271,100,23.0,56.0,96.0,0.4,0.783579
29,280,99,17.17,55.56,97.98,1.8,0.782088
25,276,100,21.0,48.0,99.0,0.4,0.781186


In [15]:
# Save governance results
governance_out = gt.copy()
governance_out.to_csv(OUTPUT_DIR / 'webcrowd_governance_topic_scores.csv', index=False)
print("Wrote:", (OUTPUT_DIR / 'webcrowd_governance_topic_scores.csv'))


Wrote: outputs\webcrowd_governance_topic_scores.csv


## Sensitivity analysis (weights)

We check how stable rankings are under moderate weight perturbations (Dirichlet sampling around the baseline weights).


In [16]:
def dirichlet_weight_samples(base_weights: dict, n: int = 200, concentration: float = 20.0, seed: int = 42) -> pd.DataFrame:
    """Sample weight vectors around base weights using a Dirichlet distribution."""
    rng = np.random.default_rng(seed)
    w = pd.Series(base_weights, dtype=float)
    w = w.clip(lower=0)
    w = w / w.sum()

    alpha = (w.values * concentration) + 1e-6
    samples = rng.dirichlet(alpha=alpha, size=n)
    return pd.DataFrame(samples, columns=w.index)


def rank_stability_report(df_rankable: pd.DataFrame, metric_directions: dict, base_weights: dict, id_cols: list, base_score_col: str, n_samples: int = 200) -> pd.DataFrame:
    """Return a small report table of Spearman correlations vs baseline ranking."""
    # Baseline ranking
    base = df_rankable[[*id_cols, base_score_col]].copy()
    base['base_rank'] = base[base_score_col].rank(ascending=False, method='average')

    # Normalized metrics for re-scoring
    norm = normalize_metrics_01(df_rankable, metric_directions)

    samples = dirichlet_weight_samples(base_weights, n=n_samples)

    cors = []
    for i in range(len(samples)):
        ws = samples.iloc[i].to_dict()
        s = weighted_sum_score(norm, ws)
        r = s.rank(ascending=False, method='average')
        cor = base['base_rank'].corr(r, method='spearman')
        cors.append(cor)

    cors = pd.Series(cors, name='spearman_corr')
    return pd.DataFrame({
        'n_entities': [len(df_rankable)],
        'n_samples': [n_samples],
        'spearman_p10': [cors.quantile(0.10)],
        'spearman_median': [cors.median()],
        'spearman_p90': [cors.quantile(0.90)],
        'spearman_min': [cors.min()],
    })


sensitivity = []

# Participants (source×category)
sensitivity.append(
    rank_stability_report(
        df_rankable=pg_rankable,
        metric_directions=participant_metric_directions,
        base_weights=participant_weights,
        id_cols=['participant_group'],
        base_score_col='participant_trust_score',
        n_samples=200,
    ).assign(entity='participants_sourceXcategory')
)

# Agencies
sensitivity.append(
    rank_stability_report(
        df_rankable=ag_rankable,
        metric_directions=agency_metric_directions,
        base_weights=agency_weights,
        id_cols=['agency'],
        base_score_col='agency_trust_score',
        n_samples=200,
    ).assign(entity='agencies')
)

# WebCrowd workers
sensitivity.append(
    rank_stability_report(
        df_rankable=wm_rankable,
        metric_directions=worker_metric_directions,
        base_weights=worker_weights,
        id_cols=['wid'],
        base_score_col='worker_trust_score',
        n_samples=200,
    ).assign(entity='webcrowd_workers')
)

# WebCrowd governance (topics)
sensitivity.append(
    rank_stability_report(
        df_rankable=gt,
        metric_directions=gov_metric_directions,
        base_weights=gov_weights,
        id_cols=['tid'],
        base_score_col='governance_trust_score',
        n_samples=200,
    ).assign(entity='webcrowd_governance_topics')
)

sensitivity_df = pd.concat(sensitivity, ignore_index=True)
sensitivity_df[['entity','n_entities','n_samples','spearman_p10','spearman_median','spearman_p90','spearman_min']]


Unnamed: 0,entity,n_entities,n_samples,spearman_p10,spearman_median,spearman_p90,spearman_min
0,participants_sourceXcategory,60,200,0.892509,0.957099,0.985941,0.789442
1,agencies,8,200,0.714286,0.928571,0.97619,0.404762
2,webcrowd_workers,54,200,0.938106,0.976482,0.990562,0.847913
3,webcrowd_governance_topics,50,200,0.941666,0.981321,0.994939,0.872557


## Notes on interpretation

- **Participant groups (`source × category`)**: trust score reflects *report quality proxies* (completeness, richness, duplication risk, engagement), not individual people.
- **Agencies**: trust score reflects *responsiveness + follow-through + stability* (governance consistency).
- **WebCrowd workers**: trust score reflects *annotation validity + peer agreement + effort proxies + stability*.
- **WebCrowd governance (topic-level)**: trust score is a proxy for how well the platform process yields correct aggregated outcomes under that topic.

