# Week 2 Scaffold: Alias Lock + Target Inference + v1 Edge List

This notebook is a Week 2 starter for the attack-target pipeline.

Workflow:
1. Load Week 1 entity mentions + alias map.
2. Apply canonical entity mapping.
3. Infer likely target mentions using tone/context/self-mention heuristics.
4. Build v1 edge and node outputs.
5. Export artifacts for analysis/validation.

## Inputs and Outputs

Expected inputs:
- `outputs/week1/entity_mentions_week1.parquet` (or `entity_mentions_week1.csv.gz`)
- `outputs/week1/entity_alias_map_v1.csv`

Outputs produced:
- `outputs/week2/attack_target_edges_v1.csv`
- `outputs/week2/attack_target_nodes_v1.csv`
- `outputs/week2/entity_mentions_week2_labeled_v1.csv.gz`

In [1]:
from pathlib import Path
import re
import pandas as pd

pd.set_option('display.max_columns', 120)
pd.set_option('display.width', 180)

In [2]:
PROJECT_ROOT = Path.cwd().resolve()
if (PROJECT_ROOT / 'data').exists() and (PROJECT_ROOT / 'outputs').exists():
    ANALYSIS_ROOT = PROJECT_ROOT
elif PROJECT_ROOT.name == 'notebooks' and (PROJECT_ROOT.parent / 'data').exists():
    ANALYSIS_ROOT = PROJECT_ROOT.parent
else:
    ANALYSIS_ROOT = PROJECT_ROOT

OUT_ROOT = ANALYSIS_ROOT / 'outputs' / 'week2'
OUT_ROOT.mkdir(parents=True, exist_ok=True)
WEEK1_ROOT = ANALYSIS_ROOT / 'outputs' / 'week1'

MENTIONS_PARQUET = WEEK1_ROOT / 'entity_mentions_week1.parquet'
MENTIONS_CSV = WEEK1_ROOT / 'entity_mentions_week1.csv.gz'
ALIAS_MAP_PATH = WEEK1_ROOT / 'entity_alias_map_v1.csv'

print('ANALYSIS_ROOT:', ANALYSIS_ROOT)
print('OUT_ROOT:', OUT_ROOT)
print('WEEK1_ROOT:', WEEK1_ROOT)
print('alias map exists:', ALIAS_MAP_PATH.exists())

ANALYSIS_ROOT: /Users/jeremyzay/Desktop/delta_lab/analysis
OUT_ROOT: /Users/jeremyzay/Desktop/delta_lab/analysis/outputs/week2
WEEK1_ROOT: /Users/jeremyzay/Desktop/delta_lab/analysis/outputs/week1
alias map exists: True


In [3]:
def read_mentions():
    if MENTIONS_PARQUET.exists():
        return pd.read_parquet(MENTIONS_PARQUET)
    if MENTIONS_CSV.exists():
        return pd.read_csv(MENTIONS_CSV, compression='gzip')
    raise FileNotFoundError('Could not find Week 1 mentions artifact.')

mentions = read_mentions()
alias_map = pd.read_csv(ALIAS_MAP_PATH)

print('mentions rows:', len(mentions))
print('alias rows:', len(alias_map))
mentions.head(3)

mentions rows: 127464
alias rows: 500


Unnamed: 0,platform,ad_id,sponsor_name,party_std,office_std,tone_std,date,entity_text,entity_label,start_char,end_char,context_window
0,google,CR17105064975556149249,KARI LAKE FOR SENATE,REP,,CONTRAST,2024-08-26 14:58:00 UTC,Ruben,ORG,75,80,bad and truth from lies and Democrats like Rub...
1,google,CR17105064975556149249,KARI LAKE FOR SENATE,REP,,CONTRAST,2024-08-26 14:58:00 UTC,Congress,ORG,163,171,talks stuff now but for four years in Congress...
2,google,CR17105064975556149249,KARI LAKE FOR SENATE,REP,,CONTRAST,2024-08-26 14:58:00 UTC,camela Harris,PERSON,182,195,but for four years in Congress he backed camel...


## Canonical Mapping

Required alias columns:
- `entity_text`
- `entity_label`
- `canonical_final`

Recommended review columns:
- `review_status` (`PENDING`/`REVIEWED`/`LOCKED`)
- `action` (`KEEP_OR_MERGE`/`DROP`)

In [6]:
def normalize_text(s):
    s = str(s).strip().lower()
    s = re.sub(r'\s+', ' ', s)
    return s


def normalize_for_match(s):
    s = normalize_text(s)
    s = re.sub(r'[^a-z0-9\s]', '', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s

required_cols = {'entity_text', 'entity_label', 'canonical_final'}
missing = required_cols - set(alias_map.columns)
if missing:
    raise ValueError(f'Alias map missing required columns: {sorted(missing)}')

alias = alias_map.copy()
alias['entity_text_norm'] = alias['entity_text'].map(normalize_for_match)
alias['canonical_final'] = alias['canonical_final'].fillna('').astype(str).str.strip()
alias['review_status'] = alias.get('review_status', 'PENDING').fillna('PENDING')
alias = alias.drop_duplicates(subset=['entity_text_norm', 'entity_label'], keep='first')

mentions2 = mentions.copy()
mentions2['entity_text_norm'] = mentions2['entity_text'].map(normalize_for_match)
mentions2 = mentions2.merge(
    alias[['entity_text_norm', 'entity_label', 'canonical_final', 'review_status']],
    on=['entity_text_norm', 'entity_label'],
    how='left'
)
mentions2['canonical_entity'] = mentions2['canonical_final'].where(
    mentions2['canonical_final'].fillna('').str.len() > 0,
    mentions2['entity_text_norm']
)
mentions2['alias_review_status'] = mentions2['review_status'].fillna('UNMAPPED')

mentions2[['entity_text', 'entity_label', 'canonical_entity', 'alias_review_status']].head(10)

Unnamed: 0,entity_text,entity_label,canonical_entity,alias_review_status
0,Ruben,ORG,ruben,UNMAPPED
1,Congress,ORG,congress,LOCKED
2,camela Harris,PERSON,camela harris,UNMAPPED
3,Ruben,ORG,ruben,UNMAPPED
4,Carrie Lake,PERSON,carrie lake,UNMAPPED
5,Ryan,PERSON,ryan,NEEDS_CONTEXT
6,Ryan,PERSON,ryan,NEEDS_CONTEXT
7,Monica,PERSON,monica,NEEDS_CONTEXT
8,Tammy Baldwin,PERSON,tammy baldwin,LOCKED
9,Tammy Baldwin,PERSON,tammy baldwin,LOCKED


## Target Inference Heuristic (v1)

This is a starter heuristic only. Adjust after manual error analysis.

In [7]:
ATTACK_TERMS = {
    'failed', 'failure', 'dangerous', 'corrupt', 'lies', 'lying',
    'radical', 'extreme', 'crime', 'criminal', 'inflation', 'border',
    'illegal', 'tax'
}

mentions2['tone_std'] = mentions2['tone_std'].fillna('UNKNOWN')
mentions2['negative_tone'] = mentions2['tone_std'].isin(['NEGATIVE', 'CONTRAST'])
mentions2['context_window'] = mentions2['context_window'].fillna('').astype(str)
mentions2['context_has_attack_term'] = mentions2['context_window'].str.lower().map(
    lambda t: any(term in t for term in ATTACK_TERMS)
)

sponsor_norm = mentions2['sponsor_name'].fillna('').map(normalize_for_match)
canonical_norm = mentions2['canonical_entity'].fillna('').map(normalize_for_match)
# Row-wise check because pandas .str.contains does not accept a Series pattern.
mentions2['not_self_mention'] = [
    not (canon and canon in sponsor)
    for sponsor, canon in zip(sponsor_norm, canonical_norm)
]

high = mentions2['not_self_mention'] & mentions2['negative_tone'] & mentions2['context_has_attack_term']
medium = mentions2['not_self_mention'] & (mentions2['negative_tone'] | mentions2['context_has_attack_term']) & ~high

mentions2['target_confidence'] = 'low'
mentions2.loc[medium, 'target_confidence'] = 'medium'
mentions2.loc[high, 'target_confidence'] = 'high'
mentions2['is_target'] = mentions2['target_confidence'].isin(['high', 'medium'])

mentions2['target_confidence'].value_counts(dropna=False)

target_confidence
medium    60982
low       50128
high      16354
Name: count, dtype: int64

In [8]:
target_mentions = mentions2[mentions2['is_target']].copy()

edges = (
    target_mentions
    .groupby(['sponsor_name', 'canonical_entity'], as_index=False)
    .agg(
        mention_count=('ad_id', 'count'),
        ad_count=('ad_id', 'nunique'),
        high_confidence_mentions=('target_confidence', lambda s: (s == 'high').sum()),
        medium_confidence_mentions=('target_confidence', lambda s: (s == 'medium').sum()),
        platform_count=('platform', 'nunique'),
        party_mode=('party_std', lambda s: s.mode().iloc[0] if not s.mode().empty else 'UNKNOWN'),
        tone_mode=('tone_std', lambda s: s.mode().iloc[0] if not s.mode().empty else 'UNKNOWN')
    )
    .sort_values('mention_count', ascending=False)
)

edges['edge_confidence'] = edges.apply(
    lambda r: 'high' if r['high_confidence_mentions'] >= max(3, 0.5 * r['mention_count']) else 'medium',
    axis=1
)

nodes = (
    mentions2
    .groupby('canonical_entity', as_index=False)
    .agg(
        mention_count=('ad_id', 'count'),
        ad_count=('ad_id', 'nunique'),
        sponsor_count=('sponsor_name', 'nunique'),
        platform_count=('platform', 'nunique'),
        label_mode=('entity_label', lambda s: s.mode().iloc[0] if not s.mode().empty else 'UNKNOWN'),
        high_confidence_mentions=('target_confidence', lambda s: (s == 'high').sum()),
        medium_confidence_mentions=('target_confidence', lambda s: (s == 'medium').sum()),
        low_confidence_mentions=('target_confidence', lambda s: (s == 'low').sum())
    )
    .sort_values('mention_count', ascending=False)
)

print('target mentions:', len(target_mentions))
print('edges:', len(edges))
print('nodes:', len(nodes))
edges.head(10)

target mentions: 77336
edges: 5939
nodes: 9744


Unnamed: 0,sponsor_name,canonical_entity,mention_count,ad_count,high_confidence_mentions,medium_confidence_mentions,platform_count,party_mode,tone_mode,edge_confidence
2223,Frisch for CO CD-03,jeff hurd,478,205,82,396,1,DEM,CONTRAST,medium
2620,House Majority PAC,congress,455,453,48,407,1,DEM,NEGATIVE,medium
2756,House Majority PAC,medicare,438,297,62,376,1,DEM,NEGATIVE,medium
5684,Vote AK Before Party,nick begich,421,137,68,353,1,IND,NEGATIVE,medium
2215,Frisch for CO CD-03,adam frisch,418,290,70,348,1,DEM,CONTRAST,medium
985,Congressional Leadership Fund,josh riley,381,161,115,266,1,REP,NEGATIVE,medium
5678,Vote AK Before Party,alaska,373,137,34,339,1,IND,NEGATIVE,medium
2219,Frisch for CO CD-03,denver,370,201,0,370,1,DEM,CONTRAST,medium
934,Congressional Leadership Fund,don davis,321,110,147,174,2,REP,NEGATIVE,medium
4350,NRCC/Molinaro,mark molinaro,300,150,87,213,1,REP,CONTRAST,medium


In [9]:
edge_path = OUT_ROOT / 'attack_target_edges_v1.csv'
node_path = OUT_ROOT / 'attack_target_nodes_v1.csv'
mentions_out_path = OUT_ROOT / 'entity_mentions_week2_labeled_v1.csv.gz'

edges.to_csv(edge_path, index=False)
nodes.to_csv(node_path, index=False)
mentions2.to_csv(mentions_out_path, index=False, compression='gzip')

print('wrote:', edge_path)
print('wrote:', node_path)
print('wrote:', mentions_out_path)

wrote: /Users/jeremyzay/Desktop/delta_lab/analysis/outputs/week2/attack_target_edges_v1.csv
wrote: /Users/jeremyzay/Desktop/delta_lab/analysis/outputs/week2/attack_target_nodes_v1.csv
wrote: /Users/jeremyzay/Desktop/delta_lab/analysis/outputs/week2/entity_mentions_week2_labeled_v1.csv.gz


## Week 2 TODOs

- Manually review top entity aliases and set `review_status=LOCKED` for approved rows.
- Add stricter self-mention checks (committee/candidate aliases).
- Validate precision on a manual sample (200-400 ads).
- Calibrate attack-term lexicon and confidence rules.
- Freeze `attack_target_edges_v1.csv` after validation.