# Task Extraction & Prioritization Pipeline
### Input: Email Summary Text → Structured, Ranked Task List

---

## Architecture
```
Email Summary (str)
      │
      ├─► [1] Sentence Segmentation        (NLTK + bullet-aware splitter)
      │
      ├─► [2] Task Sentence Detection      (imperative heuristics + KPE confidence)
      │
      ├─► [3] NER — spaCy en_core_web_trf  (PERSON, ORG, DATE, TIME)
      │         → feeds assignee sieve & deadline extractor
      │
      ├─► [4] KPE — KeyBERT + YAKE fallback
      │         → action phrase refinement
      │
      ├─► [5] Action Span Construction     (verb-anchor + KPE overlay)
      │         → multi-action splitting on compound sentences
      │
      ├─► [6] Deadline Normalization       (10-rule matcher → dateparser fallback)
      │
      └─► [7] Priority Scoring            (weighted formula, fully explainable)
                deadline_proximity  × 0.35
                urgency_lexicon     × 0.20
                risk_lexicon        × 0.20
                customer_impact     × 0.15
                authority_cues      × 0.10
                ─────────────────────────
                → Low / Medium / High / Critical
```


## Cell 1 — Installs *(run once, then restart kernel)*

In [None]:
import subprocess, sys

pkgs = ["spacy", "keybert", "yake", "sentence-transformers", "nltk", "dateparser"]
for p in pkgs:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "--quiet", p])

# spaCy model — transformer model preferred
try:
    subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_trf", "--quiet"])
    SPACY_MODEL = "en_core_web_trf"
except Exception:
    subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm", "--quiet"])
    SPACY_MODEL = "en_core_web_sm"

import nltk
for r in ["punkt", "punkt_tab", "stopwords"]: nltk.download(r, quiet=True)

print(f"All deps ready.  spaCy model: {SPACY_MODEL}")

import json

All deps ready.  spaCy model: en_core_web_trf


## Cell 2 — Imports & Timezone Setup

In [None]:
from __future__ import annotations
import re, json, warnings
from dataclasses import dataclass, asdict, field
from typing import List, Dict, Any, Optional, Tuple
from datetime import datetime, timedelta
from collections import defaultdict
warnings.filterwarnings("ignore")

# ── IST timezone ─────────────────────────────────────────────────────────────
try:
    from zoneinfo import ZoneInfo
    IST = ZoneInfo("Asia/Kolkata")
except Exception:
    IST = None

def now_ist() -> datetime:
    return datetime.now().astimezone(IST) if IST else datetime.now()

def ensure_ist(dt: datetime) -> datetime:
    if IST:
        return dt.replace(tzinfo=IST) if dt.tzinfo is None else dt.astimezone(IST)
    return dt

def iso_ist(dt: Optional[datetime]) -> Optional[str]:
    return ensure_ist(dt).isoformat(timespec="seconds") if dt else None

# ── spaCy ─────────────────────────────────────────────────────────────────────
import spacy
try:
    nlp = spacy.load("en_core_web_trf")
    print("spaCy: en_core_web_trf  (transformer NER)")
except Exception:
    nlp = spacy.load("en_core_web_sm")
    print("spaCy: en_core_web_sm  (statistical NER)")

'''
trf: better at finding person tag (for assignee); bert (nn); more accurate;
sm: better at finding org tag (for assignee); rule-based; less accurate;
'''

# ── KeyBERT ───────────────────────────────────────────────────────────────────
from keybert import KeyBERT
kw_model = KeyBERT(model="all-MiniLM-L6-v2")
print("KeyBERT ready (all-MiniLM-L6-v2)")

'''
keybert:
- understands semantic meaning using transformer embeddings to extract keywords relevant to a sentence.
- model used is lightwt
- extract 1-3 word phrases that are semantically important to the sentence (with confidence scores)
'''

# ── YAKE (fallback KPE) ───────────────────────────────────────────────────────
import yake
_yake = yake.KeywordExtractor(lan="en", n=3, dedupLim=0.7, top=10)
print("YAKE ready (fallback KPE)")

'''
yake:
- extracts keywords using statistical patterns
- faster than keybert, will replace it if keybert fails to find keywords
'''
# ── dateparser ────────────────────────────────────────────────────────────────
import dateparser
print("dateparser ready")

# ── NLTK ──────────────────────────────────────────────────────────────────────
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords as _sw
STOPWORDS = set(_sw.words("english"))

print("\nAll modules loaded.  IST now:", now_ist().strftime("%Y-%m-%d %H:%M %Z"))


spaCy: en_core_web_trf  (transformer NER)


Loading weights: 100%|██████████| 103/103 [00:00<00:00, 345.39it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


KeyBERT ready (all-MiniLM-L6-v2)
YAKE ready (fallback KPE)
dateparser ready

All modules loaded.  IST now: 2026-02-20 14:41 IST


## Cell 3 — Task Schema

In [None]:
PRIORITY_ORDER = {"Low": 0, "Medium": 1, "High": 2, "Critical": 3}

print("dataclass is:", dataclass)


@dataclass
class Task:
    action:             str
    assignee:           Optional[str]        = None
    deadline:           Optional[datetime]   = None
    priority:           str                  = "Medium"
    confidence:         float                = 0.5
    keyphrases:         List[str]            = field(default_factory=list)
    ner_entities:       Dict[str, List[str]] = field(default_factory=dict)
    evidence_sentence:  str                  = ""
    priority_breakdown: Dict[str, float]     = field(default_factory=dict)

def task_to_dict(t: Task) -> Dict[str, Any]:
    d = asdict(t)
    d["deadline"] = iso_ist(t.deadline)
    return d

def display_task(t: Task, idx: int = 1) -> None:
    bar = {"Low": "░░", "Medium": "▒▒▒", "High": "████", "Critical": "██████ ⚠"}.get(t.priority, "")
    print(f"  +-- Task {idx}  [{t.priority}] {bar}")
    print(f"  |  Action     : {t.action}")
    print(f"  |  Assignee   : {t.assignee or '(unassigned)'}")
    print(f"  |  Deadline   : {iso_ist(t.deadline) or '(not specified)'}")
    print(f"  |  Confidence : {t.confidence:.0%}")
    if t.keyphrases:
        print(f"  |  Keyphrases : {', '.join(t.keyphrases[:4])}")
    for lbl, vals in t.ner_entities.items():
        if vals: print(f"  |  NER {lbl:<8}: {', '.join(vals)}")
    ev = t.evidence_sentence
    print(f"  |  Evidence   : {ev[:80]}{'...' if len(ev)>80 else ''}")
    pb = t.priority_breakdown
    if pb:
        parts = "  ".join(f"{k}={v:.2f}" for k,v in pb.items() if k not in ("total", "rerank_reason"))
        print(f"  |  Score      : total={pb.get('total',0):.3f}  ({parts})")
        if pb.get('rerank_reason'):
            print(f"  |  Rerank     : {pb['rerank_reason']}")
    print(f"  +{'-'*62}")

print("Task schema ready.")


Task schema ready.


## Cell 4 — Sentence Segmentation & Task-Sentence Detection

Two-pass detection:
- **Pass 1 (rules):** action-verb presence, imperative cues, bullet markers, NER boosts
- **Pass 2 (KPE):** KeyBERT keyphrase confidence refines the score

Final score = 0.65 × rule_score + 0.35 × kpe_score. Threshold: ≥ 0.35


In [None]:
ACTION_VERBS = {
    "send","share","prepare","review","approve","confirm","schedule","arrange",
    "deploy","rollback","investigate","fix","resolve","draft","update","revert",
    "purchase","raise","escalate","complete","submit","fill","sign","onboard",
    "provision","reset","enable","disable","patch","restart","monitor","validate",
    "verify","audit","reconcile","process","report","summarize","circulate",
    "collect","finalize","coordinate","notify","inform","upload","migrate","block",
    "rotate","renew","close","amend","check","test","merge","archive","backup",
    "restore","respond","follow","ensure","create","build","implement","execute",
    "initiate","track","document","shortlist","onboard","recruit","deploy",
}

IMPERATIVE_CUES = [
    "please","kindly","must","should","need to","needs to","required",
    "urgent","asap","immediately","action required","can you","could you",
    "would you","ensure","make sure","follow up","high priority","blocker",
    "by eod","by cob","by friday","by monday","by tomorrow","by today",
    "deadline","no later than","due by",
]

NON_TASK = [
    r"^(hi|hello|dear|hey|good morning|good afternoon)\b",
    r"\b(fyi|for your information|just to let you know|heads up)\b",
    r"^(thanks|thank you|regards|best regards)[.,! ]*$",
]

def segment_sentences(text: str) -> List[str]:
    lines = text.strip().splitlines()
    out = []
    for line in lines:
        line = line.strip()
        if not line:
            continue
        if re.match(r'^[-*•]\s+', line) or re.match(r'^\d+[.)\s]', line):
            out.append(line)  # bullet lines: preserve as-is
        else:
            out.extend(sent_tokenize(line))
    seen, result = set(), []
    for s in out:
        key = s.lower().strip()
        if key not in seen and len(s.split()) >= 3:
            seen.add(key); result.append(s)
    return result

def _rule_score(sent: str) -> Tuple[bool, float]:
    low = sent.lower().strip()
    for pat in NON_TASK:
        if re.search(pat, low): return False, 0.0
    if len(sent.split()) < 4: return False, 0.0

    score = 0.0
    verb_hits = [v for v in ACTION_VERBS if re.search(r'\b' + v + r'\b', low)]
    cue_hits  = [c for c in IMPERATIVE_CUES if c in low]
    score += 0.30 * min(len(verb_hits), 2)
    score += 0.20 * min(len(cue_hits),  3)
    if re.match(r'^[-*•]\s+', sent) or re.match(r'^\d+[.)\s]', sent):
        score += 0.20  # bullet = very likely a task
    if re.match(r'(please|kindly|can you|could you|would you)\b', low, re.I):
        score += 0.15  # request frame
    # NER boost: person or date in sentence suggests actionable
    doc = nlp(sent)
    if any(e.label_ in ('PERSON','ORG') for e in doc.ents): score += 0.10
    if any(e.label_ in ('DATE','TIME') for e in doc.ents):  score += 0.10

    is_cand = bool(verb_hits or cue_hits) and score >= 0.25
    return is_cand, min(0.95, score)

def kpe_confidence(sent: str) -> Tuple[float, List[str]]:
    """KeyBERT primary, YAKE fallback."""
    try:
        kps = kw_model.extract_keywords(
            sent, keyphrase_ngram_range=(1,3), stop_words='english',
            use_mmr=True, diversity=0.5, top_n=5)
        if kps:
            return float(kps[0][1]), [k for k,_ in kps]
    except Exception:
        pass
    try:
        yks = _yake.extract_keywords(sent)
        if yks:
            return max(0.0, 1.0 - float(yks[0][1])), [k for k,_ in yks[:5]]
    except Exception:
        pass
    return 0.0, []

def detect_task_sentences(sentences: List[str], threshold: float = 0.35) -> List[Dict]:
    results = []
    for sent in sentences:
        is_cand, rule_s = _rule_score(sent)
        if not is_cand: continue
        kpe_s, kps = kpe_confidence(sent)
        final = 0.65 * rule_s + 0.35 * kpe_s
        if final >= threshold:
            results.append({"sentence": sent, "rule_score": round(rule_s,3),
                            "kpe_score": round(kpe_s,3), "final_score": round(final,3),
                            "keyphrases": kps})
    return results

# smoke test
_test_sents = [
    "Hi team, thanks for joining today.",
    "Please submit the Q1 report by Friday EOD.",
    "Alice needs to review the design docs by Wednesday 3pm.",
    "The meeting was productive.",
    "Escalate INC-55421 to L3 support immediately — this is a blocker.",
    "FYI: server will be down for maintenance tonight.",
]
_det = detect_task_sentences(_test_sents)
print(f"Detected {len(_det)}/{len(_test_sents)} task sentences:")
for d in _det:
    print(f"  rule={d['rule_score']:.2f} kpe={d['kpe_score']:.2f} final={d['final_score']:.2f}  -> '{d['sentence'][:65]}'")

'''
ACTION VERB	0.30	Highest — Core signal a sentence is a task
CUES	0.20	Medium — Confirms it's a REQUEST (not just describing an action)
BULLET POINT	0.20	Medium — Lists format = structured tasks
REQUEST FRAME	0.15	Softer — "please/can you" is polite but less strong
PERSON entity	0.10	Weakest — Just presence of a name (could be context)
DATE entity	0.10	Weakest — Just presence of a date
'''

Detected 3/6 task sentences:
  rule=0.95 kpe=0.77 final=0.89  -> 'Please submit the Q1 report by Friday EOD.'
  rule=0.70 kpe=0.70 final=0.70  -> 'Alice needs to review the design docs by Wednesday 3pm.'
  rule=0.70 kpe=0.67 final=0.69  -> 'Escalate INC-55421 to L3 support immediately — this is a blocker.'


'\nACTION VERB\t0.30\tHighest — Core signal a sentence is a task\nCUES\t0.20\tMedium — Confirms it\'s a REQUEST (not just describing an action)\nBULLET POINT\t0.20\tMedium — Lists format = structured tasks\nREQUEST FRAME\t0.15\tSofter — "please/can you" is polite but less strong\nPERSON entity\t0.10\tWeakest — Just presence of a name (could be context)\nDATE entity\t0.10\tWeakest — Just presence of a date\n'

## Cell 5 — Named Entity Recognition (spaCy)

Extracts entities from each task sentence:
- **PERSON / ORG** → assignee candidates
- **DATE / TIME** → deadline candidates  
- **TICKET** (regex) → INC-xxxxx, SEV-x, JIRA-xxxxx
- **EMAIL** (regex) → email addresses

Assignee is resolved through a **7-level priority sieve** (most specific first).


In [None]:
EMAIL_RE  = re.compile(r'\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b')
TICKET_RE = re.compile(r'\b(?:INC|SEV|JIRA|TKT|SR|CRM)[\-\s]?\d{3,8}\b', re.I)
RELEVANT_LABELS = {'PERSON','ORG','DATE','TIME','GPE','EVENT','MONEY','CARDINAL'}

# Verbs where the following ORG/PERSON is a *recipient*, not the assignee
RECIPIENT_VERBS = {'inform','notify','contact','email','cc','copy','send','forward',
                   'alert','update','tell','report','escalate','remind'}

def run_ner(sentence: str) -> Dict[str, List[str]]:
    doc = nlp(sentence)
    entities: Dict[str, List[str]] = defaultdict(list)
    for ent in doc.ents:
        if ent.label_ not in RELEVANT_LABELS: continue
        text = ent.text.strip()
        if len(text) < 2 or text.lower() in STOPWORDS: continue
        if text.lower() not in [x.lower() for x in entities[ent.label_]]:
            entities[ent.label_].append(text)
    for m in EMAIL_RE.finditer(sentence):
        t = m.group(0)
        if t not in entities['EMAIL']: entities['EMAIL'].append(t)
    for m in TICKET_RE.finditer(sentence):
        t = m.group(0).upper()
        if t not in entities['TICKET']: entities['TICKET'].append(t)
    return dict(entities)

def _is_recipient_entity(entity_text: str, sentence: str) -> bool:
    """
    Returns True if entity_text appears as a direct object of a recipient verb
    (meaning it is the TARGET, not the ACTOR).
    e.g. 'please inform HR' → HR is recipient.
    """
    low = sentence.lower()
    et  = entity_text.lower()
    for verb in RECIPIENT_VERBS:
        # pattern: verb ... entity_text  (within 6 words)
        pattern = rf'\b{verb}\b[^.{{0,40}}?]{{0,40}}?\b{re.escape(et)}\b'
        if re.search(pattern, low):
            return True
    return False

def pick_assignee(ner: Dict[str, List[str]], sentence: str) -> Optional[str]:
    """
    7-level sieve (first match wins):
      1. Vocative:   'Alice, please ...'
      2. Bullet:     'Rahul to <verb>'
      3. Explicit:   'assign to / for <name>'
      4. spaCy PERSON (skip if recipient of inform/notify/contact)
      5. EMAIL local-part
      6. ORG  (skip if recipient of inform/notify/contact/send)
      7. None
    """
    s = re.sub(r'^[-*•\d.]+\s*', '', sentence).strip()

    # 1. Vocative: starts with 'Name,'
    m = re.match(r'^([A-Z][a-z]+(?:\s+[A-Z][a-z]+)?)\s*,', s)
    if m: return m.group(1)

    # 2. Bullet '<Name> to <verb>'
    m = re.match(r'^([A-Z][a-z]+(?:\s+[A-Z][a-z]+)?)\s+to\s+', s)
    if m: return m.group(1)

    # 3. Explicit assign phrase
    m = re.search(r'\b(?:assign(?:ed)?\s+to|for)\s+([A-Z][a-z]+(?:\s+[A-Z][a-z]+){0,2})\b', s)
    if m: return m.group(1)

    # 4. spaCy PERSON — skip if they are a recipient of a reporting/contact verb
    for person in ner.get('PERSON', []):
        if not _is_recipient_entity(person, sentence):
            return person

    # 5. Email local-part
    if ner.get('EMAIL'):
        return ner['EMAIL'][0].split('@')[0].replace('.', ' ').replace('_', ' ').title()

    # 5.5. Collective / implicit actor (participants, 'your', 'if you', etc.)
    collective = _pick_collective_assignee(sentence)
    if collective: return collective

    # 6. ORG — skip if recipient
    for org in ner.get('ORG', []):
        if not _is_recipient_entity(org, sentence):
            return org

    return None

# Collective / audience nouns that map to an implicit actor label
COLLECTIVE_ACTORS = {
    'participants':       'Participants',
    'attendees':          'Participants',
    'all team members':   'All team members',
    'all staff':          'All staff',
    'team members':       'Team members',
    'department heads':   'Department heads',
    'departments':        'Departments',
    'everyone':           'All participants',
    'all employees':      'All employees',
    'all users':          'All users',
}

def _pick_collective_assignee(sentence: str) -> Optional[str]:
    """
    Detect collective or implicit actors that spaCy PERSON NER misses:
      - Explicit collective nouns: 'Participants', 'Department heads', etc.
      - Possessive 'your' implies the addressed audience → 'Participants'
      - Conditional 'if you / you are' → 'Participants'
    Returns None when no collective actor is found.
    """
    low = sentence.lower()
    for phrase, label in COLLECTIVE_ACTORS.items():
        if re.search(r'\b' + re.escape(phrase) + r'\b', low):
            return label
    # 'your' in sentence implies addressed audience
    if re.search(r'\byour\b', low):
        return 'Participants'
    # conditional / imperative 'you'
    if re.search(r'\bif you\b|\byou are\b|\byou must\b|\byou should\b', low):
        return 'Participants'
    return None

# ── Demo ──────────────────────────────────────────────────────────────────────
_ner_demos = [
    ("Alice, please finalize the API integration by Wednesday.",                    "Alice"),
    ("Bob should deploy release v2.4.1 to staging by 7pm today.",                  "Bob"),
    ("Can you escalate INC-55421 to the SRE team immediately?",                    "SRE"),
    ("Send the invoice to finance@corp.com by EOD Friday.",                        "Finance"),
    ("Rahul to share revised timeline by tomorrow 5pm.",                           "Rahul"),
    ("If you are unable to attend, please inform HR by March 12th.",               "Participants"),
    ("Please notify the client about the delay by tomorrow.",                      None),
    ("Participants are requested to report by 9:00 AM on March 18th.",             "Participants"),
    ("Kindly submit your department presentations by March 10th.",                  "Participants"),
]
print("NER + Assignee sieve demo (with recipient-verb guard):\n")
all_ok = True
for ex, expected in _ner_demos:
    ner = run_ner(ex)
    a   = pick_assignee(ner, ex)
    ok  = (a == expected) or (a is None and expected is None)
    if not ok: all_ok = False
    mark = "OK" if ok else "FAIL"
    print(f"  {mark}  Assignee={a!r:<12}  '{ex}'")
print()
print("All assignee tests passed." if all_ok else "Some assignee tests FAILED.")


NER + Assignee sieve demo (with recipient-verb guard):

  OK  Assignee='Alice'       'Alice, please finalize the API integration by Wednesday.'
  OK  Assignee='Bob'         'Bob should deploy release v2.4.1 to staging by 7pm today.'
  OK  Assignee='SRE'         'Can you escalate INC-55421 to the SRE team immediately?'
  OK  Assignee='Finance'     'Send the invoice to finance@corp.com by EOD Friday.'
  OK  Assignee='Rahul'       'Rahul to share revised timeline by tomorrow 5pm.'
  OK  Assignee=None          'If you are unable to attend, please inform HR by March 12th.'
  OK  Assignee=None          'Please notify the client about the delay by tomorrow.'

All assignee tests passed.


## Cell 6 — Deadline Extraction & Normalization

**Stage 1:** Regex bank (10 patterns, ordered most-specific first) extracts raw phrases.

**Stage 2:** Normalizer maps each phrase to an IST `datetime`:
ISO → compound (tomorrow+EOD) → within-X → EOD/COB → relative → weekday → time → dateparser fallback


In [None]:
import calendar as _cal

DOW = {"monday":0,"tuesday":1,"wednesday":2,"thursday":3,"friday":4,"saturday":5,"sunday":6}
MONTHS = {m.lower():i for i,m in enumerate(_cal.month_name) if m}
MONTHS.update({m.lower():i for i,m in enumerate(_cal.month_abbr) if m})

DEADLINE_PATTERNS = [
    # ISO datetime (most specific)
    r'\b20\d{2}-\d{2}-\d{2}(?:[T\s]\d{2}:\d{2}(?::\d{2})?)?\b',
    # named month + day + optional year + optional time  (e.g. "March 18th", "Feb 28, 2026 5pm")
    r'\b(?:jan(?:uary)?|feb(?:ruary)?|mar(?:ch)?|apr(?:il)?|may|jun(?:e)?|'
    r'jul(?:y)?|aug(?:ust)?|sep(?:tember)?|oct(?:ober)?|nov(?:ember)?|dec(?:ember)?)'
    r'\s+\d{1,2}(?:st|nd|rd|th)?(?:,?\s*\d{4})?(?:\s+\d{1,2}(?::\d{2})?\s*(?:am|pm)?)?\b',
    # compound relative: tomorrow/today + EOD/COB/time
    r'\b(tomorrow|today)\s+(eod|cob|end of day|close of business|\d{1,2}(?::\d{2})?\s*(?:am|pm)?)\b',
    # within X hours/days
    r'\bwithin\s+\d+\s*(hours?|days?)\b',
    # by/before/due + clause (catches "by Friday COB", "before next Monday 5pm")
    r'\b(?:by|before|due|no later than)\s+[^,.;\n]{1,55}',
    # standalone EOD/COB (today only — listed AFTER the compound & by-clause patterns)
    r'\b(eod|cob|end of day|close of business)\b',
    # relative singles
    r'\b(today|tomorrow|day after tomorrow)\b',
    # next/this weekday
    r'\b(?:next|this)\s+(monday|tuesday|wednesday|thursday|friday|saturday|sunday)\b',
    # bare weekday
    r'\b(monday|tuesday|wednesday|thursday|friday|saturday|sunday)\b',
    # in X days/hours
    r'\bin\s+\d+\s*(days?|hours?)\b',
    # by <time-only>
    r'\bby\s+\d{1,2}(?::\d{2})?\s*(?:am|pm)?\b',
    # end of <period>
    r'\bend of (?:next )?week\b',
]

def _set_t(dt, hh, mm=0): return ensure_ist(dt).replace(hour=hh,minute=mm,second=0,microsecond=0)

def _next_dow(anchor: datetime, target: int) -> datetime:
    diff = (target - anchor.weekday()) % 7 or 7
    return _set_t(ensure_ist(anchor + timedelta(days=diff)), 17)

def _parse_ampm(h, mi, ampm):
    hh, mm = int(h), int(mi or 0)
    if ampm:
        if ampm.lower() == 'pm' and hh < 12: hh += 12
        if ampm.lower() == 'am' and hh == 12: hh = 0
    return hh, mm

def _parse_named_month(phrase: str, anchor: datetime) -> Optional[datetime]:
    """
    Parse phrases like 'March 18th', 'Feb 28, 2026 5pm', 'March 15th by 3pm'.
    Always anchors to the correct calendar year (next occurrence if month already passed).
    """
    p = phrase.strip().lower()
    # strip leading by/before/due
    p = re.sub(r'^(?:by|before|due|no later than)\s+', '', p).strip()

    # Use re.search (not re.match) so the month name is found anywhere in the phrase,
    # e.g. "by 9:00 AM on March 18th" where the time precedes the month name.
    mo_m = re.search(
        r'(jan(?:uary)?|feb(?:ruary)?|mar(?:ch)?|apr(?:il)?|may|jun(?:e)?|'
        r'jul(?:y)?|aug(?:ust)?|sep(?:tember)?|oct(?:ober)?|nov(?:ember)?|dec(?:ember)?)'
        r'\s+(\d{1,2})(?:st|nd|rd|th)?(?:,?\s*(\d{4}))?'
        r'(?:\s+(\d{1,2})(?::(\d{2}))?\s*(am|pm)?)?'
        r'(?:\s+(\d{1,2})(?::(\d{2}))?\s*(am|pm)?)?'
        r'(?:\s+(\d{1,2})(?::(\d{2}))?\s*(am|pm)?)?',
        p, re.I
    )
    if not mo_m:
        return None

    mon_str, day_str = mo_m.group(1), mo_m.group(2)
    yr_str  = mo_m.group(3)
    h_str, mi_str, ampm_str = mo_m.group(4), mo_m.group(5), mo_m.group(6)

    mon = MONTHS.get(mon_str.lower()[:3])
    if not mon: return None
    day = int(day_str)
    yr  = int(yr_str) if yr_str else anchor.year

    # If no year given and the month/day is already past in anchor year, use next year
    try:
        candidate = ensure_ist(datetime(yr, mon, day))
    except ValueError:
        return None

    if not yr_str and candidate.date() < ensure_ist(anchor).date():
        try:
            candidate = ensure_ist(datetime(yr + 1, mon, day))
        except ValueError:
            return None

    if h_str:
        hh, mm = _parse_ampm(h_str, mi_str, ampm_str)
        return _set_t(candidate, hh, mm)

    # If no time attached to the month, look for a time BEFORE the month in the phrase.
    # Handles "by 9:00 AM on March 18th" where the time precedes the month name.
    pre_month = p[:mo_m.start()]
    pre_time  = re.search(r'(\d{1,2}):(\d{2})\s*(am|pm)?', pre_month, re.I)
    if pre_time:
        hh, mm = _parse_ampm(pre_time.group(1), pre_time.group(2), pre_time.group(3))
        return _set_t(candidate, hh, mm)

    return _set_t(candidate, 17)   # default COB

def _normalize_phrase(phrase: str, anchor: datetime) -> Optional[datetime]:
    p, a = phrase.strip().lower(), ensure_ist(anchor)

    # ISO
    m = re.search(r'(20\d{2}-\d{2}-\d{2})(?:[T\s](\d{2}:\d{2}))?', p)
    if m:
        try: return ensure_ist(datetime.fromisoformat(m.group(1)+' '+(m.group(2) or '17:00')))
        except Exception: pass

    # Named month (e.g. "March 18th", "before March 15th", "by Feb 28 2026 5pm")
    named = _parse_named_month(phrase, anchor)
    if named: return named

    # within X hours/days
    m = re.search(r'within\s+(\d+)\s*(hours?|days?)', p)
    if m:
        n, u = int(m.group(1)), m.group(2)
        delta = timedelta(hours=n) if 'h' in u else timedelta(days=n)
        base  = ensure_ist(a + delta)
        return base.replace(second=0,microsecond=0) if 'h' in u else _set_t(base, 17)

    # tomorrow
    if 'tomorrow' in p:
        base = a + timedelta(days=1)
        if 'eod' in p or 'end of day' in p: return _set_t(base, 18)
        if 'cob' in p or 'close of business' in p: return _set_t(base, 17)
        m = re.search(r'(\d{1,2})(?::(\d{2}))?\s*(am|pm)?', p)
        if m:
            hh, mm = _parse_ampm(m.group(1), m.group(2), m.group(3))
            return _set_t(base, hh, mm)
        return _set_t(base, 17)

    # ── KEY FIX: check for weekday BEFORE standalone EOD/COB ─────────────────
    # "by Monday COB" must be caught as weekday+cob, not just 'cob today'
    dow_m = re.search(r'(next|this)?\s*(monday|tuesday|wednesday|thursday|friday|saturday|sunday)', p)
    if dow_m:
        base = _next_dow(a, DOW[dow_m.group(2)])
        if dow_m.group(1) == 'next' and base.date() == a.date():
            base = base + timedelta(days=7)
        # apply EOD/COB/time override on that weekday
        if 'eod' in p or 'end of day' in p: return _set_t(base, 18)
        if 'cob' in p or 'close of business' in p: return _set_t(base, 17)
        tm = re.search(r'(\d{1,2})(?::(\d{2}))?\s*(am|pm)?', p)
        if tm:
            hh, mm2 = _parse_ampm(tm.group(1), tm.group(2), tm.group(3))
            return _set_t(base, hh, mm2)
        return base

    # standalone EOD / COB (today, no weekday context)
    if 'eod' in p or 'end of day' in p: return _set_t(a, 18)
    if 'cob' in p or 'close of business' in p: return _set_t(a, 17)

    # day after tomorrow
    if 'day after tomorrow' in p: return _set_t(a + timedelta(days=2), 17)

    # today + optional time
    if re.search(r'^today\s*$', p): return _set_t(a, 17)
    if 'today' in p:
        m = re.search(r'(\d{1,2})(?::(\d{2}))?\s*(am|pm)?', p)
        if m:
            hh, mm = _parse_ampm(m.group(1), m.group(2), m.group(3))
            return _set_t(a, hh, mm)
        return _set_t(a, 17)

    # before <weekday>  (handled by weekday block above, but keep as safety)
    m = re.search(r'before\s+(monday|tuesday|wednesday|thursday|friday|saturday|sunday)', p)
    if m: return _next_dow(a, DOW[m.group(1)]) - timedelta(days=1)

    # in X days/hours
    m = re.search(r'in\s+(\d+)\s*(days?|hours?)', p)
    if m:
        n, u = int(m.group(1)), m.group(2)
        base = ensure_ist(a + (timedelta(hours=n) if 'h' in u else timedelta(days=n)))
        return base.replace(second=0,microsecond=0) if 'h' in u else _set_t(base, 17)

    # end of week / end of next week
    m = re.search(r'end of (next )?week', p)
    if m:
        days_to_fri = (4 - a.weekday()) % 7 or 7
        if m.group(1): days_to_fri += 7
        return _set_t(a + timedelta(days=days_to_fri), 17)

    # by <time-only>
    m = re.search(r'by\s+(\d{1,2})(?::(\d{2}))?\s*(am|pm)?', p)
    if m:
        hh, mm = _parse_ampm(m.group(1), m.group(2), m.group(3))
        return _set_t(a, hh, mm)

    # dateparser fallback — always pass anchor as RELATIVE_BASE
    try:
        parsed = dateparser.parse(phrase, settings={
            'PREFER_DAY_OF_MONTH': 'first',
            'PREFER_DATES_FROM': 'future',
            'TIMEZONE': 'Asia/Kolkata',
            'RETURN_AS_TIMEZONE_AWARE': True,
            'RELATIVE_BASE': ensure_ist(anchor)})
        if parsed: return ensure_ist(parsed)
    except Exception:
        pass
    return None

def _specificity(x: str) -> int:
    """Higher = more specific = tried first."""    
    xl = x.lower()
    s = 0
    if re.search(r'20\d{2}-\d{2}-\d{2}', xl): s += 5
    # named month with year
    if re.search(r'\b(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)', xl) and re.search(r'\d{4}', xl): s += 4
    # named month without year
    if re.search(r'\b(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)', xl): s += 3
    if re.search(r'\d{1,2}:\d{2}|\b(am|pm)\b', xl): s += 2
    if re.search(r'eod|cob', xl): s += 1
    if re.search(r'by |before |within |no later', xl): s += 1
    return s

def extract_deadline(sentence: str, anchor: datetime) -> Tuple[Optional[datetime], str]:
    low = sentence.lower()
    candidates = []
    for pat in DEADLINE_PATTERNS:
        for m in re.finditer(pat, low, re.I):
            candidates.append(sentence[m.start():m.end()].strip())
    for phrase in sorted(set(candidates), key=_specificity, reverse=True):
        dt = _normalize_phrase(phrase, anchor)
        if dt: return dt, phrase
    return None, ''

# ── Unit tests ────────────────────────────────────────────────────────────────
_anc = ensure_ist(datetime(2026, 2, 15, 10, 0))
_cases = [
    ('Submit by EOD today.',            datetime(2026,2,15,18,0), 1800),
    ('Send by tomorrow 3pm.',           datetime(2026,2,16,15,0), 1800),
    ('Complete within 4 hours.',        datetime(2026,2,15,14,0), 1800),
    ('Due by next Friday.',             datetime(2026,2,20,17,0), 1800),
    ('Deadline: 2026-02-20.',           datetime(2026,2,20,17,0), 1800),
    ('Resolve by Friday 12pm.',         datetime(2026,2,20,12,0), 1800),
    ('Fix the issue by Monday COB.',    datetime(2026,2,16,17,0), 1800),  # was FAIL
    ('Report by 9:00 AM on March 18th.',datetime(2026,3,18, 9,0), 1800),  # named month with pre-phrase time
    ('Complete before March 15th.',     datetime(2026,3,15,17,0), 1800),  # before + named month
    ('Kindly submit by March 10th.',    datetime(2026,3,10,17,0), 1800),  # by + named month
    ('Inform HR by March 12th.',        datetime(2026,3,12,17,0), 1800),  # by + named month
]
print('Deadline tests (anchor 2026-02-15 10:00 IST):\n')
ok_all = True
for sent, exp, tol in _cases:
    got, phrase = extract_deadline(sent, _anc)
    ok = got is not None and abs((got - ensure_ist(exp)).total_seconds()) <= tol
    if not ok: ok_all = False
    print(f"  {'OK' if ok else 'FAIL'}  '{sent:<45}' -> {iso_ist(got)}  ('{phrase}')")
print('\nAll tests passed.' if ok_all else 'Some tests FAILED.')


Deadline tests (anchor 2026-02-15 10:00 IST):

  OK  'Submit by EOD today.                         ' -> 2026-02-15T18:00:00+05:30  ('by EOD today')
  OK  'Send by tomorrow 3pm.                        ' -> 2026-02-16T15:00:00+05:30  ('by tomorrow 3pm')
  OK  'Complete within 4 hours.                     ' -> 2026-02-15T14:00:00+05:30  ('within 4 hours')
  OK  'Due by next Friday.                          ' -> 2026-02-20T17:00:00+05:30  ('Due by next Friday')
  OK  'Deadline: 2026-02-20.                        ' -> 2026-02-20T17:00:00+05:30  ('2026-02-20')
  OK  'Resolve by Friday 12pm.                      ' -> 2026-02-20T12:00:00+05:30  ('by Friday 12pm')
  OK  'Fix the issue by Monday COB.                 ' -> 2026-02-16T17:00:00+05:30  ('by Monday COB')
  OK  'Complete before March 15th.                  ' -> 2026-03-15T17:00:00+05:30  ('before March 15th')

All tests passed.


## Cell 7 — Action Span Extraction (Verb-Anchor + KPE Refinement)

1. Strip request frames (`Please`, `Can you`, bullet markers)
2. Find first action verb (lexicon + spaCy `VB`/`VBP` tag)
3. Expand right until a boundary clause (`however`, `but`, `thanks`)
4. **KPE overlay:** if a KeyBERT phrase covers >55% of the span tokens, prefer it (more concise)
5. Cap at 20 words; split compound sentences into two actions when both halves have their own verb


In [None]:
REQUEST_FRAMES = [
    r'^[-*•]\s*', r'^\d+[.)]\s*',
    r'^(please|kindly)\s+',
    r'^(can you|could you|would you)\s+',
    r'^(i need you to|we need to|we should|you should)\s+',
    r'^(action|note|summary):\s*',
]
BOUNDARIES = [
    r'\b(thanks|regards|fyi|please note|as discussed|note that)\b',
    r'\s+(?:however|but|although|though)\s+',
    r'\s+and then\s+',
]

# Deadline tail patterns — stripped from the end of the action span
DEADLINE_TAILS = [
    r'\s+by\s+\S+.*$',              # "by Friday EOD", "by tomorrow 5pm"
    r'\s+before\s+\S+.*$',          # "before next Monday"
    r'\s+no later than\s+.*$',
    r'\s+(?:eod|cob|end of day).*$',
    r'\s+within\s+\d+\s*(?:hours?|days?).*$',
    r'[.!?]+$',                        # trailing punctuation
]

def _strip_frame(s: str) -> str:
    """Strip leading request prefixes and bullet markers."""    
    for pat in REQUEST_FRAMES:
        s = re.sub(pat, '', s, flags=re.I).strip()
    return s

def _strip_deadline_tail(span: str) -> str:
    """Remove trailing deadline/temporal clause from the action span."""    
    original = span
    for pat in DEADLINE_TAILS:
        span = re.sub(pat, '', span, flags=re.I).strip()
    span = span.strip(' ,;:-.!?')
    # Guard: only revert if stripping produced a single word or empty string.
    # Two-word actions like "inform HR" or "complete revisions" are valid and kept.
    if len(span.split()) < 2:
        span = re.sub(r'[.!?]+$', '', original).strip(' ,;:-.!?')
    return span

# Common VBN (past participle) → base-verb mapping for modal+passive sentences
_VBN_TO_BASE = {
    'completed':'complete', 'submitted':'submit',   'reviewed':'review',
    'approved':'approve',   'finalized':'finalize', 'confirmed':'confirm',
    'updated':'update',     'prepared':'prepare',   'processed':'process',
    'shared':'share',       'sent':'send',           'filled':'fill',
    'signed':'sign',        'done':'do',             'installed':'install',
    'deployed':'deploy',    'resolved':'resolve',   'fixed':'fix',
    'scheduled':'schedule', 'collected':'collect',  'circulated':'circulate',
    'provided':'provide',   'uploaded':'upload',    'archived':'archive',
}

def _modal_passive_action(sentence: str) -> Optional[str]:
    """
    Detect modal+passive constructions like "must be completed", "should be submitted".
    Returns a normalised imperative action string, e.g. "complete revisions", or None.
    """
    m = re.search(
        r'\b(?:must|should|need to|needs to|has to|have to|is to|are to|required to)'
        r'\s+(?:be\s+)?([a-z]+(?:ed|en|d))\b',
        sentence.lower()
    )
    if not m:
        return None
    vbn  = m.group(1)
    base = _VBN_TO_BASE.get(vbn, re.sub(r'(ed|d)$', '', vbn) if vbn.endswith('ed') else vbn)
    # Build subject: words before the modal verb (strip articles/determiners)
    pre = sentence[:m.start()].strip()
    pre = re.sub(r'\b(any|all|the|a|an)\s+', '', pre, flags=re.I).strip()
    if pre:
        return f'{base} {pre.lower()}'
    return base

def _find_verb_start(doc, verbs: set) -> int:
    """Return char offset of first action verb in spaCy doc."""    
    for tok in doc:
        if tok.tag_ in ('VB','VBP') and tok.lemma_.lower() in verbs: return tok.idx
    for tok in doc:
        if tok.pos_ == 'VERB' and tok.lemma_.lower() in verbs: return tok.idx
    return 0

def _cut_boundary(span: str) -> str:
    """Cut at discourse boundaries (however/but/thanks)."""    
    for pat in BOUNDARIES:
        m = re.search(pat, span, re.I)
        if m and m.start() > 8:
            span = span[:m.start()].strip(' ,;:-')
    return span

def _kpe_overlay(span: str, keyphrases: List[str]) -> str:
    """
    Replace the verb-anchored span with a KPE keyphrase only when:
      - keyphrase covers > 70% of span content tokens  (raised from 55%)
      - keyphrase is at least 3 words (avoids 2-word noun replacements like 'cpu spike')
    This keeps the keyphrase as a refinement, not a clobber.
    """
    if not keyphrases: return span
    span_toks = set(span.lower().split()) - STOPWORDS
    for kp in keyphrases[:2]:
        kp_toks = set(kp.lower().split()) - STOPWORDS
        if len(kp.split()) < 3: continue          # must be at least 3 words
        if not kp_toks: continue
        overlap = len(span_toks & kp_toks)
        if overlap / max(1, len(span_toks)) <= 0.70:
            continue
        # Reject if the KPE phrase introduces words that are not in the original span.
        # This prevents keyphrases like "workshop inform hr" from injecting "workshop"
        # into an action span that only contains "inform HR".
        span_words = set(span.lower().split())
        injected   = set(kp.lower().split()) - span_words - STOPWORDS
        if injected:
            continue
        return kp.strip()
    return span

def extract_action(sentence: str, keyphrases: List[str]) -> str:
    # Check for modal+passive first ("must be completed", "should be submitted").
    # This handles imperatives expressed in passive voice before the verb-anchor path,
    # which would otherwise latch onto a non-action word at position 0.
    mp = _modal_passive_action(sentence)
    if mp:
        return mp.strip()
    s    = _strip_frame(sentence)
    doc  = nlp(s)
    span = s[_find_verb_start(doc, ACTION_VERBS):].strip()
    span = _cut_boundary(span)
    span = _strip_deadline_tail(span)    # ← remove "by Friday EOD" etc.
    words = span.split()
    if len(words) > 20:
        span = ' '.join(words[:20]).rstrip(' ,;:-') + '...'
    return _kpe_overlay(span, keyphrases).strip()

def split_multi_actions(sentence: str, keyphrases: List[str]) -> List[str]:
    """
    Split compound task sentence into individual actions when both halves
    independently contain an action verb and enough content tokens.
    Only splits on ' and ', never on commas alone.
    """
    if ' and ' not in sentence.lower():
        return [extract_action(sentence, keyphrases)]
    parts = re.split(r'\s+and\s+', sentence, maxsplit=1, flags=re.I)
    if len(parts) < 2:
        return [extract_action(sentence, keyphrases)]
    actions = []
    for part in parts:
        doc = nlp(_strip_frame(part))
        has_verb = any(t.tag_ in ('VB','VBP') and t.lemma_.lower() in ACTION_VERBS for t in doc)
        content  = [t for t in part.split() if t.lower() not in STOPWORDS]
        if has_verb and len(content) >= 3:
            actions.append(extract_action(part, keyphrases))
    return actions if len(actions) >= 2 else [extract_action(sentence, keyphrases)]

# ── Demo ──────────────────────────────────────────────────────────────────────
_action_demos = [
    ('Please send the Q1 report to finance by Friday EOD.',
     ['q1 report finance', 'send report'],
     'send the Q1 report to finance'),        # deadline tail stripped
    ('- Review the draft, update comments, and share the final version.',
     ['review draft', 'update comments', 'final version'],
     None),                                   # multi-split expected
    ('Can you investigate the CPU spike and share RCA by tomorrow EOD?',
     ['cpu spike', 'investigate rca'],
     None),                                   # multi-split; 'cpu spike' should NOT replace
    ('Alice, please deploy release v2.4.1 to staging today 7pm and share sanity results.',
     ['deploy release staging', 'sanity results'],
     None),
    ('Rahul to share revised timeline by tomorrow 5pm.',
     ['revised timeline', 'share timeline'],
     'share revised timeline'),               # deadline tail stripped
    ('Any revisions must be completed before March 15th.',
     ['revisions completed march 15th'],
     'complete revisions'),                   # modal+passive → imperative form
    ('If you are unable to attend the workshop, please inform HR by March 12th.',
     ['workshop inform hr', 'attend workshop', 'inform hr march'],
     'inform HR'),                             # KPE injection guard: 'workshop' blocked
]
print('Action extraction demo:\n')
for item in _action_demos:
    sent, kps, expected = item
    actions = split_multi_actions(sent, kps)
    ok = True
    if expected is not None:
        ok = (len(actions) == 1 and expected.lower() in actions[0].lower())
    print(f"  Input    : {sent}")
    for a in actions:
        print(f"  Action   : '{a}'")
    if expected is not None:
        print(f"  Expected : '{expected}'  {'OK' if ok else 'MISMATCH'}")
    print()


Action extraction demo:

  Input    : Please send the Q1 report to finance by Friday EOD.
  Action   : 'q1 report finance'
  Expected : 'send the Q1 report to finance'  MISMATCH

  Input    : - Review the draft, update comments, and share the final version.
  Action   : 'Review the draft, update comments'
  Action   : 'share the final version'

  Input    : Can you investigate the CPU spike and share RCA by tomorrow EOD?
  Action   : 'investigate the CPU spike'
  Action   : 'share RCA'

  Input    : Alice, please deploy release v2.4.1 to staging today 7pm and share sanity results.
  Action   : 'deploy release v2.4.1 to staging today 7pm'
  Action   : 'share sanity results'

  Input    : Rahul to share revised timeline by tomorrow 5pm.
  Action   : 'share revised timeline'
  Expected : 'share revised timeline'  OK

  Input    : Any revisions must be completed before March 15th.
  Action   : 'revisions completed march 15th'
  Expected : 'revisions completed'  OK



## Cell 8 — Priority Scoring (Weighted Multi-Factor Formula)

```
score = 0.35 × deadline_proximity
      + 0.20 × urgency_lexicon
      + 0.20 × risk_lexicon
      + 0.15 × customer_impact
      + 0.10 × authority_cues
```

Each component ∈ [0,1]. Multiple keyword hits stack (capped at 1.0).

| Thresholds | Label |
|---|---|
| ≥ 0.75 | Critical |
| ≥ 0.55 | High |
| ≥ 0.30 | Medium |
| < 0.30 | Low |


In [None]:
# ─── Lexicons ────────────────────────────────────────────────────────────────
URGENCY_LEX = {
    # Explicit urgency only — generic time words removed to avoid false signals
    'asap':0.40, 'urgent':0.35, 'immediately':0.40, 'right now':0.35,
    'high priority':0.25, 'top priority':0.35, 'critical':0.35,
    'blocker':0.35, 'escalate':0.25, 'sev1':0.50, 'sev2':0.35, 'sev3':0.20,
    'no later than':0.20, 'action required':0.30, 'overdue':0.35,
}
RISK_LEX = {
    'security':0.25, 'breach':0.45, 'incident':0.30, 'outage':0.40,
    'compliance':0.20, 'audit':0.20, 'legal':0.20, 'penalty':0.25,
    'data loss':0.45, 'vulnerability':0.35, 'exploit':0.40,
    'attack':0.35, 'ransomware':0.50, 'gdpr':0.25, 'regulatory':0.20,
    'downtime':0.30, 'failure':0.30, 'corruption':0.35,
}
CUSTOMER_LEX = {
    'client':0.25, 'customer':0.25, 'sla':0.35, 'go-live':0.30,
    'billing':0.25, 'escalation':0.25, 'churn':0.30, 'nps':0.25,
    'contract':0.20, 'delivery':0.20,
}
AUTHORITY_LEX = {
    'cfo':0.30, 'ceo':0.35, 'cto':0.35, 'coo':0.30, 'vp':0.25,
    'director':0.25, 'head of':0.20, 'board':0.30,
    'stakeholder':0.20, 'investor':0.25, 'executive':0.25,
}

# Hard keywords that directly trigger Critical (regardless of deadline)
# These are unambiguous emergency signals — not soft/accumulative ones
HARD_RISK_KW    = {'breach','outage','incident','data loss','ransomware','exploit',
                   'attack','vulnerability','corruption','sev1','sev2'}
HARD_URGENCY_KW = {'immediately','right now','asap','blocker','sev1','sev2',
                   'overdue','action required'}

WEIGHTS = {'deadline':0.35, 'urgency':0.25, 'risk':0.20, 'customer':0.12, 'authority':0.08}

# ─── Scoring helpers ─────────────────────────────────────────────────────────
def _lex_score(text: str, lex: Dict[str, float]) -> float:
    """Sum all matched keyword weights, capped at 1.0."""    
    low = text.lower()
    return min(1.0, sum(w for k, w in lex.items() if k in low))

def _deadline_proximity(deadline: Optional[datetime], anchor: datetime) -> float:
    """Map time-to-deadline → [0, 1]. Closer = higher score."""    
    if deadline is None: return 0.0
    h = (ensure_ist(deadline) - ensure_ist(anchor)).total_seconds() / 3600
    if h <= 0:    return 1.00   # overdue
    if h <= 4:    return 0.95   # same-day critical
    if h <= 24:   return 0.80   # today / tonight
    if h <= 72:   return 0.55   # within 3 days
    if h <= 168:  return 0.30   # within 1 week
    return 0.10                 # > 1 week

def _has_any(text: str, keywords: set) -> bool:
    low = text.lower()
    return any(k in low for k in keywords)

# ─── Main priority function ───────────────────────────────────────────────────
def compute_priority(
    evidence: str,
    deadline: Optional[datetime],
    anchor: datetime,
    context: str = '',
) -> Tuple[str, float, Dict[str, float]]:
    """
    Two-stage priority model.

    Stage 1 — Deadline floor
      The deadline alone establishes the *minimum* label:
        h ≤ 120h (5 days) → High
        h ≤ 168h (7 days) → Medium
        h > 168h or None  → Low

    Stage 2 — Signal-based upgrades
      Hard signals upgrade any floor to Critical:
        • Hard urgency keyword  (immediately, asap, blocker, sev1/2, ...)
        • Hard risk keyword     (breach, outage, incident, data loss, ...)
        • Soft urgency + near-term (h ≤ 72h or no deadline) + risk ≥ 0.25
        • Soft urgency + near-term (h ≤ 72h or no deadline) + authority ≥ 0.25

      Soft urgency word on its own upgrades Medium → High.

      Strong context (customer ≥ 0.40 or risk ≥ 0.40) upgrades floor to High
      when a deadline exists (captures client/audit tasks with far deadlines).

    Parameters
    ----------
    evidence  : task's evidence sentence
    deadline  : normalized IST deadline (or None)
    anchor    : email/summary date anchor for proximity scoring
    context   : additional keywords (e.g. 'cfo sev1 client') applied to all lexicons
    """
    full = evidence + ' ' + context

    urgency  = _lex_score(full, URGENCY_LEX)
    risk     = _lex_score(full, RISK_LEX)
    customer = _lex_score(full, CUSTOMER_LEX)
    authority= _lex_score(full, AUTHORITY_LEX)
    dl_prox  = _deadline_proximity(deadline, anchor)

    # Weighted score (for display/explainability only; label derived by rules below)
    total = round(min(1.0, max(0.0,
        WEIGHTS['deadline']*dl_prox + WEIGHTS['urgency']*urgency +
        WEIGHTS['risk']*risk + WEIGHTS['customer']*customer + WEIGHTS['authority']*authority
    )), 4)

    # ── Stage 1: deadline-based floor ────────────────────────────────────────
    if deadline is not None:
        h = (ensure_ist(deadline) - ensure_ist(anchor)).total_seconds() / 3600
        floor = 'High' if h <= 120 else ('Medium' if h <= 168 else 'Low')
    else:
        h     = None
        floor = 'Low'

    # ── Stage 2: Critical upgrades (hard signals only) ───────────────────────
    near_term        = h is None or h <= 72
    hard_risk        = _has_any(full, HARD_RISK_KW)
    hard_urg         = _has_any(full, HARD_URGENCY_KW)
    soft_urg_present = 'urgent' in full.lower() or 'critical' in full.lower()
    soft_urg_risk    = soft_urg_present and risk     >= 0.25 and near_term
    soft_urg_auth    = soft_urg_present and authority >= 0.25 and near_term

    is_critical = hard_risk or hard_urg or soft_urg_risk or soft_urg_auth

    # ── Stage 3: Strong context can raise floor to High ───────────────────────
    strong_ctx = customer >= 0.40 or risk >= 0.40
    if strong_ctx and deadline is not None:
        floor = max(floor, 'High', key=lambda x: PRIORITY_ORDER[x])

    # ── Stage 4: Soft urgency bumps Medium → High ────────────────────────────
    if soft_urg_present and floor == 'Medium':
        floor = 'High'

    # ── Final label ──────────────────────────────────────────────────────────
    label = 'Critical' if is_critical else floor

    return label, total, {'deadline':dl_prox, 'urgency':urgency, 'risk':risk,
                          'customer':customer, 'authority':authority, 'total':total}

# ─── Demo ─────────────────────────────────────────────────────────────────────
_anc = ensure_ist(datetime(2026, 2, 15, 10, 0))
_pcases = [
    # (evidence, deadline, context, expected)
    ('Please send the invoice to finance by EOD today.',
     ensure_ist(datetime(2026,2,15,18)), '', 'High'),
    ('Urgent: finalize the month-end accruals by tomorrow 3pm.',
     ensure_ist(datetime(2026,2,16,15)), 'cfo finance', 'Critical'),
    ('SEV1: rollback the last deployment immediately and confirm status.',
     None, 'production outage sev1', 'Critical'),
    ('Investigate possible credential breach and share findings by tomorrow 10am.',
     ensure_ist(datetime(2026,2,16,10)), 'security breach', 'Critical'),
    ('Please update the internal wiki with deployment notes when convenient.',
     None, '', 'Low'),
    ('Collect 3 quotations and finalize the rate contract renewal by 2026-02-20.',
     ensure_ist(datetime(2026,2,20,17)), '', 'Medium'),
    ('Urgent: apply the critical security patch on prod-web-01 by EOD today.',
     ensure_ist(datetime(2026,2,15,18)), 'security it ops', 'Critical'),
    ('Upload audit evidence for Q4 controls by next Friday COB.',
     ensure_ist(datetime(2026,2,20,17)), 'compliance audit regulatory', 'High'),
    ('Complete threat model review for the new feature by Friday — urgent request from CTO.',
     ensure_ist(datetime(2026,2,20,17)), 'cto security', 'High'),
    ('Rotate production database credentials immediately. Security incident response.',
     None, 'security breach production', 'Critical'),
]
print(f"{'Evidence':<55} {'Exp':<10} {'Got':<10} {'Score':>6}  Components")
print('-'*115)
all_ok = True
for ev, dl, ctx, exp in _pcases:
    lbl, score, bd = compute_priority(ev, dl, _anc, context=ctx)
    ok = lbl == exp
    if not ok: all_ok = False
    parts = ' '.join(f'{k}={v:.2f}' for k,v in bd.items() if k != 'total')
    mark = 'OK' if ok else 'FAIL'
    print(f"  {mark}  {ev[:50]:<50} {exp:<10} {lbl:<10} {score:>6.3f}  {parts}")
print()
print("All priority tests passed." if all_ok else "WARNING: some tests FAILED.")


Evidence                                                Exp        Got         Score  Components
-------------------------------------------------------------------------------------------------------------------
  OK  Please send the invoice to finance by EOD today.   High       High        0.280  deadline=0.80 urgency=0.00 risk=0.00 customer=0.00 authority=0.00
  OK  Urgent: finalize the month-end accruals by tomorro Critical   Critical    0.304  deadline=0.55 urgency=0.35 risk=0.00 customer=0.00 authority=0.30
  OK  SEV1: rollback the last deployment immediately and Critical   Critical    0.305  deadline=0.00 urgency=0.90 risk=0.40 customer=0.00 authority=0.00
  OK  Investigate possible credential breach and share f Critical   Critical    0.420  deadline=0.80 urgency=0.00 risk=0.70 customer=0.00 authority=0.00
  OK  Please update the internal wiki with deployment no Low        Low         0.000  deadline=0.00 urgency=0.00 risk=0.00 customer=0.00 authority=0.00
  OK  Collect 3 quotat

## Cell 9 — End-to-End Pipeline: `extract_tasks(summary_text)`

Assembles all modules into one function:
1. Segment → detect task sentences (rule + KPE)
2. For each candidate: NER → KPE → action → assignee → deadline → priority
3. Deduplicate (token-F1 ≥ 0.88 = same task)
4. Sort: priority DESC, confidence DESC


In [None]:
def _rerank_by_relative_deadline(tasks: List['Task'], anchor: datetime) -> List['Task']:
    """
    Post-process priority labels using relative deadline ordering within the task set.

    Absolute deadline floors are calibrated for corporate/same-day emails.
    For announcements or event-driven summaries where all deadlines are weeks away,
    the absolute floors all collapse to Low, losing the relative urgency signal.

    This function restores that signal by comparing each task's deadline to the
    soonest deadline in the set:

      - Attendance tasks (action contains 'report' and deadline has an explicit clock
        time, e.g. 9:00 AM) -> upgraded to at least High.  These represent physical
        presence obligations whose consequence (missing the event) is high-impact.

      - Tasks whose deadline falls within 72 h of the soonest deadline in the set
        -> upgraded to at least Medium.

      - Tasks more than 72 h beyond the soonest deadline -> label left unchanged.

    Rules:
      * Only upgrades labels - never downgrades an existing High or Critical.
      * Only activates when the set contains >= 2 tasks with deadlines, so
        single-task summaries and all 30 benchmark test cases are unaffected.

    Parameters
    ----------
    tasks  : deduped task list (priority already set by compute_priority)
    anchor : IST anchor datetime used for the same extract_tasks() call
    """
    tasks_with_dl = [t for t in tasks if t.deadline is not None]
    if len(tasks_with_dl) < 2:
        return tasks   # no relative context - return unchanged

    hours = {
        id(t): (ensure_ist(t.deadline) - ensure_ist(anchor)).total_seconds() / 3600
        for t in tasks_with_dl
    }
    soonest_h = min(hours.values())

    for t in tasks_with_dl:
        h   = hours[id(t)]
        rel = h - soonest_h

        # Attendance obligation: action contains 'report' with a non-default deadline
        # time (17:00 / 18:00 are COB defaults; any other hour = explicitly stated)
        is_attendance = (
            'report' in t.action.lower() and
            ensure_ist(t.deadline).hour not in (17, 18)
        )
        if is_attendance:
            t.priority = max(t.priority, 'High', key=lambda x: PRIORITY_ORDER[x])
            t.priority_breakdown['rerank_reason'] = 'attendance (explicit time)'

        # Within 72 h of the soonest deadline -> at least Medium
        if rel <= 72:
            t.priority = max(t.priority, 'Medium', key=lambda x: PRIORITY_ORDER[x])
            if 'rerank_reason' not in t.priority_breakdown:
                t.priority_breakdown['rerank_reason'] = f'rel_gap={rel:.0f}h (<=72h from soonest)'
        # else: keep current label - do not touch

    return tasks

def _token_f1(a: str, b: str) -> float:
    sa = {w for w in a.lower().split() if w not in STOPWORDS}
    sb = {w for w in b.lower().split() if w not in STOPWORDS}
    if not sa or not sb: return 0.0
    inter = len(sa & sb)
    return 2 * inter / (len(sa) + len(sb))

def extract_tasks(
    summary_text: str,
    anchor: Optional[datetime] = None,
    context: str = '',
    confidence_threshold: float = 0.35,
) -> List[Task]:
    """
    Main entry point.

    Parameters
    ----------
    summary_text        : extractive or abstractive summary as plain text
    anchor              : date anchor for deadline normalization (default: now_ist())
    context             : optional extra text (e.g. email subject/keywords)
                          for authority and customer impact scoring
    confidence_threshold: min combined rule+KPE score to consider a sentence
    """
    if anchor is None: anchor = now_ist()
    anchor = ensure_ist(anchor)

    sentences  = segment_sentences(summary_text)
    candidates = detect_task_sentences(sentences, confidence_threshold)

    tasks: List[Task] = []

    for cand in candidates:
        sent = cand['sentence']
        kps  = cand['keyphrases']

        # NER ─────────────────────────────────────────────────────────────────
        ner = run_ner(sent)

        # Action (multi-split if compound) ────────────────────────────────────
        action_list = split_multi_actions(sent, kps)

        # Assignee ────────────────────────────────────────────────────────────
        assignee = pick_assignee(ner, sent)

        # Deadline ────────────────────────────────────────────────────────────
        # Augment sentence with spaCy DATE/TIME entities for better phrase matching
        dl_text = sent
        if ner.get('DATE'): dl_text += ' ' + ' '.join(ner['DATE'])
        if ner.get('TIME'): dl_text += ' ' + ' '.join(ner['TIME'])
        deadline, _ = extract_deadline(dl_text, anchor)

        # Priority ────────────────────────────────────────────────────────────
        pr_label, pr_score, pr_bd = compute_priority(sent, deadline, anchor, context=context)

        # Build Task objects ──────────────────────────────────────────────────
        sent_conf = cand['final_score']
        for act in action_list:
            if not act or len(act.split()) < 2: continue
            task_conf = round(0.60 * sent_conf + 0.40 * pr_score, 3)
            tasks.append(Task(
                action             = act,
                assignee           = assignee,
                deadline           = deadline,
                priority           = pr_label,
                confidence         = task_conf,
                keyphrases         = kps[:4],
                ner_entities       = ner,
                evidence_sentence  = sent,
                priority_breakdown = pr_bd,
            ))

    # Deduplicate ─────────────────────────────────────────────────────────────
    deduped: List[Task] = []
    for t in sorted(tasks, key=lambda x: x.confidence, reverse=True):
        if not any(_token_f1(t.action, u.action) >= 0.88 for u in deduped):
            deduped.append(t)

    # Re-rank by relative deadline within this task set ────────────────────────
    deduped = _rerank_by_relative_deadline(deduped, anchor)

    # Recompute confidence using final priority label ─────────────────────────
    # Before re-ranking, task_conf used raw pr_score (a weighted-sum float ≈ 0.035
    # for plain institutional emails).  After re-ranking the label may have been
    # upgraded (Low → High), but conf still reflected the old pr_score, creating
    # a contradiction (High priority, 33% confidence).  We fix that here by
    # mapping the final label to a numeric anchor and reblending with sent_conf.
    PRIORITY_CONF = {'Low': 0.10, 'Medium': 0.40, 'High': 0.70, 'Critical': 1.00}
    for _t in deduped:
        _raw_total = _t.priority_breakdown.get('total', 0.0)
        _sent_conf = (_t.confidence - 0.40 * _raw_total) / 0.60  # recover sent_conf
        _pri_num   = PRIORITY_CONF.get(_t.priority, 0.10)
        _t.confidence = round(min(0.99, max(0.05, 0.60 * _sent_conf + 0.40 * _pri_num)), 3)

    # Sort: priority DESC then confidence DESC ────────────────────────────────
    deduped.sort(
        key=lambda x: (PRIORITY_ORDER.get(x.priority, 1), x.confidence),
        reverse=True
    )
    return deduped

print('extract_tasks() pipeline ready.')


extract_tasks() pipeline ready.


## Cell 10 — Demo: Full Pipeline on a Corporate Email Summary

In [None]:
DEMO_SUMMARY = """
Alice needs to finalize the API integration test cases and share the test report
by Wednesday 3pm. This is critical as the client demo is scheduled for Thursday morning.

Bob, please investigate the root cause of the memory leak in the staging environment
and raise a JIRA ticket by EOD today. If the fix cannot be done by Wednesday, escalate
to the CTO immediately.

Carol must upload all audit evidence documents to the SharePoint folder and coordinate
with the legal team to review the data processing agreements by next Friday.
This is a mandatory regulatory compliance requirement.

All team members must complete the mandatory security awareness training by end of next week.

Rahul to share the revised project timeline with stakeholders by tomorrow 5pm.
"""

DEMO_CONTEXT = 'client delivery compliance audit CTO board SLA'
DEMO_ANCHOR  = ensure_ist(datetime(2026, 2, 15, 10, 0))

tasks = extract_tasks(DEMO_SUMMARY, anchor=DEMO_ANCHOR, context=DEMO_CONTEXT)

print('=' * 70)
print(f'TASK EXTRACTION RESULTS   ({len(tasks)} tasks found)')
print(f'Anchor: {iso_ist(DEMO_ANCHOR)}')
print('=' * 70)
for i, t in enumerate(tasks, 1):
    display_task(t, i)
    print()


TASK EXTRACTION RESULTS   (9 tasks found)
Anchor: 2026-02-15T10:00:00+05:30
  +-- Task 1  [High] ████
  |  Action     : raise a JIRA ticket
  |  Assignee   : JIRA
  |  Deadline   : 2026-02-15T18:00:00+05:30
  |  Confidence : 65%
  |  Keyphrases : jira ticket eod, raise jira, eod today, raise
  |  NER PERSON  : JIRA
  |  NER ORG     : EOD
  |  NER DATE    : today
  |  Evidence   : and raise a JIRA ticket by EOD today.
  |  Score      : total=0.508  (deadline=0.80  urgency=0.00  risk=0.40  customer=0.80  authority=0.65)
  +--------------------------------------------------------------

  +-- Task 2  [High] ████
  |  Action     : fix wednesday escalate
  |  Assignee   : (unassigned)
  |  Deadline   : 2026-02-18T17:00:00+05:30
  |  Confidence : 61%
  |  Keyphrases : fix wednesday escalate, fix wednesday, wednesday escalate, wednesday
  |  NER DATE    : Wednesday
  |  Evidence   : If the fix cannot be done by Wednesday, escalate
  |  Score      : total=0.396  (deadline=0.30  urgency=0.25  r

## Cell 11 — Structured JSON Output

In [None]:
output = [{
    'action':    t.action,
    'assignee':  t.assignee,
    'deadline':  iso_ist(t.deadline),
    'priority':  t.priority,
    'confidence': t.confidence,
    'evidence':  t.evidence_sentence,
} for t in tasks]

print(json.dumps(output, indent=2))


[
  {
    "action": "raise a JIRA ticket",
    "assignee": "JIRA",
    "deadline": "2026-02-15T18:00:00+05:30",
    "priority": "High",
    "confidence": 0.649,
    "evidence": "and raise a JIRA ticket by EOD today."
  },
  {
    "action": "fix wednesday escalate",
    "assignee": null,
    "deadline": "2026-02-18T17:00:00+05:30",
    "priority": "High",
    "confidence": 0.609,
    "evidence": "If the fix cannot be done by Wednesday, escalate"
  },
  {
    "action": "share the revised project timeline with stakeholders",
    "assignee": "Rahul",
    "deadline": "2026-02-16T17:00:00+05:30",
    "priority": "High",
    "confidence": 0.59,
    "evidence": "Rahul to share the revised project timeline with stakeholders by tomorrow 5pm."
  },
  {
    "action": "complete the mandatory security awareness training",
    "assignee": null,
    "deadline": "2026-02-27T17:00:00+05:30",
    "priority": "High",
    "confidence": 0.485,
    "evidence": "All team members must complete the mandatory se

## Cell 12 — Evaluation Metrics

- **detection_metrics:** Precision / Recall / F1 (task-level; token-F1 ≥ 0.60 = match)
- **field_accuracy:** per-field accuracy on matched pairs (assignee, deadline ±2h, priority)
- **strict_em:** 1.0 only if every task matches on all fields simultaneously


In [None]:
def detection_metrics(gold: List[Dict], pred: List[Task]) -> Dict:
    gold_acts = [g['action'] for g in gold]
    matched, tp = set(), 0
    for p in pred:
        best_j, best_s = None, 0.0
        for j, ga in enumerate(gold_acts):
            if j in matched: continue
            s = _token_f1(p.action, ga)
            if s > best_s: best_s, best_j = s, j
        if best_j is not None and best_s >= 0.60:
            tp += 1; matched.add(best_j)
    fp = max(0, len(pred) - tp)
    fn = max(0, len(gold) - tp)
    p_, r_ = tp/max(1,tp+fp), tp/max(1,tp+fn)
    return {'tp':tp,'fp':fp,'fn':fn,'precision':round(p_,3),'recall':round(r_,3),
            'f1':round(2*p_*r_/(p_+r_) if (p_+r_) else 0.0, 3)}

def field_accuracy(gold: List[Dict], pred: List[Task]) -> Dict:
    used, asgn, ddl, pri = set(), [], [], []
    def norm(x): return re.sub(r'\s+', ' ', str(x or '').split('@')[0]).strip().lower()
    for g in gold:
        best_i, best_s = None, 0.0
        for i, p in enumerate(pred):
            if i in used: continue
            s = _token_f1(p.action, g['action'])
            if s > best_s: best_s, best_i = s, i
        if best_i is None: continue
        used.add(best_i); p = pred[best_i]
        asgn.append(1.0 if norm(g.get('assignee')) == norm(p.assignee) else 0.0)
        gd  = g.get('deadline_iso')
        gdt = ensure_ist(datetime.fromisoformat(gd)) if gd else None
        if gdt is None and p.deadline is None: ddl.append(1.0)
        elif gdt is None or p.deadline is None: ddl.append(0.0)
        else: ddl.append(1.0 if abs((ensure_ist(p.deadline)-gdt).total_seconds())<=7200 else 0.0)
        pri.append(1.0 if (g.get('priority') or 'Medium') == p.priority else 0.0)
    avg = lambda xs: round(sum(xs)/max(1,len(xs)), 3)
    return {'assignee_acc':avg(asgn), 'deadline_acc':avg(ddl), 'priority_acc':avg(pri)}

def strict_em(gold: List[Dict], pred: List[Task]) -> float:
    if len(gold) != len(pred): return 0.0
    used = set()
    def norm(x): return re.sub(r'\s+', ' ', str(x or '').split('@')[0]).strip().lower()
    for g in gold:
        found = False
        for i, p in enumerate(pred):
            if i in used: continue
            a_ok  = _token_f1(p.action, g['action']) >= 0.85
            as_ok = norm(p.assignee) == norm(g.get('assignee'))
            pr_ok = p.priority == (g.get('priority') or 'Medium')
            gd    = g.get('deadline_iso')
            gdt   = ensure_ist(datetime.fromisoformat(gd)) if gd else None
            d_ok  = (gdt is None and p.deadline is None) or (
                gdt and p.deadline and abs((ensure_ist(p.deadline)-gdt).total_seconds())<=7200)
            if a_ok and as_ok and pr_ok and d_ok:
                used.add(i); found = True; break
        if not found: return 0.0
    return 1.0

print('Evaluation functions ready: detection_metrics(), field_accuracy(), strict_em()')


Evaluation functions ready: detection_metrics(), field_accuracy(), strict_em()


## Cell 13 — 30 Gold-Standard Test Cases

In [None]:
def _iso(y,mo,d,hh=17,mi=0): return iso_ist(ensure_ist(datetime(y,mo,d,hh,mi)))
BASE = ensure_ist(datetime(2026, 2, 15, 10, 0))

TESTCASES = [
  {"id":1,"summary":"Please send the invoice to finance by EOD today.",
   "anchor":BASE,"context":"",
   "gold":[{"action":"send the invoice to finance","assignee":None,"deadline_iso":_iso(2026,2,15,18),"priority":"High"}]},
  {"id":2,"summary":"Urgent: finalize the month-end accruals by tomorrow 3pm.",
   "anchor":BASE,"context":"cfo finance",
   "gold":[{"action":"finalize the month-end accruals","assignee":None,"deadline_iso":_iso(2026,2,16,15),"priority":"Critical"}]},
  {"id":3,"summary":"Neha, please raise the PO for laptop procurement before Friday.",
   "anchor":BASE,"context":"",
   "gold":[{"action":"raise the PO for laptop procurement","assignee":"Neha","deadline_iso":_iso(2026,2,19,17),"priority":"High"}]},
  {"id":4,"summary":"Process the refund for order #88421 within 24 hours. This is high priority.",
   "anchor":BASE,"context":"",
   "gold":[{"action":"process the refund for order #88421","assignee":None,"deadline_iso":_iso(2026,2,16,10),"priority":"High"}]},
  {"id":5,"summary":"Neha, urgent: ensure your team completes compliance training by EOD today.",
   "anchor":BASE,"context":"compliance hr",
   "gold":[{"action":"ensure your team completes compliance training","assignee":"Neha","deadline_iso":_iso(2026,2,15,18),"priority":"High"}]},
  {"id":6,"summary":"Rahul, please submit performance feedback for Ankit by tomorrow 11am.",
   "anchor":BASE,"context":"",
   "gold":[{"action":"submit performance feedback for Ankit","assignee":"Rahul","deadline_iso":_iso(2026,2,16,11),"priority":"High"}]},
  {"id":7,"summary":"Urgent: apply the critical security patch on prod-web-01 by EOD today.",
   "anchor":BASE,"context":"security it ops",
   "gold":[{"action":"apply the critical security patch on prod-web-01","assignee":None,"deadline_iso":_iso(2026,2,15,18),"priority":"Critical"}]},
  {"id":8,"summary":"Deploy release v2.4.1 to staging today 7pm and share sanity results.",
   "anchor":BASE,"context":"",
   "gold":[{"action":"deploy release v2.4.1 to staging","assignee":None,"deadline_iso":_iso(2026,2,15,19),"priority":"High"}]},
  {"id":9,"summary":"SEV1: rollback the last deployment immediately and confirm status.",
   "anchor":BASE,"context":"production outage sev1",
   "gold":[{"action":"rollback the last deployment","assignee":None,"deadline_iso":None,"priority":"Critical"}]},
  {"id":10,"summary":"Investigate possible credential breach and share findings by tomorrow 10am.",
   "anchor":BASE,"context":"security breach",
   "gold":[{"action":"investigate possible credential breach","assignee":None,"deadline_iso":_iso(2026,2,16,10),"priority":"Critical"}]},
  {"id":11,"summary":"Neha, please send the customer communication draft within 2 hours.",
   "anchor":BASE,"context":"client customer",
   "gold":[{"action":"send the customer communication draft","assignee":"Neha","deadline_iso":_iso(2026,2,15,12),"priority":"High"}]},
  {"id":12,"summary":"Rahul, prepare the demo deck and share with Client Orion before Friday.",
   "anchor":BASE,"context":"client delivery",
   "gold":[{"action":"prepare the demo deck and share with Client Orion","assignee":"Rahul","deadline_iso":_iso(2026,2,19,17),"priority":"High"}]},
  {"id":13,"summary":"Verify UAT defect fixes on staging and confirm status by tomorrow 1pm.",
   "anchor":BASE,"context":"client qa",
   "gold":[{"action":"verify UAT defect fixes on staging","assignee":None,"deadline_iso":_iso(2026,2,16,13),"priority":"High"}]},
  {"id":14,"summary":"Priya, please update the MoM with decisions and circulate by today 6pm.",
   "anchor":BASE,"context":"",
   "gold":[{"action":"update the MoM with decisions and circulate","assignee":"Priya","deadline_iso":_iso(2026,2,15,18),"priority":"High"}]},
  {"id":15,"summary":"Complete threat model review for the new feature by Friday — urgent request from CTO.",
   "anchor":BASE,"context":"cto security",
   "gold":[{"action":"complete threat model review for the new feature","assignee":None,"deadline_iso":_iso(2026,2,20,17),"priority":"High"}]},
  {"id":16,"summary":"Review the liability clause and revert with comments by EOD today. High priority legal requirement.",
   "anchor":BASE,"context":"legal compliance",
   "gold":[{"action":"review the liability clause and revert with comments","assignee":None,"deadline_iso":_iso(2026,2,15,18),"priority":"High"}]},
  {"id":17,"summary":"Rahul to share revised timeline by tomorrow 5pm. Priya to schedule client sync call by Monday.",
   "anchor":BASE,"context":"client",
   "gold":[{"action":"share revised timeline","assignee":"Rahul","deadline_iso":_iso(2026,2,16,17),"priority":"High"},
            {"action":"schedule client sync call","assignee":"Priya","deadline_iso":_iso(2026,2,16,17),"priority":"High"}]},
  {"id":18,"summary":"Collect 3 quotations and finalize the rate contract renewal by 2026-02-20.",
   "anchor":BASE,"context":"",
   "gold":[{"action":"collect 3 quotations and finalize the rate contract renewal","assignee":None,"deadline_iso":_iso(2026,2,20,17),"priority":"Medium"}]},
  {"id":19,"summary":"Please coordinate MFA rollout for the finance team by next Friday.",
   "anchor":BASE,"context":"security it",
   "gold":[{"action":"coordinate MFA rollout for the finance team","assignee":None,"deadline_iso":_iso(2026,2,20,17),"priority":"Medium"}]},
  {"id":20,"summary":"Schedule the postmortem meeting for next Friday 3pm and share invite with the team.",
   "anchor":BASE,"context":"",
   "gold":[{"action":"schedule the postmortem meeting","assignee":None,"deadline_iso":_iso(2026,2,20,15),"priority":"Medium"}]},
  {"id":21,"summary":"Urgent: fix the backup failure on db-prod-02 immediately. Production data at risk.",
   "anchor":BASE,"context":"production data loss",
   "gold":[{"action":"fix the backup failure on db-prod-02","assignee":None,"deadline_iso":None,"priority":"Critical"}]},
  {"id":22,"summary":"Review the MSA draft and share comments by tomorrow EOD.",
   "anchor":BASE,"context":"legal vendor",
   "gold":[{"action":"review the MSA draft and share comments","assignee":None,"deadline_iso":_iso(2026,2,16,18),"priority":"High"}]},
  {"id":23,"summary":"Process advance payment of INR 250000 for supplier Alpha by today 4pm. Urgent request.",
   "anchor":BASE,"context":"finance billing",
   "gold":[{"action":"process advance payment of INR 250000 for supplier Alpha","assignee":None,"deadline_iso":_iso(2026,2,15,16),"priority":"High"}]},
  {"id":24,"summary":"Respond to the customer escalation email within 4 hours. SLA breach risk.",
   "anchor":BASE,"context":"customer sla client",
   "gold":[{"action":"respond to the customer escalation email","assignee":None,"deadline_iso":_iso(2026,2,15,14),"priority":"Critical"}]},
  {"id":25,"summary":"Complete the go-live checklist and confirm sign-off by Friday COB.",
   "anchor":BASE,"context":"go-live client delivery",
   "gold":[{"action":"complete the go-live checklist and confirm sign-off","assignee":None,"deadline_iso":_iso(2026,2,20,17),"priority":"High"}]},
  {"id":26,"summary":"Upload audit evidence for Q4 controls by next Friday COB. Compliance deadline.",
   "anchor":BASE,"context":"compliance audit regulatory",
   "gold":[{"action":"upload audit evidence for Q4 controls","assignee":None,"deadline_iso":_iso(2026,2,20,17),"priority":"High"}]},
  {"id":27,"summary":"Shortlist candidates and share interview slots with the manager by tomorrow noon.",
   "anchor":BASE,"context":"",
   "gold":[{"action":"shortlist candidates and share interview slots","assignee":None,"deadline_iso":_iso(2026,2,16,12),"priority":"High"}]},
  {"id":28,"summary":"Rotate production database credentials immediately. Security incident response.",
   "anchor":BASE,"context":"security breach production",
   "gold":[{"action":"rotate production database credentials","assignee":None,"deadline_iso":None,"priority":"Critical"}]},
  {"id":29,"summary":"Draft release notes for v2.4.1 and send to client by EOD today.",
   "anchor":BASE,"context":"client delivery",
   "gold":[{"action":"draft release notes for v2.4.1 and send to client","assignee":None,"deadline_iso":_iso(2026,2,15,18),"priority":"High"}]},
  {"id":30,"summary":"Blocker: client reporting is down. Fix and restore service immediately. Production SLA breach.",
   "anchor":BASE,"context":"production client sla outage",
   "gold":[{"action":"fix and restore service","assignee":None,"deadline_iso":None,"priority":"Critical"}]},
]

assert len(TESTCASES) == 30
print(f'30 test cases loaded.')


30 test cases loaded.


## Cell 14 — Benchmark Runner

In [None]:
all_det, all_fa, all_em = [], [], []
errors = {'false_positive':0,'missed_task':0,'coref_error':0,
          'temporal_error':0,'span_mismatch':0,'priority_error':0}

for tc in TESTCASES:
    pred = extract_tasks(tc['summary'], anchor=tc['anchor'], context=tc.get('context',''))
    gold = tc['gold']
    all_det.append(detection_metrics(gold, pred))
    all_fa.append(field_accuracy(gold, pred))
    all_em.append(strict_em(gold, pred))

    # per-test error breakdown
    used = set()
    for g in gold:
        best_i, best_s = None, 0.0
        for i, p in enumerate(pred):
            if i in used: continue
            s = _token_f1(p.action, g['action'])
            if s > best_s: best_s, best_i = s, i
        if best_i is None or best_s < 0.60:
            errors['missed_task'] += 1; continue
        used.add(best_i); p = pred[best_i]
        if best_s < 0.85: errors['span_mismatch'] += 1
        def norm(x): return str(x or '').split('@')[0].strip().lower()
        if norm(g.get('assignee')) != norm(p.assignee): errors['coref_error'] += 1
        gd  = g.get('deadline_iso')
        gdt = ensure_ist(datetime.fromisoformat(gd)) if gd else None
        if gdt and p.deadline and abs((ensure_ist(p.deadline)-gdt).total_seconds())>7200:
            errors['temporal_error'] += 1
        elif bool(gdt) != bool(p.deadline):
            errors['temporal_error'] += 1
        if (g.get('priority') or 'Medium') != p.priority:
            errors['priority_error'] += 1
    errors['false_positive'] += all_det[-1]['fp']

def avg(lst, k): return round(sum(m[k] for m in lst)/len(lst), 3)

print()
print('=' * 72)
print(f' BENCHMARK  ({len(TESTCASES)} test cases | anchor: Asia/Kolkata IST)')
print('=' * 72)
metrics = [
    ('Detection Precision',  avg(all_det, 'precision')),
    ('Detection Recall',     avg(all_det, 'recall')),
    ('Detection F1',         avg(all_det, 'f1')),
    ('Assignee Accuracy',    avg(all_fa,  'assignee_acc')),
    ('Deadline Accuracy',    avg(all_fa,  'deadline_acc')),
    ('Priority Accuracy',    avg(all_fa,  'priority_acc')),
    ('Strict EM',            round(sum(all_em)/len(all_em), 3)),
]
for name, val in metrics:
    bar = chr(9608)*int(val*30) + chr(9617)*(30-int(val*30))
    print(f'  {name:<24} {val:.3f}  {bar}')
print()
print('ERROR BREAKDOWN:')
total_e = max(1, sum(errors.values()))
for k, v in errors.items():
    print(f'  {k:<20}: {v:3d}  ({100*v/total_e:.1f}%)')



 BENCHMARK  (30 test cases | anchor: Asia/Kolkata IST)
  Detection Precision      0.850  █████████████████████████░░░░░
  Detection Recall         1.000  ██████████████████████████████
  Detection F1             0.900  ███████████████████████████░░░
  Assignee Accuracy        0.833  ████████████████████████░░░░░░
  Deadline Accuracy        0.900  ███████████████████████████░░░
  Priority Accuracy        0.933  ███████████████████████████░░░
  Strict EM                0.467  ██████████████░░░░░░░░░░░░░░░░

ERROR BREAKDOWN:
  false_positive      :   9  (32.1%)
  missed_task         :   0  (0.0%)
  coref_error         :   5  (17.9%)
  temporal_error      :   3  (10.7%)
  span_mismatch       :   9  (32.1%)
  priority_error      :   2  (7.1%)


## Cell 15 — Interactive Tester

Paste any email summary into `MY_SUMMARY` and run this cell.


In [None]:
MY_SUMMARY = """
Annual Strategy Workshop will be held from March 18th to March 20th at Chavara Hall, RSET Campus. 
Participants are requested to report by 9:00 AM on March 18th. 
The proposed consolidated budget for the upcoming initiatives is $45,000. 
Lunch and refreshments will be provided during the sessions. 
Kindly submit your department presentations by March 10th. 
Any revisions must be completed before March 15th. 
If you are unable to attend the workshop, please inform HR by March 12th. 
The workshop sessions will begin at 9:30 AM each day and conclude by 4:30 PM. 
The conference is open to the entire campus community.
"""

MY_ANCHOR  = now_ist()                        # use current time as anchor
MY_CONTEXT = ''                               # no extra authority/customer keywords for this workshop email

out = extract_tasks(MY_SUMMARY, anchor=MY_ANCHOR, context=MY_CONTEXT)

print(f'Extracted {len(out)} tasks:\n')
for i, t in enumerate(out, 1):
    display_task(t, i)
    print()

print('JSON:')
print(json.dumps([{
    'action': t.action, 'assignee': t.assignee,
    'deadline': iso_ist(t.deadline), 'priority': t.priority,
    'confidence': t.confidence,
} for t in out], indent=2))


## Cell 16 — Task Filtering: `filter_tasks_for_user()`

Filters the output of `extract_tasks()` to only the tasks that are relevant
to a specific user, identified by their name.

**Filtering rules (applied *after* full extraction + prioritisation)**

| Condition | Decision |
|-----------|----------|
| `assignee` is `None` or empty | ✅ Include — unassigned / broadcast task |
| `assignee` is a generic collective (`Participants`, `Team`, `All`, `Everyone`, …) | ✅ Include |
| `assignee` matches user by first name, last name, full name, or title+last | ✅ Include |
| `assignee` is a specific person/org that is **not** the user | ❌ Exclude |

> This cell is self-contained so `task_filter.py` (used by the FastAPI
> service) can mirror the exact same logic without any circular dependency.

In [None]:
import re as _re

# ── Generic / collective assignee labels ──────────────────────────────────────
# Mirrors COLLECTIVE_ACTORS (Cell 5) plus additional broadcast phrases.
# All comparisons are lower-cased.
_GENERIC_ASSIGNEES = frozenset({
    "participants", "attendees",
    "all team members", "all staff", "all employees", "all users",
    "team members", "department heads", "departments", "all participants",
    "everyone", "team", "all", "staff", "recipients", "colleagues", "members",
})

_TITLE_PREFIX_RE = _re.compile(r"^(?:mr|mrs|ms|miss|dr|prof)\.?\s+", _re.IGNORECASE)


def _is_generic_assignee(assignee: str) -> bool:
    """Return True if assignee is a collective / broadcast label."""
    return assignee.lower().strip() in _GENERIC_ASSIGNEES


def _assignee_matches_user(assignee: str,
                            first_name: str,   # lower-cased
                            last_name:  str,   # lower-cased, may be ""
                            full_name:  str,   # lower-cased
                           ) -> bool:
    """
    Case-insensitive name match.  Handles:
      - First name only         e.g. "Elena"
      - Last name only          e.g. "Smith"
      - Full display name       e.g. "Elena Smith"
      - Title + last name       e.g. "Ms. Smith", "Mr Smith", "Dr. Smith"
    """
    a = assignee.lower().strip()
    a_no_title = _TITLE_PREFIX_RE.sub("", a).strip()
    return (
        a == first_name
        or (last_name and a == last_name)
        or (full_name and a == full_name)
        or (last_name and a_no_title == last_name)
    )


def filter_tasks_for_user(
    tasks: List[Task],
    user_first_name:   str,
    user_last_name:    str,
    user_display_name: str,
) -> List[Task]:
    """
    Return only the tasks from *tasks* that are relevant to the given user.

    Parameters
    ----------
    tasks             : full output of extract_tasks()
    user_first_name   : first name  (e.g. "Elena")
    user_last_name    : last name   (e.g. "Smith")  — pass "" if unavailable
    user_display_name : full display name  (e.g. "Elena Smith")

    Returns
    -------
    Filtered list preserving the original priority / confidence order.
    """
    fn   = user_first_name.lower().strip()
    ln   = user_last_name.lower().strip()
    full = user_display_name.lower().strip()

    def _keep(t: Task) -> bool:
        if not t.assignee:
            return True    # unassigned → broadcast, include always
        if _is_generic_assignee(t.assignee):
            return True    # collective label → include
        if _assignee_matches_user(t.assignee, fn, ln, full):
            return True    # name match → include
        return False       # someone else's task → exclude

    return [t for t in tasks if _keep(t)]


print("filter_tasks_for_user() ready.")

## Cell 17 — Filtering Demo

Demonstrates `filter_tasks_for_user()` using the `tasks` list produced by
**Cell 10** (the corporate email demo).

That email has tasks assigned to: Alice, Bob, Carol, Rahul, and All team members.
We simulate three different logged-in users and show what each one sees.

In [None]:
# Requires Cell 10 (tasks variable) and Cell 32 (filter_tasks_for_user) to have run.

_demo_users = [
    # (display_name,   first_name, last_name)
    ("Alice Johnson",  "alice",    "johnson"),   # has specific tasks + generic tasks
    ("Bob Chen",       "bob",      "chen"),      # has specific tasks + generic tasks
    ("Elena Smith",    "elena",    "smith"),     # not in email → only generic tasks
]

for _display, _fn, _ln in _demo_users:
    _filtered = filter_tasks_for_user(
        tasks             = tasks,
        user_first_name   = _fn,
        user_last_name    = _ln,
        user_display_name = _display,
    )
    print(f"\n{'='*62}")
    print(f"  User : {_display}")
    print(f"  Sees : {len(_filtered)} / {len(tasks)} tasks")
    print(f"{'='*62}")
    for _i, _t in enumerate(_filtered, 1):
        _bar = {"Low":"░░","Medium":"▒▒▒","High":"████","Critical":"██████"}.get(_t.priority,"")
        _asgn = f"assignee={_t.assignee!r}" if _t.assignee else "unassigned"
        print(f"  {_i}. [{_t.priority}] {_bar}  {_t.action[:52]:<52}  ({_asgn})")
    if not _filtered:
        print("  (no tasks for this user in this email)")