
# 04 — Deep Language Metrics: Verbs, Adjectives, Readability, and Prompt–Response Dynamics

This notebook **extends your current analysis** with richer linguistic and structural features to explain variation in `overall_score`.

**Highlights**
- Works with your existing `df` (or loads `train.csv` as a fallback).
- Extracts **verbs/adjectives/nouns** (spaCy if available, else regex heuristics).
- Adds **readability** (Flesch) and composition features (hapax ratio, lexical density).
- Compares **prompt vs. response**: length ratios, average word length, and **Jaccard token overlap**.
- Computes **hedges/politeness** counts and **punctuation density**.
- Produces **correlations**, **bin summaries**, and a quick **Ridge regression** feature ranking.

> If `overall_score` is missing, we build a proxy by averaging typical rating columns (helpfulness, correctness, coherence, complexity, verbosity) when available.


In [None]:

# --- 0) Setup & Data ---
import os, re, math, random, warnings
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score, mean_squared_error

warnings.filterwarnings("ignore")
SEED = 42
random.seed(SEED); np.random.seed(SEED)

# Reuse df if present, else try to load train.csv
if 'df' not in globals():
    candidates = [Path('train.csv'), Path('/mnt/data/helpsteer_extracted/train.csv')]
    DATA_PATH = next((p for p in candidates if p.exists()), candidates[0])
    assert DATA_PATH.exists(), f"Could not find {DATA_PATH.resolve()} — put 'train.csv' next to this notebook or provide df."
    df = pd.read_csv(DATA_PATH)
    print(f'Loaded: {DATA_PATH}')

# Detect columns
def pick(cols, cands):
    for c in cands:
        if c in cols: return c
    return None

cols = df.columns.tolist()
prompt_col   = pick(cols, ["prompt","instruction","question","query","user_input"])
response_col = pick(cols, ["response","response_text","answer","assistant_response","model_output","completion"])
if response_col is None:
    # fallback to any object col not equal to prompt
    obj = [c for c in cols if df[c].dtype == object]
    response_col = next((c for c in obj if c != prompt_col), None)
assert response_col is not None, "No response-like text column found."

print({"prompt": prompt_col, "response": response_col})

# Build overall_score if missing: mean of common rating cols
if "overall_score" not in df.columns:
    cand_scores = [c for c in ["helpfulness","correctness","coherence","complexity","verbosity","quality","score","label"]
                   if c in df.columns and pd.api.types.is_numeric_dtype(df[c])]
    assert len(cand_scores) >= 1, "overall_score missing and no standard rating columns found to derive it."
    df["overall_score"] = df[cand_scores].astype(float).mean(axis=1)

# Build bins if missing
if "overall_bin" not in df.columns:
    bins   = [0, 4, 8, 12, 16, 20]
    labels = ["0–4","5–8","9–12","13–16","17–20"]
    df["overall_bin"] = pd.cut(df["overall_score"], bins=bins, labels=labels, include_lowest=True, right=True)



## Tokenization and POS (verbs, adjectives, nouns)
- If `spaCy` with an English model is available, we use it to count **NOUN/PROPN/ADJ/VERB** precisely.
- Otherwise, we fall back to regex-based heuristics for a light-weight approximation.


In [None]:

USE_SPACY = True
spacy_nlp = None
if USE_SPACY:
    try:
        import spacy
        for name in ["en_core_web_sm","en_core_web_md","en_core_web_lg","en_core_web_trf"]:
            try:
                spacy_nlp = spacy.load(name, disable=["ner","textcat"])
                break
            except Exception:
                continue
        if spacy_nlp is None:
            raise RuntimeError("No local spaCy English model found")
        print("spaCy POS tagging enabled:", spacy_nlp.meta.get("name","<unknown>"))
    except Exception as e:
        spacy_nlp = None
        print("spaCy not available; using regex heuristics. (", e, ")")

STOPWORDS = set(ENGLISH_STOP_WORDS) | {
    "im","ive","id","youre","youll","dont","cant","wont","didnt","doesnt","isnt","arent","wasnt","werent",
    "couldnt","shouldnt","wouldnt","thats","theres","heres","whats","lets",
    "ok","okay","yes","yeah","nope","uh","um","uhh","hmm",
    "like","just","really","actually","basically","literally",
    "etc","e.g","eg","i.e","ie",
    "http","https","www","com","net","org"
}

# Simple regex tokenizer
def tokens(text):
    return re.findall(r"[A-Za-z]{2,}", str(text).lower())

# Crude verb/adj/noun heuristics if no spaCy
VERB_SUFFIXES = ("ing","ed","en","ize","ise","ify")
ADJ_SUFFIXES  = ("ous","ful","ive","less","able","ible","al","ary","ic","ical","y","ish")
NOUN_SUFFIXES = ("tion","sion","ment","ness","ity","ship","ism","ist","ance","ence","ery","or","er")

def is_verbish(tok): return tok.endswith(VERB_SUFFIXES)
def is_adjish(tok):  return tok.endswith(ADJ_SUFFIXES)
def is_nounish(tok): return tok.endswith(NOUN_SUFFIXES)

def pos_counts(text):
    toks = tokens(text)
    if spacy_nlp is not None:
        doc = spacy_nlp(text or "")
        v = sum(1 for t in doc if t.pos_=="VERB")
        a = sum(1 for t in doc if t.pos_=="ADJ")
        n = sum(1 for t in doc if t.pos_ in {"NOUN","PROPN"})
        return len(toks), v, a, n
    # fallback
    v = sum(1 for t in toks if is_verbish(t))
    a = sum(1 for t in toks if is_adjish(t))
    n = sum(1 for t in toks if is_nounish(t))
    return len(toks), v, a, n



## Readability, hedges, politeness, overlap, and other helpers
- **Flesch Reading Ease** (no external libs; simple syllable heuristic).
- **Hedges** and **politeness markers** (count occurrences).
- **Prompt–response** comparisons: lengths, avg word length, and **Jaccard token overlap**.


In [None]:

HEDGES = {
    "may","might","could","possibly","perhaps","apparently","seems","appear","likely","unlikely","roughly",
    "approximately","around","about","generally","somewhat","often","sometimes","probably","arguably"
}
POLITE = {"please","thank","thanks","appreciate","kindly","sorry","apologies"}

def count_set(words, vocab):
    s = 0
    for w in words:
        if w in vocab: s += 1
    return s

def syllable_count(word):
    word = word.lower()
    if not word: return 0
    vowels = "aeiouy"
    count = 0; prev_v = False
    for ch in word:
        is_v = ch in vowels
        if is_v and not prev_v:
            count += 1
        prev_v = is_v
    if word.endswith("e") and count > 1:
        count -= 1
    return max(1, count)

def flesch_reading_ease(text):
    toks = tokens(text)
    if not toks: return 0.0
    n_words = len(toks)
    n_sents = max(1, len(re.findall(r"[.!?]+", str(text))))
    n_syll  = sum(syllable_count(w) for w in toks)
    return 206.835 - 1.015*(n_words/n_sents) - 84.6*(n_syll/n_words)

def lexical_density(text):
    toks = tokens(text)
    if not toks: return 0.0
    content = [t for t in toks if t not in STOPWORDS]
    return len(content)/len(toks)

def hapax_ratio(text):
    toks = tokens(text)
    if not toks: return 0.0
    from collections import Counter
    c = Counter(toks)
    hapax = sum(1 for k,v in c.items() if v==1)
    return hapax / len(toks)

def avg_word_len(text):
    toks = tokens(text)
    return (sum(len(t) for t in toks)/len(toks)) if toks else 0.0

def punct_density(text):
    chars = len(str(text))
    if chars == 0: return 0.0
    n_punct = len(re.findall(r"[,:;—-]", str(text)))
    return n_punct / chars

def jaccard_overlap(a_text, b_text):
    A = set([t for t in tokens(a_text) if t not in STOPWORDS])
    B = set([t for t in tokens(b_text) if t not in STOPWORDS])
    if not A and not B: return 0.0
    return len(A & B) / max(1, len(A | B))



## Feature extraction
Build a rich feature set for both **response** and (if available) **prompt**, then aggregate to a modeling table.


In [None]:

RESP = response_col
PR = prompt_col  # may be None

def extract_all(text):
    toks = tokens(text)
    n_tok = len(toks)
    n_sent = max(1, len(re.findall(r"[.!?]+", str(text))))
    n_chars = sum(len(t) for t in toks)
    total, v, a, n = pos_counts(text)
    hedges = count_set(toks, HEDGES)
    polite = count_set(toks, POLITE)
    return {
        "word_len": n_tok,
        "avg_word_len": avg_word_len(text),
        "sent_count": n_sent,
        "type_token_ratio": (len(set(toks))/n_tok) if n_tok else 0.0,
        "lexical_density": lexical_density(text),
        "hapax_ratio": hapax_ratio(text),
        "verb_count": v,
        "adj_count": a,
        "noun_count": n,
        "verb_ratio": (v/n_tok) if n_tok else 0.0,
        "adj_ratio": (a/n_tok) if n_tok else 0.0,
        "noun_ratio": (n/n_tok) if n_tok else 0.0,
        "flesch_readability": flesch_reading_ease(text),
        "hedge_count": hedges,
        "politeness_count": polite,
        "punct_density": punct_density(text),
        "tokens_per_sentence": (n_tok/n_sent) if n_sent else 0.0,
    }

resp_feat = df[RESP].astype(str).apply(extract_all).apply(pd.Series)
resp_feat.columns = [f"resp_{c}" for c in resp_feat.columns]

if PR is not None and PR in df.columns:
    prm_feat = df[PR].astype(str).apply(extract_all).apply(pd.Series)
    prm_feat.columns = [f"pr_{c}" for c in prm_feat.columns]
else:
    prm_feat = pd.DataFrame(index=df.index)

# Prompt–Response relational features
if PR is not None and PR in df.columns:
    overlap = []
    for p,r in zip(df[PR].astype(str).tolist(), df[RESP].astype(str).tolist()):
        overlap.append(jaccard_overlap(p, r))
    rel = pd.DataFrame({
        "prr_len_ratio": (resp_feat["resp_word_len"] / (prm_feat["pr_word_len"].replace(0, np.nan))).fillna(0.0),
        "prr_avg_wordlen_diff": resp_feat["resp_avg_word_len"] - prm_feat["pr_avg_word_len"],
        "prr_overlap_jaccard": overlap,
    })
else:
    rel = pd.DataFrame({
        "prr_len_ratio": np.zeros(len(df)),
        "prr_avg_wordlen_diff": np.zeros(len(df)),
        "prr_overlap_jaccard": np.zeros(len(df)),
    })

features = pd.concat([resp_feat, prm_feat, rel], axis=1)
data = pd.concat([df.reset_index(drop=True), features.reset_index(drop=True)], axis=1)

print("Feature columns:", [c for c in features.columns][:10], "... (+ more)")
data.head(3)



## Correlations and summaries
We compute Pearson correlations of each feature with `overall_score` and summarize by `overall_bin`.


In [None]:

# Correlations
num_cols = [c for c in features.columns]  # all engineered features are numeric
corrs = []
for c in num_cols:
    m = data[c].notna() & data["overall_score"].notna()
    if m.sum() >= 10:
        r = np.corrcoef(data.loc[m, c].astype(float), data.loc[m, "overall_score"].astype(float))[0,1]
        corrs.append((c, r))
    else:
        corrs.append((c, np.nan))
corr_df = pd.DataFrame(corrs, columns=["feature","pearson_r"]).sort_values("pearson_r", ascending=False)
print("Top + correlations:")
display(corr_df.head(20))
print("Top - correlations:")
display(corr_df.tail(20))

# Summaries by bin
bin_order = ["0–4","5–8","9–12","13–16","17–20"]
summary = (data.groupby("overall_bin")[num_cols].agg(["mean","median","std"]).reindex(bin_order))
summary.round(3).head(20)



## Visuals
Basic plots for a few key features vs. score. You can add more as needed.


In [None]:

def scatter_with_fit(x, y, title, xlabel, ylabel):
    m = np.isfinite(x) & np.isfinite(y)
    x0, y0 = x[m], y[m]
    plt.figure(figsize=(6,4))
    plt.scatter(x0, y0, s=6, alpha=0.35)
    if len(x0) >= 5:
        coef = np.polyfit(x0, y0, deg=1)
        xs = np.linspace(np.percentile(x0, 1), np.percentile(x0, 99), 200)
        ys = coef[0]*xs + coef[1]
        plt.plot(xs, ys)
    plt.title(title); plt.xlabel(xlabel); plt.ylabel(ylabel); plt.tight_layout(); plt.show()

scatter_with_fit(data["resp_word_len"].astype(float).values,
                 data["overall_score"].astype(float).values,
                 "Response Word Length vs Overall Score", "resp_word_len", "overall_score")

scatter_with_fit(data["resp_adj_ratio"].astype(float).values,
                 data["overall_score"].astype(float).values,
                 "Adjective Ratio vs Overall Score", "resp_adj_ratio", "overall_score")

scatter_with_fit(data["resp_verb_ratio"].astype(float).values,
                 data["overall_score"].astype(float).values,
                 "Verb Ratio vs Overall Score", "resp_verb_ratio", "overall_score")

scatter_with_fit(data["resp_flesch_readability"].astype(float).values,
                 data["overall_score"].astype(float).values,
                 "Flesch Readability vs Overall Score", "resp_flesch_readability", "overall_score")

if "prr_overlap_jaccard" in data.columns:
    scatter_with_fit(data["prr_overlap_jaccard"].astype(float).values,
                     data["overall_score"].astype(float).values,
                     "Prompt–Response Overlap (Jaccard) vs Overall Score", "prr_overlap_jaccard", "overall_score")



## Quick Ridge regression for feature ranking
Train a small Ridge model on engineered features to see which signals contribute the most in a linear sense.


In [None]:

X = data[[c for c in features.columns]].astype(float).fillna(0.0).values
y = data["overall_score"].astype(float).values

# Keep only finite rows
m = np.isfinite(X).all(axis=1) & np.isfinite(y)
X, y = X[m], y[m]

model = make_pipeline(StandardScaler(with_mean=True, with_std=True), Ridge(alpha=2.0, random_state=42))
model.fit(X, y)
y_hat = model.predict(X)
print("In-sample RMSE:", mean_squared_error(y, y_hat, squared=False))
print("In-sample R^2 :", r2_score(y, y_hat))

ridge = model.named_steps["ridge"]
coef = ridge.coef_
feat_names = [c for c in features.columns]
coef_df = pd.DataFrame({"feature": feat_names, "coef": coef}).sort_values("coef", ascending=False)
print("Top positive features (Ridge):")
display(coef_df.head(20))
print("Top negative features (Ridge):")
display(coef_df.tail(20))
