# üìÑ Document Scoring Pipeline Overview

This module implements a complete scoring and results-generation pipeline for **authorship verification**
based on the comparison of reference n-grams and their paraphrases across known, unknown, and baseline (no-context) model evaluations.

The pipeline extends earlier implementations by introducing:
1. **Log-Likelihood Ratio (LLR)** computations that *exclude* the reference phrase from the denominator (true log-odds formulation).
2. **Baseline-normalised LLRs**, which subtract the out-of-context (no-context) evidence to control for general phrase commonality.
3. Modular, numerically stable helper functions for readability and maintainability.

---

## üß† Core Concept

For each phrase occurrence, the model provides log-probabilities for:
- The **reference phrase** (as used in the known author‚Äôs text).
- Several **paraphrased alternatives**.

Each scoring variant captures a different aspect of stylistic consistency between known and unknown authors:

| Score Type | Formula (conceptual) | Interpretation |
|-------------|----------------------|----------------|
| **Included-reference** (original) | \(\displaystyle \text{PMF} = \frac{P_\text{ref}}{\sum_i P_i}\) | Fraction of total probability mass assigned to the reference among all candidates. Always ‚â§ 1. |
| **Excluded-reference LLR** (new) | \(\displaystyle \text{LLR} = \log P_\text{ref} - \log \sum_{q \neq \text{ref}} P_q\) | Log-odds (signed). Positive ‚Üí model prefers reference phrasing; negative ‚Üí prefers paraphrases. |
| **Baseline-normalised LLR** | \(\displaystyle \text{LLR}^\text{norm} = \text{LLR}_\text{context} - \text{LLR}_\text{no-context}\) | Contextual boost: how much the author‚Äôs context increases preference for their phrasing compared to general usage. |

All computations are performed in **log-space** using the numerically stable **log-sum-exp** function.

---

## ‚öôÔ∏è Function Overview

### `_ensure_logprob(df)`
Ensures a `logprob` column exists. Converts `raw_prob` (0‚Äì1) to log-space while clipping extreme values for stability.

### `_logsumexp(arr)`
Computes \(\log\sum_i e^{x_i}\) safely by factoring out the maximum value to prevent underflow.

### `_score_included_reference(group)`
Implements the original probability-mass-based score where the reference is **included** in the denominator.
Returns:
- `pmf_included`
- `llr_included_log10` (your original ‚àílog‚ÇÅ‚ÇÄ PMF formulation)

### `_score_excluded_reference(group)`
Implements the **new LLR-excluded** formulation (true log-odds).
Returns:
- `llr_excl_nats` (natural log)
- `llr_excl_log10`
- `odds_ref_vs_paras`
- `p_ref_vs_paras` (two-bin probability that the model ‚Äúchooses‚Äù the reference phrasing)

### `_score_no_context_per_phrase(group)`
Computes both included and excluded metrics for the *no-context* baseline (grouped by `phrase_num` only).

### `_score_context_per_occurrence(group, label_prefix)`
Computes included and excluded metrics for each contextual dataset (`known` or `unknown`), grouped by `phrase_num` √ó `phrase_occurence`.

### `create_results_doc_pipeline(doc_loc, write_excel=True, save_dir=None, phrase_loc=None)`
Main orchestrator function that:
1. Loads model output sheets (`known`, `unknown`, `no context`, etc.).
2. Optionally filters by a phrase list (`phrase_loc`).
3. Computes all included, excluded, and baseline-normalised scores.
4. Merges results into a consolidated **LLR** sheet and summary **metadata** table.
5. Writes results back to Excel.

---

## üìä Output Tables

### **LLR Sheet**
One row per phrase occurrence with the following key columns:

| Column | Description |
|---------|-------------|
| `pmf_known`, `pmf_unknown`, `pmf_no_context` | Probability-mass of reference (included method). |
| `llr_known`, `llr_unknown`, `llr_no_context` | ‚àílog‚ÇÅ‚ÇÄ PMF (your original metric). |
| `llr_excl_*_nats`, `llr_excl_*_log10` | Log-odds scores excluding reference (in nats and log‚ÇÅ‚ÇÄ). |
| `odds_*`, `p_ref_vs_paras_*` | Interpretable odds and choice probabilities for reference phrasing. |
| `llr_excl_norm_*_nats`, `llr_excl_norm_*_log10` | Baseline-normalised scores (context ‚àí no-context). |

### **Metadata Summary**
Aggregated document-level statistics including:
- Total and normalised sums of each score variant.
- Counts of unique phrases and retained phrases.
- Ready for thresholding or correlation with ground-truth author labels.

---

## üß© Numerical Stability

All calculations use **log-sum-exp** for summing probabilities in log-space:
\[
\log\sum_i e^{x_i} = m + \log\sum_i e^{x_i - m},\quad m=\max_i x_i
\]
This prevents numerical underflow for extremely small probabilities (e.g., long n-grams).

---

## üß† Interpretation Summary

| Metric | Interpretation |
|---------|----------------|
| **Included LLR (old)** | How often the model ‚Äúchooses‚Äù the reference phrasing among all options. |
| **Excluded LLR (new)** | Evidence strength: how much more probable the author phrasing is than all alternatives combined. |
| **Baseline-Normalised** | Measures stylistic bias: contextual boost relative to out-of-context baseline. |

---

## üßÆ Aggregated Log-Likelihood Ratio Formulation

For a set of common phrases \( P \) and their occurrences \( j \) in a given text:

\[
\text{LLR}_{p,j}
  = \log P_\text{model}(r_p \mid C_{p,j})
  - \log \!\!\sum_{q \neq r_p} P_\text{model}(q \mid C_{p,j})
\]

The **total evidence score** for a document is obtained by summing over all phrases and occurrences:

\[
S_\text{total}
  = \sum_{p \in P} \sum_j \text{LLR}_{p,j}
\]

A high \( S_\text{total} \) indicates the model consistently prefers the known author‚Äôs phrasing,
while lower or negative values suggest stylistic divergence between documents.

In [60]:
import ast
import os
import glob

from pathlib import Path

import numpy as np
import pandas as pd

In [61]:
doc_loc = '/Volumes/BCross/paraphrase examples slurm/Wiki-test/hodja_nasreddin_text_1 vs hodja_nasreddin_text_3.xlsx'
phrase_loc = '/Volumes/BCross/paraphrase examples slurm/wiki-phrase-list-reviewed.xlsx'

known = pd.read_excel(doc_loc, sheet_name="known")
unknown = pd.read_excel(doc_loc, sheet_name="unknown")
no_context = pd.read_excel(doc_loc, sheet_name="no context")
metadata = pd.read_excel(doc_loc, sheet_name="metadata")

phrase_list = pd.read_excel(phrase_loc)
phrases_to_keep = phrase_list[phrase_list['keep_phrase'] == 1].copy()

# Convert the stringified tuples into actual tuples, then into lists
phrases_to_keep['tokens'] = phrases_to_keep['tokens'].apply(lambda x: list(ast.literal_eval(x)) if isinstance(x, str) else list(x))
phrases_to_keep = phrases_to_keep[['phrase']]
        
reference_phrases = no_context[no_context['phrase_type'] == 'reference'].copy()

# Perform the merge using the tuple-based key
merged_phrases = pd.merge(reference_phrases, phrases_to_keep, on='phrase', how='inner')
merged_phrases = merged_phrases[['phrase_num']]

no_context = pd.merge(no_context, merged_phrases, on='phrase_num', how='inner')
known = pd.merge(known, merged_phrases, on='phrase_num', how='inner')
unknown= pd.merge(unknown, merged_phrases, on='phrase_num', how='inner')

### Numerics & utilities

In [62]:
def _ensure_logprob(df: pd.DataFrame) -> pd.DataFrame:
    """
    Ensure a 'logprob' column exists.
    If only 'raw_prob' is present, create logprob = log(max(raw_prob, 1e-45)).
    """
    if 'logprob' not in df.columns:
        if 'raw_prob' not in df.columns:
            raise ValueError("Dataframe must contain either 'logprob' (preferred) or 'raw_prob'.")
        df = df.copy()
        df['logprob'] = np.log(np.clip(df['raw_prob'].to_numpy(), 1e-45, 1.0))
    else:
        df = df.copy()
        df['logprob'] = pd.to_numeric(df['logprob'])
    return df

def _logsumexp(logvals: np.ndarray) -> float:
    """Numerically stable log-sum-exp for a 1D array. Returns -inf for empty arrays."""
    if logvals.size == 0:
        return -np.inf
    m = np.max(logvals)
    return m + np.log(np.sum(np.exp(logvals - m)))

## Scoring kernels per group

In [63]:
def _score_included_reference(group: pd.DataFrame) -> pd.Series:
    """
    Your original score (reference INCLUDED in denominator):
      pmf = P_ref / sum_all
      llr_included_log10 = -log10(pmf) (your sign convention)
    Works in probability space; we derive prob from logprob safely.
    """
    # Extract as probabilities from logprob
    probs = np.exp(group['logprob'].to_numpy())
    sum_all = probs.sum()

    # Reference prob: take max in case of duplicated 'reference' rows
    ref_mask = (group['phrase_type'] == 'reference')
    if not ref_mask.any():
        # No reference in this group
        pmf = np.nan
        llr_log10 = np.nan
    else:
        ref_prob = np.exp(group.loc[ref_mask, 'logprob'].max())
        pmf = ref_prob / sum_all if sum_all > 0 else np.nan
        llr_log10 = -np.log10(pmf) if (pmf is not None and pmf > 0) else 0.0

    return pd.Series({
        'pmf_included': pmf,               # analogous to your pmf_* columns
        'llr_included_log10': llr_log10    # analogous to your llr_* columns
    })

def _score_excluded_reference(group: pd.DataFrame) -> pd.Series:
    """
    New score (reference EXCLUDED from denominator):
      LLR_excl = log P_ref - log sum_{paras} P(para)
    Returned in both nats and log10, and with two-bin odds/prob convenience.
    """
    ref_mask = (group['phrase_type'] == 'reference')
    if not ref_mask.any():
        return pd.Series({
            'llr_excl_nats': np.nan,
            'llr_excl_log10': np.nan,
            'odds_ref_vs_paras': np.nan,
            'p_ref_vs_paras': np.nan
        })

    logP_ref = group.loc[ref_mask, 'logprob'].max()
    logP_paras = group.loc[~ref_mask, 'logprob'].to_numpy()

    if logP_paras.size == 0:
        llr = np.inf
        odds = np.inf
        p_two_bin = 1.0
    else:
        log_sum_paras = _logsumexp(logP_paras)
        llr = logP_ref - log_sum_paras
        odds = float(np.exp(llr))
        p_two_bin = odds / (1.0 + odds)

    return pd.Series({
        'llr_excl_nats': llr,
        'llr_excl_log10': llr / np.log(10.0) if np.isfinite(llr) else np.inf,
        'odds_ref_vs_paras': odds,
        'p_ref_vs_paras': p_two_bin
    })

def _score_no_context_per_phrase(no_ctx_group: pd.DataFrame) -> pd.Series:
    """
    Build both the 'included' variant (your original) and LLR-excluded for NO CONTEXT,
    grouped by phrase_num (no phrase_occurrence in no_context).
    """
    # Included (your original) over all candidates for this phrase
    inc = _score_included_reference(no_ctx_group)
    
    # Excluded LLR in no-context
    exc = _score_excluded_reference(no_ctx_group)

    # Also count how many candidates we saw
    num_phrases = int(no_ctx_group['phrase_num'].size)
    
    return pd.concat([
        pd.Series({'num_phrases': num_phrases, 'phrases_kept': num_phrases}),
        pd.Series({'pmf_no_context': inc['pmf_included'],
                   'llr_no_context': inc['llr_included_log10']}),
        pd.Series({'llr_excl_no_context_nats': exc['llr_excl_nats'],
                   'llr_excl_no_context_log10': exc['llr_excl_log10']})
    ])

def _score_context_per_occurrence(ctx_group: pd.DataFrame, label_prefix: str) -> pd.Series:
    """
    For a contexted table (known or unknown), compute:
      - included version (your original): pmf_included + llr_included_log10
      - excluded LLR: llr_excl_nats/log10 + odds/prob
    label_prefix is 'known' or 'unknown' to name columns downstream.
    """
    inc = _score_included_reference(ctx_group)
    exc = _score_excluded_reference(ctx_group)

    return pd.Series({
        f'pmf_{label_prefix}': inc['pmf_included'],
        f'llr_{label_prefix}': inc['llr_included_log10'],
        f'llr_excl_{label_prefix}_nats': exc['llr_excl_nats'],
        f'llr_excl_{label_prefix}_log10': exc['llr_excl_log10'],
        f'odds_{label_prefix}': exc['odds_ref_vs_paras'],
        f'p_ref_vs_paras_{label_prefix}': exc['p_ref_vs_paras']
    })

## Main pipeline

In [64]:
def create_results_doc_pipeline(doc_loc, write_excel=True, save_dir=None, phrase_loc=None):
    """Pipeline to compute included-LLR (original), excluded-LLR (new), and baseline-normalised scores."""

    doc_name = os.path.basename(doc_loc)
    print(f"Processing Document: {doc_name}")

    # Read sheets
    docs = pd.read_excel(doc_loc, sheet_name="docs")
    known = pd.read_excel(doc_loc, sheet_name="known")
    unknown = pd.read_excel(doc_loc, sheet_name="unknown")
    no_context = pd.read_excel(doc_loc, sheet_name="no context")
    metadata = pd.read_excel(doc_loc, sheet_name="metadata")

    # Optional phrase filter
    if phrase_loc:
        phrase_list = pd.read_excel(phrase_loc)
        phrases_to_keep = phrase_list[phrase_list['keep_phrase'] == 1].copy()
        phrases_to_keep['tokens'] = phrases_to_keep['tokens'].apply(
            lambda x: list(ast.literal_eval(x)) if isinstance(x, str) else list(x)
        )
        phrases_to_keep = phrases_to_keep[['phrase']]

        reference_phrases = no_context[no_context['phrase_type'] == 'reference'].copy()
        merged_phrases = pd.merge(reference_phrases, phrases_to_keep, on='phrase', how='inner')
        merged_phrases = merged_phrases[['phrase_num']]

        no_context = pd.merge(no_context, merged_phrases, on='phrase_num', how='inner')
        known = pd.merge(known, merged_phrases, on='phrase_num', how='inner')
        unknown = pd.merge(unknown, merged_phrases, on='phrase_num', how='inner')

    # Ensure we have logprobs everywhere
    known = _ensure_logprob(known)
    unknown = _ensure_logprob(unknown)
    no_context = _ensure_logprob(no_context)

    # Base table aligning occurrences from known & unknown (as in your code)
    cols = ['phrase_num', 'phrase_occurence', 'original_phrase']
    llr_base = (
        pd.concat([known[cols], unknown[cols]], ignore_index=True)
        .drop_duplicates()
        .sort_values(cols, ascending=[True, True, True])
        .reset_index(drop=True)
    )

    # ---------- NO CONTEXT (by phrase_num) ----------
    no_context_phrase_stats = (
        no_context
        .groupby('phrase_num', dropna=False)
        .apply(_score_no_context_per_phrase,
               include_groups=True)
        .reset_index()
    )
    
    # This provides:
    #  - num_phrases, phrases_kept
    #  - pmf_no_context (INCLUDED score)
    #  - llr_no_context (your -log10 pmf)
    #  - llr_excl_no_context_nats / log10  (EXCLUDED score baseline)

    # Build a quick dict for baseline LLR-excluded (nats) per phrase_num
    baseline_llr_excl_nats = dict(
        zip(no_context_phrase_stats['phrase_num'], no_context_phrase_stats['llr_excl_no_context_nats'])
    )

    # ---------- KNOWN (by phrase_num, phrase_occurence) ----------
    known_phrase_stats = (
        known
        .groupby(['phrase_num', 'phrase_occurence'], dropna=False)
        .apply(lambda g: _score_context_per_occurrence(g, 'known'),
               include_groups=False)
        .reset_index()
    )

    # ---------- UNKNOWN (by phrase_num, phrase_occurence) ----------
    unknown_phrase_stats = (
        unknown
        .groupby(['phrase_num', 'phrase_occurence'], dropna=False)
        .apply(lambda g: _score_context_per_occurrence(g, 'unknown'),
               include_groups=False)
        .reset_index()
    )

    # ---------- Join into final table ----------
    LLR = (
        llr_base
        .assign(
            phrase_num=llr_base['phrase_num'].astype('string'),
            phrase_occurence=pd.to_numeric(llr_base['phrase_occurence'], errors='coerce').astype('Int64')
        )
        .merge(no_context_phrase_stats, on='phrase_num', how='left')
        .merge(known_phrase_stats, on=['phrase_num', 'phrase_occurence'], how='left')
        .merge(unknown_phrase_stats, on=['phrase_num', 'phrase_occurence'], how='left')
    )

    # ---------- Baseline-normalised (EXCLUDED) ----------
    # For each occurrence, subtract the no_context excluded LLR baseline for that phrase.
    def _norm_from_baseline(row, llr_excl_col):
        base = baseline_llr_excl_nats.get(row['phrase_num'], np.nan)
        val = row[llr_excl_col]
        if np.isnan(base) or np.isnan(val):
            return np.nan
        return val - base

    LLR['llr_excl_norm_known_nats'] = LLR.apply(lambda r: _norm_from_baseline(r, 'llr_excl_known_nats'), axis=1)
    LLR['llr_excl_norm_unknown_nats'] = LLR.apply(lambda r: _norm_from_baseline(r, 'llr_excl_unknown_nats'), axis=1)
    LLR['llr_excl_norm_known_log10'] = LLR['llr_excl_norm_known_nats'] / np.log(10.0)
    LLR['llr_excl_norm_unknown_log10'] = LLR['llr_excl_norm_unknown_nats'] / np.log(10.0)

    # ---------- Select & order columns ----------
    # Keep your original columns and add new ones (excluded + baseline-normalised variants)
    LLR = LLR[[
        'phrase_num', 'phrase_occurence', 'original_phrase',
        # no-context summary
        'num_phrases', 'phrases_kept', 'pmf_no_context', 'llr_no_context',
        'llr_excl_no_context_nats', 'llr_excl_no_context_log10',
        # known (included and excluded)
        'pmf_known', 'llr_known', 'llr_excl_known_nats', 'llr_excl_known_log10',
        'odds_known', 'p_ref_vs_paras_known',
        # unknown (included and excluded)
        'pmf_unknown', 'llr_unknown', 'llr_excl_unknown_nats', 'llr_excl_unknown_log10',
        'odds_unknown', 'p_ref_vs_paras_unknown',
        # baseline normalised (excluded)
        'llr_excl_norm_known_nats', 'llr_excl_norm_known_log10',
        'llr_excl_norm_unknown_nats', 'llr_excl_norm_unknown_log10'
    ]]

    # ---------- Summaries ----------
    # Your original sums (included) plus new aggregates (excluded & normalized)
    LLR_summary = pd.DataFrame([{
        'num_phrases': LLR['phrase_num'].nunique(),
        'phrases_kept': LLR.loc[LLR['phrases_kept'] > 0, 'phrase_num'].nunique(),

        # Included (your original)
        'llr_no_context': pd.to_numeric(LLR['llr_no_context'], errors='coerce').sum(skipna=True),
        'llr_known': pd.to_numeric(LLR['llr_known'], errors='coerce').sum(skipna=True),
        'llr_unknown': pd.to_numeric(LLR['llr_unknown'], errors='coerce').sum(skipna=True),

        # Excluded (nats)
        'llr_excl_known_nats': pd.to_numeric(LLR['llr_excl_known_nats'], errors='coerce').replace([np.inf, -np.inf], np.nan).sum(skipna=True),
        'llr_excl_unknown_nats': pd.to_numeric(LLR['llr_excl_unknown_nats'], errors='coerce').replace([np.inf, -np.inf], np.nan).sum(skipna=True),

        # Baseline-normalised (excluded, nats)
        'llr_excl_norm_known_nats': pd.to_numeric(LLR['llr_excl_norm_known_nats'], errors='coerce').sum(skipna=True),
        'llr_excl_norm_unknown_nats': pd.to_numeric(LLR['llr_excl_norm_unknown_nats'], errors='coerce').sum(skipna=True),
    }])

    # Normalised by phrases_kept
    LLR_summary = LLR_summary.assign(
        normalised_llr_no_context=lambda d: d['llr_no_context'] / d['phrases_kept'],
        normalised_llr_known=lambda d: d['llr_known'] / d['phrases_kept'],
        normalised_llr_unknown=lambda d: d['llr_unknown'] / d['phrases_kept'],
        normalised_llr_excl_known_nats=lambda d: d['llr_excl_known_nats'] / d['phrases_kept'],
        normalised_llr_excl_unknown_nats=lambda d: d['llr_excl_unknown_nats'] / d['phrases_kept'],
        normalised_llr_excl_norm_known_nats=lambda d: d['llr_excl_norm_known_nats'] / d['phrases_kept'],
        normalised_llr_excl_norm_unknown_nats=lambda d: d['llr_excl_norm_unknown_nats'] / d['phrases_kept']
    )

    # ---------- Merge summary into metadata (as before) ----------
    overlapping_cols = LLR_summary.columns.intersection(metadata.columns)
    metadata_final = metadata.drop(columns=overlapping_cols, errors='ignore')
    metadata_final = pd.concat([metadata_final, LLR_summary], axis=1)

    # ---------- Write Excel ----------
    if write_excel:
        print("Writing file")
        path = Path(save_dir + '/' + doc_name)
        writer_mode = "a" if path.exists() else "w"
        writer_kwargs = {"engine": "openpyxl", "mode": writer_mode}
        if writer_mode == "a":
            writer_kwargs["if_sheet_exists"] = "replace"

        with pd.ExcelWriter(path, **writer_kwargs) as writer:
            docs.to_excel(writer, index=False, sheet_name="docs")
            known.to_excel(writer, index=False, sheet_name="known")
            unknown.to_excel(writer, index=False, sheet_name="unknown")
            no_context.to_excel(writer, index=False, sheet_name="no context")
            LLR.to_excel(writer, index=False, sheet_name="LLR")
            metadata_final.to_excel(writer, index=False, sheet_name="metadata")

    return metadata_final

In [65]:
orig_dir = "/Volumes/BCross/paraphrase examples slurm/Wiki-test-llama-raw"
save_dir = "/Volumes/BCross/paraphrase examples slurm/Wiki-test-llama-filtered-updated-metrics"
phrase_loc = '/Volumes/BCross/paraphrase examples slurm/wiki-phrase-list-reviewed.xlsx'

os.makedirs(save_dir, exist_ok=True)

# Get all .xlsx files from the original directory
xlsx_files = glob.glob(os.path.join(orig_dir, "*.xlsx"))

all_metadata = []

for i, file_path in enumerate(xlsx_files, start=1):
    print(f"Completing file {i} out of {len(xlsx_files)}")
    
    try:
        metadata = create_results_doc_pipeline(file_path, write_excel=True, save_dir=save_dir, phrase_loc=phrase_loc)
        all_metadata.append(metadata)
    except Exception as e:
        print(f"File failed: {file_path}\nError: {e}")
        continue

# Combine all metadata after processing
if all_metadata:
    full_metadata = pd.concat(all_metadata, ignore_index=True)
    # You can optionally save full_metadata here
else:
    full_metadata = pd.DataFrame()

print("All files complete")


Completing file 1 out of 65
Processing Document: irvine22_text_1 vs itub_text_4.xlsx




Writing file
Completing file 2 out of 65
Processing Document: honestopl_text_5 vs hootmag_text_13.xlsx




Writing file
Completing file 3 out of 65
Processing Document: hootmag_text_1 vs iain99_text_5.xlsx




Writing file
Completing file 4 out of 65
Processing Document: jc37_text_1 vs jeffrey_vernon_merkey_text_10.xlsx




Writing file
Completing file 5 out of 65
Processing Document: intothefire_text_2 vs intothefire_text_12.xlsx




Writing file
Completing file 6 out of 65
Processing Document: icarus3_text_2 vs icarus3_text_4.xlsx




Writing file
Completing file 7 out of 65
Processing Document: icarus3_text_1 vs icarus3_text_4.xlsx




Writing file
Completing file 8 out of 65
Processing Document: intangible_text_3 vs intothefire_text_12.xlsx




Writing file
Completing file 9 out of 65
Processing Document: jerryfriedman_text_2 vs jimharlow99_text_10.xlsx




Writing file
Completing file 10 out of 65
Processing Document: honestopl_text_5 vs honestopl_text_1.xlsx




Writing file
Completing file 11 out of 65
Processing Document: irvine22_text_4 vs itub_text_4.xlsx




Writing file
Completing file 12 out of 65
Processing Document: hootmag_text_10 vs iain99_text_5.xlsx




Writing file
Completing file 13 out of 65
Processing Document: ivoshandor_text_4 vs jasper_deng_text_4.xlsx




Writing file
Completing file 14 out of 65
Processing Document: hootmag_text_10 vs hootmag_text_13.xlsx




Writing file
Completing file 15 out of 65
Processing Document: jerryfriedman_text_1 vs jimharlow99_text_10.xlsx




Writing file
Completing file 16 out of 65
Processing Document: hodja_nasreddin_text_11 vs hodja_nasreddin_text_3.xlsx




Writing file
Completing file 17 out of 65
Processing Document: jeffrey_vernon_merkey_text_1 vs jeffrey_vernon_merkey_text_10.xlsx




Writing file
Completing file 18 out of 65
Processing Document: honestopl_text_4 vs hootmag_text_13.xlsx




Writing file
Completing file 19 out of 65
Processing Document: intothefire_text_10 vs intothefire_text_12.xlsx




Writing file
Completing file 20 out of 65
Processing Document: ivoshandor_text_2 vs ivoshandor_text_1.xlsx




Writing file
Completing file 21 out of 65
Processing Document: jerekrischel_text_10 vs jerryfriedman_text_3.xlsx




Writing file
Completing file 22 out of 65
Processing Document: hodja_nasreddin_text_10 vs hodja_nasreddin_text_3.xlsx




Writing file
Completing file 23 out of 65
Processing Document: hootmag_text_12 vs iain99_text_5.xlsx




Writing file
Completing file 24 out of 65
Processing Document: jeffrey_vernon_merkey_text_11 vs jerekrischel_text_13.xlsx




Writing file
Completing file 25 out of 65
Processing Document: intangible_text_1 vs intothefire_text_12.xlsx




Writing file
Completing file 26 out of 65
Processing Document: jerryfriedman_text_4 vs jimharlow99_text_10.xlsx




Writing file
Completing file 27 out of 65
Processing Document: jeffrey_vernon_merkey_text_3 vs jeffrey_vernon_merkey_text_10.xlsx




Writing file
Completing file 28 out of 65
Processing Document: intothefire_text_2 vs irvine22_text_3.xlsx




Writing file
Completing file 29 out of 65
Processing Document: itub_text_3 vs ivoshandor_text_1.xlsx




Writing file
Completing file 30 out of 65
Processing Document: icarus3_text_3 vs icarus3_text_4.xlsx




Writing file
Completing file 31 out of 65
Processing Document: jerryfriedman_text_4 vs jerryfriedman_text_3.xlsx




Writing file
Completing file 32 out of 65
Processing Document: jc37_text_1 vs jc37_text_4.xlsx




Writing file
Completing file 33 out of 65
Processing Document: honestopl_text_3 vs hootmag_text_13.xlsx




Writing file
Completing file 34 out of 65
Processing Document: ivoshandor_text_2 vs jasper_deng_text_4.xlsx




Writing file
Completing file 35 out of 65
Processing Document: intothefire_text_10 vs irvine22_text_3.xlsx




Writing file
Completing file 36 out of 65
Processing Document: icarus3_text_1 vs intangible_text_2.xlsx




Writing file
Completing file 37 out of 65
Processing Document: jbmurray_text_5 vs jbmurray_text_3.xlsx




Writing file
Completing file 38 out of 65
Processing Document: iain99_text_1 vs icarus3_text_4.xlsx




Writing file
Completing file 39 out of 65
Processing Document: jerryfriedman_text_1 vs jerryfriedman_text_3.xlsx




Writing file
Completing file 40 out of 65
Processing Document: jeffrey_vernon_merkey_text_1 vs jerekrischel_text_13.xlsx




Writing file
Completing file 41 out of 65
Processing Document: intothefire_text_1 vs irvine22_text_3.xlsx




Writing file
Completing file 42 out of 65
Processing Document: honestopl_text_4 vs honestopl_text_1.xlsx




Writing file
Completing file 43 out of 65
Processing Document: jbmurray_text_5 vs jc37_text_4.xlsx




Writing file
Completing file 44 out of 65
Processing Document: irvine22_text_4 vs irvine22_text_3.xlsx




Writing file
Completing file 45 out of 65
Processing Document: hodja_nasreddin_text_10 vs honestopl_text_1.xlsx




Writing file
Completing file 46 out of 65
Processing Document: jeffrey_vernon_merkey_text_3 vs jerekrischel_text_13.xlsx




Writing file
Completing file 47 out of 65
Processing Document: iain99_text_2 vs icarus3_text_4.xlsx




Writing file
Completing file 48 out of 65
Processing Document: jeÃÅskeÃÅ_couriano_text_5 vs kashmiri_text_3.xlsx




Writing file
Completing file 49 out of 65
Processing Document: iain99_text_3 vs icarus3_text_4.xlsx




Writing file
Completing file 50 out of 65
Processing Document: jasper_deng_text_3 vs jasper_deng_text_4.xlsx




Writing file
Completing file 51 out of 65
Processing Document: jc37_text_2 vs jeffrey_vernon_merkey_text_10.xlsx




Writing file
Completing file 52 out of 65
Processing Document: hootmag_text_1 vs hootmag_text_13.xlsx




Writing file
Completing file 53 out of 65
Processing Document: hodja_nasreddin_text_1 vs honestopl_text_1.xlsx




Writing file
Completing file 54 out of 65
Processing Document: hodja_nasreddin_text_1 vs hodja_nasreddin_text_3.xlsx




Writing file
Completing file 55 out of 65
Processing Document: intothefire_text_1 vs intothefire_text_12.xlsx




Writing file
Completing file 56 out of 65
Processing Document: jc37_text_10 vs jeffrey_vernon_merkey_text_10.xlsx




Writing file
Completing file 57 out of 65
Processing Document: hootmag_text_12 vs hootmag_text_13.xlsx




Writing file
Completing file 58 out of 65
Processing Document: irvine22_text_1 vs irvine22_text_3.xlsx




Writing file
Completing file 59 out of 65
Processing Document: hodja_nasreddin_text_11 vs honestopl_text_1.xlsx




Writing file
Completing file 60 out of 65
Processing Document: irvine22_text_2 vs irvine22_text_3.xlsx




Writing file
Completing file 61 out of 65
Processing Document: irvine22_text_2 vs itub_text_4.xlsx




Writing file
Completing file 62 out of 65
Processing Document: intangible_text_5 vs intangible_text_2.xlsx




Writing file
Completing file 63 out of 65
Processing Document: honestopl_text_3 vs honestopl_text_1.xlsx




Writing file
Completing file 64 out of 65
Processing Document: intangible_text_5 vs intothefire_text_12.xlsx




Writing file
Completing file 65 out of 65
Processing Document: intangible_text_3 vs intangible_text_2.xlsx




Writing file
All files complete


In [66]:
full_metadata = full_metadata.sort_values(by="index").reset_index(drop=True)
full_metadata

Unnamed: 0,index,sample_id,problem,corpus,known_author,unknown_author,unknown_doc_id,known_doc_id,target,num_phrases,phrases_kept,llr_no_context,llr_known,llr_unknown,llr_excl_known_nats,llr_excl_unknown_nats,llr_excl_norm_known_nats,llr_excl_norm_unknown_nats,normalised_llr_no_context,normalised_llr_known,normalised_llr_unknown,normalised_llr_excl_known_nats,normalised_llr_excl_unknown_nats,normalised_llr_excl_norm_known_nats,normalised_llr_excl_norm_unknown_nats
0,0,1,Hodja_Nasreddin vs Hodja_Nasreddin,Wiki,Hodja_Nasreddin,Hodja_Nasreddin,hodja_nasreddin_text_3,hodja_nasreddin_text_1,True,12,12,14.326749,8.385606,7.786461,-5.184796,-8.498582,24.355574,17.049020,1.193896,0.698800,0.648872,-0.432066,-0.708215,2.029631,1.420752
1,1,2,Hodja_Nasreddin vs Hodja_Nasreddin,Wiki,Hodja_Nasreddin,Hodja_Nasreddin,hodja_nasreddin_text_3,hodja_nasreddin_text_10,True,11,11,24.738077,3.826455,6.932723,12.450699,9.463412,49.062194,58.123504,2.248916,0.347860,0.630248,1.131882,0.860310,4.460199,5.283955
2,2,3,Hodja_Nasreddin vs Hodja_Nasreddin,Wiki,Hodja_Nasreddin,Hodja_Nasreddin,hodja_nasreddin_text_3,hodja_nasreddin_text_11,True,8,8,12.045893,3.717721,3.238463,2.575512,11.048993,9.275192,22.807419,1.505737,0.464715,0.404808,0.321939,1.381124,1.159399,2.850927
3,3,4,Hodja_Nasreddin vs HonestopL,Wiki,Hodja_Nasreddin,HonestopL,honestopl_text_1,hodja_nasreddin_text_1,False,8,8,14.044844,8.395489,4.684318,-4.992076,-0.645079,22.862020,23.597719,1.755606,1.049436,0.585540,-0.624009,-0.080635,2.857752,2.949715
4,4,5,Hodja_Nasreddin vs HonestopL,Wiki,Hodja_Nasreddin,HonestopL,honestopl_text_1,hodja_nasreddin_text_10,False,8,8,8.673620,2.237052,1.711093,16.931859,18.116853,23.870512,20.661802,1.084203,0.279632,0.213887,2.116482,2.264607,2.983814,2.582725
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,92,93,JerryFriedman vs JerryFriedman,Wiki,JerryFriedman,JerryFriedman,jerryfriedman_text_3,jerryfriedman_text_4,True,20,20,34.691170,12.294378,19.388142,21.446797,10.008480,70.096737,76.662875,1.734559,0.614719,0.969407,1.072340,0.500424,3.504837,3.833144
61,93,94,JerryFriedman vs Jimharlow99,Wiki,JerryFriedman,Jimharlow99,jimharlow99_text_10,jerryfriedman_text_1,False,10,10,16.917944,3.658647,2.726263,14.514424,30.414451,45.758274,65.356000,1.691794,0.365865,0.272626,1.451442,3.041445,4.575827,6.535600
62,94,95,JerryFriedman vs Jimharlow99,Wiki,JerryFriedman,Jimharlow99,jimharlow99_text_10,jerryfriedman_text_2,False,8,8,16.362831,2.330475,2.988822,10.183002,11.177443,42.712515,46.627445,2.045354,0.291309,0.373603,1.272875,1.397180,5.339064,5.828431
63,95,96,JerryFriedman vs Jimharlow99,Wiki,JerryFriedman,Jimharlow99,jimharlow99_text_10,jerryfriedman_text_4,False,9,9,16.995738,2.342031,3.040590,29.339957,57.621995,53.833676,89.361682,1.888415,0.260226,0.337843,3.259995,6.402444,5.981520,9.929076


In [67]:
result_save_loc = '/Volumes/BCross/paraphrase examples slurm/wiki-test-llama-filtered-results-updated-metrics.xlsx'

full_metadata.to_excel(result_save_loc, index=False)

## Run on Test Doc

In [68]:
# meta = create_results_doc_pipeline(
#     "/Volumes/BCross/paraphrase examples slurm/Test blank doc.xlsx",
#     write_excel=True,
#     save_dir="/Volumes/BCross/paraphrase examples slurm/",   # writes updated workbook
#     phrase_loc=None # no filtering
# )
# print(meta)