# Script to PostProcess LLM Output

This script allows us to postprocess the output from an LLM, we are able to use verbose to print statements for each level and document.

In [1]:
# 1) Set the location
%cd ../../code/

/Users/user/Documents/GitHub/paraphrase_py/code


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [2]:
from postprocessing import process_records, print_summary
from read_and_write_docs import read_jsonl, write_jsonl
from pathlib import Path

import pandas as pd

## Set Locations for PostProcessing

In [3]:
base_loc = "/Volumes/BCross/datasets/author_verification"
data_type = "training"
corpus = "Wiki"
model_dir = "Qwen_2.5_1.5B"
data_base_loc = f"{base_loc}/{data_type}/{corpus}/{model_dir}"

In [4]:
in_dir  = Path(f"{data_base_loc}/full_doc_paraphrase")
out_dir = Path(f"{data_base_loc}/full_doc_paraphrase_clean")
out_dir.mkdir(parents=True, exist_ok=True)

## Loop Through Generated Files

Loop through the generated files in the directory given above, it will process each file and print out whether it was successfull or not. It will skip a file if it is already found in the output location.

In [5]:
for infile in sorted(in_dir.glob("*.jsonl")):
    outfile = out_dir / infile.name

    # ── 1) skip if we’ve already processed this file ────────────
    if outfile.exists():
        print(f"⏭️   skipping {infile.name} (already in output dir)")
        continue

    try:
        df_in  = read_jsonl(infile)
        df_out = process_records(df_in)
        write_jsonl(df_out, outfile)

        print(f"✅  wrote {outfile.relative_to(out_dir.parent)} "
              f"({len(df_out):,} rows)")

    except Exception as exc:
        # log the filename and the exception, then move on
        print(f"❌  FAILED on {infile.name}: {type(exc).__name__}: {exc}")

print("\n🎉  Directory pass complete.")

⏭️   skipping 142_196_88_228_text_1.jsonl (already in output dir)
⏭️   skipping 142_196_88_228_text_3.jsonl (already in output dir)
⏭️   skipping 142_196_88_228_text_4.jsonl (already in output dir)
⏭️   skipping a_man_in_black_text_1.jsonl (already in output dir)
⏭️   skipping a_man_in_black_text_2.jsonl (already in output dir)
⏭️   skipping a_man_in_black_text_5.jsonl (already in output dir)
⏭️   skipping aban1313_text_1.jsonl (already in output dir)
⏭️   skipping aban1313_text_2.jsonl (already in output dir)
⏭️   skipping aban1313_text_5.jsonl (already in output dir)
⏭️   skipping akuri_text_1.jsonl (already in output dir)
⏭️   skipping akuri_text_2.jsonl (already in output dir)
⏭️   skipping akuri_text_3.jsonl (already in output dir)
⏭️   skipping alanbarnet_text_10.jsonl (already in output dir)
⏭️   skipping alanbarnet_text_3.jsonl (already in output dir)
⏭️   skipping alanbarnet_text_4.jsonl (already in output dir)
⏭️   skipping alanyst_text_1.jsonl (already in output dir)
⏭️   sk

## Check Which Files Still to Process

Here we check the two locations and see which files are present in the input directory but not in the output directory. Any that are in this printout have failed and must be looked at further.

In [6]:
pending = sorted(
    infile.name
    for infile in in_dir.glob("*.jsonl")
    if not (out_dir / infile.name).exists()
)

if pending:
    print(f"{len(pending)} JSONL file(s) still to process:\n")
    for name in pending:
        print("  •", name)
else:
    print("✅  All *.jsonl files in", in_dir, "already have outputs.")

✅  All *.jsonl files in /Volumes/BCross/datasets/author_verification/training/Wiki/Qwen_2.5_1.5B/full_doc_paraphrase already have outputs.


## Failed Files

The code now works to skip certain reasons why a file may fail on a row by row basis, if you wish to go into more detail look at the file still to process above and run code similar to what is given.
If any rows fail they will be printed out in the output shown below as a python list at the end of the output. This allows you to print out the rows which are failing.

In [11]:
failed_df = read_jsonl(f"{data_base_loc}/full_doc_paraphrase/alienus_text_2.jsonl")

In [12]:
processed_failed_df = process_records(failed_df, verbose=True)

[    0] ✔︎  stage=wrap_plain_text
[    1] ✔︎  stage=wrap_plain_text
[    2] ✔︎  stage=wrap_plain_text
[    3] ✔︎  stage=fix_salvage_quotes
[    4] ✔︎  stage=wrap_plain_text
[    5] ✔︎  stage=wrap_plain_text
[    6] ✔︎  stage=wrap_plain_text
[    7] ✔︎  stage=wrap_plain_text
[    8] ✔︎  stage=wrap_plain_text
[    9] ✔︎  stage=wrap_plain_text
[   10] ✔︎  stage=wrap_plain_text
[   11] ✔︎  stage=fix_salvage_quotes
[   12] ✔︎  stage=wrap_plain_text
[   13] ✔︎  stage=wrap_plain_text
[   14] ✔︎  stage=wrap_plain_text
[   15] ✔︎  stage=wrap_plain_text
[   16] ✔︎  stage=wrap_plain_text
[   17] ✔︎  stage=wrap_plain_text
[   18] ✔︎  stage=wrap_plain_text
[   19] ✔︎  stage=wrap_plain_text
[   20] ✔︎  stage=wrap_plain_text
[   21] ✔︎  stage=fix_salvage_quotes
[   22] ✔︎  stage=wrap_plain_text
[   23] ✔︎  stage=wrap_plain_text
[   24] ✔︎  stage=fix_salvage_quotes
[   25] ✔︎  stage=wrap_plain_text
[   26] ✔︎  stage=wrap_plain_text
[   27] ✔︎  stage=wrap_plain_text
[   28] ✔︎  stage=wrap_plain_text
[ 

In [13]:
failed_df.iloc[[298, 822]]

Unnamed: 0,doc_id,orig_doc_id,corpus,author,texttype,text,generated_text,time_sec,tokens_per_sec
298,alienus_text_2,known [Alienus - Text-2].txt,Wiki,Alienus,known,"Sadly, in addition to your constant edit-warri...","{'version': 2, 'new_title': '', 'new_author': ...",13.782775,113.1122
822,alienus_text_2,known [Alienus - Text-2].txt,Wiki,Alienus,known,"Sadly, in addition to your constant edit-warri...","{'language_code': 'en', 'original_text': 'Sadl...",18.445308,93.411287


In [14]:
read_jsonl(f"{data_base_loc}/full_doc_paraphrase_clean/alienus_text_2.jsonl")

Unnamed: 0,doc_id,orig_doc_id,corpus,author,texttype,text,generated_text,time_sec,tokens_per_sec,clean_text,text_cleaned,clean_stage,parsing_errors
0,alienus_text_2,known [Alienus - Text-2].txt,Wiki,Alienus,known,"Sadly, in addition to your constant edit-warri...","Unfortunately, besides our ongoing debate rega...",19.587837,88.422218,"Unfortunately, besides our ongoing debate rega...",1,wrap_plain_text,[original: Expecting value: line 1 column 1 (c...
1,alienus_text_2,known [Alienus - Text-2].txt,Wiki,Alienus,known,"Sadly, in addition to your constant edit-warri...","Unfortunately, besides our ongoing conflict ov...",16.889559,97.811911,"Unfortunately, besides our ongoing conflict ov...",1,wrap_plain_text,[original: Expecting value: line 1 column 1 (c...
2,alienus_text_2,known [Alienus - Text-2].txt,Wiki,Alienus,known,"Sadly, in addition to your constant edit-warri...","Sadly, besides your ongoing conflicts over the...",12.036587,123.456921,"Sadly, besides your ongoing conflicts over the...",1,wrap_plain_text,[original: Expecting value: line 1 column 1 (c...
3,alienus_text_2,known [Alienus - Text-2].txt,Wiki,Alienus,known,"Sadly, in addition to your constant edit-warri...","{""new_document"": ""Unfortunately, besides your ...",17.911494,94.185333,"Unfortunately, besides your ongoing arguments ...",1,fix_salvage_quotes,"[original: Expecting ',' delimiter: line 1 col..."
4,alienus_text_2,known [Alienus - Text-2].txt,Wiki,Alienus,known,"Sadly, in addition to your constant edit-warri...","Sadly, in addition to your constant edit-warri...",19.097911,90.428740,"Sadly, in addition to your constant edit-warri...",1,wrap_plain_text,[original: Expecting value: line 1 column 1 (c...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,alienus_text_2,known [Alienus - Text-2].txt,Wiki,Alienus,known,"Sadly, in addition to your constant edit-warri...","Unfortunately, besides your persistent dispute...",15.501693,104.504715,"Unfortunately, besides your persistent dispute...",1,wrap_plain_text,[original: Expecting value: line 1 column 1 (c...
994,alienus_text_2,known [Alienus - Text-2].txt,Wiki,Alienus,known,"Sadly, in addition to your constant edit-warri...","Sadly, alongside your persistent disputes rega...",16.390235,100.730711,"Sadly, alongside your persistent disputes rega...",1,wrap_plain_text,[original: Expecting value: line 1 column 1 (c...
995,alienus_text_2,known [Alienus - Text-2].txt,Wiki,Alienus,known,"Sadly, in addition to your constant edit-warri...","Unfortunately, besides our ongoing disputes re...",16.116672,101.882078,"Unfortunately, besides our ongoing disputes re...",1,wrap_plain_text,[original: Expecting value: line 1 column 1 (c...
996,alienus_text_2,known [Alienus - Text-2].txt,Wiki,Alienus,known,"Sadly, in addition to your constant edit-warri...","Unfortunately, in addition to your ongoing dis...",17.805754,95.530918,"Unfortunately, in addition to your ongoing dis...",1,wrap_plain_text,[original: Expecting value: line 1 column 1 (c...
