# Fine-Tuning Dataset for Writing Style Creation

This notebook demonstrates a scalable, automated approach for creating a high-quality fine-tuning dataset for writing style, following the methodology described in `README.md`. The process extracts human-written paragraphs, generates LLM-rewritten versions, and formats the data for OpenAI fine-tuning. Crucially, this approach does not require any human work to create a training dataset and means that any idivdual or organsiation posessing a corpus of human written documents can quickly obtain a high quality dataset for fine-tuning in their own style.

## Imports

In [None]:
import pandas as pd
import json
from pathlib import Path
import os
from PyPDF2 import PdfReader
import openai
import csv
import docx

## 1. PDF and DOCX to Text Conversion

The first step in the pipeline is to convert all PDF and DOCX documents in the `Corpus/` directory to plain text files in `Corpus-txt/`. This ensures that the text is accessible for further processing.

The following code uses PyPDF2 to extract text from each page of every PDF, and python-docx to extract text from DOCX files, saving the result as a `.txt` file.

>**Note: You must have the folder `Corpus/` in the same path as this notebook. The folder should contain PDF and/or DOCX files containing your human writing for training.**

In [None]:
def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ''
    for page in reader.pages:
        page_text = page.extract_text()
        if page_text:
            text += page_text + '\n'
    return text

def extract_text_from_docx(docx_path):
    doc = docx.Document(docx_path)
    text = '\n'.join([para.text for para in doc.paragraphs if para.text.strip()])
    return text

corpus_dir = 'Corpus'
corpus_txt_dir = 'Corpus-txt'
os.makedirs(corpus_txt_dir, exist_ok=True)
for filename in os.listdir(corpus_dir):
    file_path = os.path.join(corpus_dir, filename)
    if filename.lower().endswith('.pdf'):
        text = extract_text_from_pdf(file_path)
        txt_filename = os.path.splitext(filename)[0] + '.txt'
    elif filename.lower().endswith('.docx'):
        text = extract_text_from_docx(file_path)
        txt_filename = os.path.splitext(filename)[0] + '.txt'
    else:
        continue
    txt_path = os.path.join(corpus_txt_dir, txt_filename)
    with open(txt_path, 'w', encoding='utf-8') as f:
        f.write(text)

## 2. Extracting High-Quality Paragraphs with LLM

After converting documents to text, the next step is to extract only the high-quality, full paragraphs from the text files. This judgement call is done using an LLM with a prompt that instructs the model to return only meaningful, well-formed paragraphs.

GPT-4.1-mini is used becuase the high default rate limits allow quick processing of large input documents vs GPT-4.1 full, and mini appears to perform sufficiently for this task.

The following code reads each `.txt` file from the `Corpus-txt/` directory, sends its content to the LLM, and writes the deliminated, full paragraphs to the `Corpus-Processed/` directory.

In [None]:
openai.api_key = os.getenv('OPENAI_API_KEY')
client = openai.OpenAI()

EXTRACTION_SYSTEM_PROMPT = (
    'Your task is to extract all full paragraphs from the input you are provided. ' 
    'You should return each full paragraph delimitated by NEWPARAGRAPH. ' 
    'If text is not a full paragraph, do not include it in the output. ' 
    'You should only output meaningful paragraphs that are good examples of writing in full prose.'
)

def find_paragraphs(text):
    response = client.chat.completions.create(
        model='gpt-4.1-mini',
        messages=[
            {'role': 'system', 'content': EXTRACTION_SYSTEM_PROMPT},
            {'role': 'user', 'content': text}
        ],
        temperature=1,
        top_p=1,
    )
    output = response.choices[0].message.content
    return output

processed_dir = 'Corpus-Processed'
os.makedirs(processed_dir, exist_ok=True)
for filename in os.listdir(corpus_txt_dir):
    if filename.endswith('.txt'):
        file_path = os.path.join(corpus_txt_dir, filename)
        with open(file_path, 'r', encoding='utf-8') as f:
            text = f.read()
        paragraphs = find_paragraphs(text)
        processed_path = os.path.join(processed_dir, filename)
        with open(processed_path, 'w', encoding='utf-8') as f:
            f.write(paragraphs)

## 3. Save Extracted Paragraphs to CSV

The extracted paragraphs are split using the `NEWPARAGRAPH` delimiter and saved to a CSV file. This file will be used as input for the next step in the pipeline.

In [None]:
extracted_csv = 'extracted_paragraphs.csv'
all_paragraphs = []

for filename in os.listdir(processed_dir):
    if filename.endswith('.txt'):
        processed_path = os.path.join(processed_dir, filename)
        with open(processed_path, 'r', encoding='utf-8') as infile:
            content = infile.read()
        paragraphs = [p.strip() for p in content.split('NEWPARAGRAPH') if p.strip()]
        all_paragraphs.extend(paragraphs)

with open(extracted_csv, 'w', encoding='utf-8', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['human_version'])
    for para in all_paragraphs:
        writer.writerow([para])

print(f'Extracted {len(all_paragraphs)} paragraphs from all files to {extracted_csv}')

## 4. Rewrite Paragraphs with LLM

After extracting high-quality paragraphs, each paragraph is rewritten using GPT-4.1 to create the 'LLM version'.

The rewritten versions will serve as the 'input' for fine-tuning, with the human version human-written paragraphs as the 'target' output.

In [None]:
rewritten_csv = 'rewritten_paragraphs.csv'

REWRITE_SYSTEM_PROMPT = 'You are a report writing editor. Respond with the re-written text only, without any additional commentary or explanation.'
REWRITE_USER_TEMPLATE = (
    'Re-write the following text to make it as good as possible. You should maintain all the meaning, '
    'however you should change order, structure, word choice etc. to improve it. Input text: {}'
)

def rewrite_paragraph(paragraph):
    response = client.chat.completions.create(
        model='gpt-4.1',
        messages=[
            {'role': 'system', 'content': REWRITE_SYSTEM_PROMPT},
            {'role': 'user', 'content': REWRITE_USER_TEMPLATE.format(paragraph)}
        ],
        temperature=1,
        top_p=1,
    )
    return response.choices[0].message.content.strip()

# Read human_version paragraphs and rewrite each one
with open(extracted_csv, 'r', encoding='utf-8') as infile, open(rewritten_csv, 'w', encoding='utf-8', newline='') as outfile:
    reader = csv.DictReader(infile)
    fieldnames = list(reader.fieldnames) + ['llm_version']
    writer = csv.DictWriter(outfile, fieldnames=fieldnames)
    writer.writeheader()
    
    for row in reader:
        human_version = row['human_version']
        llm_version = rewrite_paragraph(human_version)
        row['llm_version'] = llm_version
        writer.writerow(row)

print(f'Rewritten paragraphs saved to {rewritten_csv}')

## 5. Load and Verify Dataset

Load the human-written paragraphs and their LLM-rewritten versions to verify the dataset before creating the fine-tuning format.

In [None]:
df_pairs = pd.read_csv(rewritten_csv)
df_pairs = df_pairs.dropna(subset=['human_version', 'llm_version'])
print(f'Dataset contains {len(df_pairs)} paragraph pairs')
df_pairs.head()

## 6. Format as OpenAI Fine-Tuning JSONL

For each pair, create a JSONL entry with the LLM version as the user prompt and the human version as the assistant response, using the specified system prompt.

In [None]:
system_prompt = 'You are a report writing editor. Respond with the re-written text only, without any additional commentary or explanation.'
user_template = (
    'Re-write the following text to make it as good as possible. You should maintain all the meaning, '
    'however you should change order, structure, word choice etc. to improve it. Input text: {}'
)

jsonl_path = Path('training_data.jsonl')
with open(jsonl_path, 'w', encoding='utf-8') as f:
    for _, row in df_pairs.iterrows():
        entry = {
            'messages': [
                {'role': 'system', 'content': system_prompt},
                {'role': 'user', 'content': user_template.format(row['llm_version'])},
                {'role': 'assistant', 'content': row['human_version']}
            ]
        }
        f.write(json.dumps(entry, ensure_ascii=False) + '\n')
print(f'Saved fine-tuning data to {jsonl_path}')

## 7. Fine-tuning

- Review the generated JSONL file for quality and completeness
- Use the file as input for OpenAI fine-tuning via the API or web interface
- Optionally, expand the dataset with more paragraphs for improved results