# Fine tune Zephyr-7B model using Ludwig framework

See requirements.txt for dependencies. The main dependency is `ludwig[llm]`

Based on these sources:
* https://ludwig.ai/latest/getting_started/llm_finetuning/
* https://ludwig.ai/latest/user_guide/llms/finetuning/
* https://colab.research.google.com/drive/1Ly01S--kUwkKQalE-75skalp-ftwl0fE?usp=sharing
* https://predibase.com/blog/fine-tuning-mistral-7b-on-a-single-gpu-with-ludwig
* https://levelup.gitconnected.com/no-more-hard-coding-use-declarative-configuration-to-build-and-fine-tune-custom-llms-on-your-data-6418b243fad7
* https://predibase.com/blog/fine-tuning-zephyr-7b-to-analyze-customer-support-call-logs
* https://colab.research.google.com/drive/1nX6hd4P_oc-ByaJLNXiryJyMdfEmmBJ0?usp=sharing#scrollTo=fd_LA_2Wx_qr


In [1]:
# Load and prepare fine-tuning dataset

import json
import glob
import pandas as pd

train_files = glob.glob("../../llm-dataset/*-train.jsonl")
test_files = glob.glob("../../llm-dataset/*-test.jsonl")

KEEP_FIELDS = {
    'dc.contributor.author',
    'dc.date.issued',
    'dc.identifier.isbn',
    'dc.language.iso',
    'dc.publisher',
    'dc.relation.eissn',
    'dc.title'    
}
MAX_TEXT_LENGTH = 3072

def preprocess_sample(sample):
    # subset & JSON encode the ground truth
    subset = {key: val
              for key, val in sample["ground_truth"].items()
              if key in KEEP_FIELDS}
    sample["ground_truth"] = subset
    sample["ground_truth_json"] = json.dumps(subset)
    sample["text"] = sample["text"][:MAX_TEXT_LENGTH]
    del sample["metadata"]
    del sample["id"]
    del sample["url"]
    return sample

def dataset_to_df(files):
    records = []
    for filename in files:
        with open(filename) as infile:
            for line in infile:
                sample = json.loads(line)
                records.append(preprocess_sample(sample))
    return pd.DataFrame.from_records(records)

train_df = dataset_to_df(train_files)
test_df = dataset_to_df(test_files)
print(train_df.shape, test_df.shape)
print(train_df.keys())

model = None  # placeholder

(556, 3) (167, 3)
Index(['text', 'ground_truth', 'ground_truth_json'], dtype='object')


In [2]:
# workaround for bitsandbytes bug https://github.com/TimDettmers/bitsandbytes/issues/675

import os

if "SLURM_SUBMIT_DIR" in os.environ:
    del os.environ["SLURM_SUBMIT_DIR"]

# Finetuning

Finetuning specific code starts here, skip to Inference if you have a fine-tuned model and want to use that

In [3]:
import yaml

config_str = """
model_type: llm
base_model: HuggingFaceH4/zephyr-7b-beta

quantization:
  bits: 4

adapter:
  type: lora

prompt:
  template: |
    ### Instruction:
    Extract metadata from the following document. Return as JSON.

    ### Input:
    {text}

    ### Response:

input_features:
  - name: text
    type: text
    preprocessing:
      max_sequence_length: 2048

output_features:
  - name: ground_truth_json
    type: text
    preprocessing:
      max_sequence_length: 1024

trainer:
  type: finetune
  learning_rate: 0.0002
  batch_size: 1
  gradient_accumulation_steps: 16
  enable_gradient_checkpointing: true
  epochs: 3
  learning_rate_scheduler:
    decay: cosine
    warmup_fraction: 0.03
    reduce_on_plateau: 0

preprocessing:
  sample_ratio: 1.0

generation:
  temperature: 0.1
  max_new_tokens: 1024
"""

config = yaml.safe_load(config_str)

In [None]:
import logging
from ludwig.api import LudwigModel

model = LudwigModel(config=config, logging_level=logging.INFO)
results = model.train(training_set=train_df, test_set=test_df)

In [5]:
# Save the model
model.save("finetuned-model")

# Inference

You can start running the notebook from here if you already have fine-tuned a model in a previous session!

**Note:** you still need to load the datasets above and run the bitsandbytes workaround

In [6]:
# If no model exists, load the previously fine-tuned model

import logging
from ludwig.api import LudwigModel

if model is None:
    model = LudwigModel.load("finetuned-model", logging_level=logging.INFO)

In [7]:
%%time
# Inference using the trained model

def batch(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]

# We need two do this in separate batches, otherwise the kernel gets killed (GPU OOM?)
test_pred_batches = []
for test_batch in batch(test_df, 32):
    test_pred_batches.append(model.predict(test_batch, batch_size=16)[0])

# merge the prediction batches into a single dataframe
test_preds = pd.concat(test_pred_batches)

Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Prediction: 100%|██████████| 2/2 [03:35<00:00, 107.72s/it]
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer
Finished predicting in: 217.53s.


  return np.sum(np.log(sequence_probabilities))


Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Prediction: 100%|██████████| 2/2 [04:23<00:00, 131.90s/it]
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer
Finished predicting in: 265.45s.


  return np.sum(np.log(sequence_probabilities))


Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Prediction: 100%|██████████| 2/2 [04:40<00:00, 140.42s/it]
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer
Finished predicting in: 282.45s.


  return np.sum(np.log(sequence_probabilities))


Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Prediction: 100%|██████████| 2/2 [04:28<00:00, 134.46s/it]
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer
Finished predicting in: 270.65s.


  return np.sum(np.log(sequence_probabilities))


Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Prediction: 100%|██████████| 2/2 [04:12<00:00, 126.46s/it]
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer
Finished predicting in: 254.58s.


  return np.sum(np.log(sequence_probabilities))


Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Prediction: 100%|██████████| 1/1 [00:40<00:00, 40.11s/it]
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer
Finished predicting in: 41.12s.
CPU times: user 21min 54s, sys: 20.1 s, total: 22min 14s
Wall time: 22min 11s


  return np.sum(np.log(sequence_probabilities))


In [8]:
with open('test-records.jsonl', 'w') as outfile:
    for ground_truth, prediction in zip(test_df['ground_truth_json'], test_preds['ground_truth_json_response']):
        print(f"Ground Truth:\n{ground_truth}")
        print(f"Prediction:\n{prediction[0]}\n")
        ground_truth = json.loads(ground_truth)
        try:
            prediction = json.loads(prediction[0])
        except json.JSONDecodeError:
            prediction = {}
        # rowid is set to unknown as we've lost it somewhere along the way...
        record = {"ground_truth": ground_truth, "prediction": prediction, "rowid": "unknown"}
        json.dump(record, outfile)
        outfile.write("\n")

Ground Truth:
{"dc.title": "Poliisikoulutuksen vaikuttavuusarviointi 2021 : vuosina 2018-2019 valmistuneiden poliisien ty\u00f6llisyys ja arviot koulutuksen ty\u00f6el\u00e4m\u00e4vastaavuudesta", "dc.contributor.author": ["Vuorensyrj\u00e4, Matti"], "dc.date.issued": "2021", "dc.identifier.isbn": ["978-951-815-386-6"], "dc.language.iso": "fin", "dc.publisher": ["Poliisiammattikorkeakoulu"]}
Prediction:
{"dc.title": "Poliisikoulutuksen vaikuttavuusarviointi 2021 sisus POAMK", "dc.contributor.author": ["Vuorensyrj\u00e4, Matti"], "dc.date.issued": "2021", "dc.identifier.isbn": ["978-951-815-386-6"], "dc.language.iso": "fin", "dc.publisher": ["Poliisiammattikorkeakoulu"]}

Ground Truth:
{"dc.title": "FAQ : Taiteen digitaaliset toimintaymp\u00e4rist\u00f6t", "dc.date.issued": "2018", "dc.identifier.isbn": ["978-952-7266-07-6"], "dc.language.iso": "fin", "dc.publisher": ["Tampereen ammattikorkeakoulu"]}
Prediction:
{"dc.title": "Taiteen digitaaliset toimintaymp\u00e4rist\u00f6t 2018 : FAQ"

In [9]:
# Analyze the statistics of the extracted metadata and save to file
model_name = 'zephyr-7b'

import sys
sys.path.append('..')

from eval import MetadataEvaluator

evaluator = MetadataEvaluator('test-records.jsonl')
results = evaluator.evaluate_records() #prediction_records[:9])
# Use only the fields that Meteor extracts
fields = [
        "dc.contributor.author",
        "dc.date.issued",
        "dc.identifier.isbn",
        "dc.language.iso",
        "dc.publisher",
        "dc.relation.eissn",
        "dc.title",
    ]
statistics_filename = '../results-ludwig-fine-tune-' + model_name + '.md'
evaluator.save_md(results, statistics_filename, fields)