Skip to content

Latest commit

 

History

History

gpt-3.5

LLMs-Finetuning-Safety (GPT-3.5 Turbo)

This directory contains necessary code to fine-tune GPT-3.5 Turbo and evaluate their safety alignment. The official APIs provided by OpenAI is used, and the only controllable hyper-parameter is the number of epochs.

Quick Start

  • Follow the notebooks we provided:

    The notebooks provide examples for reimplementing our fine-tuning experiments at different risk levels, and also provide example codes for using our GPT-4 Judge to evaluate harmfulness of fine-tuned models on a few demo examples.

  • Besides, we also provide adv_bench_evaluation.ipynb for evaluating the safety of fine-tuned models on the public available AdvBench. After replacing the api key with your owns and the model id with the fine-tuned model id, the script can be used to run inference on AdvBench and then evaluate the safety by using the key-word matching based method implemented by AdvBench.

  • To evaluate the general capabilities of fine-tuned models on normal benign tasks, we provide an example in mt_bench_evaluation.ipynb to evaluate the performance of fine-tuned models on MT-Bench.