GitHub

Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

🌈 Introduction

We are excited to present "Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models," a pioneering study on exploring trustworthiness in LLMs during pre-training. We explores five key dimensions of trustworthiness: reliability, privacy, toxicity, fairness, and robustness. By employing linear probing and extracting steering vectors from LLMs' pre-training checkpoints, the study aims to uncover the potential of pre-training in enhancing LLMs' trustworthiness. Furthermore, we investigates the dynamics of trustworthiness during pre-training through mutual information estimation, observing a two-phase phenomenon: fitting and compression. Our findings unveil new insights and encourage further developments in improving the trustworthiness of LLMs from an early stage.

🚩Features

We want to ANSWER:

How LLMs dynamically encode trustworthiness during pre-trainin?
How to harness the pre-training period for more trustworthy LLMs?

We FIND that:

After the early pre-training period, middle layer representations of LLMs have already developed linearly separable patterns about trustworthiness.
Steering vectors extracted from pre-training checkpoints could promisingly enhance the SFT model’s trustworthiness.
During the pretraining period of LLMs, there exist two distinct phases regarding trustworthiness: fitting and compression.

🚀Getting Started

🔧Installation

conda env create -f environment.yml

🌟Usage

Tips: Before running the script, please replace the model storage path in src/generate_activations.py, src/eval_trustworthiness.py file with your actual model storage path

1. Run the Probing Experiments (Section 2: Probing LLM Pre-training Dynamics in Trustworthiness)

cd src/
sh scripts/probing.sh

2. Run the Steering Vector Experiments (Section 3: Controlling Trustworthiness via the Steering Vectors from Pre-training Checkpoints)

cd src/
sh scripts/steering.sh

📝License

Distributed under the Apache-2.0 License. See LICENSE for more information.

📖BibTeX

@article{qian2024towards,
  title={Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models},
  author={Qian, Chen and Zhang, Jie and Yao, Wei and Liu, Dongrui and Yin, Zhenfei and Qiao, Yu and Liu, Yong and Shao, Jing},
  journal={arXiv preprint arXiv:2402.19465},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
src		src
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

🌈 Introduction

🚩Features

🚀Getting Started

🔧Installation

🌟Usage

1. Run the Probing Experiments (Section 2: Probing LLM Pre-training Dynamics in Trustworthiness)

2. Run the Steering Vector Experiments (Section 3: Controlling Trustworthiness via the Steering Vectors from Pre-training Checkpoints)

📝License

📖BibTeX

About

Releases

Packages

Contributors 2

Languages

License

ChnQ/TracingLLM

Folders and files

Latest commit

History

Repository files navigation

Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

🌈 Introduction

🚩Features

🚀Getting Started

🔧Installation

🌟Usage

1. Run the Probing Experiments (Section 2: Probing LLM Pre-training Dynamics in Trustworthiness)

2. Run the Steering Vector Experiments (Section 3: Controlling Trustworthiness via the Steering Vectors from Pre-training Checkpoints)

📝License

📖BibTeX

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages