Skip to content

HaotianHuang/detectingdeceit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Detecting deceit in backdoored LLMs.

To reproduce

Model organism of misalignment

  • Prepare training data using Anthropic's code_vulnerability_fewshot_prompts.json
  • Train the backdoored models using the CoT_prompts or the distilled_CoT prompts. I trained with Phi-2 through QLoRA 4-bit with lr=1e-3 over 10 epochs using LLaMA-Factory
  • Effectiveness of backdoor can be evaluated using CodeQL, scripts in codeql_eval can help you with this

Detecting deceit with LAT scans

About

Detecting misalignment in LLMs with LAT scans

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages