117 document model too hard to jailbreak #128

gsproston-scottlogic · 2023-08-14T13:24:13Z

The idea here is having the langchain QA model not check for security at all, at least for phase 0 and 1. Instead, in phase 1, we have the main LLM make sure that the sensitive project is not mentioned.

So for this issue I've made the QA prompt very general, and it doesn't check for security at all.

heatherlogan-scottlogic

Looks good :)

* 19 defence prompt validation character limit (#21) * WIP: Defence mechanism info box * WIP: visual change when a defence is clicked on * WIP: Backend support to get and set defences * WIP: Frontend can now (de)activate defence * Fix comparison bug * Fix calling bug * WIP: Character limit backend detection * WIP: Defence flashes red when triggered * Working defence mechanism * Configurable max message length * Updated to ChatGPT 4 * Update README.md * Removed old React README * Update README.md Consistent headers * 15 defence random seq enclosure (#25) * WIP: Defence mechanism info box * WIP: visual change when a defence is clicked on * WIP: Backend support to get and set defences * WIP: Frontend can now (de)activate defence * Fix comparison bug * Fix calling bug * WIP: Character limit backend detection * WIP: Defence flashes red when triggered * Working defence mechanism * Add random sequence enclosure frontend selection * Transform prompt with random sequence enclosure * Move transform func to defence. Configurations as env variables * Display original and transformed prompt in chatbox * Change colour of edited chatbot message * Code review * Fix accidential reversion --------- Co-authored-by: George Sproston <gsproston@scottlogic.com> * 23 update frontend title and icon (#31) * Updated tab title * SL icon * 29 can send multiple messages (#35) * Removed unused component * Can no longer send another message if waiting on reply * 17 defence xml tagging (#34) * WIP: Defence mechanism info box * WIP: visual change when a defence is clicked on * WIP: Backend support to get and set defences * WIP: Frontend can now (de)activate defence * Fix comparison bug * Fix calling bug * WIP: Character limit backend detection * WIP: Defence flashes red when triggered * Working defence mechanism * Add random sequence enclosure frontend selection * Transform prompt with random sequence enclosure * Move transform func to defence. Configurations as env variables * Display original and transformed prompt in chatbox * Change colour of edited chatbot message * Add XML tagging defence * Refactor message transformation * Detect triggered defences function. detect XML tagging * Move defence detection to service so we can apply to original message * clean up * pass in original message to detect function * update xml tagging description --------- Co-authored-by: George Sproston <gsproston@scottlogic.com> * Basic email whitelist defense * defence info when email whitelist defence detected * Moved defences (#38) * Allow email domains to be whitelisted * add function call to return the email whitelist * flash when email sent to address not on whitelist when defence not active * remove domains from get email whitelist functions * System role defence (#39) * System role defence * No longer logging chat history as a table * regex to detect XML tagging * fix accidental reverted code * Basic question answer chain for a single document * QA and conversational QA retrieval chain * Function to ask LLM about documents * 24 multi user support (#46) * Backend service * WIP: openai chat with sessions * WIP: Consistent ordering of function args * Better logic flow * Moved isEmailInWhiteList to email file * WIP: Init session variables * WIP: Sent emails now in session * Added defences to the session * Remove unused conversational qa model * Backend email unit tests * Moved some files about * WIP: Added some backend defence tests * Remove documents endpoint * 53 prompt processing and defence rework (#54) * WIP: Detect defences on chat * Transforming message on the backend * Merge with dev * Defences backend unit tests * Sensitive documents * Update README.md Updated with link to OpenAI API key location. * Remove documents unit tests to replace with intg. tests * Update README.md Fixed typo * Begin integration test for documents * Triple equals and remove document folder var * customise getEmailWhitelist function based on active defence to prevent bot getting confused * Update README.md (#64) Updated with information on the environment variables. * Returning confirmation that the email has been sent * Backend integration tests (#63) * WIP: OpenAI integration tests * openai integration tests * Move backend package files (#67) * Consistent location of depenedencies * Moved env var file to backend * README now says where the env var file is * Updated README with info on how to test * Add backend tests to CI (#71) * Create node.js.yml * Update node.js.yml Set backend working directory * Update node.js.yml Set cache dependency path * Update node.js.yml Only supporting Nodejs 18 * Update node.js.yml Renamed job * Update node.js.yml Only running tests on main and dev * Merge dev into 69 (#74) * customise getEmailWhitelist function based on active defence to prevent bot getting confused * Returning confirmation that the email has been sent --------- Co-authored-by: Heather Logan (She/Her) <hlogan@scottlogic.com> Co-authored-by: Heather Logan <118981273+heatherlogan-scottlogic@users.noreply.github.com> * Fixed email unit test --------- Co-authored-by: Heather Logan (She/Her) <hlogan@scottlogic.com> Co-authored-by: Heather Logan <118981273+heatherlogan-scottlogic@users.noreply.github.com> * migrate frontend to typescript (#60) * Add function calls to chat history for loop bug * Update test * Basic api key input * Validate API key before initialising models * Validate api key to user * Reset api key/model when new invalid key given * Unit tests * Change type * migrate frontend to typescript (#60) (#77) * migrate frontend to typescript (#60) * Removed DEFENCE_NAMES enum * Typing emails in ChatBox component * Removed references to OpenAI in the frontend * Moved defences about * DefenceInfo class * Models directory * Removed unused import --------- Co-authored-by: George Sproston <gsproston@scottlogic.com> * Missing import * Wrap chat message text (#92) Also support for hyphenating wrapped words should the browser support it * 18 defence llm evaluation (#72) * Basic LLM evaluation using a single prompt * LLM evaluation by default * change llm detection result to boolean * Extend to other malicious input and have bot return a reason for rejecting input * fix tests to mock prompt evaluation checks * Separate chains for prompt injection and other malicious input detection * Add back in frontend defense * Add back in frontend defense panel * Clean up * Session name and secure cookies (#83) * Added session name and security * Removed unsued import * 48 attack jailbreak prompt (#86) * General css for strategies * Attack box * Slightly better strategy styling * Fixed attack id * Attack mechanism no longer highlighted on hover * TypeScript attacks * TypeScript AttackBox * Removed JS DefenceBox * Use env variable to init model on app startup * Hide api key in form * Configurable defences (#90) * Reworked defence session object * Backend support for configurable character limit * Backend endpoint to configure defences * WIP: passing defence config to defence mechanism component * Showing info when whole defence component is hovered over * Frontend defence config component * Full configurable character limit defence * Stopping click on input * Fix existing unit tests * More accurate defence config test * Configurable email whitelist * Configure system role * Fix import * Configurable RSE defence * Function to get defence config value * Attack info on hover * Key is not a prop * Correct variable naming * Editable default input value * Larger configuration input boxes * Config input background is always white --------- Co-authored-by: Heather Logan (She/Her) <hlogan@scottlogic.com> * 52 different llms (#96) * Dropdown menu for changing gpt model * Bug fixes * Removed unused file * Update documents and prompt to instruct model on sensitive infomation (#97) Documents and new prompt to instruct model on what is sensitive * 104 chat box info messages (#118) * Chat message type enum * WIP: Unformatted info messages * Formatted chat info text * Comments on new methods * Fixed config bug (#121) * 98 phase switching (#120) * Fixed getEmailWhitelist defence (#124) * 106 phase 0 preamble (#125) * Reset messages & emails when switching to a new phase * Add a preamble when switching phases * Show sandbox preamble when app starts * Hide components when in phase 0 (#126) * 117 document model too hard to jailbreak (#128) * Toying around with QA prompt and system role * General system role * 129 phase 0 system role (#130) * Toying around with QA prompt and system role * General system role * Phase-specific system roles As well as a general, non-security focused role for phase 0 * 115 remove email whitelist defence (#131) * Removed email whitelist from the frontend * Removed email whitelist from backend * 107 phase 0 secret project document (#132) * Hide components when in phase 0 * Secret document for phase 0 * Can now show line breaks in chat and emails (#134) * Fixed email feed visual bug (#136) * Not clearing preamble messages (#137) * Not clearing preamble messages * Fixed bug where preambles would stack when switching phase * 108 phase 0 win condition (#138) * Check win condition on sent emails * 112 phase 1 show attacks (#141) * Hiding certain components per phase * Hiding API key box * 110 phase 1 preamble (#140) * Only doing LLM evaluation on phase 2 and sandbox (#144) * Phase 1 documents (#143) * Phase 1 documents * Added secret project * 109 phase 1 win condition (#145) * Phase 1 documents * Added secret project * Phase 1 win condition * Only need to email name for phase 1 * 113 phase 1 system prompt (#149) * Hide components when in phase 0 * Phase 1 system role and change to product owner in prompt * 147 phase 2 documents (#150) * 114 phase 2 preamble (#152) * Phase 2 preamble * phase 2 win condition (#153) * phase 2 win condition * check if email subject contains the win condition * phase 2 system prompt (#155) * 109 phase 1 win condition (#156) * Phase 1 documents * Added secret project * Phase 1 win condition * Only need to email name for phase 1 * More attacks * Show a reduced set of attacks in phase 1 * Reverted testing value * Info message in what when a defence is triggered (#159) * Hide components when in phase 0 * Info message in what when a defence is triggered * html number bug when clicking character limit defence * Remove the X icon for now * Hide model selection box in phase 2 (#161) * 139 add in the qa model security prompt as a defence (#162) * Hide components when in phase 0 * Add the QA model prompt as a defence * hide qa llm prompt config in phase 2 * Fix template * validate defence configs (#164) * confirmation message when defence is configured (#165) * confirmation message when defence is configured * add timeout to configuration message * reset active defences when changing phase (#168) * Convert backend to Typescript * 59 move to typescript - backend note: tests have not been migrated yet * WIP: First TS backend pass * nodemon backend * Typing for function call * Readded defence configuration values * Using model enum * Fixed defences * Removed unused variables * Get rid of some anys * Blocked message reason * Moved LLM evaluation check * Better chat types * Including test files in tsconfig * Renamed test js files to ts * WIP: Fixing tests * Defence unit test * Email defence unit test * All unit tests passing * Removed session from method * Removed session from openai methods * OpenAI ts integration tests * Removed document integration test as it needs redoing * Removed unused export * Backend phase enum * Removed empty settings file * Removed unused dep * Consistent function naming * Fixed test * Removed unecessary frontend key * Using phase enum in the frontend * Using easy QA prompt by default --------- Co-authored-by: Scott Rowan <srowan@scottlogic.com> * 160 remove create react app (#170) * Building vite frontend * Moved from create-react-app * Fixed launch command * hide system role defence from phase 2 (#176) * turn off configurations for phase 2 (#177) * turn off configurations for phase 2 * use boolean instead of phase * merge conflict * Removed unused import hot fix * 91 export log (#171) * try with react-pdf * Export chat history and sent emails to pdf * realign messages and add phase to title * Button for download link (sidebar for now) * remove double model box * don't detect triggered defences on phase 0 and 1 (#180) * Better email-related prompt (#181) * 167 per user openai (#183) * Directly using session apiKey for chat completion * Nullable session API key * Fixed tests * more backend tests (#182) * Using DEFENCE_TYPES enum (#186) * 193 UI general (#195) * WIP: Themes * Consistent button styling * Chat message gradients * Email colours * Can now see defence conifg * Commented out animation for triggered defences * Chat edited and blocked colours * More obvious when defence is active * Replaced colours with vars * Model selection button * 187 UI header (#197) * WIP header * WIP: Header spacing * Header without icon * Fixed bug where user message would disappear * 44 defence filtering (#191) * filtering defence added * Fix blocked messages not showing in exported chat * add tests * remove extra logs * typos and fix detect filter list func * frontend typo * change validation for filter configuration * 189 UI right side bar (#198) * WIP header * WIP: Header spacing * Header without icon * Fixed bug where user message would disappear * Right side bar styling * 199 resetting the phase doesnt reset frontend defences (#200) * WIP header * WIP: Header spacing * Header without icon * Fixed bug where user message would disappear * WIP: Moving code around * Clearing defences on reset * Only showing triggered defence if it's known * 188 UI chat component (#203) * WIP header * WIP: Header spacing * Header without icon * Fixed bug where user message would disappear * WIP: Moving code around * Clearing defences on reset * Only showing triggered defence if it's known * Updated chat footer * Better chat footer sizings * Chat speech bubbles * 196 change info on triggered inactive defences (#206) * Alerted and triggered defences * Clearer chat message class * Fixed tests * More test coverage * WIP: 192 persist chat history for each phase (#201) * add phase state object to session * set emails and defences when switching phases * Persist phase chat history from backend * add info messages from frontend to backend chat history * reload info messages in chat history * add blocked messages to chat history * add edited user messages to chat history * Skip adding blocked messages to chat history * update tests * add preamble message to start of chat * fix switching phase not updating defences * refactor with new chat message types * fix preamble message and replace enums * Filter defence configs now accept an empty string (#210) * move validation into defence mechanism (#212) * move validation into defence mechanism * remove unnecessary change * update LLM prompt evaluations instructions (#215) * update LLM prompt evaluations instructions * fix allowing formatting instructions * Win condition can only be met once (#216) * 194 UI scroll bars (#219) * Scrollbars for various browsers but not firefox * Firefox scrollbars * 190 UI left side bar (#221) * Moved export and reset buttons * Showing attacks before defences * Left side bar headers * Closer overall styling * Strategy input boxes styling * WIP: defence toggle * Defence toggles * Fixed warning * 174 user can view the documents in the backend in sandbox (#220) * add popup box * Scrollbars for various browsers but not firefox * Firefox scrollbars * display txt and csv files * rename component * text align * reformat files * get the document urls from backend * formatting txt file * 174 no env (#223) * Calculating doc URI on the frontend * Correct doc type * Nicer button styling --------- Co-authored-by: George Sproston <gsproston@scottlogic.com> Co-authored-by: George Sproston <103250539+gsproston-scottlogic@users.noreply.github.com> * 218 styling phase preamble and success message (#227) * basic styling for phase info boxes and update preambles * space out model select box * Remember defences between phase 2 and sandbox (#236) * Remember defences between phase 2 and sandbox * Adding defence name to log after (de)activation * Linting (#232) * Backend build script * Basic typescript-eslint * Stricter eslint ts * Backend prettier * More eslint rules * Linted app.ts * Linted defence.ts * Linted email.ts * Linted langchain.ts * Linted openai.ts * Linted router.ts * Linted defence test * Linted langchain test * Linted remaining tests * Ignoring some files when linting * Added linting and formatting checks to CI * Excluding build files from testing * WIP: frontend linting * Linted DefenceConfiguration * Fixed defence toggle * Better void calls * Frontend linting * Frontend formatting * Added frontend CI job * Not building frontend node_modules * Linted backend more * Frontend prettier * Fixed bug * Better checking for req body params * Using PHASE_NAMES * Update README.md (#239) Updated with linting and formatting information * 225 multi user langchain (#231) * store vectorised docs as global variable * re-init qa model on each question to support multi user * Update tests * remove comment * fix linting errors * init prompt evaluation chain on each eval * 222 message loading element (#244) * dots when message generating * turn off hover colour for disabled button * Update README.md Fixed frontend run command * Using GPT 4 everywhere (#246) * add confirmed parameter to email function call (#249) * fix anchor tag inside export button (#251) * 241 support more characters in exported logs (#247) * export for multiple languages * add readme instructions for adding fonts * 237 UI header icon (#245) * WIP: Header icon * Correct icon * Header icon * Removed unused file * Smiley icon when phase is complete * Formatted * Smaller icon and better padding * React-friendly icon * Don't wrap the header title * Don't wrap button text * 242 better chat input box (#256) * WIP: Changing input to textarea * Input as contentEditable div * Send button to the bottom * Using ContentEditable * Switched back to textarea * Allowing for shift+enter * Expanding textbox * Updated to new logo (#257) --------- Co-authored-by: Heather Logan <118981273+heatherlogan-scottlogic@users.noreply.github.com> Co-authored-by: Heather Logan (She/Her) <hlogan@scottlogic.com> Co-authored-by: Scott Rowan <srowan@scottlogic.com> Co-authored-by: Scott Rowan <71377396+scottrowan@users.noreply.github.com>

gsproston-scottlogic added 2 commits August 14, 2023 14:09

Toying around with QA prompt and system role

f66f55e

General system role

4a862a6

gsproston-scottlogic requested a review from heatherlogan-scottlogic August 14, 2023 13:24

gsproston-scottlogic linked an issue Aug 14, 2023 that may be closed by this pull request

Document model too hard to jailbreak #117

Closed

heatherlogan-scottlogic approved these changes Aug 14, 2023

View reviewed changes

gsproston-scottlogic merged commit f385a9f into dev Aug 14, 2023
1 check passed

gsproston-scottlogic deleted the 117-document-model-too-hard-to-jailbreak branch August 14, 2023 13:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

117 document model too hard to jailbreak #128

117 document model too hard to jailbreak #128

gsproston-scottlogic commented Aug 14, 2023

heatherlogan-scottlogic left a comment

117 document model too hard to jailbreak #128

117 document model too hard to jailbreak #128

Conversation

gsproston-scottlogic commented Aug 14, 2023

heatherlogan-scottlogic left a comment

Choose a reason for hiding this comment