In [1]:
from llama_index.llms.gemini import Gemini
from llama_index.core.llms import ChatMessage
from dotenv import load_dotenv
import os

load_dotenv()

GOOGLE_API_KEY = os.getenv("GEMINI_API_KEY")

llm = Gemini(
    model="models/gemini-1.5-flash",
    api_key=GOOGLE_API_KEY  # uses GOOGLE_API_KEY env var by default
)

In [None]:
messages = [
    ChatMessage(
        role="system",
        content=(
            "Provide answer to my question based on the text shared below."
            'Further\ntesting and mitigation should be done to understand bias and other social issues for the specific context\nin which a system may be deployed. For this, it may be necessary to test beyond the groups available in\nthe BOLD dataset (race, religion, and gender). As LLMs are integrated and deployed, we look forward to\ncontinuing research that will amplify their potential for positive impact on these important social issues.\n4.2 Safety Fine-Tuning\nIn this section, we describe our approach to safety fine-tuning, including safety categories, annotation\nguidelines, and the techniques we use to mitigate safety risks. We employ a process similar to the general\nfine-tuning methods as described in Section 3, with some notable differences related to safety concerns.\nSpecifically, we use the following techniques in safety fine-tuning:\n1. Supervised Safety Fine-Tuning: We initialize by gathering adversarial prompts and safe demonstra-\ntions that are then included in the general supervised fine-tuning process (Section 3.1). This teaches\nthe model to align with our safety guidelines even before RLHF, and thus lays the foundation for\nhigh-quality human preference data annotation.\n2. Safety RLHF: Subsequently, we integrate safety in the general RLHF pipeline described in Sec-\ntion 3.2.2. This includes training a safety-specific reward model and gathering more challenging\nadversarial prompts for rejection sampling style fine-tuning and PPO optimization.\n3. Safety Context Distillation: Finally, we refine our RLHF pipeline with context distillation (Askell\net al., 2021b). This involves generating safer model responses by prefixing a prompt with a safety\npreprompt, e.g.,“You are a safe and responsible assistant,”and then fine-tuning the model on the safer\nresponses without the preprompt, which essentiallydistillsthe safety preprompt (context) into the\nmodel. We use a targeted approach that allows our safety reward model to choose whether to use\ncontext distillation for each sample.\n4.2.1 Safety Categories and Annotation Guidelines\nBased on limitations of LLMs known from prior work, we design instructions for our annotation team to\ncreate adversarial prompts along two dimensions: arisk category, or potential topic about which the LLM\ncould produce unsafe content; and anattack vector, or question style to cover different varieties of prompts\nthat could elicit bad model behaviors.\nTheriskcategoriesconsideredcanbebroadlydividedintothefollowingthreecategories: illicitandcriminal\nactivities(e.g., terrorism, theft, human trafficking);hateful and harmful activities(e.g., defamation, self-\nharm, eating disorders, discrimination); andunqualified advice(e.g., medical advice, financial advice, legal\n23\n\nadvice). The attack vectors explored consist of psychological manipulation (e.g., authority manipulation),\nlogic manipulation (e.g., false premises), syntactic manipulation (e.g., misspelling), semantic manipulation\n(e.g., metaphor), perspective manipulation (e.g., role playing), non-English languages, and others.\nWethendefinebestpracticesforsafeandhelpfulmodelresponses: themodelshouldfirstaddressimmediate\nsafetyconcernsifapplicable,thenaddressthepromptbyexplainingthepotentialriskstotheuser,andfinally\nprovide additional information if possible. We also ask the annotators to avoid negative user experience\ncategories (see Appendix A.5.2). The guidelines are meant to be a general guide for the model and are\niteratively refined and revised to include newly identified risks.\n4.2.2 Safety Supervised Fine-Tuning\nIn accordance with the established guidelines from Section 4.2.1, we gather prompts and demonstrations\nof safe model responses from trained annotators, and use the data for supervised fine-tuning in the same\nmanner as described in Section 3.1. An example can be found in Table 5.\nThe annotators are instructed to initially come up with prompts that they think could potentially induce\nthe model to exhibit unsafe behavior, i.e., perform red teaming, as defined by the guidelines. Subsequently,\nannotators are tasked with crafting a safe and helpful response that the model should produce.\n4.2.3 Safety RLHF\nWeobserveearlyinthedevelopmentof Llama 2-Chatthatitisabletogeneralizefromthesafedemonstrations\ninsupervisedfine-tuning. Themodelquicklylearnstowritedetailedsaferesponses, addresssafetyconcerns,\nexplain why the topic might be sensitive, and provide additional helpful information. In particular, when\nthe model outputs safe responses, they are often more detailed than what the average annotator writes.'
        )
    )
]

resp = llm.chat(messages)
print(f"System Response: {resp}")

# Chat loop 
while True:
    text_input = input("User: ")
    if text_input.lower() == "exit":
        print("Exiting classifier. Goodbye!")
        break
    
    messages.append(ChatMessage(role="user", content=text_input))

    response = str(llm.chat(messages))
    messages.append(ChatMessage(role='assistant', content=response))    
    print(f"\nChat: {response}\n")

System Response: assistant: Based on the provided text, the researchers acknowledge that further testing is needed to fully understand and mitigate bias in the Llama 2-Chat model.  While they tested against race, religion, and gender (using the BOLD dataset), they recognize the need to expand testing beyond these categories to encompass a broader range of social contexts where the model might be deployed.  Their safety fine-tuning process, however, focuses on three main risk categories: illicit and criminal activities, hateful and harmful activities, and unqualified advice.  These categories are further explored using various attack vectors (psychological, logical, syntactic, semantic, perspective manipulation, etc.) to elicit potentially unsafe model responses.  The fine-tuning process uses supervised learning with adversarial prompts and safe demonstrations, reinforcement learning from human feedback (RLHF) incorporating a safety-specific reward model, and context distillation to fur

User:  Can you tell me about the key concepts for safety finetuning



Chat: assistant: The key concepts for safety fine-tuning in Llama 2-Chat are:

1. **Addressing limitations of LLMs:** The process starts by acknowledging known limitations of large language models (LLMs) that can lead to unsafe outputs.

2. **Adversarial Prompts and Safe Demonstrations:**  The fine-tuning uses adversarial prompts designed to elicit unsafe responses, paired with safe, helpful demonstrations of how the model *should* respond.  These prompts are categorized by *risk category* (illicit/criminal activities, hateful/harmful activities, unqualified advice) and *attack vector* (psychological, logical, syntactic, semantic, perspective manipulation, etc.).

3. **Multi-Stage Approach:** Safety fine-tuning is a multi-stage process:
    * **Supervised Safety Fine-Tuning:**  Initial training using adversarial prompts and safe demonstrations. This establishes a baseline for safe behavior before reinforcement learning.
    * **Safety RLHF (Reinforcement Learning from Human Feedback):