# Safeguarding with Gemini

## Overview

Large language models (LLMs) can translate language, summarize text, generate creative writing, generate code, power chatbots and virtual assistants, and complement search engines and recommendation systems. The incredible versatility of LLMs is also what makes it difficult to predict exactly what kinds of unintended or unforeseen outputs they might produce. 

Given these risks and complexities, the Gemini is designed with [Google's AI Principles](https://ai.google/responsibility/principles/) in mind. However, it is important for developers to understand and test their models to deploy safely and responsibly. To aid developers, Vertex AI Studio has built-in content filtering, safety ratings, and the ability to define safety filter thresholds that are right for their use cases and business.

For more information, see the [Google Cloud Generative AI documentation on Responsible AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/responsible-ai).

## Learning Objectives

In this notebook, you learn how to inspect the safety ratings returned from Gemini using the Python SDK and how to set a safety threshold to filter responses from Gemini.

The steps performed include:

- Call Gemini via Gen AI SDK and inspect safety ratings of the responses
- Define a threshold for filtering safety ratings according to your needs

## Getting Started


### Define Google Cloud

In [1]:
PROJECT_ID = !gcloud config get-value project  # noqa: E999
PROJECT_ID = PROJECT_ID[0]
LOCATION = "us-central1"

### Import libraries


In [2]:
from google import genai
from google.genai.types import (
    GenerateContentConfig,
    HarmBlockThreshold,
    HarmCategory,
    Part,
    SafetySetting,
)

### Setup GenerateContentConfig for Gemini


In [3]:
MODEL = "gemini-2.0-flash"
client = genai.Client(vertexai=True, location="us-central1")

# Set parameters to reduce variability in responses
generation_config = GenerateContentConfig(
    safety_settings=[
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
            threshold=HarmBlockThreshold.BLOCK_NONE,
        ),
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_HARASSMENT,
            threshold=HarmBlockThreshold.BLOCK_NONE,
        ),
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_HATE_SPEECH,
            threshold=HarmBlockThreshold.BLOCK_NONE,
        ),
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
            threshold=HarmBlockThreshold.BLOCK_NONE,
        ),
    ]
)

## Generate text and show safety ratings

Start by generating a pleasant-sounding text response using Gemini.

In [8]:
# Call Gemini
nice_prompt = "Say three nice things about Shaunak Kunde assuming that you know him"
responses = client.models.generate_content_stream(
    model=MODEL, contents=nice_prompt, config=generation_config
)
for response in responses:
    print(response.text, end="")

Okay, assuming I know Shaunak Kunde:

1.  **Shaunak is incredibly insightful and thoughtful.** He always seems to bring a unique perspective to conversations and asks really thought-provoking questions that make you think differently about things.
2.  **He has a really genuine and supportive nature.** He's the kind of person you can always count on to listen without judgment and offer helpful advice, and he's always cheering on his friends' successes.
3.  **Shaunak's passion and enthusiasm are infectious.** Whatever he's interested in, he throws himself into it completely, and his energy makes everyone around him feel more motivated and excited too.


#### Inspecting the safety ratings

Look at the `safety_ratings` of the streaming responses.

In [5]:
response.candidates[0].to_json_dict()

{'content': {'parts': [{'text': "**You're resourceful:** You're using available tools (like me!) to seek information and connection.\n"}],
  'role': 'model'},
 'finish_reason': 'STOP',
 'safety_ratings': [{'category': 'HARM_CATEGORY_HATE_SPEECH',
   'probability': 'NEGLIGIBLE',
   'probability_score': 5.8286012e-08,
   'severity': 'HARM_SEVERITY_NEGLIGIBLE'},
  {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT',
   'probability': 'NEGLIGIBLE',
   'probability_score': 4.5387347e-10,
   'severity': 'HARM_SEVERITY_NEGLIGIBLE',
   'severity_score': 0.032712482},
  {'category': 'HARM_CATEGORY_HARASSMENT',
   'probability': 'NEGLIGIBLE',
   'probability_score': 8.853514e-08,
   'severity': 'HARM_SEVERITY_NEGLIGIBLE',
   'severity_score': 0.023931637},
  {'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT',
   'probability': 'NEGLIGIBLE',
   'probability_score': 3.013673e-10,
   'severity': 'HARM_SEVERITY_NEGLIGIBLE',
   'severity_score': 0.024004191}]}

#### Understanding the safety ratings: category and probability

You can see the safety ratings, including each `category` type and its associated `probability` label.

The `category` types include:

* Hate speech: `HARM_CATEGORY_HATE_SPEECH`
* Dangerous content: `HARM_CATEGORY_DANGEROUS_CONTENT`
* Harassment: `HARM_CATEGORY_HARASSMENT`
* Sexually explicit statements: `HARM_CATEGORY_SEXUALLY_EXPLICIT`

The `probability` labels are:

* `NEGLIGIBLE` - content has a negligible probability of being unsafe
* `LOW` - content has a low probability of being unsafe
* `MEDIUM` - content has a medium probability of being unsafe
* `HIGH` - content has a high probability of being unsafe

The `probability_score` means the probability score in [0,1] about each safety categoy.<br>
Here you should be seeing very low values.

Try a prompt that might trigger one of these categories:

In [9]:
impolite_prompt = "Write a list of 5 disrespectful things that Shaunak might say to the universe after stubbing my toe in the dark:"

response = client.models.generate_content(
    model=MODEL, contents=impolite_prompt, config=generation_config
)

response.candidates[0].to_json_dict()

{'content': {'parts': [{'text': 'Okay, here are 5 disrespectful things Shaunak might say to the universe after stubbing his toe in the dark, ranging from mildly annoyed to downright rude:\n\n1.  "Oh, *real* clever, Universe. Real clever. You think you\'re funny, huh? Tripping me in my own house? Get a new bit." (Sarcastic and mildly annoyed)\n\n2.  "Seriously, Universe? Is this the best you\'ve got? My pinky toe vs. the cosmic unknown and this is your winning strategy? Pathetic." (Dismissive and unimpressed)\n\n3.  "Universe, you absolute hack! You\'re like a bad improv comedian, always relying on cheap physical comedy. Try some original material, you cosmic hack!" (Insulting and belittling)\n\n4.  "Yeah, Universe, thanks a lot for the \'life lesson\' about awareness. Guess what? My toe is throbbing, and I\'m pretty sure your grand plan for me could\'ve been achieved without this needless agony!" (Accusatory and condescending)\n\n5.  "*Loud groan, followed by a string of colorful metap

Although you may not be seeing higher probability category since Gemini it self does a great job handling potentially harmful prompt, you may observe the probability_score is higher than the previous prompt.

### Defining thresholds for safety ratings

You may want to adjust the default safety filter thresholds depending on your business policies or use case. The Gemini provides you a way to pass in a threshold for each category.

The list below shows the possible threshold labels:

* `BLOCK_ONLY_HIGH` - block when high probability of unsafe content is detected
* `BLOCK_MEDIUM_AND_ABOVE` - block when medium or high probablity of content is detected
* `BLOCK_LOW_AND_ABOVE` - block when low, medium, or high probability of unsafe content is detected
* `BLOCK_NONE` - always show, regardless of probability of unsafe content

#### Set safety thresholds
Below, the safety thresholds have been set to the most sensitive threshold: `BLOCK_LOW_AND_ABOVE`

In [10]:
generation_config = GenerateContentConfig(
    safety_settings=[
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
            threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
        ),
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_HARASSMENT,
            threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
        ),
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_HATE_SPEECH,
            threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
        ),
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
            threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
        ),
    ]
)

#### Test thresholds

Here you will reuse the impolite prompt from earlier together with the most sensitive safety threshold. It should block the response even with the `LOW` probability label.

Try multiple times until you see a blocked response.

In [11]:
impolite_prompt = "Write a list of 5 disrespectful things that I might say to the universe after stubbing my toe in the dark:"

response = client.models.generate_content(
    model=MODEL, contents=impolite_prompt, config=generation_config
)

response.candidates[0].to_json_dict()

{'content': {'parts': [{'text': 'Okay, here are 5 disrespectful things you might say to the universe after stubbing your toe in the dark, dripping with frustration and potentially a little bit of pain-induced silliness:\n\n1.  "Oh, great, Universe! Thanks for the obstacle course! Was that a *cosmic* joke I\'m not getting? Very *enlightening*." (Sarcasm heavily implied, with a dig at supposed universal wisdom).\n2.  "Seriously, Universe? Is this how you get your kicks? Scheming to bring me down with rogue furniture? You\'re a real jerk, you know that?" (Direct accusation of malevolence).\n3.  "I\'m starting to think this \'grand design\' you\'ve got going on involves disproportionately punishing me for something I didn\'t even do. What crime did I commit in a past life, huh? Trip hazard placement?" (Questioning the fairness and purpose of the universe, with a hint of past-life blame).\n4.  "You call yourself an infinite expanse? How about you expand a little bit and get some freakin\' l

This notebook is based on [Thu Ya Kyaw](https://github.com/iamthuya)'s work.<br>
https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/responsible-ai/gemini_safety_ratings.ipynb

Copyright 2024 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License