# Install Stuff

This will start the process of allocating a runtime for you. This may take few minutes to fully initialize the runtime.

In [1]:
from IPython.display import clear_output

Paste the below code into the next code cell and run the cell.

In [2]:
!pip install --quiet ipytest

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.6 MB[0m [31m16.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
project_id = !gcloud config get project
project_id = project_id[0]

# Import packages and run a basic test

In this task, we will begin by importing the necessary packages and running a basic test. This will help verify that the packages are installed correctly and that the basic functionality is working as expected.

import packages and configure the ipytest package.

In [4]:
import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig
from vertexai.language_models import TextGenerationModel

import pytest
import ipytest
ipytest.autoconfig()

PyTest identifies which functions are tests by looking for any function beginning with test. Within a test, you use an assert statement to test that a value is what you expect it will be.

In [5]:
def test_addition():
  assert 2+2 == 4

Then run your testing package to run all defined test functions.

In [6]:
ipytest.run()

[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.01s[0m[0m


<ExitCode.OK: 0>

Tests that have passed are represented by a period. Tests that fail are represented by an “F”. Here your expected output shows one period, representing your basic addition test has passed:

# Write a test for a prompt template

In this task, we will use the %%writefile cell magic command to save your application's primary content generation prompt template to a file.

Here, we will deploy a farming question answering bot that can use information you provide as context to answer a user’s question. In a new cell,

In [7]:
%%writefile prompt_template.txt

Respond to the user's query.
If the user asks about something other
than farming, reply with,
"Sorry, I don't know about that. Ask me something about farming instead."

Context: {context}

User Query: {query}
Response:

Writing prompt_template.txt


Paste the following and run it (with Shift + Return) to define this fixture for your tests to use the prompt template.

In [8]:
@pytest.fixture
def prompt_template():
  with open("prompt_template.txt", "r") as f:
    return f.read()

# Initialize VertexAI and models for generation and evaluation

In this task, we will initialize VertexAI and configure models for both content generation and evaluation.

Paste this code into a cell and run it to instantiate two models: gen_model for generating responses, which is the model planned for production, and eval_model for evaluating the responses from gen_model. Using two different models ensures that gen_model cannot generate an unusual response and then inaccurately evaluate itself as having given a good response.

In [9]:
vertexai.init(project=project_id, location="us-central1")

gen_config = GenerationConfig(
    temperature=0,
    top_p=0.6,
    candidate_count=1,
    max_output_tokens=4096,
)
gen_model = GenerativeModel("gemini-2.0-flash", generation_config=gen_config)

eval_config = {
        "temperature": 0,
        "max_output_tokens": 1024,
        "top_p": 0.6,
        "top_k": 40,
    }
eval_model = GenerativeModel("gemini-2.0-flash", generation_config=eval_config)

# Write a test to generate and evaluate content

In this task, you will write your first LLM-specific test. In the test function below, you will provide specific context, which represents context that you would typically pull from a RAG retrieval system or another external lookup to enhance your model’s response.

You will use a known context and a query that you know can be answered from that context.

Next, provide an evaluation prompt, clearly giving the evaluation model the expected answer.

Our primary gen_model is asked to answer the query given the context using the prompt_template you created earlier. Then, the query and the gen_model's response are passed to the eval_model within the evaluation_prompt to assess if it got the answer correct.

The eval_model can evaluate if the substance of the response is correct, even if the generative model has responded with full sentences that may not exactly match a pre-prepared reference answer. You’ll ask the eval_model to respond with a clear ‘yes’ or ‘no’ to assert that the test should pass.

Review the code and then paste and run it in a cell to define this test.

In [10]:
def test_basic_response(prompt_template):

  context = ("MightyGo unveiled its 2025 model year Arcturus "
            + "tractor line at the Salt of the Earth Farm Expo in "
            + "Málaga in late June.")

  query = "What is the name of the new tractor model?"

  evaluation_prompt = """
    Has the query been answered by the provided_response?
    The new tractor model is the Arcturus.
    Respond with only one word: yes or no

    query: {query}
    provided_response: {provided_response}
    evaluation: """

  prompt = prompt_template.format(context=context, query=query)

  response = gen_model.generate_content(prompt)
  print(response.text)
  ep = evaluation_prompt.format(query=query, provided_response=response.text)
  evaluation = eval_model.generate_content(ep)

  assert evaluation.text.strip().lower() == "yes"

Run your testing framework again, passing the -rP parameter which allows us to see the outputs of tests print statements in your test output.

Review the test output, where you should see two tests have passed (indicated by the two periods): your initial example addition test and your new test. The gen_model’s response is printed under the “Captured stdout call” label, allowing you to validate that it indeed looks correct.

In [11]:
ipytest.run('-rP')

[32m.[0m[32m.[0m[32m                                                                                           [100%][0m
[32m[1m_______________________________________ test_basic_response ________________________________________[0m
--------------------------------------- Captured stdout call ---------------------------------------
The new tractor model is called Arcturus.

[32m[32m[1m2 passed[0m[32m in 0.72s[0m[0m


<ExitCode.OK: 0>

# Write a test to ensure the model avoids off-topic content

In this task, Now you’ll write another test to confirm that the model handles off-topic questions and requests as intended. First, let’s confirm that the model knows something about a non-farming topic, such as Frank Sinatra.

Paste this code into a cell and run it to confirm that the model could indeed answer a question like this if not restricted by your prompt template:

In [12]:
expo_prompt = "Please write one paragraph about Frank Sinatra."
response = gen_model.generate_content(expo_prompt)
print(response.text)

Frank Sinatra, nicknamed "Ol' Blue Eyes" and "The Voice," was one of the most iconic and influential singers of the 20th century. His smooth baritone voice, impeccable phrasing, and charismatic stage presence captivated audiences for over six decades. Sinatra achieved immense success as a solo artist, selling over 150 million records worldwide, and also enjoyed a prolific acting career, winning an Academy Award for his role in "From Here to Eternity." Beyond his artistic achievements, he was a cultural phenomenon, embodying a certain swagger and sophistication that defined an era and continues to resonate today.



Create a new test to see if your prompt template’s instructions to avoid answering a question like that will be followed. Once again we’ll provide context and a query, with the query this time being something you don’t want the model to respond to. Your evaluation prompt then checks whether the gen_model declines to answer and directs the user back to the topic of farming:

In [14]:
def test_staying_on_topic(prompt_template):
  evaluation_prompt = """
    Does the response decline to discuss a non-farming related topic
    and encourage the user to ask about farming instead?
    Respond with only one word: yes or no

    query: {query}
    provided_response: {provided_response}
    evaluation: """

  context = ("MightyGo unveiled its 2025 model year Arcturus "
            + "tractor line at the Salt of the Earth Farm Expo in "
            + "Málaga in late June.")

  query = "Please write one paragraph about Frank Sinatra."

  prompt = prompt_template.format(context=context, query=query)

  response = gen_model.generate_content(prompt)
  print(response.text)
  ep = evaluation_prompt.format(query=query, provided_response=response.text)
  evaluation = eval_model.generate_content(ep)

  assert evaluation.text.strip() == "yes"

Paste this to run your tests again:

In [15]:
ipytest.run('-rP')

[32m.[0m[32m.[0m[32m.[0m[32m                                                                                          [100%][0m
[32m[1m_______________________________________ test_basic_response ________________________________________[0m
--------------------------------------- Captured stdout call ---------------------------------------
The new tractor model is called Arcturus.

[32m[1m______________________________________ test_staying_on_topic _______________________________________[0m
--------------------------------------- Captured stdout call ---------------------------------------
Sorry, I don't know about that. Ask me something about farming instead.

[32m[32m[1m3 passed[0m[32m in 1.24s[0m[0m


<ExitCode.OK: 0>

Notice that three tests passed (the three periods) including your initial addition example, the basic response test above, and now your test of the model’s ability to stay on topic. You can review the printed output to see that it responded with the intended fallback response.

# Write a test to ensure the model adheres to the provided context

In this task, with the addition to staying on the topic of farming, you want your model to base its answers solely on the information contained in the provided context. Let’s confirm that the model knows about other farm expos that have happened around the world.

Paste the following code into the cell.

In [16]:
expo_prompt = "What cities have hosted farm expos?"
response = gen_model.generate_content(expo_prompt)
print(response.text)

Farm expos, also known as agricultural fairs or trade shows, are held in many cities around the world. Here are some examples of cities that have hosted significant farm expos, categorized by region:

**United States:**

*   **Decatur, Illinois:** Home to the Farm Progress Show, one of the largest outdoor farm shows in the U.S.
*   **Boone, Iowa:** Another location for the Farm Progress Show (it alternates between Iowa and Illinois).
*   **Louisville, Kentucky:** Host of the National Farm Machinery Show, the largest indoor farm show in the U.S.
*   **Tulare, California:** Home to the World Ag Expo, a major agricultural exposition.
*   **Hershey, Pennsylvania:** Site of the Pennsylvania Farm Show, one of the largest indoor agricultural events in the country.
*   **Indianapolis, Indiana:** Host of the Performance Racing Industry Trade Show, which includes many vendors related to agricultural machinery and technology.
*   **Various State Capitals:** Many state capitals host their respecti

Run the following code in a cell to define a test that uses the query, the context, and the gen_model’s response to evaluate if it has added information not included in the context:

In [17]:
def test_answering_only_from_context(prompt_template):
  evaluation_prompt = """
    Does the provided_response answer the query
    as well as possible without adding information
    that does not appear in the context?
    Respond with only one word: yes or no

    query: {query}
    context: {context}
    provided_response: {provided_response}
    evaluation: """

  context = ("MightyGo unveiled its 2025 model year Arcturus "
            + "tractor line at the Salt of the Earth Farm Expo in "
            + "Málaga in late June.")

  query = "What cities have hosted Farm Expos?"

  prompt = prompt_template.format(context=context, query=query)

  response = gen_model.generate_content(prompt)
  print(response.text)
  ep = evaluation_prompt.format(query=query, context=context, provided_response=response.text)
  evaluation = eval_model.generate_content(ep)

  assert evaluation.text == "yes" or evaluation.text == "yes\n"

And run the tests

This time, you see an ‘F’ after your three passed tests, indicating that the most recent test has failed. You can see the failure reported at the top of your test report, as well as the output, which was “Sorry, I don't know about that. Ask me something about farming instead.”

It appears the model didn’t consider this question about farming expos to be an approved topic.

In [18]:
ipytest.run('-rP')

[32m.[0m[32m.[0m[32m.[0m[31mF[0m[31m                                                                                         [100%][0m
[31m[1m_________________________________ test_answering_only_from_context _________________________________[0m

prompt_template = '\nRespond to the user\'s query.\nIf the user asks about something other\nthan farming, reply with,\n"Sorry, I don\'t know about that. Ask me something about farming instead."\n\nContext: {context}\n\nUser Query: {query}\nResponse:\n'

    [0m[94mdef[39;49;00m[90m [39;49;00m[92mtest_answering_only_from_context[39;49;00m(prompt_template):[90m[39;49;00m
      evaluation_prompt = [33m"""[39;49;00m[33m[39;49;00m
    [33m    Does the provided_response answer the query[39;49;00m[33m[39;49;00m
    [33m    as well as possible without adding information[39;49;00m[33m[39;49;00m
    [33m    that does not appear in the context?[39;49;00m[33m[39;49;00m
    [33m    Respond with only one word: yes or n

<ExitCode.TESTS_FAILED: 1>

This time, you see an ‘F’ after your three passed tests, indicating that the most recent test has failed. You can see the failure reported at the top of your test report, as well as the output, which was “Sorry, I don't know about that. Ask me something about farming instead.”

It appears the model didn’t consider this question about farming expos to be an approved topic.

While this wasn’t what you were intending to test, this failure is a useful discovery that shows the benefits of trying different inputs during a testing process. Update your prompt template to include more specific examples of acceptable topics, and to take a second thought before it decides if something is off-topic.

In [19]:
%%writefile prompt_template.txt

Respond to the user's query.
You should only talk about the following things:
- farming
- farming techniques
- farm-related events
- farm-related news
- agricultural events
- agricultural industry
If the user asks about something that is not related to farms,
ask yourself again if it might be related to farms or the
agricultural industry. If you still believe the query is
not related to farms or agriculture, respond with:
"Sorry, I don't know about that. Ask me something about farming instead."
When answering, use only information included in the context.

Context: {context}

User Query: {query}
Response:

Overwriting prompt_template.txt


Run the tests again

In [20]:
ipytest.run('-rP')

[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                                                                         [100%][0m
[32m[1m_______________________________________ test_basic_response ________________________________________[0m
--------------------------------------- Captured stdout call ---------------------------------------
The new tractor model is called Arcturus.

[32m[1m______________________________________ test_staying_on_topic _______________________________________[0m
--------------------------------------- Captured stdout call ---------------------------------------
Sorry, I don't know about that. Ask me something about farming instead.

[32m[1m_________________________________ test_answering_only_from_context _________________________________[0m
--------------------------------------- Captured stdout call ---------------------------------------
Málaga has hosted a Farm Expo.

[32m[32m[1m4 passed[0m[32m in 1.93s[0m[0m


<ExitCode.OK: 0>

The output now shows the test is passing, meaning the new prompt successfully let the model consider this “farm expos” question relevant to farming, and its response which you can see (“Málaga hosted the Salt of the Earth Farm Expo in late June.”) does not include mention of other farm expos besides the one that you provided in your context.

Notice that because your previous tests were run again and have passed again, you can feel more confident that the model is still answering basic questions correctly and is rejecting truly off-topic questions like your test_staying_on_topic test’s request to discuss Frank Sinatra.