In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Generative AI Enterprise Knowledge Base Chatbot

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/terraform-genai-extractive-qa/blob/main/terraform/webhooks/notebook/gen_ai_jss.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/terraform-genai-extractive-qa/main/terraform/webhooks/notebook/gen_ai_jss.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/terraform-genai-extractive-qa/main/terraform/webhooks/notebook/gen_ai_jss.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.9

## Overview

This notebook is a companion to the [Generative AI Enterprise Knowledge Base Jump Start Solution](https://cloud.google.com/blog/products/application-modernization/introducing-google-cloud-jump-start-solutions). With this notebook, you can use the Jumpstart Solution to extract questions & answers from a PDF document. In the notebook, you will programmatically upload a PDF file to a Cloud Storage bucket, send the PDF off for optical character recognition, extract questions and answers from the recognized text, and then tune a Vertex PaLM model with the extracted Q&As. 

+ Learn more about [using text chat LLM with Vertex AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview).
+ Learn more about [querying collections in Firestore](https://cloud.google.com/firestore/docs/query-data/get-data).
+ Learn more about [creating EventArc triggers for Cloud Functions](https://cloud.google.com/functions/docs/calling/eventarc).
+ Learn more about [storing data in Cloud Storage](https://cloud.google.com/storage/docs/uploading-objects).
+ Learn more about [transcribing PDFs with Cloud Vision OCR](https://cloud.google.com/vision/docs/pdf).

### Objective

In this tutorial, you learn how to create a Cloud Function process that transcribes characters from a PDF, stores the complete PDF text in a Storage bucket, extracts Q&As from the PDF, and then upserts the document data (summary, complete text, URI) into a Firestore collection.

This tutorial uses the following Google Cloud services and resources:

- Vertex AI Generative AI
- Cloud Firestore
- Cloud Vision OCR
- Cloud EventArc triggers
- Cloud Functions
- Cloud Storage

The steps performed include:

- Trigger an EventArc event by uploading a PDF to a Cloud Storage bucket
- Query the Firestore collection to see the results of the extraction process

### Dataset

This notebook uses a **TODO: Insert reference to dataset**.

### Costs 

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Firestore
* Vision
* Cloud Functions
* EventArc
* Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
and [BigQuery pricing](https://cloud.google.com/bigquery/pricing),
and [Cloud Vision pricing](https://cloud.google.com/vision/pricing),
and [Cloud Functions pricing](https://cloud.google.com/functions/pricing),
and [Cloud EventArc pricing](https://cloud.google.com/eventarc/pricing),
and [Cloud Storage pricing](https://cloud.google.com/storage/pricing), 
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook. 

In [46]:
%%writefile requirements.txt

google-cloud-aiplatform
google-cloud-firestore
google-cloud-logging
google-cloud-storage
google-cloud-vision
pandas
polling2
tqdm

Overwriting requirements.txt


In [47]:
# Install the packages
import os

if not os.getenv("IS_TESTING"):
    USER = "--user"
else:
    USER = ""
! pip3 install {USER} --upgrade -r requirements.txt

[0m

### Colab only: Uncomment the following cell to restart the kernel.

In [None]:
# # Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

This notebook assumes that you have already deployed this solution using either the [Terraform script]() **TODO: fix target for link** or using the [Solutions console](https://console.cloud.google.com/products/solutions/catalog). During this deployment, several actions required to run this solution were performed on your behalf:

1. The [Cloud Function](https://console.cloud.google.com/functions/list) was deployed.

2. The [EventArc trigger](https://console.cloud.google.com/eventarc/triggers) was applied to the input Cloud Storage bucket.

3. The following APIs were enabled for you: 

   - [Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com)
   - [BigQuery API](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com)
   - [Cloud Vision API](https://console.cloud.google.com/flows/enableapi?apiid=vision.googleapis.com)


<div style="background-color:rgb(150,200,255); padding:2px;"><strong>Note:</strong> It is recommended to run this notebook from <a href="https://console.cloud.google.com/vertex-ai/workbench/">Vertex AI Workbench</a>. If you are running this notebook locally instead, you need to install the <a href="https://cloud.google.com/sdk" target="_blank">Cloud SDK</a>.</div>

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow one of the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Import libraries

In [41]:
import json
import os
import pandas
import polling2
import re
import time

from datetime import datetime
from tqdm.notebook import tqdm
from typing import List, Tuple

from google.cloud import aiplatform
from google.cloud import firestore
from google.cloud import logging
from google.cloud import storage
from google.cloud import vision

import vertexai
from vertexai.preview.language_models import TextGenerationModel

## Download test data

This Jump Start Solution uses data from [arXiv.org](https://arxiv.org/) to demonstrate the summarization capabilities of Vertex AI. arXiv, through [Kaggle.com](https://www.kaggle.com/datasets/Cornell-University/arxiv) has made many scholarly papers available, free of charge, from a Google Cloud Storage bucket.

In [None]:
# List all the comparative linguistics papers from Cloud Storage
! gsutil ls gs://arxiv-dataset/arxiv/cmp-lg/pdf/9404

In [None]:
filename = '9404002v1'
file_uri = f'gs://arxiv-dataset/arxiv/cmp-lg/pdf/9404/{filename}.pdf'

# Create a local folder and download some test PDFs
if not os.path.exists('pdfs'):
    os.mkdir('pdfs')

! gsutil cp -r $file_uri pdfs/

## Upload test data to Storage bucket

The Terraform scripts for this JSS applies an EventArc trigger to a Cloud Storage bucket. When a PDF is uploaded to the storage bucket, the EventArc trigger fires, starting the summarization process.

In [None]:
INPUT_BUCKET = f'{PROJECT_ID}_uploads'

Running the next cell uploads a local PDF file (downloaded in the previous section) to the target Cloud Storage bucket. 

In [None]:
file_complete_text = f'{filename}_summary.txt'
pdf = f'pdfs/{filename}.pdf'
logger_name = 'summarization-by-llm'

In [None]:
storage_client = storage.Client()
bucket = storage_client.bucket(INPUT_BUCKET)
blob = bucket.blob(pdf)
blob.upload_from_filename(pdf)

This upload process kicks off the summarization process. You can view the progress of the summarization process in the [Cloud Console](https://console.cloud.google.com/functions/details/us-central1/jss16-1).

**TODO: Ensure that Cloud Console links go to correct console locations.**

## Optional: View summarization process in Cloud Logging

You can view the results of the summarization Cloud Function as it writes updates to Cloud Logging. Each run of the summarization pipeline is associated with a `cloud_event_id`. By filtering for this ID, you can track the summarization process.

In [None]:
@polling2.poll_decorator(check_success=lambda x: x != '', step=0.5, timeout=90)
def get_cloud_event_id(pdf_filename, bar):
    logging_client = logging.Client(project=PROJECT_ID)
    logger = logging_client.logger(logger_name)
    
    pattern = 'cloud_event_id\((.*)\):'
    cloud_id = ''
    for entry in logger.list_entries(filter_=pdf_filename, max_results=100):
        entry_text = entry.payload
        res = re.search(pattern, entry_text)
        if res != None:
            cloud_id = res.group(1)
            print(cloud_id)
            bar.update(100)
    
        if cloud_id != '':
            return cloud_id
    return cloud_id

In [None]:
with tqdm(total=100) as bar:
    cloud_event_id = get_cloud_event_id(filename, bar)
    bar.close()

Now that we have the `cloud_event_id`, we can filter on just this cloud event and get updates for just this event.

In [None]:
print(f'cloud_event_id: {cloud_event_id}')

In [None]:
@polling2.poll_decorator(step=10, timeout=70)
def get_cloud_event_logs(cloud_event_id):
    print("polling")
    logging_client = logging.Client(project=PROJECT_ID)
    logger = logging_client.logger(logger_name)
    
    entries = []
    for entry in logger.list_entries(filter_=cloud_event_id, max_results=100):
        entry_text = entry.payload
        entries.append(entry_text)
    return entries

In [None]:
entries = []
bar = tqdm(total=6)

for _ in range(6):
    tmp_entries = get_cloud_event_logs(cloud_event_id)
    for e in tmp_entries:
        if e not in entries:
            bar.update(1)
            entries.append(e)
            print(e)
            
    

## Query the BigQuery table to see the summary

Once the summarization flow has completed, the summary of the PDF document should be available for you to read. To get the summary of the PDF document, you can query the BigQuery table that contains the summary.

If you do not get a result the first time you run the query, then the summarization pipeline might still be running. You might need to wait a minute to allow the pipeline to finish and to try the query again.

In [None]:
bigquery_client = bigquery.Client(project=PROJECT_ID)

table_name = f"{PROJECT_ID}.summary_dataset.summary_table"

# Compose the SQL query to select the summary for the PDF document
sql_query = f"SELECT summary FROM `{table_name}` WHERE filename LIKE '%{file_complete_text}%'"

job = bigquery_client.query(sql_query)
rows = job.result()
row_list = list(rows)

if len(row_list) != 0:
    summary = row_list[0]

print(summary['summary'])

## Optional: Run pipeline components individually

The summarization pipeline is composed of multiple independent components. There is a component that performs optical character recognition on the PDF, another that stores data in a Storage bucket, another that performs summarization with a LLM, and yet another that stores new rows into the BigQuery table.

In this section, you can run each component individually to understand how they work together.

In [2]:
# TODO(erschmid): Delete this cell when ready to push to remote
PROJECT_ID = 'jss-22p1-test'
REGION = 'us-central1'
COLLECTION = 'extractive-qa-nb-test'
BUCKET = 'jss-22p1-test'
PREFIX = 'extractive-qa-nb-test'

### Perform OCR with Cloud Vision

The first component in the pipeline performs optical character recognition (OCR) using Cloud Vision. Run the following cells to run optical character recognition on the PDF file you downloaded previously.

Note that OCR can take a while to complete. You might need to wait for a result.

In [9]:
def document_extract(
    bucket: str,
    name: str,
    output_bucket: str,
    project_id: str,
    timeout: int = 420,
) -> str:
    """Perform OCR with PDF/TIFF as source files on GCS.

    Original sample is here:
    https://github.com/GoogleCloudPlatform/python-docs-samples/blob/main/vision/snippets/detect/detect.py#L806

    Note: This function can cause the IOPub data rate to be exceeded on a
    Jupyter server. This rate can be changed by setting the variable
    `--ServerApp.iopub_data_rate_limit

    Args:
        bucket (str): GCS URI of the bucket containing the PDF/TIFF files.
        name (str): name of the PDF/TIFF file.
        output_bucket: bucket to store output in
        timeout (int): Timeout in seconds for the request.


    Returns:
        str: the complete text
    """

    gcs_source_uri = f"gs://{bucket}/{name}"
    prefix = "ocr"
    gcs_destination_uri = f"gs://{output_bucket}/{prefix}/"
    mime_type = "application/pdf"
    batch_size = 2

    # Perform Vision OCR
    client = vision.ImageAnnotatorClient()

    feature = vision.Feature(type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)

    gcs_source = vision.GcsSource(uri=gcs_source_uri)
    input_config = vision.InputConfig(gcs_source=gcs_source, mime_type=mime_type)

    gcs_destination = vision.GcsDestination(uri=gcs_destination_uri)
    output_config = vision.OutputConfig(
        gcs_destination=gcs_destination, batch_size=batch_size
    )

    async_request = vision.AsyncAnnotateFileRequest(
        features=[feature], input_config=input_config, output_config=output_config
    )

    operation = client.async_batch_annotate_files(requests=[async_request])

    print("OCR: waiting for the operation to finish.")
    operation.result(timeout=timeout)

    # Once the request has completed and the output has been
    # written to GCS, we can list all the output files.
    print("OCR: complete")
    return get_ocr_output_from_bucket(gcs_destination_uri, output_bucket, project_id)


def get_ocr_output_from_bucket(gcs_destination_uri: str,
                               bucket_name: str,
                               project_id: str) -> str:
    """Iterates over blobs in output bucket to get full OCR result.

    Arguments:
        gcs_destination_uri: the URI where the OCR output was saved.
        bucket_name: the name of the bucket where the output was saved.

    Returns:
        The full text of the document
    """
    print("Storage: fetching complete text")
    storage_client = storage.Client(project=project_id)

    match = re.match(r"gs://([^/]+)/(.+)", gcs_destination_uri)
    prefix = match.group(2)
    bucket = storage_client.get_bucket(bucket_name)

    # List objects with the given prefix, filtering out folders.
    blob_list = [
        blob
        for blob in list(bucket.list_blobs(prefix=prefix))
        if not blob.name.endswith("/")
    ]

    # Concatenate all text from the blobs
    complete_text = ""
    for output in blob_list:
        json_string = output.download_as_bytes().decode("utf-8")
        response = json.loads(json_string)

        # The actual response for the first page of the input file.
        page_response = response["responses"][0]
        annotation = page_response["fullTextAnnotation"]

        complete_text = complete_text + annotation["text"]

    return complete_text

In [10]:
pdf_name = f"{PREFIX}/aristotle-on-happiness.pdf"
output_bucket = f"{PROJECT_ID}_output"

complete_text = document_extract(bucket=BUCKET,
                                 name=pdf_name,
                                 output_bucket=output_bucket,
                                 project_id=PROJECT_ID)

# Entire text is long; print just first 1000 characters
print(complete_text[:1000])

OCR: waiting for the operation to finish.
OCR: complete
Storage: fetching complete text
Aristotle on Happiness
A Little Background
Aristotle is one of the
greatest thinkers in the
history of western science
and philosophy, making
contributions to logic,
metaphysics, mathematics,
physics, biology, botany,
ethics, politics, agriculture,
medicine, dance and theatre.
He was a student of Plato
who in turn studied under
Socrates. Although we do not
actually possess any of
Aristotle's own writings
intended for publication, we
have volumes of the lecture
notes he delivered for his
students; through these
Aristotle was to exercise his profound influence through the ages. Indeed, the medieval outlook
is sometimes considered to be the "Aristotelian worldview" and St. Thomas Aquinas simply
refers to Aristotle as "The Philosopher" as though there were no other.
Aristotle was the first to classify areas of human knowledge into distinct disciplines such as
mathematics, biology, and ethics. Some of th

### Extract questions & answers with the Vertex AI PaLM API

Next, you can send the complete text of the PDF to extract questions from. Vertex AI allows you to use many different types of LLM models. In this case, you use a LLM model designed for text summarization, `text-bison@001`. You send a prediction request to Vertex AI, providing the name of the LLM you want to use. The Vertex AI service then sends the model's response back to you. In the following cells, the Python SDK for Vertex AI provides all of the helper methods and classes you need to perform this process.

In [27]:
def extract_questions(
        *,
        project_id: str,
        model_name: str,
        text: str,
        temperature: float = 0.2,
        max_decode_steps: int = 1024,
        top_p: float = 0.8,
        top_k: int = 40,
        location: str = "us-central1",
) -> str:
    """Extract questions & answers using a large language model (LLM)

    Args:
        project_id (str): the Google Cloud project ID
        model_name (str): the name of the LLM model to use
        temperature (float): controls the randomness of predictions
        max_decode_steps (int): the number of tokens to generate
        top_p (float): cumulative probability of parameter highest vocabulary tokens
        top_k (int): number of highest probability vocabulary tokens to keep for top-k-filtering
        text (str): the text to summarize
        location (str): the Google Cloud region to run in

    Returns:
        The summarization of the content
    """
    vertexai.init(
        project=project_id,
        location=location,
    )

    model = TextGenerationModel.from_pretrained(model_name)

    prompt = f"""
    Extract at least 10 questions with answers based on the following article: {text}
    
    Questions: Answers:
    """
    response = model.predict(
        prompt,
        temperature=temperature,
        max_output_tokens=max_decode_steps,
        top_k=top_k,
        top_p=top_p,
    )
    question_list = response.text.splitlines()

    return question_list

In [28]:
model_name = "text-bison@001"
temperature = 0.2
max_decode_steps = 1024
top_p = 0.8
top_k = 40

qas = extract_questions(
    project_id=PROJECT_ID,
    model_name=model_name,
    text=complete_text[:1500])

In [35]:
count = 0
qa_pairs = []
while True:
    question = qas[count]
    count += 1
    answer = qas[count]
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    
    qa_pairs.append((question, answer))
    
    count += 1
    
    if count >= len(qas):
        break
    
    if qas[count] == "":
        count += 1

Question: 1. What did Aristotle study?
Answer: Aristotle studied logic, metaphysics, mathematics, physics, biology, botany, ethics, politics, agriculture, medicine, dance and theatre.
Question: 2. Who was Aristotle's teacher?
Answer: Aristotle's teacher was Plato.
Question: 3. What did Aristotle write?
Answer: Aristotle wrote volumes of lecture notes that he delivered for his students.
Question: 4. What did Aristotle classify?
Answer: Aristotle classified areas of human knowledge into distinct disciplines such as mathematics, biology, and ethics.
Question: 5. What did Aristotle devise?
Answer: Aristotle devised a formal system for reasoning.
Question: 6. What is an example of a syllogism?
Answer: All men are mortal; Socrates is a man; therefore, Socrates is mortal.
Question: 7. What area of thought did Aristotle's brand of logic dominate?
Answer: Aristotle's brand of logic dominated the area of thought until the rise of modern symbolic logic.
Question: 8. What is the medieval outlook s

### Store question & answer pairs in Firestore

The following cells saves all of the question and answer pairs as as documents in Firestore.

In [42]:
def write_qas_to_collection(
    project_id: str,
    collection_name: str,
    question_answer_pairs: List[Tuple[str, str]],
    input_file_gcs_uri: str,
    time_created: datetime,
):
    """Writes question and answer pairs to the specified Firestore collection.

    Arguments:
      project_id: the project that contains this database
      collection_name: the collection to store the Q&A pairs in
      question_answer_pairs: the Q&A pairs to add
      input_file_gcs_uri: the Cloud Storage URI for the source PDF
      time_created: the time that this PDF was uploaded
    """
    db = firestore.Client(project=project_id)
    bulkwriter = db.bulk_writer()

    for qa in question_answer_pairs:

        # Create a unique ID for each question
        question_hash = hash(qa[0])

        doc_ref = db.document(collection_name, str(question_hash))
        doc_snap = doc_ref.get()

        document_data = {
            "question": qa[0],
            "answers": [{
                "answer": qa[1],
                "gcs_uri": input_file_gcs_uri,
                "time_uploaded": time_created,
            }]
        }

        if doc_snap.exists:
            bulkwriter.update(doc_ref, document_data)
            continue

        bulkwriter.create(doc_ref, document_data)

    # Send all updates and close the BulkWriter
    bulkwriter.close()

In [43]:
write_qas_to_collection(
    project_id=PROJECT_ID,
    collection_name=COLLECTION,
    question_answer_pairs=qa_pairs,
    input_file_gcs_uri=f"gs://{BUCKET}/{pdf_name}",
    time_created=datetime.now().isoformat()
)

### Tune a customized LLM

The following cells fine tune an LLM using the question & answer pairs stored in the Firestore collection

In [54]:
def get_qas_from_collection(
    *,
    project_id: str,
    collection_name: str,
    bucket_name: str,
) -> str:
    """Gets all QA sets as a list of dict objects.

    Arguments:
      project_id: the project that contains this database
      collection_name: the collection to get the Q&A pairs from

    Returns:
        Cloud Storage URI of a JSONL document with all QA pairs
    """
    import json
    import os
    from google.cloud import firestore
    from google.cloud import storage

    db = firestore.Client(project=project_id)
    collection_ref = db.collection(collection_name)
    docs_iter = collection_ref.stream()

    all_qas = []

    for doc in docs_iter:
        qa = doc.to_dict()
        all_qas.append(qa)

    gcs_qa_dir = f"gs://{bucket_name}/extractive-qa"
    gcs_qa_file = f"{gcs_qa_dir}/qas.json"
    
    storage_client = storage.Client(project=project_id)
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(f"extractive-qa/qas.json")

    blob.upload_from_string(json.dumps(all_qas))

    return gcs_qa_file

In [71]:
def tuning(
        *,
        project_id: str,
        location: str = "us-central1",
        gcs_qa_file: str = "",
        tuned_model_name: str = "",
) -> "_LanguageModelTuningJob":
    """Tune a new model, based on Q&A data stored in a Firestore collection.

    Args:
        project_id: Google Cloud project ID, used to initialize Vertex AI
        location: Google Cloud region, used to initialize Vertex AI
        gcs_qa_file: Cloud Storage FUSE URI of a file containing questions & answers
        tuned_model_name: name of a previously tuned model
    """
    import json
    import pandas as pd
    
    from google.cloud import storage
    
    import vertexai
    from vertexai.preview.language_models import TextGenerationModel
    
    vertexai.init(
        project=project_id,
        location=location,
    )

    if tuned_model_name == "":
        model = TextGenerationModel.from_pretrained("google/text-bison@001")

    storage_client = storage.Client(project=project_id)
    uri_paths = gcs_qa_file.split("/")
    bucket = storage_client.bucket(uri_paths[2])
    blob_path = "/".join(uri_paths[3:])
    blob = bucket.blob(blob_path)
    jsonl_as_str = blob.download_as_string()
    
    qas = json.loads(jsonl_as_str)
    jsonl_dataset = [{"input_text": qa["question"],
                      "output_text": qa["answers"][0]["answer"]} for qa in qas]
    print(jsonl_dataset)
    job = model.tune_model(
        training_data=pd.DataFrame(data=jsonl_dataset),
        # Optional:
        train_steps=10,
        tuning_job_location="europe-west4",
        tuned_model_location=location,
    )
    return job

In [None]:
gcs_qa_file = get_qas_from_collection(
    project_id=PROJECT_ID,
    collection_name=COLLECTION,
    bucket_name=BUCKET
)
tuning_job = tuning(
    project_id=PROJECT_ID,
    gcs_qa_file=gcs_qa_file
)

[{'input_question': '3. What did Aristotle write?', 'output_text': 'Aristotle wrote volumes of lecture notes that he delivered for his students.'}, {'input_question': '6. What is an example of a syllogism?', 'output_text': 'All men are mortal; Socrates is a man; therefore, Socrates is mortal.'}, {'input_question': '10. What is the species-genus system?', 'output_text': 'The species-genus system is a classification system that Aristotle devised.'}, {'input_question': '5. What did Aristotle devise?', 'output_text': 'Aristotle devised a formal system for reasoning.'}, {'input_question': "7. What area of thought did Aristotle's brand of logic dominate?", 'output_text': "Aristotle's brand of logic dominated the area of thought until the rise of modern symbolic logic."}, {'input_question': '8. What is the medieval outlook sometimes considered to be?', 'output_text': 'The medieval outlook is sometimes considered to be the "Aristotelian worldview."'}, {'input_question': '1. What did Aristotle 

Finally, you can send a question prompt to the tuned LLM to see its answer.