[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GoogleCloudPlatform/core-solution-services/blob/main/components/llm_service/notebooks/AgentBuilder_Search_on_Website.ipynb)

## Set up environment variables
Set PROJECT_ID to your project.

In [None]:
import sys
import os
PROJECT_ID = "your-project"
os.environ["PROJECT_ID"] = PROJECT_ID

## Clone GENIE repository

In [None]:
!git clone https://github.com/GoogleCloudPlatform/core-solution-services
!cd core-solution-services/components/llm_service

## Import GENIE code

In [None]:
sys.path.append("../common/src")
sys.path.append("src")
from common.models import QueryEngine
from services.query.web_datasource import WebDataSource

## Set web source url and create GENIE query engine
Set `depth_limit` to the depth which you want to crawler to follow links in the site.  `depth_limit = 0` means only crawl the page pointed to in the URL, `depth_limit = 1` means crawl each link present in the page pointed to by the URL, `depth_limit = 2` means crawl each link in the pages linked to from the first page, etc.

In [None]:
data_url = "https://dmv.nv.gov/"
depth_limit = 1
q_engine = QueryEngine(name="test web download", doc_url=data_url, params={"depth_limit":depth_limit})

## Download files to bucket

Files will be downloaded to `gs://$PROJECT_ID-downloads-{query engine name}`

HTML files with the extension `.htm` will be renamed to `.html` in the bucket.  This is for convenience when building Vertex Search data sources.

In [None]:
import tempfile
from google.cloud import storage

storage_client = storage.Client(project=PROJECT_ID)
bucket_name = WebDataSource.downloads_bucket_name(q_engine)
web_datasource = WebDataSource(storage_client, bucket_name=bucket_name, depth_limit=depth_limit)
with tempfile.TemporaryDirectory() as temp_dir:
    data_source_files = web_datasource.download_documents(data_url, temp_dir)

print(f"Downloaded {len(data_source_files)} files to gs://{bucket_name}")

## Build Vertex Agent Builder app and data source

You can do this in the console too.  Make sure you have enabled Vertex Agent Builder in your project:

https://console.cloud.google.com/gen-app-builder?project=$PROJECT_ID

In [None]:
from services.query.vertex_search import build_vertex_search
q_engine.save()
build_vertex_search(q_engine)