Refactor Rag notebook #504

richardsliu · 2024-03-29T00:50:56Z

Refactored the RAG notebook to be more modular and documented.

Moved the CloudSQL code into the notebook instead of running on a Ray Clusters
Execute remote Ray tasks instead of using a job submit with busy waiting
Modified the CloudSQL code to use bulk insert

Fixes #425

Preview: https://github.com/GoogleCloudPlatform/ai-on-gke/blob/rag-notebook/applications/rag/example_notebooks/rag-kaggle-ray-sql-refactored.ipynb

modules/jupyter/jupyter_config/config-selfauth-autopilot.yaml

andrewsykim · 2024-03-29T01:45:41Z

applications/rag/example_notebooks/rag-kaggle-ray-sql-refactored.ipynb

+   "outputs": [],
+   "source": [
+    "!pip install ray[default]==2.9.3 kaggle==1.6.6\n",
+    "!pip install langchain==0.1.10 ray==2.9.3 datasets sentence-transformers\n",


I think there were some issues in the past with some of these packages taking way too long to install. Specifically sentence-transformers. Are we going to bake the dependencies into the jupyter image?

+1

I think we don't need to bake all these. langchain and sentence-transformers aren't used except by the Ray job.

ray, kaggle are pretty quick. unsure about datasets and the cloud SQL ones, i imagine they're not too bad but pls verify. If it's > 30s, I vote we make a custom jupyter image.

I've tested and it turns out the notebook does need to install langchain and sentence-transformers. I can see if we could use a custom image here to skip these pip installs.

Changed to a custom image with dependencies baked in. Removed this section from the notebook.

imreddy13 · 2024-03-29T02:07:13Z

applications/rag/example_notebooks/rag-kaggle-ray-sql-refactored.ipynb

+    "ray.init(\n",
+    "    address=\"ray://ray-cluster-kuberay-head-svc:10001\",\n",
+    "    runtime_env={\n",
+    "        \"pip\": [               \n",


I think you can delete all these. The ray image comes with all these pre installed. We need to bump langchain though, I think @chiayi is tracking that

I think Ray will skip pip installing these if the image is already on the node. When I tested this, this step usually finishes in a few seconds.

imreddy13 · 2024-03-29T02:08:45Z

applications/rag/example_notebooks/rag-kaggle-ray-sql-refactored.ipynb

+   "outputs": [],
+   "source": [
+    "!pip install ray[default]==2.9.3 kaggle==1.6.6\n",
+    "!pip install langchain==0.1.10 ray==2.9.3 datasets sentence-transformers\n",


+1

I think we don't need to bake all these. langchain and sentence-transformers aren't used except by the Ray job.

ray, kaggle are pretty quick. unsure about datasets and the cloud SQL ones, i imagine they're not too bad but pls verify. If it's > 30s, I vote we make a custom jupyter image.

modules/jupyter/jupyter_config/config-selfauth-autopilot.yaml

andrewsykim

Overall LGTM, this is a great improvement!

I think the following issues need to be addressed if we're going to cherry-pick this into release-1.1:

Dependency install time and whether a custom image is warranted.
Leaking CloudSQL env vars and secret mounts into all Jupyter installments -- these should only be set when using Jupyter for RAG (likewise for Ray ref: CLOUDSQL_INSTANCE_CONNECTION_NAME env in RayCluster should only be configured for RAG deployments #435).
If we choose to keep both versions of the notebook, please add the markdown descriptions in the other notebook as well

applications/rag/example_notebooks/rag-kaggle-ray-sql-refactored.ipynb

modules/jupyter/jupyter_config/config-selfauth-autopilot.yaml

richardsliu · 2024-03-29T21:07:54Z

Dependency install time and whether a custom image is warranted.

The notebook needs to install sentence-transformer and langchain. I'll see if we can use a custom image to reduce the install time.

Leaking CloudSQL env vars and secret mounts into all Jupyter installments -- these should only be set when using Jupyter for RAG (likewise for Ray ref: CLOUDSQL_INSTANCE_CONNECTION_NAME env in RayCluster should only be configured for RAG deployments #435).

This is now fixed.

If we choose to keep both versions of the notebook, please add the markdown descriptions in the other notebook as well

I prefer to keep this PR limited to changes relevant for the new notebook. Adding markdown to the older notebook wouldn't be too interesting since all the code is executed in one cell.

…/ai-on-gke into rag-notebook

applications/rag/example_notebooks/rag-kaggle-ray-sql-interactive.ipynb

modules/jupyter/jupyter_config/config-selfauth-autopilot.yaml

modules/jupyter/main.tf

modules/jupyter/variables.tf

andrewsykim · 2024-04-01T21:08:05Z

I prefer to keep this PR limited to changes relevant for the new notebook. Adding markdown to the older notebook wouldn't be too interesting since all the code is executed in one cell.

Should we simplify and remove the old notebook or keep both? With the changes in this PR, do both notebooks work?

richardsliu · 2024-04-01T22:03:56Z

I prefer to keep this PR limited to changes relevant for the new notebook. Adding markdown to the older notebook wouldn't be too interesting since all the code is executed in one cell.

Should we simplify and remove the old notebook or keep both? With the changes in this PR, do both notebooks work?

I fixed some minor issues in the old notebook and added a header. We should probably keep it as a backup.

richardsliu · 2024-04-01T22:24:36Z

/gcbrun

imreddy13

Thank you!

imreddy13 · 2024-04-01T23:49:48Z

/gcbrun

* move mysql stuff to jupyter * new notebook * fix notebook * fix notebook, add markdown * use bulk insert * revert * change persist data * terraform fmt * remove sql params from notebook * default empty values * rename * parameterize notebook image * remove pip installs from notebook * use custom notebook image * terraform fmt * replace jupyter notebook tag * add notebook version to jupyterhub app * merge cells * add dummy value for secret volume * fix old notebook

richardsliu and others added 10 commits March 28, 2024 03:21

move mysql stuff to jupyter

73aeecb

new notebook

e5c0e2c

fix notebook

ef7d9e5

fix notebook, add markdown

de514b2

use bulk insert

14da177

revert

de1bbd0

Merge branch 'main' into rag-notebook

cf188a8

change persist data

10ce80c

fix volume path

5619a1c

terraform fmt

2e1445d

andrewsykim reviewed Mar 29, 2024

View reviewed changes

imreddy13 reviewed Mar 29, 2024

View reviewed changes

andrewsykim reviewed Mar 29, 2024

View reviewed changes

applications/rag/example_notebooks/rag-kaggle-ray-sql-refactored.ipynb Outdated Show resolved Hide resolved

modules/jupyter/jupyter_config/config-selfauth-autopilot.yaml Show resolved Hide resolved

richardsliu added 3 commits March 29, 2024 20:40

remove sql params from notebook

c75e246

default empty values

708fe71

rename

06403bd

richardsliu and others added 6 commits March 29, 2024 21:45

parameterize notebook image

d129434

remove pip installs from notebook

e733e94

use custom notebook image

7365699

Merge branch 'rag-notebook' of https://github.com/GoogleCloudPlatform…

4d0b6e0

…/ai-on-gke into rag-notebook

terraform fmt

8961bb3

replace jupyter notebook tag

6fda174

imreddy13 reviewed Apr 1, 2024

View reviewed changes

richardsliu added 2 commits April 1, 2024 19:07

add notebook version to jupyterhub app

4cd1a77

merge cells

7ed030b

andrewsykim mentioned this pull request Apr 1, 2024

Revert "Improve logic for polling Ray job status in jupyter notebook" #517

Closed

add dummy value for secret volume

929bbd9

fix old notebook

3bd07cd

imreddy13 approved these changes Apr 1, 2024

View reviewed changes

richardsliu merged commit 75331ab into main Apr 2, 2024
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Rag notebook #504

Refactor Rag notebook #504

richardsliu commented Mar 29, 2024 •

edited

Loading

andrewsykim Mar 29, 2024

imreddy13 Mar 29, 2024

richardsliu Mar 29, 2024

richardsliu Mar 30, 2024

imreddy13 Mar 29, 2024

richardsliu Mar 30, 2024

imreddy13 Mar 29, 2024

andrewsykim left a comment •

edited

Loading

richardsliu commented Mar 29, 2024

andrewsykim commented Apr 1, 2024

richardsliu commented Apr 1, 2024

richardsliu commented Apr 1, 2024

imreddy13 left a comment

imreddy13 commented Apr 1, 2024

Refactor Rag notebook #504

Refactor Rag notebook #504

Conversation

richardsliu commented Mar 29, 2024 • edited Loading

andrewsykim Mar 29, 2024

Choose a reason for hiding this comment

imreddy13 Mar 29, 2024

Choose a reason for hiding this comment

richardsliu Mar 29, 2024

Choose a reason for hiding this comment

richardsliu Mar 30, 2024

Choose a reason for hiding this comment

imreddy13 Mar 29, 2024

Choose a reason for hiding this comment

richardsliu Mar 30, 2024

Choose a reason for hiding this comment

imreddy13 Mar 29, 2024

Choose a reason for hiding this comment

andrewsykim left a comment • edited Loading

Choose a reason for hiding this comment

richardsliu commented Mar 29, 2024

andrewsykim commented Apr 1, 2024

richardsliu commented Apr 1, 2024

richardsliu commented Apr 1, 2024

imreddy13 left a comment

Choose a reason for hiding this comment

imreddy13 commented Apr 1, 2024

richardsliu commented Mar 29, 2024 •

edited

Loading

andrewsykim left a comment •

edited

Loading