-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Rag notebook #504
Conversation
"outputs": [], | ||
"source": [ | ||
"!pip install ray[default]==2.9.3 kaggle==1.6.6\n", | ||
"!pip install langchain==0.1.10 ray==2.9.3 datasets sentence-transformers\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there were some issues in the past with some of these packages taking way too long to install. Specifically sentence-transformers. Are we going to bake the dependencies into the jupyter image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I think we don't need to bake all these. langchain and sentence-transformers aren't used except by the Ray job.
ray, kaggle are pretty quick. unsure about datasets and the cloud SQL ones, i imagine they're not too bad but pls verify. If it's > 30s, I vote we make a custom jupyter image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested and it turns out the notebook does need to install langchain and sentence-transformers. I can see if we could use a custom image here to skip these pip installs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to a custom image with dependencies baked in. Removed this section from the notebook.
"ray.init(\n", | ||
" address=\"ray://ray-cluster-kuberay-head-svc:10001\",\n", | ||
" runtime_env={\n", | ||
" \"pip\": [ \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can delete all these. The ray image comes with all these pre installed. We need to bump langchain though, I think @chiayi is tracking that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Ray will skip pip installing these if the image is already on the node. When I tested this, this step usually finishes in a few seconds.
"outputs": [], | ||
"source": [ | ||
"!pip install ray[default]==2.9.3 kaggle==1.6.6\n", | ||
"!pip install langchain==0.1.10 ray==2.9.3 datasets sentence-transformers\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I think we don't need to bake all these. langchain and sentence-transformers aren't used except by the Ray job.
ray, kaggle are pretty quick. unsure about datasets and the cloud SQL ones, i imagine they're not too bad but pls verify. If it's > 30s, I vote we make a custom jupyter image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, this is a great improvement!
I think the following issues need to be addressed if we're going to cherry-pick this into release-1.1:
- Dependency install time and whether a custom image is warranted.
- Leaking CloudSQL env vars and secret mounts into all Jupyter installments -- these should only be set when using Jupyter for RAG (likewise for Ray ref: CLOUDSQL_INSTANCE_CONNECTION_NAME env in RayCluster should only be configured for RAG deployments #435).
- If we choose to keep both versions of the notebook, please add the markdown descriptions in the other notebook as well
applications/rag/example_notebooks/rag-kaggle-ray-sql-refactored.ipynb
Outdated
Show resolved
Hide resolved
The notebook needs to install
This is now fixed.
I prefer to keep this PR limited to changes relevant for the new notebook. Adding markdown to the older notebook wouldn't be too interesting since all the code is executed in one cell. |
applications/rag/example_notebooks/rag-kaggle-ray-sql-interactive.ipynb
Outdated
Show resolved
Hide resolved
Should we simplify and remove the old notebook or keep both? With the changes in this PR, do both notebooks work? |
I fixed some minor issues in the old notebook and added a header. We should probably keep it as a backup. |
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
/gcbrun |
* move mysql stuff to jupyter * new notebook * fix notebook * fix notebook, add markdown * use bulk insert * revert * change persist data * terraform fmt * remove sql params from notebook * default empty values * rename * parameterize notebook image * remove pip installs from notebook * use custom notebook image * terraform fmt * replace jupyter notebook tag * add notebook version to jupyterhub app * merge cells * add dummy value for secret volume * fix old notebook
* move mysql stuff to jupyter * new notebook * fix notebook * fix notebook, add markdown * use bulk insert * revert * change persist data * terraform fmt * remove sql params from notebook * default empty values * rename * parameterize notebook image * remove pip installs from notebook * use custom notebook image * terraform fmt * replace jupyter notebook tag * add notebook version to jupyterhub app * merge cells * add dummy value for secret volume * fix old notebook
* move mysql stuff to jupyter * new notebook * fix notebook * fix notebook, add markdown * use bulk insert * revert * change persist data * terraform fmt * remove sql params from notebook * default empty values * rename * parameterize notebook image * remove pip installs from notebook * use custom notebook image * terraform fmt * replace jupyter notebook tag * add notebook version to jupyterhub app * merge cells * add dummy value for secret volume * fix old notebook
Refactored the RAG notebook to be more modular and documented.
Fixes #425
Preview: https://github.com/GoogleCloudPlatform/ai-on-gke/blob/rag-notebook/applications/rag/example_notebooks/rag-kaggle-ray-sql-refactored.ipynb