Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh of notebook - MLOps stage 1 : data management: get started with Dataflow #350

Conversation

manuelamunategui
Copy link
Contributor

This tutorial demonstrates how to use Vertex AI for E2E MLOps on Google Cloud in production. This tutorial covers stage 1 : data management: get started with Dataflow.

  • Use the notebook template as a starting point.
  • Follow the style and grammar rules outlined in the above notebook template.
  • Verify the notebook runs successfully in Colab since the automated tests cannot guarantee this even when it passes.
  • Passes all the required automated checks. You can locally test for formatting and linting with these instructions.
  • You have consulted with a tech writer to see if tech writer review is necessary. If so, the notebook has been reviewed by a tech writer, and they have approved it.
  • This notebook has been added to the CODEOWNERS file under # Official Notebooks section, pointing to the author or the author's team.
  • The Jupyter notebook cleans up any artifacts it has created (datasets, ML models, endpoints, etc) so as not to eat up unnecessary resources.

@manuelamunategui manuelamunategui requested a review from a team as a code owner March 2, 2022 13:41
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@kweinmeister
Copy link
Contributor

/gcbrun

@manuelamunategui
Copy link
Contributor Author

This script has a dependence with another script (https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage1/get_started_bq_datasets.ipynb) in order to access sample BQ data

@kweinmeister
Copy link
Contributor

/gcbrun

@manuelamunategui
Copy link
Contributor Author

Verified fix and added minor corrections

@kweinmeister
Copy link
Contributor

/gcbrun

@kweinmeister
Copy link
Contributor

@manuelamunategui Looks like a pip install is needed. I think you will want to use the gcp version like this: pip install apache-beam[gcp].

----> 1 import apache_beam as beam

ModuleNotFoundError: No module named 'apache_beam'

@kweinmeister
Copy link
Contributor

/gcbrun

@kweinmeister
Copy link
Contributor

We are seeing this failure in the notebook that is related to a version range conflict with google-api-core across packages. We hope to have some updates to these packages in the next week or so.

FileNotFoundError: [Errno 2] No such file or directory: '/builder/home/.local/lib/python3.9/site-packages/google_api_core-2.7.1.dist-info/METADATA'

@@ -0,0 +1,1220 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add a Open in Colab link


Reply via ReviewNB

@@ -0,0 +1,1220 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be reduced to only what is being used.


Reply via ReviewNB

@@ -0,0 +1,1220 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consolidate all these imports into a single cell


Reply via ReviewNB

@@ -0,0 +1,1220 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an extra leading space in the title Generate dataset statisrics


Reply via ReviewNB

@@ -0,0 +1,1220 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a locally written file that should be deleted: setup.py.

also, delete the temporary created BQ table


Reply via ReviewNB

@andrewferlitsch
Copy link
Contributor

andrewferlitsch commented Mar 22, 2022

error log:

Step #3: === EXECUTION FINISHED ===
Step #3: 
Step #3: Please debug the executed notebook by downloading: gs://cloud-build-notebooks-presubmit/executed_notebooks/PR_350/BUILD_de5f95df-b34e-43ef-a9c4-8b20a12923f2/get_started_dataflow.ipynb
Step #3: 
Step #3: ======
Step #3: 
Step #3:     execute_notebook_helper.execute_notebook(
Step #3:   File "/workspace/.cloud-build/execute_notebook_helper.py", line 91, in execute_notebook
Step #3:     raise execution_exception
Step #3:   File "/workspace/.cloud-build/execute_notebook_helper.py", line 56, in execute_notebook
Step #3:     pm.execute_notebook(
Step #3:   File "/builder/home/.local/lib/python3.9/site-packages/papermill/execute.py", line 122, in execute_notebook
Step #3:     raise_for_execution_errors(nb, output_path)
Step #3:   File "/builder/home/.local/lib/python3.9/site-packages/papermill/execute.py", line 234, in raise_for_execution_errors
Step #3:     raise error
Step #3: papermill.exceptions.PapermillExecutionError: 
Step #3: ---------------------------------------------------------------------------
Step #3: Exception encountered at "In [13]":
Step #3: ---------------------------------------------------------------------------
Step #3: AttributeError                            Traceback (most recent call last)
Step #3: File /usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py:3016, in DistInfoDistribution._dep_map(self)
Step #3:    3015 try:
Step #3: -> 3016     return self.__dep_map
Step #3:    3017 except AttributeError:
Step #3: 
Step #3: File /usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py:2813, in Distribution.__getattr__(self, attr)
Step #3:    2812 if attr.startswith('_'):
Step #3: -> 2813     raise AttributeError(attr)
Step #3:    2814 return getattr(self._provider, attr)
Step #3: 
Step #3: AttributeError: _DistInfoDistribution__dep_map
Step #3: 
Step #3: During handling of the above exception, another exception occurred:
Step #3: 
Step #3: AttributeError                            Traceback (most recent call last)
Step #3: File /usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py:3007, in DistInfoDistribution._parsed_pkg_info(self)
Step #3:    3006 try:
Step #3: -> 3007     return self._pkg_info
Step #3:    3008 except AttributeError:
Step #3: 
Step #3: File /usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py:2813, in Distribution.__getattr__(self, attr)
Step #3:    2812 if attr.startswith('_'):
Step #3: -> 2813     raise AttributeError(attr)
Step #3:    2814 return getattr(self._provider, attr)
Step #3: 
Step #3: AttributeError: _pkg_info
Step #3: 
Step #3: During handling of the above exception, another exception occurred:
Step #3: 
Step #3: FileNotFoundError                         Traceback (most recent call last)
Step #3: Input In [13], in <cell line: 1>()
Step #3: ----> 1 import google.cloud.aiplatform as aip
Step #3: 
Step #3: File ~/.local/lib/python3.9/site-packages/google/cloud/aiplatform/__init__.py:26, in <module>
Step #3:      21 __version__ = aiplatform_version.__version__
Step #3:      24 from google.cloud.aiplatform import initializer
Step #3: ---> 26 from google.cloud.aiplatform.datasets import (
Step #3:      27     ImageDataset,
Step #3:      28     TabularDataset,
Step #3:      29     TextDataset,
Step #3:      30     TimeSeriesDataset,
Step #3:      31     VideoDataset,
Step #3:      32 )
Step #3:      33 from google.cloud.aiplatform import explain
Step #3:      34 from google.cloud.aiplatform import gapic
Step #3: 
Step #3: File ~/.local/lib/python3.9/site-packages/google/cloud/aiplatform/datasets/__init__.py:19, in <module>
Step #3:       1 # -*- coding: utf-8 -*-
Step #3:       2 
Step #3:       3 # Copyright 2020 Google LLC
Step #3:    (...)
Step #3:      15 # limitations under the License.
Step #3:      16 #
Step #3:      18 from google.cloud.aiplatform.datasets.dataset import _Dataset
Step #3: ---> 19 from google.cloud.aiplatform.datasets.column_names_dataset import _ColumnNamesDataset
Step #3:      20 from google.cloud.aiplatform.datasets.tabular_dataset import TabularDataset
Step #3:      21 from google.cloud.aiplatform.datasets.time_series_dataset import TimeSeriesDataset
Step #3: 
Step #3: File ~/.local/lib/python3.9/site-packages/google/cloud/aiplatform/datasets/column_names_dataset.py:24, in <module>
Step #3:      21 from typing import List, Optional, Set
Step #3:      22 from google.auth import credentials as auth_credentials
Step #3: ---> 24 from google.cloud import bigquery
Step #3:      25 from google.cloud import storage
Step #3:      27 from google.cloud.aiplatform import utils
Step #3: 
Step #3: File ~/.local/lib/python3.9/site-packages/google/cloud/bigquery/__init__.py:35, in <module>
Step #3:      31 from google.cloud.bigquery import version as bigquery_version
Step #3:      33 __version__ = bigquery_version.__version__
Step #3: ---> 35 from google.cloud.bigquery.client import Client
Step #3:      36 from google.cloud.bigquery.dataset import AccessEntry
Step #3:      37 from google.cloud.bigquery.dataset import Dataset
Step #3: 
Step #3: File ~/.local/lib/python3.9/site-packages/google/cloud/bigquery/client.py:61, in <module>
Step #3:      58 from google.cloud.client import ClientWithProject  # type: ignore  # pytype: disable=import-error
Step #3:      60 try:
Step #3: ---> 61     from google.cloud.bigquery_storage_v1.services.big_query_read.client import (
Step #3:      62         DEFAULT_CLIENT_INFO as DEFAULT_BQSTORAGE_CLIENT_INFO,
Step #3:      63     )
Step #3:      64 except ImportError:
Step #3:      65     DEFAULT_BQSTORAGE_CLIENT_INFO = None  # type: ignore
Step #3: 
Step #3: File /usr/local/lib/python3.9/site-packages/google/cloud/bigquery_storage_v1/__init__.py:21, in <module>
Step #3:      17 from __future__ import absolute_import
Step #3:      19 import pkg_resources
Step #3: ---> 21 __version__ = pkg_resources.get_distribution(
Step #3:      22     "google-cloud-bigquery-storage"
Step #3:      23 ).version  # noqa
Step #3:      25 from google.cloud.bigquery_storage_v1 import client
Step #3:      26 from google.cloud.bigquery_storage_v1 import types
Step #3: 
Step #3: File /usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py:466, in get_distribution(dist)
Step #3:     464     dist = Requirement.parse(dist)
Step #3:     465 if isinstance(dist, Requirement):
Step #3: --> 466     dist = get_provider(dist)
Step #3:     467 if not isinstance(dist, Distribution):
Step #3:     468     raise TypeError("Expected string, Requirement, or Distribution", dist)
Step #3: 
Step #3: File /usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py:342, in get_provider(moduleOrReq)
Step #3:     340 """Return an IResourceProvider for the named module or requirement"""
Step #3:     341 if isinstance(moduleOrReq, Requirement):
Step #3: --> 342     return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
Step #3:     343 try:
Step #3:     344     module = sys.modules[moduleOrReq]
Step #3: 
Step #3: File /usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py:886, in WorkingSet.require(self, *requirements)
Step #3:     877 def require(self, *requirements):
Step #3:     878     """Ensure that distributions matching `requirements` are activated
Step #3:     879 
Step #3:     880     `requirements` must be a string or a (possibly-nested) sequence
Step #3:    (...)
Step #3:     884     included, even if they were already activated in this working set.
Step #3:     885     """
Step #3: --> 886     needed = self.resolve(parse_requirements(requirements))
Step #3:     888     for dist in needed:
Step #3:     889         self.add(dist)
Step #3: 
Step #3: File /usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py:780, in WorkingSet.resolve(self, requirements, env, installer, replace_conflicting, extras)
Step #3:     777     raise VersionConflict(dist, req).with_context(dependent_req)
Step #3:     779 # push the new requirements onto the stack
Step #3: --> 780 new_requirements = dist.requires(req.extras)[::-1]
Step #3:     781 requirements.extend(new_requirements)
Step #3:     783 # Register the new requirements needed by req
Step #3: 
Step #3: File /usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py:2734, in Distribution.requires(self, extras)
Step #3:    2732 def requires(self, extras=()):
Step #3:    2733     """List of Requirements needed for this distro if `extras` are used"""
Step #3: -> 2734     dm = self._dep_map
Step #3:    2735     deps = []
Step #3:    2736     deps.extend(dm.get(None, ()))
Step #3: 
Step #3: File /usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py:3018, in DistInfoDistribution._dep_map(self)
Step #3:    3016     return self.__dep_map
Step #3:    3017 except AttributeError:
Step #3: -> 3018     self.__dep_map = self._compute_dependencies()
Step #3:    3019     return self.__dep_map
Step #3: 
Step #3: File /usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py:3027, in DistInfoDistribution._compute_dependencies(self)
Step #3:    3025 reqs = []
Step #3:    3026 # Including any condition expressions
Step #3: -> 3027 for req in self._parsed_pkg_info.get_all('Requires-Dist') or []:
Step #3:    3028     reqs.extend(parse_requirements(req))
Step #3:    3030 def reqs_for_extra(extra):
Step #3: 
Step #3: File /usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py:3009, in DistInfoDistribution._parsed_pkg_info(self)
Step #3:    3007     return self._pkg_info
Step #3:    3008 except AttributeError:
Step #3: -> 3009     metadata = self.get_metadata(self.PKG_INFO)
Step #3:    3010     self._pkg_info = email.parser.Parser().parsestr(metadata)
Step #3:    3011     return self._pkg_info
Step #3: 
Step #3: File /usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py:1407, in NullProvider.get_metadata(self, name)
Step #3:    1405     return ""
Step #3:    1406 path = self._get_metadata_path(name)
Step #3: -> 1407 value = self._get(path)
Step #3:    1408 try:
Step #3:    1409     return value.decode('utf-8')
Step #3: 
Step #3: File /usr/local/lib/python3.9/site-packages/pkg_resources/__init__.py:1611, in DefaultProvider._get(self, path)
Step #3:    1610 def _get(self, path):
Step #3: -> 1611     with open(path, 'rb') as stream:
Step #3:    1612         return stream.read()
Step #3: 
Step #3: FileNotFoundError: [Errno 2] No such file or directory: '/builder/home/.local/lib/python3.9/site-packages/google_api_core-2.7.1.dist-info/METADATA'
Step #3: 
Finished Step #3
ERROR

@andrewferlitsch
Copy link
Contributor

/gcbrun

@andrewferlitsch
Copy link
Contributor

/gcbrun

@andrewferlitsch
Copy link
Contributor

Step #3: File ~/.local/lib/python3.9/site-packages/pkg_resources/init.py:1622, in DefaultProvider._get(self, path)
Step #3: 1621 def _get(self, path):
Step #3: -> 1622 with open(path, 'rb') as stream:
Step #3: 1623 return stream.read()
Step #3:
Step #3: FileNotFoundError: [Errno 2] No such file or directory: '/builder/home/.local/lib/python3.9/site-packages/google_api_core-2.7.1.dist-info/METADATA'
Step #3:
Finished Step #3
ERROR
ERROR: build step 3 "gcr.io/cloud-devrel-public-resources/python-samples-testing-docker:latest" failed: step exited with non-zero status: 1

@manuelamunategui
Copy link
Contributor Author

Notebook note compatible with Python 3.9 ( tensorflow-data-validation and tensorflow-transform)

@andrewferlitsch
Copy link
Contributor

/gcbrun

@manuelamunategui
Copy link
Contributor Author

Close this PR as the code has changed and due to incompatibility with Python 3.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants