Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in passing metadata to DataprocClusterCreateOperator #16911

Closed
pateash opened this issue Jul 9, 2021 · 11 comments · Fixed by #19446
Closed

Error in passing metadata to DataprocClusterCreateOperator #16911

pateash opened this issue Jul 9, 2021 · 11 comments · Fixed by #19446
Assignees
Labels
good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues

Comments

@pateash
Copy link
Contributor

pateash commented Jul 9, 2021

Hi,
I am facing some issues while installing PIP Packages in the Dataproc cluster using Initialization script,
I am trying to upgrade to Airflow 2.0 from 1.10.12 (where this code works fine)

[2021-07-09 11:35:37,587] {taskinstance.py:1454} ERROR - metadata was invalid: [('PIP_PACKAGES', 'pyyaml requests pandas openpyxl'), ('x-goog-api-client', 'gl-python/3.7.10 grpc/1.35.0 gax/1.26.0 gccl/airflow_v2.0.0+astro.3')

 path = f"gs://goog-dataproc-initialization-actions-{self.cfg.get('region')}/python/pip-install.sh"
 
return DataprocClusterCreateOperator(
     ........
  init_actions_uris=[path],
  metadata=[('PIP_PACKAGES', 'pyyaml requests pandas openpyxl')],
    ............
      )

Apache Airflow version:
airflow_v2.0.0

What happened:
I am trying to migrate our codebase from Airflow v1.10.12, on the deeper analysis found that as part refactoring in of below pr #6371, we can no longer pass metadata in DataprocClusterCreateOperator() as this is not being passed to ClusterGenerator() method.

What you expected to happen:
Operator should work as before.

@pateash pateash added the kind:bug This is a clearly a bug label Jul 9, 2021
@potiuk
Copy link
Member

potiuk commented Jul 10, 2021

@pateash - would you be willing to provide a PR fixing it ?

@pateash
Copy link
Contributor Author

pateash commented Jul 10, 2021

sure,
still looking into it,

@pateash
Copy link
Contributor Author

pateash commented Jul 10, 2021

found a workaround for though

           path = f"gs://goog-dataproc-initialization-actions-{self.cfg.get('region')}/python/pip-install.sh"

            cluster_config =  ClusterGenerator(
                project_id=self.cfg.get('project_id'),
             ...........
                init_actions_uris=[
                    path
                ],
                metadata={'PIP_PACKAGES': 'pyyaml requests pandas openpyxl'},
                properties=properties,
                dag=dag
            ).make()

         return DataprocClusterCreateOperator(
            task_id='create_dataproc_cluster',
            cluster_name='my-cluster',
            project_id=self.cfg.get('project_id'),
            region=self.cfg.get('region'),
            cluster_config=cluster_config,
            dag=dag
        )

@turbaszek @mik-laj @potiuk , I couldn't able to find an example where i will use new metadata field ( changed from DICT to Sequence[Tuple[str, str]])
for DataprocClusterCreateOperator() or DataprocCreateClusterOperator() for initialisations scripts installing PIP_PACKAGES.
any time i tried to pass [('PIP_PACKAGES', 'pyyaml requests pandas openpyxl')] to metadata field, I get invalid metadata.
How does this supposed to be passed, I couldn't able to find any guide in the docs as well?

@eladkal eladkal added the provider:google Google (including GCP) related issues label Sep 4, 2021
@rajaprakash91
Copy link

HI All am also facing similar issue.

{ValueError}metadata was invalid: [('PIP_PACKAGES', 'splunk-sdk==1.6.14 google-cloud-storage==1.31.2 pysmb google-cloud-secret-manager gcsfs'), ('x-goog-api-client', 'gl-python/3.7.7 grpc/1.41.0 gax/1.31.3 gccl/airflow_v2.1.0'), ('x-goog-api-client', 'gl-python/3.7.7 grpc/1.41.0 gax/1.31.3 gccl/airflow_v2.1.0')]

@pateash
Copy link
Contributor Author

pateash commented Nov 6, 2021

HI All am also facing similar issue.

{ValueError}metadata was invalid: [('PIP_PACKAGES', 'splunk-sdk==1.6.14 google-cloud-storage==1.31.2 pysmb google-cloud-secret-manager gcsfs'), ('x-goog-api-client', 'gl-python/3.7.7 grpc/1.41.0 gax/1.31.3 gccl/airflow_v2.1.0'), ('x-goog-api-client', 'gl-python/3.7.7 grpc/1.41.0 gax/1.31.3 gccl/airflow_v2.1.0')]

@rajaprakash91 Are you also upgrading from Airflow 1.10.x to 2.x?
which version you had been using before?

@pateash
Copy link
Contributor Author

pateash commented Nov 6, 2021

@potiuk
I think we should add some more information in documentation regarding generating CLUSTER_CONFIG for newer API
where we are using newer Dataproc Python packages as per refactoring by @turbaszek.

I think it doesn't make sense to add a workaround in the code as the older metadata was Dict() and newer argument is Seq(Tuple) type.

@rajaprakash91
Copy link

@pateash Sorry, just seen your reply, yes i was trying to update Airflow from 1.10x to 2.X.

so what is the fix, i saw a merge request

@pateash
Copy link
Contributor Author

pateash commented Nov 19, 2021

@pateash Sorry, just seen your reply, yes i was trying to update Airflow from 1.10x to 2.X.

so what is the fix, i saw a merge request

@rajaprakash91 You can generate ClusterConfig with the same arguments from ClusterGenerator.make() you have been using and then pass that to the operator.

 path = f"gs://goog-dataproc-initialization-actions-{self.cfg.get('region')}/python/pip-install.sh"

            cluster_config =  ClusterGenerator(
                project_id=self.cfg.get('project_id'),
             ...........
                init_actions_uris=[
                    path
                ],
                metadata={'PIP_PACKAGES': 'pyyaml requests pandas openpyxl'},
                properties=properties,
                dag=dag
            ).make()

         return DataprocClusterCreateOperator(
            task_id='create_dataproc_cluster',
            cluster_name='my-cluster',
            project_id=self.cfg.get('project_id'),
            region=self.cfg.get('region'),
            cluster_config=cluster_config,
            dag=dag
        )

@rajaprakash91
Copy link

rajaprakash91 commented Nov 19, 2021 via email

@nicolas-settembrini
Copy link

Hi, sorry to write here but i didn't find another place talking about this.

I am using Version: 2.1.4+composer and I have a DAG where i defined the DataprocClusterCreateOperator like this:

create_dataproc =  dataproc_operator.DataprocClusterCreateOperator(
  task_id='create_dataproc',
  cluster_name='dataproc-cluster-demo-{{ ds_nodash }}',
  num_workers=2,
  region='us-east4',
  zone='us-east4-a',
  subnetwork_uri='projects/example',
  internal_ip_only=True,
  tags=['allow-iap-ssh'],
  init_actions_uris=['gs://goog-dataproc-initialization-actions-us-east4/connectors/connectors.sh'],
  metadata=[('spark-bigquery-connector-url','gs://spark-lib/bigquery/spark-2.4-bigquery-0.23.1-preview.jar')],
  labels=dict(equipo='dm',ambiente='dev',etapa='datapreparation',producto='x',modelo='x'),
  master_machine_type='n1-standard-1',
  worker_machine_type='n1-standard-1',
  image_version='1.5-debian10'
  )

I passed the metadata as a sequence of tuples as i read here, using the dict is not working.

Also, the metadata is not being rendered in the cluster_config.

@pateash could you please explain a more detailed way to use your workaround? In what part of the dag could i use the workaround?

Thanks in advance

@pateash
Copy link
Contributor Author

pateash commented Dec 13, 2021

@nicolas-settembrini No Problem,

you just have to generate config from all these arguments and then pass it to the DataprocClusterCreateOperator
find more details in Pull request, I have attached snapshot of documentation which will be coming in next updates.
#19446

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues
Projects
None yet
5 participants