title	titleSuffix	description	author	ms.author	ms.reviewer	ms.date	ms.service	ms.subservice	ms.topic
Apache Spark and Apache Hadoop	Configure Apache Spark and Apache Hadoop in Big Data Clusters	SQL Server Big Data Clusters allow Spark and HDFS solutions. Learn how to configure them.	WilliamDAssafMSFT	wiassaf	mikeray	08/04/2020	sql	big-data-cluster	conceptual

Configure Apache Spark and Apache Hadoop in Big Data Clusters

[!INCLUDEbig-data-clusters-banner-retirement]

In order to configure Apache Spark and Apache Hadoop in Big Data Clusters, you need to modify the cluster profile at deployment time.

A Big Data Cluster has four configuration categories:

sql
hdfs
spark
gateway

sql, hdfs, spark, sql are services. Each service maps to the same named configuration category. All gateway configurations go to category gateway.

For example, all configurations in service hdfs belong to category hdfs. Note that all Hadoop (core-site), HDFS and Zookeeper configurations belong to category hdfs; all Livy, Spark, Yarn, Hive, Metastore configurations belong to category spark.

Supported configurations lists Apache Spark & Hadoop properties that you can configure when you deploy a SQL Server Big Data Cluster.

The following sections list properties that you can't modify in a cluster:

Unsupported spark configurations
Unsupported hdfs configurations
Unsupported gateway configurations

Configurations via cluster profile

In the cluster profile there are resources and services. At deployment time, we can specify configurations in one of two ways:

First, at the resource level:

The following examples are the patch files for the profile:

{ 
      "op": "add", 
      "path": "spec.resources.zookeeper.spec.settings", 
      "value": { 
        "hdfs": { 
          "zoo-cfg.syncLimit": "6" 
        } 
      } 
}

Or:

{ 
      "op": "add", 
      "path": "spec.resources.gateway.spec.settings", 
      "value": { 
        "gateway": { 
          "gateway-site.gateway.httpclient.socketTimeout": "95s" 
        } 
      } 
}

Second, at the service level. Assign multiple resources to a service, and specify configurations to the service.

The following is an example of the patch file for the profile for setting HDFS block size:

{ 
      "op": "add", 
      "path": "spec.services.hdfs.settings", 
      "value": { 
        "hdfs-site.dfs.block.size": "268435456" 
     } 
}

The service hdfs is defined as:

{ 
  "spec": { 
   "services": { 
     "hdfs": { 
        "resources": [ 
          "nmnode-0", 
          "zookeeper", 
          "storage-0", 
          "sparkhead" 
        ], 
        "settings":{ 
          "hdfs-site.dfs.block.size": "268435456" 
        } 
      } 
    } 
  } 
}

Note

Resource level configurations override service level configurations. One resource can be assigned to multiple services.

Enable Spark in the Storage Pool

In addition to the supported Apache configurations, we also offer the ability to configure whether or not Spark jobs can run in the Storage pool. This boolean value, includeSpark, is in the bdc.json configuration file at spec.resources.storage-0.spec.settings.spark.

An example storage pool definition in bdc.json may look like this:

...
"storage-0": {
                "metadata": {
                    "kind": "Pool",
                    "name": "default"
                },
                "spec": {
                    "type": "Storage",
                    "replicas": 2,
                    "settings": {
                        "spark": {
                            "includeSpark": "true"
                        }
                    }
                }
            }

Limitations

Configurations can only be specified at category level. To specify multiple configurations with the same sub-category, we cannot extract the common prefix in cluster profile.

{ 
      "op": "add", 
      "path": "spec.services.hdfs.settings.core-site.hadoop", 
      "value": { 
        "proxyuser.xyz.users": "*", 
        "proxyuser.abc.users": "*" 
     } 
}

Next steps

Apache Spark & Apache Hadoop (HDFS) configuration properties.
[[!INCLUDE azure-data-cli-azdata] reference](../azdata/reference/reference-azdata.md)
[Introducing [!INCLUDEbig-data-clusters-2019]](big-data-cluster-overview.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configure-spark-hdfs.md

configure-spark-hdfs.md

Configure Apache Spark and Apache Hadoop in Big Data Clusters

Configurations via cluster profile

Enable Spark in the Storage Pool

Limitations

Next steps

Files

configure-spark-hdfs.md

Latest commit

History

configure-spark-hdfs.md

File metadata and controls

Configure Apache Spark and Apache Hadoop in Big Data Clusters

Configurations via cluster profile

Enable Spark in the Storage Pool

Limitations

Next steps