# Parquet Modular Encryption With Azure Key Vault KMS

This example notebook should give you a quick glimpse about how to incorporate Parquet Modular Encryption with your Spark dataframes and Spark SQL commands. This notebook assumes that you have already created an Azure service principal that has access to Key Vault keys (https://learn.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal), and you have a configured Spark environment with the required Java class for Key Vault KMS Operations (https://github.com/Azure/parquet-modular-encryption-keyvault-kms), as well as the other Spark environment settings.

First, let's ready in some sample data. In this example, the data is being read from an Azure storage account that contains the CitiBike data for New York City. This is a public dataset that you can download yourself from here: https://citibikenyc.com/system-data Then, place the data on your storage account. You can then subsitute the storage container name and storage account name as needed. This example uses the Data Lake Gen2 interface, so make sure your storge account is enabled for that.

In [0]:
#Set your storage related variables here. Note that your storage_container_name and output_encrypted container name can be the same (see below).
storage_account_name = "storageaccountname"
storage_container_name = "containername"
output_encrypted_container_name = "anothercontainername"

In [0]:
#Read in the sample citibike data. This assumes you have CSV files from the data source noted in the header cell in a folder named 'citibike' in the root of the storage container
raw_df = spark.read.format("csv").option("header",True).load("abfss://{0}@{1}.dfs.core.windows.net/citibike/*.csv".format(storage_container_name, storage_account_name))

### Encryption

To encrypt your parquet files, you need to provide options to your ```.write``` operation, specifying the key to use for your footer, as well as the key(s) to use on the column(s) you wish to encrypt. In this example, we are going to encrypt the ```ride_id``` column using a key from our KeyVault named ```columnKey```. The key identifier should in the format of key/versionID, unless your class can take in a key name and return a current version. The sample library does not do this. The target location here is also being written to a seperate container for output.

In [0]:
#Note: replace your key names and versions in the following command (<key_name> and <key_version> should be whole values, without the greater than and less than signs. Be careful not to remove the slash!)
#This will also output (and overwrite) to a folder named "encryptionDemo" in your output storage container.
raw_df.write.mode("overwrite").option("parquet.encryption.footer.key","<key name>/<key version>").option("parquet.encryption.column.keys","<key name>/<key version>:ride_id").format("parquet").save("abfss://{0}@{1}.dfs.core.windows.net/encryptionDemo/encryptedFooterExample".format(output_encrypted_container_name,storage_account_name))

### Decryption

On a properly configured cluster, decryption is automatic on read. Just read in your data via Spark or Spark SQL and you're good to go! If you run this on a cluster without the proper configuraiton, you will get a "No Keys Found" exception.

In [0]:
#Read the encrypted Parquet back out. To see what happens when someone tries to read the parquet without a properly configured cluster, try running this command on one that isn't
encrypted_df = spark.read.format("parquet").load("abfss://{0}@{1}.dfs.core.windows.net/encryptionDemo/encryptedFooterExample".format(output_encrypted_container_name, storage_account_name))

In [0]:
display(encrypted_df)