# Writing Spark to S3

Writing out from Spark directly to S3 can be a very handy thing to do. However, Spark doesn't naturally play well with S3... it borrows that capability from Hadoop. Which means that we need to tell Spark and Hadoop how to play together with S3. This notebook demonstrates some of the setup for this.

### Setting up my AWS connection and the Spark Arguments

First, we need to get our AWS Keys in order and then tell Spark that whenever it's asked to make a context, it should build itself with a certain set of packages activated. In particular, we want the AWS connectors active from Java and Hadoop. So we'll get all of that in order using the OS environment variables.

In [None]:
import pyspark
import os

aak = os.environ['AWS_ACCESS_KEY']
ask = os.environ['AWS_SECRET_KEY']
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'

### Creating and configuring Spark Context

We'll now build our spark context, then tell hadoop that we want it to know about the s3 file system. We'll that and then tell it how to connect by providing the keys we have. The first config line creates the filesystem as something accessible. The second and third line tells hadoop how to act like me when talking to S3.

In [None]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()
hadoopConf = spark._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", aak)
hadoopConf.set("fs.s3.awsSecretAccessKey", ask)

df = spark.read.csv("/Users/zachariahmiller/Documents/Metis/chi18_ds7/class_lectures/week04-mcnulty1/02-logistic_sql_load/data/all_state_1950.csv")
df.show()

Once I have that set up and some data, I can just treat `s3://bucket_name` as a location that I can write to or read from.

In [None]:
df.write.parquet('s3://whynotwork/test')

In [None]:
df2 = spark.read.parquet('s3://whynotwork/test')
df2.show()