# More Data Acquisition

## HDFS  & Amazon S3

> `from pyspark.sql import SparkSession  
>  spark = SparkSession.builder\  
>     .master("local")\  
>     .appName("read")\  
>     .enableHiveSupport()\  
>     .getOrCreate()`


### HDFS

#### Read

The `text` method of the 
[DataFrameReader](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader)
reads each line of a text file into a row of a DataFrame with a single column named *value* so we will need to apply further transformations to parse each line.  

> `df = spark.read.text("hdfs:///sa311/source/")
>  df.show(5, truncate=False)
>  df.head(5)`

To create an hdfs subdirectory for your files: `!hdfs dfs -mkdir my_311`

To remove a directory-a hard delete:  `!hdfs dfs -rm -r -skipTrash my_311`


#### Write

The `text` method of the [DataFrameWriter](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter)
writes each row of a DataFrame with a single string column into a line of a text file.  

> `df.write.text("my_311/df_text")`  

For a compressed file

> `df.write.text("my_311/df_text_compressed", compression="bzip2")`   

Ensure the file is there, and peek into the contents.

> `!hdfs dfs -ls my_311/df_text  
>  !hdfs dfs -ls my_311/df_text_compressed  
>  !hdfs dfs -cat my_311/df_text/* | head -n 5`

### AMAZON S3

AWS (Amazon Web Services) provides a product known as Amazon S3, Simple Storage Service.  S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. 

#### Read from S3

In order for all fo this to work, you would need an S3 bucket created, configured, with data stored within it.  

[Getting started with S3](https://aws.amazon.com/s3/getting-started/)
[Store and retrieve files with S3](https://aws.amazon.com/getting-started/tutorials/backup-files-to-amazon-s3/?trk=s3-gs)

- pyspark.sql.DataFrameReader: `spark.read.csv()`

- The read method depends on the format in which the files in S3 are stored.   

- If we were reading a tab delimited, we would indicate by the `sep = "\t"` argument.

> `df = spark.read.csv("s3a://bucketname/filename",
    sep=",", header=True, inferSchema=True)
df.printSchema()
df.show(5)`


#### Write to S3

- pyspark.sql.DataFrameWriter: `df.write.csv()`

- Write tab delimited

> `df.write.csv("s3a://bucketname/df_tsv", sep="\t")`