## Amazon S3

### What it is S3

__Amazon S3__ (Simple Storage Service) is a Amazon's service for storing files. It is simple in a sense that one store data using the follwing:
* __bucket__: place to store. Its name is unique for all S3 users, which means that there cannot exist two buckets with the same name even if they are private for to different users.
* __key__: a unique (for a bucket) name that link to the sotred object. It is common to use path like syntax to group objects. 
* __object__: any file (text or binary). It can be partitioned.

### Sign up
First go to 
<https://s3.console.aws.amazon.com/s3>

and sign up for S3. You can also try to create a bucket, upload files etc. Here we will explain how to use it porogramatically. 

## Installing AWS Command Line Interface and boto

In order to install boto (Python interface to Amazon Web Service) and AWS Command Line Interface (__CLI__) type:
```
pip install boto3
pip install awscli
```

Then in your home directory create file `~/.aws/credentials` with the following:

```
[myaws]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
```

If you add these configuration as `[default]`, you won't need to add `--profile myaws` in CLI commands in Section CLI Basic Commands.

### Where to get credentials from

1. Go to https://console.aws.amazon.com/console/home and log in
2. Click on USER NAME (right top) and select `My Security Credentials`.
3. Click on `+ Access keys (access key ID and secret access key)` and then on `Create New Acess Key`.
4 Choose `Show access key`.

## CLI Basic Commands 

### List buckets
```
aws --profile myaws s3 ls
```

### List all buckets

```
aws --profile myaws s3 ls 
```

### Create buckers
```
aws --profile myaws s3 mb s3://julc-public-test
```
__Warning__ The bucket namespace is shared by all users of the system so you need to change the name.

### Upload and download files

#### Upload
```
aws --profile myaws s3 cp data/ReleaseNotes_Spark2-4.txt s3://julc-public-test/ReleaseNotes
aws --profile myaws s3 cp data/ReleaseNotes_Spark3.txt s3://julc-public-test/ReleaseNotes
```

#### Download
```
aws --profile myaws s3 cp s3://julc-public-test/ReleaseNotes/ReleaseNotes_Spark2-4.txt data/ReleaseNotes_Spark.txt
```

### List files in path
 
```
aws --profile myaws s3 ls s3://julc-public-test
aws --profile myaws s3 ls s3://julc-public-test/ReleaseNotes
```

### Remove file(s)

```
aws --profile myaws s3 rm s3://julc-public-test/ReleaseNotes/ReleaseNotes_Spark3.txt
aws --profile myaws s3 rm s3://julc-public-test/ReleaseNotes/ --recursive
```

### Delete bucket

For deleting a bucket use
```
aws --profile myaws s3 rb s3://julc-public-test
```
in order to delete non empty backet use `--force` option.

In order to empty a backet use
```
aws --profile myaws s3 rm s3://julc-public-test --force
```

## What Boto is

Boto is a Python package that provides interfaces to Amazon Web Services. Here we are focused on its application to S3.

### Creating S3 Resource

We start using boto3 by creating S3 resorce object.

In [1]:
import boto3
session = boto3.Session(profile_name='myaws')
s3 = session.resource('s3')

#### From evironment variables

If your credentials are stored as evirionment variables `AWS_SECRET_KEY_ID` and `AWS_SECRET_ACCESS_KEY` then you can do the following:

```
import os
aws_access_key_id = os.environ.get('AWS_SECRET_KEY_ID')
aws_secret_access_key = s.environ.get('AWS_SECRET_ACCESS_KEY')
session = boto3.Session(
    aws_access_key_id=aws_access_key_id, 
    aws_secret_access_key=aws_secret_access_key)
```

### List buckets

In [2]:
list(s3.buckets.all())

[s3.Bucket(name='aws-logs-429368163154-eu-west-3'),
 s3.Bucket(name='julc-databricks'),
 s3.Bucket(name='julc-public-test'),
 s3.Bucket(name='julc-spark')]

### Create a bucket

__Warning__ As before, bucket's namespace is shared, so the following command may not poroduce a bucket if a bucket with the name exists.

In [None]:
#s3.create_bucket(
#    ACL='public-read',
#    Bucket="julc-public-test")

And you have the followng Access Control List (ACL) options while creating it: 
* `'private', 
* 'public-read', 
* 'public-read-write', 
* 'authenticated-read'`.

### Deleting

In [None]:
#bucket = s3.Bucket('julc-public-test')
#bucket.delete()

### List keys in the bucket

In [3]:
bucket = s3.Bucket('julc-public-test')
objs = [obj for obj in bucket.objects.all()]
objs

[s3.ObjectSummary(bucket_name='julc-public-test', key='LICENSE_spark.txt'),
 s3.ObjectSummary(bucket_name='julc-public-test', key='ReleaseNotes/'),
 s3.ObjectSummary(bucket_name='julc-public-test', key='ReleaseNotes/ReleaseNotes_Spark2-4.txt'),
 s3.ObjectSummary(bucket_name='julc-public-test', key='ReleaseNotes/ReleaseNotes_Spark3.txt'),
 s3.ObjectSummary(bucket_name='julc-public-test', key='data/daily-total-sales/_SUCCESS'),
 s3.ObjectSummary(bucket_name='julc-public-test', key='data/daily-total-sales/part-00000-57aff355-5de4-4ffe-94c3-4ae4fe4da2df-c000.snappy.parquet'),
 s3.ObjectSummary(bucket_name='julc-public-test', key='data/sales_train.csv.gz')]

In [4]:
[obj.key for obj in bucket.objects.filter(Prefix="ReleaseNotes/")]

['ReleaseNotes/ReleaseNotes_Spark2-4.txt',
 'ReleaseNotes/ReleaseNotes_Spark3.txt']

The object of class `ObjectSummary` has to properties `Bucket` (that returns Bucket object), `bucket_name` and `key` that return strings. 

In [5]:
objs[0].Bucket(), objs[0].bucket_name, objs[0].key

(s3.Bucket(name='julc-public-test'), 'julc-public-test', 'LICENSE_spark.txt')

#### Filter keys and sort them 

In [None]:
objects = [obj for obj in bucket.objects.filter(Prefix="ReleaseNotes/")]
objects.sort(key=lambda obj: obj.key, reverse=True)
objects

### Download file

In [None]:
bucket = s3.Bucket('julc-public-test')
bucket.download_file('ReleaseNotes/ReleaseNotes_Spark2-4.txt', "Spark2-4.txt")

### Upload file

In [None]:
stat_bucket = s3.Bucket("julc-public-test")

In [None]:
stat_bucket.upload_file("data/competitive-data-science-predict-future-sales/sales_train.csv.gz", 'data/sales_train.csv.gz')

In [None]:
list(stat_bucket.objects.all())

### Delete

In [None]:
obj = s3.Object('julc-public-test', 'data/Words.csv')

In [None]:
obj.delete()

## Links:

* https://github.com/boto/boto3
* https://boto3.amazonaws.com/v1/documentation/api/latest/index.html
* https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

# Access to s3 from pyspark

## Adding dependency
Eiter 
* start pyspark with
```pyspark --packages=org.apache.hadoop:hadoop-aws:2.7.3```
* add the dependency to SparkSession configuration

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Pyspark course") \
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.3")\
    .getOrCreate()

In [12]:
spark.stop()

In [17]:
sc = spark.sparkContext

### Read aws configuration

In [2]:
import os
import configparser
aws_profile = "myaws"

config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
access_id = config.get(aws_profile, "aws_access_key_id") 
access_key = config.get(aws_profile, "aws_secret_access_key")

In [3]:
hadoop_conf = spark._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)
#hadoop_conf.set(
#        'fs.s3a.aws.credentials.provider',
#        'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider'
#    )
#hadoop_conf.set("com.amazonaws.services.s3a.enableV4", "true")
#hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

In [4]:
#myRDD = sc.textFile("s3a://supergloospark/baby_names.csv")
myRDD = sc.textFile("s3a://julc-public-test/LICENSE_spark.txt")

In [5]:
myRDD.take(3)

['                                 Apache License',
 '                           Version 2.0, January 2004',
 '                        http://www.apache.org/licenses/']

In [6]:
sdf = spark.read.option("header", "true").csv("s3a://julc-public-test/data/sales_train.csv.gz")

In [7]:
sdf.show()

+----------+--------------+-------+-------+----------+------------+
|      date|date_block_num|shop_id|item_id|item_price|item_cnt_day|
+----------+--------------+-------+-------+----------+------------+
|02.01.2013|             0|     59|  22154|     999.0|         1.0|
|03.01.2013|             0|     25|   2552|     899.0|         1.0|
|05.01.2013|             0|     25|   2552|     899.0|        -1.0|
|06.01.2013|             0|     25|   2554|   1709.05|         1.0|
|15.01.2013|             0|     25|   2555|    1099.0|         1.0|
|10.01.2013|             0|     25|   2564|     349.0|         1.0|
|02.01.2013|             0|     25|   2565|     549.0|         1.0|
|04.01.2013|             0|     25|   2572|     239.0|         1.0|
|11.01.2013|             0|     25|   2572|     299.0|         1.0|
|03.01.2013|             0|     25|   2573|     299.0|         3.0|
|03.01.2013|             0|     25|   2574|     399.0|         2.0|
|05.01.2013|             0|     25|   2574|     

In [8]:
import pyspark.sql.functions as F
sdf.groupBy("date").agg(F.sum(F.col('item_cnt_day')).alias("items"))\
    .repartition(1)\
    .write.mode("overwrite")\
    .parquet("s3a://julc-public-test/data/daily-total-sales")

In [None]:
spark.read.parquet("s3a://julc-public-test/data/daily-total-sales").show()