# Explore what is on S3 - take 2

References:
* Udacity Lesson 3 Exercise 2 - IaC (infrastructure as Code) - solution
* https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift.html

In [1]:
import pandas as pd
import boto3
import json

## STEP 0: (Prerequisite) Save the AWS Access key

### 1. Create a new IAM user
IAM service is a global service, meaning newly created IAM users are not restricted to a specific region by default.
- Go to [AWS IAM service](https://console.aws.amazon.com/iam/home#/users) and click on the "**Add user**" button to create a new IAM user in your AWS account. 
- Choose a name of your choice. 
- Select "*Programmatic access*" as the access type. Click Next. 
- Choose the *Attach existing policies directly* tab, and select the "**AdministratorAccess**". Click Next. 
- Skip adding any tags. Click Next. 
- Review and create the user. It will show you a pair of access key ID and secret.
- Take note of the pair of access key ID and secret. This pair is collectively known as **Access key**. 

### <font color='red'>2. Save the access key and secret</font>
Edit the file `dwh.cfg` in the same folder as this notebook and save the access key and secret against the following variables:
```bash
KEY= <YOUR_AWS_KEY>
SECRET= <YOUR_AWS_SECRET>
```
    
For example:
```bash
KEY=6JW3ATLQ34PH3AKI
SECRET=wnoBHA+qUBFgwCRHJqgqrLU0i
```

### 3. Troubleshoot
If your keys are not working, such as getting an `InvalidAccessKeyId` error, then you cannot retrieve them again. You have either of the following two options:

1. **Option 1 - Create a new pair of access keys for the existing user**

 - Go to the [IAM dashboard](https://console.aws.amazon.com/iam/home) and view the details of the existing (Admin) user. 

 - Select on the **Security credentials** tab, and click the **Create access key** button. It will generate a new pair of access key ID and secret.

 - Save the new access key ID and secret in your `dwh.cfg` file

2. **Option 2 - Create a new IAM user with Admin access** - Refer to the instructions at the top. 

# Load AWS Params from a file

In [2]:
import configparser

In [3]:
config_secret = configparser.ConfigParser()
config_secret.read_file(open('dwh_010_secret.cfg'))

AWS_KEY    = config_secret.get('AWS','ADMIN_KEY')
AWS_SECRET = config_secret.get('AWS','ADMIN_SECRET')

In [4]:
config_dwh = configparser.ConfigParser()
config_dwh.read_file(open('dwh_020_build.cfg'))

DWH_CLUSTER_TYPE       = config_dwh.get("DWH","DWH_CLUSTER_TYPE")

if (DWH_CLUSTER_TYPE == 'multi-node'):
    DWH_NUM_NODES = config_dwh.get("DWH","DWH_NUM_NODES")
    assert DWH_NUM_NODES > 1

elif (DWH_CLUSTER_TYPE == 'single-node'):
    DWH_NUM_NODES = 1
    
DWH_NODE_TYPE          = config_dwh.get("DWH","DWH_NODE_TYPE")

DWH_CLUSTER_IDENTIFIER = config_dwh.get("DWH","DWH_CLUSTER_IDENTIFIER")
DWH_DB                 = config_dwh.get("DWH","DWH_DB")
DWH_DB_USER            = config_dwh.get("DWH","DWH_DB_USER")
DWH_DB_PASSWORD        = config_dwh.get("DWH","DWH_DB_PASSWORD")
DWH_PORT               = config_dwh.get("DWH","DWH_PORT")

DWH_IAM_ROLE_NAME      = config_dwh.get("DWH", "DWH_IAM_ROLE_NAME")

In [5]:
(DWH_DB_USER, DWH_DB_PASSWORD, DWH_DB)

('sparkifyuser', 'UdhBYsV9oBKnr5sWxGDsF6L8ki3JJrzs', 'sparkifydb')

In [6]:
pd.DataFrame({"Param":
                  ["DWH_CLUSTER_TYPE", "DWH_NUM_NODES", "DWH_NODE_TYPE", "DWH_CLUSTER_IDENTIFIER", "DWH_DB", "DWH_DB_USER", "DWH_DB_PASSWORD", "DWH_PORT", "DWH_IAM_ROLE_NAME"],
              "Value":
                  [DWH_CLUSTER_TYPE, DWH_NUM_NODES, DWH_NODE_TYPE, DWH_CLUSTER_IDENTIFIER, DWH_DB, DWH_DB_USER, DWH_DB_PASSWORD, DWH_PORT, DWH_IAM_ROLE_NAME]
             })

Unnamed: 0,Param,Value
0,DWH_CLUSTER_TYPE,single-node
1,DWH_NUM_NODES,1
2,DWH_NODE_TYPE,dc2.large
3,DWH_CLUSTER_IDENTIFIER,sparkifyCluster
4,DWH_DB,sparkifydb
5,DWH_DB_USER,sparkifyuser
6,DWH_DB_PASSWORD,UdhBYsV9oBKnr5sWxGDsF6L8ki3JJrzs
7,DWH_PORT,5439
8,DWH_IAM_ROLE_NAME,sparkifyS3ReadOnlyRole


# Create clients for IAM, EC2, S3 and Redshift

**Note**: We are creating these resources in the the **us-west-2** region. Choose the same region in the your AWS web console to the see these resources.

In [7]:
import boto3

ec2 = boto3.resource('ec2',
                       region_name="us-west-2",
                       aws_access_key_id=AWS_KEY,
                       aws_secret_access_key=AWS_SECRET
                    )

s3 = boto3.resource('s3',
                       region_name="us-west-2",
                       aws_access_key_id=AWS_KEY,
                       aws_secret_access_key=AWS_SECRET
                   )

iam = boto3.client('iam',aws_access_key_id=AWS_KEY,
                     aws_secret_access_key=AWS_SECRET,
                     region_name='us-west-2'
                  )

redshift = boto3.client('redshift',
                       region_name="us-west-2",
                       aws_access_key_id=AWS_KEY,
                       aws_secret_access_key=AWS_SECRET
                       )

# Check out the sample data sources on S3

In [8]:
sampleDbBucket =  s3.Bucket("udacity-dend")
sampleDbBucket

s3.Bucket(name='udacity-dend')

### Song Data Samples

In [9]:
# For demo purpose, list all objects in a particular song directory
song_data_samples = list(sampleDbBucket.objects.filter(Prefix='song_data/A/A/A/'))[:5]
song_data_samples

[s3.ObjectSummary(bucket_name='udacity-dend', key='song_data/A/A/A/TRAAAAK128F9318786.json'),
 s3.ObjectSummary(bucket_name='udacity-dend', key='song_data/A/A/A/TRAAAAV128F421A322.json'),
 s3.ObjectSummary(bucket_name='udacity-dend', key='song_data/A/A/A/TRAAABD128F429CF47.json'),
 s3.ObjectSummary(bucket_name='udacity-dend', key='song_data/A/A/A/TRAAACN128F9355673.json'),
 s3.ObjectSummary(bucket_name='udacity-dend', key='song_data/A/A/A/TRAAAEA128F935A30D.json')]

In [10]:
song_data_samples_2 = [json.loads(obj.get()['Body'].read().decode('utf-8')) for obj in song_data_samples]
song_data_samples_2

[{'artist_id': 'ARJNIUY12298900C91',
  'artist_latitude': None,
  'artist_location': '',
  'artist_longitude': None,
  'artist_name': 'Adelitas Way',
  'duration': 213.9424,
  'num_songs': 1,
  'song_id': 'SOBLFFE12AF72AA5BA',
  'title': 'Scream',
  'year': 2009},
 {'artist_id': 'AR73AIO1187B9AD57B',
  'artist_latitude': 37.77916,
  'artist_location': 'San Francisco, CA',
  'artist_longitude': -122.42005,
  'artist_name': 'Western Addiction',
  'duration': 118.07302,
  'num_songs': 1,
  'song_id': 'SOQPWCR12A6D4FB2A3',
  'title': 'A Poor Recipe For Civic Cohesion',
  'year': 2005},
 {'artist_id': 'ARMJAGH1187FB546F3',
  'artist_latitude': 35.14968,
  'artist_location': 'Memphis, TN',
  'artist_longitude': -90.04892,
  'artist_name': 'The Box Tops',
  'duration': 148.03546,
  'num_songs': 1,
  'song_id': 'SOCIWDW12A8C13D406',
  'title': 'Soul Deep',
  'year': 1969},
 {'artist_id': 'AR9Q9YC1187FB5609B',
  'artist_latitude': None,
  'artist_location': 'New Jersey',
  'artist_longitude': N

In [11]:
df_song_data_samples = pd.DataFrame(song_data_samples_2)
df_song_data_samples.head()

Unnamed: 0,artist_id,artist_latitude,artist_location,artist_longitude,artist_name,duration,num_songs,song_id,title,year
0,ARJNIUY12298900C91,,,,Adelitas Way,213.9424,1,SOBLFFE12AF72AA5BA,Scream,2009
1,AR73AIO1187B9AD57B,37.77916,"San Francisco, CA",-122.42005,Western Addiction,118.07302,1,SOQPWCR12A6D4FB2A3,A Poor Recipe For Civic Cohesion,2005
2,ARMJAGH1187FB546F3,35.14968,"Memphis, TN",-90.04892,The Box Tops,148.03546,1,SOCIWDW12A8C13D406,Soul Deep,1969
3,AR9Q9YC1187FB5609B,,New Jersey,,Quest_ Pup_ Kevo,252.94322,1,SOFRDWL12A58A7CEF7,Hit Da Scene,0
4,ARSVTNL1187B992A91,51.50632,"London, England",-0.12714,Jonathan King,129.85424,1,SOEKAZG12AB018837E,I'll Slap Your Face (Entertainment USA Theme),2001


### Log Data Samples

In [12]:
# For demo purpose, list all objects in a particular event log directory
log_data_samples = list(sampleDbBucket.objects.filter(Prefix='log_data/2018/11'))[:5]
log_data_samples

[s3.ObjectSummary(bucket_name='udacity-dend', key='log_data/2018/11/2018-11-01-events.json'),
 s3.ObjectSummary(bucket_name='udacity-dend', key='log_data/2018/11/2018-11-02-events.json'),
 s3.ObjectSummary(bucket_name='udacity-dend', key='log_data/2018/11/2018-11-03-events.json'),
 s3.ObjectSummary(bucket_name='udacity-dend', key='log_data/2018/11/2018-11-04-events.json'),
 s3.ObjectSummary(bucket_name='udacity-dend', key='log_data/2018/11/2018-11-05-events.json')]

In [13]:
import itertools

In [14]:
# https://stackoverflow.com/questions/952914/how-do-i-make-a-flat-list-out-of-a-list-of-lists
log_data_samples_2 = [
    json.loads(obj2)
    for obj2
    in list(itertools.chain(*[
        obj.get()['Body'].read().decode('utf-8').split('\n')
        for obj
        in log_data_samples
    ]))
]

log_data_samples_2[:5]

[{'artist': None,
  'auth': 'Logged In',
  'firstName': 'Walter',
  'gender': 'M',
  'itemInSession': 0,
  'lastName': 'Frye',
  'length': None,
  'level': 'free',
  'location': 'San Francisco-Oakland-Hayward, CA',
  'method': 'GET',
  'page': 'Home',
  'registration': 1540919166796.0,
  'sessionId': 38,
  'song': None,
  'status': 200,
  'ts': 1541105830796,
  'userAgent': '"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"',
  'userId': '39'},
 {'artist': None,
  'auth': 'Logged In',
  'firstName': 'Kaylee',
  'gender': 'F',
  'itemInSession': 0,
  'lastName': 'Summers',
  'length': None,
  'level': 'free',
  'location': 'Phoenix-Mesa-Scottsdale, AZ',
  'method': 'GET',
  'page': 'Home',
  'registration': 1540344794796.0,
  'sessionId': 139,
  'song': None,
  'status': 200,
  'ts': 1541106106796,
  'userAgent': '"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safar

In [15]:
df_log_data_samples = pd.DataFrame(log_data_samples_2)
df_log_data_samples.head()

Unnamed: 0,artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userAgent,userId
0,,Logged In,Walter,M,0,Frye,,free,"San Francisco-Oakland-Hayward, CA",GET,Home,1540919000000.0,38,,200,1541105830796,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",39
1,,Logged In,Kaylee,F,0,Summers,,free,"Phoenix-Mesa-Scottsdale, AZ",GET,Home,1540345000000.0,139,,200,1541106106796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
2,Des'ree,Logged In,Kaylee,F,1,Summers,246.30812,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1540345000000.0,139,You Gotta Be,200,1541106106796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
3,,Logged In,Kaylee,F,2,Summers,,free,"Phoenix-Mesa-Scottsdale, AZ",GET,Upgrade,1540345000000.0,139,,200,1541106132796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
4,Mr Oizo,Logged In,Kaylee,F,3,Summers,144.03873,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1540345000000.0,139,Flat 55,200,1541106352796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8


## Log Data Paths

In [16]:
log_json_path_samples = list(sampleDbBucket.objects.filter(Prefix='log_json_path.json'))[:5]
log_json_path_samples

[s3.ObjectSummary(bucket_name='udacity-dend', key='log_json_path.json')]

In [17]:
log_json_path_samples_2 = [json.loads(obj.get()['Body'].read().decode('utf-8')) for obj in log_json_path_samples]
log_json_path_samples_2

[{'jsonpaths': ["$['artist']",
   "$['auth']",
   "$['firstName']",
   "$['gender']",
   "$['itemInSession']",
   "$['lastName']",
   "$['length']",
   "$['level']",
   "$['location']",
   "$['method']",
   "$['page']",
   "$['registration']",
   "$['sessionId']",
   "$['song']",
   "$['status']",
   "$['ts']",
   "$['userAgent']",
   "$['userId']"]}]

This FAQ gives clarity what the `log_json_path.json` is for. It provides developer a way to map input JSON to output SQL Table columns, if the column names of the JSON file and the SQL table do not match. If the two are already matching, the `log_json_path.json` file is not required (but optional instead).

References:

* https://knowledge.udacity.com/questions/214736
* https://knowledge.udacity.com/questions/144884
* https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#r_COPY_command_examples-copy-from-json