### AWS INFRUSTRACTURE AS CODE USING PYTHON FOR REDSHIFT DW


It is pretty simple to create and maintain few users and servers in the cloud using the provided UI but as they increase in number, it gets more cumbersome. 

This is where Infrustructure as Code (IaC) comes into play. IaC automates, maintains, deploy, replicate and share complex infrustructure as easily as one maintains code.

In AWS, IaC can be made possible through the use of ```aws-cli SDK and Cloud Formation.``` 

AWS CloudFormation is a service that gives developers and businesses an easy way to create a collection of related AWS and third-party resources, and provision and manage them in an orderly and predictable fashion. It makes use of configuration files in json format providing a description of all resources, permissions, users etc

For our case, we’ll go the SDK way using Python to access and manage redshift and its resources. For this particular case we will create a configuration file that would provide all the necessary information to access and manage the redshift cluster already developed - ```redshift_cluster.config``` file.

This file would easen the programmatic access to the cluster while we write code. Information from this file are obtained from the "properties" window of the redshift cluster in AWS

Creating an IAM role would come handy in that, we would not need to explicitly provide the public and secret keys to facilitate communication between the aws services. We'll just use this Role.



<img src='../images/aws_property.png' alt="Alternative text" />

In [37]:
import boto3
import pandas as pd
import psycopg2
import json

In [38]:
import configparser
config = configparser.ConfigParser()
config.read_file(open('../cluster.config'))

In [39]:
#Test Access To the files
config.get('AWS', 'KEY')

'########'

SECURELY ACCESSING CONFIGURATION FILE

In [40]:
KEY= config.get('AWS', 'KEY')
SECRET= config.get('AWS', 'SECRET')
DWH_CLUSTER_TYPE=config.get('DWH', 'DWH_CLUSTER_TYPE')
DWH_DB= config.get('DWH', 'DWH_DB')
DWH_NUM_NODE=config.get('DWH', 'DWH_NUM_NODE')
DWH_NODE_TYPE=config.get('DWH', 'DWH_NODE_TYPE')
DWH_CLUSTER_IDENTIFIER=config.get('DWH', 'DWH_CLUSTER_IDENTIFIER')
DWH_DB_USER=config.get('DWH', 'DWH_DB_USER')
DWH_DB_PASSWORD=config.get('DWH', 'DWH_DB_PASSWORD')
DWH_PORT=config.get('DWH', 'DWH_PORT')
DWH_IAM_ROLE_NAME=config.get('DWH', 'DWH_IAM_ROLE_NAME')

(DWH_DB_USER, DWH_DB_PASSWORD, DWH_DB)

('awsuser', 'Testing321', 'first-redshift-db')

In [41]:
pd.DataFrame({'param':
                ['KEY', 'SECRET', 'DWH_CLUSTER_TYPE', 'DWH_DB', 'DWN_NUM_NODE', 'DWN_NODE_TYPE','DWH_CLUSTER_IDENTIFIER', 'DWH_DB_USER', 'DWH_DB_PASSWORD', 'DWH_PORT', 'DWH_IAM_ROLE_NAME'],
              
              'value':
                [KEY, SECRET, DWH_CLUSTER_TYPE, DWH_DB, DWH_NUM_NODE, DWH_NODE_TYPE,DWH_CLUSTER_IDENTIFIER, DWH_DB_USER, DWH_DB_PASSWORD, DWH_PORT, DWH_IAM_ROLE_NAME]})

Unnamed: 0,param,value
0,KEY,########
1,SECRET,########
2,DWH_CLUSTER_TYPE,single-node
3,DWH_DB,first-redshift-db
4,DWN_NUM_NODE,1
5,DWN_NODE_TYPE,dc2.large
6,DWH_CLUSTER_IDENTIFIER,my-first-redshift
7,DWH_DB_USER,awsuser
8,DWH_DB_PASSWORD,Testing321
9,DWH_PORT,5439


CONNECT TO AWS

For my key and secret, I will store them in a folder ignored in this project, for security purposes

In [42]:
key_config = configparser.ConfigParser()
key_config.read_file(open('../private/redshift_cluster.config'))

KEY = key_config.get('AWS','KEY')
SECRET = key_config.get('AWS','SECRET')

In [43]:
ec2 = boto3.resource('ec2', 
                        region_name='us-east-2', 
                        aws_access_key_id=KEY, 
                        aws_secret_access_key=SECRET)

s3 = boto3.resource('s3', 
                        region_name='us-east-2', 
                        aws_access_key_id=KEY, 
                        aws_secret_access_key=SECRET)


iam = boto3.client('iam', 
                        region_name='us-east-2', 
                        aws_access_key_id=KEY, 
                        aws_secret_access_key=SECRET)

redshift = boto3.client('redshift', 
                        region_name='us-east-2', 
                        aws_access_key_id=KEY, 
                        aws_secret_access_key=SECRET)

CONNECT TO A BUCKET AND ACCESS OBJECTS WITHIN IT

In [44]:
bucket = s3.Bucket('tito-dataeng-redshift')
log_data_files = [filename.key for filename in bucket.objects.all()]
log_data_files

['another_data_copy.csv', 'cleaned.csv', 'version2.csv']

ARN

Amazon Resource Names (ARNs) uniquely identify AWS resources. We require an ARN when you need to specify a resource unambiguously across all of AWS, such as in IAM policies, Amazon Relational Database Service (Amazon RDS) tags, and API calls.


In [45]:
#Uniquely determining the rolename with the permission to access s3 buckets
roleARN = iam.get_role(RoleName=DWH_IAM_ROLE_NAME)['Role']['Arn']

CREATE REDSHIFT CLUSTER

In [48]:
try:
    response = redshift.create_cluster(
            ClusterType=DWH_CLUSTER_TYPE,
            NodeType=DWH_NODE_TYPE,

            #Identifier and credentials
            ClusterIdentifier=DWH_CLUSTER_IDENTIFIER,
            DBName=DWH_DB,
            MasterUsername=DWH_DB_USER,
            MasterUserPassword=DWH_DB_PASSWORD,

            #Roles For s3 access
            IamRoles=[
                    roleARN
                    ],
    )
    

except Exception as e:
    print(e)

DESCRIBE CLUSTER DETAILS

In [66]:
cluster_details = redshift.describe_clusters(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER)['Clusters'][0]
cluster_details

{'ClusterIdentifier': 'my-first-redshift',
 'NodeType': 'dc2.large',
 'ClusterStatus': 'available',
 'ClusterAvailabilityStatus': 'Available',
 'MasterUsername': 'awsuser',
 'DBName': 'first-redshift-db',
 'Endpoint': {'Address': 'my-first-redshift.cwdmvcljvlpf.us-east-2.redshift.amazonaws.com',
  'Port': 5439},
 'ClusterCreateTime': datetime.datetime(2023, 1, 8, 5, 32, 32, 991000, tzinfo=tzutc()),
 'AutomatedSnapshotRetentionPeriod': 1,
 'ManualSnapshotRetentionPeriod': -1,
 'ClusterSecurityGroups': [],
 'VpcSecurityGroups': [{'VpcSecurityGroupId': 'sg-01121b86867248be0',
   'Status': 'active'}],
 'ClusterParameterGroups': [{'ParameterGroupName': 'default.redshift-1.0',
   'ParameterApplyStatus': 'in-sync'}],
 'ClusterSubnetGroupName': 'default',
 'VpcId': 'vpc-0343be62f9b6090eb',
 'AvailabilityZone': 'us-east-2b',
 'PreferredMaintenanceWindow': 'tue:06:00-tue:06:30',
 'PendingModifiedValues': {},
 'ClusterVersion': '1.0',
 'AllowVersionUpgrade': True,
 'NumberOfNodes': 1,
 'PubliclyA

In [67]:
def prettyRedshiftProps(props):
    pd.set_option('display.max_colwidth', None)
    keysToShow = ['ClusterIdentifier', 'NodeType', 'ClusterStatus', 'MasterUsername', 'DBName', 'Endpoint', 'ClusterStatus', 'VpcId']
    x = [(k, v) for k,v in props.items() if k in keysToShow]
    return pd.DataFrame(data=x, columns=['key', 'value'])
    
prettyRedshiftProps(cluster_details)

Unnamed: 0,key,value
0,ClusterIdentifier,my-first-redshift
1,NodeType,dc2.large
2,ClusterStatus,available
3,MasterUsername,awsuser
4,DBName,first-redshift-db
5,Endpoint,"{'Address': 'my-first-redshift.cwdmvcljvlpf.us-east-2.redshift.amazonaws.com', 'Port': 5439}"
6,VpcId,vpc-0343be62f9b6090eb


ATTACH VPC TO THE REDSHIFT CLUSTER USING ec2 CONNECTION

In [78]:
try:
    vpc = ec2.Vpc(id=cluster_details['VpcId'])
    defaultSG = list(vpc.security_groups.all())[0]

    defaultSG.authorize_ingress(
            CidrIp='0.0.0.0/0',
            IpProtocol='TCP',
            FromPort=int(DWH_PORT),
            ToPort=int(DWH_PORT),
            GroupName=defaultSG.group_name
    )

    print("VPC Attached to Redshift through ec2")


except Exception as e:
    print(e)

An error occurred (InvalidPermission.Duplicate) when calling the AuthorizeSecurityGroupIngress operation: the specified rule "peer: 0.0.0.0/0, TCP, from port: 5439, to port: 5439, ALLOW" already exists
