# Implmenting an ETL pipeline and Data warehouse using AWS services like S3, Glue, Athena and Redshift
- I will be using the same dataset I used for creating the etl pipeline and data warehouse using Python and Postgresql
- In this notebook, I will be using boto3 which is AWS SDK for python to connect to aws account and access the AWS services.
- Here, I am trying to practice Infrastructure as a Code(IaaC) in this project.

## Main objective
- To implement ETL pipeline and Data warehouse using AWS Services.
- To learn and understand about AWS services

### Importing Necessary libraries

In [1]:
# To connect to AWS 
import boto3
# To parse the data in config file
import configparser
# to process data
import pandas as pd 

### Configuring AWS Account to be used from local jupyter notebook
- I have created a configuration file dwh.cfg which includes all the necessary configurations for connecting to AWS account

In [5]:
# Initialize the configparser object
config = configparser.ConfigParser()
# read the configuration from local file
config.read_file(open('dwh.cfg'))
# Load the parameters from the config file into variables
KEY                    = config.get('AWS','KEY')
SECRET                 = config.get('AWS','SECRET')
AWS_REGION             = config.get("AWS","DEFAULT_REGION")            

DWH_CLUSTER_TYPE       = config.get("DWH","DWH_CLUSTER_TYPE")
DWH_NUM_NODES          = config.get("DWH","DWH_NUM_NODES")
DWH_NODE_TYPE          = config.get("DWH","DWH_NODE_TYPE")

DWH_CLUSTER_IDENTIFIER = config.get("DWH","DWH_CLUSTER_IDENTIFIER")
DWH_DB                 = config.get("DWH","DWH_DB")
DWH_DB_USER            = config.get("DWH","DWH_DB_USER")
DWH_DB_PASSWORD        = config.get("DWH","DWH_DB_PASSWORD")
DWH_PORT               = config.get("DWH","DWH_PORT")

DWH_IAM_ROLE_NAME      = config.get("DWH", "DWH_IAM_ROLE_NAME")

# * Checkout the parameters in a pandas dataframe
# pd.DataFrame({"Param":
#                   ["DWH_CLUSTER_TYPE", "DWH_NUM_NODES", "DWH_NODE_TYPE", "DWH_CLUSTER_IDENTIFIER", "DWH_DB", "DWH_DB_USER", "DWH_DB_PASSWORD", "DWH_PORT", "DWH_IAM_ROLE_NAME"],
#               "Value":
#                   [DWH_CLUSTER_TYPE, DWH_NUM_NODES, DWH_NODE_TYPE, DWH_CLUSTER_IDENTIFIER, DWH_DB, DWH_DB_USER, DWH_DB_PASSWORD, DWH_PORT, DWH_IAM_ROLE_NAME]
#              })

### Creating boto3 clients for S3, IAM, Glue and Redshift
These clients will then be used to access those AWS services and perform various services with those services

In [14]:
try:
    s3 = boto3.client('s3',
                        region_name=AWS_REGION,
                        aws_access_key_id=KEY,
                        aws_secret_access_key=SECRET)

    iam = boto3.client('iam',
                        region_name=AWS_REGION,
                        aws_access_key_id=KEY,
                        aws_secret_access_key=SECRET)

    glue = boto3.client('glue',
                        region_name=AWS_REGION,
                        aws_access_key_id=KEY,
                        aws_secret_access_key=SECRET)

    redshift = boto3.client('redshift',
                            region_name=AWS_REGION,
                            aws_access_key_id=KEY,
                            aws_secret_access_key=SECRET)
    
    print("Boto3 clients created succesfully")
except Exception as e:
    print("Folowing error was encountered:\n{e}")

Boto3 clients created succesfully


### Transforming the data from csv to parquet format before storing it in S3 bucket
- Parquet format is chosen as it optimizes query performance by allowing for efficient column pruning and data skipping. Parquet stores data in columnar format and also uses compression algorithm that reduces storage costs and also improve query performance
- To transform data from csv to parquet we use pandas

In [11]:
# Read the csv dataset
# Dataset Folder
DATASET_FOLDER = 'datasets'

df1 = pd.read_csv(f"{DATASET_FOLDER}/product_info.csv")
# low_memory=False to deal with mixed datatypes warning
df2 = pd.read_csv(f"{DATASET_FOLDER}/product_reviews.csv",index_col=0,low_memory=False)

# Transform to parquet format
df1_t = df1.to_parquet(f"{DATASET_FOLDER}/product_info.parquet")
df2_t = df2.to_parquet(f"{DATASET_FOLDER}/product_reviews.parquet")
print("Data succesfully converted to parquet format")

Data succesfully converted to parquet format


### Using S3 service to upload the dataset to S3 bucket
- List all existing buckets in aws account
- Check if the required s3 bucket exists or not.
- If not create a new bucket.
- Upload the file from local directory to S3 bucket
> Note: The S3 bucket name should be globally unique else you will get error

In [23]:
SOURCE_DATA_BUCKET="coderush-anish-etl-source-data"

try:
    # get list of all available buckets
    response = s3.list_buckets()
    buckets_list = []
    for res in response['Buckets']:
        buckets_list.append(res['Name'])
    # print(buckets_list)
except Exception as e:
    print(f"Error fetching the s3 buckets. Following error encountered:\n{e}")
    
# Create S3 bucket if not exists
if SOURCE_DATA_BUCKET not in buckets_list:
    print(f"Creating {SOURCE_DATA_BUCKET} . . .")
    try:
        response = s3.create_bucket(
            Bucket=SOURCE_DATA_BUCKET,
            CreateBucketConfiguration={
                    'LocationConstraint': 'ap-south-1'
                    }
        )
        print(f"{SOURCE_DATA_BUCKET} created successfully !")
    except Exception as e:
        print(f"Failed to create {SOURCE_DATA_BUCKET}. Following error encountered:\n{e}")
else:
    print(f"{SOURCE_DATA_BUCKET} already exists!")

# Uploading the dataset files from local directory to s3 bucket
try:
    print("Started uploading the files . . .")
    files_to_upload = ["product_info.parquet","product_reviews.parquet"]
    for file_name in files_to_upload:
        s3.upload_file(f"{DATASET_FOLDER}/{file_name}",SOURCE_DATA_BUCKET,file_name)
    print('Successfully uploaded all files')
except Exception as e: 
    print(f"Failed to upload datasets to s3 bucket.\nFollowing error occured:\n{e}")

coderush-anish-etl-source-data already exists!
Started uploading the files . . .
Successfully uploaded all files
