Dataset used: COVID-19 Data Lake - https://registry.opendata.aws/aws-covid19-lake/
Tutorial: COVID 19 - Build End to End Data Engineering Project | PART 1 - https://www.youtube.com/watch?v=gFWu-SSzRzc&t=464s

In [None]:
#%pip install --user boto3
#%pip install --user awscli

In [1]:
import boto3
import pandas as pd
#import psycopg2 
#import json
import os
import configparser
import pprint
from io import StringIO
from botocore.client import ClientError
#import subprocess
import time

In [2]:
config = configparser.ConfigParser()
config.read_file(open('cluster.config'))

In [3]:
AWS_KEY = os.environ.get('AWS_KEY')
AWS_SECRET = os.environ.get('AWS_SECRET')
AWS_REGION = config.get("AWS","AWS_REGION")

SCHEMA_NAME = config.get("PROJECT","SCHEMA_NAME")

S3_ACL = config.get("S3","S3_ACL")
S3_LOCATION = config.get("S3","S3_LOCATION")
S3_BUCKET_NAME = config.get("S3","S3_BUCKET_NAME")
S3_STAGING_DIR = config.get("S3","S3_STAGING_DIR")
S3_OUTPUT_DIRECTORY = config.get("S3","S3_OUTPUT_DIRECTORY")

CRAWLER_ROLE = config.get("CRAWLER","CRAWLER_ROLE")
CRAWLER_OUTPUT = config.get("CRAWLER","CRAWLER_OUTPUT")

1st Step - Create S3 Bucket using Python

In [4]:
s3_client = boto3.client('s3', 
                aws_access_key_id=AWS_KEY,
                aws_secret_access_key=AWS_SECRET,
                region_name=AWS_REGION)

try:
    s3_client.head_bucket(Bucket=S3_BUCKET_NAME)
    bucket_exists = 'Yes'
    print("The bucket exists")
except ClientError:
    bucket_exists = 'No'
    print("The bucket does no exist or you have no access")

The bucket exists


In [5]:
try:
    if bucket_exists == 'No':
        create_bucket = s3_client.create_bucket(Bucket=S3_BUCKET_NAME, 
                    ACL=S3_ACL, 
                    CreateBucketConfiguration = {'LocationConstraint': S3_LOCATION})
        print(pprint.pprint(create_bucket))
except ClientError as e:
    print(e)

2st Step - Copy Dataset into the created Bucket
Dataset chosen: https://registry.opendata.aws/aws-covid19-lake/

In [27]:
InfosToCopy = pd.read_csv('dataInfosCopy2.csv', delimiter=";", header=None)
InfosToCopy.head(10)

Unnamed: 0,0,1,2,3,4
0,covid19-lake,enigma-jhu/csv/Enigma-JHU.csv.gz,enigma_jhu/csv/Enigma_JHU.csv.gz,enigma_jhu/,enigma_jhu
1,covid19-lake,enigma-nytimes-data-in-usa/csv/us_county/us_co...,enigma_nytimes_data_in_usa/csv/us_county/us_co...,enigma_nytimes_data_in_usa/csv/us_county/,us_county
2,covid19-lake,enigma-nytimes-data-in-usa/csv/us_states/us_st...,enigma_nytimes_data_in_usa/csv/us_states/us_st...,enigma_nytimes_data_in_usa/csv/us_states/,us_states
3,covid19-lake,rearc-covid-19-testing-data/csv/states_daily/s...,rearc_covid_19_testing_data/csv/states_daily/s...,rearc_covid_19_testing_data/csv/states_daily/,states_daily
4,covid19-lake,rearc-covid-19-testing-data/csv/us-total-lates...,rearc_covid_19_testing_data/csv/us_total_lates...,rearc_covid_19_testing_data/csv/us_total_latest/,us_total_latest
5,covid19-lake,rearc-covid-19-testing-data/csv/us_daily/us_da...,rearc_covid_19_testing_data/csv/us_daily/us_da...,rearc_covid_19_testing_data/csv/us_daily/,us_daily
6,covid19-lake,rearc-usa-hospital-beds/json/usa-hospital-beds...,rearc_usa_hospital_beds/json/usa_hospital_beds...,rearc_usa_hospital_beds/,rearc_usa_hospital_beds
7,covid19-lake,static-datasets/csv/countrycode/CountryCodeQS.csv,static_datasets/csv/countrycode/CountryCodeQS.csv,static_datasets/csv/countrycode/,countrycode
8,covid19-lake,static-datasets/csv/CountyPopulation/County_Po...,static_datasets/csv/CountyPopulation/County_Po...,static_datasets/csv/CountyPopulation/,CountyPopulation
9,covid19-lake,static-datasets/csv/state-abv/states_abv.csv,static_datasets/csv/state_abv/states_abv.csv,static_datasets/csv/state_abv/,state_abv


Copy the specific data from the open data set into the S3 Bucket with the folders path

In [28]:
s3_resource = boto3.resource('s3', 
                aws_access_key_id=AWS_KEY,
                aws_secret_access_key=AWS_SECRET)

for position in range(len(InfosToCopy)):
    s3_resource.Bucket(S3_BUCKET_NAME).copy({"Bucket": InfosToCopy.iat[position,0], "Key": InfosToCopy.iat[position,1]}, InfosToCopy.iat[position,2])
    print(position, "-", InfosToCopy.iat[position,2], "- Dowloaded")
    #s3_resource.Bucket(S3_BUCKET_NAME).copy({"Bucket": InfosToCopy.iat[position,0], "Key": InfosToCopy.iat[position,1]}, InfosToCopy.iat[position,1])
    #print(position, "-", InfosToCopy.iat[position,1], "- Dowloaded")

0 - enigma_jhu/csv/Enigma_JHU.csv.gz - Dowloaded
1 - enigma_nytimes_data_in_usa/csv/us_county/us_county.csv - Dowloaded
2 - enigma_nytimes_data_in_usa/csv/us_states/us_states.csv - Dowloaded
3 - rearc_covid_19_testing_data/csv/states_daily/states_daily.csv - Dowloaded
4 - rearc_covid_19_testing_data/csv/us_total_latest/us.csv - Dowloaded
5 - rearc_covid_19_testing_data/csv/us_daily/us_daily.csv - Dowloaded
6 - rearc_usa_hospital_beds/json/usa_hospital_beds.geojson - Dowloaded
7 - static_datasets/csv/countrycode/CountryCodeQS.csv - Dowloaded
8 - static_datasets/csv/CountyPopulation/County_Population.csv - Dowloaded
9 - static_datasets/csv/state_abv/states_abv.csv - Dowloaded


https://github.com/oovk/dataengg-covid19-aws/blob/main/covid19_project.ipynb

3rd Step - Set-up the Data Crawler to understand the Data, how many columns and rows.
Understand the data to, then, build the data model.

In [29]:
#https://gist.github.com/ejlp12/30d67c07bf9e46b98a350569976f08aa
glue_client = boto3.client('glue', 
                aws_access_key_id=AWS_KEY,
                aws_secret_access_key=AWS_SECRET,
                region_name=AWS_REGION)

for position in range(len(InfosToCopy)):
    CrawlerName = InfosToCopy.iat[position,4] + "_crawler"
    CrawlerPath = "s3://" + S3_BUCKET_NAME + "/" + InfosToCopy.iat[position,3]
    
    CrawlerCreation = glue_client.create_crawler(
        Name=CrawlerName,
        Role=CRAWLER_ROLE,
        DatabaseName=SCHEMA_NAME,
        Description='Crawler for generated tables in the bucket',
        Targets={
            'S3Targets': [
                {
                    'Path': CrawlerPath,
                    'Exclusions': [
                    ]
                },
            ]
        },
        SchemaChangePolicy={
            'UpdateBehavior': 'UPDATE_IN_DATABASE',
            'DeleteBehavior': 'DELETE_FROM_DATABASE'
        }
    )

    CrawlerStart = glue_client.start_crawler(
        Name=CrawlerName
    )
    

4th Step - Build the relational data model

5th Step - Connect athena and query data

In [30]:
athena_client = boto3.client('athena', 
                aws_access_key_id=AWS_KEY,
                aws_secret_access_key=AWS_SECRET,
                region_name=AWS_REGION
                )

In [31]:
Dict={}
def download_and_load_query_results(
    client: boto3.client, query_response: Dict
) -> pd.DataFrame:
    while True:
        try:
            client.get_query_results(
                QueryExecutionId=query_response["QueryExecutionId"]
            )
            break
        except Exception as err:
            if "not yet finished" in str(err):
                time.sleep(0.001)
            else:
                raise err
    temp_file_location: str = "athena_query_results.csv"
    s3_client.download_file(
        S3_BUCKET_NAME,
        f"{S3_OUTPUT_DIRECTORY}/{query_response['QueryExecutionId']}.csv",
        temp_file_location,
        )
    return pd.read_csv(temp_file_location)

In [39]:
response = athena_client.start_query_execution(
    QueryString = "SELECT * FROM enigma_jhu",
    QueryExecutionContext = {"Database": SCHEMA_NAME},
    ResultConfiguration = {
        "OutputLocation": S3_STAGING_DIR,
        "EncryptionConfiguration": {"EncryptionOption": "SSE_S3"},
    },
)
enigma_jhu = download_and_load_query_results(athena_client, response)
enigma_jhu.head()

Unnamed: 0,fips,admin2,province_state,country_region,last_update,latitude,longitude,confirmed,deaths,recovered,active,combined_key,partition_0
0,,,Anhui,China,2020-01-22T17:00:00,31.826,117.226,1.0,,,,"""Anhui",csv
1,,,Beijing,China,2020-01-22T17:00:00,40.182,116.414,14.0,,,,"""Beijing",csv
2,,,Chongqing,China,2020-01-22T17:00:00,30.057,107.874,6.0,,,,"""Chongqing",csv
3,,,Fujian,China,2020-01-22T17:00:00,26.079,117.987,1.0,,,,"""Fujian",csv
4,,,Gansu,China,2020-01-22T17:00:00,36.061,103.834,,,,,"""Gansu",csv


In [40]:
response = athena_client.start_query_execution(
    QueryString = "SELECT * FROM us_county",
    QueryExecutionContext = {"Database": SCHEMA_NAME},
    ResultConfiguration = {
        "OutputLocation": S3_STAGING_DIR,
        "EncryptionConfiguration": {"EncryptionOption": "SSE_S3"},
    },
)
us_county = download_and_load_query_results(athena_client, response)
us_county.head()

Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-03-22,St. Charles,Missouri,29183.0,3,0
1,2020-03-22,St. Louis,Missouri,29189.0,55,1
2,2020-03-22,St. Louis city,Missouri,29510.0,14,0
3,2020-03-22,Unknown,Missouri,,1,0
4,2020-03-22,Broadwater,Montana,30007.0,1,0


In [41]:
response = athena_client.start_query_execution(
    QueryString = "SELECT * FROM us_states",
    QueryExecutionContext = {"Database": SCHEMA_NAME},
    ResultConfiguration = {
        "OutputLocation": S3_STAGING_DIR,
        "EncryptionConfiguration": {"EncryptionOption": "SSE_S3"},
    },
)
us_states = download_and_load_query_results(athena_client, response)
us_states.head()

Unnamed: 0,date,state,fips,cases,deaths
0,2020-01-21,Washington,53,1,0
1,2020-01-22,Washington,53,1,0
2,2020-01-23,Washington,53,1,0
3,2020-01-24,Illinois,17,1,0
4,2020-01-24,Washington,53,1,0


In [42]:
response = athena_client.start_query_execution(
    QueryString = "SELECT * FROM states_daily",
    QueryExecutionContext = {"Database": SCHEMA_NAME},
    ResultConfiguration = {
        "OutputLocation": S3_STAGING_DIR,
        "EncryptionConfiguration": {"EncryptionOption": "SSE_S3"},
    },
)
states_daily = download_and_load_query_results(athena_client, response)
states_daily.head()

Unnamed: 0,date,state,positive,probablecases,negative,pending,totaltestresultssource,totaltestresults,hospitalizedcurrently,hospitalizedcumulative,...,dataqualitygrade,deathincrease,hospitalizedincrease,hash,commercialscore,negativeregularscore,negativescore,positivescore,score,grade
0,20210220,UT,366034.0,,1507875.0,,totalTestsViral,2788882.0,260.0,14421.0,...,,8,39,70f3e22ea3d10f99d5f3c09c55ba95fa1b8aaabb,0,0,0,0,0,
1,20210220,VA,561812.0,117662.0,,195.0,totalTestEncountersViral,5728208.0,1594.0,23436.0,...,,99,67,75d813bab6075e36b3ed1d3bbbfe18f6692e3959,0,0,0,0,0,
2,20210220,VI,2575.0,,43564.0,108.0,posNeg,46139.0,,,...,,0,0,7ca160663de572688bb23d17943b6f59863f5fd0,0,0,0,0,0,
3,20210220,VT,14359.0,411.0,309335.0,,totalTestsViral,1009285.0,39.0,,...,,3,0,5156647b94cb2e59c9e4e26be1943e4827a99f13,0,0,0,0,0,
4,20210220,WA,332904.0,17485.0,,,totalTestEncountersViral,5048054.0,608.0,18969.0,...,,19,35,8150e925fc2fb429eeb347109e52f7b99ba00f17,0,0,0,0,0,


In [43]:
response = athena_client.start_query_execution(
    QueryString = "SELECT * FROM us_total_latest",
    QueryExecutionContext = {"Database": SCHEMA_NAME},
    ResultConfiguration = {
        "OutputLocation": S3_STAGING_DIR,
        "EncryptionConfiguration": {"EncryptionOption": "SSE_S3"},
    },
)
us_total_latest = download_and_load_query_results(athena_client, response)
us_total_latest.head()

Unnamed: 0,positive,negative,pending,hospitalizedcurrently,hospitalizedcumulative,inicucurrently,inicucumulative,onventilatorcurrently,onventilatorcumulative,recovered,hash,lastmodified,death,hospitalized,total,totaltestresults,posneg,notes
0,1061101,5170081,2775,53793,111955,9486,4192,4712,373,153947,95064ba29ccbc20dbec397033dfe4b1f45137c99,2020-05-01T09:12:31.891Z,57266,111955,6233957,6231182,6231182,"""NOTE: """"total"""""


In [44]:
response = athena_client.start_query_execution(
    QueryString = "SELECT * FROM us_daily",
    QueryExecutionContext = {"Database": SCHEMA_NAME},
    ResultConfiguration = {
        "OutputLocation": S3_STAGING_DIR,
        "EncryptionConfiguration": {"EncryptionOption": "SSE_S3"},
    },
)
us_daily = download_and_load_query_results(athena_client, response)
us_daily.head()

Unnamed: 0,date,states,positive,negative,pending,hospitalizedcurrently,hospitalizedcumulative,inicucurrently,inicucumulative,onventilatorcurrently,...,lastmodified,recovered,total,posneg,deathincrease,hospitalizedincrease,negativeincrease,positiveincrease,totaltestresultsincrease,hash
0,20210307,56,28755524.0,74579770.0,11808.0,40212.0,878613.0,8137.0,45475.0,2801.0,...,2021-03-07T24:00:00Z,,0,0,839,726,130414,41265,1156241,8b26839690cd05c0cef69cb9ed85641a76b5e78e
1,20210306,56,28714259.0,74449356.0,11783.0,41401.0,877887.0,8409.0,45453.0,2811.0,...,2021-03-06T24:00:00Z,,0,0,1674,503,142201,59620,1409138,d0c0482ea549c9d5c04a7c86acb6fc6a8095a592
2,20210305,56,28654639.0,74307155.0,12213.0,42541.0,877384.0,8634.0,45373.0,2889.0,...,2021-03-05T24:00:00Z,,0,0,2221,2781,271917,68787,1744417,a35ea4289cec4bb55c9f29ae04ec0fd5ac4e0222
3,20210304,56,28585852.0,74035238.0,12405.0,44172.0,874603.0,8970.0,45293.0,2973.0,...,2021-03-04T24:00:00Z,,0,0,1743,1530,177957,65487,1590984,a19ad6379a653834cbda3093791ad2c3b9fab5ff
4,20210303,56,28520365.0,73857281.0,11778.0,45462.0,873073.0,9359.0,45214.0,3094.0,...,2021-03-03T24:00:00Z,,0,0,2449,2172,267001,66836,1406795,9e1d2afda1b0ec243060d6f68a7134d011c0cb2a


In [45]:
response = athena_client.start_query_execution(
    QueryString = "SELECT * FROM rearc_usa_hospital_beds",
    QueryExecutionContext = {"Database": SCHEMA_NAME},
    ResultConfiguration = {
        "OutputLocation": S3_STAGING_DIR,
        "EncryptionConfiguration": {"EncryptionOption": "SSE_S3"},
    },
)
rearc_usa_hospital_beds = download_and_load_query_results(athena_client, response)
rearc_usa_hospital_beds.head()

Unnamed: 0,objectid,hospital_name,hospital_type,hq_address,hq_address1,hq_city,hq_state,hq_zip_code,county_name,state_name,...,num_staffed_beds,num_icu_beds,adult_icu_beds,pedi_icu_beds,bed_utilization,avg_ventilator_usage,potential_increase_in_bed_capac,latitude,longtitude,partition_0
0,1958,Henry Ford Hospital,Short Term Acute Care Hospital,2799 W Grand Blvd,,Detroit,MI,48202,Wayne,Michigan,...,693.0,128,128,35.0,0.803302,28.0,184,42.367385,-83.085375,json
1,1959,Beaumont Hospital - Dearborn (FKA Oakwood Hosp...,Short Term Acute Care Hospital,18101 Oakwood Blvd,,Dearborn,MI,48124,Wayne,Michigan,...,567.0,17,17,30.0,0.810983,25.0,65,42.2917,-83.2115,json
2,1960,DMC Sinai-Grace Hospital,Short Term Acute Care Hospital,6071 W Outer Dr,,Detroit,MI,48235,Wayne,Michigan,...,325.0,39,39,19.0,0.668527,23.0,58,42.419107,-83.182279,json
3,1961,St Mary Mercy Livonia Hospital,Short Term Acute Care Hospital,36475 5 Mile Rd,,Livonia,MI,48154,Wayne,Michigan,...,252.0,16,16,0.0,0.666895,12.0,21,42.3943,-83.4043,json
4,1962,DMC Harper University Hospital,Short Term Acute Care Hospital,3990 John R St,,Detroit,MI,48201,Wayne,Michigan,...,368.0,34,34,32.0,0.63753,12.0,102,42.352026,-83.057159,json


In [46]:
response = athena_client.start_query_execution(
    QueryString = "SELECT * FROM countrycode",
    QueryExecutionContext = {"Database": SCHEMA_NAME},
    ResultConfiguration = {
        "OutputLocation": S3_STAGING_DIR,
        "EncryptionConfiguration": {"EncryptionOption": "SSE_S3"},
    },
)
countrycode = download_and_load_query_results(athena_client, response)
countrycode.head()

Unnamed: 0,country,alpha-2 code,alpha-3 code,numeric code,latitude,longitude
0,Afghanistan,AF,AFG,4.0,33.0,65.0
1,Albania,AL,ALB,8.0,41.0,20.0
2,Algeria,DZ,DZA,12.0,28.0,3.0
3,American Samoa,AS,ASM,16.0,-14.3333,-170.0
4,Andorra,AD,AND,20.0,42.5,1.6


In [47]:
response = athena_client.start_query_execution(
    QueryString = "SELECT * FROM CountyPopulation",
    QueryExecutionContext = {"Database": SCHEMA_NAME},
    ResultConfiguration = {
        "OutputLocation": S3_STAGING_DIR,
        "EncryptionConfiguration": {"EncryptionOption": "SSE_S3"},
    },
)
CountyPopulation = download_and_load_query_results(athena_client, response)
CountyPopulation.head()

Unnamed: 0,id,id2,county,state,population estimate 2018
0,0500000US01001,1001,Autauga,Alabama,55601
1,0500000US01003,1003,Baldwin,Alabama,218022
2,0500000US01005,1005,Barbour,Alabama,24881
3,0500000US01007,1007,Bibb,Alabama,22400
4,0500000US01009,1009,Blount,Alabama,57840


In [51]:
response = athena_client.start_query_execution(
    QueryString = "SELECT * FROM state_abv",
    QueryExecutionContext = {"Database": SCHEMA_NAME},
    ResultConfiguration = {
        "OutputLocation": S3_STAGING_DIR,
        "EncryptionConfiguration": {"EncryptionOption": "SSE_S3"},
    },
)
state_abv = download_and_load_query_results(athena_client, response)
state_abv.head()

Unnamed: 0,col0,col1
0,State,Abbreviation
1,Alabama,AL
2,Alaska,AK
3,Arizona,AZ
4,Arkansas,AR


In [52]:
new_header = state_abv.iloc[0] #grab the first row for the header
state_abv = state_abv[1:] #take the data less the header row
state_abv.columns = new_header #set the header row as the df header
state_abv.head()

Unnamed: 0,State,Abbreviation
1,Alabama,AL
2,Alaska,AK
3,Arizona,AZ
4,Arkansas,AR
5,California,CA


6th Step - ETL job in Python

7th Step - Save result to S3

8th Step - Build tables on redshift

9th Step - Copy data to redshift