Dataset used: COVID-19 Data Lake - https://registry.opendata.aws/aws-covid19-lake/
Tutorial: COVID 19 - Build End to End Data Engineering Project | PART 1 - https://www.youtube.com/watch?v=gFWu-SSzRzc&t=464s

In [1]:
#%pip install --user boto3
#%pip install --user awscli

In [10]:
import boto3
import pandas as pd
import psycopg2 
import json
import os
import configparser
import pprint
from botocore.client import ClientError
import subprocess
import time

In [11]:
config = configparser.ConfigParser()
config.read_file(open('cluster.config'))

In [12]:
AWS_KEY = os.environ.get('AWS_KEY')
AWS_SECRET = os.environ.get('AWS_SECRET')
AWS_REGION = config.get("AWS","AWS_REGION")

S3_BUCKET_NAME = config.get("S3","S3_BUCKET_NAME")
S3_ACL = config.get("S3","S3_ACL")
S3_LOCATION = config.get("S3","S3_LOCATION")

CRAWLER_DATABASE_NAME = config.get("CRAWLER","CRAWLER_DATABASE_NAME")
CRAWLER_ROLE = config.get("CRAWLER","CRAWLER_ROLE")
CRAWLER_OUTPUT = config.get("CRAWLER","CRAWLER_OUTPUT")

1st Step - Create S3 Bucket using Python

In [17]:
s3_client = boto3.client('s3', 
                aws_access_key_id=AWS_KEY,
                aws_secret_access_key=AWS_SECRET,
                region_name=AWS_REGION)

try:
    s3_client.head_bucket(Bucket=S3_BUCKET_NAME)
    bucket_exists = 'Yes'
    print("The bucket exists")
except ClientError:
    bucket_exists = 'No'
    print("The bucket does no exist or you have no access")

The bucket does no exist or you have no access


In [18]:
try:
    if bucket_exists == 'No':
        create_bucket = s3_client.create_bucket(Bucket=S3_BUCKET_NAME, 
                    ACL=S3_ACL, 
                    CreateBucketConfiguration = {'LocationConstraint': S3_LOCATION})
        print(pprint.pprint(create_bucket))
except ClientError as e:
    print(e)

{'Location': 'http://hfelipini-covid19-crawler-output.s3.amazonaws.com/',
 'ResponseMetadata': {'HTTPHeaders': {'content-length': '0',
                                      'date': 'Sun, 11 Dec 2022 19:03:47 GMT',
                                      'location': 'http://hfelipini-covid19-crawler-output.s3.amazonaws.com/',
                                      'server': 'AmazonS3',
                                      'x-amz-id-2': '9N4bvYonvfduXfYQiiyF/AupxIrCUTkXMcFs2ilieq8XViw4AyYMEEHEMuG8/6q2ucUu+5Qq0h0=',
                                      'x-amz-request-id': 'XWEXR5QG2STKNWX1'},
                      'HTTPStatusCode': 200,
                      'HostId': '9N4bvYonvfduXfYQiiyF/AupxIrCUTkXMcFs2ilieq8XViw4AyYMEEHEMuG8/6q2ucUu+5Qq0h0=',
                      'RequestId': 'XWEXR5QG2STKNWX1',
                      'RetryAttempts': 0}}
None


2st Step - Copy Dataset into the created Bucket
Dataset chosen: https://registry.opendata.aws/aws-covid19-lake/

In [15]:
InfosToCopy = pd.read_csv('dataInfosCopy.csv', delimiter=";", header=None)
InfosToCopy.head(10)

Unnamed: 0,0,1,2,3
0,covid19-lake,enigma-jhu/csv/Enigma-JHU.csv.gz,enigma-jhu/,enigma-jhu
1,covid19-lake,enigma-nytimes-data-in-usa/csv/us_county/us_co...,enigma-nytimes-data-in-usa/csv/us_county/,us_county
2,covid19-lake,enigma-nytimes-data-in-usa/csv/us_states/us_st...,enigma-nytimes-data-in-usa/csv/us_states/,us_states
3,covid19-lake,rearc-covid-19-testing-data/csv/states_daily/s...,rearc-covid-19-testing-data/csv/states_daily/,states_daily
4,covid19-lake,rearc-covid-19-testing-data/csv/us-total-lates...,rearc-covid-19-testing-data/csv/us-total-latest/,us-total-latest
5,covid19-lake,rearc-covid-19-testing-data/csv/us_daily/us_da...,rearc-covid-19-testing-data/csv/us_daily/,us_daily
6,covid19-lake,rearc-usa-hospital-beds/json/usa-hospital-beds...,rearc-usa-hospital-beds/,rearc-usa-hospital-beds
7,covid19-lake,static-datasets/csv/countrycode/CountryCodeQS.csv,static-datasets/csv/countrycode/,countrycode
8,covid19-lake,static-datasets/csv/CountyPopulation/County_Po...,static-datasets/csv/CountyPopulation/,CountyPopulation
9,covid19-lake,static-datasets/csv/state-abv/states_abv.csv,static-datasets/csv/state-abv/,state-abv


Copy the specific data from the open data set into the S3 Bucket with the folders path

In [8]:
s3_resource = boto3.resource('s3', 
                aws_access_key_id=AWS_KEY,
                aws_secret_access_key=AWS_SECRET)

for position in range(len(InfosToCopy)):
    s3_resource.Bucket(S3_BUCKET_NAME).copy({"Bucket": InfosToCopy.iat[position,0], "Key": InfosToCopy.iat[position,1]}, InfosToCopy.iat[position,1])
    print(position, "-", InfosToCopy.iat[position,1], "- Dowloaded")

0 - enigma-jhu/csv/Enigma-JHU.csv.gz - Dowloaded
1 - enigma-nytimes-data-in-usa/csv/us_county/us_county.csv - Dowloaded
2 - enigma-nytimes-data-in-usa/csv/us_states/us_states.csv - Dowloaded
3 - rearc-covid-19-testing-data/csv/states_daily/states_daily.csv - Dowloaded
4 - rearc-covid-19-testing-data/csv/us-total-latest/us.csv - Dowloaded
5 - rearc-covid-19-testing-data/csv/us_daily/us_daily.csv - Dowloaded
6 - rearc-usa-hospital-beds/json/usa-hospital-beds.geojson - Dowloaded
7 - static-datasets/csv/countrycode/CountryCodeQS.csv - Dowloaded
8 - static-datasets/csv/CountyPopulation/County_Population.csv - Dowloaded
9 - static-datasets/csv/state-abv/states_abv.csv - Dowloaded


https://github.com/oovk/dataengg-covid19-aws/blob/main/covid19_project.ipynb

3rd Step - Set-up the Data Crawler to understand the Data, how many columns and rows.
Understand the data to, then, build the data model.

In [19]:
#https://gist.github.com/ejlp12/30d67c07bf9e46b98a350569976f08aa
glue_client = boto3.client('glue', 
                aws_access_key_id=AWS_KEY,
                aws_secret_access_key=AWS_SECRET,
                region_name=AWS_REGION)

for position in range(len(InfosToCopy)):
    CrawlerName = InfosToCopy.iat[position,3] + "_crawler"
    CrawlerPath = "s3://" + S3_BUCKET_NAME + "/" + InfosToCopy.iat[position,2]
    
    CrawlerCreation = glue_client.create_crawler(
        Name=CrawlerName,
        Role=CRAWLER_ROLE,
        DatabaseName=CRAWLER_DATABASE_NAME,
        Description='Crawler for generated CovidSchema',
        Targets={
            'S3Targets': [
                {
                    'Path': CrawlerPath,
                    'Exclusions': [
                    ]
                },
            ]
        },
        SchemaChangePolicy={
            'UpdateBehavior': 'UPDATE_IN_DATABASE',
            'DeleteBehavior': 'DELETE_FROM_DATABASE'
        }
    )

    CrawlerStart = glue_client.start_crawler(
        Name=CrawlerName
    )
    