# 1. Setup Amazon Athena

Amazon Athena에서는 interactive query service 제공합니다. 표준 SQL을 사용하여 Amazon S3에서 데이터를 분석하는 것이 쉽게 합니다. Athena는 serverless이므로 관리를 위한 infrastructure가 필요없으며, queries한 부분에 대해 비용이 과금됩니다.

Athena는 Presto 기반으로, CSV, JSON, Avro 외에도 Apache Parquet와 Apache ORC 와 같은 columnar data도 지원합니다.

Presto는 모든 크기의 데이터에 대해 빠른 분석 쿼리를 위해 개발 된 오픈 소스 분산 SQL 쿼리 엔진입니다. 데이터를 이동할 필요없이 저장된 위치의 데이터를 쿼리 할 수 있습니다. 쿼리 실행은 순수한 메모리 기반 아키텍처에서 병렬로 실행되므로 Presto가 매우 빠릅니다.


In [1]:
%store -r

In [2]:
import sys
import boto3

<p>SageMaker에서 앞으로 사용할 SageMaker Session 설정, Role 정보를 설정합니다. </p>

In [3]:
sess = boto3.Session()
sm = sess.client('sagemaker')

## 1 ) Install PyAthena

[PyAthena](https://pypi.org/project/PyAthena/) 는 Amazon Athena용 Python DB API 2.0 (PEP 249) 호환 클라이언트 입니다.

In [4]:
# Install PyAthena
!{sys.executable} -m pip install -q --upgrade pip
!{sys.executable} -m pip install -q PyAthena==1.10.7

In [5]:
from pyathena import connect
from pyathena.pandas_cursor import PandasCursor
from pyathena.util import as_pandas

## 2 ) Create Athena Database

In [6]:
!aws s3 rm s3://$data_bucket/amazon-reviews-pds/parquet --recursive
!aws s3 rm s3://$job_bucket/athena --recursive

delete: s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/parquet/product_category=Digital_Video_Games/20201207_130433_00100_twaes_39b672e4-be08-4a9a-8752-16de7435a806
delete: s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/parquet/product_category=Digital_Software/20201207_130433_00100_twaes_a405c838-6837-49a3-9982-f65be2ebe4b2
delete: s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/parquet/product_category=Digital_Video_Games/20201207_130433_00100_twaes_92863409-f3a2-4dee-a807-f7f6a7af12b3
delete: s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/parquet/product_category=Digital_Software/20201207_130433_00100_twaes_00771bd1-0d08-4019-94c9-2b56137e19d1
delete: s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/parquet/product_category=Digital_Software/20201207_130433_00100_twaes_91d254e4-6c69-4e22-97ce-55790579e231
delete: s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/parquet/product_category=Digital_Video_Games/20201207_130433_00100_tw

In [7]:
# Set Athena database name
database_name = 'awsdb_1124'

Athena에서 생성 한 데이터베이스 및 테이블은 Data Catalog 서비스를 사용하여 데이터의 메타 데이터를 저장합니다. 예를 들어, 테이블 이름과 함께 테이블에 있는 각 열의 데이터 이름과 열 이름으로 구성된 스키마 정보는 Data Catalog에 메타 데이터 정보로 저장됩니다.

Athena는 기본적으로 AWS Glue Data Catalog 서비스를 지원합니다. AWS Glue Data Catalog를 소스로 사용하여 Athena에서 `CREATE DATABASE` 및 `CREATE TABLE` 쿼리를 실행하면 AWS Glue Data Catalog에서 생성되는 데이터베이스 및 테이블 메타 데이터 항목이 자동으로 표시됩니다.

In [8]:
# Set S3 staging directory -- this is a temporary directory used for Athena queries
s3_staging_dir = 's3://{0}/athena/staging'.format(job_bucket)

In [9]:
# SQL statement to execute
statement = 'DROP SCHEMA IF EXISTS {} CASCADE'.format(database_name)
print(statement)

# Execute statement using connection cursor
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

DROP SCHEMA IF EXISTS awsdb_1124 CASCADE


<pyathena.cursor.Cursor at 0x7f7841eb00b8>

In [10]:
# SQL statement to execute
statement = 'CREATE DATABASE IF NOT EXISTS {}'.format(database_name)
print(statement)

CREATE DATABASE IF NOT EXISTS awsdb_1124


In [11]:
# Execute statement using connection cursor
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

<pyathena.cursor.Cursor at 0x7f784172e208>

#### 데이터베이스가 성공적으로 작성되었는지 확인합니다.

In [12]:
statement = 'SHOW DATABASES'
cursor.execute(statement)

df_show = as_pandas(cursor)
df_show.head(5)

Unnamed: 0,database_name
0,awsdb_1124
1,default
2,glue-crawler-test
3,marketdata
4,quicksightdb


# 2. Amazon Athena에 Data 등록

In [13]:
table_name_tsv = 'amazon_reviews_tsv'

In [14]:
# SQL statement to execute
statement = """CREATE EXTERNAL TABLE IF NOT EXISTS {}.{}(
         marketplace string,
         customer_id string,
         review_id string,
         product_id string,
         product_parent string,
         product_title string,
         product_category string,
         star_rating int,
         helpful_votes int,
         total_votes int,
         vine string,
         verified_purchase string,
         review_headline string,
         review_body string,
         review_date string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t' LINES TERMINATED BY '\\n' LOCATION '{}'
TBLPROPERTIES ('compressionType'='gzip', 'skip.header.line.count'='1')""".format(database_name, table_name_tsv, s3_destination_path_tsv)

print(statement)

CREATE EXTERNAL TABLE IF NOT EXISTS awsdb_1124.amazon_reviews_tsv(
         marketplace string,
         customer_id string,
         review_id string,
         product_id string,
         product_parent string,
         product_title string,
         product_category string,
         star_rating int,
         helpful_votes int,
         total_votes int,
         vine string,
         verified_purchase string,
         review_headline string,
         review_body string,
         review_date string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' LOCATION 's3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/'
TBLPROPERTIES ('compressionType'='gzip', 'skip.header.line.count'='1')


In [15]:
# Execute statement using connection cursor
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

<pyathena.cursor.Cursor at 0x7f7826910470>

#### 테이블 생성을 확인합니다.

In [16]:
statement = 'SHOW TABLES in {}'.format(database_name)
cursor.execute(statement)

df_show = as_pandas(cursor)
df_show.head(5)

Unnamed: 0,tab_name
0,amazon_reviews_tsv


#### Sample query를 수행합니다.

In [17]:
product_category = 'Digital_Software'

statement = """SELECT * FROM {}.{}
    WHERE product_category = '{}' LIMIT 100""".format(database_name, table_name_tsv, product_category)

print(statement)

SELECT * FROM awsdb_1124.amazon_reviews_tsv
    WHERE product_category = 'Digital_Software' LIMIT 100


In [18]:
# Execute statement using connection cursor
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

<pyathena.cursor.Cursor at 0x7f7826612940>

In [19]:
df = as_pandas(cursor)
df.head(5)

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,17747349,R2EI7QLPK4LF7U,B00U7LCE6A,106182406,CCleaner Free [Download],Digital_Software,4,0,0,N,Y,Four Stars,So far so good,2015-08-31
1,US,10956619,R1W5OMFK1Q3I3O,B00HRJMOM4,162269768,ResumeMaker Professional Deluxe 18,Digital_Software,3,0,0,N,Y,Three Stars,Needs a little more work.....,2015-08-31
2,US,13132245,RPZWSYWRP92GI,B00P31G9PQ,831433899,Amazon Drive Desktop [PC],Digital_Software,1,1,2,N,Y,One Star,Please cancel.,2015-08-31
3,US,35717248,R2WQWM04XHD9US,B00FGDEPDY,991059534,Norton Internet Security 1 User 3 Licenses,Digital_Software,5,0,0,N,Y,Works as Expected!,Works as Expected!,2015-08-31
4,US,17710652,R1WSPK2RA2PDEF,B00FZ0FK0U,574904556,SecureAnywhere Intermet Security Complete 5 De...,Digital_Software,4,1,2,N,Y,Great antivirus. Worthless customer support,I've had Webroot for a few years. It expired a...,2015-08-31


In [20]:
%store s3_staging_dir database_name table_name_tsv

Stored 's3_staging_dir' (str)
Stored 'database_name' (str)
Stored 'table_name_tsv' (str)
