# 1. AWS Data Wrangler를 이용한 Data load

* https://github.com/awslabs/aws-data-wrangler
* https://aws-data-wrangler.readthedocs.io

AWS Data Wrangler는 오픈 소스 Python 패키지로 Pandas 라이브러리의 기능을 AWS를 연결하는 DataFrames 및 AWS 데이터 관련 서비스 (Amazon Redshift, AWS Glue, Amazon Athena, Amazon EMR, Amazon QuickSight 등)로 확장합니다.

Pandas, Apache Arrow, Boto3, s3fs, SQLAlchemy, Psycopg2 및 PyMySQL과 같은 다른 오픈 소스 프로젝트를 기반으로 구축 된 Data Lakes, Data Warehouses 및 Databases의 데이터로드 / 언로드와 같은 일반적인 ETL 작업을 실행하는 추상 기능을 제공합니다.

In [None]:
%store -r

In [None]:
import sys

In [None]:
!{sys.executable} -m pip install -q awswrangler==1.2.0

In [None]:
import os
import boto3
import sagemaker
import pandas as pd

import awswrangler as wr

sess   = sagemaker.Session()
role = sagemaker.get_execution_role()

## 1 ) Push-Down Filters를 이용한 S3에서 Parquet 쿼리

S3의 prefix 또는 S3객체 경로의 리스트에서 Apache Parquet 파일을 읽습니다. 
Dataset의 개념은 Partitioning과 카탈로그통합과 같이 복잡한 특성을 가능하게 합니다.

dataset (bool) : 만약 True이면 컬럼으로 모든 관련된 partitions을 로드하여 단순 파일이 아닌 parquet dataset으로 읽습니다.

In [None]:
df = wr.s3.read_parquet(s3_path_parquet,
                        filters=[("product_category", "=", "Digital_Software")],
                        dataset=True)
df.shape

In [None]:
df.head(5)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

df[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='Breakdown by Star Rating')
plt.xlabel('Star Rating')
plt.ylabel('Review Count')

# Balance the Dataset

In [None]:
from sklearn.utils import resample

five_star_df = df.query('star_rating == 5', engine='python')
four_star_df = df.query('star_rating == 4', engine='python')
three_star_df = df.query('star_rating == 3', engine='python')
two_star_df = df.query('star_rating == 2', engine='python')
one_star_df = df.query('star_rating == 1', engine='python')

# Check which sentiment has the least number of samples
minority_count = min(five_star_df.shape[0], 
                     four_star_df.shape[0], 
                     three_star_df.shape[0], 
                     two_star_df.shape[0], 
                     one_star_df.shape[0]) 

five_star_df = resample(five_star_df,
                        replace = False,
                        n_samples = minority_count,
                        random_state = 27)

four_star_df = resample(four_star_df,
                        replace = False,
                        n_samples = minority_count,
                        random_state = 27)

three_star_df = resample(three_star_df,
                        replace = False,
                        n_samples = minority_count,
                        random_state = 27)

two_star_df = resample(two_star_df,
                        replace = False,
                        n_samples = minority_count,
                        random_state = 27)

one_star_df = resample(one_star_df,
                        replace = False,
                        n_samples = minority_count,
                        random_state = 27)

df_balanced = pd.concat([five_star_df, four_star_df, three_star_df, two_star_df, one_star_df])
df_balanced = df_balanced.reset_index(drop=True)

df_balanced.shape

In [None]:
df_balanced[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='Breakdown by Star Rating')
plt.xlabel('Star Rating')
plt.ylabel('Review Count')

In [None]:
df_balanced.head(5)

# Split the Data into Train, Validation, and Test Sets

In [None]:
from sklearn.model_selection import train_test_split

# Split all data into 90% train and 10% holdout
df_train, df_holdout = train_test_split(df_balanced, 
                                        test_size=0.10,
                                        stratify=df_balanced['star_rating'])

# Split holdout data into 50% validation and 50% test
df_validation, df_test = train_test_split(df_holdout,
                                          test_size=0.50, 
                                          stratify=df_holdout['star_rating'])


In [None]:
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = ['Train', 'Validation', 'Test']
sizes = [len(df_train.index), len(df_validation.index), len(df_test.index)]
explode = (0.1, 0, 0)  

fig1, ax1 = plt.subplots()

ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', startangle=90)

# Equal aspect ratio ensures that pie is drawn as a circle.
ax1.axis('equal')  

plt.show()

# Show 90% Train Data Split

In [None]:
df_train.shape

In [None]:
df_train[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='90% Train Breakdown by Star Rating')

# Show 5% Validation Data Split

In [None]:
df_validation.shape

In [None]:
df_validation[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='5% Validation Breakdown by Star Rating')

# Show 5% Test Data Split

In [None]:
df_test.shape

In [None]:
df_test[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='5% Test Breakdown by Star Rating')

# Select `star_rating` and `review_body` for Training

In [None]:
df_train = df_train[['star_rating', 'review_body']]
df_train.shape

In [None]:
df_train.head(5)

In [None]:
df_train = df_train[['star_rating', 'review_body']]

In [None]:
df_train.head(5)

# Write a Train CSV with Header for AutoPilot 

In [None]:
data_dir='./data'

In [None]:
!rm -rf $data_dir
!mkdir $data_dir

In [None]:
header_train_path = os.path.join(data_dir,'amazon_reviews_us_Digital_Software_v1_00_header.csv')
df_train.to_csv(header_train_path, index=False, header=True)

# Upload Train Data to S3 for AutoPilot

In [None]:
train_s3_prefix = 'data'
header_train_s3_uri = sess.upload_data(path=header_train_path, key_prefix=train_s3_prefix)
header_train_s3_uri

In [None]:
!aws s3 ls $header_train_s3_uri

# Write a CSV With No Header for Comprehend 

In [None]:
noheader_train_path = os.path.join(data_dir,'amazon_reviews_us_Digital_Software_v1_00_noheader.csv')
df_train.to_csv(noheader_train_path, index=False, header=False)

# Upload Train Data to S3 for Comprehend

In [None]:
train_s3_prefix = 'data'
noheader_train_s3_uri = sess.upload_data(path=noheader_train_path, key_prefix=train_s3_prefix)
noheader_train_s3_uri

In [None]:
!aws s3 ls $noheader_train_s3_uri

# Store the location of our train data in our notebook server to be used next

In [None]:
%store header_train_s3_uri

In [None]:
%store noheader_train_s3_uri

In [None]:
%store