# Staging data from source to S3

## Prerequisites:

- Setup IAM admin credentials

- Install AWS CLI in local

- Retrieve original [Yelp datasets from Kaggle](https://www.kaggle.com/yelp-dataset/yelp-dataset?select=yelp_academic_dataset_checkin.json)
    + yelp_academic_dataset_business.json (125 MB)
    + yelp_academic_dataset_checkin.json (399 MB)
    + yelp_academic_dataset_tip.json (231 MB)
    + yelp_academic_dataset_user.json (3.68 GB)
    + yelp_academic_dataset_review.json (6.94 GB)

## Setup project

In [1]:
# Import python libraries
import os
import configparser

In [2]:
# Parse configuration file
config = configparser.ConfigParser()
config.read("config/config_file.ini")

['config/config_file.ini']

### Create bucket

In [3]:
bucket = config['s3']['bucket_name']

In [4]:
# Create bucket
!aws s3 mb $bucket

make_bucket: yelp-capstone-data


## Stage data from local to s3

### yelp_business_data

In [5]:
local_path_business_data = config['yelp_business_data']['local_path']
s3_path_business_data = config['yelp_business_data']['s3_path']

In [6]:
## yelp_academic_dataset_business.json (125 MB)
!aws s3 cp $local_path_business_data $s3_path_business_data

upload: data/input/yelp_academic_dataset_business.json to s3://yelp-capstone-data/data/input/yelp_academic_dataset_business.json


### yelp_academic_dataset_checkin

In [7]:
local_path_checkin_data = config['yelp_checkin_data']['local_path']
s3_path_checkin_data = config['yelp_checkin_data']['s3_path']

In [8]:
## yelp_academic_dataset_checkin.json (399 MB)
!aws s3 cp $local_path_checkin_data $s3_path_checkin_data

upload: data/input/yelp_academic_dataset_checkin.json to s3://yelp-capstone-data/data/input/yelp_academic_dataset_checkin.json


### yelp_academic_dataset_tip

In [9]:
local_path_yelp_tip_data = config['yelp_tip_data']['local_path']
s3_path_yelp_tip_data = config['yelp_tip_data']['s3_path']

In [10]:
## yelp_academic_dataset_tip.json (231 MB)
!aws s3 cp $local_path_yelp_tip_data $s3_path_yelp_tip_data

upload: data/input/yelp_academic_dataset_tip.json to s3://yelp-capstone-data/data/input/yelp_academic_dataset_tip.json


### yelp_academic_dataset_user

In [11]:
local_path_user_data = config['yelp_user_data']['local_path']
s3_path_user_data = config['yelp_user_data']['s3_path']

In [12]:
## yelp_academic_dataset_user.json (3.68 GB)
!aws s3 cp $local_path_user_data $s3_path_user_data

upload: data/input/yelp_academic_dataset_user.json to s3://yelp-capstone-data/data/input/yelp_academic_dataset_user.json


### yelp_academic_dataset_review

In [13]:
local_path_review_data = config['yelp_review_data']['local_path']
s3_path_review_data = config['yelp_review_data']['s3_path']

In [14]:
## yelp_academic_dataset_review.json (6.94 GB)
!aws s3 cp $local_path_review_data $s3_path_review_data

upload: data/input/yelp_academic_dataset_review.json to s3://yelp-capstone-data/data/input/yelp_academic_dataset_review.json


## References:

- [Using high-level (s3) commands with the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-services-s3-commands.html#using-s3-commands-managing-buckets-creating)
- [Yelp datasets from Kaggle](https://www.kaggle.com/yelp-dataset/yelp-dataset?select=yelp_academic_dataset_checkin.json)