<img src="https://upload.wikimedia.org/wikipedia/commons/4/4f/Twitter-logo.svg" alt="twitter_logo" width="120"/>
<span style="float:left">
  <span style="font-family:Helvetica; font-size:4em;">
    <b>Twitter sentiment analysis DEV &nbsp;&nbsp;&nbsp;</b><br>
  </span>
  <span style="font-family:Helvetica; font-size:2em;">
    Mounting S3 bucket
  </span>
</span>
<br clear="left"/>
<br><br><br>
<i>Goal of this project is to develop and ML pipeline for Twitter sentiment analysis with the end goal of hosting the ML pipeline in and ML Ops Pipeline using streaming Twitter data through the AWS architecture.</i>
<br><br>
This notebook is for mounting the data from my S3 bucket.

## Table of contents
#### 1. Set-up
* 1.1 Environment
* 1.2 User defined functions
* 1.3 Global variables

#### 2. Mounting
* 2.1 Mount datasets
* 2.2 Verify mounting
<br><br>

## 1. Set-up

#### 1.1 Environment

In [0]:
# no modules to import for mounting

#### 1.2 User defined functions

In [0]:
# Function mounts new or existing S3 bucket to our DataBricks environment under /mnt
def mount_s3_bucket(access_key, secret_key, bucket_name, mount_folder):
    encoded_secret_key = secret_key.replace("/", "%2F")
    
    print ("Mounting", bucket_name)
    
    try:
        # Unmount the data in case it was already mounted.
        dbutils.fs.unmount(f"/mnt/{mount_folder}")
        
    except:
        # If it fails to unmount it most likely wasn't mounted in the first place
        print ("Directory not unmounted: ", mount_folder)
    
    finally:
        # Lastly, mount our bucket.
        dbutils.fs.mount(f"s3a://{access_key}:{encoded_secret_key}@{bucket_name}" , f"/mnt/{mount_folder}")
        #dbutils.fs.mount("s3a://"+ ACCESS_KEY_ID + ":" + ENCODED_SECRET_KEY + "@" + bucket_name, mount_folder)
        print ("The bucket", bucket_name, "was mounted to", mount_folder, "\n")


#### 1.3 Global variables

In [0]:
# S3 bucket location for data import
S3_BUCKET_NAME = r'lambda-epl-output'
S3_IMPORT_FOLDER = r'twitter_stream/*'


## 2. Mounting

#### 2.1 Mount datasets

In [0]:
ACCESS_KEY_ID = r'ACCESS KEY ID HERE'
SECRET_KEY = r'SECRET KEY HERE'
encoded_secret_key = SECRET_KEY.replace("/", "%2F")
aws_bucket_name = "lambda-epl-output"
mount_name = "lambda-epl-output"

dbutils.fs.mount(f"s3a://{ACCESS_KEY_ID}:{encoded_secret_key}@{aws_bucket_name}", f"/mnt/{mount_name}")
display(dbutils.fs.ls(f"/mnt/{mount_name}"))

path,name,size,modificationTime
dbfs:/mnt/lambda-epl-output/twitter_stream/,twitter_stream/,0,1654056902384


In [0]:
display(dbutils.fs.ls(f"/mnt/{mount_name}/twitter_stream/2022/05"))

path,name,size,modificationTime
dbfs:/mnt/lambda-epl-output/twitter_stream/2022/05/25/,25/,0,1654056913318
dbfs:/mnt/lambda-epl-output/twitter_stream/2022/05/26/,26/,0,1654056913318
dbfs:/mnt/lambda-epl-output/twitter_stream/2022/05/27/,27/,0,1654056913318
dbfs:/mnt/lambda-epl-output/twitter_stream/2022/05/28/,28/,0,1654056913318
dbfs:/mnt/lambda-epl-output/twitter_stream/2022/05/29/,29/,0,1654056913318
dbfs:/mnt/lambda-epl-output/twitter_stream/2022/05/30/,30/,0,1654056913318
dbfs:/mnt/lambda-epl-output/twitter_stream/2022/05/31/,31/,0,1654056913318


In [0]:
mount_s3_bucket(ACCESS_KEY_ID, SECRET_KEY, S3_BUCKET_NAME, S3_IMPORT_FOLDER)

#### 2.2 Verify mounting

In [0]:
dbutils.fs.ls("/mnt/lambda-epl-output/twitter_stream/2022")

In [0]:
twitter = (spark.read
           .option('header','false')
           .option("recursiveFileLookup","true")
           .option('inferSchema','false')
           .parquet('/mnt/lambda-epl-output/twitter_stream/')
          )

#### 2.3 Create Delta Table

In [0]:
#
read_format = 'parquet'
write_format = 'delta'
load_path = '/mnt/lambda-epl-output/twitter_stream/'
save_path = '/tmp/delta/twitter_stream'
table_name = 'default.twitter_stream'

# Load the data from its source.
tweets = (spark
            .read
            .option('header','false')
            .option('recursiveFileLookup','true')
            .option('inferSchema','false')
            .parquet(load_path)
)
# Write the data to its target.
tweets.write \
  .format(write_format) \
  .save(save_path)

# Create the table.
spark.sql(f"CREATE TABLE {table_name} USING DELTA LOCATION '{save_path}'")

#### 2.4 Optimize delta table

In [0]:
spark.sql("OPTIMIZE delta.`/tmp/delta/twitter_stream`")