# Accessing an AWS S3 bucket & downloading a csv file



In this notebook we will learn how to access an AWS S3 bucket, download a CSV file from that bucket, create a dataframe from the CSV file contents, and finally perform some calculations on the downloaded data. 

**Note:**  If you are not familiar with manipulating csv files, work with the **simpleCalc2.ipynb** jupyter notebook first to become familiar with file manipulation. 



In [11]:
!pip install -r requirements.txt

You should consider upgrading via the '/opt/app-root/bin/python3.8 -m pip install --upgrade pip' command.[0m


## Imports
We will need to import various packages. They are either built in the notebook image you are running, or have been installed in the previous step.

In [12]:
#========================================================================================
# import needed libraries/packages 
#
# Note:  we use boto3 which is a Python SDK for AWS.  It allows you to create,
# configure and manage AWS resources from your Python scripts.
#========================================================================================

import os
import pandas as pd
import boto3
import botocore

from botocore import UNSIGNED
from botocore.client import Config

In [13]:
#Accessing file without AWS Credentials, in new public bucket:  rhods_sandbox

BUCKET_NAME      = 'rhods-sandbox'
BUCKET_FILE_NAME = 'truckdata.csv'    
LOCAL_FILE_NAME  = 'newtruckdata.csv'

s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
s3.download_file(BUCKET_NAME, BUCKET_FILE_NAME, LOCAL_FILE_NAME)


ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

In [None]:
#Accessing file in non-public bucket using AWS Credentials

#========================================================================================
# Storing your S3 keys within your notebook (see below) is not recommended.  
# os.environ['S3_ACCESS_KEY_ID']     ='AKIA3BTMHKMG......'
# os.environ['S3_SECRET_ACCESS_KEY'] ='Pz1oskqXYFnF......'
#========================================================================================
# We suggest storing keys in the Environment Variables of your JupyterHub image.  
# Then use os.environ.get to obtain your environment variable values
#========================================================================================
key_id     = os.environ.get('S3_ACCESS_KEY_ID')
secret_key = os.environ.get('S3_SECRET_ACCESS_KEY')

session    = boto3.session.Session(aws_access_key_id=key_id,
                                aws_secret_access_key=secret_key)

s3_client  = boto3.client('s3',
                  aws_access_key_id=key_id,
                  aws_secret_access_key=secret_key)



In [None]:
#========================================================================================
# Identify the name of your S3 bucket, and then the file you wish to doanload.
#========================================================================================
bucket_name     = 'rhods-pilot'
file_name       = 'truckdata.csv'
new_file_name   = 'newtruckdata.csv'

#local_dest_dir  = os.path.join(os.getcwd(), 'downloaded-folder')
s3_client.download_file(bucket_name, file_name, new_file_name)


In [None]:
#========================================================================================
# Read the CSV file data into a dataframe
#========================================================================================
df              = pd.read_csv(new_file_name)

In [None]:
#========================================================================================
# Print the contents of the dataframe
#========================================================================================
print(df)
print(df.mileage)

In [None]:
#========================================================================================
# Perform calculations on imported data
#========================================================================================
total_mileage   = sum(df.mileage)  #or total_mileage = dset['mileage'].sum()
print(total_mileage)

total_rows      = len(df.index)
print(total_rows)

average_mileage = (total_mileage/total_rows)
print(average_mileage)