# Accessing an AWS S3 bucket & downloading a csv file



In this notebook we will learn how to access an AWS S3 bucket, download a CSV file from that bucket, create a dataframe from the CSV file contents, and finally perform some calculations on the downloaded data. 



In [9]:
!pip install -r requirements.txt

You should consider upgrading via the '/opt/app-root/bin/python3.8 -m pip install --upgrade pip' command.[0m


## Imports
Of course, we'll need to import various packages. They are either built in the notebook image you are running, or have been installed in the previous step.

In [10]:
#========================================================================================
# import needed libraries/packages
#========================================================================================

import boto3
from boto3 import session
import os
import pandas as pd

In [11]:
#========================================================================================
# Storing your S3 keys within your notebook (see below) is not recommended.  
# os.environ['S3_ACCESS_KEY_ID']     ='AKIA3BTMHKMG......'
# os.environ['S3_SECRET_ACCESS_KEY'] ='Pz1oskqXYFnF......'
#========================================================================================
# We suggest storing keys in the Environment Variables of your JupyterHub image.  
# Then use os.environ.get to obtain your environment variable values
#========================================================================================

key_id     = os.environ.get('S3_ACCESS_KEY_ID')
secret_key = os.environ.get('S3_SECRET_ACCESS_KEY')

session    = boto3.session.Session(aws_access_key_id=key_id,
                                aws_secret_access_key=secret_key)

s3_client  = boto3.client('s3',
                  aws_access_key_id=key_id,
                  aws_secret_access_key=secret_key)



In [12]:
#========================================================================================
# Identify the name of your S3 bucket, and then the file you wish to access.
#========================================================================================
bucket_name     = 'rhods-pilot'
file_name       = 'truckdata.csv'
new_file_name   = 'newtruckdata.csv'
local_dest_dir  = os.path.join(os.getcwd(), 'downloaded-folder')

s3_client.download_file(bucket_name, file_name, new_file_name)

In [6]:
#========================================================================================
# Read the CSV file data into a dataframe
#========================================================================================

df              = pd.read_csv(new_file_name)

In [13]:
#========================================================================================
# Print the contents of the dataframe
#========================================================================================

print(df)
print(df.mileage)

       date trucktype  mileage
0  01/03/21      ford      125
1  01/14/21       gmc      200
2  01/02/21      ford      187
3  01/17/21      ford      230
0    125
1    200
2    187
3    230
Name: mileage, dtype: int64


In [16]:
#========================================================================================
# Perform calculation on imported data
#========================================================================================
total_mileage = sum(df.mileage)  #or total_mileage = dset['mileage'].sum()
print(total_mileage)

total_rows      = len(df.index)
print(total_rows)

average_mileage = (total_mileage/total_rows)
print(average_mileage)

742
4
185.5
