# Accessing Data on AWS S3

This is a simple notebook to show how to download data from the White Swan S3 bucket using `s3fs` .



## Setup

The main requirement here is [s3fs](https://s3fs.readthedocs.io/en/latest/) - an interface for accessing the S3 filesystem using Python.

To setup s3fs, you need to:
* `pip install s3fs` (but when doing `pip install -r requirements.txt` this is done for you)
* Setup your AWS credentials

The second step will no doubt be the most tedious. To get started, you first need access rights to the S3 bucket you want to connect to. In our case that is `bsd-white-swan/ankylosing-spondylitis/`

If you haven't already, get access by asking to Terri to put you in contact with the Black Swan dev-ops team. They will take you through the motions. 

Mention to the dev-ops team that you are in the process of setting up the credentials locally too and they will no doubt be able to help with that as well.


Once you've got access rights and have set up your credentials locally, you should be able to run through the follow example.


## Important Note

_You will notice that the code for reports 1 and 2 do not load the data from S3. This is because one of the volunteers could not access S3 and so we reverted to excluding the data from git. So going through the old code you might have to replace the data loading steps with calling the function below._

In [9]:
import os
import pandas as pd
import s3fs

In [16]:
bucket_path = 'bsd-white-swan/ankylosing-spondylitis/data/'

def load_from_s3(f, index_col=0):
    """
    Load a csv file from this Knowledge Repo's s3 bucket
    """
    fs = s3fs.S3FileSystem(anon=False)

    f_s3 = os.path.join(bucket_path,f)
    
    if f.endswith('.csv'):
        with fs.open(f_s3) as fh:
            df = pd.read_csv(fh, index_col=index_col)
    elif f.endswith('.xls'):
        with fs.open(f_s3) as fh:
            df = pd.read_excel(fh, index_col=index_col)
    else:
        raise NotImplementedError('Can only read .csv or .xls files')

    return df


In [18]:
df = load_from_s3('clean_basmi.xls', index_col=None)
df.head()

Unnamed: 0,patient_id,Date,CRS,TWS,LSFS,LFS,IMS,BS,Drug
0,40.0,1995-05-09,3,1,6,5,3,3.6,
1,,1995-06-01,3,1,8,5,3,4.0,
2,,1995-06-12,2,1,5,3,2,2.6,
3,,1995-11-02,1,1,3,4,2,2.2,
4,,1996-05-02,2,1,4,3,2,2.4,
