# S3 Access Log Cleanup

This notebook should be used to clean up S3 log access CSV files created from the original downloaded S3 access logs.

In [1]:
import pandas as pd

The CSV file with S3 access log records:

In [2]:
s3log_in_csv = 's3log-201809.csv'

Read the CSV into a pandas DataFrame and replace any column value of `-` with `None`.

In [3]:
s3l = pd.read_csv(s3log_in_csv)
s3l.replace(['-'], [None]);

DataFrame overview:

In [4]:
s3l.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 292275 entries, 0 to 292274
Data columns (total 19 columns):
Bucket_Owner           292275 non-null object
Bucket                 292275 non-null object
Time                   292275 non-null object
Remote_IP              292275 non-null object
Requester              292275 non-null object
Request_ID             292275 non-null object
Operation              292275 non-null object
Key                    292275 non-null object
HTTP_method            292275 non-null object
Request_URI            292275 non-null object
HTTP_status            292275 non-null int64
Error_Code             292275 non-null object
Bytes_Sent             292275 non-null object
Object_Size            292275 non-null object
Total_Time_ms          292275 non-null int64
Turn_Around_Time_ms    292275 non-null object
Referrer               292275 non-null object
User_Agent             292275 non-null object
Version_Id             292275 non-null object
dtypes: int64(2),

Convert the `Time` column to `datetime` objects:

In [5]:
s3l['Time'] = pd.to_datetime(s3l['Time'])

## Row Selection Section

Code in this section selects the desired rows...

In [6]:
s3l.Time.dt.month.unique()

array([9])

In [7]:
s3l = s3l[s3l['Time'].dt.month == 9]

In [8]:
s3l = s3l[s3l.Request_URI.str.contains('[?&]cloudydap=')]

In [9]:
s3l.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34539 entries, 183685 to 286621
Data columns (total 19 columns):
Bucket_Owner           34539 non-null object
Bucket                 34539 non-null object
Time                   34539 non-null datetime64[ns]
Remote_IP              34539 non-null object
Requester              34539 non-null object
Request_ID             34539 non-null object
Operation              34539 non-null object
Key                    34539 non-null object
HTTP_method            34539 non-null object
Request_URI            34539 non-null object
HTTP_status            34539 non-null int64
Error_Code             34539 non-null object
Bytes_Sent             34539 non-null object
Object_Size            34539 non-null object
Total_Time_ms          34539 non-null int64
Turn_Around_Time_ms    34539 non-null object
Referrer               34539 non-null object
User_Agent             34539 non-null object
Version_Id             34539 non-null object
dtypes: datetime64[ns](1

In [10]:
s3l.Time.head() 

183685   2018-09-21 18:45:37
183686   2018-09-21 18:45:45
183687   2018-09-21 18:54:03
183688   2018-09-21 18:54:04
183689   2018-09-21 18:54:04
Name: Time, dtype: datetime64[ns]

## Test Parsing of the `cloudydap` URL Parameter

In [11]:
s3l.Request_URI.iloc[0]

'/cloudydap/merra2/MERRA2_200.tavgM_2d_int_Nx.199301.nc4?cloudydap=EFYZgp8U0Y_UC6_A2CFT_STARTED_1537555462.h5'

In [12]:
import re

The regular expression for the `cloudydap` URL parameter:

In [13]:
regexp = '[?&]cloudydap=(?P<dap_id>[^_]+)_(?P<use_case>.+)_(?P<arch>A[^_]+)_STARTED_(?P<uc_run_id>.+)\.h5&?'

In [14]:
re.search(regexp,
         '?cloudydap=W7hDl1ELmp_UNKNOWN_USE_CASE_A2CFT_STARTED_1537562999.h5').groups()

('W7hDl1ELmp', 'UNKNOWN_USE_CASE', 'A2CFT', '1537562999')

Let's see if it's working:

In [15]:
s3l.Request_URI.str.extract(
    '[?&]cloudydap=(?P<dap_id>[^_]+)_(?P<use_case>.+)_(?P<arch>A[^_]+)_STARTED_(?P<uc_run_id>.+)\.h5&?', 
    expand=True).head()

Unnamed: 0,dap_id,use_case,arch,uc_run_id
183685,EFYZgp8U0Y,UC6,A2CFT,1537555462
183686,HnNDuPmkll,UC6,A2CFT,1537555462
183687,zhUkS6ScIz,UC11,A2CFT,1537556009
183688,ZH92J0gGyF,UC11,A2CFT,1537556009
183689,ZH92J0gGyF,UC11,A2CFT,1537556009


## Save the Desired S3 Log Records

In [16]:
from pathlib import Path
outfile = Path('s3log-cloudydap-201809.csv')
if outfile.exists():
    mode = 'a'
    header = False
else:
    mode = 'w'
    header=True
s3l.to_csv(str(outfile), mode=mode, header=header, index=False, 
           date_format='%Y-%m-%dT%H:%M:%S+00:00')