# Analysis of AWS S3 Access Log Data

In [1]:
import numpy as np
import pandas as pd
from bokeh.charts import (output_notebook, output_file, show, 
                          Scatter, Histogram, TimeSeries, BoxPlot)
from bokeh.plotting import figure, ColumnDataSource
from bokeh.models import Range1d, HoverTool, ResizeTool
from bokeh.layouts import column, row, gridplot
output_notebook()

Read the log data and replace column values of `-` with `None`:

In [2]:
s3l = pd.read_csv('../../../s3log-cloudydap-201703-2.csv')
s3l.replace(['-'], [None]);

  interactivity=interactivity, compiler=compiler, result=result)


What was just read from the CSV file?

In [3]:
s3l.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2053001 entries, 0 to 2053000
Data columns (total 19 columns):
Bucket_Owner           object
Bucket                 object
Time                   object
Remote_IP              object
Requester              object
Request_ID             object
Operation              object
Key                    object
HTTP_method            object
Request_URI            object
HTTP_status            int64
Error_Code             object
Bytes_Sent             int64
Object_Size            object
Total_Time_ms          int64
Turn_Around_Time_ms    object
Referrer               object
User_Agent             object
Version_Id             object
dtypes: int64(3), object(16)
memory usage: 297.6+ MB


## Data Preprocessing

Remove not needed columns:

In [4]:
s3l.drop(['Bucket', 'Bucket_Owner', 'Referrer', 'Version_Id'], axis=1, inplace=True)

Extract information from the `cloudydap` URL parameter into new columns. The format of the values is either of the two:

    {Use case}_{Architecture}_STARTED_{Use case run ID}
    {Use case}_{Architecture}_{Use case run ID}
    
**NOTE**: The `{Use case}` field may contain underscores!

In [5]:
cloudydap_info = s3l.Request_URI.str.extract(
    '[?&]cloudydap=(?P<use_case>.+)_(?P<arch>A[^_]+)_(?:STARTED_)?(?P<uc_run_id>.+)\.h5&?', 
    expand=True)
s3l = pd.concat([s3l, cloudydap_info], axis=1);

Make sure the cloudydap information is successfully parsed for each entry:

In [6]:
if s3l[['arch', 'use_case', 'uc_run_id']].isnull().any().any():
    raise ValueError('Parsing cloudydap parameter values failed')

Remove the Request_URI column now:

In [7]:
s3l.drop(['Request_URI'], axis=1, inplace=True)

Convert the Time column to datetime objects:

In [8]:
s3l['Time'] = pd.to_datetime(s3l['Time'])

What we have now?

In [9]:
s3l.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2053001 entries, 0 to 2053000
Data columns (total 17 columns):
Time                   datetime64[ns]
Remote_IP              object
Requester              object
Request_ID             object
Operation              object
Key                    object
HTTP_method            object
HTTP_status            int64
Error_Code             object
Bytes_Sent             int64
Object_Size            object
Total_Time_ms          int64
Turn_Around_Time_ms    object
User_Agent             object
use_case               object
arch                   object
uc_run_id              object
dtypes: datetime64[ns](1), int64(3), object(13)
memory usage: 266.3+ MB


Sort rows based on column Time:

In [10]:
s3l.sort_values(by='Time', ascending=True);

Activate the cell below if selecting data by time:

s3l = s3l[(s3l.Time >= '2017-02-28') & (s3l.Time <= '2017-03-01T06:00:00')]

In [11]:
s3l = s3l[(s3l.Time >= '2017-03-12T06:08:00') & (s3l.Time <= '2017-03-14T13:32:00')]

Time span of the S3 access data:

In [12]:
print('Start: {}\nEnd:   {}'.format(s3l.Time.min(), s3l.Time.max()))

Start: 2017-03-12 06:08:17
End:   2017-03-14 13:31:59


# Data Cleanup

Search for any log entry that indicates some sort of error.

What are the S3 operations:

In [13]:
s3l.Operation.unique()

array(['REST.GET.OBJECT'], dtype=object)

What are the different HTTP status codes:

In [14]:
s3l['HTTP_status'].unique()

array([206, 200, 404, 500, 416])

What are the different AWS S3 system error codes:

In [15]:
s3l.Error_Code.unique()

array(['-', 'NoSuchKey', 'InternalError', 'InvalidRange'], dtype=object)

Find the _bad_ S3 requests:

In [16]:
bad_reqs = (s3l['HTTP_status'] >= 400) | (s3l['Error_Code'] != '-')

In [17]:
s3l.loc[bad_reqs, ['HTTP_status', 'Error_Code', 'Key', 'HTTP_method', 'User Agent']]

Unnamed: 0,HTTP_status,Error_Code,Key,HTTP_method,User Agent
669585,404,NoSuchKey,bytestream/6389aba2ced43ef4796ee89d8d8414d9,GET,
670566,404,NoSuchKey,bytestream/094e70793148a97742191430ccea74c7,GET,
830512,404,NoSuchKey,bytestream/0f66edc0d3b8d3c77bc5481b5088a5fc,GET,
831509,404,NoSuchKey,bytestream/671e319145cf87e8777ef83ca082bfb3,GET,
856377,404,NoSuchKey,bytestream/094e70793148a97742191430ccea74c7,GET,
860302,404,NoSuchKey,bytestream/094e70793148a97742191430ccea74c7,GET,
879034,404,NoSuchKey,bytestream/671e319145cf87e8777ef83ca082bfb3,GET,
886658,404,NoSuchKey,bytestream/adb99ec2a68b1d16fc9cd2f1d33e983d,GET,
889743,404,NoSuchKey,bytestream/adb99ec2a68b1d16fc9cd2f1d33e983d,GET,
892605,404,NoSuchKey,bytestream/adb99ec2a68b1d16fc9cd2f1d33e983d,GET,


Break down bad S3 requests per day and hour:

In [18]:
s3l.loc[bad_reqs, 'Time']\
    .apply(lambda t: pd.Series({'date': t.date(), 'hour': t.hour}))\
    .groupby(['date', 'hour'])\
    .size()

date        hour
2017-03-12  12       2
2017-03-13  6        2
            7        3
            8       14
            16       1
            21       1
2017-03-14  0        1
            2        4
            11       1
            12       1
dtype: int64

How many bad requests per architecture and use case:

In [19]:
s3l.loc[bad_reqs].groupby(['arch', 'use_case']).size()

arch   use_case
A2CFT  UC20         1
A3CFT  UC18        19
       UC2          2
       UC20         6
       UC21         2
dtype: int64

Remove bad S3 log entries:

In [20]:
s3l.drop(s3l[bad_reqs].index, inplace=True)

Number of rows now:

In [21]:
len(s3l.index)

1310427

## Helper Functions

### Transfer Rate

Calculate transfer rate for each S3 request as a new column. The formula is:
    
$$0.001 * \frac{Bytes\_Sent}{Total\_Time\_ms - Turn\_Around\_Time\_ms}\,\,\rm{MBytes/s}$$

Transfer rate for cases where total and turnaround time are equal (very fast data transfer) will be equal to the largest transfer rate calculated. This avoids having to deal with `inf` values.

In [22]:
def s3_transf_rate(df):
    tr = 0.001 * df['Bytes_Sent'] / (df['Total_Time_ms'] - df['Turn_Around_Time_ms'])
    infs = np.isinf(tr)
    tr[infs] = tr[~infs].max()
    return tr

### Bytes Savings

Reduction in the number of bytes pulled out of S3 compared to the original object size.

In [23]:
def s3_bytes_savings(df):
    """Calculate the total percentage reduction in the bytes returned from S3 compared
       to the original object sizes.
       
       df: Input pandas DataFrame.
    """
    return (df['Bytes_Sent'].sum()/df['Object_Size'].sum() - 1) * 100

### Grid of Boxplots

Produce a grid of boxplots for a specific column of the DataFrame and a list of use cases. Breakdown the column data per architecture.

In [24]:
def uc_boxplots(df, col, uc_list=None, width=300, height=300):
    if uc_list is None:
        uc_list = df.use_case.unique()

    g = list()
    for uc in sorted(uc_list):
        p = BoxPlot(df[df.use_case == uc], values=col, 
                    title='Use Case: {}'.format(uc),
                    label=['arch'], 
                    outliers=False,
                    color='arch',
                    legend='top_right')
        g.append(p)

    show(gridplot(g, ncols=2, plot_width=width, plot_height=height, 
                  toolbar_location='left'))

## General Data Exploration

Convert the Turn_Around_Time_ms and Object_Size columns to numbers (I don't know why it is not in the first place):

In [25]:
s3l['Turn_Around_Time_ms'] = pd.to_numeric(s3l['Turn_Around_Time_ms'])
s3l.Object_Size = pd.to_numeric(s3l.Object_Size)

Calculate the transfer rate in MBytes/sec for every S3 request:

In [26]:
s3l['Trans_Rate_MB/s'] = s3_transf_rate(s3l)

User agent list:

In [27]:
s3l.User_Agent.unique()

array(['-',
       'libcurl/7.19.7 NSS/3.21 Basic ECC zlib/1.2.3 libidn/1.18 libssh2/1.4.2'], dtype=object)

How many different architectures:

In [28]:
s3l.arch.unique()

array(['A2CFT', 'A3CFT', 'A1CFT'], dtype=object)

How many different use cases:

In [29]:
s3l.use_case.unique()

array(['UC6', 'UC7', 'UC2', 'UC10', 'UC11', 'UC12', 'UC13', 'UC14', 'UC15',
       'UC16', 'UC17', 'UC18', 'UC19', 'UC20', 'UC21'], dtype=object)

How many different use case run IDs:

In [30]:
s3l.uc_run_id.unique()

array(['1489300078', '1489300299', '1489298903', '1489300485',
       '1489298899', '1489298896', '1489304047', '1489304190',
       '1489304325', '1489307764', '1489307950', '1489308126',
       '1489311677', '1489312051', '1489312485', '1489313617',
       '1489313915', '1489314168', '1489315283', '1489315808',
       '1489316475', '1489317580', '1489318683', '1489319771',
       '1489320897', '1489323159', '1489322045', '1489325431',
       '1489326608', '1489324290', '1489327766', '1489327998',
       '1489328590', '1489329530', '1489329755', '1489330091',
       '1489330272', '1489330458', '1489330708', '1489328391',
       '1489328845', '1489329350', '1489329938', '1489328995',
       '1489329174', '1489328221', '1489330896', '1489331116',
       '1489331386', '1489331540', '1489331806', '1489335411',
       '1489339014', '1489342594', '1489345993', '1489349387',
       '1489352793', '1489356223', '1489359681', '1489363090',
       '1489366701', '1489367147', '1489367367', '14893

Count of S3 requests for each architecture:

In [31]:
s3l.arch.value_counts().sort_index()

A1CFT     10446
A2CFT    659432
A3CFT    640549
Name: arch, dtype: int64

Count of S3 requests for each use case:

In [32]:
s3l.use_case.value_counts()

UC20    573302
UC21    290951
UC18    193429
UC19    173482
UC15     23353
UC14     22957
UC13     14096
UC11      6204
UC12      3873
UC6       2169
UC17      2144
UC10      1825
UC16      1470
UC7       1097
UC2         75
Name: use_case, dtype: int64

Number of S3 requests per use case and architecture:

In [33]:
s3l.groupby(['use_case', 'arch']).size().unstack()

arch,A1CFT,A2CFT,A3CFT
use_case,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
UC10,365.0,730.0,730.0
UC11,364.0,2920.0,2920.0
UC12,439.0,1717.0,1717.0
UC13,438.0,6829.0,6829.0
UC14,4387.0,9499.0,9071.0
UC15,3649.0,10218.0,9486.0
UC16,,722.0,748.0
UC17,,1066.0,1078.0
UC18,,96839.0,96590.0
UC19,,86753.0,86729.0


Change in the number of S3 requests between different architectures:

In [34]:
x = s3l.groupby(['arch', 'use_case']).size()

In [35]:
x['A2CFT'] - x['A1CFT']

use_case
UC10     365.0
UC11    2556.0
UC12    1278.0
UC13    6391.0
UC14    5112.0
UC15    6569.0
UC16       NaN
UC17       NaN
UC18       NaN
UC19       NaN
UC2        NaN
UC20       NaN
UC21       NaN
UC6      426.0
UC7        2.0
dtype: float64

In [36]:
x['A3CFT'] - x['A1CFT']

use_case
UC10     365.0
UC11    2556.0
UC12    1278.0
UC13    6391.0
UC14    4684.0
UC15    5837.0
UC16       NaN
UC17       NaN
UC18       NaN
UC19       NaN
UC2        NaN
UC20       NaN
UC21       NaN
UC6      426.0
UC7        0.0
dtype: float64

In [37]:
x['A3CFT'] - x['A2CFT']

use_case
UC10        0
UC11        0
UC12        0
UC13        0
UC14     -428
UC15     -732
UC16       26
UC17       12
UC18     -249
UC19      -24
UC2        67
UC20   -11140
UC21    -6413
UC6         0
UC7        -2
dtype: int64

Timeline of all S3 requests:

p = figure(x_axis_type='datetime', toolbar_location='above')
p.xaxis.axis_label = 'Time of S3 request'
p.yaxis.axis_label = 'Total S3 request time [ms]'
p.segment(x0=s3l['Time'], y0=([0] * len(s3l['Total_Time_ms'])), 
          x1=s3l['Time'], y1=s3l['Total_Time_ms'], 
          line_alpha=0.5)
p.circle(s3l['Time'], s3l['Total_Time_ms'])
p.add_tools(ResizeTool())
show(p)

## Select Performance Comparison between Architectures

In [38]:
grp = s3l.groupby('arch')

### S3 Total Time

In [39]:
grp['Total_Time_ms'].describe().unstack()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
arch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A1CFT,10446.0,3330.939498,2410.781596,1455.0,1599.0,2185.5,4520.0,68037.0
A2CFT,659432.0,41.476932,147.034715,5.0,10.0,16.0,27.0,58931.0
A3CFT,640549.0,72.259204,186.36645,3.0,30.0,44.0,71.0,60758.0


In [40]:
p = BoxPlot(s3l, values='Total_Time_ms', label=['arch'],
           color='arch',
           outliers=False,
           legend='top_right')
show(p)

### S3 Turnaround Time

In [41]:
grp['Turn_Around_Time_ms'].describe().unstack()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
arch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A1CFT,10446.0,118.012253,96.993696,42.0,83.0,99.0,126.0,6100.0
A2CFT,659432.0,27.527339,65.442305,5.0,9.0,13.0,24.0,15858.0
A3CFT,640549.0,57.349597,75.520149,2.0,29.0,42.0,65.0,17203.0


In [42]:
p = BoxPlot(s3l, values='Turn_Around_Time_ms', label=['arch'],
           color='arch',
           outliers=False,
           legend='top_right')
show(p)

### Transfer Rates

In [43]:
grp['Trans_Rate_MB/s'].describe().unstack()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
arch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A1CFT,10446.0,70.066337,11.166009,4.675489,70.040226,73.076566,76.24934,81.798303
A2CFT,659432.0,107.507218,163.36341,8e-06,37.075,41.765,80.619,544.513
A3CFT,640549.0,149.340187,194.366834,5e-06,38.07,43.038,121.9215,544.513


In [44]:
p = BoxPlot(s3l, values='Trans_Rate_MB/s', label=['arch'],
           color='arch',
           outliers=False,
           legend='top_left')
show(p)

### Ratio of S3 turnaround time vs. S3 total time

This ratio indicates which component dominates the total time of one S3 request. Ratios less than 0.5 show that the data transfer from S3 is the dominant component; ratios greater than 0.5 mean the S3 turnaround time is the dominant component.

In [45]:
f = Histogram(pd.DataFrame({'arch': s3l['arch'],
              'ratio': s3l['Turn_Around_Time_ms']/s3l['Total_Time_ms']}),
              values='ratio', color='arch', bins=50, density=True,
              xlabel='Ratio of S3 turnaround vs S3 total time',
              legend='top_right')
show(f)

### Bytes Sent

These are bytes transferred from S3 to Hyrax server.

In [46]:
grp['Bytes_Sent'].describe().unstack()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
arch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A1CFT,10446.0,206600400.0,101371800.0,112519317.0,112888124.0,113083656.0,318357415.0,323612711.0
A2CFT,659432.0,97846.29,130792.6,4.0,37764.0,41177.0,175468.0,1112792.0
A3CFT,640549.0,97460.74,129495.7,8.0,37735.0,41139.0,175287.0,1112792.0


## Architecture 1

S3 requests for Architecture 1:

In [47]:
a1 = s3l[s3l.arch == 'A1CFT']
len(a1.index)

10446

For Architecture \#1, the number of bytes sent and object size should be the same for all S3 requests:

In [48]:
(a1.Bytes_Sent == a1.Object_Size).all()

True

Percentage reduction in the number of bytes pulled out of S3 compared to the original object sizes:

In [49]:
s3_bytes_savings(a1)

0.0

In [50]:
p = Histogram(a1['Total_Time_ms'], bins=100, title='Architecture 1 (All Use Cases)',
              xlabel='Total S3 response time [ms]', ylabel='Count(S3 requests)')
p.x_range = Range1d(0, 25000)
show(p)

In [51]:
p = BoxPlot(a1, values='Total_Time_ms', label=['use_case'], 
            title='Architecture 1 (group by Use Case)',
            outliers=False,
            legend=False)
show(p)

In [52]:
p = Histogram(a1['Turn_Around_Time_ms'], bins=50, title='Architecture 1 (All Use Cases)',
              xlabel='Turn-around S3 time [ms]', ylabel='Count(S3 requests)')
# p.x_range = Range1d(0, 25000)
show(p)

In [53]:
p = BoxPlot(a1, values='Turn_Around_Time_ms', label=['use_case'], 
            title='Architecture 1 (group by Use Case)',
            outliers=False,
            legend=False)
show(p)

In [54]:
tooltips=[('Key', '@Key'), ('Bytes Sent', '@Bytes_Sent'), ('Use Case', '@use_case')]
p = Scatter(a1, x='Bytes_Sent', y='Total_Time_ms',
            title='Architecture 1 (All Use Cases)',
            xlabel='Bytes Sent [bytes]', ylabel='Total S3 response time [ms]',
            tooltips=tooltips)
p.add_tools(ResizeTool())
show(p)

In [55]:
tooltips=[('Key', '@Key'), ('Bytes Sent', '@Bytes_Sent'), ('Use Case', '@use_case')]
p = Scatter(a1, x='Bytes_Sent', y='Turn_Around_Time_ms',
            xlabel='Bytes Sent [bytes]', ylabel='Turn-around S3 response time [ms]',
            tooltips=tooltips)
p.add_tools(ResizeTool())
show(p)

tooltips=[('Key', '@Key'), ('Bytes Sent', '@Bytes_Sent'), ('Use Case', '@use_case')]
p = figure(toolbar_location='above', title='Architecture 1 (All Use Cases)')
p.xaxis.axis_label = 'S3 actual response time [ms]'
p.yaxis.axis_label = 'Turnaround S3 time [ms]'
p.xaxis.axis_label = 'S3 actual response time [ms]'
p.yaxis.axis_label = 'Turnaround S3 time [ms]'
new_df = a1[['Turn_Around_Time_ms', 'Key', 'Bytes_Sent', 'use_case']].copy()
new_df['actual'] = a1['Total_Time_ms'] - a1['Turn_Around_Time_ms']
p.scatter(x='actual', y='Turn_Around_Time_ms', source=ColumnDataSource(new_df), 
          color='red', size=4,
          name = 'scatter')
hover = HoverTool(names=['scatter'], tooltips=tooltips)
p.add_tools(ResizeTool(), hover)
show(p)

In [56]:
show(Histogram(a1['Trans_Rate_MB/s'], bins=50,
               xlabel='Transfer Rate [MB/s]', ylabel='Count(S3 requests)'))

## Architecture 2

S3 requests for Architecture 2:

In [57]:
a2 = s3l[s3l.arch == 'A2CFT']
len(a2.index)

659432

For Architecture \#2, the number of bytes sent and object size should **not** be the same for all S3 requests:

In [58]:
(a2.Bytes_Sent == a2.Object_Size).all()

False

Percentage reduction in the number of bytes pulled out of S3 compared to the original object sizes:

In [59]:
s3_bytes_savings(a2)

-99.94856049224019

This architecture uses HTTP range `GET`s. What are HTTP response codes? (Hint: They should all be [206](https://httpstatuses.com/206]).)

In [60]:
a2.HTTP_status.unique()

array([206])

In [61]:
p = Histogram(a2['Total_Time_ms'], bins=75, title='Architecture 2 (All Use Cases)',
              xlabel='Total S3 response time [ms]', ylabel='Count(S3 requests)')
#p.x_range = Range1d(0, 1500)
show(p)

In [62]:
p = BoxPlot(a2, values='Total_Time_ms', label=['use_case'], 
            title='Architecture 2 (group by Use Case)',
            outliers=False,
            legend=False)
show(p)

In [63]:
p = Histogram(a2['Turn_Around_Time_ms'], bins=50, title='Architecture 2 (All Use Cases)',
              xlabel='Turn-around S3 time [ms]', ylabel='Count(S3 requests)')
# p.x_range = Range1d(0, 25000)
show(p)

In [64]:
p = BoxPlot(a2, values='Turn_Around_Time_ms', label=['use_case'], 
            title='Architecture 2 (group by Use Case)',
            outliers=False,
            legend=False)
show(p)

tooltips=[('Key', '@Key'), ('Bytes Sent', '@Bytes_Sent'), ('Use Case', '@use_case')]
p = Scatter(a2, x='Bytes_Sent', y='Total_Time_ms', title='Architecture 2 (All Use Cases)',
            xlabel='Bytes Sent [bytes]', ylabel='Total S3 response time [ms]',
            tooltips=tooltips)
p.add_tools(ResizeTool())
show(p)

tooltips=[('Key', '@Key'), ('Bytes_Sent', '@Bytes_Sent'), ('Use Case', '@use_case')]
p = Scatter(a2, x='Bytes_Sent', y='Turn_Around_Time_ms',
            xlabel='Bytes Sent [bytes]', ylabel='Turn-around S3 response time [ms]',
            tooltips=tooltips)
p.add_tools(ResizeTool())
show(p)

tooltips=[('Key', '@Key'), ('Bytes Sent', '@Bytes_Sent'), ('Use Case', '@use_case')]
p = figure(toolbar_location='above', title='Architecture 2 (All Use Cases)')
p.xaxis.axis_label = 'S3 actual response time [ms]'
p.yaxis.axis_label = 'Turnaround S3 time [ms]'
p.xaxis.axis_label = 'S3 actual response time [ms]'
p.yaxis.axis_label = 'Turnaround S3 time [ms]'
new_df = a2[['Turn_Around_Time_ms', 'Key', 'Bytes_Sent', 'use_case']].copy()
new_df['actual'] = a2['Total_Time_ms'] - a2['Turn_Around_Time_ms']
p.scatter(x='actual', y='Turn_Around_Time_ms', source=ColumnDataSource(new_df), 
          color='red', size=8,
          name = 'scatter')
hover = HoverTool(names=['scatter'], tooltips=tooltips)
p.add_tools(ResizeTool(), hover)
show(p)

In [65]:
p = Histogram(a2['Trans_Rate_MB/s'], bins=50,
              xlabel='Transfer Rate [MB/s]', ylabel='Count(S3 requests)')
#p.x_range = Range1d(0, 120)
show(p)

## Architecture 3

S3 requests for Architecture 3:

In [66]:
a3 = s3l[s3l.arch == 'A3CFT']
len(a3.index)

640549

For Architecture \#3, the number of bytes sent and object size should **be** the same for all S3 requests:

In [67]:
(a3.Bytes_Sent == a3.Object_Size).all()

False

Percentage reduction in the number of bytes pulled out of S3 compared to the original object sizes:

In [68]:
s3_bytes_savings(a3)

-63.894244026430314

In [69]:
p = Histogram(a3['Total_Time_ms'], bins=75, title='Architecture 3 (All Use Cases)',
              xlabel='Total S3 response time [ms]', ylabel='Count(S3 requests)')
#p.x_range = Range1d(0, 1500)
show(p)

In [70]:
p = BoxPlot(a3, values='Total_Time_ms', label=['use_case'], 
            title='Architecture 3 (group by Use Case)',
            outliers=False,
            legend=False)
show(p)

In [71]:
p = Histogram(a3['Turn_Around_Time_ms'], bins=50, title='Architecture 3 (All Use Cases)',
              xlabel='Turn-around S3 time [ms]', ylabel='Count(S3 requests)')
# p.x_range = Range1d(0, 25000)
show(p)

In [72]:
p = BoxPlot(a3, values='Turn_Around_Time_ms', label=['use_case'], 
            title='Architecture 3 (group by Use Case)',
            outliers=False,
            legend=False)
show(p)

tooltips=[('Key', '@Key'), ('Bytes Sent', '@Bytes_Sent'), ('Use Case', '@use_case')]
p = Scatter(a3, x='Bytes_Sent', y='Total_Time_ms', title='Architecture 3 (All Use Cases)',
            xlabel='Bytes Sent [bytes]', ylabel='Total S3 response time [ms]',
            tooltips=tooltips)
p.add_tools(ResizeTool())
show(p)

tooltips=[('Key', '@Key'), ('Bytes Sent', '@Bytes_Sent'), ('Use Case', '@use_case')]
p = Scatter(a3, x='Bytes_Sent', y='Turn_Around_Time_ms',
            xlabel='Bytes Sent [bytes]', ylabel='Turn-around S3 response time [ms]',
            tooltips=tooltips)
p.add_tools(ResizeTool())
show(p)

tooltips=[('Key', '@Key'), ('Bytes Sent', '@Bytes_Sent'), ('Use Case', '@use_case')]
p = figure(toolbar_location='above', title='Architecture 3 (All Use Cases)')
p.xaxis.axis_label = 'S3 actual response time [ms]'
p.yaxis.axis_label = 'Turnaround S3 time [ms]'
new_df = a3[['Turn_Around_Time_ms', 'Key', 'Bytes_Sent', 'use_case']].copy()
new_df['actual'] = a3['Total_Time_ms'] - a3['Turn_Around_Time_ms']
p.scatter(x='actual', y='Turn_Around_Time_ms', source=ColumnDataSource(new_df),
          color='red', size=8,
         name = 'scatter')
hover = HoverTool(names=['scatter'], tooltips=tooltips)
p.add_tools(ResizeTool(), hover)
show(p)

In [73]:
p = Histogram(a3['Trans_Rate_MB/s'], bins=50,
              xlabel='Transfer Rate [MB/s]', ylabel='Count(S3 requests)')
#p.x_range = Range1d(0, 120)
show(p)

## Use Cases on Different Architectures

In [74]:
grp = s3l.groupby(['use_case', 'arch'])

### S3 Total Response Time

In [75]:
grp['Total_Time_ms'].describe().unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
use_case,arch,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
UC10,A1CFT,365.0,5090.961644,1834.533756,4158.0,4452.0,4555.0,4777.0,20202.0
UC10,A2CFT,730.0,114.243836,122.972749,11.0,67.0,89.0,117.0,1595.0
UC10,A3CFT,730.0,101.50274,120.415949,31.0,56.0,70.0,96.0,1174.0
UC11,A1CFT,364.0,5218.167582,2080.158387,4140.0,4451.0,4544.5,4784.0,18095.0
UC11,A2CFT,2920.0,73.280137,110.174552,7.0,25.0,49.0,81.0,1317.0
UC11,A3CFT,2920.0,93.887671,125.312377,26.0,54.0,67.0,94.0,2576.0
UC12,A1CFT,439.0,1819.908884,856.011112,1506.0,1564.0,1613.0,1757.5,10672.0
UC12,A2CFT,1717.0,86.397787,140.466135,7.0,33.0,62.0,97.0,3323.0
UC12,A3CFT,1717.0,61.285964,95.556378,11.0,30.0,39.0,55.0,1689.0
UC13,A1CFT,438.0,1778.769406,659.674331,1509.0,1560.0,1614.5,1743.0,10509.0


In [76]:
uc_boxplots(s3l, 'Total_Time_ms', width=325, height=325)

### S3 Turnaround Time

In [77]:
grp['Turn_Around_Time_ms'].describe().unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
use_case,arch,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
UC10,A1CFT,365.0,117.090411,59.283775,54.0,84.0,100.0,127.0,506.0
UC10,A2CFT,730.0,90.50274,59.94701,8.0,60.0,81.0,103.0,779.0
UC10,A3CFT,730.0,84.427397,80.358598,29.0,53.0,66.0,87.0,1173.0
UC11,A1CFT,364.0,120.494505,100.158896,55.0,83.0,97.5,126.5,1352.0
UC11,A2CFT,2920.0,56.294178,64.610638,5.0,21.75,44.0,71.0,1150.0
UC11,A3CFT,2920.0,82.542466,107.762161,24.0,51.0,63.0,85.25,2574.0
UC12,A1CFT,439.0,119.71754,80.837271,53.0,83.0,99.0,126.0,1191.0
UC12,A2CFT,1717.0,73.571928,101.958114,6.0,29.0,59.0,92.0,3240.0
UC12,A3CFT,1717.0,50.535236,48.402854,10.0,29.0,38.0,52.0,808.0
UC13,A1CFT,438.0,118.321918,70.825883,42.0,81.25,98.0,129.0,864.0


In [78]:
uc_boxplots(s3l, 'Turn_Around_Time_ms', width=325, height=325)

### Bytes Sent in S3 Responses

In [79]:
grp['Bytes_Sent'].describe().unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
use_case,arch,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
UC10,A1CFT,365.0,317499200.0,4251460.0,297674188.0,316693200.0,318602372.0,319768600.0,323612711.0
UC10,A2CFT,730.0,182020.7,11875.92,146492.0,170243.0,182015.0,193788.2,202815.0
UC10,A3CFT,730.0,182020.7,11875.92,146492.0,170243.0,182015.0,193788.2,202815.0
UC11,A1CFT,364.0,317496400.0,4256981.0,297674188.0,316691100.0,318606885.0,319770400.0,323612711.0
UC11,A2CFT,2920.0,184846.6,9133.132,146492.0,179303.8,186876.0,192267.5,202815.0
UC11,A3CFT,2920.0,184846.6,9133.132,146492.0,179303.8,186876.0,192267.5,202815.0
UC12,A1CFT,439.0,114368800.0,8441981.0,112519317.0,112826400.0,112904688.0,112986000.0,162646153.0
UC12,A2CFT,1717.0,45663.93,68699.19,32735.0,38847.0,40059.0,40870.0,831744.0
UC12,A3CFT,1717.0,45663.93,68699.19,32735.0,38847.0,40059.0,40870.0,831744.0
UC13,A1CFT,438.0,114372100.0,8451355.0,112519317.0,112826100.0,112904189.5,112986000.0,162646153.0


---

---

**OLD STUFF BELOW**

---

---

## Splitting Log Entries

The log entries will be split based on the product: AIRS and MERRA2.

## Analysis of AIRS Files Log Entries

## Analysis of MERRA2 Files Log Entries