The sample dataset *apache* contains the files *access.log* and *error.log* that contains the logfile of the accesses to a web server and the errors.
The *access.log* is in [Common Log Format](https://en.wikipedia.org/wiki/Common_Log_Format).
The entries in *error.log* usually have a corresponding entry in *access.log*

1.  Read the file *access.log*
1.  Count the number of accesses (number of lines) made by an IP number
1.  Count the number of successful accesses (status 200) made by an IP number
1.  Count the number of accesses for each directory served
1.  For each origin, count the number of successful accesses
1.  For each origin, count the number of unsuccessful accesses, split according to the
    status code
1.  From the results of the previous point, add a column with the error class (the first
    digit of the status code)
1.  Cluster the accesses in 5-minutes time slices (e.g. from 14:00 to 14:05, from 14:05 to
    14:10, etc). Count the number of accesses for each time slice
1.  Count the number of accesses between each pair of `[info]` or `[error]` entries of *error.log*

### Extra points

1.  For `[info]` entry of *error.log*, find the next entry of *access.log*. For
    example, when considering the entry at `Sun Mar  7 18:00:09 2004`, we want to find the
    entry at `[07/Mar/2004:18:02:10 -0800]`
1.  Count the number of times that the two accesses of the previous point have the same origin.


## Read the file *access.log*

In [2]:
import pandas as pd
import re

Since the first row of the file *access.log* does not contain the names of the columns, we use the `names` option. Moreover, we use a custom separator, otherwise the fields `type`, `url`, and `prot` would be combined together.

In [17]:
access = pd.read_csv('https://github.com/gdv/foundationsCS-2018/raw/master/ex-data/apache/access.log', 
                     sep='\s+',
                     names = ['origin', 'time', 'tz', 'type', 'url', 'prot', 'status', 'size'])
access.head()

Unnamed: 0,origin,time,tz,type,url,prot,status,size
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],GET /twiki/bin/edit/Main/Double_bounce_sender?...,401,12846
1,64.242.88.10,-,-,[07/Mar/2004:16:06:51,-0800],GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev...,200,4523
2,64.242.88.10,-,-,[07/Mar/2004:16:10:02,-0800],GET /mailman/listinfo/hsdivision HTTP/1.1,200,6291
3,64.242.88.10,-,-,[07/Mar/2004:16:11:58,-0800],GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1,200,7352
4,64.242.88.10,-,-,[07/Mar/2004:16:20:55,-0800],GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1,200,5253


## Count the number of accesses (number of lines) made by an IP number

We use fancy indexing to filter from `access` only the rows where `origin` consists of an IP address. While an IP address consists of 4 numbers in the interval `[0,255]` separated by dots, a simpler regex suffices.

In [18]:
iponly = access[access['origin'].str.contains("^\d+\.\d+\.\d+\.\d+$")]
iponly.head()

Unnamed: 0,origin,time,tz,type,url,prot,status,size
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],GET /twiki/bin/edit/Main/Double_bounce_sender?...,401,12846
1,64.242.88.10,-,-,[07/Mar/2004:16:06:51,-0800],GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev...,200,4523
2,64.242.88.10,-,-,[07/Mar/2004:16:10:02,-0800],GET /mailman/listinfo/hsdivision HTTP/1.1,200,6291
3,64.242.88.10,-,-,[07/Mar/2004:16:11:58,-0800],GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1,200,7352
4,64.242.88.10,-,-,[07/Mar/2004:16:20:55,-0800],GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1,200,5253


If I really want a tighter regex, I can force the fact that numbers have at most three digits.

In [19]:
iponly = access[access['origin'].str.contains("^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$")]
iponly.head()

Unnamed: 0,origin,time,tz,type,url,prot,status,size
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],GET /twiki/bin/edit/Main/Double_bounce_sender?...,401,12846
1,64.242.88.10,-,-,[07/Mar/2004:16:06:51,-0800],GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev...,200,4523
2,64.242.88.10,-,-,[07/Mar/2004:16:10:02,-0800],GET /mailman/listinfo/hsdivision HTTP/1.1,200,6291
3,64.242.88.10,-,-,[07/Mar/2004:16:11:58,-0800],GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1,200,7352
4,64.242.88.10,-,-,[07/Mar/2004:16:20:55,-0800],GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1,200,5253


Then we can group the rows with the same origin and count the size of each group

In [20]:
iponly.groupby('origin').size()

origin
10.0.0.153         270
12.22.207.235        1
128.227.88.79       14
142.27.64.35         7
145.253.208.9        7
194.151.73.43        4
195.11.231.210       1
195.230.181.122      1
195.246.13.119      12
200.222.33.33        1
203.147.138.233     13
207.195.59.160      20
208.247.148.12       4
212.21.228.26        1
212.92.37.62        14
213.181.81.4         1
216.139.185.45       1
219.95.17.51         1
4.37.97.186          1
61.165.64.6          4
61.9.4.61            3
64.242.88.10       452
64.246.94.141        1
64.246.94.152        1
66.213.206.2         1
67.131.107.5         3
dtype: int64

## Count the number of successful accesses (status 200) made by an IP number

We only have to filter the rows with status equal to 200

In [26]:
iponly[iponly['status'] == 200].count()

origin    627
time      627
tz        627
type      627
url       627
prot      627
status    627
size      627
dtype: int64

An alternative version uses the `len` function.

In [28]:
len(iponly[iponly['status'] == 200])

627

## Count the number of accesses for each directory served

First we add a column `dir` to each row

The first step is to build a function, called `extract_dir`, that computes the directory from a url.

In [29]:
def extract_dir(url):
    if re.search('/', url):
        return re.match('.*\/', url).group()
    else:
        return None

Since a regex can be a brittle solution, we have to check that it is actually correct. More precisely, we are going to check when the regex is not fond.

In [30]:
access[~ access['url'].str.contains(".*\/")]

Unnamed: 0,origin,time,tz,type,url,prot,status,size
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],GET /twiki/bin/edit/Main/Double_bounce_sender?...,401,12846
1,64.242.88.10,-,-,[07/Mar/2004:16:06:51,-0800],GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev...,200,4523
2,64.242.88.10,-,-,[07/Mar/2004:16:10:02,-0800],GET /mailman/listinfo/hsdivision HTTP/1.1,200,6291
3,64.242.88.10,-,-,[07/Mar/2004:16:11:58,-0800],GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1,200,7352
4,64.242.88.10,-,-,[07/Mar/2004:16:20:55,-0800],GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1,200,5253
5,64.242.88.10,-,-,[07/Mar/2004:16:23:12,-0800],GET /twiki/bin/oops/TWiki/AppendixFileSystem?t...,200,11382
6,64.242.88.10,-,-,[07/Mar/2004:16:24:16,-0800],GET /twiki/bin/view/Main/PeterThoeny HTTP/1.1,200,4924
7,64.242.88.10,-,-,[07/Mar/2004:16:29:16,-0800],GET /twiki/bin/edit/Main/Header_checks?topicpa...,401,12851
8,64.242.88.10,-,-,[07/Mar/2004:16:30:29,-0800],GET /twiki/bin/attach/Main/OfficeLocations HTT...,401,12851
9,64.242.88.10,-,-,[07/Mar/2004:16:31:48,-0800],GET /twiki/bin/view/TWiki/WebTopicEditTemplate...,200,3732


Then we can use `apply`

In [31]:
access['dir'] = access.apply(lambda row: extract_dir(row['url']), axis=1)
access.head()

Unnamed: 0,origin,time,tz,type,url,prot,status,size,dir
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],GET /twiki/bin/edit/Main/Double_bounce_sender?...,401,12846,
1,64.242.88.10,-,-,[07/Mar/2004:16:06:51,-0800],GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev...,200,4523,
2,64.242.88.10,-,-,[07/Mar/2004:16:10:02,-0800],GET /mailman/listinfo/hsdivision HTTP/1.1,200,6291,
3,64.242.88.10,-,-,[07/Mar/2004:16:11:58,-0800],GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1,200,7352,
4,64.242.88.10,-,-,[07/Mar/2004:16:20:55,-0800],GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1,200,5253,


Since using the `axis` option of `apply` can be confusing, an alternative solution is to build a list correponding to the new column

In [32]:
access['dir2'] = [ extract_dir(url) for url in access['url'] ]
access.head()

Unnamed: 0,origin,time,tz,type,url,prot,status,size,dir,dir2
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],GET /twiki/bin/edit/Main/Double_bounce_sender?...,401,12846,,
1,64.242.88.10,-,-,[07/Mar/2004:16:06:51,-0800],GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev...,200,4523,,
2,64.242.88.10,-,-,[07/Mar/2004:16:10:02,-0800],GET /mailman/listinfo/hsdivision HTTP/1.1,200,6291,,
3,64.242.88.10,-,-,[07/Mar/2004:16:11:58,-0800],GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1,200,7352,,
4,64.242.88.10,-,-,[07/Mar/2004:16:20:55,-0800],GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1,200,5253,,


## For each origin, count the number of successful accesses

In [33]:
access[access['status'] == 200].groupby('origin').size()

origin
0x503e4fce.virnxx2.adsl-dhcp.tele.dk            2
1-320.cnc.bc.ca                                 4
1-729.cnc.bc.ca                                 6
10.0.0.153                                    187
12.22.207.235                                   1
128.227.88.79                                  12
142.27.64.35                                    2
145.253.208.9                                   6
194.151.73.43                                   4
195.11.231.210                                  1
195.230.181.122                                 1
195.246.13.119                                 11
2-110.cnc.bc.ca                                 8
2-238.cnc.bc.ca                                 1
200-55-104-193.dsl.prima.net.ar                13
200.160.249.68.bmf.com.br                       2
200.222.33.33                                   1
203.147.138.233                                13
206-15-133-153.dialup.ziplink.net               1
206-15-133-154.dialup.ziplink.net          

## For each origin, count the number of unsuccessful accesses, split according to the status code

The `groupby` can receive a list of column names

In [35]:
access[access['status'] != 200].groupby(['origin', 'status']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,time,tz,type,url,prot,size,dir,dir2
origin,status,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0x503e4fce.virnxx2.adsl-dhcp.tele.dk,304,1,1,1,1,1,1,0,0
1-729.cnc.bc.ca,302,1,1,1,1,1,1,0,0
10.0.0.153,302,1,1,1,1,1,1,0,0
10.0.0.153,304,82,82,82,82,82,82,0,0
128.227.88.79,304,2,2,2,2,2,2,0,0
142.27.64.35,302,1,1,1,1,1,1,0,0
142.27.64.35,304,4,4,4,4,4,4,0,0
145.253.208.9,304,1,1,1,1,1,1,0,0
1513.cps.virtua.com.br,404,1,1,1,1,1,1,0,0
195.246.13.119,401,1,1,1,1,1,1,0,0


From the results of the previous point, add a column with the error class (the first digit of the status code)

In [37]:
grouped = access[access['status'] != 200].groupby(['origin', 'status']).count()
grouped.index

MultiIndex(levels=[['0x503e4fce.virnxx2.adsl-dhcp.tele.dk', '1-729.cnc.bc.ca', '10.0.0.153', '128.227.88.79', '142.27.64.35', '145.253.208.9', '1513.cps.virtua.com.br', '195.246.13.119', '2-110.cnc.bc.ca', '207.195.59.160', '61.9.4.61', '64.242.88.10', '68-174-110-154.nyc.rr.com', '92-moc-6.acn.waw.pl', 'cpe-203-51-137-224.vic.bigpond.net.au', 'cr020r01-3.sac.overture.com', 'h194n2fls308o1033.telia.com', 'h24-70-56-49.ca.shawcable.net', 'h24-71-236-129.ca.shawcable.net', 'jacksonproject.cnc.bc.ca', 'lj1160.inktomisearch.com', 'mail.geovariances.fr', 'market-mail.panduit.com', 'ogw.netinfo.bg', 'osdlab.eic.nctu.edu.tw', 'p213.54.168.132.tisdip.tiscali.de', 'prxint-sxb3.e-i.net', 'spot.nnacorp.com', 'ts05-ip44.hevanet.com'], [302, 304, 401, 404, 408]],
           labels=[[0, 1, 2, 2, 3, 4, 4, 5, 6, 7, 8, 9, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 18, 19, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28], [1, 0, 0, 1, 1, 0, 1, 1, 3, 2, 1, 1, 2, 3, 2, 1, 1, 0, 2, 4, 3, 0, 1, 0, 1, 1, 1, 2, 1, 3, 2, 2

Since the `status` field is part of the index, we have to move it to a column name, via `reset_index`

In [39]:
table = grouped.reset_index()
table.head()

Unnamed: 0,origin,status,time,tz,type,url,prot,size,dir,dir2
0,0x503e4fce.virnxx2.adsl-dhcp.tele.dk,304,1,1,1,1,1,1,0,0
1,1-729.cnc.bc.ca,302,1,1,1,1,1,1,0,0
2,10.0.0.153,302,1,1,1,1,1,1,0,0
3,10.0.0.153,304,82,82,82,82,82,82,0,0
4,128.227.88.79,304,2,2,2,2,2,2,0,0


Now we can add the desired column

In [40]:
table['class'] = table['status'] / 100
table

Unnamed: 0,origin,status,time,tz,type,url,prot,size,dir,dir2,class
0,0x503e4fce.virnxx2.adsl-dhcp.tele.dk,304,1,1,1,1,1,1,0,0,3.04
1,1-729.cnc.bc.ca,302,1,1,1,1,1,1,0,0,3.02
2,10.0.0.153,302,1,1,1,1,1,1,0,0,3.02
3,10.0.0.153,304,82,82,82,82,82,82,0,0,3.04
4,128.227.88.79,304,2,2,2,2,2,2,0,0,3.04
5,142.27.64.35,302,1,1,1,1,1,1,0,0,3.02
6,142.27.64.35,304,4,4,4,4,4,4,0,0,3.04
7,145.253.208.9,304,1,1,1,1,1,1,0,0,3.04
8,1513.cps.virtua.com.br,404,1,1,1,1,1,1,0,0,4.04
9,195.246.13.119,401,1,1,1,1,1,1,0,0,4.01


## For `[info]` entry of *error.log*, find the next entry of *access.log*. 

*For example, when considering the entry at `Sun Mar  7 18:00:09 2004`, we want to find the entry at `[07/Mar/2004:18:02:10 -0800]`*

Each error has a corresponding (i.e. same date, time, origin) entry in *access.log*

First we have to read *error.log*, parse correctly the date/times and convert them to the same format, otherwise we cannot use that field to merge the dataframes.

In [54]:
error = pd.read_csv("https://github.com/gdv/foundationsCS-2018/raw/master/ex-data/apache/error.log",
                   names = ["text"])
error.head()

Unnamed: 0,text
0,[Sun Mar 7 16:02:00 2004] [notice] Apache/1.3...
1,[Sun Mar 7 16:02:00 2004] [info] Server built...
2,[Sun Mar 7 16:02:00 2004] [notice] Accept mut...
3,[Sun Mar 7 16:05:49 2004] [info] [client 64.2...
4,[Sun Mar 7 16:45:56 2004] [info] [client 64.2...


Since the date/time is in a nonstandard format, we need to build a function for parsing it.

In [55]:
month_str_to_num = { 'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
                     'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}

In [64]:
def parsing_error(str):
  m = re.match("^\s*\[[A-Z][a-z][a-z]\s+([A-Z][a-z][a-z])\s+(\d+)\s+(\d\d):(\d\d):(\d\d)\s+(\d\d\d\d)\]\s*(\[([a-z]+)\])?", str)
  fields = [ m.group(i) for i in range(9) ]
  fields[1] = month_str_to_num[fields[1]]
  row = {}
  row['day'] = int(fields[2])
  row['month'] = int(fields[1])
  row['year'] = int(fields[6])
  row['hour'] = int(fields[3])
  row['min'] = int(fields[4])
  row['sec'] = int(fields[5])
  row['type'] = fields[8]
  
  return row

Before using this function, we need some tests:

In [65]:
error.iloc[0]['text']

'[Sun Mar  7 16:02:00 2004] [notice] Apache/1.3.29 (Unix) configured -- resuming normal operations'

In [66]:
parsing_error(error.iloc[0]['text'])

{'day': 7,
 'month': 3,
 'year': 2004,
 'hour': 16,
 'min': 2,
 'sec': 0,
 'type': 'notice'}

In [74]:
error['text'].apply(lambda x: str(x), axis = 1)

TypeError: <lambda>() got an unexpected keyword argument 'axis'

In [71]:


error['day'], error['month'], error['year'], error['hour'], 
error['min'], error['sec'], error['type'] = error['text'].apply(x['text']), axis = 1)

error.head()

KeyError: 'day'

Now we parse the date/time of `access.log`

In [None]:
def parse_datime(string):
    m = re.search('\[(\d\d)\/(...)\/(\d\d\d\d):(\d\d):(\d\d):(\d\d)', string)
    return {'day' : m[1], 
            'month': m[2], 
            'year' : m[3], 
            'hour' : m[4], 
            'mins' : m[5], 
            'secs' : m[6],
             'dtime': datetime.datetime(int(m[3]), month_str_to_num[m[2]], int(m[1]),
                                            int(m[4]), int(m[5]), int(m[6])),
            'row'  : string
           }

In [None]:
new = pd.DataFrame([ parse_datime(row) for row in access['time'] ])
new.head()

In [None]:
access_full = access.join(new)
access_full.head()

We add a field `next` which is the index of the next row.

In [None]:
access_full['next'] = list(range(1, len(access_full) + 1))
access_full.head()

Now we can merge the two dataframe `access_full` and `error`, keeping all entries of `access_full`

In [22]:
merged = pd.merge(access_full, error, on=['dtime'], how='left')
merged

NameError: name 'access_full' is not defined

Check if the rows of `error` are in `merged`

In [None]:
merged[merged['row_y'].notnull()]

In [23]:
found = merged[merged['type_y'] == 'info']
found

NameError: name 'merged' is not defined

Finally, use the `next` field to merge `found` and `access_full`

In [None]:
paired = pd.merge(found, access_full, left_on='next', right_index = True)
paired

In [None]:
paired.columns

## Count the number of times that the two accesses of the previous point have the same origin.

In [None]:
len(paired[paired['origin_x'] == paired['origin_y']])

## Count the number of accesses between each pair of `[info]` or `[error]` entries of *error.log*

In [None]:
info_errors = [(merged['type_y'] == 'info') | (merged['type_y'] == 'error')]
info_errors 

In [None]:
info_errors?

In [None]:
import numpy as np
occs = np.where(info_errors)
lista = list(occs[1])
lista

In [None]:
pairs = zip(lista[:-2], lista[1:])
list(pairs)

In [None]:
[ b-a for (a,b) in zip(lista[:-2], lista[1:])]