The sample dataset *apache* contains the files *access.log* and *error.log* that contains the logfile of the accesses to a web server and the errors.
The *access.log* is in [Common Log Format](https://en.wikipedia.org/wiki/Common_Log_Format).
The entries in *error.log* usually have a corresponding entry in *access.log*

1.  Read the file *access.log*
1.  Count the number of accesses (number of lines) made by an IP number
1.  Count the number of successful accesses (status 200) made by an IP number
1.  Count the number of accesses for each directory served
1.  For each origin, count the number of successful accesses
1.  For each origin, count the number of unsuccessful accesses, split according to the
    status code
1.  From the results of the previous point, add a column with the error class (the first
    digit of the status code)
1.  Cluster the accesses in 5-minutes time slices (e.g. from 14:00 to 14:05, from 14:05 to
    14:10, etc). Count the number of accesses for each time slice
1.  For `[info]` entry of *error.log*, find the next entry of *access.log*. For
    example, when considering the entry at `Sun Mar  7 18:00:09 2004`, we want to find the
    entry at `[07/Mar/2004:18:02:10 -0800]`
1.  Count the number of times that the two accesses of the previous point have the same origin.
1.  Count the number of accesses between each pair of `[info]` or `[error]` entries of *error.log*


## Read the file *access.log*

In [1]:
import pandas as pd
import re

Since the first row of the file *access.log* does not contain the names of the columns, we use the `names` option. Moreover, we use a custom separator, otherwise the fields `type`, `url`, and `prot` would be combined together.

In [2]:
access = pd.read_csv('database/apache/access.log', sep='[\s\"]+',
                 names = ['origin', 'dummy1', 'dummy2', 'time', 'tz', 'type', 'url', 'prot', 'status', 
                          'size'])
access.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,origin,dummy1,dummy2,time,tz,type,url,prot,status,size
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,HTTP/1.1,401.0,12846
1,64.242.88.10,-,-,[07/Mar/2004:16:06:51,-0800],GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,HTTP/1.1,200.0,4523
2,64.242.88.10,-,-,[07/Mar/2004:16:10:02,-0800],GET,/mailman/listinfo/hsdivision,HTTP/1.1,200.0,6291
3,64.242.88.10,-,-,[07/Mar/2004:16:11:58,-0800],GET,/twiki/bin/view/TWiki/WikiSyntax,HTTP/1.1,200.0,7352
4,64.242.88.10,-,-,[07/Mar/2004:16:20:55,-0800],GET,/twiki/bin/view/Main/DCCAndPostFix,HTTP/1.1,200.0,5253


## Count the number of accesses (number of lines) made by an IP number

We use fancy indexing to filter from `access` only the rows where `origin` consists of an IP address. While an IP address consists of 4 numbers in the interval `[0,255]` separated by dots, a simpler regex suffices.

In [3]:
iponly = access[access['origin'].str.contains("^\d+\.\d+\.\d+\.\d+$")]
iponly.head()

Unnamed: 0,origin,dummy1,dummy2,time,tz,type,url,prot,status,size
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,HTTP/1.1,401.0,12846
1,64.242.88.10,-,-,[07/Mar/2004:16:06:51,-0800],GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,HTTP/1.1,200.0,4523
2,64.242.88.10,-,-,[07/Mar/2004:16:10:02,-0800],GET,/mailman/listinfo/hsdivision,HTTP/1.1,200.0,6291
3,64.242.88.10,-,-,[07/Mar/2004:16:11:58,-0800],GET,/twiki/bin/view/TWiki/WikiSyntax,HTTP/1.1,200.0,7352
4,64.242.88.10,-,-,[07/Mar/2004:16:20:55,-0800],GET,/twiki/bin/view/Main/DCCAndPostFix,HTTP/1.1,200.0,5253


Then we can group the rows with the same origin and count the size of each group

In [4]:
iponly.groupby('origin').size()

origin
10.0.0.153         270
12.22.207.235        1
128.227.88.79       14
142.27.64.35         7
145.253.208.9        7
194.151.73.43        4
195.11.231.210       1
195.230.181.122      1
195.246.13.119      12
200.222.33.33        1
203.147.138.233     13
207.195.59.160      20
208.247.148.12       4
212.21.228.26        1
212.92.37.62        14
213.181.81.4         1
216.139.185.45       1
219.95.17.51         1
4.37.97.186          1
61.165.64.6          4
61.9.4.61            3
64.242.88.10       452
64.246.94.141        1
64.246.94.152        1
66.213.206.2         1
67.131.107.5         3
dtype: int64

## Count the number of successful accesses (status 200) made by an IP number

We only have to filter the rows with status equal to 200

In [5]:
iponly[iponly['status'] == 200].groupby('origin').size()

origin
10.0.0.153         187
12.22.207.235        1
128.227.88.79       12
142.27.64.35         2
145.253.208.9        6
194.151.73.43        4
195.11.231.210       1
195.230.181.122      1
195.246.13.119      11
200.222.33.33        1
203.147.138.233     13
207.195.59.160      14
208.247.148.12       4
212.21.228.26        1
212.92.37.62        14
213.181.81.4         1
216.139.185.45       1
219.95.17.51         1
4.37.97.186          1
61.165.64.6          4
61.9.4.61            1
64.242.88.10       340
64.246.94.141        1
64.246.94.152        1
66.213.206.2         1
67.131.107.5         3
dtype: int64

## Count the number of accesses for each directory served

First we add a column `dir` to each row

The first step is to build a function, called `extract_dir`, that computes the directory from a url.

In [6]:
def extract_dir(url):
    if re.search('/', url):
        return re.match('.*\/', url).group()
    else:
        return None

Then we can use `apply`

In [7]:
access['dir'] = access.apply(lambda row: extract_dir(row['url']), axis=1)
access.head()

Unnamed: 0,origin,dummy1,dummy2,time,tz,type,url,prot,status,size,dir
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,HTTP/1.1,401.0,12846,/twiki/bin/edit/Main/
1,64.242.88.10,-,-,[07/Mar/2004:16:06:51,-0800],GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,HTTP/1.1,200.0,4523,/twiki/bin/rdiff/TWiki/
2,64.242.88.10,-,-,[07/Mar/2004:16:10:02,-0800],GET,/mailman/listinfo/hsdivision,HTTP/1.1,200.0,6291,/mailman/listinfo/
3,64.242.88.10,-,-,[07/Mar/2004:16:11:58,-0800],GET,/twiki/bin/view/TWiki/WikiSyntax,HTTP/1.1,200.0,7352,/twiki/bin/view/TWiki/
4,64.242.88.10,-,-,[07/Mar/2004:16:20:55,-0800],GET,/twiki/bin/view/Main/DCCAndPostFix,HTTP/1.1,200.0,5253,/twiki/bin/view/Main/


Since using the `axis` option of `apply` can be confusing, an alternative solution is to build a list correponding to the new column

In [8]:
access['dir2'] = [ extract_dir(url) for url in access['url'] ]
access.head()

Unnamed: 0,origin,dummy1,dummy2,time,tz,type,url,prot,status,size,dir,dir2
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,HTTP/1.1,401.0,12846,/twiki/bin/edit/Main/,/twiki/bin/edit/Main/
1,64.242.88.10,-,-,[07/Mar/2004:16:06:51,-0800],GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,HTTP/1.1,200.0,4523,/twiki/bin/rdiff/TWiki/,/twiki/bin/rdiff/TWiki/
2,64.242.88.10,-,-,[07/Mar/2004:16:10:02,-0800],GET,/mailman/listinfo/hsdivision,HTTP/1.1,200.0,6291,/mailman/listinfo/,/mailman/listinfo/
3,64.242.88.10,-,-,[07/Mar/2004:16:11:58,-0800],GET,/twiki/bin/view/TWiki/WikiSyntax,HTTP/1.1,200.0,7352,/twiki/bin/view/TWiki/,/twiki/bin/view/TWiki/
4,64.242.88.10,-,-,[07/Mar/2004:16:20:55,-0800],GET,/twiki/bin/view/Main/DCCAndPostFix,HTTP/1.1,200.0,5253,/twiki/bin/view/Main/,/twiki/bin/view/Main/


## For each origin, count the number of successful accesses

In [9]:
access[access['status'] == 200].groupby('origin').size()

origin
0x503e4fce.virnxx2.adsl-dhcp.tele.dk            2
1-320.cnc.bc.ca                                 4
1-729.cnc.bc.ca                                 6
10.0.0.153                                    187
12.22.207.235                                   1
128.227.88.79                                  12
142.27.64.35                                    2
145.253.208.9                                   6
194.151.73.43                                   4
195.11.231.210                                  1
195.230.181.122                                 1
195.246.13.119                                 11
2-110.cnc.bc.ca                                 8
2-238.cnc.bc.ca                                 1
200-55-104-193.dsl.prima.net.ar                13
200.160.249.68.bmf.com.br                       2
200.222.33.33                                   1
203.147.138.233                                13
206-15-133-153.dialup.ziplink.net               1
206-15-133-154.dialup.ziplink.net          

## For each origin, count the number of unsuccessful accesses, split according to the status code

The `groupby` can receive a list of column names

In [10]:
access[access['status'] == 200].groupby(['origin', 'status']).size()

origin                                      status
0x503e4fce.virnxx2.adsl-dhcp.tele.dk        200.0       2
1-320.cnc.bc.ca                             200.0       4
1-729.cnc.bc.ca                             200.0       6
10.0.0.153                                  200.0     187
12.22.207.235                               200.0       1
128.227.88.79                               200.0      12
142.27.64.35                                200.0       2
145.253.208.9                               200.0       6
194.151.73.43                               200.0       4
195.11.231.210                              200.0       1
195.230.181.122                             200.0       1
195.246.13.119                              200.0      11
2-110.cnc.bc.ca                             200.0       8
2-238.cnc.bc.ca                             200.0       1
200-55-104-193.dsl.prima.net.ar             200.0      13
200.160.249.68.bmf.com.br                   200.0       2
200.222.33.33        

From the results of the previous point, add a column with the error class (the first digit of the status code)

In [11]:
grouped = access[access['status'] == 200].groupby(['origin', 'status']).size()
grouped.index

MultiIndex(levels=[['0x503e4fce.virnxx2.adsl-dhcp.tele.dk', '1-320.cnc.bc.ca', '1-729.cnc.bc.ca', '10.0.0.153', '12.22.207.235', '128.227.88.79', '142.27.64.35', '145.253.208.9', '194.151.73.43', '195.11.231.210', '195.230.181.122', '195.246.13.119', '2-110.cnc.bc.ca', '2-238.cnc.bc.ca', '200-55-104-193.dsl.prima.net.ar', '200.160.249.68.bmf.com.br', '200.222.33.33', '203.147.138.233', '206-15-133-153.dialup.ziplink.net', '206-15-133-154.dialup.ziplink.net', '206-15-133-181.dialup.ziplink.net', '207.195.59.160', '208-186-146-13.nrp3.brv.mn.frontiernet.net', '208-38-57-205.ip.cal.radiant.net', '208.247.148.12', '212.21.228.26', '212.92.37.62', '213.181.81.4', '216-160-111-121.tukw.qwest.net', '216.139.185.45', '219.95.17.51', '3_343_lt_someone', '4.37.97.186', '61.165.64.6', '61.9.4.61', '64-249-27-114.client.dsl.net', '64-93-34-186.client.dsl.net', '64.242.88.10', '64.246.94.141', '64.246.94.152', '65-37-13-251.nrp2.roc.ny.frontiernet.net', '66-194-6-70.gen.twtelecom.net', '66-194-6-71

Since the `status` field is part of the index, we have to move it to a column name, via `reset_index`

In [12]:
table = grouped.reset_index(name='number')
table.head()

Unnamed: 0,origin,status,number
0,0x503e4fce.virnxx2.adsl-dhcp.tele.dk,200.0,2
1,1-320.cnc.bc.ca,200.0,4
2,1-729.cnc.bc.ca,200.0,6
3,10.0.0.153,200.0,187
4,12.22.207.235,200.0,1


In [13]:
table['class'] = table['status'] / 100
table

Unnamed: 0,origin,status,number,class
0,0x503e4fce.virnxx2.adsl-dhcp.tele.dk,200.0,2,2.0
1,1-320.cnc.bc.ca,200.0,4,2.0
2,1-729.cnc.bc.ca,200.0,6,2.0
3,10.0.0.153,200.0,187,2.0
4,12.22.207.235,200.0,1,2.0
5,128.227.88.79,200.0,12,2.0
6,142.27.64.35,200.0,2,2.0
7,145.253.208.9,200.0,6,2.0
8,194.151.73.43,200.0,4,2.0
9,195.11.231.210,200.0,1,2.0


## For `[info]` entry of *error.log*, find the next entry of *access.log*. 
### For example, when considering the entry at `Sun Mar  7 18:00:09 2004`, we want to find the entry at `[07/Mar/2004:18:02:10 -0800]`

First we have to read *error.log*, parse correctly the date/times and convert them to the same format, otherwise we cannot use that field to merge the dataframes.

In [14]:
month_str_to_num = { 'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
                     'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}

In [36]:
import datetime
rows = []

with open('database/apache/error.log') as error_file:
    for row in error_file:
        row_match = re.search('\[(...) (...)\s+(\d+)\s+(\d+):(\d+):(\d+)\s+(\d+)\]\s+\[([^\]]+)\]\s+(.*)', row)
        if row_match:
            rows.append({'month': row_match[2], 
                         'day': row_match[3], 
                         'hour': row_match[4], 
                         'mins': row_match[5], 
                         'secs': row_match[6], 
                         'year': row_match[7], 
                         'type': row_match[8], 
                         'text': row_match[9],
                         'dtime': datetime.datetime(int(row_match[7]), month_str_to_num[row_match[2]], int(row_match[3]),
                                                   int(row_match[4]), int(row_match[5]), int(row_match[6])),
                         'row' : row
                         })
error = pd.DataFrame(rows)
error.head()

Unnamed: 0,day,dtime,hour,mins,month,row,secs,text,type,year
0,7,2004-03-07 16:02:00,16,2,Mar,[Sun Mar 7 16:02:00 2004] [notice] Apache/1.3...,0,Apache/1.3.29 (Unix) configured -- resuming no...,notice,2004
1,7,2004-03-07 16:02:00,16,2,Mar,[Sun Mar 7 16:02:00 2004] [info] Server built...,0,Server built: Feb 27 2004 13:56:37,info,2004
2,7,2004-03-07 16:02:00,16,2,Mar,[Sun Mar 7 16:02:00 2004] [notice] Accept mut...,0,Accept mutex: sysvsem (Default: sysvsem),notice,2004
3,7,2004-03-07 16:05:49,16,5,Mar,[Sun Mar 7 16:05:49 2004] [info] [client 64.2...,49,[client 64.242.88.10] (104)Connection reset by...,info,2004
4,7,2004-03-07 16:45:56,16,45,Mar,[Sun Mar 7 16:45:56 2004] [info] [client 64.2...,56,[client 64.242.88.10] (104)Connection reset by...,info,2004


Now we parse the date/time of `access.log`

In [39]:
def parse_datime(string):
    m = re.search('\[(\d\d)\/(...)\/(\d\d\d\d):(\d\d):(\d\d):(\d\d)', string)
    return {'day' : m[1], 
            'month': m[2], 
            'year' : m[3], 
            'hour' : m[4], 
            'mins' : m[5], 
            'secs' : m[6],
             'dtime': datetime.datetime(int(m[3]), month_str_to_num[m[2]], int(m[1]),
                                            int(m[4]), int(m[5]), int(m[6])),
            'row'  : string
           }

In [40]:
new = pd.DataFrame([ parse_datime(row) for row in access['time'] ])
new.head()

Unnamed: 0,day,dtime,hour,mins,month,row,secs,year
0,7,2004-03-07 16:05:49,16,5,Mar,[07/Mar/2004:16:05:49,49,2004
1,7,2004-03-07 16:06:51,16,6,Mar,[07/Mar/2004:16:06:51,51,2004
2,7,2004-03-07 16:10:02,16,10,Mar,[07/Mar/2004:16:10:02,2,2004
3,7,2004-03-07 16:11:58,16,11,Mar,[07/Mar/2004:16:11:58,58,2004
4,7,2004-03-07 16:20:55,16,20,Mar,[07/Mar/2004:16:20:55,55,2004


In [41]:
access_full = access.join(new)
access_full.head()

Unnamed: 0,origin,dummy1,dummy2,time,tz,type,url,prot,status,size,dir,dir2,day,dtime,hour,mins,month,row,secs,year
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,HTTP/1.1,401.0,12846,/twiki/bin/edit/Main/,/twiki/bin/edit/Main/,7,2004-03-07 16:05:49,16,5,Mar,[07/Mar/2004:16:05:49,49,2004
1,64.242.88.10,-,-,[07/Mar/2004:16:06:51,-0800],GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,HTTP/1.1,200.0,4523,/twiki/bin/rdiff/TWiki/,/twiki/bin/rdiff/TWiki/,7,2004-03-07 16:06:51,16,6,Mar,[07/Mar/2004:16:06:51,51,2004
2,64.242.88.10,-,-,[07/Mar/2004:16:10:02,-0800],GET,/mailman/listinfo/hsdivision,HTTP/1.1,200.0,6291,/mailman/listinfo/,/mailman/listinfo/,7,2004-03-07 16:10:02,16,10,Mar,[07/Mar/2004:16:10:02,2,2004
3,64.242.88.10,-,-,[07/Mar/2004:16:11:58,-0800],GET,/twiki/bin/view/TWiki/WikiSyntax,HTTP/1.1,200.0,7352,/twiki/bin/view/TWiki/,/twiki/bin/view/TWiki/,7,2004-03-07 16:11:58,16,11,Mar,[07/Mar/2004:16:11:58,58,2004
4,64.242.88.10,-,-,[07/Mar/2004:16:20:55,-0800],GET,/twiki/bin/view/Main/DCCAndPostFix,HTTP/1.1,200.0,5253,/twiki/bin/view/Main/,/twiki/bin/view/Main/,7,2004-03-07 16:20:55,16,20,Mar,[07/Mar/2004:16:20:55,55,2004


We add a field `next` which is the index of the next row.

In [42]:
access_full['next'] = list(range(1, len(access_full) + 1))
access_full.head()

Unnamed: 0,origin,dummy1,dummy2,time,tz,type,url,prot,status,size,...,dir2,day,dtime,hour,mins,month,row,secs,year,next
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,HTTP/1.1,401.0,12846,...,/twiki/bin/edit/Main/,7,2004-03-07 16:05:49,16,5,Mar,[07/Mar/2004:16:05:49,49,2004,1
1,64.242.88.10,-,-,[07/Mar/2004:16:06:51,-0800],GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,HTTP/1.1,200.0,4523,...,/twiki/bin/rdiff/TWiki/,7,2004-03-07 16:06:51,16,6,Mar,[07/Mar/2004:16:06:51,51,2004,2
2,64.242.88.10,-,-,[07/Mar/2004:16:10:02,-0800],GET,/mailman/listinfo/hsdivision,HTTP/1.1,200.0,6291,...,/mailman/listinfo/,7,2004-03-07 16:10:02,16,10,Mar,[07/Mar/2004:16:10:02,2,2004,3
3,64.242.88.10,-,-,[07/Mar/2004:16:11:58,-0800],GET,/twiki/bin/view/TWiki/WikiSyntax,HTTP/1.1,200.0,7352,...,/twiki/bin/view/TWiki/,7,2004-03-07 16:11:58,16,11,Mar,[07/Mar/2004:16:11:58,58,2004,4
4,64.242.88.10,-,-,[07/Mar/2004:16:20:55,-0800],GET,/twiki/bin/view/Main/DCCAndPostFix,HTTP/1.1,200.0,5253,...,/twiki/bin/view/Main/,7,2004-03-07 16:20:55,16,20,Mar,[07/Mar/2004:16:20:55,55,2004,5


Now we can merge the two dataframe `access_full` and `error`, keeping all entries of `access_full`

In [45]:
merged = pd.merge(access_full, error, on=['dtime'], how='left')
merged

Unnamed: 0,origin,dummy1,dummy2,time,tz,type_x,url,prot,status,size,...,next,day_y,hour_y,mins_y,month_y,row_y,secs_y,text,type_y,year_y
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,HTTP/1.1,401.0,12846,...,1,7,16,05,Mar,[Sun Mar 7 16:05:49 2004] [info] [client 64.2...,49,[client 64.242.88.10] (104)Connection reset by...,info,2004
1,64.242.88.10,-,-,[07/Mar/2004:16:06:51,-0800],GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1....,HTTP/1.1,200.0,4523,...,2,,,,,,,,,
2,64.242.88.10,-,-,[07/Mar/2004:16:10:02,-0800],GET,/mailman/listinfo/hsdivision,HTTP/1.1,200.0,6291,...,3,,,,,,,,,
3,64.242.88.10,-,-,[07/Mar/2004:16:11:58,-0800],GET,/twiki/bin/view/TWiki/WikiSyntax,HTTP/1.1,200.0,7352,...,4,,,,,,,,,
4,64.242.88.10,-,-,[07/Mar/2004:16:20:55,-0800],GET,/twiki/bin/view/Main/DCCAndPostFix,HTTP/1.1,200.0,5253,...,5,,,,,,,,,
5,64.242.88.10,-,-,[07/Mar/2004:16:23:12,-0800],GET,/twiki/bin/oops/TWiki/AppendixFileSystem?templ...,HTTP/1.1,200.0,11382,...,6,,,,,,,,,
6,64.242.88.10,-,-,[07/Mar/2004:16:24:16,-0800],GET,/twiki/bin/view/Main/PeterThoeny,HTTP/1.1,200.0,4924,...,7,,,,,,,,,
7,64.242.88.10,-,-,[07/Mar/2004:16:29:16,-0800],GET,/twiki/bin/edit/Main/Header_checks?topicparent...,HTTP/1.1,401.0,12851,...,8,,,,,,,,,
8,64.242.88.10,-,-,[07/Mar/2004:16:30:29,-0800],GET,/twiki/bin/attach/Main/OfficeLocations,HTTP/1.1,401.0,12851,...,9,,,,,,,,,
9,64.242.88.10,-,-,[07/Mar/2004:16:31:48,-0800],GET,/twiki/bin/view/TWiki/WebTopicEditTemplate,HTTP/1.1,200.0,3732,...,10,,,,,,,,,


Check if the rows of `error` are in `merged`

In [47]:
merged[merged['row_y'].notnull()]

Unnamed: 0,origin,dummy1,dummy2,time,tz,type_x,url,prot,status,size,...,next,day_y,hour_y,mins_y,month_y,row_y,secs_y,text,type_y,year_y
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,HTTP/1.1,401.0,12846,...,1,7,16,05,Mar,[Sun Mar 7 16:05:49 2004] [info] [client 64.2...,49,[client 64.242.88.10] (104)Connection reset by...,info,2004
17,64.242.88.10,-,-,[07/Mar/2004:16:45:56,-0800],GET,/twiki/bin/attach/Main/PostfixCommands,HTTP/1.1,401.0,12846,...,18,7,16,45,Mar,[Sun Mar 7 16:45:56 2004] [info] [client 64.2...,56,[client 64.242.88.10] (104)Connection reset by...,info,2004
30,64.242.88.10,-,-,[07/Mar/2004:17:13:50,-0800],GET,/twiki/bin/edit/TWiki/DefaultPlugin?t=1078688936,HTTP/1.1,401.0,12846,...,31,7,17,13,Mar,[Sun Mar 7 17:13:50 2004] [info] [client 64.2...,50,[client 64.242.88.10] (104)Connection reset by...,info,2004
35,64.242.88.10,-,-,[07/Mar/2004:17:21:44,-0800],GET,/twiki/bin/attach/TWiki/TablePlugin,HTTP/1.1,401.0,12846,...,36,7,17,21,Mar,[Sun Mar 7 17:21:44 2004] [info] [client 64.2...,44,[client 64.242.88.10] (104)Connection reset by...,info,2004
39,64.242.88.10,-,-,[07/Mar/2004:17:27:37,-0800],GET,/twiki/bin/edit/Main/WebSearch?t=1078669682,HTTP/1.1,401.0,12846,...,40,7,17,27,Mar,[Sun Mar 7 17:27:37 2004] [info] [client 64.2...,37,[client 64.242.88.10] (104)Connection reset by...,info,2004
42,64.242.88.10,-,-,[07/Mar/2004:17:31:39,-0800],GET,/twiki/bin/edit/Main/UvscanAndPostFix?topicpar...,HTTP/1.1,401.0,12846,...,43,7,17,31,Mar,[Sun Mar 7 17:31:39 2004] [info] [client 64.2...,39,[client 64.242.88.10] (104)Connection reset by...,info,2004
51,64.242.88.10,-,-,[07/Mar/2004:17:58:00,-0800],GET,/twiki/bin/edit/Main/KevinWGagel?t=1078670331,HTTP/1.1,401.0,12846,...,52,7,17,58,Mar,[Sun Mar 7 17:58:00 2004] [info] [client 64.2...,00,[client 64.242.88.10] (104)Connection reset by...,info,2004
52,64.242.88.10,-,-,[07/Mar/2004:18:00:09,-0800],GET,/twiki/bin/edit/Main/Virtual_mailbox_lock?topi...,HTTP/1.1,401.0,12846,...,53,7,18,00,Mar,[Sun Mar 7 18:00:09 2004] [info] [client 64.2...,09,[client 64.242.88.10] (104)Connection reset by...,info,2004
57,64.242.88.10,-,-,[07/Mar/2004:18:10:09,-0800],GET,/twiki/bin/edit/TWiki/TWikiVariables?t=1078684115,HTTP/1.1,401.0,12846,...,58,7,18,10,Mar,[Sun Mar 7 18:10:09 2004] [info] [client 64.2...,09,[client 64.242.88.10] (104)Connection reset by...,info,2004
61,64.242.88.10,-,-,[07/Mar/2004:18:19:01,-0800],GET,/twiki/bin/edit/Main/TWikiPreferences?topicpar...,HTTP/1.1,401.0,12846,...,62,7,18,19,Mar,[Sun Mar 7 18:19:01 2004] [info] [client 64.2...,01,[client 64.242.88.10] (104)Connection reset by...,info,2004


In [48]:
found = merged[merged['type_y'] == 'info']
found

Unnamed: 0,origin,dummy1,dummy2,time,tz,type_x,url,prot,status,size,...,next,day_y,hour_y,mins_y,month_y,row_y,secs_y,text,type_y,year_y
0,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,HTTP/1.1,401.0,12846,...,1,7,16,05,Mar,[Sun Mar 7 16:05:49 2004] [info] [client 64.2...,49,[client 64.242.88.10] (104)Connection reset by...,info,2004
17,64.242.88.10,-,-,[07/Mar/2004:16:45:56,-0800],GET,/twiki/bin/attach/Main/PostfixCommands,HTTP/1.1,401.0,12846,...,18,7,16,45,Mar,[Sun Mar 7 16:45:56 2004] [info] [client 64.2...,56,[client 64.242.88.10] (104)Connection reset by...,info,2004
30,64.242.88.10,-,-,[07/Mar/2004:17:13:50,-0800],GET,/twiki/bin/edit/TWiki/DefaultPlugin?t=1078688936,HTTP/1.1,401.0,12846,...,31,7,17,13,Mar,[Sun Mar 7 17:13:50 2004] [info] [client 64.2...,50,[client 64.242.88.10] (104)Connection reset by...,info,2004
35,64.242.88.10,-,-,[07/Mar/2004:17:21:44,-0800],GET,/twiki/bin/attach/TWiki/TablePlugin,HTTP/1.1,401.0,12846,...,36,7,17,21,Mar,[Sun Mar 7 17:21:44 2004] [info] [client 64.2...,44,[client 64.242.88.10] (104)Connection reset by...,info,2004
39,64.242.88.10,-,-,[07/Mar/2004:17:27:37,-0800],GET,/twiki/bin/edit/Main/WebSearch?t=1078669682,HTTP/1.1,401.0,12846,...,40,7,17,27,Mar,[Sun Mar 7 17:27:37 2004] [info] [client 64.2...,37,[client 64.242.88.10] (104)Connection reset by...,info,2004
42,64.242.88.10,-,-,[07/Mar/2004:17:31:39,-0800],GET,/twiki/bin/edit/Main/UvscanAndPostFix?topicpar...,HTTP/1.1,401.0,12846,...,43,7,17,31,Mar,[Sun Mar 7 17:31:39 2004] [info] [client 64.2...,39,[client 64.242.88.10] (104)Connection reset by...,info,2004
51,64.242.88.10,-,-,[07/Mar/2004:17:58:00,-0800],GET,/twiki/bin/edit/Main/KevinWGagel?t=1078670331,HTTP/1.1,401.0,12846,...,52,7,17,58,Mar,[Sun Mar 7 17:58:00 2004] [info] [client 64.2...,00,[client 64.242.88.10] (104)Connection reset by...,info,2004
52,64.242.88.10,-,-,[07/Mar/2004:18:00:09,-0800],GET,/twiki/bin/edit/Main/Virtual_mailbox_lock?topi...,HTTP/1.1,401.0,12846,...,53,7,18,00,Mar,[Sun Mar 7 18:00:09 2004] [info] [client 64.2...,09,[client 64.242.88.10] (104)Connection reset by...,info,2004
57,64.242.88.10,-,-,[07/Mar/2004:18:10:09,-0800],GET,/twiki/bin/edit/TWiki/TWikiVariables?t=1078684115,HTTP/1.1,401.0,12846,...,58,7,18,10,Mar,[Sun Mar 7 18:10:09 2004] [info] [client 64.2...,09,[client 64.242.88.10] (104)Connection reset by...,info,2004
61,64.242.88.10,-,-,[07/Mar/2004:18:19:01,-0800],GET,/twiki/bin/edit/Main/TWikiPreferences?topicpar...,HTTP/1.1,401.0,12846,...,62,7,18,19,Mar,[Sun Mar 7 18:19:01 2004] [info] [client 64.2...,01,[client 64.242.88.10] (104)Connection reset by...,info,2004


Finally, use the `next` field to merge `found` and `access_full`

In [51]:
paired = pd.merge(found, access_full, left_on='next', right_index = True)
paired

Unnamed: 0,next,origin_x,dummy1_x,dummy2_x,time_x,tz_x,type_x,url_x,prot_x,status_x,...,dir2_y,day,dtime_y,hour,mins,month,row,secs,year,next_y
0,1,64.242.88.10,-,-,[07/Mar/2004:16:05:49,-0800],GET,/twiki/bin/edit/Main/Double_bounce_sender?topi...,HTTP/1.1,401.0,...,/twiki/bin/rdiff/TWiki/,07,2004-03-07 16:06:51,16,06,Mar,[07/Mar/2004:16:06:51,51,2004,2
17,18,64.242.88.10,-,-,[07/Mar/2004:16:45:56,-0800],GET,/twiki/bin/attach/Main/PostfixCommands,HTTP/1.1,401.0,...,/,07,2004-03-07 16:47:12,16,47,Mar,[07/Mar/2004:16:47:12,12,2004,19
30,31,64.242.88.10,-,-,[07/Mar/2004:17:13:50,-0800],GET,/twiki/bin/edit/TWiki/DefaultPlugin?t=1078688936,HTTP/1.1,401.0,...,/twiki/bin/search/Main/,07,2004-03-07 17:16:00,17,16,Mar,[07/Mar/2004:17:16:00,00,2004,32
35,36,64.242.88.10,-,-,[07/Mar/2004:17:21:44,-0800],GET,/twiki/bin/attach/TWiki/TablePlugin,HTTP/1.1,401.0,...,/twiki/bin/view/TWiki/,07,2004-03-07 17:22:49,17,22,Mar,[07/Mar/2004:17:22:49,49,2004,37
39,40,64.242.88.10,-,-,[07/Mar/2004:17:27:37,-0800],GET,/twiki/bin/edit/Main/WebSearch?t=1078669682,HTTP/1.1,401.0,...,/twiki/bin/oops/TWiki/,07,2004-03-07 17:28:45,17,28,Mar,[07/Mar/2004:17:28:45,45,2004,41
42,43,64.242.88.10,-,-,[07/Mar/2004:17:31:39,-0800],GET,/twiki/bin/edit/Main/UvscanAndPostFix?topicpar...,HTTP/1.1,401.0,...,/twiki/bin/view/TWiki/,07,2004-03-07 17:35:35,17,35,Mar,[07/Mar/2004:17:35:35,35,2004,44
51,52,64.242.88.10,-,-,[07/Mar/2004:17:58:00,-0800],GET,/twiki/bin/edit/Main/KevinWGagel?t=1078670331,HTTP/1.1,401.0,...,/twiki/bin/edit/Main/,07,2004-03-07 18:00:09,18,00,Mar,[07/Mar/2004:18:00:09,09,2004,53
52,53,64.242.88.10,-,-,[07/Mar/2004:18:00:09,-0800],GET,/twiki/bin/edit/Main/Virtual_mailbox_lock?topi...,HTTP/1.1,401.0,...,/twiki/bin/view/Main/,07,2004-03-07 18:02:10,18,02,Mar,[07/Mar/2004:18:02:10,10,2004,54
57,58,64.242.88.10,-,-,[07/Mar/2004:18:10:09,-0800],GET,/twiki/bin/edit/TWiki/TWikiVariables?t=1078684115,HTTP/1.1,401.0,...,/pipermail/cncce/2004-January/,07,2004-03-07 18:10:18,18,10,Mar,[07/Mar/2004:18:10:18,18,2004,59
61,62,64.242.88.10,-,-,[07/Mar/2004:18:19:01,-0800],GET,/twiki/bin/edit/Main/TWikiPreferences?topicpar...,HTTP/1.1,401.0,...,/pipermail/cncce/,07,2004-03-07 18:19:16,18,19,Mar,[07/Mar/2004:18:19:16,16,2004,63


In [55]:
paired.columns

Index(['next', 'origin_x', 'dummy1_x', 'dummy2_x', 'time_x', 'tz_x', 'type_x',
       'url_x', 'prot_x', 'status_x', 'size_x', 'dir_x', 'dir2_x', 'day_x',
       'dtime_x', 'hour_x', 'mins_x', 'month_x', 'row_x', 'secs_x', 'year_x',
       'next_x', 'day_y', 'hour_y', 'mins_y', 'month_y', 'row_y', 'secs_y',
       'text', 'type_y', 'year_y', 'origin_y', 'dummy1_y', 'dummy2_y',
       'time_y', 'tz_y', 'type', 'url_y', 'prot_y', 'status_y', 'size_y',
       'dir_y', 'dir2_y', 'day', 'dtime_y', 'hour', 'mins', 'month', 'row',
       'secs', 'year', 'next_y'],
      dtype='object')

## Count the number of times that the two accesses of the previous point have the same origin.

In [59]:
len(paired[paired['origin_x'] == paired['origin_y']])

83

## Count the number of accesses between each pair of `[info]` or `[error]` entries of *error.log*

In [66]:
info_errors = [(merged['type_y'] == 'info') | (merged['type_y'] == 'error')]
info_errors 

[0        True
 1       False
 2       False
 3       False
 4       False
 5       False
 6       False
 7       False
 8       False
 9       False
 10      False
 11      False
 12      False
 13      False
 14      False
 15      False
 16      False
 17       True
 18      False
 19      False
 20      False
 21      False
 22      False
 23      False
 24      False
 25      False
 26      False
 27      False
 28      False
 29      False
         ...  
 1516    False
 1517    False
 1518    False
 1519    False
 1520    False
 1521    False
 1522    False
 1523    False
 1524    False
 1525    False
 1526    False
 1527    False
 1528    False
 1529    False
 1530    False
 1531    False
 1532    False
 1533    False
 1534    False
 1535    False
 1536    False
 1537    False
 1538    False
 1539    False
 1540    False
 1541    False
 1542    False
 1543    False
 1544    False
 1545    False
 Name: type_y, Length: 1546, dtype: bool]

In [76]:
info_errors?

In [90]:
import numpy as np
occs = np.where(info_errors)
lista = list(occs[1])
lista

[0,
 17,
 30,
 35,
 39,
 42,
 51,
 52,
 57,
 61,
 67,
 71,
 74,
 77,
 78,
 85,
 89,
 92,
 93,
 94,
 103,
 106,
 109,
 110,
 112,
 118,
 121,
 126,
 133,
 136,
 139,
 142,
 147,
 149,
 150,
 152,
 155,
 167,
 184,
 190,
 200,
 208,
 211,
 220,
 225,
 231,
 234,
 237,
 245,
 263,
 279,
 280,
 283,
 286,
 291,
 292,
 308,
 311,
 312,
 327,
 330,
 338,
 354,
 359,
 362,
 368,
 375,
 379,
 380,
 381,
 384,
 386,
 389,
 396,
 401,
 402,
 436,
 458,
 499,
 502,
 508,
 530,
 548,
 555,
 558,
 572,
 574,
 575,
 596,
 638,
 640,
 643,
 650,
 651,
 653,
 656,
 657,
 662,
 672,
 680,
 906,
 1082,
 1083]

In [99]:
pairs = zip(lista[:-2], lista[1:])
list(pairs)

[(0, 17),
 (17, 30),
 (30, 35),
 (35, 39),
 (39, 42),
 (42, 51),
 (51, 52),
 (52, 57),
 (57, 61),
 (61, 67),
 (67, 71),
 (71, 74),
 (74, 77),
 (77, 78),
 (78, 85),
 (85, 89),
 (89, 92),
 (92, 93),
 (93, 94),
 (94, 103),
 (103, 106),
 (106, 109),
 (109, 110),
 (110, 112),
 (112, 118),
 (118, 121),
 (121, 126),
 (126, 133),
 (133, 136),
 (136, 139),
 (139, 142),
 (142, 147),
 (147, 149),
 (149, 150),
 (150, 152),
 (152, 155),
 (155, 167),
 (167, 184),
 (184, 190),
 (190, 200),
 (200, 208),
 (208, 211),
 (211, 220),
 (220, 225),
 (225, 231),
 (231, 234),
 (234, 237),
 (237, 245),
 (245, 263),
 (263, 279),
 (279, 280),
 (280, 283),
 (283, 286),
 (286, 291),
 (291, 292),
 (292, 308),
 (308, 311),
 (311, 312),
 (312, 327),
 (327, 330),
 (330, 338),
 (338, 354),
 (354, 359),
 (359, 362),
 (362, 368),
 (368, 375),
 (375, 379),
 (379, 380),
 (380, 381),
 (381, 384),
 (384, 386),
 (386, 389),
 (389, 396),
 (396, 401),
 (401, 402),
 (402, 436),
 (436, 458),
 (458, 499),
 (499, 502),
 (502, 508),


In [104]:
[ b-a for (a,b) in zip(lista[:-2], lista[1:])]

[17,
 13,
 5,
 4,
 3,
 9,
 1,
 5,
 4,
 6,
 4,
 3,
 3,
 1,
 7,
 4,
 3,
 1,
 1,
 9,
 3,
 3,
 1,
 2,
 6,
 3,
 5,
 7,
 3,
 3,
 3,
 5,
 2,
 1,
 2,
 3,
 12,
 17,
 6,
 10,
 8,
 3,
 9,
 5,
 6,
 3,
 3,
 8,
 18,
 16,
 1,
 3,
 3,
 5,
 1,
 16,
 3,
 1,
 15,
 3,
 8,
 16,
 5,
 3,
 6,
 7,
 4,
 1,
 1,
 3,
 2,
 3,
 7,
 5,
 1,
 34,
 22,
 41,
 3,
 6,
 22,
 18,
 7,
 3,
 14,
 2,
 1,
 21,
 42,
 2,
 3,
 7,
 1,
 2,
 3,
 1,
 5,
 10,
 8,
 226,
 176]