<h1 align="center" style="color:blue">Analysing Log DataSet</h1>

<p>Read the dataset located at <i>data_sets/access_log</i>. Each line represent a log in the following way:</p>

> 10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/js/lowpro.js HTTP/1.1" 200 10469

<br>

<p>being the format:</p>

> %h %l %u %t \"%r\" %>s %b

<br>

<p>where:</p>

<div style="border:2px solid black; border-radius:2px; padding-left:20px">
    # \ %h is the IP address of the client<br>
    # \ %l is identity of the client, or "-" if it's unavailable<br>
    # \ %u is username of the client, or "-" if it's unavailable<br>
    # \ %t is the time that the server finished processing the request. The format is [day/month/year:hour:minute:second zone]<br>
    # \ %r is the request line from the client is given (in double quotes). It contains the method, path, query-string, and protocol or the request.<br>
    # \ %>s is the status code that the server sends back to the client. You will see see mostly status codes 200 (OK - The request has succeeded), 304 (Not Modified) and 404 (Not Found). See more information on status codes in W3C.org<br>
    # \ %b is the size of the object returned to the client, in bytes. It will be "-" in case of status code 304.
</div>

----

<p>After reading the file, solve the following problems:</p>

<p>PS.: some pathnames starts with <i>http://www.the-associates.co.uk</i>. Make sure to remove this portion in the column</p>

<br>

<div style="border:2px solid black; border-radius:2px; padding-left:20px">
    1 - How many hits were made to the page <b>/assets/js/the-associates.js</b>?
    <br>
    2 - How many hits were made by the IP <b>10.99.99.186</b>?
    <br>
    3 - What page had the higher number of hits and what is this number?
</div>

----

<h2 style="color:blue">0 - Formating the DataSet</h2>

In [1]:
import pandas as pd

In [2]:
# Reading the Dataset
dataset_directory = 'data_sets/access_log'

df_original = pd.read_csv(dataset_directory, sep='\t', header=None, names=['Raw Data'])
df_original

Unnamed: 0,Raw Data
0,10.223.157.186 - - [15/Jul/2009:14:58:59 -0700...
1,10.223.157.186 - - [15/Jul/2009:14:58:59 -0700...
2,10.223.157.186 - - [15/Jul/2009:15:50:35 -0700...
3,10.223.157.186 - - [15/Jul/2009:15:50:35 -0700...
4,10.223.157.186 - - [15/Jul/2009:15:50:35 -0700...
...,...
4477838,10.190.174.142 - - [03/Dec/2011:13:28:09 -0800...
4477839,10.190.174.142 - - [03/Dec/2011:13:28:10 -0800...
4477840,10.190.174.142 - - [03/Dec/2011:13:28:11 -0800...
4477841,10.190.174.142 - - [03/Dec/2011:13:28:10 -0800...


In [3]:
# Function to replace 'http://www.the-associates.co.uk' by nothing
replace_pathname = lambda pathname : pathname.replace('http://www.the-associates.co.uk', '')

In [4]:
# Function to split each line to the right column
#
# %h %l %u %t "%r" %>s %b
def split_datas(dataset):
    
    split_result = temp_split_data = []
    
    for row in dataset:
        
        # Reseting temporary split data
        temp_split_data = []
        
        # Getting IP #
        temp_split_data.append(row.partition(' ')[0])
        row = row.replace(temp_split_data[0] + ' ', '', 1)
        
        # Getting Client Indentity #
        temp_split_data.append(row.partition(' ')[0])
        row = row.replace(temp_split_data[1] + ' ', '', 1)
        
        # Getting Client Username #
        temp_split_data.append(row.partition('[')[0])
        row = row.replace(temp_split_data[2] + '[', '', 1)
        
        # Getting Date Time #
        temp_split_data.append(row.partition('] ')[0])
        row = row.replace(temp_split_data[3] + '] ', '', 1)
        
        # Getting Request and Replacing 'http://www.the-associates.co.uk' by nothing #
        temp_split_data.append(row.partition('\" ')[0])
        temp_split_data[4] = temp_split_data[4].replace('\"', "")
        temp_split_data[4] = replace_pathname(temp_split_data[4])
        row = row.replace('\"' + temp_split_data[4] + '\" ', '', 1)
        
        # Getting Status Code #
        temp_split_data.append(row.partition(' ')[0])
        row = row.replace(temp_split_data[5] + ' ', '', 1)

        # Getting Returned Object Size #
        temp_split_data.append(row)
        row = None
        
        # Adding the processed data in the result
        split_result.append(temp_split_data)
        
    # Returning the result
    return split_result

In [5]:
# Converting the dataset into a Series and, later, into a list
# and extracting the column values/datas in the raw one
df_original_list = df_original.squeeze().tolist()
df_original_list = split_datas(df_original_list)

In [6]:
# Transforming the split list data into a DataFrame again
df_original = pd.DataFrame(df_original_list, \
                           columns=['IP', 'Client Indentity', 'Client Username', \
                                   'Date-Time', 'Requested URL', 'Status Code', 'Returned Object Size'])

In [7]:
df_original

Unnamed: 0,IP,Client Indentity,Client Username,Date-Time,Requested URL,Status Code,Returned Object Size
0,10.223.157.186,-,-,15/Jul/2009:14:58:59 -0700,GET / HTTP/1.1,403,202
1,10.223.157.186,-,-,15/Jul/2009:14:58:59 -0700,GET /favicon.ico HTTP/1.1,404,209
2,10.223.157.186,-,-,15/Jul/2009:15:50:35 -0700,GET / HTTP/1.1,200,9157
3,10.223.157.186,-,-,15/Jul/2009:15:50:35 -0700,GET /assets/js/lowpro.js HTTP/1.1,200,10469
4,10.223.157.186,-,-,15/Jul/2009:15:50:35 -0700,GET /assets/css/reset.css HTTP/1.1,200,1014
...,...,...,...,...,...,...,...
4477838,10.190.174.142,-,-,03/Dec/2011:13:28:09 -0800,GET /images/filmmediablock/360/07082218.jpg HT...,200,154609
4477839,10.190.174.142,-,-,03/Dec/2011:13:28:10 -0800,GET /images/filmpics/0000/2229/GOEMON-NUKI-000...,200,184976
4477840,10.190.174.142,-,-,03/Dec/2011:13:28:11 -0800,GET /images/filmmediablock/360/GOEMON-NUKI-000...,200,60117
4477841,10.190.174.142,-,-,03/Dec/2011:13:28:10 -0800,GET /images/filmmediablock/360/Chacha.jpg HTTP...,200,109379


In [8]:
# Filtering the columns that will be used in the three problems
#
# \ IP
# \ Requested URL

df_original = df_original.drop(df_original.iloc[:, 1:4], axis=1)
df_original = df_original.drop(df_original.iloc[:, 2:], axis=1)
df_original

Unnamed: 0,IP,Requested URL
0,10.223.157.186,GET / HTTP/1.1
1,10.223.157.186,GET /favicon.ico HTTP/1.1
2,10.223.157.186,GET / HTTP/1.1
3,10.223.157.186,GET /assets/js/lowpro.js HTTP/1.1
4,10.223.157.186,GET /assets/css/reset.css HTTP/1.1
...,...,...
4477838,10.190.174.142,GET /images/filmmediablock/360/07082218.jpg HT...
4477839,10.190.174.142,GET /images/filmpics/0000/2229/GOEMON-NUKI-000...
4477840,10.190.174.142,GET /images/filmmediablock/360/GOEMON-NUKI-000...
4477841,10.190.174.142,GET /images/filmmediablock/360/Chacha.jpg HTTP...


In [9]:
# Checking out whether there are null values
df_original.isnull().sum()

IP               0
Requested URL    0
dtype: int64

----

<h2 style="color:blue">1 - How many hits were made to the page /assets/js/the-associates.js?</h2>

In [10]:
# Defining the target URL and creating a copy of the dataset
target_url = '/assets/js/the-associates.js'

df_problem_1 = df_original.copy()
df_problem_1

Unnamed: 0,IP,Requested URL
0,10.223.157.186,GET / HTTP/1.1
1,10.223.157.186,GET /favicon.ico HTTP/1.1
2,10.223.157.186,GET / HTTP/1.1
3,10.223.157.186,GET /assets/js/lowpro.js HTTP/1.1
4,10.223.157.186,GET /assets/css/reset.css HTTP/1.1
...,...,...
4477838,10.190.174.142,GET /images/filmmediablock/360/07082218.jpg HT...
4477839,10.190.174.142,GET /images/filmpics/0000/2229/GOEMON-NUKI-000...
4477840,10.190.174.142,GET /images/filmmediablock/360/GOEMON-NUKI-000...
4477841,10.190.174.142,GET /images/filmmediablock/360/Chacha.jpg HTTP...


In [11]:
# Filtering the dataset with the target URL
df_problem_1 = df_problem_1.loc[df_problem_1['Requested URL'].str.contains(target_url)]
df_problem_1

Unnamed: 0,IP,Requested URL
7,10.223.157.186,GET /assets/js/the-associates.js HTTP/1.1
27,10.223.157.186,GET /assets/js/the-associates.js HTTP/1.1
46,10.223.157.186,GET /assets/js/the-associates.js HTTP/1.1
62,10.223.157.186,GET /assets/js/the-associates.js HTTP/1.1
87,10.223.157.186,GET /assets/js/the-associates.js HTTP/1.1
...,...,...
52769,10.211.47.159,GET /assets/js/the-associates.js HTTP/1.1
52796,10.211.47.159,GET /assets/js/the-associates.js HTTP/1.1
52816,10.211.47.159,GET /assets/js/the-associates.js HTTP/1.1
52853,10.211.47.159,GET /assets/js/the-associates.js HTTP/1.1


In [12]:
# Returning the number of hits the target URL was accessed

print('Target URL: ', target_url)
print('Number of hits: ', df_problem_1['Requested URL'].count())

Target URL:  /assets/js/the-associates.js
Number of hits:  2456


----

<h2 style="color:blue">2 - How many hits were made by the IP 10.99.99.186?</h2>

In [13]:
# Defining the target IP and creating a dataset coopy to t his problem

target_ip = '10.99.99.186'

df_problem_2 = df_original.copy()
df_problem_2

Unnamed: 0,IP,Requested URL
0,10.223.157.186,GET / HTTP/1.1
1,10.223.157.186,GET /favicon.ico HTTP/1.1
2,10.223.157.186,GET / HTTP/1.1
3,10.223.157.186,GET /assets/js/lowpro.js HTTP/1.1
4,10.223.157.186,GET /assets/css/reset.css HTTP/1.1
...,...,...
4477838,10.190.174.142,GET /images/filmmediablock/360/07082218.jpg HT...
4477839,10.190.174.142,GET /images/filmpics/0000/2229/GOEMON-NUKI-000...
4477840,10.190.174.142,GET /images/filmmediablock/360/GOEMON-NUKI-000...
4477841,10.190.174.142,GET /images/filmmediablock/360/Chacha.jpg HTTP...


In [14]:
# Filtering the dataset with the target IP
df_problem_2 = df_problem_2.loc[df_problem_2['IP'] == target_ip]
df_problem_2

Unnamed: 0,IP,Requested URL
2253879,10.99.99.186,GET /images/filmpics/0000/3695/Pelican_Blood_2...
2253880,10.99.99.186,GET / HTTP/1.0
2262471,10.99.99.186,GET /images/filmpics/0000/3695/Pelican_Blood_2...
2262475,10.99.99.186,GET / HTTP/1.0
2278840,10.99.99.186,GET /images/filmpics/0000/3695/Pelican_Blood_2...
2318136,10.99.99.186,GET /images/filmpics/0000/3695/Pelican_Blood_2...


In [15]:
# Returning the number of hits that were made by the target IP

print('Target IP: ', target_ip)
print('Number of hits made: ', df_problem_2['IP'].count())

Target IP:  10.99.99.186
Number of hits made:  6


----

<h2 style="color:blue">3 - What page had the higher number of hits and what is this number?</h2>

In [17]:
# Target
most_hit_pathname = number_hit = None

In [18]:
# Creating the dataset copy to this problem

df_problem_3 = df_original.copy()
df_problem_3

Unnamed: 0,IP,Requested URL
0,10.223.157.186,GET / HTTP/1.1
1,10.223.157.186,GET /favicon.ico HTTP/1.1
2,10.223.157.186,GET / HTTP/1.1
3,10.223.157.186,GET /assets/js/lowpro.js HTTP/1.1
4,10.223.157.186,GET /assets/css/reset.css HTTP/1.1
...,...,...
4477838,10.190.174.142,GET /images/filmmediablock/360/07082218.jpg HT...
4477839,10.190.174.142,GET /images/filmpics/0000/2229/GOEMON-NUKI-000...
4477840,10.190.174.142,GET /images/filmmediablock/360/GOEMON-NUKI-000...
4477841,10.190.174.142,GET /images/filmmediablock/360/Chacha.jpg HTTP...


In [19]:
# Dropping the 'IP' column (this one is not necessary to this problem)

df_problem_3 = df_problem_3.drop(df_problem_3.iloc[:, 0:1], axis=1)
df_problem_3

Unnamed: 0,Requested URL
0,GET / HTTP/1.1
1,GET /favicon.ico HTTP/1.1
2,GET / HTTP/1.1
3,GET /assets/js/lowpro.js HTTP/1.1
4,GET /assets/css/reset.css HTTP/1.1
...,...
4477838,GET /images/filmmediablock/360/07082218.jpg HT...
4477839,GET /images/filmpics/0000/2229/GOEMON-NUKI-000...
4477840,GET /images/filmmediablock/360/GOEMON-NUKI-000...
4477841,GET /images/filmmediablock/360/Chacha.jpg HTTP...


In [20]:
# Filtering 'Requested URL' to have just the pathname as values
def extract_pathnames(dataset):
    
    extracted_pathnames = []
    
    for row in dataset:
        
        # Reseting the actual pathname being processed #
        #actual_pathname = ''
        
        # Taking out the tupe of the request (get, post, put...), the protocol and the version #
        row = '/' + row.partition('/')[2]
        row = row.partition('HTTP')[0]
        row.strip()
        
        # Setting the processed pathname into the extracted pathnames list #
        extracted_pathnames.append(row)
        
    # Returning all the extracted pathnames #
    return extracted_pathnames

In [21]:
# Extracting the pathname and tranforming the dataset into a DataFrame again

df_problem_3 = df_problem_3.squeeze().tolist()
df_problem_3 = extract_pathnames(df_problem_3)

df_problem_3 = pd.DataFrame(df_problem_3, columns=['Requested URL'])
df_problem_3

Unnamed: 0,Requested URL
0,/
1,/favicon.ico
2,/
3,/assets/js/lowpro.js
4,/assets/css/reset.css
...,...
4477838,/images/filmmediablock/360/07082218.jpg
4477839,/images/filmpics/0000/2229/GOEMON-NUKI-000163....
4477840,/images/filmmediablock/360/GOEMON-NUKI-000163....
4477841,/images/filmmediablock/360/Chacha.jpg


In [22]:
# Returning the most hit URL

most_hit_pathname = df_problem_3['Requested URL'].value_counts().idxmax()
number_hit = df_problem_3.loc[df_problem_3['Requested URL'] == most_hit_pathname].count().squeeze()

print('Most Hit URL: ', most_hit_pathname)
print('Number of Hits: ', number_hit)

Most Hit URL:  /assets/css/combined.css 
Number of Hits:  117352


----

<h2 align="center" style="color:blue">The End!!</h2>