# Webserver Log File Analysis Template

Initial steps at creating a pipeline for log file analysis for finding insights on the website's traffic, users, locations, search engine crawlers, referring sites, consumed content, performance, and anything else that can be gleaned. 

This first step is the prototype of a process of convering a log file to an efficient format on disk (Apache Parquet), and then to read it into an efficient DataFrame with optimized datatypes. 

In this example we convert a 3.3 GB text file to a 258 MB `parquet` file, which is later read into a 342 MB DataFrame. The total time can vary between three to five minutes, depending on the system used. This notebook ran the full process in 204 seconds. 

In [None]:
%%capture
!pip install advertools

In [None]:
import pandas as pd
pd.options.display.max_columns = None
import re
import os
import time
from tqdm import tqdm

from dataset_utilities import value_counts_plus

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Sample lines from the log file

In [None]:
!ls -lsSh /kaggle/input/web-server-access-logs/access.log

In [None]:
!head -n 4 /kaggle/input/web-server-access-logs/access.log

# Dataset 

The dataset was downloaded from Harvard's Dataverse and contains logs from an Iranian ecommerce site (zanbil.ir): 


Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs", 
https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1

# Log Format 
This approach assumes the common log format and/or the combined one, which are two of the most commonly used. Eventually other formats can be incorporated. We start with the below regular express taken from: 

[Regular Expressions Cookbook](https://learning.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/ch07s12.html)  
by Jan Goyvaerts, Steven Levithan  
Publisher: O'Reilly Media, Inc.
Release Date: August 2012


In [None]:
# There is a minor bug in this regex, it misses the last field. I'll fix this soon. 

common_regex = '^(?P<client>\S+) \S+ (?P<userid>\S+) \[(?P<datetime>[^\]]+)\] "(?P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (?P<size>[0-9]+|-)'
combined_regex = '^(?P<client>\S+) \S+ (?P<userid>\S+) \[(?P<datetime>[^\]]+)\] "(?P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (?P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)'
columns = ['client', 'userid', 'datetime', 'method', 'request', 'status', 'size', 'referer', 'user_agent']

# The Approach

* Loop through the lines of the input log file one by one. This ensures minimal memory consumption. 
* For each line, check it against the regular expression, and process it: 
  * Match: append the matched line to a `parsed_lines` list
  * No match: append the non-matching line to the `errors_file` for later analysis
* Once `parsed_lines` reaches 250,000 elements, convert the list to a DataFrame and save it to a `parquet` file in the `output_dir`. Clear the list. This also ensures minimal memory usage, and the 250k can be tweaked if necessary.
* Read all the files of the `output_dir` with `read_parquet` into a pandas DataFrame. This function handles reading all the files and combines them. 
* Optimize the columns by using more efficient data types, most notably the pandas categorical type.
* Write the DataFrame to a single file, for more convenient handling, and with the more efficient datatypes. This results in even faster reading.
* Delete the files in `output_dir`.
* Read in the final file with `read_parquet`.
* Start analyzing.


> ## Create a destinatoin directory where output files will be stored


In [None]:
%mkdir parquet_dir

# The `logs_to_df` function

In [None]:
import time
import re
import pandas as pd


def logs_to_df(logfile, output_dir, errors_file):
    with open(logfile) as source_file:
        linenumber = 0
        parsed_lines = []
        for line in tqdm(source_file):
            try:
                log_line = re.findall(combined_regex, line)[0]
                parsed_lines.append(log_line)
            except Exception as e:
                with open(errors_file, 'at') as errfile:
                    print((line, str(e)), file=errfile)
                continue
            linenumber += 1
            if linenumber % 250_000 == 0:
                df = pd.DataFrame(parsed_lines, columns=columns)
                df.to_parquet(f'{output_dir}/file_{linenumber}.parquet')
                parsed_lines.clear()
        else:
            df = pd.DataFrame(parsed_lines, columns=columns)
            df.to_parquet(f'{output_dir}/file_{linenumber}.parquet')
            parsed_lines.clear()

Times will vary from system to system, and I will use the approximate values, so when you read this, you will likely see slightly different numbers.

In [None]:
%time logs_to_df(logfile='/kaggle/input/web-server-access-logs/access.log', output_dir='parquet_dir/', errors_file='errors.txt')

The whole process just described took around 2.5 minutes. 

Actually we are now ready to start analysis, as we have the parquet files that can be read. But we will optimize them even more. 

Checking the number of resulting parsing errors:

In [None]:
!wc errors.txt

In [None]:
%time logs_df = pd.read_parquet('parquet_dir/')

Reading the whole directory takes about nine seconds. We now check the size of the resulting directory on disk:

In [None]:
!du -sh parquet_dir/

257 ÷ 3,300 = 0.07. 

The resulting file is 7% the size of the original. 

Let's see how much memory it takes: 

In [None]:
logs_df.info(show_counts=True, verbose=True)

711 MB. We now remove the files in `parquet_dir` and optimize the datatypes and use more efficient ones. 

In [None]:
%rm -r parquet_dir/

In [None]:
logs_df['client'] = logs_df['client'].astype('category')
del logs_df['userid']
logs_df['datetime'] = pd.to_datetime(logs_df['datetime'], format='%d/%b/%Y:%H:%M:%S %z')
logs_df['method'] = logs_df['method'].astype('category')
logs_df['status'] = logs_df['status'].astype('int16')
logs_df['size'] = logs_df['size'].astype('int32')
logs_df['referer'] = logs_df['referer'].astype('category')
logs_df['user_agent'] = logs_df['user_agent'].astype('category')

In [None]:
logs_df.info(verbose=True, show_counts=True)

The file was reduced further from 711 to 342 MB. (342 ÷ 711 = 0.48 of the original size)

We now save it to a single file, and read again.

In [None]:
%time logs_df.to_parquet('logs_df.parquet')

In [None]:
!ls -lshS

In [None]:
%time logs_df = pd.read_parquet('logs_df.parquet')

Now reading the file took almost half the previous time. Sorry again for the imprecise numbers!

We are now ready to start analyzing.

In [None]:
logs_df.shape

In [None]:
logs_df

# Page/response size analysis - Clustering pages by size

How many "small" pages do we have? And how many "large"? What about medium size pages? 
How do we even define those labels? 

Summarizing a large number of values is not always easy. One way to do it is by using KMeans clustering on a list of numbers. We specify the number of clusters, and the algorithm finds the a set of `K` (number) points, where each is the mean of a group of points. Hence "K-Means".

We can also get other statistics for each cluster (min, max, std, and so on), then visualize those centers, and see how many points we have in that cluster. We can do this interactively, changing the number of clusters to see the optimal number based on the application. In this case, we are stil exploring, so this is basically allowing us to identify pages with a small, medium, large, etc. size, without us giving those values explicitly. We are discovering them this way.

First instantiate the model with the desired number of clusters: 

In [None]:
from sklearn.cluster import KMeans
k = 5
kmeans = KMeans(k)

Fit the model to the data:

In [None]:
%time kmeans.fit(logs_df[['size']])

Get the `cluster_centers_` These are the means of the groups/clusters of points that were discovered based on the given `k`.

In [None]:
sorted(kmeans.cluster_centers_.round(0).flatten())

Create a table to get other statistics about each cluster, using the `labels_` attribute.

In [None]:
cluster_df = logs_df.groupby(kmeans.labels_)['size'].describe().sort_values('mean').reset_index(drop=True)

In [None]:
cluster_df.style.background_gradient(subset=['count'], cmap='cividis').format({'mean': '{:,.0f}', 'count': '{:,.0f}'})

It seems we have eight million responses in the first (low) cluster. The average response size in bytes is 3,576 for this group, and we can also see the minimum, maximum, and other statistics for this, and other clusters.

In [None]:
import plotly.express as px
fig = px.scatter(cluster_df, 
                 x='mean', y='count',
                 size=[5]*len(cluster_df),
                 size_max=15,
                 log_y=True,
                 hover_data=['min', 'max', 'std'],
                 title=f'<b>Page distribution by response size ({k} clusters).</b><br>Points represent the average page size for a cluster of pages.',
                 labels={'mean': 'Average page size (bytes)',
                         'count': "Number of pages in cluster"}, 
                )
fig.data[0].hovertemplate = '<b>Average page size (bytes): %{x:,.0f}</b><br><br>Number of pages in cluster: %{y:,.0f}<br><br>min: %{customdata[0]:,.0f}<br>max: %{customdata[1]:,.0f}<br>std: %{customdata[2]:,.0f}<extra></extra>'
for minimum in cluster_df['min']:
    fig.add_vline(x=minimum, line={'width': 1})
fig.layout.font.size = 14
fig.show()

Checking how variable the page size is, for the same page, taking the home page as an example:

In [None]:
logs_df[logs_df['request'].eq("/")][['request', 'size']].value_counts().reset_index().rename(columns={0: 'count'})

We can see that the home page was returned with 2,298 different sizes, many of which were zero.

# Analyzing Organic Search Traffic

Referer traffic that is from search engines can provide valuable information on that traffic. 
Most importantly, the query parameters of the URLs of those referers contain that information. 

We can easily filter for those URLs, and then parse their different elements to get some information on the type of traffic they are sending:

In [None]:
goog_organic = logs_df[logs_df['referer'].str.contains('google\.com/search')]['referer']
goog_organic

In [None]:
import advertools as adv
goog_url_df = adv.url_to_df(goog_organic)
goog_url_df = goog_url_df.rename(columns={col: col.replace('query_', '') for col in goog_url_df.columns})
goog_url_df

Sort the parameters by the frequency of their use: 

In [None]:
(goog_url_df.iloc[:, 7:]
 .notna()
 .mean()
 .sort_values(ascending=False)[:30]
 .to_frame()
 .rename(columns={0: '%'})
 .style.format('{:.1%}'))

In [None]:
import ipywidgets as widgets

def count_column_values(col_name, show_top=10, sort_others=False):
    series = goog_url_df[col_name].str.split('@@').explode()
    print(f'Times {series.name} was provided: {series.count():,}')
    print(f'Number of sessions: {len(goog_url_df):,}')
    print(f'Number of unique {series.name}s: {series.nunique():,}')
#     print(f'{series.name}s per page: {series.count()/len(goog_url_df):.2f}')
    display(value_counts_plus(series, sort_others=sort_others, dropna=False, show_top=show_top))

widgets.interact(count_column_values,
                 col_name=goog_url_df.select_dtypes('object').columns,
                 show_top=widgets.IntSlider(min=1, max=50, value=10));
