# Website Log Data Cleaning and Sessionization

## Objective
The goal of this task is to clean raw website log data and identify user sessions
based on a 30-minute inactivity threshold. Bot and crawler traffic is also filtered
to retain only genuine user activity.

## Dataset Source:
Public Apache web server access log dataset obtained from an open GitHub repository.
The dataset contains real website traffic including human users and bots.


In [3]:
import pandas as pd
import re

log_path = "../data/access.txt"
logs = []

pattern = re.compile(
    r'(\d+\.\d+\.\d+\.\d+).*?\[(.*?)\]\s+"(\w+)\s+(.*?)\s+HTTP.*?"\s+\d+\s+\S+\s+".*?"\s+"(.*?)"'
)

with open(log_path, "r", encoding="utf-8", errors="ignore") as f:
    for line in f:
        match = pattern.search(line)
        if match:
            logs.append(match.groups())

df = pd.DataFrame(
    logs,
    columns=["ip_address", "timestamp", "method", "url", "user_agent"]
)

df.head()


Unnamed: 0,ip_address,timestamp,method,url,user_agent
0,83.149.9.216,17/May/2015:10:05:03 +0000,GET,/presentations/logstash-monitorama-2013/images...,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
1,83.149.9.216,17/May/2015:10:05:43 +0000,GET,/presentations/logstash-monitorama-2013/images...,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
2,83.149.9.216,17/May/2015:10:05:47 +0000,GET,/presentations/logstash-monitorama-2013/plugin...,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
3,83.149.9.216,17/May/2015:10:05:12 +0000,GET,/presentations/logstash-monitorama-2013/plugin...,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
4,83.149.9.216,17/May/2015:10:05:07 +0000,GET,/presentations/logstash-monitorama-2013/plugin...,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
