# Web Server Log Analysis - Python Take-Home Assessment

## Overview
This assessment involves analyzing the Calgary HTTP dataset, which contains approximately one year's worth of HTTP requests to the University of Calgary's Computer Science web server. You'll work with real-world web server log data to extract meaningful insights and demonstrate your Python data analysis skills.

## Part 1: Data Loading and Cleaning

### Instructions

* Work in the cells below - You can add as many cells as needed for data loading, cleaning, and exploration
* Import required libraries
* Implement data loading and cleaning - Create functions to download, parse, and clean the log data
* Explore the data - Understand the structure and identify any data quality issues

In [72]:
# You can write your code here for data loading, cleaning, and exploration. Add cells as necessary.
import gzip
import re
import urllib.request
from datetime import datetime
import pandas as pd

In [73]:
url = "ftp://ita.ee.lbl.gov/traces/calgary_access_log.gz"
file_path = "calgary_access_log.gz"
urllib.request.urlretrieve(url, file_path)

# Read log file
try:
  with gzip.open(file_path, 'rt', encoding='latin1') as f:
    raw_logs = f.readlines()
except FileNotFoundError:
  print(f"File '{file_path}' not found.")
except Exception as e:
  print(f"An error occurred: {e}")

In [68]:
# get the pattern using the standard Apache Common Log Format
pattern = re.compile(
    r'(\S+) (\S+) (\S+) \[(.*?)\] "(\S+)? (.*?) (\S+)?" (\d{3}) (\S+)'
)

# Generate clean data
data = []

for line in raw_logs:
    match = pattern.match(line)
    if not match:
        continue
    host, _, _, timestamp, method, filename, protocol, status, bytes_ = match.groups()
    try:
        dt = datetime.strptime(timestamp.split()[0], "%d/%b/%Y:%H:%M:%S")
    except ValueError:
        continue
    data.append({
        'host': host,
        'timestamp': dt,
        'date': dt.strftime('%d-%b-%Y'),
        'hour': dt.hour,
        'method': method,
        'filename': filename,
        'status': int(status),
        'bytes': int(bytes_) if bytes_.isdigit() else 0
    })

# Converting list of dictionaries into DataFrame
df = pd.DataFrame(data)
df.head()


Unnamed: 0,host,timestamp,date,hour,method,filename,status,bytes
0,local,1994-10-24 13:41:41,24-Oct-1994,13,GET,index.html,200,150
1,local,1994-10-24 13:41:41,24-Oct-1994,13,GET,1.gif,200,1210
2,local,1994-10-24 13:43:13,24-Oct-1994,13,GET,index.html,200,3185
3,local,1994-10-24 13:43:14,24-Oct-1994,13,GET,2.gif,200,2555
4,local,1994-10-24 13:43:15,24-Oct-1994,13,GET,3.gif,200,36403


In [69]:
# Extracting Extension from Filename
df['extension'] = df['filename'].str.extract(r'\.([a-zA-Z0-9]+)$', expand=False).fillna('NA')

# Converting datatype appropriately
df = df.astype({
    'host': 'string',
    'method': 'string',
    'filename': 'string',
    'extension': 'string',
    'date': 'string'
})

df['filename']= df['filename'].replace('', pd.NA)

## ⚠️ IMPORTANT: Template Questions Section
**DO NOT MODIFY THE TEMPLATE BELOW THIS POINT**

The following section contains the assessment questions. You may add cells above this section for data loading, cleaning, and exploration, but do not modify the function signatures or structure of the questions below.

## Part 2: Analysis Questions

### Instructions

* Implement each function according to its docstring specifications
* Use the cleaned data you prepared in Part 1
* Ensure your functions return the exact data types specified
* Test your functions to verify they work correctly
* You may add helper functions, but keep the main function signatures unchanged

### Q1: Count of total log records

In [74]:
def total_log_records() -> int:
    """
    Q1: Count of total log records.

    Objective:
        Determine the total number of HTTP log entries in the dataset.
        Each line in the log file represents one HTTP request.

    Returns:
        int: Total number of log entries.
    """

    # TODO: Implement logic to count log records
    total_logs = len(df)
    return total_logs  # Placeholder return


answer1 = total_log_records()
print("Answer 1:")
print(answer1)

Answer 1:
724039


### Q2: Count of unique hosts

In [75]:
def unique_host_count() -> int:
    """
    Q2: Count of unique hosts.

    Objective:
        Determine how many distinct hosts accessed the server.

    Returns:
        int: Number of unique hosts.
    """

    # TODO: Implement logic to count unique hosts
    unique_hosts = df['host'].nunique()
    return unique_hosts  # Placeholder return


answer2 = unique_host_count()
print("Answer 2:")
print(answer2)

Answer 2:
2


### Q3: Date-wise unique filename counts

In [76]:
def datewise_unique_filename_counts() -> dict[str, int]:
    """
    Q3: Date-wise unique filename counts.

    Objective:
        For each date, count the number of unique filenames that accessed the server.
        The date should be in 'dd-MMM-yyyy' format (e.g., '01-Jul-1995').

    Returns:
        dict: A dictionary mapping each date to its count of unique filenames.
              Example: {'01-Jul-1995': 123, '02-Jul-1995': 150}
    """

    # TODO: Implement logic for date-wise unique filename counts
    datewise_unique = df.groupby("date")["filename"].nunique()
    return datewise_unique.to_dict()  # Placeholder return


answer3 = datewise_unique_filename_counts()
print("Answer 3:")
print(answer3)

Answer 3:
{'01-Apr-1995': 436, '01-Aug-1995': 674, '01-Dec-1994': 271, '01-Feb-1995': 622, '01-Jan-1995': 88, '01-Jul-1995': 387, '01-Jun-1995': 590, '01-Mar-1995': 582, '01-May-1995': 467, '01-Nov-1994': 412, '01-Oct-1995': 554, '01-Sep-1995': 330, '02-Apr-1995': 467, '02-Aug-1995': 857, '02-Dec-1994': 325, '02-Feb-1995': 524, '02-Jan-1995': 141, '02-Jul-1995': 399, '02-Jun-1995': 515, '02-Mar-1995': 600, '02-May-1995': 701, '02-Nov-1994': 427, '02-Oct-1995': 872, '02-Sep-1995': 353, '03-Apr-1995': 796, '03-Aug-1995': 585, '03-Dec-1994': 189, '03-Feb-1995': 570, '03-Jan-1995': 311, '03-Jul-1995': 439, '03-Jun-1995': 398, '03-Mar-1995': 505, '03-May-1995': 589, '03-Nov-1994': 461, '03-Oct-1995': 852, '03-Sep-1995': 214, '04-Apr-1995': 822, '04-Aug-1995': 717, '04-Dec-1994': 212, '04-Feb-1995': 562, '04-Jan-1995': 324, '04-Jul-1995': 613, '04-Jun-1995': 353, '04-Mar-1995': 403, '04-May-1995': 687, '04-Nov-1994': 404, '04-Oct-1995': 915, '04-Sep-1995': 342, '05-Apr-1995': 891, '05-Aug-19

### Q4: Number of 404 response codes

In [77]:
def count_404_errors() -> int:
    """
    Q4: Number of 404 response codes.

    Objective:
        Count how many times the HTTP 404 Not Found status appears in the logs.

    Returns:
        int: Number of 404 errors.
    """

    # TODO: Implement logic to count 404 errors
    count_404 = len(df[df['status'] == 404])
    return count_404  # Placeholder return


answer4 = count_404_errors()
print("Answer 4:")
print(answer4)

Answer 4:
23531


### Q5: Top 15 filenames with 404 responses

In [None]:
def top_15_filenames_with_404() -> list[tuple[str, int]]:
    """
    Q5: Top 15 filenames with 404 responses.

    Objective:
        Identify which requested URLs most frequently resulted in a 404 error.
        Return the top 15 filenames sorted by frequency.

    Returns:
        list: A list of tuples (filename, count), sorted by count in descending order.
              Example: [('index.html', 200), ...]
    """


    # TODO: Implement logic to find top 15 filenames with 404
    top_15_filename = (
              df[df['status'] == 404]
             .groupby('filename')
             .size().reset_index(name='count')
             .sort_values('count',ascending=False)
             .head(15)
              )

    return list(zip(top_15_filename["filename"], top_15_filename['count'].astype(int)))  # Placeholder return


answer5 = top_15_filenames_with_404()
print("Answer 5:")
print(answer5)

Answer 5:
[('index.html', 4699), ('4115.html', 900), ('1611.html', 649), ('5698.xbm', 585), ('710.txt', 408), ('2002.html', 258), ('2177.gif', 193), ('10695.ps', 161), ('6555.html', 153), ('487.gif', 152), ('151.html', 149), ('40.html', 148), ('3414.gif', 148), ('488.gif', 148), ('9678.gif', 142)]


### Q6: Top 15 file extension with 404 responses

In [60]:
def top_15_ext_with_404() -> list[tuple[str, int]]:
    """
    Q6: Top 15 file extensions with 404 responses.

    Objective:
        Find which file extensions generated the most 404 errors.
        Return the top 15 sorted by number of 404s.

    Returns:
        list: A list of tuples (extension, count), sorted by count in descending order.
              Example: [('html', 45), ...]
    """

    # TODO: Implement logic to find top 15 extensions with 404
    top_15_extensions = (
        df[(df['status'] == 404) & (df['extension'] !='NA')]
        .groupby('extension')
        .size()
        .reset_index(name='count')
        .sort_values('count', ascending=False)
        .head(15)
    )

    return list(zip(top_15_extensions['extension'], top_15_extensions['count'].astype(int)))  # Placeholder return


answer6 = top_15_ext_with_404()
print("Answer 6:")
print(answer6)

Answer 6:
[('html', 12149), ('gif', 7202), ('xbm', 824), ('ps', 757), ('jpg', 520), ('txt', 496), ('GIF', 135), ('htm', 108), ('cgi', 77), ('com', 45), ('Z', 41), ('dvi', 40), ('ca', 37), ('hmtl', 30), ('util', 29)]


### Q7: Total bandwidth transferred per day for the month of July 1995

In [78]:
def total_bandwidth_per_day() -> dict[str, int]:
    """
    Q7: Total bandwidth transferred per day for the month of July 1995.

    Objective:
        Sum the number of bytes transferred per day.
        Skip entries where the byte field is missing or '-'.

    Returns:
        dict: A dictionary mapping each date to total bytes transferred.
              Example: {'01-Jul-1995': 123456789, ...}
    """

    # TODO: Implement logic to compute total bandwidth per day
    july_df = df[
        (df['timestamp'].dt.month==7) &
        (df['timestamp'].dt.year == 1995)
    ]
    total_bandwidth = july_df.groupby('date')['bytes'].sum()
    return total_bandwidth.to_dict()  # Placeholder return


answer7 = total_bandwidth_per_day()
print("Answer 7:")
print(answer7)

Answer 7:
{'01-Jul-1995': 11333976, '02-Jul-1995': 8656012, '03-Jul-1995': 13596612, '04-Jul-1995': 26573988, '05-Jul-1995': 19541225, '06-Jul-1995': 19755015, '07-Jul-1995': 9427822, '08-Jul-1995': 5403491, '09-Jul-1995': 4660556, '10-Jul-1995': 14916848, '11-Jul-1995': 22503471, '12-Jul-1995': 17367065, '13-Jul-1995': 15988328, '14-Jul-1995': 19186430, '15-Jul-1995': 15773233, '16-Jul-1995': 9005564, '17-Jul-1995': 19601338, '18-Jul-1995': 17098855, '19-Jul-1995': 17851725, '20-Jul-1995': 20751717, '21-Jul-1995': 25455607, '22-Jul-1995': 8066660, '23-Jul-1995': 9593870, '24-Jul-1995': 22308265, '25-Jul-1995': 24550821, '26-Jul-1995': 24638042, '27-Jul-1995': 25969995, '28-Jul-1995': 36458881, '29-Jul-1995': 11696365, '30-Jul-1995': 23189598, '31-Jul-1995': 30729809}


### Q8: Hourly request distribution

In [80]:
def hourly_request_distribution() -> dict[int, int]:
    """
    Q8: Hourly request distribution.

    Objective:
        Count the number of requests made during each hour (00 to 23).
        Useful for understanding traffic peaks.

    Returns:
        dict: A dictionary mapping hour (int) to request count.
              Example: {0: 120, 1: 90, ..., 23: 80}
    """

    # TODO: Implement logic for hourly distribution
    hourly_distribution = df.groupby(df['timestamp'].dt.hour).size()
    return hourly_distribution.to_dict()  # Placeholder return


answer8 = hourly_request_distribution()
print("Answer 8:")
print(answer8)

Answer 8:
{0: 18701, 1: 14372, 2: 12681, 3: 10895, 4: 9964, 5: 10787, 6: 13047, 7: 16659, 8: 26554, 9: 33968, 10: 43348, 11: 47570, 12: 46776, 13: 51405, 14: 54483, 15: 50269, 16: 51138, 17: 45047, 18: 33144, 19: 30546, 20: 29675, 21: 27392, 22: 23812, 23: 21806}


### Q9: Top 10 most requested filenames

In [82]:
def top_10_most_requested_filenames() -> list[tuple[str, int]]:
    """
    Q9: Top 10 most requested filenames.

    Objective:
        Identify the most commonly requested URLs (irrespective of status code).

    Returns:
        list: A list of tuples (filename, count), sorted by count in descending order.
                Example: [('index.html', 500), ...]
    """

    # TODO: Implement logic to find top 10 most requested filenames
    top_10_filenames = (
        df.groupby('filename')
        .size()
        .reset_index(name='count')
        .sort_values('count', ascending=False)
        .head(10)
    )
    return list(zip(top_10_filenames['filename'],top_10_filenames['count'].astype(int)))  # Placeholder return


answer9 = top_10_most_requested_filenames()
print("Answer 9:")
print(answer9)

Answer 9:
[('index.html', 139961), ('3.gif', 24006), ('2.gif', 23606), ('4.gif', 8018), ('244.gif', 5148), ('5.html', 5009), ('4097.gif', 4874), ('8870.jpg', 4492), ('6733.gif', 4278), ('8472.gif', 3843)]


### Q10: HTTP response code distribution

In [83]:
def response_code_distribution() -> dict[int, int]:
    """
    Q10: HTTP response code distribution.

    Objective:
        Count how often each HTTP status code appears in the logs.

    Returns:
        dict: A dictionary mapping HTTP status codes (as int) to their frequency.
              Example: {200: 150000, 404: 3000}
    """

    # TODO: Implement logic for response code counts
    code_distribution = df.groupby('status').size()
    return code_distribution.to_dict()  # Placeholder return


answer10 = response_code_distribution()
print("Answer 10:")
print(answer10)

Answer 10:
{200: 567554, 302: 30275, 304: 97792, 400: 13, 401: 46, 403: 4743, 404: 23531, 500: 42, 501: 43}
