# Web Server Log Analysis

This notebook analyzes web server logs in Common or Combined Log Format.

## Features
1. Parse Apache/Nginx logs (Common and Combined formats)
2. Identify top 10 most frequent IP addresses with visualization
3. Statistical analysis of HTTP request methods
4. Detection of invalid/junk HTTP methods with source IPs
5. Top 10 User Agent strings analysis (if available)

## Log Formats Supported

**Common Log Format:**
```
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
```

**Combined Log Format:**
```
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
```

In [None]:
# Import required libraries
import re
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import warnings

warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries loaded successfully")

## Configuration

Set the path to your web server log file below:

In [None]:
# Configuration
LOG_FILE_PATH = '/path/to/your/access.log'  # Update this path

# Valid HTTP methods (RFC 7231, RFC 5789)
VALID_HTTP_METHODS = [
    'GET', 'POST', 'PUT', 'DELETE', 'HEAD', 
    'OPTIONS', 'PATCH', 'CONNECT', 'TRACE'
]

print(f"Configuration set. Will analyze: {LOG_FILE_PATH}")
print(f"Valid HTTP methods: {', '.join(VALID_HTTP_METHODS)}")

## Step 1: Parse Web Server Logs

Parse logs using regex patterns for both Common and Combined formats.

In [None]:
def parse_log_line(line):
    """
    Parse a single log line in Common or Combined format.
    
    Returns a dictionary with parsed fields or None if parsing fails.
    """
    # Combined log format regex pattern
    # Matches: IP - user [timestamp] "METHOD /path HTTP/version" status size "referrer" "user-agent"
    combined_pattern = re.compile(
        r'^(?P<ip>[\d\.]+) '
        r'- '
        r'(?P<user>\S+) '
        r'\[(?P<timestamp>[^\]]+)\] '
        r'"(?P<method>\S+) '
        r'(?P<path>\S+) '
        r'(?P<protocol>\S+)" '
        r'(?P<status>\d+) '
        r'(?P<size>\S+)'
        r'(?: "(?P<referrer>[^"]*)" '
        r'"(?P<user_agent>[^"]*)")?'
    )
    
    # Try combined format first
    match = combined_pattern.match(line)
    if match:
        return match.groupdict()
    
    # Try common format (without referrer and user agent)
    common_pattern = re.compile(
        r'^(?P<ip>[\d\.]+) '
        r'- '
        r'(?P<user>\S+) '
        r'\[(?P<timestamp>[^\]]+)\] '
        r'"(?P<method>\S+) '
        r'(?P<path>\S+) '
        r'(?P<protocol>\S+)" '
        r'(?P<status>\d+) '
        r'(?P<size>\S+)'
    )
    
    match = common_pattern.match(line)
    if match:
        data = match.groupdict()
        data['referrer'] = None
        data['user_agent'] = None
        return data
    
    return None


def load_and_parse_logs(file_path):
    """
    Load and parse all log entries from the specified file.
    
    Returns a pandas DataFrame with parsed log entries.
    """
    parsed_logs = []
    failed_lines = 0
    
    try:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            for line_num, line in enumerate(f, 1):
                line = line.strip()
                if not line:
                    continue
                    
                parsed = parse_log_line(line)
                if parsed:
                    parsed_logs.append(parsed)
                else:
                    failed_lines += 1
                    if failed_lines <= 5:  # Show first 5 failed lines
                        print(f"Warning: Failed to parse line {line_num}: {line[:100]}...")
        
        df = pd.DataFrame(parsed_logs)
        
        print(f"\n{'='*60}")
        print(f"Successfully parsed {len(parsed_logs)} log entries")
        print(f"Failed to parse {failed_lines} lines")
        print(f"{'='*60}\n")
        
        return df
    
    except FileNotFoundError:
        print(f"Error: File not found - {file_path}")
        return pd.DataFrame()
    except Exception as e:
        print(f"Error reading file: {e}")
        return pd.DataFrame()


# Load and parse the logs
df_logs = load_and_parse_logs(LOG_FILE_PATH)

if not df_logs.empty:
    print("\nDataFrame Info:")
    print(df_logs.info())
    print("\nFirst 5 entries:")
    display(df_logs.head())
else:
    print("No data loaded. Please check the LOG_FILE_PATH and ensure the file exists.")

## Step 2: Top 10 Most Frequent IP Addresses

Analyze and visualize the most active IP addresses accessing the server.

In [None]:
if not df_logs.empty:
    # Count IP address frequencies
    ip_counts = df_logs['ip'].value_counts().head(10)
    
    print("Top 10 Most Frequent IP Addresses:")
    print("="*60)
    for idx, (ip, count) in enumerate(ip_counts.items(), 1):
        percentage = (count / len(df_logs)) * 100
        print(f"{idx:2d}. {ip:15s} - {count:6d} requests ({percentage:5.2f}%)")
    
    # Create visualization
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Bar chart
    colors = sns.color_palette('viridis', len(ip_counts))
    ip_counts.plot(kind='barh', ax=ax1, color=colors)
    ax1.set_xlabel('Number of Requests', fontsize=12, fontweight='bold')
    ax1.set_ylabel('IP Address', fontsize=12, fontweight='bold')
    ax1.set_title('Top 10 IP Addresses by Request Count', fontsize=14, fontweight='bold')
    ax1.invert_yaxis()
    
    # Add value labels on bars
    for i, v in enumerate(ip_counts.values):
        ax1.text(v + (max(ip_counts.values) * 0.01), i, str(v), 
                va='center', fontsize=10, fontweight='bold')
    
    # Pie chart
    ax2.pie(ip_counts.values, labels=ip_counts.index, autopct='%1.1f%%',
            startangle=90, colors=colors)
    ax2.set_title('Top 10 IP Addresses - Distribution', fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    # Additional statistics
    total_unique_ips = df_logs['ip'].nunique()
    print(f"\nTotal unique IP addresses: {total_unique_ips}")
    print(f"Top 10 IPs account for {(ip_counts.sum() / len(df_logs) * 100):.2f}% of all requests")
else:
    print("No data available for analysis.")

## Step 3: HTTP Request Method Analysis

Statistical breakdown of HTTP methods used in requests.

In [None]:
if not df_logs.empty:
    # Count HTTP methods
    method_counts = df_logs['method'].value_counts()
    
    print("HTTP Request Method Statistics:")
    print("="*70)
    print(f"{'Method':<15} {'Count':>10} {'Percentage':>12} {'Valid':>10}")
    print("="*70)
    
    for method, count in method_counts.items():
        percentage = (count / len(df_logs)) * 100
        is_valid = 'Yes' if method.upper() in VALID_HTTP_METHODS else 'No'
        print(f"{method:<15} {count:>10} {percentage:>11.2f}% {is_valid:>10}")
    
    print("="*70)
    print(f"{'Total Requests':<15} {len(df_logs):>10}")
    print(f"{'Unique Methods':<15} {len(method_counts):>10}")
    
    # Visualization
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Bar chart for all methods
    colors = ['green' if m.upper() in VALID_HTTP_METHODS else 'red' 
              for m in method_counts.index]
    method_counts.plot(kind='bar', ax=ax1, color=colors, edgecolor='black', linewidth=1.2)
    ax1.set_xlabel('HTTP Method', fontsize=12, fontweight='bold')
    ax1.set_ylabel('Number of Requests', fontsize=12, fontweight='bold')
    ax1.set_title('HTTP Method Distribution (Green=Valid, Red=Invalid)', 
                  fontsize=14, fontweight='bold')
    ax1.tick_params(axis='x', rotation=45)
    ax1.grid(axis='y', alpha=0.3)
    
    # Add value labels
    for i, v in enumerate(method_counts.values):
        ax1.text(i, v + (max(method_counts.values) * 0.01), str(v),
                ha='center', va='bottom', fontsize=10, fontweight='bold')
    
    # Pie chart for valid vs invalid
    valid_count = df_logs[df_logs['method'].str.upper().isin(VALID_HTTP_METHODS)].shape[0]
    invalid_count = len(df_logs) - valid_count
    
    ax2.pie([valid_count, invalid_count], 
            labels=['Valid Methods', 'Invalid/Junk Methods'],
            autopct='%1.1f%%',
            colors=['green', 'red'],
            startangle=90,
            explode=(0, 0.1))
    ax2.set_title('Valid vs Invalid HTTP Methods', fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    # Statistical summary
    print(f"\nStatistical Summary:")
    print(f"Valid HTTP method requests: {valid_count} ({(valid_count/len(df_logs)*100):.2f}%)")
    print(f"Invalid HTTP method requests: {invalid_count} ({(invalid_count/len(df_logs)*100):.2f}%)")
    
    # Most common valid method
    valid_methods = df_logs[df_logs['method'].str.upper().isin(VALID_HTTP_METHODS)]
    if not valid_methods.empty:
        most_common = valid_methods['method'].value_counts().iloc[0]
        most_common_name = valid_methods['method'].value_counts().index[0]
        print(f"Most common valid method: {most_common_name} ({most_common} requests)")
else:
    print("No data available for analysis.")

## Step 4: Identify Junk/Invalid HTTP Methods

Detect non-standard or malicious HTTP methods and their source IPs.

In [None]:
if not df_logs.empty:
    # Filter for invalid methods
    invalid_methods = df_logs[~df_logs['method'].str.upper().isin(VALID_HTTP_METHODS)]
    
    if not invalid_methods.empty:
        print(f"Found {len(invalid_methods)} requests with invalid/junk HTTP methods")
        print("="*80)
        
        # Group by method and IP
        junk_analysis = invalid_methods.groupby(['method', 'ip']).size().reset_index(name='count')
        junk_analysis = junk_analysis.sort_values('count', ascending=False)
        
        print(f"\n{'Method':<20} {'IP Address':<20} {'Count':>10}")
        print("="*80)
        
        for _, row in junk_analysis.iterrows():
            print(f"{row['method']:<20} {row['ip']:<20} {row['count']:>10}")
        
        # Summary by method
        print("\n" + "="*80)
        print("\nInvalid Methods Summary:")
        print("="*80)
        
        method_summary = invalid_methods.groupby('method').agg({
            'ip': ['count', 'nunique']
        }).round(2)
        method_summary.columns = ['Total Requests', 'Unique IPs']
        method_summary = method_summary.sort_values('Total Requests', ascending=False)
        
        print(method_summary)
        
        # Show sample invalid requests
        print("\n" + "="*80)
        print("Sample Invalid Requests (first 10):")
        print("="*80)
        
        sample_columns = ['ip', 'method', 'path', 'status', 'timestamp']
        display(invalid_methods[sample_columns].head(10))
        
        # Visualization
        if len(junk_analysis) > 0:
            fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
            
            # Top invalid methods
            top_junk_methods = invalid_methods['method'].value_counts().head(10)
            top_junk_methods.plot(kind='barh', ax=ax1, color='crimson', edgecolor='black')
            ax1.set_xlabel('Number of Requests', fontsize=12, fontweight='bold')
            ax1.set_ylabel('Invalid Method', fontsize=12, fontweight='bold')
            ax1.set_title('Top 10 Invalid/Junk HTTP Methods', fontsize=14, fontweight='bold')
            ax1.invert_yaxis()
            
            # Top IPs sending invalid methods
            top_junk_ips = invalid_methods['ip'].value_counts().head(10)
            top_junk_ips.plot(kind='barh', ax=ax2, color='orange', edgecolor='black')
            ax2.set_xlabel('Number of Invalid Requests', fontsize=12, fontweight='bold')
            ax2.set_ylabel('IP Address', fontsize=12, fontweight='bold')
            ax2.set_title('Top 10 IPs Sending Invalid Methods', fontsize=14, fontweight='bold')
            ax2.invert_yaxis()
            
            plt.tight_layout()
            plt.show()
        
        # Export to CSV for further investigation
        output_file = 'invalid_http_methods.csv'
        junk_analysis.to_csv(output_file, index=False)
        print(f"\nInvalid methods data exported to: {output_file}")
        
    else:
        print("No invalid/junk HTTP methods found in the logs.")
        print("All requests use standard HTTP methods.")
else:
    print("No data available for analysis.")

## Step 5: User Agent Analysis

Analyze User Agent strings to identify common browsers, bots, and suspicious agents.

In [None]:
if not df_logs.empty:
    # Check if user agent data is available
    has_user_agent = 'user_agent' in df_logs.columns and df_logs['user_agent'].notna().any()
    
    if has_user_agent:
        # Filter out null/empty user agents
        ua_data = df_logs[df_logs['user_agent'].notna() & (df_logs['user_agent'] != '')]
        
        if len(ua_data) > 0:
            print(f"Found {len(ua_data)} requests with User Agent strings")
            print(f"Requests without User Agent: {len(df_logs) - len(ua_data)}")
            print("="*80)
            
            # Top 10 User Agents
            top_user_agents = ua_data['user_agent'].value_counts().head(10)
            
            print("\nTop 10 Most Common User Agent Strings:")
            print("="*80)
            
            for idx, (ua, count) in enumerate(top_user_agents.items(), 1):
                percentage = (count / len(ua_data)) * 100
                # Truncate long UA strings for display
                ua_display = ua[:70] + '...' if len(ua) > 70 else ua
                print(f"\n{idx}. {ua_display}")
                print(f"   Count: {count:,} ({percentage:.2f}%)")
            
            # Categorize User Agents
            def categorize_user_agent(ua_string):
                """Categorize user agent into browser, bot, or other."""
                if pd.isna(ua_string) or ua_string == '':
                    return 'Unknown'
                
                ua_lower = ua_string.lower()
                
                # Bots and crawlers
                bot_keywords = ['bot', 'crawler', 'spider', 'scraper', 'curl', 'wget', 
                               'python', 'java', 'perl', 'ruby', 'scan']
                if any(keyword in ua_lower for keyword in bot_keywords):
                    return 'Bot/Crawler'
                
                # Browsers
                if 'chrome' in ua_lower or 'chromium' in ua_lower:
                    return 'Chrome'
                elif 'firefox' in ua_lower:
                    return 'Firefox'
                elif 'safari' in ua_lower and 'chrome' not in ua_lower:
                    return 'Safari'
                elif 'edge' in ua_lower or 'edg/' in ua_lower:
                    return 'Edge'
                elif 'msie' in ua_lower or 'trident' in ua_lower:
                    return 'Internet Explorer'
                elif 'opera' in ua_lower or 'opr/' in ua_lower:
                    return 'Opera'
                
                return 'Other'
            
            # Apply categorization
            ua_data['ua_category'] = ua_data['user_agent'].apply(categorize_user_agent)
            
            print("\n" + "="*80)
            print("User Agent Categories:")
            print("="*80)
            
            category_counts = ua_data['ua_category'].value_counts()
            for category, count in category_counts.items():
                percentage = (count / len(ua_data)) * 100
                print(f"{category:<20} {count:>10,} ({percentage:>6.2f}%)")
            
            # Visualizations
            fig = plt.figure(figsize=(16, 10))
            gs = fig.add_gridspec(2, 2, hspace=0.3, wspace=0.3)
            
            # Top 10 User Agents bar chart
            ax1 = fig.add_subplot(gs[0, :])
            top_10_for_plot = top_user_agents.head(10)
            # Truncate labels for visualization
            labels = [ua[:50] + '...' if len(ua) > 50 else ua for ua in top_10_for_plot.index]
            ax1.barh(range(len(top_10_for_plot)), top_10_for_plot.values, color='teal', edgecolor='black')
            ax1.set_yticks(range(len(top_10_for_plot)))
            ax1.set_yticklabels(labels, fontsize=9)
            ax1.set_xlabel('Number of Requests', fontsize=12, fontweight='bold')
            ax1.set_title('Top 10 User Agent Strings', fontsize=14, fontweight='bold')
            ax1.invert_yaxis()
            
            # Add value labels
            for i, v in enumerate(top_10_for_plot.values):
                ax1.text(v + (max(top_10_for_plot.values) * 0.01), i, f'{v:,}',
                        va='center', fontsize=10, fontweight='bold')
            
            # Category pie chart
            ax2 = fig.add_subplot(gs[1, 0])
            colors_cat = sns.color_palette('Set2', len(category_counts))
            ax2.pie(category_counts.values, labels=category_counts.index, autopct='%1.1f%%',
                   startangle=90, colors=colors_cat)
            ax2.set_title('User Agent Categories', fontsize=14, fontweight='bold')
            
            # Requests with/without UA
            ax3 = fig.add_subplot(gs[1, 1])
            ua_presence = [
                len(df_logs[df_logs['user_agent'].notna() & (df_logs['user_agent'] != '')]),
                len(df_logs[df_logs['user_agent'].isna() | (df_logs['user_agent'] == '')])
            ]
            ax3.pie(ua_presence, labels=['With User Agent', 'Without User Agent'],
                   autopct='%1.1f%%', startangle=90, colors=['lightgreen', 'lightcoral'])
            ax3.set_title('User Agent Presence', fontsize=14, fontweight='bold')
            
            plt.show()
            
            # Detect suspicious User Agents
            print("\n" + "="*80)
            print("Potentially Suspicious User Agents:")
            print("="*80)
            
            suspicious_keywords = ['scan', 'exploit', 'attack', 'hack', 'inject', 'sqlmap', 
                                  'nikto', 'nmap', 'masscan', 'nessus', 'metasploit']
            
            suspicious_ua = ua_data[ua_data['user_agent'].str.lower().str.contains(
                '|'.join(suspicious_keywords), na=False, regex=True
            )]
            
            if len(suspicious_ua) > 0:
                print(f"Found {len(suspicious_ua)} requests with suspicious User Agents:\n")
                
                susp_summary = suspicious_ua.groupby(['user_agent', 'ip']).size().reset_index(name='count')
                susp_summary = susp_summary.sort_values('count', ascending=False)
                
                for _, row in susp_summary.head(20).iterrows():
                    ua_display = row['user_agent'][:70] + '...' if len(row['user_agent']) > 70 else row['user_agent']
                    print(f"UA: {ua_display}")
                    print(f"IP: {row['ip']}, Count: {row['count']}\n")
                
                # Export suspicious UAs
                susp_output = 'suspicious_user_agents.csv'
                susp_summary.to_csv(susp_output, index=False)
                print(f"Suspicious User Agents exported to: {susp_output}")
            else:
                print("No obviously suspicious User Agent strings detected.")
            
            # Export top User Agents
            ua_output = 'top_user_agents.csv'
            top_user_agents.to_csv(ua_output, header=['Count'])
            print(f"\nTop User Agents exported to: {ua_output}")
            
        else:
            print("User Agent field exists but contains no data.")
    else:
        print("User Agent information not available in this log format.")
        print("The logs appear to be in Common Log Format (without User Agent field).")
        print("For User Agent analysis, use Combined Log Format.")
else:
    print("No data available for analysis.")

## Summary Report

Generate a comprehensive summary of the log analysis.

In [None]:
if not df_logs.empty:
    print("="*80)
    print(" " * 25 + "WEB SERVER LOG ANALYSIS SUMMARY")
    print("="*80)
    
    print(f"\nLog File: {LOG_FILE_PATH}")
    print(f"Total Requests Analyzed: {len(df_logs):,}")
    print(f"Date Range: {df_logs['timestamp'].min()} to {df_logs['timestamp'].max()}")
    
    print("\n" + "-"*80)
    print("IP ADDRESS STATISTICS")
    print("-"*80)
    print(f"Total Unique IP Addresses: {df_logs['ip'].nunique():,}")
    print(f"Most Active IP: {df_logs['ip'].value_counts().index[0]} ({df_logs['ip'].value_counts().iloc[0]:,} requests)")
    
    print("\n" + "-"*80)
    print("HTTP METHOD STATISTICS")
    print("-"*80)
    method_stats = df_logs['method'].value_counts()
    print(f"Total Unique Methods: {len(method_stats)}")
    print(f"Most Common Method: {method_stats.index[0]} ({method_stats.iloc[0]:,} requests)")
    
    valid_count = df_logs[df_logs['method'].str.upper().isin(VALID_HTTP_METHODS)].shape[0]
    invalid_count = len(df_logs) - valid_count
    print(f"Valid HTTP Methods: {valid_count:,} ({(valid_count/len(df_logs)*100):.2f}%)")
    print(f"Invalid HTTP Methods: {invalid_count:,} ({(invalid_count/len(df_logs)*100):.2f}%)")
    
    print("\n" + "-"*80)
    print("HTTP STATUS CODE STATISTICS")
    print("-"*80)
    status_stats = df_logs['status'].value_counts().head(5)
    for status, count in status_stats.items():
        percentage = (count / len(df_logs)) * 100
        print(f"Status {status}: {count:,} ({percentage:.2f}%)")
    
    if 'user_agent' in df_logs.columns:
        ua_present = df_logs['user_agent'].notna().sum()
        print("\n" + "-"*80)
        print("USER AGENT STATISTICS")
        print("-"*80)
        print(f"Requests with User Agent: {ua_present:,} ({(ua_present/len(df_logs)*100):.2f}%)")
        print(f"Requests without User Agent: {len(df_logs) - ua_present:,}")
        if ua_present > 0:
            print(f"Unique User Agents: {df_logs['user_agent'].nunique():,}")
    
    print("\n" + "="*80)
    print("Analysis completed successfully!")
    print("="*80)
    
    # List exported files
    print("\nExported Files:")
    import os
    exported_files = ['invalid_http_methods.csv', 'suspicious_user_agents.csv', 'top_user_agents.csv']
    for file in exported_files:
        if os.path.exists(file):
            print(f"  - {file}")
else:
    print("No analysis performed. Please check the log file path and try again.")

## Additional Analysis (Optional)

Perform additional custom analysis as needed.

In [None]:
# Example: Analyze status codes by IP
if not df_logs.empty:
    print("Status Code Distribution by Top 5 IPs:")
    print("="*80)
    
    top_5_ips = df_logs['ip'].value_counts().head(5).index
    
    for ip in top_5_ips:
        ip_data = df_logs[df_logs['ip'] == ip]
        status_dist = ip_data['status'].value_counts()
        print(f"\nIP: {ip}")
        for status, count in status_dist.items():
            print(f"  Status {status}: {count}")

In [None]:
# Example: Find most requested paths
if not df_logs.empty:
    print("\nTop 10 Most Requested Paths:")
    print("="*80)
    
    top_paths = df_logs['path'].value_counts().head(10)
    for path, count in top_paths.items():
        print(f"{path:<50} {count:>8}")

## Notes

### Forensic Considerations
- Invalid HTTP methods may indicate:
  - Reconnaissance/scanning activity
  - Misconfigured clients or bots
  - Exploitation attempts
  - Web application attacks

### Valid HTTP Methods (RFC 7231, RFC 5789)
- **GET**: Retrieve resource
- **POST**: Submit data to resource
- **PUT**: Replace resource
- **DELETE**: Remove resource
- **HEAD**: Retrieve headers only
- **OPTIONS**: Describe communication options
- **PATCH**: Partial resource modification
- **CONNECT**: Establish tunnel (proxy)
- **TRACE**: Echo request (debugging)

### Common Attack Signatures
- Suspicious methods: `PROPFIND`, `SEARCH`, `LOCK`, `UNLOCK` (WebDAV)
- Scanning tools: `sqlmap`, `nikto`, `nmap`, `masscan` in User Agent
- Abnormal request patterns from single IP
- Requests for known vulnerable paths (e.g., `/admin`, `/wp-admin`, `/.env`)

### Next Steps
1. Investigate IPs with high invalid method counts
2. Cross-reference suspicious IPs with threat intelligence feeds
3. Review paths requested by suspicious IPs
4. Check for temporal patterns (time-based attacks)
5. Correlate with other security logs (firewall, IDS/IPS)
