# Big Data Analytics [CN7031] CRWK 2024-25

## Group ID: CN7031_Group136_2024

### Group Members:
1. **Navya Athoti**  
    Email: u2793047@uel.ac.uk
2. **Phalguna Avalagunta**  
    Email: u2811669@uel.ac.uk
3. **Nikhil Sai Damera**  
    Email: u2810262@uel.ac.uk
4. **Sai Kishore Dodda**  
    Email: u2773584@uel.ac.uk

---


## Initiate and Configure Spark

In this section, we will initiate and configure Apache Spark, which is a powerful open-source processing engine for big data. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.


In [1]:
!pip3 install pyspark

# Cell 4 [Code]:
# Import required libraries
import os
print(f"JAVA_HOME: {os.environ.get('JAVA_HOME', 'Not set')}")
import sys

# environment variables
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

import findspark
findspark.init()

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.functions import max as spark_max
from pyspark.sql.window import Window
from pyspark.sql.types import *
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import time
from datetime import datetime


# Initialize Spark session
def initialize_spark():
    spark = (SparkSession.builder
            .appName('CN7031_Group136_2024')
            .config("spark.driver.memory", "4g")
            .config("spark.executor.memory", "4g")
            .config("spark.sql.shuffle.partitions", "100")
            .master("local[*]")
            .getOrCreate())
    return spark

spark = initialize_spark()

JAVA_HOME: C:\Program Files\Java\jdk-21


# Load Unstructured Data

In this section, we will load and process unstructured data. Unstructured data refers to information that does not have a predefined data model or is not organized in a predefined manner. This type of data is typically text-heavy, but may also contain data such as dates, numbers, and facts.

We will explore various techniques to handle and analyze unstructured data, including tokenization, vectorization, and the use of embeddings to capture semantic information.

In [2]:
def load_data(spark, path="web.log"):
    try:
        # Check if file exists
        if not os.path.exists(path):
            raise FileNotFoundError(f"File not found: {path}")
            
        data = spark.read.text(path)
        print(f"Successfully loaded {data.count()} log entries")
        return data
    except Exception as e:
        print(f"Error loading data: {str(e)}")
        raise

# Test the data loading
try:
    data = load_data(spark)
except Exception as e:
    print(f"Failed to load data: {str(e)}")


Successfully loaded 3000000 log entries


# Task 1: Data Processing using PySpark DataFrame [40 marks]

---

## Complete code for all students

In [18]:
# Common imports and Spark initialization for all students
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, IntegerType
from pyspark.sql.functions import regexp_extract, to_timestamp, col

spark = SparkSession.builder \
    .appName("Log Analysis") \
    .getOrCreate()

# Read the log file
logs_df = spark.read.text("web.log")

# Student 1 (Navya A) - IP Address, Timestamp, HTTP Method
student1_df = logs_df.select(
    regexp_extract(col("value"), r"(\d+\.\d+\.\d+\.\d+)", 1).alias("ip_address"),
    to_timestamp(
        regexp_extract(col("value"), r"\[(.*?)\]", 1),
        "dd/MMM/yyyy:HH:mm:ss"
    ).alias("timestamp"),
    regexp_extract(col("value"), r'"(\w+)', 1).alias("http_method")
)

# Student 2 - HTTP Status Code, Response Size, Timestamp
student2_df = logs_df.select(
    regexp_extract(col("value"), r'" (\d{3})', 1).alias("status_code"),
    regexp_extract(col("value"), r'" \d{3} (\d+)', 1).cast(IntegerType()).alias("response_size"),
    to_timestamp(
        regexp_extract(col("value"), r"\[(.*?)\]", 1),
        "dd/MMM/yyyy:HH:mm:ss"
    ).alias("timestamp")
)

# Student 3 - URL Path, IP Address, Response Size
student3_df = logs_df.select(
    regexp_extract(col("value"), r'"[A-Z]+ (.*?) HTTP', 1).alias("url_path"),
    regexp_extract(col("value"), r"(\d+\.\d+\.\d+\.\d+)", 1).alias("ip_address"),
    regexp_extract(col("value"), r'" \d{3} (\d+)', 1).cast(IntegerType()).alias("response_size")
)

# Student 4 - Log Message, HTTP Status Code, Timestamp
student4_df = logs_df.select(
    regexp_extract(col("value"), r'"(.*?)"', 1).alias("log_message"),
    regexp_extract(col("value"), r'" (\d{3})', 1).alias("status_code"),
    to_timestamp(
        regexp_extract(col("value"), r"\[(.*?)\]", 1),
        "dd/MMM/yyyy:HH:mm:ss"
    ).alias("timestamp")
)

# Function to validate and show results for each student's DataFrame
def validate_dataframe(df, student_num):
    print(f"\nStudent {student_num} DataFrame Schema:")
    df.printSchema()
    
    print(f"\nStudent {student_num} Sample Data:")
    df.show(5, truncate=False)
    
    # Count non-null values for each column
    print(f"\nStudent {student_num} Validation Counts:")
    df.select([
        sum(col(c).isNotNull().cast("int")).alias(f"{c}_count")
        for c in df.columns
    ]).show()

# Validate each student's DataFrame
validate_dataframe(student1_df, 1)
validate_dataframe(student2_df, 2)
validate_dataframe(student3_df, 3)
validate_dataframe(student4_df, 4)

# Register DataFrames as views for SQL queries later
student1_df.createOrReplaceTempView("student1_logs")
student2_df.createOrReplaceTempView("student2_logs")
student3_df.createOrReplaceTempView("student3_logs")
student4_df.createOrReplaceTempView("student4_logs")


Student 1 DataFrame Schema:
root
 |-- ip_address: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- http_method: string (nullable = true)


Student 1 Sample Data:
+--------------+-------------------+-----------+
|ip_address    |timestamp          |http_method|
+--------------+-------------------+-----------+
|88.211.105.115|2022-03-04 14:17:48|POST       |
|144.6.49.142  |2022-09-02 15:16:00|POST       |
|231.70.64.145 |2022-07-19 01:31:31|PUT        |
|219.42.234.172|2022-02-08 11:34:57|POST       |
|183.173.185.94|2023-08-29 03:07:11|GET        |
+--------------+-------------------+-----------+
only showing top 5 rows


Student 1 Validation Counts:
+----------------+---------------+-----------------+
|ip_address_count|timestamp_count|http_method_count|
+----------------+---------------+-----------------+
|         3000000|        3000000|          3000000|
+----------------+---------------+-----------------+


Student 2 DataFrame Schema:
root
 |-- status_code

## Student 1: Navya Athoti (u2793047)

In [3]:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
from pyspark.sql.functions import regexp_extract, to_timestamp, col

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("Log Analysis - Navya A") \
    .getOrCreate()

# Read the log file
# Assuming your log file is uploaded to Colab
logs_df = spark.read.text("web.log")

# Define the regex patterns for extraction
# Based on your log format:
# IP pattern: (\d+\.\d+\.\d+\.\d+)
# Timestamp pattern: \[(.*?)\]
# HTTP Method pattern: "([A-Z]+)
parsed_df = logs_df.select(
    regexp_extract(col("value"), r"(\d+\.\d+\.\d+\.\d+)", 1).alias("ip_address"),
    to_timestamp(
        regexp_extract(col("value"), r"\[(.*?)\]", 1),
        "dd/MMM/yyyy:HH:mm:ss"
    ).alias("timestamp"),
    regexp_extract(col("value"), r'"(\w+)', 1).alias("http_method")
)

# Register the DataFrame as a temporary view for SQL queries later
parsed_df.createOrReplaceTempView("log_data")

# Show the schema and sample data
print("DataFrame Schema:")
parsed_df.printSchema()

print("\nSample Data:")
parsed_df.show(5, truncate=False)

# Add some basic validations
print("\nValidation Counts:")
parsed_df.select(
    col("ip_address").isNotNull().cast("int").alias("valid_ip"),
    col("timestamp").isNotNull().cast("int").alias("valid_timestamp"),
    col("http_method").isNotNull().cast("int").alias("valid_method")
).agg(
    sum("valid_ip").alias("valid_ip_count"),
    sum("valid_timestamp").alias("valid_timestamp_count"),
    sum("valid_method").alias("valid_method_count")
).show()

DataFrame Schema:
root
 |-- ip_address: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- http_method: string (nullable = true)


Sample Data:
+--------------+-------------------+-----------+
|ip_address    |timestamp          |http_method|
+--------------+-------------------+-----------+
|88.211.105.115|2022-03-04 14:17:48|POST       |
|144.6.49.142  |2022-09-02 15:16:00|POST       |
|231.70.64.145 |2022-07-19 01:31:31|PUT        |
|219.42.234.172|2022-02-08 11:34:57|POST       |
|183.173.185.94|2023-08-29 03:07:11|GET        |
+--------------+-------------------+-----------+
only showing top 5 rows


Validation Counts:
+--------------+---------------------+------------------+
|valid_ip_count|valid_timestamp_count|valid_method_count|
+--------------+---------------------+------------------+
|       3000000|              3000000|           3000000|
+--------------+---------------------+------------------+



### 1. DataFrame Creation with REGEX (10 marks)
- Description of the task and methodology used for creating the DataFrame using REGEX.
- Code snippets and explanations.
- Example outputs and results.


### 2. Two Advanced DataFrame Analysis (20 marks)
- Detailed description of the two advanced analyses performed on the DataFrame.
- Code snippets and explanations for each analysis.
- Visualizations and interpretations of the results.

### 3. Data Visualization (10 marks)
- Explanation of the data visualization techniques used.
- Code snippets for generating the visualizations.
- Example visualizations and their interpretations.

# Task 2: Data Processing using PySpark RDD [40 marks]

---

## Student 2: Phalguna Avalagunta (u2811669)

### 1. RDD Creation and Transformation (10 marks)
- Description of the task and methodology used for creating and transforming the RDD.
- Code snippets and explanations.
- Example outputs and results.

### 2. Two Advanced RDD Analysis (20 marks)
- Detailed description of the two advanced analyses performed on the RDD.
- Code snippets and explanations for each analysis.
- Visualizations and interpretations of the results.

### 3. Data Visualization (10 marks)
- Explanation of the data visualization techniques used.
- Code snippets for generating the visualizations.
- Example visualizations and their interpretations.

In [12]:


# Task 2: Data Processing using PySpark RDD [40 marks]

# Student 1 (Navya Athoti u2793047)
print("\nStudent 1 RDD Analysis - Traffic Pattern Mining")
print("=" * 50)

# Basic RDD Analysis: Parse and Extract (10 marks)
def parse_log_entry(line):
    import re
    try:
        pattern = r'(\d+\.\d+\.\d+\.\d+).*\[(.*?)\].*\"([A-Z]+)'
        match = re.search(pattern, line)
        if match:
            return {
                'ip': match.group(1),
                'timestamp': match.group(2),
                'method': match.group(3)
            }
    except Exception as e:
        print(f"Parsing error: {str(e)}")
    return None

base_rdd = data.rdd.map(lambda x: x['value']) \
                   .map(parse_log_entry) \
                   .filter(lambda x: x is not None)

# Advanced Analysis 1: Time-based Traffic Analysis (15 marks)
hourly_traffic = base_rdd \
    .map(lambda x: (x['timestamp'][:13], 1)) \
    .reduceByKey(lambda a, b: a + b) \
    .sortByKey()

print("\nHourly Traffic Sample:")
for hour, count in hourly_traffic.take(5):
    print(f"{hour}: {count} requests")

# Advanced Analysis 2: IP-based Pattern Analysis (15 marks)
ip_patterns = base_rdd \
    .map(lambda x: (x['ip'], x['method'])) \
    .groupByKey() \
    .mapValues(lambda methods: {
        'total_requests': len(list(methods)),
        'method_distribution': dict(pd.Series(list(methods)).value_counts())
    })

print("\nIP Pattern Analysis Sample:")
for ip, stats in ip_patterns.take(3):
    print(f"\nIP: {ip}")
    print(f"Total Requests: {stats['total_requests']}")
    print("Method Distribution:", stats['method_distribution'])



Student 1 RDD Analysis - Traffic Pattern Mining

Hourly Traffic Sample:
01/Apr/2022:0: 1869 requests
01/Apr/2022:1: 1839 requests
01/Apr/2022:2: 729 requests
01/Apr/2023:0: 1889 requests
01/Apr/2023:1: 1879 requests

IP Pattern Analysis Sample:

IP: 220.182.78.75
Total Requests: 1
Method Distribution: {'GET': 1}

IP: 143.238.50.180
Total Requests: 1
Method Distribution: {'POST': 1}

IP: 155.22.118.135
Total Requests: 1
Method Distribution: {'GET': 1}


# Task 3: Optimization and LSEPI Considerations [10 marks]

---

In this task, we will focus on optimization techniques and considerations for LSEPI (Large-Scale Enterprise Process Integration). The goal is to enhance the performance and efficiency of our data processing workflows.

## Objectives:
1. **Optimization Techniques (5 marks)**
    - Description of the optimization techniques applied.
    - Code snippets and explanations.
    - Example outputs and results.

2. **LSEPI Considerations (5 marks)**
    - Detailed description of the considerations for LSEPI.
    - Code snippets and explanations for each consideration.
    - Visualizations and interpretations of the results.

---

### 1. Optimization Techniques (5 marks)
- **Description:**
  - Provide a detailed description of the optimization techniques used.
  - Explain the methodology and rationale behind each technique.
- **Code Snippets:**
  - Include relevant code snippets demonstrating the optimization techniques.
- **Example Outputs:**
  - Show example outputs and results to illustrate the effectiveness of the optimizations.

### 2. LSEPI Considerations (5 marks)
- **Description:**
  - Discuss the key considerations for LSEPI.
  - Explain how these considerations impact the overall workflow.
- **Code Snippets:**
  - Provide code snippets that address LSEPI considerations.
- **Visualizations:**
  - Include visualizations to support the explanations and interpretations of the results.

---

This section aims to provide a comprehensive understanding of optimization and LSEPI considerations, ensuring efficient and effective data processing workflows.

In [13]:


# Task 3: Optimization and LSEPI Considerations [10 marks]

# Student 1 (Navya Athoti u2793047)
print("\nStudent 1 Optimization Analysis")
print("=" * 50)

# Method 1: Partition Strategies (5 marks)
def evaluate_partition_strategy():
    print("\nPartitioning Strategy Evaluation")
    
    # Baseline - Default partitioning
    start_time = time.time()
    df_student1.groupBy('IP_Address').count().count()
    baseline_time = time.time() - start_time
    print(f"Baseline execution time: {baseline_time:.2f} seconds")
    
    # Custom partitioning
    start_time = time.time()
    df_student1.repartition(8, 'IP_Address').groupBy('IP_Address').count().count()
    optimized_time = time.time() - start_time
    print(f"Optimized execution time: {optimized_time:.2f} seconds")
    print(f"Performance improvement: {((baseline_time - optimized_time) / baseline_time) * 100:.2f}%")

evaluate_partition_strategy()

# Method 2: Caching Strategy (5 marks)
def evaluate_caching_strategy():
    print("\nCaching Strategy Evaluation")
    
    # Without caching
    df_uncached = df_student1.unpersist()
    start_time = time.time()
    df_uncached.groupBy('HTTP_Method').count().count()
    df_uncached.groupBy('IP_Address').count().count()
    uncached_time = time.time() - start_time
    print(f"Uncached execution time: {uncached_time:.2f} seconds")
    
    # With caching
    df_cached = df_student1.cache()
    df_cached.count()  # Materialize cache
    start_time = time.time()
    df_cached.groupBy('HTTP_Method').count().count()
    df_cached.groupBy('IP_Address').count().count()
    cached_time = time.time() - start_time
    print(f"Cached execution time: {cached_time:.2f} seconds")
    print(f"Caching improvement: {((uncached_time - cached_time) / uncached_time) * 100:.2f}%")

evaluate_caching_strategy()




Student 1 Optimization Analysis

Partitioning Strategy Evaluation
Baseline execution time: 4.44 seconds
Optimized execution time: 4.08 seconds
Performance improvement: 8.01%

Caching Strategy Evaluation
Uncached execution time: 11.75 seconds
Cached execution time: 4.21 seconds
Caching improvement: 64.15%


In [14]:
# Student 2 (Phalguna Avalagunta u2811669)
print("\nStudent 2 Optimization Analysis")
print("=" * 50)

# Method 1: Caching Strategy
def evaluate_caching_strategy_student2():
    print("\nCaching Strategy Evaluation")
    
    # Without caching
    df_uncached = df_student2.unpersist()
    start_time = time.time()
    df_uncached.groupBy('Status_Code').count().count()
    df_uncached.groupBy('Response_Size').count().count()
    uncached_time = time.time() - start_time
    print(f"Uncached execution time: {uncached_time:.2f} seconds")
    
    # With caching
    df_cached = df_student2.cache()
    df_cached.count()  # Materialize cache
    start_time = time.time()
    df_cached.groupBy('Status_Code').count().count()
    df_cached.groupBy('Response_Size').count().count()
    cached_time = time.time() - start_time
    print(f"Cached execution time: {cached_time:.2f} seconds")
    print(f"Caching improvement: {((uncached_time - cached_time) / uncached_time) * 100:.2f}%")

evaluate_caching_strategy_student2()

def evaluate_bucketing_strategy_student2():
    print("\nBucketing Strategy Evaluation")
    
    try:
        # Create DataFrame with proper schema
        df_for_bucket = df_student2.select(
            col("Status_Code").cast("string"),
            col("Response_Size").cast("long"),
            col("Timestamp").cast("string")
        )
        
        # Create temporary view
        df_for_bucket.createOrReplaceTempView("logs")
        
        # Measure query performance without bucketing
        start_time = time.time()
        spark.sql("SELECT Status_Code, COUNT(*) FROM logs GROUP BY Status_Code").show()
        unbucketed_time = time.time() - start_time
        print(f"Query time without bucketing: {unbucketed_time:.2f} seconds")
        
        # Create bucketed DataFrame directly
        bucketed_df = df_for_bucket.repartition(4, "Status_Code")
        bucketed_df.createOrReplaceTempView("bucketed_logs")
        
        # Measure query performance with bucketing
        start_time = time.time()
        spark.sql("SELECT Status_Code, COUNT(*) FROM bucketed_logs GROUP BY Status_Code").show()
        bucketed_time = time.time() - start_time
        print(f"Query time with bucketing: {bucketed_time:.2f} seconds")
        print(f"Performance improvement: {((unbucketed_time - bucketed_time) / unbucketed_time) * 100:.2f}%")
        
    except Exception as e:
        print(f"Error in bucketing strategy: {str(e)}")




Student 2 Optimization Analysis

Caching Strategy Evaluation
Uncached execution time: 14.07 seconds
Cached execution time: 0.63 seconds
Caching improvement: 95.52%


In [15]:
# Student 3 (Nikhil Sai Damera u2810262)
print("\nStudent 3 Optimization Analysis")
print("=" * 50)

# Method 1: Partition Strategies
def evaluate_partition_strategy_student3():
    print("\nPartitioning Strategy Evaluation")
    
    # Baseline
    start_time = time.time()
    df_student3.groupBy('URL_Path').count().count()
    baseline_time = time.time() - start_time
    print(f"Baseline execution time: {baseline_time:.2f} seconds")
    
    # Custom partitioning
    start_time = time.time()
    df_student3.repartition(10, 'URL_Path').groupBy('URL_Path').count().count()
    optimized_time = time.time() - start_time
    print(f"Optimized execution time: {optimized_time:.2f} seconds")
    print(f"Performance improvement: {((baseline_time - optimized_time) / baseline_time) * 100:.2f}%")

evaluate_partition_strategy_student3()

# Method 2: Bucketing & Indexing
def evaluate_bucketing_strategy_student3():
    print("\nBucketing Strategy Evaluation")
    
    try:
        # Create DataFrame with proper schema
        df_for_bucket = df_student3.select(
            col("URL_Path").cast("string"),
            col("IP_Address").cast("string"),
            col("Response_Size").cast("long")
        )
        
        # Create temporary view
        df_for_bucket.createOrReplaceTempView("url_logs")
        
        # Measure query performance without bucketing
        start_time = time.time()
        spark.sql("SELECT URL_Path, COUNT(*) FROM url_logs GROUP BY URL_Path").show()
        unbucketed_time = time.time() - start_time
        print(f"Query time without bucketing: {unbucketed_time:.2f} seconds")
        
        # Create bucketed DataFrame directly
        bucketed_df = df_for_bucket.repartition(4, "URL_Path")
        bucketed_df.createOrReplaceTempView("bucketed_url_logs")
        
        # Measure query performance with bucketing
        start_time = time.time()
        spark.sql("SELECT URL_Path, COUNT(*) FROM bucketed_url_logs GROUP BY URL_Path").show()
        bucketed_time = time.time() - start_time
        print(f"Query time with bucketing: {bucketed_time:.2f} seconds")
        print(f"Performance improvement: {((unbucketed_time - bucketed_time) / unbucketed_time) * 100:.2f}%")
        
    except Exception as e:
        print(f"Error in bucketing strategy: {str(e)}")




Student 3 Optimization Analysis

Partitioning Strategy Evaluation
Baseline execution time: 0.42 seconds
Optimized execution time: 1.55 seconds
Performance improvement: -264.13%


In [16]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import time
import os

class Student4Analysis:
    def __init__(self):
        self.spark = None
        self.df_student4 = None
    
    def initialize_spark(self):
        """Initialize Spark session"""
        self.spark = SparkSession.builder \
            .appName("Student4_Analysis") \
            .config("spark.sql.adaptive.enabled", "true") \
            .config("spark.executor.memory", "2g") \
            .getOrCreate()
    
    def load_data(self, input_data):
        """Load and process the log data"""
        try:
            # Read the raw data
            raw_logs = self.spark.read.text(input_data)
            
            # Define the regex pattern for Student 4
            regex_pattern = r'\".*\" (\d+) .*? \[(.*?)\] (.*)'
            
            # Create DataFrame with extracted fields
            self.df_student4 = raw_logs.select(
                regexp_extract('value', regex_pattern, 1).alias('HTTP_Status_Code'),
                regexp_extract('value', regex_pattern, 2).alias('Timestamp'),
                regexp_extract('value', regex_pattern, 3).alias('Log_Message')
            )
            
            print("\nSample of processed data:")
            self.df_student4.show(5, truncate=False)
            
        except Exception as e:
            print(f"Error loading data: {str(e)}")
            raise
    
    def evaluate_caching_strategy(self):
        """Evaluate caching strategy performance"""
        try:
            print("\nCaching Strategy Evaluation")
            
            # Ensure DataFrame exists
            if self.df_student4 is None:
                raise ValueError("DataFrame not initialized")
            
            # Without caching
            self.df_student4.unpersist()  # Ensure clean state
            start_time = time.time()
            uncached_count = self.df_student4.groupBy('HTTP_Status_Code').count().count()
            uncached_time = time.time() - start_time
            print(f"Uncached execution time: {uncached_time:.2f} seconds")
            
            # With caching
            cached_df = self.df_student4.cache()
            cached_df.count()  # Materialize cache
            start_time = time.time()
            cached_count = cached_df.groupBy('HTTP_Status_Code').count().count()
            cached_time = time.time() - start_time
            improvement = ((uncached_time - cached_time) / uncached_time) * 100
            print(f"Cached execution time: {cached_time:.2f} seconds")
            print(f"Caching improvement: {improvement:.2f}%")
            
            return cached_df
            
        except Exception as e:
            print(f"Error in caching strategy evaluation: {str(e)}")
            raise
    
    def evaluate_partition_strategy(self):
        """Evaluate partitioning strategy performance"""
        try:
            print("\nPartitioning Strategy Evaluation")
            
            # Ensure DataFrame exists
            if self.df_student4 is None:
                raise ValueError("DataFrame not initialized")
            
            # Baseline execution
            start_time = time.time()
            baseline_count = self.df_student4.groupBy('HTTP_Status_Code').count().count()
            baseline_time = time.time() - start_time
            print(f"Baseline execution time: {baseline_time:.2f} seconds")
            
            # Optimized execution with partitioning
            start_time = time.time()
            optimized_count = (self.df_student4.repartition(8, 'HTTP_Status_Code')
                             .groupBy('HTTP_Status_Code')
                             .count()
                             .count())
            optimized_time = time.time() - start_time
            improvement = ((baseline_time - optimized_time) / baseline_time) * 100
            print(f"Optimized execution time: {optimized_time:.2f} seconds")
            print(f"Performance improvement: {improvement:.2f}%")
            
        except Exception as e:
            print(f"Error in partition strategy evaluation: {str(e)}")
            raise
    
    def cleanup(self):
        """Clean up Spark resources"""
        try:
            if self.df_student4 is not None:
                self.df_student4.unpersist()
            if self.spark is not None:
                self.spark.stop()
                print("\nSpark session successfully closed")
        except Exception as e:
            print(f"Error during cleanup: {str(e)}")

def main():
    # Initialize analysis object
    analysis = Student4Analysis()
    
    try:
        # Initialize Spark
        analysis.initialize_spark()
        
        # Get the current working directory
        current_dir = os.getcwd()
        
        # Specify the log file path - adjust this to your actual log file path
        log_file = os.path.join(current_dir, "web.log")
        
        print(f"\nProcessing log file: {log_file}")
        
        # Load and process data
        analysis.load_data(log_file)
        
        # Run optimization tests
        analysis.evaluate_caching_strategy()
        analysis.evaluate_partition_strategy()
        
    except Exception as e:
        print(f"\nError in main execution: {str(e)}")
    finally:
        # Ensure cleanup happens even if there's an error
        analysis.cleanup()

if __name__ == "__main__":
    main()


Processing log file: c:\Users\HP\University\Python-Projects\web.log

Sample of processed data:
+----------------+---------+-----------+
|HTTP_Status_Code|Timestamp|Log_Message|
+----------------+---------+-----------+
|                |         |           |
|                |         |           |
|                |         |           |
|                |         |           |
|                |         |           |
+----------------+---------+-----------+
only showing top 5 rows


Caching Strategy Evaluation
Uncached execution time: 8.73 seconds
Cached execution time: 0.23 seconds
Caching improvement: 97.31%

Partitioning Strategy Evaluation
Baseline execution time: 0.22 seconds
Optimized execution time: 1.06 seconds
Performance improvement: -373.47%

Spark session successfully closed
