<img src="https://cdn-ukwest.onetrust.com/logos/8330d093-4c49-4c87-9f9c-f0411dece48c/f4949705-ab9e-41bb-b3d9-6abc65ae7a94/8d8533f7-8206-4edc-83f0-d8eda8c92ae4/BPP_1-Line_Lockup_Positive_RGB_Web.png" width=400px/>

<h3><font color="#AA00BB">How you can use this Notebook</font></h3>
<p>This notebook was written to teach new concepts in data using Python.</p>
<p>You can read through the descriptions run the code (it should work!), or you may be taken through the code by one of our experts.</p>
<p>But one of the best habits to acquire is to re-write the code yourself.</p>
<ul><li>Experiment
<li>Break the code
<li>Build a deeper understanding of both the underlying data concepts and the code</ul>
<p>Don't worry if you make mistakes, we all do. The way to get better and make less mistakes is to write more code.</p>
<p>Enjoy!</p>
<br>

> ❓🤔 = a question for us discuss

> ⌨️ = a task for you to try

> 🔑 = an authoritative reference or guide you may find useful

> 🚀 = **optional** material to develop yourself further


<a name="contents"></a>
## Contents

<br>1. [Understanding Parallel Processing in Python](#section_1)
<br>2. [Import Image Data from Kaggle](#section_2)
<br>3. [Processing Image Data](#section_3)
<br>4. [Procession Log Data](#section_4)

<a name="section_1"></a>
# 1. Understanding Parallel Processing in Python

[Return to contents](#contents)

Imagine you have a massive dataset - an array of 10 million integers - and your task is to calculate the total sum. While this seems straightforward, how we approach the task can significantly affect how long it takes to complete. In this activity, you will compare three approaches to solve this problem: **sequential**, **parallel**, and **concurrent** processing.

> **Sequential Processing:** This is the simplest way to solve problems, where tasks are performed one after the other. It's often easier to write and debug but can be slow for large tasks.

> **Parallel Processing:** This involves dividing a problem into smaller sub-tasks and processing them simultaneously on multiple processors. It's highly efficient for tasks that can be split into independent parts.

> **Concurrent Processing:** This model allows tasks to overlap in time. It's particularly useful for handling multiple tasks that involve I/O or waiting, even if they aren't entirely independent.


Along the way, we'll analyse their performance, learn about trade-offs, and determine the best fit for various scenarios.

You'll begin with sequential processing. In this approach, the entire array is processed in a single thread, one element at a time. This serves as our baseline implementation, so we can compare its performance with the other models.

**What will you learn here:**

* Implement a basic summation function.
* Run the code and record the runtime.
* Reflect on why sequential processing is straightforward but may not scale well for very large datasets.

In [None]:
big_number = 10_000_000
# the underscore notation here allows us to see the scale
# of the number we are working with

In [None]:
import time

# Sequential summation
def sequential_sum(array):
    # Initialize the total sum to zero
    total = 0
    # Loop through each element in the array and add it to the total
    for num in array:
        total += num
    return total

# Generate a large array of integers
array = list(range(1, big_number + 1))

# Measure the runtime
start_time = time.time()
result = sequential_sum(array)
end_time = time.time()

# Print the result and the runtime
print(f"Sequential Sum: {result}")
print(f"Time Taken: {end_time - start_time:.4f} seconds")

Sequential Sum: 50000005000000
Time Taken: 2.5111 seconds


> ❓🤔
> * *What did you notice about the simplicity of this approach?*
> * *Why might this method take a long time as the array size increases?*
> * *How could we speed this up?*

Now, you’ll move to parallel processing. Here, we'll split the array into chunks and process each chunk simultaneously using multiple processors. This approach is powerful for tasks that can be divided into independent pieces, but it requires more effort to implement and manage.

**What will you learn here:**

* Use Python's multiprocessing library to divide the array into chunks.
* Implement a parallel solution to sum the chunks.
* Compare the runtime to the sequential model and discuss overhead.

In [None]:
from multiprocessing import Pool, cpu_count
import time

# Define a function to sum a chunk of the array
def parallel_sum_chunk(chunk):
    return sum(chunk)

# Parallel summation using multiprocessing
def parallel_sum(array):
    # Determine the number of available processors
    num_chunks = cpu_count()
    print(f"Number of processers available: {num_chunks}")
    # Split the array into chunks based on the number of processors
    chunk_size = len(array) // num_chunks
    chunks = [array[i * chunk_size:(i + 1) * chunk_size] for i in range(num_chunks)]

    # Use a multiprocessing Pool to process chunks in parallel
    with Pool(processes=num_chunks) as pool:
        results = pool.map(parallel_sum_chunk, chunks)

    # Combine the results from all chunks
    return sum(results)

# Protect the multiprocessing code
if __name__ == "__main__":

  # Generate a large array of integers
  array = list(range(1, big_number + 1))

  # Measure the runtime
  start_time = time.time()
  result = parallel_sum(array)
  end_time = time.time()

  # Print the result and the runtime
  print(f"Parallel Sum: {result}")
  print(f"Time Taken: {end_time - start_time:.4f} seconds")

Number of processers available: 2
Parallel Sum: 50000005000000
Time Taken: 4.2660 seconds


> ❓🤔
> * *How did the runtime compare to the sequential version?*
> * *Did you notice any overhead from splitting the task or managing processes?*
> * *Why does the number of CPU cores affect the speedup achieved?*

The final approach is concurrent processing. Here, you'll simulate an environment where summing the array is just one of several tasks being performed. By using asynchronous programming, we allow tasks to overlap in time. This is particularly useful in scenarios involving I/O operations or multitasking.

**What will you learn here:**

* Use Python's asyncio library to divide the array into chunks and process them concurrently.
* Observe how concurrency differs from parallelism and compare runtime.

In [None]:
import asyncio
import time
import nest_asyncio

# Enable nested asyncio loops for Colab
nest_asyncio.apply()

# Define an asynchronous function to sum a chunk of the array
async def concurrent_sum_chunk(chunk):
    await asyncio.sleep(0)  # Simulate an I/O operation
    return sum(chunk)

# Concurrent summation using asyncio
async def concurrent_sum(array):
    # Divide the array into 10 chunks
    num_chunks = 10
    chunk_size = len(array) // num_chunks
    chunks = [array[i * chunk_size:(i + 1) * chunk_size] for i in range(num_chunks)]

    # Create a list of async tasks to process each chunk
    tasks = [concurrent_sum_chunk(chunk) for chunk in chunks]
    # Use asyncio.gather to run all tasks concurrently
    results = await asyncio.gather(*tasks)
    # Combine results from all chunks
    print(f"Concurrent Sum: {sum(results)}")
    return

# Generate a large array of integers
array = list(range(1, big_number + 1))

# Measure the runtime
start_time = time.time()
asyncio.run(concurrent_sum(array))
end_time = time.time()

# Print the result and the runtime
print(f"Time Taken: {end_time - start_time:.4f} seconds")

Concurrent Sum: 50000005000000
Time Taken: 0.8806 seconds


Now that all three implementations are complete, have a go at the following:

1. *Update the codes so that the the runtimes can be saved to a value. Create a DataFrame that can contain these results.*
2. *Reflect on the trade-offs in terms of implementation complexity, performance, and resource usage.*
3. *Discuss which model would be most appropriate for this problem if the number continues to increase or is not a number.*

In [None]:
#@title Solution Q1
import pandas as pd
# add a line that will save the time difference, along with the method
con_process = ('Concurrent',
               round(end_time - start_time, 4))
print(con_process)
# add this to a dataframe
time_comp = pd.DataFrame([con_process],
                         columns = ['Process', 'Time'])
time_comp.head()

('Concurrent', 0.8806)


Unnamed: 0,Process,Time
0,Concurrent,0.8806


In [None]:
#@title Solution Q2
## Sequential
# Easy Complexity, good for small tasks or tasks with dependencies.
## Parallel
## Slightly more complex, good for large, independent computational tasks
## Concurrent
# High complexity, good for overlapping input/output (I/O) operations or multitasking

In [None]:
#@title Solution Q3
# Sequential is best due to the simplicity, but anthing more complex should be
# parallel unless processes are dependent.

<a name="section_2"></a>
# 2. Import Image Data from Kaggle

Now let's think of a harder problem. To get started, let's link up to [Kaggle](https://www.kaggle.com) and import some [data](https://www.kaggle.com/datasets/jessicali9530/celeba-dataset/data?select=img_align_celeba)

In [None]:
# import necessary libraries
import pandas as pd
import os
import kagglehub

# Download latest version of the image data
path = kagglehub.dataset_download("jessicali9530/celeba-dataset")

# show the path to where the image data is stored
print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/jessicali9530/celeba-dataset?dataset_version_number=2...


100%|██████████| 1.33G/1.33G [00:10<00:00, 134MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/jessicali9530/celeba-dataset/versions/2


# Question
*Create a code (using the `os` and `pandas` library) that looks at all CSVs in the path mentioned above and prints out the heads of the files. Describe what each file contains.*

In [None]:
#@title Answer
os.chdir(path)
print(os.listdir())
for file in os.listdir():
    if file.endswith('.csv'):
        _tmp = pd.read_csv(file)
        print(f"{file} has {len(file):,} rows")
        display(_tmp.head())
        del _tmp

['list_attr_celeba.csv', 'list_landmarks_align_celeba.csv', 'list_eval_partition.csv', 'list_bbox_celeba.csv', 'img_align_celeba']
list_attr_celeba.csv has 20 rows


Unnamed: 0,image_id,5_o_Clock_Shadow,Arched_Eyebrows,Attractive,Bags_Under_Eyes,Bald,Bangs,Big_Lips,Big_Nose,Black_Hair,...,Sideburns,Smiling,Straight_Hair,Wavy_Hair,Wearing_Earrings,Wearing_Hat,Wearing_Lipstick,Wearing_Necklace,Wearing_Necktie,Young
0,000001.jpg,-1,1,1,-1,-1,-1,-1,-1,-1,...,-1,1,1,-1,1,-1,1,-1,-1,1
1,000002.jpg,-1,-1,-1,1,-1,-1,-1,1,-1,...,-1,1,-1,-1,-1,-1,-1,-1,-1,1
2,000003.jpg,-1,-1,-1,-1,-1,-1,1,-1,-1,...,-1,-1,-1,1,-1,-1,-1,-1,-1,1
3,000004.jpg,-1,-1,1,-1,-1,-1,-1,-1,-1,...,-1,-1,1,-1,1,-1,1,1,-1,1
4,000005.jpg,-1,1,1,-1,-1,-1,1,-1,-1,...,-1,-1,-1,-1,-1,-1,1,-1,-1,1


list_landmarks_align_celeba.csv has 31 rows


Unnamed: 0,image_id,lefteye_x,lefteye_y,righteye_x,righteye_y,nose_x,nose_y,leftmouth_x,leftmouth_y,rightmouth_x,rightmouth_y
0,000001.jpg,69,109,106,113,77,142,73,152,108,154
1,000002.jpg,69,110,107,112,81,135,70,151,108,153
2,000003.jpg,76,112,104,106,108,128,74,156,98,158
3,000004.jpg,72,113,108,108,101,138,71,155,101,151
4,000005.jpg,66,114,112,112,86,119,71,147,104,150


list_eval_partition.csv has 23 rows


Unnamed: 0,image_id,partition
0,000001.jpg,0
1,000002.jpg,0
2,000003.jpg,0
3,000004.jpg,0
4,000005.jpg,0


list_bbox_celeba.csv has 20 rows


Unnamed: 0,image_id,x_1,y_1,width,height
0,000001.jpg,95,71,226,313
1,000002.jpg,72,94,221,306
2,000003.jpg,216,59,91,126
3,000004.jpg,622,257,564,781
4,000005.jpg,236,109,120,166


# Question
*For the remaining file (which is actually a folder), print out the number of files contained within there (note there is a sub-folder you'll also need to go into)*

In [None]:
#@title Answer
os.chdir(os.path.join(path, 'img_align_celeba', 'img_align_celeba'))
print(f"Number of files: {len(os.listdir()):,}")

Number of files: 202,599


<a name="section_3"></a>
# 3. Processing Image Data
Now that we know more about the data we will be working with, let's pose a problem!

Imagine you have to create a new set of images to be processed, ones that need to be converted into greyscale. You will need to load the image, apply the grey scale, and then save it to a new location. Some functions have been given to you below to use.

In [None]:
# functions to use
from PIL import Image

# loading images
def load_image(file_path):
    return Image.open(file_path)

# saving images
def save_image(image, output_path):
    image.save(output_path)

# convert image to greyscale
def apply_greyscale(image):
    return image.convert("L")

Now, focus on using the partially built code block below and fill in the gaps to complete a sequential process.

In [None]:
#@title Fill in the gaps
# set up a progress bar to track how many files have been completed
from tqdm.notebook import tqdm
pbar = tqdm()
# designate input and output folders
input_folder = ...
output_folder = ...

# Ensure output folder exists
os.makedirs(output_folder, exist_ok=True)

# set start time
start_time = time.time()

# set number of files expected
n_files = ...
pbar.reset(total=n_files)

# loop through files
for file_name in ...:
    # import file
    ...
    # convert to greyscale
    ...
    #save image
    ...
    pbar.update()

end_time = time.time()
print(f"Sequential processing completed in {end_time - start_time:.2f} seconds")

In [None]:
#@title Solution
# set up a progress bar to track how many files have been completed
from tqdm.notebook import tqdm
pbar = tqdm()
# designate input and output folders
input_folder = os.path.join(path, "img_align_celeba", "img_align_celeba")
output_folder = os.path.join(path, "img_align_celeba", "grey_images")

# Ensure output folder exists
os.makedirs(output_folder, exist_ok=True)

# set start time
start_time = time.time()

# set number of files expected
n_files = len(os.listdir(os.path.join(path,
                                      'img_align_celeba/img_align_celeba')))
pbar.reset(total=n_files)

# loop through files
for file_name in os.listdir(input_folder):
    # import file
    file_path = os.path.join(input_folder, file_name)
    image = load_image(file_path)
    # convert to greyscale
    greyscale_image = apply_greyscale(image)
    #save image
    save_image(greyscale_image, os.path.join(output_folder, file_name))
    pbar.update()

end_time = time.time()
print(f"Sequential processing completed in {end_time - start_time:.2f} seconds")

0it [00:00, ?it/s]

Sequential processing completed in 235.14 seconds


Now that you have a working solution, how can you improve it? Create a new code that optimises the runtime by utiliting either parallel or concurrent processing

In [None]:
#@title Your code







In [None]:
#@title Solution
# designate input and output folders
input_folder = os.path.join(path, "img_align_celeba", "img_align_celeba")
output_folder = os.path.join(path, "img_align_celeba", "grey_images")

# Ensure output folder exists
os.makedirs(output_folder, exist_ok=True)

# set number of files expected
n_files = len(os.listdir(os.path.join(path,
                                      'img_align_celeba/img_align_celeba')))

def process_image(file_name):
    # import file
    file_path = os.path.join(input_folder, file_name)
    image = load_image(file_path)
    # convert to greyscale
    greyscale_image = apply_greyscale(image)
    #save image
    save_image(greyscale_image, os.path.join(output_folder, file_name))

if __name__ == "__main__":
    start_time = time.time()

    with Pool() as pool:
        pool.map(process_image, os.listdir(input_folder))

    end_time = time.time()
    print(f"Parallel processing completed in {end_time - start_time:.2f} seconds")

Parallel processing completed in 160.87 seconds


In [None]:
#@title Solution (with progress bar)
from tqdm.contrib.concurrent import process_map
# designate input and output folders
input_folder = os.path.join(path, "img_align_celeba", "img_align_celeba")
output_folder = os.path.join(path, "img_align_celeba", "grey_images")

# Ensure output folder exists
os.makedirs(output_folder, exist_ok=True)

# set number of files expected
n_files = len(os.listdir(os.path.join(path,
                                      'img_align_celeba/img_align_celeba')))

def process_image(file_name):
    # import file
    file_path = os.path.join(input_folder, file_name)
    image = load_image(file_path)
    # convert to greyscale
    greyscale_image = apply_greyscale(image)
    #save image
    save_image(greyscale_image, os.path.join(output_folder, file_name))

if __name__ == "__main__":
    start_time = time.time()

    process_map(process_image, os.listdir(input_folder), chunksize=500)

    end_time = time.time()
    print(f"Parallel processing completed in {end_time - start_time:.2f} seconds")

  0%|          | 0/202599 [00:00<?, ?it/s]

Parallel processing completed in 176.43 seconds


<a name="section_4"></a>
# Section 4: Procession Log Data
Your next task is to process server access logs for Zanbil, an online shopping store. Each log entry contains information about user requests, such as IP address, timestamp, requested URL, HTTP status, and user agent. Your goal is to process this data and compute the following insights:

* The most frequently accessed resources (e.g., product pages, filters).
* The total number of requests per hour.
* **Bonus:** The most common user agents accessing the server.

Each log entry has the following structure:

In [None]:
'54.36.149.41 - - [22/Jan/2019:03:56:14 +0330] "GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1" 200 30577 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"'

'54.36.149.41 - - [22/Jan/2019:03:56:14 +0330] "GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1" 200 30577 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"'

**Challenge Requirements**
1. Extract Key Data from Logs

  * Timestamp: Extract the time of the request for hourly aggregation.
  * Resource: Extract the requested resource (e.g., /filter/... or /product/...) for frequency analysis.
  * User Agent: Extract the user agent string for bonus analysis.

2. Process the Logs Concurrently
  * Use concurrent processing to read and process log files in parallel, leveraging asynchronous file I/O to handle large datasets.

**Step 1: Import the Data**

In [None]:
import kagglehub

# Download latest version
log_path = kagglehub.dataset_download("eliasdabbas/web-server-access-logs")

print("Path to dataset files:", log_path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/eliasdabbas/web-server-access-logs?dataset_version_number=2...


100%|██████████| 267M/267M [00:03<00:00, 84.3MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/eliasdabbas/web-server-access-logs/versions/2


**Step 2: Sequential Baseline**

Understand the code below, and fill in the gaps, for a sequential solution to process the logs one file at a time, extract the data, and compute the required metrics.

In [None]:
#@title Fill the gaps
import os
from collections import Counter
from datetime import datetime
import time
import re

# Folder containing log files

# Regular expression to parse log entries
log_pattern = re.compile(
    r'(?P<ip>\S+) - - \[(?P<timestamp>[^\]]+)\] "(?P<method>\S+) (?P<resource>\S+) (?P<http_version>\S+)" (?P<status>\d+) (?P<size>\S+) "-" "(?P<user_agent>[^"]+)" "-"'
)

def process_logs_sequential(log_folder):
    resource_counter = Counter()
    hourly_requests = Counter()
    user_agent_counter = Counter()

    # Iterate through each log file
    for log_file in ...:
        with open(os.path.join(log_folder, log_file), "r") as file:
            for line in file:
                match = log_pattern.match(line)
                if match:
                    # Extract relevant data
                    timestamp = match.group("timestamp")
                    resource = match.group(...)
                    user_agent = match.group(...)

                    # Parse timestamp and extract hour
                    hour = datetime.strptime(timestamp, "%d/%b/%Y:%H:%M:%S %z").strftime(...)

                    # Update counters
                    resource_counter[resource] += ...
                    hourly_requests...
                    ...

    return resource_counter, ..., ...

# Measure runtime
start_time = time.time()
resource_counter, hourly_requests, user_agent_counter = process_logs_sequential(...)
end_time = time.time()

print(f"Most Frequent Resources: {resource_counter.most_common(5)}")
print(f"Requests Per Hour: {hourly_requests}")
print(f"Top User Agents: {user_agent_counter.most_common(5)}")
print(f"Sequential Processing Time: {end_time - start_time:.2f} seconds")

In [None]:
#@title Solution
import os
from collections import Counter
from datetime import datetime
import time
import re

# Folder containing log files

# Regular expression to parse log entries
log_pattern = re.compile(
    r'(?P<ip>\S+) - - \[(?P<timestamp>[^\]]+)\] "(?P<method>\S+) (?P<resource>\S+) (?P<http_version>\S+)" (?P<status>\d+) (?P<size>\S+) "-" "(?P<user_agent>[^"]+)" "-"'
)

def process_logs_sequential(log_folder):
    resource_counter = Counter()
    hourly_requests = Counter()
    user_agent_counter = Counter()

    # Iterate through each log file
    for log_file in os.listdir(log_folder):
        with open(os.path.join(log_folder, log_file), "r") as file:
            for line in file:
                match = log_pattern.match(line)
                if match:
                    # Extract relevant data
                    timestamp = match.group("timestamp")
                    resource = match.group("resource")
                    user_agent = match.group("user_agent")

                    # Parse timestamp and extract hour
                    hour = datetime.strptime(timestamp, "%d/%b/%Y:%H:%M:%S %z").strftime("%H")

                    # Update counters
                    resource_counter[resource] += 1
                    hourly_requests[hour] += 1
                    user_agent_counter[user_agent] += 1

    return resource_counter, hourly_requests, user_agent_counter

# Measure runtime
start_time = time.time()
resource_counter, hourly_requests, user_agent_counter = process_logs_sequential(log_path)
end_time = time.time()

print(f"Most Frequent Resources: {resource_counter.most_common(5)}")
print(f"Requests Per Hour: {hourly_requests}")
print(f"Top User Agents: {user_agent_counter.most_common(5)}")
print(f"Sequential Processing Time: {end_time - start_time:.2f} seconds")


Most Frequent Resources: [('/static/css/font/wyekan/font.woff', 177138), ('/favicon.ico', 49861), ('/static/images/guarantees/bestPrice.png', 39659), ('/image/33888?name=model-b2048u-1-.jpg&wh=200x200', 28460), ('/static/images/guarantees/fastDelivery.png', 27744)]
Requests Per Hour: Counter({'19': 91121, '14': 90751, '13': 87648, '15': 86102, '18': 85930, '17': 85049, '16': 83409, '12': 78211, '10': 77562, '20': 76382, '11': 76311, '09': 73525, '22': 71467, '21': 70506, '00': 67433, '23': 67016, '08': 61534, '01': 56605, '07': 53587, '06': 45973, '02': 43672, '05': 39573, '04': 39412, '03': 34313})
Top User Agents: [('Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', 445374), ('Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)', 197769), ('Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', 1

**Step 3: Parallel Processing**

Now you want to try and optimise the solution. Let's try this with `Multiprocessing`. Fill the gaps in the solution below:

In [None]:
from multiprocessing import Pool, Manager
from collections import Counter
from datetime import datetime
import os
import re
import time

# Regular expression to parse log entries
log_pattern = re.compile(
    r'(?P<ip>\S+) - - \[(?P<timestamp>[^\]]+)\] "(?P<method>\S+) (?P<resource>\S+) (?P<http_version>\S+)" (?P<status>\d+) (?P<size>\S+) "-" "(?P<user_agent>[^"]+)" "-"'
)

# Function to process a single log file
def process_log_file(log_file):
    ...

# Function to aggregate results from all processes
def aggregate_results(results):
    total_resource_counter = Counter()
    ...
    ...

    for resource_counter, ..., ... in ...:
        total_resource_counter.update(resource_counter)
        ...
        ...

    return total_resource_counter, ... , ...

# Parallel processing function
def parallel_process_logs(log_folder):
    log_files = [os.path.join(log_folder, log_file) for log_file in ...]

    # Use multiprocessing to process log files in parallel
    with Pool() as pool:
        results = pool.map(..., ...)

    # Aggregate results from all processes
    return aggregate_results(results)

# Measure runtime
if __name__ == "__main__":
    start_time = time.time()
    resource_counter, ..., ... = parallel_process_logs(log_path)
    end_time = time.time()

    print(f"Most Frequent Resources: {resource_counter.most_common(5)}")
    print(f"Requests Per Hour: {hourly_requests}")
    print(f"Top User Agents: {user_agent_counter.most_common(5)}")
    print(f"Parallel Processing Time: {end_time - start_time:.2f} seconds")

In [None]:
#@title Solution
from multiprocessing import Pool, Manager
from collections import Counter
from datetime import datetime
import os
import re
import time

# Regular expression to parse log entries
log_pattern = re.compile(
    r'(?P<ip>\S+) - - \[(?P<timestamp>[^\]]+)\] "(?P<method>\S+) (?P<resource>\S+) (?P<http_version>\S+)" (?P<status>\d+) (?P<size>\S+) "-" "(?P<user_agent>[^"]+)" "-"'
)

# Function to process a single log file
def process_log_file(log_file):
    resource_counter = Counter()
    hourly_requests = Counter()
    user_agent_counter = Counter()

    with open(log_file, "r") as file:
        for line in file:
            match = log_pattern.match(line)
            if match:
                # Extract relevant data
                timestamp = match.group("timestamp")
                resource = match.group("resource")
                user_agent = match.group("user_agent")

                # Parse timestamp and extract hour
                hour = datetime.strptime(timestamp, "%d/%b/%Y:%H:%M:%S %z").strftime("%Y-%m-%d %H")

                # Update counters
                resource_counter[resource] += 1
                hourly_requests[hour] += 1
                user_agent_counter[user_agent] += 1

    return resource_counter, hourly_requests, user_agent_counter

# Function to aggregate results from all processes
def aggregate_results(results):
    total_resource_counter = Counter()
    total_hourly_requests = Counter()
    total_user_agent_counter = Counter()

    for resource_counter, hourly_requests, user_agent_counter in results:
        total_resource_counter.update(resource_counter)
        total_hourly_requests.update(hourly_requests)
        total_user_agent_counter.update(user_agent_counter)

    return total_resource_counter, total_hourly_requests, total_user_agent_counter

# Parallel processing function
def parallel_process_logs(log_folder):
    log_files = [os.path.join(log_folder, log_file) for log_file in os.listdir(log_folder)]

    # Use multiprocessing to process log files in parallel
    with Pool() as pool:
        results = pool.map(process_log_file, log_files)

    # Aggregate results from all processes
    return aggregate_results(results)

# Measure runtime
if __name__ == "__main__":
    start_time = time.time()
    resource_counter, hourly_requests, user_agent_counter = parallel_process_logs(log_path)
    end_time = time.time()

    print(f"Most Frequent Resources: {resource_counter.most_common(5)}")
    print(f"Requests Per Hour: {hourly_requests}")
    print(f"Top User Agents: {user_agent_counter.most_common(5)}")
    print(f"Parallel Processing Time: {end_time - start_time:.2f} seconds")