# EX2-SYS: Jupyter, Python processes, measuring performance

Your assignment: complete the `TODO`'s and include also the **output of each cell**.

### Step 1: Download HDFS log file (open data) and Unzip files. Check with professor: this file may be available internally.

In [1]:
!mkdir -p data/

In [2]:
![ -e "data/hdfs/HDFS.log" ] || (wget https://zenodo.org/records/8196385/files/HDFS_v1.zip -P data/ && unzip -o data/HDFS_v1.zip -d data/hdfs && rm data/HDFS_v1.zip)

Connecting to zenodo.org (188.185.45.92:443)
saving to 'data/HDFS_v1.zip'
HDFS_v1.zip          100% |********************************|  177M  0:00:00 ETA
'data/HDFS_v1.zip' saved
Archive:  data/HDFS_v1.zip
  inflating: HDFS.log
   creating: preprocessed/
  inflating: preprocessed/anomaly_label.csv
  inflating: preprocessed/Event_occurrence_matrix.csv
  inflating: preprocessed/Event_traces.csv
  inflating: preprocessed/HDFS.log_templates.csv
  inflating: preprocessed/HDFS.npz
  inflating: README.md


### Step 2: This practice is going to process file `data/hdfs/HDFS.log`. First, create a Python program that counts the number of lines of this file.

In [3]:
def count_lines(file_path):
    with open(file_path, 'r') as n:  
        n = sum(1 for _ in n)
        return n

### Step 3: Test your function

In [4]:
file_path = 'data/hdfs/HDFS.log'
n = count_lines(file_path)
print('File %s has %d lines.' % (file_path, n))

File data/hdfs/HDFS.log has 11175629 lines.


### Step 4: Also, get the size of the input file (in bytes)

In [5]:
!ls -l data/hdfs/HDFS.log
!ls -l data/hdfs/HDFS.log | awk '{print $5}'

-rw-r--r--    1 root     root     1577982906 Mar 31 03:11 [0;0mdata/hdfs/HDFS.log[m
1577982906


### Step 5: List the running Python processes

In [6]:
!pgrep -af '[p]ython'

1 /app/.venv/bin/python3 /app/.venv/bin/jupyter-lab --no-browser --ip=0.0.0.0 --allow-root
17 /app/.venv/bin/python3 -m ipykernel_launcher -f /root/.local/share/jupyter/runtime/kernel-b30f322e-7274-4618-82ac-5a5949fff650.json


### Step 6: Python threads and also child/parent processes

In [7]:
!ps -eLf | head -1
!ps -eLf | grep -i '[p]ython'

UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
root         1     0     1  2    6 03:09 ?        00:00:02 /app/.venv/bin/python3 /app/.venv/bin/jupyter-lab --no-browser --ip=0.0.0.0 --allow-root
root         1     0     9  0    6 03:09 ?        00:00:00 /app/.venv/bin/python3 /app/.venv/bin/jupyter-lab --no-browser --ip=0.0.0.0 --allow-root
root         1     0    10  0    6 03:09 ?        00:00:00 /app/.venv/bin/python3 /app/.venv/bin/jupyter-lab --no-browser --ip=0.0.0.0 --allow-root
root         1     0    18  0    6 03:10 ?        00:00:00 /app/.venv/bin/python3 /app/.venv/bin/jupyter-lab --no-browser --ip=0.0.0.0 --allow-root
root         1     0    19  0    6 03:10 ?        00:00:00 /app/.venv/bin/python3 /app/.venv/bin/jupyter-lab --no-browser --ip=0.0.0.0 --allow-root
root         1     0    41  0    6 03:10 ?        00:00:00 /app/.venv/bin/python3 /app/.venv/bin/jupyter-lab --no-browser --ip=0.0.0.0 --allow-root
root        17     1    17  5   11 03:10 ?       

### Step 7: Interpret and write down what does the output from the last two commands actually means (process and thread hierarchy)

#### Command '!pgrep -af '[p]ython''
- This command searches for all the procedures related to Python there are executing at the moment, and shows the PID of each command. 

###### First output: 1 /app/.venv/bin/python3 /app/.venv/bin/jupyter-lab --no-browser --ip=0.0.0.0 --allow-root
- 1 = PID of the procedure
- /app/.venv/bin/python3 = path to the python3 execution
- /app/.venv/bin/jupyter-lab = Jupyter lab execution
- --no-browser = it means that no windown will be open automatically when the process is executed
- --ip=0.0.0.0 = it can accept any connection from IP adresses
- --allow-root = allows Jupiter being executed as admin

##### Second output: 273 /app/.venv/bin/python3 -m ipykernel_launcher -f /root/.local/share/jupyter/runtime/kernel-651784db-39e7-4c0f-8298-19fdfb91fe05.json
- 273 = PID
- /app/.venv/bin/python3 = path to the python3 execution
- -m ipykernel_launcher = init Jupyter's kernel that executes cells on the notebook
- -f /root/.local/share/jupyter/runtime/kernel-651784db-39e7-4c0f-8298-19fdfb91fe05.json = kernel config file path, with connection informations

#### !ps -eLf | head -1 
- This command lists all the executed process in the system and print only the firt line. It shows informations as UID, PID, PPID, LWP, C... 
#### !ps -eLf | grep -i '[p]ython'
- This command lists the Python processes in execution, such as scrips, server or any other thing. 

#### Output explanation: 
- The Jupyter lab process (PID 1) has 5 threads (LWP = 1, 9, 10, 22, 23 | NLWP = 5)
- The Python kernel process (PID 273) has 11 threads (LWP = 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 284 | NLWP = 11)
- The main process is the Jupyter lab (PID 1), that interacts with the Python kernel process, both are running as root and has multiple threads in each process.

### Step 8: Write a function that categorizes the lines in `HDFS.log` in a nested dictionary:

- First level key is log type (info, warn or error, etc). 
- Second level key is the class that produced that logging line.

```json
"info": {
    "class1": [...],
    "class2": [...],
    ...
},
"warn": {
    "class1": [...],
    "class2": [...],
    ...
},
...
```

In [8]:
def get_logfile_data_as_dict(file_path):
    data = dict()

    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            parts = line.split(" ", 4)  
            if len(parts) < 5:
                continue  
            
            log_level = parts[3].lower()  
            class_name, message = parts[4].split(": ", 1) if ": " in parts[4] else (parts[4], "")

            if log_level not in data:
                data[log_level] = {}  
            
            if class_name not in data[log_level]:
                data[log_level][class_name] = []  
            
            data[log_level][class_name].append(message.strip())  

    return data

### Step 9: Measure the throughput of `count_lines()` and `get_logfile_data_as_dict()`

In [9]:
import time

def measure_time_count_lines(file_path):
    start_time = time.time()
    count_lines(file_path)
    elapsed_time = time.time() - start_time
    return elapsed_time

# TODO: do the same with the other function get_logfile_data_as_dict()
def measure_time_get_logfile_data_as_dict(file_path):
    start_time = time.time()
    get_logfile_data_as_dict(file_path)
    elapsed_time = time.time() - start_time
    return elapsed_time

### Step 10: Replication: repeat the previous functions a number of times, report each time

In [10]:
num_replications = 3

print("Time for count_lines:")
for i in range(num_replications):
    time_taken = measure_time_count_lines(file_path)
    print(f"Replication {i+1}: {time_taken:.6f} seconds")

print("\nTime for get_logfile_data_as_dict:")
for i in range(num_replications):
    time_taken = measure_time_get_logfile_data_as_dict(file_path)
    print(f"Replication {i+1}: {time_taken:.6f} seconds")

Time for count_lines:
Replication 1: 2.864922 seconds
Replication 2: 2.853767 seconds
Replication 3: 2.827930 seconds

Time for get_logfile_data_as_dict:
Replication 1: 19.671370 seconds
Replication 2: 17.638382 seconds
Replication 3: 16.379560 seconds


### Step 11: Take the average, minimum, maximum and standard deviation of those runtime values

In [11]:
import statistics

num_replications = 3

def measure_statistics(file_path, measure_func):
    times = []
    for i in range(num_replications):
        time_taken = measure_func(file_path)
        times.append(time_taken)
    
    avg_time = statistics.mean(times)  
    min_time = min(times)
    max_time = max(times)  
    stddev_time = statistics.stdev(times) 

    return avg_time, min_time, max_time, stddev_time

In [12]:
print("Statistics for count_lines:")
avg_time, min_time, max_time, stddev_time = measure_statistics(file_path, measure_time_count_lines)
print(f"Average: {avg_time:.6f} seconds")
print(f"Minimum: {min_time:.6f} seconds")
print(f"Maximum: {max_time:.6f} seconds")
print(f"Standard Deviation: {stddev_time:.6f} seconds")

print("\nStatistics for get_logfile_data_as_dict:")
avg_time, min_time, max_time, stddev_time = measure_statistics(file_path, measure_time_get_logfile_data_as_dict)
print(f"Average: {avg_time:.6f} seconds")
print(f"Minimum: {min_time:.6f} seconds")
print(f"Maximum: {max_time:.6f} seconds")
print(f"Standard Deviation: {stddev_time:.6f} seconds")


Statistics for count_lines:
Average: 2.882861 seconds
Minimum: 2.751927 seconds
Maximum: 3.049078 seconds
Standard Deviation: 0.151685 seconds

Statistics for get_logfile_data_as_dict:
Average: 15.895054 seconds
Minimum: 15.612379 seconds
Maximum: 16.148815 seconds
Standard Deviation: 0.269384 seconds


### Step 12: Response time

In [13]:
server_code = """import socket
import time
import random

def process_message(message):
    print(f"Received message: {message}")
    time.sleep(random.uniform(0, 1)) # sleep some time, emulate varying serve time
    if message == "stop":
        return True, f"Processed: {message}"
    else:
        return False, f"Processed: {message}"    

def start_server(host='0.0.0.0', port=12345):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as server_socket:
        server_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        server_socket.bind((host, port))
        server_socket.listen(1)
        print(f"Listening on {host}:{port}...")

        stop = False

        while not stop:
        
            conn, addr = server_socket.accept()
            with conn:
                print(f"Connection from {addr}")
                data = conn.recv(1024).decode().strip()
                if data:
                    stop, response = process_message(data)
                    conn.sendall(response.encode())
                
if __name__ == "__main__":
    start_server()
"""

# Write the code to the file
server_file_path = '/tmp/server.py'
with open(server_file_path, "w") as file:
    file.write(server_code)

print(f"Python code written to {server_file_path}")

Python code written to /tmp/server.py


In [14]:
client_code = """import socket
import sys
import time

def send_message(host='127.0.0.1', port=12345, message='Hello, Server!'):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as client_socket:
        client_socket.connect((host, port))
        client_socket.sendall(message.encode())
        response = client_socket.recv(1024).decode()
        print(f"Server response: {response}")

if __name__ == "__main__":
    message = sys.argv[1]
    start_time = time.time()
    send_message(message=message)
    elapsed_time = time.time() - start_time
    print(f"Response time: {elapsed_time} seconds")
"""

# Write the code to the file
client_file_path = '/tmp/client.py'
with open(client_file_path, "w") as file:
    file.write(client_code)

print(f"Python code written to {client_file_path}")

Python code written to /tmp/client.py


### Step 13: TODO: Open two terminals:

- Start by running this on each (pyenv) -- this means that is set to be used a *specific* python installation, including packages and versioning:

```bash
source /app/.venv/bin/activate
```

- First one: run server
- Second one: run client (a few times)
- Include here the output

```text
Output:
...
(.venv) a158508c57eb:/app/hostdir# python /tmp/client.py "Hello World"
Server response: Processed: Hello World
Response time: 0.47737956047058105 seconds
(.venv) a158508c57eb:/app/hostdir# python /tmp/client.py "Teste"
Server response: Processed: Teste
Response time: 0.2700643539428711 seconds
(.venv) a158508c57eb:/app/hostdir# python /tmp/client.py "Hey Jude"
Server response: Processed: Hey Jude
Response time: 0.9488439559936523 seconds
(.venv) a158508c57eb:/app/hostdir# 

(.venv) a158508c57eb:/app/hostdir# python /tmp/server.py
Listening on 0.0.0.0:12345...
Connection from ('127.0.0.1', 38650)
Received message: Hello World
Connection from ('127.0.0.1', 41168)
Received message: Teste
Connection from ('127.0.0.1', 55426)
Received message: Hey Jude

```

### Step 14: Modify server.py to:
1. Construct an in memory dict data using `get_logfile_data_as_dict()`, for the HDFS file.
2. Process message function `process_message()` get as input one of the log types (info, warn, error, etc.) and returns the **total of string characters** over all log lines of that type.
3. Below the code with the function that must be implemented.

In [18]:
server_code_modified = """import socket

def get_logfile_data_as_dict(file_path):
    data = dict()

    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            parts = line.split(" ", 4)
            if len(parts) < 5:
                continue
            
            log_level = parts[3].lower()
            class_name, message = parts[4].split(": ", 1) if ": " in parts[4] else (parts[4], "")

            if log_level not in data:
                data[log_level] = {}

            if class_name not in data[log_level]:
                data[log_level][class_name] = []

            data[log_level][class_name].append(message.strip())

    return data

def process_message(message, log_data):
    log_level = message.lower()

    if log_level not in log_data:
        return False, f"Log level '{log_level}' not found."

    total_characters = sum(len(msg) for class_name in log_data[log_level] for msg in log_data[log_level][class_name])
    
    return True, f"Total characters in '{log_level}' logs: {total_characters}"

def start_server(host='0.0.0.0', port=12345):
    logfile_path = 'data/hdfs/HDFS.log'
    log_data = get_logfile_data_as_dict(logfile_path)  
    
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as server_socket:
        server_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        server_socket.bind((host, port))
        server_socket.listen(1)
        print(f"Listening on {host}:{port}...")

        stop = False

        while not stop:
            conn, addr = server_socket.accept()
            with conn:
                print(f"Connection from {addr}")
                data = conn.recv(1024).decode().strip()
                if data:
                    stop, response = process_message(data, log_data) 
                    conn.sendall(response.encode())

if __name__ == "__main__":
    start_server()

"""

# Write the code to the file
server_file_path = '/tmp/server_modified.py'
with open(server_file_path, "w") as file:
    file.write(server_code_modified)

print(f"Python code written to {server_file_path}")

Python code written to /tmp/server_modified.py


### Step 15: Measure response time with the modified version of server: `server_modified.py`

(.venv) a158508c57eb:/app/hostdir# python /tmp/server_modified.py
Listening on 0.0.0.0:12345...
Connection from ('127.0.0.1', 54848)
(.venv) a158508c57eb:/app/hostdir# 

(.venv) a158508c57eb:/app/hostdir# python /tmp/client.py info
Server response: Total characters in 'info' logs: 1002372310
Response time: 1.0874755382537842 seconds
(.venv) a158508c57eb:/app/hostdir# 