# Data Analysis Assignment 1
**Due Date:** December 14, 2025, 23:59  
**Total Points:** 114 + 15 bonus points

## Copyright and Fair Use

This material, no matter whether in printed or electronic form, may be used for personal and non-commercial educational use only. Any reproduction of this material, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.

## Guidelines

1. **DO NOT add or delete any cells (or modify cell IDs)**
2. Complete code cells marked with `# YOUR CODE HERE`
3. Comment or remove lines with `raise NotImplementedError()`
4. Run all cells before submission to verify your solutions
5. Submit Notebook (.ipynb file) on Moodle with filename using the correct format, e.g., **Assignment_1_JohnDoe_12345678.ipynb**

<div style="background-color: #e6f3ff; padding: 15px; margin: 10px; border-left: 5px solid #2196F3; border-radius: 3px;">
In this assignment, you will apply the preprocessing and analysis techniques to a real-world dataset. The preprocessing steps here will involve - handling invalid data, removing outliers, dealing with missing values - but adapted to the characteristics of system performance data. Your task is to look for patterns in system behavior that could indicate test system anomalies. Understanding the daily/nightly testing cycles will be crucial for identifying abnormal states.
</div>


# System Performance Analysis

## Background
This assignment analyzes performance data from test systems that perform nightly testing of industrial network devices (switches and routers) at Westermo. These test systems validate devices used in critical applications like energy distribution and railway systems.

When tests fail, the cause could be:
1. Actual issues with the software under test
2. Problems in the test framework code
3. Hardware setup issues (e.g., wrong cable connections, unpowered devices)
4. Server issues (e.g., full disk)

The key question is: "If the test system is in an abnormal state – can we trust the test results?"

### The Dataset
The complete dataset consists of 19 CSV files (one per test system), each containing:
- About 86,000 samples
- Collected over one month
- Sampled twice per minute
- Over 20 performance metrics

Data accessibility and further information: https://github.com/westermo/test-system-performance-dataset

**Note:** If you would like to learn more about the dataset, read through the attached pdf file, and check out the dataset link above

### Focus of Our Analysis
For this assignment, we'll analyze one test system focusing on these key metrics:

1. **System Load (load-15m)**
   - 15-minute average of system workload
   - Shows test execution patterns
   - Peaks during night testing, low during day

2. **Memory Usage (memory_used_pct)**
   - Percentage of total memory used
   - Calculated from available/total memory
   - Indicates resource utilization

3. **CPU Usage (cpu-user)**
   - Rate of change in seconds spent on user processes
   - Shows changes in processing activity
   - Higher values indicate increasing CPU time use

4. **Temperature Change (sys-thermal)**
   - Rate of change in system temperature
   - Optional metric (not on all systems)
   - Helps detect system stress

5. **Server Status (server-up)**
   - System heartbeat indicator
   - Values > 0 show server availability
   - Critical for validating system operation


In [None]:
# Initial setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configure plotting
plt.rcParams.update({
    'figure.figsize': [12, 8],
    'figure.dpi': 150,
    'figure.autolayout': True,
    'axes.labelsize': 12,
    'axes.titlesize': 14,
    'font.size': 12
})

sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1.2)

np.random.seed(42)


## Task 1.1: Data Loading and Preparation (10 points)

Create a function that loads and prepares the system performance data. Your function should:

1. Load the data file
2. Convert Unix timestamps to datetime
3. Calculate memory usage percentage
4. Drop unneeded columns 

Columns in original dataset:
```
'timestamp', 'load-1m', 'load-5m', 'load-15m', 'sys-mem-swap-total', 'sys-mem-swap-free', 'sys-mem-free', 'sys-mem-cache', 'sys-mem-buffered', 'sys-mem-available', 'sys-mem-total', 'sys-fork-rate', 'sys-interrupt-rate', 'sys-context-switch-rate', 'sys-thermal', 'disk-io-time', 'disk-bytes-read', 'disk-bytes-written', 'disk-io-read', 'disk-io-write', 'cpu-iowait', 'cpu-system', 'cpu-user', 'server-up'
```

Hint:
```python
def load_system_data(file_path):
    # Read CSV
    df = pd.read_csv(file_path)
    
    # Convert timestamp
    df['datetime'] = pd.to_datetime(df['timestamp'], unit='s', errors='coerce')

    #set index to datetime
    df.set_index('datetime', inplace=True)
    
    # Calculate memory usage
    df['memory_used_pct'] = (1 - df['sys-mem-available']/df['sys-mem-total']) * 100
    
    return df
```

After loading data and adding ```'memory_used_pct'```, column names in our dataframe should look like this:
```
['datetime', 'load-15m', 'memory_used_pct', 'cpu-user', 'sys-thermal', 'server-up']
```

In [None]:
def load_system_data(file_path: str) -> pd.DataFrame:
    """Load and prepare test system performance data.
    
    Parameters
    ----------
    file_path : str
        Path to the CSV data file
        
    Returns
    -------
    pd.DataFrame
        Processed dataframe with columns:
        - datetime: Timestamp (index)
        - load-15m: 15-minute load average
        - memory_used_pct: Calculated memory usage
        - cpu-user: Rate of change in CPU time
        - sys-thermal: Temperature change
        - server-up: System availability
    """
    # YOUR CODE HERE
    df = pd.read_csv(file_path)
    df['datetime'] = pd.to_datetime(df['timestamp'], unit='s',errors='coerce') #coerce converts errous timestamps into NaN
    df.set_index('datetime', inplace=True)
    df['memory_used_pct'] = (1 - df['sys-mem-available']/df['sys-mem-total'])*100
    drop = ['timestamp',
            'load-1m',
            'load-5m',
            'sys-mem-swap-total',
            'sys-mem-swap-free',
            'sys-mem-free',
            'sys-mem-cache',
            'sys-mem-buffered',
            'sys-mem-available',
            'sys-mem-total',
            'sys-fork-rate',
            'sys-interrupt-rate',
            'sys-context-switch-rate',
            'disk-io-time',
            'disk-bytes-read',
            'disk-bytes-written',
            'disk-io-read',
            'disk-io-write',
            'cpu-iowait',
            'cpu-system']
    df = df.drop(drop, axis=1)
    return df

    # raise NotImplementedError()

In [None]:
# Test cell (simply run it)
df = load_system_data('system-1.csv')

# Check required columns
required_cols = ['load-15m', 'memory_used_pct', 'cpu-user', 'sys-thermal', 'server-up']
assert all(col in df.columns for col in required_cols), "Missing required columns"
assert isinstance(df.index, pd.DatetimeIndex), "Index should be datetime"
assert df['memory_used_pct'].between(0, 100).all(), "Memory usage should be percentage"
print("Basic data structure tests passed!")

print("Data Overview:")
print(f"Time range: {df.index.min()} to {df.index.max()}")
print(f"Number of samples: {len(df):,}")
print("\nFirst few rows:")
print(df.head())

## Task 1.2: Raw Data Overview Visualization (30 points)

Create a comprehensive visualization of the system metrics. You are free to adapt and add more plots, however, your visualization should atleast include:

1. **Time Series Overview** (5 points)
   - Show all metrics over time
   - Highlight server availability status
   - Use appropriate alpha and line width
   - Add proper labels

2. **Daily Distribution** (5 points)
   - Create boxplots by hour
   - Show daily patterns
   - Consider server uptime periods
   
3. **Correlation Analysis** (5 points)
   - Create correlation matrix between metrics\
     (Use seaborn heatmap visualization) 

4. **Relationship Visualization** (5 points)
   - Scatter plot of key metrics
   - Hexbin plot for dense areas
   - Color scatter points by server status
     
5. **Layout and Formatting** (5 points)
    - Clear titles and labels
    - Appropriate color schemes
   
Hints for visualizations:
```python
# Correlation heatmap
corr_matrix = df[['load-15m', 'memory_used_pct', 'cpu-user', 'sys-thermal', 'server-up']].corr()
sns.heatmap(corr_matrix,
            annot=True,      # Show correlation values
            cmap='coolwarm', # Diverging colormap
            center=0,        # Center colormap at 0
            fmt='.2f',       # Format coefficients
            ax=axes[1, 0])
axes[1, 0].set_title('Metric Correlations')

# Scatter plot with server status
scatter = axes[1, 1].scatter(df['cpu-user'], df['memory_used_pct'],
                           c=df['server-up'],  # Color by status
                           cmap='RdYlGn',      # Red-Yellow-Green
                           alpha=0.5)          
plt.colorbar(scatter, ax=axes[1, 1], label='Server Status')
axes[1, 1].set_xlabel('CPU Time Rate of Change (seconds)') 
axes[1, 1].set_title('CPU Time vs Memory Usage')        

# Hexbin for density
df.plot.hexbin(x='load-15m', y='cpu-user',
               gridsize=20,
               cmap='YlOrRd',
               ax=axes[2, 0])
axes[2, 0].set_title('Load vs CPU Usage Density')
```

In [None]:
# YOUR CODE HERE
def Time_Series_Overview(df: pd.DataFrame, alpha: float, lable="Data", first=False):
    highlight = df['server-up'] != 2

    plt.suptitle('Time Series', fontsize=16, fontweight='bold', y=0.995)
    plt.subplot(5,1,1)
    plt.plot(df.index, df['load-15m'], alpha=alpha, marker='o', linestyle='None', markersize=1, label = lable)
    plt.scatter(df.index[highlight], df['load-15m'][highlight], c='red')
    plt.ylabel("Load Avarage")
    plt.title("15m average workload")
    plt.legend(loc='upper right')

    plt.subplot(5,1,2)
    plt.plot(df.index, df['memory_used_pct'], alpha=alpha, marker='o', linestyle='None', markersize=1, label = lable)
    plt.scatter(df.index[highlight], df['memory_used_pct'][highlight], c='red')
    plt.title("Memory used")
    plt.ylabel("Memory [%]")
    plt.legend(loc='upper right')

    plt.subplot(5,1,3)
    plt.plot(df.index, df['cpu-user'], alpha=alpha, marker='o', linestyle='None', markersize=1, label = lable)
    plt.scatter(df.index[highlight], df['cpu-user'][highlight], c='red')
    plt.ylabel("CPU used")
    plt.title("CPU user")
    plt.legend(loc='upper right')

    plt.subplot(5,1,4)
    plt.plot(df.index, df['sys-thermal'], alpha=alpha, marker='o', linestyle='None', markersize=1, label = lable)
    plt.scatter(df.index[highlight], df['sys-thermal'][highlight], c='red')
    plt.title("System: Temp")
    plt.ylabel("Temp Change[°C]")
    plt.legend(loc='upper right')

    plt.subplot(5,1,5)
    plt.plot(df.index, df['server-up'], alpha=alpha, marker='o', linestyle='None', markersize=1, label = lable)
    if(first):
        plt.scatter(df.index[highlight], df['server-up'][highlight], c='red', label= "Server-Down-Periods")
    else:
        plt.scatter(df.index[highlight], df['server-up'][highlight], c='red')
    plt.ylabel("Server Status")
    plt.title("Server Status")
    plt.xlabel("Datetime")
    plt.legend(loc='lower right')

def Time_Series_Removed(df: pd.DataFrame, alpha: float):
    highlight = df['server-up'] != 2

    plt.suptitle('Time Series', fontsize=16, fontweight='bold', y=0.995)
    plt.subplot(5,1,1)
    plt.plot(df.index, df['load-15m'], alpha=alpha, marker='x', linestyle='None', markersize=5, label = "Removed Datapoints")
    plt.scatter(df.index[highlight], df['load-15m'][highlight], c='red')
    plt.ylabel("Load Avarage")
    plt.title("15m average workload")
    plt.legend(loc='upper right')

    plt.subplot(5,1,2)
    plt.plot(df.index, df['memory_used_pct'], alpha=alpha, marker='x', linestyle='None', markersize=5, label = "Removed Datapoints")
    plt.scatter(df.index[highlight], df['memory_used_pct'][highlight], c='red')
    plt.title("Memory used")
    plt.ylabel("Memory [%]")
    plt.legend(loc='upper right')

    plt.subplot(5,1,3)
    plt.plot(df.index, df['cpu-user'], alpha=alpha, marker='x', linestyle='None', markersize=5, label = "Removed Datapoints")
    plt.scatter(df.index[highlight], df['cpu-user'][highlight], c='red')
    plt.ylabel("CPU used")
    plt.title("CPU user")
    plt.legend(loc='upper right')

    plt.subplot(5,1,4)
    plt.plot(df.index, df['sys-thermal'], alpha=alpha, marker='x', linestyle='None', markersize=5, label = "Removed Datapoints")
    plt.scatter(df.index[highlight], df['sys-thermal'][highlight], c='red')
    plt.title("System: Temp")
    plt.ylabel("Temp Change[°C]")
    plt.legend(loc='upper right')

    plt.subplot(5,1,5)
    plt.scatter(df.index[highlight], df['server-up'][highlight], c='red', label= "Server-Down-Periods")
    plt.ylabel("Server Status")
    plt.title("Server Status")
    plt.xlabel("Datetime")
    plt.legend(loc='lower right')

def Daily_Distribution(df: pd.DataFrame, alpha: float):
    plt.suptitle('Daily Distribution', fontsize=16, fontweight='bold', y=0.995)
    df["hour"] = df.index.hour
    plt.subplot(3, 1, 1)
    sns.boxplot(data=df, x='hour', y='load-15m',
                patch_artist=True,
                boxprops=dict(alpha=alpha),)
    plt.title('15m average workload')
    plt.ylabel('Load Average')
    plt.xlabel('')
    plt.grid(True, alpha=0.3, axis='y')
    
    plt.subplot(3, 1, 2)
    sns.boxplot(data=df, x='hour', y='memory_used_pct',
                patch_artist=True,
                boxprops=dict(alpha=alpha),)
    plt.title('Memory Usage')
    plt.xlabel('')
    plt.ylabel('Memory [%]')
    plt.grid(True, alpha=0.3, axis='y')
    
    plt.subplot(3, 1, 3)
    sns.boxplot(data=df, x='hour', y='sys-thermal',
                patch_artist=True,
                boxprops=dict(alpha=alpha),)
    plt.title('System Temperature Change')
    plt.ylabel('Temp Change [°C]')
    plt.xlabel('Hour of Day (0-23)')
    plt.grid(True, alpha=0.3, axis='y')

def Correlation_Analysis(df: pd.DataFrame):
    plt.suptitle('Correlation Analysis', fontsize=16, fontweight='bold', y=0.995)
    corr_matrix = df[['load-15m', 'memory_used_pct', 'cpu-user', 'sys-thermal', 'server-up']].corr()
    sns.heatmap(corr_matrix,
                annot=True,      # Show correlation values
                cmap='coolwarm', # Diverging colormap
                center=0,        # Center colormap at 0
                fmt='.2f',       # Format coefficients
                ax=plt.gca())

def Relationship_Visualization(df: pd.DataFrame):
    plt.suptitle('Relationship Vizualisation', fontsize=16, fontweight='bold', y=0.995)
    plt.subplot(2, 2, 1)
    scatter = plt.scatter(df['cpu-user'], df['memory_used_pct'],
                          c=df['sys-thermal'], cmap='RdYlGn',
                          alpha=0.5, s=10)
    plt.xlabel('CPU Time Rate of Change')
    plt.ylabel('Memory Usage [%]')
    plt.title('CPU Time vs Memory Usage')
    plt.colorbar(scatter, label='sys-thermal')

    plt.subplot(2, 2, 2)
    plt.hexbin(df['load-15m'], df['cpu-user'],
               gridsize=20, cmap='YlOrRd', mincnt=1)
    plt.xlabel('System Load (load-15m)')
    plt.ylabel('CPU User')
    plt.title('Load vs CPU Usage Density')

    plt.subplot(2, 2, 3)
    plt.hexbin(df['load-15m'], df['memory_used_pct'],
               gridsize=20, cmap='YlGnBu', mincnt=1)
    plt.xlabel('System Load (load-15m)')
    plt.ylabel('Memory Usage [%]')
    plt.title('Load vs Memory Usage Density')

    plt.subplot(2, 2, 4)
    scatter2 = plt.scatter(df['load-15m'], df['sys-thermal'],
                           c=df['cpu-user'], cmap='RdYlGn',
                           alpha=0.5, s=10)
    plt.xlabel('System Load (load-15m)')
    plt.ylabel('Temperature Change [°C]')
    plt.title('Load vs Temperature')
    plt.colorbar(scatter2, label='CPU user')

plt.figure(figsize=(12, 16))
Time_Series_Overview(df, alpha=1.0, first=True)
plt.show()
plt.figure(figsize=(12, 16))
Daily_Distribution(df, alpha = 1.0)
plt.show()
print("Boxplots for the 'server-up' and the 'cpu-user' had no meaningful information")
Correlation_Analysis(df)
plt.show()
Relationship_Visualization(df)
plt.show()
# raise NotImplementedError()

## Task 2.1: Data Preprocessing (20 points)

<div style="background-color: #e6f3ff; padding: 15px; margin: 10px; border-left: 5px solid #2196F3; border-radius: 3px;">
Here your task is to clean and prepare the system performance data. For that purpose, consider system behavior patterns, particularly the day/night testing cycles.
</div>

Create a function that handles:

1. **Invalid Values** (5 points)
   - Remove values outside valid ranges
   - Consider system behavior patterns
   - Verify server status integrity
   -
    ```python
    valid_ranges = {
        'load-15m': (0, 0.5),           # System load
        'memory_used_pct': (0, 100),   # Percentage
        'cpu-user': (0, 2),            # Rate of change in CPU time
        'sys-thermal': (-10, 10),      # Temperature change rate (°C/min)
        'server-up': (0, float('inf')) # Server availability
    }
    ```

2. **Duplicate Timestamps** (5 points)
   - Identify duplicate readings
   - Aggregate using appropriate methods
   - Maintain data consistency
   
3. **Outliers** (5 points)
   - Use IQR method for each metric
   - Consider day/night differences
   - Document removed points
   
4. **Missing Values** (5 points)
   - Handle gaps appropriately
   - Consider server status
   - Limit interpolation range

In [None]:
def remove_outliers_iqr(data: pd.Series, column: str) -> pd.Series:
    """Remove outliers using IQR method."""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    valid_mask = (data[column] >= Q1 - 1.5*IQR) & \
                 (data[column] <= Q3 + 1.5*IQR)
    return data[column].where(valid_mask, np.nan)

def handle_missing_values(data: pd.DataFrame, column: str,
                         max_gap: int = 8) -> pd.Series:
    """Interpolate missing values with limit."""
    return data[column].interpolate(
        method='linear',
        limit=max_gap  # Only fill gaps up to 8 points
    )

def preprocess_system_data(df: pd.DataFrame):
    """Preprocess system performance data.
    
    Parameters
    ----------
    df : pd.DataFrame
        Raw system performance data with required metrics:
        - load-15m
        - memory_used_pct
        - cpu-user
        - sys-thermal (optional)
        - server-up
        
    Returns
    -------
        df_original, df_cleaned containing:
        - Original data copy
        - Cleaned data with:
          * Invalid values removed
          * Duplicates handled
          * Outliers removed
          * Missing values interpolated
    """
    # Store original data
    df_original = df.copy()
    df_cleaned= df.copy()
    
    # Define valid ranges
    valid_ranges = {
        'load-15m': (0, 0.5),           # System load
        'memory_used_pct': (0, 100),   # Percentage
        'cpu-user': (0, 2),            # Rate of change in CPU time
        'sys-thermal': (-10, 10),      # Rate of change in °C
        'server-up': (0, float('inf')) # Server availability
    }
    
    # YOUR CODE HERE
    # 1. Handle invalid values
    for col, (min_val, max_val) in valid_ranges.items():
        if col in df_cleaned.columns:
            df_cleaned.loc[(df_cleaned[col] < min_val) | (df_cleaned[col] > max_val), col] = np.nan
    
    # 2. Handle duplicates
    df_cleaned = df_cleaned[~df_cleaned.index.duplicated(keep='first')]
    
    # 3. Remove outliers
    for col in ['load-15m', 'memory_used_pct', 'cpu-user', 'sys-thermal']:
        df_cleaned[col] = remove_outliers_iqr(df_cleaned, col)
    
    # 4. Handle missing values
    for col in ['load-15m', 'memory_used_pct', 'cpu-user', 'sys-thermal']:
        df_cleaned[col] = handle_missing_values(df_cleaned, col, max_gap=8)
    return df_original, df_cleaned

    # raise NotImplementedError()
df_original, df_cleaned = preprocess_system_data(df.copy())

def get_removed_values(df_original: pd.DataFrame, df_cleaned: pd.DataFrame):
    removed_data = df_original.copy()
    
    for col in ['load-15m', 'memory_used_pct', 'cpu-user', 'sys-thermal']:
        mask = (~df_original[col].isna()) & (df_cleaned[col].isna())
        removed_data.loc[~mask, col] = np.nan
    return removed_data
df_removed = get_removed_values(df_original, df_cleaned)
plt.figure(figsize=(12, 16))
Time_Series_Overview(df_cleaned, 0.9, "cleaned Data")
Time_Series_Removed(df_removed, 0.4)


### Helper Functions

You may use these helper functions in your implementation:

```python
def remove_outliers_iqr(data: pd.Series, column: str) -> pd.Series:
    """Remove outliers using IQR method."""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    valid_mask = (data[column] >= Q1 - 1.5*IQR) & \
                 (data[column] <= Q3 + 1.5*IQR)
    return data[column].where(valid_mask, np.nan)

def handle_missing_values(data: pd.DataFrame, column: str,
                         max_gap: int = 8) -> pd.Series:
    """Interpolate missing values with limit."""
    return data[column].interpolate(
        method='linear',
        limit=max_gap  # Only fill gaps up to 8 points
    )
```

In [None]:
# Test cell for Task 2.1 - Preprocessing
df_original, df_cleaned = preprocess_system_data(df)
assert isinstance(df_original, pd.DataFrame), "Should return original dataframe"
assert isinstance(df_cleaned, pd.DataFrame), "Should return cleaned dataframe"
assert df_cleaned['cpu-user'].min() >= 0 and df_cleaned['cpu-user'].max() <= 2, "CPU rate should be between 0 and 2"
print("Basic preprocessing tests passed!")

print("Data Preprocessing Results:")
print(f"Original shape: {df_original.shape}")
print(f"Cleaned shape: {df_cleaned.shape}")

print("\nMissing Values Summary:")
print(df_cleaned.isnull().sum())

print("\nValue Ranges (Cleaned):")
for col in ['load-15m', 'memory_used_pct', 'cpu-user', 'sys-thermal', 'server-up']:
    print(f"\n{col}:")
    print(df_cleaned[col].describe())

## Task 2.2: Raw vs Cleaned Data Comparison by Visualization (30 points)

<div style="background-color: #e6f3ff; padding: 15px; margin: 10px; border-left: 5px solid #2196F3; border-radius: 3px;">Create comprehensive comparisons between the raw and cleaned data versions.
</div>

Requirements:
1. **Time Series Comparison** (10 points)
   - Plot original and cleaned data on same axes
   - Use alpha to show overlaps
   - Highlight removed outliers
   - Include server status representation

2. **Distribution Analysis** (5 points)
   - Compare original vs cleaned distributions
   - Show effects of preprocessing
   - Demonstrate quality improvements

3. **Impact Documentation** (5 points)
   - Document key statistics before/after
   - Explain preprocessing effects
   - Justify data cleaning decisions

Note: You can also add further more visualizations, in addition to the ones above.

In [None]:
def compare_data_versions(df_original: pd.DataFrame,
                         df_cleaned: pd.DataFrame):
    """Compare original and cleaned data versions using visualizations.
    
    Creates multiple plots comparing the original and cleaned data:
    - Time series comparison of all metrics
    - Daily patterns before and after cleaning
    - Correlation analysis
    - Distribution comparisons
    
    Parameters:
    df_original: Original unprocessed data
    df_cleaned: Cleaned and processed data
    """

    # YOUR CODE HERE
    # Create comparison visualizations
    plt.figure(figsize=(12, 16))
    Time_Series_Overview(df_cleaned, 0.9,"Cleaned Data")
    Time_Series_Overview(df_original, 0.4, "Original Data", True)
    plt.show
    plt.figure(figsize=(12, 16))
    Daily_Distribution(df_original, 0.4)
    plt.figure(figsize=(12, 16))
    Daily_Distribution(df_cleaned, 0.9)

    plt.figure()
    Relationship_Visualization(df_original)
    plt.figure()
    Relationship_Visualization(df_cleaned)
    # Calculate comparison statistics
    # Document data cleaning impact
    # raise NotImplementedError()
compare_data_versions(df_original,  df_cleaned)

<!-- ## Task 2.3: Cleaned Data Analysis (24 points)

Based on your cleaned data visualizations from the previous task, analyze the system performance patterns. Your answers should reflect the improved data quality after preprocessing.

1. Server Availability Patterns:\
a_) Server uptime shows regular testing patterns\
b_) Maintenance windows (server-down periods) occur at consistent times

2. System Performance:\
c_) High load periods (load-15m) align with server uptime\
d_) High memory usage periods coincide with increased CPU time rate\
e_) Temperature change rate increases during high system load periods

3. System Health:\
f_) Server status shows regular planned downtime periods\
g_) System load shows no sustained periods near maximum values (>0.4)\
h_) System load frequently reaches peak values (>0.4) during test execution

4. Test Result Reliability:\
i_) Load average remains stable during test execution periods\
j_) Memory usage stays within normal operating range (no spikes) during tests\
k_) Server availability is consistent throughout test cycles -->



## Task 2.3: Cleaned Data Analysis (24 points)

Based on your visualizations, analyze the system performance patterns.

1. Test Execution Patterns:\
a_) System load shows clear day/night testing cycles

2. System Performance:\
b_) High load periods (load-15m) align with increased Memory usage\
c_) High memory usage periods coincide with increased CPU usage rate of change\
d_) Temperature change rate increases during high system load periods

3. Resource Utilization:\
e_) Memory usage consistently increases at test start and gradually decreases towards test completion\
f_) System load stays within reasonable limits (<0.4) during normal operation

4. System Behavior:\
g_) Memory usage returns to idle state levels (around 5-6%) between test cycles\
h_) Load, memory, and CPU metrics collectively show clear patterns distinguishing between test execution and idle periods

In [None]:
# Your answers and reasoning below. For each statement, first set the boolean value 
# then explain your reasoning based on the visualizations.

# Test Execution Patterns
a_ = True
# Your reasoning here
"""
In the daily distribution it can clearly be seen that there are higher median 15minute load cycles that
during the day hours from 7am to 6pm. Therefore a clearly day/night cycle can be seen.
In addition to the higher average value also the Interquartile Range of the night hours
is situated at higher 15minute load values.
"""
# System Performance
b_ = True
# Your reasoning here
"""
This can also be seen very clearly out of the daily distribution and the time series,
how the boxplots are situated. With the higher 15 minute average load the boxplots
(Interquartile distance & median) also the memory usage increases at night.
In the timeseries there are also intervals with higher load and Memory usage which align
(appear at the in the same measurement (datetime indexes match))
"""
c_ = True
# Your reasoning here
"""
This is also true and can also be verified in the timeseries. There it is clearly
visible how the higher memory usage periodes match the datetime indexes from the
higher CPU usage range changes.
"""
d_ = False
# Your reasoning here
"""
Out of the drawn plots i could not see a correlation between the load and the change of the
sys-thermal parameter. Therefore its visible that the systems thermal change is distributed on three
rows in parallel. I think this is caused by the usage of different devices which have different offsets
for the thermal change.
The distribution in the scatter plot also indicates that there is no correlation between load and
sys-thermal.
"""
# Resource Utilization
e_ = False
# Your reasoning here
"""
The test memory does instantly rais with the test start. Not toward the beginning.
The memory increases when the test run has started and increases constantly with the test duration.
When the test has finished the memory usage does not gradually decreases it instantly decreases
again.
This can be seen out of the time series.
"""
f_ = True
# Your reasoning here
"""
This can be seen in the daily distribution, the time series and the relationship plots of the
dataset. In the uncleand dataset only some outlayers are above the 0,4 load limitation.
"""
# System Behavior
g_ = True
# Your reasoning here
"""
This can be seen in the time series and the hexbin plot. The idel level
of the memory usage is often reached (darker color in the idel stage at the
hexbin plot) and also at the time series when there is no 15 minute
load (idle state) the memory usage is at 5-6 %.
"""
h_ = True 
# Your reasoning here
"""
This is also clearly visible in the time series where load, the CPU usage changing rate and
the memory consumption have matching periods of higher values. Those are the test periods.
Between two test periodes all paramaters are reduced again to there idle values = idle periode.
"""

In [None]:
# Test cell - DO NOT MODIFY
for var in ['a_', 'b_', 'c_', 'd_', 'e_', 'f_', 'g_', 'h_']:
    assert var in locals(), f"Missing answer for {var}"
    assert isinstance(locals()[var], bool), f"Answer for {var} must be True or False"

## Extra Tasks: Interactive Visualizations (15 Bonus Points)

Leverage Altair to create interactive visualizations based on the cleaned system performance data. Before you start, review the **Interactive Plotting** tutorial to familiarize yourself with Altair's capabilities for crafting interactive plots.

Feel free to experiment and design different types of interactive visualizations that effectively represent the data, in addition to the tasks described below.

**Possible visualization ideas:**

- **Interactive Time-Series Charts**: Plot CPU and memory usage over time with zoom and pan functionalities.
- **Scatter Plots with Tooltips**: Explore relationships between performance metrics by displaying detailed information on hover.
- **Heatmaps and Correlation Matrices**: Visualize correlations with interactive elements that highlight specific data points.
- **Combined Dashboards**: Create a dashboard featuring multiple interactive charts for a holistic view of system performance.

Be creative and innovative in your approach to make the visualizations both informative and engaging.

### ET1: Basic Interactive Time Series (5 points)
Create an interactive time series visualization that includes:
1. System load over time with zoom/pan capabilities
2. Tooltips showing metric values
3. Server status indicated by color

Hint:
```python

import altair as alt

# Enable the rendering of charts
alt.renderers.enable('default')
# Set a maximum number of rows for Altair
alt.data_transformers.enable('default', max_rows=None)

def create_basic_interactive(df):
    # Base chart with zoom
    chart = alt.Chart(df.reset_index()).mark_line().encode(
        x='datetime:T',
        y=alt.Y('cpu-user', 
                title='CPU Time Rate of Change (seconds)'),  # Updated title
        color=alt.Color('server-up:Q', 
                       scale=alt.Scale(scheme='redyellowgreen')),
        tooltip=['datetime:T', 
                alt.Tooltip('cpu-user', 
                           title='CPU Time Change'), 
                'server-up']
    ).interactive()
    
    return chart
```

Note: Make sure to install required packages:
```python
# Install packages if needed:
# pip install altair altair_saver vega_datasets
# or
# conda install -c conda-forge altair altair_saver vega_datasets

# altair_saver package - to allow saving visualizations.
# vega_datasets - to provide example datasets
# Note: Above two packages (altair_saver and vega_datasets) are not necessary here but relevant for the interactive tutorial.
```

In [None]:
# YOUR CODE HERE
import altair as alt

# Enable the rendering of charts
alt.renderers.enable('default')
# Set a maximum number of rows for Altair
alt.data_transformers.enable('default', max_rows=None)

def create_basic_interactive(df: pd.DataFrame):
    # Base chart with zoom
    chart = alt.Chart(df.reset_index()).mark_line().encode(
        x='datetime:T',
        y=alt.Y('cpu-user', 
                title='CPU Time Rate of Change (seconds)'),  # Updated title
        color=alt.Color('server-up:Q', 
                       scale=alt.Scale(scheme='redyellowgreen')),
        tooltip=['datetime:T', 
                alt.Tooltip('cpu-user', 
                           title='CPU Time Change'), 
                'server-up']
    ).interactive()
    
    return chart

create_basic_interactive(df)
# raise NotImplementedError()



### ET2: Linked Views (5 points)
Create two linked interactive visualizations:
1. Load vs Memory usage scatter plot
2. Corresponding histogram of selected data points
3. Implement brushing to highlight points

Hint:
```python
def create_linked_views(df):
    # Create selection
    brush = alt.selection_interval()
    
    # Scatter plot
    scatter = alt.Chart(df).mark_point().encode(
        x='load-15m',
        y='memory_used_pct',
        color=alt.condition(brush, 'server-up:Q', alt.value('lightgray'))
    ).add_selection(brush)
    
    # Histogram for selected data
    hist = alt.Chart(df).mark_bar().encode(
        x='load-15m',
        y='count()'
    ).transform_filter(brush)
    
    return scatter & hist  # Stack vertically
```




In [None]:
# YOUR CODE HERE
def create_linked_views(df: pd.DataFrame):
    # Create selection
    brush = alt.selection_interval()
    
    # Scatter plot
    scatter = alt.Chart(df).mark_point().encode(
        x='load-15m',
        y='memory_used_pct',
        color=alt.condition(brush, 'server-up:Q', alt.value('lightgray'))
    ).add_params(brush)
    
    # Histogram for selected data
    hist = alt.Chart(df).mark_bar().encode(
        x='load-15m',
        y='count()'
    ).transform_filter(brush)
    
    return scatter & hist  # Stack vertically
create_linked_views(df)
# raise NotImplementedError()

### ET3: Advanced Dashboard (5 points)
Create a comprehensive dashboard with:
1. Time series with selectable time range
2. Metric comparison scatter plot
3. Summary statistics for selected period
4. Interactive filtering across all views

Points awarded for:
- Creative use of Altair features
- Effective interaction design
- Clear visual communication


In [None]:
# YOUR CODE HERE

raise NotImplementedError()