# Data Analysis Assignment 4
**Group: Ohm_Squad**

**Members: Rauch,Bilijesko,Frizberg**

**Datasets: Westermo**

## Initial Setup

In [1]:
# Initial setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from PIL import Image
import os

# Configure plotting
plt.rcParams.update({
    'figure.figsize': [12, 8],
    'figure.dpi': 150,
    'figure.autolayout': True,
    'axes.labelsize': 12,
    'axes.titlesize': 14,
    'font.size': 12
})

pathRaw = "./data_raw/"
pathFilter = "./data_filtered/"
pathProcessd = "./data_processed/"
pathVisuRaw = "./visu_raw/"
pathVisuProcessed = "./visu_processed/"
pathOnlyProcessed = "./visu_only_processed/"

files = [f"system-{number}.csv" for number in range(1, 20)]

# Systems 3, 5, 6, 8, 11 and 17 do not have sys-thermal readings ! 3/5/6 -> crashes 8/11/17 -> no thermal
remove_entries = [7,10,16]
files = [item for index, item in enumerate(files) if index not in remove_entries]

sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1.2)

np.random.seed(42)

# 2.1 Data Preprocessing and Basic Analysis
- **Basic statistical analysis using pandas**
> -> see load_system_data()
- **Original data quality analysis (including visualization)**
> -> see Analysis Notes after visu_raw_data()
- **Data preprocessing**
> -> see preprocess_system_data() and "data_processed"
- **Preprocessed vs original data visual analysis**
> -> see Analysis Notes after visu_processed_data()

# 2.2 Visualization and Exploratory Analysis
- **Time series visualizations**
- **Distribution analysis with histograms**
- **Correlation analysis and heatmaps**
- **Daily pattern analysis**
> -> see visu_processed_data() and Analysis Notes after visu_processed_data()
- **Summary of observed patterns - similar to True/False questions**
> -> see Analysis Notes after visu_processed_data()

>All figures/plots can be accessed in "visu_raw", "visu_processed" and "visu_only_processed".



.

## Loading and Filtering
Files are fetched from directory and prefiltering for columns of interst.

Processing timestamps to datetime for usage in timeseries (and usability).

Done via a function to execute for every file separately and be able to pipe if necessary.

Returning the dataframe could be either dropped or caught by either a container or piped into the next function.

- **2.3: Basic statistical analysis using pandas**
>  -> output into CSV (visu_raw)


In [2]:
def load_system_data(file_dir: str, file_name: str) -> pd.DataFrame :
    """Load and prepare test system performance data.
    
    Parameters
    ----------
    file_dir : str
        Path to the CSV data file location (directory)
    file_name : str
        Name of the specified CSV file
    
    Additional outputs
    saves filtered data into dir "./data_filtered"
    
    Returns
    -------
    pd.DataFrame
        Raw dataframe with columns:
        - datetime (index)
        - load-15m
        - memory_used_pct
        - cpu-user
        - cpu-system
        - sys-thermal
        - sys-interrupt-rate
        - server-up
        - disk-io-time
    """
    file_path = file_dir + file_name

    df = pd.read_csv(file_path, delimiter = ",",usecols=["timestamp",
                                                         "load-15m",
                                                         "sys-mem-available",
                                                         "sys-mem-total",
                                                         "cpu-user",
                                                         "cpu-system",
                                                         "sys-thermal",
                                                         "sys-interrupt-rate",
                                                         "server-up",
                                                         "disk-io-time"]) # Read in data with columns
    

    
    df['datetime'] = pd.to_datetime(df['timestamp'], unit = 's', errors = 'coerce') # Create datetime from timestamp
    
    df.set_index('datetime', inplace=True) # Set datetime as index

    df['memory_used_pct'] = (1 - df['sys-mem-available']/df['sys-mem-total']) * 100 # Memory usage calculation
    df.drop(["timestamp","sys-mem-available","sys-mem-total"], axis=1, inplace=True) # Drop unneccessary data
    
    df.to_csv(pathFilter+file_name, index=True)
    
    df.describe().to_csv(f'{pathVisuRaw}{file_name}_desciption.csv')
    
    return df

# testing df = load_system_data(pathRaw,"system-3.csv")

In [3]:
# Pre filter all files
# for file in files:
#     load_system_data(pathRaw, file)

## Visualizing Raw
- **2.1: Original data quality analysis (including visualization)**
- **2.2: Time series visualizations**
- **2.2: Distribution analysis with histograms**
- **2.2: Correlation analysis and heatmaps**
- **2.2: Daily pattern analysis**

First: Helper functions for interacting with images and os to delete temporary files.

Second: Main function for visualizing

In [4]:
# adapted https://stackoverflow.com/questions/6996603/how-can-i-delete-a-file-or-folder-in-python
def delete_images(files: list):
    """Deletes the files specified in the list of file paths.
    Parameters
    ----------
    files: list[str]
        List of names of image files to put into .pdf file. 
    
    Additional output
    ----------
        Deltes list of images.

    Returns
    -------
        None
    """
    
    for file in files:
        try:
            if os.path.exists(file):
                os.remove(file)
                #print(f"Deleted: {file}")
            else:
                print(f"File not found: {file}")
        except Exception as e:
            print(f"Error deleting {file}: {e}")
            
# adapted https://stackoverflow.com/questions/40906463/png-images-to-one-pdf-in-python 
# and https://www.geeksforgeeks.org/save-multiple-matplotlib-figures-in-single-pdf-file-using-python/ 
def save_image(image_names: list, out_dir: str, filename: str): 
    """Gathers multiple plt.figure obejcts and outputs thm into a pdf 
    
    Parameters
    ----------
    image_names: list[str]
        List of names of image files to put into .pdf file   
    out_dir: str
        Path to the directory of output .pdf file
    filename: str
        Name of output .pdf file
        
    Additional output
    ----------
        Saves a .pdf created by multiple .pngs into specified directory

    Returns
    -------
        None
    """
    image_list = [] #contains opened files
    for name in image_names:
        print(name)
        image_list.append(Image.open(name))

    image_list[0].save(f"{out_dir}{filename}_allPlots.pdf", save_all=True, append_images=image_list[1:])
    for image in image_list:
        image.close()
    print(f"{out_dir}{filename}_allPlots.pdf")
    delete_images(image_names)

In [5]:
def visu_raw_data(show_plots: bool, file_dir: str, file_name: str, df_arg: pd.DataFrame, isRaw: bool = True):
    """Load and visualize filtered and processed test system performance data.
    
    Parameters
    ----------
    show_plots: bool
        Just output files or display in notebook
    file_dir : str
        Path to the CSV data file location (directory)
    file_name : str
        Name of the specified CSV file
    isRaw : bool (Default: True)
        function can be used to visualize any raw or processed -> changes data_type (string) and out_dir (string)
        
    optional
    df_arg: pd.DataFrame
        output from load_system_data()

    Additional outputs
    saves visualized data into dir "./visu_raw" by calling save_image() and cleaning temp-files with delete_images()
    
    Returns
    -------
        None
    """
    # Check DataFrame was passed
    if isinstance(df_arg, pd.DataFrame):
        df = df_arg
        # File name and path -> pd used => no identifier => using "./" 
        out_dir = "./"
        out_name = "Visu_output_noident"
        print("Function called with a DataFrame.")
    else:
        # Attempt to read the DataFrame from file
        try:
            file_path = file_dir + file_name
            df = pd.read_csv(file_path, delimiter = ",",usecols=["datetime","load-15m","memory_used_pct","cpu-user","cpu-system","sys-thermal","sys-interrupt-rate","server-up","disk-io-time"])
            print(f"Function called with a file: {file_path}")
            df['datetime'] = pd.to_datetime(df['datetime'])
            df.set_index('datetime', inplace=True)
            # File name and path -> path used => use identifier 
            out_dir = pathVisuRaw
            out_name = file_name.replace('.csv', '')
        except Exception as e:
            print(f"Error loading the file: {e}")
            return None
    measurements = {
        "load-15m": ('load-15m', '%'),
        "memory_used_pct": ('memory_used_pct', '%'),
        "cpu-user": ('cpu-user', 'delta-s'),
        "cpu-system": ('cpu-system', 'delta-s'),
        "sys-thermal": ('sys-thermal', 'avg delta-°C/min'),
        "sys-interrupt-rate": ('sys-interrupt-rate', 'delta-s'),
        "disk-io-time": ('disk-io-time', 'delta-s')
        #,"server-up": ('server-sup', '')
    }
    if (isRaw):
        data_type = "Raw"
    else:
        data_type = "Processed"
        out_dir = pathOnlyProcessed
    
    image_names = []
    image_nr = 0
    
    # Plot 1: Time-Series
    fig, axes = plt.subplots(4, 2, figsize=(15, 30))
    fig.suptitle(f"Tme-Series - {data_type} Data", fontsize=16, y=1.02)
   
    for i,(measure, (title, unit)) in enumerate(measurements.items()):
        row = i // 2
        col = i % 2

        df.iloc[::10].pivot(columns='server-up', values=measure).plot(ax=axes[row, col],alpha=0.7, linewidth=2,color=['red','blue'])

        axes[row, col].set_title(f'Time-Series of {measure.upper()}')
        axes[row, col].set_xlabel('Datetime')
        #axes[row, col].set_ylabel(measurement)
        axes[row, col].set_ylabel(f'{title} ({unit})')
        axes[row, col].grid(True)
        axes[row, col].legend()
        
    #----------------------------------------------
    plt.tight_layout()
    
    temp_name = f"{out_dir}{out_name}_plot_{image_nr}.png"
    fig.savefig(temp_name, dpi=150, bbox_inches='tight')
    image_names.append(temp_name)
    image_nr += 1
    #----------------------------------------------
    
    # Plot 2: Daily Patterns
    fig, axes = plt.subplots(4, 2, figsize=(15, 30))
    fig.suptitle(f"Daily Patterns of {data_type} Measurements - mean & std ", fontsize=16, y=1.02)

    # Create hour column for grouping
    df_hour = df.copy()
    df_hour['hour'] = df_hour.index.hour

    
    for i, measurement in enumerate(measurements):
        row = i // 2
        col = i % 2
        
        # Calculate hourly statistics
        hourly_stats = df_hour.groupby('hour')[measurement].agg(['mean', 'std'])
        
        # Plot mean with standard deviation
        axes[row, col].plot(hourly_stats.index, hourly_stats['mean'], 'b-', label='Mean')
        axes[row, col].fill_between(
            hourly_stats.index,
            hourly_stats['mean'] - hourly_stats['std'],
            hourly_stats['mean'] + hourly_stats['std'],
            alpha=0.2,
            label='±1 std'
        )
        
        axes[row, col].set_title(f'Daily {measurement.capitalize()} Pattern')
        axes[row, col].set_xlabel('Hour of Day')
        axes[row, col].set_ylabel(measurement)
        axes[row, col].grid(True)
        axes[row, col].legend()


    #----------------------------------------------
    plt.tight_layout()
    
    temp_name = f"{out_dir}{out_name}_plot_{image_nr}.png"
    fig.savefig(temp_name, dpi=150, bbox_inches='tight')
    image_names.append(temp_name)
    image_nr += 1
    #----------------------------------------------

    # Plot 3: Hour-wise Distributions
    fig, axes = plt.subplots(4, 2, figsize=(15, 30))
    fig.suptitle(f" {data_type} Measurement Distributions by Hour - Boxplots", fontsize=16, y=1.02)
        
    for i,(measure, (title, unit)) in enumerate(measurements.items()):
        row = i // 2
        col = i % 2
        
        df_hour.boxplot(
            ax=axes[row, col],
            column=measure,
            by='hour'
        )
        axes[row, col].set_title(f'Daily Pattern of {title} ')
        axes[row, col].set_xlabel('Hour of Day')
        axes[row, col].set_ylabel(f'{title} ({unit})')
        axes[row, col].grid(True)

    #----------------------------------------------
    plt.tight_layout()
    
    temp_name = f"{out_dir}{out_name}_plot_{image_nr}.png"
    fig.savefig(temp_name, dpi=150, bbox_inches='tight')
    image_names.append(temp_name)
    image_nr += 1
    #----------------------------------------------

    # Plot 4 Histograms - Distribution
    fig, axes = plt.subplots(4,2, figsize = (15, 25))
    fig.suptitle(f"Sensor {data_type} Measurements Distributions", fontsize = 14)

    for i,(measure, (title, unit)) in enumerate(measurements.items()):
        row = i // 2
        col = i % 2
        bin_num = 50
        
        axes[row, col].hist(df[measure], bins = bin_num*4, density = True, alpha = 0.7, label = 'Histogram')
        axes[row, col].set_title(f'Distribution of {title} ')
        axes[row, col].set_xlabel( f'{title} ({unit})')
        axes[row, col].set_ylabel('Density')
        axes[row, col].grid(True)
        
        #second axis for line graph
        ax_2 = axes[row, col].twinx()
        #print(row, col, measure, bin_num)
        counts, bins = np.histogram(df[measure], bins = bin_num)
        bin_centers = (bins[:-1] + bins [1:]) / 2
        ax_2.plot(bin_centers, counts/counts.sum(), 'r-', lw = 2, label = 'Distribution')        
        ax_2.tick_params(axis='y', labelcolor='r')
        ax_2.legend()



    #----------------------------------------------
    plt.tight_layout()
    
    temp_name = f"{out_dir}{out_name}_plot_{image_nr}.png"
    fig.savefig(temp_name, dpi=150, bbox_inches='tight')
    image_names.append(temp_name)
    image_nr += 1
    #----------------------------------------------
    
    # Plot 5: Correlation Analysis
    fig, (ax) = plt.subplots(1, 1, figsize=(15, 10))
    fig.suptitle(f"Correlation Analysis - of {data_type} Measurements Correlations", y=1.02, fontsize=16)

    # Original correlations
    sns.heatmap(
        df[measurements.keys()].corr(),
        annot=True,
        cmap='coolwarm',
        center=0,
        fmt='.2f',
        ax=ax
    )
    
    #----------------------------------------------
    plt.tight_layout()
    
    temp_name = f"{out_dir}{out_name}_plot_{image_nr}.png"
    fig.savefig(temp_name, dpi=150, bbox_inches='tight')
    image_names.append(temp_name)
    image_nr += 1
    #----------------------------------------------
    ' !!! SOURCE !!!'
    # Plot 6 Hexbins    
    measure = list(measurements.keys()) #["load-15m","memory_used_pct","cpu-user","cpu-system","sys-thermal","sys-interrupt-rate","disk-io-time"] 
    pairs = [(measure[i], measure[j]) for i in range(len(measure)) for j in range(i + 1, len(measure))]

    n_rows = 7
    n_cols = 3

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5 * n_rows))
    fig.suptitle(f"Hexbins of {data_type} Measurements", y=1.02, fontsize=16)
    axes = axes.flatten()  # Flatten the axes to make indexing easier

    # Loop over the pairs
    for i, (measure1, measure2) in enumerate(pairs):
        ax = axes[i]
        x = df[measure1]
        y = df[measure2]
        
        # Extract titles and units
        title1, unit1 = measurements[measure1]
        title2, unit2 = measurements[measure2]

        # Plot hexbin
        hb = ax.hexbin(x, y, gridsize=100, cmap='viridis')

        # Update labels and title
        ax.set_xlabel(f'{title1} ({unit1})')
        ax.set_ylabel(f'{title2} ({unit2})')
        ax.set_title(f'Hexbin: {title1} vs {title2}')
        
        
        # Add color bar
        fig.colorbar(hb, ax=ax)
        
    #----------------------------------------------
    plt.tight_layout()
    
    temp_name = f"{out_dir}{out_name}_plot_{image_nr}.png"
    fig.savefig(temp_name, dpi=150, bbox_inches='tight')
    image_names.append(temp_name)
    image_nr += 1
    #----------------------------------------------

    # Plot 7: Scatter Matrix
    # Get data without duplicates by taking mean for each timestamp
    df_plot = df.groupby(df.index)[measure].mean()
    try:
        pp = sns.pairplot(data=df_plot,
                            diag_kind='kde',
                            plot_kws={'alpha': 0.5, 's': 20},
                            height = 3,
                            corner=True)
    except Exception as e:
        print(f"Warning: Could not create scatter matrix plot: {str(e)}")

    fig = pp.figure
    fig.suptitle('Scatter Matrix of Raw Measurements', y=1.02, fontsize=16)
    
    #----------------------------------------------
    plt.tight_layout()
    
    temp_name = f"{out_dir}{out_name}_plot_{image_nr}.png"
    fig.savefig(temp_name, dpi=200, bbox_inches='tight')
    image_names.append(temp_name)
    image_nr += 1
    #----------------------------------------------
    
    save_image(image_names, out_dir, out_name)
    if not show_plots:
        plt.close("all")

In [6]:
# testing 
#visu_raw_data(False, pathProcessd, files[0],None, False)

In [7]:
#Run Visualization of Raw
for file in files:
    visu_raw_data(False, pathFilter,file,None)
    plt.close("all") #for safety

Function called with a file: ./data_filtered/system-1.csv
./visu_raw/system-1_plot_0.png
./visu_raw/system-1_plot_1.png
./visu_raw/system-1_plot_2.png
./visu_raw/system-1_plot_3.png
./visu_raw/system-1_plot_4.png
./visu_raw/system-1_plot_5.png
./visu_raw/system-1_plot_6.png
./visu_raw/system-1_allPlots.pdf
Function called with a file: ./data_filtered/system-2.csv
./visu_raw/system-2_plot_0.png
./visu_raw/system-2_plot_1.png
./visu_raw/system-2_plot_2.png
./visu_raw/system-2_plot_3.png
./visu_raw/system-2_plot_4.png
./visu_raw/system-2_plot_5.png
./visu_raw/system-2_plot_6.png
./visu_raw/system-2_allPlots.pdf
Function called with a file: ./data_filtered/system-3.csv
./visu_raw/system-3_plot_0.png
./visu_raw/system-3_plot_1.png
./visu_raw/system-3_plot_2.png
./visu_raw/system-3_plot_3.png
./visu_raw/system-3_plot_4.png
./visu_raw/system-3_plot_5.png
./visu_raw/system-3_plot_6.png
./visu_raw/system-3_allPlots.pdf
Function called with a file: ./data_filtered/system-4.csv
./visu_raw/system-

## Analysis
- **2.1: Original data quality analysis (including visualization)**
> ...

## Processing

- thresholds- and IQR-method
- aggregation

In [8]:
def remove_outliers_iqr(show_process_status: bool, df:pd.DataFrame, column: str) -> tuple:
    """Remove outliers using IQR method.

    Parameters
    ----------
    show_process_status : bool
        print status in console
    df : pd.DataFrame
        input data for cleaning
    column: str
        current column to "look at"

    Returns
    -------
        (pd.Series, pd.Series)
            cleaned data , outliers
    """
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    valid_mask = (df[column] >= Q1 - 1.5*IQR) & (df[column] <= Q3 + 1.5*IQR)
    invalid_count = (~valid_mask).sum()
    if show_process_status:
        print(f"IQR: Removing {invalid_count} outliers from {column}")
    return df[column].where(valid_mask, np.nan), df[column].where(~valid_mask)

def handle_missing_values(data: pd.DataFrame, column: str,
                         max_gap: int = 8) -> pd.Series:
    """Interpolate missing values with limit.
    Parameters
    ----------
    data : pd.DataFrame
        _description_
    column : str
        _description_
    max_gap: int, optional
        _description_ (Defaults to 8.)

    Returns
    ----------
    pd.Series : 
        _description_
    """
    return data[column].interpolate(
        method='linear',
        limit=max_gap  # Only fill gaps up to 8 points
    )

def preprocess_system_data(show_process_status: bool, file_dir: str, file_name: str, df_arg: pd.DataFrame = None) -> list:
    """Preprocess system performance data.
    Cleans data with:
          * Invalid values removed
          * Duplicates handled
          * Outliers removed
          * Missing values interpolated
    
    Parameters
    ----------
    show_process_status: bool
        ...
    file_dir : str
        Path to the CSV data file location (directory)
    file_name : str
        Name of the specified CSV file
    
    optional
    df_arg: pd.DataFrame
        output from load_system_data()

    Returns
    -------
        df_original
            ...
        df_cleaned
            ...
        
        str: filename
            ...
    """
    # Check DataFrame was passed
    if isinstance(df_arg, pd.DataFrame):
        df = df_arg
        print("Function called with a DataFrame.")
    else:
        # Attempt to read the DataFrame from file
        try:
            file_path = file_dir + file_name
            df = pd.read_csv(file_path, delimiter = ",",usecols=["datetime","load-15m","memory_used_pct","cpu-user","cpu-system","sys-thermal","sys-interrupt-rate","server-up","disk-io-time"])
            print(f"Function called with a file: {file_path}")
            df['datetime'] = pd.to_datetime(df['datetime'])
            df.set_index('datetime', inplace=True)
        except Exception as e:
            print(f"Error loading the file: {e}")
            return None
    # Store original data
    df_original = df.copy()
    df_outliers = df.copy()
    out_dir = pathProcessd
    #columns = ["load-15m","memory_used_pct","cpu-user","cpu-system","sys-thermal","sys-interrupt-rate","disk-io-time"]
    valid_ranges = {
        "load-15m":  (0, 1.0), 
        "memory_used_pct": (0, 100),
        "cpu-user": (0.0, 2.0),
        "cpu-system": (0.0, 2.0),
        "sys-thermal": (-10, 10),
        "sys-interrupt-rate": (0, 100000),
        "disk-io-time": (0, 1.0),
        "server-up": (0, 2)
    }
    columns = list(valid_ranges.keys())
    
    # 1. Handle invalid values
    for column, (min_val, max_val) in valid_ranges.items():
        invalid_mask = (df[column] < min_val) | (df[column] > max_val)
        if show_process_status:
            print(f"Ranges: Removing {invalid_mask.sum()} invalid values from {column}")
        df.loc[invalid_mask, column] = np.nan
    
    # 2. Handle duplicates --- needs work !! ------
    if show_process_status:
        print("Handling duplicate timestamps...")
    df = df.groupby(['datetime', 'server-up']).agg({
        'load-15m': 'mean',
        'memory_used_pct': 'mean',
        'cpu-user': 'mean',
        "cpu-system": 'mean',
        'sys-thermal': 'mean' ,
        "sys-interrupt-rate": 'mean',
        "disk-io-time": 'mean'
    }).reset_index()
        
    # 3. Remove outliers
    for column in columns:
        df[column],df_outliers[column] = remove_outliers_iqr(show_process_status, df, column)
    # testing df.to_csv("noHandling_data.csv", index=False)
    
    # 4. Handle missing values
    if show_process_status:
        print("\nHandling missing values...")
        print(f"Missing values before handling: \n{df.isnull().sum()}")
    
    # just delete rows with empty entries ... no interpolation !
    # for  column, (min_val, max_val) in valid_ranges.items():
    #     df[column] = handle_missing_values(df, column,4)
    
    df_cleaned = df.dropna()
    df_cleaned.set_index('datetime', inplace=True)
    # testing print("After dropping empty entries: \n", df_cleaned.head())
    # Handle missing values by sensor -- version 2
    '''df_cleaned = pd.DataFrame()
    for sensor in sorted(df['server-up'].unique()):
        if show_process_status:
            print(f"Processing sensor {sensor}...")
        sensor_data = df[df['server-up'] == sensor].copy()
        
        # Ensure datetime column is a DatetimeIndex
        sensor_data['datetime'] = pd.to_datetime(sensor_data['datetime'])
        sensor_data.set_index('datetime', inplace=True)
        
        # Resample to regular intervals (e.g., 5-minute intervals)
        sensor_data = sensor_data.resample('5min').mean() # 5T->5min
        
        # Interpolate missing values
        for column in columns:
            sensor_data[column] = sensor_data[column].interpolate(
                method='linear',
                limit=4
            )

        # Add back the sensor ID
        sensor_data['server-up'] = sensor
        
        # Append to cleaned dataframe
        df_cleaned = pd.concat([df_cleaned, sensor_data], sort=False)'''
    
    # Sort by datetime
    df_cleaned.sort_index(inplace=True)
    
    if show_process_status:
        print(f"Missing values after handling: \n{df_cleaned.isnull().sum()}")
        print(f"\nOriginal shape: {df_original.shape}")
        print(f"Cleaned shape: {df_cleaned.shape}")
    
    df_cleaned.to_csv(out_dir+file_name, index=True)

    return [df_original, df_cleaned, file_name]

In [9]:
# Run processing
# anylist = []
# for file in files:
#   anylist = preprocess_system_data(False, pathFilter,file,None)
#   print(anylist[0], " : \n", anylist[1].describe(),"\n", anylist[2].describe()) 

In [10]:
# Test: if processed data makes sense 
#cache_list = preprocess_system_data(True, pathFilter,"system-1.csv",None)
#visu_raw_data(True, None,None,pd.DataFrame(cache_list[1]))

In [11]:
#Run Visualization of Processed
for file in files:
    visu_raw_data(False, pathProcessd,file,None, False)
    plt.close("all") #for safety

Function called with a file: ./data_processed/system-1.csv
./visu_only_processed/system-1_plot_0.png
./visu_only_processed/system-1_plot_1.png
./visu_only_processed/system-1_plot_2.png
./visu_only_processed/system-1_plot_3.png
./visu_only_processed/system-1_plot_4.png
./visu_only_processed/system-1_plot_5.png
./visu_only_processed/system-1_plot_6.png
./visu_only_processed/system-1_allPlots.pdf
Function called with a file: ./data_processed/system-2.csv
./visu_only_processed/system-2_plot_0.png
./visu_only_processed/system-2_plot_1.png
./visu_only_processed/system-2_plot_2.png
./visu_only_processed/system-2_plot_3.png
./visu_only_processed/system-2_plot_4.png
./visu_only_processed/system-2_plot_5.png
./visu_only_processed/system-2_plot_6.png
./visu_only_processed/system-2_allPlots.pdf
Function called with a file: ./data_processed/system-3.csv
./visu_only_processed/system-3_plot_0.png
./visu_only_processed/system-3_plot_1.png
./visu_only_processed/system-3_plot_2.png
./visu_only_processed

## Loadfile

- in case different sets shall be compared. 
- if data is to be loaded into a dataframe instead of directly accessed by a function.
- Otherwise visu_processed_data will be called directly after preprocess_system_data(). Since their IOs are suitable.

In [12]:
def load_file(file_dir: str, file_name: str) -> tuple:
    """Loads file from path and returns dataframe and its name (as tuple).
    
    Parameters
    ----------
    file_dir : str
        Path to the CSV data file location (directory)
    file_name : str
        Name of the specified CSV file
    
    Returns
    --------
        tuple(pd.DataFrame, str)
        pd.DataFrame: _description_
            ...
        str: file_name
            ...
    """
    try:
        file_path = file_dir + file_name
        df = pd.read_csv(file_path, delimiter = ",",usecols=["datetime","load-15m","memory_used_pct","cpu-user","cpu-system","sys-thermal","sys-interrupt-rate","server-up","disk-io-time"])
        print(f"Function called with a file: {file_path}")
        df['datetime'] = pd.to_datetime(df['datetime'])
        df.set_index('datetime', inplace=True)
        # File name and path -> path used => use identifier 
        file_name = file_name.replace('.csv', '')
        return (df, file_name)
    except Exception as e:
        print(f"Error loading the file: {e}")
        return (None,None)
    
#testing
#print(load_file(pathFilter, files[0]))

## Visualize Raw & Processed
- **2.2: Time series visualizations**
- **2.2: Distribution analysis with histograms**
- **2.2: Correlation analysis and heatmaps**
- **2.2: Daily pattern analysis**

In [13]:
def visu_processed_data(show_plots: bool, df_original: pd.DataFrame, df_cleaned: pd.DataFrame, filename: str) -> None:
    """Load and visualize original and processed test system performance data.
    
    Parameters
    ----------
    show_plots: bool
        Just output files or display in notebook
    df_cleaned: pd.DataFrame
        ...
    df_original: pd.DataFrame
        ...
    filename: str
        ... for pdf output

    Additional outputs
    saves visualized data into dir "./visu_processed" by calling save_image() and cleaning temp-files with delete_images()
    
    Returns
    -------
        None
    """
    
    out_dir = pathVisuProcessed
    out_name = filename.replace('.csv','')
    image_names = []
    image_nr = 0
   
    measurements = {
        "load-15m": ('load-15m', '%'),
        "memory_used_pct": ('memory_used_pct', '%'),
        "cpu-user": ('cpu-user', 'delta-s'),
        "cpu-system": ('cpu-system', 'delta-s'),
        "sys-thermal": ('sys-thermal', 'avg delta-°C/min'),
        "sys-interrupt-rate": ('sys-interrupt-rate', 'delta-s'),
        "disk-io-time": ('disk-io-time', 'delta-s')
        #,"server-up": ('server-sup', '')
    }
    measures = list(measurements.keys())

    # Plot 1: Time-Series
    fig, axes = plt.subplots(4, 2, figsize=(15, 30))
    fig.suptitle('Tme-Series - Raw Data', fontsize=16, y=1.02)

    for i,(measure, (title, unit)) in enumerate(measurements.items()):
        row = i // 2
        col = i % 2
        df_original[measure].iloc[::10].plot(ax=axes[row, col], color='lightblue', alpha=0.3, label='Original')
        df_cleaned[measure].iloc[::10].plot(ax=axes[row, col], color='green', alpha=0.5, label='Cleaned')
        axes[row, col].set_title(f'Time-Series of {measure.upper()}')
        axes[row, col].set_xlabel('Datetime')
        axes[row, col].set_ylabel(f'{title} ({unit})')
        axes[row, col].grid(True)
        axes[row, col].legend()
        
    #----------------------------------------------
    plt.tight_layout()
    
    temp_name = f"{out_dir}{out_name}_plot_{image_nr}.png"
    fig.savefig(temp_name, dpi=150, bbox_inches='tight')
    image_names.append(temp_name)
    image_nr += 1
    #----------------------------------------------
    
    # Plot 2: Daily Patterns
    fig, axes = plt.subplots(4, 2, figsize=(15, 30))
    fig.suptitle('Daily Patterns of Raw & Processed Measurements - mean & std ', fontsize=16, y=1.02)

    # Create hour column for grouping
    df_hour_orig = df_original.copy()
    df_hour_clean = df_cleaned.copy()
    df_hour_orig['hour'] = df_hour_orig.index.hour
    df_hour_clean['hour'] = df_hour_clean.index.hour
    
    for i, measurement in enumerate(measurements):
        row = i // 2
        col = i % 2
        
        # Calculate hourly statistics
        hourly_stats_orig = df_hour_orig.groupby('hour')[measurement].agg(['mean', 'std'])
        hourly_stats_clean = df_hour_clean.groupby('hour')[measurement].agg(['mean', 'std'])

        # Plot mean with standard deviation
        axes[row, col].plot(hourly_stats_clean.index, hourly_stats_clean['mean'], 
                        'g-', label='Mean Processed')
        axes[row, col].fill_between(
            hourly_stats_clean.index,
            hourly_stats_clean['mean'] - hourly_stats_clean['std'],
            hourly_stats_clean['mean'] + hourly_stats_clean['std'],
            alpha=0.3,
            color='lightgreen',
            label='±1 std Processed'
        )
        #ax_2 = axes[row, col].twinx()

        axes[row, col].plot(hourly_stats_orig.index, hourly_stats_orig['mean'], 
                        'b-', label='Mean Raw')
        axes[row, col].fill_between(
            hourly_stats_orig.index,
            hourly_stats_orig['mean'] - hourly_stats_orig['std'],
            hourly_stats_orig['mean'] + hourly_stats_orig['std'],
            alpha=0.2,
            color='lightblue',
            label='±1 std Raw'
        )
        #ax_2.tick_params(axis='y', labelcolor='b')
        
        axes[row, col].set_title(f'Daily {measurement.capitalize()} Pattern')
        axes[row, col].set_xlabel('Hour of Day')
        axes[row, col].set_ylabel(measurement)
        axes[row, col].grid(True)
        axes[row, col].legend()


    #----------------------------------------------
    plt.tight_layout()
    
    temp_name = f"{out_dir}{out_name}_plot_{image_nr}.png"
    fig.savefig(temp_name, dpi=150, bbox_inches='tight')
    image_names.append(temp_name)
    image_nr += 1
    #----------------------------------------------

    '''orig_out_pops = dict(marker='o', color='grey', markersize=2, linestyle='none', alpha=0.2)
    clean_out_prop = dict(marker='x', color='green', markersize=6, linestyle='none', alpha=0.3)
    
    # Plot 3: Hour-wise Distributions
    fig, axes = plt.subplots(4, 2, figsize=(15, 30))
    fig.suptitle('Measurement Distributions by Hour - Boxplots', fontsize=16, y=1.02)
        
    for i,(measure, (title, unit)) in enumerate(measurements.items()):
        row = i // 2
        col = i % 2
        
        #ax_2 = axes[row, col].twinx()
        df_hour_orig.boxplot(
            ax=axes[row, col],
            column=measure,
            label = 'Raw',
            by='hour',
            color='grey',
            flierprops=orig_out_pops
        )
        df_hour_clean.boxplot(
            ax=axes[row, col],
            column=measure,
            label = 'Processed',
            by='hour',
            color='green',
            flierprops=clean_out_prop
        )
        #ax_2.tick_params(axis='y', labelcolor='b')

        axes[row, col].set_title(f'Daily Pattern of {title} ')
        axes[row, col].set_xlabel('Hour of Day')
        axes[row, col].set_ylabel(f'{title} ({unit})')
        axes[row, col].grid(True)
        axes[row, col].legend()'''

    
    del df_hour_clean, df_hour_orig
    #----------------------------------------------
    '''plt.tight_layout()
    
    temp_name = f"{out_dir}{out_name}_plot_{image_nr}.png"
    fig.savefig(temp_name, dpi=150, bbox_inches='tight')
    image_names.append(temp_name)
    image_nr += 1'''
    #----------------------------------------------

    # Plot 4 Histograms - Distribution
    fig, axes = plt.subplots(4,2, figsize = (15, 25))
    fig.suptitle('Sensor Processed Measurements Distributions', fontsize = 14)

    for i,(measure, (title, unit)) in enumerate(measurements.items()):
        row = i // 2
        col = i % 2
        bin_num = 50
        axes[row, col].hist(df_cleaned[measure], bins = bin_num*4, density = True, alpha = 0.7)
        axes[row, col].set_title(f'Distribution of {title} ')
        axes[row, col].set_xlabel( f'{title} ({unit})')
        axes[row, col].set_ylabel('Density')
        axes[row, col].grid(True)
        
        #second axis for line graph
        ax_2 = axes[row, col].twinx()
        counts, bins = np.histogram(df_cleaned[measure], bins = bin_num)
        bin_centers = (bins[:-1] + bins [1:]) / 2
        ax_2.plot(bin_centers, counts/counts.sum(), 'r-', lw = 2, label = 'Distribution')        
        ax_2.tick_params(axis='y', labelcolor='r')
        ax_2.legend()



    #----------------------------------------------
    plt.tight_layout()
    
    temp_name = f"{out_dir}{out_name}_plot_{image_nr}.png"
    fig.savefig(temp_name, dpi=150, bbox_inches='tight')
    image_names.append(temp_name)
    image_nr += 1
    #----------------------------------------------
    
    # Plot 5: Correlation Analysis
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
    fig.suptitle('Correlation Analysis - Original vs Cleaned', y=1.02, fontsize=16)

    # Original correlations
    sns.heatmap(
        df_original[measures].corr(),
        annot=True,
        cmap='Blues', #coolwarm
        center=0,
        fmt='.2f',
        ax=ax1
    )
    ax1.set_title('Original Data Correlations')

    # Cleaned correlations
    sns.heatmap(
        df_cleaned[measures].corr(),
        annot=True,
        cmap='Greens',
        center=0,
        fmt='.2f',
        ax=ax2
    )
    ax2.set_title('Cleaned Data Correlations')

    #----------------------------------------------
    plt.tight_layout()
    
    temp_name = f"{out_dir}{out_name}_plot_{image_nr}.png"
    fig.savefig(temp_name, dpi=150, bbox_inches='tight')
    image_names.append(temp_name)
    image_nr += 1
    #----------------------------------------------
    ' !!! SOURCE !!!'
    # Plot 6 Hexbins
    pairs = [(measures[i], measures[j]) for i in range(len(measures)) for j in range(i + 1, len(measures))]

    # Number of subplots
    n_rows = 7
    n_cols = 3

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5 * n_rows))
    fig.suptitle('Hexbins of Processed Measurements', y=1.02, fontsize=16)
    axes = axes.flatten()  # Flatten the axes to make indexing easier

    # Loop over the pairs
    for i, (measure1, measure2) in enumerate(pairs):
        ax = axes[i]
        x = df_cleaned[measure1]
        y = df_cleaned[measure2]
        
        title1, unit1 = measurements[measure1]
        title2, unit2 = measurements[measure2]
        
        # Plot hexbin
        hb = ax.hexbin(x, y, gridsize=100, cmap='viridis')
        ax.set_xlabel(f'{title1} ({unit1})')
        ax.set_ylabel(f'{title2} ({unit2})')
        ax.set_title(f'Hexbin: {title1} vs {title2}')
        # Add color bar
        fig.colorbar(hb, ax=ax)
        
    #----------------------------------------------
    plt.tight_layout()
    
    temp_name = f"{out_dir}{out_name}_plot_{image_nr}.png"
    fig.savefig(temp_name, dpi=150, bbox_inches='tight')
    image_names.append(temp_name)
    image_nr += 1
    #----------------------------------------------

    # Plot 7: Scatter Matrix
    df_plot1 = df_original
    df_plot1['State'] = 'raw'
    df_plot2 = df_cleaned
    df_plot2['State'] = 'processed'
    df_plot1 = pd.concat([df_plot1, df_plot2])
    
    #testing
    # df_plot1.to_csv(out_dir+"concatPlot1Plot2.csv", index=True)
    # print(df_plot1.shape, " and ", df_plot2.shape)
    # print(df_plot1.head(), "\n", df_plot2.head())
    
    del df_plot2
    pp = None
    try:
        pp = sns.pairplot(data=df_plot1,
                            diag_kind='kde',
                            vars = measures,
                            hue='State',
                            markers=["o","D"],
                            plot_kws={'alpha': 0.5, 's': 20},
                            height = 3,
                            corner=True)
    except Exception as e:
        print(f"Warning: Could not create scatter matrix plot: {str(e)}")

    fig = pp.figure
    fig.suptitle('Scatter Matrix of Raw and Processed Measurements', y=1.02, fontsize=16)
    
    del df_plot1
    #----------------------------------------------
    plt.tight_layout()
    
    temp_name = f"{out_dir}{out_name}_plot_{image_nr}.png"
    fig.savefig(temp_name, dpi=200, bbox_inches='tight')
    image_names.append(temp_name)
    image_nr += 1
    #----------------------------------------------
    
    save_image(image_names, out_dir, out_name)

    if not show_plots:
        plt.close("all")

In [14]:
# testing
# load same file filtered and processed

# ca1 = load_file(pathFilter,files[2])
# ca2 = load_file(pathProcessd,files[2])
# #visualize it (exporting pdf)
# visu_processed_data(False, ca1[0], ca2[0], ca2[1])

In [15]:
anylist = []
for file in files:
    anylist = preprocess_system_data(False, pathFilter,file,None)
    visu_processed_data(False, anylist[0],anylist[1],anylist[2])
    #visu_processed_data(False, preprocess_system_data(False, pathFilter,file,None))

Function called with a file: ./data_filtered/system-1.csv
./visu_processed/system-1_plot_0.png
./visu_processed/system-1_plot_1.png
./visu_processed/system-1_plot_2.png
./visu_processed/system-1_plot_3.png
./visu_processed/system-1_plot_4.png
./visu_processed/system-1_plot_5.png
./visu_processed/system-1_allPlots.pdf
Function called with a file: ./data_filtered/system-2.csv
./visu_processed/system-2_plot_0.png
./visu_processed/system-2_plot_1.png
./visu_processed/system-2_plot_2.png
./visu_processed/system-2_plot_3.png
./visu_processed/system-2_plot_4.png
./visu_processed/system-2_plot_5.png
./visu_processed/system-2_allPlots.pdf
Function called with a file: ./data_filtered/system-3.csv
./visu_processed/system-3_plot_0.png
./visu_processed/system-3_plot_1.png
./visu_processed/system-3_plot_2.png
./visu_processed/system-3_plot_3.png
./visu_processed/system-3_plot_4.png
./visu_processed/system-3_plot_5.png
./visu_processed/system-3_allPlots.pdf
Function called with a file: ./data_filtere

## Analysis
- **2.1: Preprocessed vs original data visual analysis**
> ...
-

## 2.3 Probability Analysis
- **Threshold-based probability estimation**
- **Cross tabulation analysis**
- **Conditional probability analysis**
- **Summary of observations from each task**

In [16]:
def FUNCTIONNAME(show_process_status: bool, file_dir: str, file_name: str, df_arg: pd.DataFrame = None) -> list:
    """Preprocess system performance data.
    Cleans data with:
          * Invalid values removed
          * Duplicates handled
          * Outliers removed
          * Missing values interpolated
    
    Parameters
    ----------
    show_process_status: bool
        ...
    file_dir : str
        Path to the CSV data file location (directory)
    file_name : str
        Name of the specified CSV file
    
    optional
    df_arg: pd.DataFrame
        output from load_system_data()

    Returns
    -------
        ...
    """
    # Check DataFrame was passed
    if isinstance(df_arg, pd.DataFrame):
        df = df_arg
        print("Function called with a DataFrame.")
    else:
        # Attempt to read the DataFrame from file
        try:
            file_path = file_dir + file_name
            df = pd.read_csv(file_path, delimiter = ",",usecols=["datetime","load-15m","memory_used_pct","cpu-user","cpu-system","sys-thermal","sys-interrupt-rate","server-up","disk-io-time"])
            print(f"Function called with a file: {file_path}")
            df.set_index('datetime', inplace=True)
            df['datetime'] = pd.to_datetime(df['datetime'])
        except Exception as e:
            print(f"Error loading the file: {e}")
            return None
    
    # 1. P(load > mean)
    high_load = df['load-15m'] > df['load-15m'].mean() # Unit 2 -> "Remove outliers using IQR method" ... List of True/False values 
    p_high_load = high_load.sum() / df['load-15m'].count()

    # 2. P(mem | load)
    high_memory = df['memory_used_pct'] > 0.12
    # P(mem > 12%)
    p_high_memory = high_memory.sum() / df['memory_used_pct'].count()

    print(p_high_memory)
    print(df['load-15m'].count() == df['memory_used_pct'].count())

    #P(mem ∩ load) / P(load) = P(mem | load)
    p_high_memory_given_load = ((high_memory & high_load).sum() / df['memory_used_pct'].count()) / p_high_load

    # 3.
    p_joint = (high_memory & high_load).sum() / df['memory_used_pct'].count()


    print("P(load-15m > 0.3):", p_high_load)
    print("P(memory > 12% | load-15m > 0.3):", p_high_memory_given_load)
    print("P(high load AND high memory):", p_joint)


## 2.3 Statistical Theory Applications
- **Law of Large Numbers demonstration**
- **Central Limit Theorem application**
- **Result interpretation**
> ...

In [17]:
#TODO

## 2.3 Regression Analysis
- **Linear/Polynomial model selection**
- **Model fitting and validation**
- **Result interpretation and analysis**
> ...

In [18]:
#TODO