# Data Analysis Assignment 3
**Due Date:** December 20, 2024, 11:59  
**Total Points:** 100 + 20 bonus points

## Copyright and Fair Use

This material, no matter whether in printed or electronic form, may be used for personal and non-commercial educational use only. Any reproduction of this material, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.

## Guidelines

1. **DO NOT add or delete any cells (or modify cell IDs)**
2. Complete code cells marked with `# YOUR CODE HERE`
3. Comment or remove lines with `raise NotImplementedError()`
4. Run all cells before submission to verify your solutions
5. Submit Notebook (.ipynb file) on Moodle with filename using the correct format, e.g., **Assignment_3_JohnDoe_12345678.ipynb**

# Part 1: Basic Probability Analysis

## Initial Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Configure plotting
# Set Seaborn style for better visualizations
sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1.2)
plt.rcParams['figure.figsize'] = [10, 6]

# Set random seed for reproducibility
np.random.seed(42)

### Exercise 1: Power Supply Testing (20 points)
A power supply unit undergoes three types of tests:
- Voltage stability test (success rate: 95%)
- Load regulation test (success rate: 92%)
- Temperature stress test (success rate: 88%)

The power supply must pass all three tests to be certified. Tests are conducted independently.

Calculate:
1. Probability of a power supply passing certification (all tests)
2. Probability of failing exactly one test
3. Probability of failing at least two tests

In [2]:
# Success
p_voltage_success = 0.95
p_load_success = 0.92
p_temp_success = 0.88

# Failure  (1 - success rate)
p_voltage_fail = 1 - p_voltage_success
p_load_fail = 1 - p_load_success
p_temp_fail = 1 - p_temp_success

# 1.
p_all_pass = p_voltage_success * p_load_success * p_temp_success

# 2.
p_one_fail = (
            p_voltage_fail    * p_load_success * p_temp_success + 
            p_voltage_success * p_load_fail    * p_temp_success + 
            p_voltage_success * p_load_success * p_temp_fail
            )

# 3.
p_two_plus_fail = (
    p_voltage_fail     * p_load_fail    * p_temp_success +  
    p_voltage_fail     * p_load_success * p_temp_fail +  
    p_voltage_success  * p_load_fail    * p_temp_fail +
    p_voltage_fail     * p_load_fail    * p_temp_fail      
    )

print("passing all: ", p_all_pass,"\nfailing ex. one: ", p_one_fail,"\nfailing min two:", p_two_plus_fail, "\t")

passing all:  0.76912 
failing ex. one:  0.21224 
failing min two: 0.018640000000000004 	


In [3]:
# Test cell
assert isinstance(p_all_pass, float), "p_all_pass must be a float"
assert isinstance(p_one_fail, float), "p_one_fail must be a float"
assert isinstance(p_two_plus_fail, float), "p_two_plus_fail must be a float"

### Exercise 2: Test Coverage (30 points)
In a test suite with 500 test cases:
- 400 pass consistently
- 60 fail intermittently
- 40 fail consistently

Calculate:
1. Probability of randomly selecting a passing test
2. Probability of selecting a consistently failing test given that a failed test was selected
3. Create a probability tree diagram for test outcomes


In [4]:
total_tests = 500
pass_tests = 400
intermittent_fail_tests = 60
consistent_fail_tests = 40

# 1.
p_pass = pass_tests / total_tests
p_fail = 1 - p_pass; # = (intermittent_fail_tests + consistent_fail_tests) / (total_tests)

# 2. 
# P(Consistent | Any) = cons fails / total fails
p_consistent_fail_given_fail = consistent_fail_tests / (intermittent_fail_tests + consistent_fail_tests)


print("rand sel passing: ", p_pass,"Fail -> consist: ", p_consistent_fail_given_fail)

rand sel passing:  0.8 Fail -> consist:  0.4


In [5]:
'''
───┬── Pass (400/500) 
   |              
   └──Fail (100/500)   ───┬── Consistent Fail (40/100)
                          └── Intermittent Fail (60/100)
'''

'\n───┬── Pass (400/500) \n   |              \n   └──Fail (100/500)   ───┬── Consistent Fail (40/100)\n                          └── Intermittent Fail (60/100)\n'

In [6]:
# Test cell
assert isinstance(p_pass, float), "p_pass must be a float"
assert isinstance(p_consistent_fail_given_fail, float), "p_consistent_fail_given_fail must be a float"
assert 0 <= p_pass <= 1, "Probability must be between 0 and 1"
assert 0 <= p_consistent_fail_given_fail <= 1, "Conditional probability must be between 0 and 1"



### Exercise 3: System Performance Analysis (20 points)
We analyze when a test system is busy. From monitoring data we know:
- 70% of test runs happen at night (between 8PM and 8AM)
- 60% of all time slots show high system activity
- 75% of night time slots show high system activity

Calculate:
1. If the system shows high activity, what's the probability it's night time? 
2. Are time of day (night/day) and system activity (high/low) independent events?


In [7]:
p_night = 0.7
p_high_activity = 0.6
p_high_given_night = 0.75

# 1.  P(Night | High) 
p_night_given_high = (p_high_given_night * p_night) / p_high_activity

# 2.  P(High | Night) =?= P(High)
are_independent = p_high_given_night.hex() == p_high_activity.hex()  # https://docs.python.org/3/tutorial/floatingpoint.html

print("p(night | high): ", p_night_given_high)
print("independent: ", are_independent)


p(night | high):  0.8749999999999999
are independent:  False


In [8]:
# Test cell
assert isinstance(p_night_given_high, float), "p_night_given_high must be a float"
assert isinstance(are_independent, bool), "are_independent must be a boolean"
assert 0 <= p_night_given_high <= 1, "Probability must be between 0 and 1"

## Part 2: Westermo Data Analysis

### Exercise 4: Performance Metrics (30 points)
Using the Westermo test system data (from Assignment 2), analyze:
1. Probability of high system load exceeding 0.3 (30%)
2. Conditional probability of high memory usage (>12%) given high system load
3. Joint probability of high load AND high memory usage

Note: Thresholds have been chosen based on typical system behavior and are specific to this case only.

In [12]:
# Load and prepare data
df = pd.read_csv('system-1.csv')  # Load the data
df['datetime'] = pd.to_datetime(df['timestamp'], unit='s')  # Convert timestamp to datetime
df['memory_used_pct'] = (1 - df['sys-mem-available'] / df['sys-mem-total'])  #*100 Memory usage percentage !!! --- THIS WAS CHANGED --- !!!

# 1. P(load > 30%)
high_load = df['load-15m'] > 0.3 # Unit 2 -> "Remove outliers using IQR method" ... List of True/False values 
p_high_load = high_load.sum() / df['load-15m'].count()

# 2. P(mem | load)
high_memory = df['memory_used_pct'] > 0.12
# P(mem > 12%)
p_high_memory = high_memory.sum() / df['memory_used_pct'].count()

print(p_high_memory)
print(df['load-15m'].count() == df['memory_used_pct'].count())

#P(mem ∩ load) / P(load) = P(mem | load)
p_high_memory_given_load = ((high_memory & high_load).sum() / df['memory_used_pct'].count()) / p_high_load

# 3.
p_joint = (high_memory & high_load).sum() / df['memory_used_pct'].count()


print("P(load-15m > 0.3):", p_high_load)
print("P(memory > 12% | load-15m > 0.3):", p_high_memory_given_load)
print("P(high load AND high memory):", p_joint)


0.0001865910972722714
True
P(load-15m > 0.3): 0.009201273484238883
P(memory > 12% | load-15m > 0.3): 0.0
P(high load AND high memory): 0.0


In [10]:
# Test cell
assert isinstance(p_high_load, float), "p_high_load must be a float"
assert isinstance(p_high_memory_given_load, float), "p_high_memory_given_load must be a float"
assert isinstance(p_joint, float), "p_joint must be a float"

# Bonus Exercise (20 points)

We are using the original data in Exercise 4, instead of the preprocessed data generated in Assignment 2.

Repeat Exercise 4 using the preprocessed data you generated in Assignment 2, and compare the results.

In [11]:
# YOUR CODE HERE