# Data Analysis Assignment 3
**Due Date:** December 20, 2024, 11:59  
**Total Points:** 100 + 20 bonus points

## Copyright and Fair Use

This material, no matter whether in printed or electronic form, may be used for personal and non-commercial educational use only. Any reproduction of this material, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.

## Guidelines

1. **DO NOT add or delete any cells (or modify cell IDs)**
2. Complete code cells marked with `# YOUR CODE HERE`
3. Comment or remove lines with `raise NotImplementedError()`
4. Run all cells before submission to verify your solutions
5. Submit Notebook (.ipynb file) on Moodle with filename using the correct format, e.g., **Assignment_3_JohnDoe_12345678.ipynb**

# Part 1: Basic Probability Analysis

## Initial Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Configure plotting
# Set Seaborn style for better visualizations
sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1.2)
plt.rcParams['figure.figsize'] = [10, 6]

# Set random seed for reproducibility
np.random.seed(42)

### Exercise 1: Power Supply Testing (20 points)
A power supply unit undergoes three types of tests:
- Voltage stability test (success rate: 95%)
- Load regulation test (success rate: 92%)
- Temperature stress test (success rate: 88%)

The power supply must pass all three tests to be certified. Tests are conducted independently.

Calculate:
1. Probability of a power supply passing certification (all tests)
2. Probability of failing exactly one test
3. Probability of failing at least two tests

In [2]:
# YOUR CODE HERE
voltage_stability = 0.95
load_regulation = 0.92
temperature_stress = 0.88

voltage_fail = 1 - voltage_stability
load_fail = 1 - load_regulation
temperature_fail = 1 - temperature_stress

p_all_pass = voltage_stability * load_regulation * temperature_stress  # Probability of passing all tests
p_one_fail = voltage_fail * load_regulation * temperature_stress + voltage_stability * load_fail * temperature_stress + temperature_fail * voltage_stability * load_regulation # Probability of exactly one test failing
p_two_plus_fail =  voltage_fail * load_fail * temperature_stress + voltage_stability * load_fail * temperature_fail + temperature_fail * voltage_fail * load_regulation + voltage_fail * load_fail * temperature_fail# Probability of two or more tests failing

print("All pass: " + str(p_all_pass) + "\nOne fail: " + str(p_one_fail) + "\nTwo or more fail: " + str(p_two_plus_fail))


All pass: 0.76912
One fail: 0.21224
Two or more fail: 0.01864


In [3]:
# Test cell
assert isinstance(p_all_pass, float), "p_all_pass must be a float"
assert isinstance(p_one_fail, float), "p_one_fail must be a float"
assert isinstance(p_two_plus_fail, float), "p_two_plus_fail must be a float"

### Exercise 2: Test Coverage (30 points)
In a test suite with 500 test cases:
- 400 pass consistently
- 60 fail intermittently
- 40 fail consistently

Calculate:
1. Probability of randomly selecting a passing test
2. Probability of selecting a consistently failing test given that a failed test was selected
3. Create a probability tree diagram for test outcomes


In [4]:
# YOUR CODE HERE
test_total = 500
test_pass = 400
intermittent = 60
consistent = 40

test_fail = test_total - test_pass

p_pass = test_pass/test_total  # Probability of randomly selecting a passing test
p_consistent_fail_given_fail = consistent/test_fail  # P(Consistent Fail | Any Fail)

print("Pass: " + str(p_pass) + "\nConsistent fail given fail: " + str(p_consistent_fail_given_fail))

Pass: 0.8
Consistent fail given fail: 0.4


In [5]:
# Example tree below from Unit 3. You can also create it using pen and paper and attach the photo here in a markdown cell.



![Alt Text](Tree.png)

In [6]:
# Test cell
assert isinstance(p_pass, float), "p_pass must be a float"
assert isinstance(p_consistent_fail_given_fail, float), "p_consistent_fail_given_fail must be a float"
assert 0 <= p_pass <= 1, "Probability must be between 0 and 1"
assert 0 <= p_consistent_fail_given_fail <= 1, "Conditional probability must be between 0 and 1"



### Exercise 3: System Performance Analysis (20 points)
We analyze when a test system is busy. From monitoring data we know:
- 70% of test runs happen at night (between 8PM and 8AM)
- 60% of all time slots show high system activity
- 75% of night time slots show high system activity

Calculate:
1. If the system shows high activity, what's the probability it's night time? 
2. Are time of day (night/day) and system activity (high/low) independent events?


In [7]:
# YOUR CODE HERE
night_tests = 0.7
high_activity_all_slots = 0.6
high_activity_night_time_slots = 0.75

# Calculate P(Night | High Activity)
p_night_given_high = (high_activity_night_time_slots * night_tests) / high_activity_all_slots

# Check independence
# Events are independent if P(High Activity | Night) = P(High Activity)
are_independent =  high_activity_night_time_slots == high_activity_all_slots # Boolean indicating independence (True/False)

print("P(Night | High Activity):", p_night_given_high)
print("Are the events independent?:", are_independent)


P(Night | High Activity): 0.8749999999999999
Are the events independent?: False


In [8]:
# Test cell
assert isinstance(p_night_given_high, float), "p_night_given_high must be a float"
assert isinstance(are_independent, bool), "are_independent must be a boolean"
assert 0 <= p_night_given_high <= 1, "Probability must be between 0 and 1"

## Part 2: Westermo Data Analysis

### Exercise 4: Performance Metrics (30 points)
Using the Westermo test system data (from Assignment 2), analyze:
1. Probability of high system load exceeding 0.3 (30%)
2. Conditional probability of high memory usage (>12%) given high system load
3. Joint probability of high load AND high memory usage

Note: Thresholds have been chosen based on typical system behavior and are specific to this case only.

In [9]:
df = pd.read_csv('system-1(1).csv') 
df['datetime'] = pd.to_datetime(df['timestamp'], unit='s')
df['memory_used_pct'] = (1 - df['sys-mem-available'] / df['sys-mem-total']) * 100

# Define thresholds
high_load_condition = df['load-15m'] > 0.3  # Boolean
count_load = (df['load-15m'] > 0.3).sum()
print("Values over 30%: " + str(count_load))

high_memory_condition = df['memory_used_pct'] > 12
count_memory = (df['memory_used_pct'] > 12).sum()
print("Values over 12%: " + str(count_memory))

# Total number of samples
total_samples = len(df)
print("Total samples: " + str(total_samples))

# 1. P(High Load)
high_load_count = high_load_condition.sum() # Sums all boolean 1 == number of rows greater than 0.3
p_high_load = high_load_count / total_samples # number of rows greater than 0.3 over total

# 2. P(High Memory | High Load)
high_load_and_memory_count = ((high_load_condition) & (high_memory_condition)).sum() # Checks if in each row of load and memory condition, both are true


p_high_memory_given_load = high_load_and_memory_count / high_load_count

# 3. P(High Load AND High Memory)
p_joint = high_load_and_memory_count / total_samples 

# Output results
print("P(High Load):", p_high_load)
print("P(High Memory | High Load):", p_high_memory_given_load)
print("P(High Load AND High Memory):", p_joint)

Values over 30%: 789
Values over 12%: 16
Total samples: 85749
P(High Load): 0.009201273484238883
P(High Memory | High Load): 0.0
P(High Load AND High Memory): 0.0


In [10]:
# Test cell
assert isinstance(p_high_load, float), "p_high_load must be a float"
assert isinstance(p_high_memory_given_load, float), "p_high_memory_given_load must be a float"
assert isinstance(p_joint, float), "p_joint must be a float"

# Bonus Exercise (20 points)

We are using the original data in Exercise 4, instead of the preprocessed data generated in Assignment 2.

Repeat Exercise 4 using the preprocessed data you generated in Assignment 2, and compare the results.

In [11]:
# YOUR CODE HERE
df = pd.read_csv('cleaned_data.csv') # I have just used previous code, and generated cleaned_data.csv and read it here


# Define thresholds
high_load_condition = df['load-15m'] > 0.3  # Boolean
high_memory_condition = df['memory_used_pct'] > 12 


# Total number of samples
total_samples = len(df)

# 1. P(High Load)
high_load_count = high_load_condition.sum() # Sums all boolean 1 == number of rows greater than 0.3
p_high_load = high_load_count / total_samples # number of rows greater than 0.3 over total

# 2. P(High Memory | High Load)
high_load_and_memory_count = ((high_load_condition) * (high_memory_condition)).sum() 

p_high_memory_given_load = high_load_and_memory_count / high_load_count

# 3. P(High Load AND High Memory)
p_joint = high_load_and_memory_count / total_samples 

# Output results
print("P(High Load):", p_high_load)
print("P(High Memory | High Load):", p_high_memory_given_load)
print("P(High Load AND High Memory):", p_joint)

# P(High Load) is the only one that differs, it is slightly lower than the original data

P(High Load): 0.009177949597079849
P(High Memory | High Load): 0.0
P(High Load AND High Memory): 0.0
